Report SFU WebSocket connection and join timeouts accurately#1740
Report SFU WebSocket connection and join timeouts accurately#1740PratimMallick wants to merge 5 commits into
Conversation
Splits the SFU socket connect into a transport-open phase (bounded by OkHttp's connect timeout) and a distinct join-response phase (bounded by a dedicated timer via a new WebSocketConnected state), so a silent SFU no longer hangs the join. Surfaces real transport failure messages and HTTP status codes on the resulting NetworkError, routes all recoverable errors through a single DisconnectedTemporarily state carrying the exact code/reason, and maps both timeout flavours to REQUEST_TIMEOUT in analytics (join-response via error code, transport via SocketTimeoutException cause). Also moves join-error analytics to the join flow only, adds a per-session SFU WS retry counter, and wires connectionTimeoutInMs from the builder to both the OkHttp client and the join-response deadline. Co-authored-by: Cursor <cursoragent@cursor.com>
PR checklist ❌The following issues were detected:
What we check
|
SDK Size Comparison 📏
|
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.kt (1)
60-73: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winSplit the API and SFU socket clients
callTimeoutonly affects the handshake; the real risk is that this shared client also sets the SFU WebSocketreadTimeoutfromconnectionTimeoutInMs. A 5s idle window can drop quiet socket sessions. Keep the API client bounded, but give the socket its own client withreadTimeout(0)andpingInterval(joinResponseTimeoutMsalready covers the post-open join deadline).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.kt` around lines 60 - 73, The shared OkHttpClient built by buildSfuOkHttpClient is applying connectionTimeoutInMs to readTimeout, which can prematurely close quiet SFU WebSocket sessions. Split the API and socket client configuration: keep the API-side client bounded with the existing timeouts, but create a separate client for the SFU socket path with readTimeout(0) and a pingInterval, using joinResponseTimeoutMs to enforce the post-open join deadline. Update the relevant client निर्माण in SfuConnectionModule so the WebSocket path no longer inherits the API read timeout.stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.kt (1)
148-197: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winStale
joinResponseTimeoutJobnot cancelled when re-enteringConnecting.Per
SfuSocketStateService,WebSocketConnectedtransitions back toConnectingon a newConnectevent (reachable wheneverconnect()is invoked while still awaiting the join response, e.g. a caller-driven reconnect). The observer'sConnectingbranch (Lines 181-184) starts a newconnectUser()flow but doesn't cancel the previously-startedjoinResponseTimeoutJob. If that stale timer fires before the new attempt reachesWebSocketConnected, it injectsSFU_JOIN_RESPONSE_TIMEOUTinto the state machine while it's inConnecting, which (per the FSM) drives the new attempt straight toDisconnectedTemporarily— a spurious failure of an unrelated connection attempt. Every other exit fromWebSocketConnected(Connected,Disconnected) explicitly cancels the timer;Connectingshould too.🔧 Proposed fix
is SfuSocketState.Connecting -> { + cancelJoinResponseTimeout() connectUser(state.connectionConf) callListeners { listener -> listener.onConnecting() } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.kt` around lines 148 - 197, The SfuSocketState observer in SfuSocket.kt leaves a stale joinResponseTimeoutJob running when the state re-enters SfuSocketState.Connecting from SfuSocketState.WebSocketConnected. Update the Connecting branch in the observer to cancelJoinResponseTimeout() before calling connectUser(state.connectionConf), matching the cleanup already done in Connected and Disconnected so an old timeout cannot fire during a new connect attempt.
🧹 Nitpick comments (2)
stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.kt (1)
54-54: 📐 Maintainability & Code Quality | 🔵 TrivialPrefer
org.junit.Assert.assertFalseover the legacyjunit.framework.TestCaseimport.The rest of the file uses JUnit 4 style assertions (
assertTrue,assertEquals,assertNotNull); importingassertFalsefrom the legacyjunit.framework.TestCaseshim is inconsistent.♻️ Proposed fix
-import junit.framework.TestCase.assertFalse +import org.junit.Assert.assertFalse🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.kt` at line 54, Replace the legacy junit.framework.TestCase.assertFalse import in RtcSessionTest2 with org.junit.Assert.assertFalse so the test file uses consistent JUnit 4 assertions alongside assertTrue, assertEquals, and assertNotNull; update the import in the test class and keep the existing assertFalse call sites unchanged.stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.kt (1)
172-227: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winNew
WebSocketConnectedstate block triplicates terminal-event handling.The new
state<SfuSocketState.WebSocketConnected>block repeats the same sevenonEventhandlers (WebSocketEventLost,NetworkNotAvailable,UnrecoverableError,NetworkError,RequiredDisconnection,Stop,ConnectionEstablished) already present verbatim inConnecting(Lines 177-199) andConnected(Lines 230-251). This is now the third copy of identical transition logic; a future change to one of these outcomes (e.g. adjustingDisconnectedTemporarilyconstruction) risks silently diverging across the three states.Consider extracting a shared handler block/function that all three "active" states can apply, if the
FiniteStateMachineDSL supports composition.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.kt` around lines 172 - 227, The new SfuSocketState.WebSocketConnected state duplicates the same terminal-event transition logic already used in SfuSocketState.Connecting and SfuSocketState.Connected. Refactor the repeated onEvent handlers into a shared helper or reusable transition block in SfuSocketStateService, then apply it to all three active states so ConnectionEstablished, WebSocketEventLost, NetworkNotAvailable, UnrecoverableError, NetworkError, RequiredDisconnection, and Stop stay consistent in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`:
- Around line 774-783: Avoid racing on the mutable session reference during SFU
connect in Call: the null check and the later connectInternal() call on
session.value!! are not atomic, so cleanup can clear the session and cause an
NPE. In the connect/join flow, capture session.value into a localSession once
after the null check, use that same localSession to set state._connection and to
call connectInternal(), and then use the final active session consistently
before returning; apply the same pattern to the other connect path referenced by
the review.
- Around line 2277-2279: The timeout comment in Call’s connection flow is stale
because it still mentions a 10s default even though
StreamVideoBuilder.connectionTimeoutInMs now defaults to 5s. Update the wording
near the connection/join timeout explanation to remove the hardcoded literal and
describe the timeout in a way that won’t drift from the builder’s default; use
the existing connectionTimeoutInMs and StreamVideoBuilder references to locate
the comment.
---
Outside diff comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.kt`:
- Around line 60-73: The shared OkHttpClient built by buildSfuOkHttpClient is
applying connectionTimeoutInMs to readTimeout, which can prematurely close quiet
SFU WebSocket sessions. Split the API and socket client configuration: keep the
API-side client bounded with the existing timeouts, but create a separate client
for the SFU socket path with readTimeout(0) and a pingInterval, using
joinResponseTimeoutMs to enforce the post-open join deadline. Update the
relevant client निर्माण in SfuConnectionModule so the WebSocket path no longer
inherits the API read timeout.
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.kt`:
- Around line 148-197: The SfuSocketState observer in SfuSocket.kt leaves a
stale joinResponseTimeoutJob running when the state re-enters
SfuSocketState.Connecting from SfuSocketState.WebSocketConnected. Update the
Connecting branch in the observer to cancelJoinResponseTimeout() before calling
connectUser(state.connectionConf), matching the cleanup already done in
Connected and Disconnected so an old timeout cannot fire during a new connect
attempt.
---
Nitpick comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.kt`:
- Around line 172-227: The new SfuSocketState.WebSocketConnected state
duplicates the same terminal-event transition logic already used in
SfuSocketState.Connecting and SfuSocketState.Connected. Refactor the repeated
onEvent handlers into a shared helper or reusable transition block in
SfuSocketStateService, then apply it to all three active states so
ConnectionEstablished, WebSocketEventLost, NetworkNotAvailable,
UnrecoverableError, NetworkError, RequiredDisconnection, and Stop stay
consistent in one place.
In
`@stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.kt`:
- Line 54: Replace the legacy junit.framework.TestCase.assertFalse import in
RtcSessionTest2 with org.junit.Assert.assertFalse so the test file uses
consistent JUnit 4 assertions alongside assertTrue, assertEquals, and
assertNotNull; update the import in the test class and keep the existing
assertFalse call sites unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 02a2be2b-3aef-4a48-9d87-cdbd58e9bf59
📒 Files selected for processing (19)
stream-video-android-core/api/stream-video-android-core.apistream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoBuilder.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoClient.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/errors/VideoErrorCode.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/common/StreamWebSocket.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/common/StreamWebSocketEvent.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/coordinator/CoordinatorSocket.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketConnection.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/state/SfuSocketState.ktstream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/state/SfuSocketStateEvent.ktstream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/FastReconnectIceRestartTest.ktstream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/JoinRecoverableFailureTest.ktstream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.ktstream-video-android-core/src/test/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateServiceTest.kt
…ssion.value was cleared to null while connecting to sfu was still in progress
There was a problem hiding this comment.
Solid PR. Few things before merge:
-
connectionTimeoutInMsKDoc says "Defaults to 10s" but the default is now5_000. 5s is fine, just fix the doc.StreamVideoClientstill defaults to10_000, so align those too, and add a changelog line since the coordinator/SFU connect budget effectively drops from 10s. -
On the recovered path, once
didReconnectSucceed()returns true,_joinfalls through tosetActiveCall+monitorSession(result.value)+Success. But reconnect already created a new session and calledmonitorSessionon it. Does this double-register, or hand back a response that doesn't match the recovered session? -
didReconnectSucceed()waits onstate.connection.first { ... }with no timeout. Reconnect is bounded so it should settle, but if it hitsDisconnectedtransiently it returns false early, and if it never settles_joinhangs. A timeout on the same budget would be safer.
Nit: the RtcSessionTest2 timeout test emits DisconnectedTemporarily directly, so the WebSocketConnected -> NetworkError -> DisconnectedTemporarily transition isn't actually tested. Worth a case in SfuSocketStateServiceTest.
|
From the comments of @aleksandar-apostolov
|
- Rename SfuSocketStateEvent.WebSocketConnected to WebSocketEstablished to avoid the naming collision with the SfuSocketState.WebSocketConnected state. - Guard monitorSession() in _join recovery so it only runs when the original session is reused, preventing double-registration after rejoin/migrate. - Default connectionTimeoutInMs to 5s and fix stale KDoc. - Rename triggersReconnect() to canTriggersReconnect() for clarity. Co-authored-by: Cursor <cursoragent@cursor.com>
| val sfuSocketState = sfuConnectionModule.socketConnection.state().first { | ||
| it is SfuSocketState.Connected || it is SfuSocketState.Disconnected |
There was a problem hiding this comment.
Did we covered the failure case in which our retries will be exhausted?
Look at this code where we do an early exit after 10 retries -
if (loopIteration >= MAX_RECONNECT_ATTEMPTS) {
logger.w { "[reconnect] Max reconnect attempts ($MAX_RECONNECT_ATTEMPTS) reached — giving up" }
state._connection.value = RealtimeConnection.ReconnectingFailed
break
}There was a problem hiding this comment.
So the reconnect loop comes into action after the socket state has moved :
socketState has the following states :
Connecting -> WebSocketConnected -> Connected -> Disconnected
Disconnected futher has 7 mini states of its own
When the socket moves to any of the disconnected states the retry loop starts. Also parallely this code
val sfuSocketState = sfuConnectionModule.socketConnection.state().first {
it is SfuSocketState.Connected || it is SfuSocketState.Disconnected
}
would collapse and go into the else part of the cide which returns a SfuConnectionResult.Failure to the caller of connectInternal which is _join in the joining scenario . The _join now awaits on the result of retry loop in the method didReconnectSucceed where we take care of the events which retry loop can emit which are :
ReconnectOutcome.Success , ReconnectOutcome.PreconditionNotMet , ReconnectOutcome.PeerConnectionStale , ReconnectOutcome.Failed,
So we have covered all the scenarios of retry loop
…_TIMEOUT
OkHttp connect/call timeouts surface as InterruptedIOException("timeout"),
not SocketTimeoutException, so SFU join analytics were incorrectly reporting
SFU_ERROR despite the failure reason being "timeout".
Co-authored-by: Cursor <cursoragent@cursor.com>
|


Goal
This PR reworks SFU socket connect/timeout handling so every outcome is attributed precisely.
Issues this PR solves:
JoinCallResponsewe could wait indefinitely. There was no separate deadline for the post-open handshake.withTimeoutOrNullinsideconnectInternal) and the terminal state wasn't interpreted consistently.SOCKET_FAILURE's generic "See stack trace in logs…" instead of the real reason (e.g.Failed to connect to /1.2.3.4:443,timeout, SSL handshake errors), and the HTTP status code from a failed WS upgrade (401/429/5xx) was discarded.connectInternalreported join failures even when it ran as part of the internal reconnect loop, double-counting/false-positive join errors.connectionTimeoutInMsfromStreamVideoBuilderwas ignored by the SFU socket path (a hardcoded 10s was used, and the post-open handshake had no configurable bound).Implementation
WebSocketConnectedstate. The transport open (HTTP→WS upgrade) is bounded by OkHttp's connect timeout; a newSfuSocketState.WebSocketConnectedmarks "socket open, awaitingJoinCallResponse".StreamWebSocketnow emits aStreamWebSocketEvent.OpenononOpento drive this transition (coordinator socket treats it as informational).WebSocketConnected, a timer (joinResponseTimeoutMs) bounds the wait forJoinCallResponse; on expiry it pushes a retryableNetworkErrortaggedVideoErrorCode.SFU_JOIN_RESPONSE_TIMEOUT→ resolves toDisconnectedTemporarilyinstead of hanging.withTimeoutOrNullinconnectInternal; all recoverable errors flow through oneDisconnectedTemporarily(error)whoseNetworkErrorcarries the exact code + reason.connectInternalclassifies the terminal state viatriggersReconnect()/networkErrorOrNull().StreamWebSocket.onFailurenow buildsError.NetworkErrordirectly with the throwable's message and the HTTPresponse.codewhen the upgrade failed on an HTTP response (falls back to unknown for pure transport failures).toAbortReason()maps both the join-response timeout (by error code) and the transport connect timeout (bySocketTimeoutExceptioncause — a type check, not string matching) toREQUEST_TIMEOUT; everything else staysSFU_ERROR.connectInternalintoCall._joinviasendJoinErrorAnalytics(...), so reconnect-driven connects don't report join errors.SfuConnectionResult.Failurenow carriesrecoverable+abortReason; on a recoverable failure the join flow awaits the reconnect outcome (didReconnectSucceed()).RtcSession.sfuWsRetryCountincrements only on actual retries (aconnectInternalwithreconnectDetails), resets to 0 onConnected, and is passed toonSfuWsCompleted.StreamVideoBuilder.connectionTimeoutInMsnow drives both the SFU OkHttp client and the join-response deadline.SfuConnectionResult.Connected/Failed→Success/Failure; renamedawaitReconnectOutcome()→didReconnectSucceed().Testing
./gradlew :stream-video-android-core:testDebugUnitTest --tests "io.getstream.video.android.core.socket.sfu.*" --tests "io.getstream.video.android.core.rtc.*"— pass./gradlew :stream-video-android-core:compileDebugKotlin— pass./gradlew :stream-video-android-core:spotlessApply— cleanNew/updated tests:
SfuSocketStateServiceTest(newWebSocketConnected/join-response transitions),JoinRecoverableFailureTest(recoverable-vs-permanent join handling),RtcSessionTest2(connectInternalterminal-state classification),FastReconnectIceRestartTest(result rename).☑️Contributor Checklist
General
developbranchCode & documentation
stream-video-examples) (N/A)☑️Reviewer Checklist
Made with Cursor
Summary by CodeRabbit
New Features
Bug Fixes
Documentation