Skip to content

Report SFU WebSocket connection and join timeouts accurately#1740

Open
PratimMallick wants to merge 5 commits into
developfrom
fix/sfu-ws-connection-timeout-handling
Open

Report SFU WebSocket connection and join timeouts accurately#1740
PratimMallick wants to merge 5 commits into
developfrom
fix/sfu-ws-connection-timeout-handling

Conversation

@PratimMallick

@PratimMallick PratimMallick commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Goal

This PR reworks SFU socket connect/timeout handling so every outcome is attributed precisely.

Issues this PR solves:

  1. Connect could hang on a silent SFU. Once the transport WebSocket was upgraded, OkHttp no longer applied a timeout, so if the SFU never sent its JoinCallResponse we could wait indefinitely. There was no separate deadline for the post-open handshake.
  2. A successful-but-slow connect / socket timeout was wrongly surfaced as a hard join failure to the app. There were effectively two competing timeouts (OkHttp's, plus a withTimeoutOrNull inside connectInternal) and the terminal state wasn't interpreted consistently.
  3. Opaque error messages. Transport failures surfaced as SOCKET_FAILURE's generic "See stack trace in logs…" instead of the real reason (e.g. Failed to connect to /1.2.3.4:443, timeout, SSL handshake errors), and the HTTP status code from a failed WS upgrade (401/429/5xx) was discarded.
  4. Join-error analytics were emitted during reconnects. connectInternal reported join failures even when it ran as part of the internal reconnect loop, double-counting/false-positive join errors.
  5. Retry count was inaccurate. There was no per-session notion of how many SFU WS retries an attempt actually took.
  6. connectionTimeoutInMs from StreamVideoBuilder was ignored by the SFU socket path (a hardcoded 10s was used, and the post-open handshake had no configurable bound).

Implementation

  • Two-phase connect with a new WebSocketConnected state. The transport open (HTTP→WS upgrade) is bounded by OkHttp's connect timeout; a new SfuSocketState.WebSocketConnected marks "socket open, awaiting JoinCallResponse". StreamWebSocket now emits a StreamWebSocketEvent.Open on onOpen to drive this transition (coordinator socket treats it as informational).
  • Dedicated join-response timer. While in WebSocketConnected, a timer (joinResponseTimeoutMs) bounds the wait for JoinCallResponse; on expiry it pushes a retryable NetworkError tagged VideoErrorCode.SFU_JOIN_RESPONSE_TIMEOUT → resolves to DisconnectedTemporarily instead of hanging.
  • Single recoverable path. Removed the extra withTimeoutOrNull in connectInternal; all recoverable errors flow through one DisconnectedTemporarily(error) whose NetworkError carries the exact code + reason. connectInternal classifies the terminal state via triggersReconnect()/networkErrorOrNull().
  • Real error surfaced. StreamWebSocket.onFailure now builds Error.NetworkError directly with the throwable's message and the HTTP response.code when the upgrade failed on an HTTP response (falls back to unknown for pure transport failures).
  • Accurate timeout analytics. toAbortReason() maps both the join-response timeout (by error code) and the transport connect timeout (by SocketTimeoutException cause — a type check, not string matching) to REQUEST_TIMEOUT; everything else stays SFU_ERROR.
  • Join-only analytics. Join-error analytics moved out of connectInternal into Call._join via sendJoinErrorAnalytics(...), so reconnect-driven connects don't report join errors. SfuConnectionResult.Failure now carries recoverable + abortReason; on a recoverable failure the join flow awaits the reconnect outcome (didReconnectSucceed()).
  • Per-session retry counter. RtcSession.sfuWsRetryCount increments only on actual retries (a connectInternal with reconnectDetails), resets to 0 on Connected, and is passed to onSfuWsCompleted.
  • Configurable timeout. StreamVideoBuilder.connectionTimeoutInMs now drives both the SFU OkHttp client and the join-response deadline.
  • Renamed SfuConnectionResult.Connected/FailedSuccess/Failure; renamed awaitReconnectOutcome()didReconnectSucceed().

Testing

  • ./gradlew :stream-video-android-core:testDebugUnitTest --tests "io.getstream.video.android.core.socket.sfu.*" --tests "io.getstream.video.android.core.rtc.*" — pass
  • ./gradlew :stream-video-android-core:compileDebugKotlin — pass
  • ./gradlew :stream-video-android-core:spotlessApply — clean
  • Manual: reproduced an SFU connect timeout via Charles proxy and verified the connection resolves to a retryable disconnect and the reconnect loop takes over (no hang).

New/updated tests: SfuSocketStateServiceTest (new WebSocketConnected/join-response transitions), JoinRecoverableFailureTest (recoverable-vs-permanent join handling), RtcSessionTest2 (connectInternal terminal-state classification), FastReconnectIceRestartTest (result rename).

☑️Contributor Checklist

General

  • I have signed the Stream CLA (required)
  • Assigned a person / code owner group (required)
  • Thread with the PR link started in a respective Slack channel (required internally)
  • PR targets the develop branch
  • PR is linked to the GitHub issue it resolves

Code & documentation

  • Changelog is updated with client-facing changes
  • New code is covered by unit tests
  • Comparison screenshots added for visual changes (N/A — no UI changes)
  • Affected documentation updated (KDocs)
  • Tutorial starter kit updated (N/A)
  • Examples/guides starter kits updated (stream-video-examples) (N/A)

☑️Reviewer Checklist

  • XML sample runs & works
  • Compose sample runs & works
  • Tutorial starter kit
  • Example starter kits work
  • UI Changes correct (before & after images)
  • Bugs validated (bugfixes)
  • New feature tested and works
  • Release notes and docs clearly describe changes
  • All code we touched has new or updated KDocs
  • Check the SDK Size Comparison table in the CI logs

Made with Cursor

Summary by CodeRabbit

  • New Features

    • Added clearer handling for WebSocket connection stages, including a new “WebSocket connected” state before the final join response arrives.
    • Introduced support for timeout-based recovery when the join response does not arrive in time.
  • Bug Fixes

    • Improved reconnect behavior for recoverable connection failures.
    • Added better error reporting for SFU join timeouts and transport failures.
  • Documentation

    • Updated connection timeout guidance and defaults for a more predictable setup experience.

Splits the SFU socket connect into a transport-open phase (bounded by
OkHttp's connect timeout) and a distinct join-response phase (bounded by
a dedicated timer via a new WebSocketConnected state), so a silent SFU no
longer hangs the join. Surfaces real transport failure messages and HTTP
status codes on the resulting NetworkError, routes all recoverable errors
through a single DisconnectedTemporarily state carrying the exact
code/reason, and maps both timeout flavours to REQUEST_TIMEOUT in analytics
(join-response via error code, transport via SocketTimeoutException cause).

Also moves join-error analytics to the join flow only, adds a per-session
SFU WS retry counter, and wires connectionTimeoutInMs from the builder to
both the OkHttp client and the join-response deadline.

Co-authored-by: Cursor <cursoragent@cursor.com>
@PratimMallick PratimMallick requested a review from a team as a code owner July 1, 2026 13:26
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

PR checklist ❌

The following issues were detected:

  • Linked issue missing. Add a Linear ticket reference (e.g. AND-123, Closes AND-123, or a https://linear.app/... link) or a GitHub issue (Closes #123) to the PR description.

What we check

  1. Title is concise (5–18 words) unless labeled pr:ignore-for-release.
  2. At least one pr: label exists (e.g., pr:bug, pr:new-feature).
  3. Sections ### Goal, ### Implementation, and ### Testing contain content. Bot-authored PRs are exempt.
  4. PR description references an issue (Linear ticket like AND-123, a Linear URL, or a GitHub Closes #N). Bot-authored PRs are exempt.

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

SDK Size Comparison 📏

SDK Before After Difference Status
stream-video-android-core 12.27 MB 12.27 MB 0.00 MB 🟢
stream-video-android-ui-xml 5.68 MB 5.68 MB 0.00 MB 🟢
stream-video-android-ui-compose 6.20 MB 6.20 MB 0.00 MB 🟢

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 7.41% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description covers Goal, Implementation, Testing, and checklists well; only optional UI/GIF sections are missing, which is acceptable for a non-visual change.
Title check ✅ Passed The title is concise and accurately reflects the main change around distinguishing SFU WebSocket connection and join-response timeouts.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/sfu-ws-connection-timeout-handling

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.kt (1)

60-73: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Split the API and SFU socket clients

callTimeout only affects the handshake; the real risk is that this shared client also sets the SFU WebSocket readTimeout from connectionTimeoutInMs. A 5s idle window can drop quiet socket sessions. Keep the API client bounded, but give the socket its own client with readTimeout(0) and pingInterval (joinResponseTimeoutMs already covers the post-open join deadline).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.kt`
around lines 60 - 73, The shared OkHttpClient built by buildSfuOkHttpClient is
applying connectionTimeoutInMs to readTimeout, which can prematurely close quiet
SFU WebSocket sessions. Split the API and socket client configuration: keep the
API-side client bounded with the existing timeouts, but create a separate client
for the SFU socket path with readTimeout(0) and a pingInterval, using
joinResponseTimeoutMs to enforce the post-open join deadline. Update the
relevant client निर्माण in SfuConnectionModule so the WebSocket path no longer
inherits the API read timeout.
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.kt (1)

148-197: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Stale joinResponseTimeoutJob not cancelled when re-entering Connecting.

Per SfuSocketStateService, WebSocketConnected transitions back to Connecting on a new Connect event (reachable whenever connect() is invoked while still awaiting the join response, e.g. a caller-driven reconnect). The observer's Connecting branch (Lines 181-184) starts a new connectUser() flow but doesn't cancel the previously-started joinResponseTimeoutJob. If that stale timer fires before the new attempt reaches WebSocketConnected, it injects SFU_JOIN_RESPONSE_TIMEOUT into the state machine while it's in Connecting, which (per the FSM) drives the new attempt straight to DisconnectedTemporarily — a spurious failure of an unrelated connection attempt. Every other exit from WebSocketConnected (Connected, Disconnected) explicitly cancels the timer; Connecting should too.

🔧 Proposed fix
                     is SfuSocketState.Connecting -> {
+                        cancelJoinResponseTimeout()
                         connectUser(state.connectionConf)
                         callListeners { listener -> listener.onConnecting() }
                     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.kt`
around lines 148 - 197, The SfuSocketState observer in SfuSocket.kt leaves a
stale joinResponseTimeoutJob running when the state re-enters
SfuSocketState.Connecting from SfuSocketState.WebSocketConnected. Update the
Connecting branch in the observer to cancelJoinResponseTimeout() before calling
connectUser(state.connectionConf), matching the cleanup already done in
Connected and Disconnected so an old timeout cannot fire during a new connect
attempt.
🧹 Nitpick comments (2)
stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.kt (1)

54-54: 📐 Maintainability & Code Quality | 🔵 Trivial

Prefer org.junit.Assert.assertFalse over the legacy junit.framework.TestCase import.

The rest of the file uses JUnit 4 style assertions (assertTrue, assertEquals, assertNotNull); importing assertFalse from the legacy junit.framework.TestCase shim is inconsistent.

♻️ Proposed fix
-import junit.framework.TestCase.assertFalse
+import org.junit.Assert.assertFalse
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.kt`
at line 54, Replace the legacy junit.framework.TestCase.assertFalse import in
RtcSessionTest2 with org.junit.Assert.assertFalse so the test file uses
consistent JUnit 4 assertions alongside assertTrue, assertEquals, and
assertNotNull; update the import in the test class and keep the existing
assertFalse call sites unchanged.
stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.kt (1)

172-227: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

New WebSocketConnected state block triplicates terminal-event handling.

The new state<SfuSocketState.WebSocketConnected> block repeats the same seven onEvent handlers (WebSocketEventLost, NetworkNotAvailable, UnrecoverableError, NetworkError, RequiredDisconnection, Stop, ConnectionEstablished) already present verbatim in Connecting (Lines 177-199) and Connected (Lines 230-251). This is now the third copy of identical transition logic; a future change to one of these outcomes (e.g. adjusting DisconnectedTemporarily construction) risks silently diverging across the three states.

Consider extracting a shared handler block/function that all three "active" states can apply, if the FiniteStateMachine DSL supports composition.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.kt`
around lines 172 - 227, The new SfuSocketState.WebSocketConnected state
duplicates the same terminal-event transition logic already used in
SfuSocketState.Connecting and SfuSocketState.Connected. Refactor the repeated
onEvent handlers into a shared helper or reusable transition block in
SfuSocketStateService, then apply it to all three active states so
ConnectionEstablished, WebSocketEventLost, NetworkNotAvailable,
UnrecoverableError, NetworkError, RequiredDisconnection, and Stop stay
consistent in one place.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt`:
- Around line 774-783: Avoid racing on the mutable session reference during SFU
connect in Call: the null check and the later connectInternal() call on
session.value!! are not atomic, so cleanup can clear the session and cause an
NPE. In the connect/join flow, capture session.value into a localSession once
after the null check, use that same localSession to set state._connection and to
call connectInternal(), and then use the final active session consistently
before returning; apply the same pattern to the other connect path referenced by
the review.
- Around line 2277-2279: The timeout comment in Call’s connection flow is stale
because it still mentions a 10s default even though
StreamVideoBuilder.connectionTimeoutInMs now defaults to 5s. Update the wording
near the connection/join timeout explanation to remove the hardcoded literal and
describe the timeout in a way that won’t drift from the builder’s default; use
the existing connectionTimeoutInMs and StreamVideoBuilder references to locate
the comment.

---

Outside diff comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.kt`:
- Around line 60-73: The shared OkHttpClient built by buildSfuOkHttpClient is
applying connectionTimeoutInMs to readTimeout, which can prematurely close quiet
SFU WebSocket sessions. Split the API and socket client configuration: keep the
API-side client bounded with the existing timeouts, but create a separate client
for the SFU socket path with readTimeout(0) and a pingInterval, using
joinResponseTimeoutMs to enforce the post-open join deadline. Update the
relevant client निर्माण in SfuConnectionModule so the WebSocket path no longer
inherits the API read timeout.

In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.kt`:
- Around line 148-197: The SfuSocketState observer in SfuSocket.kt leaves a
stale joinResponseTimeoutJob running when the state re-enters
SfuSocketState.Connecting from SfuSocketState.WebSocketConnected. Update the
Connecting branch in the observer to cancelJoinResponseTimeout() before calling
connectUser(state.connectionConf), matching the cleanup already done in
Connected and Disconnected so an old timeout cannot fire during a new connect
attempt.

---

Nitpick comments:
In
`@stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.kt`:
- Around line 172-227: The new SfuSocketState.WebSocketConnected state
duplicates the same terminal-event transition logic already used in
SfuSocketState.Connecting and SfuSocketState.Connected. Refactor the repeated
onEvent handlers into a shared helper or reusable transition block in
SfuSocketStateService, then apply it to all three active states so
ConnectionEstablished, WebSocketEventLost, NetworkNotAvailable,
UnrecoverableError, NetworkError, RequiredDisconnection, and Stop stay
consistent in one place.

In
`@stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.kt`:
- Line 54: Replace the legacy junit.framework.TestCase.assertFalse import in
RtcSessionTest2 with org.junit.Assert.assertFalse so the test file uses
consistent JUnit 4 assertions alongside assertTrue, assertEquals, and
assertNotNull; update the import in the test class and keep the existing
assertFalse call sites unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 02a2be2b-3aef-4a48-9d87-cdbd58e9bf59

📥 Commits

Reviewing files that changed from the base of the PR and between 6035a0d and 24b05e3.

📒 Files selected for processing (19)
  • stream-video-android-core/api/stream-video-android-core.api
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoBuilder.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/StreamVideoClient.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/call/RtcSession.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/errors/VideoErrorCode.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/internal/module/SfuConnectionModule.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/common/StreamWebSocket.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/common/StreamWebSocketEvent.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/coordinator/CoordinatorSocket.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocket.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketConnection.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateService.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/state/SfuSocketState.kt
  • stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/socket/sfu/state/SfuSocketStateEvent.kt
  • stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/FastReconnectIceRestartTest.kt
  • stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/JoinRecoverableFailureTest.kt
  • stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/rtc/RtcSessionTest2.kt
  • stream-video-android-core/src/test/kotlin/io/getstream/video/android/core/socket/sfu/SfuSocketStateServiceTest.kt

Comment thread stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt Outdated
Comment thread stream-video-android-core/src/main/kotlin/io/getstream/video/android/core/Call.kt Outdated
@PratimMallick PratimMallick added the pr:improvement Enhances an existing feature or code label Jul 1, 2026
@aleksandar-apostolov aleksandar-apostolov changed the title fix(core): report SFU WebSocket connection/join timeouts accurately Report SFU WebSocket connection and join timeouts accurately Jul 2, 2026
…ssion.value was cleared to null while connecting to sfu was still in progress

@aleksandar-apostolov aleksandar-apostolov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid PR. Few things before merge:

  1. connectionTimeoutInMs KDoc says "Defaults to 10s" but the default is now 5_000. 5s is fine, just fix the doc. StreamVideoClient still defaults to 10_000, so align those too, and add a changelog line since the coordinator/SFU connect budget effectively drops from 10s.

  2. On the recovered path, once didReconnectSucceed() returns true, _join falls through to setActiveCall + monitorSession(result.value) + Success. But reconnect already created a new session and called monitorSession on it. Does this double-register, or hand back a response that doesn't match the recovered session?

  3. didReconnectSucceed() waits on state.connection.first { ... } with no timeout. Reconnect is bounded so it should settle, but if it hits Disconnected transiently it returns false early, and if it never settles _join hangs. A timeout on the same budget would be safer.

Nit: the RtcSessionTest2 timeout test emits DisconnectedTemporarily directly, so the WebSocketConnected -> NetworkError -> DisconnectedTemporarily transition isn't actually tested. Worth a case in SfuSocketStateServiceTest.

@PratimMallick

PratimMallick commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

From the comments of @aleksandar-apostolov

  1. connectionTimeoutInMs KDoc says "Defaults to 10s" but the default is now 5_000. 5s is fine, just fix the doc. StreamVideoClient still defaults to 10_000, so align those too, and add a changelog line since the coordinator/SFU connect budget effectively drops from 10s.
    Fixed
  2. On the recovered path, once didReconnectSucceed() returns true, _join falls through to setActiveCall + monitorSession(result.value) + Success. But reconnect already created a new session and called monitorSession on it. Does this double-register, or hand back a response that doesn't match the recovered session?
    Added a check for this. In case of fast reconnect we donot monitor seperately
  3. didReconnectSucceed() waits on state.connection.first { ... } with no timeout. Reconnect is bounded so it should settle, but if it hits Disconnected transiently it returns false early, and if it never settles _join hangs. A timeout on the same budget would be safer.
    The reconnect loop always reaches a terminal state and hence we donot have to worry about this 'hung' scenario

- Rename SfuSocketStateEvent.WebSocketConnected to WebSocketEstablished to
  avoid the naming collision with the SfuSocketState.WebSocketConnected state.
- Guard monitorSession() in _join recovery so it only runs when the original
  session is reused, preventing double-registration after rejoin/migrate.
- Default connectionTimeoutInMs to 5s and fix stale KDoc.
- Rename triggersReconnect() to canTriggersReconnect() for clarity.

Co-authored-by: Cursor <cursoragent@cursor.com>
Comment on lines +943 to +944
val sfuSocketState = sfuConnectionModule.socketConnection.state().first {
it is SfuSocketState.Connected || it is SfuSocketState.Disconnected

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we covered the failure case in which our retries will be exhausted?

Look at this code where we do an early exit after 10 retries -

if (loopIteration >= MAX_RECONNECT_ATTEMPTS) {
     logger.w { "[reconnect] Max reconnect attempts ($MAX_RECONNECT_ATTEMPTS) reached — giving up" }
     state._connection.value = RealtimeConnection.ReconnectingFailed
     break
}

@PratimMallick PratimMallick Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the reconnect loop comes into action after the socket state has moved :

socketState has the following states :
Connecting -> WebSocketConnected -> Connected -> Disconnected

Disconnected futher has 7 mini states of its own

When the socket moves to any of the disconnected states the retry loop starts. Also parallely this code

val sfuSocketState = sfuConnectionModule.socketConnection.state().first {
            it is SfuSocketState.Connected || it is SfuSocketState.Disconnected
            }

would collapse and go into the else part of the cide which returns a SfuConnectionResult.Failure to the caller of connectInternal which is _join in the joining scenario . The _join now awaits on the result of retry loop in the method didReconnectSucceed where we take care of the events which retry loop can emit which are :
ReconnectOutcome.Success , ReconnectOutcome.PreconditionNotMet , ReconnectOutcome.PeerConnectionStale , ReconnectOutcome.Failed,

So we have covered all the scenarios of retry loop

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice 👍

…_TIMEOUT

OkHttp connect/call timeouts surface as InterruptedIOException("timeout"),
not SocketTimeoutException, so SFU join analytics were incorrectly reporting
SFU_ERROR despite the failure reason being "timeout".

Co-authored-by: Cursor <cursoragent@cursor.com>
@sonarqubecloud

sonarqubecloud Bot commented Jul 2, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
46.7% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr:improvement Enhances an existing feature or code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants