fix(voice): gapless audio chunk playback#1747
Conversation
🦋 Changeset detectedLatest commit: 8e4717e The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
|
/bonk do a deep review, look for edge cases and things we might have missed, how's the ux/dx on this? |
There was a problem hiding this comment.
Deep Review
Good approach overall — scheduling on the audio clock instead of waiting for ended callbacks is the right fix for this class of problem. The generation-based cancellation pattern is solid and the test coverage is thorough. A few things worth discussing:
Things that look good
-
Core scheduling logic is correct.
Math.max(ctx.currentTime, cursor)handles both the happy path (back-to-back) and the catch-up case (cursor fell behind real time) cleanly. -
#isScheduling/#isPlayingseparation. This is a subtle but important distinction —#isPlayingstays true while scheduled sources are still audibly playing even after the scheduling loop finishes, so barge-in detection works through the "tail." Nice. -
Generation-based cancellation. The
#playbackGenerationpattern already existed, and extending it to cover the new scheduling loop +onendedcallbacks is the right move. Multiple interrupt vectors (serverplayback_interrupt, user transcript, client-side RMS interrupt,endCall) all go through#stopPlayback, which is clean. -
Cursor reset on
#stopPlayback. Without this, a stale cursor from a previous utterance would cause the next one to schedule far in the future. Good catch.
Edge cases and concerns
1. onended callbacks from source.stop() during #stopPlayback
When #stopPlayback iterates through [...this.#scheduledSources], it calls source.stop() on each. Per the Web Audio spec, stop() dispatches the ended event. The onended handler checks generation === this.#playbackGeneration, but #stopPlayback increments #playbackGeneration before calling stop(), so the generation check will fail and the handler is effectively a no-op. This is correct! But it's subtle — worth a brief inline comment on why the generation increment must come before the stop loop.
2. Floating-point drift in cursor accumulation
Each chunk advances the cursor by audioBuffer.duration, which is a floating-point number. Over a long agent response (say 60s of audio in 0.1s chunks = 600 additions), accumulated floating-point error could theoretically cause micro-gaps or micro-overlaps. In practice this is probably negligible (IEEE 754 double precision gives ~15 significant digits, and we're adding values around 0.1), but it's worth being aware of. If it ever becomes an issue, tracking the cursor in samples rather than seconds would eliminate it.
3. No test for #isPlaying transitioning to false after all sources end
The tests verify scheduling times and interrupt behavior, but none assert that #isPlaying becomes false after all scheduled sources finish playing (i.e., after all onended callbacks fire). This is the path that would surface bugs in the !#isScheduling && #scheduledSources.size === 0 && #playbackQueue.length === 0 condition. Consider adding a test that sends a few chunks, fires all onended callbacks, and asserts the client no longer considers playback active. You can observe this indirectly through barge-in: after all sources end, RMS above the interrupt threshold should not trigger an interrupt message.
4. Race: new audio arriving between #isScheduling = false and last onended
Consider this sequence:
#processPlaybackQueuefinishes, sets#isScheduling = false- Sources are still playing (
#scheduledSources.size > 0), so#isPlayingstaystrue - New audio chunk arrives from the server
#processPlaybackQueueis called again —#isSchedulingisfalse, so it enters the loop- Meanwhile, the last source's
onendedfires
This is actually fine because step 4 will set #isScheduling = true again before any await, and the onended handler checks !this.#isScheduling. Just wanted to confirm this is intentional — the interleaving is safe because JS is single-threaded and the #isScheduling = true in step 4 happens synchronously before any microtask queue processing.
5. pcm16 path creates AudioBuffer with duration from length / sampleRate
The createBuffer in the test fake returns duration: length / sampleRate, which matches the real Web Audio API. But the production code at line 896-901 creates the buffer at 16kHz while the AudioContext is at 48kHz. The browser will implicitly resample, but audioBuffer.duration is based on the buffer's own sample rate (16kHz), so the duration is correct for scheduling. Good — no issue here, just noting it's correct for a non-obvious reason.
Minor / DX notes
- The changeset description is excellent — it explains both the symptom and the mechanism clearly. Consumers upgrading will understand what changed and why.
- The test helper
startPcm16Call()using PCM16 format is a good choice — it avoidsdecodeAudioDataasync complexity and makes the scheduling math deterministic.
Verdict
This is a solid fix. The core approach is correct, the edge cases around cancellation are handled well, and the test coverage is good. The main suggestion would be adding a test for the #isPlaying → false transition after natural playback completion (concern #3 above). The rest is minor.
|
|
||
| #stopPlayback(): void { | ||
| const source = this.#activeSource; | ||
| this.#playbackGeneration++; |
There was a problem hiding this comment.
Nit: worth a brief comment here explaining why the generation increment must come before the stop loop — source.stop() synchronously dispatches ended, and the handler checks generation === this.#playbackGeneration, so bumping first makes those callbacks no-ops. Without this ordering, the onended handler would set #isPlaying = false prematurely before we finish cleaning up.
|
|
||
| // Without the reset this would start at the stale 5.1 cursor. | ||
| expect(sources[1].startedAt).toBe(2); | ||
| }); |
There was a problem hiding this comment.
Consider adding a test that verifies #isPlaying transitions to false after all onended callbacks fire (natural playback completion, no interrupt). You could observe this indirectly: send a couple of PCM chunks, fire all onended callbacks on the sources, then send RMS above the interrupt threshold and assert that no { type: "interrupt" } message is sent — because the client should no longer consider playback active.
This would exercise the !this.#isScheduling && this.#scheduledSources.size === 0 && this.#playbackQueue.length === 0 condition in the onended handler.
|
Review posted on PR #1747. Here's a summary of the findings: Overall verdict: Solid fix. The approach of scheduling chunks on the audio clock via a playback cursor instead of waiting for Key findings:
|
This PR fixes audible clicks at audio chunk boundaries during agent speech in
VoiceClient. Closes #1736.Why
VoiceClientplayed each response chunk by starting it atcurrentTimeand waiting for itsendedevent before scheduling the next. The round-trip latency (event-loop delay plus the next chunk's setup) left a few milliseconds of silence at every seam, audible as a periodic click.endedcallbacks run on the main thread, not the audio clock, so there will always be some gap.Code Changes
packages/voice/src/voice-client.ts#activeSourcewith a#scheduledSourcesset and a#playbackCursor. Each incoming chunk is scheduled atMath.max(currentTime, cursor), and the cursor is advanced by the chunk duration.#stopPlaybackstops every scheduled source, not just the current one, so an interrupt or end-call cleanly cancels the whole pipeline.packages/voice/src/tests/voice-client.test.tscurrentTime, interrupt stops all scheduled chunks, barge-in works through the scheduled tail, and cursor reset on end-call.