Reconstruct fragmented mic clip: fragment-aware direct capture by MaxHeimbrock · Pull Request #306 · livekit/client-sdk-unity

MaxHeimbrock · 2026-06-12T13:30:14Z

Stacked on #304. Fixes the choppy/garbled published audio with a Bluetooth HFP headset mic on macOS — by reading the audio Unity actually delivers, which turned out to be intact but scattered.

Root cause (proven by buffer inspection)

A raw WAV dump of the mic clip in the bad state showed the exact structure: FMOD writes each real 20 ms packet of clip.frequency audio, then advances Microphone.GetPosition as if it had written ~3.2× as much, zero-filling the skipped range. Concretely: valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k where k = counterRate/clip.frequency = 3.2), with exact-zero padding between them (true silence has a real noise floor, never exact zeros).

Junction analysis showed the fragments join continuously (boundary sample deltas within normal in-fragment variation) — the full stream is present, just zero-padded. Concatenating fragments reconstructed clean, correct-pitch voice (verified by ear).

This explains every prior symptom:

plain playback → 31% voice + 69% silence = chop
counter-paced reading → fragments + padding fast over a live buffer = noise with echo
pitch servo (either model) → cannot help; the gaps are in the data layout, not the timing

Change

MicrophoneSource now does fragment-aware direct capture:

Reads the clip ring buffer directly (no AudioSource, no OnAudioFilterRead) — also decouples capture from the output device's clock.
A ~0.3 s pre-roll measures k (counter rate ÷ clip.frequency) and the counter's smallest discrete jump (the stride J).
k ≈ 1 (healthy devices): plain contiguous read at the counter's pace.
k > 1.05 (this state): read only the first J/k samples of each stride — exactly the valid fragments — skipping the padding.
Downmix → mono, resample clip.frequency → fixed 48 kHz native source (streaming linear; resampler state carries across fragments since junctions are continuous).
Backlog beyond 200 ms after a stall is dropped (stride-aligned) so the native queue can't overrun.

Expected log in the bad state:

MicrophoneSource: fragmented clip detected (k=3.20); reading 320 of every 1024 samples at 16000Hz

Healthy devices log contiguous capture (k=1.00).

Verification

Runtime compiles clean.
Buffer-dump analysis (fragment sizes/strides, junction continuity, reconstruction) done offline on a captured WAV; reconstruction validated by ear.
To validate end-to-end: mac publisher with BT headset mic → Android receiver; expect clean, correct-pitch, non-choppy audio. Built-in mic should behave unchanged.

History

This branch went through two falsified designs first — a pitch servo at the counter ratio (garbled: the counter doesn't describe the data) and a k-rescaled lag servo at pitch 1 (perfect telemetry, still choppy: the gaps are in the buffer itself). The WAV dump diagnostic settled it. Commits preserved for the record.

🤖 Generated with Claude Code

The native (Rust) audio source was created with a hardcoded sample rate (48000) and channel count (2). Microphone frames flow through Unity's audio graph (AudioProbe) at the actual DSP output configuration, which often differs — e.g. with a Bluetooth headset. The Rust source does not resample; it rejects frames whose rate/channels don't match, causing the metadata-mismatch warning and capture failures. Read the source's sample rate and channel count from Unity's output configuration (AudioSettings.GetConfiguration) instead of hardcoded defaults, falling back to the defaults only when Unity can't report one. The base constructor now exposes a device-mode overload (type only) and an explicit overload (type, sampleRate, channels) for sources that generate a fixed format. MicrophoneSource and BasicAudioSource use device mode; BasicAudioSource drops its unused channels parameter. SineWaveAudioSource declares its exact format. If a frame's format still doesn't match (inconsistent Unity report or a runtime output change), drop it with a throttled warning instead of sending a mismatch the native side would error on. Also removes the redundant Microphone.Start in the Meet sample. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Temporary, ~2s-throttled diagnostics to investigate choppy received audio: - RtcAudioSource logs the effective capture sample rate (samples/sec by wall clock) vs the rate declared to the native source. A measured rate that differs from the declared rate means the frame format label is wrong, which would sound fast/slow/choppy on the receiver. - AudioStream logs buffer fill, underrun count, callback count and frames received, to distinguish receive-side starvation from a clean stream. Emitted via Utils.Info so they appear without LK_DEBUG (Utils.Debug is compiled out unless LK_DEBUG is defined). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

MicrophoneSource started the device at the hardcoded DefaultMicrophoneSampleRate and played the looping clip through an AudioSource read on the DSP thread. When the device's actual rate differs from the engine output rate, the clip fills and plays back at different rates, so the read position drifts against the write position and the captured audio becomes choppy. Open the microphone at AudioSettings.outputSampleRate when the device supports it (clamped to the device's reported caps; falling back to the default when the output rate is unknown), so capture and playback run at the same rate. This also aligns the mic rate with the native source rate, which is taken from the same output configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The mic clip is filled by the capture device's clock while the AudioSource that plays it (feeding OnAudioFilterRead) runs on the output device's clock. Some devices also misreport the clip rate entirely: a Bluetooth headset on macOS labels its clip 16kHz while filling it at ~51kHz. Either way the read head drifts against the write head and gets lapped, which sounds like periodic chopping. Add a pacing servo that measures how fast the write head actually advances (GetPosition delta over wall clock - steady within ±0.1% even when the instantaneous position is jumpy) and continuously adjusts AudioSource.pitch so the read head consumes clip samples at the same rate, holding a fixed lag behind the writer. A short pre-roll measures the rate before playback starts so the initial pitch is already correct; the fill-rate estimate and the lag target (sized to ~4x observed jitter, bounded by clip capacity) keep adapting while capturing, and an out-of-bounds resync recovers from long hitches. In the normal case the measured rate matches clip.frequency, pitch hovers at ~1.0, and the servo is effectively a no-op. In the misreporting case pitch settles at the true ratio (~3.2), which plays the clip's real-time data at correct speed and eliminates the chop. Pitch is rate control, not a delay: the added latency is only the held lag (~80-150ms, adaptive). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…head Field test falsified the previous model: with pitch set to the measured counter ratio (3.2), the published audio became garbled repeats ("noise with echo"), while the servo's own lag telemetry stayed perfectly stable — because it was measuring against the same lying counter. Combined with earlier results (1x playback yields correct-pitch voice; reading at the counter's pace yields noise), the consistent model is: - The clip DATA genuinely is at clip.frequency (16kHz here). - Microphone.GetPosition's counter is inflated ~3.2x on macOS + BT-HFP; it does not describe the data. The choppiness on the plain path is the read head colliding with the bursty real write head due to a small, unmanaged startup lag — not a rate mismatch. Rework the servo accordingly: pitch stays pinned near 1.0 (max ±3% trim). The counter is used only after rescaling by its measured inflation factor k = counterRate / clip.frequency (~1 on healthy devices) to estimate the real write head, and the servo holds the read head a generous adaptive lag (150ms default) behind that estimate. Clip buffer extended to 2s for more collision headroom. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The reworked servo's telemetry is perfect in the bad state (k=3.20, pitch~1.00, lag locked on target, jitter ~0, no resyncs) yet the published audio still chops like the unpaced path. That falsifies the read/write collision model: the reader is provably never near the writer. Remaining hypothesis: the chop is baked into the clip data itself — FMOD scatters the real 16kHz samples at the inflated counter's positions, leaving stale regions between fragments (~31% fresh per cycle). That would also explain why counter-paced reading sounds like noise with echo (fragments + stale older audio, fast). Snapshot the raw clip to a WAV 4s after capture starts (editor-only) so the buffer contents can be inspected directly: contiguous voice means the chop is downstream and still fixable; fragmented voice means capture data is destroyed at write time and the Unity Microphone path cannot work for this device. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

A raw dump of the mic clip in the macOS + Bluetooth HFP state revealed the true buffer structure: FMOD writes each real 20ms packet of clip.frequency audio, then advances the position counter as if it had written k (~3.2x) as much and zero-fills the skipped range. The buffer holds valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k), and the fragments join continuously (junction sample deltas within normal in-fragment variation) - i.e. the full audio stream is present, just zero-padded. Concatenating the fragments reconstructed clean, correct-pitch voice (verified by ear), which also explains every earlier symptom: plain playback = 31% voice + 69% silence (chop); counter-paced reading = fragments and padding played fast over a live buffer (noise with echo). Replace the pitch-servo playback approach with fragment-aware direct capture: - Read the clip ring buffer directly (no AudioSource, no OnAudioFilterRead), which also decouples capture from the output device's clock. - Pre-roll measures the counter rate (k = counterRate / clip.frequency) and the counter's smallest discrete jump (the stride J). - k ~ 1: plain contiguous read at the counter's pace (healthy devices). - k > 1.05: read only the first J/k samples of each stride - exactly the valid fragments - skipping the zero padding. - Downmix to mono and resample clip.frequency -> 48kHz (streaming linear; state carries across fragments since their junctions are continuous), into a native source fixed at 48kHz mono. - Backlog beyond 200ms after a stall is dropped, stride-aligned, to avoid overrunning the native queue. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

MaxHeimbrock · 2026-06-15T15:16:58Z

MDR-1000X microphone is fixed in Unity 6, so no need for this.

MaxHeimbrock and others added 7 commits June 11, 2026 17:45

MaxHeimbrock changed the title ~~Adaptive pitch servo: lock mic playback to the measured capture rate~~ Reconstruct fragmented mic clip: fragment-aware direct capture Jun 12, 2026

This was referenced Jun 12, 2026

Add AudioClipDump debugging utility (dump audio buffers to WAV) #307

Open

Fix microphone capture: device-true source format + fragment-aware clip reading #308

Closed

MaxHeimbrock force-pushed the max/mic-samplerate-device-init branch 3 times, most recently from 266730d to 2a26265 Compare June 15, 2026 14:23

MaxHeimbrock closed this Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reconstruct fragmented mic clip: fragment-aware direct capture#306

Reconstruct fragmented mic clip: fragment-aware direct capture#306
MaxHeimbrock wants to merge 7 commits into
max/mic-samplerate-device-initfrom
max/mic-pitch-servo

MaxHeimbrock commented Jun 12, 2026 •

edited

Loading

Uh oh!

MaxHeimbrock commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxHeimbrock commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Root cause (proven by buffer inspection)

Change

Verification

History

Uh oh!

MaxHeimbrock commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxHeimbrock commented Jun 12, 2026 •

edited

Loading