Reconstruct fragmented mic clip: fragment-aware direct capture#306
Closed
MaxHeimbrock wants to merge 7 commits into
Closed
Reconstruct fragmented mic clip: fragment-aware direct capture#306MaxHeimbrock wants to merge 7 commits into
MaxHeimbrock wants to merge 7 commits into
Conversation
The native (Rust) audio source was created with a hardcoded sample rate (48000) and channel count (2). Microphone frames flow through Unity's audio graph (AudioProbe) at the actual DSP output configuration, which often differs — e.g. with a Bluetooth headset. The Rust source does not resample; it rejects frames whose rate/channels don't match, causing the metadata-mismatch warning and capture failures. Read the source's sample rate and channel count from Unity's output configuration (AudioSettings.GetConfiguration) instead of hardcoded defaults, falling back to the defaults only when Unity can't report one. The base constructor now exposes a device-mode overload (type only) and an explicit overload (type, sampleRate, channels) for sources that generate a fixed format. MicrophoneSource and BasicAudioSource use device mode; BasicAudioSource drops its unused channels parameter. SineWaveAudioSource declares its exact format. If a frame's format still doesn't match (inconsistent Unity report or a runtime output change), drop it with a throttled warning instead of sending a mismatch the native side would error on. Also removes the redundant Microphone.Start in the Meet sample. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Temporary, ~2s-throttled diagnostics to investigate choppy received audio: - RtcAudioSource logs the effective capture sample rate (samples/sec by wall clock) vs the rate declared to the native source. A measured rate that differs from the declared rate means the frame format label is wrong, which would sound fast/slow/choppy on the receiver. - AudioStream logs buffer fill, underrun count, callback count and frames received, to distinguish receive-side starvation from a clean stream. Emitted via Utils.Info so they appear without LK_DEBUG (Utils.Debug is compiled out unless LK_DEBUG is defined). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MicrophoneSource started the device at the hardcoded DefaultMicrophoneSampleRate and played the looping clip through an AudioSource read on the DSP thread. When the device's actual rate differs from the engine output rate, the clip fills and plays back at different rates, so the read position drifts against the write position and the captured audio becomes choppy. Open the microphone at AudioSettings.outputSampleRate when the device supports it (clamped to the device's reported caps; falling back to the default when the output rate is unknown), so capture and playback run at the same rate. This also aligns the mic rate with the native source rate, which is taken from the same output configuration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The mic clip is filled by the capture device's clock while the AudioSource that plays it (feeding OnAudioFilterRead) runs on the output device's clock. Some devices also misreport the clip rate entirely: a Bluetooth headset on macOS labels its clip 16kHz while filling it at ~51kHz. Either way the read head drifts against the write head and gets lapped, which sounds like periodic chopping. Add a pacing servo that measures how fast the write head actually advances (GetPosition delta over wall clock - steady within ±0.1% even when the instantaneous position is jumpy) and continuously adjusts AudioSource.pitch so the read head consumes clip samples at the same rate, holding a fixed lag behind the writer. A short pre-roll measures the rate before playback starts so the initial pitch is already correct; the fill-rate estimate and the lag target (sized to ~4x observed jitter, bounded by clip capacity) keep adapting while capturing, and an out-of-bounds resync recovers from long hitches. In the normal case the measured rate matches clip.frequency, pitch hovers at ~1.0, and the servo is effectively a no-op. In the misreporting case pitch settles at the true ratio (~3.2), which plays the clip's real-time data at correct speed and eliminates the chop. Pitch is rate control, not a delay: the added latency is only the held lag (~80-150ms, adaptive). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…head
Field test falsified the previous model: with pitch set to the measured
counter ratio (3.2), the published audio became garbled repeats ("noise with
echo"), while the servo's own lag telemetry stayed perfectly stable — because
it was measuring against the same lying counter. Combined with earlier
results (1x playback yields correct-pitch voice; reading at the counter's
pace yields noise), the consistent model is:
- The clip DATA genuinely is at clip.frequency (16kHz here).
- Microphone.GetPosition's counter is inflated ~3.2x on macOS + BT-HFP; it
does not describe the data. The choppiness on the plain path is the read
head colliding with the bursty real write head due to a small, unmanaged
startup lag — not a rate mismatch.
Rework the servo accordingly: pitch stays pinned near 1.0 (max ±3% trim).
The counter is used only after rescaling by its measured inflation factor
k = counterRate / clip.frequency (~1 on healthy devices) to estimate the
real write head, and the servo holds the read head a generous adaptive lag
(150ms default) behind that estimate. Clip buffer extended to 2s for more
collision headroom.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The reworked servo's telemetry is perfect in the bad state (k=3.20, pitch~1.00, lag locked on target, jitter ~0, no resyncs) yet the published audio still chops like the unpaced path. That falsifies the read/write collision model: the reader is provably never near the writer. Remaining hypothesis: the chop is baked into the clip data itself — FMOD scatters the real 16kHz samples at the inflated counter's positions, leaving stale regions between fragments (~31% fresh per cycle). That would also explain why counter-paced reading sounds like noise with echo (fragments + stale older audio, fast). Snapshot the raw clip to a WAV 4s after capture starts (editor-only) so the buffer contents can be inspected directly: contiguous voice means the chop is downstream and still fixable; fragmented voice means capture data is destroyed at write time and the Unity Microphone path cannot work for this device. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A raw dump of the mic clip in the macOS + Bluetooth HFP state revealed the true buffer structure: FMOD writes each real 20ms packet of clip.frequency audio, then advances the position counter as if it had written k (~3.2x) as much and zero-fills the skipped range. The buffer holds valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k), and the fragments join continuously (junction sample deltas within normal in-fragment variation) - i.e. the full audio stream is present, just zero-padded. Concatenating the fragments reconstructed clean, correct-pitch voice (verified by ear), which also explains every earlier symptom: plain playback = 31% voice + 69% silence (chop); counter-paced reading = fragments and padding played fast over a live buffer (noise with echo). Replace the pitch-servo playback approach with fragment-aware direct capture: - Read the clip ring buffer directly (no AudioSource, no OnAudioFilterRead), which also decouples capture from the output device's clock. - Pre-roll measures the counter rate (k = counterRate / clip.frequency) and the counter's smallest discrete jump (the stride J). - k ~ 1: plain contiguous read at the counter's pace (healthy devices). - k > 1.05: read only the first J/k samples of each stride - exactly the valid fragments - skipping the zero padding. - Downmix to mono and resample clip.frequency -> 48kHz (streaming linear; state carries across fragments since their junctions are continuous), into a native source fixed at 48kHz mono. - Backlog beyond 200ms after a stall is dropped, stride-aligned, to avoid overrunning the native queue. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 12, 2026
266730d to
2a26265
Compare
Contributor
Author
|
MDR-1000X microphone is fixed in Unity 6, so no need for this. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on #304. Fixes the choppy/garbled published audio with a Bluetooth HFP headset mic on macOS — by reading the audio Unity actually delivers, which turned out to be intact but scattered.
Root cause (proven by buffer inspection)
A raw WAV dump of the mic clip in the bad state showed the exact structure: FMOD writes each real 20 ms packet of
clip.frequencyaudio, then advancesMicrophone.GetPositionas if it had written ~3.2× as much, zero-filling the skipped range. Concretely: valid fragments of exactly 320 samples at a stride of exactly 1024 (320/1024 = 1/k where k = counterRate/clip.frequency = 3.2), with exact-zero padding between them (true silence has a real noise floor, never exact zeros).Junction analysis showed the fragments join continuously (boundary sample deltas within normal in-fragment variation) — the full stream is present, just zero-padded. Concatenating fragments reconstructed clean, correct-pitch voice (verified by ear).
This explains every prior symptom:
Change
MicrophoneSourcenow does fragment-aware direct capture:AudioSource, noOnAudioFilterRead) — also decouples capture from the output device's clock.k(counter rate ÷clip.frequency) and the counter's smallest discrete jump (the strideJ).J/ksamples of each stride — exactly the valid fragments — skipping the padding.clip.frequency→ fixed 48 kHz native source (streaming linear; resampler state carries across fragments since junctions are continuous).Expected log in the bad state:
Healthy devices log
contiguous capture (k=1.00).Verification
History
This branch went through two falsified designs first — a pitch servo at the counter ratio (garbled: the counter doesn't describe the data) and a k-rescaled lag servo at pitch 1 (perfect telemetry, still choppy: the gaps are in the buffer itself). The WAV dump diagnostic settled it. Commits preserved for the record.
🤖 Generated with Claude Code