fix(qwen3_5): preserve packed-sample boundaries in GatedDeltaNet#2147
Open
fix(qwen3_5): preserve packed-sample boundaries in GatedDeltaNet#2147
Conversation
Qwen3_5GatedDeltaNet leaked recurrent state across packed-sample boundaries because FLA chunk_gated_delta_rule was called without cu_seqlens (state ran continuously across the whole pack) and causal_conv1d_fn was called without seq_idx (sample-1's last 3 tokens bled into sample-2's first 3 tokens). Fix engages on the existing patch_hf_model class-swap path: - Dense path: new Qwen3_5DecoderLayerWithPacking subclass derives cu_seqlens / indices from the indexed attention_mask and threads them, plus position_ids, into linear_attn as kwargs. patch_hf_model now class-swaps every Qwen3_5DecoderLayer instance to this subclass. - MoE path: Qwen3_5MoeBlock.forward override does the same kwarg threading. No class-swap needed since the block is custom code. - CPAwareGatedDeltaNet._forward_no_cp accepts the new kwargs. When packed: skips apply_mask_to_padding_states (the unpad does it stronger), unpads [B, T, H] -> [1, total_valid, H] for B>1 cases, threads seq_idx into causal_conv1d_fn, threads cu_seqlens into chunk_gated_delta_rule, repads on exit. When not packed: bit-for-bit unchanged. Layer-output diff between standalone-vs-packed-then-sliced on a small reproducer: 4.011e-01 -> 0.000e+00. _forward_with_cp is intentionally not modified; packing+CP is out of scope (no shipped Qwen3.5 recipe sets cp_size > 1). Fixes #2131 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2131 —
Qwen3_5GatedDeltaNetwas leaking recurrent state across packed-sample boundaries because:chunk_gated_delta_rulewas called withoutcu_seqlens→ recurrent state ran continuously across the whole pack.causal_conv1d_fn(4-tap depthwise conv, upstream of FLA) was called withoutseq_idx→ sample-1's last 3 tokens bled into sample-2's first 3 tokens.Layer-output diff between standalone and packed-then-sliced on a small reproducer:
4.011e-01 → 0.000e+00.
Mechanism
The fix rides on the existing
patch_hf_modelclass-swap machinery that already wraps everyQwen3_5GatedDeltaNetinstance intoCPAwareGatedDeltaNet.Qwen3_5ForCausalLM/Qwen3_5ForConditionalGeneration): a newQwen3_5DecoderLayerWithPacking(Qwen3_5DecoderLayer)subclass overridesforwardto derivecu_seqlensandindicesfrom the indexedattention_maskand pass them tolinear_attnas keyword arguments.patch_hf_modelis extended to class-swap everyQwen3_5DecoderLayerinstance to this subclass.Qwen3_5MoeForConditionalGeneration):Qwen3_5MoeBlock(Block)already custom-built; itsforwardis overridden in place to do the same kwarg-threading. No class-swap needed.CPAwareGatedDeltaNet._forward_no_cpaccepts the new kwargs. When packed: skipsapply_mask_to_padding_states(the unpad does it stronger), unpads[B, T, H] → [1, total_valid, H]if there's actual padding (B>1 case), threadsseq_idxintocausal_conv1d_fn, threadscu_seqlensintochunk_gated_delta_rule, repads on exit. When not packed: original kwargs path verbatim — bit-for-bit unchanged.Files changed
```
M nemo_automodel/_transformers/infrastructure.py
M nemo_automodel/components/distributed/cp_utils.py
M nemo_automodel/components/models/common/packing.py
A nemo_automodel/components/models/qwen3_5/decoder_layer.py
M nemo_automodel/components/models/qwen3_5_moe/cp_linear_attn.py
M nemo_automodel/components/models/qwen3_5_moe/model.py
M tests/unit_tests/models/qwen3_5/test_cp_linear_attn_patch.py
```
What does NOT change
_forward_with_cpis not modified — packing+CP support is intentionally out of scope (no shipped Qwen3.5 recipe setscp_size > 1).Verification
Unit + reproducer evidence
tests/unit_tests/models/qwen3_5/test_cp_linear_attn_patch.pyTestPackingHelperslogs/reproduce_deltanet_packing_via_subclass.pylogs/reproduce_deltanet_packing_logits.pylogs/reproduce_deltanet_packing_logits_moe.pyProduction-recipe verification
Captured live during `automodel <qwen3_5_4b_neat_packing.yaml>`:
```
[INFO] Patched 24 GatedDeltaNet modules (cp=False) with FSDP-safe fp32 param wrapping.
step 0 chunk_gated_delta_rule:
q=(1, 1503, 32, 128) bf16 needs_unpad=False
cu_seqlens=[0, 371, 495, 816, 1074, 1198, 1503] n_docs=6 doc_lens=[371, 124, 321, 258, 124, 305]
step 1 chunk_gated_delta_rule:
q=(1, 2150, 32, 128) bf16 needs_unpad=False
cu_seqlens=[0, 1699, 2150] n_docs=2 doc_lens=[1699, 451]
```
End-to-end: production recipe → indexed mask →
Qwen3_5DecoderLayerWithPacking.forward→CPAwareGatedDeltaNet._forward_no_cp→ FLAchunk_gated_delta_ruleis now receiving real per-document boundaries. With the bug, FLA would have seencu_seqlens=Noneand run all 1503 tokens as one sequence.🤖 Generated with Claude Code