Skip to content

fix(qwen3_5): preserve packed-sample boundaries in GatedDeltaNet#2147

Open
HuiyingLi wants to merge 1 commit intomainfrom
huiyingl/fix/qwen3_5-packing-2131
Open

fix(qwen3_5): preserve packed-sample boundaries in GatedDeltaNet#2147
HuiyingLi wants to merge 1 commit intomainfrom
huiyingl/fix/qwen3_5-packing-2131

Conversation

@HuiyingLi
Copy link
Copy Markdown
Contributor

Summary

Fixes #2131Qwen3_5GatedDeltaNet was leaking recurrent state across packed-sample boundaries because:

  1. FLA chunk_gated_delta_rule was called without cu_seqlens → recurrent state ran continuously across the whole pack.
  2. causal_conv1d_fn (4-tap depthwise conv, upstream of FLA) was called without seq_idx → sample-1's last 3 tokens bled into sample-2's first 3 tokens.

Layer-output diff between standalone and packed-then-sliced on a small reproducer:
4.011e-01 → 0.000e+00.

Mechanism

The fix rides on the existing patch_hf_model class-swap machinery that already wraps every Qwen3_5GatedDeltaNet instance into CPAwareGatedDeltaNet.

  • Dense path (Qwen3_5ForCausalLM / Qwen3_5ForConditionalGeneration): a new Qwen3_5DecoderLayerWithPacking(Qwen3_5DecoderLayer) subclass overrides forward to derive cu_seqlens and indices from the indexed attention_mask and pass them to linear_attn as keyword arguments. patch_hf_model is extended to class-swap every Qwen3_5DecoderLayer instance to this subclass.
  • MoE path (Qwen3_5MoeForConditionalGeneration): Qwen3_5MoeBlock(Block) already custom-built; its forward is overridden in place to do the same kwarg-threading. No class-swap needed.
  • CPAwareGatedDeltaNet._forward_no_cp accepts the new kwargs. When packed: skips apply_mask_to_padding_states (the unpad does it stronger), unpads [B, T, H] → [1, total_valid, H] if there's actual padding (B>1 case), threads seq_idx into causal_conv1d_fn, threads cu_seqlens into chunk_gated_delta_rule, repads on exit. When not packed: original kwargs path verbatim — bit-for-bit unchanged.

Files changed

```
M nemo_automodel/_transformers/infrastructure.py
M nemo_automodel/components/distributed/cp_utils.py
M nemo_automodel/components/models/common/packing.py
A nemo_automodel/components/models/qwen3_5/decoder_layer.py
M nemo_automodel/components/models/qwen3_5_moe/cp_linear_attn.py
M nemo_automodel/components/models/qwen3_5_moe/model.py
M tests/unit_tests/models/qwen3_5/test_cp_linear_attn_patch.py
```

What does NOT change

  • The unpacked path (no indexed mask) is bit-for-bit identical to before.
  • _forward_with_cp is not modified — packing+CP support is intentionally out of scope (no shipped Qwen3.5 recipe sets cp_size > 1).

Verification

Unit + reproducer evidence

Test Coverage Result
tests/unit_tests/models/qwen3_5/test_cp_linear_attn_patch.py 35 cases incl. new TestPackingHelpers all pass
logs/reproduce_deltanet_packing_via_subclass.py First-DeltaNet-layer comparison (literal issue procedure) for B=1 packed-no-padding AND B=2 packed-with-padding all per-doc max_abs_diff = 0.000e+00
logs/reproduce_deltanet_packing_logits.py 4-layer dense stack + final norm + lm_head all per-doc logits diff = 0.000e+00
logs/reproduce_deltanet_packing_logits_moe.py 4-layer MoE stack + final norm + lm_head all per-doc logits diff = 0.000e+00

Production-recipe verification

Captured live during `automodel <qwen3_5_4b_neat_packing.yaml>`:

```
[INFO] Patched 24 GatedDeltaNet modules (cp=False) with FSDP-safe fp32 param wrapping.

step 0 chunk_gated_delta_rule:
q=(1, 1503, 32, 128) bf16 needs_unpad=False
cu_seqlens=[0, 371, 495, 816, 1074, 1198, 1503] n_docs=6 doc_lens=[371, 124, 321, 258, 124, 305]

step 1 chunk_gated_delta_rule:
q=(1, 2150, 32, 128) bf16 needs_unpad=False
cu_seqlens=[0, 1699, 2150] n_docs=2 doc_lens=[1699, 451]
```

End-to-end: production recipe → indexed mask → Qwen3_5DecoderLayerWithPacking.forwardCPAwareGatedDeltaNet._forward_no_cp → FLA chunk_gated_delta_rule is now receiving real per-document boundaries. With the bug, FLA would have seen cu_seqlens=None and run all 1503 tokens as one sequence.

🤖 Generated with Claude Code

Qwen3_5GatedDeltaNet leaked recurrent state across packed-sample
boundaries because FLA chunk_gated_delta_rule was called without
cu_seqlens (state ran continuously across the whole pack) and
causal_conv1d_fn was called without seq_idx (sample-1's last 3
tokens bled into sample-2's first 3 tokens).

Fix engages on the existing patch_hf_model class-swap path:

- Dense path: new Qwen3_5DecoderLayerWithPacking subclass derives
  cu_seqlens / indices from the indexed attention_mask and threads
  them, plus position_ids, into linear_attn as kwargs. patch_hf_model
  now class-swaps every Qwen3_5DecoderLayer instance to this subclass.
- MoE path: Qwen3_5MoeBlock.forward override does the same kwarg
  threading. No class-swap needed since the block is custom code.
- CPAwareGatedDeltaNet._forward_no_cp accepts the new kwargs. When
  packed: skips apply_mask_to_padding_states (the unpad does it
  stronger), unpads [B, T, H] -> [1, total_valid, H] for B>1 cases,
  threads seq_idx into causal_conv1d_fn, threads cu_seqlens into
  chunk_gated_delta_rule, repads on exit. When not packed: bit-for-bit
  unchanged.

Layer-output diff between standalone-vs-packed-then-sliced on a small
reproducer: 4.011e-01 -> 0.000e+00.

_forward_with_cp is intentionally not modified; packing+CP is out of
scope (no shipped Qwen3.5 recipe sets cp_size > 1).

Fixes #2131

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3.5 DeltaNet breaks sample independence when using sequence packing

1 participant