fix(qwen3_5): preserve packed-sample boundaries in GatedDeltaNet by HuiyingLi · Pull Request #2147 · NVIDIA-NeMo/Automodel

HuiyingLi · 2026-05-06T07:36:09Z

Summary

Fixes #2131 — Qwen3_5GatedDeltaNet was leaking recurrent state across packed-sample boundaries because:

FLA chunk_gated_delta_rule was called without cu_seqlens → recurrent state ran continuously across the whole pack.
causal_conv1d_fn (4-tap depthwise conv, upstream of FLA) was called without seq_idx → sample-1's last 3 tokens bled into sample-2's first 3 tokens.

Layer-output diff between standalone and packed-then-sliced on a small reproducer:
4.011e-01 → 0.000e+00.

Mechanism

The fix rides on the existing patch_hf_model class-swap machinery that already wraps every Qwen3_5GatedDeltaNet instance into CPAwareGatedDeltaNet.

Dense path (Qwen3_5ForCausalLM / Qwen3_5ForConditionalGeneration): a new Qwen3_5DecoderLayerWithPacking(Qwen3_5DecoderLayer) subclass overrides forward to derive cu_seqlens and indices from the indexed attention_mask and pass them to linear_attn as keyword arguments. patch_hf_model is extended to class-swap every Qwen3_5DecoderLayer instance to this subclass.
MoE path (Qwen3_5MoeForConditionalGeneration): Qwen3_5MoeBlock(Block) already custom-built; its forward is overridden in place to do the same kwarg-threading. No class-swap needed.
CPAwareGatedDeltaNet._forward_no_cp accepts the new kwargs. When packed: skips apply_mask_to_padding_states (the unpad does it stronger), unpads [B, T, H] → [1, total_valid, H] if there's actual padding (B>1 case), threads seq_idx into causal_conv1d_fn, threads cu_seqlens into chunk_gated_delta_rule, repads on exit. When not packed: original kwargs path verbatim — bit-for-bit unchanged.

Files changed

```
M nemo_automodel/_transformers/infrastructure.py
M nemo_automodel/components/distributed/cp_utils.py
M nemo_automodel/components/models/common/packing.py
A nemo_automodel/components/models/qwen3_5/decoder_layer.py
M nemo_automodel/components/models/qwen3_5_moe/cp_linear_attn.py
M nemo_automodel/components/models/qwen3_5_moe/model.py
M tests/unit_tests/models/qwen3_5/test_cp_linear_attn_patch.py
```

What does NOT change

The unpacked path (no indexed mask) is bit-for-bit identical to before.
_forward_with_cp is not modified — packing+CP support is intentionally out of scope (no shipped Qwen3.5 recipe sets cp_size > 1).

Verification

Unit + reproducer evidence

Test	Coverage	Result
`tests/unit_tests/models/qwen3_5/test_cp_linear_attn_patch.py`	35 cases incl. new `TestPackingHelpers`	all pass
`logs/reproduce_deltanet_packing_via_subclass.py`	First-DeltaNet-layer comparison (literal issue procedure) for B=1 packed-no-padding AND B=2 packed-with-padding	all per-doc max_abs_diff = 0.000e+00
`logs/reproduce_deltanet_packing_logits.py`	4-layer dense stack + final norm + lm_head	all per-doc logits diff = 0.000e+00
`logs/reproduce_deltanet_packing_logits_moe.py`	4-layer MoE stack + final norm + lm_head	all per-doc logits diff = 0.000e+00

Production-recipe verification

Captured live during `automodel <qwen3_5_4b_neat_packing.yaml>`:

```
[INFO] Patched 24 GatedDeltaNet modules (cp=False) with FSDP-safe fp32 param wrapping.

step 0 chunk_gated_delta_rule:
q=(1, 1503, 32, 128) bf16 needs_unpad=False
cu_seqlens=[0, 371, 495, 816, 1074, 1198, 1503] n_docs=6 doc_lens=[371, 124, 321, 258, 124, 305]

step 1 chunk_gated_delta_rule:
q=(1, 2150, 32, 128) bf16 needs_unpad=False
cu_seqlens=[0, 1699, 2150] n_docs=2 doc_lens=[1699, 451]
```

End-to-end: production recipe → indexed mask → Qwen3_5DecoderLayerWithPacking.forward → CPAwareGatedDeltaNet._forward_no_cp → FLA chunk_gated_delta_rule is now receiving real per-document boundaries. With the bug, FLA would have seen cu_seqlens=None and run all 1503 tokens as one sequence.

🤖 Generated with Claude Code

Qwen3_5GatedDeltaNet leaked recurrent state across packed-sample boundaries because FLA chunk_gated_delta_rule was called without cu_seqlens (state ran continuously across the whole pack) and causal_conv1d_fn was called without seq_idx (sample-1's last 3 tokens bled into sample-2's first 3 tokens). Fix engages on the existing patch_hf_model class-swap path: - Dense path: new Qwen3_5DecoderLayerWithPacking subclass derives cu_seqlens / indices from the indexed attention_mask and threads them, plus position_ids, into linear_attn as kwargs. patch_hf_model now class-swaps every Qwen3_5DecoderLayer instance to this subclass. - MoE path: Qwen3_5MoeBlock.forward override does the same kwarg threading. No class-swap needed since the block is custom code. - CPAwareGatedDeltaNet._forward_no_cp accepts the new kwargs. When packed: skips apply_mask_to_padding_states (the unpad does it stronger), unpads [B, T, H] -> [1, total_valid, H] for B>1 cases, threads seq_idx into causal_conv1d_fn, threads cu_seqlens into chunk_gated_delta_rule, repads on exit. When not packed: bit-for-bit unchanged. Layer-output diff between standalone-vs-packed-then-sliced on a small reproducer: 4.011e-01 -> 0.000e+00. _forward_with_cp is intentionally not modified; packing+CP is out of scope (no shipped Qwen3.5 recipe sets cp_size > 1). Fixes #2131 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-06T07:36:13Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

HuiyingLi requested review from ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners May 6, 2026 07:36

HuiyingLi mentioned this pull request May 6, 2026

Qwen3.5 DeltaNet breaks sample independence when using sequence packing #2131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen3_5): preserve packed-sample boundaries in GatedDeltaNet#2147

fix(qwen3_5): preserve packed-sample boundaries in GatedDeltaNet#2147
HuiyingLi wants to merge 1 commit intomainfrom
huiyingl/fix/qwen3_5-packing-2131

HuiyingLi commented May 6, 2026

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HuiyingLi commented May 6, 2026

Summary

Mechanism

Files changed

What does NOT change

Verification

Unit + reproducer evidence

Production-recipe verification

Uh oh!

copy-pr-bot Bot commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant