Describe the bug
Qwen3.5 models that use Qwen3_5GatedDeltaNet (DeltaNet / FLA backend) produce incorrect training behavior when using packed sequences (e.g., NEAT packing).
Although packing is explicitly demonstrated in the official config (qwen3_5_4b_neat_packing.yaml), the underlying DeltaNet implementation is not segment-aware and does not respect sample boundaries inside packed sequences.
As a result, information leaks across packed samples via recurrent state, leading to incorrect training dynamics.
Steps/Code to reproduce bug
-
Prepare a dataset with two independent samples:
-
Run the recipe (with Qwen3_5GatedDeltaNet) on each sample individually, and log the output of the first DeltaNet layer. (packing may be still on, but two separate runs or batches)
-
Run the model on both samples packed sequentially (packing on)
-
Split the output of step3 back into segments (sample_1, sample_2).
-
Compare outputs (they will differ)
Expected behavior
In step5 of bug reproduction the outputs should be the same sample_{1,2} w/o packing and sample_{1,2} packed
Describe the bug
Qwen3.5 models that use Qwen3_5GatedDeltaNet (DeltaNet / FLA backend) produce incorrect training behavior when using packed sequences (e.g., NEAT packing).
Although packing is explicitly demonstrated in the official config (qwen3_5_4b_neat_packing.yaml), the underlying DeltaNet implementation is not segment-aware and does not respect sample boundaries inside packed sequences.
As a result, information leaks across packed samples via recurrent state, leading to incorrect training dynamics.
Steps/Code to reproduce bug
Prepare a dataset with two independent samples:
Run the recipe (with Qwen3_5GatedDeltaNet) on each sample individually, and log the output of the first DeltaNet layer. (packing may be still on, but two separate runs or batches)
Run the model on both samples packed sequentially (packing on)
Split the output of step3 back into segments (sample_1, sample_2).
Compare outputs (they will differ)
Expected behavior
In step5 of bug reproduction the outputs should be the same sample_{1,2} w/o packing and sample_{1,2} packed