Add Qwen3.5 vision encoder and connector#3962
Conversation
899d4db to
bbcf606
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
The Pull Request successfully adds support for the Qwen3.5 vision encoder and connector by subclassing the Qwen3 Omni layers. This approach ensures clean checkpoint parameter keys while reusing established logic. The fix for the hybrid GDN logic in attentions.py is a crucial correction for vision tower integration.
🔍 General Feedback
- Clean Architecture: Subclassing Qwen3 Omni layers to achieve clean checkpoint names while reusing logic is an excellent use of NNX and maintains code modularity.
- Critical Bug Fix: The update to
is_qwen3_hybridinattentions.pycorrectly prevents the hybrid GDN/Attention logic from being incorrectly applied to the vision tower. - Comprehensive Testing: The addition of
tests/unit/qwen3_5_layers_test.pywith detailed comparisons against the HuggingFace reference implementation ensures the correctness of the new vision tower. - Consistent Configuration: The MRoPE and vision configuration in the YAML file align perfectly with the model's architectural requirements.
| """Get the bidirectional mask for specific models.""" | ||
| bidirectional_mask_audio = None | ||
| if config.model_name in ["qwen3-omni-30b-a3b"]: | ||
| if config.model_name in ["qwen3-omni-30b-a3b", "qwen3.5-397b-a17b"]: |
There was a problem hiding this comment.
I thought qwen3.5-397b-a17b doesn't support audio, could you double check?
There was a problem hiding this comment.
Great catch, thanks! Qwen3.5 indeed doesn't support audio input, I've removed this edit.
a43b833 to
545b677
Compare
Description
Subclassed JAX Vision Layers: Created clean JAX
Qwen3_5MoeVisionEncoderandQwen3_5MoeVisionProjectorsubclasses to reuse Qwen3-Omni layers (both share3VLbase), keeping checkpoint parameter keys clean through specifying names inencoders.py.Key Differences against Qwen3-Omni:
ln_q/mlptonorm,linear_fc1,linear_fc2. We updated in the unit testcopy_qwen3_5_patch_merger_weights, and should also address it in the follow-up ckpt PR.deepstack_visual_indexes_for_vit: []in yml.Hybrid Attention Bug Fix: Fixed
maxtext/layers/attentions.pyto prevent Qwen3.5 hybrid GDN query-splitting logic from executing on the vision tower attention layer.Equivalence Unit Test: Added
tests/unit/qwen3_5_layers_test.pycomparing the subclassed JAX tower against HFQwen3_5MoeVisionModelon TPU. Usesatol=2e-2(due to more accumulated error of 4096 visual projection dimension vs 2048 in Omni). Passed cleanly.Tests
Offline unit test against HF Qwen3.5 reference implementation:
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.