Add Qwen3.5 vision encoder and connector by hengtaoguo · Pull Request #3962 · AI-Hypercomputer/maxtext

hengtaoguo · 2026-05-21T05:40:08Z

Description

Subclassed JAX Vision Layers: Created clean JAX Qwen3_5MoeVisionEncoder and Qwen3_5MoeVisionProjector subclasses to reuse Qwen3-Omni layers (both share 3VL base), keeping checkpoint parameter keys clean through specifying names in encoders.py.
Key Differences against Qwen3-Omni:
- HuggingFace renames the connector weights from ln_q/mlp to norm, linear_fc1, linear_fc2. We updated in the unit test copy_qwen3_5_patch_merger_weights, and should also address it in the follow-up ckpt PR.
- Disable deepstack layers for Qwen3.5 by adding deepstack_visual_indexes_for_vit: [] in yml.
Hybrid Attention Bug Fix: Fixed maxtext/layers/attentions.py to prevent Qwen3.5 hybrid GDN query-splitting logic from executing on the vision tower attention layer.
Equivalence Unit Test: Added tests/unit/qwen3_5_layers_test.py comparing the subclassed JAX tower against HF Qwen3_5MoeVisionModel on TPU. Uses atol=2e-2 (due to more accumulated error of 4096 visual projection dimension vs 2048 in Omni). Passed cleanly.

Tests

Offline unit test against HF Qwen3.5 reference implementation:

python -m pytest tests/unit/qwen3_5_layers_test.py -vv -s

collected 1 item                                             

tests/unit/qwen3_5_layers_test.py::TestQwen3_5MoeVisionEncoderEndToEnd::test_vision_encoder_subclasses_match_torch W0521 05:45:28.587099 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:30.585665 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:31.026261 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:31.343423 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:31.926974 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:32.180391 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
PASSED

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

codecov · 2026-05-21T06:04:24Z

Codecov Report

❌ Patch coverage is 10.00000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/maxtext/layers/encoders.py	0.00%	7 Missing ⚠️
src/maxtext/multimodal/processor.py	12.50%	7 Missing ⚠️
src/maxtext/models/qwen3_5_vision.py	0.00%	3 Missing ⚠️
src/maxtext/layers/attentions.py	50.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

github-actions · 2026-05-22T00:42:11Z

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions

## 📋 Review Summary

The Pull Request successfully adds support for the Qwen3.5 vision encoder and connector by subclassing the Qwen3 Omni layers. This approach ensures clean checkpoint parameter keys while reusing established logic. The fix for the hybrid GDN logic in attentions.py is a crucial correction for vision tower integration.

🔍 General Feedback

Clean Architecture: Subclassing Qwen3 Omni layers to achieve clean checkpoint names while reusing logic is an excellent use of NNX and maintains code modularity.
Critical Bug Fix: The update to is_qwen3_hybrid in attentions.py correctly prevents the hybrid GDN/Attention logic from being incorrectly applied to the vision tower.
Comprehensive Testing: The addition of tests/unit/qwen3_5_layers_test.py with detailed comparisons against the HuggingFace reference implementation ensures the correctness of the new vision tower.
Consistent Configuration: The MRoPE and vision configuration in the YAML file align perfectly with the model's architectural requirements.

aireenmei

Thanks for the work!

aireenmei · 2026-05-22T01:05:01Z

  """Get the bidirectional mask for specific models."""
  bidirectional_mask_audio = None
-  if config.model_name in ["qwen3-omni-30b-a3b"]:
+  if config.model_name in ["qwen3-omni-30b-a3b", "qwen3.5-397b-a17b"]:


I thought qwen3.5-397b-a17b doesn't support audio, could you double check?

Great catch, thanks! Qwen3.5 indeed doesn't support audio input, I've removed this edit.

hengtaoguo force-pushed the hengtaoguo-qwen35 branch 4 times, most recently from 899d4db to bbcf606 Compare May 21, 2026 05:58

hengtaoguo marked this pull request as ready for review May 21, 2026 06:00

entrpn approved these changes May 21, 2026

View reviewed changes

aireenmei added the gemini-review label May 22, 2026

github-actions Bot reviewed May 22, 2026

View reviewed changes

aireenmei approved these changes May 22, 2026

View reviewed changes

Add Qwen3.5 vision layers

545b677

hengtaoguo force-pushed the hengtaoguo-qwen35 branch from a43b833 to 545b677 Compare May 22, 2026 06:13

hengtaoguo added the pull ready label May 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3.5 vision encoder and connector#3962

Add Qwen3.5 vision encoder and connector#3962
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-qwen35

hengtaoguo commented May 21, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

aireenmei left a comment

Uh oh!

aireenmei May 22, 2026

Uh oh!

hengtaoguo May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hengtaoguo commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

codecov Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 22, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

aireenmei left a comment

Choose a reason for hiding this comment

Uh oh!

aireenmei May 22, 2026

Choose a reason for hiding this comment

Uh oh!

hengtaoguo May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hengtaoguo commented May 21, 2026 •

edited

Loading

codecov Bot commented May 21, 2026 •

edited

Loading