Skip to content

Add Qwen3.5 vision encoder and connector#3962

Open
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-qwen35
Open

Add Qwen3.5 vision encoder and connector#3962
hengtaoguo wants to merge 1 commit into
mainfrom
hengtaoguo-qwen35

Conversation

@hengtaoguo
Copy link
Copy Markdown
Collaborator

@hengtaoguo hengtaoguo commented May 21, 2026

Description

  • Subclassed JAX Vision Layers: Created clean JAX Qwen3_5MoeVisionEncoder and Qwen3_5MoeVisionProjector subclasses to reuse Qwen3-Omni layers (both share 3VL base), keeping checkpoint parameter keys clean through specifying names in encoders.py.

  • Key Differences against Qwen3-Omni:

    • HuggingFace renames the connector weights from ln_q/mlp to norm, linear_fc1, linear_fc2. We updated in the unit test copy_qwen3_5_patch_merger_weights, and should also address it in the follow-up ckpt PR.
    • Disable deepstack layers for Qwen3.5 by adding deepstack_visual_indexes_for_vit: [] in yml.
  • Hybrid Attention Bug Fix: Fixed maxtext/layers/attentions.py to prevent Qwen3.5 hybrid GDN query-splitting logic from executing on the vision tower attention layer.

  • Equivalence Unit Test: Added tests/unit/qwen3_5_layers_test.py comparing the subclassed JAX tower against HF Qwen3_5MoeVisionModel on TPU. Uses atol=2e-2 (due to more accumulated error of 4096 visual projection dimension vs 2048 in Omni). Passed cleanly.

Tests

Offline unit test against HF Qwen3.5 reference implementation:

python -m pytest tests/unit/qwen3_5_layers_test.py -vv -s
collected 1 item                                             

tests/unit/qwen3_5_layers_test.py::TestQwen3_5MoeVisionEncoderEndToEnd::test_vision_encoder_subclasses_match_torch W0521 05:45:28.587099 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:30.585665 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:31.026261 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:31.343423 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:31.926974 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
W0521 05:45:32.180391 3188396 pjrt_executable.cc:642] Assume version compatibility. PjRt-IFRT does not track XLA executable versions.
PASSED

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

❌ Patch coverage is 10.00000% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/maxtext/layers/encoders.py 0.00% 7 Missing ⚠️
src/maxtext/multimodal/processor.py 12.50% 7 Missing ⚠️
src/maxtext/models/qwen3_5_vision.py 0.00% 3 Missing ⚠️
src/maxtext/layers/attentions.py 50.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@github-actions
Copy link
Copy Markdown

🤖 Hi @aireenmei, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

## 📋 Review Summary

The Pull Request successfully adds support for the Qwen3.5 vision encoder and connector by subclassing the Qwen3 Omni layers. This approach ensures clean checkpoint parameter keys while reusing established logic. The fix for the hybrid GDN logic in attentions.py is a crucial correction for vision tower integration.

🔍 General Feedback

  • Clean Architecture: Subclassing Qwen3 Omni layers to achieve clean checkpoint names while reusing logic is an excellent use of NNX and maintains code modularity.
  • Critical Bug Fix: The update to is_qwen3_hybrid in attentions.py correctly prevents the hybrid GDN/Attention logic from being incorrectly applied to the vision tower.
  • Comprehensive Testing: The addition of tests/unit/qwen3_5_layers_test.py with detailed comparisons against the HuggingFace reference implementation ensures the correctness of the new vision tower.
  • Consistent Configuration: The MRoPE and vision configuration in the YAML file align perfectly with the model's architectural requirements.

Copy link
Copy Markdown
Collaborator

@aireenmei aireenmei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

Comment thread src/maxtext/multimodal/processor.py Outdated
"""Get the bidirectional mask for specific models."""
bidirectional_mask_audio = None
if config.model_name in ["qwen3-omni-30b-a3b"]:
if config.model_name in ["qwen3-omni-30b-a3b", "qwen3.5-397b-a17b"]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought qwen3.5-397b-a17b doesn't support audio, could you double check?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, thanks! Qwen3.5 indeed doesn't support audio input, I've removed this edit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants