Feature/add minimax m3 vl#1096
Draft
shagsood wants to merge 10 commits into
Draft
Conversation
Trigger: deepseek-ai/DeepSeek-V4-Pro requires transformers.models.deepseek_v4 which first appeared in transformers 5.10.2. Fixes applied: - B9 (Gemma3 drift): `logger` no longer exported from transformers.models.gemma3.modeling_gemma3 — replaced with logging.getLogger(__name__) Regression test: 121 passed, 3 skipped, 0 failures (test_model_quickcheck.py full suite on isolated venv) Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Trigger: google/diffusiongemma-26B-A4B-it requires transformers >= 5.11.0
for transformers.models.diffusion_gemma support.
Test evidence (isolated venv qeff_env_upgrade_5.11.0):
- Per-model import sweep: 0 failures across all QEfficient/transformers/models/*/
- Quickcheck (tests/unit_test/models/test_model_quickcheck.py): 107 passed, 17 skipped, 0 failed in 3:00
- Causal LM CPU parity (HF vs QEff vs ORT tokens): codegen, falcon, gpt2, gpt_oss, gptj,
granite, llama, mistral, mixtral, mpt, olmo2, phi3, phi, qwen2, starcoder2 — all PASS
- Subfunction export smoke (15 archs) — PASS
- VLM export smoke (gemma3, qwen2_5_vl) — PASS
- Whisper export smoke — PASS
- AWQ export smoke — PASS
- Audio CTC, sequence classification, text embedding — PASS
No code fixes required: 5.10.2 → 5.11.0 is a clean minor bump (no R1-R5 / B1-B9 hits).
Cloud AI 100 stack impact: none — no MXFP6/MXINT8, KV cache, or ONNX surface changes.
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
(see .archon/artifacts/runs/minimax_m3_vl/fix-spine-state.md) Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
…rward Two correctness bugs surfaced by tiny-dummy CPU parity test (HF==QEff==ORT all green at max|delta|=1e-6 after these fixes): B1 (RoPE shape): TextModel was pre-unsqueezing cos/sin to [B,1,S,d] before calling apply_rotary_pos_emb, but upstream's apply_rotary_pos_emb does its own unsqueeze(unsqueeze_dim=1) — making cos 5D against q's 4D and crashing with 'Tensors must have same number of dimensions'. Pass cos/sin in [B,S,head_dim] so the upstream broadcast works as documented. B2 (causal mask): target_length was bare past_seen_tokens, which is 0 on the first eager forward and produced an empty kv-axis (arange(0,0)). Use attention_mask.shape[-1] when caller supplies an SDPA-shaped mask, else past_seen_tokens + current seq_len. Mirrors the qwen3_vl_moe pattern. Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
…ask target_length Production-path qaic-compile bring-up surfaced three more issues, all fixed with full HF==QEff(PT)==QPC parity now green on a tiny dummy (52M params, fp16 KV on AI 100): S1 (specs vision symbols): get_specializations didn't bind num_patches or num_images, but get_onnx_dynamic_axes named pixel_values dim 0 = num_patches and image_grid_thw dim 0 = num_images. qaic-compile rejected the ONNX with 'symbol num_patches is undefined'. Bound both in vision/lang specs as image_seq_length * spatial_merge^2 * max_num_images and max_num_images. S2 (CCL kwarg leak): comp_ctx_lengths_prefill / comp_ctx_lengths_decode default to None and were passed verbatim into compiler_options, which qaic-compile then received as the literal string '-comp-ctx-lengths-prefill= None' and rejected with 'Invalid option'. Pop them at the top of get_specializations before the **compiler_options leak path. M2 (causal-mask kv-axis): target_length was past_seen_tokens + seq_len, which double-counts on the export trace where past_kv comes pre-allocated at prefill_seq_len: past_seen=32, seq_len=32 → mask kv-axis=64, but k_out after QEffDynamicCache.update is 32 (scatter, not concat) → broadcast crash. target_length should be max(past_seen, seq_len), the actual kv-buffer extent. Tested on dummy (1×8×8 grid, 52M params, fp16 QPC): HF top1=1110 max=1.0359 QEff PT top1=1110 max=1.0359 (max|Δ| vs HF = 1.0e-6) QPC top1=1110 max=1.0361 (max|Δ| vs HF = 2.0e-3, fp16 noise) Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Upload `shagunsd/tiny-random-MiniMaxM3SparseForConditionalGeneration` (106M params, seed=0) to HF Hub so the parametrized harness is reproducible on any machine. Switch `image_text_model_configs.json` from the full 444B `MiniMaxAI/MiniMax-M3` to the Hub-hosted tiny. Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
…rse operator) Graft QEffMiniMaxM3VLIndexer onto the modeling file so sparse layers (layer_types[i] == "minimax_m3_sparse") run top-k key-block selection instead of falling back to a dense causal mask. This is the model's defining long-context operator; it was previously bypassed. - select_blocks: score Q.K^T on already-rotated states (no dedicated indexer-K cache), pick top-k blocks per query, force-keep local blocks. - build_block_maskout: return a BOOLEAN mask-out grid (True = masked) that folds into the boolean causal mask consumed by qeff_eager_attention_forward. Chosen over upstream's additive finfo.min bias, which folds to fp16 -inf and yields 0 * -inf = NaN under -convert-to-fp16. - Register MiniMaxM3VLIndexer -> QEffMiniMaxM3VLIndexer (import + KVCacheTransform). - Tiny test config sets layer_types=["minimax_m3_sparse","full_attention"] + indexer dims so both the sparse and dense paths are exercised. - _layerwise.py: minimax window hooks (frozenset entry, _set_layer_windows, window patches, skip_vision on per-window export). CPU parity with the indexer active: MoE block dInf=1.5e-8, RMSNorm=0, full-attn layer=0.064, sparse-attn layer (indexer running real top-k)=0.054 (matches the dense eager-vs-SDPA tolerance). ORT/QPC re-run on a sparse dummy is still outstanding before validated-hw. Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add MiniMax-M3-VL (MiniMaxAI/MiniMax-M3)
HF class:
MiniMaxM3SparseForConditionalGeneration· requires transformers >= 5.11.0444B total / 22.5B active · 60 layers · GQA (partial RoPE) · 128-expert top-4 MoE + 1 shared ·
MiniMax Sparse Attention (1M ctx) · CLIP ViT vision (3D-RoPE + patch-merge) · Native multimodal VLM
What's in this PR
New modeling file (743 LOC, 11 QEff classes) covering partial-RoPE static cache, ONNX-friendly
eager attention, fused BMM MoE with swigluoai activation, vision 3D-RoPE + patch-merge,
encoder/decoder wrapper split for two-phase VLM export
Transform registrations: 9 KVCacheTransform class pairs +
MiniMaxM3VLRMSNorm→GemmaCustomRMSNormAICONNX-trace fixes: finite mask values via
torch.where; no-ellipsis einsum; positive-axis unsqueezeTransformers pin bump: 5.10.2 → 5.12.1
Validation
CPU + Hardware parity (AI 100, fp16): HF == QEff == ORT == QPC ✅ (token 1110 across all 4 edges)
Tiny-random model:
shagunsd/tiny-random-MiniMaxM3SparseForConditionalGeneration— full pipeline exercised end-to-end including MoE routing, partial-RoPE, vision patch-merge.
Performance (roofline)
TS16 minimum fit · mxfp6 + mxint8 KV · disaggregated compile · 144.9 tok/s roofline (bs=1, ctx=1024)
Huge-tier handoff — full-weight compile is user-side.