Feature/add minimax m3 vl by shagsood · Pull Request #1096 · quic/efficient-transformers

shagsood · 2026-06-18T09:41:04Z

Add MiniMax-M3-VL (MiniMaxAI/MiniMax-M3)

HF class: MiniMaxM3SparseForConditionalGeneration · requires transformers >= 5.11.0

444B total / 22.5B active · 60 layers · GQA (partial RoPE) · 128-expert top-4 MoE + 1 shared ·
MiniMax Sparse Attention (1M ctx) · CLIP ViT vision (3D-RoPE + patch-merge) · Native multimodal VLM

What's in this PR

New modeling file (743 LOC, 11 QEff classes) covering partial-RoPE static cache, ONNX-friendly
eager attention, fused BMM MoE with swigluoai activation, vision 3D-RoPE + patch-merge,
encoder/decoder wrapper split for two-phase VLM export

Transform registrations: 9 KVCacheTransform class pairs + MiniMaxM3VLRMSNorm → GemmaCustomRMSNormAIC

ONNX-trace fixes: finite mask values via torch.where; no-ellipsis einsum; positive-axis unsqueeze

Transformers pin bump: 5.10.2 → 5.12.1

Validation

CPU + Hardware parity (AI 100, fp16): HF == QEff == ORT == QPC ✅ (token 1110 across all 4 edges)

Tiny-random model: shagunsd/tiny-random-MiniMaxM3SparseForConditionalGeneration
— full pipeline exercised end-to-end including MoE routing, partial-RoPE, vision patch-merge.

Performance (roofline)

TS16 minimum fit · mxfp6 + mxint8 KV · disaggregated compile · 144.9 tok/s roofline (bs=1, ctx=1024)
Huge-tier handoff — full-weight compile is user-side.

Trigger: deepseek-ai/DeepSeek-V4-Pro requires transformers.models.deepseek_v4 which first appeared in transformers 5.10.2. Fixes applied: - B9 (Gemma3 drift): `logger` no longer exported from transformers.models.gemma3.modeling_gemma3 — replaced with logging.getLogger(__name__) Regression test: 121 passed, 3 skipped, 0 failures (test_model_quickcheck.py full suite on isolated venv) Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

Trigger: google/diffusiongemma-26B-A4B-it requires transformers >= 5.11.0 for transformers.models.diffusion_gemma support. Test evidence (isolated venv qeff_env_upgrade_5.11.0): - Per-model import sweep: 0 failures across all QEfficient/transformers/models/*/ - Quickcheck (tests/unit_test/models/test_model_quickcheck.py): 107 passed, 17 skipped, 0 failed in 3:00 - Causal LM CPU parity (HF vs QEff vs ORT tokens): codegen, falcon, gpt2, gpt_oss, gptj, granite, llama, mistral, mixtral, mpt, olmo2, phi3, phi, qwen2, starcoder2 — all PASS - Subfunction export smoke (15 archs) — PASS - VLM export smoke (gemma3, qwen2_5_vl) — PASS - Whisper export smoke — PASS - AWQ export smoke — PASS - Audio CTC, sequence classification, text embedding — PASS No code fixes required: 5.10.2 → 5.11.0 is a clean minor bump (no R1-R5 / B1-B9 hits). Cloud AI 100 stack impact: none — no MXFP6/MXINT8, KV cache, or ONNX surface changes. Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

(see .archon/artifacts/runs/minimax_m3_vl/fix-spine-state.md) Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

…rward Two correctness bugs surfaced by tiny-dummy CPU parity test (HF==QEff==ORT all green at max|delta|=1e-6 after these fixes): B1 (RoPE shape): TextModel was pre-unsqueezing cos/sin to [B,1,S,d] before calling apply_rotary_pos_emb, but upstream's apply_rotary_pos_emb does its own unsqueeze(unsqueeze_dim=1) — making cos 5D against q's 4D and crashing with 'Tensors must have same number of dimensions'. Pass cos/sin in [B,S,head_dim] so the upstream broadcast works as documented. B2 (causal mask): target_length was bare past_seen_tokens, which is 0 on the first eager forward and produced an empty kv-axis (arange(0,0)). Use attention_mask.shape[-1] when caller supplies an SDPA-shaped mask, else past_seen_tokens + current seq_len. Mirrors the qwen3_vl_moe pattern. Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

…ask target_length Production-path qaic-compile bring-up surfaced three more issues, all fixed with full HF==QEff(PT)==QPC parity now green on a tiny dummy (52M params, fp16 KV on AI 100): S1 (specs vision symbols): get_specializations didn't bind num_patches or num_images, but get_onnx_dynamic_axes named pixel_values dim 0 = num_patches and image_grid_thw dim 0 = num_images. qaic-compile rejected the ONNX with 'symbol num_patches is undefined'. Bound both in vision/lang specs as image_seq_length * spatial_merge^2 * max_num_images and max_num_images. S2 (CCL kwarg leak): comp_ctx_lengths_prefill / comp_ctx_lengths_decode default to None and were passed verbatim into compiler_options, which qaic-compile then received as the literal string '-comp-ctx-lengths-prefill= None' and rejected with 'Invalid option'. Pop them at the top of get_specializations before the **compiler_options leak path. M2 (causal-mask kv-axis): target_length was past_seen_tokens + seq_len, which double-counts on the export trace where past_kv comes pre-allocated at prefill_seq_len: past_seen=32, seq_len=32 → mask kv-axis=64, but k_out after QEffDynamicCache.update is 32 (scatter, not concat) → broadcast crash. target_length should be max(past_seen, seq_len), the actual kv-buffer extent. Tested on dummy (1×8×8 grid, 52M params, fp16 QPC): HF top1=1110 max=1.0359 QEff PT top1=1110 max=1.0359 (max|Δ| vs HF = 1.0e-6) QPC top1=1110 max=1.0361 (max|Δ| vs HF = 2.0e-3, fp16 noise) Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

Upload `shagunsd/tiny-random-MiniMaxM3SparseForConditionalGeneration` (106M params, seed=0) to HF Hub so the parametrized harness is reproducible on any machine. Switch `image_text_model_configs.json` from the full 444B `MiniMaxAI/MiniMax-M3` to the Hub-hosted tiny. Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

…rse operator) Graft QEffMiniMaxM3VLIndexer onto the modeling file so sparse layers (layer_types[i] == "minimax_m3_sparse") run top-k key-block selection instead of falling back to a dense causal mask. This is the model's defining long-context operator; it was previously bypassed. - select_blocks: score Q.K^T on already-rotated states (no dedicated indexer-K cache), pick top-k blocks per query, force-keep local blocks. - build_block_maskout: return a BOOLEAN mask-out grid (True = masked) that folds into the boolean causal mask consumed by qeff_eager_attention_forward. Chosen over upstream's additive finfo.min bias, which folds to fp16 -inf and yields 0 * -inf = NaN under -convert-to-fp16. - Register MiniMaxM3VLIndexer -> QEffMiniMaxM3VLIndexer (import + KVCacheTransform). - Tiny test config sets layer_types=["minimax_m3_sparse","full_attention"] + indexer dims so both the sparse and dense paths are exercised. - _layerwise.py: minimax window hooks (frozenset entry, _set_layer_windows, window patches, skip_vision on per-window export). CPU parity with the indexer active: MoE block dInf=1.5e-8, RMSNorm=0, full-attn layer=0.064, sparse-attn layer (indexer running real top-k)=0.054 (matches the dense eager-vs-SDPA tolerance). ORT/QPC re-run on a sparse dummy is still outstanding before validated-hw. Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

shagsood added 9 commits June 17, 2026 09:54

Bump transformers pin: 5.11.0 → 5.12.1

3eb1d31

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

perf(minimax_m3_vl): apply kept performance fixes from perf-fix loop

81370b2

(see .archon/artifacts/runs/minimax_m3_vl/fix-spine-state.md) Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

minimax_m3_vl: add get_submodules_for_export + tiny test config

19f3043

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>

Merge branch 'quic:main' into feature/add-minimax_m3_vl

beb0828

shagsood marked this pull request as draft June 18, 2026 09:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/add minimax m3 vl#1096

Feature/add minimax m3 vl#1096
shagsood wants to merge 10 commits into
quic:mainfrom
shagsood:feature/add-minimax_m3_vl

shagsood commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

shagsood commented Jun 18, 2026

Add MiniMax-M3-VL (MiniMaxAI/MiniMax-M3)

What's in this PR

Validation

Performance (roofline)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant