Skip to content

Feature/add minimax m3 vl#1096

Draft
shagsood wants to merge 10 commits into
quic:mainfrom
shagsood:feature/add-minimax_m3_vl
Draft

Feature/add minimax m3 vl#1096
shagsood wants to merge 10 commits into
quic:mainfrom
shagsood:feature/add-minimax_m3_vl

Conversation

@shagsood

Copy link
Copy Markdown

Add MiniMax-M3-VL (MiniMaxAI/MiniMax-M3)

HF class: MiniMaxM3SparseForConditionalGeneration · requires transformers >= 5.11.0

444B total / 22.5B active · 60 layers · GQA (partial RoPE) · 128-expert top-4 MoE + 1 shared ·
MiniMax Sparse Attention (1M ctx) · CLIP ViT vision (3D-RoPE + patch-merge) · Native multimodal VLM

What's in this PR

New modeling file (743 LOC, 11 QEff classes) covering partial-RoPE static cache, ONNX-friendly
eager attention, fused BMM MoE with swigluoai activation, vision 3D-RoPE + patch-merge,
encoder/decoder wrapper split for two-phase VLM export

Transform registrations: 9 KVCacheTransform class pairs + MiniMaxM3VLRMSNormGemmaCustomRMSNormAIC

ONNX-trace fixes: finite mask values via torch.where; no-ellipsis einsum; positive-axis unsqueeze

Transformers pin bump: 5.10.2 → 5.12.1

Validation

CPU + Hardware parity (AI 100, fp16): HF == QEff == ORT == QPC ✅ (token 1110 across all 4 edges)

Tiny-random model: shagunsd/tiny-random-MiniMaxM3SparseForConditionalGeneration
— full pipeline exercised end-to-end including MoE routing, partial-RoPE, vision patch-merge.

Performance (roofline)

TS16 minimum fit · mxfp6 + mxint8 KV · disaggregated compile · 144.9 tok/s roofline (bs=1, ctx=1024)
Huge-tier handoff — full-weight compile is user-side.

shagsood added 9 commits June 17, 2026 09:54
Trigger: deepseek-ai/DeepSeek-V4-Pro requires transformers.models.deepseek_v4
which first appeared in transformers 5.10.2.

Fixes applied:
- B9 (Gemma3 drift): `logger` no longer exported from
  transformers.models.gemma3.modeling_gemma3 — replaced with
  logging.getLogger(__name__)

Regression test: 121 passed, 3 skipped, 0 failures
(test_model_quickcheck.py full suite on isolated venv)

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Trigger: google/diffusiongemma-26B-A4B-it requires transformers >= 5.11.0
for transformers.models.diffusion_gemma support.

Test evidence (isolated venv qeff_env_upgrade_5.11.0):
- Per-model import sweep: 0 failures across all QEfficient/transformers/models/*/
- Quickcheck (tests/unit_test/models/test_model_quickcheck.py): 107 passed, 17 skipped, 0 failed in 3:00
  - Causal LM CPU parity (HF vs QEff vs ORT tokens): codegen, falcon, gpt2, gpt_oss, gptj,
    granite, llama, mistral, mixtral, mpt, olmo2, phi3, phi, qwen2, starcoder2 — all PASS
  - Subfunction export smoke (15 archs) — PASS
  - VLM export smoke (gemma3, qwen2_5_vl) — PASS
  - Whisper export smoke — PASS
  - AWQ export smoke — PASS
  - Audio CTC, sequence classification, text embedding — PASS

No code fixes required: 5.10.2 → 5.11.0 is a clean minor bump (no R1-R5 / B1-B9 hits).

Cloud AI 100 stack impact: none — no MXFP6/MXINT8, KV cache, or ONNX surface changes.

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
(see .archon/artifacts/runs/minimax_m3_vl/fix-spine-state.md)

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
…rward

Two correctness bugs surfaced by tiny-dummy CPU parity test (HF==QEff==ORT
all green at max|delta|=1e-6 after these fixes):

B1 (RoPE shape): TextModel was pre-unsqueezing cos/sin to [B,1,S,d] before
calling apply_rotary_pos_emb, but upstream's apply_rotary_pos_emb does its
own unsqueeze(unsqueeze_dim=1) — making cos 5D against q's 4D and crashing
with 'Tensors must have same number of dimensions'. Pass cos/sin in
[B,S,head_dim] so the upstream broadcast works as documented.

B2 (causal mask): target_length was bare past_seen_tokens, which is 0 on
the first eager forward and produced an empty kv-axis (arange(0,0)). Use
attention_mask.shape[-1] when caller supplies an SDPA-shaped mask, else
past_seen_tokens + current seq_len. Mirrors the qwen3_vl_moe pattern.

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
…ask target_length

Production-path qaic-compile bring-up surfaced three more issues, all fixed
with full HF==QEff(PT)==QPC parity now green on a tiny dummy (52M params,
fp16 KV on AI 100):

S1 (specs vision symbols): get_specializations didn't bind num_patches or
num_images, but get_onnx_dynamic_axes named pixel_values dim 0 = num_patches
and image_grid_thw dim 0 = num_images. qaic-compile rejected the ONNX with
'symbol num_patches is undefined'. Bound both in vision/lang specs as
image_seq_length * spatial_merge^2 * max_num_images and max_num_images.

S2 (CCL kwarg leak): comp_ctx_lengths_prefill / comp_ctx_lengths_decode
default to None and were passed verbatim into compiler_options, which
qaic-compile then received as the literal string '-comp-ctx-lengths-prefill=
None' and rejected with 'Invalid option'. Pop them at the top of
get_specializations before the **compiler_options leak path.

M2 (causal-mask kv-axis): target_length was past_seen_tokens + seq_len, which
double-counts on the export trace where past_kv comes pre-allocated at
prefill_seq_len: past_seen=32, seq_len=32 → mask kv-axis=64, but k_out after
QEffDynamicCache.update is 32 (scatter, not concat) → broadcast crash.
target_length should be max(past_seen, seq_len), the actual kv-buffer extent.

Tested on dummy (1×8×8 grid, 52M params, fp16 QPC):
  HF top1=1110 max=1.0359
  QEff PT top1=1110 max=1.0359 (max|Δ| vs HF = 1.0e-6)
  QPC top1=1110 max=1.0361 (max|Δ| vs HF = 2.0e-3, fp16 noise)

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Upload `shagunsd/tiny-random-MiniMaxM3SparseForConditionalGeneration`
(106M params, seed=0) to HF Hub so the parametrized harness is
reproducible on any machine. Switch `image_text_model_configs.json`
from the full 444B `MiniMaxAI/MiniMax-M3` to the Hub-hosted tiny.

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
@shagsood shagsood marked this pull request as draft June 18, 2026 09:41
…rse operator)

Graft QEffMiniMaxM3VLIndexer onto the modeling file so sparse layers
(layer_types[i] == "minimax_m3_sparse") run top-k key-block selection instead of
falling back to a dense causal mask. This is the model's defining long-context
operator; it was previously bypassed.

- select_blocks: score Q.K^T on already-rotated states (no dedicated indexer-K
  cache), pick top-k blocks per query, force-keep local blocks.
- build_block_maskout: return a BOOLEAN mask-out grid (True = masked) that folds
  into the boolean causal mask consumed by qeff_eager_attention_forward. Chosen
  over upstream's additive finfo.min bias, which folds to fp16 -inf and yields
  0 * -inf = NaN under -convert-to-fp16.
- Register MiniMaxM3VLIndexer -> QEffMiniMaxM3VLIndexer (import + KVCacheTransform).
- Tiny test config sets layer_types=["minimax_m3_sparse","full_attention"] +
  indexer dims so both the sparse and dense paths are exercised.
- _layerwise.py: minimax window hooks (frozenset entry, _set_layer_windows,
  window patches, skip_vision on per-window export).

CPU parity with the indexer active: MoE block dInf=1.5e-8, RMSNorm=0,
full-attn layer=0.064, sparse-attn layer (indexer running real top-k)=0.054
(matches the dense eager-vs-SDPA tolerance). ORT/QPC re-run on a sparse dummy
is still outstanding before validated-hw.

Signed-off-by: shagsood <shagsood@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant