Updated example file for ai200 runs by asmigosw · Pull Request #1127 · quic/efficient-transformers

asmigosw · 2026-06-25T18:24:49Z

Overview

Adds a production-ready example demonstrating MiniMax-M3 VLM decode-only inference on AI200 servers with replicate KV-head optimization.

What's Updated

📄 Files

examples/text_generation/minimax_m3_decode_only.py - Main inference script
examples/text_generation/README_MINIMAX_M3_AI200.md - Comprehensive documentation

✨ Key Features

Replicate KV-head optimization (num_replicate_kv_heads: 8) for AI200 hardware
Decode-only mode (PL=1) for efficient token-by-token generation
Mixed-precision computation with MXFP6 matmul and MXINT8 KV cache
Multi-device parallelism supporting up to 24 AI200 devices
Flexible CLI with comprehensive argument support

Usage

Quick Start

# Install dependencies
pip install transformers --upgrade

# Run with defaults
python examples/text_generation/minimax_m3_decode_only.py

# Custom configuration
python examples/text_generation/minimax_m3_decode_only.py \
    --ctx-len 2048 \
    --generation-len 64 \
    --prompt "Your prompt here"

Replicate KV-Head Configuration

# Applied in model initialization and compilation
qaic_config={"num_replicate_kv_heads": 8}

…untime/test plumbing. - Reworked ReplicateKVHeadTransform into a mutator-style flow with strict config validation, idempotent re-apply handling, and encoder-wrapper skip (EncoderWrapper class-name guard). - Replaced KV projection duplication branching with internal factory dispatch (repeat_kv_projection_dispatch.py) and routed all duplication through shared repeat_kv_utils. - Added shared config utilities/constants to resolve head/hidden keys across model families and compute/apply num_replicate_kv_heads. - Plumbed num_replicate_kv_heads through auto-model wrappers/export paths for CausalLM and ImageTextToText. - Updated tests/helpers for RepeatKV coverage (causal + image-text paths, plus fast unit checks), and renamed example/ docs QAIC knob from num_kv_heads_repeat to num_replicate_kv_heads. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

quic-dhirajku and others added 3 commits June 25, 2026 13:59

Updated example file for ai200 runs

c9a5216

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added README

71115fe

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Updated example file for ai200 runs#1127

Updated example file for ai200 runs#1127
asmigosw wants to merge 3 commits into
quic:minimax-m3-layerwise-onboarding-qefffrom
asmigosw:minimax-m3-layerwise-onboarding-qeff

asmigosw commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

asmigosw commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

What's Updated

📄 Files

✨ Key Features

Usage

Quick Start

Replicate KV-Head Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

asmigosw commented Jun 25, 2026 •

edited

Loading