Skip to content

Updated example file for ai200 runs#1127

Open
asmigosw wants to merge 3 commits into
quic:minimax-m3-layerwise-onboarding-qefffrom
asmigosw:minimax-m3-layerwise-onboarding-qeff
Open

Updated example file for ai200 runs#1127
asmigosw wants to merge 3 commits into
quic:minimax-m3-layerwise-onboarding-qefffrom
asmigosw:minimax-m3-layerwise-onboarding-qeff

Conversation

@asmigosw

@asmigosw asmigosw commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Overview

Adds a production-ready example demonstrating MiniMax-M3 VLM decode-only inference on AI200 servers with replicate KV-head optimization.

What's Updated

📄 Files

  • examples/text_generation/minimax_m3_decode_only.py - Main inference script
  • examples/text_generation/README_MINIMAX_M3_AI200.md - Comprehensive documentation

✨ Key Features

  • Replicate KV-head optimization (num_replicate_kv_heads: 8) for AI200 hardware
  • Decode-only mode (PL=1) for efficient token-by-token generation
  • Mixed-precision computation with MXFP6 matmul and MXINT8 KV cache
  • Multi-device parallelism supporting up to 24 AI200 devices
  • Flexible CLI with comprehensive argument support

Usage

Quick Start

# Install dependencies
pip install transformers --upgrade

# Run with defaults
python examples/text_generation/minimax_m3_decode_only.py

# Custom configuration
python examples/text_generation/minimax_m3_decode_only.py \
    --ctx-len 2048 \
    --generation-len 64 \
    --prompt "Your prompt here"

Replicate KV-Head Configuration

# Applied in model initialization and compilation
qaic_config={"num_replicate_kv_heads": 8}

quic-dhirajku and others added 3 commits June 25, 2026 13:59
…untime/test plumbing.

  - Reworked ReplicateKVHeadTransform into a mutator-style flow with strict config validation, idempotent re-apply
    handling, and encoder-wrapper skip (EncoderWrapper class-name guard).
  - Replaced KV projection duplication branching with internal factory dispatch (repeat_kv_projection_dispatch.py) and
    routed all duplication through shared repeat_kv_utils.
  - Added shared config utilities/constants to resolve head/hidden keys across model families and compute/apply
    num_replicate_kv_heads.
  - Plumbed num_replicate_kv_heads through auto-model wrappers/export paths for CausalLM and ImageTextToText.
  - Updated tests/helpers for RepeatKV coverage (causal + image-text paths, plus fast unit checks), and renamed example/
    docs QAIC knob from num_kv_heads_repeat to num_replicate_kv_heads.

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants