Skip to content

[AMD] Add MiniMax-M3-MXFP4 MI355X single-node vLLM recipe#1936

Closed
andyluo7 wants to merge 2 commits into
mainfrom
feat/minimaxm3-fp4-mi355x-vllm-singlenode
Closed

[AMD] Add MiniMax-M3-MXFP4 MI355X single-node vLLM recipe#1936
andyluo7 wants to merge 2 commits into
mainfrom
feat/minimaxm3-fp4-mi355x-vllm-singlenode

Conversation

@andyluo7

Copy link
Copy Markdown
Collaborator

Single-node vLLM benchmark for amd/MiniMax-M3-MXFP4 on MI355X (gfx950), served via the AITER MoE backend. Complements the FP4 disagg recipe (#1914) and the FP4 ATOM recipe (minimaxm3-fp4-mi355x-atom) with a plain single-node vLLM path.

Recipe

  • benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x.sh: launcher. Mirrors the MXFP8 single-node script (minimaxm3_fp8_mi355x.sh) — block-size 128 (MSA), TRITON_ATTN, --language-model-only, --no-enable-prefix-caching, minimax_m3 tool/reasoning parsers — with the FP4 adjustments: --moe-backend aiter + VLLM_ROCM_USE_AITER{,_MOE}=1, and no --kv-cache-dtype fp8 (this checkpoint ships no calibrated KV scales).
  • amd-master.yaml: minimaxm3-fp4-mi355x-vllm config, image rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625 (same proven MXFP4 image as [AMD] Add MiniMax-M3-MXFP4 MI355X vLLM disagg recipe #1914).

Search space

Pure TP, 1k/1k and 8k/1k, conc 1–256:

  • TP8 — low-concurrency / low-latency regime, conc 1–32.
  • TP4 — full sweep, conc 1–256.

Validation

Validated single-node on 8x MI355X (gfx950), TP=4, with this exact image (vLLM 0.23.1): vllm serve amd/MiniMax-M3-MXFP4 … reaches Application startup complete, engine reports quantization=quark / moe_backend=aiter, and chat completions return coherent output with the minimax_m3 reasoning parser splitting reasoning from content. I also confirmed that --kv-cache-dtype fp8 does not crash on vLLM (unlike ATOM) but falls back to an uncalibrated KV scale of 1.0 with an accuracy warning — hence the default KV dtype here.

Upstream recipe alignment: vllm-project/recipes#579 adds the matching mxfp4 variant to the official MiniMax-M3 recipe.

Test plan

  • generate_sweep_configs full-sweep --single-node … --framework vllm expands to the expected 30-job matrix (TP8 1–32 + TP4 1–256 at 1k1k/8k1k)
  • CI single-node MI355X sweep passes

Single-node vLLM benchmark for amd/MiniMax-M3-MXFP4 on MI355X (gfx950),
served via the AITER MoE backend. Mirrors the MXFP8 single-node recipe
(block-size 128 for MSA, TRITON_ATTN, minimax_m3 parsers) with the FP4
adjustments validated on hardware: --moe-backend aiter + AITER env, and
no fp8 KV cache (the checkpoint ships no calibrated KV scales).

Search space: pure TP, conc 1-256. TP8 covers the low-concurrency /
low-latency regime (conc 1-32); TP4 carries the full sweep (conc 1-256).
Both 1k/1k and 8k/1k.

Upstream recipe: https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3 (mxfp4 variant).
@functionstackx

Copy link
Copy Markdown
Collaborator

@andyluo7 i have this PR open with upstream vllm #1935

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward FP4 sibling of the existing MXFP8 MI355X single-node recipe.

Extended reasoning...

Overview

This PR adds a new vLLM single-node benchmark recipe for amd/MiniMax-M3-MXFP4 on MI355X. The changes are confined to three files: a new config entry in amd-master.yaml, a new launcher script in benchmarks/single_node/fixed_seq_len/, and a corresponding entry in perf-changelog.yaml. The launcher is a near-clone of the sibling MXFP8 script (minimaxm3_fp8_mi355x.sh) with the three advertised FP4 deltas: --moe-backend aiter, VLLM_ROCM_USE_AITER{,_MOE}=1 env vars, and removal of --kv-cache-dtype fp8 (the checkpoint has no calibrated KV scales).

Security risks

None. This is benchmark plumbing — YAML config, a launcher shell script that invokes vllm serve with a fixed flag set, and a changelog entry. No auth, crypto, network exposure, or user-facing code paths.

Level of scrutiny

Low. Recipe additions like this follow a well-established template across many sibling files in this directory (the minimaxm3_fp*_*.sh family), and the diff matches that template byte-for-byte except for the deliberate FP4 differences. The author validated the configuration locally on 8x MI355X TP=4 with this exact image. CI sweep validation is still pending per the test plan checkbox, but that gates merge separately from this review.

Other factors

The image (rocm/vllm-dev:vllm-0.23.1-rocm723-mi35x-mori-0625) is the same one already in use by the FP4 disagg recipe (#1914), so no new image surface is being introduced. The search space (TP8 low-conc + TP4 full sweep) and concurrency ranges are consistent with neighboring FP4 recipes.

@andyluo7

Copy link
Copy Markdown
Collaborator Author

Closing in favor of #1935 (functionstackx), which uses the public upstream nightly image with merged Quark MXFP4 support (vllm-project/vllm#45794) — better for upstream recipe alignment. I've left my hardware-validation findings (notably the missing VLLM_USE_BREAKABLE_CUDAGRAPH=0 and the explicit AITER MoE backend) as a comment on #1935 to fold in. Upstream recipe variant: vllm-project/recipes#579.

@andyluo7 andyluo7 closed this Jun 25, 2026
@andyluo7 andyluo7 deleted the feat/minimaxm3-fp4-mi355x-vllm-singlenode branch June 25, 2026 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants