[codex] add MiniMax M3 FP4 MI355X vLLM MTP benchmark#1939
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
aad5226 to
a9d0e82
Compare
|
Claude finished @functionstackx's task in 1m 39s —— View job Review: PR #1939 — MiniMax-M3 FP4 MI355X vLLM MTP benchmark
LGTM — no blocking issues found. The new script is a clean derivative of Validation against the repo's review checklist:
Note: the FP4 nightly is claimed to already include AMD's |
There was a problem hiding this comment.
LGTM — straightforward MiniMax-M3 MXFP4 MTP recipe addition mirroring the MXFP8 sister entry.
Extended reasoning...
Overview
This PR adds the EAGLE3 speculative-decoding (MTP) variant of the existing minimaxm3-fp4-mi355x-vllm recipe. Three changes: a new config entry in amd-master.yaml (mirrors the MXFP8 MTP search space, trimmed at extreme concurrency), a new launcher script minimaxm3_fp4_mi355x_vllm_mtp.sh (text-only serving, TRITON_ATTN, EAGLE3 with 3 spec tokens, --use-chat-template), and the standard perf-changelog entry.
Security risks
None. This is a benchmark config + launcher script — no auth, crypto, network-exposed code paths, or production logic. The launcher runs inside the benchmark container.
Level of scrutiny
Low. Pure additive benchmark recipe following an established pattern across the existing MTP family (minimaxm3_fp8_mi{300,325,355}x_mtp.sh). PR was validated with bash -n, matrix-logic pytest, and the sweep entry generation. No critical code paths touched.
Other factors
The only finding is a nit about the DRAFT_MODEL fetch being unconditional rather than guarded by the same if [[ "$MODEL" != /* ]] block as its three sister MTP scripts. The PR author has the inline comment to act on — it's a minor consistency cleanup that doesn't impact the current CI path (where MODEL is passed as a bare HF id). Shadow-approving since this is exactly the kind of mechanical recipe addition that follows established patterns.
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
| hf download "$DRAFT_MODEL" |
There was a problem hiding this comment.
🟡 The draft-model fetch on line 30 is unconditional, but the three sister MTP recipes (minimaxm3_fp8_mi300x_mtp.sh:40-44, minimaxm3_fp8_mi325x_mtp.sh:52-56, minimaxm3_fp8_mi355x_mtp.sh:49-53) all wrap both hf download "$MODEL" and hf download "$DRAFT_MODEL" inside the same if [[ "$MODEL" != /* ]]; then ... fi guard, with an explicit comment that local-path MODEL implies an offline-pre-staged cache. The new script breaks that invariant — a runner with MODEL set to a local path still hits HuggingFace to pull the EAGLE3 draft, which fails on offline-staged runners. Fix: move hf download "$DRAFT_MODEL" inside the existing if block to match the family pattern.
Extended reasoning...
What the bug is
The new minimaxm3_fp4_mi355x_vllm_mtp.sh recipe handles the target and draft model downloads inconsistently:
# Line 29-30
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
hf download "$DRAFT_MODEL"$MODEL is gated on the "not a local path" check, but $DRAFT_MODEL is fetched unconditionally — every invocation reaches out to HuggingFace for Inferact/MiniMax-M3-EAGLE3.
How sister scripts handle this
All three existing sister MTP recipes wrap both downloads under the same guard, with an explicit comment documenting the invariant. From minimaxm3_fp8_mi355x_mtp.sh:47-53:
# MODEL stays a bare HF id on the mi355x single-node runner (weights are
# pre-staged in the mounted NFS HF cache, so this is a fast cache hit). The
# EAGLE3 draft is not staged; fetch it into the same cache.
if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
hf download "$DRAFT_MODEL"
fiminimaxm3_fp8_mi300x_mtp.sh:40-44 and minimaxm3_fp8_mi325x_mtp.sh:52-56 use the identical pattern with the same comment. The convention is clear: a local-path MODEL is the offline-pre-staged signal — when it's set, the runner has no business hitting HF for anything.
Why this matters
On a runner with MODEL=/some/local/path and no HF network/auth (the exact scenario the local-path mode is designed for), the unconditional hf download "$DRAFT_MODEL" on line 30 will fail. The other MTP recipes in the family correctly skip the draft fetch in that case under the assumption that the offline staging includes the draft.
Step-by-step proof
- Operator pre-stages both
amd/MiniMax-M3-MXFP4andInferact/MiniMax-M3-EAGLE3to a local path/staged/models/...for an air-gapped runner. - They set
MODEL=/staged/models/amd/MiniMax-M3-MXFP4and run the recipe. - Line 29:
"$MODEL" != /*is false (MODEL does start with/), so the target download is skipped — correct. - Line 30:
hf download "$DRAFT_MODEL"runs unconditionally and attempts to reachhuggingface.coforInferact/MiniMax-M3-EAGLE3. - On an offline runner: command fails with a network error. On a runner without HF auth for that repo: command fails with 401.
- Compare with running
minimaxm3_fp8_mi355x_mtp.shin the same setup — itsifblock is fully skipped becauseMODELis a local path, no network call is made, and serving proceeds.
Practical impact
Limited in current usage — the AMD master config passes amd/MiniMax-M3-MXFP4 as a bare HF id, so the != /* branch is taken and both downloads run together. The divergence only manifests when someone runs in offline-staged mode, which isn't the current CI path. Still worth fixing for family consistency since the pattern (and its comment) is established across all sister MTP scripts.
Fix
Move hf download "$DRAFT_MODEL" inside the existing if block — matching the existing 3-script pattern:
if [[ "$MODEL" != /* ]]; then
hf download "$MODEL"
hf download "$DRAFT_MODEL"
fi|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28206198366 |
|
/reuse-sweep-run |
# Conflicts: # perf-changelog.yaml
590ba00 to
05ffa9a
Compare
|
following https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3?features=tool_calling%2Creasoning%2Cspec_decoding&hardware=mi355x&variant=mxfp4 as of 5:59pm PT using the latest avaialble upstream docker repo container available currently vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
|

What changed
minimaxm3-fp4-mi355x-vllm-mtpto the AMD master configamd/MiniMax-M3-MXFP4withInferact/MiniMax-M3-EAGLE3using three speculative tokensVLLM_USE_BREAKABLE_CUDAGRAPH=0--use-chat-templatefor realistic EAGLE3 acceptanceWhy
The MXFP4 target recipe is now on main. This adds the corresponding EAGLE3 speculative-decoding coverage on MI355X using the same pinned ROCm nightly and comparable MTP search space as MXFP8.
Validation
bash -n benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.shpython -m pytest utils/matrix_logic/ -q(180 passed)