[codex] add MiniMax M3 FP4 MI355X vLLM benchmark#1935
Conversation
734db59 to
6d617d8
Compare
37151eb to
08ed9f3
Compare
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
Claude finished @functionstackx's task in 1m 29s —— View job PR Review: MiniMax M3 FP4 MI355X vLLM benchmark
LGTM — no blocking issues found. I compared this against the already-merged
No bugs, logic errors, or resource issues found. The script faithfully reuses the working FP8 recipe shape. |
| SERVER_LOG=/workspace/server.log | ||
| export VLLM_ENGINE_READY_TIMEOUT_S=3600 | ||
|
|
There was a problem hiding this comment.
🔴 The new minimaxm3_fp4_mi355x_vllm.sh is missing export VLLM_USE_BREAKABLE_CUDAGRAPH=0 after the VLLM_ENGINE_READY_TIMEOUT_S line, which every other MiniMax-M3 vLLM recipe in the repo sets (including the MXFP4 multi-node disagg entry at models_vllm.yaml:44 for the SAME amd/MiniMax-M3-MXFP4 model). Without it, the M3 decode path silently falls back to eager mode via the breakable-cudagraph fallback, invalidating the "direct precision comparison" with the MXFP8 baseline (which DOES run with CUDA graphs) that the PR description names as the motivation. Fix: add export VLLM_USE_BREAKABLE_CUDAGRAPH=0 at line 34, matching minimaxm3_fp8_mi355x.sh:33.
Extended reasoning...
The bug
The new benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh script (lines 32-34) sets SERVER_LOG and VLLM_ENGINE_READY_TIMEOUT_S=3600, but does NOT export VLLM_USE_BREAKABLE_CUDAGRAPH=0. Every other MiniMax-M3 vLLM recipe in this repo sets this env var:
| File | Line |
|---|---|
minimaxm3_fp8_mi300x.sh |
35 |
minimaxm3_fp8_mi300x_mtp.sh |
52 |
minimaxm3_fp8_mi325x.sh |
33 |
minimaxm3_fp8_mi325x_mtp.sh |
64 |
minimaxm3_fp8_mi355x.sh |
33 (the direct sibling this PR claims to mirror) |
minimaxm3_fp8_mi355x_mtp.sh |
63 |
benchmarks/multi_node/amd_utils/models_vllm.yaml |
44 (MiniMax-M3-MXFP4 disagg) |
The inline comment in those scripts identifies it as a MiniMax-M3 model-specific (not precision-specific) workaround: "VLLM_USE_BREAKABLE_CUDAGRAPH=0 avoids the M3-decode breakable-cudagraph path that previously forced eager execution."
Why this is not specific to MXFP8
The disagg config at benchmarks/multi_node/amd_utils/models_vllm.yaml:44 uses the exact same model (amd/MiniMax-M3-MXFP4) and explicitly sets VLLM_USE_BREAKABLE_CUDAGRAPH=0 in its env string. So the requirement is tied to the MiniMax-M3 model + ROCm decode path, not to the weight quantization. PRs #1750/#1754/#1755/#1756 (recorded in perf-changelog.yaml) landed this fix "per AMD guidance" across every MiniMax-M3 single-node recipe at the time; this new MXFP4 single-node recipe breaks the established pattern without justification.
Concrete trigger walkthrough
- Sweep launcher runs
bash minimaxm3_fp4_mi355x_vllm.shwith one of the TP/EP shapes fromamd-master.yaml. - Script exports only
VLLM_ENGINE_READY_TIMEOUT_S=3600;VLLM_USE_BREAKABLE_CUDAGRAPHis unset (default: enabled). vllm serveis invoked without--enforce-eager, so vLLM normally captures CUDA graphs for decode.- On MiniMax-M3, the decode path hits the "breakable cudagraph" fallback (the issue the env var was added to suppress, per the inline comments in all sister scripts).
- Decode silently runs eager-mode while the FP8 MI355X baseline runs with CUDA graphs enabled (since its script DOES export the var).
- The PR description explicitly states the motivation is to "mirror the existing MXFP8 MI355X TP/EP/DP-attention sweep for direct precision comparisons" — but the comparison is no longer apples-to-apples: FP4 measures eager-mode decode while FP8 measures graph-captured decode.
Impact
This silently invalidates the benchmark's stated purpose. The numbers will look worse than they should because eager-mode decode throughput is substantially below graph-captured decode on MoE models. Anyone comparing these results to the MXFP8 baseline will draw incorrect conclusions about MXFP4's quality/perf trade-off. This is a normal-severity bug because the measurement validity is the explicit goal of this PR.
Fix
One-line addition at benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh:34:
SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0 # avoids the M3-decode breakable-cudagraph path that previously forced eager executionThis exactly mirrors the layout in minimaxm3_fp8_mi355x.sh:31-33, which the PR description claims to mirror.
|
I had a duplicate of this in #1936 (same config key
Other things I confirmed that corroborate this PR's choices: block-size 128 + TRITON_ATTN MSA work; the FYI the matching upstream recipe variant is up at vllm-project/recipes#579. One open question for you: search space. This mirrors the MXFP8 grid (EP + conc→1024); my closed PR used pure TP4+TP8 capped at conc 256 per the original ask. Either is fine — flagging in case you want to trim. |
|
Heads up — the pinned image is likely a blocker for MXFP4, separate from the flags above. Per the AMD MXFP4 enablement owners:
This PR pins
Required serve flags (from #46419, matches what I validated): Accuracy target to validate against: gsm8k 5-shot 0.940 flexible / 0.941 strict. |
hi @andyluo7 we do not enable AITER as it is still WIP and as of June 25, 2026 12:42pm PT it is not accessible to upstream docker containers vllm-project/vllm#46419 I am happy to update this PR once it is merged and accesisble into an https://hub.docker.com/r/vllm/ docker image
while i am glad, that there is development build that it works on, it is not accessible to upstream https://hub.docker.com/r/vllm/ docker image . Feel free to create an update PR once it is accessible to upstream docker image |
ChuanLi1101
left a comment
There was a problem hiding this comment.
Thanks @functionstackx. +1 to the blocker @claude / @andyluo7 already flagged — VLLM_USE_BREAKABLE_CUDAGRAPH=0 is required (model-specific, set by every other M3 recipe incl. the MXFP4 disagg at models_vllm.yaml:44). Please add at line 34 and I'm good to approve. Two things to add on top of their reviews:
1. Make the "AITER off" status explicit (builds on @andyluo7's MoE-backend point). Leaving MoE to vLLM's default is the right interim call: the MXFP4 AITER path isn't in the upstream docker yet (vllm-project/vllm#46419, gated behind the aiter bump #46692). But that means these are non-AITER MXFP4 baseline numbers and won't match the AITER FP4 disagg results in #1914. Asks: (a) call this out in the changelog entry so it's not read as the optimized path; (b) plan an AITER-enabled follow-up once #46419/#46474/#46692 land. (If the nightly already carries AITER and you want it now, set it explicitly per Andy's note rather than relying on the default.)
2. The aiter 0.1.16.post2 mla_reduce_v1 regression does NOT affect this PR. That break only hits MLA models (DSR1/Kimi). M3 is not MLA (TRITON_ATTN here), so it never touches the MLA decode/reduce path — irrelevant to this benchmark now and after we flip AITER on. Flagging so nobody blocks this on the war-room regression.
Once VLLM_USE_BREAKABLE_CUDAGRAPH=0 is in, LGTM.
hi Andy, following ur recipe as of 1pm ET June 25, 2026 it doesn't work |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28195297568 |
Thanks @ChuanLi1101 for seeing that AITER MoE is not accessible to any upstream https://hub.docker.com/r/vllm/ docker image thus not enabling it is an accurate reflection of the upstream performance as of jun 25, 1:30pm PT. Happy to accept follow PRs to track the performance once it is accessible to upstream vLLM docker
thanks @ChuanLi1101 i have implement ur single suggestion of |
|
Claude finished @functionstackx's task in 52s —— View job Verified —
|
|
@functionstackx thanks for re-testing — two things: 1. Your 1pm ET test predated my fix. I pushed recipes#579 92fe4c6 at ~4:40pm ET, after your run. The version you hit had no variant-level image pin (it fell back to the generic 2. On official This keeps the two PRs consistent: your #1935 = non-AITER baseline on official docker (correct, already merged), and #579 documents both the emulation path (official images, today) and the AITER perf path (dev image now, official nightly once #46419 ships). Happy to do an AITER-enabled InferenceX follow-up once it's in an official image. |
ur AI wants me to use MXFP4 emulation even tho emulation is slower & even then non-emulation MXFP4 works on upstream now and passes evals!? can u please upgrade ur ai to codex 5.5 or opus 4.8 xhigh plz https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28195297568/job/83520505469 |
ur recipe was/is wrong, as addressed in vllm-project/recipes#579 (review) i am glad u were able to address my suggestions of where this bugs |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28198641824 |
|
|
/reuse-sweep-run |
13d5a23 to
87cf65e
Compare

What changed
minimaxm3-fp4-mi355x-vllmto the AMD master config usingamd/MiniMax-M3-MXFP4nightly-3f5a1e1733200760169ff31ebe60a271072b199eWhy
AMD's MiniMax M3 MXFP4 checkpoint now has upstream vLLM support through vllm-project/vllm#45794. This adds benchmark coverage for that path on MI355X while keeping the FP8 and FP4 sweep shapes comparable.
doesn't enable AITER as it is still WIP and as of June 25, 2026 12:42pm PT it is not accessible to upstream docker containers vllm-project/vllm#46419
Validation
bash -n benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.shpython -m pytest utils/matrix_logic/ -v(180 passed)