Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark by Ankur-singh · Pull Request #1932 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-06-25T17:31:44Z

Adds the minimaxm3-fp4-b200-vllm config: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc), no spec decode.

Config: nvidia-master.yaml entry (fp4 / vllm / runner b200-dgxc); sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-1024.
Recipe: benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh — overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; --block-size 128 (MSA), --language-model-only.
Weights: pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 — added a minimaxm3 && fp4 branch to launch_b200-dgxc.sh that resolves MODEL_PATH there (the launcher rewrites MODEL to it and bind-mounts it).
perf-changelog entry appended.

New minimaxm3-fp4-b200-vllm config (fp4 vLLM aggregated on b200-dgxc). The benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve. Weights are pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added a minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh).

claude · 2026-06-25T17:41:28Z

+for f in \
+  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
+  model_executor/layers/quantization/modelopt.py \
+  model_executor/layers/quantization/utils/flashinfer_utils.py
+do
+  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
+done
+python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"


🔴 The NVFP4 overlay-patch loop at lines 27-33 downloads 3 files from raw.githubusercontent.com with no error handling: the script has no set -e, benchmark_lib.sh does not set it either, and there is no || exit after curl -fsSL. If only modelopt.py or flashinfer_utils.py fails to download (transient 5xx, rate limit, network blip), curl writes no file and the loop continues — the verification at line 34 only imports from the first file (trtllm_nvfp4_moe.py), so the failure is not caught and the benchmark dies much later inside vllm serve with an opaque unrecognized-NVFP4-quant-config error. Fix: add || exit 1 to the curl invocation, or set -euo pipefail at the top — matching the || { echo ...; exit 1; } pattern the sibling minimaxm3_fp8_b200.sh already uses.

Extended reasoning...

What the bug is

The new script benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh overlays vllm-project/vllm PR #46380 onto the installed vLLM package by curl-fetching three source files from raw.githubusercontent.com:

for f in \ model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ model_executor/layers/quantization/modelopt.py \ model_executor/layers/quantization/utils/flashinfer_utils.py do curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" done python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

The loop has no error handling. The script does not declare set -e (the only set in the file is set -x on line 65, which is shell tracing), and benchmark_lib.sh sourced at line 9 does not enable -e globally either (its only set -e/set +e calls are scoped inside a single function around lines 1265/1270). The curl invocation also has no || exit / || { …; exit 1; } trailer.

The specific failure path

curl -fsSL returns non-zero on HTTP errors (the -f flag), and crucially, with -f curl writes no output file on failure — the existing site-packages file from the image stays in place untouched. With no set -e and no explicit error check, the loop simply moves to the next iteration; the script then proceeds.

The post-patch verification at line 34 only imports TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe.py — the first file in the loop. If the second (modelopt.py) or third (flashinfer_utils.py) download fails, the verification still passes, because the original stock-vLLM files those names reference are still valid Python modules; they simply lack the NVFP4 quant-config support that PR #46380 added. The benchmark then proceeds to vllm serve, which fails opaquely much later with an unrecognized-NVFP4-quant-config error or an ImportError — far from the actual patch step.

Step-by-step proof

raw.githubusercontent.com returns a transient 503 (or rate-limits) for modelopt.py — realistic during GitHub Actions runner storms.

curl -fsSL …/modelopt.py -o …/modelopt.py exits 22, prints nothing (-s), writes nothing (-f suppresses the output file on HTTP failure). The stock modelopt.py in the image is untouched.

The for-loop ignores the non-zero exit and continues to flashinfer_utils.py (which may also be patched or original).

python3 -c "from …trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular" succeeds — file [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 was overwritten correctly, and the symbol the new file defines is importable. Prints [nvfp4-patch] OK.

vllm serve nvidia/MiniMax-M3-NVFP4 … starts. Inside vLLM, the modelopt loader is reached for the NVFP4 quant config, but the unpatched modelopt.py does not recognise the NVFP4 variant from PR #46380 — startup fails with an opaque error well after the patch step.

Why existing code doesn't prevent this

The verification command is correct for confirming file #1's overwrite, but is silent on files #2 and #3. The sibling script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh shows the project's own defensive convention for an analogous patching block — its Python heredoc patch is followed by || { echo "… patch failed" >&2; exit 1; } (around line 30). This new fp4 recipe diverges from that convention.

Impact

When the failure hits, the benchmark fails inside vllm serve startup with a confusing NVFP4-quant-config or ImportError, not at the patch step where the actual problem lives. That's the worst kind of CI failure — expensive to triage and easy to mis-attribute to a vLLM or model-config issue. raw.githubusercontent.com 5xx / rate limiting / DNS blips during runner storms are realistic, not hypothetical.

Fix

Either of the following one-line fixes works:

# Option A: fail-fast inside the loop curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" || { echo "[nvfp4-patch] failed to fetch ${f}" >&2; exit 1; } # Option B: enable strict mode at the top of the script set -euo pipefail

Option A matches the project's existing pattern in minimaxm3_fp8_b200.sh. Option B is broader and would also catch other unchecked failures (hf download, the python3 -c …vllm.__file__ lookup, etc.).

# Conflicts: # perf-changelog.yaml

functionstackx

fix patchwork as discussed in slack
missing vllm recipes

github-actions · 2026-06-25T20:11:58Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28189599852
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28189599852

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the benchmark script no longer overwrites vLLM files at runtime.

# Conflicts: # perf-changelog.yaml

Ankur-singh requested a review from a team June 25, 2026 17:31

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners June 25, 2026 17:31

github-project-automation Bot added this to InferenceMAX Board Jun 25, 2026

Update perf-changelog pr-link for #1932

f2c156b

claude Bot reviewed Jun 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into minimaxm3-fp4-b200-vllm

e65ca33

# Conflicts: # perf-changelog.yaml

Ankur-singh added the full-sweep-enabled label Jun 25, 2026

functionstackx requested changes Jun 25, 2026

View reviewed changes

Ankur-singh added 2 commits June 25, 2026 13:15

Drop runtime NVFP4 patch; bump perf image to ...-8b00f41

ef18622

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the benchmark script no longer overwrites vLLM files at runtime.

Merge remote-tracking branch 'origin/main' into minimaxm3-fp4-b200-vllm

436111d

# Conflicts: # perf-changelog.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932
Ankur-singh wants to merge 5 commits into
mainfrom
minimaxm3-fp4-b200-vllm

Ankur-singh commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Uh oh!

functionstackx left a comment

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Ankur-singh commented Jun 25, 2026

Uh oh!

claude Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants