Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932
Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932Ankur-singh wants to merge 5 commits into
Conversation
New minimaxm3-fp4-b200-vllm config (fp4 vLLM aggregated on b200-dgxc). The benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve. Weights are pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 (added a minimaxm3-fp4 MODEL_PATH branch to launch_b200-dgxc.sh).
| for f in \ | ||
| model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \ | ||
| model_executor/layers/quantization/modelopt.py \ | ||
| model_executor/layers/quantization/utils/flashinfer_utils.py | ||
| do | ||
| curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" | ||
| done | ||
| python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')" |
There was a problem hiding this comment.
🔴 The NVFP4 overlay-patch loop at lines 27-33 downloads 3 files from raw.githubusercontent.com with no error handling: the script has no set -e, benchmark_lib.sh does not set it either, and there is no || exit after curl -fsSL. If only modelopt.py or flashinfer_utils.py fails to download (transient 5xx, rate limit, network blip), curl writes no file and the loop continues — the verification at line 34 only imports from the first file (trtllm_nvfp4_moe.py), so the failure is not caught and the benchmark dies much later inside vllm serve with an opaque unrecognized-NVFP4-quant-config error. Fix: add || exit 1 to the curl invocation, or set -euo pipefail at the top — matching the || { echo ...; exit 1; } pattern the sibling minimaxm3_fp8_b200.sh already uses.
Extended reasoning...
What the bug is
The new script benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh overlays vllm-project/vllm PR #46380 onto the installed vLLM package by curl-fetching three source files from raw.githubusercontent.com:
for f in \
model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
model_executor/layers/quantization/modelopt.py \
model_executor/layers/quantization/utils/flashinfer_utils.py
do
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"The loop has no error handling. The script does not declare set -e (the only set in the file is set -x on line 65, which is shell tracing), and benchmark_lib.sh sourced at line 9 does not enable -e globally either (its only set -e/set +e calls are scoped inside a single function around lines 1265/1270). The curl invocation also has no || exit / || { …; exit 1; } trailer.
The specific failure path
curl -fsSL returns non-zero on HTTP errors (the -f flag), and crucially, with -f curl writes no output file on failure — the existing site-packages file from the image stays in place untouched. With no set -e and no explicit error check, the loop simply moves to the next iteration; the script then proceeds.
The post-patch verification at line 34 only imports TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe.py — the first file in the loop. If the second (modelopt.py) or third (flashinfer_utils.py) download fails, the verification still passes, because the original stock-vLLM files those names reference are still valid Python modules; they simply lack the NVFP4 quant-config support that PR #46380 added. The benchmark then proceeds to vllm serve, which fails opaquely much later with an unrecognized-NVFP4-quant-config error or an ImportError — far from the actual patch step.
Step-by-step proof
- raw.githubusercontent.com returns a transient 503 (or rate-limits) for
modelopt.py— realistic during GitHub Actions runner storms. curl -fsSL …/modelopt.py -o …/modelopt.pyexits 22, prints nothing (-s), writes nothing (-fsuppresses the output file on HTTP failure). The stockmodelopt.pyin the image is untouched.- The for-loop ignores the non-zero exit and continues to
flashinfer_utils.py(which may also be patched or original). python3 -c "from …trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular"succeeds — file [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 was overwritten correctly, and the symbol the new file defines is importable. Prints[nvfp4-patch] OK.vllm serve nvidia/MiniMax-M3-NVFP4 …starts. Inside vLLM, the modelopt loader is reached for the NVFP4 quant config, but the unpatchedmodelopt.pydoes not recognise the NVFP4 variant from PR #46380 — startup fails with an opaque error well after the patch step.
Why existing code doesn't prevent this
The verification command is correct for confirming file #1's overwrite, but is silent on files #2 and #3. The sibling script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh shows the project's own defensive convention for an analogous patching block — its Python heredoc patch is followed by || { echo "… patch failed" >&2; exit 1; } (around line 30). This new fp4 recipe diverges from that convention.
Impact
When the failure hits, the benchmark fails inside vllm serve startup with a confusing NVFP4-quant-config or ImportError, not at the patch step where the actual problem lives. That's the worst kind of CI failure — expensive to triage and easy to mis-attribute to a vLLM or model-config issue. raw.githubusercontent.com 5xx / rate limiting / DNS blips during runner storms are realistic, not hypothetical.
Fix
Either of the following one-line fixes works:
# Option A: fail-fast inside the loop
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" || { echo "[nvfp4-patch] failed to fetch ${f}" >&2; exit 1; }
# Option B: enable strict mode at the top of the script
set -euo pipefailOption A matches the project's existing pattern in minimaxm3_fp8_b200.sh. Option B is broader and would also catch other unchecked failures (hf download, the python3 -c …vllm.__file__ lookup, etc.).
# Conflicts: # perf-changelog.yaml
functionstackx
left a comment
There was a problem hiding this comment.
- fix patchwork as discussed in slack
- missing vllm recipes
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28189599852 |
The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the benchmark script no longer overwrites vLLM files at runtime.
# Conflicts: # perf-changelog.yaml
Adds the
minimaxm3-fp4-b200-vllmconfig: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc), no spec decode.nvidia-master.yamlentry (fp4 / vllm / runnerb200-dgxc); sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-1024.benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh— overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve;--block-size 128(MSA),--language-model-only./scratch/fsw/models/MiniMax-M3-NVFP4— added aminimaxm3 && fp4branch tolaunch_b200-dgxc.shthat resolvesMODEL_PATHthere (the launcher rewritesMODELto it and bind-mounts it).