Skip to content

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932

Open
Ankur-singh wants to merge 5 commits into
mainfrom
minimaxm3-fp4-b200-vllm
Open

Add MiniMax-M3 NVFP4 B200 single-node aggregated vLLM benchmark#1932
Ankur-singh wants to merge 5 commits into
mainfrom
minimaxm3-fp4-b200-vllm

Conversation

@Ankur-singh

Copy link
Copy Markdown
Collaborator

Adds the minimaxm3-fp4-b200-vllm config: MiniMax-M3 NVFP4 (nvidia/MiniMax-M3-NVFP4) single-node aggregated vLLM on B200 (runner: b200-dgxc), no spec decode.

  • Config: nvidia-master.yaml entry (fp4 / vllm / runner b200-dgxc); sweeps tp 4/8 with and without EP and dp-attn at 1k1k and 8k1k, conc 1-1024.
  • Recipe: benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh — overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4 support, commit 6c08558) before serve; --block-size 128 (MSA), --language-model-only.
  • Weights: pre-staged at /scratch/fsw/models/MiniMax-M3-NVFP4 — added a minimaxm3 && fp4 branch to launch_b200-dgxc.sh that resolves MODEL_PATH there (the launcher rewrites MODEL to it and bind-mounts it).
  • perf-changelog entry appended.

New minimaxm3-fp4-b200-vllm config (fp4 vLLM aggregated on b200-dgxc). The
benchmark script overlays vllm-project/vllm PR #46380 (MiniMax-M3 modelopt NVFP4
support, commit 6c08558) before serve. Weights are pre-staged at
/scratch/fsw/models/MiniMax-M3-NVFP4 (added a minimaxm3-fp4 MODEL_PATH branch to
launch_b200-dgxc.sh).
Comment on lines +27 to +34
for f in \
model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
model_executor/layers/quantization/modelopt.py \
model_executor/layers/quantization/utils/flashinfer_utils.py
do
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The NVFP4 overlay-patch loop at lines 27-33 downloads 3 files from raw.githubusercontent.com with no error handling: the script has no set -e, benchmark_lib.sh does not set it either, and there is no || exit after curl -fsSL. If only modelopt.py or flashinfer_utils.py fails to download (transient 5xx, rate limit, network blip), curl writes no file and the loop continues — the verification at line 34 only imports from the first file (trtllm_nvfp4_moe.py), so the failure is not caught and the benchmark dies much later inside vllm serve with an opaque unrecognized-NVFP4-quant-config error. Fix: add || exit 1 to the curl invocation, or set -euo pipefail at the top — matching the || { echo ...; exit 1; } pattern the sibling minimaxm3_fp8_b200.sh already uses.

Extended reasoning...

What the bug is

The new script benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh overlays vllm-project/vllm PR #46380 onto the installed vLLM package by curl-fetching three source files from raw.githubusercontent.com:

for f in \
  model_executor/layers/fused_moe/experts/trtllm_nvfp4_moe.py \
  model_executor/layers/quantization/modelopt.py \
  model_executor/layers/quantization/utils/flashinfer_utils.py
do
  curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}"
done
python3 -c "from vllm.model_executor.layers.fused_moe.experts.trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular; print('[nvfp4-patch] OK')"

The loop has no error handling. The script does not declare set -e (the only set in the file is set -x on line 65, which is shell tracing), and benchmark_lib.sh sourced at line 9 does not enable -e globally either (its only set -e/set +e calls are scoped inside a single function around lines 1265/1270). The curl invocation also has no || exit / || { …; exit 1; } trailer.

The specific failure path

curl -fsSL returns non-zero on HTTP errors (the -f flag), and crucially, with -f curl writes no output file on failure — the existing site-packages file from the image stays in place untouched. With no set -e and no explicit error check, the loop simply moves to the next iteration; the script then proceeds.

The post-patch verification at line 34 only imports TrtLlmNvFp4ExpertsModular from trtllm_nvfp4_moe.py — the first file in the loop. If the second (modelopt.py) or third (flashinfer_utils.py) download fails, the verification still passes, because the original stock-vLLM files those names reference are still valid Python modules; they simply lack the NVFP4 quant-config support that PR #46380 added. The benchmark then proceeds to vllm serve, which fails opaquely much later with an unrecognized-NVFP4-quant-config error or an ImportError — far from the actual patch step.

Step-by-step proof

  1. raw.githubusercontent.com returns a transient 503 (or rate-limits) for modelopt.py — realistic during GitHub Actions runner storms.
  2. curl -fsSL …/modelopt.py -o …/modelopt.py exits 22, prints nothing (-s), writes nothing (-f suppresses the output file on HTTP failure). The stock modelopt.py in the image is untouched.
  3. The for-loop ignores the non-zero exit and continues to flashinfer_utils.py (which may also be patched or original).
  4. python3 -c "from …trtllm_nvfp4_moe import TrtLlmNvFp4ExpertsModular" succeeds — file [NVIDIA] Add TRT-LLM 70B FP8 via slurm #1 was overwritten correctly, and the symbol the new file defines is importable. Prints [nvfp4-patch] OK.
  5. vllm serve nvidia/MiniMax-M3-NVFP4 … starts. Inside vLLM, the modelopt loader is reached for the NVFP4 quant config, but the unpatched modelopt.py does not recognise the NVFP4 variant from PR #46380 — startup fails with an opaque error well after the patch step.

Why existing code doesn't prevent this

The verification command is correct for confirming file #1's overwrite, but is silent on files #2 and #3. The sibling script benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_b200.sh shows the project's own defensive convention for an analogous patching block — its Python heredoc patch is followed by || { echo "… patch failed" >&2; exit 1; } (around line 30). This new fp4 recipe diverges from that convention.

Impact

When the failure hits, the benchmark fails inside vllm serve startup with a confusing NVFP4-quant-config or ImportError, not at the patch step where the actual problem lives. That's the worst kind of CI failure — expensive to triage and easy to mis-attribute to a vLLM or model-config issue. raw.githubusercontent.com 5xx / rate limiting / DNS blips during runner storms are realistic, not hypothetical.

Fix

Either of the following one-line fixes works:

# Option A: fail-fast inside the loop
curl -fsSL "https://raw.githubusercontent.com/vllm-project/vllm/6c08558/vllm/${f}" -o "${VLLM_DIR}/${f}" || { echo "[nvfp4-patch] failed to fetch ${f}" >&2; exit 1; }

# Option B: enable strict mode at the top of the script
set -euo pipefail

Option A matches the project's existing pattern in minimaxm3_fp8_b200.sh. Option B is broader and would also catch other unchecked failures (hf download, the python3 -c …vllm.__file__ lookup, etc.).

@functionstackx functionstackx left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. fix patchwork as discussed in slack
  2. missing vllm recipes

@github-actions

Copy link
Copy Markdown
Contributor

The vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 image bakes in
MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380), so the benchmark
script no longer overwrites vLLM files at runtime.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants