-
Notifications
You must be signed in to change notification settings - Fork 208
[codex] add MiniMax M3 FP4 MI355X vLLM MTP benchmark #1939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
98 changes: 98 additions & 0 deletions
98
benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # MiniMax-M3 MXFP4 MI355X (gfx950) single-node vLLM recipe with EAGLE3 | ||
| # speculative decoding. This is the spec-decoding=mtp variant of | ||
| # minimaxm3_fp4_mi355x_vllm.sh and uses three speculative tokens from | ||
| # Inferact/MiniMax-M3-EAGLE3. The pinned nightly includes upstream AMD | ||
| # MiniMax-M3 SupportsEagle3 support, so no runtime model patch is needed. | ||
|
|
||
| source "$(dirname "$0")/../../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| DRAFT_MODEL="Inferact/MiniMax-M3-EAGLE3" | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
| hf download "$DRAFT_MODEL" | ||
|
|
||
| if [ -n "$ROCR_VISIBLE_DEVICES" ]; then | ||
| export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES" | ||
| fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| export VLLM_ENGINE_READY_TIMEOUT_S=3600 | ||
| export VLLM_USE_BREAKABLE_CUDAGRAPH=0 | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| fi | ||
|
|
||
| PARALLEL_ARGS=(--tensor-parallel-size "$TP") | ||
| if [ "${DP_ATTENTION}" = "true" ]; then | ||
| PARALLEL_ARGS=( | ||
| --tensor-parallel-size 1 | ||
| --data-parallel-size "$TP" | ||
| --enable-expert-parallel | ||
| ) | ||
| elif [ "$EP_SIZE" -gt 1 ]; then | ||
| PARALLEL_ARGS+=(--enable-expert-parallel) | ||
| fi | ||
|
|
||
| NUM_SPEC_TOKENS=3 | ||
|
|
||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve "$MODEL" --port "$PORT" \ | ||
| "${PARALLEL_ARGS[@]}" \ | ||
| --trust-remote-code \ | ||
| --block-size 128 \ | ||
| --no-enable-prefix-caching \ | ||
| --language-model-only \ | ||
| --max-model-len "$MAX_MODEL_LEN" \ | ||
| --attention-backend TRITON_ATTN \ | ||
| --speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \ | ||
| --tool-call-parser minimax_m3 \ | ||
| --enable-auto-tool-choice \ | ||
| --reasoning-parser minimax_m3 > "$SERVER_LOG" 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code \ | ||
| --use-chat-template | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The draft-model fetch on line 30 is unconditional, but the three sister MTP recipes (
minimaxm3_fp8_mi300x_mtp.sh:40-44,minimaxm3_fp8_mi325x_mtp.sh:52-56,minimaxm3_fp8_mi355x_mtp.sh:49-53) all wrap bothhf download "$MODEL"andhf download "$DRAFT_MODEL"inside the sameif [[ "$MODEL" != /* ]]; then ... figuard, with an explicit comment that local-path MODEL implies an offline-pre-staged cache. The new script breaks that invariant — a runner withMODELset to a local path still hits HuggingFace to pull the EAGLE3 draft, which fails on offline-staged runners. Fix: movehf download "$DRAFT_MODEL"inside the existingifblock to match the family pattern.Extended reasoning...
What the bug is
The new
minimaxm3_fp4_mi355x_vllm_mtp.shrecipe handles the target and draft model downloads inconsistently:$MODELis gated on the "not a local path" check, but$DRAFT_MODELis fetched unconditionally — every invocation reaches out to HuggingFace forInferact/MiniMax-M3-EAGLE3.How sister scripts handle this
All three existing sister MTP recipes wrap both downloads under the same guard, with an explicit comment documenting the invariant. From
minimaxm3_fp8_mi355x_mtp.sh:47-53:minimaxm3_fp8_mi300x_mtp.sh:40-44andminimaxm3_fp8_mi325x_mtp.sh:52-56use the identical pattern with the same comment. The convention is clear: a local-pathMODELis the offline-pre-staged signal — when it's set, the runner has no business hitting HF for anything.Why this matters
On a runner with
MODEL=/some/local/pathand no HF network/auth (the exact scenario the local-path mode is designed for), the unconditionalhf download "$DRAFT_MODEL"on line 30 will fail. The other MTP recipes in the family correctly skip the draft fetch in that case under the assumption that the offline staging includes the draft.Step-by-step proof
amd/MiniMax-M3-MXFP4andInferact/MiniMax-M3-EAGLE3to a local path/staged/models/...for an air-gapped runner.MODEL=/staged/models/amd/MiniMax-M3-MXFP4and run the recipe."$MODEL" != /*is false (MODEL does start with/), so the target download is skipped — correct.hf download "$DRAFT_MODEL"runs unconditionally and attempts to reachhuggingface.coforInferact/MiniMax-M3-EAGLE3.minimaxm3_fp8_mi355x_mtp.shin the same setup — itsifblock is fully skipped becauseMODELis a local path, no network call is made, and serving proceeds.Practical impact
Limited in current usage — the AMD master config passes
amd/MiniMax-M3-MXFP4as a bare HF id, so the!= /*branch is taken and both downloads run together. The divergence only manifests when someone runs in offline-staged mode, which isn't the current CI path. Still worth fixing for family consistency since the pattern (and its comment) is established across all sister MTP scripts.Fix
Move
hf download "$DRAFT_MODEL"inside the existingifblock — matching the existing 3-script pattern: