Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2667,6 +2667,36 @@ minimaxm3-fp4-mi355x-vllm:
- { tp: 4, conc-start: 1, conc-end: 128 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 512 }

# EAGLE3 speculative-decoding variant of minimaxm3-fp4-mi355x-vllm. Pair the
# amd/MiniMax-M3-MXFP4 target with Inferact/MiniMax-M3-EAGLE3 and three draft
# tokens. Search space mirrors the MI355X MXFP8 MTP entry, trimming the base
# FP4 sweep at extreme concurrency where speculative decoding loses value.
minimaxm3-fp4-mi355x-vllm-mtp:
image: vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e
model: amd/MiniMax-M3-MXFP4
model-prefix: minimaxm3
runner: mi355x
precision: fp4
framework: vllm
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 1, conc-end: 64, spec-decoding: mtp }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 256, spec-decoding: mtp }
- { tp: 4, conc-start: 1, conc-end: 64, spec-decoding: mtp }
- { tp: 4, ep: 4, conc-start: 64, conc-end: 256, spec-decoding: mtp }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 1, conc-end: 64, spec-decoding: mtp }
- { tp: 8, ep: 8, conc-start: 1, conc-end: 256, spec-decoding: mtp }
- { tp: 4, conc-start: 1, conc-end: 64, spec-decoding: mtp }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 128, conc-end: 256, spec-decoding: mtp }

# MiniMax-M3 MXFP4 MI355X atom recipe:
# https://github.com/ROCm/ATOM/blob/5d42d49f9e4292e5b61475917e92e7ec1b1dacb7/recipes/MiniMax-M3.md
# block size 128 is mandatory for MSA. TP4 on a single gfx950 node, per the recipe.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
#!/usr/bin/env bash

# MiniMax-M3 MXFP4 MI355X (gfx950) single-node vLLM recipe with EAGLE3
# speculative decoding. This is the spec-decoding=mtp variant of
# minimaxm3_fp4_mi355x_vllm.sh and uses three speculative tokens from
# Inferact/MiniMax-M3-EAGLE3. The pinned nightly includes upstream AMD
# MiniMax-M3 SupportsEagle3 support, so no runtime model patch is needed.

source "$(dirname "$0")/../../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
EP_SIZE \
DP_ATTENTION \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

DRAFT_MODEL="Inferact/MiniMax-M3-EAGLE3"

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
hf download "$DRAFT_MODEL"
Comment on lines +29 to +30

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The draft-model fetch on line 30 is unconditional, but the three sister MTP recipes (minimaxm3_fp8_mi300x_mtp.sh:40-44, minimaxm3_fp8_mi325x_mtp.sh:52-56, minimaxm3_fp8_mi355x_mtp.sh:49-53) all wrap both hf download "$MODEL" and hf download "$DRAFT_MODEL" inside the same if [[ "$MODEL" != /* ]]; then ... fi guard, with an explicit comment that local-path MODEL implies an offline-pre-staged cache. The new script breaks that invariant — a runner with MODEL set to a local path still hits HuggingFace to pull the EAGLE3 draft, which fails on offline-staged runners. Fix: move hf download "$DRAFT_MODEL" inside the existing if block to match the family pattern.

Extended reasoning...

What the bug is

The new minimaxm3_fp4_mi355x_vllm_mtp.sh recipe handles the target and draft model downloads inconsistently:

# Line 29-30
if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
hf download "$DRAFT_MODEL"

$MODEL is gated on the "not a local path" check, but $DRAFT_MODEL is fetched unconditionally — every invocation reaches out to HuggingFace for Inferact/MiniMax-M3-EAGLE3.

How sister scripts handle this

All three existing sister MTP recipes wrap both downloads under the same guard, with an explicit comment documenting the invariant. From minimaxm3_fp8_mi355x_mtp.sh:47-53:

# MODEL stays a bare HF id on the mi355x single-node runner (weights are
# pre-staged in the mounted NFS HF cache, so this is a fast cache hit). The
# EAGLE3 draft is not staged; fetch it into the same cache.
if [[ "$MODEL" != /* ]]; then
  hf download "$MODEL"
  hf download "$DRAFT_MODEL"
fi

minimaxm3_fp8_mi300x_mtp.sh:40-44 and minimaxm3_fp8_mi325x_mtp.sh:52-56 use the identical pattern with the same comment. The convention is clear: a local-path MODEL is the offline-pre-staged signal — when it's set, the runner has no business hitting HF for anything.

Why this matters

On a runner with MODEL=/some/local/path and no HF network/auth (the exact scenario the local-path mode is designed for), the unconditional hf download "$DRAFT_MODEL" on line 30 will fail. The other MTP recipes in the family correctly skip the draft fetch in that case under the assumption that the offline staging includes the draft.

Step-by-step proof

  1. Operator pre-stages both amd/MiniMax-M3-MXFP4 and Inferact/MiniMax-M3-EAGLE3 to a local path /staged/models/... for an air-gapped runner.
  2. They set MODEL=/staged/models/amd/MiniMax-M3-MXFP4 and run the recipe.
  3. Line 29: "$MODEL" != /* is false (MODEL does start with /), so the target download is skipped — correct.
  4. Line 30: hf download "$DRAFT_MODEL" runs unconditionally and attempts to reach huggingface.co for Inferact/MiniMax-M3-EAGLE3.
  5. On an offline runner: command fails with a network error. On a runner without HF auth for that repo: command fails with 401.
  6. Compare with running minimaxm3_fp8_mi355x_mtp.sh in the same setup — its if block is fully skipped because MODEL is a local path, no network call is made, and serving proceeds.

Practical impact

Limited in current usage — the AMD master config passes amd/MiniMax-M3-MXFP4 as a bare HF id, so the != /* branch is taken and both downloads run together. The divergence only manifests when someone runs in offline-staged mode, which isn't the current CI path. Still worth fixing for family consistency since the pattern (and its comment) is established across all sister MTP scripts.

Fix

Move hf download "$DRAFT_MODEL" inside the existing if block — matching the existing 3-script pattern:

if [[ "$MODEL" != /* ]]; then
  hf download "$MODEL"
  hf download "$DRAFT_MODEL"
fi


if [ -n "$ROCR_VISIBLE_DEVICES" ]; then
export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES"
fi

SERVER_LOG=/workspace/server.log
export VLLM_ENGINE_READY_TIMEOUT_S=3600
export VLLM_USE_BREAKABLE_CUDAGRAPH=0

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi

PARALLEL_ARGS=(--tensor-parallel-size "$TP")
if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS=(
--tensor-parallel-size 1
--data-parallel-size "$TP"
--enable-expert-parallel
)
elif [ "$EP_SIZE" -gt 1 ]; then
PARALLEL_ARGS+=(--enable-expert-parallel)
fi

NUM_SPEC_TOKENS=3

start_gpu_monitor

set -x
vllm serve "$MODEL" --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--trust-remote-code \
--block-size 128 \
--no-enable-prefix-caching \
--language-model-only \
--max-model-len "$MAX_MODEL_LEN" \
--attention-backend TRITON_ATTN \
--speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \
--tool-call-parser minimax_m3 \
--enable-auto-tool-choice \
--reasoning-parser minimax_m3 > "$SERVER_LOG" 2>&1 &

SERVER_PID=$!
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code \
--use-chat-template

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4213,3 +4213,11 @@
- "Image: lmsysorg/sglang:nightly-dev-cu13-20260608-303757cc."
- "8k/1k STP recipes: 1P1D TP4 (conc 1-256), 5P1D DEP4+1D DEP16 (conc 2048, NIXL), 6P1D and 7P1D DEP4+1D DEP16 (conc 5120, Mooncake)."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1921

- config-keys:
- minimaxm3-fp4-mi355x-vllm-mtp
description:
- "Add a MiniMax-M3 MXFP4 MI355X vLLM benchmark with EAGLE3 speculative decoding using amd/MiniMax-M3-MXFP4 and Inferact/MiniMax-M3-EAGLE3 with three speculative tokens."
- "Reuse the pinned vllm/vllm-openai-rocm:nightly-3f5a1e1733200760169ff31ebe60a271072b199e image, text-only target path, TRITON_ATTN, automatic tool choice, MiniMax-M3 parsers, VLLM_USE_BREAKABLE_CUDAGRAPH=0, default KV-cache dtype, and automatic MoE backend selection."
- "Pass --use-chat-template for MTP acceptance and mirror the existing MiniMax-M3 MXFP8 MI355X MTP TP/EP/DP-attention search space at 1k1k and 8k1k."
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1939
Loading