Add GLM-5-FP8 GB300 multinode dynamo-sglang MTP benchmark#1907
Add GLM-5-FP8 GB300 multinode dynamo-sglang MTP benchmark#1907hshrivastava-droid wants to merge 4 commits into
Conversation
- nvidia-master.yaml: add glm5-fp8-gb300-dynamo-sglang-mtp (14 topologies
across 1k/1k and 8k/1k; prefill TP4 + decode wide-EP DEP16/24/32/40/48/56
high-throughput and per-node TP4 low-latency, all with spec-decoding: mtp).
- 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/
glm5/gb300-fp8/{8k1k,1k1k}/disagg/mtp/, mirroring the existing stp/ siblings
with EAGLE speculative decoding (num-steps 2, eagle-topk 1, num-draft-tokens 3).
- perf-changelog: entry for the new config.
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
| benchmark: | ||
| type: sa-bench | ||
| req_rate: inf | ||
| isl: 1024 | ||
| osl: 1024 | ||
| concurrencies: '8192' |
There was a problem hiding this comment.
🔴 All 14 new MTP recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{1k1k,8k1k}/disagg/mtp/ omit use_chat_template: true in the benchmark: block, which AGENTS.md and .github/workflows/claude-pr-review.yml explicitly mandate for MTP benchmarks. Without it the benchmark measures EAGLE acceptance against raw prompts instead of chat-formatted inputs, silently regressing the reported acceptance rate and making these numbers not comparable to other MTP benchmarks in the repo. Fix: add use_chat_template: true under the benchmark: block in each of the 14 new files (matching every existing sglang multi-node MTP recipe under dsr1/b200-fp4/8k1k/disagg/mtp/ and deepseek-v4/8k1k/disagg-*-mtp.yaml).
Extended reasoning...
What the bug is
Every MTP YAML in this PR (14 files under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{1k1k,8k1k}/disagg/mtp/) ends with a benchmark: block of the shape:
benchmark:
type: sa-bench
req_rate: inf
isl: <isl>
osl: 1024
concurrencies: '<N>'There is no use_chat_template: true field. The PR description states these files are 'byte-identical to the existing stp/ siblings except for the EAGLE speculative-decoding flags on the decode block' — and the STP siblings correctly omit chat-template (raw-prompt input is fine for non-spec-decoding STP). The copy carried that omission into MTP, where it is incorrect.
Why this is mandatory for MTP
AGENTS.md:56 says verbatim: 'MTP scripts MUST pass --use-chat-template to run_benchmark_serving — EAGLE-style spec decoding is trained against chat-formatted inputs; benchmarking against raw prompts silently regresses acceptance rate.' The repository's own automated review at .github/workflows/claude-pr-review.yml:280-296 enforces the same rule: 'MTP benchmarks MUST include the --use-chat-template flag in the benchmark client configuration.'
For multi-node recipes consumed by sa-bench, the YAML key use_chat_template: true under the benchmark: block is the equivalent of the shell --use-chat-template flag — sa-bench plumbs the field through benchmark_lib.sh into benchmark_serving.py, where it gates tokenizer.apply_chat_template formatting of prompts.
Why existing code doesn't prevent it
Nothing in the loader or runtime cross-checks use_chat_template against the presence of speculative-algorithm: EAGLE in the decode block. The omission is silent — the benchmark runs, produces numbers, and the only visible signal is a lower acceptance rate than the model is actually capable of.
Precedent
Every existing sglang multi-node MTP recipe in the repo sets this field:
benchmarks/multi_node/srt-slurm-recipes/sglang/dsr1/b200-fp4/8k1k/disagg/mtp/*.yaml— all 6 filesbenchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-*-mtp.yaml— all -mtp variants- All single-node GLM-5 MTP scripts (
glm5_fp8_b300_mtp.sh,glm5_fp8_b200_mtp.sh,glm5_fp8_h200_mtp.sh, etc.) pass--use-chat-template
GLM-5 specifically requires chat-template formatting for EAGLE to perform as intended — the per-platform MTP scripts in this repo already encode that rule.
Step-by-step proof
Take 1k1k_mtp_hightpt_0.yaml in this PR (lines 147-152):
- The
decode:block setsspeculative-algorithm: "EAGLE",speculative-num-steps: 2,speculative-eagle-topk: 1,speculative-num-draft-tokens: 3→ MTP is on. - The
benchmark:block istype: sa-bench / req_rate: inf / isl: 1024 / osl: 1024 / concurrencies: '8192'— nouse_chat_templatekey. - When sa-bench launches,
benchmark_serving.pyreadsuse_chat_template(defaults to false) and skipstokenizer.apply_chat_template(...). Prompts are sent to the GLM-5 server in raw form. - The EAGLE draft head was trained on chat-formatted token sequences; raw-prompt distribution shift drops draft-token acceptance.
- The reported acceptance rate is silently lower than the model's true capability — and not comparable to other MTP benchmarks in the repo (dsr1, deepseek-v4) which all measure against chat-formatted prompts.
Repeat verbatim for the other 13 files; same structure, same omission.
Fix
Add one line under the benchmark: block of each of the 14 new YAMLs:
benchmark:
type: sa-bench
req_rate: inf
isl: <isl>
osl: 1024
concurrencies: '<N>'
use_chat_template: trueFiles to update:
1k1k/disagg/mtp/1k1k_mtp_hightpt_{0,1,2,3,4}.yaml1k1k/disagg/mtp/1k1k_mtp_lowlat_{0,1}.yaml8k1k/disagg/mtp/8k1k_mtp_hightpt_{0,1,2,3}.yaml8k1k/disagg/mtp/8k1k_mtp_lowlat_{0,1,2}.yaml
| backend: | ||
| prefill_environment: | ||
| TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800' | ||
| PYTHONUNBUFFERED: '1' | ||
| DYN_SKIP_SGLANG_LOG_FORMATTING: '1' | ||
| SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000' | ||
| SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000' | ||
| SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000' | ||
| MC_TE_METRIC: 'true' | ||
| MC_FORCE_MNNVL: '1' | ||
| NCCL_MNNVL_ENABLE: '1' | ||
| NCCL_CUMEM_ENABLE: '1' | ||
| SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True' | ||
| SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0' | ||
| SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1' | ||
| DYN_REQUEST_PLANE: nats |
There was a problem hiding this comment.
🔴 All 14 new GB300 MTP recipe YAMLs omit SGLANG_ENABLE_SPEC_V2: '1' from both prefill_environment and decode_environment blocks, even though every other GLM-5 / SGLang MTP path in this repo (every existing dsr1/b200-fp4/{1k1k,8k1k}/disagg/mtp/*.yaml recipe, every single-node *_mtp.sh launcher including benchmarks/single_node/fixed_seq_len/glm5_fp8_b300_mtp.sh:39, and benchmarks/multi_node/amd_utils/env.sh:156) sets it explicitly. runners/launch_gb300-nv.sh does not inject it either, so the recipe YAML is the only entry point — without it, EAGLE on lmsysorg/sglang:v0.5.11-cu130 will run via the legacy spec-decoding path (or silently no-op with the NSA + DeepEP + DPA decode topology), producing decode behavior inconsistent with every other validated MTP benchmark in the repo and invalidating the new measurements. Fix: add SGLANG_ENABLE_SPEC_V2: '1' to both env blocks in every new MTP recipe (matching the dsr1 MTP precedent).
Extended reasoning...
What the bug is
All 14 new MTP recipe YAMLs added by this PR (benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{1k1k,8k1k}/disagg/mtp/*.yaml) are missing the SGLANG_ENABLE_SPEC_V2: '1' environment variable in both prefill_environment and decode_environment blocks. This env var is the documented enablement gate for EAGLE/MTP speculative decoding in SGLang across this repo.
Why this is a bug — overwhelming precedent
Every other MTP launch path in this repo sets this variable explicitly:
- Every existing multinode SGLang MTP recipe under
benchmarks/multi_node/srt-slurm-recipes/sglang/dsr1/b200-fp4/{1k1k,8k1k}/disagg/mtp/*.yamlsetsSGLANG_ENABLE_SPEC_V2: '1'in both prefill and decode env blocks (e.g.dsr1/b200-fp4/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml:48and:66— 20 hits across 10 files). - Every single-node GLM-5 MTP launcher:
benchmarks/single_node/fixed_seq_len/glm5_fp8_b300_mtp.sh:39runsexport SGLANG_ENABLE_SPEC_V2=1immediately beforesglang.launch_server --speculative-algorithm EAGLE. Same inglm5_fp8_b200_mtp.sh,glm5_fp8_mi355x_mtp.sh,glm5_fp4_b300_mtp.sh,glm5_fp4_b200_mtp.sh. - AMD multi-node env:
benchmarks/multi_node/amd_utils/env.sh:156exports it. perf-changelog.yamldescribes every prior GLM-5 MTP entry (b300/b200/mi355x FP8 and FP4 variants) verbatim as "adds EAGLE speculative decoding ... behind SGLANG_ENABLE_SPEC_V2=1" — this is the maintainer-documented contract.
Why existing code doesn't catch it
runners/launch_gb300-nv.sh contains zero references to SPEC_V2, speculative, or MTP — it only srtctl applys the recipe YAML. The recipe YAML's prefill_environment/decode_environment is the only place SGLang env vars reach the worker containers on this launch path. A missing entry is not silently filled in.
Root cause (confirmed by PR description)
The PR description states the new recipes are "byte-identical to the existing stp/ siblings except for the EAGLE speculative-decoding flags on the decode block." The STP siblings don't need this env var (no spec decoding), so the copy carried the STP environment forward and the new EAGLE-specific env var was never added. Spot-check: diff stp/1k1k_stp_hightpt_0.yaml mtp/1k1k_mtp_hightpt_0.yaml shows the new MTP file is byte-identical to STP except for the name change and the four speculative-* keys appended to the decode block.
Step-by-step proof of impact
- CI invokes
launch_gb300-nv.shforglm5-fp8-gb300-dynamo-sglang-mtp. - The launcher copies
recipes/sglang/glm5/gb300-fp8/.../mtp/*.yamlinto srt-slurm and runssrtctl apply. It does not injectSGLANG_ENABLE_SPEC_V2. - srtctl reads
prefill_environment/decode_environmentfrom the YAML and exports them into the worker containers. Neither block containsSGLANG_ENABLE_SPEC_V2. - SGLang v0.5.11-cu130 starts with
--speculative-algorithm EAGLEbut withoutSGLANG_ENABLE_SPEC_V2=1— it routes EAGLE through the legacy v1 spec-decoding code path (or silently disables spec for the NSA + DeepEP + DPA decode topology, since v2 is the implementation that supports this combination in v0.5.11). - The benchmark completes and publishes throughput/latency numbers — but they are measuring a different decode code path than every other GLM-5 MTP entry in
perf-changelog.yaml, and different from the GB300 single-node siblingglm5_fp8_b300_mtp.sh.
The whole point of -mtp is to measure EAGLE MTP performance; without SPEC_V2=1 the published numbers do not represent the intended config, defeating the purpose of the entry and breaking the apples-to-apples comparison with the existing MTP benchmarks.
Fix
Add SGLANG_ENABLE_SPEC_V2: '1' to both prefill_environment and decode_environment in every new MTP recipe (28 env blocks across 14 files), matching the dsr1 MTP recipes exactly.
| - config-keys: | ||
| - glm5-fp8-gb300-dynamo-sglang-mtp | ||
| description: | ||
| - "Add GLM-5-FP8 GB300 multinode dynamo-sglang benchmark with EAGLE MTP speculative decoding" |
There was a problem hiding this comment.
🟡 The new perf-changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — looks like the XXX placeholder was never replaced with this PR's number (1907). Every other entry in the file points at a real PR (the entry immediately above uses pull/1897), so after merge this one would render as a broken link in the changelog. Please update line 4122 to pull/1907.
Extended reasoning...
What the bug is. The new glm5-fp8-gb300-dynamo-sglang-mtp entry added at the bottom of perf-changelog.yaml ends with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\n\nThe literal token XXX is clearly a placeholder the author intended to replace with the actual PR number before opening the PR. The PR metadata confirms this is PR #1907.\n\nWhy this is unambiguously a bug. Every other pr-link in this file points at a concrete PR number — the entry immediately preceding this one uses pull/1897, and earlier entries in the recent additions use 1888, 1893, etc. The file's convention is real, merged PR numbers; XXX violates that convention and is not interpretable as anything other than an unreplaced template token.\n\nImpact. After merge, GitHub will resolve https://github.com/SemiAnalysisAI/InferenceX/pull/XXX to a 404 (no PR numbered XXX exists or can exist — PR numbers are integers). Anyone browsing the changelog to find the context for this benchmark addition will hit a dead link. Runtime/benchmark behavior is unaffected, so this is a documentation/cosmetic problem only — hence "nit" severity — but it is a real, actionable issue that should be fixed before merge.\n\nStep-by-step proof.\n1. Open perf-changelog.yaml at line 4122 (the last line of the diff).\n2. Observe the value: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.\n3. Cross-check the PR metadata in the PR description / GitHub UI: this is PR #1907.\n4. Cross-check the preceding entry at line ~4117: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1897 — a numeric PR ID, matching the file's convention.\n5. Try the URL with XXX in a browser → GitHub returns a 404 because XXX is not a valid PR identifier.\n\nFix. One-character (well, three-character) edit:\n\ndiff\n- pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1907\n
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28058155782 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28058155782 |
…5 MTP decode Sweep on 7f7d765 hit `Assertion error /build/DeepEP/csrc/deep_ep.cpp:1233 'x.size(0) <= num_max_dispatch_tokens_per_rank'` during CUDA-graph capture on the wide-EP decode configs (TP16/EP16, TP32/EP32, TP40/EP40). The old comment sized the buffer for ceil(cuda_graph_max_bs / dp_size) and ignored MTP's speculative_num_draft_tokens=3 multiplier — capture-time per-rank tokens (cuda_graph_max_bs * num_draft_tokens under DP-attention) overflowed the 512 buffer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28188192700 |
Summary
Adds
glm5-fp8-gb300-dynamo-sglang-mtp— the EAGLE multi-token-prediction sibling of the existingglm5-fp8-gb300-dynamo-sglang(STP) GB300 disagg benchmark.benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{8k1k,1k1k}/disagg/mtp/, byte-identical to the existingstp/siblings except for the EAGLE speculative-decoding flags on the decode block (speculative-algorithm: EAGLE,num-steps 2,eagle-topk 1,num-draft-tokens 3).lmsysorg/sglang:v0.5.11-cu130(same as STP entry).launch_gb300-nv.shalready handlesglm5-fp8(MODEL_PATH + recipes-copy) — no launch-script change needed.