Skip to content

Add GLM-5-FP8 GB300 multinode dynamo-sglang MTP benchmark#1907

Open
hshrivastava-droid wants to merge 4 commits into
mainfrom
nv/glm5-fp8-gb300-nv2
Open

Add GLM-5-FP8 GB300 multinode dynamo-sglang MTP benchmark#1907
hshrivastava-droid wants to merge 4 commits into
mainfrom
nv/glm5-fp8-gb300-nv2

Conversation

@hshrivastava-droid

Copy link
Copy Markdown
Collaborator

Summary

Adds glm5-fp8-gb300-dynamo-sglang-mtp — the EAGLE multi-token-prediction sibling of the existing glm5-fp8-gb300-dynamo-sglang (STP) GB300 disagg benchmark.

  • 14 search-space points across 1k/1k and 8k/1k (4 + 3 + 5 + 2): prefill TP4 + decode wide-EP DEP16/24/32/40/48/56 high-throughput and per-node TP4 low-latency.
  • 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{8k1k,1k1k}/disagg/mtp/, byte-identical to the existing stp/ siblings except for the EAGLE speculative-decoding flags on the decode block (speculative-algorithm: EAGLE, num-steps 2, eagle-topk 1, num-draft-tokens 3).
  • Image: lmsysorg/sglang:v0.5.11-cu130 (same as STP entry).
  • perf-changelog: entry for the new config.

launch_gb300-nv.sh already handles glm5-fp8 (MODEL_PATH + recipes-copy) — no launch-script change needed.

- nvidia-master.yaml: add glm5-fp8-gb300-dynamo-sglang-mtp (14 topologies
  across 1k/1k and 8k/1k; prefill TP4 + decode wide-EP DEP16/24/32/40/48/56
  high-throughput and per-node TP4 low-latency, all with spec-decoding: mtp).
- 14 split recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/
  glm5/gb300-fp8/{8k1k,1k1k}/disagg/mtp/, mirroring the existing stp/ siblings
  with EAGLE speculative decoding (num-steps 2, eagle-topk 1, num-draft-tokens 3).
- perf-changelog: entry for the new config.
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

Comment on lines +147 to +152
benchmark:
type: sa-bench
req_rate: inf
isl: 1024
osl: 1024
concurrencies: '8192'

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All 14 new MTP recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{1k1k,8k1k}/disagg/mtp/ omit use_chat_template: true in the benchmark: block, which AGENTS.md and .github/workflows/claude-pr-review.yml explicitly mandate for MTP benchmarks. Without it the benchmark measures EAGLE acceptance against raw prompts instead of chat-formatted inputs, silently regressing the reported acceptance rate and making these numbers not comparable to other MTP benchmarks in the repo. Fix: add use_chat_template: true under the benchmark: block in each of the 14 new files (matching every existing sglang multi-node MTP recipe under dsr1/b200-fp4/8k1k/disagg/mtp/ and deepseek-v4/8k1k/disagg-*-mtp.yaml).

Extended reasoning...

What the bug is

Every MTP YAML in this PR (14 files under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{1k1k,8k1k}/disagg/mtp/) ends with a benchmark: block of the shape:

benchmark:
  type: sa-bench
  req_rate: inf
  isl: <isl>
  osl: 1024
  concurrencies: '<N>'

There is no use_chat_template: true field. The PR description states these files are 'byte-identical to the existing stp/ siblings except for the EAGLE speculative-decoding flags on the decode block' — and the STP siblings correctly omit chat-template (raw-prompt input is fine for non-spec-decoding STP). The copy carried that omission into MTP, where it is incorrect.

Why this is mandatory for MTP

AGENTS.md:56 says verbatim: 'MTP scripts MUST pass --use-chat-template to run_benchmark_serving — EAGLE-style spec decoding is trained against chat-formatted inputs; benchmarking against raw prompts silently regresses acceptance rate.' The repository's own automated review at .github/workflows/claude-pr-review.yml:280-296 enforces the same rule: 'MTP benchmarks MUST include the --use-chat-template flag in the benchmark client configuration.'

For multi-node recipes consumed by sa-bench, the YAML key use_chat_template: true under the benchmark: block is the equivalent of the shell --use-chat-template flag — sa-bench plumbs the field through benchmark_lib.sh into benchmark_serving.py, where it gates tokenizer.apply_chat_template formatting of prompts.

Why existing code doesn't prevent it

Nothing in the loader or runtime cross-checks use_chat_template against the presence of speculative-algorithm: EAGLE in the decode block. The omission is silent — the benchmark runs, produces numbers, and the only visible signal is a lower acceptance rate than the model is actually capable of.

Precedent

Every existing sglang multi-node MTP recipe in the repo sets this field:

  • benchmarks/multi_node/srt-slurm-recipes/sglang/dsr1/b200-fp4/8k1k/disagg/mtp/*.yaml — all 6 files
  • benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/8k1k/disagg-*-mtp.yaml — all -mtp variants
  • All single-node GLM-5 MTP scripts (glm5_fp8_b300_mtp.sh, glm5_fp8_b200_mtp.sh, glm5_fp8_h200_mtp.sh, etc.) pass --use-chat-template

GLM-5 specifically requires chat-template formatting for EAGLE to perform as intended — the per-platform MTP scripts in this repo already encode that rule.

Step-by-step proof

Take 1k1k_mtp_hightpt_0.yaml in this PR (lines 147-152):

  1. The decode: block sets speculative-algorithm: "EAGLE", speculative-num-steps: 2, speculative-eagle-topk: 1, speculative-num-draft-tokens: 3 → MTP is on.
  2. The benchmark: block is type: sa-bench / req_rate: inf / isl: 1024 / osl: 1024 / concurrencies: '8192' — no use_chat_template key.
  3. When sa-bench launches, benchmark_serving.py reads use_chat_template (defaults to false) and skips tokenizer.apply_chat_template(...). Prompts are sent to the GLM-5 server in raw form.
  4. The EAGLE draft head was trained on chat-formatted token sequences; raw-prompt distribution shift drops draft-token acceptance.
  5. The reported acceptance rate is silently lower than the model's true capability — and not comparable to other MTP benchmarks in the repo (dsr1, deepseek-v4) which all measure against chat-formatted prompts.

Repeat verbatim for the other 13 files; same structure, same omission.

Fix

Add one line under the benchmark: block of each of the 14 new YAMLs:

benchmark:
  type: sa-bench
  req_rate: inf
  isl: <isl>
  osl: 1024
  concurrencies: '<N>'
  use_chat_template: true

Files to update:

  • 1k1k/disagg/mtp/1k1k_mtp_hightpt_{0,1,2,3,4}.yaml
  • 1k1k/disagg/mtp/1k1k_mtp_lowlat_{0,1}.yaml
  • 8k1k/disagg/mtp/8k1k_mtp_hightpt_{0,1,2,3}.yaml
  • 8k1k/disagg/mtp/8k1k_mtp_lowlat_{0,1,2}.yaml

Comment on lines +22 to +37
backend:
prefill_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: '1800'
PYTHONUNBUFFERED: '1'
DYN_SKIP_SGLANG_LOG_FORMATTING: '1'
SGLANG_DISAGGREGATION_HEARTBEAT_MAX_FAILURE: '100000'
SGLANG_DISAGGREGATION_BOOTSTRAP_TIMEOUT: '100000'
SGLANG_DISAGGREGATION_WAITING_TIMEOUT: '100000'
MC_TE_METRIC: 'true'
MC_FORCE_MNNVL: '1'
NCCL_MNNVL_ENABLE: '1'
NCCL_CUMEM_ENABLE: '1'
SGLANG_MOONCAKE_CUSTOM_MEM_POOL: 'True'
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
DYN_REQUEST_PLANE: nats

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All 14 new GB300 MTP recipe YAMLs omit SGLANG_ENABLE_SPEC_V2: '1' from both prefill_environment and decode_environment blocks, even though every other GLM-5 / SGLang MTP path in this repo (every existing dsr1/b200-fp4/{1k1k,8k1k}/disagg/mtp/*.yaml recipe, every single-node *_mtp.sh launcher including benchmarks/single_node/fixed_seq_len/glm5_fp8_b300_mtp.sh:39, and benchmarks/multi_node/amd_utils/env.sh:156) sets it explicitly. runners/launch_gb300-nv.sh does not inject it either, so the recipe YAML is the only entry point — without it, EAGLE on lmsysorg/sglang:v0.5.11-cu130 will run via the legacy spec-decoding path (or silently no-op with the NSA + DeepEP + DPA decode topology), producing decode behavior inconsistent with every other validated MTP benchmark in the repo and invalidating the new measurements. Fix: add SGLANG_ENABLE_SPEC_V2: '1' to both env blocks in every new MTP recipe (matching the dsr1 MTP precedent).

Extended reasoning...

What the bug is

All 14 new MTP recipe YAMLs added by this PR (benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/{1k1k,8k1k}/disagg/mtp/*.yaml) are missing the SGLANG_ENABLE_SPEC_V2: '1' environment variable in both prefill_environment and decode_environment blocks. This env var is the documented enablement gate for EAGLE/MTP speculative decoding in SGLang across this repo.

Why this is a bug — overwhelming precedent

Every other MTP launch path in this repo sets this variable explicitly:

  1. Every existing multinode SGLang MTP recipe under benchmarks/multi_node/srt-slurm-recipes/sglang/dsr1/b200-fp4/{1k1k,8k1k}/disagg/mtp/*.yaml sets SGLANG_ENABLE_SPEC_V2: '1' in both prefill and decode env blocks (e.g. dsr1/b200-fp4/8k1k/disagg/mtp/8k1k_mtp_lowlat_0.yaml:48 and :66 — 20 hits across 10 files).
  2. Every single-node GLM-5 MTP launcher: benchmarks/single_node/fixed_seq_len/glm5_fp8_b300_mtp.sh:39 runs export SGLANG_ENABLE_SPEC_V2=1 immediately before sglang.launch_server --speculative-algorithm EAGLE. Same in glm5_fp8_b200_mtp.sh, glm5_fp8_mi355x_mtp.sh, glm5_fp4_b300_mtp.sh, glm5_fp4_b200_mtp.sh.
  3. AMD multi-node env: benchmarks/multi_node/amd_utils/env.sh:156 exports it.
  4. perf-changelog.yaml describes every prior GLM-5 MTP entry (b300/b200/mi355x FP8 and FP4 variants) verbatim as "adds EAGLE speculative decoding ... behind SGLANG_ENABLE_SPEC_V2=1" — this is the maintainer-documented contract.

Why existing code doesn't catch it

runners/launch_gb300-nv.sh contains zero references to SPEC_V2, speculative, or MTP — it only srtctl applys the recipe YAML. The recipe YAML's prefill_environment/decode_environment is the only place SGLang env vars reach the worker containers on this launch path. A missing entry is not silently filled in.

Root cause (confirmed by PR description)

The PR description states the new recipes are "byte-identical to the existing stp/ siblings except for the EAGLE speculative-decoding flags on the decode block." The STP siblings don't need this env var (no spec decoding), so the copy carried the STP environment forward and the new EAGLE-specific env var was never added. Spot-check: diff stp/1k1k_stp_hightpt_0.yaml mtp/1k1k_mtp_hightpt_0.yaml shows the new MTP file is byte-identical to STP except for the name change and the four speculative-* keys appended to the decode block.

Step-by-step proof of impact

  1. CI invokes launch_gb300-nv.sh for glm5-fp8-gb300-dynamo-sglang-mtp.
  2. The launcher copies recipes/sglang/glm5/gb300-fp8/.../mtp/*.yaml into srt-slurm and runs srtctl apply. It does not inject SGLANG_ENABLE_SPEC_V2.
  3. srtctl reads prefill_environment/decode_environment from the YAML and exports them into the worker containers. Neither block contains SGLANG_ENABLE_SPEC_V2.
  4. SGLang v0.5.11-cu130 starts with --speculative-algorithm EAGLE but without SGLANG_ENABLE_SPEC_V2=1 — it routes EAGLE through the legacy v1 spec-decoding code path (or silently disables spec for the NSA + DeepEP + DPA decode topology, since v2 is the implementation that supports this combination in v0.5.11).
  5. The benchmark completes and publishes throughput/latency numbers — but they are measuring a different decode code path than every other GLM-5 MTP entry in perf-changelog.yaml, and different from the GB300 single-node sibling glm5_fp8_b300_mtp.sh.

The whole point of -mtp is to measure EAGLE MTP performance; without SPEC_V2=1 the published numbers do not represent the intended config, defeating the purpose of the entry and breaking the apples-to-apples comparison with the existing MTP benchmarks.

Fix

Add SGLANG_ENABLE_SPEC_V2: '1' to both prefill_environment and decode_environment in every new MTP recipe (28 env blocks across 14 files), matching the dsr1 MTP recipes exactly.

Comment thread perf-changelog.yaml
- config-keys:
- glm5-fp8-gb300-dynamo-sglang-mtp
description:
- "Add GLM-5-FP8 GB300 multinode dynamo-sglang benchmark with EAGLE MTP speculative decoding"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX — looks like the XXX placeholder was never replaced with this PR's number (1907). Every other entry in the file points at a real PR (the entry immediately above uses pull/1897), so after merge this one would render as a broken link in the changelog. Please update line 4122 to pull/1907.

Extended reasoning...

What the bug is. The new glm5-fp8-gb300-dynamo-sglang-mtp entry added at the bottom of perf-changelog.yaml ends with:\n\nyaml\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n\n\nThe literal token XXX is clearly a placeholder the author intended to replace with the actual PR number before opening the PR. The PR metadata confirms this is PR #1907.\n\nWhy this is unambiguously a bug. Every other pr-link in this file points at a concrete PR number — the entry immediately preceding this one uses pull/1897, and earlier entries in the recent additions use 1888, 1893, etc. The file's convention is real, merged PR numbers; XXX violates that convention and is not interpretable as anything other than an unreplaced template token.\n\nImpact. After merge, GitHub will resolve https://github.com/SemiAnalysisAI/InferenceX/pull/XXX to a 404 (no PR numbered XXX exists or can exist — PR numbers are integers). Anyone browsing the changelog to find the context for this benchmark addition will hit a dead link. Runtime/benchmark behavior is unaffected, so this is a documentation/cosmetic problem only — hence "nit" severity — but it is a real, actionable issue that should be fixed before merge.\n\nStep-by-step proof.\n1. Open perf-changelog.yaml at line 4122 (the last line of the diff).\n2. Observe the value: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.\n3. Cross-check the PR metadata in the PR description / GitHub UI: this is PR #1907.\n4. Cross-check the preceding entry at line ~4117: pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1897 — a numeric PR ID, matching the file's convention.\n5. Try the URL with XXX in a browser → GitHub returns a 404 because XXX is not a valid PR identifier.\n\nFix. One-character (well, three-character) edit:\n\ndiff\n- pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX\n+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1907\n

@github-actions

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

Copy link
Copy Markdown
Contributor

hshrivastava-droid and others added 2 commits June 25, 2026 10:04
…5 MTP decode

Sweep on 7f7d765 hit `Assertion error /build/DeepEP/csrc/deep_ep.cpp:1233
'x.size(0) <= num_max_dispatch_tokens_per_rank'` during CUDA-graph capture on
the wide-EP decode configs (TP16/EP16, TP32/EP32, TP40/EP40). The old comment
sized the buffer for ceil(cuda_graph_max_bs / dp_size) and ignored MTP's
speculative_num_draft_tokens=3 multiplier — capture-time per-rank tokens
(cuda_graph_max_bs * num_draft_tokens under DP-attention) overflowed the 512
buffer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant