Skip to content

add all backends for moe benchmark#948

Open
ybenvidia wants to merge 4 commits into
NVIDIA:mainfrom
ybenvidia:main
Open

add all backends for moe benchmark#948
ybenvidia wants to merge 4 commits into
NVIDIA:mainfrom
ybenvidia:main

Conversation

@ybenvidia

Copy link
Copy Markdown
Contributor

Summary

Provide a concise summary of the changes introduced by this pull request. Detail the purpose and scope of the changes, referencing any relevant issues or discussions. Explain how these changes address the problem or improve the project.

Test Plan

In this section, describe the testing you have performed to verify the changes. Include:

  • A clear description of the testing environment.
  • The steps you followed to test the new features or bug fixes.
  • Any specific commands used during testing, along with their outputs.
  • A description of the results and observations from your testing.
    This information is crucial for reviewers to understand how the changes have been validated.

Additional Notes

Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Updates MoE benchmark arguments, Slurm execution, throughput reporting, and experimental benchmark configs to support multiple backend variants, hybrid rendezvous setup, and revised result generation.

Changes

MoE Benchmark Multi-Backend & Dashboard

Layer / File(s) Summary
Command args and report schema
src/cloudai/workloads/moe_benchmark/moe_benchmark.py, src/cloudai/workloads/moe_benchmark/report_generation_strategy.py, tests/test_acceptance.py
Adds deepep_versions, updates benchmark and results paths to /workspace/DeepEP/..., excludes the new field from serialized command args, changes CSV column ordering to use time_s, and extends the acceptance mapping with the backend list.
Slurm generation and single-sbatch checks
src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py, src/cloudai/workloads/common/moe_benchmark_report.py, src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py, src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py, src/cloudai/systems/slurm/single_sbatch_runner.py, tests/ref_data/moe-benchmark.sbatch, tests/test_single_sbatch_runner.py
Adds backend-specific environment prefixes, hybrid rendezvous startup, per-backend benchmark chaining, generated config type selection, single-sbatch incompatibility reporting, and matching updates in the NCCL and UCC test strategies and reference sbatch script.
Throughput dashboard rendering
src/cloudai/workloads/moe_benchmark/throughput_reporter.py
Reads all MoE result rows, extracts backend-aware bus and split bandwidth series, skips non-finite NCCL values, and renders a multi-panel SVG dashboard with conditional combine and split panels.
Benchmark configs and scenarios
conf/experimental/test/moe_benchmark_HT.toml, conf/experimental/test/moe_benchmark_low_latency.toml, conf/experimental/test_scenario/moe_benchmark_HT.toml, conf/experimental/test_scenario/moe_benchmark_low_latency.toml, conf/experimental/test/nccl_test_alltoallv.toml, conf/experimental/test/ucc_alltoallv_deepep.toml
Updates HT and low-latency benchmark configs with new backend lists, DeepEP paths, combine mode, and expanded environment variables; renames the HT scenario test name; adds the chained low-latency scenario; and adjusts NCCL and UCX test settings.

Estimated code review effort: 4 (Complex) | ~50 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The description is just a template and does not describe the actual changeset in any meaningful way. Replace the template with a brief summary of the MoE backend changes and any testing performed.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the main change: expanding the MoE benchmark to cover all backends.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 95-97: The hybrid alias handling in `SlurmCommandGenStrategy` is
inconsistent because `deepep_hybrid`, `hybrid`, and `hybrid_ep` are detected
together earlier, but `_BACKEND_ENV` is later derived from the raw version
value. Normalize the selected DeePEP version to a single canonical hybrid name
inside `generate` before the `_BACKEND_ENV` lookup, and use that normalized
value for the include/NCCL override logic so `hybrid` and `hybrid_ep` follow the
same path as `deepep_hybrid`.
- Around line 101-106: The Hybrid-EP rendezvous setup in the command generation
flow still depends on a fixed sleep after starting etcd, which can leave the
benchmark racing an unready endpoint. Update the logic around the etcd startup
sequence in the strategy that builds the Slurm command to poll the
NIXL_ETCD_ENDPOINTS/2379 endpoint until it is healthy, and make it fail fast
with a clear error if readiness is never reached. Keep the background launch of
etcd, but replace the hardcoded delay with an explicit readiness check in the
same command-building path.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 93-123: The bandwidth parsing in _extract_moe_separate_bw (and the
related single-value extractor used by the throughput report) currently accepts
NaN/Inf via float(...) and can later break rendering when gmax is converted to
int in the SVG path. Update the parsing logic to reject non-finite values before
returning from these helpers, and ensure the downstream chart generation path
that uses gmax only receives finite numbers so invalid measurements from
results.json or baseline logs are skipped instead of crashing report generation.

In `@tests/ref_data/moe-benchmark.sbatch`:
- Around line 26-30: The current sbatch fixture only covers the default
legacy-and-elastic command path, so add reference coverage in
tests/ref_data/moe-benchmark.sbatch for the backend override cases as well.
Update the fixture to include at least one generated srun command that uses the
inline _BACKEND_ENV prefixes and one hybrid etcd bootstrap run, so
test_sbatch_generation will fail if those paths regress. Use the existing
moe-benchmark sbatch fixture and its srun command variants as the place to lock
down these exact-match scenarios.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: dfb90795-08c7-4282-9006-a440e28e3842

📥 Commits

Reviewing files that changed from the base of the PR and between 1bca58d and 9fb7bfc.

📒 Files selected for processing (9)
  • conf/experimental/test/moe_benchmark_HT.toml
  • conf/experimental/test/moe_benchmark_low_latency.toml
  • conf/experimental/test_scenario/moe_benchmark_HT.toml
  • conf/experimental/test_scenario/moe_benchmark_low_latency.toml
  • src/cloudai/workloads/moe_benchmark/moe_benchmark.py
  • src/cloudai/workloads/moe_benchmark/report_generation_strategy.py
  • src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
  • src/cloudai/workloads/moe_benchmark/throughput_reporter.py
  • tests/ref_data/moe-benchmark.sbatch

Comment thread src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
Comment thread src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py Outdated
Comment thread src/cloudai/workloads/moe_benchmark/throughput_reporter.py
Comment thread tests/ref_data/moe-benchmark.sbatch Outdated
@podkidyshev

Copy link
Copy Markdown
Contributor

@ybenvidia please resolve coderabbit comments

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)

101-103: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Reject negative bandwidth values too.

math.isfinite(...) still allows negative bandwidths, which are invalid metric values and can corrupt the shared SVG axis/bar geometry downstream. Use one helper for all bandwidth sources.

Proposed fix
 import math
 import re
 from pathlib import Path
 from typing import Any
+
+
+def _valid_bw(value: float) -> bool:
+    return math.isfinite(value) and value >= 0.0
@@
-    if not math.isfinite(val):  # skip NaN/Inf so gmax stays finite for the int() in the SVG path
+    if not _valid_bw(val):
         return None
@@
-    if not (math.isfinite(nvl) and math.isfinite(rdma)):  # skip NaN/Inf (keeps gmax finite)
+    if not (_valid_bw(nvl) and _valid_bw(rdma)):
         return None
@@
-        if sz < 1048576 or not math.isfinite(bavg):
+        if sz < 1048576 or not _valid_bw(bavg):
             continue
@@
-        if not math.isfinite(v):
+        if not _valid_bw(v):
             continue

Also applies to: 130-132, 202-203, 220-222

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` around lines 101
- 103, The bandwidth parsing helper still accepts negative values because it
only filters non-finite numbers, which can break the shared SVG axis/bar
calculations. Update the common bandwidth helper(s) in throughput_reporter.py
used by the affected sources so they reject any value below zero as well as
NaN/Inf, and make sure all bandwidth inputs flow through the same validation
path rather than duplicating checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 101-103: The bandwidth parsing helper still accepts negative
values because it only filters non-finite numbers, which can break the shared
SVG axis/bar calculations. Update the common bandwidth helper(s) in
throughput_reporter.py used by the affected sources so they reject any value
below zero as well as NaN/Inf, and make sure all bandwidth inputs flow through
the same validation path rather than duplicating checks.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: d43aa464-e1a6-4488-b29b-4e975e3960e6

📥 Commits

Reviewing files that changed from the base of the PR and between 9fb7bfc and d7c7e54.

📒 Files selected for processing (7)
  • conf/experimental/test/moe_benchmark_HT.toml
  • conf/experimental/test/nccl_test_alltoallv.toml
  • conf/experimental/test/ucc_alltoallv_deepep.toml
  • src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
  • src/cloudai/workloads/moe_benchmark/throughput_reporter.py
  • tests/ref_data/moe-benchmark.sbatch
  • tests/test_acceptance.py

@ybenvidia

Copy link
Copy Markdown
Contributor Author

@ybenvidia please resolve coderabbit comments

Done

[
"# Hybrid-EP NIXL rendezvous: etcd on the head node (background).",
f'srun --overlap --nodes=1 --ntasks=1 -w "$head_node" --container-image={image} '
'bash -c "etcd --log-level error --listen-client-urls http://0.0.0.0:2379 '

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does moe benchmark need a single-sbatch support? (because etcd is a background job which is not cleaned up by script at the end = single-sbatch mode will likely fail)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@podkidyshev podkidyshev Jul 3, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see my comment was taken quite far :D

I’m not fully convinced by this job-level prologue approach for ETCD. It still feels a bit ad hoc, and it behaves differently from the NIXL implementation. Could you instead take inspiration from src/cloudai/workloads/common/nixl.py? There ETCD is managed at the test-case level: started before the test, waited on, then shut down after the test, which also works correctly in single-sbatch because each case gets its own lifecycle.

I was told that this per-test lifecycle is the correct behavior, and the NIXL path has already been tested quite a lot by the verification team. Maybe we can extract the common ETCD start/wait/stop logic into something like workloads/common/etcd.py and reuse it from both NIXL and MoE, instead of adding a separate MoE-specific implementation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py (1)

146-166: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Normalize hybrid aliases before invoking benchmark.py. hybrid/hybrid_ep are already treated as aliases for env setup and rendezvous, but the positional CLI arg still passes the raw token through. That leaves alias inputs inconsistent with the generated benchmark command, which uses deepep_hybrid, and can break runs when those values are supplied.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py` around
lines 146 - 166, The generate_test_command method is passing the raw deepep
version token through to benchmark.py even when the value is a hybrid alias,
which makes the CLI argument inconsistent with the normalized environment
handling. Update the command assembly in generate_test_command so that hybrid
and hybrid_ep are normalized to the same benchmark-facing value used by
_BACKEND_ENV before appending the positional version argument, keeping alias
handling consistent across env setup and invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`:
- Around line 48-51: The single-sbatch guard in single_sbatch_unsupported_reason
is too strict because it rejects every use_deepep_matrix case even when gen
provides the fallback that _deepep_ucc_matrix_host_path relies on. Update the
logic in single_sbatch_unsupported_reason to only return
DEEPEP_MATRIX_SINGLE_SBATCH_REASON when use_deepep_matrix is enabled and the
config still lacks a usable gen fallback, so the same compatibility check
matches the behavior in _deepep_ucc_matrix_host_path and allows valid
single-sbatch configs.

In `@tests/test_single_sbatch_runner.py`:
- Around line 490-503: The test for SingleSbatchRunner is asserting on the
exception message manually instead of using a narrowed pytest.raises match,
which can hide unrelated ValueError failures in gen_sbatch_content(). Update
test_rejects_single_sbatch_incompatible to use pytest.raises(ValueError,
match=...) with a pattern that covers the existing checks for single-sbatch
mode, nccl_tr.name, and use_deepep_matrix, so the assertion stays focused on
_reject_single_sbatch_incompatible().

---

Outside diff comments:
In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 146-166: The generate_test_command method is passing the raw
deepep version token through to benchmark.py even when the value is a hybrid
alias, which makes the CLI argument inconsistent with the normalized environment
handling. Update the command assembly in generate_test_command so that hybrid
and hybrid_ep are normalized to the same benchmark-facing value used by
_BACKEND_ENV before appending the positional version argument, keeping alias
handling consistent across env setup and invocation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: 5a8b6f3d-f90b-4678-bca9-a960032b9916

📥 Commits

Reviewing files that changed from the base of the PR and between d7c7e54 and 8a4b8b3.

📒 Files selected for processing (7)
  • src/cloudai/systems/slurm/single_sbatch_runner.py
  • src/cloudai/workloads/common/moe_benchmark_report.py
  • src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py
  • src/cloudai/workloads/nccl_test/slurm_command_gen_strategy.py
  • src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py
  • tests/ref_data/moe-benchmark.sbatch
  • tests/test_single_sbatch_runner.py

Comment on lines +48 to +51
def single_sbatch_unsupported_reason(self) -> str | None:
tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test)
return DEEPEP_MATRIX_SINGLE_SBATCH_REASON if tdef.cmd_args.use_deepep_matrix else None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Single-sbatch rejection ignores the gen fallback that makes the config actually compatible.

_deepep_ucc_matrix_host_path() (Lines 61-63) already falls back to tdef.cmd_args.gen when the runtime-produced matrix isn't available — it doesn't require the MoE benchmark to have run first in that case. But single_sbatch_unsupported_reason() rejects use_deepep_matrix=True unconditionally, even when gen is set, needlessly blocking a configuration that would otherwise generate correctly in single-sbatch mode.

🐛 Proposed fix
     def single_sbatch_unsupported_reason(self) -> str | None:
         tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test)
-        return DEEPEP_MATRIX_SINGLE_SBATCH_REASON if tdef.cmd_args.use_deepep_matrix else None
+        return (
+            DEEPEP_MATRIX_SINGLE_SBATCH_REASON
+            if tdef.cmd_args.use_deepep_matrix and tdef.cmd_args.gen is None
+            else None
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def single_sbatch_unsupported_reason(self) -> str | None:
tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test)
return DEEPEP_MATRIX_SINGLE_SBATCH_REASON if tdef.cmd_args.use_deepep_matrix else None
def single_sbatch_unsupported_reason(self) -> str | None:
tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test)
return (
DEEPEP_MATRIX_SINGLE_SBATCH_REASON
if tdef.cmd_args.use_deepep_matrix and tdef.cmd_args.gen is None
else None
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py` around lines 48
- 51, The single-sbatch guard in single_sbatch_unsupported_reason is too strict
because it rejects every use_deepep_matrix case even when gen provides the
fallback that _deepep_ucc_matrix_host_path relies on. Update the logic in
single_sbatch_unsupported_reason to only return
DEEPEP_MATRIX_SINGLE_SBATCH_REASON when use_deepep_matrix is enabled and the
config still lacks a usable gen fallback, so the same compatibility check
matches the behavior in _deepep_ucc_matrix_host_path and allows valid
single-sbatch configs.

Comment on lines +490 to +503
def test_rejects_single_sbatch_incompatible(nccl_tr: TestRun, slurm_system: SlurmSystem) -> None:
cast(NCCLTestDefinition, nccl_tr.test).cmd_args.use_deepep_matrix = True
tc = TestScenario(name="tc", test_runs=[nccl_tr])
runner = SingleSbatchRunner(mode="run", system=slurm_system, test_scenario=tc, output_path=slurm_system.output_path)

with pytest.raises(ValueError) as exc_info:
runner.gen_sbatch_content()

msg = str(exc_info.value)
assert "single-sbatch mode" in msg
assert nccl_tr.name in msg
assert "use_deepep_matrix" in msg


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Narrow the exception assertion per Ruff PT011.

pytest.raises(ValueError) without match= would also pass if an unrelated ValueError were raised elsewhere in gen_sbatch_content(), masking a real regression in _reject_single_sbatch_incompatible().

✅ Proposed fix
-    with pytest.raises(ValueError) as exc_info:
+    with pytest.raises(ValueError, match="cannot run in single-sbatch mode") as exc_info:
         runner.gen_sbatch_content()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def test_rejects_single_sbatch_incompatible(nccl_tr: TestRun, slurm_system: SlurmSystem) -> None:
cast(NCCLTestDefinition, nccl_tr.test).cmd_args.use_deepep_matrix = True
tc = TestScenario(name="tc", test_runs=[nccl_tr])
runner = SingleSbatchRunner(mode="run", system=slurm_system, test_scenario=tc, output_path=slurm_system.output_path)
with pytest.raises(ValueError) as exc_info:
runner.gen_sbatch_content()
msg = str(exc_info.value)
assert "single-sbatch mode" in msg
assert nccl_tr.name in msg
assert "use_deepep_matrix" in msg
def test_rejects_single_sbatch_incompatible(nccl_tr: TestRun, slurm_system: SlurmSystem) -> None:
cast(NCCLTestDefinition, nccl_tr.test).cmd_args.use_deepep_matrix = True
tc = TestScenario(name="tc", test_runs=[nccl_tr])
runner = SingleSbatchRunner(mode="run", system=slurm_system, test_scenario=tc, output_path=slurm_system.output_path)
with pytest.raises(ValueError, match="cannot run in single-sbatch mode") as exc_info:
runner.gen_sbatch_content()
msg = str(exc_info.value)
assert "single-sbatch mode" in msg
assert nccl_tr.name in msg
assert "use_deepep_matrix" in msg
🧰 Tools
🪛 Ruff (0.15.20)

[warning] 495-495: pytest.raises(ValueError) is too broad, set the match parameter or use a more specific exception

(PT011)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_single_sbatch_runner.py` around lines 490 - 503, The test for
SingleSbatchRunner is asserting on the exception message manually instead of
using a narrowed pytest.raises match, which can hide unrelated ValueError
failures in gen_sbatch_content(). Update test_rejects_single_sbatch_incompatible
to use pytest.raises(ValueError, match=...) with a pattern that covers the
existing checks for single-sbatch mode, nccl_tr.name, and use_deepep_matrix, so
the assertion stays focused on _reject_single_sbatch_incompatible().

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants