add all backends for moe benchmark#948
Conversation
📝 WalkthroughWalkthroughUpdates MoE benchmark arguments, Slurm execution, throughput reporting, and experimental benchmark configs to support multiple backend variants, hybrid rendezvous setup, and revised result generation. ChangesMoE Benchmark Multi-Backend & Dashboard
Estimated code review effort: 4 (Complex) | ~50 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (3 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 95-97: The hybrid alias handling in `SlurmCommandGenStrategy` is
inconsistent because `deepep_hybrid`, `hybrid`, and `hybrid_ep` are detected
together earlier, but `_BACKEND_ENV` is later derived from the raw version
value. Normalize the selected DeePEP version to a single canonical hybrid name
inside `generate` before the `_BACKEND_ENV` lookup, and use that normalized
value for the include/NCCL override logic so `hybrid` and `hybrid_ep` follow the
same path as `deepep_hybrid`.
- Around line 101-106: The Hybrid-EP rendezvous setup in the command generation
flow still depends on a fixed sleep after starting etcd, which can leave the
benchmark racing an unready endpoint. Update the logic around the etcd startup
sequence in the strategy that builds the Slurm command to poll the
NIXL_ETCD_ENDPOINTS/2379 endpoint until it is healthy, and make it fail fast
with a clear error if readiness is never reached. Keep the background launch of
etcd, but replace the hardcoded delay with an explicit readiness check in the
same command-building path.
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 93-123: The bandwidth parsing in _extract_moe_separate_bw (and the
related single-value extractor used by the throughput report) currently accepts
NaN/Inf via float(...) and can later break rendering when gmax is converted to
int in the SVG path. Update the parsing logic to reject non-finite values before
returning from these helpers, and ensure the downstream chart generation path
that uses gmax only receives finite numbers so invalid measurements from
results.json or baseline logs are skipped instead of crashing report generation.
In `@tests/ref_data/moe-benchmark.sbatch`:
- Around line 26-30: The current sbatch fixture only covers the default
legacy-and-elastic command path, so add reference coverage in
tests/ref_data/moe-benchmark.sbatch for the backend override cases as well.
Update the fixture to include at least one generated srun command that uses the
inline _BACKEND_ENV prefixes and one hybrid etcd bootstrap run, so
test_sbatch_generation will fail if those paths regress. Use the existing
moe-benchmark sbatch fixture and its srun command variants as the place to lock
down these exact-match scenarios.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: dfb90795-08c7-4282-9006-a440e28e3842
📒 Files selected for processing (9)
conf/experimental/test/moe_benchmark_HT.tomlconf/experimental/test/moe_benchmark_low_latency.tomlconf/experimental/test_scenario/moe_benchmark_HT.tomlconf/experimental/test_scenario/moe_benchmark_low_latency.tomlsrc/cloudai/workloads/moe_benchmark/moe_benchmark.pysrc/cloudai/workloads/moe_benchmark/report_generation_strategy.pysrc/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.pysrc/cloudai/workloads/moe_benchmark/throughput_reporter.pytests/ref_data/moe-benchmark.sbatch
|
@ybenvidia please resolve coderabbit comments |
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/cloudai/workloads/moe_benchmark/throughput_reporter.py (1)
101-103: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winReject negative bandwidth values too.
math.isfinite(...)still allows negative bandwidths, which are invalid metric values and can corrupt the shared SVG axis/bar geometry downstream. Use one helper for all bandwidth sources.Proposed fix
import math import re from pathlib import Path from typing import Any + + +def _valid_bw(value: float) -> bool: + return math.isfinite(value) and value >= 0.0 @@ - if not math.isfinite(val): # skip NaN/Inf so gmax stays finite for the int() in the SVG path + if not _valid_bw(val): return None @@ - if not (math.isfinite(nvl) and math.isfinite(rdma)): # skip NaN/Inf (keeps gmax finite) + if not (_valid_bw(nvl) and _valid_bw(rdma)): return None @@ - if sz < 1048576 or not math.isfinite(bavg): + if sz < 1048576 or not _valid_bw(bavg): continue @@ - if not math.isfinite(v): + if not _valid_bw(v): continueAlso applies to: 130-132, 202-203, 220-222
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py` around lines 101 - 103, The bandwidth parsing helper still accepts negative values because it only filters non-finite numbers, which can break the shared SVG axis/bar calculations. Update the common bandwidth helper(s) in throughput_reporter.py used by the affected sources so they reject any value below zero as well as NaN/Inf, and make sure all bandwidth inputs flow through the same validation path rather than duplicating checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@src/cloudai/workloads/moe_benchmark/throughput_reporter.py`:
- Around line 101-103: The bandwidth parsing helper still accepts negative
values because it only filters non-finite numbers, which can break the shared
SVG axis/bar calculations. Update the common bandwidth helper(s) in
throughput_reporter.py used by the affected sources so they reject any value
below zero as well as NaN/Inf, and make sure all bandwidth inputs flow through
the same validation path rather than duplicating checks.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: d43aa464-e1a6-4488-b29b-4e975e3960e6
📒 Files selected for processing (7)
conf/experimental/test/moe_benchmark_HT.tomlconf/experimental/test/nccl_test_alltoallv.tomlconf/experimental/test/ucc_alltoallv_deepep.tomlsrc/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.pysrc/cloudai/workloads/moe_benchmark/throughput_reporter.pytests/ref_data/moe-benchmark.sbatchtests/test_acceptance.py
Done |
| [ | ||
| "# Hybrid-EP NIXL rendezvous: etcd on the head node (background).", | ||
| f'srun --overlap --nodes=1 --ntasks=1 -w "$head_node" --container-image={image} ' | ||
| 'bash -c "etcd --log-level error --listen-client-urls http://0.0.0.0:2379 ' |
There was a problem hiding this comment.
does moe benchmark need a single-sbatch support? (because etcd is a background job which is not cleaned up by script at the end = single-sbatch mode will likely fail)
There was a problem hiding this comment.
I see my comment was taken quite far :D
I’m not fully convinced by this job-level prologue approach for ETCD. It still feels a bit ad hoc, and it behaves differently from the NIXL implementation. Could you instead take inspiration from src/cloudai/workloads/common/nixl.py? There ETCD is managed at the test-case level: started before the test, waited on, then shut down after the test, which also works correctly in single-sbatch because each case gets its own lifecycle.
I was told that this per-test lifecycle is the correct behavior, and the NIXL path has already been tested quite a lot by the verification team. Maybe we can extract the common ETCD start/wait/stop logic into something like workloads/common/etcd.py and reuse it from both NIXL and MoE, instead of adding a separate MoE-specific implementation.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py (1)
146-166: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winNormalize hybrid aliases before invoking
benchmark.py.hybrid/hybrid_epare already treated as aliases for env setup and rendezvous, but the positional CLI arg still passes the raw token through. That leaves alias inputs inconsistent with the generated benchmark command, which usesdeepep_hybrid, and can break runs when those values are supplied.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py` around lines 146 - 166, The generate_test_command method is passing the raw deepep version token through to benchmark.py even when the value is a hybrid alias, which makes the CLI argument inconsistent with the normalized environment handling. Update the command assembly in generate_test_command so that hybrid and hybrid_ep are normalized to the same benchmark-facing value used by _BACKEND_ENV before appending the positional version argument, keeping alias handling consistent across env setup and invocation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py`:
- Around line 48-51: The single-sbatch guard in single_sbatch_unsupported_reason
is too strict because it rejects every use_deepep_matrix case even when gen
provides the fallback that _deepep_ucc_matrix_host_path relies on. Update the
logic in single_sbatch_unsupported_reason to only return
DEEPEP_MATRIX_SINGLE_SBATCH_REASON when use_deepep_matrix is enabled and the
config still lacks a usable gen fallback, so the same compatibility check
matches the behavior in _deepep_ucc_matrix_host_path and allows valid
single-sbatch configs.
In `@tests/test_single_sbatch_runner.py`:
- Around line 490-503: The test for SingleSbatchRunner is asserting on the
exception message manually instead of using a narrowed pytest.raises match,
which can hide unrelated ValueError failures in gen_sbatch_content(). Update
test_rejects_single_sbatch_incompatible to use pytest.raises(ValueError,
match=...) with a pattern that covers the existing checks for single-sbatch
mode, nccl_tr.name, and use_deepep_matrix, so the assertion stays focused on
_reject_single_sbatch_incompatible().
---
Outside diff comments:
In `@src/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.py`:
- Around line 146-166: The generate_test_command method is passing the raw
deepep version token through to benchmark.py even when the value is a hybrid
alias, which makes the CLI argument inconsistent with the normalized environment
handling. Update the command assembly in generate_test_command so that hybrid
and hybrid_ep are normalized to the same benchmark-facing value used by
_BACKEND_ENV before appending the positional version argument, keeping alias
handling consistent across env setup and invocation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Enterprise
Run ID: 5a8b6f3d-f90b-4678-bca9-a960032b9916
📒 Files selected for processing (7)
src/cloudai/systems/slurm/single_sbatch_runner.pysrc/cloudai/workloads/common/moe_benchmark_report.pysrc/cloudai/workloads/moe_benchmark/slurm_command_gen_strategy.pysrc/cloudai/workloads/nccl_test/slurm_command_gen_strategy.pysrc/cloudai/workloads/ucc_test/slurm_command_gen_strategy.pytests/ref_data/moe-benchmark.sbatchtests/test_single_sbatch_runner.py
| def single_sbatch_unsupported_reason(self) -> str | None: | ||
| tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test) | ||
| return DEEPEP_MATRIX_SINGLE_SBATCH_REASON if tdef.cmd_args.use_deepep_matrix else None | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Single-sbatch rejection ignores the gen fallback that makes the config actually compatible.
_deepep_ucc_matrix_host_path() (Lines 61-63) already falls back to tdef.cmd_args.gen when the runtime-produced matrix isn't available — it doesn't require the MoE benchmark to have run first in that case. But single_sbatch_unsupported_reason() rejects use_deepep_matrix=True unconditionally, even when gen is set, needlessly blocking a configuration that would otherwise generate correctly in single-sbatch mode.
🐛 Proposed fix
def single_sbatch_unsupported_reason(self) -> str | None:
tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test)
- return DEEPEP_MATRIX_SINGLE_SBATCH_REASON if tdef.cmd_args.use_deepep_matrix else None
+ return (
+ DEEPEP_MATRIX_SINGLE_SBATCH_REASON
+ if tdef.cmd_args.use_deepep_matrix and tdef.cmd_args.gen is None
+ else None
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def single_sbatch_unsupported_reason(self) -> str | None: | |
| tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test) | |
| return DEEPEP_MATRIX_SINGLE_SBATCH_REASON if tdef.cmd_args.use_deepep_matrix else None | |
| def single_sbatch_unsupported_reason(self) -> str | None: | |
| tdef: UCCTestDefinition = cast(UCCTestDefinition, self.test_run.test) | |
| return ( | |
| DEEPEP_MATRIX_SINGLE_SBATCH_REASON | |
| if tdef.cmd_args.use_deepep_matrix and tdef.cmd_args.gen is None | |
| else None | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/cloudai/workloads/ucc_test/slurm_command_gen_strategy.py` around lines 48
- 51, The single-sbatch guard in single_sbatch_unsupported_reason is too strict
because it rejects every use_deepep_matrix case even when gen provides the
fallback that _deepep_ucc_matrix_host_path relies on. Update the logic in
single_sbatch_unsupported_reason to only return
DEEPEP_MATRIX_SINGLE_SBATCH_REASON when use_deepep_matrix is enabled and the
config still lacks a usable gen fallback, so the same compatibility check
matches the behavior in _deepep_ucc_matrix_host_path and allows valid
single-sbatch configs.
| def test_rejects_single_sbatch_incompatible(nccl_tr: TestRun, slurm_system: SlurmSystem) -> None: | ||
| cast(NCCLTestDefinition, nccl_tr.test).cmd_args.use_deepep_matrix = True | ||
| tc = TestScenario(name="tc", test_runs=[nccl_tr]) | ||
| runner = SingleSbatchRunner(mode="run", system=slurm_system, test_scenario=tc, output_path=slurm_system.output_path) | ||
|
|
||
| with pytest.raises(ValueError) as exc_info: | ||
| runner.gen_sbatch_content() | ||
|
|
||
| msg = str(exc_info.value) | ||
| assert "single-sbatch mode" in msg | ||
| assert nccl_tr.name in msg | ||
| assert "use_deepep_matrix" in msg | ||
|
|
||
|
|
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Narrow the exception assertion per Ruff PT011.
pytest.raises(ValueError) without match= would also pass if an unrelated ValueError were raised elsewhere in gen_sbatch_content(), masking a real regression in _reject_single_sbatch_incompatible().
✅ Proposed fix
- with pytest.raises(ValueError) as exc_info:
+ with pytest.raises(ValueError, match="cannot run in single-sbatch mode") as exc_info:
runner.gen_sbatch_content()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def test_rejects_single_sbatch_incompatible(nccl_tr: TestRun, slurm_system: SlurmSystem) -> None: | |
| cast(NCCLTestDefinition, nccl_tr.test).cmd_args.use_deepep_matrix = True | |
| tc = TestScenario(name="tc", test_runs=[nccl_tr]) | |
| runner = SingleSbatchRunner(mode="run", system=slurm_system, test_scenario=tc, output_path=slurm_system.output_path) | |
| with pytest.raises(ValueError) as exc_info: | |
| runner.gen_sbatch_content() | |
| msg = str(exc_info.value) | |
| assert "single-sbatch mode" in msg | |
| assert nccl_tr.name in msg | |
| assert "use_deepep_matrix" in msg | |
| def test_rejects_single_sbatch_incompatible(nccl_tr: TestRun, slurm_system: SlurmSystem) -> None: | |
| cast(NCCLTestDefinition, nccl_tr.test).cmd_args.use_deepep_matrix = True | |
| tc = TestScenario(name="tc", test_runs=[nccl_tr]) | |
| runner = SingleSbatchRunner(mode="run", system=slurm_system, test_scenario=tc, output_path=slurm_system.output_path) | |
| with pytest.raises(ValueError, match="cannot run in single-sbatch mode") as exc_info: | |
| runner.gen_sbatch_content() | |
| msg = str(exc_info.value) | |
| assert "single-sbatch mode" in msg | |
| assert nccl_tr.name in msg | |
| assert "use_deepep_matrix" in msg |
🧰 Tools
🪛 Ruff (0.15.20)
[warning] 495-495: pytest.raises(ValueError) is too broad, set the match parameter or use a more specific exception
(PT011)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/test_single_sbatch_runner.py` around lines 490 - 503, The test for
SingleSbatchRunner is asserting on the exception message manually instead of
using a narrowed pytest.raises match, which can hide unrelated ValueError
failures in gen_sbatch_content(). Update test_rejects_single_sbatch_incompatible
to use pytest.raises(ValueError, match=...) with a pattern that covers the
existing checks for single-sbatch mode, nccl_tr.name, and use_deepep_matrix, so
the assertion stays focused on _reject_single_sbatch_incompatible().
Source: Linters/SAST tools
Summary
Provide a concise summary of the changes introduced by this pull request. Detail the purpose and scope of the changes, referencing any relevant issues or discussions. Explain how these changes address the problem or improve the project.
Test Plan
In this section, describe the testing you have performed to verify the changes. Include:
This information is crucial for reviewers to understand how the changes have been validated.
Additional Notes
Include any other notes or comments about the pull request here. This can include challenges faced, future considerations, or context that reviewers might find helpful.