Add pre-test hook support for MegatronBridge workload by saivishal1999 · Pull Request #957 · NVIDIA/cloudai

saivishal1999 · 2026-07-02T07:09:11Z

MegatronBridge's gen_exec_command uses setup_experiment.py to generate its own sbatch, bypassing CloudAI's normal pre_test injection. This adds support by submitting the pre-hook as a separate sbatch first, then passing dependency=afterok:<job_id> to setup_experiment.py via --additional_slurm_params so Slurm gates training on pre-hook success.

_gen_pre_hook_sbatch: generates standalone sbatch for pre-hook tests, injecting main test's nnodes so srun allocates all training nodes
_collect_additional_slurm_params: extracted from _build_launcher_parts so the base params can be built in Python and dependency appended at bash runtime after the pre-hook job ID is known
_wrap_launcher_for_job_id_and_quiet_output: submits pre-hook sbatch, captures job ID, builds --additional_slurm_params with dependency

MegatronBridge's gen_exec_command uses setup_experiment.py to generate its own sbatch, bypassing CloudAI's normal pre_test injection. This adds support by submitting the pre-hook as a separate sbatch first, then passing dependency=afterok:<job_id> to setup_experiment.py via --additional_slurm_params so Slurm gates training on pre-hook success. - _gen_pre_hook_sbatch: generates standalone sbatch for pre-hook tests, injecting main test's nnodes so srun allocates all training nodes - _collect_additional_slurm_params: extracted from _build_launcher_parts so the base params can be built in Python and dependency appended at bash runtime after the pre-hook job ID is known - _wrap_launcher_for_job_id_and_quiet_output: submits pre-hook sbatch, captures job ID, builds --additional_slurm_params with dependency Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-07-02T07:09:30Z

📝 Walkthrough

Walkthrough

This PR adds optional pre-test sbatch support to the Megatron-Bridge Slurm command generation strategy. A standalone pre-hook sbatch script can now be generated, submitted, and its job ID used as an afterok dependency for the main launcher job, with conditional inclusion of additional Slurm parameters.

Changes

Pre-hook sbatch dependency for Megatron-Bridge

Layer / File(s)	Summary
Pre-hook script generation and entry wiring `src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`	`gen_exec_command` conditionally generates a pre-hook sbatch script via new `_gen_pre_hook_sbatch`, and builds `--additional_slurm_params` via new `_collect_additional_slurm_params`, passing both into the wrapper generator.
Wrapper submission and job ID dependency injection `src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`	`_wrap_launcher_for_job_id_and_quiet_output` gains `pre_hook_sbatch_path` and `base_slurm_params` parameters, submits the pre-hook via sbatch, parses its job ID (failing fast on error), and injects `dependency=afterok:<job_id>` into the main launcher invocation.
Conditional Slurm params in launcher parts `src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`	`_build_launcher_parts` accepts an `include_slurm_params` flag guarding appending `--additional_slurm_params`, avoiding duplication between pre-hook and main launcher commands.

Estimated code review effort: 4 (Complex) | ~45 minutes

Poem

A rabbit hops before the race,
submits a hook to hold its place,
waits for job IDs, then binds the chain,
afterok, dependency remains,
two scripts now run in tidy grace. 🐇

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely describes the main change: adding pre-test hook support for the MegatronBridge workload.
Description check	✅ Passed	The description directly matches the changeset and explains the new pre-hook submission and dependency wiring.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 133-136: The pre-hook sbatch generation in build_sbatch_lines is
dropping excluded nodes when get_cached_nodes_spec() only provides num_nodes.
Update the sbatch line assembly so self.test_run.exclude_nodes is preserved even
in the --nodes path, not just when node_list is present, and keep the existing
behavior around num_nodes in SlurmCommandGenStrategy while ensuring excluded
nodes are still passed through.
- Around line 117-118: The pre-test node assignment in
slurm_command_gen_strategy.py is using the read-only TestRun.nnodes property,
which will fail before sbatch generation. Update the loop in the strategy method
that iterates over self.test_run.pre_test.test_runs to assign the underlying
mutable field num_nodes instead of nnodes, keeping the existing logic that
copies the parent test run’s node count into each pre-test run.
- Line 139: The pre-hook submission in gen_pre_test / SlurmCommandGenStrategy
currently only propagates PRE_TEST_SUCCESS, so a failed pre-test can still let
the sbatch wrapper exit successfully and release the main job. Update the
command assembly around sbatch_lines.extend and pre_hook_cmds to explicitly exit
non-zero when the pre-test check fails, using the gen_pre_test-generated success
flag as the gate so the pre-hook sbatch cannot complete as success on failure.
- Around line 344-350: Make the include_slurm_params parameter keyword-only in
_build_launcher_parts so it cannot be passed positionally and the FBT001/FBT002
warning is eliminated. Update the _build_launcher_parts function signature to
force include_slurm_params as a keyword-only argument, and ensure any internal
call sites use the named argument form; the existing keyword-based caller should
continue to work unchanged.
- Around line 268-270: The shell wrapper in the Slurm command generation logic
is inserting base Slurm args directly into the ADDITIONAL_SLURM_PARAMS
assignment, which can allow shell expansion and produce a leading separator when
the base value is empty. Update the Megatron bridge’s slurm_command_gen_strategy
around the ADDITIONAL_SLURM_PARAMS construction to safely quote/escape
base_slurm_params before embedding it in the shell snippet, and adjust the
PRE_HOOK_JOB_ID append logic so it only adds the dependency separator when
ADDITIONAL_SLURM_PARAMS already has content. Use the existing command assembly
in the strategy method that emits these lines to keep the fix localized.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: dce884bc-1a4a-49c7-bf8d-0aa679ff96f6

📥 Commits

Reviewing files that changed from the base of the PR and between 5e0c351 and a5f95a4.

📒 Files selected for processing (1)

src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

coderabbitai · 2026-07-02T07:14:18Z

+        if node_list:
+            sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}")
+        elif num_nodes:
+            sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Preserve excluded nodes in the pre-hook sbatch.

When get_cached_nodes_spec() returns only num_nodes, this script emits --nodes but drops self.test_run.exclude_nodes, so the pre-hook can run on nodes the main job excludes.

🐛 Proposed fix

if node_list: sbatch_lines.append(f"`#SBATCH` --nodelist={','.join(node_list)}") elif num_nodes: sbatch_lines.append(f"`#SBATCH` --nodes={num_nodes}") + if self.test_run.exclude_nodes: + sbatch_lines.append(f"`#SBATCH` --exclude={','.join(self.test_run.exclude_nodes)}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if node_list:

sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}")

elif num_nodes:

sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")

if node_list:

sbatch_lines.append(f"`#SBATCH` --nodelist={','.join(node_list)}")

elif num_nodes:

sbatch_lines.append(f"`#SBATCH` --nodes={num_nodes}")

if self.test_run.exclude_nodes:

sbatch_lines.append(f"`#SBATCH` --exclude={','.join(self.test_run.exclude_nodes)}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around lines 133 - 136, The pre-hook sbatch generation in build_sbatch_lines is dropping excluded nodes when get_cached_nodes_spec() only provides num_nodes. Update the sbatch line assembly so self.test_run.exclude_nodes is preserved even in the --nodes path, not just when node_list is present, and keep the existing behavior around num_nodes in SlurmCommandGenStrategy while ensuring excluded nodes are still passed through.

this is a broader issue rather than only num nodes. it also doesn't set distribution, gres/gpu-per-node and some other potential directives. what I suggest as a fix, is to update the base class SlurmCommandGenStrategy with extract directives generation into a sub-method. So that you could explicitly call it here to write directives into your buffer

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash fd -a SlurmCommandGenStrategy fd . -a --type f -g '*slurm_command_gen_strategy.py' src/cloudai/systems/slurm </command>

Length of output: 810

🏁 Script executed:

#!/bin/bash fd -a slurm_command_gen_strategy.py echo "---" fd -a -e py . src/cloudai/systems/slurm

Length of output: 2675

🏁 Script executed:

#!/bin/bash ast-grep outline src/cloudai/systems/slurm/slurm_command_gen_strategy.py

Length of output: 995

🏁 Script executed:

#!/bin/bash ast-grep run --pattern 'def _append_sbatch_directives($$$) { $$$ }' --lang python src/cloudai/systems/slurm/slurm_command_gen_strategy.py echo "---" sed -n '1,400p' src/cloudai/systems/slurm/slurm_command_gen_strategy.py | grep -n "_append_sbatch_directives\|_append_nodes_related_directives\|_write_sbatch_script"

Length of output: 1164

🏁 Script executed:

#!/bin/bash sed -n '342,470p' src/cloudai/systems/slurm/slurm_command_gen_strategy.py

Length of output: 5433

@podkidyshev, agreed — that's the better fix. I checked src/cloudai/systems/slurm/slurm_command_gen_strategy.py, and it already has most of this logic centralized in _append_sbatch_directives (Line 375) and _append_nodes_related_directives (Line 411), covering: reservation, --distribution, nodelist/hostfile vs -N, exclude, gpus-per-node/gres, ntasks-per-node, --time, and extra_sbatch_args. The _gen_pre_hook_sbatch method in slurm_command_gen_strategy.py currently reimplements only a small subset of that (job-name, output/error, partition, account, nodelist/nodes, time_limit), which is why distribution, gres/gpus-per-node, ntasks-per-node, extra_sbatch_args, and exclude are all missing there.

The one wrinkle is that _append_sbatch_directives hardcodes self.test_run.output_path (for stdout/stderr) and self.job_name(), both of which differ for the pre-hook script (pre_hook subdir, pre_hook_ prefix). So a clean extraction would be something like pulling the node/GPU/misc-directive portion (distribution, nodelist/exclude/hostfile, gpus-per-node/gres, ntasks-per-node, extra_sbatch_args, time_limit) out of _append_sbatch_directives into its own method, e.g. _append_resource_directives(content: List[str]), that both _write_sbatch_script and _gen_pre_hook_sbatch call directly, while each caller still appends its own job-name/output/error/partition/account lines.

Since this touches the shared base class and is broader than the original single-line fix, would you like this done in this PR, or should I open a follow-up issue to track the base-class refactor so both _gen_pre_hook_sbatch and the main sbatch path stay in sync?

✏️ Learnings added

Learnt from: podkidyshev Repo: NVIDIA/cloudai PR: 957 File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:133-136 Timestamp: 2026-07-03T12:16:35.962Z Learning: In NVIDIA/cloudai, src/cloudai/systems/slurm/slurm_command_gen_strategy.py already centralizes most SBATCH directive generation (distribution, nodelist/hostfile/-N, exclude_nodes, gpus-per-node/gres, ntasks-per-node, extra_sbatch_args, time_limit) inside `_append_sbatch_directives` and `_append_nodes_related_directives`. When a workload-specific strategy (e.g., src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py's `_gen_pre_hook_sbatch`) needs to generate a standalone sbatch script, it should reuse/extend these base-class methods rather than reimplementing a partial subset of directives, to avoid missing directives like distribution, gres/gpus-per-node, ntasks-per-node, extra_sbatch_args, and exclude_nodes. Maintainer podkidyshev suggested extracting the resource-directive portion of `_append_sbatch_directives` into its own reusable sub-method on the base class for this purpose.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

🧠 Learnings used

Learnt from: amaslenn Repo: NVIDIA/cloudai PR: 754 File: src/cloudai/_core/registry.py:226-234 Timestamp: 2025-12-16T19:47:41.994Z Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Learnt from: amaslenn Repo: NVIDIA/cloudai PR: 799 File: src/cloudai/workloads/vllm/report_generation_strategy.py:47-48 Timestamp: 2026-02-10T13:29:25.671Z Learning: In src/cloudai/workloads/vllm/report_generation_strategy.py, it is acceptable to annotate parse_vllm_bench_output with a caching decorator (e.g., functools.cache) because benchmark result files are immutable after creation in typical usage. This improves repeated calls by avoiding re-parsing unchanged data. Ensure the function is side-effect free and that cache invalidation is not needed for these inputs. If inputs include non-file-state variability, consider cache keys on file path or content hash.

Learnt from: podkidyshev Repo: NVIDIA/cloudai PR: 821 File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:87-97 Timestamp: 2026-03-10T11:01:25.158Z Learning: In NVIDIA/cloudai repository, there is no secrets management infrastructure; environment variable values (including tokens/secrets) are serialized as literal strings into generated commands and run artifacts (e.g., cloudai_generated_command.sh). This is an architectural limitation acknowledged by the maintainer (podkidyshev). During code reviews for this repo, do not flag secret-serialization issues unless a secrets management solution is available. If a future change introduces proper secrets handling, update this guideline accordingly.

Learnt from: podkidyshev Repo: NVIDIA/cloudai PR: 821 File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:69-70 Timestamp: 2026-03-10T11:09:12.655Z Learning: In src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, _build_launcher_parts uses ' '.join(parts) intentionally. The parts list contains raw shell constructs and pre-quoted entries (via shlex.quote) so they survive as single arguments to the Python launcher. Replacing with shlex.join(parts) would double-quote shell syntax and break the generated bash command. Do not suggest replacing ' '.join(parts) with shlex.join(parts) in this file during reviews.

Learnt from: podkidyshev Repo: NVIDIA/cloudai PR: 829 File: src/cloudai/workloads/common/nixl.py:198-217 Timestamp: 2026-03-13T14:17:44.666Z Learning: In NVIDIA/cloudai, when reviewing code in src/cloudai/workloads/common/nixl.py, understand that NIXL Pydantic models (e.g., NIXLExtendedCmdArgs) may define fields like device_list and total_buffer_size as str | list[str] to support sweep configuration. However, by the time methods such as _device_list_mounts in NIXLCmdGenBase are invoked, these fields are resolved to a single str or None. Casting them to str | None at the command-generation layer is intentional and correct and should not be flagged as a bug. If reviewing related code, verify that the sweep expansion happens before command generation and that the command-gen layer operates on concrete string values or None rather than lists.

Learnt from: amaslenn Repo: NVIDIA/cloudai PR: 836 File: tests/workloads/test_llm_serving.py:35-83 Timestamp: 2026-03-16T14:10:11.280Z Learning: In the NVIDIA/cloudai repository, require meaningful docstrings for non-test Python classes (production/library code) to improve readability and maintainability. Test doubles, test helper classes, and other classes defined inside test files (e.g., Fake*, Plain*, stub classes) are exempt from the docstring requirement. Enforce via automated checks (e.g., a docstring style/quality gate using CI) that validate presence and quality of class docstrings for production code while skipping test-only classes within test files.

Learnt from: amaslenn Repo: NVIDIA/cloudai PR: 839 File: src/cloudai/workloads/sglang/slurm_command_gen_strategy.py:41-41 Timestamp: 2026-03-17T20:01:53.137Z Learning: In src/cloudai/workloads/sglang/slurm_command_gen_strategy.py, 0.0.0.0 bindings (default prefill_host/decode_host in get_serve_commands and --host 0.0.0.0 in get_helper_command) are intentional for in-cluster serving. Ruff S104 warnings on these lines should not be flagged or require suppression comments, as the maintainer (amaslenn) has explicitly confirmed this is the intended behavior.

Learnt from: podkidyshev Repo: NVIDIA/cloudai PR: 846 File: src/cloudai/reporter.py:168-168 Timestamp: 2026-03-24T23:24:17.668Z Learning: In the NVIDIA/cloudai repository, follow the project-wide logging convention: use root/global logging calls like `logging.info(...)` and `logging.warning(...)` rather than creating module-level loggers via `logging.getLogger(__name__)`. When reviewing Python code, do not flag or reference LOG015 for this reason and do not suggest switching to module-level loggers. Separately, Ruff’s formatting rule (e.g., G004 f-string vs %-style) is independent and may still be raised if applicable.

Learnt from: podkidyshev Repo: NVIDIA/cloudai PR: 869 File: src/cloudai/_core/test_scenario.py:23-23 Timestamp: 2026-04-14T11:05:59.444Z Learning: In the NVIDIA/cloudai repository, do not surface or fail code-review findings for Ruff UP035 warnings about deprecated typing aliases (e.g., typing.List, typing.Set, typing.Type) on individual files. This deprecation cleanup is planned for a dedicated whole-repo modernization bump, so UP035 should be treated as an expected/known issue rather than something reviewers should block at the per-file level.

Learnt from: orbalayla-nvidia Repo: NVIDIA/cloudai PR: 899 File: src/cloudai/_core/grading_strategy.py:2-2 Timestamp: 2026-06-07T12:01:43.285Z Learning: In the NVIDIA/cloudai repo, copyright headers’ year formatting must match the test logic enforced by tests/test_check_copyright_headers.py (via collect_years_same_file and _format_years_to_ranges): use comma-separated year lists for non-consecutive “touched” years (e.g., "2024, 2026") and use hyphen ranges only for consecutive years (e.g., "2024-2026"). Do not “normalize” "YYYY, YYYY" into "YYYY-YYYY" unless the missing years are actually present in git history; otherwise the copyright header check will fail.

Learnt from: rutayan-nv Repo: NVIDIA/cloudai PR: 901 File: src/cloudai/cli/handlers.py:147-162 Timestamp: 2026-06-24T01:05:39.500Z Learning: For NVIDIA/cloudai: Ruff rule G004 (logging-f-string) is not enabled in this repo’s Ruff configuration, and CI Ruff checks pass even with f-string usage in logging calls. During code reviews, do not flag or request changes specifically for Ruff G004 violations; leave logging f-strings as-is in .py files unless CI or other enabled rules fail.

coderabbitai · 2026-07-02T07:14:18Z

+            sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")
+        if self.test_run.time_limit:
+            sbatch_lines.append(f"#SBATCH --time={self.test_run.time_limit}")
+        sbatch_lines.extend(["", pre_hook_cmds])


🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Exit non-zero when the pre-hook success check fails.

gen_pre_test() only sets PRE_TEST_SUCCESS; without an explicit exit here, the pre-hook sbatch can finish successfully and unblock the main training job even after a failed pre-test.

🐛 Proposed fix

- sbatch_lines.extend(["", pre_hook_cmds]) + sbatch_lines.extend( + [ + "", + pre_hook_cmds, + "", + 'if [ "${PRE_TEST_SUCCESS:-0}" -ne 1 ]; then', + ' echo "Pre-hook failed; not releasing dependent training job." >&2', + " exit 1", + "fi", + ] + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

sbatch_lines.extend(["", pre_hook_cmds])

sbatch_lines.extend(

[

"",

pre_hook_cmds,

"",

'if [ "${PRE_TEST_SUCCESS:-0}" -ne 1 ]; then',

' echo "Pre-hook failed; not releasing dependent training job." >&2',

" exit 1",

"fi",

]

)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` at line 139, The pre-hook submission in gen_pre_test / SlurmCommandGenStrategy currently only propagates PRE_TEST_SUCCESS, so a failed pre-test can still let the sbatch wrapper exit successfully and release the main job. Update the command assembly around sbatch_lines.extend and pre_hook_cmds to explicitly exit non-zero when the pre-test check fails, using the gen_pre_test-generated success flag as the gate so the pre-hook sbatch cannot complete as success on failure.

coderabbitai · 2026-07-02T07:14:18Z

+                f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"',
+                'if [ -n "$PRE_HOOK_JOB_ID" ]; then',
+                '  ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Quote base Slurm params and avoid a leading empty parameter.

base_slurm_params is injected into a double-quoted shell assignment, so $, backticks, or quotes from extra Slurm args can be expanded by the wrapper. Also, an empty base value becomes ;dependency=....

🐛 Proposed fix

if pre_hook_sbatch_path is not None: + quoted_base_slurm_params = shlex.quote(base_slurm_params) pre_hook_lines = [ f'PRE_HOOK_SBATCH="{pre_hook_sbatch_path.absolute()}"', 'PRE_HOOK_OUTPUT=$(sbatch "$PRE_HOOK_SBATCH" 2>&1)', @@ "fi", 'echo "Submitted pre-hook batch job $PRE_HOOK_JOB_ID"', - f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"', - 'if [ -n "$PRE_HOOK_JOB_ID" ]; then', + f"ADDITIONAL_SLURM_PARAMS={quoted_base_slurm_params}", + 'if [ -n "$ADDITIONAL_SLURM_PARAMS" ]; then', ' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"', + "else", + ' ADDITIONAL_SLURM_PARAMS="dependency=afterok:${PRE_HOOK_JOB_ID}"', "fi", "", ]

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"',

'if [ -n "$PRE_HOOK_JOB_ID" ]; then',

' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',

quoted_base_slurm_params = shlex.quote(base_slurm_params)

pre_hook_lines = [

f'PRE_HOOK_SBATCH="{pre_hook_sbatch_path.absolute()}"',

'PRE_HOOK_OUTPUT=$(sbatch "$PRE_HOOK_SBATCH" 2>&1)',

# ...

"fi",

'echo "Submitted pre-hook batch job $PRE_HOOK_JOB_ID"',

f"ADDITIONAL_SLURM_PARAMS={quoted_base_slurm_params}",

'if [ -n "$ADDITIONAL_SLURM_PARAMS" ]; then',

' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',

"else",

' ADDITIONAL_SLURM_PARAMS="dependency=afterok:${PRE_HOOK_JOB_ID}"',

"fi",

"",

]

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around lines 268 - 270, The shell wrapper in the Slurm command generation logic is inserting base Slurm args directly into the ADDITIONAL_SLURM_PARAMS assignment, which can allow shell expansion and produce a leading separator when the base value is empty. Update the Megatron bridge’s slurm_command_gen_strategy around the ADDITIONAL_SLURM_PARAMS construction to safely quote/escape base_slurm_params before embedding it in the shell snippet, and adjust the PRE_HOOK_JOB_ID append logic so it only adds the dependency separator when ADDITIONAL_SLURM_PARAMS already has content. Use the existing command assembly in the strategy method that emits these lines to keep the fix localized.

coderabbitai · 2026-07-02T07:14:18Z

    def _build_launcher_parts(  # noqa: C901
-        self, args: MegatronBridgeCmdArgs, tdef: MegatronBridgeTestDefinition, repo_path: Path, launcher_py: Path
+        self,
+        args: MegatronBridgeCmdArgs,
+        tdef: MegatronBridgeTestDefinition,
+        repo_path: Path,
+        launcher_py: Path,
+        include_slurm_params: bool = True,


📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash ruff check --select FBT001,FBT002 src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

Repository: NVIDIA/cloudai

Length of output: 1524

Make include_slurm_params keyword-only. This removes the FBT001/FBT002 warning and avoids boolean positional calls; the existing keyword call site stays unchanged.

♻️ Proposed fix

args: MegatronBridgeCmdArgs, tdef: MegatronBridgeTestDefinition, repo_path: Path, launcher_py: Path, + *, include_slurm_params: bool = True,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _build_launcher_parts( # noqa: C901

self, args: MegatronBridgeCmdArgs, tdef: MegatronBridgeTestDefinition, repo_path: Path, launcher_py: Path

self,

args: MegatronBridgeCmdArgs,

tdef: MegatronBridgeTestDefinition,

repo_path: Path,

launcher_py: Path,

include_slurm_params: bool = True,

def _build_launcher_parts( # noqa: C901

self,

args: MegatronBridgeCmdArgs,

tdef: MegatronBridgeTestDefinition,

repo_path: Path,

launcher_py: Path,

*,

include_slurm_params: bool = True,

🧰 Tools

🪛 Ruff (0.15.20)

[warning] 344-344: Too many branches (28 > 12)

(PLR0912)

[warning] 344-344: Too many statements (152 > 50)

(PLR0915)

[warning] 350-350: Boolean-typed positional argument in function definition

(FBT001)

[warning] 350-350: Boolean default positional argument in function definition

(FBT002)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around lines 344 - 350, Make the include_slurm_params parameter keyword-only in _build_launcher_parts so it cannot be passed positionally and the FBT001/FBT002 warning is eliminated. Update the _build_launcher_parts function signature to force include_slurm_params as a keyword-only argument, and ensure any internal call sites use the named argument form; the existing keyword-based caller should continue to work unchanged.

Source: Linters/SAST tools

tr.nnodes is a read-only property backed by num_nodes; set num_nodes directly so srun gets -N<main_job_nodes> instead of -N1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

podkidyshev · 2026-07-02T10:47:06Z

@saivishal1999 please resolve coderabbit comments

podkidyshev · 2026-07-03T12:13:52Z

+        if node_list:
+            sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}")
+        elif num_nodes:
+            sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")


this is a broader issue rather than only num nodes. it also doesn't set distribution, gres/gpu-per-node and some other potential directives. what I suggest as a fix, is to update the base class SlurmCommandGenStrategy with extract directives generation into a sub-method. So that you could explicitly call it here to write directives into your buffer

saivishal1999 requested review from jeffnvidia, podkidyshev and srivatsankrishnan as code owners July 2, 2026 07:09

coderabbitai Bot reviewed Jul 2, 2026

View reviewed changes

Fix nnodes injection for pre-hook test runs

65d4b3a

tr.nnodes is a read-only property backed by num_nodes; set num_nodes directly so srun gets -N<main_job_nodes> instead of -N1. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

podkidyshev requested changes Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add pre-test hook support for MegatronBridge workload#957

Add pre-test hook support for MegatronBridge workload#957
saivishal1999 wants to merge 2 commits into
NVIDIA:mainfrom
saivishal1999:spothula/nccl-alltoall-pre-hook

saivishal1999 commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot Jul 2, 2026 •

edited

Loading

Uh oh!

podkidyshev Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 3, 2026

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

coderabbitai Bot Jul 2, 2026

Uh oh!

podkidyshev commented Jul 2, 2026

Uh oh!

podkidyshev Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        sbatch_lines.extend(["", pre_hook_cmds])
+        sbatch_lines.extend(
+            [
+                "",
+                pre_hook_cmds,
+                "",
+                'if [ "${PRE_TEST_SUCCESS:-0}" -ne 1 ]; then',
+                '  echo "Pre-hook failed; not releasing dependent training job." >&2',
+                "  exit 1",
+                "fi",
+            ]
+        )

-                f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"',
-                'if [ -n "$PRE_HOOK_JOB_ID" ]; then',
-                '  ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',
+            quoted_base_slurm_params = shlex.quote(base_slurm_params)
+            pre_hook_lines = [
+                f'PRE_HOOK_SBATCH="{pre_hook_sbatch_path.absolute()}"',
+                'PRE_HOOK_OUTPUT=$(sbatch "$PRE_HOOK_SBATCH" 2>&1)',
+                # ...
+                "fi",
+                'echo "Submitted pre-hook batch job $PRE_HOOK_JOB_ID"',
+                f"ADDITIONAL_SLURM_PARAMS={quoted_base_slurm_params}",
+                'if [ -n "$ADDITIONAL_SLURM_PARAMS" ]; then',
+                '  ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',
+                "else",
+                '  ADDITIONAL_SLURM_PARAMS="dependency=afterok:${PRE_HOOK_JOB_ID}"',
+                "fi",
+                "",
+            ]

Uh oh!

Conversation

saivishal1999 commented Jul 2, 2026

Uh oh!

coderabbitai Bot commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

podkidyshev Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 2, 2026

Choose a reason for hiding this comment

Uh oh!

podkidyshev commented Jul 2, 2026

Uh oh!

podkidyshev Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jul 2, 2026 •

edited

Loading

coderabbitai Bot Jul 2, 2026 •

edited

Loading