Skip to content

Add pre-test hook support for MegatronBridge workload#957

Open
saivishal1999 wants to merge 2 commits into
NVIDIA:mainfrom
saivishal1999:spothula/nccl-alltoall-pre-hook
Open

Add pre-test hook support for MegatronBridge workload#957
saivishal1999 wants to merge 2 commits into
NVIDIA:mainfrom
saivishal1999:spothula/nccl-alltoall-pre-hook

Conversation

@saivishal1999

Copy link
Copy Markdown
Contributor

MegatronBridge's gen_exec_command uses setup_experiment.py to generate its own sbatch, bypassing CloudAI's normal pre_test injection. This adds support by submitting the pre-hook as a separate sbatch first, then passing dependency=afterok:<job_id> to setup_experiment.py via --additional_slurm_params so Slurm gates training on pre-hook success.

  • _gen_pre_hook_sbatch: generates standalone sbatch for pre-hook tests, injecting main test's nnodes so srun allocates all training nodes
  • _collect_additional_slurm_params: extracted from _build_launcher_parts so the base params can be built in Python and dependency appended at bash runtime after the pre-hook job ID is known
  • _wrap_launcher_for_job_id_and_quiet_output: submits pre-hook sbatch, captures job ID, builds --additional_slurm_params with dependency

MegatronBridge's gen_exec_command uses setup_experiment.py to generate its
own sbatch, bypassing CloudAI's normal pre_test injection. This adds support
by submitting the pre-hook as a separate sbatch first, then passing
dependency=afterok:<job_id> to setup_experiment.py via --additional_slurm_params
so Slurm gates training on pre-hook success.

- _gen_pre_hook_sbatch: generates standalone sbatch for pre-hook tests,
  injecting main test's nnodes so srun allocates all training nodes
- _collect_additional_slurm_params: extracted from _build_launcher_parts
  so the base params can be built in Python and dependency appended at
  bash runtime after the pre-hook job ID is known
- _wrap_launcher_for_job_id_and_quiet_output: submits pre-hook sbatch,
  captures job ID, builds --additional_slurm_params with dependency

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds optional pre-test sbatch support to the Megatron-Bridge Slurm command generation strategy. A standalone pre-hook sbatch script can now be generated, submitted, and its job ID used as an afterok dependency for the main launcher job, with conditional inclusion of additional Slurm parameters.

Changes

Pre-hook sbatch dependency for Megatron-Bridge

Layer / File(s) Summary
Pre-hook script generation and entry wiring
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
gen_exec_command conditionally generates a pre-hook sbatch script via new _gen_pre_hook_sbatch, and builds --additional_slurm_params via new _collect_additional_slurm_params, passing both into the wrapper generator.
Wrapper submission and job ID dependency injection
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
_wrap_launcher_for_job_id_and_quiet_output gains pre_hook_sbatch_path and base_slurm_params parameters, submits the pre-hook via sbatch, parses its job ID (failing fast on error), and injects dependency=afterok:<job_id> into the main launcher invocation.
Conditional Slurm params in launcher parts
src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py
_build_launcher_parts accepts an include_slurm_params flag guarding appending --additional_slurm_params, avoiding duplication between pre-hook and main launcher commands.

Estimated code review effort: 4 (Complex) | ~45 minutes

Poem

A rabbit hops before the race,
submits a hook to hold its place,
waits for job IDs, then binds the chain,
afterok, dependency remains,
two scripts now run in tidy grace. 🐇

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding pre-test hook support for the MegatronBridge workload.
Description check ✅ Passed The description directly matches the changeset and explains the new pre-hook submission and dependency wiring.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py`:
- Around line 133-136: The pre-hook sbatch generation in build_sbatch_lines is
dropping excluded nodes when get_cached_nodes_spec() only provides num_nodes.
Update the sbatch line assembly so self.test_run.exclude_nodes is preserved even
in the --nodes path, not just when node_list is present, and keep the existing
behavior around num_nodes in SlurmCommandGenStrategy while ensuring excluded
nodes are still passed through.
- Around line 117-118: The pre-test node assignment in
slurm_command_gen_strategy.py is using the read-only TestRun.nnodes property,
which will fail before sbatch generation. Update the loop in the strategy method
that iterates over self.test_run.pre_test.test_runs to assign the underlying
mutable field num_nodes instead of nnodes, keeping the existing logic that
copies the parent test run’s node count into each pre-test run.
- Line 139: The pre-hook submission in gen_pre_test / SlurmCommandGenStrategy
currently only propagates PRE_TEST_SUCCESS, so a failed pre-test can still let
the sbatch wrapper exit successfully and release the main job. Update the
command assembly around sbatch_lines.extend and pre_hook_cmds to explicitly exit
non-zero when the pre-test check fails, using the gen_pre_test-generated success
flag as the gate so the pre-hook sbatch cannot complete as success on failure.
- Around line 344-350: Make the include_slurm_params parameter keyword-only in
_build_launcher_parts so it cannot be passed positionally and the FBT001/FBT002
warning is eliminated. Update the _build_launcher_parts function signature to
force include_slurm_params as a keyword-only argument, and ensure any internal
call sites use the named argument form; the existing keyword-based caller should
continue to work unchanged.
- Around line 268-270: The shell wrapper in the Slurm command generation logic
is inserting base Slurm args directly into the ADDITIONAL_SLURM_PARAMS
assignment, which can allow shell expansion and produce a leading separator when
the base value is empty. Update the Megatron bridge’s slurm_command_gen_strategy
around the ADDITIONAL_SLURM_PARAMS construction to safely quote/escape
base_slurm_params before embedding it in the shell snippet, and adjust the
PRE_HOOK_JOB_ID append logic so it only adds the dependency separator when
ADDITIONAL_SLURM_PARAMS already has content. Use the existing command assembly
in the strategy method that emits these lines to keep the fix localized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Enterprise

Run ID: dce884bc-1a4a-49c7-bf8d-0aa679ff96f6

📥 Commits

Reviewing files that changed from the base of the PR and between 5e0c351 and a5f95a4.

📒 Files selected for processing (1)
  • src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

Comment thread src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py Outdated
Comment on lines +133 to +136
if node_list:
sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}")
elif num_nodes:
sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")

@coderabbitai coderabbitai Bot Jul 2, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Preserve excluded nodes in the pre-hook sbatch.

When get_cached_nodes_spec() returns only num_nodes, this script emits --nodes but drops self.test_run.exclude_nodes, so the pre-hook can run on nodes the main job excludes.

🐛 Proposed fix
         if node_list:
             sbatch_lines.append(f"`#SBATCH` --nodelist={','.join(node_list)}")
         elif num_nodes:
             sbatch_lines.append(f"`#SBATCH` --nodes={num_nodes}")
+            if self.test_run.exclude_nodes:
+                sbatch_lines.append(f"`#SBATCH` --exclude={','.join(self.test_run.exclude_nodes)}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if node_list:
sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}")
elif num_nodes:
sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")
if node_list:
sbatch_lines.append(f"`#SBATCH` --nodelist={','.join(node_list)}")
elif num_nodes:
sbatch_lines.append(f"`#SBATCH` --nodes={num_nodes}")
if self.test_run.exclude_nodes:
sbatch_lines.append(f"`#SBATCH` --exclude={','.join(self.test_run.exclude_nodes)}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 133 - 136, The pre-hook sbatch generation in build_sbatch_lines is
dropping excluded nodes when get_cached_nodes_spec() only provides num_nodes.
Update the sbatch line assembly so self.test_run.exclude_nodes is preserved even
in the --nodes path, not just when node_list is present, and keep the existing
behavior around num_nodes in SlurmCommandGenStrategy while ensuring excluded
nodes are still passed through.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a broader issue rather than only num nodes. it also doesn't set distribution, gres/gpu-per-node and some other potential directives. what I suggest as a fix, is to update the base class SlurmCommandGenStrategy with extract directives generation into a sub-method. So that you could explicitly call it here to write directives into your buffer

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd -a SlurmCommandGenStrategy
fd . -a --type f -g '*slurm_command_gen_strategy.py' src/cloudai/systems/slurm
</command>

Length of output: 810


🏁 Script executed:

#!/bin/bash
fd -a slurm_command_gen_strategy.py
echo "---"
fd -a -e py . src/cloudai/systems/slurm

Length of output: 2675


🏁 Script executed:

#!/bin/bash
ast-grep outline src/cloudai/systems/slurm/slurm_command_gen_strategy.py

Length of output: 995


🏁 Script executed:

#!/bin/bash
ast-grep run --pattern 'def _append_sbatch_directives($$$) {
$$$
}' --lang python src/cloudai/systems/slurm/slurm_command_gen_strategy.py

echo "---"
sed -n '1,400p' src/cloudai/systems/slurm/slurm_command_gen_strategy.py | grep -n "_append_sbatch_directives\|_append_nodes_related_directives\|_write_sbatch_script"

Length of output: 1164


🏁 Script executed:

#!/bin/bash
sed -n '342,470p' src/cloudai/systems/slurm/slurm_command_gen_strategy.py

Length of output: 5433


@podkidyshev, agreed — that's the better fix. I checked src/cloudai/systems/slurm/slurm_command_gen_strategy.py, and it already has most of this logic centralized in _append_sbatch_directives (Line 375) and _append_nodes_related_directives (Line 411), covering: reservation, --distribution, nodelist/hostfile vs -N, exclude, gpus-per-node/gres, ntasks-per-node, --time, and extra_sbatch_args. The _gen_pre_hook_sbatch method in slurm_command_gen_strategy.py currently reimplements only a small subset of that (job-name, output/error, partition, account, nodelist/nodes, time_limit), which is why distribution, gres/gpus-per-node, ntasks-per-node, extra_sbatch_args, and exclude are all missing there.

The one wrinkle is that _append_sbatch_directives hardcodes self.test_run.output_path (for stdout/stderr) and self.job_name(), both of which differ for the pre-hook script (pre_hook subdir, pre_hook_ prefix). So a clean extraction would be something like pulling the node/GPU/misc-directive portion (distribution, nodelist/exclude/hostfile, gpus-per-node/gres, ntasks-per-node, extra_sbatch_args, time_limit) out of _append_sbatch_directives into its own method, e.g. _append_resource_directives(content: List[str]), that both _write_sbatch_script and _gen_pre_hook_sbatch call directly, while each caller still appends its own job-name/output/error/partition/account lines.

Since this touches the shared base class and is broader than the original single-line fix, would you like this done in this PR, or should I open a follow-up issue to track the base-class refactor so both _gen_pre_hook_sbatch and the main sbatch path stay in sync?


✏️ Learnings added
Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 957
File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:133-136
Timestamp: 2026-07-03T12:16:35.962Z
Learning: In NVIDIA/cloudai, src/cloudai/systems/slurm/slurm_command_gen_strategy.py already centralizes most SBATCH directive generation (distribution, nodelist/hostfile/-N, exclude_nodes, gpus-per-node/gres, ntasks-per-node, extra_sbatch_args, time_limit) inside `_append_sbatch_directives` and `_append_nodes_related_directives`. When a workload-specific strategy (e.g., src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py's `_gen_pre_hook_sbatch`) needs to generate a standalone sbatch script, it should reuse/extend these base-class methods rather than reimplementing a partial subset of directives, to avoid missing directives like distribution, gres/gpus-per-node, ntasks-per-node, extra_sbatch_args, and exclude_nodes. Maintainer podkidyshev suggested extracting the resource-directive portion of `_append_sbatch_directives` into its own reusable sub-method on the base class for this purpose.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 754
File: src/cloudai/_core/registry.py:226-234
Timestamp: 2025-12-16T19:47:41.994Z
Learning: In this repository, prefer expressing behavioral documentation through tests rather than docstrings. Tests act as living, verified documentation. Reserve docstrings for interfaces or high-level descriptions, and avoid duplicating behavior that is already covered by tests.

Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 799
File: src/cloudai/workloads/vllm/report_generation_strategy.py:47-48
Timestamp: 2026-02-10T13:29:25.671Z
Learning: In src/cloudai/workloads/vllm/report_generation_strategy.py, it is acceptable to annotate parse_vllm_bench_output with a caching decorator (e.g., functools.cache) because benchmark result files are immutable after creation in typical usage. This improves repeated calls by avoiding re-parsing unchanged data. Ensure the function is side-effect free and that cache invalidation is not needed for these inputs. If inputs include non-file-state variability, consider cache keys on file path or content hash.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 821
File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:87-97
Timestamp: 2026-03-10T11:01:25.158Z
Learning: In NVIDIA/cloudai repository, there is no secrets management infrastructure; environment variable values (including tokens/secrets) are serialized as literal strings into generated commands and run artifacts (e.g., cloudai_generated_command.sh). This is an architectural limitation acknowledged by the maintainer (podkidyshev). During code reviews for this repo, do not flag secret-serialization issues unless a secrets management solution is available. If a future change introduces proper secrets handling, update this guideline accordingly.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 821
File: src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py:69-70
Timestamp: 2026-03-10T11:09:12.655Z
Learning: In src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py, _build_launcher_parts uses ' '.join(parts) intentionally. The parts list contains raw shell constructs and pre-quoted entries (via shlex.quote) so they survive as single arguments to the Python launcher. Replacing with shlex.join(parts) would double-quote shell syntax and break the generated bash command. Do not suggest replacing ' '.join(parts) with shlex.join(parts) in this file during reviews.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 829
File: src/cloudai/workloads/common/nixl.py:198-217
Timestamp: 2026-03-13T14:17:44.666Z
Learning: In NVIDIA/cloudai, when reviewing code in src/cloudai/workloads/common/nixl.py, understand that NIXL Pydantic models (e.g., NIXLExtendedCmdArgs) may define fields like device_list and total_buffer_size as str | list[str] to support sweep configuration. However, by the time methods such as _device_list_mounts in NIXLCmdGenBase are invoked, these fields are resolved to a single str or None. Casting them to str | None at the command-generation layer is intentional and correct and should not be flagged as a bug. If reviewing related code, verify that the sweep expansion happens before command generation and that the command-gen layer operates on concrete string values or None rather than lists.

Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 836
File: tests/workloads/test_llm_serving.py:35-83
Timestamp: 2026-03-16T14:10:11.280Z
Learning: In the NVIDIA/cloudai repository, require meaningful docstrings for non-test Python classes (production/library code) to improve readability and maintainability. Test doubles, test helper classes, and other classes defined inside test files (e.g., Fake*, Plain*, stub classes) are exempt from the docstring requirement. Enforce via automated checks (e.g., a docstring style/quality gate using CI) that validate presence and quality of class docstrings for production code while skipping test-only classes within test files.

Learnt from: amaslenn
Repo: NVIDIA/cloudai PR: 839
File: src/cloudai/workloads/sglang/slurm_command_gen_strategy.py:41-41
Timestamp: 2026-03-17T20:01:53.137Z
Learning: In src/cloudai/workloads/sglang/slurm_command_gen_strategy.py, 0.0.0.0 bindings (default prefill_host/decode_host in get_serve_commands and --host 0.0.0.0 in get_helper_command) are intentional for in-cluster serving. Ruff S104 warnings on these lines should not be flagged or require suppression comments, as the maintainer (amaslenn) has explicitly confirmed this is the intended behavior.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 846
File: src/cloudai/reporter.py:168-168
Timestamp: 2026-03-24T23:24:17.668Z
Learning: In the NVIDIA/cloudai repository, follow the project-wide logging convention: use root/global logging calls like `logging.info(...)` and `logging.warning(...)` rather than creating module-level loggers via `logging.getLogger(__name__)`. When reviewing Python code, do not flag or reference LOG015 for this reason and do not suggest switching to module-level loggers. Separately, Ruff’s formatting rule (e.g., G004 f-string vs %-style) is independent and may still be raised if applicable.

Learnt from: podkidyshev
Repo: NVIDIA/cloudai PR: 869
File: src/cloudai/_core/test_scenario.py:23-23
Timestamp: 2026-04-14T11:05:59.444Z
Learning: In the NVIDIA/cloudai repository, do not surface or fail code-review findings for Ruff UP035 warnings about deprecated typing aliases (e.g., typing.List, typing.Set, typing.Type) on individual files. This deprecation cleanup is planned for a dedicated whole-repo modernization bump, so UP035 should be treated as an expected/known issue rather than something reviewers should block at the per-file level.

Learnt from: orbalayla-nvidia
Repo: NVIDIA/cloudai PR: 899
File: src/cloudai/_core/grading_strategy.py:2-2
Timestamp: 2026-06-07T12:01:43.285Z
Learning: In the NVIDIA/cloudai repo, copyright headers’ year formatting must match the test logic enforced by tests/test_check_copyright_headers.py (via collect_years_same_file and _format_years_to_ranges): use comma-separated year lists for non-consecutive “touched” years (e.g., "2024, 2026") and use hyphen ranges only for consecutive years (e.g., "2024-2026"). Do not “normalize” "YYYY, YYYY" into "YYYY-YYYY" unless the missing years are actually present in git history; otherwise the copyright header check will fail.

Learnt from: rutayan-nv
Repo: NVIDIA/cloudai PR: 901
File: src/cloudai/cli/handlers.py:147-162
Timestamp: 2026-06-24T01:05:39.500Z
Learning: For NVIDIA/cloudai: Ruff rule G004 (logging-f-string) is not enabled in this repo’s Ruff configuration, and CI Ruff checks pass even with f-string usage in logging calls. During code reviews, do not flag or request changes specifically for Ruff G004 violations; leave logging f-strings as-is in .py files unless CI or other enabled rules fail.

sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")
if self.test_run.time_limit:
sbatch_lines.append(f"#SBATCH --time={self.test_run.time_limit}")
sbatch_lines.extend(["", pre_hook_cmds])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Exit non-zero when the pre-hook success check fails.

gen_pre_test() only sets PRE_TEST_SUCCESS; without an explicit exit here, the pre-hook sbatch can finish successfully and unblock the main training job even after a failed pre-test.

🐛 Proposed fix
-        sbatch_lines.extend(["", pre_hook_cmds])
+        sbatch_lines.extend(
+            [
+                "",
+                pre_hook_cmds,
+                "",
+                'if [ "${PRE_TEST_SUCCESS:-0}" -ne 1 ]; then',
+                '  echo "Pre-hook failed; not releasing dependent training job." >&2',
+                "  exit 1",
+                "fi",
+            ]
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sbatch_lines.extend(["", pre_hook_cmds])
sbatch_lines.extend(
[
"",
pre_hook_cmds,
"",
'if [ "${PRE_TEST_SUCCESS:-0}" -ne 1 ]; then',
' echo "Pre-hook failed; not releasing dependent training job." >&2',
" exit 1",
"fi",
]
)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` at line
139, The pre-hook submission in gen_pre_test / SlurmCommandGenStrategy currently
only propagates PRE_TEST_SUCCESS, so a failed pre-test can still let the sbatch
wrapper exit successfully and release the main job. Update the command assembly
around sbatch_lines.extend and pre_hook_cmds to explicitly exit non-zero when
the pre-test check fails, using the gen_pre_test-generated success flag as the
gate so the pre-hook sbatch cannot complete as success on failure.

Comment on lines +268 to +270
f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"',
'if [ -n "$PRE_HOOK_JOB_ID" ]; then',
' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Quote base Slurm params and avoid a leading empty parameter.

base_slurm_params is injected into a double-quoted shell assignment, so $, backticks, or quotes from extra Slurm args can be expanded by the wrapper. Also, an empty base value becomes ;dependency=....

🐛 Proposed fix
         if pre_hook_sbatch_path is not None:
+            quoted_base_slurm_params = shlex.quote(base_slurm_params)
             pre_hook_lines = [
                 f'PRE_HOOK_SBATCH="{pre_hook_sbatch_path.absolute()}"',
                 'PRE_HOOK_OUTPUT=$(sbatch "$PRE_HOOK_SBATCH" 2>&1)',
@@
                 "fi",
                 'echo "Submitted pre-hook batch job $PRE_HOOK_JOB_ID"',
-                f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"',
-                'if [ -n "$PRE_HOOK_JOB_ID" ]; then',
+                f"ADDITIONAL_SLURM_PARAMS={quoted_base_slurm_params}",
+                'if [ -n "$ADDITIONAL_SLURM_PARAMS" ]; then',
                 '  ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',
+                "else",
+                '  ADDITIONAL_SLURM_PARAMS="dependency=afterok:${PRE_HOOK_JOB_ID}"',
                 "fi",
                 "",
             ]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
f'ADDITIONAL_SLURM_PARAMS="{base_slurm_params}"',
'if [ -n "$PRE_HOOK_JOB_ID" ]; then',
' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',
quoted_base_slurm_params = shlex.quote(base_slurm_params)
pre_hook_lines = [
f'PRE_HOOK_SBATCH="{pre_hook_sbatch_path.absolute()}"',
'PRE_HOOK_OUTPUT=$(sbatch "$PRE_HOOK_SBATCH" 2>&1)',
# ...
"fi",
'echo "Submitted pre-hook batch job $PRE_HOOK_JOB_ID"',
f"ADDITIONAL_SLURM_PARAMS={quoted_base_slurm_params}",
'if [ -n "$ADDITIONAL_SLURM_PARAMS" ]; then',
' ADDITIONAL_SLURM_PARAMS="${ADDITIONAL_SLURM_PARAMS};dependency=afterok:${PRE_HOOK_JOB_ID}"',
"else",
' ADDITIONAL_SLURM_PARAMS="dependency=afterok:${PRE_HOOK_JOB_ID}"',
"fi",
"",
]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 268 - 270, The shell wrapper in the Slurm command generation logic is
inserting base Slurm args directly into the ADDITIONAL_SLURM_PARAMS assignment,
which can allow shell expansion and produce a leading separator when the base
value is empty. Update the Megatron bridge’s slurm_command_gen_strategy around
the ADDITIONAL_SLURM_PARAMS construction to safely quote/escape
base_slurm_params before embedding it in the shell snippet, and adjust the
PRE_HOOK_JOB_ID append logic so it only adds the dependency separator when
ADDITIONAL_SLURM_PARAMS already has content. Use the existing command assembly
in the strategy method that emits these lines to keep the fix localized.

Comment on lines 344 to +350
def _build_launcher_parts( # noqa: C901
self, args: MegatronBridgeCmdArgs, tdef: MegatronBridgeTestDefinition, repo_path: Path, launcher_py: Path
self,
args: MegatronBridgeCmdArgs,
tdef: MegatronBridgeTestDefinition,
repo_path: Path,
launcher_py: Path,
include_slurm_params: bool = True,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
ruff check --select FBT001,FBT002 src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py

Repository: NVIDIA/cloudai

Length of output: 1524


Make include_slurm_params keyword-only. This removes the FBT001/FBT002 warning and avoids boolean positional calls; the existing keyword call site stays unchanged.

♻️ Proposed fix
         args: MegatronBridgeCmdArgs,
         tdef: MegatronBridgeTestDefinition,
         repo_path: Path,
         launcher_py: Path,
+        *,
         include_slurm_params: bool = True,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _build_launcher_parts( # noqa: C901
self, args: MegatronBridgeCmdArgs, tdef: MegatronBridgeTestDefinition, repo_path: Path, launcher_py: Path
self,
args: MegatronBridgeCmdArgs,
tdef: MegatronBridgeTestDefinition,
repo_path: Path,
launcher_py: Path,
include_slurm_params: bool = True,
def _build_launcher_parts( # noqa: C901
self,
args: MegatronBridgeCmdArgs,
tdef: MegatronBridgeTestDefinition,
repo_path: Path,
launcher_py: Path,
*,
include_slurm_params: bool = True,
🧰 Tools
🪛 Ruff (0.15.20)

[warning] 344-344: Too many branches (28 > 12)

(PLR0912)


[warning] 344-344: Too many statements (152 > 50)

(PLR0915)


[warning] 350-350: Boolean-typed positional argument in function definition

(FBT001)


[warning] 350-350: Boolean default positional argument in function definition

(FBT002)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cloudai/workloads/megatron_bridge/slurm_command_gen_strategy.py` around
lines 344 - 350, Make the include_slurm_params parameter keyword-only in
_build_launcher_parts so it cannot be passed positionally and the FBT001/FBT002
warning is eliminated. Update the _build_launcher_parts function signature to
force include_slurm_params as a keyword-only argument, and ensure any internal
call sites use the named argument form; the existing keyword-based caller should
continue to work unchanged.

Source: Linters/SAST tools

tr.nnodes is a read-only property backed by num_nodes; set num_nodes
directly so srun gets -N<main_job_nodes> instead of -N1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@podkidyshev

Copy link
Copy Markdown
Contributor

@saivishal1999 please resolve coderabbit comments

Comment on lines +133 to +136
if node_list:
sbatch_lines.append(f"#SBATCH --nodelist={','.join(node_list)}")
elif num_nodes:
sbatch_lines.append(f"#SBATCH --nodes={num_nodes}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a broader issue rather than only num nodes. it also doesn't set distribution, gres/gpu-per-node and some other potential directives. what I suggest as a fix, is to update the base class SlurmCommandGenStrategy with extract directives generation into a sub-method. So that you could explicitly call it here to write directives into your buffer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants