fix: make MCP Slurm wait poll remote job state#1900
Conversation
Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
📝 WalkthroughWalkthroughAdds Slurm submit metadata persistence and SSH-based status resolution (sacct/squeue) into bridge.py with new helper functions and updated job_status_impl, changes the managed-source launcher's uv invocation to use --with-editable, adds corresponding tests, and introduces a new hostname smoke-test script. ChangesSlurm Sidecar Status Tracking
Managed-source Launcher Argv Change
Hostname Smoke Test Script
Estimated code review effort: 3 (Moderate) | ~25 minutes Sequence Diagram(s)sequenceDiagram
participant Client
participant Bridge as bridge.py
participant ExperimentDir as Experiment Directory
participant SlurmNode as Slurm Login Node
Client->>Bridge: submit Slurm job
Bridge->>ExperimentDir: write _SLURM_STATUS_META (job id, ssh info)
Client->>Bridge: job_status_impl(experiment_id)
Bridge->>ExperimentDir: load _SLURM_STATUS_META
Bridge->>SlurmNode: SSH sacct -P (fallback squeue)
SlurmNode-->>Bridge: Slurm state/exit code
Bridge->>Bridge: map Slurm state to MCP status
Bridge-->>Client: overall status (done/running/failed/unknown)
Suggested reviewers: Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tools/mcp/modelopt_mcp/bridge.py (1)
1705-1780: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low valueConsider validating/quoting
slurm_job_idbefore it reaches the remote shell.
job_idis interpolated into thesacct/squeuecommand strings, whichsshexecutes through the remote login shell. Todayslurm_job_idis parsed from trusted launcher output (numeric), so this is not currently exploitable, but a non-numeric value would be interpreted by the remote shell. A cheap defense-in-depth is to assert the id matches an expected pattern (e.g.^\d+(_\d+)?(\.\w+)?$) before building the command.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1705 - 1780, The _query_slurm_status helper builds remote sacct/squeue command strings using slurm_job_id, so add a defensive validation step before constructing those commands. In _query_slurm_status, ensure job_id matches the expected Slurm job-id format (for example numeric with optional array/step suffixes) and return a failed diagnostic if it does not, then only pass the validated id into _ssh_argv_for_slurm and the sacct/squeue command strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1826-1841: In the Slurm status handling branch of the experiment
status lookup, do not always fall back to running when _query_slurm_status
returns a non-ok result. Update the logic around _load_slurm_status_metadata and
_query_slurm_status to treat slurm_status_not_found as a special case and
consult the local completion indicators (_DONE and status_*.out) before
defaulting to running. Keep the returned slurm_status payload intact, but ensure
the overall status can become completed when the local markers show the
experiment has finished so wait_for_experiment_impl stops polling correctly.
---
Nitpick comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1705-1780: The _query_slurm_status helper builds remote
sacct/squeue command strings using slurm_job_id, so add a defensive validation
step before constructing those commands. In _query_slurm_status, ensure job_id
matches the expected Slurm job-id format (for example numeric with optional
array/step suffixes) and return a failed diagnostic if it does not, then only
pass the validated id into _ssh_argv_for_slurm and the sacct/squeue command
strings.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 64116f92-4ae7-4046-b0ef-672474565ecb
📒 Files selected for processing (3)
tools/launcher/common/smoke/hostname.shtools/mcp/modelopt_mcp/bridge.pytools/mcp/tests/test_bridge.py
| slurm_meta = _load_slurm_status_metadata(exp_dir) | ||
| if slurm_meta is not None: | ||
| slurm_status = _query_slurm_status(slurm_meta) | ||
| if slurm_status.get("ok"): | ||
| overall = slurm_status.get("status", "unknown") | ||
| else: | ||
| overall = "running" | ||
| return { | ||
| "ok": True, | ||
| "experiment_id": experiment_id, | ||
| "experiment_dir": str(exp_dir), | ||
| "status": overall, | ||
| "task_statuses": task_statuses, | ||
| "has_done_marker": done_marker.exists(), | ||
| "slurm_status": slurm_status, | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect relevant symbols and surrounding code in bridge.py
python3 - <<'PY'
from pathlib import Path
path = Path("tools/mcp/modelopt_mcp/bridge.py")
text = path.read_text().splitlines()
symbols = ["def _query_slurm_status", "def _load_slurm_status_metadata", "def wait_for_experiment_impl", "status_*.out", "_DONE", "slurm_status_not_found"]
for sym in symbols:
for i, line in enumerate(text, 1):
if sym in line:
start = max(1, i-25)
end = min(len(text), i+120)
print(f"\n=== {sym} @ line {i} ===")
for j in range(start, end+1):
print(f"{j:5d}: {text[j-1]}")
break
# Search for other slurm status handling and wait-timeout behavior
print("\n=== SEARCH RESULTS ===")
for needle in ["wait_timeout", "overall = \"running\"", "has_done_marker", "status_*.out", "slurm_status_not_found", "task_statuses"]:
print(f"\n-- {needle} --")
for i, line in enumerate(text, 1):
if needle in line:
print(f"{i}: {line}")
PYRepository: NVIDIA/Model-Optimizer
Length of output: 39024
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1800,1955p' tools/mcp/modelopt_mcp/bridge.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 6877
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "_save_slurm_status_metadata|_remove_slurm_status_metadata|_load_slurm_status_metadata|slurm_enabled|slurm_meta" tools/mcp/modelopt_mcp/bridge.pyRepository: NVIDIA/Model-Optimizer
Length of output: 383
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '1978,2175p' tools/mcp/modelopt_mcp/bridge.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 9278
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '2060,2170p' tools/mcp/modelopt_mcp/bridge.py | cat -nRepository: NVIDIA/Model-Optimizer
Length of output: 4902
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "TERMINAL_STATUSES|class ExperimentStatusResult|class WaitForExperimentResult|class QuerySlurmStatusResult" tools/mcp/modelopt_mcp/bridge.pyRepository: NVIDIA/Model-Optimizer
Length of output: 160
Handle slurm_status_not_found with local completion markers If the Slurm query comes back slurm_status_not_found, this branch still reports running even when _DONE or status_*.out already mark the experiment complete. That can keep wait_for_experiment_impl polling until timeout for an aged-out finished job; fall back to the local markers before defaulting to running.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1826 - 1841, In the Slurm
status handling branch of the experiment status lookup, do not always fall back
to running when _query_slurm_status returns a non-ok result. Update the logic
around _load_slurm_status_metadata and _query_slurm_status to treat
slurm_status_not_found as a special case and consult the local completion
indicators (_DONE and status_*.out) before defaulting to running. Keep the
returned slurm_status payload intact, but ensure the overall status can become
completed when the local markers show the experiment has finished so
wait_for_experiment_impl stops polling correctly.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1900 +/- ##
==========================================
+ Coverage 61.21% 61.24% +0.02%
==========================================
Files 515 515
Lines 57245 57245
==========================================
+ Hits 35043 35060 +17
+ Misses 22202 22185 -17
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Summary
uv run --with-editableso themodelopt-launcherconsole script is availablejob_status/wait_for_experimentprefer remotesacct/squeuestate over local_DONEcommon/smoke/hostname.shused byexamples/smoke/hostname.yamlWhy
Detached Slurm submission creates the local
_DONEmarker when the submit process exits, not when the remote Slurm job completes. This made MCPwait_for_experimentreturndonewhile Slurm jobs were still running. Reproduced on both MFA and non-MFA clusters.Validation
uv run --project tools/mcp ruff check tools/mcp/modelopt_mcp/bridge.py tools/mcp/tests/test_bridge.pyuv run --project tools/mcp pytest tools/mcp/tests/test_bridge.pyuv run --project tools/launcher pytest tools/launcher/tests/test_core_extended.py tools/launcher/tests/test_slurm_config.pyuv run --with-editable tools/launcher modelopt-launcher --yaml tools/launcher/examples/smoke/hostname.yaml --dryrun --yesManual smoke evidence
cicd_1783037181, Slurm job2320415, exit0:02945347was still running; job later completed with exit0:0Summary by CodeRabbit
New Features
Bug Fixes