fix: make MCP Slurm wait poll remote job state by ChenhanYu · Pull Request #1900 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-07-04T03:05:29Z

Summary

run managed Model-Optimizer checkouts via uv run --with-editable so the modelopt-launcher console script is available
persist Slurm submission metadata beside the local experiment and make job_status / wait_for_experiment prefer remote sacct / squeue state over local _DONE
add the missing common/smoke/hostname.sh used by examples/smoke/hostname.yaml

Why

Detached Slurm submission creates the local _DONE marker when the submit process exits, not when the remote Slurm job completes. This made MCP wait_for_experiment return done while Slurm jobs were still running. Reproduced on both MFA and non-MFA clusters.

Validation

uv run --project tools/mcp ruff check tools/mcp/modelopt_mcp/bridge.py tools/mcp/tests/test_bridge.py
uv run --project tools/mcp pytest tools/mcp/tests/test_bridge.py
uv run --project tools/launcher pytest tools/launcher/tests/test_core_extended.py tools/launcher/tests/test_slurm_config.py
uv run --with-editable tools/launcher modelopt-launcher --yaml tools/launcher/examples/smoke/hostname.yaml --dryrun --yes

Manual smoke evidence

ptyche smoke completed: experiment cicd_1783037181, Slurm job 2320415, exit 0:0
computelab reproduced the wait bug pre-fix: MCP reported done immediately while Slurm job 2945347 was still running; job later completed with exit 0:0

Summary by CodeRabbit

New Features
- Improved job status reporting for Slurm-backed launches, with more accurate tracking of running, completed, and failed jobs.
- Added a hostname smoke check to help validate launcher environments.
Bug Fixes
- Status checks now rely on the remote scheduler when available, reducing false “done” results from local markers.
- Waiting for experiments now continues until the scheduler confirms a terminal state.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

coderabbitai · 2026-07-04T03:05:46Z

📝 Walkthrough

Walkthrough

Adds Slurm submit metadata persistence and SSH-based status resolution (sacct/squeue) into bridge.py with new helper functions and updated job_status_impl, changes the managed-source launcher's uv invocation to use --with-editable, adds corresponding tests, and introduces a new hostname smoke-test script.

Changes

Slurm Sidecar Status Tracking

Layer / File(s)	Summary
Slurm state constants and helpers `tools/mcp/modelopt_mcp/bridge.py`	Adds `json` import, Slurm terminal/failure/running state constants, a sidecar filename constant, and helpers to write/load metadata, build SSH argv, parse `sacct -P` output, query status via SSH with `squeue` fallback, and map Slurm state to MCP status.
Persist metadata on submission `tools/mcp/modelopt_mcp/bridge.py`	Slurm submission flow now writes experiment id, Slurm job id, and SSH connection details to the experiment directory sidecar file.
job_status_impl prefers sidecar `tools/mcp/modelopt_mcp/bridge.py`	`job_status_impl` queries Slurm via the sidecar for authoritative status instead of relying solely on local `_DONE`/`status_*.out` markers.
Tests for sidecar persistence and polling `tools/mcp/tests/test_bridge.py`	Adds `json` import, assertions on persisted sidecar fields, new tests where Slurm `RUNNING`/`COMPLETED` states override or confirm status, and a `wait_for_experiment_impl` test verifying continued polling until Slurm reaches terminal state.

Managed-source Launcher Argv Change

Layer / File(s)	Summary
uv invocation switched to --with-editable `tools/mcp/modelopt_mcp/bridge.py`, `tools/mcp/tests/test_bridge.py`	`_launcher_argv` now builds the uv command with `--with-editable` instead of `--reinstall-package`/`--project`, with matching test expectations and docstring update.

Hostname Smoke Test Script

Layer / File(s)	Summary
New hostname smoke-test script `tools/launcher/common/smoke/hostname.sh`	New Bash script with SPDX header, strict error handling, and start/done markers around a `hostname` command.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Bridge as bridge.py
    participant ExperimentDir as Experiment Directory
    participant SlurmNode as Slurm Login Node

    Client->>Bridge: submit Slurm job
    Bridge->>ExperimentDir: write _SLURM_STATUS_META (job id, ssh info)
    Client->>Bridge: job_status_impl(experiment_id)
    Bridge->>ExperimentDir: load _SLURM_STATUS_META
    Bridge->>SlurmNode: SSH sacct -P (fallback squeue)
    SlurmNode-->>Bridge: Slurm state/exit code
    Bridge->>Bridge: map Slurm state to MCP status
    Bridge-->>Client: overall status (done/running/failed/unknown)

Suggested reviewers: kevalmorabia97

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	bridge.py adds new `# nosec B603` comments on the new sacct/squeue SSH subprocess calls, which this check forbids.	Remove the new `# nosec` comments (or add explicit SECURITY.md justification plus `@NVIDIA/modelopt-setup-codeowners` approval).

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly reflects the main change: making MCP Slurm wait on remote job state instead of local completion markers.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenhany/fix-mcp-slurm-wait

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)

tools/mcp/modelopt_mcp/bridge.py (1)
1705-1780: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Consider validating/quoting slurm_job_id before it reaches the remote shell.

job_id is interpolated into the sacct/squeue command strings, which ssh executes through the remote login shell. Today slurm_job_id is parsed from trusted launcher output (numeric), so this is not currently exploitable, but a non-numeric value would be interpreted by the remote shell. A cheap defense-in-depth is to assert the id matches an expected pattern (e.g. ^\d+(_\d+)?(\.\w+)?$) before building the command.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1705 - 1780, The
_query_slurm_status helper builds remote sacct/squeue command strings using
slurm_job_id, so add a defensive validation step before constructing those
commands. In _query_slurm_status, ensure job_id matches the expected Slurm
job-id format (for example numeric with optional array/step suffixes) and return
a failed diagnostic if it does not, then only pass the validated id into
_ssh_argv_for_slurm and the sacct/squeue command strings.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1826-1841: In the Slurm status handling branch of the experiment
status lookup, do not always fall back to running when _query_slurm_status
returns a non-ok result. Update the logic around _load_slurm_status_metadata and
_query_slurm_status to treat slurm_status_not_found as a special case and
consult the local completion indicators (_DONE and status_*.out) before
defaulting to running. Keep the returned slurm_status payload intact, but ensure
the overall status can become completed when the local markers show the
experiment has finished so wait_for_experiment_impl stops polling correctly.

---

Nitpick comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1705-1780: The _query_slurm_status helper builds remote
sacct/squeue command strings using slurm_job_id, so add a defensive validation
step before constructing those commands. In _query_slurm_status, ensure job_id
matches the expected Slurm job-id format (for example numeric with optional
array/step suffixes) and return a failed diagnostic if it does not, then only
pass the validated id into _ssh_argv_for_slurm and the sacct/squeue command
strings.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64116f92-4ae7-4046-b0ef-672474565ecb

📥 Commits

Reviewing files that changed from the base of the PR and between 75b5803 and 9f0c4ec.

📒 Files selected for processing (3)

tools/launcher/common/smoke/hostname.sh
tools/mcp/modelopt_mcp/bridge.py
tools/mcp/tests/test_bridge.py

coderabbitai · 2026-07-04T03:12:48Z

+    slurm_meta = _load_slurm_status_metadata(exp_dir)
+    if slurm_meta is not None:
+        slurm_status = _query_slurm_status(slurm_meta)
+        if slurm_status.get("ok"):
+            overall = slurm_status.get("status", "unknown")
+        else:
+            overall = "running"
+        return {
+            "ok": True,
+            "experiment_id": experiment_id,
+            "experiment_dir": str(exp_dir),
+            "status": overall,
+            "task_statuses": task_statuses,
+            "has_done_marker": done_marker.exists(),
+            "slurm_status": slurm_status,
+        }


🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect relevant symbols and surrounding code in bridge.py python3 - <<'PY' from pathlib import Path path = Path("tools/mcp/modelopt_mcp/bridge.py") text = path.read_text().splitlines() symbols = ["def _query_slurm_status", "def _load_slurm_status_metadata", "def wait_for_experiment_impl", "status_*.out", "_DONE", "slurm_status_not_found"] for sym in symbols: for i, line in enumerate(text, 1): if sym in line: start = max(1, i-25) end = min(len(text), i+120) print(f"\n=== {sym} @ line {i} ===") for j in range(start, end+1): print(f"{j:5d}: {text[j-1]}") break # Search for other slurm status handling and wait-timeout behavior print("\n=== SEARCH RESULTS ===") for needle in ["wait_timeout", "overall = \"running\"", "has_done_marker", "status_*.out", "slurm_status_not_found", "task_statuses"]: print(f"\n-- {needle} --") for i, line in enumerate(text, 1): if needle in line: print(f"{i}: {line}") PY

Repository: NVIDIA/Model-Optimizer

Length of output: 39024

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1800,1955p' tools/mcp/modelopt_mcp/bridge.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 6877

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "_save_slurm_status_metadata|_remove_slurm_status_metadata|_load_slurm_status_metadata|slurm_enabled|slurm_meta" tools/mcp/modelopt_mcp/bridge.py

Repository: NVIDIA/Model-Optimizer

Length of output: 383

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '1978,2175p' tools/mcp/modelopt_mcp/bridge.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 9278

🏁 Script executed:

#!/bin/bash set -euo pipefail sed -n '2060,2170p' tools/mcp/modelopt_mcp/bridge.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 4902

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n "TERMINAL_STATUSES|class ExperimentStatusResult|class WaitForExperimentResult|class QuerySlurmStatusResult" tools/mcp/modelopt_mcp/bridge.py

Repository: NVIDIA/Model-Optimizer

Length of output: 160

Handle slurm_status_not_found with local completion markers If the Slurm query comes back slurm_status_not_found, this branch still reports running even when _DONE or status_*.out already mark the experiment complete. That can keep wait_for_experiment_impl polling until timeout for an aged-out finished job; fall back to the local markers before defaulting to running.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1826 - 1841, In the Slurm status handling branch of the experiment status lookup, do not always fall back to running when _query_slurm_status returns a non-ok result. Update the logic around _load_slurm_status_metadata and _query_slurm_status to treat slurm_status_not_found as a special case and consult the local completion indicators (_DONE and status_*.out) before defaulting to running. Keep the returned slurm_status payload intact, but ensure the overall status can become completed when the local markers show the experiment has finished so wait_for_experiment_impl stops polling correctly.

codecov · 2026-07-04T03:14:37Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.24%. Comparing base (75b5803) to head (9f0c4ec).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1900      +/-   ##
==========================================
+ Coverage   61.21%   61.24%   +0.02%     
==========================================
  Files         515      515              
  Lines       57245    57245              
==========================================
+ Hits        35043    35060      +17     
+ Misses      22202    22185      -17

Flag	Coverage Δ
regression	`14.82% <ø> (+0.06%)`	⬆️
unit	`54.92% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fix: make mcp slurm status authoritative

9f0c4ec

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>

ChenhanYu requested a review from a team as a code owner July 4, 2026 03:05

coderabbitai Bot reviewed Jul 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make MCP Slurm wait poll remote job state#1900

fix: make MCP Slurm wait poll remote job state#1900
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/fix-mcp-slurm-wait

ChenhanYu commented Jul 4, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 4, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Pre-merge checks failed

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 4, 2026

Uh oh!

codecov Bot commented Jul 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ChenhanYu commented Jul 4, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Validation

Manual smoke evidence

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChenhanYu commented Jul 4, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 4, 2026 •

edited

Loading

codecov Bot commented Jul 4, 2026 •

edited

Loading