Skip to content

fix: make MCP Slurm wait poll remote job state#1900

Open
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/fix-mcp-slurm-wait
Open

fix: make MCP Slurm wait poll remote job state#1900
ChenhanYu wants to merge 1 commit into
mainfrom
chenhany/fix-mcp-slurm-wait

Conversation

@ChenhanYu

@ChenhanYu ChenhanYu commented Jul 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • run managed Model-Optimizer checkouts via uv run --with-editable so the modelopt-launcher console script is available
  • persist Slurm submission metadata beside the local experiment and make job_status / wait_for_experiment prefer remote sacct / squeue state over local _DONE
  • add the missing common/smoke/hostname.sh used by examples/smoke/hostname.yaml

Why

Detached Slurm submission creates the local _DONE marker when the submit process exits, not when the remote Slurm job completes. This made MCP wait_for_experiment return done while Slurm jobs were still running. Reproduced on both MFA and non-MFA clusters.

Validation

  • uv run --project tools/mcp ruff check tools/mcp/modelopt_mcp/bridge.py tools/mcp/tests/test_bridge.py
  • uv run --project tools/mcp pytest tools/mcp/tests/test_bridge.py
  • uv run --project tools/launcher pytest tools/launcher/tests/test_core_extended.py tools/launcher/tests/test_slurm_config.py
  • uv run --with-editable tools/launcher modelopt-launcher --yaml tools/launcher/examples/smoke/hostname.yaml --dryrun --yes

Manual smoke evidence

  • ptyche smoke completed: experiment cicd_1783037181, Slurm job 2320415, exit 0:0
  • computelab reproduced the wait bug pre-fix: MCP reported done immediately while Slurm job 2945347 was still running; job later completed with exit 0:0

Summary by CodeRabbit

  • New Features

    • Improved job status reporting for Slurm-backed launches, with more accurate tracking of running, completed, and failed jobs.
    • Added a hostname smoke check to help validate launcher environments.
  • Bug Fixes

    • Status checks now rely on the remote scheduler when available, reducing false “done” results from local markers.
    • Waiting for experiments now continues until the scheduler confirms a terminal state.

Signed-off-by: Chenhan Yu <chenhany@nvidia.com>
@ChenhanYu ChenhanYu requested a review from a team as a code owner July 4, 2026 03:05
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds Slurm submit metadata persistence and SSH-based status resolution (sacct/squeue) into bridge.py with new helper functions and updated job_status_impl, changes the managed-source launcher's uv invocation to use --with-editable, adds corresponding tests, and introduces a new hostname smoke-test script.

Changes

Slurm Sidecar Status Tracking

Layer / File(s) Summary
Slurm state constants and helpers
tools/mcp/modelopt_mcp/bridge.py
Adds json import, Slurm terminal/failure/running state constants, a sidecar filename constant, and helpers to write/load metadata, build SSH argv, parse sacct -P output, query status via SSH with squeue fallback, and map Slurm state to MCP status.
Persist metadata on submission
tools/mcp/modelopt_mcp/bridge.py
Slurm submission flow now writes experiment id, Slurm job id, and SSH connection details to the experiment directory sidecar file.
job_status_impl prefers sidecar
tools/mcp/modelopt_mcp/bridge.py
job_status_impl queries Slurm via the sidecar for authoritative status instead of relying solely on local _DONE/status_*.out markers.
Tests for sidecar persistence and polling
tools/mcp/tests/test_bridge.py
Adds json import, assertions on persisted sidecar fields, new tests where Slurm RUNNING/COMPLETED states override or confirm status, and a wait_for_experiment_impl test verifying continued polling until Slurm reaches terminal state.

Managed-source Launcher Argv Change

Layer / File(s) Summary
uv invocation switched to --with-editable
tools/mcp/modelopt_mcp/bridge.py, tools/mcp/tests/test_bridge.py
_launcher_argv now builds the uv command with --with-editable instead of --reinstall-package/--project, with matching test expectations and docstring update.

Hostname Smoke Test Script

Layer / File(s) Summary
New hostname smoke-test script
tools/launcher/common/smoke/hostname.sh
New Bash script with SPDX header, strict error handling, and start/done markers around a hostname command.

Estimated code review effort: 3 (Moderate) | ~25 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Bridge as bridge.py
    participant ExperimentDir as Experiment Directory
    participant SlurmNode as Slurm Login Node

    Client->>Bridge: submit Slurm job
    Bridge->>ExperimentDir: write _SLURM_STATUS_META (job id, ssh info)
    Client->>Bridge: job_status_impl(experiment_id)
    Bridge->>ExperimentDir: load _SLURM_STATUS_META
    Bridge->>SlurmNode: SSH sacct -P (fallback squeue)
    SlurmNode-->>Bridge: Slurm state/exit code
    Bridge->>Bridge: map Slurm state to MCP status
    Bridge-->>Client: overall status (done/running/failed/unknown)
Loading

Suggested reviewers: kevalmorabia97


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error bridge.py adds new # nosec B603 comments on the new sacct/squeue SSH subprocess calls, which this check forbids. Remove the new # nosec comments (or add explicit SECURITY.md justification plus @NVIDIA/modelopt-setup-codeowners approval).
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly reflects the main change: making MCP Slurm wait on remote job state instead of local completion markers.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/fix-mcp-slurm-wait

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
tools/mcp/modelopt_mcp/bridge.py (1)

1705-1780: 🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Consider validating/quoting slurm_job_id before it reaches the remote shell.

job_id is interpolated into the sacct/squeue command strings, which ssh executes through the remote login shell. Today slurm_job_id is parsed from trusted launcher output (numeric), so this is not currently exploitable, but a non-numeric value would be interpreted by the remote shell. A cheap defense-in-depth is to assert the id matches an expected pattern (e.g. ^\d+(_\d+)?(\.\w+)?$) before building the command.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1705 - 1780, The
_query_slurm_status helper builds remote sacct/squeue command strings using
slurm_job_id, so add a defensive validation step before constructing those
commands. In _query_slurm_status, ensure job_id matches the expected Slurm
job-id format (for example numeric with optional array/step suffixes) and return
a failed diagnostic if it does not, then only pass the validated id into
_ssh_argv_for_slurm and the sacct/squeue command strings.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1826-1841: In the Slurm status handling branch of the experiment
status lookup, do not always fall back to running when _query_slurm_status
returns a non-ok result. Update the logic around _load_slurm_status_metadata and
_query_slurm_status to treat slurm_status_not_found as a special case and
consult the local completion indicators (_DONE and status_*.out) before
defaulting to running. Keep the returned slurm_status payload intact, but ensure
the overall status can become completed when the local markers show the
experiment has finished so wait_for_experiment_impl stops polling correctly.

---

Nitpick comments:
In `@tools/mcp/modelopt_mcp/bridge.py`:
- Around line 1705-1780: The _query_slurm_status helper builds remote
sacct/squeue command strings using slurm_job_id, so add a defensive validation
step before constructing those commands. In _query_slurm_status, ensure job_id
matches the expected Slurm job-id format (for example numeric with optional
array/step suffixes) and return a failed diagnostic if it does not, then only
pass the validated id into _ssh_argv_for_slurm and the sacct/squeue command
strings.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 64116f92-4ae7-4046-b0ef-672474565ecb

📥 Commits

Reviewing files that changed from the base of the PR and between 75b5803 and 9f0c4ec.

📒 Files selected for processing (3)
  • tools/launcher/common/smoke/hostname.sh
  • tools/mcp/modelopt_mcp/bridge.py
  • tools/mcp/tests/test_bridge.py

Comment on lines +1826 to +1841
slurm_meta = _load_slurm_status_metadata(exp_dir)
if slurm_meta is not None:
slurm_status = _query_slurm_status(slurm_meta)
if slurm_status.get("ok"):
overall = slurm_status.get("status", "unknown")
else:
overall = "running"
return {
"ok": True,
"experiment_id": experiment_id,
"experiment_dir": str(exp_dir),
"status": overall,
"task_statuses": task_statuses,
"has_done_marker": done_marker.exists(),
"slurm_status": slurm_status,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect relevant symbols and surrounding code in bridge.py
python3 - <<'PY'
from pathlib import Path
path = Path("tools/mcp/modelopt_mcp/bridge.py")
text = path.read_text().splitlines()

symbols = ["def _query_slurm_status", "def _load_slurm_status_metadata", "def wait_for_experiment_impl", "status_*.out", "_DONE", "slurm_status_not_found"]
for sym in symbols:
    for i, line in enumerate(text, 1):
        if sym in line:
            start = max(1, i-25)
            end = min(len(text), i+120)
            print(f"\n=== {sym} @ line {i} ===")
            for j in range(start, end+1):
                print(f"{j:5d}: {text[j-1]}")
            break

# Search for other slurm status handling and wait-timeout behavior
print("\n=== SEARCH RESULTS ===")
for needle in ["wait_timeout", "overall = \"running\"", "has_done_marker", "status_*.out", "slurm_status_not_found", "task_statuses"]:
    print(f"\n-- {needle} --")
    for i, line in enumerate(text, 1):
        if needle in line:
            print(f"{i}: {line}")
PY

Repository: NVIDIA/Model-Optimizer

Length of output: 39024


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1800,1955p' tools/mcp/modelopt_mcp/bridge.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 6877


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "_save_slurm_status_metadata|_remove_slurm_status_metadata|_load_slurm_status_metadata|slurm_enabled|slurm_meta" tools/mcp/modelopt_mcp/bridge.py

Repository: NVIDIA/Model-Optimizer

Length of output: 383


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '1978,2175p' tools/mcp/modelopt_mcp/bridge.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 9278


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '2060,2170p' tools/mcp/modelopt_mcp/bridge.py | cat -n

Repository: NVIDIA/Model-Optimizer

Length of output: 4902


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "TERMINAL_STATUSES|class ExperimentStatusResult|class WaitForExperimentResult|class QuerySlurmStatusResult" tools/mcp/modelopt_mcp/bridge.py

Repository: NVIDIA/Model-Optimizer

Length of output: 160


Handle slurm_status_not_found with local completion markers If the Slurm query comes back slurm_status_not_found, this branch still reports running even when _DONE or status_*.out already mark the experiment complete. That can keep wait_for_experiment_impl polling until timeout for an aged-out finished job; fall back to the local markers before defaulting to running.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/mcp/modelopt_mcp/bridge.py` around lines 1826 - 1841, In the Slurm
status handling branch of the experiment status lookup, do not always fall back
to running when _query_slurm_status returns a non-ok result. Update the logic
around _load_slurm_status_metadata and _query_slurm_status to treat
slurm_status_not_found as a special case and consult the local completion
indicators (_DONE and status_*.out) before defaulting to running. Keep the
returned slurm_status payload intact, but ensure the overall status can become
completed when the local markers show the experiment has finished so
wait_for_experiment_impl stops polling correctly.

@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 61.24%. Comparing base (75b5803) to head (9f0c4ec).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1900      +/-   ##
==========================================
+ Coverage   61.21%   61.24%   +0.02%     
==========================================
  Files         515      515              
  Lines       57245    57245              
==========================================
+ Hits        35043    35060      +17     
+ Misses      22202    22185      -17     
Flag Coverage Δ
regression 14.82% <ø> (+0.06%) ⬆️
unit 54.92% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant