Skip to content

Add native v1 debug CLI and validate modes#1905

Open
rasdani wants to merge 6 commits into
mainfrom
feat/v1-debug-cli
Open

Add native v1 debug CLI and validate modes#1905
rasdani wants to merge 6 commits into
mainfrom
feat/v1-debug-cli

Conversation

@rasdani

@rasdani rasdani commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds native v1 validate --mode support for apply-answer, noop, and both; both runs the two high-level validations in independent runtimes and reports nested subresults.
  • Adds a native v1 debug CLI that runs task setup, then either an inline shell command or an uploaded host script, and persists diagnostics under trace.info["debug"] in eval-style config.toml / results.jsonl output.
  • Deprecates the older composable debug path in warnings/docs for v1 tasksets while keeping compatibility.

Notes

  • research-environments#619 remains active. I audited it for native-tool hook needs: debug only needs loadable tasksets, runtime config, and setup; validate apply-answer uses the underlying taskset validate() hooks. No missing hook was found, so no stacked follow-up PR was opened.
  • No new test files were added.

Validation

  • uv run ruff check verifiers/v1/cli/debug.py verifiers/v1/cli/validate.py verifiers/v1/configs/debug.py verifiers/v1/configs/validate.py verifiers/v1/configs/__init__.py verifiers/v1/cli/output.py verifiers/envs/experimental/composable/sandbox_debug_env.py verifiers/envs/experimental/composable/swe_debug_env.py passed.
  • uv run ruff format --check verifiers/v1/cli/debug.py verifiers/v1/cli/validate.py verifiers/v1/configs/debug.py verifiers/v1/configs/validate.py verifiers/v1/configs/__init__.py verifiers/v1/cli/output.py verifiers/envs/experimental/composable/sandbox_debug_env.py verifiers/envs/experimental/composable/swe_debug_env.py passed.
  • uv run python -m compileall -q verifiers/v1 verifiers/envs/experimental/composable passed.
  • uv run python - <<'PY' ... import verifiers.v1.cli.debug/validate and DebugConfig/ValidateConfig ... PY passed.
  • uv build --wheel passed; generated dist/ was removed.
  • uv pip install -e /home/ubuntu/git/research-environments/environments/r2e_gym_v1 passed.
  • uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode noop passed on real task namanjain12/orange3_final:2d9617bd0cb1f0ba61771258410ab8fae8e7e24d.
  • uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode apply-answer passed on the same real task.
  • uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode both passed and started independent validate-apply-answer-0 and validate-noop-0 containers.
  • uv run debug r2e-gym-v1 --runtime.type docker -n 1 -c 1 --command 'printf "VF_DEBUG_COMMAND_OK\n"; pwd; git status --short | head -20' --output-dir /tmp/vf-debug-command passed.
  • uv run debug r2e-gym-v1 --runtime.type docker -n 1 -c 1 --script-path /tmp/vf-debug-host-script.sh --output-dir /tmp/vf-debug-script passed.
  • uv run python - <<'PY' ... inspect /tmp/vf-debug-command/results.jsonl and /tmp/vf-debug-script/results.jsonl ... PY passed and confirmed both saved traces contain command/script output in trace.info["debug"].
  • Push pre-push hooks passed: ruff check, ruff format, generated AGENTS/CLAUDE check, and ty (ci parity).

Note

Medium Risk
New CLI paths touch runtime provisioning, setup hooks, and container sandboxes; validate both doubles runtime work per task. Changes are additive with legacy wrappers kept, but misconfigured debug/validate runs could still stress remote runtimes.

Overview
Adds native v1 developer CLIs for model-free taskset checks: extended validate and a new debug entrypoint, plus docs and deprecation nudges away from composable sandbox debug envs.

validate --mode now supports apply-answer (default), noop (setup only), and both. In both, apply-answer and noop each get their own runtime lifecycle per task, with nested subresults rolled into one aggregate row.

uv run debug (new debug console script) provisions a runtime per task, runs taskset.setup, then executes exactly one inline shell command or uploaded host script. It writes eval-style config.toml and results.jsonl, with command/script diagnostics under trace.info["debug"]. Shared output helpers accept any Pydantic config, not only EvalConfig.

Deprecation: SandboxDebugEnv warns on construction; SWEDebugEnv / SWEDebugRubric point to the v1 debug CLI, with SWEDebugEnv suppressing the base warning so only one fires.

Documentation and the evaluate-environments skill describe when to use validate / debug alongside prime eval run.

Reviewed by Cursor Bugbot for commit e6515f3. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add native debug CLI and --mode flag to validate CLI for v1 tasksets

  • Adds a new debug console script (debug.py) that provisions runtimes for selected tasks, runs a shell command or uploaded host script, and writes config.toml and results.jsonl to an output directory.
  • Adds a DebugConfig pydantic model (debug.py) with fields for runtime, command/script (mutually exclusive), timeouts, concurrency, and output control.
  • Extends the validate CLI with a --mode flag supporting apply-answer (default), noop (setup-only), and both (runs each independently and aggregates results).
  • Deprecates SandboxDebugEnv and SWEDebugEnv with warnings directing v1 taskset users to the native debug CLI instead.

Macroscope summarized e6515f3.

Comment thread verifiers/v1/cli/debug.py Outdated
Comment thread verifiers/v1/cli/debug.py
Comment thread verifiers/v1/cli/debug.py Outdated
@macroscopeapp

macroscopeapp Bot commented Jul 1, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces a new CLI command (debug) with substantial new runtime logic for task setup and command execution. New features with this scope warrant human review. Additionally, an unresolved comment identifies a potential bug in command validation logic.

You can customize Macroscope's approvability policy. Learn more.

Comment thread verifiers/v1/cli/debug.py
Comment thread verifiers/envs/experimental/composable/swe_debug_env.py
Comment thread verifiers/v1/cli/debug.py
Comment thread docs/evaluation.md
@rasdani

rasdani commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

@codex review

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e6515f3. Configure here.

@model_validator(mode="after")
def validate_action(self):
if bool(self.command) == bool(self.script_path):
raise ValueError("pass exactly one of `--command` or `--script-path`")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty command skips script path

Medium Severity

DebugConfig treats an empty command as “missing” when checking that exactly one of --command or --script-path was provided, but run_action treats any non-None command (including "") as the inline-command path. A config with both an empty command and a script_path can pass validation yet run sh -lc with an empty string instead of uploading and executing the script.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e6515f3. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed5efa5a6b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/cli/debug.py
}
cancelled = False
runtime = make_runtime(
resolve_runtime_config(config.runtime, task), name=f"debug-{task.idx}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use unique debug runtime names

When two debug sessions select the same task index with --runtime.type subprocess (or a previous run leaves /tmp/debug-0 behind), this fixed name is reused; SubprocessRuntime.start() creates /tmp/<name> with mkdir() and will fail on the existing directory before the command runs. Since the output path already has a UUID, the runtime name should include a run/trace-unique suffix instead of only task.idx.

Useful? React with 👍 / 👎.


@model_validator(mode="after")
def validate_action(self):
if bool(self.command) == bool(self.script_path):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Badge Reject empty command plus script path

When a config file or shell invocation passes an empty command together with script_path, this check treats the command as absent because bool('') is false, but run_action() dispatches on config.command is not None and runs the empty command instead of the uploaded script. Use presence checks consistently so the “exactly one action” invariant is actually enforced.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant