Add native v1 debug CLI and validate modes#1905
Conversation
ApprovabilityVerdict: Needs human review This PR introduces a new CLI command ( You can customize Macroscope's approvability policy. Learn more. |
|
@codex review |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e6515f3. Configure here.
| @model_validator(mode="after") | ||
| def validate_action(self): | ||
| if bool(self.command) == bool(self.script_path): | ||
| raise ValueError("pass exactly one of `--command` or `--script-path`") |
There was a problem hiding this comment.
Empty command skips script path
Medium Severity
DebugConfig treats an empty command as “missing” when checking that exactly one of --command or --script-path was provided, but run_action treats any non-None command (including "") as the inline-command path. A config with both an empty command and a script_path can pass validation yet run sh -lc with an empty string instead of uploading and executing the script.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit e6515f3. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ed5efa5a6b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| cancelled = False | ||
| runtime = make_runtime( | ||
| resolve_runtime_config(config.runtime, task), name=f"debug-{task.idx}" |
There was a problem hiding this comment.
Use unique debug runtime names
When two debug sessions select the same task index with --runtime.type subprocess (or a previous run leaves /tmp/debug-0 behind), this fixed name is reused; SubprocessRuntime.start() creates /tmp/<name> with mkdir() and will fail on the existing directory before the command runs. Since the output path already has a UUID, the runtime name should include a run/trace-unique suffix instead of only task.idx.
Useful? React with 👍 / 👎.
|
|
||
| @model_validator(mode="after") | ||
| def validate_action(self): | ||
| if bool(self.command) == bool(self.script_path): |
There was a problem hiding this comment.
Reject empty command plus script path
When a config file or shell invocation passes an empty command together with script_path, this check treats the command as absent because bool('') is false, but run_action() dispatches on config.command is not None and runs the empty command instead of the uploaded script. Use presence checks consistently so the “exactly one action” invariant is actually enforced.
Useful? React with 👍 / 👎.


Summary
validate --modesupport forapply-answer,noop, andboth;bothruns the two high-level validations in independent runtimes and reports nested subresults.debugCLI that runs task setup, then either an inline shell command or an uploaded host script, and persists diagnostics undertrace.info["debug"]in eval-styleconfig.toml/results.jsonloutput.Notes
research-environments#619remains active. I audited it for native-tool hook needs:debugonly needs loadable tasksets, runtime config, andsetup;validate apply-answeruses the underlying tasksetvalidate()hooks. No missing hook was found, so no stacked follow-up PR was opened.Validation
uv run ruff check verifiers/v1/cli/debug.py verifiers/v1/cli/validate.py verifiers/v1/configs/debug.py verifiers/v1/configs/validate.py verifiers/v1/configs/__init__.py verifiers/v1/cli/output.py verifiers/envs/experimental/composable/sandbox_debug_env.py verifiers/envs/experimental/composable/swe_debug_env.pypassed.uv run ruff format --check verifiers/v1/cli/debug.py verifiers/v1/cli/validate.py verifiers/v1/configs/debug.py verifiers/v1/configs/validate.py verifiers/v1/configs/__init__.py verifiers/v1/cli/output.py verifiers/envs/experimental/composable/sandbox_debug_env.py verifiers/envs/experimental/composable/swe_debug_env.pypassed.uv run python -m compileall -q verifiers/v1 verifiers/envs/experimental/composablepassed.uv run python - <<'PY' ... import verifiers.v1.cli.debug/validate and DebugConfig/ValidateConfig ... PYpassed.uv build --wheelpassed; generateddist/was removed.uv pip install -e /home/ubuntu/git/research-environments/environments/r2e_gym_v1passed.uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode nooppassed on real tasknamanjain12/orange3_final:2d9617bd0cb1f0ba61771258410ab8fae8e7e24d.uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode apply-answerpassed on the same real task.uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode bothpassed and started independentvalidate-apply-answer-0andvalidate-noop-0containers.uv run debug r2e-gym-v1 --runtime.type docker -n 1 -c 1 --command 'printf "VF_DEBUG_COMMAND_OK\n"; pwd; git status --short | head -20' --output-dir /tmp/vf-debug-commandpassed.uv run debug r2e-gym-v1 --runtime.type docker -n 1 -c 1 --script-path /tmp/vf-debug-host-script.sh --output-dir /tmp/vf-debug-scriptpassed.uv run python - <<'PY' ... inspect /tmp/vf-debug-command/results.jsonl and /tmp/vf-debug-script/results.jsonl ... PYpassed and confirmed both saved traces contain command/script output intrace.info["debug"].ruff check,ruff format, generated AGENTS/CLAUDE check, andty (ci parity).Note
Medium Risk
New CLI paths touch runtime provisioning, setup hooks, and container sandboxes; validate
bothdoubles runtime work per task. Changes are additive with legacy wrappers kept, but misconfigured debug/validate runs could still stress remote runtimes.Overview
Adds native v1 developer CLIs for model-free taskset checks: extended
validateand a newdebugentrypoint, plus docs and deprecation nudges away from composable sandbox debug envs.validate --modenow supportsapply-answer(default),noop(setup only), andboth. Inboth, apply-answer and noop each get their own runtime lifecycle per task, with nested subresults rolled into one aggregate row.uv run debug(newdebugconsole script) provisions a runtime per task, runstaskset.setup, then executes exactly one inline shell command or uploaded host script. It writes eval-styleconfig.tomlandresults.jsonl, with command/script diagnostics undertrace.info["debug"]. Shared output helpers accept any Pydantic config, not onlyEvalConfig.Deprecation:
SandboxDebugEnvwarns on construction;SWEDebugEnv/SWEDebugRubricpoint to the v1debugCLI, withSWEDebugEnvsuppressing the base warning so only one fires.Documentation and the evaluate-environments skill describe when to use
validate/debugalongsideprime eval run.Reviewed by Cursor Bugbot for commit e6515f3. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add native
debugCLI and--modeflag tovalidateCLI for v1 tasksetsdebugconsole script (debug.py) that provisions runtimes for selected tasks, runs a shell command or uploaded host script, and writesconfig.tomlandresults.jsonlto an output directory.DebugConfigpydantic model (debug.py) with fields for runtime, command/script (mutually exclusive), timeouts, concurrency, and output control.validateCLI with a--modeflag supportingapply-answer(default),noop(setup-only), andboth(runs each independently and aggregates results).SandboxDebugEnvandSWEDebugEnvwith warnings directing v1 taskset users to the nativedebugCLI instead.Macroscope summarized e6515f3.