Add native v1 debug CLI and validate modes by rasdani · Pull Request #1905 · PrimeIntellect-ai/verifiers

rasdani · 2026-07-01T02:58:54Z

Summary

Adds native v1 validate --mode support for apply-answer, noop, and both; both runs the two high-level validations in independent runtimes and reports nested subresults.
Adds a native v1 debug CLI that runs task setup, then either an inline shell command or an uploaded host script, and persists diagnostics under trace.info["debug"] in eval-style config.toml / results.jsonl output.
Deprecates the older composable debug path in warnings/docs for v1 tasksets while keeping compatibility.

Notes

research-environments#619 remains active. I audited it for native-tool hook needs: debug only needs loadable tasksets, runtime config, and setup; validate apply-answer uses the underlying taskset validate() hooks. No missing hook was found, so no stacked follow-up PR was opened.
No new test files were added.

Validation

uv run ruff check verifiers/v1/cli/debug.py verifiers/v1/cli/validate.py verifiers/v1/configs/debug.py verifiers/v1/configs/validate.py verifiers/v1/configs/__init__.py verifiers/v1/cli/output.py verifiers/envs/experimental/composable/sandbox_debug_env.py verifiers/envs/experimental/composable/swe_debug_env.py passed.
uv run ruff format --check verifiers/v1/cli/debug.py verifiers/v1/cli/validate.py verifiers/v1/configs/debug.py verifiers/v1/configs/validate.py verifiers/v1/configs/__init__.py verifiers/v1/cli/output.py verifiers/envs/experimental/composable/sandbox_debug_env.py verifiers/envs/experimental/composable/swe_debug_env.py passed.
uv run python -m compileall -q verifiers/v1 verifiers/envs/experimental/composable passed.
uv run python - <<'PY' ... import verifiers.v1.cli.debug/validate and DebugConfig/ValidateConfig ... PY passed.
uv build --wheel passed; generated dist/ was removed.
uv pip install -e /home/ubuntu/git/research-environments/environments/r2e_gym_v1 passed.
uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode noop passed on real task namanjain12/orange3_final:2d9617bd0cb1f0ba61771258410ab8fae8e7e24d.
uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode apply-answer passed on the same real task.
uv run validate r2e-gym-v1 --runtime.type docker -n 1 -c 1 --rich false --mode both passed and started independent validate-apply-answer-0 and validate-noop-0 containers.
uv run debug r2e-gym-v1 --runtime.type docker -n 1 -c 1 --command 'printf "VF_DEBUG_COMMAND_OK\n"; pwd; git status --short | head -20' --output-dir /tmp/vf-debug-command passed.
uv run debug r2e-gym-v1 --runtime.type docker -n 1 -c 1 --script-path /tmp/vf-debug-host-script.sh --output-dir /tmp/vf-debug-script passed.
uv run python - <<'PY' ... inspect /tmp/vf-debug-command/results.jsonl and /tmp/vf-debug-script/results.jsonl ... PY passed and confirmed both saved traces contain command/script output in trace.info["debug"].
Push pre-push hooks passed: ruff check, ruff format, generated AGENTS/CLAUDE check, and ty (ci parity).

Note

Medium Risk
New CLI paths touch runtime provisioning, setup hooks, and container sandboxes; validate both doubles runtime work per task. Changes are additive with legacy wrappers kept, but misconfigured debug/validate runs could still stress remote runtimes.

Overview
Adds native v1 developer CLIs for model-free taskset checks: extended validate and a new debug entrypoint, plus docs and deprecation nudges away from composable sandbox debug envs.

validate --mode now supports apply-answer (default), noop (setup only), and both. In both, apply-answer and noop each get their own runtime lifecycle per task, with nested subresults rolled into one aggregate row.

uv run debug (new debug console script) provisions a runtime per task, runs taskset.setup, then executes exactly one inline shell command or uploaded host script. It writes eval-style config.toml and results.jsonl, with command/script diagnostics under trace.info["debug"]. Shared output helpers accept any Pydantic config, not only EvalConfig.

Deprecation: SandboxDebugEnv warns on construction; SWEDebugEnv / SWEDebugRubric point to the v1 debug CLI, with SWEDebugEnv suppressing the base warning so only one fires.

Documentation and the evaluate-environments skill describe when to use validate / debug alongside prime eval run.

^{Reviewed by Cursor Bugbot for commit e6515f3. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add native `debug` CLI and `--mode` flag to `validate` CLI for v1 tasksets

Adds a new debug console script (debug.py) that provisions runtimes for selected tasks, runs a shell command or uploaded host script, and writes config.toml and results.jsonl to an output directory.
Adds a DebugConfig pydantic model (debug.py) with fields for runtime, command/script (mutually exclusive), timeouts, concurrency, and output control.
Extends the validate CLI with a --mode flag supporting apply-answer (default), noop (setup-only), and both (runs each independently and aggregates results).
Deprecates SandboxDebugEnv and SWEDebugEnv with warnings directing v1 taskset users to the native debug CLI instead.

^{Macroscope summarized e6515f3.}

macroscopeapp · 2026-07-01T03:03:49Z

Approvability

Verdict: Needs human review

This PR introduces a new CLI command (debug) with substantial new runtime logic for task setup and command execution. New features with this scope warrant human review. Additionally, an unresolved comment identifies a potential bug in command validation logic.

^{You can customize Macroscope's approvability policy. Learn more.}

rasdani · 2026-07-01T03:37:40Z

@codex review

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit e6515f3. Configure here.}

cursor · 2026-07-01T03:42:22Z

+    @model_validator(mode="after")
+    def validate_action(self):
+        if bool(self.command) == bool(self.script_path):
+            raise ValueError("pass exactly one of `--command` or `--script-path`")


Empty command skips script path

Medium Severity

DebugConfig treats an empty command as “missing” when checking that exactly one of --command or --script-path was provided, but run_action treats any non-None command (including "") as the inline-command path. A config with both an empty command and a script_path can pass validation yet run sh -lc with an empty string instead of uploading and executing the script.

Additional Locations (1)

verifiers/v1/cli/debug.py#L187-L194

^{Reviewed by Cursor Bugbot for commit e6515f3. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ed5efa5a6b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-07-01T03:44:01Z

+    }
+    cancelled = False
+    runtime = make_runtime(
+        resolve_runtime_config(config.runtime, task), name=f"debug-{task.idx}"


Use unique debug runtime names

When two debug sessions select the same task index with --runtime.type subprocess (or a previous run leaves /tmp/debug-0 behind), this fixed name is reused; SubprocessRuntime.start() creates /tmp/<name> with mkdir() and will fail on the existing directory before the command runs. Since the output path already has a UUID, the runtime name should include a run/trace-unique suffix instead of only task.idx.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T03:44:01Z

+
+    @model_validator(mode="after")
+    def validate_action(self):
+        if bool(self.command) == bool(self.script_path):


Reject empty command plus script path

When a config file or shell invocation passes an empty command together with script_path, this check treats the command as absent because bool('') is false, but run_action() dispatches on config.command is not None and runs the empty command instead of the uploaded script. Use presence checks consistently so the “exactly one action” invariant is actually enforced.

Useful? React with 👍 / 👎.

feat: add native v1 debug CLI

9c68c39

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/cli/debug.py Outdated

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/cli/debug.py

Comment thread verifiers/v1/cli/debug.py Outdated

rasdani added 2 commits July 1, 2026 03:19

fix: report debug setup elapsed time

e5b0e57

fix: persist cancelled debug traces

b8fd9a1

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/cli/debug.py

fix: update swe debug deprecation test

214ff8d

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/envs/experimental/composable/swe_debug_env.py

Comment thread verifiers/v1/cli/debug.py

fix: address debug cli bugbot comments

ed5efa5

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread docs/evaluation.md

fix: address remaining debug review comments

e6515f3

cursor Bot reviewed Jul 1, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

This was referenced Jul 1, 2026

Add missing SWE v1 tasksets PrimeIntellect-ai/research-environments#563

Open

Add SWE debug v1 taskset PrimeIntellect-ai/research-environments#619

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add native v1 debug CLI and validate modes#1905

Add native v1 debug CLI and validate modes#1905
rasdani wants to merge 6 commits into
mainfrom
feat/v1-debug-cli

rasdani commented Jul 1, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasdani commented Jul 1, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jul 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

rasdani commented Jul 1, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Validation

Add native debug CLI and --mode flag to validate CLI for v1 tasksets

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rasdani commented Jul 1, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Empty command skips script path

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rasdani commented Jul 1, 2026 •

edited by macroscopeapp Bot

Loading

Add native `debug` CLI and `--mode` flag to `validate` CLI for v1 tasksets

macroscopeapp Bot commented Jul 1, 2026 •

edited

Loading