Skip to content

feat(eval): add --prune to drop errored rollout rows (parallel to retry types)#1893

Open
hallerite wants to merge 1 commit into
mainfrom
feat/eval-prune-errors
Open

feat(eval): add --prune to drop errored rollout rows (parallel to retry types)#1893
hallerite wants to merge 1 commit into
mainfrom
feat/eval-prune-errors

Conversation

@hallerite

@hallerite hallerite commented Jun 29, 2026

Copy link
Copy Markdown
Member

What

Adds uv run eval --prune <dir> [--prune-include <csv>] [--prune-exclude <csv>] — a post-hoc cleanup that drops errored rollout rows from a finished run's results.jsonl, reusing resume.rewrite_results for the atomic write. No model, no config.

Error-type selection is parallel to retries. This extracts error_type_selected(error_type, include, exclude) (exclude wins; empty include = all not excluded; matched on the most-recent errors[].type, the same error should_retry keys off via trace.error) and shares it between should_retry and --prune. So --prune-include/--prune-exclude name the same exception classes (SandboxError, ProviderError, HarnessError, …) as --retries.rollout.include/exclude, with identical semantics:

retries prune
selector --retries.rollout.include / exclude --prune-include / --prune-exclude
empty include retry all not excluded prune all errored not excluded
exclude wins over include wins over include
matches trace.error.type (most recent) most-recent errors[].type

Composes with --resume:

  • --prune <dir> alone → clean and stop.
  • --resume <dir> alone → re-run owed rollouts (unchanged).
  • --resume <dir> --prune [--prune-include/--prune-exclude …]prune first, then resume the same dir.

How

  • New verifiers/v1/cli/eval/prune.pysplit_prune (handles --prune <dir>, --prune=<dir>, bare --prune when combined with --resume, plus --prune-include/--prune-exclude) + prune_results; own byte-offset streamer; reuses resume.rewrite_results.
  • verifiers/v1/retries.py — extracts error_type_selected; should_retry now delegates to it (no behaviour change).
  • main.py dispatch parses --resume then --prune, reconciles the target dir (rejecting a mismatch), and runs prune→resume / prune-only / resume-only.

Testing

  • tests/v1/test_prune.py (9 tests): default / include / exclude / exclude-wins / absent-type pruning, missing-dir error, split_prune forms, and the --resume --prune ordering + bad-combo rejections.
  • should_retry + error_type_selected re-verified; ruff check + ruff format clean. (Pre-push ty not run locally — env not synced; CI covers it.)

🤖 Generated with Claude Code

Note

Add --prune flag to drop errored rows from results.jsonl in eval CLI

  • Adds a new --prune mode to the eval CLI that rewrites results.jsonl, removing rows whose error type matches include/exclude filters (parallel to existing retry selectors).
  • Implements argument parsing in prune.py supporting --prune [dir], --prune-include, and --prune-exclude with validation that rejects empty inline values.
  • Supports a combined --resume --prune workflow that prunes first, then resumes; validates that both flags agree on the output directory when both specify one.
  • Refactors should_retry in retries.py to share selection logic with prune via a new error_type_selected helper.
  • Behavioral Change: --include/--exclude flags now require --prune to be present, and --resume/--prune cannot be combined with other arguments.

Macroscope summarized 8612336.


Note

Medium Risk
The change rewrites eval output files in place on disk and extends eval CLI dispatch alongside resume, but uses atomic rewrites, validates legacy resume before pruning, and is covered by targeted tests.

Overview
Adds uv run eval --prune to remove errored rollout rows from a finished run’s results.jsonl in place, with optional --prune-include / --prune-exclude CSV filters that mirror --retries.rollout.include/exclude semantics (exclude wins; empty include = all errored types not excluded), keyed off the most recent errors[].type.

error_type_selected is extracted in retries.py and shared by should_retry and prune so retry and prune stay aligned; should_retry behavior is unchanged.

Eval main.py parses --resume then --prune: prune-only exits after cleanup; --resume <dir> --prune loads resume config (including legacy rejection before any rewrite), prunes, then continues resume; mismatched dirs and invalid flag combos exit with usage errors. Pruning reuses resume.rewrite_results for atomic writes.

Reviewed by Cursor Bugbot for commit 8612336. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread verifiers/v1/cli/eval/prune.py
Comment thread verifiers/v1/cli/eval/main.py
@hallerite hallerite marked this pull request as ready for review June 29, 2026 01:19

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e03ec6d. Configure here.

Comment thread verifiers/v1/cli/eval/main.py
@macroscopeapp

macroscopeapp Bot commented Jun 29, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

New feature introducing --prune CLI capability that rewrites results.jsonl files based on error type filtering. New user-facing features with file manipulation capabilities warrant human review regardless of implementation quality.

You can customize Macroscope's approvability policy. Learn more.

…ry types)

`uv run eval --prune <dir> [--prune-include <csv>] [--prune-exclude <csv>]`
rewrites a finished run's results.jsonl in place, dropping rows whose most-recent
error type is selected, reusing `resume.rewrite_results` for the atomic write —
no model, no config.

Error-type selection is parallel to retries: extract `error_type_selected`
(exclude wins; empty include = all not excluded; matched on the most-recent
`errors[].type`, the same error `should_retry` keys off via `trace.error`) and
share it between `should_retry` and `--prune`. So `--prune-include`/
`--prune-exclude` name the same exception classes (SandboxError, ProviderError,
...) as `--retries.rollout.include`/`exclude`, with the same semantics.

Composes with `--resume`: `--prune` alone cleans and stops; `--resume <dir>
--prune` prunes first, then resumes the same dir.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite force-pushed the feat/eval-prune-errors branch from e03ec6d to 8612336 Compare June 29, 2026 01:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant