feat(eval): add --prune to drop errored rollout rows (parallel to retry types) by hallerite · Pull Request #1893 · PrimeIntellect-ai/verifiers

hallerite · 2026-06-29T01:09:16Z

What

Adds uv run eval --prune <dir> [--prune-include <csv>] [--prune-exclude <csv>] — a post-hoc cleanup that drops errored rollout rows from a finished run's results.jsonl, reusing resume.rewrite_results for the atomic write. No model, no config.

Error-type selection is parallel to retries. This extracts error_type_selected(error_type, include, exclude) (exclude wins; empty include = all not excluded; matched on the most-recent errors[].type, the same error should_retry keys off via trace.error) and shares it between should_retry and --prune. So --prune-include/--prune-exclude name the same exception classes (SandboxError, ProviderError, HarnessError, …) as --retries.rollout.include/exclude, with identical semantics:

	retries	prune
selector	`--retries.rollout.include` / `exclude`	`--prune-include` / `--prune-exclude`
empty include	retry all not excluded	prune all errored not excluded
exclude	wins over include	wins over include
matches	`trace.error.type` (most recent)	most-recent `errors[].type`

Composes with --resume:

--prune <dir> alone → clean and stop.
--resume <dir> alone → re-run owed rollouts (unchanged).
--resume <dir> --prune [--prune-include/--prune-exclude …] → prune first, then resume the same dir.

How

New verifiers/v1/cli/eval/prune.py — split_prune (handles --prune <dir>, --prune=<dir>, bare --prune when combined with --resume, plus --prune-include/--prune-exclude) + prune_results; own byte-offset streamer; reuses resume.rewrite_results.
verifiers/v1/retries.py — extracts error_type_selected; should_retry now delegates to it (no behaviour change).
main.py dispatch parses --resume then --prune, reconciles the target dir (rejecting a mismatch), and runs prune→resume / prune-only / resume-only.

Testing

tests/v1/test_prune.py (9 tests): default / include / exclude / exclude-wins / absent-type pruning, missing-dir error, split_prune forms, and the --resume --prune ordering + bad-combo rejections.
should_retry + error_type_selected re-verified; ruff check + ruff format clean. (Pre-push ty not run locally — env not synced; CI covers it.)

🤖 Generated with Claude Code

Note

Add `--prune` flag to drop errored rows from `results.jsonl` in eval CLI

Adds a new --prune mode to the eval CLI that rewrites results.jsonl, removing rows whose error type matches include/exclude filters (parallel to existing retry selectors).
Implements argument parsing in prune.py supporting --prune [dir], --prune-include, and --prune-exclude with validation that rejects empty inline values.
Supports a combined --resume --prune workflow that prunes first, then resumes; validates that both flags agree on the output directory when both specify one.
Refactors should_retry in retries.py to share selection logic with prune via a new error_type_selected helper.
Behavioral Change: --include/--exclude flags now require --prune to be present, and --resume/--prune cannot be combined with other arguments.

^{Macroscope summarized 8612336.}

Note

Medium Risk
The change rewrites eval output files in place on disk and extends eval CLI dispatch alongside resume, but uses atomic rewrites, validates legacy resume before pruning, and is covered by targeted tests.

Overview
Adds uv run eval --prune to remove errored rollout rows from a finished run’s results.jsonl in place, with optional --prune-include / --prune-exclude CSV filters that mirror --retries.rollout.include/exclude semantics (exclude wins; empty include = all errored types not excluded), keyed off the most recent errors[].type.

error_type_selected is extracted in retries.py and shared by should_retry and prune so retry and prune stay aligned; should_retry behavior is unchanged.

Eval main.py parses --resume then --prune: prune-only exits after cleanup; --resume <dir> --prune loads resume config (including legacy rejection before any rewrite), prunes, then continues resume; mismatched dirs and invalid flag combos exit with usage errors. Pruning reuses resume.rewrite_results for atomic writes.

^{Reviewed by Cursor Bugbot for commit 8612336. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.

^{Reviewed by Cursor Bugbot for commit e03ec6d. Configure here.}

macroscopeapp · 2026-06-29T01:21:17Z

Approvability

Verdict: Needs human review

New feature introducing --prune CLI capability that rewrites results.jsonl files based on error type filtering. New user-facing features with file manipulation capabilities warrant human review regardless of implementation quality.

^{You can customize Macroscope's approvability policy. Learn more.}

…ry types) `uv run eval --prune <dir> [--prune-include <csv>] [--prune-exclude <csv>]` rewrites a finished run's results.jsonl in place, dropping rows whose most-recent error type is selected, reusing `resume.rewrite_results` for the atomic write — no model, no config. Error-type selection is parallel to retries: extract `error_type_selected` (exclude wins; empty include = all not excluded; matched on the most-recent `errors[].type`, the same error `should_retry` keys off via `trace.error`) and share it between `should_retry` and `--prune`. So `--prune-include`/ `--prune-exclude` name the same exception classes (SandboxError, ProviderError, ...) as `--retries.rollout.include`/`exclude`, with the same semantics. Composes with `--resume`: `--prune` alone cleans and stops; `--resume <dir> --prune` prunes first, then resumes the same dir. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread verifiers/v1/cli/eval/prune.py

Comment thread verifiers/v1/cli/eval/main.py

hallerite marked this pull request as ready for review June 29, 2026 01:19

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread verifiers/v1/cli/eval/main.py

hallerite force-pushed the feat/eval-prune-errors branch from e03ec6d to 8612336 Compare June 29, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(eval): add --prune to drop errored rollout rows (parallel to retry types)#1893

feat(eval): add --prune to drop errored rollout rows (parallel to retry types)#1893
hallerite wants to merge 1 commit into
mainfrom
feat/eval-prune-errors

hallerite commented Jun 29, 2026 •

edited by macroscopeapp Bot

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hallerite commented Jun 29, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Testing

Add --prune flag to drop errored rows from results.jsonl in eval CLI

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

macroscopeapp Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 29, 2026 •

edited by macroscopeapp Bot

Loading

Add `--prune` flag to drop errored rows from `results.jsonl` in eval CLI

macroscopeapp Bot commented Jun 29, 2026 •

edited

Loading