feat(eval): add --prune to drop errored rollout rows (parallel to retry types)#1893
feat(eval): add --prune to drop errored rollout rows (parallel to retry types)#1893hallerite wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e03ec6d. Configure here.
ApprovabilityVerdict: Needs human review New feature introducing You can customize Macroscope's approvability policy. Learn more. |
…ry types) `uv run eval --prune <dir> [--prune-include <csv>] [--prune-exclude <csv>]` rewrites a finished run's results.jsonl in place, dropping rows whose most-recent error type is selected, reusing `resume.rewrite_results` for the atomic write — no model, no config. Error-type selection is parallel to retries: extract `error_type_selected` (exclude wins; empty include = all not excluded; matched on the most-recent `errors[].type`, the same error `should_retry` keys off via `trace.error`) and share it between `should_retry` and `--prune`. So `--prune-include`/ `--prune-exclude` name the same exception classes (SandboxError, ProviderError, ...) as `--retries.rollout.include`/`exclude`, with the same semantics. Composes with `--resume`: `--prune` alone cleans and stops; `--resume <dir> --prune` prunes first, then resumes the same dir. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
e03ec6d to
8612336
Compare

What
Adds
uv run eval --prune <dir> [--prune-include <csv>] [--prune-exclude <csv>]— a post-hoc cleanup that drops errored rollout rows from a finished run'sresults.jsonl, reusingresume.rewrite_resultsfor the atomic write. No model, no config.Error-type selection is parallel to retries. This extracts
error_type_selected(error_type, include, exclude)(exclude wins; emptyinclude= all not excluded; matched on the most-recenterrors[].type, the same errorshould_retrykeys off viatrace.error) and shares it betweenshould_retryand--prune. So--prune-include/--prune-excludename the same exception classes (SandboxError,ProviderError,HarnessError, …) as--retries.rollout.include/exclude, with identical semantics:--retries.rollout.include/exclude--prune-include/--prune-excludetrace.error.type(most recent)errors[].typeComposes with
--resume:--prune <dir>alone → clean and stop.--resume <dir>alone → re-run owed rollouts (unchanged).--resume <dir> --prune [--prune-include/--prune-exclude …]→ prune first, then resume the same dir.How
verifiers/v1/cli/eval/prune.py—split_prune(handles--prune <dir>,--prune=<dir>, bare--prunewhen combined with--resume, plus--prune-include/--prune-exclude) +prune_results; own byte-offset streamer; reusesresume.rewrite_results.verifiers/v1/retries.py— extractserror_type_selected;should_retrynow delegates to it (no behaviour change).main.pydispatch parses--resumethen--prune, reconciles the target dir (rejecting a mismatch), and runs prune→resume / prune-only / resume-only.Testing
tests/v1/test_prune.py(9 tests): default / include / exclude / exclude-wins / absent-type pruning, missing-dir error,split_pruneforms, and the--resume --pruneordering + bad-combo rejections.should_retry+error_type_selectedre-verified;ruff check+ruff formatclean. (Pre-pushtynot run locally — env not synced; CI covers it.)🤖 Generated with Claude Code
Note
Add
--pruneflag to drop errored rows fromresults.jsonlin eval CLI--prunemode to the eval CLI that rewritesresults.jsonl, removing rows whose error type matches include/exclude filters (parallel to existing retry selectors).--prune [dir],--prune-include, and--prune-excludewith validation that rejects empty inline values.--resume --pruneworkflow that prunes first, then resumes; validates that both flags agree on the output directory when both specify one.should_retryin retries.py to share selection logic with prune via a newerror_type_selectedhelper.--include/--excludeflags now require--pruneto be present, and--resume/--prunecannot be combined with other arguments.Macroscope summarized 8612336.
Note
Medium Risk
The change rewrites eval output files in place on disk and extends eval CLI dispatch alongside resume, but uses atomic rewrites, validates legacy resume before pruning, and is covered by targeted tests.
Overview
Adds
uv run eval --pruneto remove errored rollout rows from a finished run’sresults.jsonlin place, with optional--prune-include/--prune-excludeCSV filters that mirror--retries.rollout.include/excludesemantics (exclude wins; empty include = all errored types not excluded), keyed off the most recenterrors[].type.error_type_selectedis extracted inretries.pyand shared byshould_retryand prune so retry and prune stay aligned;should_retrybehavior is unchanged.Eval
main.pyparses--resumethen--prune: prune-only exits after cleanup;--resume <dir> --pruneloads resume config (including legacy rejection before any rewrite), prunes, then continues resume; mismatched dirs and invalid flag combos exit with usage errors. Pruning reusesresume.rewrite_resultsfor atomic writes.Reviewed by Cursor Bugbot for commit 8612336. Bugbot is set up for automated code reviews on this repo. Configure here.