feat(v1): resume dashboard counts the whole run, not just resumed rollouts by hallerite · Pull Request #1892 · PrimeIntellect-ai/verifiers

hallerite · 2026-06-29T01:09:15Z

What

On --resume, the live eval dashboard counted only the re-run rollouts, so it showed e.g. 1246/9683 with a session-only reward — ignoring the rollouts already kept on disk.

This makes the dashboard reflect the whole run: the counter shows kept+done / kept+total, and the headline reward / err share / reward-metric breakdown fold in the kept on-disk rows.

How

resume.plan now also returns a Baseline over the kept rows (count + summed reward + per-@reward/@metric sums), accumulated inside the parse it already does — so it's effectively free.
format_mean gains base_sum/base_n to seed both the error-corrected and the global (errored-as-0) mean.
The dashboard threads the baseline through Progress and _breakdown.
Token/time stay session-scoped (kept rows weren't recomputed; re-deriving their tokens would need full Trace reconstruction). Non-resume runs are unchanged (empty baseline).

Testing

tests/v1/test_resume_baseline.py (4 tests): baseline math, plan aggregation, the group-scored drop case.
ruff check + ruff format clean on touched files. (Pre-push ty not run locally — env not synced; CI covers it.)

🤖 Generated with Claude Code

Note

Include kept rollouts from resumed runs in eval dashboard counts and means

Introduces a Baseline dataclass in resume.py that aggregates non-errored kept rollouts (count, reward sum, per-key component sums); resume.plan now returns this alongside keep offsets and owed counts.
Updates format_mean to accept base_sum/base_n kwargs, folding prior rollouts into both the clean mean and the parenthesized global mean.
Updates the dashboard's Progress and _breakdown components in eval.py to accept and apply the Baseline, so counts, error rates, rewards, and per-key metrics reflect the entire run rather than only the current session.
Behavioral Change: resumed run dashboards now show cumulative statistics; callers that do not pass base_sum/base_n are unaffected.

^{Macroscope summarized 3c764e2.}

Note

Low Risk
Display and aggregation-only for the resume dashboard path; format_mean defaults preserve existing callers, and persisted results logic is unchanged aside from reading reward/metric dicts when planning keeps.

Overview
On --resume, the rich eval dashboard now treats kept on-disk rollouts as part of the run for progress and scoring stats, instead of showing only the rollouts re-executed in the current session.

resume.plan returns a new Baseline (count, summed headline reward, per-@reward/@metric sums) built while scanning results.jsonl for rows to keep. The runner passes that baseline into the dashboard. format_mean accepts optional base_sum/base_n so error-corrected and parenthesized global means include those kept rows. Progress shows kept+done / kept+total, and reward/err/breakdown reward-metric rows use the combined denominators. Usage and time in the breakdown stay session-only (no token/time re-derivation from disk). Non-resume runs are unchanged (empty baseline).

Adds tests/v1/test_resume_baseline.py for format_mean baseline folding and plan aggregation (including group-scored incomplete groups).

^{Reviewed by Cursor Bugbot for commit 3c764e2. Bugbot is set up for automated code reviews on this repo. Configure here.}

…louts On `--resume`, the live dashboard built its progress counter and reward/err headline from only the owed (re-run) rollouts, so it showed e.g. `1246/9683` with a session-only reward — ignoring the rollouts already kept on disk. `resume.plan` now also returns a `Baseline` over the kept rows (count + summed reward + per-component sums), computed in the parse it already does (so it's free). `format_mean` gains `base_sum`/`base_n` to fold those kept rows into both the error-corrected and the global mean. The eval dashboard threads the baseline through, so the counter (`kept+done / kept+total`), headline reward, err share, and the reward/metric breakdown cover the whole run. Token/time totals stay session-scoped (kept rows weren't recomputed). Non-resume runs are unchanged (empty baseline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-29T01:29:00Z

Approvability

Verdict: Needs human review

This PR introduces new user-facing dashboard behavior during resumed runs, with a new Baseline dataclass and modified calculations threaded through multiple functions. The new feature and runtime behavior changes warrant human review.

^{You can customize Macroscope's approvability policy. Learn more.}

hallerite marked this pull request as ready for review June 29, 2026 01:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(v1): resume dashboard counts the whole run, not just resumed rollouts#1892

feat(v1): resume dashboard counts the whole run, not just resumed rollouts#1892
hallerite wants to merge 1 commit into
mainfrom
feat/resume-dashboard-totals

hallerite commented Jun 29, 2026 •

edited by cursor Bot

Loading

Uh oh!

macroscopeapp Bot commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

hallerite commented Jun 29, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Testing

Include kept rollouts from resumed runs in eval dashboard counts and means

Uh oh!

macroscopeapp Bot commented Jun 29, 2026

Approvability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 29, 2026 •

edited by cursor Bot

Loading