Skip to content

feat(v1): resume dashboard counts the whole run, not just resumed rollouts#1892

Open
hallerite wants to merge 1 commit into
mainfrom
feat/resume-dashboard-totals
Open

feat(v1): resume dashboard counts the whole run, not just resumed rollouts#1892
hallerite wants to merge 1 commit into
mainfrom
feat/resume-dashboard-totals

Conversation

@hallerite

@hallerite hallerite commented Jun 29, 2026

Copy link
Copy Markdown
Member

What

On --resume, the live eval dashboard counted only the re-run rollouts, so it showed e.g. 1246/9683 with a session-only reward — ignoring the rollouts already kept on disk.

This makes the dashboard reflect the whole run: the counter shows kept+done / kept+total, and the headline reward / err share / reward-metric breakdown fold in the kept on-disk rows.

How

  • resume.plan now also returns a Baseline over the kept rows (count + summed reward + per-@reward/@metric sums), accumulated inside the parse it already does — so it's effectively free.
  • format_mean gains base_sum/base_n to seed both the error-corrected and the global (errored-as-0) mean.
  • The dashboard threads the baseline through Progress and _breakdown.
  • Token/time stay session-scoped (kept rows weren't recomputed; re-deriving their tokens would need full Trace reconstruction). Non-resume runs are unchanged (empty baseline).

Testing

  • tests/v1/test_resume_baseline.py (4 tests): baseline math, plan aggregation, the group-scored drop case.
  • ruff check + ruff format clean on touched files. (Pre-push ty not run locally — env not synced; CI covers it.)

🤖 Generated with Claude Code

Note

Include kept rollouts from resumed runs in eval dashboard counts and means

  • Introduces a Baseline dataclass in resume.py that aggregates non-errored kept rollouts (count, reward sum, per-key component sums); resume.plan now returns this alongside keep offsets and owed counts.
  • Updates format_mean to accept base_sum/base_n kwargs, folding prior rollouts into both the clean mean and the parenthesized global mean.
  • Updates the dashboard's Progress and _breakdown components in eval.py to accept and apply the Baseline, so counts, error rates, rewards, and per-key metrics reflect the entire run rather than only the current session.
  • Behavioral Change: resumed run dashboards now show cumulative statistics; callers that do not pass base_sum/base_n are unaffected.

Macroscope summarized 3c764e2.


Note

Low Risk
Display and aggregation-only for the resume dashboard path; format_mean defaults preserve existing callers, and persisted results logic is unchanged aside from reading reward/metric dicts when planning keeps.

Overview
On --resume, the rich eval dashboard now treats kept on-disk rollouts as part of the run for progress and scoring stats, instead of showing only the rollouts re-executed in the current session.

resume.plan returns a new Baseline (count, summed headline reward, per-@reward/@metric sums) built while scanning results.jsonl for rows to keep. The runner passes that baseline into the dashboard. format_mean accepts optional base_sum/base_n so error-corrected and parenthesized global means include those kept rows. Progress shows kept+done / kept+total, and reward/err/breakdown reward-metric rows use the combined denominators. Usage and time in the breakdown stay session-only (no token/time re-derivation from disk). Non-resume runs are unchanged (empty baseline).

Adds tests/v1/test_resume_baseline.py for format_mean baseline folding and plan aggregation (including group-scored incomplete groups).

Reviewed by Cursor Bugbot for commit 3c764e2. Bugbot is set up for automated code reviews on this repo. Configure here.

…louts

On `--resume`, the live dashboard built its progress counter and reward/err
headline from only the owed (re-run) rollouts, so it showed e.g. `1246/9683`
with a session-only reward — ignoring the rollouts already kept on disk.

`resume.plan` now also returns a `Baseline` over the kept rows (count + summed
reward + per-component sums), computed in the parse it already does (so it's
free). `format_mean` gains `base_sum`/`base_n` to fold those kept rows into both
the error-corrected and the global mean. The eval dashboard threads the baseline
through, so the counter (`kept+done / kept+total`), headline reward, err share,
and the reward/metric breakdown cover the whole run. Token/time totals stay
session-scoped (kept rows weren't recomputed). Non-resume runs are unchanged
(empty baseline).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@hallerite hallerite marked this pull request as ready for review June 29, 2026 01:26
@macroscopeapp

macroscopeapp Bot commented Jun 29, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

This PR introduces new user-facing dashboard behavior during resumed runs, with a new Baseline dataclass and modified calculations threaded through multiple functions. The new feature and runtime behavior changes warrant human review.

You can customize Macroscope's approvability policy. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant