feat(v1): resume dashboard counts the whole run, not just resumed rollouts#1892
Open
hallerite wants to merge 1 commit into
Open
feat(v1): resume dashboard counts the whole run, not just resumed rollouts#1892hallerite wants to merge 1 commit into
hallerite wants to merge 1 commit into
Conversation
…louts On `--resume`, the live dashboard built its progress counter and reward/err headline from only the owed (re-run) rollouts, so it showed e.g. `1246/9683` with a session-only reward — ignoring the rollouts already kept on disk. `resume.plan` now also returns a `Baseline` over the kept rows (count + summed reward + per-component sums), computed in the parse it already does (so it's free). `format_mean` gains `base_sum`/`base_n` to fold those kept rows into both the error-corrected and the global mean. The eval dashboard threads the baseline through, so the counter (`kept+done / kept+total`), headline reward, err share, and the reward/metric breakdown cover the whole run. Token/time totals stay session-scoped (kept rows weren't recomputed). Non-resume runs are unchanged (empty baseline). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ApprovabilityVerdict: Needs human review This PR introduces new user-facing dashboard behavior during resumed runs, with a new You can customize Macroscope's approvability policy. Learn more. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
On
--resume, the live eval dashboard counted only the re-run rollouts, so it showed e.g.1246/9683with a session-only reward — ignoring the rollouts already kept on disk.This makes the dashboard reflect the whole run: the counter shows
kept+done / kept+total, and the headline reward / err share / reward-metric breakdown fold in the kept on-disk rows.How
resume.plannow also returns aBaselineover the kept rows (count + summed reward + per-@reward/@metricsums), accumulated inside the parse it already does — so it's effectively free.format_meangainsbase_sum/base_nto seed both the error-corrected and the global (errored-as-0) mean.Progressand_breakdown.Tracereconstruction). Non-resume runs are unchanged (empty baseline).Testing
tests/v1/test_resume_baseline.py(4 tests): baseline math,planaggregation, the group-scored drop case.ruff check+ruff formatclean on touched files. (Pre-pushtynot run locally — env not synced; CI covers it.)🤖 Generated with Claude Code
Note
Include kept rollouts from resumed runs in eval dashboard counts and means
Baselinedataclass inresume.pythat aggregates non-errored kept rollouts (count, reward sum, per-key component sums);resume.plannow returns this alongside keep offsets and owed counts.format_meanto acceptbase_sum/base_nkwargs, folding prior rollouts into both the clean mean and the parenthesized global mean.Progressand_breakdowncomponents ineval.pyto accept and apply theBaseline, so counts, error rates, rewards, and per-key metrics reflect the entire run rather than only the current session.base_sum/base_nare unaffected.Macroscope summarized 3c764e2.
Note
Low Risk
Display and aggregation-only for the resume dashboard path;
format_meandefaults preserve existing callers, and persisted results logic is unchanged aside from reading reward/metric dicts when planning keeps.Overview
On
--resume, the rich eval dashboard now treats kept on-disk rollouts as part of the run for progress and scoring stats, instead of showing only the rollouts re-executed in the current session.resume.planreturns a newBaseline(count, summed headline reward, per-@reward/@metricsums) built while scanningresults.jsonlfor rows to keep. The runner passes that baseline into the dashboard.format_meanaccepts optionalbase_sum/base_nso error-corrected and parenthesized global means include those kept rows. Progress showskept+done / kept+total, and reward/err/breakdown reward-metric rows use the combined denominators. Usage and time in the breakdown stay session-only (no token/time re-derivation from disk). Non-resume runs are unchanged (empty baseline).Adds
tests/v1/test_resume_baseline.pyforformat_meanbaseline folding andplanaggregation (including group-scored incomplete groups).Reviewed by Cursor Bugbot for commit 3c764e2. Bugbot is set up for automated code reviews on this repo. Configure here.