Option A: tag replay resume points via trace.info#1895
Conversation
Illustrative change for the replay-buffer design discussion: how a harness marks nodes a replay buffer can resume from. Two things are tagged because the typed graph can't express them — branch provenance (compaction) and tool failure status (failed_tool_call). Tool identity / "is this a tool call" stay intrinsic to the graph. Option A stores tags in trace.info["node_tags"] (no core schema change). The program is the sensor for tool failures (the trace drops isError/exit); the harness is the writer, tagging the finished graph post-launch. See compact/annotate.py for the reader. Compare with Option B (typed MessageNode.kind field): exp/replay-node-tags-field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ng + ReplayTaskset (Option A) Extends the Option-A node-tagging skeleton into a runnable replay-buffer slice: - Drop the failed-tool-call path (deprioritized): program.py reverted; tags are now compaction-only and authored purely from graph structure (no program sensor). - Two resume points per compaction: `compaction_after` (post-compaction branch start — continue from the compaction message) and `compaction_before` (pre-compaction branch leaf — the model regenerates the compaction, then continues). Tagged into trace.info. - Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py) for exec/sandbox replay; durable-ref contract documented. Capture (per-turn) is a follow-up framework hook; restore is wired in ReplayTaskset.setup (no-op while refs are None). - New `environments/replay` env: ReplayTaskset materializes one task per resume point from a buffer glob (offline), seeds task.prompt with the root->node prefix (run with the default harness), and stubs scoring (reuse the original verifier — TODO). Online/growing buffer noted as a rollout-time-sampling follow-up. Option A stores tags/snapshot refs in trace.info (no core schema change). Compare with Option B (typed MessageNode fields): exp/replay-node-tags-field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| children.setdefault(node.parent, []).append(nid) | ||
| starts: list[int] = [] | ||
| for kids in children.values(): | ||
| if len(kids) > 1: |
There was a problem hiding this comment.
🟡 Medium compact/annotate.py:32
compaction_after_nodes() collects every node that is a second-or-later child of a multi-child parent (kids[1:]), assuming all such children are post-compaction branch starts. When the graph deduplicates identical user messages by (parent, message_hash), a later turn that produces the same rewritten user prompt reuses the existing user node and attaches its assistant response as an additional child. This function mis-tags that later assistant node as a compaction_after resume point instead of the reused user node, so replay resumes from the wrong place and skips the compaction prompt.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/compact/compact/annotate.py around line 32:
`compaction_after_nodes()` collects every node that is a second-or-later child of a multi-child parent (`kids[1:]`), assuming all such children are post-compaction branch starts. When the graph deduplicates identical user messages by `(parent, message_hash)`, a later turn that produces the same rewritten user prompt reuses the existing user node and attaches its assistant response as an additional child. This function mis-tags that later assistant node as a `compaction_after` resume point instead of the reused user node, so replay resumes from the wrong place and skips the compaction prompt.
| is the turn whose output was summarized into the next branch's compaction message, so | ||
| resuming there puts the model right before it writes a compaction.""" | ||
| leaves = sorted(graph.leaves(trace)) | ||
| return leaves[:-1] if len(leaves) > 1 else [] |
There was a problem hiding this comment.
🟡 Medium compact/annotate.py:42
compaction_before_nodes() unconditionally drops the highest-numbered leaf (leaves[:-1]), assuming it is the final-answer branch. On failed or interrupted rollouts there is no final-answer leaf, so the function silently omits the last real pre-compaction leaf. Replay selectors built from the returned tags will miss a valid resume point on crashed/timed-out traces. Consider checking the run status before dropping the last leaf, or document the assumption that this function is only called on successful rollouts.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/compact/compact/annotate.py around line 42:
`compaction_before_nodes()` unconditionally drops the highest-numbered leaf (`leaves[:-1]`), assuming it is the final-answer branch. On failed or interrupted rollouts there is no final-answer leaf, so the function silently omits the last real pre-compaction leaf. Replay selectors built from the returned tags will miss a valid resume point on crashed/timed-out traces. Consider checking the run status before dropping the last leaf, or document the assumption that this function is only called on successful rollouts.
…me sampling (Option A)
- Scoring delegation: ReplayTaskset.score now reuses the ORIGINAL env's verifier
(config.inner) by running its rewards/metrics over the replay trace with the original
task swapped in. Original task + reward ride on the replay task (offline) or in
trace.info["replay"] (online). Empty inner.id => no-op (skeleton-safe).
- Online buffer: config.mode="online" returns pool_size virtual slots; new ReplayHarness
samples a stored trace + resume point from the LIVE buffer per rollout (re-globs each
time), restores the snapshot, seeds the default chat loop with the root->node prefix via
INITIAL_MESSAGES, and stashes provenance. Empty buffer => stop("replay_buffer_empty")
(warmup). Offline path (materialize in load_tasks) unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'check your work, fix if wrong' mode: take the full sampled rollout, append a user turn (config.followup), and re-roll — scored by the same original verifier as the rest. - Structural, tag-free: the recheck point is the rollout's final-answer leaf, so it works for ANY rollout (linear or compacting) and is identical across A/B (no tag dependency). - selector: recheck_points() + build_seed() (appends the follow-up user turn for recheck; plain prefix otherwise). resume_points() now yields recheck alongside compaction kinds. - taskset + harness: new `followup` config; `recheck` added to default `kinds`; both build seeds via build_seed so offline and online modes get the appended turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…de (Option A)
Cleanup:
- Shared DEFAULT_KINDS / DEFAULT_FOLLOWUP constants in selector.py, referenced by both the
taskset and harness configs (was duplicated literals).
- Buffer reading centralized as selector.iter_traces; taskset drops its glob/json copy.
- resume_points folds recheck/judge final-leaf points in one place.
Judge mode (config: add "judge" to kinds):
- A judge replay point presents the rollout's transcript ("was this correct? yes/no")
instead of continuing it (selector.judge_prompt / build_seed).
- ReplayTaskset.score grades the model's verdict against the original rollout's reward
(original_reward > judge_threshold) — a self-supervised correctness label; no inner
verifier or snapshot needed. recheck/compaction_* still reuse the original verifier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… sandbox needed) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion A) Was a single ReplayTaskset with a `kinds` list; now each mode is its own selectable taskset (taskset id), so an env config picks exactly one task type — no mode mixing. Structure: one distribution (`environments/replay`) ships five top-level modules — - replay_common: shared base (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing offline+online, seeding, snapshot restore, scoring) + selector. - replay_recheck / replay_judge / replay_compaction_after / replay_compaction_before: thin modules each fixing KIND and bundling a harness (auto-selected via default_harness_id). recheck/compaction_* reuse the original verifier; judge grades the verdict against the original reward. The bundled harness handles both modes: offline (materialized task.prompt -> default chat loop) and online (sample this KIND from the live buffer). selector is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
||
| def _sample(self, rng: random.Random) -> tuple[Trace, dict] | None: | ||
| """Scan the live buffer in random order; return the first (trace, ``KIND`` point) found.""" | ||
| files = sorted(glob.glob(self.config.buffer_glob)) |
There was a problem hiding this comment.
🟡 Medium replay_common/base.py:203
_sample() shuffles the file list and returns the first matching rollout from the first file that has any, so a file with one matching rollout is as likely to be picked as a file with hundreds. This skews online sampling toward sparse files instead of giving each resume point in the buffer equal probability. Consider collecting all matching points across all files (or using a two-stage draw weighted by per-file point counts) before choosing one.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/replay/replay_common/base.py around line 203:
`_sample()` shuffles the file list and returns the first matching rollout from the first file that has any, so a file with one matching rollout is as likely to be picked as a file with hundreds. This skews online sampling toward sparse files instead of giving each resume point in the buffer equal probability. Consider collecting all matching points across all files (or using a two-stage draw weighted by per-file point counts) before choosing one.
…ion A) - Bug: BaseReplayHarness subclassed DefaultHarness without parameterizing Harness[ReplayHarnessConfig], so harness_config_type resolved to DefaultHarnessConfig (no buffer_glob/followup) -> self.config.buffer_glob would AttributeError in the online path. Now subclasses Harness[ReplayHarnessConfig] directly and seeds the default program itself (one unified offline/online seed path via _resolve_seed), so the config type and the --harness.buffer_glob/followup fields resolve correctly. - resume_points: single get_tag per node instead of two. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…groups (Option A) Was seeded by trace.id (unique per rollout), so the group_size rollouts of one group each sampled a different source rollout — GRPO's group-relative baseline would then compare across different problems. Seed by trace.task.idx (shared across a group) so all rollouts of a group replay the same source+point. Freshness preserved by the growing buffer. Offline unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Option A — replay buffer via
trace.infotags (no core schema change)Design comparison PR (draft, not for merge). Pairs with Option B (#1896), which is identical except it stores the same metadata as typed
MessageNodefields. The two branches differ in only three files —replay_common/selector.py(two accessors), the one tag-write line in the compact harness, and (B-only)verifiers/v1/graph.py— so the diff between #1895 and #1896 is the A/B decision.What this is
A runnable skeleton of the replay-buffer slice: tag compaction resume points during generation, then turn old rollouts into new training tasks. Producer = the
compactexample env; consumer = a newreplaydistribution.Four separate tasksets (pick one per env via
--taskset.id)replay_recheck("try again") — seed the full rollout + an appended user turn (Check your work, fix if wrong) and re-roll.replay_compaction_after— resume from a compaction message; the model continues solving.replay_compaction_before— resume before a compaction; the model writes the compaction itself, then continues.replay_judge— present the rollout's transcript and ask "was this correct? yes/no", graded against the original rollout's actual reward (original_reward > judge_threshold) — a self-supervised correctness label.Each env selects exactly one task type — no mode-mixing.
Scoring
recheck/compaction_*reuse the original env's verifier (config.inner):scoreruns its rewards/metrics over the replay trace with the original task swapped in (e.g. a math verifier checks the replayed continuation's final answer against the original ground truth).judgecompares the model's yes/no verdict to the original reward (noinner, no sandbox).Provenance (original task + reward) rides on the task (offline) or
trace.info["replay"](online). Compaction tags are authored from graph structure (no program sensor);recheck/judgeuse the structural final-answer leaf (tag-free, work on any rollout). Failed-tool-call tagging stays deprioritized (program.pyreverted).Structure — one package, five modules
environments/replay/ships a sharedreplay_commonlibrary + the four selectable taskset modules.replay_common.baseholds everything shared (BaseReplayTaskset/BaseReplayHarnessparameterized byKIND, buffer sourcing, seeding, snapshot restore, scoring); each taskset module is ~3 lines fixing itsKINDand bundling a harness (auto-selected viadefault_harness_id).mode=offline):load_tasksmaterializes one task perKINDresume point from a buffer glob (.../rollouts/step_*/train_rollouts.jsonl), seedingtask.promptwith theroot→nodeprefix.mode=online):load_tasksreturnspool_sizevirtual slots; the bundled harness samples thatKINDfrom the live buffer per rollout (and serves offline too, seeding the same program from the task's materialized prompt). Sampling is keyed by task index, so a GRPO group replays the same source+point. Empty buffer →stop("replay_buffer_empty")(warmup).Sandbox / exec replay (
snapshot_ref)Runtime.snapshot()/restore()stubs (verifiers/v1/runtimes/base.py), durable-ref contract documented. Restore is wired (offline: tasksetsetup; online: harness). Capture (per-turn) needs a framework hook — deliberate follow-up; refs stayNoneuntil the sandbox-snapshot feature lands.The Option-A choice
Tags + snapshot refs live in
trace.info(node_tags/snapshotsmaps keyed bynode_id). No change toMessageNode/Trace. Trade-off: untyped dict joined bynode_id(safe across the dump→reload we control; fragile if the graph is re-derived). Reader isolated inreplay_common/selector.py, so A→B is a two-function swap.Files (+515)
verifiers/v1/runtimes/base.py—snapshot/restorestubsenvironments/compact/compact/{annotate,harness}.py— find + tag compaction before/after intotrace.infoenvironments/replay/replay_common/{base,selector}.py— shared base + the A readerenvironments/replay/replay_{recheck,judge,compaction_after,compaction_before}/— the four tasksets +pyproject.toml🤖 Generated with Claude Code