Skip to content

Option A: tag replay resume points via trace.info#1895

Draft
mikasenghaas wants to merge 10 commits into
mainfrom
exp/replay-node-tags-info
Draft

Option A: tag replay resume points via trace.info#1895
mikasenghaas wants to merge 10 commits into
mainfrom
exp/replay-node-tags-info

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 29, 2026

Copy link
Copy Markdown
Member

Option A — replay buffer via trace.info tags (no core schema change)

Design comparison PR (draft, not for merge). Pairs with Option B (#1896), which is identical except it stores the same metadata as typed MessageNode fields. The two branches differ in only three filesreplay_common/selector.py (two accessors), the one tag-write line in the compact harness, and (B-only) verifiers/v1/graph.py — so the diff between #1895 and #1896 is the A/B decision.

What this is

A runnable skeleton of the replay-buffer slice: tag compaction resume points during generation, then turn old rollouts into new training tasks. Producer = the compact example env; consumer = a new replay distribution.

Four separate tasksets (pick one per env via --taskset.id)

  • replay_recheck ("try again") — seed the full rollout + an appended user turn (Check your work, fix if wrong) and re-roll.
  • replay_compaction_after — resume from a compaction message; the model continues solving.
  • replay_compaction_before — resume before a compaction; the model writes the compaction itself, then continues.
  • replay_judge — present the rollout's transcript and ask "was this correct? yes/no", graded against the original rollout's actual reward (original_reward > judge_threshold) — a self-supervised correctness label.

Each env selects exactly one task type — no mode-mixing.

Scoring

  • recheck / compaction_* reuse the original env's verifier (config.inner): score runs its rewards/metrics over the replay trace with the original task swapped in (e.g. a math verifier checks the replayed continuation's final answer against the original ground truth).
  • judge compares the model's yes/no verdict to the original reward (no inner, no sandbox).

Provenance (original task + reward) rides on the task (offline) or trace.info["replay"] (online). Compaction tags are authored from graph structure (no program sensor); recheck/judge use the structural final-answer leaf (tag-free, work on any rollout). Failed-tool-call tagging stays deprioritized (program.py reverted).

Structure — one package, five modules

environments/replay/ ships a shared replay_common library + the four selectable taskset modules. replay_common.base holds everything shared (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing, seeding, snapshot restore, scoring); each taskset module is ~3 lines fixing its KIND and bundling a harness (auto-selected via default_harness_id).

  • Offline (mode=offline): load_tasks materializes one task per KIND resume point from a buffer glob (.../rollouts/step_*/train_rollouts.jsonl), seeding task.prompt with the root→node prefix.
  • Online (mode=online): load_tasks returns pool_size virtual slots; the bundled harness samples that KIND from the live buffer per rollout (and serves offline too, seeding the same program from the task's materialized prompt). Sampling is keyed by task index, so a GRPO group replays the same source+point. Empty buffer → stop("replay_buffer_empty") (warmup).

Sandbox / exec replay (snapshot_ref)

  • Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py), durable-ref contract documented. Restore is wired (offline: taskset setup; online: harness). Capture (per-turn) needs a framework hook — deliberate follow-up; refs stay None until the sandbox-snapshot feature lands.

The Option-A choice

Tags + snapshot refs live in trace.info (node_tags/snapshots maps keyed by node_id). No change to MessageNode/Trace. Trade-off: untyped dict joined by node_id (safe across the dump→reload we control; fragile if the graph is re-derived). Reader isolated in replay_common/selector.py, so A→B is a two-function swap.

Files (+515)

  • verifiers/v1/runtimes/base.pysnapshot/restore stubs
  • environments/compact/compact/{annotate,harness}.py — find + tag compaction before/after into trace.info
  • environments/replay/replay_common/{base,selector}.py — shared base + the A reader
  • environments/replay/replay_{recheck,judge,compaction_after,compaction_before}/ — the four tasksets + pyproject.toml

🤖 Generated with Claude Code

Illustrative change for the replay-buffer design discussion: how a harness marks
nodes a replay buffer can resume from. Two things are tagged because the typed graph
can't express them — branch provenance (compaction) and tool failure status
(failed_tool_call). Tool identity / "is this a tool call" stay intrinsic to the graph.

Option A stores tags in trace.info["node_tags"] (no core schema change). The program
is the sensor for tool failures (the trace drops isError/exit); the harness is the
writer, tagging the finished graph post-launch. See compact/annotate.py for the reader.

Compare with Option B (typed MessageNode.kind field): exp/replay-node-tags-field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/compact/compact/harness.py Outdated
…ng + ReplayTaskset (Option A)

Extends the Option-A node-tagging skeleton into a runnable replay-buffer slice:

- Drop the failed-tool-call path (deprioritized): program.py reverted; tags are now
  compaction-only and authored purely from graph structure (no program sensor).
- Two resume points per compaction: `compaction_after` (post-compaction branch start —
  continue from the compaction message) and `compaction_before` (pre-compaction branch
  leaf — the model regenerates the compaction, then continues). Tagged into trace.info.
- Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py) for exec/sandbox
  replay; durable-ref contract documented. Capture (per-turn) is a follow-up framework
  hook; restore is wired in ReplayTaskset.setup (no-op while refs are None).
- New `environments/replay` env: ReplayTaskset materializes one task per resume point from
  a buffer glob (offline), seeds task.prompt with the root->node prefix (run with the
  default harness), and stubs scoring (reuse the original verifier — TODO). Online/growing
  buffer noted as a rollout-time-sampling follow-up.

Option A stores tags/snapshot refs in trace.info (no core schema change). Compare with
Option B (typed MessageNode fields): exp/replay-node-tags-field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
children.setdefault(node.parent, []).append(nid)
starts: list[int] = []
for kids in children.values():
if len(kids) > 1:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium compact/annotate.py:32

compaction_after_nodes() collects every node that is a second-or-later child of a multi-child parent (kids[1:]), assuming all such children are post-compaction branch starts. When the graph deduplicates identical user messages by (parent, message_hash), a later turn that produces the same rewritten user prompt reuses the existing user node and attaches its assistant response as an additional child. This function mis-tags that later assistant node as a compaction_after resume point instead of the reused user node, so replay resumes from the wrong place and skips the compaction prompt.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/compact/compact/annotate.py around line 32:

`compaction_after_nodes()` collects every node that is a second-or-later child of a multi-child parent (`kids[1:]`), assuming all such children are post-compaction branch starts. When the graph deduplicates identical user messages by `(parent, message_hash)`, a later turn that produces the same rewritten user prompt reuses the existing user node and attaches its assistant response as an additional child. This function mis-tags that later assistant node as a `compaction_after` resume point instead of the reused user node, so replay resumes from the wrong place and skips the compaction prompt.

is the turn whose output was summarized into the next branch's compaction message, so
resuming there puts the model right before it writes a compaction."""
leaves = sorted(graph.leaves(trace))
return leaves[:-1] if len(leaves) > 1 else []

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium compact/annotate.py:42

compaction_before_nodes() unconditionally drops the highest-numbered leaf (leaves[:-1]), assuming it is the final-answer branch. On failed or interrupted rollouts there is no final-answer leaf, so the function silently omits the last real pre-compaction leaf. Replay selectors built from the returned tags will miss a valid resume point on crashed/timed-out traces. Consider checking the run status before dropping the last leaf, or document the assumption that this function is only called on successful rollouts.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/compact/compact/annotate.py around line 42:

`compaction_before_nodes()` unconditionally drops the highest-numbered leaf (`leaves[:-1]`), assuming it is the final-answer branch. On failed or interrupted rollouts there is no final-answer leaf, so the function silently omits the last real pre-compaction leaf. Replay selectors built from the returned tags will miss a valid resume point on crashed/timed-out traces. Consider checking the run status before dropping the last leaf, or document the assumption that this function is only called on successful rollouts.

…me sampling (Option A)

- Scoring delegation: ReplayTaskset.score now reuses the ORIGINAL env's verifier
  (config.inner) by running its rewards/metrics over the replay trace with the original
  task swapped in. Original task + reward ride on the replay task (offline) or in
  trace.info["replay"] (online). Empty inner.id => no-op (skeleton-safe).
- Online buffer: config.mode="online" returns pool_size virtual slots; new ReplayHarness
  samples a stored trace + resume point from the LIVE buffer per rollout (re-globs each
  time), restores the snapshot, seeds the default chat loop with the root->node prefix via
  INITIAL_MESSAGES, and stashes provenance. Empty buffer => stop("replay_buffer_empty")
  (warmup). Offline path (materialize in load_tasks) unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/replay/replay/taskset.py Outdated
Comment thread environments/replay/replay/harness.py Outdated
Comment thread environments/replay/replay/harness.py Outdated
The 'check your work, fix if wrong' mode: take the full sampled rollout, append a user
turn (config.followup), and re-roll — scored by the same original verifier as the rest.

- Structural, tag-free: the recheck point is the rollout's final-answer leaf, so it works
  for ANY rollout (linear or compacting) and is identical across A/B (no tag dependency).
- selector: recheck_points() + build_seed() (appends the follow-up user turn for recheck;
  plain prefix otherwise). resume_points() now yields recheck alongside compaction kinds.
- taskset + harness: new `followup` config; `recheck` added to default `kinds`; both build
  seeds via build_seed so offline and online modes get the appended turn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/replay/replay/harness.py Outdated
mikasenghaas and others added 2 commits June 30, 2026 00:28
…de (Option A)

Cleanup:
- Shared DEFAULT_KINDS / DEFAULT_FOLLOWUP constants in selector.py, referenced by both the
  taskset and harness configs (was duplicated literals).
- Buffer reading centralized as selector.iter_traces; taskset drops its glob/json copy.
- resume_points folds recheck/judge final-leaf points in one place.

Judge mode (config: add "judge" to kinds):
- A judge replay point presents the rollout's transcript ("was this correct? yes/no")
  instead of continuing it (selector.judge_prompt / build_seed).
- ReplayTaskset.score grades the model's verdict against the original rollout's reward
  (original_reward > judge_threshold) — a self-supervised correctness label; no inner
  verifier or snapshot needed. recheck/compaction_* still reuse the original verifier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… sandbox needed)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/replay/replay/taskset.py Outdated
Comment thread environments/replay/replay_common/selector.py Outdated
Comment thread environments/replay/replay/taskset.py Outdated
mikasenghaas and others added 2 commits June 30, 2026 00:40
…tion A)

Was a single ReplayTaskset with a `kinds` list; now each mode is its own selectable
taskset (taskset id), so an env config picks exactly one task type — no mode mixing.

Structure: one distribution (`environments/replay`) ships five top-level modules —
- replay_common: shared base (BaseReplayTaskset/BaseReplayHarness parameterized by KIND,
  buffer sourcing offline+online, seeding, snapshot restore, scoring) + selector.
- replay_recheck / replay_judge / replay_compaction_after / replay_compaction_before:
  thin modules each fixing KIND and bundling a harness (auto-selected via
  default_harness_id). recheck/compaction_* reuse the original verifier; judge grades the
  verdict against the original reward.

The bundled harness handles both modes: offline (materialized task.prompt -> default chat
loop) and online (sample this KIND from the live buffer). selector is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

def _sample(self, rng: random.Random) -> tuple[Trace, dict] | None:
"""Scan the live buffer in random order; return the first (trace, ``KIND`` point) found."""
files = sorted(glob.glob(self.config.buffer_glob))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium replay_common/base.py:203

_sample() shuffles the file list and returns the first matching rollout from the first file that has any, so a file with one matching rollout is as likely to be picked as a file with hundreds. This skews online sampling toward sparse files instead of giving each resume point in the buffer equal probability. Consider collecting all matching points across all files (or using a two-stage draw weighted by per-file point counts) before choosing one.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/replay/replay_common/base.py around line 203:

`_sample()` shuffles the file list and returns the first matching rollout from the first file that has any, so a file with one matching rollout is as likely to be picked as a file with hundreds. This skews online sampling toward sparse files instead of giving each resume point in the buffer equal probability. Consider collecting all matching points across all files (or using a two-stage draw weighted by per-file point counts) before choosing one.

mikasenghaas and others added 2 commits June 30, 2026 22:57
…ion A)

- Bug: BaseReplayHarness subclassed DefaultHarness without parameterizing
  Harness[ReplayHarnessConfig], so harness_config_type resolved to DefaultHarnessConfig
  (no buffer_glob/followup) -> self.config.buffer_glob would AttributeError in the online
  path. Now subclasses Harness[ReplayHarnessConfig] directly and seeds the default program
  itself (one unified offline/online seed path via _resolve_seed), so the config type and
  the --harness.buffer_glob/followup fields resolve correctly.
- resume_points: single get_tag per node instead of two.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…groups (Option A)

Was seeded by trace.id (unique per rollout), so the group_size rollouts of one group each
sampled a different source rollout — GRPO's group-relative baseline would then compare across
different problems. Seed by trace.task.idx (shared across a group) so all rollouts of a group
replay the same source+point. Freshness preserved by the growing buffer. Offline unaffected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant