Option A: tag replay resume points via trace.info by mikasenghaas · Pull Request #1895 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-29T21:59:09Z

Option A — replay buffer via `trace.info` tags (no core schema change)

Design comparison PR (draft, not for merge). Pairs with Option B (#1896), which is identical except it stores the same metadata as typed MessageNode fields. The two branches differ in only three files — replay_common/selector.py (two accessors), the one tag-write line in the compact harness, and (B-only) verifiers/v1/graph.py — so the diff between #1895 and #1896 is the A/B decision.

What this is

A runnable skeleton of the replay-buffer slice: tag compaction resume points during generation, then turn old rollouts into new training tasks. Producer = the compact example env; consumer = a new replay distribution.

Four separate tasksets (pick one per env via `--taskset.id`)

replay_recheck ("try again") — seed the full rollout + an appended user turn (Check your work, fix if wrong) and re-roll.
replay_compaction_after — resume from a compaction message; the model continues solving.
replay_compaction_before — resume before a compaction; the model writes the compaction itself, then continues.
replay_judge — present the rollout's transcript and ask "was this correct? yes/no", graded against the original rollout's actual reward (original_reward > judge_threshold) — a self-supervised correctness label.

Each env selects exactly one task type — no mode-mixing.

Scoring

recheck / compaction_* reuse the original env's verifier (config.inner): score runs its rewards/metrics over the replay trace with the original task swapped in (e.g. a math verifier checks the replayed continuation's final answer against the original ground truth).
judge compares the model's yes/no verdict to the original reward (no inner, no sandbox).

Provenance (original task + reward) rides on the task (offline) or trace.info["replay"] (online). Compaction tags are authored from graph structure (no program sensor); recheck/judge use the structural final-answer leaf (tag-free, work on any rollout). Failed-tool-call tagging stays deprioritized (program.py reverted).

Structure — one package, five modules

environments/replay/ ships a shared replay_common library + the four selectable taskset modules. replay_common.base holds everything shared (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing, seeding, snapshot restore, scoring); each taskset module is ~3 lines fixing its KIND and bundling a harness (auto-selected via default_harness_id).

Offline (mode=offline): load_tasks materializes one task per KIND resume point from a buffer glob (.../rollouts/step_*/train_rollouts.jsonl), seeding task.prompt with the root→node prefix.
Online (mode=online): load_tasks returns pool_size virtual slots; the bundled harness samples that KIND from the live buffer per rollout (and serves offline too, seeding the same program from the task's materialized prompt). Sampling is keyed by task index, so a GRPO group replays the same source+point. Empty buffer → stop("replay_buffer_empty") (warmup).

Sandbox / exec replay (`snapshot_ref`)

Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py), durable-ref contract documented. Restore is wired (offline: taskset setup; online: harness). Capture (per-turn) needs a framework hook — deliberate follow-up; refs stay None until the sandbox-snapshot feature lands.

The Option-A choice

Tags + snapshot refs live in trace.info (node_tags/snapshots maps keyed by node_id). No change to MessageNode/Trace. Trade-off: untyped dict joined by node_id (safe across the dump→reload we control; fragile if the graph is re-derived). Reader isolated in replay_common/selector.py, so A→B is a two-function swap.

Files (+515)

verifiers/v1/runtimes/base.py — snapshot/restore stubs
environments/compact/compact/{annotate,harness}.py — find + tag compaction before/after into trace.info
environments/replay/replay_common/{base,selector}.py — shared base + the A reader
environments/replay/replay_{recheck,judge,compaction_after,compaction_before}/ — the four tasksets + pyproject.toml

🤖 Generated with Claude Code

Illustrative change for the replay-buffer design discussion: how a harness marks nodes a replay buffer can resume from. Two things are tagged because the typed graph can't express them — branch provenance (compaction) and tool failure status (failed_tool_call). Tool identity / "is this a tool call" stay intrinsic to the graph. Option A stores tags in trace.info["node_tags"] (no core schema change). The program is the sensor for tool failures (the trace drops isError/exit); the harness is the writer, tagging the finished graph post-launch. See compact/annotate.py for the reader. Compare with Option B (typed MessageNode.kind field): exp/replay-node-tags-field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ng + ReplayTaskset (Option A) Extends the Option-A node-tagging skeleton into a runnable replay-buffer slice: - Drop the failed-tool-call path (deprioritized): program.py reverted; tags are now compaction-only and authored purely from graph structure (no program sensor). - Two resume points per compaction: `compaction_after` (post-compaction branch start — continue from the compaction message) and `compaction_before` (pre-compaction branch leaf — the model regenerates the compaction, then continues). Tagged into trace.info. - Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py) for exec/sandbox replay; durable-ref contract documented. Capture (per-turn) is a follow-up framework hook; restore is wired in ReplayTaskset.setup (no-op while refs are None). - New `environments/replay` env: ReplayTaskset materializes one task per resume point from a buffer glob (offline), seeds task.prompt with the root->node prefix (run with the default harness), and stubs scoring (reuse the original verifier — TODO). Online/growing buffer noted as a rollout-time-sampling follow-up. Option A stores tags/snapshot refs in trace.info (no core schema change). Compare with Option B (typed MessageNode fields): exp/replay-node-tags-field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-29T23:53:02Z

+        children.setdefault(node.parent, []).append(nid)
+    starts: list[int] = []
+    for kids in children.values():
+        if len(kids) > 1:


🟡 Medium compact/annotate.py:32

compaction_after_nodes() collects every node that is a second-or-later child of a multi-child parent (kids[1:]), assuming all such children are post-compaction branch starts. When the graph deduplicates identical user messages by (parent, message_hash), a later turn that produces the same rewritten user prompt reuses the existing user node and attaches its assistant response as an additional child. This function mis-tags that later assistant node as a compaction_after resume point instead of the reused user node, so replay resumes from the wrong place and skips the compaction prompt.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/compact/compact/annotate.py around line 32: `compaction_after_nodes()` collects every node that is a second-or-later child of a multi-child parent (`kids[1:]`), assuming all such children are post-compaction branch starts. When the graph deduplicates identical user messages by `(parent, message_hash)`, a later turn that produces the same rewritten user prompt reuses the existing user node and attaches its assistant response as an additional child. This function mis-tags that later assistant node as a `compaction_after` resume point instead of the reused user node, so replay resumes from the wrong place and skips the compaction prompt.

macroscopeapp · 2026-06-29T23:53:02Z

+    is the turn whose output was summarized into the next branch's compaction message, so
+    resuming there puts the model right before it writes a compaction."""
+    leaves = sorted(graph.leaves(trace))
+    return leaves[:-1] if len(leaves) > 1 else []


🟡 Medium compact/annotate.py:42

compaction_before_nodes() unconditionally drops the highest-numbered leaf (leaves[:-1]), assuming it is the final-answer branch. On failed or interrupted rollouts there is no final-answer leaf, so the function silently omits the last real pre-compaction leaf. Replay selectors built from the returned tags will miss a valid resume point on crashed/timed-out traces. Consider checking the run status before dropping the last leaf, or document the assumption that this function is only called on successful rollouts.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/compact/compact/annotate.py around line 42: `compaction_before_nodes()` unconditionally drops the highest-numbered leaf (`leaves[:-1]`), assuming it is the final-answer branch. On failed or interrupted rollouts there is no final-answer leaf, so the function silently omits the last real pre-compaction leaf. Replay selectors built from the returned tags will miss a valid resume point on crashed/timed-out traces. Consider checking the run status before dropping the last leaf, or document the assumption that this function is only called on successful rollouts.

…me sampling (Option A) - Scoring delegation: ReplayTaskset.score now reuses the ORIGINAL env's verifier (config.inner) by running its rewards/metrics over the replay trace with the original task swapped in. Original task + reward ride on the replay task (offline) or in trace.info["replay"] (online). Empty inner.id => no-op (skeleton-safe). - Online buffer: config.mode="online" returns pool_size virtual slots; new ReplayHarness samples a stored trace + resume point from the LIVE buffer per rollout (re-globs each time), restores the snapshot, seeds the default chat loop with the root->node prefix via INITIAL_MESSAGES, and stashes provenance. Empty buffer => stop("replay_buffer_empty") (warmup). Offline path (materialize in load_tasks) unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The 'check your work, fix if wrong' mode: take the full sampled rollout, append a user turn (config.followup), and re-roll — scored by the same original verifier as the rest. - Structural, tag-free: the recheck point is the rollout's final-answer leaf, so it works for ANY rollout (linear or compacting) and is identical across A/B (no tag dependency). - selector: recheck_points() + build_seed() (appends the follow-up user turn for recheck; plain prefix otherwise). resume_points() now yields recheck alongside compaction kinds. - taskset + harness: new `followup` config; `recheck` added to default `kinds`; both build seeds via build_seed so offline and online modes get the appended turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…de (Option A) Cleanup: - Shared DEFAULT_KINDS / DEFAULT_FOLLOWUP constants in selector.py, referenced by both the taskset and harness configs (was duplicated literals). - Buffer reading centralized as selector.iter_traces; taskset drops its glob/json copy. - resume_points folds recheck/judge final-leaf points in one place. Judge mode (config: add "judge" to kinds): - A judge replay point presents the rollout's transcript ("was this correct? yes/no") instead of continuing it (selector.judge_prompt / build_seed). - ReplayTaskset.score grades the model's verdict against the original rollout's reward (original_reward > judge_threshold) — a self-supervised correctness label; no inner verifier or snapshot needed. recheck/compaction_* still reuse the original verifier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… sandbox needed) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion A) Was a single ReplayTaskset with a `kinds` list; now each mode is its own selectable taskset (taskset id), so an env config picks exactly one task type — no mode mixing. Structure: one distribution (`environments/replay`) ships five top-level modules — - replay_common: shared base (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing offline+online, seeding, snapshot restore, scoring) + selector. - replay_recheck / replay_judge / replay_compaction_after / replay_compaction_before: thin modules each fixing KIND and bundling a harness (auto-selected via default_harness_id). recheck/compaction_* reuse the original verifier; judge grades the verdict against the original reward. The bundled harness handles both modes: offline (materialized task.prompt -> default chat loop) and online (sample this KIND from the live buffer). selector is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-30T00:43:56Z

+
+    def _sample(self, rng: random.Random) -> tuple[Trace, dict] | None:
+        """Scan the live buffer in random order; return the first (trace, ``KIND`` point) found."""
+        files = sorted(glob.glob(self.config.buffer_glob))


🟡 Medium replay_common/base.py:203

_sample() shuffles the file list and returns the first matching rollout from the first file that has any, so a file with one matching rollout is as likely to be picked as a file with hundreds. This skews online sampling toward sparse files instead of giving each resume point in the buffer equal probability. Consider collecting all matching points across all files (or using a two-stage draw weighted by per-file point counts) before choosing one.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/replay/replay_common/base.py around line 203: `_sample()` shuffles the file list and returns the first matching rollout from the first file that has any, so a file with one matching rollout is as likely to be picked as a file with hundreds. This skews online sampling toward sparse files instead of giving each resume point in the buffer equal probability. Consider collecting all matching points across all files (or using a two-stage draw weighted by per-file point counts) before choosing one.

…ion A) - Bug: BaseReplayHarness subclassed DefaultHarness without parameterizing Harness[ReplayHarnessConfig], so harness_config_type resolved to DefaultHarnessConfig (no buffer_glob/followup) -> self.config.buffer_glob would AttributeError in the online path. Now subclasses Harness[ReplayHarnessConfig] directly and seeds the default program itself (one unified offline/online seed path via _resolve_seed), so the config type and the --harness.buffer_glob/followup fields resolve correctly. - resume_points: single get_tag per node instead of two. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…groups (Option A) Was seeded by trace.id (unique per rollout), so the group_size rollouts of one group each sampled a different source rollout — GRPO's group-relative baseline would then compare across different problems. Seed by trace.task.idx (shared across a group) so all rollouts of a group replay the same source+point. Freshness preserved by the growing buffer. Offline unaffected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread environments/compact/compact/harness.py Outdated

mikasenghaas mentioned this pull request Jun 29, 2026

Option B: typed MessageNode.kind for replay resume points #1896

Draft

macroscopeapp Bot reviewed Jun 29, 2026

View reviewed changes

macroscopeapp Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread environments/replay/replay/taskset.py Outdated

Comment thread environments/replay/replay/harness.py Outdated

Comment thread environments/replay/replay/harness.py Outdated

macroscopeapp Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread environments/replay/replay/harness.py Outdated

mikasenghaas and others added 2 commits June 30, 2026 00:28

fix(replay): online harness skips snapshot restore for judge mode (no…

ec39356

… sandbox needed) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread environments/replay/replay/taskset.py Outdated

Comment thread environments/replay/replay_common/selector.py Outdated

Comment thread environments/replay/replay/taskset.py Outdated

mikasenghaas and others added 2 commits June 30, 2026 00:40

refactor(replay): drop now-unused DEFAULT_KINDS constant (Option A)

ff68638

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp Bot reviewed Jun 30, 2026

View reviewed changes

mikasenghaas and others added 2 commits June 30, 2026 22:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Option A: tag replay resume points via trace.info#1895

Option A: tag replay resume points via trace.info#1895
mikasenghaas wants to merge 10 commits into
mainfrom
exp/replay-node-tags-info

mikasenghaas commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

macroscopeapp Bot Jun 29, 2026

Uh oh!

macroscopeapp Bot Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mikasenghaas commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Option A — replay buffer via trace.info tags (no core schema change)

What this is

Four separate tasksets (pick one per env via --taskset.id)

Scoring

Structure — one package, five modules

Sandbox / exec replay (snapshot_ref)

The Option-A choice

Files (+515)

Uh oh!

Uh oh!

macroscopeapp Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

macroscopeapp Bot Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mikasenghaas commented Jun 29, 2026 •

edited

Loading

Option A — replay buffer via `trace.info` tags (no core schema change)

Four separate tasksets (pick one per env via `--taskset.id`)

Sandbox / exec replay (`snapshot_ref`)