Option B: typed MessageNode.kind for replay resume points#1896
Option B: typed MessageNode.kind for replay resume points#1896mikasenghaas wants to merge 9 commits into
Conversation
Illustrative change for the replay-buffer design discussion: how a harness marks nodes a replay buffer can resume from. Two things are tagged because the typed graph can't express them — branch provenance (compaction) and tool failure status (failed_tool_call). Tool identity / "is this a tool call" stay intrinsic to the graph. Option B adds a typed `kind: NodeTag` field to MessageNode (a core schema change), so the tag rides with the node (no keying, validated, rides the wire + dump automatically). The program is the sensor for tool failures (the trace drops isError/exit); the harness is the writer, stamping node.kind on the finished graph post-launch. annotate.py reads it. Compare with Option A (trace.info side-channel, no schema change): exp/replay-node-tags-info. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
🟡 Medium
verifiers/verifiers/v1/graph.py
Line 128 in a8df793
MessageNode.kind is annotated as NodeTag, a Literal union, but model_config doesn't set validate_assignment=True, so assigning a value that isn't one of the allowed tags (e.g. trace.nodes[nid].kind = rec["tag"] with a typo) stores it unchecked. That invalid tag then serializes to JSON and the replay resume-point selector silently misses it, because Pydantic never rejects the out-of-enum string. Consider enabling validate_assignment=True or gating tag writes through a validating helper.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/graph.py around line 128:
`MessageNode.kind` is annotated as `NodeTag`, a `Literal` union, but `model_config` doesn't set `validate_assignment=True`, so assigning a value that isn't one of the allowed tags (e.g. `trace.nodes[nid].kind = rec["tag"]` with a typo) stores it unchecked. That invalid tag then serializes to JSON and the replay resume-point selector silently misses it, because Pydantic never rejects the out-of-enum string. Consider enabling `validate_assignment=True` or gating tag writes through a validating helper.
…ng + ReplayTaskset (Option B) Extends the Option-B node-field skeleton into a runnable replay-buffer slice: - Drop the failed-tool-call path (deprioritized): program.py reverted; NodeTag is now compaction-only (compaction_before/compaction_after/subagent) and authored purely from graph structure (no program sensor). - Two resume points per compaction: `compaction_after` (post-compaction branch start — continue from the compaction message) and `compaction_before` (pre-compaction branch leaf — the model regenerates the compaction, then continues). Stamped on MessageNode.kind. - Typed `MessageNode.snapshot_ref` for exec/sandbox replay + Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py); durable-ref contract documented. Capture (per-turn) is a follow-up framework hook; restore is wired in ReplayTaskset.setup (no-op while None). - New `environments/replay` env: ReplayTaskset materializes one task per resume point from a buffer glob (offline), seeds task.prompt with the root->node prefix (run with the default harness), and stubs scoring (reuse the original verifier — TODO). Online/growing buffer noted as a rollout-time-sampling follow-up. Option B stores tags/snapshot refs as typed MessageNode fields (core schema change). Compare with Option A (trace.info side-channel): exp/replay-node-tags-info. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| """True iff a model call produced this message (the response passed to `commit`); False for | ||
| every prompt-supplied message — including assistant/tool messages fabricated as context | ||
| the model never generated, which role alone can't tell apart from real turns.""" | ||
| kind: NodeTag = None |
There was a problem hiding this comment.
🟡 Medium v1/graph.py:84
Because Trace.nodes deduplicates on (parent, message_hash) — prepare_turn() resolves repeated prefix messages to one shared MessageNode — kind and snapshot_ref are occurrence-specific metadata stored on a shared node. When the same (parent, message_hash) is reached again in a later turn, writing trace.nodes[nid].kind = ... or snapshot_ref = ... overwrites the earlier occurrence's values, so resume_points()/snapshot_ref_of() return whichever occurrence wrote last rather than the one on the current path. Consider moving kind and snapshot_ref to a path-local store (keyed by node id within the branch) instead of mutating the shared MessageNode.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/graph.py around line 84:
Because `Trace.nodes` deduplicates on `(parent, message_hash)` — `prepare_turn()` resolves repeated prefix messages to one shared `MessageNode` — `kind` and `snapshot_ref` are occurrence-specific metadata stored on a shared node. When the same `(parent, message_hash)` is reached again in a later turn, writing `trace.nodes[nid].kind = ...` or `snapshot_ref = ...` overwrites the earlier occurrence's values, so `resume_points()`/`snapshot_ref_of()` return whichever occurrence wrote last rather than the one on the current path. Consider moving `kind` and `snapshot_ref` to a path-local store (keyed by node id within the branch) instead of mutating the shared `MessageNode`.
| result = await runtime.run_program([*program, trace.task.prompt], env) | ||
|
|
||
| # Tag compaction resume points on the finished graph (Option B: typed MessageNode.kind). |
There was a problem hiding this comment.
🟡 Medium compact/harness.py:56
launch unconditionally overwrites trace.nodes[*].kind after runtime.run_program(...) returns, before the caller checks result.exit_code. When the compact harness crashes or exits non-zero after producing a partial trace, the compaction resume tags are still persisted on an errored rollout. ReplayTaskset.load_tasks() reads these tags from every stored trace without filtering trace.error, so failed or incomplete compact rollouts generate replay tasks that resume from bogus prefixes. Consider skipping the tagging when result.exit_code indicates failure, or guarding load_tasks() to skip traces with trace.error.
result = await runtime.run_program([*program, trace.task.prompt], env)
+ if result.exit_code != 0:
+ return result
+
# Tag compaction resume points on the finished graph (Option B: typed MessageNode.kind).🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/compact/compact/harness.py around lines 56-58:
`launch` unconditionally overwrites `trace.nodes[*].kind` after `runtime.run_program(...)` returns, before the caller checks `result.exit_code`. When the compact harness crashes or exits non-zero after producing a partial trace, the compaction resume tags are still persisted on an errored rollout. `ReplayTaskset.load_tasks()` reads these tags from every stored trace without filtering `trace.error`, so failed or incomplete compact rollouts generate replay tasks that resume from bogus prefixes. Consider skipping the tagging when `result.exit_code` indicates failure, or guarding `load_tasks()` to skip traces with `trace.error`.
…me sampling (Option B)
- Scoring delegation: ReplayTaskset.score now reuses the ORIGINAL env's verifier
(config.inner) by running its rewards/metrics over the replay trace with the original
task swapped in. Original task + reward ride on the replay task (offline) or in
trace.info["replay"] (online). Empty inner.id => no-op (skeleton-safe).
- Online buffer: config.mode="online" returns pool_size virtual slots; new ReplayHarness
samples a stored trace + resume point from the LIVE buffer per rollout (re-globs each
time), restores the snapshot, seeds the default chat loop with the root->node prefix via
INITIAL_MESSAGES, and stashes provenance. Empty buffer => stop("replay_buffer_empty")
(warmup). Offline path (materialize in load_tasks) unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The 'check your work, fix if wrong' mode: take the full sampled rollout, append a user turn (config.followup), and re-roll — scored by the same original verifier as the rest. - Structural, tag-free: the recheck point is the rollout's final-answer leaf, so it works for ANY rollout (linear or compacting) and is identical across A/B (no tag dependency). - selector: recheck_points() + build_seed() (appends the follow-up user turn for recheck; plain prefix otherwise). resume_points() now yields recheck alongside compaction kinds. - taskset + harness: new `followup` config; `recheck` added to default `kinds`; both build seeds via build_seed so offline and online modes get the appended turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…de (Option B)
Cleanup:
- Shared DEFAULT_KINDS / DEFAULT_FOLLOWUP constants in selector.py, referenced by both the
taskset and harness configs (was duplicated literals).
- Buffer reading centralized as selector.iter_traces; taskset drops its glob/json copy.
- resume_points folds recheck/judge final-leaf points in one place.
Judge mode (config: add "judge" to kinds):
- A judge replay point presents the rollout's transcript ("was this correct? yes/no")
instead of continuing it (selector.judge_prompt / build_seed).
- ReplayTaskset.score grades the model's verdict against the original rollout's reward
(original_reward > judge_threshold) — a self-supervised correctness label; no inner
verifier or snapshot needed. recheck/compaction_* still reuse the original verifier.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… sandbox needed) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion B) Was a single ReplayTaskset with a `kinds` list; now each mode is its own selectable taskset (taskset id), so an env config picks exactly one task type — no mode mixing. Structure: one distribution (`environments/replay`) ships five top-level modules — - replay_common: shared base (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing offline+online, seeding, snapshot restore, scoring) + selector. - replay_recheck / replay_judge / replay_compaction_after / replay_compaction_before: thin modules each fixing KIND and bundling a harness (auto-selected via default_harness_id). recheck/compaction_* reuse the original verifier; judge grades the verdict against the original reward. The bundled harness handles both modes: offline (materialized task.prompt -> default chat loop) and online (sample this KIND from the live buffer). Dropped the now-unused DEFAULT_KINDS constant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| name = "replay" | ||
| version = "0.1.0" | ||
| description = "replay — replay-buffer tasksets (recheck / judge / compaction_before / compaction_after)." | ||
| requires-python = ">=3.10" |
There was a problem hiding this comment.
🟠 High replay/pyproject.toml:5
requires-python = ">=3.10" allows installation on Python 3.10, but the only dependency verifiers requires >=3.11,<3.14. On Python 3.10, dependency resolution fails to find a compatible verifiers build, so installing replay fails. Consider aligning this constraint with verifiers by using >=3.11,<3.14.
| requires-python = ">=3.10" | |
| requires-python = ">=3.11,<3.14" |
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/replay/pyproject.toml around line 5:
`requires-python = ">=3.10"` allows installation on Python 3.10, but the only dependency `verifiers` requires `>=3.11,<3.14`. On Python 3.10, dependency resolution fails to find a compatible `verifiers` build, so installing `replay` fails. Consider aligning this constraint with `verifiers` by using `>=3.11,<3.14`.
…ion B) - Bug: BaseReplayHarness subclassed DefaultHarness without parameterizing Harness[ReplayHarnessConfig], so harness_config_type resolved to DefaultHarnessConfig (no buffer_glob/followup) -> self.config.buffer_glob would AttributeError in the online path. Now subclasses Harness[ReplayHarnessConfig] directly and seeds the default program itself (one unified offline/online seed path via _resolve_seed). - resume_points: single get_tag per node instead of two. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…groups (Option B) Was seeded by trace.id (unique per rollout), so the group_size rollouts of one group each sampled a different source rollout — GRPO's group-relative baseline would then compare across different problems. Seed by trace.task.idx (shared across a group) so all rollouts of a group replay the same source+point (N diverse continuations of one point). Freshness is preserved by the growing buffer (glob picks up new files, shifting each index's draw). Offline was already correct (a group = N rollouts of one materialized task). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Option B — replay buffer via typed
MessageNodefields (core schema change)Design comparison PR (draft, not for merge). Pairs with Option A (#1895), which is identical except it stores the same metadata in
trace.infoinstead of on the node. The two branches differ in only three files —replay_common/selector.py(two accessors), the one tag-write line in the compact harness, and this branch'sverifiers/v1/graph.py— so the diff between #1895 and #1896 is the A/B decision.What this is
A runnable skeleton of the replay-buffer slice: tag compaction resume points during generation, then turn old rollouts into new training tasks. Producer = the
compactexample env; consumer = a newreplaydistribution.Four separate tasksets (pick one per env via
--taskset.id)replay_recheck("try again") — seed the full rollout + an appended user turn (Check your work, fix if wrong) and re-roll.replay_compaction_after— resume from a compaction message; the model continues solving.replay_compaction_before— resume before a compaction; the model writes the compaction itself, then continues.replay_judge— present the rollout's transcript and ask "was this correct? yes/no", graded against the original rollout's actual reward (original_reward > judge_threshold) — a self-supervised correctness label.Each env selects exactly one task type — no mode-mixing.
Scoring
recheck/compaction_*reuse the original env's verifier (config.inner):scoreruns its rewards/metrics over the replay trace with the original task swapped in (e.g. a math verifier checks the replayed continuation's final answer against the original ground truth).judgecompares the model's yes/no verdict to the original reward (noinner, no sandbox).Provenance (original task + reward) rides on the task (offline) or
trace.info["replay"](online). Compaction tags are authored from graph structure (no program sensor);recheck/judgeuse the structural final-answer leaf (tag-free, work on any rollout). Failed-tool-call tagging stays deprioritized (NodeTagis compaction-only).Structure — one package, five modules
environments/replay/ships a sharedreplay_commonlibrary + the four selectable taskset modules.replay_common.baseholds everything shared (BaseReplayTaskset/BaseReplayHarnessparameterized byKIND, buffer sourcing, seeding, snapshot restore, scoring); each taskset module is ~3 lines fixing itsKINDand bundling a harness (auto-selected viadefault_harness_id).mode=offline):load_tasksmaterializes one task perKINDresume point from a buffer glob (.../rollouts/step_*/train_rollouts.jsonl), seedingtask.promptwith theroot→nodeprefix.mode=online):load_tasksreturnspool_sizevirtual slots; the bundled harness samples thatKINDfrom the live buffer per rollout (and serves offline too, seeding the same program from the task's materialized prompt). Sampling is keyed by task index, so a GRPO group replays the same source+point. Empty buffer →stop("replay_buffer_empty")(warmup).Sandbox / exec replay (
snapshot_ref)MessageNode.snapshot_ref: str | Nonefield +Runtime.snapshot()/restore()stubs (verifiers/v1/runtimes/base.py), durable-ref contract documented. Restore is wired (offline: tasksetsetup; online: harness). Capture (per-turn) needs a framework hook — deliberate follow-up; refs stayNoneuntil the sandbox-snapshot feature lands.The Option-B choice
Adds typed fields to
MessageNodeinverifiers/v1/graph.py:kind: NodeTagandsnapshot_ref: str | None. The label/ref rides with the node — keyless, validated, visible in any trace viewer; both are plainstr/None, so they ride the wire + JSON dump automatically (no_NODE_DUMP_EXCLUDE). Cost: a change to a core model shared by every env/trace/the trainer (upstream review + prime-rl pin bump).graph.pyis the only file beyond the shared set that differs from Option A.Files (+533)
verifiers/v1/graph.py—NodeTag+MessageNode.kind+MessageNode.snapshot_ref(the core change)verifiers/v1/runtimes/base.py—snapshot/restorestubsenvironments/compact/compact/{annotate,harness}.py— find + stamp compaction before/after onnode.kindenvironments/replay/replay_common/{base,selector}.py— shared base + the B readerenvironments/replay/replay_{recheck,judge,compaction_after,compaction_before}/— the four tasksets +pyproject.toml🤖 Generated with Claude Code