Skip to content

Option B: typed MessageNode.kind for replay resume points#1896

Draft
mikasenghaas wants to merge 9 commits into
mainfrom
exp/replay-node-tags-field
Draft

Option B: typed MessageNode.kind for replay resume points#1896
mikasenghaas wants to merge 9 commits into
mainfrom
exp/replay-node-tags-field

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 29, 2026

Copy link
Copy Markdown
Member

Option B — replay buffer via typed MessageNode fields (core schema change)

Design comparison PR (draft, not for merge). Pairs with Option A (#1895), which is identical except it stores the same metadata in trace.info instead of on the node. The two branches differ in only three filesreplay_common/selector.py (two accessors), the one tag-write line in the compact harness, and this branch's verifiers/v1/graph.py — so the diff between #1895 and #1896 is the A/B decision.

What this is

A runnable skeleton of the replay-buffer slice: tag compaction resume points during generation, then turn old rollouts into new training tasks. Producer = the compact example env; consumer = a new replay distribution.

Four separate tasksets (pick one per env via --taskset.id)

  • replay_recheck ("try again") — seed the full rollout + an appended user turn (Check your work, fix if wrong) and re-roll.
  • replay_compaction_after — resume from a compaction message; the model continues solving.
  • replay_compaction_before — resume before a compaction; the model writes the compaction itself, then continues.
  • replay_judge — present the rollout's transcript and ask "was this correct? yes/no", graded against the original rollout's actual reward (original_reward > judge_threshold) — a self-supervised correctness label.

Each env selects exactly one task type — no mode-mixing.

Scoring

  • recheck / compaction_* reuse the original env's verifier (config.inner): score runs its rewards/metrics over the replay trace with the original task swapped in (e.g. a math verifier checks the replayed continuation's final answer against the original ground truth).
  • judge compares the model's yes/no verdict to the original reward (no inner, no sandbox).

Provenance (original task + reward) rides on the task (offline) or trace.info["replay"] (online). Compaction tags are authored from graph structure (no program sensor); recheck/judge use the structural final-answer leaf (tag-free, work on any rollout). Failed-tool-call tagging stays deprioritized (NodeTag is compaction-only).

Structure — one package, five modules

environments/replay/ ships a shared replay_common library + the four selectable taskset modules. replay_common.base holds everything shared (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing, seeding, snapshot restore, scoring); each taskset module is ~3 lines fixing its KIND and bundling a harness (auto-selected via default_harness_id).

  • Offline (mode=offline): load_tasks materializes one task per KIND resume point from a buffer glob (.../rollouts/step_*/train_rollouts.jsonl), seeding task.prompt with the root→node prefix.
  • Online (mode=online): load_tasks returns pool_size virtual slots; the bundled harness samples that KIND from the live buffer per rollout (and serves offline too, seeding the same program from the task's materialized prompt). Sampling is keyed by task index, so a GRPO group replays the same source+point. Empty buffer → stop("replay_buffer_empty") (warmup).

Sandbox / exec replay (snapshot_ref)

  • New typed MessageNode.snapshot_ref: str | None field + Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py), durable-ref contract documented. Restore is wired (offline: taskset setup; online: harness). Capture (per-turn) needs a framework hook — deliberate follow-up; refs stay None until the sandbox-snapshot feature lands.

The Option-B choice

Adds typed fields to MessageNode in verifiers/v1/graph.py: kind: NodeTag and snapshot_ref: str | None. The label/ref rides with the node — keyless, validated, visible in any trace viewer; both are plain str/None, so they ride the wire + JSON dump automatically (no _NODE_DUMP_EXCLUDE). Cost: a change to a core model shared by every env/trace/the trainer (upstream review + prime-rl pin bump). graph.py is the only file beyond the shared set that differs from Option A.

Files (+533)

  • verifiers/v1/graph.pyNodeTag + MessageNode.kind + MessageNode.snapshot_ref (the core change)
  • verifiers/v1/runtimes/base.pysnapshot/restore stubs
  • environments/compact/compact/{annotate,harness}.py — find + stamp compaction before/after on node.kind
  • environments/replay/replay_common/{base,selector}.py — shared base + the B reader
  • environments/replay/replay_{recheck,judge,compaction_after,compaction_before}/ — the four tasksets + pyproject.toml

🤖 Generated with Claude Code

Illustrative change for the replay-buffer design discussion: how a harness marks
nodes a replay buffer can resume from. Two things are tagged because the typed graph
can't express them — branch provenance (compaction) and tool failure status
(failed_tool_call). Tool identity / "is this a tool call" stay intrinsic to the graph.

Option B adds a typed `kind: NodeTag` field to MessageNode (a core schema change), so
the tag rides with the node (no keying, validated, rides the wire + dump automatically).
The program is the sensor for tool failures (the trace drops isError/exit); the harness
is the writer, stamping node.kind on the finished graph post-launch. annotate.py reads it.

Compare with Option A (trace.info side-channel, no schema change): exp/replay-node-tags-info.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/compact/compact/annotate.py Outdated
Comment thread verifiers/v1/graph.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium

model_config = ConfigDict(extra="forbid", arbitrary_types_allowed=True)

MessageNode.kind is annotated as NodeTag, a Literal union, but model_config doesn't set validate_assignment=True, so assigning a value that isn't one of the allowed tags (e.g. trace.nodes[nid].kind = rec["tag"] with a typo) stores it unchecked. That invalid tag then serializes to JSON and the replay resume-point selector silently misses it, because Pydantic never rejects the out-of-enum string. Consider enabling validate_assignment=True or gating tag writes through a validating helper.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/graph.py around line 128:

`MessageNode.kind` is annotated as `NodeTag`, a `Literal` union, but `model_config` doesn't set `validate_assignment=True`, so assigning a value that isn't one of the allowed tags (e.g. `trace.nodes[nid].kind = rec["tag"]` with a typo) stores it unchecked. That invalid tag then serializes to JSON and the replay resume-point selector silently misses it, because Pydantic never rejects the out-of-enum string. Consider enabling `validate_assignment=True` or gating tag writes through a validating helper.

Comment thread environments/compact/compact/harness.py Outdated
…ng + ReplayTaskset (Option B)

Extends the Option-B node-field skeleton into a runnable replay-buffer slice:

- Drop the failed-tool-call path (deprioritized): program.py reverted; NodeTag is now
  compaction-only (compaction_before/compaction_after/subagent) and authored purely from
  graph structure (no program sensor).
- Two resume points per compaction: `compaction_after` (post-compaction branch start —
  continue from the compaction message) and `compaction_before` (pre-compaction branch
  leaf — the model regenerates the compaction, then continues). Stamped on MessageNode.kind.
- Typed `MessageNode.snapshot_ref` for exec/sandbox replay + Runtime.snapshot()/restore()
  stubs (verifiers/v1/runtimes/base.py); durable-ref contract documented. Capture (per-turn)
  is a follow-up framework hook; restore is wired in ReplayTaskset.setup (no-op while None).
- New `environments/replay` env: ReplayTaskset materializes one task per resume point from
  a buffer glob (offline), seeds task.prompt with the root->node prefix (run with the
  default harness), and stubs scoring (reuse the original verifier — TODO). Online/growing
  buffer noted as a rollout-time-sampling follow-up.

Option B stores tags/snapshot refs as typed MessageNode fields (core schema change).
Compare with Option A (trace.info side-channel): exp/replay-node-tags-info.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread verifiers/v1/graph.py
"""True iff a model call produced this message (the response passed to `commit`); False for
every prompt-supplied message — including assistant/tool messages fabricated as context
the model never generated, which role alone can't tell apart from real turns."""
kind: NodeTag = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium v1/graph.py:84

Because Trace.nodes deduplicates on (parent, message_hash)prepare_turn() resolves repeated prefix messages to one shared MessageNodekind and snapshot_ref are occurrence-specific metadata stored on a shared node. When the same (parent, message_hash) is reached again in a later turn, writing trace.nodes[nid].kind = ... or snapshot_ref = ... overwrites the earlier occurrence's values, so resume_points()/snapshot_ref_of() return whichever occurrence wrote last rather than the one on the current path. Consider moving kind and snapshot_ref to a path-local store (keyed by node id within the branch) instead of mutating the shared MessageNode.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/graph.py around line 84:

Because `Trace.nodes` deduplicates on `(parent, message_hash)` — `prepare_turn()` resolves repeated prefix messages to one shared `MessageNode` — `kind` and `snapshot_ref` are occurrence-specific metadata stored on a shared node. When the same `(parent, message_hash)` is reached again in a later turn, writing `trace.nodes[nid].kind = ...` or `snapshot_ref = ...` overwrites the earlier occurrence's values, so `resume_points()`/`snapshot_ref_of()` return whichever occurrence wrote last rather than the one on the current path. Consider moving `kind` and `snapshot_ref` to a path-local store (keyed by node id within the branch) instead of mutating the shared `MessageNode`.

Comment on lines +56 to +58
result = await runtime.run_program([*program, trace.task.prompt], env)

# Tag compaction resume points on the finished graph (Option B: typed MessageNode.kind).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium compact/harness.py:56

launch unconditionally overwrites trace.nodes[*].kind after runtime.run_program(...) returns, before the caller checks result.exit_code. When the compact harness crashes or exits non-zero after producing a partial trace, the compaction resume tags are still persisted on an errored rollout. ReplayTaskset.load_tasks() reads these tags from every stored trace without filtering trace.error, so failed or incomplete compact rollouts generate replay tasks that resume from bogus prefixes. Consider skipping the tagging when result.exit_code indicates failure, or guarding load_tasks() to skip traces with trace.error.

        result = await runtime.run_program([*program, trace.task.prompt], env)
+        if result.exit_code != 0:
+            return result
+
         # Tag compaction resume points on the finished graph (Option B: typed MessageNode.kind).
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/compact/compact/harness.py around lines 56-58:

`launch` unconditionally overwrites `trace.nodes[*].kind` after `runtime.run_program(...)` returns, before the caller checks `result.exit_code`. When the compact harness crashes or exits non-zero after producing a partial trace, the compaction resume tags are still persisted on an errored rollout. `ReplayTaskset.load_tasks()` reads these tags from every stored trace without filtering `trace.error`, so failed or incomplete compact rollouts generate replay tasks that resume from bogus prefixes. Consider skipping the tagging when `result.exit_code` indicates failure, or guarding `load_tasks()` to skip traces with `trace.error`.

…me sampling (Option B)

- Scoring delegation: ReplayTaskset.score now reuses the ORIGINAL env's verifier
  (config.inner) by running its rewards/metrics over the replay trace with the original
  task swapped in. Original task + reward ride on the replay task (offline) or in
  trace.info["replay"] (online). Empty inner.id => no-op (skeleton-safe).
- Online buffer: config.mode="online" returns pool_size virtual slots; new ReplayHarness
  samples a stored trace + resume point from the LIVE buffer per rollout (re-globs each
  time), restores the snapshot, seeds the default chat loop with the root->node prefix via
  INITIAL_MESSAGES, and stashes provenance. Empty buffer => stop("replay_buffer_empty")
  (warmup). Offline path (materialize in load_tasks) unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/replay/replay/harness.py Outdated
The 'check your work, fix if wrong' mode: take the full sampled rollout, append a user
turn (config.followup), and re-roll — scored by the same original verifier as the rest.

- Structural, tag-free: the recheck point is the rollout's final-answer leaf, so it works
  for ANY rollout (linear or compacting) and is identical across A/B (no tag dependency).
- selector: recheck_points() + build_seed() (appends the follow-up user turn for recheck;
  plain prefix otherwise). resume_points() now yields recheck alongside compaction kinds.
- taskset + harness: new `followup` config; `recheck` added to default `kinds`; both build
  seeds via build_seed so offline and online modes get the appended turn.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/replay/replay/taskset.py Outdated
Comment thread environments/replay/replay/harness.py Outdated
mikasenghaas and others added 2 commits June 30, 2026 00:29
…de (Option B)

Cleanup:
- Shared DEFAULT_KINDS / DEFAULT_FOLLOWUP constants in selector.py, referenced by both the
  taskset and harness configs (was duplicated literals).
- Buffer reading centralized as selector.iter_traces; taskset drops its glob/json copy.
- resume_points folds recheck/judge final-leaf points in one place.

Judge mode (config: add "judge" to kinds):
- A judge replay point presents the rollout's transcript ("was this correct? yes/no")
  instead of continuing it (selector.judge_prompt / build_seed).
- ReplayTaskset.score grades the model's verdict against the original rollout's reward
  (original_reward > judge_threshold) — a self-supervised correctness label; no inner
  verifier or snapshot needed. recheck/compaction_* still reuse the original verifier.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… sandbox needed)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread environments/replay/replay/taskset.py Outdated
Comment thread environments/replay/replay_common/selector.py
Comment thread environments/replay/replay/taskset.py Outdated
Comment thread environments/replay/replay/harness.py Outdated
Comment thread environments/replay/replay/harness.py Outdated
…tion B)

Was a single ReplayTaskset with a `kinds` list; now each mode is its own selectable
taskset (taskset id), so an env config picks exactly one task type — no mode mixing.

Structure: one distribution (`environments/replay`) ships five top-level modules —
- replay_common: shared base (BaseReplayTaskset/BaseReplayHarness parameterized by KIND,
  buffer sourcing offline+online, seeding, snapshot restore, scoring) + selector.
- replay_recheck / replay_judge / replay_compaction_after / replay_compaction_before:
  thin modules each fixing KIND and bundling a harness (auto-selected via
  default_harness_id). recheck/compaction_* reuse the original verifier; judge grades the
  verdict against the original reward.

The bundled harness handles both modes: offline (materialized task.prompt -> default chat
loop) and online (sample this KIND from the live buffer). Dropped the now-unused
DEFAULT_KINDS constant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
name = "replay"
version = "0.1.0"
description = "replay — replay-buffer tasksets (recheck / judge / compaction_before / compaction_after)."
requires-python = ">=3.10"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High replay/pyproject.toml:5

requires-python = ">=3.10" allows installation on Python 3.10, but the only dependency verifiers requires >=3.11,<3.14. On Python 3.10, dependency resolution fails to find a compatible verifiers build, so installing replay fails. Consider aligning this constraint with verifiers by using >=3.11,<3.14.

Suggested change
requires-python = ">=3.10"
requires-python = ">=3.11,<3.14"
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/replay/pyproject.toml around line 5:

`requires-python = ">=3.10"` allows installation on Python 3.10, but the only dependency `verifiers` requires `>=3.11,<3.14`. On Python 3.10, dependency resolution fails to find a compatible `verifiers` build, so installing `replay` fails. Consider aligning this constraint with `verifiers` by using `>=3.11,<3.14`.

Comment thread environments/replay/replay_common/base.py
mikasenghaas and others added 2 commits June 30, 2026 22:57
…ion B)

- Bug: BaseReplayHarness subclassed DefaultHarness without parameterizing
  Harness[ReplayHarnessConfig], so harness_config_type resolved to DefaultHarnessConfig
  (no buffer_glob/followup) -> self.config.buffer_glob would AttributeError in the online
  path. Now subclasses Harness[ReplayHarnessConfig] directly and seeds the default program
  itself (one unified offline/online seed path via _resolve_seed).
- resume_points: single get_tag per node instead of two.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…groups (Option B)

Was seeded by trace.id (unique per rollout), so the group_size rollouts of one group each
sampled a different source rollout — GRPO's group-relative baseline would then compare across
different problems. Seed by trace.task.idx (shared across a group) so all rollouts of a group
replay the same source+point (N diverse continuations of one point). Freshness is preserved by
the growing buffer (glob picks up new files, shifting each index's draw). Offline was already
correct (a group = N rollouts of one materialized task).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant