Option B: typed MessageNode.kind for replay resume points by mikasenghaas · Pull Request #1896 · PrimeIntellect-ai/verifiers

mikasenghaas · 2026-06-29T22:05:23Z

Option B — replay buffer via typed `MessageNode` fields (core schema change)

Design comparison PR (draft, not for merge). Pairs with Option A (#1895), which is identical except it stores the same metadata in trace.info instead of on the node. The two branches differ in only three files — replay_common/selector.py (two accessors), the one tag-write line in the compact harness, and this branch's verifiers/v1/graph.py — so the diff between #1895 and #1896 is the A/B decision.

What this is

A runnable skeleton of the replay-buffer slice: tag compaction resume points during generation, then turn old rollouts into new training tasks. Producer = the compact example env; consumer = a new replay distribution.

Four separate tasksets (pick one per env via `--taskset.id`)

replay_recheck ("try again") — seed the full rollout + an appended user turn (Check your work, fix if wrong) and re-roll.
replay_compaction_after — resume from a compaction message; the model continues solving.
replay_compaction_before — resume before a compaction; the model writes the compaction itself, then continues.
replay_judge — present the rollout's transcript and ask "was this correct? yes/no", graded against the original rollout's actual reward (original_reward > judge_threshold) — a self-supervised correctness label.

Each env selects exactly one task type — no mode-mixing.

Scoring

recheck / compaction_* reuse the original env's verifier (config.inner): score runs its rewards/metrics over the replay trace with the original task swapped in (e.g. a math verifier checks the replayed continuation's final answer against the original ground truth).
judge compares the model's yes/no verdict to the original reward (no inner, no sandbox).

Provenance (original task + reward) rides on the task (offline) or trace.info["replay"] (online). Compaction tags are authored from graph structure (no program sensor); recheck/judge use the structural final-answer leaf (tag-free, work on any rollout). Failed-tool-call tagging stays deprioritized (NodeTag is compaction-only).

Structure — one package, five modules

environments/replay/ ships a shared replay_common library + the four selectable taskset modules. replay_common.base holds everything shared (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing, seeding, snapshot restore, scoring); each taskset module is ~3 lines fixing its KIND and bundling a harness (auto-selected via default_harness_id).

Offline (mode=offline): load_tasks materializes one task per KIND resume point from a buffer glob (.../rollouts/step_*/train_rollouts.jsonl), seeding task.prompt with the root→node prefix.
Online (mode=online): load_tasks returns pool_size virtual slots; the bundled harness samples that KIND from the live buffer per rollout (and serves offline too, seeding the same program from the task's materialized prompt). Sampling is keyed by task index, so a GRPO group replays the same source+point. Empty buffer → stop("replay_buffer_empty") (warmup).

Sandbox / exec replay (`snapshot_ref`)

New typed MessageNode.snapshot_ref: str | None field + Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py), durable-ref contract documented. Restore is wired (offline: taskset setup; online: harness). Capture (per-turn) needs a framework hook — deliberate follow-up; refs stay None until the sandbox-snapshot feature lands.

The Option-B choice

Adds typed fields to MessageNode in verifiers/v1/graph.py: kind: NodeTag and snapshot_ref: str | None. The label/ref rides with the node — keyless, validated, visible in any trace viewer; both are plain str/None, so they ride the wire + JSON dump automatically (no _NODE_DUMP_EXCLUDE). Cost: a change to a core model shared by every env/trace/the trainer (upstream review + prime-rl pin bump). graph.py is the only file beyond the shared set that differs from Option A.

Files (+533)

verifiers/v1/graph.py — NodeTag + MessageNode.kind + MessageNode.snapshot_ref (the core change)
verifiers/v1/runtimes/base.py — snapshot/restore stubs
environments/compact/compact/{annotate,harness}.py — find + stamp compaction before/after on node.kind
environments/replay/replay_common/{base,selector}.py — shared base + the B reader
environments/replay/replay_{recheck,judge,compaction_after,compaction_before}/ — the four tasksets + pyproject.toml

🤖 Generated with Claude Code

Illustrative change for the replay-buffer design discussion: how a harness marks nodes a replay buffer can resume from. Two things are tagged because the typed graph can't express them — branch provenance (compaction) and tool failure status (failed_tool_call). Tool identity / "is this a tool call" stay intrinsic to the graph. Option B adds a typed `kind: NodeTag` field to MessageNode (a core schema change), so the tag rides with the node (no keying, validated, rides the wire + dump automatically). The program is the sensor for tool failures (the trace drops isError/exit); the harness is the writer, stamping node.kind on the finished graph post-launch. annotate.py reads it. Compare with Option A (trace.info side-channel, no schema change): exp/replay-node-tags-info. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-29T22:07:50Z

🟡 Medium

verifiers/verifiers/v1/graph.py

Line 128 in a8df793

model_config = ConfigDict(extra="forbid", arbitrary_types_allowed=True)

MessageNode.kind is annotated as NodeTag, a Literal union, but model_config doesn't set validate_assignment=True, so assigning a value that isn't one of the allowed tags (e.g. trace.nodes[nid].kind = rec["tag"] with a typo) stores it unchecked. That invalid tag then serializes to JSON and the replay resume-point selector silently misses it, because Pydantic never rejects the out-of-enum string. Consider enabling validate_assignment=True or gating tag writes through a validating helper.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @verifiers/v1/graph.py around line 128: `MessageNode.kind` is annotated as `NodeTag`, a `Literal` union, but `model_config` doesn't set `validate_assignment=True`, so assigning a value that isn't one of the allowed tags (e.g. `trace.nodes[nid].kind = rec["tag"]` with a typo) stores it unchecked. That invalid tag then serializes to JSON and the replay resume-point selector silently misses it, because Pydantic never rejects the out-of-enum string. Consider enabling `validate_assignment=True` or gating tag writes through a validating helper.

…ng + ReplayTaskset (Option B) Extends the Option-B node-field skeleton into a runnable replay-buffer slice: - Drop the failed-tool-call path (deprioritized): program.py reverted; NodeTag is now compaction-only (compaction_before/compaction_after/subagent) and authored purely from graph structure (no program sensor). - Two resume points per compaction: `compaction_after` (post-compaction branch start — continue from the compaction message) and `compaction_before` (pre-compaction branch leaf — the model regenerates the compaction, then continues). Stamped on MessageNode.kind. - Typed `MessageNode.snapshot_ref` for exec/sandbox replay + Runtime.snapshot()/restore() stubs (verifiers/v1/runtimes/base.py); durable-ref contract documented. Capture (per-turn) is a follow-up framework hook; restore is wired in ReplayTaskset.setup (no-op while None). - New `environments/replay` env: ReplayTaskset materializes one task per resume point from a buffer glob (offline), seeds task.prompt with the root->node prefix (run with the default harness), and stubs scoring (reuse the original verifier — TODO). Online/growing buffer noted as a rollout-time-sampling follow-up. Option B stores tags/snapshot refs as typed MessageNode fields (core schema change). Compare with Option A (trace.info side-channel): exp/replay-node-tags-info. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-29T23:59:19Z

    """True iff a model call produced this message (the response passed to `commit`); False for
    every prompt-supplied message — including assistant/tool messages fabricated as context
    the model never generated, which role alone can't tell apart from real turns."""
+    kind: NodeTag = None


🟡 Medium v1/graph.py:84

Because Trace.nodes deduplicates on (parent, message_hash) — prepare_turn() resolves repeated prefix messages to one shared MessageNode — kind and snapshot_ref are occurrence-specific metadata stored on a shared node. When the same (parent, message_hash) is reached again in a later turn, writing trace.nodes[nid].kind = ... or snapshot_ref = ... overwrites the earlier occurrence's values, so resume_points()/snapshot_ref_of() return whichever occurrence wrote last rather than the one on the current path. Consider moving kind and snapshot_ref to a path-local store (keyed by node id within the branch) instead of mutating the shared MessageNode.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @verifiers/v1/graph.py around line 84: Because `Trace.nodes` deduplicates on `(parent, message_hash)` — `prepare_turn()` resolves repeated prefix messages to one shared `MessageNode` — `kind` and `snapshot_ref` are occurrence-specific metadata stored on a shared node. When the same `(parent, message_hash)` is reached again in a later turn, writing `trace.nodes[nid].kind = ...` or `snapshot_ref = ...` overwrites the earlier occurrence's values, so `resume_points()`/`snapshot_ref_of()` return whichever occurrence wrote last rather than the one on the current path. Consider moving `kind` and `snapshot_ref` to a path-local store (keyed by node id within the branch) instead of mutating the shared `MessageNode`.

macroscopeapp · 2026-06-29T23:59:19Z

+        result = await runtime.run_program([*program, trace.task.prompt], env)
+
+        # Tag compaction resume points on the finished graph (Option B: typed MessageNode.kind).


🟡 Medium compact/harness.py:56

launch unconditionally overwrites trace.nodes[*].kind after runtime.run_program(...) returns, before the caller checks result.exit_code. When the compact harness crashes or exits non-zero after producing a partial trace, the compaction resume tags are still persisted on an errored rollout. ReplayTaskset.load_tasks() reads these tags from every stored trace without filtering trace.error, so failed or incomplete compact rollouts generate replay tasks that resume from bogus prefixes. Consider skipping the tagging when result.exit_code indicates failure, or guarding load_tasks() to skip traces with trace.error.

result = await runtime.run_program([*program, trace.task.prompt], env) + if result.exit_code != 0: + return result + # Tag compaction resume points on the finished graph (Option B: typed MessageNode.kind).

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/compact/compact/harness.py around lines 56-58: `launch` unconditionally overwrites `trace.nodes[*].kind` after `runtime.run_program(...)` returns, before the caller checks `result.exit_code`. When the compact harness crashes or exits non-zero after producing a partial trace, the compaction resume tags are still persisted on an errored rollout. `ReplayTaskset.load_tasks()` reads these tags from every stored trace without filtering `trace.error`, so failed or incomplete compact rollouts generate replay tasks that resume from bogus prefixes. Consider skipping the tagging when `result.exit_code` indicates failure, or guarding `load_tasks()` to skip traces with `trace.error`.

…me sampling (Option B) - Scoring delegation: ReplayTaskset.score now reuses the ORIGINAL env's verifier (config.inner) by running its rewards/metrics over the replay trace with the original task swapped in. Original task + reward ride on the replay task (offline) or in trace.info["replay"] (online). Empty inner.id => no-op (skeleton-safe). - Online buffer: config.mode="online" returns pool_size virtual slots; new ReplayHarness samples a stored trace + resume point from the LIVE buffer per rollout (re-globs each time), restores the snapshot, seeds the default chat loop with the root->node prefix via INITIAL_MESSAGES, and stashes provenance. Empty buffer => stop("replay_buffer_empty") (warmup). Offline path (materialize in load_tasks) unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The 'check your work, fix if wrong' mode: take the full sampled rollout, append a user turn (config.followup), and re-roll — scored by the same original verifier as the rest. - Structural, tag-free: the recheck point is the rollout's final-answer leaf, so it works for ANY rollout (linear or compacting) and is identical across A/B (no tag dependency). - selector: recheck_points() + build_seed() (appends the follow-up user turn for recheck; plain prefix otherwise). resume_points() now yields recheck alongside compaction kinds. - taskset + harness: new `followup` config; `recheck` added to default `kinds`; both build seeds via build_seed so offline and online modes get the appended turn. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…de (Option B) Cleanup: - Shared DEFAULT_KINDS / DEFAULT_FOLLOWUP constants in selector.py, referenced by both the taskset and harness configs (was duplicated literals). - Buffer reading centralized as selector.iter_traces; taskset drops its glob/json copy. - resume_points folds recheck/judge final-leaf points in one place. Judge mode (config: add "judge" to kinds): - A judge replay point presents the rollout's transcript ("was this correct? yes/no") instead of continuing it (selector.judge_prompt / build_seed). - ReplayTaskset.score grades the model's verdict against the original rollout's reward (original_reward > judge_threshold) — a self-supervised correctness label; no inner verifier or snapshot needed. recheck/compaction_* still reuse the original verifier. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… sandbox needed) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion B) Was a single ReplayTaskset with a `kinds` list; now each mode is its own selectable taskset (taskset id), so an env config picks exactly one task type — no mode mixing. Structure: one distribution (`environments/replay`) ships five top-level modules — - replay_common: shared base (BaseReplayTaskset/BaseReplayHarness parameterized by KIND, buffer sourcing offline+online, seeding, snapshot restore, scoring) + selector. - replay_recheck / replay_judge / replay_compaction_after / replay_compaction_before: thin modules each fixing KIND and bundling a harness (auto-selected via default_harness_id). recheck/compaction_* reuse the original verifier; judge grades the verdict against the original reward. The bundled harness handles both modes: offline (materialized task.prompt -> default chat loop) and online (sample this KIND from the live buffer). Dropped the now-unused DEFAULT_KINDS constant. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

macroscopeapp · 2026-06-30T00:46:43Z

+name = "replay"
+version = "0.1.0"
+description = "replay — replay-buffer tasksets (recheck / judge / compaction_before / compaction_after)."
+requires-python = ">=3.10"


🟠 High replay/pyproject.toml:5

requires-python = ">=3.10" allows installation on Python 3.10, but the only dependency verifiers requires >=3.11,<3.14. On Python 3.10, dependency resolution fails to find a compatible verifiers build, so installing replay fails. Consider aligning this constraint with verifiers by using >=3.11,<3.14.

Suggested change

requires-python = ">=3.10"

requires-python = ">=3.11,<3.14"

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @environments/replay/pyproject.toml around line 5: `requires-python = ">=3.10"` allows installation on Python 3.10, but the only dependency `verifiers` requires `>=3.11,<3.14`. On Python 3.10, dependency resolution fails to find a compatible `verifiers` build, so installing `replay` fails. Consider aligning this constraint with `verifiers` by using `>=3.11,<3.14`.

…ion B) - Bug: BaseReplayHarness subclassed DefaultHarness without parameterizing Harness[ReplayHarnessConfig], so harness_config_type resolved to DefaultHarnessConfig (no buffer_glob/followup) -> self.config.buffer_glob would AttributeError in the online path. Now subclasses Harness[ReplayHarnessConfig] directly and seeds the default program itself (one unified offline/online seed path via _resolve_seed). - resume_points: single get_tag per node instead of two. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…groups (Option B) Was seeded by trace.id (unique per rollout), so the group_size rollouts of one group each sampled a different source rollout — GRPO's group-relative baseline would then compare across different problems. Seed by trace.task.idx (shared across a group) so all rollouts of a group replay the same source+point (N diverse continuations of one point). Freshness is preserved by the growing buffer (glob picks up new files, shifting each index's draw). Offline was already correct (a group = N rollouts of one materialized task). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mikasenghaas mentioned this pull request Jun 29, 2026

Option A: tag replay resume points via trace.info #1895

Draft

macroscopeapp Bot reviewed Jun 29, 2026

View reviewed changes