Skip to content

feat(v1): agent runs — judges as first-class harness+model agents (single-call to in-world)#1910

Draft
hallerite wants to merge 2 commits into
mainfrom
feat/agent-runs
Draft

feat(v1): agent runs — judges as first-class harness+model agents (single-call to in-world)#1910
hallerite wants to merge 2 commits into
mainfrom
feat/agent-runs

Conversation

@hallerite

@hallerite hallerite commented Jul 1, 2026

Copy link
Copy Markdown
Member

Design-inspiration draft: judges as first-class agent runs, built on the premise that a judge is not a special kind of rubric — it is another agent (harness + model), and the primitive verifiers is missing is the ability to execute more than one role-tagged agent within a rollout's lifecycle. One spec covers the whole spectrum: with the tool-less null harness it is the classic single-call LLM judge; with a tools harness it is a full agent grading inside the world the policy actually mutated.

The data model (where the design actually lives)

  • AgentSpec — who an agent is: harness (any harness, including off-the-shelf coding agents), model (a logical name), placement, budget, trainable. JudgeSpec adds name, prompt (grading criteria, written already rendered — the judges() hook receives the task and trace, so prompts are f-strings, no template language), verdict (a pydantic schema), and verdict_source (see below).
  • AgentRun on trace.agents — the provenance carrier: {name, role, model, trainable, trace, verdict}. The judge's own conversation is a full nested Trace, as inspectable as the policy's; its spend folds into extra_usage. trainable=False keeps judge tokens out of training data by construction — the trainer-side rule becomes "train on a trace iff its model is the policy and it opted in", which is also exactly the seam multi-agent needs later.
  • Model tableEnvConfig.models binds logical names to endpoints (ModelEndpointConfig: base_url/key-var/model/sampling). Tasksets say what they need (model="grader"); run config says where that resolves. Endpoints never live in taskset code. "policy" is reserved: the rollout's own model context — so self-judging is a deliberate one-word choice, not an accident. Environment.aclose() closes the table's shared clients.

Execution

run_agent is a standalone executor, deliberately not a private method of Rollout: it owns placement, the run's own interception session (turns recorded onto the run's trace, AgentBudget enforced between turns via the same RolloutLimits mechanism that bounds the policy), input-file materialization, and output collection before an owned runtime is torn down. A failed run captures its error onto its own trace before re-raising, so the provenance record is self-contained. Rollout is merely the first caller — group-scope scorers (agentic credit assignment at the algorithm layer) and RUN-stage teammates are future callers of the same function.

Judges execute in the SCORING stage, inside the existing scoring_timeout, with the runtime still live:

  • The judges() hook is injectable like rewards — any subset of task, trace, runtime. The task supplies criteria, the trace supplies evidence for reply-verdict judges, and conditional judging is return [] when a programmatic check already settles the grade.
  • Two verdict channels, derived from the harness (verdict_source): a tools judge gets the rollout's records as files (task.json / transcript.md / trace.json under a per-run /tmp/vf-agent/<id>/ — never the rollout's workdir, which is the world being judged) and writes verdict.json; a null-harness judge has no tools, so its evidence rides in its prompt and its final reply is the verdict JSON (fenced replies tolerated). The choice is explicit, not a fallback — a tools judge that fails to write the file is an error, never quietly re-parsed from prose.
  • Placement "rollout" (default) provisions the judge into the rollout's live runtime — only after the policy finished, so the policy can never observe or tamper with judge machinery. Judges sharing the live runtime run sequentially (one world, one provisioning path — no races); a RuntimeConfig placement gives a fresh clean-room runtime and runs concurrently. All judges read one input snapshot taken before any judge runs — judges grade the rollout, not each other's verdicts.
  • Failure is strict: a missing or invalid verdict raises JudgeError and fails the rollout — never a silent 0 reward, which would poison group baselines.

The single-call judge, on a real taskset shape (this is what vf.Judge does today, as one spec):

async def judges(self, task, trace):
    return [vf.JudgeSpec(
        name="correct",
        prompt=f"Question: {task.question}\nGold: {task.answer}\n"
               f"Response: {trace.last_reply}\nIs the response correct?",
        verdict=CorrectVerdict, harness={"id": "null"}, model="grader",
    )]

@vf.reward
async def correct(self, verdicts) -> float:
    return float(verdicts["correct"].correct)

Drop the harness={"id": "null"} and add a budget and the same judge becomes an agent that inspects the repo and re-runs the tests.

vf.Judge is superseded, not removed

vf.Judge (the hand-rolled single-call helper) predates this and is kept working for its ~18 downstream research-environments users; its docstring and the GUIDE now point at JudgeSpec. Migrating those envs and deleting vf.Judge is a coordinated follow-up with a verifiers bump, not this PR.

What fought back (adaptations from the sketch)

  1. Placement could not ride inside harness.runtime — the RuntimeConfig union's discriminator feeds runtime_is_local / resolve_runtime_config everywhere; a "rollout" variant would have polluted every consumer. It is a separate AgentSpec.placement field; the spec's harness.runtime is documented as ignored.
  2. AgentRun construction had to move after the run: validating a live Trace into the wire type snapshots (copies) it, so building the record before the run silently dropped everything the session recorded. Caught by the integration test.
  3. AgentSpec.harness needed the same id→config narrowing as EnvConfig (narrow_plugin_field) — a bare HarnessConfig(id="default") lacks the default harness's own fields. Also caught by the integration test.

Review findings (macroscope) — disposition

  • Fixed: error capture on the agent's own trace; collect-loop logging (a dead runtime is no longer indistinguishable from a judge that wrote nothing — the runtimes' read has no uniform missing-file signal, so the honest fix is logging the cause); sequential rollout-placed judges; one pre-judge input snapshot; model-table client cleanup (Environment.aclose, called by run_eval and EnvServer).
  • Obsolete: the seed-task-vs-solver-task and double-harness_timeout findings lived in the topology commit, which was dropped from this PR (see below).

Validation

  • tests/v1/test_agent.py — unit tests (spec/table/budget/verdict-source/fences/round-trip/dup-names), plus two keyless full-path integration tests in the subprocess runtime against scripted local endpoints: test_judged_rollout_against_stub (file-verdict judge whose canned bash tool-call actually executes and writes the verdict) and test_reply_judged_rollout_against_stub (a null-harness judge whose fenced JSON reply is parsed as the verdict; exercises the injectable judges(task, trace) hook). All 34 non-e2e tests pass.
  • tests/v1/test_e2e.py::test_judged / ::test_reply_judged — real-model e2e twins across the harness-runtime matrix (key-gated like their siblings).

Deliberately out of scope / follow-ups

  • Slots + topologies (multi-agent episodes) — an earlier revision of this branch carried a topology layer (single / proposer_solver, slot bindings under EnvConfig.slots); it was deliberately dropped to keep this PR the primitive, and is parked on feat/topologies-parked for the follow-up. Judging is orthogonal to episode structure and composes with any future topology.
  • Migrating the ~18 research-environments users of vf.Judge and deleting it — coordinated with a bump.
  • Folding the policy's own conversation into agents["policy"] — the full rollout-record/agent-trace split; the provenance semantics are already those of the final shape.
  • @group_reward removal + algorithm-owned cohort scoring (ranking judges, agentic credit assignment) — separate change; run_agent is already the executor it will call, which is why it is not a Rollout method.
  • prime-rl consumption — the wire change is additive (trace.agents rides along); the orchestrator threading of EnvConfig.models and provenance-based sample selection land with the verifiers bump.

🤖 Generated with Claude Code

An AgentSpec names who an agent is (harness + model name + placement + budget);
a JudgeSpec is an agent run executed in the SCORING stage, while the runtime is
live, to grade the finished rollout with a typed verdict. run_agent is the
standalone executor (placement, own interception session, file materialization,
output collection) — Rollout calls it for judges today; group-scope scorers and
RUN-stage agents are future callers of the same primitive.

- Taskset.judges(task) -> list[JudgeSpec]; @reward/@Metric receive verdicts by name
- placement: "rollout" (provisioned into the live runtime, post-run only) or a
  fresh RuntimeConfig (clean-room container / subprocess for trace-only judges)
- file I/O contract (task.json / transcript.md / trace.json in, verdict.json out,
  schema validated) so any harness can judge; missing/invalid verdict fails the
  rollout via JudgeError — never a silent 0 reward
- provenance: runs recorded as AgentRun{name, role, model, trainable, trace,
  verdict} on trace.agents; judge spend folds into extra_usage; trainable=False
  keeps judge tokens out of training data by construction
- model table: EnvConfig.models binds logical names (AgentSpec.model) to
  endpoints (ModelEndpointConfig); "policy" = the rollout's own model context
- budgets enforced between turns via the run's own RolloutLimits session

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Comment thread verifiers/v1/agent.py
Comment thread verifiers/v1/env.py
Comment thread verifiers/v1/agent.py Outdated
)
)

results = await asyncio.gather(*(_run_one(spec) for spec in specs))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High v1/agent.py:406

run_judges launches all judges concurrently via asyncio.gather, but the default JudgeSpec.placement is "rollout", so every judge runs in parallel against the same live rollout runtime and workdir. Two judges that read or mutate files in that shared sandbox will race with each other and observe each other's side effects, making their verdicts non-independent and non-reproducible. Consider either running "rollout"-placement judges sequentially, or defaulting judge placement to a fresh per-judge runtime so each gets an isolated sandbox.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/agent.py around line 406:

`run_judges` launches all judges concurrently via `asyncio.gather`, but the default `JudgeSpec.placement` is `"rollout"`, so every judge runs in parallel against the same live rollout runtime and workdir. Two judges that read or mutate files in that shared sandbox will race with each other and observe each other's side effects, making their verdicts non-independent and non-reproducible. Consider either running `"rollout"`-placement judges sequentially, or defaulting judge placement to a fresh per-judge runtime so each gets an isolated sandbox.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 5f26ea5, different remedy than suggested: rollout-placed judges now run sequentially (they share one world and one harness-provisioning path — the sharper race is two concurrent harness.setup(runtime) calls), while own-runtime judges stay concurrent. Defaulting to fresh per-judge runtimes would defeat the point of "rollout" placement (inspecting the live world) and cost real sandbox money.

Comment thread verifiers/v1/agent.py
raise session.error
trace.timing.generation.end = time.time()
for path in collect:
try:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium v1/agent.py:310

The collect loop in run_agent catches every exception from runtime.read() and sets the result to None, so transport failures, crashed containers, and permission errors are indistinguishable from a genuinely missing file. In run_judges this causes a verdict.json read failure to be misreported as "judge ... finished without writing a verdict", hiding the real runtime error. Consider catching only the missing-file error (e.g. FileNotFoundError) and letting other exceptions propagate.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/agent.py around line 310:

The `collect` loop in `run_agent` catches every exception from `runtime.read()` and sets the result to `None`, so transport failures, crashed containers, and permission errors are indistinguishable from a genuinely missing file. In `run_judges` this causes a `verdict.json` read failure to be misreported as `"judge ... finished without writing a verdict"`, hiding the real runtime error. Consider catching only the missing-file error (e.g. `FileNotFoundError`) and letting other exceptions propagate.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially fixed in 5f26ea5: the swallowed exception is now logged with its traceback, so a dead runtime is no longer silent. Catching only FileNotFoundError doesn't work — only the subprocess runtime raises it; docker/prime/modal wrap every read failure (missing file, dead container, transport) into an undifferentiated SandboxError. A uniform missing-file signal on Runtime.read is a runtime-taxonomy change, deferred.

Comment thread verifiers/v1/rollout.py Outdated
Comment thread verifiers/v1/trace.py
Comment thread verifiers/v1/rollout.py
One judge API: a JudgeSpec with the tool-less null harness is the classic
single-call LLM judge — its final reply is the verdict (verdict_source="reply",
derived from the harness; fenced JSON tolerated), its evidence rides in the
prompt. vf.Judge is superseded and kept for existing downstream tasksets.
The judges() hook is now async + injectable (task/trace/runtime), enabling
evidence-in-prompt judges and conditional judging.

Review hardening (macroscope findings):
- run_agent captures failures onto the run's own trace before re-raising
- collect logs swallowed read errors (dead runtime vs missing file)
- judges sharing the live rollout runtime run sequentially (one world,
  no provisioning races); own-runtime judges stay concurrent
- judge inputs snapshot once before any judge runs (no cross-verdict leaks)
- Environment.aclose() closes the model table's shared clients (run_eval +
  EnvServer call it)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hallerite hallerite changed the title feat(v1): agent runs — agentic judges as first-class harness+model runs feat(v1): agent runs — judges as first-class harness+model agents (single-call to in-world) Jul 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant