feat(v1): agent runs — judges as first-class harness+model agents (single-call to in-world)#1910
feat(v1): agent runs — judges as first-class harness+model agents (single-call to in-world)#1910hallerite wants to merge 2 commits into
Conversation
An AgentSpec names who an agent is (harness + model name + placement + budget); a JudgeSpec is an agent run executed in the SCORING stage, while the runtime is live, to grade the finished rollout with a typed verdict. run_agent is the standalone executor (placement, own interception session, file materialization, output collection) — Rollout calls it for judges today; group-scope scorers and RUN-stage agents are future callers of the same primitive. - Taskset.judges(task) -> list[JudgeSpec]; @reward/@Metric receive verdicts by name - placement: "rollout" (provisioned into the live runtime, post-run only) or a fresh RuntimeConfig (clean-room container / subprocess for trace-only judges) - file I/O contract (task.json / transcript.md / trace.json in, verdict.json out, schema validated) so any harness can judge; missing/invalid verdict fails the rollout via JudgeError — never a silent 0 reward - provenance: runs recorded as AgentRun{name, role, model, trainable, trace, verdict} on trace.agents; judge spend folds into extra_usage; trainable=False keeps judge tokens out of training data by construction - model table: EnvConfig.models binds logical names (AgentSpec.model) to endpoints (ModelEndpointConfig); "policy" = the rollout's own model context - budgets enforced between turns via the run's own RolloutLimits session Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
| ) | ||
| ) | ||
|
|
||
| results = await asyncio.gather(*(_run_one(spec) for spec in specs)) |
There was a problem hiding this comment.
🟠 High v1/agent.py:406
run_judges launches all judges concurrently via asyncio.gather, but the default JudgeSpec.placement is "rollout", so every judge runs in parallel against the same live rollout runtime and workdir. Two judges that read or mutate files in that shared sandbox will race with each other and observe each other's side effects, making their verdicts non-independent and non-reproducible. Consider either running "rollout"-placement judges sequentially, or defaulting judge placement to a fresh per-judge runtime so each gets an isolated sandbox.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/agent.py around line 406:
`run_judges` launches all judges concurrently via `asyncio.gather`, but the default `JudgeSpec.placement` is `"rollout"`, so every judge runs in parallel against the same live rollout runtime and workdir. Two judges that read or mutate files in that shared sandbox will race with each other and observe each other's side effects, making their verdicts non-independent and non-reproducible. Consider either running `"rollout"`-placement judges sequentially, or defaulting judge placement to a fresh per-judge runtime so each gets an isolated sandbox.
There was a problem hiding this comment.
Fixed in 5f26ea5, different remedy than suggested: rollout-placed judges now run sequentially (they share one world and one harness-provisioning path — the sharper race is two concurrent harness.setup(runtime) calls), while own-runtime judges stay concurrent. Defaulting to fresh per-judge runtimes would defeat the point of "rollout" placement (inspecting the live world) and cost real sandbox money.
| raise session.error | ||
| trace.timing.generation.end = time.time() | ||
| for path in collect: | ||
| try: |
There was a problem hiding this comment.
🟡 Medium v1/agent.py:310
The collect loop in run_agent catches every exception from runtime.read() and sets the result to None, so transport failures, crashed containers, and permission errors are indistinguishable from a genuinely missing file. In run_judges this causes a verdict.json read failure to be misreported as "judge ... finished without writing a verdict", hiding the real runtime error. Consider catching only the missing-file error (e.g. FileNotFoundError) and letting other exceptions propagate.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/agent.py around line 310:
The `collect` loop in `run_agent` catches every exception from `runtime.read()` and sets the result to `None`, so transport failures, crashed containers, and permission errors are indistinguishable from a genuinely missing file. In `run_judges` this causes a `verdict.json` read failure to be misreported as `"judge ... finished without writing a verdict"`, hiding the real runtime error. Consider catching only the missing-file error (e.g. `FileNotFoundError`) and letting other exceptions propagate.
There was a problem hiding this comment.
Partially fixed in 5f26ea5: the swallowed exception is now logged with its traceback, so a dead runtime is no longer silent. Catching only FileNotFoundError doesn't work — only the subprocess runtime raises it; docker/prime/modal wrap every read failure (missing file, dead container, transport) into an undifferentiated SandboxError. A uniform missing-file signal on Runtime.read is a runtime-taxonomy change, deferred.
One judge API: a JudgeSpec with the tool-less null harness is the classic single-call LLM judge — its final reply is the verdict (verdict_source="reply", derived from the harness; fenced JSON tolerated), its evidence rides in the prompt. vf.Judge is superseded and kept for existing downstream tasksets. The judges() hook is now async + injectable (task/trace/runtime), enabling evidence-in-prompt judges and conditional judging. Review hardening (macroscope findings): - run_agent captures failures onto the run's own trace before re-raising - collect logs swallowed read errors (dead runtime vs missing file) - judges sharing the live rollout runtime run sequentially (one world, no provisioning races); own-runtime judges stay concurrent - judge inputs snapshot once before any judge runs (no cross-verdict leaks) - Environment.aclose() closes the model table's shared clients (run_eval + EnvServer call it) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
aa05ccd to
5f26ea5
Compare
Design-inspiration draft: judges as first-class agent runs, built on the premise that a judge is not a special kind of rubric — it is another agent (harness + model), and the primitive verifiers is missing is the ability to execute more than one role-tagged agent within a rollout's lifecycle. One spec covers the whole spectrum: with the tool-less
nullharness it is the classic single-call LLM judge; with a tools harness it is a full agent grading inside the world the policy actually mutated.The data model (where the design actually lives)
AgentSpec— who an agent is:harness(any harness, including off-the-shelf coding agents),model(a logical name),placement,budget,trainable.JudgeSpecaddsname,prompt(grading criteria, written already rendered — thejudges()hook receives the task and trace, so prompts are f-strings, no template language),verdict(a pydantic schema), andverdict_source(see below).AgentRunontrace.agents— the provenance carrier:{name, role, model, trainable, trace, verdict}. The judge's own conversation is a full nestedTrace, as inspectable as the policy's; its spend folds intoextra_usage.trainable=Falsekeeps judge tokens out of training data by construction — the trainer-side rule becomes "train on a trace iff its model is the policy and it opted in", which is also exactly the seam multi-agent needs later.EnvConfig.modelsbinds logical names to endpoints (ModelEndpointConfig: base_url/key-var/model/sampling). Tasksets say what they need (model="grader"); run config says where that resolves. Endpoints never live in taskset code."policy"is reserved: the rollout's own model context — so self-judging is a deliberate one-word choice, not an accident.Environment.aclose()closes the table's shared clients.Execution
run_agentis a standalone executor, deliberately not a private method ofRollout: it owns placement, the run's own interception session (turns recorded onto the run's trace,AgentBudgetenforced between turns via the sameRolloutLimitsmechanism that bounds the policy), input-file materialization, and output collection before an owned runtime is torn down. A failed run captures its error onto its own trace before re-raising, so the provenance record is self-contained.Rolloutis merely the first caller — group-scope scorers (agentic credit assignment at the algorithm layer) and RUN-stage teammates are future callers of the same function.Judges execute in the SCORING stage, inside the existing
scoring_timeout, with the runtime still live:judges()hook is injectable like rewards — any subset oftask,trace,runtime. The task supplies criteria, the trace supplies evidence for reply-verdict judges, and conditional judging isreturn []when a programmatic check already settles the grade.verdict_source): a tools judge gets the rollout's records as files (task.json/transcript.md/trace.jsonunder a per-run/tmp/vf-agent/<id>/— never the rollout's workdir, which is the world being judged) and writesverdict.json; anull-harness judge has no tools, so its evidence rides in its prompt and its final reply is the verdict JSON (fenced replies tolerated). The choice is explicit, not a fallback — a tools judge that fails to write the file is an error, never quietly re-parsed from prose."rollout"(default) provisions the judge into the rollout's live runtime — only after the policy finished, so the policy can never observe or tamper with judge machinery. Judges sharing the live runtime run sequentially (one world, one provisioning path — no races); aRuntimeConfigplacement gives a fresh clean-room runtime and runs concurrently. All judges read one input snapshot taken before any judge runs — judges grade the rollout, not each other's verdicts.JudgeErrorand fails the rollout — never a silent 0 reward, which would poison group baselines.The single-call judge, on a real taskset shape (this is what
vf.Judgedoes today, as one spec):Drop the
harness={"id": "null"}and add abudgetand the same judge becomes an agent that inspects the repo and re-runs the tests.vf.Judgeis superseded, not removedvf.Judge(the hand-rolled single-call helper) predates this and is kept working for its ~18 downstream research-environments users; its docstring and the GUIDE now point atJudgeSpec. Migrating those envs and deletingvf.Judgeis a coordinated follow-up with a verifiers bump, not this PR.What fought back (adaptations from the sketch)
harness.runtime— theRuntimeConfigunion's discriminator feedsruntime_is_local/resolve_runtime_configeverywhere; a"rollout"variant would have polluted every consumer. It is a separateAgentSpec.placementfield; the spec'sharness.runtimeis documented as ignored.AgentRunconstruction had to move after the run: validating a liveTraceinto the wire type snapshots (copies) it, so building the record before the run silently dropped everything the session recorded. Caught by the integration test.AgentSpec.harnessneeded the same id→config narrowing asEnvConfig(narrow_plugin_field) — a bareHarnessConfig(id="default")lacks the default harness's own fields. Also caught by the integration test.Review findings (macroscope) — disposition
readhas no uniform missing-file signal, so the honest fix is logging the cause); sequential rollout-placed judges; one pre-judge input snapshot; model-table client cleanup (Environment.aclose, called byrun_evalandEnvServer).harness_timeoutfindings lived in the topology commit, which was dropped from this PR (see below).Validation
tests/v1/test_agent.py— unit tests (spec/table/budget/verdict-source/fences/round-trip/dup-names), plus two keyless full-path integration tests in the subprocess runtime against scripted local endpoints:test_judged_rollout_against_stub(file-verdict judge whose canned bash tool-call actually executes and writes the verdict) andtest_reply_judged_rollout_against_stub(anull-harness judge whose fenced JSON reply is parsed as the verdict; exercises the injectablejudges(task, trace)hook). All 34 non-e2e tests pass.tests/v1/test_e2e.py::test_judged/::test_reply_judged— real-model e2e twins across the harness-runtime matrix (key-gated like their siblings).Deliberately out of scope / follow-ups
single/proposer_solver, slot bindings underEnvConfig.slots); it was deliberately dropped to keep this PR the primitive, and is parked onfeat/topologies-parkedfor the follow-up. Judging is orthogonal to episode structure and composes with any future topology.vf.Judgeand deleting it — coordinated with a bump.agents["policy"]— the full rollout-record/agent-trace split; the provenance semantics are already those of the final shape.@group_rewardremoval + algorithm-owned cohort scoring (ranking judges, agentic credit assignment) — separate change;run_agentis already the executor it will call, which is why it is not aRolloutmethod.trace.agentsrides along); the orchestrator threading ofEnvConfig.modelsand provenance-based sample selection land with the verifiers bump.🤖 Generated with Claude Code