feat: drive user simulator as an explicit harness feature#1880
feat: drive user simulator as an explicit harness feature#1880mikasenghaas wants to merge 2 commits into
Conversation
Make the user simulator a capability harnesses opt into and drive over a new `/user` interception endpoint, instead of a transparent inject-and-re-prompt inside the interception server. The transparent design forked the message graph whenever a rollout combined task tools with a user simulator: the harness re-prompted with its own sim-blind conversation (omitting the injected user turns), so prepare_turn matched a short prefix and forked. Driving the simulator from the harness keeps every turn in the harness's own request, so the graph stays linear by construction (verifiers#1871). - interception: handle_request is now a 1:1 model proxy (no internal multi-turn or opening seed); new handle_user (/user) exposes the simulator (POST the assistant text -> next user turn(s) + done) - harness: thread user_url through run/launch; rollout gates user-sim on SUPPORTS_USER_SIM (loud failure, not silent corruption) - default/bash/bash_edit programs drive /user (open a no-prompt task, then POST on each no-tool-call turn until done) - remove the now-dead Dialect.extend from base/chat/responses - add tool_user_sim_v1 toy taskset + an e2e regression test (num_branches == 1) Closes #1871 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| http, user_url, args.api_key, message.content or "" | ||
| ) | ||
| messages.extend(reply) |
There was a problem hiding this comment.
🟡 Medium bash/program.py:218
When the model refuses a request, message.content is None and the refusal text is stored in message.refusal. The current code posts message.content or "" to the user simulator, so refusal-only turns send an empty string to /user. The simulator receives no indication that a refusal occurred, which can cause multi-turn tasks to branch incorrectly or loop.
- reply, done = await next_user_turn(
- http, user_url, args.api_key, message.content or ""
- )Also found in 2 other location(s)
verifiers/v1/harnesses/bash_edit/program.py:281
On the no-tool-call path,
next_user_turn(..., message.content or "")drops assistant refusals. In the OpenAI chat SDK,ChatCompletionMessage.contentis optional and a refusal is carried separately inmessage.refusal; a refusal-only turn therefore posts""to/user. AnyUser.respond(message: str)implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.
verifiers/v1/harnesses/default/program.py:187
On the no-tool-call path,
next_user_turn(..., message.content or "")drops assistant refusals. In the OpenAI chat SDK,ChatCompletionMessage.contentis optional and a refusal is carried separately inmessage.refusal; a refusal-only turn therefore posts""to/user. AnyUser.respond(message: str)implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harnesses/bash/program.py around lines 218-220:
When the model refuses a request, `message.content` is `None` and the refusal text is stored in `message.refusal`. The current code posts `message.content or ""` to the user simulator, so refusal-only turns send an empty string to `/user`. The simulator receives no indication that a refusal occurred, which can cause multi-turn tasks to branch incorrectly or loop.
Also found in 2 other location(s):
- verifiers/v1/harnesses/bash_edit/program.py:281 -- On the no-tool-call path, `next_user_turn(..., message.content or "")` drops assistant refusals. In the OpenAI chat SDK, `ChatCompletionMessage.content` is optional and a refusal is carried separately in `message.refusal`; a refusal-only turn therefore posts `""` to `/user`. Any `User.respond(message: str)` implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.
- verifiers/v1/harnesses/default/program.py:187 -- On the no-tool-call path, `next_user_turn(..., message.content or "")` drops assistant refusals. In the OpenAI chat SDK, `ChatCompletionMessage.content` is optional and a refusal is carried separately in `message.refusal`; a refusal-only turn therefore posts `""` to `/user`. Any `User.respond(message: str)` implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.
There was a problem hiding this comment.
🟠 High
verifiers/verifiers/v1/harness.py
Line 118 in e239b28
Harness.run calls self.launch(..., user_url) with 7 arguments, but existing custom harness subclasses may still implement the old 6-argument launch signature. The bundled scaffold in verifiers/v1/cli/init.py still generates that old signature. Any rollout loading such a harness will fail with TypeError: ... takes 7 positional arguments but 8 were given before the harness starts.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harness.py around line 118:
`Harness.run` calls `self.launch(..., user_url)` with 7 arguments, but existing custom harness subclasses may still implement the old 6-argument `launch` signature. The bundled scaffold in `verifiers/v1/cli/init.py` still generates that old signature. Any rollout loading such a harness will fail with `TypeError: ... takes 7 positional arguments but 8 were given` before the harness starts.
Per review: piggyback on the MCP machinery the harness already uses for tools instead of a bespoke httpx POST. The vf.User is already an MCP server (a `respond` tool), so serve it harness-reachable (like a tool, but never shown to the model) and have the harness connect over MCP and call `respond` for each user turn, injecting the reply into its own conversation. Same mechanism for owned chat loops and external CLI agents (codex exec resume / Claude Code stream-json): the user turn arrives through the harness's native conversation, so it's a regular user message on the single branch. - interception: dropped the /user endpoint + handle_user + RolloutSession.user; it's a pure model proxy that knows nothing about user simulation (the @Stop flag still arrives over the /state channel) - serve_user: serves the user sim harness-reachable and yields its MCP URL; connect_user kept as the host-side driver for a CLI-wrapper harness - default/bash/bash_edit programs: connect to the user-sim MCP server and call its `respond` tool (an empty reply = done); deps back to [openai, mcp], no httpx Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| async def connect_user_sim(stack: AsyncExitStack, url: str): | ||
| """Connect to the task's user simulator — its own MCP server (a `vf.User`), never shown to the | ||
| model — and return an async `respond(text)` that calls its `respond` tool, returning the next | ||
| user message(s) as OpenAI wire dicts (an empty list = the simulator is done).""" | ||
| from mcp import ClientSession | ||
| from mcp.client.streamable_http import ( | ||
| create_mcp_http_client, | ||
| streamable_http_client, | ||
| ) | ||
|
|
||
| http_client = await stack.enter_async_context(create_mcp_http_client()) | ||
| read, write, *_ = await stack.enter_async_context( | ||
| streamable_http_client(url, http_client=http_client) | ||
| ) | ||
| session = await stack.enter_async_context(ClientSession(read, write)) | ||
| await session.initialize() | ||
|
|
||
| async def respond(text: str) -> list[dict]: | ||
| result = await session.call_tool("respond", {"message": text}) | ||
| parts = [b.text for b in result.content if getattr(b, "type", None) == "text"] | ||
| return json.loads("\n".join(parts))["messages"] | ||
|
|
||
| return respond |
There was a problem hiding this comment.
🟡 Medium bash/program.py:126
The respond() wrapper raises raw MCP/HTTP exceptions when the user-simulator connection drops mid-rollout, aborting the harness instead of surfacing a normal rollout error. The host-side connect_user() converts these into captured UserErrors; this harness should do the same so connection loss is handled gracefully.
async def respond(text: str) -> list[dict]:
+ try:
result = await session.call_tool("respond", {"message": text})
parts = [b.text for b in result.content if getattr(b, "type", None) == "text"]
return json.loads("\n".join(parts))["messages"]
+ except Exception as e:
+ return [{"role": "system", "content": f"User simulator error: {e}"}]🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harnesses/bash/program.py around lines 126-148:
The `respond()` wrapper raises raw MCP/HTTP exceptions when the user-simulator connection drops mid-rollout, aborting the harness instead of surfacing a normal rollout error. The host-side `connect_user()` converts these into captured `UserError`s; this harness should do the same so connection loss is handled gracefully.
| completion = response.raw | ||
| logger.debug( | ||
| "intercept turn: id=%s tools=%d", | ||
| # One model call, recorded and returned to the harness 1:1 — no internal multi-turn. A |
There was a problem hiding this comment.
🟡 Medium interception/server.py:268
When the harness-driven user simulator finishes (returns empty messages), handle_request returns without checking session.refused(). The taskset's @stop condition stored in trace.state (e.g. game_over) is only evaluated before model calls, so the rollout records "agent_completed" instead of the actual stop reason. Consider calling refused() after the simulator exits to capture the true end-of-trajectory condition.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/interception/server.py around line 268:
When the harness-driven user simulator finishes (returns empty messages), `handle_request` returns without checking `session.refused()`. The taskset's `@stop` condition stored in `trace.state` (e.g. `game_over`) is only evaluated before model calls, so the rollout records `"agent_completed"` instead of the actual stop reason. Consider calling `refused()` after the simulator exits to capture the true end-of-trajectory condition.
| return mcp_content_to_chat_content(result.content) | ||
|
|
||
|
|
||
| async def connect_user_sim(stack: AsyncExitStack, url: str): |
There was a problem hiding this comment.
🟡 Medium default/program.py:100
connect_user_sim() calls streamable_http_client(url, http_client=http_client) without retry logic, so a transient ConnectError from the colocated user server under load escapes and aborts the harness during startup. Consider adding retry logic to match the host-side connect_user() behavior so transient failures recover instead of failing the rollout.
Also found in 1 other location(s)
verifiers/v1/harnesses/bash_edit/program.py:194
connect_user_sim()opens the user-simulator MCP connection with a singlestreamable_http_client(...)attempt and no error translation. The same repo's host-sideconnect_user()retries because the colocated user server can briefly refuse connections under load; here that transientConnectErrorwill escape during startup and abort the harness instead of recovering, so user-sim rollouts can fail spuriously.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harnesses/default/program.py around line 100:
`connect_user_sim()` calls `streamable_http_client(url, http_client=http_client)` without retry logic, so a transient `ConnectError` from the colocated user server under load escapes and aborts the harness during startup. Consider adding retry logic to match the host-side `connect_user()` behavior so transient failures recover instead of failing the rollout.
Also found in 1 other location(s):
- verifiers/v1/harnesses/bash_edit/program.py:194 -- `connect_user_sim()` opens the user-simulator MCP connection with a single `streamable_http_client(...)` attempt and no error translation. The same repo's host-side `connect_user()` retries because the colocated user server can briefly refuse connections under load; here that transient `ConnectError` will escape during startup and abort the harness instead of recovering, so user-sim rollouts can fail spuriously.
Summary
Makes the user simulator an explicit harness feature instead of a transparent interception trick, fixing the message-graph fork that happened whenever a rollout combined task tools with a user simulator (verifiers#1871).
The user simulator is already an MCP server —
vf.Userexposes arespondtool. So instead of the interception server playing out the exchange and handing the harness only the final assistant message (which forked the graph), the harness drives the simulator itself over MCP, exactly the way it already talks to tool servers:vf.Useris served as a harness-reachable MCP server (like a tool, but never shown to the model).respondtool, appends the returned user message(s) to its own conversation, and re-prompts. A no-prompt task is opened byrespond("")before the first model call. An empty reply means the simulator is done.@vf.stopflag still reaches it over the existing/statechannel like any other state.This is the same mechanism for every harness — owned chat loops and external CLI agents alike (see below). No bespoke HTTP endpoint, no extra dependency: the program reuses the
mcpclient it already uses for tools.Changes
server.py):handle_requestis a 1:1 model proxy; removed the internal inject-and-re-prompt loop, the opening-turn seed, and theRolloutSession.userfield. (No/userendpoint, noDialect.extend— both removed.)serve_user): serves thevf.Userharness-reachable (like a tool) and yields its MCP URL;connect_userstays as the host-side driver utility (for CLI-wrapper harnesses).Harness.run/launchthreaduser_url(the simulator's MCP server); the rollout gates onSUPPORTS_USER_SIM(loud failure, not silent corruption).respond(a smallconnect_user_simhelper); deps stay["openai", "mcp"].rlm/codex/ others get the threaded param but keepSUPPORTS_USER_SIM = False.Why
Under transparent injection the interception server played the whole user/assistant exchange inside one program request and returned only the final assistant message to the harness. A harness that runs its own tool loop then re-prompted with its own view — which omitted the injected user turns — so
graph.prepare_turnmatched only a short prefix and forked the graph (sampled=Falseduplicate branches). Single-turn and tool-less user-sim rollouts were unaffected; tools + user-sim (e.g. BFCL v3 multi-turn) forked every turn, and naivetrace.branches[-1]scoring saw only the tail. An OpenAI chat response can carry back only one assistant message, so there's no way to keep the harness's conversation correct while keeping it sim-unaware — the simulator has to be a capability the harness opts into.How CLI agents fit (same mechanism)
The user turn always comes from the simulator's
respondtool and is injected through the harness's native multi-turn input, so it's a real user message and the graph stays linear — only the injection verb differs per agent (and it's each agent's real entry point, not a hack):codex exec "<open>"then, per turn,respond(...)→codex exec resume <session_id> "<user msg>". Codex reconstructs context from its transcript and re-prompts with the user turn in history.claude -p --input-format stream-json --output-format stream-jsonis a bidirectional session —respond(...)→ write a{"type":"user",...}line to stdin; Claude appends it and re-prompts.In both, the host-side wrapper calls the same
respondtool (viaconnect_user) and feeds the result into the CLI's own resume/stdin channel. (These two harnesses still setSUPPORTS_USER_SIM = Falsehere — implementing their drivers is a follow-up; Codex also can't yet use our MCP tools,SUPPORTS_MCP = False.)Verification
New toy taskset
tool_user_sim_v1(a calculator that must call acalc_addMCP tool and answer a user simulator across turns; no prompt, so the simulator opens it), plus an e2e regression test asserting the graph stays linear. On thedefaultharness againstdeepseek/deepseek-v4-flash:num_branchesAfter, the graph is a single linear path:
system → user(2+3) → assistant(tool call) → tool(5) → assistant(<answer>5</answer>) → user(10+20) → ….bashandbash_editwere smoke-checked on the same env (reward 1.0,num_branches == 1).No regressions on the existing envs (all reward 1.0,
num_branches == 1):echo_user_sim_v1(tool-less user sim, previously relied on transparent injection)alphabet_sort_v1(user sim, opens the conversation)counter_tool_v1(pure tools, no user sim)Closes #1871.
🤖 Generated with Claude Code