Skip to content

feat: drive user simulator as an explicit harness feature#1880

Draft
mikasenghaas wants to merge 2 commits into
mainfrom
feat/user-sim-harness
Draft

feat: drive user simulator as an explicit harness feature#1880
mikasenghaas wants to merge 2 commits into
mainfrom
feat/user-sim-harness

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Makes the user simulator an explicit harness feature instead of a transparent interception trick, fixing the message-graph fork that happened whenever a rollout combined task tools with a user simulator (verifiers#1871).

The user simulator is already an MCP server — vf.User exposes a respond tool. So instead of the interception server playing out the exchange and handing the harness only the final assistant message (which forked the graph), the harness drives the simulator itself over MCP, exactly the way it already talks to tool servers:

  • The vf.User is served as a harness-reachable MCP server (like a tool, but never shown to the model).
  • On a no-tool-call turn, the harness's conversation driver calls the simulator's respond tool, appends the returned user message(s) to its own conversation, and re-prompts. A no-prompt task is opened by respond("") before the first model call. An empty reply means the simulator is done.
  • So every user turn arrives in the harness's own next model call and is recorded as a regular user message on the single branch — no fork, no transparent injection.
  • The interception server drops out of the user-sim path entirely: it's now a pure 1:1 model proxy that knows nothing about user simulation. The simulator's end-of-trajectory @vf.stop flag still reaches it over the existing /state channel like any other state.

This is the same mechanism for every harness — owned chat loops and external CLI agents alike (see below). No bespoke HTTP endpoint, no extra dependency: the program reuses the mcp client it already uses for tools.

Changes

  • interception (server.py): handle_request is a 1:1 model proxy; removed the internal inject-and-re-prompt loop, the opening-turn seed, and the RolloutSession.user field. (No /user endpoint, no Dialect.extend — both removed.)
  • launch (serve_user): serves the vf.User harness-reachable (like a tool) and yields its MCP URL; connect_user stays as the host-side driver utility (for CLI-wrapper harnesses).
  • harness: Harness.run/launch thread user_url (the simulator's MCP server); the rollout gates on SUPPORTS_USER_SIM (loud failure, not silent corruption).
  • default / bash / bash_edit programs connect to the simulator's MCP server and call respond (a small connect_user_sim helper); deps stay ["openai", "mcp"].
  • rlm / codex / others get the threaded param but keep SUPPORTS_USER_SIM = False.

Why

Under transparent injection the interception server played the whole user/assistant exchange inside one program request and returned only the final assistant message to the harness. A harness that runs its own tool loop then re-prompted with its own view — which omitted the injected user turns — so graph.prepare_turn matched only a short prefix and forked the graph (sampled=False duplicate branches). Single-turn and tool-less user-sim rollouts were unaffected; tools + user-sim (e.g. BFCL v3 multi-turn) forked every turn, and naive trace.branches[-1] scoring saw only the tail. An OpenAI chat response can carry back only one assistant message, so there's no way to keep the harness's conversation correct while keeping it sim-unaware — the simulator has to be a capability the harness opts into.

How CLI agents fit (same mechanism)

The user turn always comes from the simulator's respond tool and is injected through the harness's native multi-turn input, so it's a real user message and the graph stays linear — only the injection verb differs per agent (and it's each agent's real entry point, not a hack):

  • Codex headless (resume): codex exec "<open>" then, per turn, respond(...)codex exec resume <session_id> "<user msg>". Codex reconstructs context from its transcript and re-prompts with the user turn in history.
  • Claude Code headless (stream-json): claude -p --input-format stream-json --output-format stream-json is a bidirectional session — respond(...) → write a {"type":"user",...} line to stdin; Claude appends it and re-prompts.

In both, the host-side wrapper calls the same respond tool (via connect_user) and feeds the result into the CLI's own resume/stdin channel. (These two harnesses still set SUPPORTS_USER_SIM = False here — implementing their drivers is a follow-up; Codex also can't yet use our MCP tools, SUPPORTS_MCP = False.)

Verification

New toy taskset tool_user_sim_v1 (a calculator that must call a calc_add MCP tool and answer a user simulator across turns; no prompt, so the simulator opens it), plus an e2e regression test asserting the graph stays linear. On the default harness against deepseek/deepseek-v4-flash:

reward num_branches
before (transparent injection) 0.67 4
after (harness-driven MCP) 1.00 1

After, the graph is a single linear path: system → user(2+3) → assistant(tool call) → tool(5) → assistant(<answer>5</answer>) → user(10+20) → …. bash and bash_edit were smoke-checked on the same env (reward 1.0, num_branches == 1).

No regressions on the existing envs (all reward 1.0, num_branches == 1):

  • echo_user_sim_v1 (tool-less user sim, previously relied on transparent injection)
  • alphabet_sort_v1 (user sim, opens the conversation)
  • counter_tool_v1 (pure tools, no user sim)

Closes #1871.

🤖 Generated with Claude Code

Make the user simulator a capability harnesses opt into and drive over a new
`/user` interception endpoint, instead of a transparent inject-and-re-prompt
inside the interception server. The transparent design forked the message graph
whenever a rollout combined task tools with a user simulator: the harness
re-prompted with its own sim-blind conversation (omitting the injected user
turns), so prepare_turn matched a short prefix and forked. Driving the simulator
from the harness keeps every turn in the harness's own request, so the graph
stays linear by construction (verifiers#1871).

- interception: handle_request is now a 1:1 model proxy (no internal multi-turn
  or opening seed); new handle_user (/user) exposes the simulator (POST the
  assistant text -> next user turn(s) + done)
- harness: thread user_url through run/launch; rollout gates user-sim on
  SUPPORTS_USER_SIM (loud failure, not silent corruption)
- default/bash/bash_edit programs drive /user (open a no-prompt task, then POST
  on each no-tool-call turn until done)
- remove the now-dead Dialect.extend from base/chat/responses
- add tool_user_sim_v1 toy taskset + an e2e regression test (num_branches == 1)

Closes #1871

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread verifiers/v1/harnesses/bash/program.py Outdated
Comment on lines +218 to +220
http, user_url, args.api_key, message.content or ""
)
messages.extend(reply)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium bash/program.py:218

When the model refuses a request, message.content is None and the refusal text is stored in message.refusal. The current code posts message.content or "" to the user simulator, so refusal-only turns send an empty string to /user. The simulator receives no indication that a refusal occurred, which can cause multi-turn tasks to branch incorrectly or loop.

-                reply, done = await next_user_turn(
-                    http, user_url, args.api_key, message.content or ""
-                )
Also found in 2 other location(s)

verifiers/v1/harnesses/bash_edit/program.py:281

On the no-tool-call path, next_user_turn(..., message.content or &#34;&#34;) drops assistant refusals. In the OpenAI chat SDK, ChatCompletionMessage.content is optional and a refusal is carried separately in message.refusal; a refusal-only turn therefore posts &#34;&#34; to /user. Any User.respond(message: str) implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.

verifiers/v1/harnesses/default/program.py:187

On the no-tool-call path, next_user_turn(..., message.content or &#34;&#34;) drops assistant refusals. In the OpenAI chat SDK, ChatCompletionMessage.content is optional and a refusal is carried separately in message.refusal; a refusal-only turn therefore posts &#34;&#34; to /user. Any User.respond(message: str) implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harnesses/bash/program.py around lines 218-220:

When the model refuses a request, `message.content` is `None` and the refusal text is stored in `message.refusal`. The current code posts `message.content or ""` to the user simulator, so refusal-only turns send an empty string to `/user`. The simulator receives no indication that a refusal occurred, which can cause multi-turn tasks to branch incorrectly or loop.

Also found in 2 other location(s):
- verifiers/v1/harnesses/bash_edit/program.py:281 -- On the no-tool-call path, `next_user_turn(..., message.content or "")` drops assistant refusals. In the OpenAI chat SDK, `ChatCompletionMessage.content` is optional and a refusal is carried separately in `message.refusal`; a refusal-only turn therefore posts `""` to `/user`. Any `User.respond(message: str)` implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.
- verifiers/v1/harnesses/default/program.py:187 -- On the no-tool-call path, `next_user_turn(..., message.content or "")` drops assistant refusals. In the OpenAI chat SDK, `ChatCompletionMessage.content` is optional and a refusal is carried separately in `message.refusal`; a refusal-only turn therefore posts `""` to `/user`. Any `User.respond(message: str)` implementation will see an empty assistant message instead of the actual refusal text, so multi-turn user-sim tasks can take the wrong branch or loop incorrectly after a refusal.

Comment thread verifiers/v1/harness.py

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 High

async def run(

Harness.run calls self.launch(..., user_url) with 7 arguments, but existing custom harness subclasses may still implement the old 6-argument launch signature. The bundled scaffold in verifiers/v1/cli/init.py still generates that old signature. Any rollout loading such a harness will fail with TypeError: ... takes 7 positional arguments but 8 were given before the harness starts.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harness.py around line 118:

`Harness.run` calls `self.launch(..., user_url)` with 7 arguments, but existing custom harness subclasses may still implement the old 6-argument `launch` signature. The bundled scaffold in `verifiers/v1/cli/init.py` still generates that old signature. Any rollout loading such a harness will fail with `TypeError: ... takes 7 positional arguments but 8 were given` before the harness starts.

Per review: piggyback on the MCP machinery the harness already uses for tools
instead of a bespoke httpx POST. The vf.User is already an MCP server (a
`respond` tool), so serve it harness-reachable (like a tool, but never shown to
the model) and have the harness connect over MCP and call `respond` for each
user turn, injecting the reply into its own conversation. Same mechanism for
owned chat loops and external CLI agents (codex exec resume / Claude Code
stream-json): the user turn arrives through the harness's native conversation,
so it's a regular user message on the single branch.

- interception: dropped the /user endpoint + handle_user + RolloutSession.user;
  it's a pure model proxy that knows nothing about user simulation (the @Stop
  flag still arrives over the /state channel)
- serve_user: serves the user sim harness-reachable and yields its MCP URL;
  connect_user kept as the host-side driver for a CLI-wrapper harness
- default/bash/bash_edit programs: connect to the user-sim MCP server and call
  its `respond` tool (an empty reply = done); deps back to [openai, mcp], no httpx

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment on lines +126 to +148
async def connect_user_sim(stack: AsyncExitStack, url: str):
"""Connect to the task's user simulator — its own MCP server (a `vf.User`), never shown to the
model — and return an async `respond(text)` that calls its `respond` tool, returning the next
user message(s) as OpenAI wire dicts (an empty list = the simulator is done)."""
from mcp import ClientSession
from mcp.client.streamable_http import (
create_mcp_http_client,
streamable_http_client,
)

http_client = await stack.enter_async_context(create_mcp_http_client())
read, write, *_ = await stack.enter_async_context(
streamable_http_client(url, http_client=http_client)
)
session = await stack.enter_async_context(ClientSession(read, write))
await session.initialize()

async def respond(text: str) -> list[dict]:
result = await session.call_tool("respond", {"message": text})
parts = [b.text for b in result.content if getattr(b, "type", None) == "text"]
return json.loads("\n".join(parts))["messages"]

return respond

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium bash/program.py:126

The respond() wrapper raises raw MCP/HTTP exceptions when the user-simulator connection drops mid-rollout, aborting the harness instead of surfacing a normal rollout error. The host-side connect_user() converts these into captured UserErrors; this harness should do the same so connection loss is handled gracefully.

    async def respond(text: str) -> list[dict]:
+        try:
             result = await session.call_tool("respond", {"message": text})
             parts = [b.text for b in result.content if getattr(b, "type", None) == "text"]
             return json.loads("\n".join(parts))["messages"]
+        except Exception as e:
+            return [{"role": "system", "content": f"User simulator error: {e}"}]
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harnesses/bash/program.py around lines 126-148:

The `respond()` wrapper raises raw MCP/HTTP exceptions when the user-simulator connection drops mid-rollout, aborting the harness instead of surfacing a normal rollout error. The host-side `connect_user()` converts these into captured `UserError`s; this harness should do the same so connection loss is handled gracefully.

completion = response.raw
logger.debug(
"intercept turn: id=%s tools=%d",
# One model call, recorded and returned to the harness 1:1 — no internal multi-turn. A

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium interception/server.py:268

When the harness-driven user simulator finishes (returns empty messages), handle_request returns without checking session.refused(). The taskset's @stop condition stored in trace.state (e.g. game_over) is only evaluated before model calls, so the rollout records "agent_completed" instead of the actual stop reason. Consider calling refused() after the simulator exits to capture the true end-of-trajectory condition.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/interception/server.py around line 268:

When the harness-driven user simulator finishes (returns empty messages), `handle_request` returns without checking `session.refused()`. The taskset's `@stop` condition stored in `trace.state` (e.g. `game_over`) is only evaluated before model calls, so the rollout records `"agent_completed"` instead of the actual stop reason. Consider calling `refused()` after the simulator exits to capture the true end-of-trajectory condition.

return mcp_content_to_chat_content(result.content)


async def connect_user_sim(stack: AsyncExitStack, url: str):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium default/program.py:100

connect_user_sim() calls streamable_http_client(url, http_client=http_client) without retry logic, so a transient ConnectError from the colocated user server under load escapes and aborts the harness during startup. Consider adding retry logic to match the host-side connect_user() behavior so transient failures recover instead of failing the rollout.

Also found in 1 other location(s)

verifiers/v1/harnesses/bash_edit/program.py:194

connect_user_sim() opens the user-simulator MCP connection with a single streamable_http_client(...) attempt and no error translation. The same repo's host-side connect_user() retries because the colocated user server can briefly refuse connections under load; here that transient ConnectError will escape during startup and abort the harness instead of recovering, so user-sim rollouts can fail spuriously.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/harnesses/default/program.py around line 100:

`connect_user_sim()` calls `streamable_http_client(url, http_client=http_client)` without retry logic, so a transient `ConnectError` from the colocated user server under load escapes and aborts the harness during startup. Consider adding retry logic to match the host-side `connect_user()` behavior so transient failures recover instead of failing the rollout.

Also found in 1 other location(s):
- verifiers/v1/harnesses/bash_edit/program.py:194 -- `connect_user_sim()` opens the user-simulator MCP connection with a single `streamable_http_client(...)` attempt and no error translation. The same repo's host-side `connect_user()` retries because the colocated user server can briefly refuse connections under load; here that transient `ConnectError` will escape during startup and abort the harness instead of recovering, so user-sim rollouts can fail spuriously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

User simulator must be an explicit harness feature (transparent injection forks the graph with tools)

1 participant