Add reusable OpenEnv taskset support#1885
Conversation
| - `openenv_echo_v1/taskset.py` — a thin config over `vf`'s reusable `OpenEnvTaskset`. | ||
|
|
||
| Echo's production MCP contract is unscored, so the reward is neutral while the tool call and | ||
| result remain in the Verifiers trace. |
There was a problem hiding this comment.
Echo env README noncompliant
Low Severity
The new openenv_echo_v1 package uses a hand-written README instead of the section structure produced by prime env init / uv run init (intro, Develop steps, Layout, CLI tuning). Project rules disallow freeform environment READMEs for packages under environments/.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit d7020e9. Configure here.
| payload = response.json() | ||
| if error := payload.get("error"): | ||
| raise RuntimeError(error.get("message", str(error))) | ||
| self.tools = payload["result"]["tools"] |
There was a problem hiding this comment.
UDS wait not HTTP ready
Medium Severity
JSONRPCToolset.setup treats a Unix socket file appearing on disk as readiness, then immediately POSTs tools/list. If the OpenEnv server creates the socket before its ASGI app accepts JSON-RPC, setup can fail intermittently even though the rollout eventually would work.
Reviewed by Cursor Bugbot for commit d7020e9. Configure here.
| prompt: str = ( | ||
| 'Call the echo_message tool with the message "Hello, World!", then return ' | ||
| "the echoed text." | ||
| ) |
There was a problem hiding this comment.
Echo prompt wrong tool name
Medium Severity
The example task prompt tells the model to call echo_message, but MCP-capable harnesses expose colocated tools as {server}_{tool}. With JSONRPCToolset's TOOL_PREFIX of jsonrpc, the default harness registers the tool as jsonrpc_echo_message, so the documented docker smoke eval (default harness) steers the model toward a name that is not offered.
Reviewed by Cursor Bugbot for commit 0a2178f. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d7020e98d3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| payload = response.json() | ||
| if error := payload.get("error"): | ||
| raise RuntimeError(error.get("message", str(error))) | ||
| return payload.get("result") |
There was a problem hiding this comment.
Preserve MCP tool results instead of returning raw JSON
When the proxied JSON-RPC tool returns a normal MCP CallToolResult shape, such as content blocks with images or an isError flag, returning the raw dict here causes FastMCP to serialize that whole object as a text JSON blob. In OpenEnv tasks with non-text tool output or tool-level errors, the selected harness will see only a text wrapper and lose the original content blocks/error metadata, so the adapter should rebuild/return the MCP result type rather than the raw JSON result.
Useful? React with 👍 / 👎.
ApprovabilityVerdict: Needs human review 5 blocking correctness issues found. This PR introduces a new OpenEnv taskset feature with significant new runtime logic, integrations, and state management. Multiple unresolved review comments identify medium-severity bugs including race conditions in socket waiting, incorrect tool naming, state inheritance issues, and missing version constraints that could cause runtime failures. You can customize Macroscope's approvability policy. Learn more. |
| raise RuntimeError(error.get("message", str(error))) | ||
| self.tools = payload["result"]["tools"] | ||
|
|
||
| def _register(self, mcp: FastMCP) -> None: |
There was a problem hiding this comment.
🟡 Medium mcp/toolset.py:146
The call closure in _register drops every argument whose value is None before forwarding it to the remote tools/call endpoint. A parameter explicitly set to JSON null arrives as Python None and is silently removed from the request, so the remote tool sees it as omitted rather than null. Tools that distinguish null from missing fields receive the wrong request. Consider using a sentinel value for unset parameters instead of filtering on None.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/mcp/toolset.py around line 146:
The `call` closure in `_register` drops every argument whose value is `None` before forwarding it to the remote `tools/call` endpoint. A parameter explicitly set to JSON `null` arrives as Python `None` and is silently removed from the request, so the remote tool sees it as omitted rather than `null`. Tools that distinguish `null` from missing fields receive the wrong request. Consider using a sentinel value for unset parameters instead of filtering on `None`.
| version = "0.1.0" | ||
| description = "openenv-echo-v1 — <one-line description>." | ||
| requires-python = ">=3.11" | ||
| dependencies = ["verifiers"] |
There was a problem hiding this comment.
🟡 Medium openenv_echo_v1/pyproject.toml:6
dependencies = ["verifiers"] has no version lower bound, so pip install can resolve to verifiers 0.1.14 (the current stable release) which predates the verifiers.v1.tasksets.openenv module this package imports. Importing openenv_echo_v1 then fails with ModuleNotFoundError. Add a minimum version constraint that includes the new OpenEnvTaskset API.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/openenv_echo_v1/pyproject.toml around line 6:
`dependencies = ["verifiers"]` has no version lower bound, so `pip install` can resolve to `verifiers` 0.1.14 (the current stable release) which predates the `verifiers.v1.tasksets.openenv` module this package imports. Importing `openenv_echo_v1` then fails with `ModuleNotFoundError`. Add a minimum version constraint that includes the new `OpenEnvTaskset` API.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e96a9f1. Configure here.
| connector = aiohttp.UnixConnector(path=self.config.uds) | ||
| timeout = aiohttp.ClientTimeout(total=self.config.timeout) | ||
| self.session = aiohttp.ClientSession(connector=connector, timeout=timeout) | ||
| self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws") |
There was a problem hiding this comment.
Gym user skips socket wait
Medium Severity
For contract="gym", OpenEnvUser._connect opens the WebSocket as soon as the user simulator runs, but OpenEnvTaskset.setup only detaches uvicorn and does not wait for /tmp/openenv.sock. The new MCP path waits on that socket in JSONRPCToolset.setup, so gym rollouts can fail intermittently if the server is still starting.
Reviewed by Cursor Bugbot for commit e96a9f1. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e96a9f1ced
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| resources: vf.TaskResources = vf.TaskResources(cpu=2, memory=4, disk=10) | ||
|
|
||
|
|
||
| class OpenEnvEchoTaskset(OpenEnvTaskset, vf.Taskset[vf.Task, OpenEnvEchoConfig]): |
There was a problem hiding this comment.
Preserve OpenEnvState when narrowing the example config
In this inheritance list, vf.Taskset[...] defaults the state type to vf.State. The framework's state_cls(type(taskset)) scans the most-derived generic bases first, so openenv-echo-v1 rollouts get a base State instead of OpenEnvState; then the inherited openenv_reward/openenv_done methods read trace.state.reward/done and scoring raises AttributeError for every run. Please keep the OpenEnvState type when rebinding only the config.
Useful? React with 👍 / 👎.
| connector = aiohttp.UnixConnector(path=self.config.uds) | ||
| timeout = aiohttp.ClientTimeout(total=self.config.timeout) | ||
| self.session = aiohttp.ClientSession(connector=connector, timeout=timeout) | ||
| self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws") |
There was a problem hiding this comment.
Wait for the gym socket before opening it
For contract="gym", setup() only spawns uvicorn with run_background() and returns; the first no-prompt request calls respond("") and reaches _connect() immediately. If the OpenEnv server has not bound /tmp/openenv.sock yet (slow image or fast harness), this ws_connect raises a connector error and the rollout fails before reset. Add the same socket/readiness wait used by JSONRPCToolset before opening the websocket.
Useful? React with 👍 / 👎.
| def has_tools(self) -> bool: | ||
| return self.config.contract == "mcp" | ||
|
|
||
| def tools(self, task: Task) -> list[Toolset]: |
There was a problem hiding this comment.
🟡 Medium openenv/taskset.py:208
For contract="mcp", tools() returns a plain JSONRPCToolset that sends tool calls directly to the /mcp endpoint via JSON-RPC tools/call. The openenv_done and openenv_reward methods read trace.state.done and trace.state.reward, but nothing in the MCP path updates those fields — only OpenEnvUser.respond() (used for contract="gym") writes to state.done and state.reward. As a result, MCP-backed OpenEnv tasks always return 0.0 reward and never terminate from the environment's done signal. Consider wiring MCP tool-call results into state.done and state.reward, or document that MCP tasksets are not scored/terminating if that is intentional.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/tasksets/openenv/taskset.py around line 208:
For `contract="mcp"`, `tools()` returns a plain `JSONRPCToolset` that sends tool calls directly to the `/mcp` endpoint via JSON-RPC `tools/call`. The `openenv_done` and `openenv_reward` methods read `trace.state.done` and `trace.state.reward`, but nothing in the MCP path updates those fields — only `OpenEnvUser.respond()` (used for `contract="gym"`) writes to `state.done` and `state.reward`. As a result, MCP-backed OpenEnv tasks always return `0.0` reward and never terminate from the environment's `done` signal. Consider wiring MCP tool-call results into `state.done` and `state.reward`, or document that MCP tasksets are not scored/terminating if that is intentional.
| async def _connect(self) -> None: | ||
| if self.ws is not None: | ||
| return | ||
| connector = aiohttp.UnixConnector(path=self.config.uds) | ||
| timeout = aiohttp.ClientTimeout(total=self.config.timeout) | ||
| self.session = aiohttp.ClientSession(connector=connector, timeout=timeout) | ||
| self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws") | ||
| self.action_schema = await self._schema() |
There was a problem hiding this comment.
🟡 Medium openenv/taskset.py:92
In _connect(), self.ws is assigned before _schema() is awaited, so if _schema() raises (timeout, non-2xx, bad JSON), the websocket and session are left set but self.action_schema remains {}. Every subsequent respond() call returns early at the if self.ws is not None: return guard, so the schema is never refetched — action parsing and rendering stay broken for the rest of the episode instead of recovering on retry. Consider resetting self.ws and self.session when _schema() fails so the next call reconnects from scratch.
async def _connect(self) -> None:
if self.ws is not None:
return
connector = aiohttp.UnixConnector(path=self.config.uds)
timeout = aiohttp.ClientTimeout(total=self.config.timeout)
self.session = aiohttp.ClientSession(connector=connector, timeout=timeout)
self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws")
- self.action_schema = await self._schema()
+ try:
+ self.action_schema = await self._schema()
+ except Exception:
+ await self._close()
+ raise🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/tasksets/openenv/taskset.py around lines 92-99:
In `_connect()`, `self.ws` is assigned before `_schema()` is awaited, so if `_schema()` raises (timeout, non-2xx, bad JSON), the websocket and session are left set but `self.action_schema` remains `{}`. Every subsequent `respond()` call returns early at the `if self.ws is not None: return` guard, so the schema is never refetched — action parsing and rendering stay broken for the rest of the episode instead of recovering on retry. Consider resetting `self.ws` and `self.session` when `_schema()` fails so the next call reconnects from scratch.
| self.action_schema: dict[str, Any] = {} | ||
| self.initial: dict[str, Any] | None = None | ||
|
|
||
| async def respond(self, message: str) -> Messages: |
There was a problem hiding this comment.
🟡 Medium openenv/taskset.py:71
When message is an empty string, respond treats it as a reset request instead of a step call. Since the caller passes response.message.content or "" on every turn, any empty assistant message causes respond to replay the cached self.initial reset result (or reset the environment again) instead of sending the empty string as a step action, silently corrupting the rollout state. Track the initial reset explicitly with a boolean flag rather than testing the truthiness of message on every call.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/tasksets/openenv/taskset.py around line 71:
When `message` is an empty string, `respond` treats it as a reset request instead of a step call. Since the caller passes `response.message.content or ""` on every turn, any empty assistant message causes `respond` to replay the cached `self.initial` reset result (or reset the environment again) instead of sending the empty string as a `step` action, silently corrupting the rollout state. Track the initial reset explicitly with a boolean flag rather than testing the truthiness of `message` on every call.


Overview
Adds Verifiers-owned OpenEnv support that runs image-backed environments through any MCP-capable v1 harness while preserving normal Verifiers traces.
Details
OpenEnvTasksetthat maps image, prompt, workdir, and resources onto Verifiers tasks and delegates tool execution to OpenEnv.JSONRPCToolsetadapter that exposes OpenEnv JSON-RPC tools through the standard Verifiers MCP abstraction.openenv-echo-v1as a thin example that owns the Echo image, hello-world prompt, and resource settings.Note
Medium Risk
Touches container startup, external OpenEnv images, and Codex harness MCP configuration; misconfiguration could break rollouts or tool reachability, but scope is mostly additive with CI skipping the docker-only example.
Overview
Adds reusable OpenEnv image support in v1: a built-in
OpenEnvTasksetstarts the image’s ASGI app on a Unix socket, then either exposes tools via a newJSONRPCToolset(tools/list+tools/callover HTTP/UDS) forcontract=mcp, or drivescontract=gymthrough a colocatedOpenEnvUserWebSocket client with reset/step, stops, and rewards from env state.Ships
openenv-echo-v1as a thin package that only pins the official Echo image digest, prompt, and resources. Docs and environment indexes list it; CI smoke eval skips it because it needs the container image.Harness / load-time behavior:
CodexHarnessnow setsSUPPORTS_MCP = Trueand forwards task MCP server URLs into Codex-c mcp_servers.*config; default Codex version moves to 0.116.0 (before an MCP regression).Environmentcompatibility checks usetaskset.has_tools()/has_user()(withOpenEnvTasksetoverriding those by contract) instead of comparing unboundTaskset.tools/usermethods.Minor runtime fix:
unset UV_SYSTEM_PYTHONbeforeuv sync --scriptwhen preparing PEP 723 scripts in sandboxes.Reviewed by Cursor Bugbot for commit e96a9f1. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add reusable
OpenEnvTasksetwith MCP and gym contract supportOpenEnvTasksetin taskset.py, a reusable base taskset that boots OpenEnv container images via Unix domain socket and exposes tools either as MCP (viaJSONRPCToolset) or drives the model via a gym WebSocket protocol (OpenEnvUser).JSONRPCToolsetin toolset.py that discovers and proxies tools from anytools/list/tools/callJSON-RPC endpoint, synthesizing proper MCP argument schemas dynamically.openenv-echo-v1in environments/openenv_echo_v1 as a thin concrete taskset pinning the Echo image digest and a fixed prompt, serving as a reference implementation.CodexHarnessto setSUPPORTS_MCP=Trueand forward MCP server URLs to the Codex process; defaults the Codex version to0.116.0(last release before a regression in0.117).has_tools()andhas_user()helpers toTasksetand updatesEnvironmentto use them instead of direct type comparisons.Macroscope summarized e96a9f1.