Skip to content

Add reusable OpenEnv taskset support#1885

Open
xeophon wants to merge 3 commits into
mainfrom
codex/openenv-support
Open

Add reusable OpenEnv taskset support#1885
xeophon wants to merge 3 commits into
mainfrom
codex/openenv-support

Conversation

@xeophon

@xeophon xeophon commented Jun 26, 2026

Copy link
Copy Markdown
Member

Overview

Adds Verifiers-owned OpenEnv support that runs image-backed environments through any MCP-capable v1 harness while preserving normal Verifiers traces.

Details

  • Introduces a generic OpenEnvTaskset that maps image, prompt, workdir, and resources onto Verifiers tasks and delegates tool execution to OpenEnv.
  • Adds a reusable JSONRPCToolset adapter that exposes OpenEnv JSON-RPC tools through the standard Verifiers MCP abstraction.
  • Adds openenv-echo-v1 as a thin example that owns the Echo image, hello-world prompt, and resource settings.
  • Enables task MCP configuration in the Codex harness and selects the custom-provider-compatible Codex release used with Nemotron Ultra.
  • Documents the reusable taskset, harness selection, tracing behavior, and example commands.

Note

Medium Risk
Touches container startup, external OpenEnv images, and Codex harness MCP configuration; misconfiguration could break rollouts or tool reachability, but scope is mostly additive with CI skipping the docker-only example.

Overview
Adds reusable OpenEnv image support in v1: a built-in OpenEnvTaskset starts the image’s ASGI app on a Unix socket, then either exposes tools via a new JSONRPCToolset (tools/list + tools/call over HTTP/UDS) for contract=mcp, or drives contract=gym through a colocated OpenEnvUser WebSocket client with reset/step, stops, and rewards from env state.

Ships openenv-echo-v1 as a thin package that only pins the official Echo image digest, prompt, and resources. Docs and environment indexes list it; CI smoke eval skips it because it needs the container image.

Harness / load-time behavior: CodexHarness now sets SUPPORTS_MCP = True and forwards task MCP server URLs into Codex -c mcp_servers.* config; default Codex version moves to 0.116.0 (before an MCP regression). Environment compatibility checks use taskset.has_tools() / has_user() (with OpenEnvTaskset overriding those by contract) instead of comparing unbound Taskset.tools / user methods.

Minor runtime fix: unset UV_SYSTEM_PYTHON before uv sync --script when preparing PEP 723 scripts in sandboxes.

Reviewed by Cursor Bugbot for commit e96a9f1. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add reusable OpenEnvTaskset with MCP and gym contract support

  • Adds OpenEnvTaskset in taskset.py, a reusable base taskset that boots OpenEnv container images via Unix domain socket and exposes tools either as MCP (via JSONRPCToolset) or drives the model via a gym WebSocket protocol (OpenEnvUser).
  • Adds JSONRPCToolset in toolset.py that discovers and proxies tools from any tools/list / tools/call JSON-RPC endpoint, synthesizing proper MCP argument schemas dynamically.
  • Adds openenv-echo-v1 in environments/openenv_echo_v1 as a thin concrete taskset pinning the Echo image digest and a fixed prompt, serving as a reference implementation.
  • Updates CodexHarness to set SUPPORTS_MCP=True and forward MCP server URLs to the Codex process; defaults the Codex version to 0.116.0 (last release before a regression in 0.117).
  • Adds has_tools() and has_user() helpers to Taskset and updates Environment to use them instead of direct type comparisons.

Macroscope summarized e96a9f1.

- `openenv_echo_v1/taskset.py` — a thin config over `vf`'s reusable `OpenEnvTaskset`.

Echo's production MCP contract is unscored, so the reward is neutral while the tool call and
result remain in the Verifiers trace.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo env README noncompliant

Low Severity

The new openenv_echo_v1 package uses a hand-written README instead of the section structure produced by prime env init / uv run init (intro, Develop steps, Layout, CLI tuning). Project rules disallow freeform environment READMEs for packages under environments/.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit d7020e9. Configure here.

payload = response.json()
if error := payload.get("error"):
raise RuntimeError(error.get("message", str(error)))
self.tools = payload["result"]["tools"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UDS wait not HTTP ready

Medium Severity

JSONRPCToolset.setup treats a Unix socket file appearing on disk as readiness, then immediately POSTs tools/list. If the OpenEnv server creates the socket before its ASGI app accepts JSON-RPC, setup can fail intermittently even though the rollout eventually would work.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d7020e9. Configure here.

prompt: str = (
'Call the echo_message tool with the message "Hello, World!", then return '
"the echoed text."
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Echo prompt wrong tool name

Medium Severity

The example task prompt tells the model to call echo_message, but MCP-capable harnesses expose colocated tools as {server}_{tool}. With JSONRPCToolset's TOOL_PREFIX of jsonrpc, the default harness registers the tool as jsonrpc_echo_message, so the documented docker smoke eval (default harness) steers the model toward a name that is not offered.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0a2178f. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7020e98d3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

payload = response.json()
if error := payload.get("error"):
raise RuntimeError(error.get("message", str(error)))
return payload.get("result")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve MCP tool results instead of returning raw JSON

When the proxied JSON-RPC tool returns a normal MCP CallToolResult shape, such as content blocks with images or an isError flag, returning the raw dict here causes FastMCP to serialize that whole object as a text JSON blob. In OpenEnv tasks with non-text tool output or tool-level errors, the selected harness will see only a text wrapper and lose the original content blocks/error metadata, so the adapter should rebuild/return the MCP result type rather than the raw JSON result.

Useful? React with 👍 / 👎.

@macroscopeapp

macroscopeapp Bot commented Jun 26, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

5 blocking correctness issues found. This PR introduces a new OpenEnv taskset feature with significant new runtime logic, integrations, and state management. Multiple unresolved review comments identify medium-severity bugs including race conditions in socket waiting, incorrect tool naming, state inheritance issues, and missing version constraints that could cause runtime failures.

You can customize Macroscope's approvability policy. Learn more.

raise RuntimeError(error.get("message", str(error)))
self.tools = payload["result"]["tools"]

def _register(self, mcp: FastMCP) -> None:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium mcp/toolset.py:146

The call closure in _register drops every argument whose value is None before forwarding it to the remote tools/call endpoint. A parameter explicitly set to JSON null arrives as Python None and is silently removed from the request, so the remote tool sees it as omitted rather than null. Tools that distinguish null from missing fields receive the wrong request. Consider using a sentinel value for unset parameters instead of filtering on None.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/mcp/toolset.py around line 146:

The `call` closure in `_register` drops every argument whose value is `None` before forwarding it to the remote `tools/call` endpoint. A parameter explicitly set to JSON `null` arrives as Python `None` and is silently removed from the request, so the remote tool sees it as omitted rather than `null`. Tools that distinguish `null` from missing fields receive the wrong request. Consider using a sentinel value for unset parameters instead of filtering on `None`.

version = "0.1.0"
description = "openenv-echo-v1 — <one-line description>."
requires-python = ">=3.11"
dependencies = ["verifiers"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium openenv_echo_v1/pyproject.toml:6

dependencies = ["verifiers"] has no version lower bound, so pip install can resolve to verifiers 0.1.14 (the current stable release) which predates the verifiers.v1.tasksets.openenv module this package imports. Importing openenv_echo_v1 then fails with ModuleNotFoundError. Add a minimum version constraint that includes the new OpenEnvTaskset API.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @environments/openenv_echo_v1/pyproject.toml around line 6:

`dependencies = ["verifiers"]` has no version lower bound, so `pip install` can resolve to `verifiers` 0.1.14 (the current stable release) which predates the `verifiers.v1.tasksets.openenv` module this package imports. Importing `openenv_echo_v1` then fails with `ModuleNotFoundError`. Add a minimum version constraint that includes the new `OpenEnvTaskset` API.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Want reviews to match your repository better? Bugbot Learning can learn team-specific rules from PR activity. A team admin can enable Learning in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit e96a9f1. Configure here.

connector = aiohttp.UnixConnector(path=self.config.uds)
timeout = aiohttp.ClientTimeout(total=self.config.timeout)
self.session = aiohttp.ClientSession(connector=connector, timeout=timeout)
self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gym user skips socket wait

Medium Severity

For contract="gym", OpenEnvUser._connect opens the WebSocket as soon as the user simulator runs, but OpenEnvTaskset.setup only detaches uvicorn and does not wait for /tmp/openenv.sock. The new MCP path waits on that socket in JSONRPCToolset.setup, so gym rollouts can fail intermittently if the server is still starting.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit e96a9f1. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e96a9f1ced

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

resources: vf.TaskResources = vf.TaskResources(cpu=2, memory=4, disk=10)


class OpenEnvEchoTaskset(OpenEnvTaskset, vf.Taskset[vf.Task, OpenEnvEchoConfig]):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve OpenEnvState when narrowing the example config

In this inheritance list, vf.Taskset[...] defaults the state type to vf.State. The framework's state_cls(type(taskset)) scans the most-derived generic bases first, so openenv-echo-v1 rollouts get a base State instead of OpenEnvState; then the inherited openenv_reward/openenv_done methods read trace.state.reward/done and scoring raises AttributeError for every run. Please keep the OpenEnvState type when rebinding only the config.

Useful? React with 👍 / 👎.

connector = aiohttp.UnixConnector(path=self.config.uds)
timeout = aiohttp.ClientTimeout(total=self.config.timeout)
self.session = aiohttp.ClientSession(connector=connector, timeout=timeout)
self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for the gym socket before opening it

For contract="gym", setup() only spawns uvicorn with run_background() and returns; the first no-prompt request calls respond("") and reaches _connect() immediately. If the OpenEnv server has not bound /tmp/openenv.sock yet (slow image or fast harness), this ws_connect raises a connector error and the rollout fails before reset. Add the same socket/readiness wait used by JSONRPCToolset before opening the websocket.

Useful? React with 👍 / 👎.

def has_tools(self) -> bool:
return self.config.contract == "mcp"

def tools(self, task: Task) -> list[Toolset]:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium openenv/taskset.py:208

For contract="mcp", tools() returns a plain JSONRPCToolset that sends tool calls directly to the /mcp endpoint via JSON-RPC tools/call. The openenv_done and openenv_reward methods read trace.state.done and trace.state.reward, but nothing in the MCP path updates those fields — only OpenEnvUser.respond() (used for contract="gym") writes to state.done and state.reward. As a result, MCP-backed OpenEnv tasks always return 0.0 reward and never terminate from the environment's done signal. Consider wiring MCP tool-call results into state.done and state.reward, or document that MCP tasksets are not scored/terminating if that is intentional.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/tasksets/openenv/taskset.py around line 208:

For `contract="mcp"`, `tools()` returns a plain `JSONRPCToolset` that sends tool calls directly to the `/mcp` endpoint via JSON-RPC `tools/call`. The `openenv_done` and `openenv_reward` methods read `trace.state.done` and `trace.state.reward`, but nothing in the MCP path updates those fields — only `OpenEnvUser.respond()` (used for `contract="gym"`) writes to `state.done` and `state.reward`. As a result, MCP-backed OpenEnv tasks always return `0.0` reward and never terminate from the environment's `done` signal. Consider wiring MCP tool-call results into `state.done` and `state.reward`, or document that MCP tasksets are not scored/terminating if that is intentional.

Comment on lines +92 to +99
async def _connect(self) -> None:
if self.ws is not None:
return
connector = aiohttp.UnixConnector(path=self.config.uds)
timeout = aiohttp.ClientTimeout(total=self.config.timeout)
self.session = aiohttp.ClientSession(connector=connector, timeout=timeout)
self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws")
self.action_schema = await self._schema()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium openenv/taskset.py:92

In _connect(), self.ws is assigned before _schema() is awaited, so if _schema() raises (timeout, non-2xx, bad JSON), the websocket and session are left set but self.action_schema remains {}. Every subsequent respond() call returns early at the if self.ws is not None: return guard, so the schema is never refetched — action parsing and rendering stay broken for the rest of the episode instead of recovering on retry. Consider resetting self.ws and self.session when _schema() fails so the next call reconnects from scratch.

    async def _connect(self) -> None:
        if self.ws is not None:
            return
        connector = aiohttp.UnixConnector(path=self.config.uds)
        timeout = aiohttp.ClientTimeout(total=self.config.timeout)
        self.session = aiohttp.ClientSession(connector=connector, timeout=timeout)
        self.ws = await self.session.ws_connect(f"{self.config.base_url}/ws")
-        self.action_schema = await self._schema()
+        try:
+            self.action_schema = await self._schema()
+        except Exception:
+            await self._close()
+            raise
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/tasksets/openenv/taskset.py around lines 92-99:

In `_connect()`, `self.ws` is assigned before `_schema()` is awaited, so if `_schema()` raises (timeout, non-2xx, bad JSON), the websocket and session are left set but `self.action_schema` remains `{}`. Every subsequent `respond()` call returns early at the `if self.ws is not None: return` guard, so the schema is never refetched — action parsing and rendering stay broken for the rest of the episode instead of recovering on retry. Consider resetting `self.ws` and `self.session` when `_schema()` fails so the next call reconnects from scratch.

self.action_schema: dict[str, Any] = {}
self.initial: dict[str, Any] | None = None

async def respond(self, message: str) -> Messages:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium openenv/taskset.py:71

When message is an empty string, respond treats it as a reset request instead of a step call. Since the caller passes response.message.content or "" on every turn, any empty assistant message causes respond to replay the cached self.initial reset result (or reset the environment again) instead of sending the empty string as a step action, silently corrupting the rollout state. Track the initial reset explicitly with a boolean flag rather than testing the truthiness of message on every call.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/tasksets/openenv/taskset.py around line 71:

When `message` is an empty string, `respond` treats it as a reset request instead of a step call. Since the caller passes `response.message.content or ""` on every turn, any empty assistant message causes `respond` to replay the cached `self.initial` reset result (or reset the environment again) instead of sending the empty string as a `step` action, silently corrupting the rollout state. Track the initial reset explicitly with a boolean flag rather than testing the truthiness of `message` on every call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant