Skip to content

Add NeMo Gym#1886

Open
xeophon wants to merge 4 commits into
mainfrom
codex/nemo-gym-taskset
Open

Add NeMo Gym#1886
xeophon wants to merge 4 commits into
mainfrom
codex/nemo-gym-taskset

Conversation

@xeophon

@xeophon xeophon commented Jun 26, 2026

Copy link
Copy Markdown
Member

Overview

Adds a compact v1 NeMo Gym taskset and a small environment package that shows how to use it with a harder packaged NeMo Gym example.

Details

  • Loads NeMo Gym JSONL rows into typed Verifiers tasks while preserving the source payload.
  • Maps Responses-format prompts through the existing dialect.
  • Forwards declared calls to NeMo Gym's packaged resource server through a generic MCP tool boundary.
  • Keeps harness selection and trace capture in Verifiers, with no NeMo-specific harness or program.
  • Adds nemo-gym-workplace-v1 as a thin config wrapper over the reusable taskset, pinned to NeMo Gym's workplace_assistant resource server.

Note

Medium Risk
New optional integration that bootstraps NeMo Gym servers in-process and proxies tool HTTP calls, but it is confined to the new taskset path and does not change core harness or auth flows.

Overview
Adds a built-in NeMoGymTaskset so Verifiers v1 can run NVIDIA NeMo Gym packaged JSONL benchmarks through the standard MCP harness, without a NeMo-specific program.

Tasks are built from each row’s responses_create_params (via ResponsesDialect) while keeping the full nemo_gym_row payload. Tooling spins up NeMo Gym’s in-process ASGI resource server for the configured nemo_env, seeds a session per task, and exposes a single nemo_gym_call MCP tool that forwards only tools declared on that task. nemo-gym==0.3.0 stays optional; missing installs get a clear ImportError with a uv run --with hint.

Also ships nemo-gym-workplace-v1 as a thin taskset that pins workplace_assistant, exports NeMoGymConfig / NeMoGymTaskset from verifiers.v1.tasksets, and documents CLI usage in environments/README.md and verifiers/v1/README.md (replacing the older nemo_gym_env listing).

Reviewed by Cursor Bugbot for commit 0fb0045. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add NeMo Gym taskset and workplace environment to verifiers v1

  • Adds NeMoGymTaskset and NeMoGymConfig to verifiers/v1/tasksets/nemo_gym/taskset.py, which loads tasks from a NeMo Gym JSONL dataset and proxies tool calls through an in-process resource server via httpx.AsyncClient with ASGI transport.
  • Tool execution uses session seeding (POST /seed_session) and per-tool endpoints (POST /{name}), exposing a single nemo_gym_call tool to MCP-capable harnesses; tasks without declared tools return no tools.
  • Adds a nemo-gym-workplace-v1 environment package in environments/nemo_gym_workplace_v1/ that pins nemo_env to workplace_assistant via NemoGymWorkplaceTaskset.
  • Exports NeMoGymConfig and NeMoGymTaskset from verifiers.v1.tasksets alongside the existing HarborTaskset.

Macroscope summarized 0fb0045.

@xeophon xeophon changed the title Add lean NeMo Gym taskset Add NeMo Gym Jun 26, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 59b494f. Configure here.

raise ValueError(f"unknown NeMo Gym tool: {name}")
response = await self.client.post(f"/{name}", json=arguments)
response.raise_for_status()
return response.json()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AsyncClient crosses event loops

High Severity

The in-process httpx.AsyncClient for the NeMo resource server is created in _NeMoGymTools.setup during the tool server’s pre-serve asyncio.run setup phase, while nemo_gym_call runs later under uvicorn’s separate event loop. Reusing that client after the first loop closes can break call even when /seed_session succeeded during setup.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 59b494f. Configure here.

@macroscopeapp

macroscopeapp Bot commented Jun 26, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

1 blocking correctness issue found. This PR introduces a new NeMo Gym taskset integration with substantial new runtime behavior. Multiple unresolved review comments identify potential bugs, including a high-severity issue with AsyncClient crossing event loops that could cause runtime failures.

You can customize Macroscope's approvability policy. Learn more.

Comment thread verifiers/v1/tasksets/nemo_gym/taskset.py
Comment on lines +140 to +144
system_prompt=(
"Call `nemo_gym_call` with the matching tool name and arguments. "
f"Available NeMo Gym tools: "
f"{json.dumps(row['responses_create_params'].get('tools', []))}"
),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium nemo_gym/taskset.py:140

load_tasks() always sets system_prompt to instruct the model to call nemo_gym_call with the matching tool name, but for rows where responses_create_params.tools is missing or empty, tools() returns []. Those tasks expose no nemo_gym_call tool yet demand its use, making them unsatisfiable. Consider omitting or adjusting the system_prompt when no tools are present.

Suggested change
system_prompt=(
"Call `nemo_gym_call` with the matching tool name and arguments. "
f"Available NeMo Gym tools: "
f"{json.dumps(row['responses_create_params'].get('tools', []))}"
),
system_prompt=(
"Call `nemo_gym_call` with the matching tool name and arguments. "
f"Available NeMo Gym tools: "
f"{json.dumps(row['responses_create_params'].get('tools', []))}"
) if row['responses_create_params'].get('tools') else None,
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/tasksets/nemo_gym/taskset.py around lines 140-144:

`load_tasks()` always sets `system_prompt` to instruct the model to call `nemo_gym_call` with the matching tool name, but for rows where `responses_create_params.tools` is missing or empty, `tools()` returns `[]`. Those tasks expose no `nemo_gym_call` tool yet demand its use, making them unsatisfiable. Consider omitting or adjusting the `system_prompt` when no tools are present.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1d0eb27aaa

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

return response.json()


class NeMoGymTaskset(Taskset[NeMoGymTask, NeMoGymConfig]):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Score rollouts with NeMo Gym verification

As introduced, this taskset loads tasks and tools but never defines any @reward or group reward that calls the resource server's /verify endpoint. Taskset.score() only records decorated rewards, and Trace.reward is just the sum of trace.rewards, so uv run --with nemo-gym==0.3.0 eval nemo_gym ... will report 0 reward for every successful rollout even when NeMo Gym's verifier would return a nonzero reward. Add a reward/finalize path that sends the completed response plus responses_create_params to the NeMo Gym verifier and records the returned reward.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1850666644

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

name=f"{self.config.resource_server}:{idx}",
prompt=dialect.parse_request(row["responses_create_params"])[0],
system_prompt=(
"Call `nemo_gym_call` with the matching tool name and arguments. "

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid prompting for a missing NeMo tool

When a NeMo row has no responses_create_params.tools, tools() returns no MCP server, so the harness never exposes nemo_gym_call; however this unconditional system prompt still tells the model to call it. This affects answer-only NeMo resource servers/tasks and makes the prompt ask for an unavailable tool instead of letting the model answer normally, so condition the instruction on actually exposing the wrapper tool.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0fb0045a1a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

version = "0.1.0"
description = "nemo-gym-workplace-v1 - NeMo Gym workplace_assistant through the built-in taskset."
requires-python = ">=3.11"
dependencies = ["verifiers"]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Add the NeMo Gym runtime dependency

For packaged installs of this environment, this metadata only installs verifiers, but the taskset immediately delegates to NeMoGymTaskset.load_tasks(), which imports nemo_gym and raises if it is absent. The README's manual --with nemo-gym==0.3.0 workaround only helps that one dev command; prime env install/Hub consumers will hit an ImportError before tasks load. Add the NeMo Gym dependency here, and align requires-python with it if needed.

Useful? React with 👍 / 👎.

Comment on lines +1 to +5
# nemo-gym-workplace-v1

NeMo Gym's `workplace_assistant` example through the built-in `NeMoGymTaskset`.
This environment pins the taskset config only; the standard Verifiers harness owns the
rollout loop, and NeMo Gym's packaged resource server owns tool execution.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Replace freeform README with generated sections

The root AGENTS.md says, "Environment READMEs must use the generated prime env init section structure; freeform environment READMEs are not allowed." This new README starts as a custom overview/Develop/Layout page instead of that required structure, so it does not satisfy the repository's documented environment README contract. Please regenerate/use the standard sections and fill them in.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant