Add v1 shared scoring helpers by xeophon · Pull Request #1907 · PrimeIntellect-ai/verifiers

xeophon · 2026-07-01T08:34:19Z

Overview

Adds one readable v1 scoring helper module for recurring environment idioms.

Details

Adds verifiers.v1.scoring with boxed answer extraction, answer-file fallback, simple judge-choice parsing, boxed math verification, stdout comparison, and pytest outcome parsing.
Re-exports those helpers through verifiers.v1 and includes them in verifiers.v1.__all__ for the usual environment import style.
Keeps the parsing and comparison behavior aligned with recurring environment needs, including boxed passthroughs, final judge choices, digit-labelled choices, non-finite stdout tokens, and pytest parameterized node IDs.

Note

Low Risk
Additive library utilities with no changes to existing rollout or auth paths; scoring edge cases are covered by new tests.

Overview
Introduces verifiers.v1.scoring so v1 environments can share the same scoring idioms via import verifiers.v1 as vf.

The module adds extract_boxed_answer (last \boxed{…} with nested-brace handling), read_answer_file_or_last_reply (sandbox file read with trace.last_reply fallback), parse_judge_choice (verdict markers, <think> stripping, boxed strict mode), verify_boxed_math_answer (math_verify with timeouts), compare_stdout_results (whitespace/token equality plus numeric tolerance), and parse_pytest_outcomes (ANSI strip, reason trimming, parameterized node IDs). All six are re-exported from verifiers.v1 and __all__.

tests/v1/test_scoring.py covers stdout leniency, numeric tolerance, pytest outcome parsing, and judge choice after a verdict marker.

^{Reviewed by Cursor Bugbot for commit 349f0c5. Bugbot is set up for automated code reviews on this repo. Configure here.}

Note

Add shared scoring helpers to the `verifiers.v1` package

Adds scoring.py with six utilities: extract_boxed_answer, read_answer_file_or_last_reply, parse_judge_choice, verify_boxed_math_answer, compare_stdout_results, and parse_pytest_outcomes.
compare_stdout_results tolerates minor formatting and numeric differences (default tolerance 1e-3) when comparing stdout output.
verify_boxed_math_answer uses math-verify with timeout handling to produce a float score for boxed math answers.
parse_judge_choice and parse_pytest_outcomes extract structured results from judge responses and pytest summary text respectively.
Exports all six helpers from verifiers.v1.__init__ via __all__.

^{Macroscope summarized 349f0c5.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b78094618

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

macroscopeapp · 2026-07-01T08:39:52Z

Approvability

Verdict: Needs human review

1 blocking correctness issue found. Multiple unresolved review comments identify potential bugs in the new scoring helper logic, including issues with verdict parsing order, pytest ID handling, and stdout comparison semantics. These substantive concerns warrant human review before merging.

^{You can customize Macroscope's approvability policy. Learn more.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec35662624

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49affab23f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc3938567a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3f3bb9494

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfbc0f5de9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edd7384e8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 349f0c5. Configure here.}

cursor · 2026-07-01T19:04:06Z

+    verdict = re.search(verdict_re, text_upper)
+    if verdict:
+        match = re.search(choice_re, text_upper[verdict.end() :])
+        return choices_by_upper.get(match.group(1)) if match else None


Verdict parsing picks wrong choice

Medium Severity

When a judge reply starts with a line like Answer:, parse_judge_choice treats that as a verdict marker and returns the first choice letter after it. Wording such as Answer: A is wrong, B is correct can score A even when the intended final choice is B.

^{Reviewed by Cursor Bugbot for commit 349f0c5. Configure here.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 349f0c57f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-07-01T19:06:33Z

+            parts = test_id.split(" - ")
+            for index in range(1, len(parts)):
+                node_id = " - ".join(parts[:index])
+                if node_id.count("[") == node_id.count("]"):
+                    test_id = node_id


Preserve pytest ids that contain bracketed separators

Fresh evidence: I checked pytest -q -rA with pytest.param(id='a] - [b'), and pytest reports a summary row like FAILED ...::test_fail[a] - [b] - assert False; pytest --help documents -rA as showing extra summary info for failed/xfail/xpass rows. This bracket-balance heuristic stops at the first balanced prefix (...::test_fail[a]), so required outcomes keyed by the real node id ...::test_fail[a] - [b] are recorded as missing; the reason stripper needs to avoid treating arbitrary parameter-id text as the failure-reason delimiter.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-07-01T19:06:33Z

+    actual_tokens = [token for line in actual_lines for token in line.split()]
+    expected_tokens = [token for line in expected_lines for token in line.split()]
+    if actual_tokens == expected_tokens:
+        return True


Preserve line boundaries when matching text tokens

For nonnumeric stdout where line breaks are part of the expected answer, such as expected YES then NO on separate lines but the program prints YES NO on one line, these flattened token lists compare equal and return True before the numeric-only tolerance path. That can award incorrect text output in environments adopting this shared helper; compare tokenized lines for textual whitespace normalization and reserve flattened tokens for numeric tolerance.

Useful? React with 👍 / 👎.

macroscopeapp · 2026-07-01T19:07:34Z

+
+    text_upper = text.upper()
+    choices_by_upper = {choice.upper(): choice for choice in choices}
+    allowed = "|".join(re.escape(choice) for choice in choices_by_upper)


🟡 Medium v1/scoring.py:62

choice_re joins choices in their original order, so when one label is a prefix of another (e.g. ("A", "AB")), the regex A|AB matches the shorter label first. parse_judge_choice("Final Verdict: AB", choices=("A", "AB")) returns "A" instead of "AB". Sort the alternatives by descending length so longer labels take priority.

Suggested change

allowed = "|".join(re.escape(choice) for choice in choices_by_upper)

allowed = "|".join(re.escape(choice) for choice in sorted(choices_by_upper, key=len, reverse=True))

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:

In file @verifiers/v1/scoring.py around line 62: `choice_re` joins choices in their original order, so when one label is a prefix of another (e.g. `("A", "AB")`), the regex `A|AB` matches the shorter label first. `parse_judge_choice("Final Verdict: AB", choices=("A", "AB"))` returns `"A"` instead of `"AB"`. Sort the alternatives by descending length so longer labels take priority.

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/parsers.py Outdated

Comment thread verifiers/v1/__init__.py Outdated

Comment thread verifiers/v1/parsers.py Outdated

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated

Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated

Comment thread verifiers/v1/parsers.py Outdated

Comment thread verifiers/v1/tasksets/math/scoring.py Outdated

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated

Comment thread verifiers/v1/tasksets/code/scoring.py Outdated

Comment thread verifiers/v1/tasksets/math/scoring.py Outdated

xeophon force-pushed the feat/v1-shared-scoring-helpers branch from 4b78094 to ec35662 Compare July 1, 2026 09:24

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/parsers.py Outdated

Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/parsers.py Outdated

xeophon force-pushed the feat/v1-shared-scoring-helpers branch from ec35662 to 90b4615 Compare July 1, 2026 11:02

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py Outdated

Comment thread verifiers/v1/scoring.py Outdated

xeophon force-pushed the feat/v1-shared-scoring-helpers branch from 90b4615 to 49affab Compare July 1, 2026 11:08

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py Outdated

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/__init__.py

Add v1 shared scoring helpers

bc39385

xeophon force-pushed the feat/v1-shared-scoring-helpers branch from 49affab to bc39385 Compare July 1, 2026 11:13

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py Outdated

Address scoring helper review feedback

b3f3bb9

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py Outdated

Ignore pytest skipped summary rows

dfbc0f5

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py Outdated

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py

xeophon mentioned this pull request Jul 1, 2026

Use shared v1 scoring helpers PrimeIntellect-ai/research-environments#626

Open

Fix v1 scoring edge cases

edd7384

cursor Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py Outdated

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread verifiers/v1/scoring.py Outdated

Tighten v1 scoring parsers

349f0c5

mikasenghaas approved these changes Jul 1, 2026

View reviewed changes

cursor Bot reviewed Jul 1, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

macroscopeapp Bot reviewed Jul 1, 2026

View reviewed changes

	allowed = "\|".join(re.escape(choice) for choice in choices_by_upper)
	allowed = "\|".join(re.escape(choice) for choice in sorted(choices_by_upper, key=len, reverse=True))

Uh oh!

Conversation

xeophon commented Jul 1, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Details

Add shared scoring helpers to the verifiers.v1 package

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

macroscopeapp Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jul 1, 2026

Choose a reason for hiding this comment

Verdict parsing picks wrong choice

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jul 1, 2026

xeophon commented Jul 1, 2026 •

edited by macroscopeapp Bot

Loading

Add shared scoring helpers to the `verifiers.v1` package

macroscopeapp Bot commented Jul 1, 2026 •

edited

Loading