Skip to content

Add v1 shared scoring helpers#1907

Open
xeophon wants to merge 5 commits into
mainfrom
feat/v1-shared-scoring-helpers
Open

Add v1 shared scoring helpers#1907
xeophon wants to merge 5 commits into
mainfrom
feat/v1-shared-scoring-helpers

Conversation

@xeophon

@xeophon xeophon commented Jul 1, 2026

Copy link
Copy Markdown
Member

Overview

Adds one readable v1 scoring helper module for recurring environment idioms.

Details

  • Adds verifiers.v1.scoring with boxed answer extraction, answer-file fallback, simple judge-choice parsing, boxed math verification, stdout comparison, and pytest outcome parsing.
  • Re-exports those helpers through verifiers.v1 and includes them in verifiers.v1.__all__ for the usual environment import style.
  • Keeps the parsing and comparison behavior aligned with recurring environment needs, including boxed passthroughs, final judge choices, digit-labelled choices, non-finite stdout tokens, and pytest parameterized node IDs.

Note

Low Risk
Additive library utilities with no changes to existing rollout or auth paths; scoring edge cases are covered by new tests.

Overview
Introduces verifiers.v1.scoring so v1 environments can share the same scoring idioms via import verifiers.v1 as vf.

The module adds extract_boxed_answer (last \boxed{…} with nested-brace handling), read_answer_file_or_last_reply (sandbox file read with trace.last_reply fallback), parse_judge_choice (verdict markers, <think> stripping, boxed strict mode), verify_boxed_math_answer (math_verify with timeouts), compare_stdout_results (whitespace/token equality plus numeric tolerance), and parse_pytest_outcomes (ANSI strip, reason trimming, parameterized node IDs). All six are re-exported from verifiers.v1 and __all__.

tests/v1/test_scoring.py covers stdout leniency, numeric tolerance, pytest outcome parsing, and judge choice after a verdict marker.

Reviewed by Cursor Bugbot for commit 349f0c5. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Add shared scoring helpers to the verifiers.v1 package

  • Adds scoring.py with six utilities: extract_boxed_answer, read_answer_file_or_last_reply, parse_judge_choice, verify_boxed_math_answer, compare_stdout_results, and parse_pytest_outcomes.
  • compare_stdout_results tolerates minor formatting and numeric differences (default tolerance 1e-3) when comparing stdout output.
  • verify_boxed_math_answer uses math-verify with timeout handling to produce a float score for boxed math answers.
  • parse_judge_choice and parse_pytest_outcomes extract structured results from judge responses and pytest summary text respectively.
  • Exports all six helpers from verifiers.v1.__init__ via __all__.

Macroscope summarized 349f0c5.

Comment thread verifiers/v1/parsers.py Outdated
Comment thread verifiers/v1/__init__.py Outdated
Comment thread verifiers/v1/parsers.py Outdated
Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated
Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated
Comment thread verifiers/v1/parsers.py Outdated
Comment thread verifiers/v1/tasksets/math/scoring.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4b78094618

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated
Comment thread verifiers/v1/tasksets/code/scoring.py Outdated
Comment thread verifiers/v1/tasksets/math/scoring.py Outdated
@macroscopeapp

macroscopeapp Bot commented Jul 1, 2026

Copy link
Copy Markdown

Approvability

Verdict: Needs human review

1 blocking correctness issue found. Multiple unresolved review comments identify potential bugs in the new scoring helper logic, including issues with verdict parsing order, pytest ID handling, and stdout comparison semantics. These substantive concerns warrant human review before merging.

You can customize Macroscope's approvability policy. Learn more.

@xeophon xeophon force-pushed the feat/v1-shared-scoring-helpers branch from 4b78094 to ec35662 Compare July 1, 2026 09:24
Comment thread verifiers/v1/parsers.py Outdated
Comment thread verifiers/v1/tasksets/swe/scoring.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ec35662624

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/parsers.py Outdated
@xeophon xeophon force-pushed the feat/v1-shared-scoring-helpers branch from ec35662 to 90b4615 Compare July 1, 2026 11:02
Comment thread verifiers/v1/scoring.py Outdated
Comment thread verifiers/v1/scoring.py Outdated
@xeophon xeophon force-pushed the feat/v1-shared-scoring-helpers branch from 90b4615 to 49affab Compare July 1, 2026 11:08
Comment thread verifiers/v1/scoring.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49affab23f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/__init__.py
@xeophon xeophon force-pushed the feat/v1-shared-scoring-helpers branch from 49affab to bc39385 Compare July 1, 2026 11:13

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bc3938567a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/scoring.py Outdated
Comment thread verifiers/v1/scoring.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b3f3bb9494

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/scoring.py Outdated
Comment thread verifiers/v1/scoring.py
Comment thread verifiers/v1/scoring.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfbc0f5de9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/scoring.py
Comment thread verifiers/v1/scoring.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edd7384e8a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/scoring.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 349f0c5. Configure here.

Comment thread verifiers/v1/scoring.py
verdict = re.search(verdict_re, text_upper)
if verdict:
match = re.search(choice_re, text_upper[verdict.end() :])
return choices_by_upper.get(match.group(1)) if match else None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict parsing picks wrong choice

Medium Severity

When a judge reply starts with a line like Answer:, parse_judge_choice treats that as a verdict marker and returns the first choice letter after it. Wording such as Answer: A is wrong, B is correct can score A even when the intended final choice is B.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 349f0c5. Configure here.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 349f0c57f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread verifiers/v1/scoring.py
Comment on lines +159 to +163
parts = test_id.split(" - ")
for index in range(1, len(parts)):
node_id = " - ".join(parts[:index])
if node_id.count("[") == node_id.count("]"):
test_id = node_id

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve pytest ids that contain bracketed separators

Fresh evidence: I checked pytest -q -rA with pytest.param(id='a] - [b'), and pytest reports a summary row like FAILED ...::test_fail[a] - [b] - assert False; pytest --help documents -rA as showing extra summary info for failed/xfail/xpass rows. This bracket-balance heuristic stops at the first balanced prefix (...::test_fail[a]), so required outcomes keyed by the real node id ...::test_fail[a] - [b] are recorded as missing; the reason stripper needs to avoid treating arbitrary parameter-id text as the failure-reason delimiter.

Useful? React with 👍 / 👎.

Comment thread verifiers/v1/scoring.py
Comment on lines +119 to +122
actual_tokens = [token for line in actual_lines for token in line.split()]
expected_tokens = [token for line in expected_lines for token in line.split()]
if actual_tokens == expected_tokens:
return True

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve line boundaries when matching text tokens

For nonnumeric stdout where line breaks are part of the expected answer, such as expected YES then NO on separate lines but the program prints YES NO on one line, these flattened token lists compare equal and return True before the numeric-only tolerance path. That can award incorrect text output in environments adopting this shared helper; compare tokenized lines for textual whitespace normalization and reserve flattened tokens for numeric tolerance.

Useful? React with 👍 / 👎.

Comment thread verifiers/v1/scoring.py

text_upper = text.upper()
choices_by_upper = {choice.upper(): choice for choice in choices}
allowed = "|".join(re.escape(choice) for choice in choices_by_upper)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium v1/scoring.py:62

choice_re joins choices in their original order, so when one label is a prefix of another (e.g. ("A", "AB")), the regex A|AB matches the shorter label first. parse_judge_choice("Final Verdict: AB", choices=("A", "AB")) returns "A" instead of "AB". Sort the alternatives by descending length so longer labels take priority.

Suggested change
allowed = "|".join(re.escape(choice) for choice in choices_by_upper)
allowed = "|".join(re.escape(choice) for choice in sorted(choices_by_upper, key=len, reverse=True))
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/scoring.py around line 62:

`choice_re` joins choices in their original order, so when one label is a prefix of another (e.g. `("A", "AB")`), the regex `A|AB` matches the shorter label first. `parse_judge_choice("Final Verdict: AB", choices=("A", "AB"))` returns `"A"` instead of `"AB"`. Sort the alternatives by descending length so longer labels take priority.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants