Add v1 shared scoring helpers#1907
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4b78094618
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ApprovabilityVerdict: Needs human review 1 blocking correctness issue found. Multiple unresolved review comments identify potential bugs in the new scoring helper logic, including issues with verdict parsing order, pytest ID handling, and stdout comparison semantics. These substantive concerns warrant human review before merging. You can customize Macroscope's approvability policy. Learn more. |
4b78094 to
ec35662
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec35662624
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ec35662 to
90b4615
Compare
90b4615 to
49affab
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 49affab23f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
49affab to
bc39385
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bc3938567a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b3f3bb9494
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: dfbc0f5de9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: edd7384e8a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 349f0c5. Configure here.
| verdict = re.search(verdict_re, text_upper) | ||
| if verdict: | ||
| match = re.search(choice_re, text_upper[verdict.end() :]) | ||
| return choices_by_upper.get(match.group(1)) if match else None |
There was a problem hiding this comment.
Verdict parsing picks wrong choice
Medium Severity
When a judge reply starts with a line like Answer:, parse_judge_choice treats that as a verdict marker and returns the first choice letter after it. Wording such as Answer: A is wrong, B is correct can score A even when the intended final choice is B.
Reviewed by Cursor Bugbot for commit 349f0c5. Configure here.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 349f0c57f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| parts = test_id.split(" - ") | ||
| for index in range(1, len(parts)): | ||
| node_id = " - ".join(parts[:index]) | ||
| if node_id.count("[") == node_id.count("]"): | ||
| test_id = node_id |
There was a problem hiding this comment.
Preserve pytest ids that contain bracketed separators
Fresh evidence: I checked pytest -q -rA with pytest.param(id='a] - [b'), and pytest reports a summary row like FAILED ...::test_fail[a] - [b] - assert False; pytest --help documents -rA as showing extra summary info for failed/xfail/xpass rows. This bracket-balance heuristic stops at the first balanced prefix (...::test_fail[a]), so required outcomes keyed by the real node id ...::test_fail[a] - [b] are recorded as missing; the reason stripper needs to avoid treating arbitrary parameter-id text as the failure-reason delimiter.
Useful? React with 👍 / 👎.
| actual_tokens = [token for line in actual_lines for token in line.split()] | ||
| expected_tokens = [token for line in expected_lines for token in line.split()] | ||
| if actual_tokens == expected_tokens: | ||
| return True |
There was a problem hiding this comment.
Preserve line boundaries when matching text tokens
For nonnumeric stdout where line breaks are part of the expected answer, such as expected YES then NO on separate lines but the program prints YES NO on one line, these flattened token lists compare equal and return True before the numeric-only tolerance path. That can award incorrect text output in environments adopting this shared helper; compare tokenized lines for textual whitespace normalization and reserve flattened tokens for numeric tolerance.
Useful? React with 👍 / 👎.
|
|
||
| text_upper = text.upper() | ||
| choices_by_upper = {choice.upper(): choice for choice in choices} | ||
| allowed = "|".join(re.escape(choice) for choice in choices_by_upper) |
There was a problem hiding this comment.
🟡 Medium v1/scoring.py:62
choice_re joins choices in their original order, so when one label is a prefix of another (e.g. ("A", "AB")), the regex A|AB matches the shorter label first. parse_judge_choice("Final Verdict: AB", choices=("A", "AB")) returns "A" instead of "AB". Sort the alternatives by descending length so longer labels take priority.
| allowed = "|".join(re.escape(choice) for choice in choices_by_upper) | |
| allowed = "|".join(re.escape(choice) for choice in sorted(choices_by_upper, key=len, reverse=True)) |
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @verifiers/v1/scoring.py around line 62:
`choice_re` joins choices in their original order, so when one label is a prefix of another (e.g. `("A", "AB")`), the regex `A|AB` matches the shorter label first. `parse_judge_choice("Final Verdict: AB", choices=("A", "AB"))` returns `"A"` instead of `"AB"`. Sort the alternatives by descending length so longer labels take priority.


Overview
Adds one readable v1 scoring helper module for recurring environment idioms.
Details
verifiers.v1.scoringwith boxed answer extraction, answer-file fallback, simple judge-choice parsing, boxed math verification, stdout comparison, and pytest outcome parsing.verifiers.v1and includes them inverifiers.v1.__all__for the usual environment import style.Note
Low Risk
Additive library utilities with no changes to existing rollout or auth paths; scoring edge cases are covered by new tests.
Overview
Introduces
verifiers.v1.scoringso v1 environments can share the same scoring idioms viaimport verifiers.v1 as vf.The module adds
extract_boxed_answer(last\boxed{…}with nested-brace handling),read_answer_file_or_last_reply(sandbox file read withtrace.last_replyfallback),parse_judge_choice(verdict markers,<think>stripping, boxed strict mode),verify_boxed_math_answer(math_verifywith timeouts),compare_stdout_results(whitespace/token equality plus numeric tolerance), andparse_pytest_outcomes(ANSI strip, reason trimming, parameterized node IDs). All six are re-exported fromverifiers.v1and__all__.tests/v1/test_scoring.pycovers stdout leniency, numeric tolerance, pytest outcome parsing, and judge choice after a verdict marker.Reviewed by Cursor Bugbot for commit 349f0c5. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add shared scoring helpers to the
verifiers.v1packageextract_boxed_answer,read_answer_file_or_last_reply,parse_judge_choice,verify_boxed_math_answer,compare_stdout_results, andparse_pytest_outcomes.compare_stdout_resultstolerates minor formatting and numeric differences (default tolerance 1e-3) when comparing stdout output.verify_boxed_math_answerusesmath-verifywith timeout handling to produce a float score for boxed math answers.parse_judge_choiceandparse_pytest_outcomesextract structured results from judge responses and pytest summary text respectively.verifiers.v1.__init__via__all__.Macroscope summarized 349f0c5.