Port CP-aware verifier + error taxonomy by corbyrosset · Pull Request #77 · microsoft/fara

corbyrosset · 2026-05-28T20:49:32Z

Summary

Brings webeval's rubric_agent package up to parity with agento_next/main (post-#1071), adapted for webeval.

New CriticalPointAgent (Step -1): classifies the task against a critical-point taxonomy (critical_point_types.yaml) and threads the result through rubric generation, action-only scoring, outcome verification, and a new CP-violation check.
New VerifierAgent (Steps 9a/9b/10): first-point-of-failure analysis using the error taxonomy, trajectory-informed task verification, and unified task verification. Step 11 (synthetic human-voice feedback) is intentionally dropped — not needed in webeval.
Some refactors: extracts shared helpers (format_action_history, call_llm, encode_image_b64, get_init_url_context, build_scored_rubric_summary, build_all_screenshot_evidence_text) into formatting.py so MMRubricAgent and VerifierAgent don't duplicate them.
Include support for evaluating run_command: adds StepSummary.tool_output, populates it from post-action ToolOutput observations, renders a Command Output: line in the action history, and teaches the prompts to treat that output as ground truth (so an unchanged desktop after run_command is not read as failure) while sanity-checking that the command isn't a fake (echo \"success\").

Architecture note vs upstream #1071

The branch keeps MMRubricAgent and VerifierAgent as independent agents orchestrated by the caller

Test plan

All 28 webeval unit tests pass (pytest tests/).
Live end-to-end test exists: tests/test_verify_trajectories.py::test_verify_trajectories_live_llm runs the real MMRubricAgent + VerifierAgent against the checked-in data/example_trajectory/alltrails_find_23/ trajectory and asserts the new CP-aware fields (cp_type_used, cp_violation, error_taxonomy.first_point_of_failure.failure_points[].error_code, Steps 9b/10 is_ambiguous/is_invalid) hit the score file. Opt-in:
```
FARA_VERIFY_LIVE_TEST=1 \
FARA_VERIFY_JUDGE_CONFIG=/path/to/judge/configs \
FARA_VERIFY_O4MINI_CONFIG=/path/to/o4mini/configs \
FARA_VERIFY_JUDGE_MODEL=gpt-5.2 \
pytest tests/test_verify_trajectories.py::test_verify_trajectories_live_llm -v -s
```
Confirm live-LLM run passes end-to-end (currently in flight; will update PR if it surfaces issues).

🤖 Generated with Claude Code

Brings the rubric_agent package up to parity with agento_next/main (post-#1071 architecture), adapted to webeval: - Adds `CriticalPointAgent` (Step -1) that classifies the task against a critical-point taxonomy (`critical_point_types.yaml`) and threads the result through rubric generation, action-only scoring, outcome verification, and a new CP-violation check. - Adds `VerifierAgent` (Steps 9a/9b/10): first-point-of-failure analysis with the error taxonomy, trajectory-informed task verification, and unified task verification. Step 11 (synthetic human-voice feedback) is intentionally dropped — not needed in webeval. - Mirrors upstream #1071's DRY refactor: extracts shared helpers (`format_action_history`, `call_llm`, `encode_image_b64`, `get_init_url_context`, `build_scored_rubric_summary`, `build_all_screenshot_evidence_text`) into `formatting.py` so `MMRubricAgent` and `VerifierAgent` don't duplicate them. Also removes the lazy `MMRubricAgent._format_action_history` import from `critical_point_classifier.py`. - Picks up upstream #889 `run_command` support: adds `StepSummary.tool_output`, populates it from post-action `ToolOutput` observations, renders a `Command Output:` line in the action history, and teaches the prompts to treat that output as ground truth (so an unchanged desktop after `run_command` is not read as failure) while sanity-checking that the command isn't a fake (`echo "success"`). - Adds missing runtime deps (`imagehash`, `jinja2`) to `webeval/pyproject.toml` — both are imported by the new modules. Architecture note: the branch keeps `MMRubricAgent` and `VerifierAgent` as independent agents orchestrated by the caller, rather than upstream #1071's compose pattern. This is the same direction (decoupling) and goes a step further by also stripping Step 11. `verify_trajectories.py` drives them in sequence. All 28 webeval unit tests pass; the live-LLM end-to-end test (`test_verify_trajectories_live_llm`, opt-in via `FARA_VERIFY_LIVE_TEST=1`) asserts the new CP-aware fields (`cp_type_used`, `cp_violation`, `error_taxonomy.first_point_of_failure.failure_points[].error_code`, Steps 9b/10 `is_ambiguous`/`is_invalid`) hit the score file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A live verifier test wedged for ~19h: a judge endpoint returning HTTP 401 (mapped by the openai SDK to AuthenticationError) hit the AuthenticationError branch, which neither blocklisted the endpoint nor consumed the `tries` budget — so next_client() kept handing back the same failing endpoint and the loop spun forever, opening connections the whole time. - AuthenticationError and the check-access-response-enc branch now decrement `tries` (and AuthenticationError backs off 1s) so a persistent auth/access failure on a small pool can't loop forever. - Add a hard `max_total_attempts` cap (default max_retries + 2*n_endpoints) as a backstop: create() always terminates regardless of which branch fires, including blocklisting paths that intentionally don't spend the budget. - Exhaustion error now reports total_attempts + both caps. - Tests: 3 regression cases (persistent-auth termination across 1 and 3 endpoints, all-endpoints-blocklisted termination), each guarded by a 30s asyncio.wait_for so any future regression fails loudly instead of hanging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…gateway) llm_call_expect_json did `result.content.content`, assuming a nested object. On the OpenAI/phyagi-gateway path CreateResult.content is already a plain str, so this raised "'str' object has no attribute 'content'" on every attempt — Step 9b (trajectory-informed task verification) failed all retries and silently fell back to a default. The bug was masked because the test only asserts the result keys exist, not that the step ran. Mirror task_classification's helper: unwrap .content only when present. Step 9b now runs clean against the gateway (no retries, real verdict). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

corbyrosset changed the title ~~Port CP-aware verifier + error taxonomy from agento_next~~ Port CP-aware verifier + error taxonomy May 29, 2026

corby and others added 2 commits May 29, 2026 12:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port CP-aware verifier + error taxonomy #77

Port CP-aware verifier + error taxonomy #77
corbyrosset wants to merge 3 commits into
mainfrom
corby/port-cp-aware-verifier-and-taxonomy

corbyrosset commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

corbyrosset commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture note vs upstream #1071

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

corbyrosset commented May 28, 2026 •

edited

Loading