Skip to content

Port CP-aware verifier + error taxonomy #77

Open
corbyrosset wants to merge 3 commits into
mainfrom
corby/port-cp-aware-verifier-and-taxonomy
Open

Port CP-aware verifier + error taxonomy #77
corbyrosset wants to merge 3 commits into
mainfrom
corby/port-cp-aware-verifier-and-taxonomy

Conversation

@corbyrosset
Copy link
Copy Markdown
Collaborator

@corbyrosset corbyrosset commented May 28, 2026

Summary

Brings webeval's rubric_agent package up to parity with agento_next/main (post-#1071), adapted for webeval.

  • New CriticalPointAgent (Step -1): classifies the task against a critical-point taxonomy (critical_point_types.yaml) and threads the result through rubric generation, action-only scoring, outcome verification, and a new CP-violation check.
  • New VerifierAgent (Steps 9a/9b/10): first-point-of-failure analysis using the error taxonomy, trajectory-informed task verification, and unified task verification. Step 11 (synthetic human-voice feedback) is intentionally dropped — not needed in webeval.
  • Some refactors: extracts shared helpers (format_action_history, call_llm, encode_image_b64, get_init_url_context, build_scored_rubric_summary, build_all_screenshot_evidence_text) into formatting.py so MMRubricAgent and VerifierAgent don't duplicate them.
  • Include support for evaluating run_command: adds StepSummary.tool_output, populates it from post-action ToolOutput observations, renders a Command Output: line in the action history, and teaches the prompts to treat that output as ground truth (so an unchanged desktop after run_command is not read as failure) while sanity-checking that the command isn't a fake (echo \"success\").

Architecture note vs upstream #1071

The branch keeps MMRubricAgent and VerifierAgent as independent agents orchestrated by the caller

Test plan

  • All 28 webeval unit tests pass (pytest tests/).
  • Live end-to-end test exists: tests/test_verify_trajectories.py::test_verify_trajectories_live_llm runs the real MMRubricAgent + VerifierAgent against the checked-in data/example_trajectory/alltrails_find_23/ trajectory and asserts the new CP-aware fields (cp_type_used, cp_violation, error_taxonomy.first_point_of_failure.failure_points[].error_code, Steps 9b/10 is_ambiguous/is_invalid) hit the score file. Opt-in:
    FARA_VERIFY_LIVE_TEST=1 \
    FARA_VERIFY_JUDGE_CONFIG=/path/to/judge/configs \
    FARA_VERIFY_O4MINI_CONFIG=/path/to/o4mini/configs \
    FARA_VERIFY_JUDGE_MODEL=gpt-5.2 \
    pytest tests/test_verify_trajectories.py::test_verify_trajectories_live_llm -v -s
    
  • Confirm live-LLM run passes end-to-end (currently in flight; will update PR if it surfaces issues).

🤖 Generated with Claude Code

Brings the rubric_agent package up to parity with agento_next/main
(post-#1071 architecture), adapted to webeval:

- Adds `CriticalPointAgent` (Step -1) that classifies the task against
  a critical-point taxonomy (`critical_point_types.yaml`) and threads
  the result through rubric generation, action-only scoring, outcome
  verification, and a new CP-violation check.
- Adds `VerifierAgent` (Steps 9a/9b/10): first-point-of-failure
  analysis with the error taxonomy, trajectory-informed task
  verification, and unified task verification. Step 11 (synthetic
  human-voice feedback) is intentionally dropped — not needed in webeval.
- Mirrors upstream #1071's DRY refactor: extracts shared helpers
  (`format_action_history`, `call_llm`, `encode_image_b64`,
  `get_init_url_context`, `build_scored_rubric_summary`,
  `build_all_screenshot_evidence_text`) into `formatting.py` so
  `MMRubricAgent` and `VerifierAgent` don't duplicate them. Also
  removes the lazy `MMRubricAgent._format_action_history` import
  from `critical_point_classifier.py`.
- Picks up upstream #889 `run_command` support: adds
  `StepSummary.tool_output`, populates it from post-action
  `ToolOutput` observations, renders a `Command Output:` line in the
  action history, and teaches the prompts to treat that output as
  ground truth (so an unchanged desktop after `run_command` is not
  read as failure) while sanity-checking that the command isn't a
  fake (`echo "success"`).
- Adds missing runtime deps (`imagehash`, `jinja2`) to
  `webeval/pyproject.toml` — both are imported by the new modules.

Architecture note: the branch keeps `MMRubricAgent` and `VerifierAgent`
as independent agents orchestrated by the caller, rather than upstream
#1071's compose pattern. This is the same direction (decoupling) and
goes a step further by also stripping Step 11. `verify_trajectories.py`
drives them in sequence.

All 28 webeval unit tests pass; the live-LLM end-to-end test
(`test_verify_trajectories_live_llm`, opt-in via `FARA_VERIFY_LIVE_TEST=1`)
asserts the new CP-aware fields (`cp_type_used`, `cp_violation`,
`error_taxonomy.first_point_of_failure.failure_points[].error_code`,
Steps 9b/10 `is_ambiguous`/`is_invalid`) hit the score file.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@corbyrosset corbyrosset changed the title Port CP-aware verifier + error taxonomy from agento_next Port CP-aware verifier + error taxonomy May 29, 2026
corby and others added 2 commits May 29, 2026 12:52
A live verifier test wedged for ~19h: a judge endpoint returning HTTP 401
(mapped by the openai SDK to AuthenticationError) hit the AuthenticationError
branch, which neither blocklisted the endpoint nor consumed the `tries`
budget — so next_client() kept handing back the same failing endpoint and
the loop spun forever, opening connections the whole time.

- AuthenticationError and the check-access-response-enc branch now decrement
  `tries` (and AuthenticationError backs off 1s) so a persistent auth/access
  failure on a small pool can't loop forever.
- Add a hard `max_total_attempts` cap (default max_retries + 2*n_endpoints)
  as a backstop: create() always terminates regardless of which branch fires,
  including blocklisting paths that intentionally don't spend the budget.
- Exhaustion error now reports total_attempts + both caps.
- Tests: 3 regression cases (persistent-auth termination across 1 and 3
  endpoints, all-endpoints-blocklisted termination), each guarded by a 30s
  asyncio.wait_for so any future regression fails loudly instead of hanging.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gateway)

llm_call_expect_json did `result.content.content`, assuming a nested object.
On the OpenAI/phyagi-gateway path CreateResult.content is already a plain
str, so this raised "'str' object has no attribute 'content'" on every
attempt — Step 9b (trajectory-informed task verification) failed all retries
and silently fell back to a default. The bug was masked because the test only
asserts the result keys exist, not that the step ran.

Mirror task_classification's helper: unwrap .content only when present. Step
9b now runs clean against the gateway (no retries, real verdict).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant