Port CP-aware verifier + error taxonomy #77
Open
corbyrosset wants to merge 3 commits into
Open
Conversation
Brings the rubric_agent package up to parity with agento_next/main (post-#1071 architecture), adapted to webeval: - Adds `CriticalPointAgent` (Step -1) that classifies the task against a critical-point taxonomy (`critical_point_types.yaml`) and threads the result through rubric generation, action-only scoring, outcome verification, and a new CP-violation check. - Adds `VerifierAgent` (Steps 9a/9b/10): first-point-of-failure analysis with the error taxonomy, trajectory-informed task verification, and unified task verification. Step 11 (synthetic human-voice feedback) is intentionally dropped — not needed in webeval. - Mirrors upstream #1071's DRY refactor: extracts shared helpers (`format_action_history`, `call_llm`, `encode_image_b64`, `get_init_url_context`, `build_scored_rubric_summary`, `build_all_screenshot_evidence_text`) into `formatting.py` so `MMRubricAgent` and `VerifierAgent` don't duplicate them. Also removes the lazy `MMRubricAgent._format_action_history` import from `critical_point_classifier.py`. - Picks up upstream #889 `run_command` support: adds `StepSummary.tool_output`, populates it from post-action `ToolOutput` observations, renders a `Command Output:` line in the action history, and teaches the prompts to treat that output as ground truth (so an unchanged desktop after `run_command` is not read as failure) while sanity-checking that the command isn't a fake (`echo "success"`). - Adds missing runtime deps (`imagehash`, `jinja2`) to `webeval/pyproject.toml` — both are imported by the new modules. Architecture note: the branch keeps `MMRubricAgent` and `VerifierAgent` as independent agents orchestrated by the caller, rather than upstream #1071's compose pattern. This is the same direction (decoupling) and goes a step further by also stripping Step 11. `verify_trajectories.py` drives them in sequence. All 28 webeval unit tests pass; the live-LLM end-to-end test (`test_verify_trajectories_live_llm`, opt-in via `FARA_VERIFY_LIVE_TEST=1`) asserts the new CP-aware fields (`cp_type_used`, `cp_violation`, `error_taxonomy.first_point_of_failure.failure_points[].error_code`, Steps 9b/10 `is_ambiguous`/`is_invalid`) hit the score file. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A live verifier test wedged for ~19h: a judge endpoint returning HTTP 401 (mapped by the openai SDK to AuthenticationError) hit the AuthenticationError branch, which neither blocklisted the endpoint nor consumed the `tries` budget — so next_client() kept handing back the same failing endpoint and the loop spun forever, opening connections the whole time. - AuthenticationError and the check-access-response-enc branch now decrement `tries` (and AuthenticationError backs off 1s) so a persistent auth/access failure on a small pool can't loop forever. - Add a hard `max_total_attempts` cap (default max_retries + 2*n_endpoints) as a backstop: create() always terminates regardless of which branch fires, including blocklisting paths that intentionally don't spend the budget. - Exhaustion error now reports total_attempts + both caps. - Tests: 3 regression cases (persistent-auth termination across 1 and 3 endpoints, all-endpoints-blocklisted termination), each guarded by a 30s asyncio.wait_for so any future regression fails loudly instead of hanging. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gateway) llm_call_expect_json did `result.content.content`, assuming a nested object. On the OpenAI/phyagi-gateway path CreateResult.content is already a plain str, so this raised "'str' object has no attribute 'content'" on every attempt — Step 9b (trajectory-informed task verification) failed all retries and silently fell back to a default. The bug was masked because the test only asserts the result keys exist, not that the step ran. Mirror task_classification's helper: unwrap .content only when present. Step 9b now runs clean against the gateway (no retries, real verdict). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Brings webeval's
rubric_agentpackage up to parity withagento_next/main(post-#1071), adapted for webeval.CriticalPointAgent(Step -1): classifies the task against a critical-point taxonomy (critical_point_types.yaml) and threads the result through rubric generation, action-only scoring, outcome verification, and a new CP-violation check.VerifierAgent(Steps 9a/9b/10): first-point-of-failure analysis using the error taxonomy, trajectory-informed task verification, and unified task verification. Step 11 (synthetic human-voice feedback) is intentionally dropped — not needed in webeval.format_action_history,call_llm,encode_image_b64,get_init_url_context,build_scored_rubric_summary,build_all_screenshot_evidence_text) intoformatting.pysoMMRubricAgentandVerifierAgentdon't duplicate them.run_command: addsStepSummary.tool_output, populates it from post-actionToolOutputobservations, renders aCommand Output:line in the action history, and teaches the prompts to treat that output as ground truth (so an unchanged desktop afterrun_commandis not read as failure) while sanity-checking that the command isn't a fake (echo \"success\").Architecture note vs upstream #1071
The branch keeps
MMRubricAgentandVerifierAgentas independent agents orchestrated by the callerTest plan
pytest tests/).tests/test_verify_trajectories.py::test_verify_trajectories_live_llmruns the realMMRubricAgent+VerifierAgentagainst the checked-indata/example_trajectory/alltrails_find_23/trajectory and asserts the new CP-aware fields (cp_type_used,cp_violation,error_taxonomy.first_point_of_failure.failure_points[].error_code, Steps 9b/10is_ambiguous/is_invalid) hit the score file. Opt-in:🤖 Generated with Claude Code