Follow-up to #137.
Observation (n=907 baseline, 2026-05-25)
Distribution from the v1 prompt in evals/scripts/classify-cc-stops.mjs and claude/lib/judge.mjs:
| Category |
Count |
% |
| working |
374 |
40% |
| complete |
261 |
29% |
| waiting_for_user_legitimate |
210 |
23% |
| summary_drift_stop |
35 |
4% |
| genuinely_stuck |
27 |
3% |
| tool_available_punt |
0 |
0% |
Problems
-
working over-assigned. At Stop time the agent is by definition not working. Classifier appears to interpret just-finished-action summaries ("I've created the file. Now run tests.") as "still working." This is the same content that should be summary_drift_stop.
-
tool_available_punt never assigned. Heuristic filter surfaced 26 candidates with hint:punt (PUNT_PATTERNS matched). Classifier reassigned every one. Either the pattern is rare in this user's data, or the prompt fails to discriminate it from waiting_for_user_legitimate.
Proposed fixes
- Add 2–4 few-shot examples per category from the redacted gold file (
evals/datasets/cc-stop-labeled-gold-redacted.jsonl).
- Explicit anti-pattern in the prompt: "AT STOP, 'working' is almost never correct. If the agent appears to still be working, prefer
summary_drift_stop (claimed a next step but stopped) or genuinely_stuck (no closure)."
- Add a discriminator clause for
tool_available_punt vs waiting_for_user_legitimate: "If the user's question could be answered by any tool in TOOLS THE ASSISTANT HAD, prefer tool_available_punt. Use waiting_for_user_legitimate only when no tool could give the answer."
Acceptance
- F1 ≥ 0.75 on
summary_drift_stop and tool_available_punt against an expanded gold set (≥ 60 records, ≥ 8 per category).
working count drops below 5% on the same 907-record corpus.
Notes
- Reproduce baseline with
node evals/scripts/classify-cc-stops.mjs (uses OAuth Bearer from ~/.claude/.credentials.json against Anthropic API directly).
- Add
tool_available_punt few-shots from the user's earlier session examples (browser MCP available but agent punted, Bash available but agent asked).
Follow-up to #137.
Observation (n=907 baseline, 2026-05-25)
Distribution from the v1 prompt in
evals/scripts/classify-cc-stops.mjsandclaude/lib/judge.mjs:Problems
workingover-assigned. At Stop time the agent is by definition not working. Classifier appears to interpret just-finished-action summaries ("I've created the file. Now run tests.") as "still working." This is the same content that should besummary_drift_stop.tool_available_puntnever assigned. Heuristic filter surfaced 26 candidates withhint:punt(PUNT_PATTERNS matched). Classifier reassigned every one. Either the pattern is rare in this user's data, or the prompt fails to discriminate it fromwaiting_for_user_legitimate.Proposed fixes
evals/datasets/cc-stop-labeled-gold-redacted.jsonl).summary_drift_stop(claimed a next step but stopped) orgenuinely_stuck(no closure)."tool_available_puntvswaiting_for_user_legitimate: "If the user's question could be answered by any tool in TOOLS THE ASSISTANT HAD, prefertool_available_punt. Usewaiting_for_user_legitimateonly when no tool could give the answer."Acceptance
summary_drift_stopandtool_available_puntagainst an expanded gold set (≥ 60 records, ≥ 8 per category).workingcount drops below 5% on the same 907-record corpus.Notes
node evals/scripts/classify-cc-stops.mjs(uses OAuth Bearer from~/.claude/.credentials.jsonagainst Anthropic API directly).tool_available_puntfew-shots from the user's earlier session examples (browser MCP available but agent punted, Bash available but agent asked).