Refine cc-stop classifier prompt — over-assigns working, under-assigns tool_available_punt

Follow-up to #137.

## Observation (n=907 baseline, 2026-05-25)

Distribution from the v1 prompt in `evals/scripts/classify-cc-stops.mjs` and `claude/lib/judge.mjs`:

| Category | Count | % |
|---|---|---|
| working | 374 | 40% |
| complete | 261 | 29% |
| waiting_for_user_legitimate | 210 | 23% |
| summary_drift_stop | 35 | 4% |
| genuinely_stuck | 27 | 3% |
| tool_available_punt | 0 | 0% |

## Problems

1. **`working` over-assigned**. At Stop time the agent is by definition not working. Classifier appears to interpret just-finished-action summaries ("I've created the file. Now run tests.") as "still working." This is the same content that should be `summary_drift_stop`.

2. **`tool_available_punt` never assigned**. Heuristic filter surfaced 26 candidates with `hint:punt` (PUNT_PATTERNS matched). Classifier reassigned every one. Either the pattern is rare in this user's data, or the prompt fails to discriminate it from `waiting_for_user_legitimate`.

## Proposed fixes

- Add 2–4 few-shot examples per category from the redacted gold file (`evals/datasets/cc-stop-labeled-gold-redacted.jsonl`).
- Explicit anti-pattern in the prompt: "AT STOP, 'working' is almost never correct. If the agent appears to still be working, prefer `summary_drift_stop` (claimed a next step but stopped) or `genuinely_stuck` (no closure)."
- Add a discriminator clause for `tool_available_punt` vs `waiting_for_user_legitimate`: "If the user's question could be answered by any tool in TOOLS THE ASSISTANT HAD, prefer `tool_available_punt`. Use `waiting_for_user_legitimate` only when no tool could give the answer."

## Acceptance

- F1 ≥ 0.75 on `summary_drift_stop` and `tool_available_punt` against an expanded gold set (≥ 60 records, ≥ 8 per category).
- `working` count drops below 5% on the same 907-record corpus.

## Notes

- Reproduce baseline with `node evals/scripts/classify-cc-stops.mjs` (uses OAuth Bearer from `~/.claude/.credentials.json` against Anthropic API directly).
- Add `tool_available_punt` few-shots from the user's earlier session examples (browser MCP available but agent punted, Bash available but agent asked).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refine cc-stop classifier prompt — over-assigns working, under-assigns tool_available_punt #138

Observation (n=907 baseline, 2026-05-25)

Problems

Proposed fixes

Acceptance

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Category	Count	%
working	374	40%
complete	261	29%
waiting_for_user_legitimate	210	23%
summary_drift_stop	35	4%
genuinely_stuck	27	3%
tool_available_punt	0	0%

Refine cc-stop classifier prompt — over-assigns working, under-assigns tool_available_punt #138

Description

Observation (n=907 baseline, 2026-05-25)

Problems

Proposed fixes

Acceptance

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions