fix(card): adaptive stall_finalize threshold + false-positive recovery#137
Merged
Conversation
Two regressions on a long-reasoning metrics-debug session: (1) ``STALL_FINALIZE_AFTER_SECONDS=90s`` fired on ``tail=tool_use idle=96s`` while Claude was legitimately thinking toward the final answer; (2) once the STALL_NOTE was appended, the real answer landing later got silently edited into the now-finalized card the user had scrolled past. Fix (1): split the threshold by tail event type. ``tool_use`` tails get ``STALL_FINALIZE_TOOL_USE_SECONDS=300s`` because slow tools and post-tool reasoning are routinely silent for minutes; ``text`` / ``thinking`` tails keep the original 90s because mid-emit silence is genuinely suspicious. Fix (2): ``maybe_finalize_stalled`` now arms ``CardState .stall_finalized=True`` after the STALL_NOTE lands. The next call to ``update_session_card`` or ``finalize_task`` runs ``_recover_from_false_stall`` — wipes msg_id / events / pagination and flips ``is_continuation=True`` so ``_send_card`` spawns a fresh card below the stub instead of clobbering it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
STALL_FINALIZE_AFTER_SECONDSby tail event type.tool_use→ 300 s,text/thinking→ 90 s. Catches the metrics-debug class of false positives where Claude is reasoning toward a final answer after a slow tool.maybe_finalize_stallednow armsCardState.stall_finalized=True. If a genuine assistant turn lands after,update_session_card/finalize_taskwipe the binding via_recover_from_false_stalland_send_cardspawns a fresh card below the stalled stub (…continuedheader marker).The stalled stub stays in chat history — we don't rewrite it. The recovery card appears below with the real answer.
Context
On 2026-06-17 a long
metrics debugsession emitted its final BT_fin/CDM race answer to the tmux pane and the JSONL, butstall_finalizehad already fired withidle=96 s tail=tool_use. Subsequent switcher taps painted empty stubs (len=33edits to msg=7179) and the user never saw the answer in TG. The 90 s blanket threshold was the root cause; the silent edit of the finalized card was the user-visible part.Test plan
tests/test_stalled_card.py:tool_usetail now uses the longer threshold; addedtest_tool_use_within_extended_threshold_no_firefor the metrics-debug regression andtest_stall_arms_recovery_flagfor the flag.tests/ccbot/handlers/test_stall_recovery.py:_recover_from_false_stallhelper,update_session_cardandfinalize_taskpaths after a stalled stub.tests/e2e/test_stalled_finalize.py: per-tail-type threshold in the seeded card.uv run ruff check— clean.uv run pyright src/ccbot/handlers/— 0 errors.uv run pytest— 731 passing (full suite) + 14 e2e green.🤖 Generated with Claude Code