fix(adapter): surface error on agent EOF without final response#1198
Conversation
This comment has been minimized.
This comment has been minimized.
The ACP recv loop's EOF branch (rx.recv() -> None) broke out of the loop without setting response_error. When a bridged agent crashes on a backend error (e.g. Gemini HTTP 500 / quota exhausted) it exits without ever emitting an ACP error notification, so the pipe closes, recv() returns None, and final_content falls through to the "_(no response)_" sentinel -- hiding the real failure from the user. Mark unexpected termination explicitly on EOF when no error was recorded and no text was streamed, so the user sees a concrete error instead of a vague sentinel. The text_buf.is_empty() guard preserves any partial text that did arrive before the connection closed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01X6Vv8eNtc4T9ESSSzZ2KgM
dc9dfee to
513c523
Compare
Mob review (Codex + Claude multi-finder) found the original text_buf.is_empty() guard predicates error surfacing on buffer contents, which is not a reliable proxy for turn completion. Three crash classes still slipped through silently: - session-reset turns pre-seed text_buf with the expiry notice, so the guard is always false and a crash is masked; - send-once mode slices off inter-tool narration before delivery, so a crash after a tool (non-empty buffer, empty delivered body) shows no error; - native-streaming partial-text crashes were shown as complete answers. Drop the text_buf.is_empty() conjunct: on EOF, set response_error whenever none was recorded. This is safe because a successful turn is always signalled by the id-bearing JSON-RPC response to session/prompt, which breaks the loop before EOF — so the EOF arm is reachable only on abnormal termination, regardless of buffer contents. When partial text was streamed, final_content prepends the warning to it, preserving the output while flagging the truncation. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01X6Vv8eNtc4T9ESSSzZ2KgM
Final Aggregated Review — PR #1198Modegroup-review (2/2 voices active: Claude 8-finder/14-verified workflow + Codex) Outcome (after fix + re-review)
Consensus Critical (must fix)None. Consensus Important (must fix)Both voices converge on a single root cause: the
Unifying fix (resolves all three): drop the The original spec's Disputed / scope decision (user decides)
Actionable NIT
Voices unavailable
Out-of-PR note
|
|
LGTM ✅ — Minimal, correct defensive fix for the silent-EOF-on-bridged-agent-crash path. What This PR DoesWhen a bridged (non-ACP-native) agent crashes on a backend error (e.g. HTTP 500, quota exhausted), it exits without emitting an ACP error notification. Previously the recv loop's EOF branch simply broke out, leaving How It WorksIn the Findings
Finding Details🟢 F1: Correct pattern reuseThe fix follows the exact same 🟢 F2: Thorough design rationale in commentsThe inline comments explain why the 🟢 F3: Defensive
|
antigenius0910
left a comment
There was a problem hiding this comment.
LGTM ✅ — independent re-verification of the bug and the fix.
Bug reproduces in code
Confirmed in crates/openab-core/src/adapter.rs on main:
:815-816— EOF arm is a barebreakand never setsresponse_error.:1033-1038— whenfinal_contentis empty andresponse_errorisNone, falls through to the_(no response)_sentinel.- Sibling termination paths already break the symmetry:
:820(Agent process died),:825(hard timeout),:845(coded ACP error) all setresponse_error. The EOF arm is the only error termination that stays silent — this PR closes that asymmetry.
Fix is provably safe
The mob-review claim that "successful turns never reach the EOF arm" verifies directly at :835-848:
if let Some(notification_id) = notification.id {
if notification_id != request_id { continue; }
if let Some(ref err) = notification.error { ... }
break; // matching id ⇒ always breaks here, before any EOF
}A successful ACP turn closes with a JSON-RPC response carrying the matching request_id, which trips this branch before the pipe close can be observed. So the EOF arm is reachable only on abnormal termination, regardless of text_buf state. Dropping the text_buf.is_empty() guard is therefore correct — keeping it would have masked the three confirmed crash classes already documented in the aggregated review (session-reset pre-seed, send-once post-tool slice, native-streaming partial).
Scope check
Searched open issues for an existing report: none filed (PR is Discord-driven, no Closes #).
- #997 (codex-acp LLM call hang) looks adjacent but is a different defect — that one is an intra-turn LLM stall with no EOF and no events, which is the liveness/inactivity-timeout territory, not the EOF arm. This PR does not address #997 and should not be conflated with it.
- #1101, #976, #1075, #990, #1090 — unrelated.
CI
- 27/28 green; the only red is the pre-existing
manual_is_multiple_oflint onpre_seed.rs:289from main (landed via #1189/#1196/#1197) — not from this diff. - All 14 smoke-test configs + 14 unified-build configs pass, including
agy-acp(the bridged-agent path this fix targets).
Known follow-ups (not blockers)
- The aggregated review's Finding #4 —
response_errordoesn't drivedelivery_failed, so dispatch still reacts 🆗 instead of ❌ on crashed turns — is pre-existing and shared with the three sibling error branches (Agent process died, hard-timeout, coded-error). Wiringresponse_error → delivery_faileduniformly is a broader contract change and is correctly deferred to a separate PR.
Approving.
What problem does this solve?
When a bridged (non-ACP-native) agent crashes on a backend error, OpenAB shows the user a vague
_(no response)_(or, worse, a partially-streamed buffer presented as a complete answer) instead of the real failure reason.agy does not natively speak ACP — we bridge it. So when an AI backend (e.g. Gemini) returns HTTP 500 or a quota-exhausted error, agy can crash/exit without first emitting an ACP final response carrying that error. The failure chain:
rx.recv() -> Nonebranch breaks out, butresponse_erroris stillNone.final_contentis empty → falls through to the_(no response)_sentinel.The user is left with no idea their request hit a quota limit or server error.
This was surfaced in the OpenAB community discussion. Because agy itself doesn't support ACP, the bridging layer (OpenAB side) must supply a sensible error on pipe EOF rather than letting it fall through.
Closes #
Discord Discussion URL: https://discord.com/channels/1491295327620169908/1519250661664100455
At a Glance
Prior Art & Industry Research
Not applicable — defensive bug fix on the ACP bridge. Adds error surfacing on an existing
breakpath; no new architecture, dependency, or protocol. It makes the EOF branch consistent with the
sibling
!conn.alive()branch, which already setsresponse_error = "Agent process died".Proposed Solution
In
crates/openab-core/src/adapter.rs, the recv loop EOF branch records an explicit error when thepipe closes without a final response and no error was already recorded:
The downstream
final_contentbuilder already rendersresponse_erroras⚠️ {err}(standalonewhen there is no body, or prepended to any partial content), so no other change is needed.
Why this approach?
crashing, so the bridge detects the abnormal EOF itself, mirroring the existing
!conn.alive()branch.text_buf.is_empty(). Multi-modelmob review (Codex + a Claude multi-finder review) showed that guard is unsound: it is defeated by
the session-reset pre-seed (
text_bufalready holds the expiry notice), by send-once mode (thedelivered body is a post-
answer_startslice, not the full buffer), and by native streaming(partial text shown as a complete answer). The guard is also unnecessary: a successful turn is
always signalled by the id-bearing JSON-RPC response to
session/prompt, which breaks the recvloop before EOF — so the EOF branch is reachable only on abnormal termination, regardless of
buffer contents. When partial text was streamed,
final_contentprepends the warning to it,preserving the output while flagging the truncation.
Alternatives Considered
text_buf.is_empty()(earlier revision): rejected after mob review — masks crashes onreset turns, send-once tool turns, and native-streaming partials (see above).
signals the pipe closed; a timeout only adds latency for the same outcome.
emit an error before crashing, so the bridge must be robust to silent EOF. A complementary agy-side
fix is still worthwhile but out of scope for this repo.
Known Issues / Follow-ups
response_errordrives the
⚠️message banner but notdelivery_failed, so dispatch still reacts with success(🆗) rather than error (❌). This gap is shared with the existing
Agent process died/hard-timeout / coded-error branches — this PR only joins the pattern. Recommended as a separate PR
that wires
response_error → ❌uniformly for all error branches.Validation
Rust changes:
cargo build -p openab-corepassescargo test -p openab-core --lib adapter— 43 passed, 0 failedadapter.rsclippy-cleantext_buf.is_empty()guard was droppedCI note:
checkjob (cargo clippy --workspace -- -D warnings) is currently RED oncrates/openab-core/src/pre_seed.rs:289(manual_is_multiple_of, rust-1.96 lint) — pre-existingon main (landed via feat: add pre_seed phase — S3 zip download before pre_boot #1189/feat(pre_seed): support .tar.gz/.tgz archives #1196/build: make pre-seed a default feature #1197), not part of this diff. The merge gate will stay red until
main is fixed; flagging for maintainers.
All PRs:
confirm the user sees
⚠️ Agent process exited unexpectedly(with any partial text preserved)instead of
_(no response)_. No automated test seam exists for this deeply-nested recv loop;the sibling error paths are likewise covered only by manual repro against a live agent.