Harden the Codex turn lifecycle: bound stalls, stop abandoned upstream work, idle broker shutdown#361
Open
geoff-ux wants to merge 5 commits into
Open
Harden the Codex turn lifecycle: bound stalls, stop abandoned upstream work, idle broker shutdown#361geoff-ux wants to merge 5 commits into
geoff-ux wants to merge 5 commits into
Conversation
captureTurn awaited `state.completion` with no timeout and no coupling to the app-server process. A turn that never emits a terminal `turn/completed` (e.g. a hung MCP server whose connection blocks the turn from finalizing, or an app-server that dies mid-turn) left the await pending forever — which surfaces as a "stuck" Codex run producing no output. Bound the wait three ways, all env-overridable for genuinely long turns: - CODEX_COMPANION_TURN_STALL_MS (default 600s): no inbound traffic at all - CODEX_COMPANION_TURN_TIMEOUT_MS (default 1800s): absolute per-turn ceiling - app-server process exit now rejects immediately The broker (app-server-broker.mjs) had no idle self-shutdown and is keyed per-cwd, reaped only on a later ensureBrokerSession for the SAME cwd. A broker for a cwd that is never revisited (deleted worktree, ended session) ran forever and orphaned processes accumulated. It now exits after CODEX_COMPANION_BROKER_IDLE_MS (default 1800s) with no connected clients; the companion transparently respawns one on the next task. Tests: a `never-completes` fake-codex behavior with a fast-stall assertion, and a broker idle self-shutdown assertion. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The codex-rescue agent and the codex-cli-runtime skill instructed the forwarder to "return nothing" when the Bash call failed or Codex could not be invoked. That masks a real failure as an empty response. Return the error output (stderr and any stdout) verbatim so the failure is visible. Updates the doc-consistency assertions in commands.test.mjs to pin the new wording. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adversarial review caught that the ceiling/stall/exit guards were constructed before `await startRequest()` but only observed by `Promise.race` after it. A start RPC (turn/start, review/start) that hangs before ACK was therefore not covered: the await stayed stuck, the guards rejected while unobserved (unhandledRejection), and the finally cleanup never ran. A start RPC that rejected pre-ACK likewise left the exit guard's later rejection unobserved. Wrap the start RPC plus response processing in a `work` promise and race that against the guards, so every guard is observed from the outset and the watchdog also covers a hung or failed start RPC. Adds a `start-hangs` fake-codex behavior and a regression test asserting the task fails fast (rather than hanging) and emits no unhandled rejection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adversarial re-review found the idle self-shutdown could race a new client: shutdown() closed the upstream appClient before server.close(), so a task connecting at the teardown boundary could be accepted and then fail against a closed app-server — an error withAppServer does not retry. Close the listening server first. A client arriving at the boundary is then either served by the still-open appClient during drain, or refused with ECONNREFUSED, which the companion retries into a fresh broker. Also adds a `start-rejects` fake-codex behavior and a test for the pre-ACK RPC rejection path: the real error surfaces with no unhandled rejection. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A watchdog/exit guard stopped the wait but not the work: on the shared broker transport, abandoning the turn closed only the companion→broker socket, leaving the turn running upstream where it could keep mutating the workspace and collide with the next task. (The direct transport is fine — close() kills the app-server.) Two complementary stops, gated by a `turnAbandoned` tag so a turn that merely errored (already over) does not trigger teardown: - When the turn id is known, the companion sends a bounded turn/interrupt so the upstream turn ends and the broker stays alive and reusable. - The broker self-terminates when the client owning in-progress upstream work disconnects before it finishes — whether a request is still in flight (a start that never ACKed) or a streaming turn has ACKed but not reached turn/completed (e.g. the companion crashed before it could interrupt). The broker is single-flight, so no other in-flight work is at stake; it closes the upstream app-server, and the next task reconnects to a fresh runtime. This covers owned and env-shared brokers uniformly. A turn that completed (or was interrupted) has already cleared activeStreamSocket, so a normal close does not trigger it. Tests: a stalled brokered turn is interrupted upstream and tears the broker down; a pre-ACK abandonment makes the broker self-terminate rather than wait out its idle window. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Jafee
added a commit
to Jafee/codex-plugin-cc-stable
that referenced
this pull request
Jun 9, 2026
… turns, idle broker shutdown (openai#361) Reconcile upstream PR openai#361 into the already-merged openai#302 wall-clock timeouts. Replace openai#302's captureTurn idle timer (armIdle / CODEX_TURN_IDLE_TIMEOUT_MS) with openai#361's Promise.race([work, ceiling, stall, exit]) + interruptTurnBestEffort. This also covers a start RPC that hangs or rejects before any ACK, and interrupts an abandoned turn upstream so it cannot overlap a retry. Keep openai#302's app-server request() wall-clock timeout untouched (complementary transport-layer guard). Add broker idle self-shutdown (CODEX_COMPANION_BROKER_IDLE_MS, default 30m) and onSocketGone teardown so an orphaned broker/turn self-heals instead of leaking; close the listener before the upstream app-server on shutdown. Surface Codex invocation failures verbatim instead of swallowing them into an empty response. openai#361 supersedes openai#312 (the per-turn watchdog openai#312 attempted, implemented correctly here). Dual security review (Opus + Codex gpt-5.5 xhigh): both APPROVE, all 9 red lines clear, 0 security findings. Converged LOW non-blockers: onSocketGone error+close double-event is idempotent but untested; CEILING_MS backstop untested; openai#355 background-job interaction noted (not merged). Full suite 98/98 in a clean env. Upstream PR: openai#361 Co-Authored-By: geoff-ux <geoff-ux@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee
added a commit
to Jafee/codex-plugin-cc-stable
that referenced
this pull request
Jun 9, 2026
…mark openai#312 superseded Both Opus and Codex (gpt-5.5 xhigh) independently APPROVE the openai#361 reconcile (8be4ab3): all 9 red lines clear, 0 security findings, 3 converged LOW non-blockers (onSocketGone double-event idempotent-but-untested, CEILING_MS backstop untested, openai#355 background-job interaction noted). openai#312 is superseded by openai#361's correct Promise.race implementation; openai#294/openai#343/openai#355 remain HOLD. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee
added a commit
to Jafee/codex-plugin-cc-stable
that referenced
this pull request
Jun 9, 2026
…t merging Record the decision to not merge openai#346: 16 files / 3099 lines, the only PR in the batch adding an auto-executing PostToolUse hook. Its value (codex-rescue completion signal + worker-lifecycle detection) is reliability polish, not a root-cause fix — the stuck/no-response root causes are already covered by openai#302+openai#361. Quick skim found nothing alarming (hook only reads state + emits text; spawns use array args), but it was not given a merge-grade dual audit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee
added a commit
to Jafee/codex-plugin-cc-stable
that referenced
this pull request
Jun 9, 2026
…(review follow-up) Address two findings from an external review of the merged openai#302/openai#361 timeout work. P2 — CodexAppServerClient.connect() returned without closing the client when initialize() failed, orphaning the spawned app-server (or broker socket). Close it on the failure path, but bound and harden the cleanup so it can never turn a fast failure into a hang: - Promise.race the close() against CLEANUP_TIMEOUT_MS (2s): a wedged app-server that ignores SIGTERM, or a broker client that threw before creating its socket, can no longer block the rethrow of the original error. - close() escalates SIGTERM -> SIGKILL after a grace (guarded on exitCode === null && signalCode === null), then destroys our stdout/stderr read streams so a descendant that inherited the app-server's stdio fds cannot keep the host process's event loop alive after a failed connect. P3 — timeout env parsing was inconsistent: the turn/stall/broker-idle vars used `Number(x) || default`, so a negative or absurdly large value armed a timer that fires immediately (Node clamps delays above 2^31-1 to ~1ms). Unify into resolveTimeoutMs(env, default, {allowDisable}): reject non-finite/negative, clamp to the timer ceiling, and preserve the RPC timeout's 0=disabled escape hatch. Dual security review (Opus + Codex gpt-5.5 xhigh), four rounds: Codex caught three real HIGH issues in the first cut of the P2 cleanup (unbounded close hang, an un-reaped child's stdio keeping the host alive, descendant fd inheritance). Each is fixed and covered by a regression test proven to fail without the fix. Both APPROVE the final diff; all 9 red lines clear. Full suite 105/105 clean env. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee
added a commit
to Jafee/codex-plugin-cc-stable
that referenced
this pull request
Jun 9, 2026
…ardown Final independent cross-audit (a fresh Opus reviewer + a fresh Codex over the full c7d071a..HEAD delta) caught a HIGH in the merged openai#361 broker teardown: shutdown() does `await server.close()`, which only resolves once EVERY connection has fully closed. A second broker client (a concurrent session on the same cwd) that is wedged or half-open never reciprocates our socket.end(), so server.close() — and thus the upstream app-server close and process exit — blocks forever, defeating onSocketGone's orphan-turn teardown. Bound it: server.close() races a SHUTDOWN_DRAIN_MS (1s) timer that force-destroys any straggler, so shutdown always completes. Align the idle self-shutdown path with onSocketGone's `.finally(() => process.exit(0))` so a shutdown rejection cannot skip the exit. New regression test (half-open straggler + broker/shutdown) is proven to hang (8s) without the bound. Verdicts on this delta: independent Opus APPROVE (0 Critical/Important); Codex raised this HIGH — now fixed. Full suite 106/106 in a clean env. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kingdoooo
added a commit
to kingdoooo/codex-plugin-cc
that referenced
this pull request
Jun 10, 2026
…enai#361) Broker-side hardening taken wholesale from openai#361 (my branch never touched app-server-broker.mjs, so this is conflict-free): - idle self-shutdown after CODEX_COMPANION_BROKER_IDLE_MS (default 30min) with no clients — self-heals the orphaned-broker leak when no SessionEnd hook ever runs for the cwd - self-teardown when a client abandons in-flight upstream work (pendingUpstreamSocket / activeStreamSocket gone) so an orphaned turn cannot keep mutating the workspace - close the listener before the upstream app-server on shutdown so a client arriving at the teardown boundary gets ECONNREFUSED (retryable) - rescue agent/skill: surface Codex invocation errors instead of returning nothing Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kingdoooo
added a commit
to kingdoooo/codex-plugin-cc
that referenced
this pull request
Jun 10, 2026
…g (PR openai#361 x Defect A/B/C) Semantic merge of upstream PR openai#361's turn-lifecycle hardening into the existing Defect A/B/C captureTurn fixes (a straight cherry-pick conflicts: both rewrite the same captureTurn region): - stall watchdog is now ALWAYS ON: review callers keep their explicit --turn-idle-timeout; every other turn (incl. /codex:task) gets a 10min default via CODEX_COMPANION_TURN_STALL_MS. Re-arm stays belonging-gated (Defect B) — openai#361's any-traffic lastActivity refresh would have reintroduced the foreign-chatter masking bug. - absolute per-turn ceiling (CODEX_COMPANION_TURN_TIMEOUT_MS, 30min). - app-server exit guard: process death rejects the wait immediately instead of hanging await state.completion forever. - abandoned turns are interrupted upstream via interruptTurnBestEffort (awaited, 5s-bounded) using turnId ?? pendingTurnId (Defect C), on both broker and direct transports. - start RPC covered by the same guards via openai#361's work-wrapper plus a single up-front Promise.race (no unhandledRejection from late guards). - fixture: start-hangs / start-rejects / never-completes behaviors; runtime tests: openai#361's six lifecycle tests appended verbatim. Verified: investigation 39/39, openai#361's six tests 6/6, broker-lifecycle 5/5, commands 8/8. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stops a Codex run launched through this plugin from appearing "stuck," and makes a failed or abandoned run stop its upstream work instead of lingering. Five focused commits.
1. A wedged or dead turn no longer hangs forever (
captureTurn)captureTurnawaitedstate.completion, which resolves only on a terminalturn/completednotification (or the inferred-completion heuristic). It had no timeout and was not coupled to the app-server process, so two real situations left it pending forever:taskshowing no output the entire time.handleExitonly rejects in-flight RPCs;turn/starthas already resolved by the time we await completion.The wait is now bounded and the start RPC is inside the raced work, so a start that hangs or rejects before ACK is covered too. All env-overridable:
CODEX_COMPANION_TURN_STALL_MS(default 600s) — no inbound traffic at allCODEX_COMPANION_TURN_TIMEOUT_MS(default 1800s) — absolute per-turn ceilingexitPromise— process death rejects immediatelyPromise.raceobserves every guard from the outset, so a guard rejecting (even pre-ACK, or after the race settles) is always handled — nounhandledRejection.2. Abandoned turns stop their upstream work
Stopping the wait is not enough on the shared broker: abandoning a turn there closed only the companion→broker socket, leaving the turn running upstream where it could keep mutating the workspace and collide with the next task. (The direct transport is fine —
close()kills the app-server.) Two complementary stops:turn/interruptso the turn ends and the broker stays reusable.turn/completed(e.g. the companion crashed before it could interrupt). The broker is single-flight, so no other work is at stake; it closes the upstream app-server and the next task reconnects to a fresh runtime. Covers owned and env-shared brokers uniformly. A turn that completed/was interrupted has already clearedactiveStreamSocket, so a normal close does not trigger it.3. Brokers self-shut-down when idle
The broker is keyed per-cwd and only reaped on a later
ensureBrokerSessionfor the same cwd, with no idle self-shutdown — so a broker for a cwd never revisited (deleted worktree, ended session) ran forever and orphaned processes accumulated. It now exits afterCODEX_COMPANION_BROKER_IDLE_MS(default 1800s) with no connected clients;shutdown()stops accepting connections before closing the upstream, so a client arriving at the teardown boundary is served or cleanly refused (ECONNREFUSED, which the companion retries), never accepted-then-failed.4. The rescue forwarder surfaces errors instead of returning empty
codex-rescue.mdand thecodex-cli-runtimeskill told the forwarder to "return nothing" on failure, masking a real failure as an empty response. It now returns the error output verbatim.Tests
never-completes,start-hangs, andstart-rejectsfake-codex behaviors.All new tests pass (
node --test). The watchdog, interrupt, idle-shutdown, and abandonment self-termination were also verified end-to-end against a live Codex install.Notes
state.completionand wins the race; a completed turn clearsactiveStreamSocketbefore close, so the broker is not torn down.