Skip to content

Harden the Codex turn lifecycle: bound stalls, stop abandoned upstream work, idle broker shutdown#361

Open
geoff-ux wants to merge 5 commits into
openai:mainfrom
geoff-ux:harden-turn-lifecycle
Open

Harden the Codex turn lifecycle: bound stalls, stop abandoned upstream work, idle broker shutdown#361
geoff-ux wants to merge 5 commits into
openai:mainfrom
geoff-ux:harden-turn-lifecycle

Conversation

@geoff-ux

@geoff-ux geoff-ux commented Jun 8, 2026

Copy link
Copy Markdown

Summary

Stops a Codex run launched through this plugin from appearing "stuck," and makes a failed or abandoned run stop its upstream work instead of lingering. Five focused commits.

1. A wedged or dead turn no longer hangs forever (captureTurn)

captureTurn awaited state.completion, which resolves only on a terminal turn/completed notification (or the inferred-completion heuristic). It had no timeout and was not coupled to the app-server process, so two real situations left it pending forever:

  • A hung MCP server. Codex blocks a turn until MCP init settles. A remote MCP server whose endpoint hangs (rather than failing fast) stalls the whole turn until its connection times out — observed at ~225s per turn for one misconfigured server, with the foreground task showing no output the entire time.
  • App-server death mid-turn. handleExit only rejects in-flight RPCs; turn/start has already resolved by the time we await completion.

The wait is now bounded and the start RPC is inside the raced work, so a start that hangs or rejects before ACK is covered too. All env-overridable:

  • CODEX_COMPANION_TURN_STALL_MS (default 600s) — no inbound traffic at all
  • CODEX_COMPANION_TURN_TIMEOUT_MS (default 1800s) — absolute per-turn ceiling
  • the app-server exitPromise — process death rejects immediately

Promise.race observes every guard from the outset, so a guard rejecting (even pre-ACK, or after the race settles) is always handled — no unhandledRejection.

2. Abandoned turns stop their upstream work

Stopping the wait is not enough on the shared broker: abandoning a turn there closed only the companion→broker socket, leaving the turn running upstream where it could keep mutating the workspace and collide with the next task. (The direct transport is fine — close() kills the app-server.) Two complementary stops:

  • When the turn id is known, the companion sends a bounded turn/interrupt so the turn ends and the broker stays reusable.
  • The broker self-terminates when the client owning in-progress upstream work disconnects before it finishes — whether a request is still in flight (a start that never ACKed) or a streaming turn ACKed but never reached turn/completed (e.g. the companion crashed before it could interrupt). The broker is single-flight, so no other work is at stake; it closes the upstream app-server and the next task reconnects to a fresh runtime. Covers owned and env-shared brokers uniformly. A turn that completed/was interrupted has already cleared activeStreamSocket, so a normal close does not trigger it.

3. Brokers self-shut-down when idle

The broker is keyed per-cwd and only reaped on a later ensureBrokerSession for the same cwd, with no idle self-shutdown — so a broker for a cwd never revisited (deleted worktree, ended session) ran forever and orphaned processes accumulated. It now exits after CODEX_COMPANION_BROKER_IDLE_MS (default 1800s) with no connected clients; shutdown() stops accepting connections before closing the upstream, so a client arriving at the teardown boundary is served or cleanly refused (ECONNREFUSED, which the companion retries), never accepted-then-failed.

4. The rescue forwarder surfaces errors instead of returning empty

codex-rescue.md and the codex-cli-runtime skill told the forwarder to "return nothing" on failure, masking a real failure as an empty response. It now returns the error output verbatim.

Tests

  • never-completes, start-hangs, and start-rejects fake-codex behaviors.
  • Fast-fail on a turn that never completes; on a start RPC that hangs (pre-ACK); and on a start RPC that rejects (pre-ACK, no unhandled rejection).
  • A stalled brokered turn is interrupted upstream and tears the broker down.
  • A pre-ACK abandonment makes the broker self-terminate rather than wait out its idle window.
  • Broker idle self-shutdown.

All new tests pass (node --test). The watchdog, interrupt, idle-shutdown, and abandonment self-termination were also verified end-to-end against a live Codex install.

Notes

  • No happy-path change: a normal turn resolves via state.completion and wins the race; a completed turn clears activeStreamSocket before close, so the broker is not torn down.
  • Defaults are generous so long legitimate runs are unaffected; the env vars exist for tuning.
  • For maintainers: broker teardown (idle or abandonment) closes the listener before the upstream to make the boundary served-or-retryable; a sub-millisecond accept-then-teardown window remains, shared with the existing idle path. Fully closing it would mean a broker-side request-cancellation/fencing primitive in the shared-concurrency model, which felt out of scope here — happy to follow up if you'd prefer it handled in this PR.

geoff-ux and others added 5 commits June 8, 2026 11:32
captureTurn awaited `state.completion` with no timeout and no coupling to
the app-server process. A turn that never emits a terminal `turn/completed`
(e.g. a hung MCP server whose connection blocks the turn from finalizing,
or an app-server that dies mid-turn) left the await pending forever — which
surfaces as a "stuck" Codex run producing no output.

Bound the wait three ways, all env-overridable for genuinely long turns:
- CODEX_COMPANION_TURN_STALL_MS (default 600s): no inbound traffic at all
- CODEX_COMPANION_TURN_TIMEOUT_MS (default 1800s): absolute per-turn ceiling
- app-server process exit now rejects immediately

The broker (app-server-broker.mjs) had no idle self-shutdown and is keyed
per-cwd, reaped only on a later ensureBrokerSession for the SAME cwd. A
broker for a cwd that is never revisited (deleted worktree, ended session)
ran forever and orphaned processes accumulated. It now exits after
CODEX_COMPANION_BROKER_IDLE_MS (default 1800s) with no connected clients;
the companion transparently respawns one on the next task.

Tests: a `never-completes` fake-codex behavior with a fast-stall assertion,
and a broker idle self-shutdown assertion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The codex-rescue agent and the codex-cli-runtime skill instructed the
forwarder to "return nothing" when the Bash call failed or Codex could not
be invoked. That masks a real failure as an empty response. Return the
error output (stderr and any stdout) verbatim so the failure is visible.

Updates the doc-consistency assertions in commands.test.mjs to pin the new
wording.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adversarial review caught that the ceiling/stall/exit guards were
constructed before `await startRequest()` but only observed by
`Promise.race` after it. A start RPC (turn/start, review/start) that hangs
before ACK was therefore not covered: the await stayed stuck, the guards
rejected while unobserved (unhandledRejection), and the finally cleanup
never ran. A start RPC that rejected pre-ACK likewise left the exit guard's
later rejection unobserved.

Wrap the start RPC plus response processing in a `work` promise and race
that against the guards, so every guard is observed from the outset and the
watchdog also covers a hung or failed start RPC.

Adds a `start-hangs` fake-codex behavior and a regression test asserting the
task fails fast (rather than hanging) and emits no unhandled rejection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adversarial re-review found the idle self-shutdown could race a new client:
shutdown() closed the upstream appClient before server.close(), so a task
connecting at the teardown boundary could be accepted and then fail against
a closed app-server — an error withAppServer does not retry.

Close the listening server first. A client arriving at the boundary is then
either served by the still-open appClient during drain, or refused with
ECONNREFUSED, which the companion retries into a fresh broker.

Also adds a `start-rejects` fake-codex behavior and a test for the pre-ACK
RPC rejection path: the real error surfaces with no unhandled rejection.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A watchdog/exit guard stopped the wait but not the work: on the shared broker
transport, abandoning the turn closed only the companion→broker socket,
leaving the turn running upstream where it could keep mutating the workspace
and collide with the next task. (The direct transport is fine — close() kills
the app-server.)

Two complementary stops, gated by a `turnAbandoned` tag so a turn that merely
errored (already over) does not trigger teardown:
- When the turn id is known, the companion sends a bounded turn/interrupt so
  the upstream turn ends and the broker stays alive and reusable.
- The broker self-terminates when the client owning in-progress upstream work
  disconnects before it finishes — whether a request is still in flight (a
  start that never ACKed) or a streaming turn has ACKed but not reached
  turn/completed (e.g. the companion crashed before it could interrupt). The
  broker is single-flight, so no other in-flight work is at stake; it closes
  the upstream app-server, and the next task reconnects to a fresh runtime.
  This covers owned and env-shared brokers uniformly. A turn that completed (or
  was interrupted) has already cleared activeStreamSocket, so a normal close
  does not trigger it.

Tests: a stalled brokered turn is interrupted upstream and tears the broker
down; a pre-ACK abandonment makes the broker self-terminate rather than wait
out its idle window.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@geoff-ux geoff-ux requested a review from a team June 8, 2026 17:33
Jafee added a commit to Jafee/codex-plugin-cc-stable that referenced this pull request Jun 9, 2026
… turns, idle broker shutdown (openai#361)

Reconcile upstream PR openai#361 into the already-merged openai#302 wall-clock timeouts.
Replace openai#302's captureTurn idle timer (armIdle / CODEX_TURN_IDLE_TIMEOUT_MS) with
openai#361's Promise.race([work, ceiling, stall, exit]) + interruptTurnBestEffort. This
also covers a start RPC that hangs or rejects before any ACK, and interrupts an
abandoned turn upstream so it cannot overlap a retry. Keep openai#302's app-server
request() wall-clock timeout untouched (complementary transport-layer guard).

Add broker idle self-shutdown (CODEX_COMPANION_BROKER_IDLE_MS, default 30m) and
onSocketGone teardown so an orphaned broker/turn self-heals instead of leaking;
close the listener before the upstream app-server on shutdown. Surface Codex
invocation failures verbatim instead of swallowing them into an empty response.

openai#361 supersedes openai#312 (the per-turn watchdog openai#312 attempted, implemented correctly here).

Dual security review (Opus + Codex gpt-5.5 xhigh): both APPROVE, all 9 red lines
clear, 0 security findings. Converged LOW non-blockers: onSocketGone error+close
double-event is idempotent but untested; CEILING_MS backstop untested; openai#355
background-job interaction noted (not merged). Full suite 98/98 in a clean env.

Upstream PR: openai#361
Co-Authored-By: geoff-ux <geoff-ux@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee added a commit to Jafee/codex-plugin-cc-stable that referenced this pull request Jun 9, 2026
…mark openai#312 superseded

Both Opus and Codex (gpt-5.5 xhigh) independently APPROVE the openai#361 reconcile
(8be4ab3): all 9 red lines clear, 0 security findings, 3 converged LOW
non-blockers (onSocketGone double-event idempotent-but-untested, CEILING_MS
backstop untested, openai#355 background-job interaction noted). openai#312 is superseded by
openai#361's correct Promise.race implementation; openai#294/openai#343/openai#355 remain HOLD.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee added a commit to Jafee/codex-plugin-cc-stable that referenced this pull request Jun 9, 2026
…t merging

Record the decision to not merge openai#346: 16 files / 3099 lines, the only PR in the
batch adding an auto-executing PostToolUse hook. Its value (codex-rescue
completion signal + worker-lifecycle detection) is reliability polish, not a
root-cause fix — the stuck/no-response root causes are already covered by
openai#302+openai#361. Quick skim found nothing alarming (hook only reads state + emits
text; spawns use array args), but it was not given a merge-grade dual audit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee added a commit to Jafee/codex-plugin-cc-stable that referenced this pull request Jun 9, 2026
…(review follow-up)

Address two findings from an external review of the merged openai#302/openai#361 timeout work.

P2 — CodexAppServerClient.connect() returned without closing the client when
initialize() failed, orphaning the spawned app-server (or broker socket). Close
it on the failure path, but bound and harden the cleanup so it can never turn a
fast failure into a hang:
  - Promise.race the close() against CLEANUP_TIMEOUT_MS (2s): a wedged app-server
    that ignores SIGTERM, or a broker client that threw before creating its
    socket, can no longer block the rethrow of the original error.
  - close() escalates SIGTERM -> SIGKILL after a grace (guarded on
    exitCode === null && signalCode === null), then destroys our stdout/stderr
    read streams so a descendant that inherited the app-server's stdio fds cannot
    keep the host process's event loop alive after a failed connect.

P3 — timeout env parsing was inconsistent: the turn/stall/broker-idle vars used
`Number(x) || default`, so a negative or absurdly large value armed a timer that
fires immediately (Node clamps delays above 2^31-1 to ~1ms). Unify into
resolveTimeoutMs(env, default, {allowDisable}): reject non-finite/negative, clamp
to the timer ceiling, and preserve the RPC timeout's 0=disabled escape hatch.

Dual security review (Opus + Codex gpt-5.5 xhigh), four rounds: Codex caught
three real HIGH issues in the first cut of the P2 cleanup (unbounded close hang,
an un-reaped child's stdio keeping the host alive, descendant fd inheritance).
Each is fixed and covered by a regression test proven to fail without the fix.
Both APPROVE the final diff; all 9 red lines clear. Full suite 105/105 clean env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Jafee added a commit to Jafee/codex-plugin-cc-stable that referenced this pull request Jun 9, 2026
…ardown

Final independent cross-audit (a fresh Opus reviewer + a fresh Codex over the
full c7d071a..HEAD delta) caught a HIGH in the merged openai#361 broker teardown:
shutdown() does `await server.close()`, which only resolves once EVERY connection
has fully closed. A second broker client (a concurrent session on the same cwd)
that is wedged or half-open never reciprocates our socket.end(), so server.close()
— and thus the upstream app-server close and process exit — blocks forever,
defeating onSocketGone's orphan-turn teardown.

Bound it: server.close() races a SHUTDOWN_DRAIN_MS (1s) timer that force-destroys
any straggler, so shutdown always completes. Align the idle self-shutdown path
with onSocketGone's `.finally(() => process.exit(0))` so a shutdown rejection
cannot skip the exit. New regression test (half-open straggler + broker/shutdown)
is proven to hang (8s) without the bound.

Verdicts on this delta: independent Opus APPROVE (0 Critical/Important); Codex
raised this HIGH — now fixed. Full suite 106/106 in a clean env.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kingdoooo added a commit to kingdoooo/codex-plugin-cc that referenced this pull request Jun 10, 2026
…enai#361)

Broker-side hardening taken wholesale from openai#361
(my branch never touched app-server-broker.mjs, so this is conflict-free):
- idle self-shutdown after CODEX_COMPANION_BROKER_IDLE_MS (default 30min)
  with no clients — self-heals the orphaned-broker leak when no
  SessionEnd hook ever runs for the cwd
- self-teardown when a client abandons in-flight upstream work
  (pendingUpstreamSocket / activeStreamSocket gone) so an orphaned turn
  cannot keep mutating the workspace
- close the listener before the upstream app-server on shutdown so a
  client arriving at the teardown boundary gets ECONNREFUSED (retryable)
- rescue agent/skill: surface Codex invocation errors instead of
  returning nothing

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
kingdoooo added a commit to kingdoooo/codex-plugin-cc that referenced this pull request Jun 10, 2026
…g (PR openai#361 x Defect A/B/C)

Semantic merge of upstream PR openai#361's turn-lifecycle hardening into the
existing Defect A/B/C captureTurn fixes (a straight cherry-pick conflicts:
both rewrite the same captureTurn region):
- stall watchdog is now ALWAYS ON: review callers keep their explicit
  --turn-idle-timeout; every other turn (incl. /codex:task) gets a 10min
  default via CODEX_COMPANION_TURN_STALL_MS. Re-arm stays belonging-gated
  (Defect B) — openai#361's any-traffic lastActivity refresh would have
  reintroduced the foreign-chatter masking bug.
- absolute per-turn ceiling (CODEX_COMPANION_TURN_TIMEOUT_MS, 30min).
- app-server exit guard: process death rejects the wait immediately
  instead of hanging await state.completion forever.
- abandoned turns are interrupted upstream via interruptTurnBestEffort
  (awaited, 5s-bounded) using turnId ?? pendingTurnId (Defect C), on both
  broker and direct transports.
- start RPC covered by the same guards via openai#361's work-wrapper plus a
  single up-front Promise.race (no unhandledRejection from late guards).
- fixture: start-hangs / start-rejects / never-completes behaviors;
  runtime tests: openai#361's six lifecycle tests appended verbatim.

Verified: investigation 39/39, openai#361's six tests 6/6, broker-lifecycle 5/5,
commands 8/8.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant