Bound the Codex turn await: fix unbounded hang on stalled/dead app-server + make turn budget configurable#376
Conversation
…rever captureTurn ended in `return await state.completion`, a Promise resolved only by completeTurn() (on turn/completed or a 250ms inferred fallback). Its rejectCompletion counterpart was wired into state but never called — dead code. So if the app-server died mid-turn or simply stalled, nothing rejected the await: the turn hung until the caller's external timeout SIGKILLed the process with no output. Separately, the per-turn budget was read at import time (`const TURN_TIMEOUT_MS = Number(process.env.CODEX_TURN_TIMEOUT_MS) || 600000`), so any value set at runtime (a foreground budget below the caller's ceiling, or --turn-timeout-ms) was ignored — the budget was frozen at the default and effectively inert. Fixes: - Wire rejectCompletion to client.exitPromise so a mid-turn process exit surfaces immediately as an error instead of hanging. - Race state.completion against a hard deadline timer (cleared in finally). - Resolve the turn budget at call time (resolveTurnTimeoutMs: option > CODEX_TURN_TIMEOUT_MS env > default) so the companion's foreground budget and --turn-timeout-ms actually take effect. - companion: set a foreground turn budget below the caller's ceiling and add a --turn-timeout-ms flag; background jobs keep the full default. Verified: node --check on both files; --turn-timeout-ms 3000 now aborts a long turn at ~3s with a structured error; a normal turn still completes cleanly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5415c43684
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| state.rejectCompletion( | ||
| client.exitError ?? new Error("codex app-server exited before the turn completed.") | ||
| ); |
There was a problem hiding this comment.
Avoid rejecting an unobserved completion promise
When startRequest() rejects before the code reaches the Promise.race below (for example, a broker-busy error from turn/start/review/start, or the app-server exiting while that request is pending), withAppServer closes the client and exitPromise resolves, so these lines reject state.completion even though no handler has been attached to that promise. On supported Node versions this becomes an unhandled rejection that can terminate the CLI or preempt the intended direct retry/original error reporting; defer this rejection until the completion promise is being raced, or explicitly handle it on early failures.
Useful? React with 👍 / 👎.
| const explicitTurnTimeout = Number(options["turn-timeout-ms"]); | ||
| if (Number.isFinite(explicitTurnTimeout) && explicitTurnTimeout > 0) { | ||
| process.env.CODEX_TURN_TIMEOUT_MS = String(explicitTurnTimeout); | ||
| } else if (!process.env.CODEX_TURN_TIMEOUT_MS) { | ||
| process.env.CODEX_TURN_TIMEOUT_MS = String(FOREGROUND_TURN_TIMEOUT_MS); |
There was a problem hiding this comment.
Apply the foreground turn budget to reviews too
This default foreground budget is only installed in handleTask, but handleReviewCommand still runs /codex:review --wait and /codex:adversarial-review --wait through the same foreground runForegroundCommand path without setting CODEX_TURN_TIMEOUT_MS. Those review turns therefore fall back to the 600000ms default in codex.mjs, so under Claude Code's ~120000ms Bash ceiling they can still be SIGKILLed with an empty result instead of returning the structured timeout this change is meant to provide.
Useful? React with 👍 / 👎.
Two P2 fixes from the automated review: 1. codex.mjs — register the client.exitPromise -> rejectCompletion handler AFTER startRequest() resolves, not before. If startRequest() rejects (e.g. broker-busy from turn/start, or the app-server exiting while the request is pending), the error now propagates directly and state.completion is never observed. Previously the early-registered handler could reject an unobserved state.completion and surface as an unhandled rejection that can terminate the CLI or preempt the intended error reporting. The mid-turn-death coverage is unchanged (the handler is in place before the completion is awaited). 2. codex-companion.mjs — apply the foreground turn budget to review turns too. Extracted applyForegroundTurnBudget(options) and call it from both handleTask and handleReviewCommand. /codex:review --wait and adversarial-review --wait run foreground through runForegroundCommand but previously never set CODEX_TURN_TIMEOUT_MS, so they fell back to the 600000ms default and could be SIGKILLed at the host Bash ceiling with an empty result. Also accept --turn-timeout-ms on the review command. Guarded with !options.background so a detached worker still inherits the full default budget. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the review — both P2s addressed in
Verified: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 688192a624
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| deadlineTimer = setTimeout(() => { | ||
| reject(new Error(`codex turn exceeded the ${turnTimeoutMs}ms turn budget.`)); | ||
| }, turnTimeoutMs); |
There was a problem hiding this comment.
Render timeout failures to stdout
When the new deadline fires, this rejects out of captureTurn; runTrackedJob then takes its catch path and main().catch prints only the error message to stderr, so runForegroundCommand never writes a rendered result to stdout. In the foreground slash-command paths that return the command stdout verbatim, a timed-out turn still yields empty stdout instead of the structured timeout/partial output this budget is meant to surface; return a failed turn state/rendered result or render the timeout from the catch path.
Useful? React with 👍 / 👎.
|
Fair point — a scoping note on this PR vs. a follow-up. The primary goal here is eliminating the unbounded hang: a stalled or dead app-server previously blocked on Rendering the timeout (plus any partial turn output) to stdout as a structured result is a reasonable next step, but it touches the foreground render path ( |
Problem
A foreground Codex task can hang until the caller's external process timeout kills it with no output. Two distinct defects in
scripts/lib/codex.mjs:1. Unbounded, un-rejectable completion await.
captureTurnends inreturn await state.completion— a Promise resolved only bycompleteTurn()(on aturn/completednotification, or a 250ms inferred fallback). ItsrejectCompletioncounterpart is wired into the turn-capture state but has zero call sites — dead code. So oncestartRequest()resolves withinProgress, if the app-server dies or simply stalls, nothing rejects the await. It hangs until the caller (e.g. a host harness with a ~2-minute shell ceiling) SIGKILLs the process — surfacing as "hangs every time, returns nothing."2. Turn budget read at import time.
const TURN_TIMEOUT_MS = Number(process.env.CODEX_TURN_TIMEOUT_MS) || 600000is evaluated when the module is imported. The companion sets the env var after import (e.g. a foreground budget below the caller's ceiling, or via a flag), so the override never takes effect — the budget is frozen at the default and effectively inert.Fix
rejectCompletiontoclient.exitPromiseso a mid-turn process exit rejects the await immediately with a structured error instead of hanging.state.completionagainst a hard deadline timer (cleared in afinally).resolveTurnTimeoutMs(options): explicit option →CODEX_TURN_TIMEOUT_MSenv → default — so runtime overrides actually apply.codex-companion.mjs: set a foreground turn budget just below the caller's external ceiling and add a--turn-timeout-msflag. Background/detached jobs inherit the full default.Verification
node --checkpasses on both files.task --turn-timeout-ms 3000on a long-generation prompt now aborts at ~3s withcodex turn exceeded the 3000ms turn budget(previously ran unbounded — the flag was inert).🤖 Generated with Claude Code