Skip to content

Bound the Codex turn await: fix unbounded hang on stalled/dead app-server + make turn budget configurable#376

Open
russjhammond wants to merge 2 commits into
openai:mainfrom
russjhammond:fix/bound-codex-turn-timeout
Open

Bound the Codex turn await: fix unbounded hang on stalled/dead app-server + make turn budget configurable#376
russjhammond wants to merge 2 commits into
openai:mainfrom
russjhammond:fix/bound-codex-turn-timeout

Conversation

@russjhammond

Copy link
Copy Markdown

Problem

A foreground Codex task can hang until the caller's external process timeout kills it with no output. Two distinct defects in scripts/lib/codex.mjs:

1. Unbounded, un-rejectable completion await. captureTurn ends in return await state.completion — a Promise resolved only by completeTurn() (on a turn/completed notification, or a 250ms inferred fallback). Its rejectCompletion counterpart is wired into the turn-capture state but has zero call sites — dead code. So once startRequest() resolves with inProgress, if the app-server dies or simply stalls, nothing rejects the await. It hangs until the caller (e.g. a host harness with a ~2-minute shell ceiling) SIGKILLs the process — surfacing as "hangs every time, returns nothing."

2. Turn budget read at import time. const TURN_TIMEOUT_MS = Number(process.env.CODEX_TURN_TIMEOUT_MS) || 600000 is evaluated when the module is imported. The companion sets the env var after import (e.g. a foreground budget below the caller's ceiling, or via a flag), so the override never takes effect — the budget is frozen at the default and effectively inert.

Fix

  • Wire rejectCompletion to client.exitPromise so a mid-turn process exit rejects the await immediately with a structured error instead of hanging.
  • Race state.completion against a hard deadline timer (cleared in a finally).
  • Resolve the budget at call timeresolveTurnTimeoutMs(options): explicit option → CODEX_TURN_TIMEOUT_MS env → default — so runtime overrides actually apply.
  • codex-companion.mjs: set a foreground turn budget just below the caller's external ceiling and add a --turn-timeout-ms flag. Background/detached jobs inherit the full default.

Verification

  • node --check passes on both files.
  • task --turn-timeout-ms 3000 on a long-generation prompt now aborts at ~3s with codex turn exceeded the 3000ms turn budget (previously ran unbounded — the flag was inert).
  • A normal foreground turn still completes cleanly (exit 0).

🤖 Generated with Claude Code

…rever

captureTurn ended in `return await state.completion`, a Promise resolved only
by completeTurn() (on turn/completed or a 250ms inferred fallback). Its
rejectCompletion counterpart was wired into state but never called — dead
code. So if the app-server died mid-turn or simply stalled, nothing rejected
the await: the turn hung until the caller's external timeout SIGKILLed the
process with no output.

Separately, the per-turn budget was read at import time
(`const TURN_TIMEOUT_MS = Number(process.env.CODEX_TURN_TIMEOUT_MS) || 600000`),
so any value set at runtime (a foreground budget below the caller's ceiling,
or --turn-timeout-ms) was ignored — the budget was frozen at the default and
effectively inert.

Fixes:
- Wire rejectCompletion to client.exitPromise so a mid-turn process exit
  surfaces immediately as an error instead of hanging.
- Race state.completion against a hard deadline timer (cleared in finally).
- Resolve the turn budget at call time (resolveTurnTimeoutMs: option >
  CODEX_TURN_TIMEOUT_MS env > default) so the companion's foreground budget
  and --turn-timeout-ms actually take effect.
- companion: set a foreground turn budget below the caller's ceiling and add a
  --turn-timeout-ms flag; background jobs keep the full default.

Verified: node --check on both files; --turn-timeout-ms 3000 now aborts a long
turn at ~3s with a structured error; a normal turn still completes cleanly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@russjhammond russjhammond requested a review from a team June 16, 2026 01:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5415c43684

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread plugins/codex/scripts/lib/codex.mjs Outdated
Comment on lines +590 to +592
state.rejectCompletion(
client.exitError ?? new Error("codex app-server exited before the turn completed.")
);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid rejecting an unobserved completion promise

When startRequest() rejects before the code reaches the Promise.race below (for example, a broker-busy error from turn/start/review/start, or the app-server exiting while that request is pending), withAppServer closes the client and exitPromise resolves, so these lines reject state.completion even though no handler has been attached to that promise. On supported Node versions this becomes an unhandled rejection that can terminate the CLI or preempt the intended direct retry/original error reporting; defer this rejection until the completion promise is being raced, or explicitly handle it on early failures.

Useful? React with 👍 / 👎.

Comment on lines +793 to +797
const explicitTurnTimeout = Number(options["turn-timeout-ms"]);
if (Number.isFinite(explicitTurnTimeout) && explicitTurnTimeout > 0) {
process.env.CODEX_TURN_TIMEOUT_MS = String(explicitTurnTimeout);
} else if (!process.env.CODEX_TURN_TIMEOUT_MS) {
process.env.CODEX_TURN_TIMEOUT_MS = String(FOREGROUND_TURN_TIMEOUT_MS);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply the foreground turn budget to reviews too

This default foreground budget is only installed in handleTask, but handleReviewCommand still runs /codex:review --wait and /codex:adversarial-review --wait through the same foreground runForegroundCommand path without setting CODEX_TURN_TIMEOUT_MS. Those review turns therefore fall back to the 600000ms default in codex.mjs, so under Claude Code's ~120000ms Bash ceiling they can still be SIGKILLed with an empty result instead of returning the structured timeout this change is meant to provide.

Useful? React with 👍 / 👎.

Two P2 fixes from the automated review:

1. codex.mjs — register the client.exitPromise -> rejectCompletion handler
   AFTER startRequest() resolves, not before. If startRequest() rejects (e.g.
   broker-busy from turn/start, or the app-server exiting while the request is
   pending), the error now propagates directly and state.completion is never
   observed. Previously the early-registered handler could reject an unobserved
   state.completion and surface as an unhandled rejection that can terminate the
   CLI or preempt the intended error reporting. The mid-turn-death coverage is
   unchanged (the handler is in place before the completion is awaited).

2. codex-companion.mjs — apply the foreground turn budget to review turns too.
   Extracted applyForegroundTurnBudget(options) and call it from both handleTask
   and handleReviewCommand. /codex:review --wait and adversarial-review --wait
   run foreground through runForegroundCommand but previously never set
   CODEX_TURN_TIMEOUT_MS, so they fell back to the 600000ms default and could be
   SIGKILLed at the host Bash ceiling with an empty result. Also accept
   --turn-timeout-ms on the review command. Guarded with !options.background so a
   detached worker still inherits the full default budget.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@russjhammond

Copy link
Copy Markdown
Author

Thanks for the review — both P2s addressed in 688192a:

  1. Unobserved completion rejection — moved the client.exitPromise → rejectCompletion wiring to after startRequest() resolves. An early startRequest() rejection (broker-busy, app-server exit while the request is pending) now propagates directly and state.completion is never observed, so it can't surface as an unhandled rejection. Mid-turn-death coverage is unchanged (handler is in place before the completion is awaited).

  2. Review turns — extracted applyForegroundTurnBudget(options) and now call it from both handleTask and handleReviewCommand, so --wait review / adversarial-review turns get the foreground budget instead of falling back to 600000ms. Also accept --turn-timeout-ms on the review command; guarded with !options.background so detached workers keep the full default.

Verified: node --check on both files; --turn-timeout-ms 3000 aborts a long turn at ~3s with the structured error; a normal turn still completes (exit 0).

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 688192a624

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +648 to +650
deadlineTimer = setTimeout(() => {
reject(new Error(`codex turn exceeded the ${turnTimeoutMs}ms turn budget.`));
}, turnTimeoutMs);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Render timeout failures to stdout

When the new deadline fires, this rejects out of captureTurn; runTrackedJob then takes its catch path and main().catch prints only the error message to stderr, so runForegroundCommand never writes a rendered result to stdout. In the foreground slash-command paths that return the command stdout verbatim, a timed-out turn still yields empty stdout instead of the structured timeout/partial output this budget is meant to surface; return a failed turn state/rendered result or render the timeout from the catch path.

Useful? React with 👍 / 👎.

@russjhammond

Copy link
Copy Markdown
Author

Fair point — a scoping note on this PR vs. a follow-up.

The primary goal here is eliminating the unbounded hang: a stalled or dead app-server previously blocked on await state.completion until the host's external timeout SIGKILLed the process with no signal. After this change a timed-out turn rejects at the budget and exits non-zero with codex turn exceeded the <n>ms turn budget on stderr — already a strict improvement over a silent kill.

Rendering the timeout (plus any partial turn output) to stdout as a structured result is a reasonable next step, but it touches the foreground render path (runForegroundCommand / result rendering) and the right shape depends on this repo's output conventions, which you'd know better than a drive-by patch would. Happy to extend this PR (or open a follow-up) to render a structured timeout result if that's the direction you'd prefer — I just didn't want to guess at the rendering contract here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant