fix: idle self-shutdown for the app-server broker to stop stale brokers accumulating#400
fix: idle self-shutdown for the app-server broker to stop stale brokers accumulating#400storyarcade wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 312557d3a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| idleTimer = setTimeout(() => { | ||
| // Re-check under the timer: only exit if still idle (no client reconnected meanwhile). | ||
| if (sockets.size === 0) { | ||
| shutdown(serverRef).then(() => process.exit(0)); |
There was a problem hiding this comment.
Clear the broker session before idle exit
When the idle timer shuts the broker down, shutdown() removes the socket and pid file but leaves the saved broker.json for this cwd. After the idle window elapses, callers that intentionally use reuseExistingBroker: true (for example setup/status auth checks) load that stale endpoint without re-probing or spawning, so codex setup --json reports ready: false with connect ENOENT ...broker.sock while still showing a shared session until some other command cleans the state. The idle self-exit path should clear the persisted broker session before exiting.
Useful? React with 👍 / 👎.
…rs accumulating The shared app-server broker is spawned detached+unref'd and reused per-cwd via endpoint-readiness. When a session ends/crashes/replaces its host, the broker keeps its codex app-server child alive with no client connected, and is never torn down because ensureBrokerSession only reaps the current cwd's stale entry. Idle brokers from other cwds accumulate. Add an idle self-shutdown: when no client has been connected for an idle window (default 30m, env CODEX_BROKER_IDLE_SHUTDOWN_MS, <=0 disables), call the existing shutdown(). The timer clears on any connect and only arms when sockets.size===0, so it cannot race a live broker (a turn holds its socket open; the broker is only clientless between turns and is respawned via endpoint-readiness). Also stop accepting connections before the async appClient.close() in shutdown() to close a drop-mid-handshake window. Adds tests/broker-idle-shutdown.test.mjs (fake-codex fixture, no login needed).
312557d to
46fc561
Compare
Addresses the review note: when the idle timer shuts the broker down, it removed the socket+pidfile but left broker.json for the cwd. A later reuseExistingBroker:true caller (e.g. `codex setup --json`) would then load that stale endpoint without re-probing and report ready:false / connect ENOENT until other cleanup ran. Clear the persisted session (clearBrokerSession(cwd)) in the autonomous idle-exit path before shutdown. The SIGTERM/RPC shutdown paths are caller-driven and already clear state, so only the self-exit path needs this. Adds a regression test asserting broker.json is gone after idle exit.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad54711506
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // left broker.json pointing at the now-removed socket, a later `reuseExistingBroker: true` | ||
| // caller (e.g. `codex setup --json`) would load the stale endpoint without re-probing and | ||
| // report ready:false / connect ENOENT until something else cleans state. | ||
| try { clearBrokerSession(cwd); } catch { /* best-effort; teardown still proceeds */ } |
There was a problem hiding this comment.
Only clear the session this broker still owns
If an older broker for the same cwd survives after its state has been torn down and replaced, this unconditional delete can remove the newer broker's broker.json; I checked ensureBrokerSession and its stale-entry cleanup unlinks the old endpoint but passes killProcess: options.killProcess ?? null, so by default it can leave a live old process behind. When that old process later idles out, reuseExistingBroker paths such as setup/status/interrupt stop discovering the active shared broker, and an interrupt during a later turn can go to a fresh direct app-server instead. Check that the loaded session still matches this broker's endpoint/pid before deleting it.
Useful? React with 👍 / 👎.
Problem
The shared Codex app-server broker (
app-server-broker.mjs) is spawneddetached+unref'd and reused per-cwd via endpoint-readiness (broker-lifecycle.mjsensureBrokerSession). When a session ends, crashes, or its host process is replaced, the broker keeps itscodex app-serverchild (and everything that child fans out) alive with no client connected.It is never torn down, because
ensureBrokerSessiononly tears down the current cwd's stale entry — it never reaps brokers left behind by sessions in other cwds. Across many sessions in distinct working directories, idle brokers accumulate monotonically (observed: a dozen idle brokers, oldest ~15 days, each pinning an app-server subtree).Fix
Give the broker an idle self-shutdown. It already tracks connected
socketsand already has a fullshutdown(); this wires a timer that calls the existingshutdown()when no client has been connected for an idle window.CODEX_BROKER_IDLE_SHUTDOWN_MS,<= 0disables.sockets.size === 0, so it can never race a live broker: a long turn holds its socket open for the whole turn, and the broker only sits clientless between turns (reused by endpoint-readiness, not a held socket). After idle-exit,ensureBrokerSessionre-probes the now-dead endpoint and respawns cleanly.shutdown()to stop accepting connections before the asyncappClient.close(), closing a small window where a connect during teardown could be dropped mid-handshake.Tests
tests/broker-idle-shutdown.test.mjsuses the existingfake-codexfixture (runs in CI without a real login): self-exit when never connected, connected-client cancels the timer, re-arm-and-exit after disconnect, disabled at<= 0, and unset-env falls back to the long default.node --testpasses (a few pre-existing unrelated flaky tests aside).