fix: reconcile orphaned jobs whose launcher process has died#379
fix: reconcile orphaned jobs whose launcher process has died#379mroldrobot wants to merge 1 commit into
Conversation
A tracked job's completion is written by the launcher process that owns it (tracked-jobs.mjs::runTrackedJob). If that launcher dies before the turn finishes (cancelled background job, ended session, crash, sleep), the job is frozen at "running"/"queued" with a stale pid forever: /status reports it running indefinitely and /result refuses with "still running". listJobs() now probes the recorded pid via process.kill(pid, 0). Jobs whose process is gone (ESRCH) are reconciled to "failed" and the correction is persisted to both the state index and the per-job file. EPERM/unknown errors are treated as alive; jobs without a pid or already in a terminal state are left untouched. Adds tests for dead-pid running/queued reconciliation, persistence across reads, and live-pid / pidless / finished jobs left untouched.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: cc61122b1b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| status: job.status, | ||
| phase: job.phase, |
There was a problem hiding this comment.
Preserve completed job files when reconciling
If a launcher dies after runTrackedJob has written the per-job completion file but before its following index upsert, the index still says running with a dead pid while the job file already says completed. This block unconditionally spreads stored and then overwrites its status/phase to failed, so the next /codex:status or /codex:result corrupts that completed job into an orphan failure. Check the stored job's terminal status, or avoid overwriting it, before persisting reconciliation.
Useful? React with 👍 / 👎.
Problem
A tracked job's completion is written by the launcher process that owns it
(
tracked-jobs.mjs::runTrackedJob): it marks the jobrunning,awaits theturn, then writes
completed/failed. If that launcher dies before the turnfinishes — a cancelled background job, an ended Claude Code session, a crash, or
the machine sleeping — the completion write never runs and the job is frozen at
running/queuedwith a now-dead pid.Nothing reconciles this:
buildStatusSnapshot/buildSingleJobSnapshotfilterpurely on the stored
statusstring, and the onlyESRCHhandling in thecodebase is in the cancel path. So
/codex:statusreports the jobrunningforever and
/codex:resultrefuses with "still running". In practice theseorphans accumulate (I had several stuck for days).
Fix
listJobs()— which every status/result/cancel path flows through — now probesthe recorded launcher pid with
process.kill(pid, 0):ESRCH(process gone) → reconcile the job tofailed, clear the pid, setcompletedAtand an explanatoryerrorMessage. The correction is persisted toboth the state index and the per-job file.
EPERM/ unknown → treat as alive (fail safe).A dead background turn now surfaces as
failed(and is retryable) instead ofhanging
runningindefinitely.Tests
Adds coverage in
tests/state.test.mjs: dead-pid running/queued reconciliation,persistence across reads, and live-pid / pidless / finished jobs left untouched.
node --test tests/state.test.mjs→ 7/7 green.Note: 3 unrelated tests (#63/#65/#67) fail under the aggregate
npm testdue topre-existing cross-file state leakage; they pass in isolation and fail
identically on
mainwithout this change.