Skip to content

fix: reconcile orphaned jobs whose launcher process has died#379

Open
mroldrobot wants to merge 1 commit into
openai:mainfrom
mroldrobot:fix/reconcile-orphaned-jobs
Open

fix: reconcile orphaned jobs whose launcher process has died#379
mroldrobot wants to merge 1 commit into
openai:mainfrom
mroldrobot:fix/reconcile-orphaned-jobs

Conversation

@mroldrobot

Copy link
Copy Markdown

Problem

A tracked job's completion is written by the launcher process that owns it
(tracked-jobs.mjs::runTrackedJob): it marks the job running, awaits the
turn, then writes completed/failed. If that launcher dies before the turn
finishes — a cancelled background job, an ended Claude Code session, a crash, or
the machine sleeping — the completion write never runs and the job is frozen at
running/queued with a now-dead pid.

Nothing reconciles this: buildStatusSnapshot / buildSingleJobSnapshot filter
purely on the stored status string, and the only ESRCH handling in the
codebase is in the cancel path. So /codex:status reports the job running
forever and /codex:result refuses with "still running". In practice these
orphans accumulate (I had several stuck for days).

Fix

listJobs() — which every status/result/cancel path flows through — now probes
the recorded launcher pid with process.kill(pid, 0):

  • ESRCH (process gone) → reconcile the job to failed, clear the pid, set
    completedAt and an explanatory errorMessage. The correction is persisted to
    both the state index and the per-job file.
  • EPERM / unknown → treat as alive (fail safe).
  • Jobs without a pid, or already in a terminal state, are left untouched.

A dead background turn now surfaces as failed (and is retryable) instead of
hanging running indefinitely.

Tests

Adds coverage in tests/state.test.mjs: dead-pid running/queued reconciliation,
persistence across reads, and live-pid / pidless / finished jobs left untouched.
node --test tests/state.test.mjs → 7/7 green.

Note: 3 unrelated tests (#63/#65/#67) fail under the aggregate npm test due to
pre-existing cross-file state leakage; they pass in isolation and fail
identically on main without this change.

A tracked job's completion is written by the launcher process that owns
it (tracked-jobs.mjs::runTrackedJob). If that launcher dies before the
turn finishes (cancelled background job, ended session, crash, sleep),
the job is frozen at "running"/"queued" with a stale pid forever:
/status reports it running indefinitely and /result refuses with
"still running".

listJobs() now probes the recorded pid via process.kill(pid, 0). Jobs
whose process is gone (ESRCH) are reconciled to "failed" and the
correction is persisted to both the state index and the per-job file.
EPERM/unknown errors are treated as alive; jobs without a pid or already
in a terminal state are left untouched.

Adds tests for dead-pid running/queued reconciliation, persistence
across reads, and live-pid / pidless / finished jobs left untouched.
@mroldrobot mroldrobot requested a review from a team June 17, 2026 13:58

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cc61122b1b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +208 to +209
status: job.status,
phase: job.phase,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve completed job files when reconciling

If a launcher dies after runTrackedJob has written the per-job completion file but before its following index upsert, the index still says running with a dead pid while the job file already says completed. This block unconditionally spreads stored and then overwrites its status/phase to failed, so the next /codex:status or /codex:result corrupts that completed job into an orphan failure. Check the stored job's terminal status, or avoid overwriting it, before persisting reconciliation.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant