feat(scheduler): force-resume gate for failed jobs by truffle-dev · Pull Request #138 · ghostwright/phantom

truffle-dev · 2026-05-20T07:18:42Z

Closes #128.

Rate-limit storms (and any sustained error) drive a scheduled job to status=failed with next_run_at=NULL once MAX_CONSECUTIVE_ERRORS = 10 trips (src/scheduler/executor.ts:8,67-69). The current resumeJob (src/scheduler/service.ts) refuses to touch the row, so recovery means a direct SQLite UPDATE on the live DB. It has happened twice now (2026-04-30, 2026-05-05).

Change

Scheduler.resumeJob(id, opts?: { force?: boolean }) now allows one extra transition:

paused → active: always (unchanged).
failed → active: requires opts.force === true.
completed → active: never, even with force. A one-shot may already have deleted itself inline (executor.ts delete_after_run path); re-activating is a sharp edge.

The HTTP path mirrors the gate:

POST /ui/api/scheduler/:id/resume with no body still revives a paused job.
On a failed job without force, it returns 409 with a message that names the force opt.
POST .../resume {"force": true} revives and audits the transition.

Both paths recompute next_run_at from the stored schedule and reset consecutive_errors to 0 so the revived job gets a clean retry budget.

Why opt-in

Failures the executor marks terminal are usually transient (model-provider rate limits, a brief Slack outage), but the executor cannot tell transient from broken. force opts in the operator: they have judged the underlying cause cleared. Without force the path stays a no-op so an accidental resume call cannot bypass the circuit-breaker.

Tests

src/scheduler/__tests__/service.test.ts

resumeJob without force is a no-op on a non-paused job (active, failed, completed) (renamed)
resumeJob({force:true}) revives a failed job and resets the error counter
resumeJob({force:true}) still refuses to revive a completed job
resumeJob({force:true}) on a paused job behaves like the unforced path

src/ui/api/__tests__/scheduler.test.ts

POST /:id/resume on a failed job without force returns 409
POST /:id/resume on a failed job with {force:true} revives it (also asserts the audit row)
POST /:id/resume tolerates an empty body for the paused → active path

68 / 68 in the two suites; bun run typecheck clean; bun run lint clean.

Scope

Service-layer method + UI HTTP handler + tests. The phantom_schedule MCP tool doesn't expose pause/resume actions today, so I did not add resume there — that's a separate concern and a different surface to design. The recovery playbook in agent-notes for direct-UPDATE remains valid; this just gives operators a non-destructive path.

Rate-limit storms (and any sustained error) drive a scheduled job to status=failed with next_run_at=NULL once MAX_CONSECUTIVE_ERRORS = 10 trips. resumeJob refused to touch the row, so recovery meant a direct SQLite UPDATE on the live DB. Add an opt-in revival path: scheduler.resumeJob(id, { force: true }) POST /ui/api/scheduler/:id/resume { "force": true } The HTTP path returns 409 with a force-prompt message when the caller omits force on a failed job, so the circuit-breaker still defaults to no-op. paused → active stays unchanged; completed → active stays forbidden even with force (one-shots may have already self-deleted). Closes ghostwright#128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scheduler): force-resume gate for failed jobs#138

feat(scheduler): force-resume gate for failed jobs#138
truffle-dev wants to merge 1 commit into
ghostwright:mainfrom
truffle-dev:scheduler-revive-failed-jobs

truffle-dev commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

truffle-dev commented May 20, 2026

Change

Why opt-in

Tests

Scope

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant