Skip to content

feat(scheduler): force-resume gate for failed jobs#138

Open
truffle-dev wants to merge 1 commit into
ghostwright:mainfrom
truffle-dev:scheduler-revive-failed-jobs
Open

feat(scheduler): force-resume gate for failed jobs#138
truffle-dev wants to merge 1 commit into
ghostwright:mainfrom
truffle-dev:scheduler-revive-failed-jobs

Conversation

@truffle-dev
Copy link
Copy Markdown
Contributor

Closes #128.

Rate-limit storms (and any sustained error) drive a scheduled job to status=failed with next_run_at=NULL once MAX_CONSECUTIVE_ERRORS = 10 trips (src/scheduler/executor.ts:8,67-69). The current resumeJob (src/scheduler/service.ts) refuses to touch the row, so recovery means a direct SQLite UPDATE on the live DB. It has happened twice now (2026-04-30, 2026-05-05).

Change

Scheduler.resumeJob(id, opts?: { force?: boolean }) now allows one extra transition:

  • paused → active: always (unchanged).
  • failed → active: requires opts.force === true.
  • completed → active: never, even with force. A one-shot may already have deleted itself inline (executor.ts delete_after_run path); re-activating is a sharp edge.

The HTTP path mirrors the gate:

  • POST /ui/api/scheduler/:id/resume with no body still revives a paused job.
  • On a failed job without force, it returns 409 with a message that names the force opt.
  • POST .../resume {"force": true} revives and audits the transition.

Both paths recompute next_run_at from the stored schedule and reset consecutive_errors to 0 so the revived job gets a clean retry budget.

Why opt-in

Failures the executor marks terminal are usually transient (model-provider rate limits, a brief Slack outage), but the executor cannot tell transient from broken. force opts in the operator: they have judged the underlying cause cleared. Without force the path stays a no-op so an accidental resume call cannot bypass the circuit-breaker.

Tests

src/scheduler/__tests__/service.test.ts

  • resumeJob without force is a no-op on a non-paused job (active, failed, completed) (renamed)
  • resumeJob({force:true}) revives a failed job and resets the error counter
  • resumeJob({force:true}) still refuses to revive a completed job
  • resumeJob({force:true}) on a paused job behaves like the unforced path

src/ui/api/__tests__/scheduler.test.ts

  • POST /:id/resume on a failed job without force returns 409
  • POST /:id/resume on a failed job with {force:true} revives it (also asserts the audit row)
  • POST /:id/resume tolerates an empty body for the paused → active path

68 / 68 in the two suites; bun run typecheck clean; bun run lint clean.

Scope

Service-layer method + UI HTTP handler + tests. The phantom_schedule MCP tool doesn't expose pause/resume actions today, so I did not add resume there — that's a separate concern and a different surface to design. The recovery playbook in agent-notes for direct-UPDATE remains valid; this just gives operators a non-destructive path.

Rate-limit storms (and any sustained error) drive a scheduled job to
status=failed with next_run_at=NULL once MAX_CONSECUTIVE_ERRORS = 10
trips. resumeJob refused to touch the row, so recovery meant a direct
SQLite UPDATE on the live DB.

Add an opt-in revival path:

  scheduler.resumeJob(id, { force: true })

  POST /ui/api/scheduler/:id/resume { "force": true }

The HTTP path returns 409 with a force-prompt message when the caller
omits force on a failed job, so the circuit-breaker still defaults to
no-op. paused → active stays unchanged; completed → active stays
forbidden even with force (one-shots may have already self-deleted).

Closes ghostwright#128
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

scheduler: no public API recovers failed jobs; resumeJob handles only paused

1 participant