fix(chat-recovery): bound Durable Object memory-limit (OOM) crash loops (#1825)#1826
Merged
Merged
Conversation
…ps (#1825) A chat-recovery turn whose Durable Object isolate exceeds its 128 MB memory limit could loop forever, re-running the (billable) turn on every platform alarm retry. The isolate streams a little content before the reset, which bumps the durable progress counter; on the next wake recovery reads that as forward progress and resets both progress-keyed bounds (maxAttempts, noProgressTimeoutMs), and because each crash lands inside the alarm-debounce window the attempt counter is pinned too. With maxRecoveryWork defaulting to Infinity, no instrument could ever seal the turn, so the model ran forever. This lands a layered fix: 1. Finite maxRecoveryWork default (1000, was Infinity). The work meter is the one signal that keeps climbing across the loop, so a finite default seals a runaway with reason="work_budget_exceeded". 2. OOM-specific in-DO budget (chatRecovery.maxOomRetries, default 3). A memory reset re-OOMs on re-run (the turn's working set, not the platform, is the cause), so it is classified as a distinct deterministic failure rather than a deploy-style transient: it is NOT deferred and retried forever. Each crash bumps a durable per-incident oomAttempts counter; after a small number of tries it seals with reason="out_of_memory". Fast and attributable. 3. Alarm-boundary circuit breaker (Agent.alarm()) as the universal backstop for OOMs that bypass the in-DO budgets entirely - thrown before the budget code runs (boot-time state hydration), or whose own small writes also OOM under memory pressure. Left unhandled such an error propagates out of alarm() and the platform auto-retries forever. alarm() now intercepts ONLY Durable Object memory-limit resets at the outermost frame, where the heavy turn has unwound and GC has reclaimed its footprint, so the seal/purge writes can land where mid-turn ones OOMed. A durable strike counter (static maxAlarmMemoryLimitStrikes, default 3) tolerates a few resets - backing off the looping rows so the retry is not a hot loop - then seals the recovery (out_of_memory) and surgically purges ONLY the looping schedule rows, leaving unrelated scheduled tasks intact. Emits a new alarm:memory_limit_reset event. Everything except memory-limit resets re-throws exactly as before. Supporting changes: - Broaden + export isDurableObjectMemoryLimitReset(error): matches the shared "exceeded its memory limit" fragment so truncated/reworded surfacings observed in real #1825 logs still classify. Sibling to isDurableObjectCodeUpdateReset / isPlatformTransientError. - _executeScheduleCallback now DEFERS (re-throws) memory-limit resets for one-shot rows instead of swallowing them after in-process retries, so the error reaches the alarm-boundary breaker; track the executing row id so the breaker can purge the exact looping row. - think/ai-chat override _cf_recoveryAlarmCallbacks() and _cf_sealMemoryLimitedRecovery() to target their recovery continuation callbacks and terminalize active incidents (banner + onExhausted + seal). - Remove the redundant result-path OOM handling in continueLastTurn: those turns are already terminalized, so it only risked wasteful reschedules and duplicate terminal signals. Adds unit + integration coverage (predicate, listActiveChatRecoveryIncidents, alarm circuit breaker), an RFC follow-up section, docs, and changesets. Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: c69cfd5 The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
The alarm-boundary memory-limit strike counter (maxAlarmMemoryLimitStrikes, #1825) is documented as counting CONSECUTIVE alarm OOM resets, but it was only ever deleted when the breaker sealed — never after a clean alarm — so it actually tracked LIFETIME resets. A Durable Object hitting rare, non-consecutive transient spikes (e.g. one a month) would eventually reach the strike budget and wrongly seal healthy recovery work. alarm() now best-effort clears cf_agents:oom_alarm_strikes after a clean _cf_runAlarmBody() so strikes must be consecutive to seal. The clear reads first and only writes when a strike is recorded, so the common no-strike path costs no write. Adds a regression test (strike recorded -> clean alarm resets to 0 -> next OOM starts at strike 1). Co-authored-by: Cursor <cursoragent@cursor.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #1825: a chat-recovery turn whose Durable Object isolate exceeds its 128 MB memory limit could loop forever, re-running the (billable) turn on every platform alarm retry.
Why it looped
A reset isolate has usually already streamed a little content, which bumps the durable progress counter. On the next wake, recovery reads that as forward progress and resets both progress-keyed bounds — the attempt cap (
maxAttempts) and the no-progress window (noProgressTimeoutMs). Because each crash lands inside the alarm-debounce window, the attempt counter is pinned too. WithmaxRecoveryWorkdefaulting toInfinity, no instrument could ever seal the turn, so the model re-ran indefinitely.This matches the customer's logs exactly: OOM during boot/hydration → "failed to read recovery incident during give-up" (the give-up read itself OOMing mid-turn) → "~4 min later" platform alarm retry → "error executing callback
_chatRecoveryContinueafter 3 attempts" → repeat. Their workaround was anoverride alarm()that caught"exceeded its memory limit"and calleddeleteAlarm(); this PR builds that behavior into the base class — but surgical, bounded, attributable, and observable.The fix (layered)
Finite
maxRecoveryWorkdefault (1000, wasInfinity). The work meter is the one signal that keeps climbing across the loop, so a finite default seals a runaway withreason="work_budget_exceeded". A normal interrupted turn never approaches it.OOM-specific in-DO budget (
chatRecovery.maxOomRetries, default3). A memory reset re-OOMs on re-run (the turn's working set, not the platform, is the cause), so it's classified as a distinct deterministic failure rather than a deploy-style transient — it is not deferred and retried forever. Each crash bumps a durable per-incidentoomAttemptscounter; after a small number of tries it seals withreason="out_of_memory". Fast and attributable.Alarm-boundary circuit breaker (
Agent.alarm()) — the universal backstop for OOMs that bypass the in-DO budgets entirely: thrown before the budget code runs (boot-time state hydration), or whose own small writes also OOM under memory pressure. Left unhandled, such an error propagates out ofalarm()and the platform auto-retries forever.alarm()now intercepts only Durable Object memory-limit resets at the outermost frame — where the heavy turn has unwound and GC has reclaimed its footprint, so the seal/purge writes can land where mid-turn ones OOMed. A durable strike counter (static maxAlarmMemoryLimitStrikes, default3) tolerates a few resets (a transient spike may clear), backing off the looping rows so the retry isn't a hot loop, then seals the recovery and surgically purges only the looping schedule rows, leaving unrelated scheduled tasks intact. Emits a newalarm:memory_limit_resetobservability event. Everything except memory-limit resets re-throws exactly as before.Supporting changes
isDurableObjectMemoryLimitReset(error)— now matches the shared"exceeded its memory limit"fragment so truncated/reworded surfacings (observed in real Neverending retries during recovery #1825 logs) still classify. Sibling toisDurableObjectCodeUpdateReset/isPlatformTransientError._executeScheduleCallbacknow defers (re-throws) memory-limit resets for one-shot rows instead of swallowing them after in-process retries, so the error reaches the alarm-boundary breaker. Tracks the executing row id so the breaker can purge the exact looping row.think/ai-chatoverride_cf_recoveryAlarmCallbacks()+_cf_sealMemoryLimitedRecovery()to target their recovery continuation callbacks and terminalize active incidents (banner +onExhausted+ seal).continueLastTurn: those turns are already terminalized, so it only risked wasteful reschedules and duplicate terminal signals.Configuration
chatRecovery.maxRecoveryWork1000(wasInfinity)chatRecovery.maxOomRetries3maxAlarmMemoryLimitStrikes3Test plan
pnpm run check(sherif + exports + oxfmt + oxlint + typecheck, 113 projects) — greenagents/think/ai-chattest suites — greenlistActiveChatRecoveryIncidents1000/3/3) feel right for shipped behaviorNotes
agents/@cloudflare/think/@cloudflare/ai-chat.alarm:memory_limit_reset+ downstream recovery exhaustion) for the catchable alarm-loop class that was previously also silent — a natural follow-up for Durable Object OOM kills produce zero observability signal #1285 is a boot-time "interrupted run" breadcrumb detector.Made with Cursor