fix(chat-recovery): bound Durable Object memory-limit (OOM) crash loops (#1825) by threepointone · Pull Request #1826 · cloudflare/agents

threepointone · 2026-06-28T10:12:51Z

Summary

Fixes #1825: a chat-recovery turn whose Durable Object isolate exceeds its 128 MB memory limit could loop forever, re-running the (billable) turn on every platform alarm retry.

Why it looped

A reset isolate has usually already streamed a little content, which bumps the durable progress counter. On the next wake, recovery reads that as forward progress and resets both progress-keyed bounds — the attempt cap (maxAttempts) and the no-progress window (noProgressTimeoutMs). Because each crash lands inside the alarm-debounce window, the attempt counter is pinned too. With maxRecoveryWork defaulting to Infinity, no instrument could ever seal the turn, so the model re-ran indefinitely.

This matches the customer's logs exactly: OOM during boot/hydration → "failed to read recovery incident during give-up" (the give-up read itself OOMing mid-turn) → "~4 min later" platform alarm retry → "error executing callback _chatRecoveryContinue after 3 attempts" → repeat. Their workaround was an override alarm() that caught "exceeded its memory limit" and called deleteAlarm(); this PR builds that behavior into the base class — but surgical, bounded, attributable, and observable.

The fix (layered)

Finite maxRecoveryWork default (1000, was Infinity). The work meter is the one signal that keeps climbing across the loop, so a finite default seals a runaway with reason="work_budget_exceeded". A normal interrupted turn never approaches it.
OOM-specific in-DO budget (chatRecovery.maxOomRetries, default 3). A memory reset re-OOMs on re-run (the turn's working set, not the platform, is the cause), so it's classified as a distinct deterministic failure rather than a deploy-style transient — it is not deferred and retried forever. Each crash bumps a durable per-incident oomAttempts counter; after a small number of tries it seals with reason="out_of_memory". Fast and attributable.
Alarm-boundary circuit breaker (Agent.alarm()) — the universal backstop for OOMs that bypass the in-DO budgets entirely: thrown before the budget code runs (boot-time state hydration), or whose own small writes also OOM under memory pressure. Left unhandled, such an error propagates out of alarm() and the platform auto-retries forever. alarm() now intercepts only Durable Object memory-limit resets at the outermost frame — where the heavy turn has unwound and GC has reclaimed its footprint, so the seal/purge writes can land where mid-turn ones OOMed. A durable strike counter (static maxAlarmMemoryLimitStrikes, default 3) tolerates a few resets (a transient spike may clear), backing off the looping rows so the retry isn't a hot loop, then seals the recovery and surgically purges only the looping schedule rows, leaving unrelated scheduled tasks intact. Emits a new alarm:memory_limit_reset observability event. Everything except memory-limit resets re-throws exactly as before.

Supporting changes

Broaden + export isDurableObjectMemoryLimitReset(error) — now matches the shared "exceeded its memory limit" fragment so truncated/reworded surfacings (observed in real Neverending retries during recovery #1825 logs) still classify. Sibling to isDurableObjectCodeUpdateReset / isPlatformTransientError.
_executeScheduleCallback now defers (re-throws) memory-limit resets for one-shot rows instead of swallowing them after in-process retries, so the error reaches the alarm-boundary breaker. Tracks the executing row id so the breaker can purge the exact looping row.
think / ai-chat override _cf_recoveryAlarmCallbacks() + _cf_sealMemoryLimitedRecovery() to target their recovery continuation callbacks and terminalize active incidents (banner + onExhausted + seal).
Remove the redundant result-path OOM handling in continueLastTurn: those turns are already terminalized, so it only risked wasteful reschedules and duplicate terminal signals.

Configuration

Option	Default	Scope
`chatRecovery.maxRecoveryWork`	`1000` (was `Infinity`)	chat recovery work backstop
`chatRecovery.maxOomRetries`	`3`	in-DO OOM budget
`maxAlarmMemoryLimitStrikes`	`3`	base-agent alarm circuit breaker

Test plan

pnpm run check (sherif + exports + oxfmt + oxlint + typecheck, 113 projects) — green
agents / think / ai-chat test suites — green
New unit coverage: broadened predicate, listActiveChatRecoveryIncidents
New integration coverage: alarm memory-limit circuit breaker (under budget → backoff/row preserved; at budget → seal/purge; truncated message match; non-memory errors pass through unchanged)
Reviewer: confirm default budgets (1000 / 3 / 3) feel right for shipped behavior

Notes

Two changesets (work-budget default flip + OOM budget/breaker), both patch on agents / @cloudflare/think / @cloudflare/ai-chat.
Does not close #1285 (zero-signal hard OOM kills during non-alarm requests): a true hard kill runs no in-isolate code, so nothing can emit. This PR does add a new signal (alarm:memory_limit_reset + downstream recovery exhaustion) for the catchable alarm-loop class that was previously also silent — a natural follow-up for Durable Object OOM kills produce zero observability signal #1285 is a boot-time "interrupted run" breadcrumb detector.
RFC follow-up section + user-facing docs updated.

Made with Cursor

…ps (#1825) A chat-recovery turn whose Durable Object isolate exceeds its 128 MB memory limit could loop forever, re-running the (billable) turn on every platform alarm retry. The isolate streams a little content before the reset, which bumps the durable progress counter; on the next wake recovery reads that as forward progress and resets both progress-keyed bounds (maxAttempts, noProgressTimeoutMs), and because each crash lands inside the alarm-debounce window the attempt counter is pinned too. With maxRecoveryWork defaulting to Infinity, no instrument could ever seal the turn, so the model ran forever. This lands a layered fix: 1. Finite maxRecoveryWork default (1000, was Infinity). The work meter is the one signal that keeps climbing across the loop, so a finite default seals a runaway with reason="work_budget_exceeded". 2. OOM-specific in-DO budget (chatRecovery.maxOomRetries, default 3). A memory reset re-OOMs on re-run (the turn's working set, not the platform, is the cause), so it is classified as a distinct deterministic failure rather than a deploy-style transient: it is NOT deferred and retried forever. Each crash bumps a durable per-incident oomAttempts counter; after a small number of tries it seals with reason="out_of_memory". Fast and attributable. 3. Alarm-boundary circuit breaker (Agent.alarm()) as the universal backstop for OOMs that bypass the in-DO budgets entirely - thrown before the budget code runs (boot-time state hydration), or whose own small writes also OOM under memory pressure. Left unhandled such an error propagates out of alarm() and the platform auto-retries forever. alarm() now intercepts ONLY Durable Object memory-limit resets at the outermost frame, where the heavy turn has unwound and GC has reclaimed its footprint, so the seal/purge writes can land where mid-turn ones OOMed. A durable strike counter (static maxAlarmMemoryLimitStrikes, default 3) tolerates a few resets - backing off the looping rows so the retry is not a hot loop - then seals the recovery (out_of_memory) and surgically purges ONLY the looping schedule rows, leaving unrelated scheduled tasks intact. Emits a new alarm:memory_limit_reset event. Everything except memory-limit resets re-throws exactly as before. Supporting changes: - Broaden + export isDurableObjectMemoryLimitReset(error): matches the shared "exceeded its memory limit" fragment so truncated/reworded surfacings observed in real #1825 logs still classify. Sibling to isDurableObjectCodeUpdateReset / isPlatformTransientError. - _executeScheduleCallback now DEFERS (re-throws) memory-limit resets for one-shot rows instead of swallowing them after in-process retries, so the error reaches the alarm-boundary breaker; track the executing row id so the breaker can purge the exact looping row. - think/ai-chat override _cf_recoveryAlarmCallbacks() and _cf_sealMemoryLimitedRecovery() to target their recovery continuation callbacks and terminalize active incidents (banner + onExhausted + seal). - Remove the redundant result-path OOM handling in continueLastTurn: those turns are already terminalized, so it only risked wasteful reschedules and duplicate terminal signals. Adds unit + integration coverage (predicate, listActiveChatRecoveryIncidents, alarm circuit breaker), an RFC follow-up section, docs, and changesets. Co-authored-by: Cursor <cursoragent@cursor.com>

changeset-bot · 2026-06-28T10:12:56Z

🦋 Changeset detected

Latest commit: c69cfd5

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
@cloudflare/ai-chat	Patch
@cloudflare/think	Patch
agents	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

pkg-pr-new · 2026-06-28T10:24:17Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1826

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1826

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1826

create-think

npm i https://pkg.pr.new/create-think@1826

hono-agents

npm i https://pkg.pr.new/hono-agents@1826

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1826

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1826

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1826

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1826

commit: c69cfd5

The alarm-boundary memory-limit strike counter (maxAlarmMemoryLimitStrikes, #1825) is documented as counting CONSECUTIVE alarm OOM resets, but it was only ever deleted when the breaker sealed — never after a clean alarm — so it actually tracked LIFETIME resets. A Durable Object hitting rare, non-consecutive transient spikes (e.g. one a month) would eventually reach the strike budget and wrongly seal healthy recovery work. alarm() now best-effort clears cf_agents:oom_alarm_strikes after a clean _cf_runAlarmBody() so strikes must be consecutive to seal. The clear reads first and only writes when a strike is recorded, so the common no-strike path costs no write. Adds a regression test (strike recorded -> clean alarm resets to 0 -> next OOM starts at strike 1). Co-authored-by: Cursor <cursoragent@cursor.com>

threepointone mentioned this pull request Jun 28, 2026

Neverending retries during recovery #1825

Closed

devin-ai-integration Bot reviewed Jun 28, 2026

View reviewed changes

threepointone merged commit 1bbd9bc into main Jun 28, 2026
7 checks passed

threepointone deleted the fix/chat-recovery-oom-alarm-breaker branch June 28, 2026 10:49

github-actions Bot mentioned this pull request Jun 28, 2026

Version Packages #1822

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(chat-recovery): bound Durable Object memory-limit (OOM) crash loops (#1825)#1826

fix(chat-recovery): bound Durable Object memory-limit (OOM) crash loops (#1825)#1826
threepointone merged 2 commits into
mainfrom
fix/chat-recovery-oom-alarm-breaker

threepointone commented Jun 28, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

changeset-bot Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

pkg-pr-new Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

threepointone commented Jun 28, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why it looped

The fix (layered)

Supporting changes

Configuration

Test plan

Notes

Uh oh!

changeset-bot Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

pkg-pr-new Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

threepointone commented Jun 28, 2026 •

edited by devin-ai-integration Bot

Loading

changeset-bot Bot commented Jun 28, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 28, 2026 •

edited

Loading