feat(codemode): add durable execution retries#1769
Conversation
🦋 Changeset detectedLatest commit: f82430e The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
create-think
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
6bb77f6 to
4d8e3d9
Compare
|
Keeping this PR in draft until #1791 lands. #1791 fixes the root |
4d8e3d9 to
4dab36b
Compare
|
#1791 is now merged. I rebased this PR onto its merge commit and resolved the overlap in favor of #1791:
Post-rebase verification: 297 unit tests, 43 durable-runtime tests, 33 browser tests, full 24-project build, and all 113 projects typecheck. The cloudflare-mcp consumer was also updated to |
4dab36b to
8c04b8a
Compare
|
I noticed one retry-fencing edge that would be worth locking down in the tests. When a superseded pass finishes late, I would add an assertion around the late |
|
Taking this back to draft for a strict code-quality pass. I found two correctness gaps (late results after terminalization and optional attempt fencing), plus avoidable orchestration/SQL growth in already-large runtime files and the open durable-log finding. I’m addressing these in order with explicit regression coverage before requesting review again. |
8c04b8a to
5b51060
Compare
|
Addressed the stale-pass review note in The strict quality pass also extracted execution/retry orchestration from |
5b51060 to
4eb6c43
Compare
|
Strict code-quality pass complete in
Coverage now includes default/custom/disabled/declined/throwing policies, invalid attempt limits, Retry-After and bounded backoff, caught retry signals, timeout fencing, stale terminal results, attempt-store lifecycle/backfill, and executor success/unclassified/structured failures. Verification: 310 unit, 51 real runtime, 33 browser, package build, full repo check (113 projects), cloudflare-mcp 256 tests, and staging dry-run. |
| __stepDecide: async (name: unknown) => | ||
| runtime.decide( | ||
| executionId, | ||
| cursor.next(), | ||
| STEP_CONNECTOR, | ||
| String(name), | ||
| undefined, | ||
| false, | ||
| false, | ||
| attempt | ||
| ), | ||
|
|
||
| __stepRecord: async (seq: unknown, value: unknown) => | ||
| runtime.recordResult(executionId, Number(seq), value, attempt) |
There was a problem hiding this comment.
🚩 Steps executed after a caught RetryableError persist in the durable log and replay on retry
The control.failure guard in buildConnectorBindings (runtime-execution.ts:270) blocks subsequent connector calls after a RetryableError, but codemode.step() calls go through the platform provider's __stepDecide/__stepRecord handlers (runtime-execution.ts:420-433), which do NOT check control.failure. If model code catches the retryable error and then calls codemode.step(), the step executes and its result is recorded durably. On retry, the step is replayed from the log, potentially returning a value from the failed pass's context.
In practice, this is unlikely to cause issues: (1) model code catching connector errors and continuing with steps is not the normal pattern; (2) divergence detection (runtime.ts:473-491) catches cases where the retry takes a different code path at the same seq; (3) the replay semantic of codemode.step (record once, replay thereafter) is intentionally sticky. But for completeness, the __stepDecide handler could check control.failure and return a pause decision, consistent with the connector guard's behavior.
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Fixed in a4217dd. createPlatformProvider now receives the same per-pass control object as connector bindings, and __stepDecide returns a pause decision immediately once a retry signal exists, before executing or recording the step closure. Added a real runtime regression where generated code catches RetryableError and calls codemode.step; the retry succeeds, the closure never runs, and the final durable log contains only the connector entry (no __step entry). Verification remains green: 310 unit, 52 runtime, 33 browser, full build/check.
4eb6c43 to
a4217dd
Compare
|
Production smoke completed against local
The throwaway Worker |
|
Taking this briefly back to draft to add cooperative cancellation. Attempt fencing prevents stale results from being recorded, but it does not stop an in-flight connector operation. Each pass will own an AbortController; connector |
a4217dd to
f82430e
Compare
Summary
Adds durable execution retries to
@cloudflare/codemode'sruntime.tool()path.A connector can now throw
RetryableErrorto explicitly tell Code Mode that a failed tool boundary is safe to retry. The runtime retries inside the same durable execution, replays already-applied calls from the durable log, and re-executes only the failed boundary. This gives callers resilient Code Mode executions without making every application build its own replay/retry runtime.This also adds attempt fencing and cooperative cancellation so stale work from timed-out or superseded sandbox passes cannot mutate the durable log.
Default retry policy
By default,
runtime.tool()retries only failures that are explicitly marked retryable withRetryableError.Default behavior:
3total attempts.RetryableErrorincludesretryAfterMs, the runtime honors it.retry: falsedisables retries completely.retry: { ... }allows callers to provide a custom retry policy.The default policy intentionally does not retry arbitrary thrown errors, executor failures, or timeouts. Those may represent bugs, validation errors, non-idempotent writes, or ambiguous remote state.
How the runtime knows an error is retryable
The SDK does not infer retryability from tool names, HTTP methods, response codes, or exception messages.
A failure is retryable by default only when connector code explicitly throws
RetryableError.That keeps the semantic decision at the connector boundary, where the implementation actually understands the protocol. For example, a connector may know that a server explicitly rejected a request and asked the client to retry, but the generic Code Mode runtime cannot safely infer that from the outside.
Normal errors remain terminal unless the application opts into a custom retry policy.
Durable retry behavior
Retries happen inside the same Code Mode execution:
errorand the durable log is preserved for audit/debugging.For example, if this code runs:
and
api.flaky()throwsRetryableErroronce, the runtime behaves as follows:firstruns and is logged.flakyfails withRetryableError.firstfrom the log.flakyruns again.afterruns only afterflakysucceeds.What happens on execution timeout
Executor timeouts are surfaced as structured failures, but they are not retried by default.
Timeouts are ambiguous: the sandbox stopped waiting, but the runtime cannot prove whether an in-flight connector operation committed externally. Retrying a timed-out write automatically could duplicate side effects.
Applications that know a timeout is safe may opt into retrying it with a custom retry policy, but the SDK default remains conservative.
To make timeout/custom-retry behavior safe, this PR adds durable attempt fencing:
Cooperative cancellation
Connector
executecontexts now include an optional pass-scopedAbortSignal:The runtime aborts the signal when a pass completes, pauses, errors, times out, or moves to a retry. This lets connectors cancel in-flight work from old passes before the runtime continues.
Cancellation is cooperative only. It reduces overlap and wasted work, but it does not prove that a remote write was rolled back or never committed. Timeout retry policy remains conservative for that reason.
Storage changes
Adds a durable attempts store/table used to fence retries by execution attempt.
The migration is additive. Existing released Code Mode execution state is backfilled into the attempt table with
INSERT OR IGNORE.Public API
Adds:
RetryableErrorsignalon connector execute contextNo retry configuration is required for normal
RetryableErrorusage.Tests
Added coverage for:
retryAfterMshandlingRetryableErrornot allowingcodemode.step()to persistLocal validation:
pnpm --filter @cloudflare/codemode test pnpm --filter @cloudflare/codemode build pnpm run check