Skip to content

Add durable execution Step + Wait end-to-end#2360

Draft
GarrettBeatty wants to merge 1 commit into
feature/durablefunctionfrom
GarrettBeatty/stack/2
Draft

Add durable execution Step + Wait end-to-end#2360
GarrettBeatty wants to merge 1 commit into
feature/durablefunctionfrom
GarrettBeatty/stack/2

Conversation

@GarrettBeatty
Copy link
Copy Markdown
Collaborator

@GarrettBeatty GarrettBeatty commented May 12, 2026

Stacked PRs:


#2216

What

This is the first real end-to-end slice of the Amazon.Lambda.DurableExecution SDK. After this PR, you can write a workflow that calls StepAsync and WaitAsync, deploy it as a Lambda, and have it run against the actual durable-execution service with replay-aware checkpointing.

Public API introduced:

Type Purpose
DurableFunction.WrapAsync Entry point; handles the durable-execution envelope (input hydration, output construction, status mapping). Has both reflection and JsonSerializerContext (AOT) overloads.
IDurableContext User-facing context: StepAsync (3 overloads — reflection, void, and AOT-safe ICheckpointSerializer), WaitAsync, LambdaContext, Logger, ExecutionContext.
StepConfig Per-step configuration. Empty in this PR — RetryStrategy and Semantics get wired in #2363.
ICheckpointSerializer Pluggable JSON serializer for step inputs/outputs, passed directly to the AOT-safe StepAsync overload.
DurableExecutionException Base exception type for durable-execution failures.

Why

Durable execution lets a Lambda function suspend and resume across invocations by checkpointing each side-effect to the service. This PR lays down the minimum needed to build everything that comes after — retries, callbacks, parallelism, the Annotations integration, the test runner package — all of it builds on the Step + Wait primitives and the replay machinery here.

I kept the scope narrow on purpose. Anything that does not block real-Lambda execution is pushed to follow-up PRs (see Out of scope) so this stays reviewable.

How

The runtime runs a Task.WhenAny race in DurableExecutionHandler between the user's workflow task and a suspension signal. Every StepAsync / WaitAsync call goes to a per-operation class (StepOperation / WaitOperation) that checks ExecutionState (built from operations the service replayed) before deciding what to do:

  • Replay hit — the operation has a SUCCEEDED / FAILED record from a prior invocation. Return the cached result (or rethrow) without re-running user code.
  • Cache miss — run user code, enqueue a SUCCEED/FAIL checkpoint, return.
  • Wait pending — the wait was scheduled previously but has not expired. Call TerminationManager.SuspendAndAwait to win the WhenAny race, returning Pending so the service re-invokes us when the timer fires.

Replay determinism comes from OperationIdGenerator, which produces stable IDs from the workflow's call sequence so the same step always lands on the same record across invocations.

Checkpoint serialization is opt-in: the reflection-based StepAsync overload uses System.Text.Json (annotated RequiresUnreferencedCode / RequiresDynamicCode); for NativeAOT or trimmed deployments, callers pass an ICheckpointSerializer to the dedicated overload. StepConfig is intentionally empty in this PR — it's the configuration carrier for RetryStrategy (#2363) and future per-step knobs. I considered adding a Serializer property there, but rejected it because the serializer is type-bound (ICheckpointSerializer) while StepConfig is type-erased.

Checkpoint flushing goes through CheckpointBatcher, a Channel-based queue with a single background worker. Each EnqueueAsync returns a Task that completes when the worker has flushed the containing batch to the service. WrapAsyncCore sets up the batcher, threads it into DurableContext, and awaits DrainAsync() before returning to Lambda. Defaults match the Java SDK (MaxBatchOperations = 200, MaxBatchBytes = 750 KB, FlushInterval = 0 — flush as soon as the queue drains). There's a TODO for the async-flush overload that Map/Parallel/Child Context will eventually need.

Key files:

  • DurableFunction.cs — envelope + replay-state hydration + batcher lifecycle
  • DurableContext.cs — facade; constructs the right per-operation class
  • Internal/DurableOperation.cs — abstract base with StartAsync / ReplayAsync template methods
  • Internal/StepOperation.cs / Internal/WaitOperation.cs — per-op replay logic
  • Internal/CheckpointBatcher.cs / Internal/CheckpointBatcherConfig.cs — checkpoint queue + worker
  • Internal/ReflectionJsonCheckpointSerializer.cs — default reflection-based serializer for the JIT-only overload
  • Internal/ExecutionState.cs — operation lookup + replay-mode flag
  • Internal/OperationIdGenerator.cs — deterministic IDs
  • Internal/TerminationManager.cs — suspension trigger
  • Internal/DurableExecutionHandler.cs — the WhenAny race
  • Services/LambdaDurableServiceClient.cs — service client wrapper

Testing

Unit tests in Amazon.Lambda.DurableExecution.Tests cover: enums, exceptions, models, OperationIdGenerator, TerminationManager, ExecutionState, the handler race, both Step and Wait paths through DurableContext, DurableFunction.WrapAsync, and CheckpointBatcher (enqueue/flush, batching within window, overflow splitting, error propagation, drain, dispose, token updates, concurrency).

End-to-end integration tests in Amazon.Lambda.DurableExecution.IntegrationTests build each test workflow into a Docker container, deploy it as a real Lambda on provided.al2023, and run against the durable-execution service:

  • StepWaitStep — basic Step → Wait → Step sequence
  • MultipleSteps — several sequential steps
  • WaitOnly — wait without any steps
  • LongerWait — wait that spans multiple invocations
  • ReplayDeterminism — verifies stable operation IDs across replays
  • StepFails — verifies a failed step surfaces correctly

Out of scope (follow-up PRs)

  • IRetryStrategy, ExponentialRetryStrategy, retry decision factories, StepConfig.RetryStrategy / StepConfig.Semantics (Adds retry support to the Amazon.Lambda.DurableExecution #2363)
  • StepException and the per-step failure exception type
  • DurableLogger replay-suppression (currently returns NullLogger)
  • Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync, WaitForConditionAsync — the interface intentionally does not declare these yet
  • Annotations source-generator integration (including a [DurableExecution] marker attribute)
  • DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
  • dotnet new lambda.DurableFunction blueprint

@GarrettBeatty GarrettBeatty requested review from a team as code owners May 12, 2026 02:36
@GarrettBeatty GarrettBeatty requested review from normj and philasmar and removed request for a team May 12, 2026 02:36
GarrettBeatty added a commit that referenced this pull request May 12, 2026
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with retry strategy, semantics, and serializer hooks
- IRetryStrategy + ExponentialRetryStrategy + retry decision factories
- ICheckpointSerializer + DefaultJsonCheckpointSerializer
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs, retry,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- DurableLogger replay-suppression
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from 55db890 to 92e2428 Compare May 12, 2026 02:36
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/1 branch from 9d78744 to 075b5e6 Compare May 12, 2026 02:36
@GarrettBeatty GarrettBeatty marked this pull request as draft May 12, 2026 02:37
@GarrettBeatty GarrettBeatty removed the request for review from a team May 12, 2026 02:37
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/1 to feature/durablefunction May 12, 2026 03:03
GarrettBeatty added a commit that referenced this pull request May 12, 2026
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with retry strategy, semantics, and serializer hooks
- IRetryStrategy + ExponentialRetryStrategy + retry decision factories
- ICheckpointSerializer + DefaultJsonCheckpointSerializer
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs, retry,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- DurableLogger replay-suppression
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from 92e2428 to edd8c5f Compare May 12, 2026 03:03
GarrettBeatty added a commit that referenced this pull request May 12, 2026
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with retry strategy, semantics, and serializer hooks
- IRetryStrategy + ExponentialRetryStrategy + retry decision factories
- ICheckpointSerializer + DefaultJsonCheckpointSerializer
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs, retry,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- DurableLogger replay-suppression
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from edd8c5f to fe03624 Compare May 12, 2026 03:03
@GarrettBeatty GarrettBeatty mentioned this pull request May 12, 2026
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/1 May 12, 2026 03:03
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/1 to feature/durablefunction May 12, 2026 03:05
GarrettBeatty added a commit that referenced this pull request May 12, 2026
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with retry strategy, semantics, and serializer hooks
- IRetryStrategy + ExponentialRetryStrategy + retry decision factories
- ICheckpointSerializer + DefaultJsonCheckpointSerializer
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs, retry,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- DurableLogger replay-suppression
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from fe03624 to 322fa09 Compare May 12, 2026 03:05
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/1 May 12, 2026 03:05
@GarrettBeatty GarrettBeatty changed the base branch from GarrettBeatty/stack/1 to feature/durablefunction May 12, 2026 03:15
GarrettBeatty added a commit that referenced this pull request May 12, 2026
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with retry strategy, semantics, and serializer hooks
- IRetryStrategy + ExponentialRetryStrategy + retry decision factories
- ICheckpointSerializer + DefaultJsonCheckpointSerializer
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs, retry,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- DurableLogger replay-suppression
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from 322fa09 to 983c9aa Compare May 12, 2026 03:16
@GarrettBeatty GarrettBeatty changed the base branch from feature/durablefunction to GarrettBeatty/stack/1 May 12, 2026 03:16
@GarrettBeatty GarrettBeatty requested a review from Copilot May 12, 2026 03:24
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch 3 times, most recently from 2726800 to 8c9d7dc Compare May 13, 2026 16:39
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/IDurableContext.cs Outdated
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch 2 times, most recently from a3aa60d to 173c9ee Compare May 13, 2026 19:15
@GarrettBeatty GarrettBeatty requested a review from Copilot May 13, 2026 19:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 69 out of 69 changed files in this pull request and generated 7 comments.

Comment thread Libraries/src/Amazon.Lambda.DurableExecution/Models/ErrorObject.cs Outdated
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/Internal/StepOperation.cs Outdated
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/DurableContext.cs
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/Internal/StepOperation.cs Outdated
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/DurableExecutionHandler.cs Outdated
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from 173c9ee to 02ed1fd Compare May 13, 2026 19:57
@GarrettBeatty GarrettBeatty added the Release Not Needed Add this label if a PR does not need to be released. label May 13, 2026
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from 02ed1fd to d589dcd Compare May 13, 2026 20:13
@GarrettBeatty GarrettBeatty requested a review from Copilot May 13, 2026 20:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 74 out of 74 changed files in this pull request and generated 24 comments.

Comments suppressed due to low confidence (2)

Libraries/test/Amazon.Lambda.DurableExecution.IntegrationTests/WaitOnlyTest.cs:1

  • The if/else branching here (and in StepWaitStepTest, LongerWaitTest, ReplayDeterminismTest, MultipleStepsTest) makes the test pass under two very different service contracts. If the service silently changes which branch is taken, regressions on the unexercised branch go undetected. Consider asserting which mode is expected (or factoring the two cases into separate [Fact]s) so the test fails when the assumed behavior changes.
    Libraries/test/Amazon.Lambda.DurableExecution.Tests/DurableContextTests.cs:1
  • This test path (an already-SUCCEEDED wait on replay) isn't covered by WaitOperation's switch statement — OperationStatuses.Succeeded returns Task.FromResult(null), which is fine, but the corresponding case for Failed/Cancelled (which can occur when a wait is explicitly stopped) has no test or production handling and will fall through to default → StartAsync. Either add an explicit case + assertion or document that failed waits aren't representable.

Comment thread Libraries/src/Amazon.Lambda.DurableExecution/DurableFunction.cs Outdated
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/Models/ErrorObject.cs Outdated
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/DurableContext.cs
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/DurableContext.cs Outdated
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/DurableFunction.cs Outdated
Comment thread Libraries/src/Amazon.Lambda.DurableExecution/DurableFunction.cs
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch 2 times, most recently from acf3d85 to b01b068 Compare May 13, 2026 21:49
/// <summary>
/// Custom serializer for the step result. Default is System.Text.Json.
/// </summary>
public ICheckpointSerializer? Serializer { get; set; }
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a deviation from the previous design. originally i had the serializer as part of the step config but after playing around with the code, i realized this needs to be a parameter in the stepasync/waitasync functions so that native aot can have its own function calls

@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch 3 times, most recently from 8b501c1 to e2d087a Compare May 13, 2026 22:35
// Step 1: Validate the order (checkpointed automatically)
var validation = await context.StepAsync(
async () => await ValidateOrder(input.OrderId),
async (step) => await ValidateOrder(input.OrderId),
Copy link
Copy Markdown
Collaborator Author

@GarrettBeatty GarrettBeatty May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my design doc i previously had an api which the users function does not receive the step context, in case the user didnt need it) however, after looking at javas code, they deprecated this function and only kept the api which gives the user the step context.

/// Uses a TaskCompletionSource that resolves when the function should suspend.
/// Only the first Terminate() call wins; subsequent calls are ignored.
/// </summary>
internal sealed class TerminationManager
Copy link
Copy Markdown
Collaborator Author

@GarrettBeatty GarrettBeatty May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used in the instance where "we need to suspend the lambda function". theres actually two ways to implement this functionality.

Python and java raises a special SuspendExecution exception from inside wait() that bubbles up the user's stack to a top-level except in the wrapper, which converts it to PENDING.

.NET (and JS) don't throw. WaitAsync flips a one-shot signal on the TerminationManager and hands user code a Task that never completes, so the user's await parks forever. The wrapper is meanwhile Task.WhenAny-ing the user's workflow against that signal — the signal wins the race, the wrapper returns PENDING, and the abandoned user task gets GC'd.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python and java are able to do this because they both throw some exception i.e BaseException and java.lang.Error which users do not usually catch (they usually just catch Exception type). in .net i think every exception comes from Exception and most users will catch a generic Exception so we cant do that

@@ -0,0 +1,7 @@
FROM public.ecr.aws/lambda/provided:al2023
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using docker files is a workaround until step function team allow lists .net managed runtime to call durable functions

if (duration < TimeSpan.FromSeconds(1))
throw new ArgumentOutOfRangeException(nameof(duration), duration, "Wait duration must be at least 1 second.");

if (duration > TimeSpan.FromSeconds(31_622_400))
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we be validating this on our end?

/// <c>sendOperationUpdate</c> vs <c>sendOperationUpdateAsync</c> is the model.
/// Today every call site is sync, so the API stays minimal.
/// </remarks>
internal sealed class CheckpointBatcher : IAsyncDisposable
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now every step async "batch" is one item. but eventually when we implement Map/Parallel operations it will do concurrent operations and then things will be batched

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the fire and forget and START step will be implemented in #2363.

its technically not needed in this PR because durable functions really only care about seeing SUCCEEDED and FAILED steps. But once we add retries, it needs to know how many times (i.e the number of START steps) so its required then

// the termination signal. When TerminationManager fires (e.g., WaitAsync),
// we need the WhenAny race below to resolve immediately without waiting
// for the user task to reach an await point.
var userTask = Task.Run(userHandler);
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the reason for this is imagine the user had

 async Task<TestResult> Workflow(TestEvent input, IDurableContext ctx)
  {
      // Imagine the user does CPU work or sync I/O before any await:
      Thread.Sleep(2000);                                // or a long compute loop
      await ctx.WaitAsync(TimeSpan.FromSeconds(5));      // first real await
      ...
  }

If we called userHandler() directly instead of Task.Run(userHandler):

var userTask = userHandler(); // ← starts running RIGHT HERE, synchronously
var winner = await Task.WhenAny(userTask, terminationManager.TerminationTask);

The userHandler() invocation runs synchronously up to the first real await. If the user sleeps, blocks on sync I/O, or does any non-yielding work first, we don't even reach the await Task.WhenAny(...) line yet. The wrapper is stuck inside the user's call.

Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with serializer hook (retry deferred to follow-up PR)
- ICheckpointSerializer interface
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- IRetryStrategy, ExponentialRetryStrategy, retry decision factories
- DefaultJsonCheckpointSerializer
- DurableLogger replay-suppression (currently returns NullLogger)
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2

remove

update

update

update

update
@GarrettBeatty GarrettBeatty force-pushed the GarrettBeatty/stack/2 branch from e2d087a to 2a4575a Compare May 14, 2026 01:24

/// <summary>
/// Wrap a workflow that takes typed input and returns no value.
/// Wrap a workflow (typed input + output) with explicit Lambda client.
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated doc to be more clear about native aot apis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Release Not Needed Add this label if a PR does not need to be released.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants