Add durable execution Step + Wait end-to-end#2360
Conversation
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution SDK: a workflow can run StepAsync and WaitAsync against a real Lambda, with replay-aware checkpointing wired through to the AWS service. Public API surface introduced: - DurableFunction.WrapAsync — entry point that handles the durable execution envelope (input hydration, output construction, status mapping) - IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait) - StepConfig with retry strategy, semantics, and serializer hooks - IRetryStrategy + ExponentialRetryStrategy + retry decision factories - ICheckpointSerializer + DefaultJsonCheckpointSerializer - [DurableExecution] attribute (recognized by future source generator) - DurableExecutionException base + StepException Internals: - DurableExecutionHandler — Task.WhenAny race between user code and the suspension signal, returning Succeeded/Failed/Pending - ExecutionState — replay-aware operation lookup and pending checkpoint buffer - OperationIdGenerator — deterministic, replay-stable IDs - TerminationManager — TaskCompletionSource-based suspension trigger - LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and state APIs Tests: - 86 unit tests covering enums, exceptions, models, configs, retry, ID generation, termination, execution state, the handler race, the context (Step + Wait paths), and the WrapAsync entry point - 8 end-to-end integration tests deploying real Lambdas via Docker on the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly, LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails Out of scope (follow-up PRs): - Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync, WaitForConditionAsync — interface intentionally does not declare them - DurableLogger replay-suppression - Annotations source-generator integration - DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package - dotnet new lambda.DurableFunction blueprint stack-info: PR: #2360, branch: GarrettBeatty/stack/2
55db890 to
92e2428
Compare
9d78744 to
075b5e6
Compare
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution SDK: a workflow can run StepAsync and WaitAsync against a real Lambda, with replay-aware checkpointing wired through to the AWS service. Public API surface introduced: - DurableFunction.WrapAsync — entry point that handles the durable execution envelope (input hydration, output construction, status mapping) - IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait) - StepConfig with retry strategy, semantics, and serializer hooks - IRetryStrategy + ExponentialRetryStrategy + retry decision factories - ICheckpointSerializer + DefaultJsonCheckpointSerializer - [DurableExecution] attribute (recognized by future source generator) - DurableExecutionException base + StepException Internals: - DurableExecutionHandler — Task.WhenAny race between user code and the suspension signal, returning Succeeded/Failed/Pending - ExecutionState — replay-aware operation lookup and pending checkpoint buffer - OperationIdGenerator — deterministic, replay-stable IDs - TerminationManager — TaskCompletionSource-based suspension trigger - LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and state APIs Tests: - 86 unit tests covering enums, exceptions, models, configs, retry, ID generation, termination, execution state, the handler race, the context (Step + Wait paths), and the WrapAsync entry point - 8 end-to-end integration tests deploying real Lambdas via Docker on the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly, LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails Out of scope (follow-up PRs): - Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync, WaitForConditionAsync — interface intentionally does not declare them - DurableLogger replay-suppression - Annotations source-generator integration - DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package - dotnet new lambda.DurableFunction blueprint stack-info: PR: #2360, branch: GarrettBeatty/stack/2
92e2428 to
edd8c5f
Compare
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution SDK: a workflow can run StepAsync and WaitAsync against a real Lambda, with replay-aware checkpointing wired through to the AWS service. Public API surface introduced: - DurableFunction.WrapAsync — entry point that handles the durable execution envelope (input hydration, output construction, status mapping) - IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait) - StepConfig with retry strategy, semantics, and serializer hooks - IRetryStrategy + ExponentialRetryStrategy + retry decision factories - ICheckpointSerializer + DefaultJsonCheckpointSerializer - [DurableExecution] attribute (recognized by future source generator) - DurableExecutionException base + StepException Internals: - DurableExecutionHandler — Task.WhenAny race between user code and the suspension signal, returning Succeeded/Failed/Pending - ExecutionState — replay-aware operation lookup and pending checkpoint buffer - OperationIdGenerator — deterministic, replay-stable IDs - TerminationManager — TaskCompletionSource-based suspension trigger - LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and state APIs Tests: - 86 unit tests covering enums, exceptions, models, configs, retry, ID generation, termination, execution state, the handler race, the context (Step + Wait paths), and the WrapAsync entry point - 8 end-to-end integration tests deploying real Lambdas via Docker on the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly, LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails Out of scope (follow-up PRs): - Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync, WaitForConditionAsync — interface intentionally does not declare them - DurableLogger replay-suppression - Annotations source-generator integration - DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package - dotnet new lambda.DurableFunction blueprint stack-info: PR: #2360, branch: GarrettBeatty/stack/2
edd8c5f to
fe03624
Compare
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution SDK: a workflow can run StepAsync and WaitAsync against a real Lambda, with replay-aware checkpointing wired through to the AWS service. Public API surface introduced: - DurableFunction.WrapAsync — entry point that handles the durable execution envelope (input hydration, output construction, status mapping) - IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait) - StepConfig with retry strategy, semantics, and serializer hooks - IRetryStrategy + ExponentialRetryStrategy + retry decision factories - ICheckpointSerializer + DefaultJsonCheckpointSerializer - [DurableExecution] attribute (recognized by future source generator) - DurableExecutionException base + StepException Internals: - DurableExecutionHandler — Task.WhenAny race between user code and the suspension signal, returning Succeeded/Failed/Pending - ExecutionState — replay-aware operation lookup and pending checkpoint buffer - OperationIdGenerator — deterministic, replay-stable IDs - TerminationManager — TaskCompletionSource-based suspension trigger - LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and state APIs Tests: - 86 unit tests covering enums, exceptions, models, configs, retry, ID generation, termination, execution state, the handler race, the context (Step + Wait paths), and the WrapAsync entry point - 8 end-to-end integration tests deploying real Lambdas via Docker on the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly, LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails Out of scope (follow-up PRs): - Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync, WaitForConditionAsync — interface intentionally does not declare them - DurableLogger replay-suppression - Annotations source-generator integration - DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package - dotnet new lambda.DurableFunction blueprint stack-info: PR: #2360, branch: GarrettBeatty/stack/2
fe03624 to
322fa09
Compare
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution SDK: a workflow can run StepAsync and WaitAsync against a real Lambda, with replay-aware checkpointing wired through to the AWS service. Public API surface introduced: - DurableFunction.WrapAsync — entry point that handles the durable execution envelope (input hydration, output construction, status mapping) - IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait) - StepConfig with retry strategy, semantics, and serializer hooks - IRetryStrategy + ExponentialRetryStrategy + retry decision factories - ICheckpointSerializer + DefaultJsonCheckpointSerializer - [DurableExecution] attribute (recognized by future source generator) - DurableExecutionException base + StepException Internals: - DurableExecutionHandler — Task.WhenAny race between user code and the suspension signal, returning Succeeded/Failed/Pending - ExecutionState — replay-aware operation lookup and pending checkpoint buffer - OperationIdGenerator — deterministic, replay-stable IDs - TerminationManager — TaskCompletionSource-based suspension trigger - LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and state APIs Tests: - 86 unit tests covering enums, exceptions, models, configs, retry, ID generation, termination, execution state, the handler race, the context (Step + Wait paths), and the WrapAsync entry point - 8 end-to-end integration tests deploying real Lambdas via Docker on the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly, LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails Out of scope (follow-up PRs): - Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync, WaitForConditionAsync — interface intentionally does not declare them - DurableLogger replay-suppression - Annotations source-generator integration - DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package - dotnet new lambda.DurableFunction blueprint stack-info: PR: #2360, branch: GarrettBeatty/stack/2
322fa09 to
983c9aa
Compare
2726800 to
8c9d7dc
Compare
a3aa60d to
173c9ee
Compare
173c9ee to
02ed1fd
Compare
02ed1fd to
d589dcd
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 74 out of 74 changed files in this pull request and generated 24 comments.
Comments suppressed due to low confidence (2)
Libraries/test/Amazon.Lambda.DurableExecution.IntegrationTests/WaitOnlyTest.cs:1
- The if/else branching here (and in
StepWaitStepTest,LongerWaitTest,ReplayDeterminismTest,MultipleStepsTest) makes the test pass under two very different service contracts. If the service silently changes which branch is taken, regressions on the unexercised branch go undetected. Consider asserting which mode is expected (or factoring the two cases into separate[Fact]s) so the test fails when the assumed behavior changes.
Libraries/test/Amazon.Lambda.DurableExecution.Tests/DurableContextTests.cs:1 - This test path (an already-
SUCCEEDEDwait on replay) isn't covered byWaitOperation's switch statement —OperationStatuses.SucceededreturnsTask.FromResult(null), which is fine, but the corresponding case forFailed/Cancelled(which can occur when a wait is explicitly stopped) has no test or production handling and will fall through todefault → StartAsync. Either add an explicit case + assertion or document that failed waits aren't representable.
acf3d85 to
b01b068
Compare
| /// <summary> | ||
| /// Custom serializer for the step result. Default is System.Text.Json. | ||
| /// </summary> | ||
| public ICheckpointSerializer? Serializer { get; set; } |
There was a problem hiding this comment.
this is a deviation from the previous design. originally i had the serializer as part of the step config but after playing around with the code, i realized this needs to be a parameter in the stepasync/waitasync functions so that native aot can have its own function calls
8b501c1 to
e2d087a
Compare
| // Step 1: Validate the order (checkpointed automatically) | ||
| var validation = await context.StepAsync( | ||
| async () => await ValidateOrder(input.OrderId), | ||
| async (step) => await ValidateOrder(input.OrderId), |
There was a problem hiding this comment.
in my design doc i previously had an api which the users function does not receive the step context, in case the user didnt need it) however, after looking at javas code, they deprecated this function and only kept the api which gives the user the step context.
| /// Uses a TaskCompletionSource that resolves when the function should suspend. | ||
| /// Only the first Terminate() call wins; subsequent calls are ignored. | ||
| /// </summary> | ||
| internal sealed class TerminationManager |
There was a problem hiding this comment.
this is used in the instance where "we need to suspend the lambda function". theres actually two ways to implement this functionality.
Python and java raises a special SuspendExecution exception from inside wait() that bubbles up the user's stack to a top-level except in the wrapper, which converts it to PENDING.
.NET (and JS) don't throw. WaitAsync flips a one-shot signal on the TerminationManager and hands user code a Task that never completes, so the user's await parks forever. The wrapper is meanwhile Task.WhenAny-ing the user's workflow against that signal — the signal wins the race, the wrapper returns PENDING, and the abandoned user task gets GC'd.
There was a problem hiding this comment.
python and java are able to do this because they both throw some exception i.e BaseException and java.lang.Error which users do not usually catch (they usually just catch Exception type). in .net i think every exception comes from Exception and most users will catch a generic Exception so we cant do that
| @@ -0,0 +1,7 @@ | |||
| FROM public.ecr.aws/lambda/provided:al2023 | |||
There was a problem hiding this comment.
using docker files is a workaround until step function team allow lists .net managed runtime to call durable functions
| if (duration < TimeSpan.FromSeconds(1)) | ||
| throw new ArgumentOutOfRangeException(nameof(duration), duration, "Wait duration must be at least 1 second."); | ||
|
|
||
| if (duration > TimeSpan.FromSeconds(31_622_400)) |
There was a problem hiding this comment.
should we be validating this on our end?
| /// <c>sendOperationUpdate</c> vs <c>sendOperationUpdateAsync</c> is the model. | ||
| /// Today every call site is sync, so the API stays minimal. | ||
| /// </remarks> | ||
| internal sealed class CheckpointBatcher : IAsyncDisposable |
There was a problem hiding this comment.
right now every step async "batch" is one item. but eventually when we implement Map/Parallel operations it will do concurrent operations and then things will be batched
There was a problem hiding this comment.
the fire and forget and START step will be implemented in #2363.
its technically not needed in this PR because durable functions really only care about seeing SUCCEEDED and FAILED steps. But once we add retries, it needs to know how many times (i.e the number of START steps) so its required then
| // the termination signal. When TerminationManager fires (e.g., WaitAsync), | ||
| // we need the WhenAny race below to resolve immediately without waiting | ||
| // for the user task to reach an await point. | ||
| var userTask = Task.Run(userHandler); |
There was a problem hiding this comment.
the reason for this is imagine the user had
async Task<TestResult> Workflow(TestEvent input, IDurableContext ctx)
{
// Imagine the user does CPU work or sync I/O before any await:
Thread.Sleep(2000); // or a long compute loop
await ctx.WaitAsync(TimeSpan.FromSeconds(5)); // first real await
...
}
If we called userHandler() directly instead of Task.Run(userHandler):
var userTask = userHandler(); // ← starts running RIGHT HERE, synchronously
var winner = await Task.WhenAny(userTask, terminationManager.TerminationTask);
The userHandler() invocation runs synchronously up to the first real await. If the user sleeps, blocks on sync I/O, or does any non-yielding work first, we don't even reach the await Task.WhenAny(...) line yet. The wrapper is stuck inside the user's call.
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution SDK: a workflow can run StepAsync and WaitAsync against a real Lambda, with replay-aware checkpointing wired through to the AWS service. Public API surface introduced: - DurableFunction.WrapAsync — entry point that handles the durable execution envelope (input hydration, output construction, status mapping) - IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait) - StepConfig with serializer hook (retry deferred to follow-up PR) - ICheckpointSerializer interface - [DurableExecution] attribute (recognized by future source generator) - DurableExecutionException base + StepException Internals: - DurableExecutionHandler — Task.WhenAny race between user code and the suspension signal, returning Succeeded/Failed/Pending - ExecutionState — replay-aware operation lookup and pending checkpoint buffer - OperationIdGenerator — deterministic, replay-stable IDs - TerminationManager — TaskCompletionSource-based suspension trigger - LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and state APIs Tests: - 86 unit tests covering enums, exceptions, models, configs, ID generation, termination, execution state, the handler race, the context (Step + Wait paths), and the WrapAsync entry point - 8 end-to-end integration tests deploying real Lambdas via Docker on the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly, LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails Out of scope (follow-up PRs): - IRetryStrategy, ExponentialRetryStrategy, retry decision factories - DefaultJsonCheckpointSerializer - DurableLogger replay-suppression (currently returns NullLogger) - Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync, WaitForConditionAsync — interface intentionally does not declare them - Annotations source-generator integration - DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package - dotnet new lambda.DurableFunction blueprint stack-info: PR: #2360, branch: GarrettBeatty/stack/2 remove update update update update
e2d087a to
2a4575a
Compare
|
|
||
| /// <summary> | ||
| /// Wrap a workflow that takes typed input and returns no value. | ||
| /// Wrap a workflow (typed input + output) with explicit Lambda client. |
There was a problem hiding this comment.
updated doc to be more clear about native aot apis
Stacked PRs:
Amazon.Lambda.DurableExecution#2363#2216
What
This is the first real end-to-end slice of the
Amazon.Lambda.DurableExecutionSDK. After this PR, you can write a workflow that callsStepAsyncandWaitAsync, deploy it as a Lambda, and have it run against the actual durable-execution service with replay-aware checkpointing.Public API introduced:
DurableFunction.WrapAsyncJsonSerializerContext(AOT) overloads.IDurableContextStepAsync(3 overloads — reflection, void, and AOT-safeICheckpointSerializer),WaitAsync,LambdaContext,Logger,ExecutionContext.StepConfigRetryStrategyandSemanticsget wired in #2363.ICheckpointSerializerStepAsyncoverload.DurableExecutionExceptionWhy
Durable execution lets a Lambda function suspend and resume across invocations by checkpointing each side-effect to the service. This PR lays down the minimum needed to build everything that comes after — retries, callbacks, parallelism, the Annotations integration, the test runner package — all of it builds on the
Step + Waitprimitives and the replay machinery here.I kept the scope narrow on purpose. Anything that does not block real-Lambda execution is pushed to follow-up PRs (see Out of scope) so this stays reviewable.
How
The runtime runs a
Task.WhenAnyrace inDurableExecutionHandlerbetween the user's workflow task and a suspension signal. EveryStepAsync/WaitAsynccall goes to a per-operation class (StepOperation/WaitOperation) that checksExecutionState(built from operations the service replayed) before deciding what to do:SUCCEEDED/FAILEDrecord from a prior invocation. Return the cached result (or rethrow) without re-running user code.SUCCEED/FAILcheckpoint, return.TerminationManager.SuspendAndAwaitto win theWhenAnyrace, returningPendingso the service re-invokes us when the timer fires.Replay determinism comes from
OperationIdGenerator, which produces stable IDs from the workflow's call sequence so the same step always lands on the same record across invocations.Checkpoint serialization is opt-in: the reflection-based
StepAsyncoverload usesSystem.Text.Json(annotatedRequiresUnreferencedCode/RequiresDynamicCode); for NativeAOT or trimmed deployments, callers pass anICheckpointSerializerto the dedicated overload.StepConfigis intentionally empty in this PR — it's the configuration carrier forRetryStrategy(#2363) and future per-step knobs. I considered adding aSerializerproperty there, but rejected it because the serializer is type-bound (ICheckpointSerializer) whileStepConfigis type-erased.Checkpoint flushing goes through
CheckpointBatcher, aChannel-based queue with a single background worker. EachEnqueueAsyncreturns aTaskthat completes when the worker has flushed the containing batch to the service.WrapAsyncCoresets up the batcher, threads it intoDurableContext, andawaitsDrainAsync()before returning to Lambda. Defaults match the Java SDK (MaxBatchOperations = 200,MaxBatchBytes = 750 KB,FlushInterval = 0— flush as soon as the queue drains). There's aTODOfor the async-flush overload that Map/Parallel/Child Context will eventually need.Key files:
DurableFunction.cs— envelope + replay-state hydration + batcher lifecycleDurableContext.cs— facade; constructs the right per-operation classInternal/DurableOperation.cs— abstract base withStartAsync/ReplayAsynctemplate methodsInternal/StepOperation.cs/Internal/WaitOperation.cs— per-op replay logicInternal/CheckpointBatcher.cs/Internal/CheckpointBatcherConfig.cs— checkpoint queue + workerInternal/ReflectionJsonCheckpointSerializer.cs— default reflection-based serializer for the JIT-only overloadInternal/ExecutionState.cs— operation lookup + replay-mode flagInternal/OperationIdGenerator.cs— deterministic IDsInternal/TerminationManager.cs— suspension triggerInternal/DurableExecutionHandler.cs— theWhenAnyraceServices/LambdaDurableServiceClient.cs— service client wrapperTesting
Unit tests in
Amazon.Lambda.DurableExecution.Testscover: enums, exceptions, models,OperationIdGenerator,TerminationManager,ExecutionState, the handler race, bothStepandWaitpaths throughDurableContext,DurableFunction.WrapAsync, andCheckpointBatcher(enqueue/flush, batching within window, overflow splitting, error propagation, drain, dispose, token updates, concurrency).End-to-end integration tests in
Amazon.Lambda.DurableExecution.IntegrationTestsbuild each test workflow into a Docker container, deploy it as a real Lambda onprovided.al2023, and run against the durable-execution service:StepWaitStep— basic Step → Wait → Step sequenceMultipleSteps— several sequential stepsWaitOnly— wait without any stepsLongerWait— wait that spans multiple invocationsReplayDeterminism— verifies stable operation IDs across replaysStepFails— verifies a failed step surfaces correctlyOut of scope (follow-up PRs)
IRetryStrategy,ExponentialRetryStrategy, retry decision factories,StepConfig.RetryStrategy/StepConfig.Semantics(Adds retry support to theAmazon.Lambda.DurableExecution#2363)StepExceptionand the per-step failure exception typeDurableLoggerreplay-suppression (currently returnsNullLogger)Callbacks,InvokeAsync,ParallelAsync,MapAsync,RunInChildContextAsync,WaitForConditionAsync— the interface intentionally does not declare these yet[DurableExecution]marker attribute)DurableTestRunner/Amazon.Lambda.DurableExecution.Testingpackagedotnet new lambda.DurableFunctionblueprint