Skip to content

Add WaitForConditionAsync polling primitive (DOTNET-8665)#2376

Draft
GarrettBeatty wants to merge 12 commits into
gcbeatty/durable-child-contextfrom
gcbeatty/durable-waitforcondition
Draft

Add WaitForConditionAsync polling primitive (DOTNET-8665)#2376
GarrettBeatty wants to merge 12 commits into
gcbeatty/durable-child-contextfrom
gcbeatty/durable-waitforcondition

Conversation

@GarrettBeatty
Copy link
Copy Markdown
Contributor

@GarrettBeatty GarrettBeatty commented May 14, 2026

#2216

Summary

Adds service-mediated polling to the .NET Durable Execution SDK. WaitForConditionAsync repeatedly evaluates a check function with configurable wait strategy between attempts; each iteration is its own Lambda invocation (suspended via STEP+RETRY checkpoints carrying NextAttemptDelaySeconds), so polling does not consume compute time.

Stacked on top of #2372 (Wave 0 cross-cutting types).

Fixes DOTNET-8665.

Public surface

  • IDurableContext.WaitForConditionAsync<TState> (single overload; per-iteration state is serialized via the ILambdaSerializer registered on ILambdaContext.Serializer, configured via LambdaBootstrapBuilder.Create(handler, serializer) — same pattern as StepAsync / RunInChildContextAsync)
  • IConditionCheckContext (Logger + AttemptNumber)
  • WaitForConditionConfig<TState> (required InitialState + WaitStrategy)
  • IWaitStrategy<TState> with Decide(state, attempt) returning WaitDecision
  • WaitDecision (readonly record struct; ShouldContinue + Delay; Stop() / ContinueAfter(TimeSpan) factories)
  • WaitStrategy factories: Exponential / Linear / Fixed / FromDelegate, each accepting an optional Func<TState, bool> isDone predicate
  • WaitForConditionException with AttemptsExhausted and LastState (preserved across both live execution and replay)

Internal

  • WaitForConditionOperation<TState> wire format = STEP + SubType "WaitForCondition". Each polling iteration emits Action=RETRY with the new state in payload and NextAttemptDelaySeconds for the service to schedule the next invocation.
  • Serialization is delegated to the registered ILambdaSerializer via stream-based Serialize<T> / Deserialize<T> calls — no AOT trim attributes on the public API. Mirrors StepOperation / ChildContextOperation.
  • Strategies signal max-attempts exhausted by throwing WaitForConditionException directly from Decide(); the operation enriches with LastState before checkpointing FAIL.
  • LastState survives FAIL replay: serialized into FAIL payload at write time, deserialized in BuildFailureException with warning-logged fallback for legacy/corrupt data.
  • ExponentialBackoff helper extracted for sharing with ExponentialRetryStrategy. Math is byte-for-byte identical.
  • Reuses OperationSubTypes.WaitForCondition from Wave 0.

Defaults

60 attempts / 5s initial / 300s max / 1.5x rate / Full jitter — distinct from RetryStrategy.Default and matching Python/JS/Java reference SDKs.

Cross-SDK note: Python returns success on max-attempts exhausted; .NET/Java/JS throw. Workflows ported from Python should review for new failure modes. Documented in the design doc.

Test plan

  • Build clean (zero warnings, TreatWarningsAsErrors enforced) on net8.0 and net10.0
  • 41 new unit tests pass alongside existing 161 (202 total), including each wait strategy, isDone predicate paths, max-attempts exhaustion, user-check exceptions, replay determinism, exponential backoff bounds, and corrupt-payload fallback logging
  • 5 new integration tests build successfully (require AWS credentials to run)

🤖 Generated with Claude Code


COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]

COPY bin/publish/ ${LAMBDA_TASK_ROOT}

ENTRYPOINT ["/var/task/bootstrap"]
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-wave0 branch from 464c591 to d308c3b Compare May 14, 2026 21:49
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-waitforcondition branch 2 times, most recently from 7f91202 to 3fa06ce Compare May 14, 2026 22:19
GarrettBeatty and others added 11 commits May 15, 2026 18:14
Implements the minimum viable slice of the Amazon.Lambda.DurableExecution
SDK: a workflow can run StepAsync and WaitAsync against a real Lambda,
with replay-aware checkpointing wired through to the AWS service.

Public API surface introduced:
- DurableFunction.WrapAsync — entry point that handles the durable
  execution envelope (input hydration, output construction, status mapping)
- IDurableContext.StepAsync / WaitAsync (4 Step overloads, 1 Wait)
- StepConfig with serializer hook (retry deferred to follow-up PR)
- ICheckpointSerializer interface
- [DurableExecution] attribute (recognized by future source generator)
- DurableExecutionException base + StepException

Internals:
- DurableExecutionHandler — Task.WhenAny race between user code and
  the suspension signal, returning Succeeded/Failed/Pending
- ExecutionState — replay-aware operation lookup and pending checkpoint
  buffer
- OperationIdGenerator — deterministic, replay-stable IDs
- TerminationManager — TaskCompletionSource-based suspension trigger
- LambdaDurableServiceClient — wraps AWSSDK.Lambda's checkpoint and
  state APIs

Tests:
- 86 unit tests covering enums, exceptions, models, configs,
  ID generation, termination, execution state, the handler race,
  the context (Step + Wait paths), and the WrapAsync entry point
- 8 end-to-end integration tests deploying real Lambdas via Docker on
  the provided.al2023 runtime: StepWaitStep, MultipleSteps, WaitOnly,
  LongerWait, ReplayDeterminism, RetrySucceeds, RetryExhausts, StepFails

Out of scope (follow-up PRs):
- IRetryStrategy, ExponentialRetryStrategy, retry decision factories
- DefaultJsonCheckpointSerializer
- DurableLogger replay-suppression (currently returns NullLogger)
- Callbacks, InvokeAsync, ParallelAsync, MapAsync, RunInChildContextAsync,
  WaitForConditionAsync — interface intentionally does not declare them
- Annotations source-generator integration
- DurableTestRunner / Amazon.Lambda.DurableExecution.Testing package
- dotnet new lambda.DurableFunction blueprint

stack-info: PR: #2360, branch: GarrettBeatty/stack/2

remove

update

update

update

update
Match the Python / Java / JavaScript reference SDKs' replay-mode model:
the workflow is "replaying" iff it has not yet revisited every
checkpointed completed user-replayable operation. A single global flag
flipped on the first fresh op (the prior model) misclassified workflow-
body code that runs before the first step and would not generalize to
Map/Parallel/Callback later.

ExecutionState changes:
- Replace `Mode`/`ExecutionMode`/`EnterExecutionMode()` with `IsReplaying`
  + `TrackReplay(operationId)`.
- Initial replay decision: any non-EXECUTION op present means we're
  replaying. The service always sends an EXECUTION-type op carrying the
  input payload — that's bookkeeping, not user history, so it does not
  count toward replay (matches Python execution.py:258, Java
  ExecutionManager:81, JS execution-context.ts:62).
- TrackReplay flips IsReplaying false once every checkpointed terminal-
  status non-EXECUTION op has been visited. Terminal set matches
  Python's: SUCCEEDED, FAILED, CANCELLED, STOPPED.

Operation changes:
- DurableOperation.ExecuteAsync calls TrackReplay(OperationId) at the
  top, so every operation participates in visit accounting without each
  subclass needing to remember.
- StepOperation/WaitOperation drop their manual EnterExecutionMode calls.

Tests:
- ExecutionStateTests rewritten around IsReplaying/TrackReplay, including
  pinning regressions: only-EXECUTION-op ⇒ NotReplaying, all-visited ⇒
  flips out of replay, PENDING ops do not block transition, idempotency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Serializer

DurableExecution now reads the registered ILambdaSerializer from the per-invocation
ILambdaContext (added in the prior PR) for both step-result checkpointing and
workflow input/output. AOT-safety is now determined entirely by which serializer
the user registers with LambdaBootstrapBuilder.Create — there is no longer a
forked path between reflection-based and AOT-safe APIs.

Removed:
- ICheckpointSerializer<T> + SerializationContext record
- ReflectionJsonCheckpointSerializer<T>
- The four JsonSerializerContext-taking overloads of DurableFunction.WrapAsync
- The IDurableContext.StepAsync overload that took ICheckpointSerializer<T>
- All [RequiresUnreferencedCode]/[RequiresDynamicCode] attributes and their
  related [UnconditionalSuppressMessage] shims

Net result: 8 WrapAsync overloads → 4, 3 StepAsync overloads → 2, zero trim
attributes in the public API. The AOT smoke test continues to publish with zero
IL2026/IL3050 warnings.
- Wrap LambdaDurableServiceClient SDK calls in DurableExecutionException with
  durable-execution context (which call, which ARN). User logs no longer show
  bare AWSSDK stack traces. Update IsTerminalCheckpointError to unwrap the
  inner AmazonServiceException for classification.
- Move public-API files out of Models/, Config/, Exceptions/ into the project
  root so folder layout matches the Amazon.Lambda.DurableExecution namespace.
- Replace string action literals ("SUCCEED", "FAIL", "START") with the
  Amazon.Lambda.OperationAction enum constants.
- Replace hand-rolled ToHex with Amazon.Util.AWSSDKUtils.ToHex. Drop the
  netstandard2.0 SHA-256 fallback now that DurableExecution targets net8+.
- Spell "iff" as "if and only if" in ExecutionState replay-mode docs.

Tests updated for the new wrapping shape: terminal classification asserts on
DurableExecutionException with the inner SDK exception preserved; transient
and hydration paths assert ThrowsAsync<DurableExecutionException> with
InnerException set to the original AmazonServiceException.
stack-info: PR: #2363, branch: GarrettBeatty/stack/3
Adds child-context support to the .NET Durable Execution SDK. A child
context is a logical sub-workflow with its own deterministic
operation-ID space, persisted as a CONTEXT operation so subsequent
invocations replay the cached value without re-executing the function.

Public surface:
- IDurableContext.RunInChildContextAsync<T> (reflection + AOT-safe
  ICheckpointSerializer<T> overloads, plus a void overload).
- ChildContextConfig with SubType (observability label) and
  ErrorMapping (transform exceptions before they surface to the caller).
- ChildContextException for failure surfacing.

Used as a building block for upcoming WaitForCallbackAsync.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lays down shared types/constants for the upcoming durable-execution
context operations (Callbacks, Invoke, Parallel, Map, WaitForCondition)
and updates the design doc to match decisions reached after comparing
against the Python, JS, and Java reference SDKs.

SDK changes:
- OperationSubTypes constants class (Step, Wait, Callback, WaitForCallback,
  Invoke, WaitForCondition, Parallel, ParallelBranch, Map, MapIteration).
  Replaces hard-coded SubType literals in StepOperation and WaitOperation.
- OperationStatuses.TimedOut for callback/invoke timeout handling.

Design-doc alignment:
- Drop Serializer field from CallbackConfig, InvokeConfig,
  ChildContextConfig. Custom serializers flow through AOT-safe
  ICheckpointSerializer<T> overloads (matches the existing StepConfig
  pattern documented at line 1247).
- InvokeConfig gains TenantId (matches Python/JS/Java); drops
  PayloadSerializer / ResultSerializer.
- BatchItemStatus.Cancelled -> Started. The SDK does not synchronously
  cancel branches; the wire state of items still in flight when the
  batch resolves (e.g., FirstSuccessful short-circuit) is STARTED.
  Matches Python and JS.
- IBatchResult<T> expanded to the full JS/Python surface: adds Started,
  GetErrors(), HasFailure, SuccessCount, FailureCount, StartedCount,
  TotalCount.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-wave0 branch from d308c3b to be4c3ad Compare May 18, 2026 15:23
Adds service-mediated polling to the .NET Durable Execution SDK.
WaitForConditionAsync repeatedly evaluates a check function with
configurable wait strategy between attempts; each iteration is its own
Lambda invocation (suspended via STEP+RETRY checkpoints carrying
NextAttemptDelaySeconds), so polling does not consume compute time.

Public surface:
- IDurableContext.WaitForConditionAsync<TState> (single overload; the
  per-iteration state checkpoint is serialized via the ILambdaSerializer
  registered on ILambdaContext.Serializer, configured via
  LambdaBootstrapBuilder.Create(handler, serializer))
- IConditionCheckContext (Logger + AttemptNumber)
- WaitForConditionConfig<TState> (required InitialState + WaitStrategy)
- IWaitStrategy<TState> with Decide(state, attempt) returning
  WaitDecision
- WaitDecision (readonly record struct, ShouldContinue + Delay,
  Stop() / ContinueAfter(TimeSpan) factories)
- WaitStrategy factories: Exponential / Linear / Fixed / FromDelegate,
  each accepting an optional Func<TState, bool> isDone predicate
- WaitForConditionException with AttemptsExhausted and LastState
  (preserved across both live execution and replay)

Internal:
- WaitForConditionOperation<TState> wire format = STEP + SubType
  "WaitForCondition". Each polling iteration emits Action=RETRY with
  the new state in payload and NextAttemptDelaySeconds for the
  service to schedule the next invocation.
- Serialization is delegated to the registered ILambdaSerializer via
  Stream-based Serialize<T>/Deserialize<T> calls; no AOT trim attributes
  on the public API. Mirrors StepOperation/ChildContextOperation.
- Strategies signal max-attempts exhausted by throwing
  WaitForConditionException directly from Decide(); the operation
  enriches with LastState before checkpointing FAIL.
- LastState survives FAIL replay: serialized into FAIL payload at
  write time, deserialized in BuildFailureException with
  warning-logged fallback for legacy/corrupt data.
- ExponentialBackoff helper extracted for sharing with
  ExponentialRetryStrategy. Math is byte-for-byte identical.
- Reuses OperationSubTypes.WaitForCondition from Wave 0.

Defaults: 60 attempts / 5s initial / 300s max / 1.5x rate / Full jitter -
distinct from RetryStrategy.Default and matching Python/JS/Java reference
SDKs. (Note: Python returns success on max-attempts; .NET/Java/JS throw
- documented in design doc.)

Adds 41 unit tests + 5 integration tests covering each wait strategy,
isDone predicate paths, max-attempts exhaustion, user-check exceptions,
replay determinism, exponential backoff bounds, and corrupt-payload
fallback logging.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-waitforcondition branch from 3fa06ce to 67f0c0c Compare May 18, 2026 15:50
@GarrettBeatty GarrettBeatty force-pushed the gcbeatty/durable-wave0 branch 3 times, most recently from ad4d208 to 3acbed5 Compare May 20, 2026 17:46
Base automatically changed from gcbeatty/durable-wave0 to gcbeatty/durable-child-context May 20, 2026 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants