fix: ensure FlushMutex is released when WASM flush throws#204
Open
jonathannorris wants to merge 2 commits into
Open
fix: ensure FlushMutex is released when WASM flush throws#204jonathannorris wants to merge 2 commits into
jonathannorris wants to merge 2 commits into
Conversation
When localBucketing.FlushEventQueue throws (e.g. AssemblyScript abort inside the WASM event queue path), the previous implementation let the exception escape between StartFlush() and EndFlush(). That permanently leaked the FlushMutex semaphore, so every subsequent flush deadlocked on StartFlush() and events accumulated in the WASM heap indefinitely - compounding any pre-existing heap corruption. Wrap the flush body in try/finally so EndFlush() always runs. Catch exceptions from GetPayloads() and the per-payload mark-success/failure calls so a failing flush surfaces as a failed FlushedEvents subscriber notification instead of an uncaught exception.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the local SDK’s event flush path so a WASM trap does not leak the flush mutex and stall future event delivery. It fits into the local bucketing/event pipeline by making EventQueue.FlushEvents() more resilient when the embedded WASM engine throws.
Changes:
- Wrapped
FlushEvents()intry/finallysoEndFlush()always runs afterStartFlush(). - Added exception handling around
GetPayloads()and payload bookkeeping to convert flush-path failures into failedFlushedEventsnotifications. - Added MSTests covering
FlushEventQueuethrow scenarios with a fakeILocalBucketing.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
DevCycle.SDK.Server.Local/Api/EventQueue.cs |
Refactors flush execution into an inner method and adds new exception-handling paths around payload retrieval and per-payload completion logic. |
DevCycle.SDK.Server.Local.MSTests/EventQueueTest.cs |
Adds regression tests and a fake ILocalBucketing implementation to simulate flush-path WASM traps. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- EventQueue.cs: revert per-payload catch from 'Exception' back to 'DevCycleException'. The broadened catch was over-aggressive and would have re-queued payloads as retryable failures if OnPayloadSuccess/OnPayloadFailure themselves threw, including re-sending events that DevCycle had already accepted with a 201. The outer try/finally on FlushEvents() ensures EndFlush is called regardless, so the inner catch doesn't need to swallow non-HTTP exceptions. - EventQueueTest.cs: TrappingLocalBucketing fake now uses a real SemaphoreSlim with a 5s timeout for StartFlush/EndFlush. A regression that skips EndFlush will now actually deadlock the next StartFlush, surfaced as InvalidOperationException, so the deadlock test exercises the real failure mode instead of just counting calls. - Wrapped the first FlushEvents call in the deadlock test in try/catch so the test specifically exercises 'second flush after failed first flush' regardless of whether the first call propagates the underlying exception.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
EventQueue.FlushEvents()body in try/finally solocalBucketing.EndFlush()always runs even if the WASM bucketing engine throws insideflushEventQueueGetPayloads()and from the per-payload mark-success / mark-failure calls; surface them as a failedFlushedEventssubscriber notification instead of letting them escapeILocalBucketingthat simulates a WASM trap insideFlushEventQueueMotivation
If the WASM bucketing engine throws on the flush path (e.g. AssemblyScript
throw new Error(...)fromrequestPayloadManager.ts:281Request Payload: ${payloadId} has not finished sending, or other flush-internal throws), the previous implementation leaked the FlushMutex.The previous
FlushEvents()implementation:If
GetPayloads()(or any subsequent step before the matchingEndFlush()) throws, the FlushMutex is permanently leaked. Every subsequent flush blocks onStartFlush()forever, events accumulate indefinitely in the WASM heap, and any heap corruption compounds.Cross-SDK audit
I audited every server SDK that uses the WASM bucketing engine to confirm this bug is unique to .NET. Every other SDK uses a language-idiomatic exception-safe lock pattern:
Wait()/Release()onSemaphoreSlim, no try/finallysynchronizedonflushEvents()(JVM auto-releases monitor on exception)with self._flush_lock+with self.wasm_lock(context managers)Mutex#synchronize(usesensureblock under the hood)defer Unlock()+recover()on both flush and queue mutexes; native bucketingThe audit also surfaced two unrelated minor issues that I filed as follow-ups in their respective repos:
UnboundLocalErrorin_flush_eventsmasks WASM trap errors. Lock is released correctly, but the original WASM error is hidden behind a confusing variable-binding error.isFlushingEventsflag can get stucktrueifpublishEvents()throwsInterruptedException, permanently disabling future flushes. Low severity (most paths swallow exceptions today) but worth fixing for robustness.Neither of those affects the fix in this PR.
Behaviour change
FlushEvents(). It is logged, surfaced viaFlushedEventssubscribers as a failed event, and the FlushMutex is released so subsequent flushes still run.Tests
Each test injects a fake
ILocalBucketingthat throws onFlushEventQueueto simulate a WASM trap, then asserts behaviour.FlushEvents_WhenFlushEventQueueThrows_DoesNotPropagateFlushEvents()FlushEvents_WhenFlushEventQueueThrows_StillCallsEndFlushEndFlush()was called (mutex released)FlushEvents_WhenFlushEventQueueThrows_RaisesFailureToSubscribersSuccess=falsewith errors populatedFlushEvents_AfterFailedFlush_NextFlushDoesNotDeadlockI verified all 4 fail on
main(or hang in the case of the deadlock test) and pass with the fix. All 56 existing tests still pass.Related