Skip to content

fix: ensure FlushMutex is released when WASM flush throws#204

Open
jonathannorris wants to merge 2 commits into
mainfrom
fix/flush-mutex-leak
Open

fix: ensure FlushMutex is released when WASM flush throws#204
jonathannorris wants to merge 2 commits into
mainfrom
fix/flush-mutex-leak

Conversation

@jonathannorris
Copy link
Copy Markdown
Member

@jonathannorris jonathannorris commented May 4, 2026

Summary

  • Wrap EventQueue.FlushEvents() body in try/finally so localBucketing.EndFlush() always runs even if the WASM bucketing engine throws inside flushEventQueue
  • Catch exceptions from GetPayloads() and from the per-payload mark-success / mark-failure calls; surface them as a failed FlushedEvents subscriber notification instead of letting them escape
  • Add 4 tests against an injected fake ILocalBucketing that simulates a WASM trap inside FlushEventQueue

Motivation

If the WASM bucketing engine throws on the flush path (e.g. AssemblyScript throw new Error(...) from requestPayloadManager.ts:281 Request Payload: ${payloadId} has not finished sending, or other flush-internal throws), the previous implementation leaked the FlushMutex.

The previous FlushEvents() implementation:

public virtual async Task FlushEvents()
{
    localBucketing.StartFlush();          // acquires FlushMutex semaphore
    var flushPayloads = GetPayloads();    // can throw a WASM trap
    // ...
    localBucketing.EndFlush();            // never reached if anything above throws
}

If GetPayloads() (or any subsequent step before the matching EndFlush()) throws, the FlushMutex is permanently leaked. Every subsequent flush blocks on StartFlush() forever, events accumulate indefinitely in the WASM heap, and any heap corruption compounds.

Cross-SDK audit

I audited every server SDK that uses the WASM bucketing engine to confirm this bug is unique to .NET. Every other SDK uses a language-idiomatic exception-safe lock pattern:

SDK Flush mutex pattern Same bug?
.NET (this PR) Manual Wait() / Release() on SemaphoreSlim, no try/finally Yes — fixed here
Java synchronized on flushEvents() (JVM auto-releases monitor on exception) No
Python with self._flush_lock + with self.wasm_lock (context managers) No
Ruby Mutex#synchronize (uses ensure block under the hood) No
Go defer Unlock() + recover() on both flush and queue mutexes; native bucketing n/a (no WASM)
PHP No local bucketing, cloud HTTP proxy only n/a

The audit also surfaced two unrelated minor issues that I filed as follow-ups in their respective repos:

  • Python: DevCycleHQ/python-server-sdk#111UnboundLocalError in _flush_events masks WASM trap errors. Lock is released correctly, but the original WASM error is hidden behind a confusing variable-binding error.
  • Java: DevCycleHQ/java-server-sdk#192isFlushingEvents flag can get stuck true if publishEvents() throws InterruptedException, permanently disabling future flushes. Low severity (most paths swallow exceptions today) but worth fixing for robustness.

Neither of those affects the fix in this PR.

Behaviour change

  • A WASM trap during flush no longer escapes FlushEvents(). It is logged, surfaced via FlushedEvents subscribers as a failed event, and the FlushMutex is released so subsequent flushes still run.
  • No change to successful-flush behaviour or to retryable HTTP failures.

Tests

Each test injects a fake ILocalBucketing that throws on FlushEventQueue to simulate a WASM trap, then asserts behaviour.

Test Asserts
FlushEvents_WhenFlushEventQueueThrows_DoesNotPropagate Exception does not escape FlushEvents()
FlushEvents_WhenFlushEventQueueThrows_StillCallsEndFlush EndFlush() was called (mutex released)
FlushEvents_WhenFlushEventQueueThrows_RaisesFailureToSubscribers Subscribers receive Success=false with errors populated
FlushEvents_AfterFailedFlush_NextFlushDoesNotDeadlock After a failed flush, the next flush still completes (no leaked mutex)

I verified all 4 fail on main (or hang in the case of the deadlock test) and pass with the fix. All 56 existing tests still pass.

Related

When localBucketing.FlushEventQueue throws (e.g. AssemblyScript abort
inside the WASM event queue path), the previous implementation let the
exception escape between StartFlush() and EndFlush(). That permanently
leaked the FlushMutex semaphore, so every subsequent flush deadlocked
on StartFlush() and events accumulated in the WASM heap indefinitely -
compounding any pre-existing heap corruption.

Wrap the flush body in try/finally so EndFlush() always runs. Catch
exceptions from GetPayloads() and the per-payload mark-success/failure
calls so a failing flush surfaces as a failed FlushedEvents subscriber
notification instead of an uncaught exception.
@jonathannorris jonathannorris requested a review from a team as a code owner May 4, 2026 19:31
Copilot AI review requested due to automatic review settings May 4, 2026 19:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the local SDK’s event flush path so a WASM trap does not leak the flush mutex and stall future event delivery. It fits into the local bucketing/event pipeline by making EventQueue.FlushEvents() more resilient when the embedded WASM engine throws.

Changes:

  • Wrapped FlushEvents() in try/finally so EndFlush() always runs after StartFlush().
  • Added exception handling around GetPayloads() and payload bookkeeping to convert flush-path failures into failed FlushedEvents notifications.
  • Added MSTests covering FlushEventQueue throw scenarios with a fake ILocalBucketing.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
DevCycle.SDK.Server.Local/Api/EventQueue.cs Refactors flush execution into an inner method and adds new exception-handling paths around payload retrieval and per-payload completion logic.
DevCycle.SDK.Server.Local.MSTests/EventQueueTest.cs Adds regression tests and a fake ILocalBucketing implementation to simulate flush-path WASM traps.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread DevCycle.SDK.Server.Local/Api/EventQueue.cs Outdated
Comment thread DevCycle.SDK.Server.Local.MSTests/EventQueueTest.cs Outdated
- EventQueue.cs: revert per-payload catch from 'Exception' back to
  'DevCycleException'. The broadened catch was over-aggressive and
  would have re-queued payloads as retryable failures if
  OnPayloadSuccess/OnPayloadFailure themselves threw, including
  re-sending events that DevCycle had already accepted with a 201.
  The outer try/finally on FlushEvents() ensures EndFlush is called
  regardless, so the inner catch doesn't need to swallow non-HTTP
  exceptions.
- EventQueueTest.cs: TrappingLocalBucketing fake now uses a real
  SemaphoreSlim with a 5s timeout for StartFlush/EndFlush. A
  regression that skips EndFlush will now actually deadlock the next
  StartFlush, surfaced as InvalidOperationException, so the deadlock
  test exercises the real failure mode instead of just counting
  calls.
- Wrapped the first FlushEvents call in the deadlock test in
  try/catch so the test specifically exercises 'second flush after
  failed first flush' regardless of whether the first call propagates
  the underlying exception.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants