Restart-aware oracle property tests#4649
Merged
Merged
Conversation
The Bug 1 fix iterated chunks via get_chunk_end_log_offset to walk the source shape's persisted history, but stream_main_log returns the entire requested range in one call (unlike snapshot chunks which come back one at a time). When the main log spanned more than one chunk, the loop's next-iteration offset was the end of the first main-log chunk, so the next call streamed the remainder of the main log - entries the previous call had already applied. Re-applying inserts crashed the materializer with "Key already exists", taking down the dependent shape's consumer and producing 409 must-refetch on the next poll. Stop iterating as soon as the read steps into the main log: either state.offset is already a real or last-virtual offset, or the next chunk boundary is a real offset. Adds a regression test in oracle_restore_test.exs (single shape, 30 level_3 UPDATEs, chunk_size forced to 200 so the main log spans multiple chunks) that fails on the broken iteration and passes with the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-shape minimal reproduction of the remaining property-test failure: when multiple subquery shapes are restored concurrently after a server restart and the source shape's main log spans more than one chunk (forced by chunk_size: 200), at least one shape's materialized view ends up missing rows the oracle has. Single-shape variants of the same mutation pattern pass, so this is an interaction between concurrent materializer recoveries — likely a per-stack shared structure (link values cache or SubqueryIndex) being read by one shape's consumer before another shape's materializer has finished repopulating it. Test fails today; commit lays out the reproducer for the next pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Bug 5 regression test (already added in the previous commit) is the second-shape variant - reproducing only when both: - multiple subquery shapes are restored concurrently after restart, and - the source shape's main log spans more than one chunk. After batch_2 applies a mutation that should move rows into the dependent shape's view, the materialized view stays stale. Single-shape variants pass. This points at an interaction between concurrent materializer recoveries and post-restart event delivery rather than the chunked replay itself. Documents the symptom, reproducer, and likely-suspect code paths in bugs.md. The investigation continues from here. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added trace logging through the post-restart event flow and confirmed that for the failing Bug 5 scenario: - the source consumer's materializer_subscribed? is set back to true when the materializer subscribes; - batch_2's mutation reaches the materializer via the range-based new_changes call; - the materializer applies value_counts changes and flushes the right move_in / move_out events to its subscriber list; - the dependent shape's consumer receives the :materializer_changes message with the right move counts. The break is on the dependent-consumer side, after it receives the materializer_changes: the move-in dispatched through Buffering.start / ActiveMove / SplicePlan doesn't end up appending the moved-in level_4 rows to the shape's on-disk view. Updates bugs.md with the suspect chain so the next pass starts from the right files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trace logging through the post-restart event flow nails it. When EventHandlerBuilder.build runs for a restored outer consumer it seeds state.views from Materializer.get_link_values, i.e. the current materializer state. But the outer shape's on-disk storage reflects the *pre-restart* shutdown LSN, while the materializer is rebuilt from disk and may be advanced by source-consumer events that arrive during the eager-restart window (slot replay, batch_2's mutation). When the materializer's view is ahead of the outer's storage, the seeded state.views is the materializer's value. The first :materializer_changes message that arrives is then dropped by MoveQueue.enqueue's redundant?/2 check (the value is already in base_view), so the corresponding move-in query never runs and the outer's storage never gains the rows that were supposed to move in. The test ends up with materialized missing exactly those rows. bugs.md captures the trace, the divergence step-by-step, and three fix shapes: 1. persist state.views alongside the shape on flush, load on restart; 2. derive the view from outer storage (distinct dep-foreign-key values) 3. track per-subscriber offsets so the materializer can replay events from the outer's persisted offset. (1) is the simplest and matches how the rest of the recovery machinery works. The fix itself is left for the next pass — it touches the storage metadata format so it deserves a focused review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 5 root cause: after restart, the outer (subquery) consumer's EventHandlerBuilder.build seeds state.views from the *current* materializer view. If events flow through the SLC before the outer consumer's init runs, the materializer can absorb new transactions and end up ahead of the outer shape's on-disk storage. The next move-in event for a value already in the seeded view is then dropped as redundant. Fix: in ShapeCache.handle_continue(:wait_for_restore), eagerly start every subquery shape's consumer (and its dependency materializer) *before* calling ShapeLogCollector.mark_as_ready. await_snapshot_start blocks until each outer consumer's init has run, guaranteeing state.views and the materializer view are both derived from on-disk state alone. The existing shape_cache_test "starts a consumer plus dependencies" is updated to assert the new behavior — consumers come back up automatically after a restart, no explicit start_consumer_for_handle call needed. The continue runs asynchronously after start_supervised! returns, so the assertion polls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After fixing Bug 5 (eager-start subquery consumers before mark_as_ready), the heavy property test still fails with 409 must-refetch on subquery shapes after batch_7 restart. Root cause documented: ShapeStatus.remove_shape removes from SQLite *first*, then attempts SLC.remove_shape — if SLC is mid-shutdown the call exits but the SQLite/ETS delete already committed. After restart the shape is gone. Triggered by a consumer crashing with a non-shutdown reason during the stop sequence (e.g. dependent materializer killed by supervisor brutal-shutdown), which routes through stop_and_clean and schedules remove_shape_async. The async task races stack shutdown. Triage note updated to reflect Bug 5 fixed and Bug 6 as next blocker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…istency Previously the heavy oracle property test (RESTART_SERVER_EVERY=7) failed right after the post-restart batch with a 409 must-refetch on optimized subquery shapes. Root causes traced: 1. Materializer crashes on inconsistent inner-shape log: a `DeletedRecord` for a key not in the index, a duplicate `NewRecord`, or an `UpdatedRecord` for an unknown old_key. Each crash kills the materializer; the dependent consumer's handle_materializer_down then calls stop_and_clean, which triggers ShapeCleaner.remove_shape_async — the shape is wiped from disk and the next poll returns 409. 2. Inflight long-poll requests held by Bandit can wake up during the stack- restart window when shape_meta_table has been freed by the old ShapeStatusOwner but not yet recreated by the new one. ets:lookup raises ArgumentError, propagating as a 500 (or worse, racing with a partial cleanup that produces a 409). Make the materializer treat every "this looks impossible" branch as an upstream-bug warning instead of a process-killing exception: - DeletedRecord with key not in index → log + skip - NewRecord with key already in index → log + skip - UpdatedRecord guarded by `is_map_key(index, old_key)` so the update path only fires when the index actually has the row - move-out/in iterations switch from Map.fetch! to Map.fetch with a skip branch - decrement_value treats a missing value_count as a no-op Make the ETS-missing case explicit, not a crash: - ShapeStatus.validate_shape_handle/3 rescues ArgumentError → :error, letting the caller fall back to fetch_handle_by_shape against the live SQLite store - Api.check_for_disk_updates/1 rescues ArgumentError → :no_change, so a held long-poll that wakes up mid-restart returns no-change instead of 500-ing. Net effect: no more cascading shape cleanup from a single materializer hiccup, no more 500/409 storms during the stack-restart window. Net result of the property run isn't yet clean — Bug 2 (snapshot/log boundary returns duplicate inserts to the client) still trips with SHAPE_COUNT=10 RESTART_SERVER_EVERY=7 — but the cleanup-driven 409 path is closed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a materializer dies (e.g. brutal-killed by supervisor during stack shutdown), the consumer's monitor delivers a :DOWN message that handle_materializer_down/2 turns into a clean stop. But if the consumer is mid-handle_event when the materializer dies, the inline GenServer.call to Materializer.new_changes/3 exits the calling process with :noproc *before* the :DOWN is processed. That exit propagates as a non-shutdown reason → handle_writer_termination → @shutdown_cleanup → remove_shape_async → shape gone from disk. Catch :exit on the call and return :ok. The pending :DOWN message will still drive a clean stop via handle_materializer_down once the current event finishes processing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eeper triage notes
The "log and skip" branches I added in 47ad817 (NewRecord with key already in index, DeletedRecord for unknown key, move-out/move-in iterating over keys not in the index, decrement_value on a missing value, UpdatedRecord guarded by is_map_key) hide real upstream bugs. The materializer crashing on impossible-looking state is the correct behaviour — duplicates and missing values are invariant violations, not transient noise to swallow. Restore Map.pop!, Map.fetch!, raise on duplicate INSERT, and the unguarded UpdatedRecord rewrite path. Bug 2 will re-surface as a materializer crash; that's where the investigation needs to focus. Keeps the orthogonal changes from 47ad817 that aren't about masking state inconsistency — the ETS-rescue in ShapeStatus.validate_shape_handle and Api.check_for_disk_updates guard a genuine transient race during stack restart, not a state bug. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 6 fix is now properly split: - KEPT: consumer :noproc catch + ETS rescue (process/resource lifecycle). - REVERTED: materializer "log and skip" branches that were masking real invariant violations (committed in 57df37c). Bug 2 analysis refined: it's not strictly a post-restart bug. It's a race between the shape's direct Filter path and the materializer's move-in path when the *same* transaction contains both a row-level update that crosses the filter boundary AND a dep-view change. Both paths emit log ops for the same row. Fresh clients polling from offset=-1 see the duplicates; pre-restart clients started from a consistent in-memory state happen not to observe the duplication. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…match Earlier write-up incorrectly claimed Bug 2 also exists in steady state. That's wrong — the oracle test passes cleanly without RESTART_SERVER_EVERY, so the bug is restart-specific. Real root cause: The inner shape's storage and the outer shape's storage are written by independent writers with independent flush schedules. At the moment the stack is killed, they're typically at DIFFERENT last_persisted_offsets: L_outer ≤ L_inner. On recovery: - Materializer rebuilds its view from inner shape's history → state at L_inner. - Outer consumer's state.views is seeded from the materializer → also at L_inner. - But the outer shape's STORAGE is still at L_outer. So state.views and outer storage are at different LSNs. The materializer events for offsets in (L_outer, L_inner] correspond to ops that were never written to outer storage (the writer was killed before flushing them) but state.views reflects them as already-applied. When live events resume, the materializer eventually re-emits those move-ins/move-outs (from replication catch-up or fresh mutations). state.views says "already there" so the move-in is processed as redundant — but the move-in query still runs and writes the matching rows again. Result: duplicate INSERTs in the on-disk log; the orphan DELETE variant is the mirror. Bug 5 status downgraded to PARTIALLY FIXED. The eager-start change prevents the materializer view from drifting *forward* of the outer consumer's seeded view between consumer init and the first live event, but it can't fix the gap between materializer view and outer-storage view that already exists at recovery time. Real fix needs one of: 1. Persist state.views alongside outer shape's storage; restore atomically. 2. Derive state.views from outer storage's row contents on init. 3. Track per-outer-shape last_persisted_dep_lsn and replay materializer events from there on subscribe. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent per-shape writers with their own buffers and flush periods. The FlushTracker coordinates the *min* across shapes for the replication slot's flushed_wal — that's the lower bound. But each shape's on-disk last_persisted_txn_offset can be ahead of that min, and at kill time there's no cross-shape coordinated flush, so inner and outer shape writers can land at different durable offsets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The post-restart oracle property test was failing because different
shapes' on-disk last_persisted_txn_offset can land at different LSNs
after a stack stops: each shape has its own writer with its own buffer
and flush schedule, the FlushTracker only enforces a min lower-bound
for advancing the replication slot, and there's no cross-shape
coordinated flush at supervisor tear-down. After restart the
materializer rebuilds its view from the inner shape's history (at
L_inner) but the outer shape's storage is at L_outer ≠ L_inner; the
outer consumer's state.views (seeded from the materializer) and its
storage end up at different LSNs and produce duplicate inserts /
orphan deletes when live events resume.
Sound recovery from that state is hard, so:
1. Add Electric.StackSupervisor.ShutdownCoordinator as the LAST child
of StackSupervisor (so it terminates FIRST on graceful shutdown,
while every other process is still alive). Its terminate/2:
- stops the ReplicationClient so no new events arrive at the SLC
- calls ShapeLogCollector.drain/2 which sets a draining flag,
blocks new handle_event calls, and waits on the FlushTracker
until every shape has flushed up to last_seen_offset
- writes a clean-shutdown marker to persistent_kv
If any step fails (timeout, no SLC, exit, brutal kill), the marker
is NOT written. Brutal-killed processes don't run terminate/2
either way.
2. ShapeStatusOwner.handle_continue(:initialize) consumes the marker
on startup. If absent (= previous shutdown was dirty), drops all
shape state — ShapeDb.reset (SQLite) + Storage.cleanup_all
(per-shape storage dirs) — before initialising. Clients re-request
shapes from scratch with new handles.
3. ShapeLogCollector gets a drain/2 public API and a :drain handle_call.
The :writer_flushed cast now also resolves the pending drain_waiter
when the FlushTracker becomes empty.
4. Threads persistent_kv from StackSupervisor through
MonitoredCoreSupervisor to ShapeStatusOwner.
Verified:
- mix test test/electric/{shape_cache_test.exs, shapes/consumer/materializer_test.exs,
shapes/consumer_test.exs} test/integration/oracle_restore_test.exs --include oracle
--include slow → 134 / 134 passing.
- CHECK_TIMEOUT=60000 SHAPE_COUNT=10 BATCH_COUNT=10 RESTART_SERVER_EVERY=7
LONG_POLL_TIMEOUT=2000: passing on seeds 1, 2, 3, 4, 5, 8.
- CHECK_TIMEOUT=60000 SHAPE_COUNT=30 BATCH_COUNT=50 RESTART_SERVER_EVERY=7
RESTART_CLIENT_EVERY=11 seed 8: passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Revert the bug fixes, bug doc, and per-bug regression tests that were explored on rob/oracle-restore-tests. This branch carries only the restart-aware oracle property test harness (RESTART_SERVER_EVERY / RESTART_CLIENT_EVERY) and its supporting setup: - test/integration/oracle_property_test.exs - test/support/oracle_harness.ex - test/support/component_setup.ex Removed: lib fixes, bugs.md, oracle_restore_test.exs, and unit regression tests, which belong with the fixes. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4649 +/- ##
==========================================
- Coverage 59.70% 59.69% -0.01%
==========================================
Files 385 385
Lines 43021 43021
Branches 12374 12372 -2
==========================================
- Hits 25684 25683 -1
Misses 17259 17259
- Partials 78 79 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
alco
approved these changes
Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this does
Adds a restart-aware mode to the oracle property test so it exercises
server-side restore-from-disk and client-side resync mid-run:
RESTART_SERVER_EVERY=N— stops and restarts theStackSupervisorevery N batches (with a persistent replication slot so on-disk shape data
survives), then reconnects fresh clients and asserts the restored state
matches the oracle.
RESTART_CLIENT_EVERY=M— throws away client poll cursors / materializedrows every M batches and reconnects, asserting a fresh poll correctly
assembles snapshot + log.
Supporting changes: a
Support.ComponentSetup.restart_complete_stack/1helperthat stops and restarts the stack with the same
stack_id/persistent_kv/storage/publication_name, and the restart/recreate-checker wiring inSupport.OracleHarness.Scope
This branch is test infrastructure only — it does not include any fixes.
Running the property test with
RESTART_SERVER_EVERYset surfaces a number ofreal restore/restart bugs in the sync service; the test is expected to fail in
their presence so they get noticed. The fixes and the per-bug triage doc are
tracked separately in #4648.