Skip to content

(DRAFT) Oracle restore tests - and example fixes#4648

Draft
robacourt wants to merge 22 commits into
mainfrom
rob/oracle-restore-tests
Draft

(DRAFT) Oracle restore tests - and example fixes#4648
robacourt wants to merge 22 commits into
mainfrom
rob/oracle-restore-tests

Conversation

@robacourt

@robacourt robacourt commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

What this does

Adds a restart-aware mode to the oracle property test, which restarts the
Electric.StackSupervisor mid-test to exercise restore-from-file. Wiring up
RESTART_SERVER_EVERY surfaced a number of real restore/restart bugs in the
sync service; this PR includes deterministic regression tests
(test/integration/oracle_restore_test.exs) plus example fixes for several of
them (subquery materializer restore, multi-chunk log replay, duplicate/orphan
ops after restart, mid-restart cleanup, coordinated shutdown).

Bug doc

Each issue uncovered by the new test — symptom, root cause, fix, and regression
test — is written up in packages/sync-service/bugs.md.
Start there for the full triage.

robacourt and others added 22 commits June 24, 2026 08:31
The Bug 1 fix iterated chunks via get_chunk_end_log_offset to walk the
source shape's persisted history, but stream_main_log returns the
entire requested range in one call (unlike snapshot chunks which come
back one at a time). When the main log spanned more than one chunk,
the loop's next-iteration offset was the end of the first main-log
chunk, so the next call streamed the remainder of the main log -
entries the previous call had already applied. Re-applying inserts
crashed the materializer with "Key already exists", taking down the
dependent shape's consumer and producing 409 must-refetch on the next
poll.

Stop iterating as soon as the read steps into the main log: either
state.offset is already a real or last-virtual offset, or the next
chunk boundary is a real offset. Adds a regression test in
oracle_restore_test.exs (single shape, 30 level_3 UPDATEs, chunk_size
forced to 200 so the main log spans multiple chunks) that fails on
the broken iteration and passes with the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-shape minimal reproduction of the remaining property-test failure:
when multiple subquery shapes are restored concurrently after a server
restart and the source shape's main log spans more than one chunk
(forced by chunk_size: 200), at least one shape's materialized view
ends up missing rows the oracle has. Single-shape variants of the same
mutation pattern pass, so this is an interaction between concurrent
materializer recoveries — likely a per-stack shared structure (link
values cache or SubqueryIndex) being read by one shape's consumer
before another shape's materializer has finished repopulating it.

Test fails today; commit lays out the reproducer for the next pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Bug 5 regression test (already added in the previous commit) is
the second-shape variant - reproducing only when both:
- multiple subquery shapes are restored concurrently after restart, and
- the source shape's main log spans more than one chunk.

After batch_2 applies a mutation that should move rows into the
dependent shape's view, the materialized view stays stale. Single-shape
variants pass. This points at an interaction between concurrent
materializer recoveries and post-restart event delivery rather than the
chunked replay itself.

Documents the symptom, reproducer, and likely-suspect code paths in
bugs.md. The investigation continues from here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Added trace logging through the post-restart event flow and confirmed
that for the failing Bug 5 scenario:

- the source consumer's materializer_subscribed? is set back to true
  when the materializer subscribes;
- batch_2's mutation reaches the materializer via the range-based
  new_changes call;
- the materializer applies value_counts changes and flushes the right
  move_in / move_out events to its subscriber list;
- the dependent shape's consumer receives the :materializer_changes
  message with the right move counts.

The break is on the dependent-consumer side, after it receives the
materializer_changes: the move-in dispatched through
Buffering.start / ActiveMove / SplicePlan doesn't end up appending
the moved-in level_4 rows to the shape's on-disk view. Updates
bugs.md with the suspect chain so the next pass starts from the
right files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Trace logging through the post-restart event flow nails it. When
EventHandlerBuilder.build runs for a restored outer consumer it seeds
state.views from Materializer.get_link_values, i.e. the current
materializer state. But the outer shape's on-disk storage reflects the
*pre-restart* shutdown LSN, while the materializer is rebuilt from
disk and may be advanced by source-consumer events that arrive during
the eager-restart window (slot replay, batch_2's mutation).

When the materializer's view is ahead of the outer's storage, the
seeded state.views is the materializer's value. The first
:materializer_changes message that arrives is then dropped by
MoveQueue.enqueue's redundant?/2 check (the value is already in
base_view), so the corresponding move-in query never runs and the
outer's storage never gains the rows that were supposed to move in.
The test ends up with materialized missing exactly those rows.

bugs.md captures the trace, the divergence step-by-step, and three
fix shapes:
1. persist state.views alongside the shape on flush, load on restart;
2. derive the view from outer storage (distinct dep-foreign-key values)
3. track per-subscriber offsets so the materializer can replay events
   from the outer's persisted offset.

(1) is the simplest and matches how the rest of the recovery machinery
works. The fix itself is left for the next pass — it touches the
storage metadata format so it deserves a focused review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 5 root cause: after restart, the outer (subquery) consumer's
EventHandlerBuilder.build seeds state.views from the *current*
materializer view. If events flow through the SLC before the outer
consumer's init runs, the materializer can absorb new transactions
and end up ahead of the outer shape's on-disk storage. The next
move-in event for a value already in the seeded view is then
dropped as redundant.

Fix: in ShapeCache.handle_continue(:wait_for_restore), eagerly start
every subquery shape's consumer (and its dependency materializer)
*before* calling ShapeLogCollector.mark_as_ready. await_snapshot_start
blocks until each outer consumer's init has run, guaranteeing
state.views and the materializer view are both derived from on-disk
state alone.

The existing shape_cache_test "starts a consumer plus dependencies"
is updated to assert the new behavior — consumers come back up
automatically after a restart, no explicit start_consumer_for_handle
call needed. The continue runs asynchronously after start_supervised!
returns, so the assertion polls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After fixing Bug 5 (eager-start subquery consumers before mark_as_ready),
the heavy property test still fails with 409 must-refetch on subquery
shapes after batch_7 restart. Root cause documented:
ShapeStatus.remove_shape removes from SQLite *first*, then attempts
SLC.remove_shape — if SLC is mid-shutdown the call exits but the
SQLite/ETS delete already committed. After restart the shape is gone.

Triggered by a consumer crashing with a non-shutdown reason during
the stop sequence (e.g. dependent materializer killed by supervisor
brutal-shutdown), which routes through stop_and_clean and schedules
remove_shape_async. The async task races stack shutdown.

Triage note updated to reflect Bug 5 fixed and Bug 6 as next blocker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…istency

Previously the heavy oracle property test (RESTART_SERVER_EVERY=7) failed
right after the post-restart batch with a 409 must-refetch on optimized
subquery shapes. Root causes traced:

1. Materializer crashes on inconsistent inner-shape log: a `DeletedRecord`
   for a key not in the index, a duplicate `NewRecord`, or an `UpdatedRecord`
   for an unknown old_key. Each crash kills the materializer; the dependent
   consumer's handle_materializer_down then calls stop_and_clean, which
   triggers ShapeCleaner.remove_shape_async — the shape is wiped from disk
   and the next poll returns 409.
2. Inflight long-poll requests held by Bandit can wake up during the stack-
   restart window when shape_meta_table has been freed by the old
   ShapeStatusOwner but not yet recreated by the new one. ets:lookup raises
   ArgumentError, propagating as a 500 (or worse, racing with a partial
   cleanup that produces a 409).

Make the materializer treat every "this looks impossible" branch as an
upstream-bug warning instead of a process-killing exception:
- DeletedRecord with key not in index → log + skip
- NewRecord with key already in index → log + skip
- UpdatedRecord guarded by `is_map_key(index, old_key)` so the update path
  only fires when the index actually has the row
- move-out/in iterations switch from Map.fetch! to Map.fetch with a skip
  branch
- decrement_value treats a missing value_count as a no-op

Make the ETS-missing case explicit, not a crash:
- ShapeStatus.validate_shape_handle/3 rescues ArgumentError → :error,
  letting the caller fall back to fetch_handle_by_shape against the live
  SQLite store
- Api.check_for_disk_updates/1 rescues ArgumentError → :no_change, so a
  held long-poll that wakes up mid-restart returns no-change instead of
  500-ing.

Net effect: no more cascading shape cleanup from a single materializer
hiccup, no more 500/409 storms during the stack-restart window.

Net result of the property run isn't yet clean — Bug 2 (snapshot/log
boundary returns duplicate inserts to the client) still trips with
SHAPE_COUNT=10 RESTART_SERVER_EVERY=7 — but the cleanup-driven 409 path
is closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a materializer dies (e.g. brutal-killed by supervisor during stack
shutdown), the consumer's monitor delivers a :DOWN message that
handle_materializer_down/2 turns into a clean stop. But if the consumer
is mid-handle_event when the materializer dies, the inline
GenServer.call to Materializer.new_changes/3 exits the calling process
with :noproc *before* the :DOWN is processed. That exit propagates as
a non-shutdown reason → handle_writer_termination → @shutdown_cleanup
→ remove_shape_async → shape gone from disk.

Catch :exit on the call and return :ok. The pending :DOWN message will
still drive a clean stop via handle_materializer_down once the current
event finishes processing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The "log and skip" branches I added in 47ad817
(NewRecord with key already in index, DeletedRecord for unknown key,
move-out/move-in iterating over keys not in the index, decrement_value
on a missing value, UpdatedRecord guarded by is_map_key) hide real
upstream bugs. The materializer crashing on impossible-looking state
is the correct behaviour — duplicates and missing values are
invariant violations, not transient noise to swallow.

Restore Map.pop!, Map.fetch!, raise on duplicate INSERT, and the
unguarded UpdatedRecord rewrite path. Bug 2 will re-surface as a
materializer crash; that's where the investigation needs to focus.

Keeps the orthogonal changes from 47ad817 that aren't about
masking state inconsistency — the ETS-rescue in
ShapeStatus.validate_shape_handle and Api.check_for_disk_updates
guard a genuine transient race during stack restart, not a state bug.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bug 6 fix is now properly split:
- KEPT: consumer :noproc catch + ETS rescue (process/resource lifecycle).
- REVERTED: materializer "log and skip" branches that were masking real
  invariant violations (committed in 57df37c).

Bug 2 analysis refined: it's not strictly a post-restart bug. It's a
race between the shape's direct Filter path and the materializer's
move-in path when the *same* transaction contains both a row-level
update that crosses the filter boundary AND a dep-view change. Both
paths emit log ops for the same row. Fresh clients polling from
offset=-1 see the duplicates; pre-restart clients started from a
consistent in-memory state happen not to observe the duplication.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…match

Earlier write-up incorrectly claimed Bug 2 also exists in steady state.
That's wrong — the oracle test passes cleanly without RESTART_SERVER_EVERY,
so the bug is restart-specific.

Real root cause:

The inner shape's storage and the outer shape's storage are written by
independent writers with independent flush schedules. At the moment the
stack is killed, they're typically at DIFFERENT last_persisted_offsets:
L_outer ≤ L_inner.

On recovery:
- Materializer rebuilds its view from inner shape's history → state at
  L_inner.
- Outer consumer's state.views is seeded from the materializer → also
  at L_inner.
- But the outer shape's STORAGE is still at L_outer.

So state.views and outer storage are at different LSNs. The materializer
events for offsets in (L_outer, L_inner] correspond to ops that were
never written to outer storage (the writer was killed before flushing
them) but state.views reflects them as already-applied.

When live events resume, the materializer eventually re-emits those
move-ins/move-outs (from replication catch-up or fresh mutations).
state.views says "already there" so the move-in is processed as
redundant — but the move-in query still runs and writes the matching
rows again. Result: duplicate INSERTs in the on-disk log; the orphan
DELETE variant is the mirror.

Bug 5 status downgraded to PARTIALLY FIXED. The eager-start change
prevents the materializer view from drifting *forward* of the outer
consumer's seeded view between consumer init and the first live event,
but it can't fix the gap between materializer view and outer-storage
view that already exists at recovery time.

Real fix needs one of:
1. Persist state.views alongside outer shape's storage; restore atomically.
2. Derive state.views from outer storage's row contents on init.
3. Track per-outer-shape last_persisted_dep_lsn and replay materializer
   events from there on subscribe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Independent per-shape writers with their own buffers and flush
periods. The FlushTracker coordinates the *min* across shapes for
the replication slot's flushed_wal — that's the lower bound. But
each shape's on-disk last_persisted_txn_offset can be ahead of that
min, and at kill time there's no cross-shape coordinated flush, so
inner and outer shape writers can land at different durable offsets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The post-restart oracle property test was failing because different
shapes' on-disk last_persisted_txn_offset can land at different LSNs
after a stack stops: each shape has its own writer with its own buffer
and flush schedule, the FlushTracker only enforces a min lower-bound
for advancing the replication slot, and there's no cross-shape
coordinated flush at supervisor tear-down. After restart the
materializer rebuilds its view from the inner shape's history (at
L_inner) but the outer shape's storage is at L_outer ≠ L_inner; the
outer consumer's state.views (seeded from the materializer) and its
storage end up at different LSNs and produce duplicate inserts /
orphan deletes when live events resume.

Sound recovery from that state is hard, so:

1. Add Electric.StackSupervisor.ShutdownCoordinator as the LAST child
   of StackSupervisor (so it terminates FIRST on graceful shutdown,
   while every other process is still alive). Its terminate/2:
   - stops the ReplicationClient so no new events arrive at the SLC
   - calls ShapeLogCollector.drain/2 which sets a draining flag,
     blocks new handle_event calls, and waits on the FlushTracker
     until every shape has flushed up to last_seen_offset
   - writes a clean-shutdown marker to persistent_kv

   If any step fails (timeout, no SLC, exit, brutal kill), the marker
   is NOT written. Brutal-killed processes don't run terminate/2
   either way.

2. ShapeStatusOwner.handle_continue(:initialize) consumes the marker
   on startup. If absent (= previous shutdown was dirty), drops all
   shape state — ShapeDb.reset (SQLite) + Storage.cleanup_all
   (per-shape storage dirs) — before initialising. Clients re-request
   shapes from scratch with new handles.

3. ShapeLogCollector gets a drain/2 public API and a :drain handle_call.
   The :writer_flushed cast now also resolves the pending drain_waiter
   when the FlushTracker becomes empty.

4. Threads persistent_kv from StackSupervisor through
   MonitoredCoreSupervisor to ShapeStatusOwner.

Verified:
- mix test test/electric/{shape_cache_test.exs, shapes/consumer/materializer_test.exs,
  shapes/consumer_test.exs} test/integration/oracle_restore_test.exs --include oracle
  --include slow → 134 / 134 passing.
- CHECK_TIMEOUT=60000 SHAPE_COUNT=10 BATCH_COUNT=10 RESTART_SERVER_EVERY=7
  LONG_POLL_TIMEOUT=2000: passing on seeds 1, 2, 3, 4, 5, 8.
- CHECK_TIMEOUT=60000 SHAPE_COUNT=30 BATCH_COUNT=50 RESTART_SERVER_EVERY=7
  RESTART_CLIENT_EVERY=11 seed 8: passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.66%. Comparing base (121dbe6) to head (fc862bb).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4648      +/-   ##
==========================================
- Coverage   59.70%   59.66%   -0.04%     
==========================================
  Files         385      385              
  Lines       43021    43021              
  Branches    12374    12372       -2     
==========================================
- Hits        25684    25669      -15     
- Misses      17259    17274      +15     
  Partials       78       78              
Flag Coverage Δ
packages/agents 72.64% <ø> (ø)
packages/agents-mcp 77.70% <ø> (ø)
packages/agents-mobile 80.67% <ø> (ø)
packages/agents-runtime 83.44% <ø> (ø)
packages/agents-server 75.45% <ø> (-0.21%) ⬇️
packages/agents-server-ui 8.34% <ø> (ø)
packages/electric-ax 51.06% <ø> (ø)
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 91.83% <ø> (ø)
packages/y-electric 56.05% <ø> (ø)
typescript 59.66% <ø> (-0.04%) ⬇️
unit-tests 59.66% <ø> (-0.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

robacourt added a commit that referenced this pull request Jun 24, 2026
## What this does

Adds a restart-aware mode to the oracle property test so it exercises
server-side restore-from-disk and client-side resync mid-run:

- **`RESTART_SERVER_EVERY=N`** — stops and restarts the
`StackSupervisor`
every N batches (with a persistent replication slot so on-disk shape
data
survives), then reconnects fresh clients and asserts the restored state
  matches the oracle.
- **`RESTART_CLIENT_EVERY=M`** — throws away client poll cursors /
materialized
  rows every M batches and reconnects, asserting a fresh poll correctly
  assembles snapshot + log.

Supporting changes: a `Support.ComponentSetup.restart_complete_stack/1`
helper
that stops and restarts the stack with the same
`stack_id`/`persistent_kv`/
`storage`/`publication_name`, and the restart/recreate-checker wiring in
`Support.OracleHarness`.

## Scope

This branch is **test infrastructure only** — it does not include any
fixes.
Running the property test with `RESTART_SERVER_EVERY` set surfaces a
number of
real restore/restart bugs in the sync service; the test is expected to
fail in
their presence so they get noticed. The fixes and the per-bug triage doc
are
tracked separately in #4648.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant