Skip to content

perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415

Open
bplatz wants to merge 11 commits into
mainfrom
feature/warm-on-write-reindex-cache
Open

perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415
bplatz wants to merge 11 commits into
mainfrom
feature/warm-on-write-reindex-cache

Conversation

@bplatz

@bplatz bplatz commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Problem

Under continuous writes with a low reindex_min_bytes, frequent reindexing was
hurting read latency instead of helping — the opposite of the expected
"binary index ≫ novelty query" behavior. A sweep showed the default (100 KB)
churn regime collapsing product-detail read latency ~30× versus a 2 MB setting.

Root cause was not cache invalidation (the shared cache is a single global Arc
and CID-keyed, so unchanged warmth survives a swap). It was that every reindex
swap left the touched artifacts cold — the incremental build decoded the leaves
it rewrote but never seeded the shared read cache — compounded by local-read I/O
amplification and unbounded trailing-leaf growth.

Changes

  • Warm-on-write: the incremental CoW build now seeds the shared LeafletCache
    with the leaves it just wrote (decoded under ColumnSet::ALL) and warms the
    reverse-dict leaves, so the first read after a swap hits warm instead of
    cold-decoding. Wired through a late-bound WarmCacheSource at a single
    chokepoint (start_background_indexing_dyn) so every build path is covered.
  • Superset cache fallback: a narrow (projection-keyed) read is served from a
    cached ALL batch via ColumnBatch::project_to (invariant: cached ⊇ requested),
    so warming does not duplicate column data across cache entries.
  • Bounded incremental leaves: oversized touched leaves are split during the
    incremental build — gap-free, with first_key → next-first-key boundaries and
    ±∞ sentinels at the ends — fixing unbounded growth of the trailing leaf.
  • Local read path: mmap local leaf reads and cache the decoded directory.
    Previously a local read did std::fs::read on the whole leaf to serve one of
    its leaflets and re-decoded the directory, bypassing the CID-keyed LeafDir
    cache.
  • Within-leaflet seek: binary-search the bound leading key within a leaflet
    scan instead of scanning the leaflet linearly to find the subject's rows.

Bench

Adds reindex_swap_read_profile, a deterministic read-after-reindex-swap harness
that isolates the per-swap cold cost as a single number (not inferred from QMpH),
with first-touch and cold-subject controls.

Validation

Existing-data reads after a churn swap drop from ~3 ms to ~0.2–0.3 ms (~10×) in
the read-after-swap microbench, with cold-miss cache insertions driven toward
zero. Query correctness is unchanged (query and api group suites pass).

Scope note

A follow-on experiment — warming the per-generation query caches (overlay
translation + stats view) at index apply — was prototyped and dropped: it is flat
on BSBM update (the between-swap overlay is small and the query:swap ratio is
high, so the per-swap translation cost amortizes to noise), and the remaining
default-regime cost is the co-located background build competing for CPU, which is
a separate lever tracked for later.

bplatz added 10 commits July 1, 2026 13:33
Leaflet V3 column batches are cached under the projection (columns) a reader decoded, so a warm full-projection batch could not serve a narrower read. Add ColumnBatch::project_to and a superset fallback in try_get_or_decode_v3_batch: on an exact-key miss, serve a cached ColumnSet::ALL batch by projecting it down to the requested columns (invariant: cached columns must be a superset of requested; unrequested columns become AbsentDefault). Also add insert_leaf_dir/insert_v3_batch for writer-side seeding and a Debug impl for LeafletCache. This lets a single ALL batch per leaflet satisfy every projection and is the read-side half of warm-on-write.
The streaming copy-on-write path now seeds the shared LeafletCache with the leaflets it just wrote, decoded under ColumnSet::ALL, so a co-located query server's immediate read of a freshly-rewritten (new-CID) leaf hits the cache instead of re-reading and re-decoding from disk. The leaf id derivation matches the reader (xxh3_128 of the leaf CID). Gated behind a late-bound WarmCacheSource resolver on IndexerConfig (None by default); only the CoW update path warms, never fresh/full rebuilds (which would decode the whole graph).
Resolve the shared read cache for the background indexer from the running LedgerManager (LedgerManagerWarmCache), set once in start_background_indexing_dyn so every co-located build path warms the exact cache readers use; separate-machine indexers leave it unset. Add reindex_swap_read_profile, an instrumented harness that drives BSBM-shape write bursts and measures post-swap read latency, build cadence, and cache occupancy so warm-on-write is measurable independent of QMpH.
Extend warm-on-write to the reverse-dictionary tree: when the incremental CoW build writes a new reverse-dict leaf, seed the shared cache (DictLeaf) with its bytes so a co-located reader resolving a just-added IRI/string hits the cache instead of a cold read. Adds LeafletCache::insert_dict_leaf; the reader keys dict leaves on the CAS address string (cid.to_string()), which the warm key matches exactly. Threaded through the reverse-tree upload path and gated by the same warm_cache_source resolver (co-located only).
Absorb the one-time apply/swap cost with a throwaway load before the measured reads, so latency reflects a query on the applied generation (as a client read does after the background listener applies) rather than the first-load apply cost.
The incremental copy-on-write build never split a touched leaf (leaf_target_rows was bumped to existing_total+novelty for CID stability), so appended high-SID subjects concentrated into an ever-growing trailing leaf whose whole-blob read + full directory decode inflate every point read. Now a touched leaf splits once its merged size exceeds 2x the leaf target (matching the config's leaf_max = 2*leaf_target and the full-build LeafWriter's greedy packing), into bounded ~target-sized leaves; below the ceiling it still grows in place to avoid churning the branch on small commits. Applies uniformly to every touched leaf (middle, leftmost, rightmost). Splits are gap-free: novelty is routed to leaves by first_key(next) half-open intervals (slice_novelty_to_leaves), and the leftmost/rightmost leaves keep their -inf/+inf coverage. Adds a test asserting a split preserves the full row count, keeps leaves strictly ordered and non-overlapping, and spans the full key range.
Absorb the one-time apply cost with a throwaway load before the measured reads and time graph().load() separately from execute(), reporting query-only latency. This showed the read-after-swap cost lives in query execution (load is ~0.01ms) on leaves that grow with each incremental burst, not in apply or novelty.
The local leaf-open fast path used to std::fs::read the entire leaf blob into a fresh heap buffer on every read and re-decode the directory each time — cost that grows with the leaf and is re-paid per read. Add MmapLeafHandle: mmap the (immutable, content-addressed) leaf so raw bytes stay in OS page cache and only touched pages fault in (no whole-blob copy), and take the decoded directory as an Arc from the shared LeafletCache (parsed once per leaf CID). This also activates the LeafDir warm-on-write already seeded by the incremental build. Column data is still materialized once per leaflet via the V3Batch cache; raw leaf bytes are never copied into the cache budget.
…let scan

filter_batch scanned every row of a leaflet applying the row filter. Leaflet rows are sorted by the order's key, so when the leading sort column is pinned to a single value (e.g. a bound subject on a SPOT scan) binary-search its contiguous [start,end) range and scan only that, instead of the whole (possibly large) leaflet. Output is identical to a full scan — rows outside the range can't match a filter that pins the leading column — so replay/overlay downstream are unaffected. Falls back to a full scan when the leading column is unbound or not a materialized sorted block, so no rows are ever missed. Cuts a bound-subject point read from O(leaflet) to O(log leaflet + result); measured ~2x on warm base-subject reads in the read-after-swap bench.
…ap bench

read_new2 (second read of the same just-inserted product) isolates a first-touch effect from insertedness; read_old_random (a different existing product each burst) isolates cold-subject from just-inserted. Together they showed the read tail is not storage: read_old_random and read_new2 are both fast (~0.4ms) while read_new (first query of a fresh generation) is ~3.9ms — pinpointing the per-generation stats-view rebuild that the first query after each reindex swap pays.
@bplatz bplatz requested review from aaj3f and zonotope July 1, 2026 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant