perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression by bplatz · Pull Request #1415 · fluree/db

bplatz · 2026-07-01T19:07:06Z

Problem

Under continuous writes with a low reindex_min_bytes, frequent reindexing was
hurting read latency instead of helping — the opposite of the expected
"binary index ≫ novelty query" behavior. A sweep showed the default (100 KB)
churn regime collapsing product-detail read latency ~30× versus a 2 MB setting.

Root cause was not cache invalidation (the shared cache is a single global Arc
and CID-keyed, so unchanged warmth survives a swap). It was that every reindex
swap left the touched artifacts cold — the incremental build decoded the leaves
it rewrote but never seeded the shared read cache — compounded by local-read I/O
amplification and unbounded trailing-leaf growth.

Changes

Warm-on-write: the incremental CoW build now seeds the shared LeafletCache
with the leaves it just wrote (decoded under ColumnSet::ALL) and warms the
reverse-dict leaves, so the first read after a swap hits warm instead of
cold-decoding. Wired through a late-bound WarmCacheSource at a single
chokepoint (start_background_indexing_dyn) so every build path is covered.
Superset cache fallback: a narrow (projection-keyed) read is served from a
cached ALL batch via ColumnBatch::project_to (invariant: cached ⊇ requested),
so warming does not duplicate column data across cache entries.
Bounded incremental leaves: oversized touched leaves are split during the
incremental build — gap-free, with first_key → next-first-key boundaries and
±∞ sentinels at the ends — fixing unbounded growth of the trailing leaf.
Local read path: mmap local leaf reads and cache the decoded directory.
Previously a local read did std::fs::read on the whole leaf to serve one of
its leaflets and re-decoded the directory, bypassing the CID-keyed LeafDir
cache.
Within-leaflet seek: binary-search the bound leading key within a leaflet
scan instead of scanning the leaflet linearly to find the subject's rows.

Bench

Adds reindex_swap_read_profile, a deterministic read-after-reindex-swap harness
that isolates the per-swap cold cost as a single number (not inferred from QMpH),
with first-touch and cold-subject controls.

Validation

Existing-data reads after a churn swap drop from ~3 ms to ~0.2–0.3 ms (~10×) in
the read-after-swap microbench, with cold-miss cache insertions driven toward
zero. Query correctness is unchanged (query and api group suites pass).

Scope note

A follow-on experiment — warming the per-generation query caches (overlay
translation + stats view) at index apply — was prototyped and dropped: it is flat
on BSBM update (the between-swap overlay is small and the query:swap ratio is
high, so the per-swap translation cost amortizes to noise), and the remaining
default-regime cost is the co-located background build competing for CPU, which is
a separate lever tracked for later.

Leaflet V3 column batches are cached under the projection (columns) a reader decoded, so a warm full-projection batch could not serve a narrower read. Add ColumnBatch::project_to and a superset fallback in try_get_or_decode_v3_batch: on an exact-key miss, serve a cached ColumnSet::ALL batch by projecting it down to the requested columns (invariant: cached columns must be a superset of requested; unrequested columns become AbsentDefault). Also add insert_leaf_dir/insert_v3_batch for writer-side seeding and a Debug impl for LeafletCache. This lets a single ALL batch per leaflet satisfy every projection and is the read-side half of warm-on-write.

The streaming copy-on-write path now seeds the shared LeafletCache with the leaflets it just wrote, decoded under ColumnSet::ALL, so a co-located query server's immediate read of a freshly-rewritten (new-CID) leaf hits the cache instead of re-reading and re-decoding from disk. The leaf id derivation matches the reader (xxh3_128 of the leaf CID). Gated behind a late-bound WarmCacheSource resolver on IndexerConfig (None by default); only the CoW update path warms, never fresh/full rebuilds (which would decode the whole graph).

Resolve the shared read cache for the background indexer from the running LedgerManager (LedgerManagerWarmCache), set once in start_background_indexing_dyn so every co-located build path warms the exact cache readers use; separate-machine indexers leave it unset. Add reindex_swap_read_profile, an instrumented harness that drives BSBM-shape write bursts and measures post-swap read latency, build cadence, and cache occupancy so warm-on-write is measurable independent of QMpH.

Extend warm-on-write to the reverse-dictionary tree: when the incremental CoW build writes a new reverse-dict leaf, seed the shared cache (DictLeaf) with its bytes so a co-located reader resolving a just-added IRI/string hits the cache instead of a cold read. Adds LeafletCache::insert_dict_leaf; the reader keys dict leaves on the CAS address string (cid.to_string()), which the warm key matches exactly. Threaded through the reverse-tree upload path and gated by the same warm_cache_source resolver (co-located only).

Absorb the one-time apply/swap cost with a throwaway load before the measured reads, so latency reflects a query on the applied generation (as a client read does after the background listener applies) rather than the first-load apply cost.

The incremental copy-on-write build never split a touched leaf (leaf_target_rows was bumped to existing_total+novelty for CID stability), so appended high-SID subjects concentrated into an ever-growing trailing leaf whose whole-blob read + full directory decode inflate every point read. Now a touched leaf splits once its merged size exceeds 2x the leaf target (matching the config's leaf_max = 2*leaf_target and the full-build LeafWriter's greedy packing), into bounded ~target-sized leaves; below the ceiling it still grows in place to avoid churning the branch on small commits. Applies uniformly to every touched leaf (middle, leftmost, rightmost). Splits are gap-free: novelty is routed to leaves by first_key(next) half-open intervals (slice_novelty_to_leaves), and the leftmost/rightmost leaves keep their -inf/+inf coverage. Adds a test asserting a split preserves the full row count, keeps leaves strictly ordered and non-overlapping, and spans the full key range.

Absorb the one-time apply cost with a throwaway load before the measured reads and time graph().load() separately from execute(), reporting query-only latency. This showed the read-after-swap cost lives in query execution (load is ~0.01ms) on leaves that grow with each incremental burst, not in apply or novelty.

The local leaf-open fast path used to std::fs::read the entire leaf blob into a fresh heap buffer on every read and re-decode the directory each time — cost that grows with the leaf and is re-paid per read. Add MmapLeafHandle: mmap the (immutable, content-addressed) leaf so raw bytes stay in OS page cache and only touched pages fault in (no whole-blob copy), and take the decoded directory as an Arc from the shared LeafletCache (parsed once per leaf CID). This also activates the LeafDir warm-on-write already seeded by the incremental build. Column data is still materialized once per leaflet via the V3Batch cache; raw leaf bytes are never copied into the cache budget.

…let scan filter_batch scanned every row of a leaflet applying the row filter. Leaflet rows are sorted by the order's key, so when the leading sort column is pinned to a single value (e.g. a bound subject on a SPOT scan) binary-search its contiguous [start,end) range and scan only that, instead of the whole (possibly large) leaflet. Output is identical to a full scan — rows outside the range can't match a filter that pins the leading column — so replay/overlay downstream are unaffected. Falls back to a full scan when the leading column is unbound or not a materialized sorted block, so no rows are ever missed. Cuts a bound-subject point read from O(leaflet) to O(log leaflet + result); measured ~2x on warm base-subject reads in the read-after-swap bench.

…ap bench read_new2 (second read of the same just-inserted product) isolates a first-touch effect from insertedness; read_old_random (a different existing product each burst) isolates cold-subject from just-inserted. Together they showed the read tail is not storage: read_old_random and read_new2 are both fast (~0.4ms) while read_new (first query of a fresh generation) is ~3.9ms — pinpointing the per-generation stats-view rebuild that the first query after each reindex swap pays.

bplatz added 10 commits July 1, 2026 13:33

bplatz requested review from aaj3f and zonotope July 1, 2026 19:07

fmt

497a3be

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415

perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415
bplatz wants to merge 11 commits into
mainfrom
feature/warm-on-write-reindex-cache

bplatz commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

bplatz commented Jul 1, 2026

Problem

Changes

Bench

Validation

Scope note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant