perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415
Open
bplatz wants to merge 11 commits into
Open
perf: warm-on-write read caching + leaf-split to fix reindex-churn read regression#1415bplatz wants to merge 11 commits into
bplatz wants to merge 11 commits into
Conversation
Leaflet V3 column batches are cached under the projection (columns) a reader decoded, so a warm full-projection batch could not serve a narrower read. Add ColumnBatch::project_to and a superset fallback in try_get_or_decode_v3_batch: on an exact-key miss, serve a cached ColumnSet::ALL batch by projecting it down to the requested columns (invariant: cached columns must be a superset of requested; unrequested columns become AbsentDefault). Also add insert_leaf_dir/insert_v3_batch for writer-side seeding and a Debug impl for LeafletCache. This lets a single ALL batch per leaflet satisfy every projection and is the read-side half of warm-on-write.
The streaming copy-on-write path now seeds the shared LeafletCache with the leaflets it just wrote, decoded under ColumnSet::ALL, so a co-located query server's immediate read of a freshly-rewritten (new-CID) leaf hits the cache instead of re-reading and re-decoding from disk. The leaf id derivation matches the reader (xxh3_128 of the leaf CID). Gated behind a late-bound WarmCacheSource resolver on IndexerConfig (None by default); only the CoW update path warms, never fresh/full rebuilds (which would decode the whole graph).
Resolve the shared read cache for the background indexer from the running LedgerManager (LedgerManagerWarmCache), set once in start_background_indexing_dyn so every co-located build path warms the exact cache readers use; separate-machine indexers leave it unset. Add reindex_swap_read_profile, an instrumented harness that drives BSBM-shape write bursts and measures post-swap read latency, build cadence, and cache occupancy so warm-on-write is measurable independent of QMpH.
Extend warm-on-write to the reverse-dictionary tree: when the incremental CoW build writes a new reverse-dict leaf, seed the shared cache (DictLeaf) with its bytes so a co-located reader resolving a just-added IRI/string hits the cache instead of a cold read. Adds LeafletCache::insert_dict_leaf; the reader keys dict leaves on the CAS address string (cid.to_string()), which the warm key matches exactly. Threaded through the reverse-tree upload path and gated by the same warm_cache_source resolver (co-located only).
Absorb the one-time apply/swap cost with a throwaway load before the measured reads, so latency reflects a query on the applied generation (as a client read does after the background listener applies) rather than the first-load apply cost.
The incremental copy-on-write build never split a touched leaf (leaf_target_rows was bumped to existing_total+novelty for CID stability), so appended high-SID subjects concentrated into an ever-growing trailing leaf whose whole-blob read + full directory decode inflate every point read. Now a touched leaf splits once its merged size exceeds 2x the leaf target (matching the config's leaf_max = 2*leaf_target and the full-build LeafWriter's greedy packing), into bounded ~target-sized leaves; below the ceiling it still grows in place to avoid churning the branch on small commits. Applies uniformly to every touched leaf (middle, leftmost, rightmost). Splits are gap-free: novelty is routed to leaves by first_key(next) half-open intervals (slice_novelty_to_leaves), and the leftmost/rightmost leaves keep their -inf/+inf coverage. Adds a test asserting a split preserves the full row count, keeps leaves strictly ordered and non-overlapping, and spans the full key range.
Absorb the one-time apply cost with a throwaway load before the measured reads and time graph().load() separately from execute(), reporting query-only latency. This showed the read-after-swap cost lives in query execution (load is ~0.01ms) on leaves that grow with each incremental burst, not in apply or novelty.
The local leaf-open fast path used to std::fs::read the entire leaf blob into a fresh heap buffer on every read and re-decode the directory each time — cost that grows with the leaf and is re-paid per read. Add MmapLeafHandle: mmap the (immutable, content-addressed) leaf so raw bytes stay in OS page cache and only touched pages fault in (no whole-blob copy), and take the decoded directory as an Arc from the shared LeafletCache (parsed once per leaf CID). This also activates the LeafDir warm-on-write already seeded by the incremental build. Column data is still materialized once per leaflet via the V3Batch cache; raw leaf bytes are never copied into the cache budget.
…let scan filter_batch scanned every row of a leaflet applying the row filter. Leaflet rows are sorted by the order's key, so when the leading sort column is pinned to a single value (e.g. a bound subject on a SPOT scan) binary-search its contiguous [start,end) range and scan only that, instead of the whole (possibly large) leaflet. Output is identical to a full scan — rows outside the range can't match a filter that pins the leading column — so replay/overlay downstream are unaffected. Falls back to a full scan when the leading column is unbound or not a materialized sorted block, so no rows are ever missed. Cuts a bound-subject point read from O(leaflet) to O(log leaflet + result); measured ~2x on warm base-subject reads in the read-after-swap bench.
…ap bench read_new2 (second read of the same just-inserted product) isolates a first-touch effect from insertedness; read_old_random (a different existing product each burst) isolates cold-subject from just-inserted. Together they showed the read tail is not storage: read_old_random and read_new2 are both fast (~0.4ms) while read_new (first query of a fresh generation) is ~3.9ms — pinpointing the per-generation stats-view rebuild that the first query after each reindex swap pays.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Under continuous writes with a low
reindex_min_bytes, frequent reindexing washurting read latency instead of helping — the opposite of the expected
"binary index ≫ novelty query" behavior. A sweep showed the default (100 KB)
churn regime collapsing product-detail read latency ~30× versus a 2 MB setting.
Root cause was not cache invalidation (the shared cache is a single global
Arcand CID-keyed, so unchanged warmth survives a swap). It was that every reindex
swap left the touched artifacts cold — the incremental build decoded the leaves
it rewrote but never seeded the shared read cache — compounded by local-read I/O
amplification and unbounded trailing-leaf growth.
Changes
LeafletCachewith the leaves it just wrote (decoded under
ColumnSet::ALL) and warms thereverse-dict leaves, so the first read after a swap hits warm instead of
cold-decoding. Wired through a late-bound
WarmCacheSourceat a singlechokepoint (
start_background_indexing_dyn) so every build path is covered.cached
ALLbatch viaColumnBatch::project_to(invariant: cached ⊇ requested),so warming does not duplicate column data across cache entries.
incremental build — gap-free, with
first_key → next-first-keyboundaries and±∞ sentinels at the ends — fixing unbounded growth of the trailing leaf.
Previously a local read did
std::fs::readon the whole leaf to serve one ofits leaflets and re-decoded the directory, bypassing the CID-keyed
LeafDircache.
scan instead of scanning the leaflet linearly to find the subject's rows.
Bench
Adds
reindex_swap_read_profile, a deterministic read-after-reindex-swap harnessthat isolates the per-swap cold cost as a single number (not inferred from QMpH),
with first-touch and cold-subject controls.
Validation
Existing-data reads after a churn swap drop from ~3 ms to ~0.2–0.3 ms (~10×) in
the read-after-swap microbench, with cold-miss cache insertions driven toward
zero. Query correctness is unchanged (query and api group suites pass).
Scope note
A follow-on experiment — warming the per-generation query caches (overlay
translation + stats view) at index apply — was prototyped and dropped: it is flat
on BSBM
update(the between-swap overlay is small and the query:swap ratio ishigh, so the per-swap translation cost amortizes to noise), and the remaining
default-regime cost is the co-located background build competing for CPU, which is
a separate lever tracked for later.