Sdstor 22200 by shosseinimotlagh · Pull Request #884 · eBay/HomeStore

shosseinimotlagh · 2026-05-08T20:40:30Z

No description provided.

The class was named BlkAllocMetrics but is exclusively used by VarsizeBlkAllocator and its slab subsystem (SlabMetrics registers to it as parent). The name is now accurate and leaves room for a separate FixedBlkAllocMetrics class.

FixedBlkAllocMetrics: new metrics class with num_alloc, num_alloc_failure counters and a blk_alloc_memory_size gauge. Wired into FixedBlkAllocator constructor; gauge is set once at construction since MPMC queue capacity is fixed (total_blks x 12B per slot: 6B BlkId + 6B folly sequence field, confirmed via GDB on 461GB index device = 10.06GB). VarsizeBlkAllocMetrics: adds the same blk_alloc_memory_size gauge, updated once after FreeBlkCacheQueue construction via the new total_slab_capacity() helper (sum of all slab entry_capacity() x 12B per slot = 1.60GB for the production data blkstore configuration).

The cache evictor's get_size() callback was returning only m_cache_size (the raw node data buffer), but each cached entry allocates additional heap memory: actual_cost = m_cache_size + sizeof(CacheBufferType) + sizeof(homeds::MemVector) + sizeof(homeds::MemPiece) This missing overhead caused two interconnected bugs: 1. cache_size metric underreported actual RSS by factor of: (m_cache_size + overhead) / m_cache_size 2. Evictor's m_cur_size tracked only m_cache_size, so when m_cur_size approached m_max_size, actual RSS was already at: m_max_size × (m_cache_size + overhead) / m_cache_size This prevented eviction from triggering before OOM. The fix adds the overhead using compile-time sizeof() for each component, ensuring both the cache_size gauge and eviction threshold (m_cur_size vs m_max_size) accurately reflect true memory consumption. For typical 512B btree nodes: overhead ≈ 320B → multiplier ≈ 1.625× For larger 4KB nodes: same absolute overhead → multiplier ≈ 1.078× The formula scales correctly across different node sizes without hardcoding.

Add verify_node_fowarding() using std::queue for iterative tree traversal. Benefits over recursive verify_node_recursive(): - Reduces lock contention: ancestor nodes unlocked while processing descendants - Enables cache eviction during traversal: ref_count drops to 1 immediately per node - Lower peak stack depth for deep trees This does NOT solve cross-volume cache accumulation during recovery. The actual OOM fix is force_evict() after each volume completes (separate commit). Add recursive parameter (default false) to verify_tree() propagated through: - btree.hpp: verify_tree(update_debug_bm, recursive) - mapping.hpp/cpp: verify_tree delegation - volume.hpp/cpp: verify_tree delegation Dispatcher verify_node() routes to verify_node_recursive() when recursive=true, verify_node_fowarding() when recursive=false (default).

Add release_cached_tree() to evict cached btree nodes when safe: - Buffer not locked (try_lock succeeds) - Buffer not dirty (no pending writeback) - ref_count permits eviction (only cache/wb_cache hold refs) Implementation: - btree.hpp: EvictionStats struct + release_cached_tree() + release_cached_nodes_recursive() - ssd_btree.hpp: free_node_from_cache() with isSyncCall=true to avoid null deref on wb_req->bcp - mapping.hpp: delegation wrapper - volume.hpp: delegation wrapper Safety: Cache::erase (cache_only=true) correctly handles wb_cache refs - buffer removed from hash_set/eviction list while wb_cache retains ref for checkpoint flush. Use case: Call after verify_tree() during recovery to release per-volume cache before processing next volume, preventing cross-volume memory accumulation.

…lation In vol_recovery_start_phase2(), call release_cached_tree() after recovery_start_phase2() finishes for each volume. This proactively evicts clean cached btree nodes, providing: 1. Limits peak memory: max(single volume) instead of sum(all volumes) Without this fix, cache accumulates across volumes during sequential recovery. 2. Frees cache space for dirty buffers: Clean nodes from verify_tree() consume cache quota but provide zero value after recovery. Evicting them makes room for dirty buffers that need cache space for writeback/checkpoint. 3. Defensive against future bugs: Reduces OOM risk from any dirty buffer accumulation by not wasting cache on stale clean buffers. Critical ordering per volume: 1. verify_tree() - loads nodes into cache for verification 2. recovery_start_phase2() - io_replay needs nodes in cache 3. release_cached_tree() - safe to evict after io_replay completes

Add release_cache_after_recovery boolean flag to GeneralConfig in homeblks_config.fbs with default value true. This allows runtime control of cache release behavior during volume recovery phase 2. When enabled (default), cached btree nodes are safely evicted after each volume's recovery to limit peak memory usage and free cache space for dirty buffers. The flag is marked as hotswap, allowing dynamic configuration updates without restart.

Add recovery_cache_release_latency gauge to HomeBlksMetrics and m_cache_release_ms field to HomeBlksRecoveryStats to track time spent releasing cached btree nodes during volume recovery phase 2. The metric accumulates time across all volumes and logs per-volume cache release duration. This enables monitoring the performance overhead of cache eviction and helps identify if the cache release operation becomes a bottleneck during recovery.

Run ./apply-clang-format.sh to format all C++ source files according to src/.clang-format style configuration. This ensures consistent code style across the codebase.

shosseinimotlagh added 10 commits May 7, 2026 22:54

Apply clang-format to all source files

4036f81

Run ./apply-clang-format.sh to format all C++ source files according to src/.clang-format style configuration. This ensures consistent code style across the codebase.

update locks

b58dc1b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sdstor 22200#884

Sdstor 22200#884
shosseinimotlagh wants to merge 10 commits intoeBay:dev/SDSTOR-22200from
shosseinimotlagh:SDSTOR-22200

shosseinimotlagh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shosseinimotlagh commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant