Add coolTransactionLogs() to advise transaction-log mmaps cold (MADV_COLD)#652
Add coolTransactionLogs() to advise transaction-log mmaps cold (MADV_COLD)#652kriszyp wants to merge 1 commit into
Conversation
…COLD) During replication base-copy/catch-up Harper reads the full transaction logs, faulting their MAP_SHARED pages into reclaimable page cache that still counts toward the container's cgroup memory.current. Under pressure the repeated fault-back-in of that cache becomes the catch-up throughput bottleneck (and can wedge it). Add a cooling pass that advises MADV_COLD on the file-backed region of every mapped transaction log, demoting already-read pages to the inactive LRU so the kernel reclaims them first under pressure without freeing them (non-destructive; safe for the concurrent, not-perfectly-sequential reader pattern). Advice is scoped to [0, actualSize) page-floored so it never touches the MAP_PRIVATE|MAP_ANONYMOUS zero-fill overlay tail, guarded for kernels without MADV_COLD (< 5.4, EINVAL-latched), and a no-op on macOS and Windows. Exposed as a process-global coolTransactionLogs() (the registry is shared across worker threads, so one call cools every worker's maps); Harper drives the interval timer from a single thread. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to advise the kernel that file-backed pages of mapped transaction logs are cold (MADV_COLD) so they can be reclaimed first under memory pressure. This is exposed via a new coolTransactionLogs binding and implemented across POSIX and Windows platforms, complete with native and integration tests. The code review identified a critical concurrent data race on the memoryMap std::shared_ptr between getMemoryMap() and adviseCold(), suggesting the use of std::atomic<std::shared_ptr<MemoryMap>>. Additionally, the reviewer recommended using standard division/multiplication for page-flooring to avoid assuming page sizes are always powers of two, and proposed a more robust regex-based parsing of the Linux kernel version in tests to handle non-numeric suffixes.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| std::lock_guard<std::mutex> lock(this->fileMutex); | ||
| if (!this->memoryMap || !this->memoryMap->map) { | ||
| return 0; | ||
| } | ||
| map = this->memoryMap; | ||
| actualSize = std::min(this->size.load(std::memory_order_relaxed), map->mapSize); | ||
| } |
There was a problem hiding this comment.
There is a concurrent data race on the this->memoryMap std::shared_ptr. While adviseCold() reads and copies this->memoryMap under fileMutex, getMemoryMap() reassigns it under indexMutex without holding fileMutex. Concurrent read and write operations on the same std::shared_ptr instance are not thread-safe in C++ and lead to undefined behavior (such as memory corruption or double-free crashes).
Since the project is compiled with C++20, a robust and modern solution is to change the type of memoryMap in transaction_log_file.h to std::atomic<std::shared_ptr<MemoryMap>>. This makes all concurrent loads and stores thread-safe without requiring complex mutex serialization.
| if (pageSize <= 0) { | ||
| pageSize = 4096; | ||
| } | ||
| size_t length = static_cast<size_t>(actualSize) & ~(static_cast<size_t>(pageSize) - 1); |
There was a problem hiding this comment.
The bitwise page-flooring logic assumes that pageSize is always a power of two. While this is practically always true, using standard division and multiplication is safer, more readable, and avoids any implicit assumptions or complex bitwise casting.
size_t length = (static_cast<size_t>(actualSize) / pageSize) * pageSize;| const [major = 0, minor = 0] = | ||
| process.platform === 'linux' ? release().split('.').map(Number) : []; |
There was a problem hiding this comment.
Parsing the Linux kernel release version by splitting on . and converting directly to Number can be fragile if the release string contains non-numeric suffixes (e.g., "5.4.0-custom" or "5.4-rc1"), which would result in NaN and cause the support check to incorrectly evaluate to false.
Using a regular expression to extract the major and minor version numbers is more robust and prevents potential test flakiness on custom or pre-release kernels.
| const [major = 0, minor = 0] = | |
| process.platform === 'linux' ? release().split('.').map(Number) : []; | |
| const match = process.platform === 'linux' ? release().match(/^(\d+)\.(\d+)/) : null; | |
| const major = match ? parseInt(match[1], 10) : 0; | |
| const minor = match ? parseInt(match[2], 10) : 0; |
|
Thanks for the thorough review! Adjudicating the three points:
— Claude Opus 4.7 (acting for @kriszyp) |
📊 Benchmark Resultsget-sync.bench.tsgetSync() > random keys - small key size (100 records)
getSync() > sequential keys - small key size (100 records)
ranges.bench.tsgetRange() > small range (100 records, 50 range)
realistic-load.bench.tsRealistic write load with workers > write variable records with transaction log
transaction-log.bench.tsTransaction log > read 100 iterators while write log with 100 byte records
Transaction log > read one entry from random position from log with 1000 100 byte records
worker-put-sync.bench.tsputSync() > random keys - small key size (100 records, 10 workers)
worker-transaction-log.bench.tsTransaction log with workers > write log with 100 byte records
Results from commit 59ec75f |
Summary
Adds a
coolTransactionLogs()cooling pass that advisesMADV_COLDon the file-backed region of every mapped transaction log, and exposes it as a process-global N-API function for Harper to drive on an interval.TransactionLogFile::adviseCold()(POSIX) —madvise(MADV_COLD)on[0, actualSize), page-floored so it never touches theMAP_PRIVATE|MAP_ANONYMOUSzero-fill overlay tail. Guarded by#ifdef MADV_COLD+ a runtime EINVAL latch (kernels < 5.4); no-op on macOS/Windows.TransactionLogStoreRegistry::CoolTransactionLogs()— walks the process-global registry (entries → stores → sequenceFiles), snapshottingshared_ptrs under each level's mutex and releasing before themadvisesyscall. Returns{ maps, bytes }.coolTransactionLogs()module export + TS surface inload-binding.ts/index.ts.Purpose
Closes #650. During replication base-copy/catch-up Harper reads the full transaction logs, faulting their
MAP_SHAREDpages into reclaimable page cache that still counts toward the container's cgroupmemory.current. Under amemory.highthrottle the kernel parks at the watermark and the repeated fault-back-in of that cache becomes the catch-up throughput bottleneck (catchup time swung 218s→840s, occasionally wedging).MADV_COLDdemotes already-read pages to the inactive LRU so reclaim targets them first — non-destructive (a re-read just re-activates), which is safe for the concurrent, not-perfectly-sequential reader pattern (replication + real-time consumers at different offsets).The cooling timer is intentionally not in this binding — the registry is a process-global singleton shared across worker threads, so Harper drives one
unref()ed interval from a single thread (separate harper-pro PR). Opt-in by construction: nothing callscoolTransactionLogs()until Harper does.Where to look
transaction_log_file_posix.cpp,adviseCold) — the floor-to-page onactualSizeis what keeps the advice off the anonymous tail; worth confirming for bothTRANSACTION_LOG_ENABLE_ANONYMOUS_OVERLAYmodes.this->memoryMapshared_ptr:adviseCold()reads it underfileMutex(consistent withclose()/removeFile()), butgetMemoryMap()reassigns it withoutfileMutex(it runs underdataSetsMutex/indexMutex).adviseColdparticipates in that race as a new periodic reader. In practice the reassignment only happens at first-map (frozen files are pre-sized and never remap; the active file is pre-sized tomaxFileSize), so the window is near-nil. A proper fix is all-or-nothing (serializegetMemoryMapunderfileMutex, changing read/write serialization, or convert everymemoryMapaccess site to atomicshared_ptr) and is deferred to the follow-up PR that reworksmemoryMapownership (weak-ref for frozen files + finalization). Flagging so it isn't rediscovered.Tests: native GoogleTest (
transaction_log_madvise_test.cc) asserts the advice is scoped tofloor(actualSize)/ never the anon tail / no-ops on the EINVAL path; Vitest asserts a cooling pass is non-destructive (logs read back intact), gated forMADV_COLDavailability. Full suite green (483 passed, 33 native).Generated by an LLM (Claude Opus 4.7); reviewed/curated by Kris.