Skip to content

Add coolTransactionLogs() to advise transaction-log mmaps cold (MADV_COLD)#652

Open
kriszyp wants to merge 1 commit into
mainfrom
kris/txn-log-madv-cold
Open

Add coolTransactionLogs() to advise transaction-log mmaps cold (MADV_COLD)#652
kriszyp wants to merge 1 commit into
mainfrom
kris/txn-log-madv-cold

Conversation

@kriszyp

@kriszyp kriszyp commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Adds a coolTransactionLogs() cooling pass that advises MADV_COLD on the file-backed region of every mapped transaction log, and exposes it as a process-global N-API function for Harper to drive on an interval.

  • TransactionLogFile::adviseCold() (POSIX) — madvise(MADV_COLD) on [0, actualSize), page-floored so it never touches the MAP_PRIVATE|MAP_ANONYMOUS zero-fill overlay tail. Guarded by #ifdef MADV_COLD + a runtime EINVAL latch (kernels < 5.4); no-op on macOS/Windows.
  • TransactionLogStoreRegistry::CoolTransactionLogs() — walks the process-global registry (entries → stores → sequenceFiles), snapshotting shared_ptrs under each level's mutex and releasing before the madvise syscall. Returns { maps, bytes }.
  • coolTransactionLogs() module export + TS surface in load-binding.ts / index.ts.

Purpose

Closes #650. During replication base-copy/catch-up Harper reads the full transaction logs, faulting their MAP_SHARED pages into reclaimable page cache that still counts toward the container's cgroup memory.current. Under a memory.high throttle the kernel parks at the watermark and the repeated fault-back-in of that cache becomes the catch-up throughput bottleneck (catchup time swung 218s→840s, occasionally wedging). MADV_COLD demotes already-read pages to the inactive LRU so reclaim targets them first — non-destructive (a re-read just re-activates), which is safe for the concurrent, not-perfectly-sequential reader pattern (replication + real-time consumers at different offsets).

The cooling timer is intentionally not in this binding — the registry is a process-global singleton shared across worker threads, so Harper drives one unref()ed interval from a single thread (separate harper-pro PR). Opt-in by construction: nothing calls coolTransactionLogs() until Harper does.

Where to look

  • Page-flooring (transaction_log_file_posix.cpp, adviseCold) — the floor-to-page on actualSize is what keeps the advice off the anonymous tail; worth confirming for both TRANSACTION_LOG_ENABLE_ANONYMOUS_OVERLAY modes.
  • Deferred concurrency note (read before reviewing the locking): cross-model review (Codex + a Harper-domain pass) flagged a pre-existing, timing-narrow data race on the this->memoryMap shared_ptr: adviseCold() reads it under fileMutex (consistent with close()/removeFile()), but getMemoryMap() reassigns it without fileMutex (it runs under dataSetsMutex/indexMutex). adviseCold participates in that race as a new periodic reader. In practice the reassignment only happens at first-map (frozen files are pre-sized and never remap; the active file is pre-sized to maxFileSize), so the window is near-nil. A proper fix is all-or-nothing (serialize getMemoryMap under fileMutex, changing read/write serialization, or convert every memoryMap access site to atomic shared_ptr) and is deferred to the follow-up PR that reworks memoryMap ownership (weak-ref for frozen files + finalization). Flagging so it isn't rediscovered.

Tests: native GoogleTest (transaction_log_madvise_test.cc) asserts the advice is scoped to floor(actualSize) / never the anon tail / no-ops on the EINVAL path; Vitest asserts a cooling pass is non-destructive (logs read back intact), gated for MADV_COLD availability. Full suite green (483 passed, 33 native).

Generated by an LLM (Claude Opus 4.7); reviewed/curated by Kris.

…COLD)

During replication base-copy/catch-up Harper reads the full transaction
logs, faulting their MAP_SHARED pages into reclaimable page cache that
still counts toward the container's cgroup memory.current. Under pressure
the repeated fault-back-in of that cache becomes the catch-up throughput
bottleneck (and can wedge it).

Add a cooling pass that advises MADV_COLD on the file-backed region of
every mapped transaction log, demoting already-read pages to the inactive
LRU so the kernel reclaims them first under pressure without freeing them
(non-destructive; safe for the concurrent, not-perfectly-sequential reader
pattern). Advice is scoped to [0, actualSize) page-floored so it never
touches the MAP_PRIVATE|MAP_ANONYMOUS zero-fill overlay tail, guarded for
kernels without MADV_COLD (< 5.4, EINVAL-latched), and a no-op on macOS and
Windows.

Exposed as a process-global coolTransactionLogs() (the registry is shared
across worker threads, so one call cools every worker's maps); Harper
drives the interval timer from a single thread.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to advise the kernel that file-backed pages of mapped transaction logs are cold (MADV_COLD) so they can be reclaimed first under memory pressure. This is exposed via a new coolTransactionLogs binding and implemented across POSIX and Windows platforms, complete with native and integration tests. The code review identified a critical concurrent data race on the memoryMap std::shared_ptr between getMemoryMap() and adviseCold(), suggesting the use of std::atomic<std::shared_ptr<MemoryMap>>. Additionally, the reviewer recommended using standard division/multiplication for page-flooring to avoid assuming page sizes are always powers of two, and proposed a more robust regex-based parsing of the Linux kernel version in tests to handle non-numeric suffixes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +230 to +236
std::lock_guard<std::mutex> lock(this->fileMutex);
if (!this->memoryMap || !this->memoryMap->map) {
return 0;
}
map = this->memoryMap;
actualSize = std::min(this->size.load(std::memory_order_relaxed), map->mapSize);
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a concurrent data race on the this->memoryMap std::shared_ptr. While adviseCold() reads and copies this->memoryMap under fileMutex, getMemoryMap() reassigns it under indexMutex without holding fileMutex. Concurrent read and write operations on the same std::shared_ptr instance are not thread-safe in C++ and lead to undefined behavior (such as memory corruption or double-free crashes).

Since the project is compiled with C++20, a robust and modern solution is to change the type of memoryMap in transaction_log_file.h to std::atomic<std::shared_ptr<MemoryMap>>. This makes all concurrent loads and stores thread-safe without requiring complex mutex serialization.

if (pageSize <= 0) {
pageSize = 4096;
}
size_t length = static_cast<size_t>(actualSize) & ~(static_cast<size_t>(pageSize) - 1);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The bitwise page-flooring logic assumes that pageSize is always a power of two. While this is practically always true, using standard division and multiplication is safer, more readable, and avoids any implicit assumptions or complex bitwise casting.

	size_t length = (static_cast<size_t>(actualSize) / pageSize) * pageSize;

Comment on lines +1812 to +1813
const [major = 0, minor = 0] =
process.platform === 'linux' ? release().split('.').map(Number) : [];

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Parsing the Linux kernel release version by splitting on . and converting directly to Number can be fragile if the release string contains non-numeric suffixes (e.g., "5.4.0-custom" or "5.4-rc1"), which would result in NaN and cause the support check to incorrectly evaluate to false.

Using a regular expression to extract the major and minor version numbers is more robust and prevents potential test flakiness on custom or pre-release kernels.

Suggested change
const [major = 0, minor = 0] =
process.platform === 'linux' ? release().split('.').map(Number) : [];
const match = process.platform === 'linux' ? release().match(/^(\d+)\.(\d+)/) : null;
const major = match ? parseInt(match[1], 10) : 0;
const minor = match ? parseInt(match[2], 10) : 0;

@kriszyp

kriszyp commented Jun 12, 2026

Copy link
Copy Markdown
Member Author

Thanks for the thorough review! Adjudicating the three points:

  1. memoryMap shared_ptr race — agreed, and this is the known concern called out under "Where to look" in the PR description. A cross-model review (Codex + a Harper-domain pass) reached the same conclusion. The proper fix is all-or-nothing (serialize getMemoryMap under fileMutex, or move every memoryMap access to std::atomic<std::shared_ptr>), so it's consolidated into the follow-up PR that reworks memoryMap ownership (weak-ref for frozen files). In practice the reassignment only happens at first-map (frozen files are pre-sized and never remap; the active file is pre-sized to maxFileSize), so the window is near-nil — but the follow-up closes it properly.

  2. Page-flooring & ~(pageSize - 1) — this is correct as written: POSIX page size (sysconf(_SC_PAGESIZE)) is architecturally always a power of two (4K/16K/64K), so the bitmask is the standard floor-to-page idiom. No correctness gap here.

  3. Kernel-version parse in the test — the gating only reads the major/minor components, which parse cleanly; the non-numeric suffix (e.g. -generic) only ever lands on the patch component, which is unused. So release().split('.').map(Number) is robust for the comparison as written.

— Claude Opus 4.7 (acting for @kriszyp)

@github-actions

Copy link
Copy Markdown
Contributor

📊 Benchmark Results

get-sync.bench.ts

getSync() > random keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 22.83K ops/sec 43.81 41.50 2,283.849 0.147 114,134
🥈 rocksdb 2 11.43K ops/sec 87.52 83.92 22,528.952 0.891 57,130

getSync() > sequential keys - small key size (100 records)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 27.65K ops/sec 36.16 34.96 489.94 0.101 138,268
🥈 rocksdb 2 11.51K ops/sec 86.89 84.15 741.211 0.056 57,543

ranges.bench.ts

getRange() > small range (100 records, 50 range)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 lmdb 1 25.52K ops/sec 39.18 35.95 1,792.917 0.295 127,603
🥈 rocksdb 2 17.46K ops/sec 57.27 50.48 2,174.8 0.155 87,301

realistic-load.bench.ts

Realistic write load with workers > write variable records with transaction log

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 481.50 ops/sec 2,076.864 117.89 54,994.652 12.42 963
🥈 lmdb 2 26.27 ops/sec 38,066.551 445.641 1,215,846.13 136.408 64.00

transaction-log.bench.ts

Transaction log > read 100 iterators while write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 39.02K ops/sec 25.63 12.64 286.34 0.174 195,115
🥈 lmdb 2 442.18 ops/sec 2,261.519 158.266 29,848.2 1.60 2,211

Transaction log > read one entry from random position from log with 1000 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 752.40K ops/sec 1.33 1.16 4,028.404 0.186 3,762,000
🥈 lmdb 2 372.33K ops/sec 2.69 1.25 8,669.384 1.87 1,861,643

worker-put-sync.bench.ts

putSync() > random keys - small key size (100 records, 10 workers)

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 849.35 ops/sec 1,177.372 1,007.648 4,496.862 0.549 1,699
🥈 lmdb 2 1.16 ops/sec 858,777.889 815,023.835 895,025.238 2.37 10.00

worker-transaction-log.bench.ts

Transaction log with workers > write log with 100 byte records

Implementation Rank Operations/sec Mean (ms) Min (ms) Max (ms) RME (%) Samples
🥇 rocksdb 1 21.09K ops/sec 47.43 29.18 537.663 0.504 42,172
🥈 lmdb 2 838.24 ops/sec 1,192.971 189.759 9,977.424 5.01 1,677

Results from commit 59ec75f

@kriszyp kriszyp marked this pull request as ready for review June 12, 2026 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Advise MADV_COLD on transaction-log mmap to reduce memory pressure during replication catch-up

1 participant