Add coolTransactionLogs() to advise transaction-log mmaps cold (MADV_COLD) by kriszyp · Pull Request #652 · HarperFast/rocksdb-js

kriszyp · 2026-06-12T14:46:51Z

Summary

Adds a coolTransactionLogs() cooling pass that advises MADV_COLD on the file-backed region of every mapped transaction log, and exposes it as a process-global N-API function for Harper to drive on an interval.

TransactionLogFile::adviseCold() (POSIX) — madvise(MADV_COLD) on [0, actualSize), page-floored so it never touches the MAP_PRIVATE|MAP_ANONYMOUS zero-fill overlay tail. Guarded by #ifdef MADV_COLD + a runtime EINVAL latch (kernels < 5.4); no-op on macOS/Windows.
TransactionLogStoreRegistry::CoolTransactionLogs() — walks the process-global registry (entries → stores → sequenceFiles), snapshotting shared_ptrs under each level's mutex and releasing before the madvise syscall. Returns { maps, bytes }.
coolTransactionLogs() module export + TS surface in load-binding.ts / index.ts.

Purpose

Closes #650. During replication base-copy/catch-up Harper reads the full transaction logs, faulting their MAP_SHARED pages into reclaimable page cache that still counts toward the container's cgroup memory.current. Under a memory.high throttle the kernel parks at the watermark and the repeated fault-back-in of that cache becomes the catch-up throughput bottleneck (catchup time swung 218s→840s, occasionally wedging). MADV_COLD demotes already-read pages to the inactive LRU so reclaim targets them first — non-destructive (a re-read just re-activates), which is safe for the concurrent, not-perfectly-sequential reader pattern (replication + real-time consumers at different offsets).

The cooling timer is intentionally not in this binding — the registry is a process-global singleton shared across worker threads, so Harper drives one unref()ed interval from a single thread (separate harper-pro PR). Opt-in by construction: nothing calls coolTransactionLogs() until Harper does.

Where to look

Page-flooring (transaction_log_file_posix.cpp, adviseCold) — the floor-to-page on actualSize is what keeps the advice off the anonymous tail; worth confirming for both TRANSACTION_LOG_ENABLE_ANONYMOUS_OVERLAY modes.
Deferred concurrency note (read before reviewing the locking): cross-model review (Codex + a Harper-domain pass) flagged a pre-existing, timing-narrow data race on the this->memoryMap shared_ptr: adviseCold() reads it under fileMutex (consistent with close()/removeFile()), but getMemoryMap() reassigns it without fileMutex (it runs under dataSetsMutex/indexMutex). adviseCold participates in that race as a new periodic reader. In practice the reassignment only happens at first-map (frozen files are pre-sized and never remap; the active file is pre-sized to maxFileSize), so the window is near-nil. A proper fix is all-or-nothing (serialize getMemoryMap under fileMutex, changing read/write serialization, or convert every memoryMap access site to atomic shared_ptr) and is deferred to the follow-up PR that reworks memoryMap ownership (weak-ref for frozen files + finalization). Flagging so it isn't rediscovered.

Tests: native GoogleTest (transaction_log_madvise_test.cc) asserts the advice is scoped to floor(actualSize) / never the anon tail / no-ops on the EINVAL path; Vitest asserts a cooling pass is non-destructive (logs read back intact), gated for MADV_COLD availability. Full suite green (483 passed, 33 native).

Generated by an LLM (Claude Opus 4.7); reviewed/curated by Kris.

…COLD) During replication base-copy/catch-up Harper reads the full transaction logs, faulting their MAP_SHARED pages into reclaimable page cache that still counts toward the container's cgroup memory.current. Under pressure the repeated fault-back-in of that cache becomes the catch-up throughput bottleneck (and can wedge it). Add a cooling pass that advises MADV_COLD on the file-backed region of every mapped transaction log, demoting already-read pages to the inactive LRU so the kernel reclaims them first under pressure without freeing them (non-destructive; safe for the concurrent, not-perfectly-sequential reader pattern). Advice is scoped to [0, actualSize) page-floored so it never touches the MAP_PRIVATE|MAP_ANONYMOUS zero-fill overlay tail, guarded for kernels without MADV_COLD (< 5.4, EINVAL-latched), and a no-op on macOS and Windows. Exposed as a process-global coolTransactionLogs() (the registry is shared across worker threads, so one call cools every worker's maps); Harper drives the interval timer from a single thread. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a mechanism to advise the kernel that file-backed pages of mapped transaction logs are cold (MADV_COLD) so they can be reclaimed first under memory pressure. This is exposed via a new coolTransactionLogs binding and implemented across POSIX and Windows platforms, complete with native and integration tests. The code review identified a critical concurrent data race on the memoryMap std::shared_ptr between getMemoryMap() and adviseCold(), suggesting the use of std::atomic<std::shared_ptr<MemoryMap>>. Additionally, the reviewer recommended using standard division/multiplication for page-flooring to avoid assuming page sizes are always powers of two, and proposed a more robust regex-based parsing of the Linux kernel version in tests to handle non-numeric suffixes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T14:48:51Z

+		std::lock_guard<std::mutex> lock(this->fileMutex);
+		if (!this->memoryMap || !this->memoryMap->map) {
+			return 0;
+		}
+		map = this->memoryMap;
+		actualSize = std::min(this->size.load(std::memory_order_relaxed), map->mapSize);
+	}


There is a concurrent data race on the this->memoryMap std::shared_ptr. While adviseCold() reads and copies this->memoryMap under fileMutex, getMemoryMap() reassigns it under indexMutex without holding fileMutex. Concurrent read and write operations on the same std::shared_ptr instance are not thread-safe in C++ and lead to undefined behavior (such as memory corruption or double-free crashes).

Since the project is compiled with C++20, a robust and modern solution is to change the type of memoryMap in transaction_log_file.h to std::atomic<std::shared_ptr<MemoryMap>>. This makes all concurrent loads and stores thread-safe without requiring complex mutex serialization.

gemini-code-assist · 2026-06-12T14:48:51Z

+	if (pageSize <= 0) {
+		pageSize = 4096;
+	}
+	size_t length = static_cast<size_t>(actualSize) & ~(static_cast<size_t>(pageSize) - 1);


The bitwise page-flooring logic assumes that pageSize is always a power of two. While this is practically always true, using standard division and multiplication is safer, more readable, and avoids any implicit assumptions or complex bitwise casting.

size_t length = (static_cast<size_t>(actualSize) / pageSize) * pageSize;

gemini-code-assist · 2026-06-12T14:48:51Z

+		const [major = 0, minor = 0] =
+			process.platform === 'linux' ? release().split('.').map(Number) : [];


Parsing the Linux kernel release version by splitting on . and converting directly to Number can be fragile if the release string contains non-numeric suffixes (e.g., "5.4.0-custom" or "5.4-rc1"), which would result in NaN and cause the support check to incorrectly evaluate to false.

Using a regular expression to extract the major and minor version numbers is more robust and prevents potential test flakiness on custom or pre-release kernels.

Suggested change

const [major = 0, minor = 0] =

process.platform === 'linux' ? release().split('.').map(Number) : [];

const match = process.platform === 'linux' ? release().match(/^(\d+)\.(\d+)/) : null;

const major = match ? parseInt(match[1], 10) : 0;

const minor = match ? parseInt(match[2], 10) : 0;

kriszyp · 2026-06-12T14:55:28Z

Thanks for the thorough review! Adjudicating the three points:

memoryMap shared_ptr race — agreed, and this is the known concern called out under "Where to look" in the PR description. A cross-model review (Codex + a Harper-domain pass) reached the same conclusion. The proper fix is all-or-nothing (serialize getMemoryMap under fileMutex, or move every memoryMap access to std::atomic<std::shared_ptr>), so it's consolidated into the follow-up PR that reworks memoryMap ownership (weak-ref for frozen files). In practice the reassignment only happens at first-map (frozen files are pre-sized and never remap; the active file is pre-sized to maxFileSize), so the window is near-nil — but the follow-up closes it properly.
Page-flooring & ~(pageSize - 1) — this is correct as written: POSIX page size (sysconf(_SC_PAGESIZE)) is architecturally always a power of two (4K/16K/64K), so the bitmask is the standard floor-to-page idiom. No correctness gap here.
Kernel-version parse in the test — the gating only reads the major/minor components, which parse cleanly; the non-numeric suffix (e.g. -generic) only ever lands on the patch component, which is unused. So release().split('.').map(Number) is robust for the comparison as written.

— Claude Opus 4.7 (acting for @kriszyp)

github-actions · 2026-06-12T14:58:31Z

📊 Benchmark Results

get-sync.bench.ts

getSync() > random keys - small key size (100 records)

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 lmdb	1	22.83K ops/sec	43.81	41.50	2,283.849	0.147	114,134
🥈 rocksdb	2	11.43K ops/sec	87.52	83.92	22,528.952	0.891	57,130

getSync() > sequential keys - small key size (100 records)

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 lmdb	1	27.65K ops/sec	36.16	34.96	489.94	0.101	138,268
🥈 rocksdb	2	11.51K ops/sec	86.89	84.15	741.211	0.056	57,543

ranges.bench.ts

getRange() > small range (100 records, 50 range)

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 lmdb	1	25.52K ops/sec	39.18	35.95	1,792.917	0.295	127,603
🥈 rocksdb	2	17.46K ops/sec	57.27	50.48	2,174.8	0.155	87,301

realistic-load.bench.ts

Realistic write load with workers > write variable records with transaction log

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 rocksdb	1	481.50 ops/sec	2,076.864	117.89	54,994.652	12.42	963
🥈 lmdb	2	26.27 ops/sec	38,066.551	445.641	1,215,846.13	136.408	64.00

transaction-log.bench.ts

Transaction log > read 100 iterators while write log with 100 byte records

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 rocksdb	1	39.02K ops/sec	25.63	12.64	286.34	0.174	195,115
🥈 lmdb	2	442.18 ops/sec	2,261.519	158.266	29,848.2	1.60	2,211

Transaction log > read one entry from random position from log with 1000 100 byte records

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 rocksdb	1	752.40K ops/sec	1.33	1.16	4,028.404	0.186	3,762,000
🥈 lmdb	2	372.33K ops/sec	2.69	1.25	8,669.384	1.87	1,861,643

worker-put-sync.bench.ts

putSync() > random keys - small key size (100 records, 10 workers)

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 rocksdb	1	849.35 ops/sec	1,177.372	1,007.648	4,496.862	0.549	1,699
🥈 lmdb	2	1.16 ops/sec	858,777.889	815,023.835	895,025.238	2.37	10.00

worker-transaction-log.bench.ts

Transaction log with workers > write log with 100 byte records

Implementation	Rank	Operations/sec	Mean (ms)	Min (ms)	Max (ms)	RME (%)	Samples
🥇 rocksdb	1	21.09K ops/sec	47.43	29.18	537.663	0.504	42,172
🥈 lmdb	2	838.24 ops/sec	1,192.971	189.759	9,977.424	5.01	1,677

Results from commit 59ec75f

kriszyp requested review from cb1kenobi and kylebernhardy June 12, 2026 14:46

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

kriszyp marked this pull request as ready for review June 12, 2026 15:03

kriszyp mentioned this pull request Jun 12, 2026

Release frozen transaction-log memory maps once JS releases them #653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add coolTransactionLogs() to advise transaction-log mmaps cold (MADV_COLD)#652

Add coolTransactionLogs() to advise transaction-log mmaps cold (MADV_COLD)#652
kriszyp wants to merge 1 commit into
mainfrom
kris/txn-log-madv-cold

kriszyp commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

kriszyp commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const [major = 0, minor = 0] =
		process.platform === 'linux' ? release().split('.').map(Number) : [];

-		const [major = 0, minor = 0] =
-			process.platform === 'linux' ? release().split('.').map(Number) : [];
+		const match = process.platform === 'linux' ? release().match(/^(\d+)\.(\d+)/) : null;
+		const major = match ? parseInt(match[1], 10) : 0;
+		const minor = match ? parseInt(match[2], 10) : 0;

Conversation

kriszyp commented Jun 12, 2026

Summary

Purpose

Where to look

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

kriszyp commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

📊 Benchmark Results

get-sync.bench.ts

getSync() > random keys - small key size (100 records)

getSync() > sequential keys - small key size (100 records)

ranges.bench.ts

getRange() > small range (100 records, 50 range)

realistic-load.bench.ts

Realistic write load with workers > write variable records with transaction log

transaction-log.bench.ts

Transaction log > read 100 iterators while write log with 100 byte records

Transaction log > read one entry from random position from log with 1000 100 byte records

worker-put-sync.bench.ts

putSync() > random keys - small key size (100 records, 10 workers)

worker-transaction-log.bench.ts

Transaction log with workers > write log with 100 byte records

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant