history: add hissqlite, a SQLite-based history method#349
Conversation
|
@blgl wonder if you can take a look at this @rra after studying dbz, there are a bunch of odd corners in the implementation. I think the main tradeoff here is probably a (30%?) space regression due to b-tree partial fill, but I am wondering if the performance may be adequate or even greater due to the oddities of dbz's msync once history gets very large (and that might also be OS/FS dependent). This remains faithful to the dbz data layout for better or worse; WITHOUT ROWID allows for that but it might not be optimal for this shape of DB. On the positive side the pause and rewrite goes away (which also can create temporary inconsistent state see https://csiph.com/innreport/news-notice.2026.06.12-04.15.00.html nnrp log) and on whole this is a lot more durable. @jrehmer is there any chance you can assault this with a test instance with your ingest? If it's on zfs stick it on a recordsize=8k to avoid write amplification. It would theoretically be plausible to intertwine expireover and expire a bit deeper if both ovsqlite and hissqlite were used, and even plausibly do a sqlite spool such that everything is transactionally sound. Fodder for future work if this turns out ok. |
490dcdb to
0da6416
Compare
|
Many thanks again, Kevin, for all your contributions! Though we could keep using pathdb, I'm wondering whether it would not be an appropriate time to have a possible separate pathhistory directory for history (from TODO: "Support putting the history file in different directory from the other (much smaller) db files without hand-editing a bunch of files.) But it may be a bit tedious to look for all the pathdb calls and check whether they apply to history only and do the change. You say there's a NEWS entry but there was no doc/pod/news.pod file in the PR. I think there are other parts of the documentation to amend as we assumed hisv6 and now not everything applies to hissqlite (for instance tagged hash, large file support with incompatible 32-bit and 64-bit history files, checklist and INSTALL to mention the possibilities of backends...). It would need a bit of thorough review. I can do a pass too. Do external programs like prunehistory still work and do what is expected with hissqlite? I'm also wondering whether there isn't now some useless dbz-related code like Adding a new history backend is great, thanks for it! As this is the first time we add a new history backend, there may be unexpected behaviour to check. I doubt the whole code of INN was written with the possibility of several history backends in mind. I'll have a deeper look soon. The migration and history manipulation tools are very appreciated! Last but not least, I promised the 2.7.4 release in June to more easily handle the blacklist migration. I can't integrate hissqlite in that time, and we need a bit more testing. Would it be OK to ship your work in the next 2.7.5 release and not delay the 2.7.4 one? |
It would be a good opportunity here to put this in a new dir (for the zfs recordsize property).
Will fix.
Sure there are probably some loose ends
Some are almost certainly dbz specific so will need to note it.
This is definitely still a work in progress. Right now the main question is: does it even perform adequately to bother going forward, so no need to plan a release yet. |
Indeed, it is the first question to answer before going further. I was under the impression that expiration was much faster on large spools (no need to rewrite the history file). I don't know if lookups are also faster to the extent it is noticeable for readers on large spools. |
|
This is a great experiment. Thank you! I am absolutely swamped with non-INN stuff, so I am not sure if I will get a chance to look in detail at the code, but I'm really glad to see someone experiment with this. Noting here just for the record since the hashing was mentioned in the PR: the MD5 hashing of message IDs is purely an artifact of the requirement for fixed-length records, which meant that the message ID couldn't be stored directly. It's probably spread tentacles all over the code, but it would be interesting to see if SQLite could store records based directly on the message ID without any of that hashing. Might be a lot of work to experiment with, though. |
It would certainly be possible.. but it is probably not pragmatic to consider yet. One issue is recovering /remember/ since the hashing was destructive for migration. It would have to be an initialization time decision, not capable of full migration. If spool sqlite were a thing, I guess you could consider overview and history to simply be indexes on the message DB. That would be a fairly radical change but promoting everything to a single DB might actually make this more feasible. So that might be the experiment to test it out on. |
|
It looks reasonable to me. Documentation issue: those bits that mention database leaf pages seem to assume that row data is stored only in leaf pages, which is not the case for |
Add hissqlite, a SQLite history method selectable instead of hisv6 (hismethod = hissqlite) when INN is built with SQLite. It keeps the whole history in a single transactional database and, unlike ovsqlite, needs no server daemon: innd is the primary writer and updates the database directly, while nnrpd readers and the offline tools open the WAL database read-only, and expiration runs in place rather than rebuilding and swapping files. - The backend uses a single WITHOUT ROWID table clustered on the 16-byte MD5 of the Message-ID (a lookup is one clustered-leaf access); a row is a real article, a remembered Message-ID, or absent. Writes autocommit under WAL with synchronous=NORMAL, giving dbz's durability contract (recent writes may be lost on power loss, never corrupted, peers resend) without batching, which would only starve the second writer. The WAL is checkpointed in the background and truncated on a clean shutdown. - Expiration is two-horizon and in place (real articles past retention become remembered, remembered entries past /remember/ are deleted), streamed in hash-keyset pages. A new HISCTLG_INPLACEEXPIRE capability lets expire(8) reopen the backend read/write with no ICCpause; -d/-f (side-file rebuild) and a bare -x are refused, with -t as the in-place dry run; hisv6's path is unchanged. - The SQL prepared-statement codegen (sqlite-helper) moves from storage/ovsqlite to lib/ so the ovsqlite and hissqlite backends share one copy; libinn now carries the libsqlite3 dependency (LTVERSION bump). - Page size, writer and per-nnrpd reader cache sizes, and mmap size are tunable through inn.conf; mmap defaults off, as in ovsqlite. - hissqlite-convert migrates an existing history by walking the source backend and bulk-loading by hash, faithfully preserving the remembered entries and timestamps a from-spool makehistory rebuild cannot. Comes with the hissqlite-util inspection tool, man pages, a NEWS entry, and two runtime tests wired into tests/TESTS: a full-vtable backend test and an end-to-end hissqlite-convert test (seed hisv6 -> convert -> verify tokens, timestamps and remembered entries survive).
A standalone benchmark (deliberately not part of the TAP suite) for comparing the hisv6 and hissqlite history backends at scale. It opens each backend exactly as innd does including the in-core index hint and reads tuning from inn.conf, so results reflect production settings; with no inn.conf it falls back to each backend's built-in defaults, which match the inn.conf defaults, so the comparison stays fair either way. Per method it times create+write (one autocommit per article, like innd's steady state), sequential check, random lookup, and missing lookup, reports the on-disk footprint, and validates correctness. It fails if any inserted entry is missing or any absent entry is found. Lookups use a fixed-seed PRNG reset per phase so every method replays the identical order. Output is both a human readable table and a CSV row. Defaults to 100M entries (sized for large local testing); build with "make -C tests benchmarks" and run tests/lib/history-bench (see -h).
0da6416 to
ba76a80
Compare
|
I force pushed the original commit with two small tweaks: drop a secondary index that I was investigating for expire (bloom already covers that case), and reduce default page size to 4k from 8k (less RMW overhead, universal win upon perf testing). The second commit is a new benchmark infra for testing this. Sample run on an AMD 5975WX, 256GB RAM, Crucial T705 4TB NVMe, FreeBSD -CURRENT (prod kernel/malloc settings), ZFS:
The results of this are somewhat surprising, I was expecting disk space to be less efficient but the opposite occurred. I would say the seq check and missing checks are plenty because of hiscache in practice. The random lookup is a welcome win for nnrpd (and non-bloom expire). The primary concern: write speed is disappointing but workable as is for a steady state instance. The fundamental issue is rebalancing the B-tree pages with the random md5 hash write workload, there is a lot of I/O overhead. On balance, the speed is what it is, it shouldn't degrade at extreme scale like 1B articles. I experimented with a rowid table (and separate index on the hash) but that was slower. I played a little with batching (no TXN per record), but I couldn't crack any major performance win, what is here is the best effort so far. It's possible there is some multiple of write perf waiting for the right eyes. It would be interesting to get some results from Linux, ext4/xfs/btrfs in particular where mmap could be enabled ( I am generally convinced of the idea now, it is the right shape for csiph.com because it will remove the innd pauses which result in a bunch of timeouts on csiph-web while hisv6 is rewriting. But it would be nice to nail down any schema change if they mattered for write perf, I am just fresh out of ideas there. |
Add hissqlite, a SQLite history method selectable instead of hisv6 (hismethod = hissqlite) when INN is built with SQLite. It keeps the whole history in a single transactional database and, unlike ovsqlite, needs no server daemon: innd is the primary writer and updates the database directly, while nnrpd readers and the offline tools open the WAL database read-only, and expiration runs in place rather than rebuilding and swapping files.
The backend uses a single WITHOUT ROWID table clustered on the 16-byte MD5 of the Message-ID (a lookup is one clustered-leaf access); a row is a real article, a remembered Message-ID, or absent. Writes autocommit under WAL with synchronous=NORMAL, giving dbz's durability contract (recent writes may be lost on power loss, never corrupted, peers resend) without batching, which would only starve the second writer. The WAL is checkpointed in the background and truncated on a clean shutdown.
Expiration is two-horizon and in place (real articles past retention become remembered, remembered entries past /remember/ are deleted), streamed in hash-keyset pages. A new HISCTLG_INPLACEEXPIRE capability lets expire(8) reopen the backend read/write with no ICCpause; -d/-f (side-file rebuild) and a bare -x are refused, with -t as the in-place dry run; hisv6's path is unchanged.
The SQL prepared-statement codegen (sqlite-helper) moves from storage/ovsqlite to lib/ so the ovsqlite and hissqlite backends share one copy; libinn now carries the libsqlite3 dependency (LTVERSION bump).
Page size, writer and per-nnrpd reader cache sizes, and mmap size are tunable through inn.conf; mmap defaults off, as in ovsqlite.
hissqlite-convert migrates an existing history by walking the source backend and bulk-loading by hash, faithfully preserving the remembered entries and timestamps a from-spool makehistory rebuild cannot.
Comes with the hissqlite-util inspection tool, man pages, a NEWS entry and a full-vtable runtime test.