Skip to content

history: add hissqlite, a SQLite-based history method#349

Draft
kev009 wants to merge 2 commits into
InterNetNews:mainfrom
kev009:hissqlite-history
Draft

history: add hissqlite, a SQLite-based history method#349
kev009 wants to merge 2 commits into
InterNetNews:mainfrom
kev009:hissqlite-history

Conversation

@kev009

@kev009 kev009 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Add hissqlite, a SQLite history method selectable instead of hisv6 (hismethod = hissqlite) when INN is built with SQLite. It keeps the whole history in a single transactional database and, unlike ovsqlite, needs no server daemon: innd is the primary writer and updates the database directly, while nnrpd readers and the offline tools open the WAL database read-only, and expiration runs in place rather than rebuilding and swapping files.

  • The backend uses a single WITHOUT ROWID table clustered on the 16-byte MD5 of the Message-ID (a lookup is one clustered-leaf access); a row is a real article, a remembered Message-ID, or absent. Writes autocommit under WAL with synchronous=NORMAL, giving dbz's durability contract (recent writes may be lost on power loss, never corrupted, peers resend) without batching, which would only starve the second writer. The WAL is checkpointed in the background and truncated on a clean shutdown.

  • Expiration is two-horizon and in place (real articles past retention become remembered, remembered entries past /remember/ are deleted), streamed in hash-keyset pages. A new HISCTLG_INPLACEEXPIRE capability lets expire(8) reopen the backend read/write with no ICCpause; -d/-f (side-file rebuild) and a bare -x are refused, with -t as the in-place dry run; hisv6's path is unchanged.

  • The SQL prepared-statement codegen (sqlite-helper) moves from storage/ovsqlite to lib/ so the ovsqlite and hissqlite backends share one copy; libinn now carries the libsqlite3 dependency (LTVERSION bump).

  • Page size, writer and per-nnrpd reader cache sizes, and mmap size are tunable through inn.conf; mmap defaults off, as in ovsqlite.

  • hissqlite-convert migrates an existing history by walking the source backend and bulk-loading by hash, faithfully preserving the remembered entries and timestamps a from-spool makehistory rebuild cannot.

Comes with the hissqlite-util inspection tool, man pages, a NEWS entry and a full-vtable runtime test.

@kev009

kev009 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@blgl wonder if you can take a look at this

@rra after studying dbz, there are a bunch of odd corners in the implementation. I think the main tradeoff here is probably a (30%?) space regression due to b-tree partial fill, but I am wondering if the performance may be adequate or even greater due to the oddities of dbz's msync once history gets very large (and that might also be OS/FS dependent). This remains faithful to the dbz data layout for better or worse; WITHOUT ROWID allows for that but it might not be optimal for this shape of DB. On the positive side the pause and rewrite goes away (which also can create temporary inconsistent state see https://csiph.com/innreport/news-notice.2026.06.12-04.15.00.html nnrp log) and on whole this is a lot more durable.

@jrehmer is there any chance you can assault this with a test instance with your ingest? If it's on zfs stick it on a recordsize=8k to avoid write amplification.

It would theoretically be plausible to intertwine expireover and expire a bit deeper if both ovsqlite and hissqlite were used, and even plausibly do a sqlite spool such that everything is transactionally sound. Fodder for future work if this turns out ok.

@kev009 kev009 force-pushed the hissqlite-history branch from 490dcdb to 0da6416 Compare June 15, 2026 09:59
@Julien-Elie

Copy link
Copy Markdown
Contributor

Many thanks again, Kevin, for all your contributions!

Though we could keep using pathdb, I'm wondering whether it would not be an appropriate time to have a possible separate pathhistory directory for history (from TODO: "Support putting the history file in different directory from the other (much smaller) db files without hand-editing a bunch of files.) But it may be a bit tedious to look for all the pathdb calls and check whether they apply to history only and do the change.

You say there's a NEWS entry but there was no doc/pod/news.pod file in the PR.

I think there are other parts of the documentation to amend as we assumed hisv6 and now not everything applies to hissqlite (for instance tagged hash, large file support with incompatible 32-bit and 64-bit history files, checklist and INSTALL to mention the possibilities of backends...). It would need a bit of thorough review. I can do a pass too.

Do external programs like prunehistory still work and do what is expected with hissqlite?

I'm also wondering whether there isn't now some useless dbz-related code like dbzneedfilecount call in innd/innd.c with hissqlite.

Adding a new history backend is great, thanks for it! As this is the first time we add a new history backend, there may be unexpected behaviour to check. I doubt the whole code of INN was written with the possibility of several history backends in mind.

I'll have a deeper look soon. The migration and history manipulation tools are very appreciated!

Last but not least, I promised the 2.7.4 release in June to more easily handle the blacklist migration. I can't integrate hissqlite in that time, and we need a bit more testing. Would it be OK to ship your work in the next 2.7.5 release and not delay the 2.7.4 one?

@kev009

kev009 commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Many thanks again, Kevin, for all your contributions!

Though we could keep using pathdb, I'm wondering whether it would not be an appropriate time to have a possible separate pathhistory directory for history (from TODO: "Support putting the history file in different directory from the other (much smaller) db files without hand-editing a bunch of files.) But it may be a bit tedious to look for all the pathdb calls and check whether they apply to history only and do the change.

It would be a good opportunity here to put this in a new dir (for the zfs recordsize property).

You say there's a NEWS entry but there was no doc/pod/news.pod file in the PR.

Will fix.

I think there are other parts of the documentation to amend as we assumed hisv6 and now not everything applies to hissqlite (for instance tagged hash, large file support with incompatible 32-bit and 64-bit history files, checklist and INSTALL to mention the possibilities of backends...). It would need a bit of thorough review. I can do a pass too.

Sure there are probably some loose ends

Do external programs like prunehistory still work and do what is expected with hissqlite?

Some are almost certainly dbz specific so will need to note it.

I'm also wondering whether there isn't now some useless dbz-related code like dbzneedfilecount call in innd/innd.c with hissqlite.

Adding a new history backend is great, thanks for it! As this is the first time we add a new history backend, there may be unexpected behaviour to check. I doubt the whole code of INN was written with the possibility of several history backends in mind.

I'll have a deeper look soon. The migration and history manipulation tools are very appreciated!

Last but not least, I promised the 2.7.4 release in June to more easily handle the blacklist migration. I can't integrate hissqlite in that time, and we need a bit more testing. Would it be OK to ship your work in the next 2.7.5 release and not delay the 2.7.4 one?

This is definitely still a work in progress. Right now the main question is: does it even perform adequately to bother going forward, so no need to plan a release yet.

@Julien-Elie

Copy link
Copy Markdown
Contributor

Right now the main question is: does it even perform adequately to bother going forward, so no need to plan a release yet.

Indeed, it is the first question to answer before going further. I was under the impression that expiration was much faster on large spools (no need to rewrite the history file). I don't know if lookups are also faster to the extent it is noticeable for readers on large spools.
Maybe we should benchmark that in order to decide whether it is worth the effort to add an hissqlite backend, with noticeable improvement. I hope so, because you already did a ton of work on it! The implementation already looks pretty mature. Perhaps we could also say an SQLite database is more robust and the related code more maintainable than a custom dbz database, and it is enough to be useful to have in INN!

@rra

rra commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This is a great experiment. Thank you! I am absolutely swamped with non-INN stuff, so I am not sure if I will get a chance to look in detail at the code, but I'm really glad to see someone experiment with this.

Noting here just for the record since the hashing was mentioned in the PR: the MD5 hashing of message IDs is purely an artifact of the requirement for fixed-length records, which meant that the message ID couldn't be stored directly. It's probably spread tentacles all over the code, but it would be interesting to see if SQLite could store records based directly on the message ID without any of that hashing. Might be a lot of work to experiment with, though.

@kev009

kev009 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

This is a great experiment. Thank you! I am absolutely swamped with non-INN stuff, so I am not sure if I will get a chance to look in detail at the code, but I'm really glad to see someone experiment with this.

Noting here just for the record since the hashing was mentioned in the PR: the MD5 hashing of message IDs is purely an artifact of the requirement for fixed-length records, which meant that the message ID couldn't be stored directly. It's probably spread tentacles all over the code, but it would be interesting to see if SQLite could store records based directly on the message ID without any of that hashing. Might be a lot of work to experiment with, though.

It would certainly be possible.. but it is probably not pragmatic to consider yet. One issue is recovering /remember/ since the hashing was destructive for migration. It would have to be an initialization time decision, not capable of full migration.

If spool sqlite were a thing, I guess you could consider overview and history to simply be indexes on the message DB. That would be a fairly radical change but promoting everything to a single DB might actually make this more feasible. So that might be the experiment to test it out on.

@blgl

blgl commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

It looks reasonable to me.

Documentation issue: those bits that mention database leaf pages seem to assume that row data is stored only in leaf pages, which is not the case for WITHOUT ROWID tables. This doesn't invalidate anything, but trying to avoid confusing other people is a good thing.

kev009 added 2 commits June 20, 2026 18:57
Add hissqlite, a SQLite history method selectable instead of hisv6
(hismethod = hissqlite) when INN is built with SQLite.  It keeps the whole
history in a single transactional database and, unlike ovsqlite, needs no
server daemon: innd is the primary writer and updates the database
directly, while nnrpd readers and the offline tools open the WAL database
read-only, and expiration runs in place rather than rebuilding and swapping
files.

- The backend uses a single WITHOUT ROWID table clustered on the 16-byte MD5
  of the Message-ID (a lookup is one clustered-leaf access); a row is a real
  article, a remembered Message-ID, or absent.  Writes autocommit under WAL
  with synchronous=NORMAL, giving dbz's durability contract (recent writes
  may be lost on power loss, never corrupted, peers resend) without batching,
  which would only starve the second writer.  The WAL is checkpointed in the
  background and truncated on a clean shutdown.

- Expiration is two-horizon and in place (real articles past retention become
  remembered, remembered entries past /remember/ are deleted), streamed in
  hash-keyset pages.  A new HISCTLG_INPLACEEXPIRE capability lets expire(8)
  reopen the backend read/write with no ICCpause; -d/-f (side-file rebuild)
  and a bare -x are refused, with -t as the in-place dry run; hisv6's path is
  unchanged.

- The SQL prepared-statement codegen (sqlite-helper) moves from
  storage/ovsqlite to lib/ so the ovsqlite and hissqlite backends share one
  copy; libinn now carries the libsqlite3 dependency (LTVERSION bump).

- Page size, writer and per-nnrpd reader cache sizes, and mmap size are
  tunable through inn.conf; mmap defaults off, as in ovsqlite.

- hissqlite-convert migrates an existing history by walking the source
  backend and bulk-loading by hash, faithfully preserving the remembered
  entries and timestamps a from-spool makehistory rebuild cannot.

Comes with the hissqlite-util inspection tool, man pages, a NEWS entry, and
two runtime tests wired into tests/TESTS: a full-vtable backend test and an
end-to-end hissqlite-convert test (seed hisv6 -> convert -> verify tokens,
timestamps and remembered entries survive).
A standalone benchmark (deliberately not part of the TAP suite) for comparing
the hisv6 and hissqlite history backends at scale.  It opens each backend
exactly as innd does including the in-core index hint and reads tuning
from inn.conf, so results reflect production settings; with no inn.conf it
falls back to each backend's built-in defaults, which match the inn.conf
defaults, so the comparison stays fair either way.

Per method it times create+write (one autocommit per article, like innd's
steady state), sequential check, random lookup, and missing lookup, reports
the on-disk footprint, and validates correctness.  It fails if any inserted
entry is missing or any absent entry is found.  Lookups use a fixed-seed PRNG
reset per phase so every method replays the identical order.  Output is both a
human readable table and a CSV row.

Defaults to 100M entries (sized for large local testing); build with
"make -C tests benchmarks" and run tests/lib/history-bench (see -h).
@kev009 kev009 force-pushed the hissqlite-history branch from 0da6416 to ba76a80 Compare June 21, 2026 04:08
@kev009

kev009 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

I force pushed the original commit with two small tweaks: drop a secondary index that I was investigating for expire (bloom already covers that case), and reduce default page size to 4k from 8k (less RMW overhead, universal win upon perf testing).

The second commit is a new benchmark infra for testing this.

Sample run on an AMD 5975WX, 256GB RAM, Crucial T705 4TB NVMe, FreeBSD -CURRENT (prod kernel/malloc settings), ZFS:

  • 100M entries, 1M random lookups, sync every 10K, built-in defaults (no inn.conf;
    hissqlite: 64 MB cache, mmap off)
  • hisv6 on a default 128K recordsize ZFS dataset
  • hissqlite -d /zroot/test/p4k/bench on a 4K recordsize ZFS dataset
  • Run sequentially so the two did not contend.
phase hisv6 (128K) hissqlite (4K) winner
write 100M 218.6 s — 457,542/s 3334.1 s — 29,993/s hisv6 ~15x
seq check 100M 55.5 s — 1,802,588/s 788.9 s — 126,758/s hisv6 ~14x
random lookup 1M 33.0 s — 30,337/s 7.9 s — 126,296/s hissqlite ~4x
missing 1M 0.6 s — 1,678,703/s 7.8 s — 128,347/s hisv6 ~13x
on disk 13.8 GB 5.7 GB hissqlite ~2.4x smaller

The results of this are somewhat surprising, I was expecting disk space to be less efficient but the opposite occurred. I would say the seq check and missing checks are plenty because of hiscache in practice. The random lookup is a welcome win for nnrpd (and non-bloom expire).

The primary concern: write speed is disappointing but workable as is for a steady state instance. The fundamental issue is rebalancing the B-tree pages with the random md5 hash write workload, there is a lot of I/O overhead. On balance, the speed is what it is, it shouldn't degrade at extreme scale like 1B articles. I experimented with a rowid table (and separate index on the hash) but that was slower. I played a little with batching (no TXN per record), but I couldn't crack any major performance win, what is here is the best effort so far. It's possible there is some multiple of write perf waiting for the right eyes.

It would be interesting to get some results from Linux, ext4/xfs/btrfs in particular where mmap could be enabled (INNCONF=/tmp/inn-hissqlite-XYZ.conf ./lib/history-bench - hissqlite)

I am generally convinced of the idea now, it is the right shape for csiph.com because it will remove the innd pauses which result in a bunch of timeouts on csiph-web while hisv6 is rewriting. But it would be nice to nail down any schema change if they mattered for write perf, I am just fresh out of ideas there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants