Add compressed vector storage (Int8 SQ, RaBitQ, TurboQUANT) by sven-n · Pull Request #90 · Build5Nines/SharpVector

sven-n · 2026-05-21T08:10:42Z

Summary

Adds an IVectorEncoding abstraction so a vector database can store its embeddings in compressed form to save memory and disk space. The encoding is injected at the database constructor and the codec owns the asymmetric (query float vs. stored encoded) distance computation, so search no longer has to decode back to float[].

Four built-in encodings ship in this PR:

Id	Compression	Notes
`raw-f32`	1x (default)	Backwards-compatible passthrough
`int8-sq`	~4x	Symmetric per-vector int8 scalar quantization
`rabitq`	~32x	Rotation-free RaBitQ — 1-bit sign code + per-vector norm and correction factor
`turboquant`	~8x	4-bit symmetric SQ, nibble-packed on disk

Usage:

var db = new BasicMemoryVectorDatabase(Int8ScalarQuantizationEncoding.Instance);

The original parameter-less constructors remain (raw encoding, same behavior as before).

Backward compatibility

Existing .b59vdb files load unchanged — the regression test (VectorDatabaseVersion_2_0_2_001) still passes with bit-exact similarity (0.3396831452846527).
Raw-encoded databases write the same legacy JSON shape, byte-for-byte: VectorEncodingId is omitted from database.json when the encoding is raw, and the per-item JSON keeps the {"Vector": [...]} array. A JsonConverter on VectorTextItem<,> understands both the legacy shape and the new {"EncodingId", "Dimensions", "EncodedBytes"} shape.
New nullable IEncodedVector EncodedVector { get; set; } on IVectorTextItem<,> has a default implementation that adapts to/from the existing float[] Vector, so external implementations keep compiling.

Key design choices (called out for review)

Codec owns distance. IVectorComparer gained a VectorComparison MetricKind { get; } property so the database can dispatch to the encoding's fast asymmetric path: encoding.Compare(comparer.MetricKind, queryFloat, stored.EncodedVector). External IVectorComparer implementations will get a compile error pointing them at the new requirement; this is the only intentional public-API break.
VectorTextItem.Vector becomes a computed view — getter decodes from EncodedVector, setter wraps in raw encoding. No values are silently cached.
VectorEncodingRegistry holds singleton instances keyed by id so persisted vectors can be rehydrated to the right codec on load. The encoding actually used after a Load is whatever is recorded in the file, not the constructor parameter.
RaBitQ caveat. The published algorithm pre-rotates database and query vectors through a shared random orthonormal matrix for stronger concentration bounds. That would require per-database state which the registry-singleton IVectorEncoding shape does not model, so the rotation is omitted here. For already-isotropic embedding outputs (the typical input) the rotation-free estimator is still close in practice. A rotation-based variant is a follow-up that would also touch the registry contract.

Test plan

23 new tests under SharpVectorTest/VectorEncoding/ cover: per-codec encode/decode round-trip, byte round-trip equivalence, cosine similarity accuracy bounds for random vectors, similarity-ranking correctness, exact storage-size expectations, odd-dimension nibble packing tail case, database-level save/load through each codec, and that raw-encoded database.json byte-omits the new VectorEncodingId field
Full suite green except for two pre-existing flaky BasicDiskVocabularyStore / BasicDiskVectorStore WAL tests (AddAndGetText_PersistsToDisk, Delete_RemovesFromIndexButKeepsFile) that fail on main for an unrelated reason — those are fixed in Fix WAL file-handle leak in disk store recovery #89

🤖 Generated with Claude Code

New abstractions (in src/Build5Nines.SharpVector/VectorEncoding/) - IEncodedVector — encoded payload (id, dims, bytes, decode) - IVectorEncoding — Encode(float[]), LoadFromBytes(bytes, dims), Compare(metric, queryFloat, encoded) - VectorEncodingRegistry — id → encoding lookup - RawFloat32Encoding (default, lossless), Int8ScalarQuantizationEncoding (~4× memory savings), RaBitQEncoding / TurboQuantEncoding (stubs that throw NotImplementedException) Interface tweaks - VectorComparison enum (CosineSimilarity, EuclideanDistance) repurposes the empty existing file - IVectorComparer gains VectorComparison MetricKind { get; } — both bundled comparers updated; external comparers will get a compile error pointing at the new requirement - IVectorTextItem<,> gains IEncodedVector EncodedVector with a default impl that adapts to/from Vector (so external implementers keep compiling). VectorTextItem<,> now backs storage with the encoded form; Vector getter decodes, setter wraps in raw Database injection Every database accepts an optional encoding parameter — the original constructors still exist: - new BasicMemoryVectorDatabase() (raw, default) - new BasicMemoryVectorDatabase(Int8ScalarQuantizationEncoding.Instance) - Same pattern on MemoryVectorDatabase<TMetadata> and BasicDiskVectorDatabase<TMetadata>. VectorDatabaseBase / MemoryVectorDatabaseBase / BasicDiskMemoryVectorDatabaseBase ctors are paired (existing signature kept; new overload added) The public VectorEncoding property is exposed for inspection. In the embedding-path search, the loop dispatches per item: encoding.Compare(comparer.MetricKind, queryFloat, item.EncodedVector) — picking the encoding from the registry by the item's own EncodingId, so even a mixed-encoding DB works. Persistence & backward compat - DatabaseInfo gained VectorEncodingId with [JsonIgnore(WhenWritingNull)]. Raw-encoded saves omit the field entirely → database.json is byte-identical to older versions for raw DBs - VectorTextItem got a JsonConverter that reads both shapes: legacy {"Vector": [...]} and new {"EncodingId": "...", "Dimensions": N, "EncodedBytes": "base64..."}. Writing a raw-encoded item still produces the legacy {"Vector": [...]} shape - On load, the file's VectorEncodingId overrides the constructor-provided encoding (so reload preserves the original setting); absence = raw Test results - 79/81 pass. All 15 new encoding tests pass. - Notably the regression test VectorDatabaseVersion_2_0_2_001 (loads a real 2.0.2-era .b59vdb and asserts Similarity == 0.3396831452846527) passes — legacy files load bit-exactly - The 2 failures (AddAndGetText_PersistsToDisk, Delete_RemovesFromIndexButKeepsFile) are a pre-existing bug in BasicDiskVocabularyStore.RecoverFromWalOrIndex:138 — a using var fs keeps the WAL file open while File.WriteAllBytes(_walPath, ...) tries to overwrite it.

Replace the NotImplementedException stubs with working encoders so all four built-in encodings (raw, int8-sq, rabitq, turboquant) can be used to construct a database. RaBitQEncoding (rotation-free variant): - 1-bit sign code per dimension, packed into bytes - Per-vector L2 norm and reconstruction-correction factor - Asymmetric distance: float query vs packed sign bits - Storage ~D/8 + 8 bytes (about 32x compression at D=384) - The published algorithm pre-rotates with a shared random orthonormal matrix; that requires per-database state the registry-singleton IVectorEncoding shape does not model, so the rotation is omitted here TurboQuantEncoding (4-bit symmetric scalar quantization): - Per-vector scale plus int4 codes packed two per byte - Asymmetric cosine (scale cancels) and Euclidean (code * scale) - Storage ~D/2 + 4 bytes (about 8x compression at D=384) - Handles odd-dimension nibble-packing tail case correctly Tests cover: byte round-trip equivalence, cosine accuracy bounds for random vectors, similarity-ranking correctness, exact storage-size expectations, odd-dimension nibble packing, and database-level save/load through each codec. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

sven-n and others added 2 commits May 20, 2026 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add compressed vector storage (Int8 SQ, RaBitQ, TurboQUANT)#90

Add compressed vector storage (Int8 SQ, RaBitQ, TurboQUANT)#90
sven-n wants to merge 2 commits into
Build5Nines:mainfrom
sven-n:vector-compression

sven-n commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sven-n commented May 21, 2026

Summary

Backward compatibility

Key design choices (called out for review)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant