Add compressed vector storage (Int8 SQ, RaBitQ, TurboQUANT)#90
Open
sven-n wants to merge 2 commits into
Open
Conversation
New abstractions (in src/Build5Nines.SharpVector/VectorEncoding/)
- IEncodedVector — encoded payload (id, dims, bytes, decode)
- IVectorEncoding — Encode(float[]), LoadFromBytes(bytes, dims), Compare(metric, queryFloat, encoded)
- VectorEncodingRegistry — id → encoding lookup
- RawFloat32Encoding (default, lossless), Int8ScalarQuantizationEncoding (~4× memory savings), RaBitQEncoding / TurboQuantEncoding (stubs that throw
NotImplementedException)
Interface tweaks
- VectorComparison enum (CosineSimilarity, EuclideanDistance) repurposes the empty existing file
- IVectorComparer gains VectorComparison MetricKind { get; } — both bundled comparers updated; external comparers will get a compile error pointing at the
new requirement
- IVectorTextItem<,> gains IEncodedVector EncodedVector with a default impl that adapts to/from Vector (so external implementers keep compiling).
VectorTextItem<,> now backs storage with the encoded form; Vector getter decodes, setter wraps in raw
Database injection
Every database accepts an optional encoding parameter — the original constructors still exist:
- new BasicMemoryVectorDatabase() (raw, default)
- new BasicMemoryVectorDatabase(Int8ScalarQuantizationEncoding.Instance)
- Same pattern on MemoryVectorDatabase<TMetadata> and BasicDiskVectorDatabase<TMetadata>. VectorDatabaseBase / MemoryVectorDatabaseBase /
BasicDiskMemoryVectorDatabaseBase ctors are paired (existing signature kept; new overload added)
The public VectorEncoding property is exposed for inspection. In the embedding-path search, the loop dispatches per item:
encoding.Compare(comparer.MetricKind, queryFloat, item.EncodedVector) — picking the encoding from the registry by the item's own EncodingId, so even a
mixed-encoding DB works.
Persistence & backward compat
- DatabaseInfo gained VectorEncodingId with [JsonIgnore(WhenWritingNull)]. Raw-encoded saves omit the field entirely → database.json is byte-identical to
older versions for raw DBs
- VectorTextItem got a JsonConverter that reads both shapes: legacy {"Vector": [...]} and new {"EncodingId": "...", "Dimensions": N, "EncodedBytes":
"base64..."}. Writing a raw-encoded item still produces the legacy {"Vector": [...]} shape
- On load, the file's VectorEncodingId overrides the constructor-provided encoding (so reload preserves the original setting); absence = raw
Test results
- 79/81 pass. All 15 new encoding tests pass.
- Notably the regression test VectorDatabaseVersion_2_0_2_001 (loads a real 2.0.2-era .b59vdb and asserts Similarity == 0.3396831452846527) passes — legacy
files load bit-exactly
- The 2 failures (AddAndGetText_PersistsToDisk, Delete_RemovesFromIndexButKeepsFile) are a pre-existing bug in
BasicDiskVocabularyStore.RecoverFromWalOrIndex:138 — a using var fs keeps the WAL file open while File.WriteAllBytes(_walPath, ...) tries to overwrite it.
Replace the NotImplementedException stubs with working encoders so all four built-in encodings (raw, int8-sq, rabitq, turboquant) can be used to construct a database. RaBitQEncoding (rotation-free variant): - 1-bit sign code per dimension, packed into bytes - Per-vector L2 norm and reconstruction-correction factor - Asymmetric distance: float query vs packed sign bits - Storage ~D/8 + 8 bytes (about 32x compression at D=384) - The published algorithm pre-rotates with a shared random orthonormal matrix; that requires per-database state the registry-singleton IVectorEncoding shape does not model, so the rotation is omitted here TurboQuantEncoding (4-bit symmetric scalar quantization): - Per-vector scale plus int4 codes packed two per byte - Asymmetric cosine (scale cancels) and Euclidean (code * scale) - Storage ~D/2 + 4 bytes (about 8x compression at D=384) - Handles odd-dimension nibble-packing tail case correctly Tests cover: byte round-trip equivalence, cosine accuracy bounds for random vectors, similarity-ranking correctness, exact storage-size expectations, odd-dimension nibble packing, and database-level save/load through each codec. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an
IVectorEncodingabstraction so a vector database can store its embeddings in compressed form to save memory and disk space. The encoding is injected at the database constructor and the codec owns the asymmetric (query float vs. stored encoded) distance computation, so search no longer has to decode back tofloat[].Four built-in encodings ship in this PR:
raw-f32int8-sqrabitqturboquantUsage:
The original parameter-less constructors remain (raw encoding, same behavior as before).
Backward compatibility
.b59vdbfiles load unchanged — the regression test (VectorDatabaseVersion_2_0_2_001) still passes with bit-exact similarity (0.3396831452846527).VectorEncodingIdis omitted fromdatabase.jsonwhen the encoding is raw, and the per-item JSON keeps the{"Vector": [...]}array. AJsonConverteronVectorTextItem<,>understands both the legacy shape and the new{"EncodingId", "Dimensions", "EncodedBytes"}shape.IEncodedVector EncodedVector { get; set; }onIVectorTextItem<,>has a default implementation that adapts to/from the existingfloat[] Vector, so external implementations keep compiling.Key design choices (called out for review)
IVectorComparergained aVectorComparison MetricKind { get; }property so the database can dispatch to the encoding's fast asymmetric path:encoding.Compare(comparer.MetricKind, queryFloat, stored.EncodedVector). ExternalIVectorComparerimplementations will get a compile error pointing them at the new requirement; this is the only intentional public-API break.VectorTextItem.Vectorbecomes a computed view — getter decodes fromEncodedVector, setter wraps in raw encoding. No values are silently cached.VectorEncodingRegistryholds singleton instances keyed by id so persisted vectors can be rehydrated to the right codec on load. The encoding actually used after aLoadis whatever is recorded in the file, not the constructor parameter.IVectorEncodingshape does not model, so the rotation is omitted here. For already-isotropic embedding outputs (the typical input) the rotation-free estimator is still close in practice. A rotation-based variant is a follow-up that would also touch the registry contract.Test plan
SharpVectorTest/VectorEncoding/cover: per-codec encode/decode round-trip, byte round-trip equivalence, cosine similarity accuracy bounds for random vectors, similarity-ranking correctness, exact storage-size expectations, odd-dimension nibble packing tail case, database-level save/load through each codec, and that raw-encodeddatabase.jsonbyte-omits the newVectorEncodingIdfieldBasicDiskVocabularyStore/BasicDiskVectorStoreWAL tests (AddAndGetText_PersistsToDisk,Delete_RemovesFromIndexButKeepsFile) that fail onmainfor an unrelated reason — those are fixed in Fix WAL file-handle leak in disk store recovery #89🤖 Generated with Claude Code