Skip to content

Add compressed vector storage (Int8 SQ, RaBitQ, TurboQUANT)#90

Open
sven-n wants to merge 2 commits into
Build5Nines:mainfrom
sven-n:vector-compression
Open

Add compressed vector storage (Int8 SQ, RaBitQ, TurboQUANT)#90
sven-n wants to merge 2 commits into
Build5Nines:mainfrom
sven-n:vector-compression

Conversation

@sven-n
Copy link
Copy Markdown

@sven-n sven-n commented May 21, 2026

Summary

Adds an IVectorEncoding abstraction so a vector database can store its embeddings in compressed form to save memory and disk space. The encoding is injected at the database constructor and the codec owns the asymmetric (query float vs. stored encoded) distance computation, so search no longer has to decode back to float[].

Four built-in encodings ship in this PR:

Id Compression Notes
raw-f32 1x (default) Backwards-compatible passthrough
int8-sq ~4x Symmetric per-vector int8 scalar quantization
rabitq ~32x Rotation-free RaBitQ — 1-bit sign code + per-vector norm and correction factor
turboquant ~8x 4-bit symmetric SQ, nibble-packed on disk

Usage:

var db = new BasicMemoryVectorDatabase(Int8ScalarQuantizationEncoding.Instance);

The original parameter-less constructors remain (raw encoding, same behavior as before).

Backward compatibility

  • Existing .b59vdb files load unchanged — the regression test (VectorDatabaseVersion_2_0_2_001) still passes with bit-exact similarity (0.3396831452846527).
  • Raw-encoded databases write the same legacy JSON shape, byte-for-byte: VectorEncodingId is omitted from database.json when the encoding is raw, and the per-item JSON keeps the {"Vector": [...]} array. A JsonConverter on VectorTextItem<,> understands both the legacy shape and the new {"EncodingId", "Dimensions", "EncodedBytes"} shape.
  • New nullable IEncodedVector EncodedVector { get; set; } on IVectorTextItem<,> has a default implementation that adapts to/from the existing float[] Vector, so external implementations keep compiling.

Key design choices (called out for review)

  • Codec owns distance. IVectorComparer gained a VectorComparison MetricKind { get; } property so the database can dispatch to the encoding's fast asymmetric path: encoding.Compare(comparer.MetricKind, queryFloat, stored.EncodedVector). External IVectorComparer implementations will get a compile error pointing them at the new requirement; this is the only intentional public-API break.
  • VectorTextItem.Vector becomes a computed view — getter decodes from EncodedVector, setter wraps in raw encoding. No values are silently cached.
  • VectorEncodingRegistry holds singleton instances keyed by id so persisted vectors can be rehydrated to the right codec on load. The encoding actually used after a Load is whatever is recorded in the file, not the constructor parameter.
  • RaBitQ caveat. The published algorithm pre-rotates database and query vectors through a shared random orthonormal matrix for stronger concentration bounds. That would require per-database state which the registry-singleton IVectorEncoding shape does not model, so the rotation is omitted here. For already-isotropic embedding outputs (the typical input) the rotation-free estimator is still close in practice. A rotation-based variant is a follow-up that would also touch the registry contract.

Test plan

  • 23 new tests under SharpVectorTest/VectorEncoding/ cover: per-codec encode/decode round-trip, byte round-trip equivalence, cosine similarity accuracy bounds for random vectors, similarity-ranking correctness, exact storage-size expectations, odd-dimension nibble packing tail case, database-level save/load through each codec, and that raw-encoded database.json byte-omits the new VectorEncodingId field
  • Full suite green except for two pre-existing flaky BasicDiskVocabularyStore / BasicDiskVectorStore WAL tests (AddAndGetText_PersistsToDisk, Delete_RemovesFromIndexButKeepsFile) that fail on main for an unrelated reason — those are fixed in Fix WAL file-handle leak in disk store recovery #89

🤖 Generated with Claude Code

sven-n and others added 2 commits May 20, 2026 18:49
New abstractions (in src/Build5Nines.SharpVector/VectorEncoding/)

  - IEncodedVector — encoded payload (id, dims, bytes, decode)
  - IVectorEncoding — Encode(float[]), LoadFromBytes(bytes, dims), Compare(metric, queryFloat, encoded)
  - VectorEncodingRegistry — id → encoding lookup
  - RawFloat32Encoding (default, lossless), Int8ScalarQuantizationEncoding (~4× memory savings), RaBitQEncoding / TurboQuantEncoding (stubs that throw
  NotImplementedException)

  Interface tweaks

  - VectorComparison enum (CosineSimilarity, EuclideanDistance) repurposes the empty existing file
  - IVectorComparer gains VectorComparison MetricKind { get; } — both bundled comparers updated; external comparers will get a compile error pointing at the
  new requirement
  - IVectorTextItem<,> gains IEncodedVector EncodedVector with a default impl that adapts to/from Vector (so external implementers keep compiling).
  VectorTextItem<,> now backs storage with the encoded form; Vector getter decodes, setter wraps in raw

  Database injection

  Every database accepts an optional encoding parameter — the original constructors still exist:
  - new BasicMemoryVectorDatabase() (raw, default)
  - new BasicMemoryVectorDatabase(Int8ScalarQuantizationEncoding.Instance)
  - Same pattern on MemoryVectorDatabase<TMetadata> and BasicDiskVectorDatabase<TMetadata>. VectorDatabaseBase / MemoryVectorDatabaseBase /
  BasicDiskMemoryVectorDatabaseBase ctors are paired (existing signature kept; new overload added)

  The public VectorEncoding property is exposed for inspection. In the embedding-path search, the loop dispatches per item:
  encoding.Compare(comparer.MetricKind, queryFloat, item.EncodedVector) — picking the encoding from the registry by the item's own EncodingId, so even a
  mixed-encoding DB works.

  Persistence & backward compat

  - DatabaseInfo gained VectorEncodingId with [JsonIgnore(WhenWritingNull)]. Raw-encoded saves omit the field entirely → database.json is byte-identical to
  older versions for raw DBs
  - VectorTextItem got a JsonConverter that reads both shapes: legacy {"Vector": [...]} and new {"EncodingId": "...", "Dimensions": N, "EncodedBytes":
  "base64..."}. Writing a raw-encoded item still produces the legacy {"Vector": [...]} shape
  - On load, the file's VectorEncodingId overrides the constructor-provided encoding (so reload preserves the original setting); absence = raw

  Test results

  - 79/81 pass. All 15 new encoding tests pass.
  - Notably the regression test VectorDatabaseVersion_2_0_2_001 (loads a real 2.0.2-era .b59vdb and asserts Similarity == 0.3396831452846527) passes — legacy
   files load bit-exactly
  - The 2 failures (AddAndGetText_PersistsToDisk, Delete_RemovesFromIndexButKeepsFile) are a pre-existing bug in
  BasicDiskVocabularyStore.RecoverFromWalOrIndex:138 — a using var fs keeps the WAL file open while File.WriteAllBytes(_walPath, ...) tries to overwrite it.
Replace the NotImplementedException stubs with working encoders so all
four built-in encodings (raw, int8-sq, rabitq, turboquant) can be used
to construct a database.

RaBitQEncoding (rotation-free variant):
- 1-bit sign code per dimension, packed into bytes
- Per-vector L2 norm and reconstruction-correction factor
- Asymmetric distance: float query vs packed sign bits
- Storage ~D/8 + 8 bytes (about 32x compression at D=384)
- The published algorithm pre-rotates with a shared random orthonormal
  matrix; that requires per-database state the registry-singleton
  IVectorEncoding shape does not model, so the rotation is omitted here

TurboQuantEncoding (4-bit symmetric scalar quantization):
- Per-vector scale plus int4 codes packed two per byte
- Asymmetric cosine (scale cancels) and Euclidean (code * scale)
- Storage ~D/2 + 4 bytes (about 8x compression at D=384)
- Handles odd-dimension nibble-packing tail case correctly

Tests cover: byte round-trip equivalence, cosine accuracy bounds for
random vectors, similarity-ranking correctness, exact storage-size
expectations, odd-dimension nibble packing, and database-level
save/load through each codec.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant