ndarray-SIMD consumer integration: turbovec ANN bridge + bgz17 AVX-512 + blasgraph Hamming dedup#493
Conversation
New excluded standalone crate wrapping the AdaWorldAPI turbovec fork (Google
TurboQuant, arXiv 2504.19874). TurboVec exposes a Kernel::{NativeLut,
PolyfillGemm} A/B switch over one index: the native hand-written nibble-LUT ADC,
and the ndarray::simd::matmul_i8_to_i32 int8 GEMM (AMX-ready via ndarray's
runtime dispatch). path-deps both the turbovec + ndarray forks; kept out of the
main graph so turbovec's faer/statrs tree never enters the deterministic
lance-graph compile path.
KNOWLEDGE.md is the synergy map vs the bgz-tensor primitives the request named
-- HDR popcount stacking early-exit (stacked.rs vedic cascade), Belichtungsmesser
sigma confidence thresholds (belichtungsmesser.rs), preheating vs palette256
ranking (WeightPalette / prepare()) -- plus the measured finding that the
polyfill GEMM is 11.4x slower than native: AMX is the wrong tool for this index
because a gather is not a matmul. Placement verdict: the index belongs on the
spine (lance-graph), the kernel math belongs in ndarray (which already owns
clam/cam_pq/cascade/amx_matmul). The promising synergy is a Belichtungsmesser
sigma-gate on the LUT scan, not AMX.
Board hygiene (same commit): EPIPHANIES E-TURBOVEC-AMX-WRONG-TOOL-1, AGENT_LOG
run entry, LATEST_STATE shipped entry, root Cargo.toml exclude.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
… scalar on v4) detect_simd() returned SimdLevel::Avx512 but batch_palette_distance had only an Avx2 match arm, so AVX-512 hosts fell through to scalar_batch. Added avx512_batch — a 16-wide _mm512_i32gather_epi32::<2> mirror of avx2_batch, low-u16 masked, bit-identical to scalar_batch. Kept bgz17 0-dependency (option b; routing through ndarray::simd is the Phase-3 'move bgz17 into the workspace' change, not a bugfix). 126 tests pass (incl test_batch_matches_scalar exercising the AVX-512 arm here). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…wise
Under `ndarray-hpc`, `dispatch_hamming`/`dispatch_popcount` and the typed
`hamming_distance_dispatch` now call `ndarray::hpc::bitwise::{hamming_distance_raw,
popcount_raw}` (the canonical VPOPCNTDQ → AVX-512BW → AVX2 → scalar dispatch),
per the "all SIMD from ndarray" doctrine. The hand-rolled in-crate intrinsics
survive only as the `#[cfg(not(feature = "ndarray-hpc"))]` fallback for minimal /
non-x86 builds (CI, wasm, embedded). Mirrors the episodic.rs pattern.
Validated: `cargo test -p lance-graph --lib blasgraph` → 194 passed, 0 failed
(protoc installed to unblock the lance-encoding build script).
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
|
Warning Review limit reached
More reviews will be available in 38 minutes and 23 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (10)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0eaaa86c52
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // u16 base pointer, lane j reads the i32 at byte offset candidates[..]*2, | ||
| // i.e. the target u16 (low half) plus the next u16 (high half). Identical | ||
| // trick to avx2_batch, widened to 16 lanes. | ||
| let gathered = _mm512_i32gather_epi32::<2>(indices, row_ptr as *const i32); |
There was a problem hiding this comment.
Avoid reading past the distance matrix row
On AVX-512 hosts this newly enabled path runs for candidate batches of at least 16. Because each lane performs a 32-bit gather from a u16 row and then masks off the high half, a lookup for the final entry of the final row (for example query == k - 1 and candidate == k - 1) reads two bytes past dm_data; the previous scalar fallback did not. This can fault at a page boundary or invoke UB despite only using the low 16 bits, so the boundary entry needs scalar handling/padding or a true 16-bit-safe load path.
Useful? React with 👍 / 👎.
| [dependencies] | ||
| # turbovec with the `ndarray-simd` polyfill path compiled in (the native | ||
| # kernel is always present); turbovec itself path-deps the ndarray fork. | ||
| turbovec = { path = "../../../turbovec/turbovec", default-features = false, features = ["ndarray-simd"] } |
There was a problem hiding this comment.
Vendor or fetch the turbovec dependency
This path resolves from crates/lance-graph-turbovec to /workspace/turbovec/turbovec, but the repo does not contain that tree and the documented setup only provisions the sibling /workspace/ndarray fork. In a fresh checkout the documented cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml cannot load this dependency, making the new crate unusable unless turbovec is vendored/submoduled or referenced via a fetchable git/registry dependency.
Useful? React with 👍 / 👎.
…darray-hpc After routing `hamming_distance_dispatch` through `ndarray::hpc::bitwise`, the in-crate `hamming_distance_scalar` is reached only from the `not(ndarray-hpc)` fallback and `#[cfg(test)]` parity tests — so a non-test build with the default `ndarray-hpc` feature warns it's unused. `#[cfg_attr(feature = "ndarray-hpc", allow(dead_code))]` keeps the lib build warning-free while leaving it live for the fallback and tests. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Summary
Routes lance-graph SIMD through the canonical
ndarray::simd/ndarray::hpcsurface ("all SIMD from ndarray" doctrine) and bridges the TurboQuant ANN index
onto the spine.
Changes
refactor(blasgraph)—dispatch_hamming/dispatch_popcountand thetyped
hamming_distance_dispatchnow route throughndarray::hpc::bitwise::{hamming_distance_raw, popcount_raw}(the canonicalVPOPCNTDQ → AVX-512BW → AVX2 → scalar dispatch) under the
ndarray-hpcfeature;the hand-rolled intrinsics survive only as the
#[cfg(not(feature = "ndarray-hpc"))]fallback for minimal / non-x86 builds. Mirrors the
episodic.rspattern.Validated:
cargo test -p lance-graph --lib blasgraph→ 194 passed, 0 failed.fix(bgz17)— real AVX-512 arm forbatch_palette_distance(
_mm512_i32gather_epi32); it was silently falling to scalar on AVX-512/v4 hosts.feat(lance-graph-turbovec)— TurboQuant ANN index (normalize → randomrotation → TQ+ calibration → Lloyd-Max 4-bit quantize → nibble-LUT ADC scan)
bridged onto the spine, consuming
ndarray::simd.Notes
The
lance-encodingbuild script needsprotoc(apt-get install -y protobuf-compiler); with it the workspace builds and the blasgraph suite is green.https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Generated by Claude Code