Skip to content

ndarray-SIMD consumer integration: turbovec ANN bridge + bgz17 AVX-512 + blasgraph Hamming dedup#493

Merged
AdaWorldAPI merged 5 commits into
mainfrom
claude/wonderful-hawking-lodtql
Jun 14, 2026
Merged

ndarray-SIMD consumer integration: turbovec ANN bridge + bgz17 AVX-512 + blasgraph Hamming dedup#493
AdaWorldAPI merged 5 commits into
mainfrom
claude/wonderful-hawking-lodtql

Conversation

@AdaWorldAPI

Copy link
Copy Markdown
Owner

Summary

Routes lance-graph SIMD through the canonical ndarray::simd / ndarray::hpc
surface ("all SIMD from ndarray" doctrine) and bridges the TurboQuant ANN index
onto the spine.

Changes

  • refactor(blasgraph)dispatch_hamming / dispatch_popcount and the
    typed hamming_distance_dispatch now route through
    ndarray::hpc::bitwise::{hamming_distance_raw, popcount_raw} (the canonical
    VPOPCNTDQ → AVX-512BW → AVX2 → scalar dispatch) under the ndarray-hpc feature;
    the hand-rolled intrinsics survive only as the #[cfg(not(feature = "ndarray-hpc"))]
    fallback for minimal / non-x86 builds. Mirrors the episodic.rs pattern.
    Validated: cargo test -p lance-graph --lib blasgraph → 194 passed, 0 failed.

  • fix(bgz17) — real AVX-512 arm for batch_palette_distance
    (_mm512_i32gather_epi32); it was silently falling to scalar on AVX-512/v4 hosts.

  • feat(lance-graph-turbovec) — TurboQuant ANN index (normalize → random
    rotation → TQ+ calibration → Lloyd-Max 4-bit quantize → nibble-LUT ADC scan)
    bridged onto the spine, consuming ndarray::simd.

Notes

The lance-encoding build script needs protoc (apt-get install -y protobuf-compiler); with it the workspace builds and the blasgraph suite is green.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo


Generated by Claude Code

claude added 3 commits June 13, 2026 19:19
New excluded standalone crate wrapping the AdaWorldAPI turbovec fork (Google
TurboQuant, arXiv 2504.19874). TurboVec exposes a Kernel::{NativeLut,
PolyfillGemm} A/B switch over one index: the native hand-written nibble-LUT ADC,
and the ndarray::simd::matmul_i8_to_i32 int8 GEMM (AMX-ready via ndarray's
runtime dispatch). path-deps both the turbovec + ndarray forks; kept out of the
main graph so turbovec's faer/statrs tree never enters the deterministic
lance-graph compile path.

KNOWLEDGE.md is the synergy map vs the bgz-tensor primitives the request named
-- HDR popcount stacking early-exit (stacked.rs vedic cascade), Belichtungsmesser
sigma confidence thresholds (belichtungsmesser.rs), preheating vs palette256
ranking (WeightPalette / prepare()) -- plus the measured finding that the
polyfill GEMM is 11.4x slower than native: AMX is the wrong tool for this index
because a gather is not a matmul. Placement verdict: the index belongs on the
spine (lance-graph), the kernel math belongs in ndarray (which already owns
clam/cam_pq/cascade/amx_matmul). The promising synergy is a Belichtungsmesser
sigma-gate on the LUT scan, not AMX.

Board hygiene (same commit): EPIPHANIES E-TURBOVEC-AMX-WRONG-TOOL-1, AGENT_LOG
run entry, LATEST_STATE shipped entry, root Cargo.toml exclude.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
… scalar on v4)

detect_simd() returned SimdLevel::Avx512 but batch_palette_distance had only an
Avx2 match arm, so AVX-512 hosts fell through to scalar_batch. Added avx512_batch
— a 16-wide _mm512_i32gather_epi32::<2> mirror of avx2_batch, low-u16 masked,
bit-identical to scalar_batch. Kept bgz17 0-dependency (option b; routing through
ndarray::simd is the Phase-3 'move bgz17 into the workspace' change, not a bugfix).
126 tests pass (incl test_batch_matches_scalar exercising the AVX-512 arm here).

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…wise

Under `ndarray-hpc`, `dispatch_hamming`/`dispatch_popcount` and the typed
`hamming_distance_dispatch` now call `ndarray::hpc::bitwise::{hamming_distance_raw,
popcount_raw}` (the canonical VPOPCNTDQ → AVX-512BW → AVX2 → scalar dispatch),
per the "all SIMD from ndarray" doctrine. The hand-rolled in-crate intrinsics
survive only as the `#[cfg(not(feature = "ndarray-hpc"))]` fallback for minimal /
non-x86 builds (CI, wasm, embedded). Mirrors the episodic.rs pattern.

Validated: `cargo test -p lance-graph --lib blasgraph` → 194 passed, 0 failed
(protoc installed to unblock the lance-encoding build script).

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 38 minutes and 23 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 275592c3-c377-432c-892b-932f35fb9827

📥 Commits

Reviewing files that changed from the base of the PR and between 2f9c3ca and 31d7757.

⛔ Files ignored due to path filters (1)
  • crates/lance-graph-turbovec/Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (10)
  • .claude/board/AGENT_LOG.md
  • .claude/board/EPIPHANIES.md
  • .claude/board/LATEST_STATE.md
  • Cargo.toml
  • crates/bgz17/src/simd.rs
  • crates/lance-graph-turbovec/Cargo.toml
  • crates/lance-graph-turbovec/KNOWLEDGE.md
  • crates/lance-graph-turbovec/src/lib.rs
  • crates/lance-graph/src/graph/blasgraph/ndarray_bridge.rs
  • crates/lance-graph/src/graph/blasgraph/types.rs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0eaaa86c52

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/bgz17/src/simd.rs
// u16 base pointer, lane j reads the i32 at byte offset candidates[..]*2,
// i.e. the target u16 (low half) plus the next u16 (high half). Identical
// trick to avx2_batch, widened to 16 lanes.
let gathered = _mm512_i32gather_epi32::<2>(indices, row_ptr as *const i32);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid reading past the distance matrix row

On AVX-512 hosts this newly enabled path runs for candidate batches of at least 16. Because each lane performs a 32-bit gather from a u16 row and then masks off the high half, a lookup for the final entry of the final row (for example query == k - 1 and candidate == k - 1) reads two bytes past dm_data; the previous scalar fallback did not. This can fault at a page boundary or invoke UB despite only using the low 16 bits, so the boundary entry needs scalar handling/padding or a true 16-bit-safe load path.

Useful? React with 👍 / 👎.

[dependencies]
# turbovec with the `ndarray-simd` polyfill path compiled in (the native
# kernel is always present); turbovec itself path-deps the ndarray fork.
turbovec = { path = "../../../turbovec/turbovec", default-features = false, features = ["ndarray-simd"] }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Vendor or fetch the turbovec dependency

This path resolves from crates/lance-graph-turbovec to /workspace/turbovec/turbovec, but the repo does not contain that tree and the documented setup only provisions the sibling /workspace/ndarray fork. In a fresh checkout the documented cargo test --manifest-path crates/lance-graph-turbovec/Cargo.toml cannot load this dependency, making the new crate unusable unless turbovec is vendored/submoduled or referenced via a fetchable git/registry dependency.

Useful? React with 👍 / 👎.

claude added 2 commits June 14, 2026 07:32
…darray-hpc

After routing `hamming_distance_dispatch` through `ndarray::hpc::bitwise`, the
in-crate `hamming_distance_scalar` is reached only from the `not(ndarray-hpc)`
fallback and `#[cfg(test)]` parity tests — so a non-test build with the default
`ndarray-hpc` feature warns it's unused. `#[cfg_attr(feature = "ndarray-hpc",
allow(dead_code))]` keeps the lib build warning-free while leaving it live for
the fallback and tests.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
@AdaWorldAPI AdaWorldAPI merged commit 8624cf3 into main Jun 14, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants