Skip to content

Commit f714de8

Browse files
committed
docs(splat-native): ndarray-side SIMD substrate plan (D-SPLAT-2)
Companion to lance-graph's canonical splat-native-ultrasound-v1 plan (see `lance-graph/.claude/plans/splat-native-ultrasound-v1.md`). This file is the ndarray-side perspective: the one deliverable owed by ndarray — `src/simd_splat.rs` — and its W1c-style five primitives: - `batched_cholesky_3x3` — 3×3 packed Σ → Cholesky factor L - `batched_mahalanobis` — squared Mahalanobis over M queries × N Gaussians - `batched_opacity_blend` — front-to-back alpha composite over sorted Gaussians - `batched_sh_eval_l3` — degree-3 spherical harmonics at a single view direction - `batched_se3_transform` — rigid SE(3) transform on centroids + covariance All three backends mandatory (AVX-512 / NEON / scalar) per the existing consumer-contract knowledge doc + W1a primitive-addition pattern. Parity tests gating; Jirak-grounded significance thresholds for any derived bounds (I-NOISE-FLOOR-JIRAK). Sprint window: P1 (sprint 1-2). FOUNDATION work — must land before any downstream consumer (lance-graph carriers, splat-fit engine, registration math, AR renderer). No code in this PR. Spec-only.
1 parent 0129b5c commit f714de8

1 file changed

Lines changed: 382 additions & 0 deletions

File tree

Lines changed: 382 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,382 @@
1+
# splat-native-ultrasound — SIMD substrate (ndarray perspective) — v1
2+
3+
> **Status:** PROPOSAL / integration plan (ndarray-side perspective). Design-spec only; **no code in this plan**.
4+
> **Authored:** 2026-06-05 (session `claude/lance-graph-ontology-review-Pyry3`).
5+
> **Canonical plan:** `lance-graph/.claude/plans/splat-native-ultrasound-v1.md` — this doc is the ndarray-side companion.
6+
>
7+
> **Anchored to:** `I-NOISE-FLOOR-JIRAK` (Jirak weak-dependence Berry-Esseen for any significance threshold); the ndarray vertical-SIMD consumer contract at `lance-graph/.claude/knowledge/ndarray-vertical-simd-alien-magic.md` (W1a-style primitive-addition pattern, all three backends mandatory).
8+
9+
---
10+
11+
## 0. What this is from ndarray's perspective
12+
13+
ndarray is the **SIMD substrate** for the splat-native ultrasound arc. Every floating-point op in the splat-fit → registration → render pipeline runs on the same `src/simd_*` family ndarray already exposes. This doc spells out the **one new module** owed by ndarray — `src/simd_splat.rs` — and its W1c-style five primitives.
14+
15+
This is one deliverable (**D-SPLAT-2**) from the canonical plan. The carriers (`Gaussian3D`, `SplatBatch<N>`) live in `lance-graph-contract`, not here; ndarray sees them as opaque `&[u8]` / typed slice views. The same separation already governs `Fingerprint<256>` and `CamCodecContract` — math here, carriers there, engines downstream.
16+
17+
**Sprint window:** P1 (sprint 1-2) — the **foundation phase**. Must land first; the entire downstream arc depends on these primitives.
18+
19+
---
20+
21+
## 1. What ndarray owns
22+
23+
### D-SPLAT-2 — `src/simd_splat.rs` (the W1c primitive bundle)
24+
25+
A new module under the existing `src/simd_*` family. Five batch primitives, all dispatched through the runtime `simd_caps()` singleton, all three backends mandatory (AVX-512 / NEON / scalar) per the consumer-contract knowledge doc.
26+
27+
**Module skeleton:**
28+
29+
```rust
30+
//! src/simd_splat.rs — anisotropic Gaussian batch ops for splat-native ultrasound
31+
//!
32+
//! Five primitives, all SIMD-dispatched via `simd_caps()`:
33+
//! - batched_cholesky_3x3
34+
//! - batched_mahalanobis
35+
//! - batched_opacity_blend
36+
//! - batched_sh_eval_l3
37+
//! - batched_se3_transform
38+
//!
39+
//! Operates on packed `Gaussian3D` SoA layouts owned by lance-graph-contract.
40+
//! All primitives are zero-allocation in the hot loop; SoA column slices in/out.
41+
42+
use crate::simd::SimdBackend;
43+
use crate::simd_caps;
44+
45+
pub fn batched_cholesky_3x3(
46+
sigma_packed: &[f32], // N × 6 packed lower-tri Σ (xx, yy, zz, xy, xz, yz)
47+
out_l_packed: &mut [f32], // N × 6 Cholesky factor L (same packing)
48+
);
49+
50+
pub fn batched_mahalanobis(
51+
query_xyz: &[f32], // M × 3 query points
52+
mu_xyz: &[f32], // N × 3 Gaussian centroids
53+
sigma_packed: &[f32], // N × 6 packed Σ
54+
out_dist_sq: &mut [f32], // M × N output (squared Mahalanobis)
55+
);
56+
57+
pub fn batched_opacity_blend(
58+
sorted_amplitudes: &[f32], // N (front-to-back along view ray)
59+
opacity_lut: &[u8; 256], // amplitude → opacity LUT
60+
out_alpha: &mut [u8], // composited alpha per pixel
61+
);
62+
63+
pub fn batched_sh_eval_l3(
64+
view_dir: [f32; 3], // unit view direction
65+
sh_coeffs: &[f32], // N × 16 SH coefficients per Gaussian
66+
out_luminance: &mut [f32], // N evaluated luminances
67+
);
68+
69+
pub fn batched_se3_transform(
70+
pose_se3: [f32; 12], // 3 translation + 9 rotation matrix
71+
mu_in: &[f32], // N × 3 input centroids
72+
mu_out: &mut [f32], // N × 3 transformed centroids
73+
sigma_in_packed: &[f32], // N × 6 input Σ
74+
sigma_out_packed: &mut [f32], // N × 6 transformed (R Σ Rᵀ)
75+
);
76+
```
77+
78+
### Three-backend implementation matrix (mandatory)
79+
80+
Per the consumer-contract knowledge doc, every primitive ships in three implementations:
81+
82+
| Primitive | AVX-512 | NEON | Scalar |
83+
|---|---|---|---|
84+
| `batched_cholesky_3x3` | 16-lane batch (process 16 Σ matrices in parallel) | 4-lane batch (NEON 128-bit) | per-Σ scalar |
85+
| `batched_mahalanobis` | 16-lane query × 16-lane Gaussian unroll | 4-lane × 4-lane | nested scalar loop |
86+
| `batched_opacity_blend` | gather LUT + alpha composite, 16-lane | 4-lane | scalar |
87+
| `batched_sh_eval_l3` | 16-lane SH-basis evaluation, view dir broadcast | 4-lane | scalar |
88+
| `batched_se3_transform` | 16-lane matmul (μ vector × R) + 16-lane Σ similarity | 4-lane | scalar |
89+
90+
Parity tests are MANDATORY (`I-NOISE-FLOOR-JIRAK`-aware significance gates if any thresholds are claimed).
91+
92+
### Saturating / overflow semantics
93+
94+
- **Cholesky**: degenerate Σ (det ≤ 0) yields NaN in L; downstream uses NaN as a sentinel for "this Gaussian is degenerate, skip it" rather than panicking. Documented as the W1c "VPABSB-correction-style" semantic carve-out in the function docstring.
95+
- **Mahalanobis**: returns `f32::INFINITY` for distances against degenerate Σ rather than NaN (lets the registration math sort distance arrays without NaN-propagation pitfalls).
96+
- **Opacity blend**: front-to-back saturation at 1.0 (alpha ceiling); accumulator uses `u16` internally then truncates to `u8` output.
97+
- **SH evaluation**: f32 luminance output, no clamping (caller decides whether to clamp for visual rendering vs computational pipeline).
98+
- **SE(3) transform**: assumes input rotation matrix is orthonormal; documentation includes a `debug_assert!` validation in debug builds.
99+
100+
---
101+
102+
## 2. What ndarray consumes
103+
104+
Nothing splat-specific — D-SPLAT-2 is foundation work. The primitives consume:
105+
106+
- **Existing `simd_caps()` dispatch** — already shipped, already battle-tested across the 5 SIMD modules.
107+
- **Existing AVX-512 / NEON test infrastructure** — same parity-test patterns as `simd_avx2`, `simd_avx512`, `simd_neon`.
108+
- **Standard `core::arch::x86_64` / `core::arch::aarch64` intrinsics** — no new third-party deps.
109+
110+
No upstream blockers.
111+
112+
---
113+
114+
## 3. What ndarray hands off
115+
116+
### To `lance-graph-contract` (D-SPLAT-1/3 in the canonical plan)
117+
118+
`Gaussian3D` and `SplatBatch<N>` carrier types — they consume ndarray's primitives via the typed-slice API. The carrier crate has zero direct dep on ndarray (consistent with the zero-dep policy for contract); it exposes `&[f32]` column slices for the consumer to feed into `simd_splat`.
119+
120+
### To `crates/splat-fit` (D-SPLAT-6)
121+
122+
The new standalone `splat-fit` crate consumes `ndarray::simd_splat` via the `ndarray-hpc` feature flag (mirrors the `deepnsm` / `bgz17` pattern: optional ndarray dep, zero-dep core).
123+
124+
### To `lance-graph::splat::registration` (D-SPLAT-5)
125+
126+
The ICP / SE(3) optimizer consumes `batched_cholesky_3x3`, `batched_mahalanobis`, `batched_se3_transform` directly. The Σ-sandwich Mahalanobis is two `batched_mahalanobis` calls (against atlas Σ, against live Σ) with elementwise sum — no new primitive needed.
127+
128+
### To `lance-graph::splat-render` (D-SPLAT-12)
129+
130+
The renderer consumes `batched_opacity_blend` + `batched_sh_eval_l3` per-frame. Sort + paint = two ndarray calls.
131+
132+
---
133+
134+
## 4. Per-primitive specifications
135+
136+
### 4.1 `batched_cholesky_3x3`
137+
138+
**Signature** (full Rust):
139+
140+
```rust
141+
/// Compute Cholesky factor L such that L Lᵀ = Σ, for N packed 3×3 symmetric
142+
/// positive-definite matrices Σ.
143+
///
144+
/// # Layout
145+
/// - `sigma_packed[i*6..i*6+6]` = [Σ₀₀, Σ₁₁, Σ₂₂, Σ₀₁, Σ₀₂, Σ₁₂]
146+
/// - `out_l_packed[i*6..i*6+6]` = [L₀₀, L₁₁, L₂₂, L₁₀, L₂₀, L₂₁]
147+
/// (lower-triangular, diag + sub-diag)
148+
///
149+
/// # Semantics
150+
/// - Degenerate Σ (non-PD, det ≤ 0) produces NaN in the output. Downstream
151+
/// uses NaN as a sentinel for "skip this Gaussian"; do NOT propagate.
152+
/// - SIMD-batched 16-at-a-time on AVX-512, 4-at-a-time on NEON, scalar fallback.
153+
///
154+
/// # Panics
155+
/// Never.
156+
pub fn batched_cholesky_3x3(sigma_packed: &[f32], out_l_packed: &mut [f32]);
157+
```
158+
159+
**Algorithm** (3×3 explicit, no loops):
160+
161+
```text
162+
L₀₀ = √Σ₀₀
163+
L₁₀ = Σ₀₁ / L₀₀
164+
L₂₀ = Σ₀₂ / L₀₀
165+
L₁₁ = √(Σ₁₁ - L₁₀²)
166+
L₂₁ = (Σ₁₂ - L₂₀·L₁₀) / L₁₁
167+
L₂₂ = √(Σ₂₂ - L₂₀² - L₂₁²)
168+
```
169+
170+
**Tests:**
171+
- Reference comparison against `nalgebra::Matrix3::cholesky()` on random PD Σ (N = 100k).
172+
- Degenerate-Σ NaN sentinel (Σ = zero matrix; Σ with negative eigenvalue).
173+
- SIMD parity AVX-512 ≡ NEON ≡ scalar (f32 ULP tolerance ≤ 4).
174+
- Bench: ≥ 5× scalar throughput on AVX-512 at N = 1M.
175+
176+
### 4.2 `batched_mahalanobis`
177+
178+
**Signature:**
179+
180+
```rust
181+
/// Squared Mahalanobis distance: (x - μ)ᵀ Σ⁻¹ (x - μ).
182+
///
183+
/// Computes M queries against N Gaussians. Output is M × N row-major.
184+
///
185+
/// # Inputs
186+
/// - `query_xyz[m*3..m*3+3]` = m-th query point
187+
/// - `mu_xyz[n*3..n*3+3]` = n-th Gaussian centroid
188+
/// - `sigma_packed[n*6..n*6+6]` = n-th Σ (same packing as Cholesky)
189+
///
190+
/// # Output
191+
/// - `out_dist_sq[m*N + n]` = ‖x_m - μ_n‖²_{Σ_n}
192+
///
193+
/// # Semantics
194+
/// - Degenerate Σ (Cholesky NaN) yields f32::INFINITY (sortable; never NaN).
195+
/// - SIMD-batched 16×16 on AVX-512, 4×4 on NEON.
196+
pub fn batched_mahalanobis(
197+
query_xyz: &[f32], mu_xyz: &[f32], sigma_packed: &[f32], out_dist_sq: &mut [f32],
198+
);
199+
```
200+
201+
**Implementation note:** internally calls `batched_cholesky_3x3` once on `sigma_packed`, caches L (heap-free via stack or caller-provided scratch), then triangular-solve + squared norm per (m, n) pair.
202+
203+
**Tests:**
204+
- Reference comparison against scipy `scipy.spatial.distance.mahalanobis` on random points + Σ.
205+
- Triangle inequality on a metric subset (commutativity / non-negativity / zero-on-equal).
206+
- SIMD parity.
207+
- Bench: ≥ 1.4× scalar at M=1k, N=1M (the splat-to-splat registration regime).
208+
209+
### 4.3 `batched_opacity_blend`
210+
211+
**Signature:**
212+
213+
```rust
214+
/// Front-to-back alpha compositing over a sorted Gaussian sequence.
215+
///
216+
/// # Inputs
217+
/// - `sorted_amplitudes[i]` = i-th Gaussian's peak echo amplitude, sorted
218+
/// front-to-back along the view ray.
219+
/// - `opacity_lut[i]` = amplitude-bucket → opacity LUT (256 buckets;
220+
/// amplitude quantized into [0..256) via the per-frame max-amplitude).
221+
///
222+
/// # Output
223+
/// - `out_alpha[ray]` = composited alpha per ray (8-bit final).
224+
///
225+
/// # Semantics
226+
/// - Composition: α_new = α_old + (1 - α_old) · α_i.
227+
/// - Internal accumulator: u16 with saturation; truncate to u8 at end.
228+
pub fn batched_opacity_blend(
229+
sorted_amplitudes: &[f32], opacity_lut: &[u8; 256], out_alpha: &mut [u8],
230+
);
231+
```
232+
233+
**Tests:**
234+
- Reference comparison against scalar reference for known sequences.
235+
- Saturation at full opacity (sequence of high-amplitude Gaussians → α = 255).
236+
- Empty sequence → α = 0.
237+
- SIMD parity.
238+
239+
### 4.4 `batched_sh_eval_l3`
240+
241+
**Signature:**
242+
243+
```rust
244+
/// Evaluate degree-3 spherical harmonics at a single view direction for N Gaussians.
245+
///
246+
/// # Inputs
247+
/// - `view_dir[3]` — unit view direction in scanner frame.
248+
/// - `sh_coeffs[i*16..i*16+16]` — i-th Gaussian's 16 SH coefficients
249+
/// (degree 0: 1, degree 1: 3, degree 2: 5, degree 3: 7 — totaling 16).
250+
///
251+
/// # Output
252+
/// - `out_luminance[i]` — i-th Gaussian's view-dependent luminance.
253+
///
254+
/// # Semantics
255+
/// - No clamping; caller decides downstream behavior (visual: clamp to [0, 1];
256+
/// computational pipeline: signed luminance is fine).
257+
pub fn batched_sh_eval_l3(view_dir: [f32; 3], sh_coeffs: &[f32], out_luminance: &mut [f32]);
258+
```
259+
260+
**Tests:**
261+
- Reference comparison against analytical SH evaluation (closed-form basis functions at degree ≤ 3).
262+
- Rotation invariance: SH coefficients rotated by R, evaluated at view dir = identity; vs original SH coefficients, evaluated at view dir = R⁻¹.
263+
- SIMD parity.
264+
265+
### 4.5 `batched_se3_transform`
266+
267+
**Signature:**
268+
269+
```rust
270+
/// Apply a rigid SE(3) transform to N Gaussian centroids and covariance matrices.
271+
///
272+
/// # Inputs
273+
/// - `pose_se3[0..3]` = translation t
274+
/// - `pose_se3[3..12]` = rotation R (row-major 3×3)
275+
/// - `mu_in[i*3..i*3+3]` = input centroid
276+
/// - `sigma_in_packed[i*6..i*6+6]` = input Σ
277+
///
278+
/// # Output
279+
/// - `mu_out[i*3..i*3+3]` = R · μ_in + t
280+
/// - `sigma_out_packed[i*6..i*6+6]` = R · Σ_in · Rᵀ
281+
///
282+
/// # Semantics
283+
/// - Assumes R is orthonormal (debug_assert in debug builds).
284+
/// - SIMD-batched 16-at-a-time on AVX-512.
285+
pub fn batched_se3_transform(
286+
pose_se3: [f32; 12], mu_in: &[f32], mu_out: &mut [f32],
287+
sigma_in_packed: &[f32], sigma_out_packed: &mut [f32],
288+
);
289+
```
290+
291+
**Tests:**
292+
- Identity transform (pose = (0, I)) → output equals input.
293+
- Orthonormal rotation preserves Σ trace.
294+
- Composition: SE(3)(A) followed by SE(3)(B) ≡ SE(3)(B·A) within ULP tolerance.
295+
- SIMD parity.
296+
297+
---
298+
299+
## 5. Cross-repo dependencies (what ndarray waits on / blocks)
300+
301+
### Upstream (what ndarray waits on)
302+
303+
- **Nothing.** D-SPLAT-2 is foundation work; runs before everything else.
304+
305+
### Downstream (what blocks on ndarray D-SPLAT-2)
306+
307+
- **`lance-graph-contract`** D-SPLAT-1/3 (carriers) — can develop in parallel; the carrier API is independent of the math implementation. Soft edge.
308+
- **`crates/splat-fit`** D-SPLAT-6 — **hard blocker**. PSF estimation calls `batched_cholesky_3x3` per peak.
309+
- **`lance-graph::splat::registration`** D-SPLAT-5 — **hard blocker**. ICP runs Mahalanobis + SE(3) transforms per iteration.
310+
- **`lance-graph::splat-render`** D-SPLAT-12 — **hard blocker**. Opacity blend + SH eval per frame.
311+
- **`splat-actors`** D-SPLAT-7 — **soft blocker**. Pose accumulator only needs `batched_se3_transform`; could partially run on scalar fallback while waiting for SIMD.
312+
- **`MedCare-rs`** D-SPLAT-10/11 — **independent**. Storage / audit don't run math.
313+
- **`OGAR`** Phase 8 (FMA TTL) — **independent**. Ontology prep doesn't touch ndarray.
314+
315+
---
316+
317+
## 6. Risk matrix (ndarray-specific)
318+
319+
| Risk | Severity | Mitigation |
320+
|---|---|---|
321+
| SIMD parity diverges across backends (AVX-512 vs NEON vs scalar) | MED | Mandatory parity tests on every primitive (per consumer-contract knowledge doc); ULP-tolerance gates documented per primitive. |
322+
| Degenerate-Σ NaN propagation into downstream | HIGH | Documented NaN-sentinel semantics; `batched_mahalanobis` converts NaN distances to `f32::INFINITY` to make sort stable; integration tests cover degenerate paths. |
323+
| AVX-512 throughput regression on older Skylake-X / Cascade Lake (VBMI gate) | MED | Reuse existing `VBMI gate for permute_bytes` pattern from ndarray PR #142; gate `batched_opacity_blend`'s 16-lane gather variant behind VBMI feature detection. |
324+
| Bench fluctuation across CI runners | LOW | Use the existing nightly bench infrastructure; ≥ 5 runs minimum; report median + IQR. |
325+
| Carrier-API changes in `lance-graph-contract` between draft and final | LOW | ndarray exposes typed-slice API (`&[f32]`); carrier struct layout changes don't affect ndarray. |
326+
327+
---
328+
329+
## 7. Acceptance criteria
330+
331+
### Per-primitive
332+
333+
- **Correctness:** reference comparison passes (≤ 4 ULP) against the named reference implementation.
334+
- **Parity:** AVX-512 ≡ NEON ≡ scalar (within ULP tolerance).
335+
- **Bench:** primitive-specific throughput gate met on the target hardware (AVX-512: x86-64-v4; NEON: Apple M-series or Cortex-A78).
336+
337+
### Module-level
338+
339+
- `cargo test -p ndarray --features simd-splat` — green.
340+
- `cargo bench -p ndarray --bench simd_splat` — published baseline numbers.
341+
- Three-backend implementations live in the existing `src/simd_*` family pattern; no new `unsafe` blocks beyond what the family already uses; all `unsafe` blocks have `// SAFETY:` comments.
342+
- Module is feature-gated behind `simd-splat` (default-on, per ndarray's "ndarray is mandatory" policy from PR #463); zero-impact on existing consumers.
343+
344+
---
345+
346+
## 8. Cross-references
347+
348+
- **Canonical plan:** `lance-graph/.claude/plans/splat-native-ultrasound-v1.md` §§3.1 (D-SPLAT-1), 3.2 (D-SPLAT-2 detailed spec), 11.1-11.7 (work division)
349+
- **Consumer contract:** `lance-graph/.claude/knowledge/ndarray-vertical-simd-alien-magic.md` (W1a primitive-addition pattern; mandatory three-backend implementation; parity-test gates; VPABSB-correction example for saturating semantics)
350+
- **ndarray priors:**
351+
- `src/simd_avx512.rs` (W1a primitives in production)
352+
- `src/simd_avx2.rs`, `src/simd_neon.rs` (companion backends)
353+
- PR #142 (VBMI gate for `permute_bytes` — relevant for `batched_opacity_blend`)
354+
- PR #189 (`OntologySchema::is_ancestor` — relevant for D-SPLAT-8 FMA traversal downstream)
355+
- PR #463 (ndarray-is-mandatory in lance-graph — relevant for splat-fit feature-flag design)
356+
- **Iron rules:** `I-NOISE-FLOOR-JIRAK` (any threshold derived in downstream registration cites Jirak under weak dependence, not classical Berry-Esseen)
357+
358+
---
359+
360+
## 9. Sprint window
361+
362+
**P1, sprint 1-2.** Foundation work. **Must land before everything else in the splat-native arc.**
363+
364+
### Sprint 1
365+
366+
- D-SPLAT-2 implementations on the AVX-512 backend (the lead path, since ndarray's pinned target is x86-64-v4).
367+
- Reference-implementation parity tests against `nalgebra` + analytical SH closed-form + scipy where available.
368+
- Initial bench numbers.
369+
370+
### Sprint 2
371+
372+
- D-SPLAT-2 NEON backend (Apple M-series + Cortex-A78 target).
373+
- D-SPLAT-2 scalar fallback (CI floor for non-AVX-512 / non-NEON CI runners).
374+
- Three-backend parity test green.
375+
- Bench gates verified on representative hardware.
376+
- AGENT_LOG entry; PR_ARC entry.
377+
378+
**Acceptance for handoff to P2 (lance-graph splat-fit):** all five primitives green on all three backends; bench targets met; carrier API stable from `lance-graph-contract` D-SPLAT-1/3 (concurrent sprint 1-2).
379+
380+
---
381+
382+
_End of ndarray companion v1._

0 commit comments

Comments
 (0)