AMX int8/bf16 tile GEMM — enable, fix 5 bugs, optimize (0→197 GMAC/s) + vertical-SIMD primitives by AdaWorldAPI · Pull Request #217 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-06-14T07:16:09Z

Summary

Brings Intel AMX (Advanced Matrix Extensions) from silently disabled on every
capable host to enabled, bit-exact, and fast on stable Rust 1.94, plus the
vertical-SIMD primitives that back the turbovec/lance-graph consumers.

Verified on Emerald Rapids (CPUID model 0xCF), kernel 6.18.5. The fixes are
ISA-level (identical on Sapphire Rapids / Granite Rapids).

AMX (the headline)

Enablement — the one-digit bug. ARCH_REQ_XCOMP_PERM (0x1023) is an
arch_prctl op (syscall 158); the code issued it on prctl (157) →
-EINVAL, so amx_available() returned false on every AMX host and the tile
path was dead code. Fixed 157→158 (simd_amx.rs). This is the "AMX is available
~50-79% of the time but needs a special way to enable it."

Four more tile-kernel bugs, surfaced once the path actually executed and
bisected with examples/amx_probe (prints a flushed line before each tile op):

TileConfig rows/colsb regions swapped → LDTILECFG #GP (SIGSEGV).
TILELOADD/TILESTORED SIB base/index swapped → deref'd stride as ptr (SIGSEGV).
TDPBUSD/TDPBF16PS ModRM 0xC1 aliased two source tiles → same-tile #UD (SIGILL).
Operand index+sign convention is the mirror of the naive SDM reading
(plain M×K = ModRM.rm = unsigned; VNNI K×N = VEX.vvvv = signed) — proven
by sweeping all four TDPB**D opcodes against sign-sensitive inputs.

The old header claimed "INLINE ASM TESTED ✓" — it never was: every AMX test
early-returns when amx_available() is false (which it always was), so the asm
had never run. That's why these bugs sat latent (incl. in the SPR-era
AMX_GOTCHAS.md, which shipped three of them).

Now correct: matmul_i8_to_i32 (TDPBUSD) bit-exact vs scalar across all
shapes; matmul_f32/matmul_bf16_to_f32 (TDPBF16PS) within BF16 tolerance.

Performance (Emerald Rapids, single-thread, target-cpu=native, all correct=true):

size	first-correct	hoist LDTILECFG	2×2 register-blocked
2048³	14.8	169.7	197.7 GMAC/s (395 GOP/s, 600× scalar)

Hoisted the serializing LDTILECFG + VNNI-pack out of the per-tile loops (11.5×).
2×2 register blocking (tile_dpbusd_2x2, 4 C accumulators + 2 A + 2 B tiles)
with a BLIS-style panel loop order — halves load bytes/MAC; +16% over serial at
2048³, wins at every size. All 12 new tile encodings validated in isolation
(examples/amx_rb_probe).
rayon row-tile parallelism (feature-gated) → 237 GMAC/s for huge GEMMs.

Detection + ergonomics: amx_available() cached in a LazyLock (non-blocking
gates; process-wide arch_prctl requested once, rayon-safe); new CpuModel
(SPR/EMR/GNR/SierraForest) + cpu_model() + amx_report(), exported via
ndarray::simd::*. has_amx() && !available cleanly separates "no AMX silicon"
from "AMX not OS-enabled".

Docs (teach-to-fish): new .claude/knowledge/amx-enablement-and-kernel.md
(canonical reference + validated byte-code table), new .claude/agents/amx-savant.md,
and a rewritten .claude/AMX_GOTCHAS.md (corrects the three bugs it shipped; adds
a fault-signature→cause playbook).

Vertical-SIMD primitives

Native AVX2 U16x16 (__m256i), FastScan PQ4-ADC helpers on U16x16/U8x32,
F32x8::{mul_add, cmp_gt_mask}, U8x32::from_ptr (zero-overhead hot-loop load),
backend gemm_bf16/gemm_i8/gemv wired to real SIMD dispatch, and the AMX/VNNI
int8 GEMM re-exported through simd.rs — the consumer-facing surface turbovec and
lance-graph build on.

Validation & honest caveats

Validated via examples/{amx_probe, amx_rb_probe, amx_gemm_bench} and
cargo clippy -p ndarray --lib [--features rayon] (clean except one pre-existing
simd_int_ops needless_return). The lib unit-test target is pre-broken by
unrelated src/tri.rs type-inference errors (noted in CLAUDE.md), so the examples
are the gate.
rayon-over-rb (the 2×2+rayon combination) was attempted and reverted — it
ran slower than rb-single and correct=false at 1024³/2048³ (an
AMX-under-rayon-at-scale issue, documented in the knowledge doc). The banked wins
(rb-single 197, 16×16-rayon 237) are both correct. Bigger lever ahead: full BLIS
Mc/Nc/Kc cache blocking.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

Generated by Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Register-blocked AMX tile GEMM acceleration with improved int8 performance.
- CPU model reporting to support silicon-specific routing and clearer capability reporting.
- New probe/benchmark examples for AMX performance and correctness.
Bug Fixes
- Corrected AMX tile configuration and instruction encodings to improve runtime stability.
- Fixed Linux AMX enablement permission handling.
- Improved GEMM/GEMV dispatch behavior and added tiered GEMV execution for non-scalar modes.
Documentation
- Updated AMX enablement and troubleshooting playbooks with a fault-signature “cause” index and expanded coverage.

Replace the scalar avx2_int_type!(U16x16, ...) polyfill ([u16;16] array + scalar loops) with a real __m256i wrapper using _mm256_*_epi16, mirroring the already-native U16x32 in simd_avx512.rs:1200. Both x86 dispatch arms (v3 AVX2, and v4 AVX-512 which re-exports the avx2-tier U16x16) now surface the hardware type — per agnostic-surface-cpu-matrix.md §A (U16x16 = __m256i) and the TD-T22 polyfill-audit cell. Lets the PQ4-ADC FastScan u16 accumulate (turbovec's AVX2 search kernel, the consumer migration this unblocks) run on hardware instead of a scalar [u16;16] loop. Drop-in: 2141 lib tests pass, 0 failed (no consumer needed changes). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

Adds the per-arch primitives a 4-bit-PQ ADC FastScan consumer (turbovec's AVX2 search kernel) needs to route through ndarray::simd instead of hand-rolled _mm256_* intrinsics: - U16x16::permute2x128::<IMM> / blend_epi32::<IMM> — the SUB-trick cross-128-lane combine (IMM 0x21 / 0xF0) - U16x16::to_f32x8_lo / to_f32x8_hi — lossless u16->f32 (_mm256_cvtepu16_epi32 + _mm256_cvtepi32_ps); PQ4 accumulators are <= FLUSH_EVERY*127 so they fit f32 - U8x32::as_u16x16 — zero-cost bitcast bridging shuffle_bytes LUT results into the 16-bit accumulator All thin __m256i/__m256 intrinsic wrappers built on the native U16x16 (f766d18). ndarray builds clean. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

_mm256_fmadd_ps wrapper (self*a + b, single rounding). The 256-bit sibling of the existing F32x16::mul_add; the PQ4-ADC FastScan flush needs an 8-wide FMA for the per-query fa = v_scale*partial + fa reduction. Validated end-to-end by turbovec's avx2_topk_matches_default_kernel parity test (migrated AVX2 kernel == default). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

_mm256_cmp_ps::<_CMP_GT_OQ> + _mm256_movemask_ps → u8 mask. The SIMD threshold- prune primitive the PQ4-ADC FastScan heap needs (skip an 8-lane chunk with no candidate above heap-min in one op). Validated by turbovec's avx2 parity test at v4. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

Unchecked _mm256_loadu_si256 from a raw ptr (caller guarantees >=32 bytes). The from_slice assert + slice-index bounds check are a measurable tax in tight scan loops; this is the consumer-side escape hatch for validated inner loops. Closed turbovec's migrated AVX2 FastScan gap: 91.0us -> 78.8us/q at v4 (~1% vs upstream raw AVX2 77.9us; was ~17%). Safety documented on the fn. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

…audit TD-T2/3/5/7) - gemm_bf16: scalar bf16_gemm_f32 -> amx_matmul::matmul_bf16_to_f32 (AMX TDPBF16PS -> AVX-512 VDPBF16PS -> scalar) via ArrayView2 wrapping; bit-equivalent on non-AMX/non-AVX512BF16 hosts. - gemm_i8: 2-tier vnni_gemm -> simd_int_ops::gemm_u8_i8 (4-tier AMX TDPBUSD / VNNI-zmm / AVX-VNNI-ymm / scalar). Deliberately stays u8xi8 (NOT the signed matmul_i8_to_i32, which would misread A bytes >=128); same int8_gemm_i32 scalar reference, so bit-equivalent. - gemv_f32/f64: scalar -> per-row dot_f32/dot_f64 on the AVX2/AVX-512 tiers (scalar tier kept byte-identical); reuses the parity-tested BLAS-1 dot, zero new unsafe. Validated centrally: cargo test -p ndarray -> 2136 passed, 0 failed, 29 ignored. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

…ut) + fmt amx_gemm_bench reports amx_available() and dispatched matmul_i8_to_i32 throughput vs scalar across square sizes — for verifying the AMX tile path lights up on SPR/GNR silicon (on this host amx_available()=false, VNNI ceiling ~30 GMAC/s). Also folds the rustfmt normalization of the from_ptr addition. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

… bugs AMX was silently disabled on EVERY capable host and, once enabled, the tile kernel produced wrong results / faulted. `amx_available()` now returns true on this Emerald Rapids host and `matmul_i8_to_i32` is bit-exact vs the scalar reference (validated by examples/amx_probe across single/multi K-block and single/multi-tile shapes, incl. negative i8 through +128/bias). Five bugs, found by bisecting with examples/amx_probe: 1. ENABLEMENT (simd_amx.rs): `ARCH_REQ_XCOMP_PERM` (0x1023) is an *arch_prctl* op (syscall 158), but the code issued it on `prctl` (syscall 157) → -EINVAL → step 4 always failed → amx_available() returned false on all AMX silicon. Fixed 157 → 158. This is the "special way to enable it": Linux 5.16+ requires the XTILEDATA permission request before any AMX tile op. 2. TILECFG (amx_matmul.rs for_dpbusd): rows/colsb regions were SWAPPED. Per the SDM XTILECFG layout, colsb[t] is a u16 at 16+2t and rows[t] a u8 at 48+t; the code wrote rows into the colsb region and vice-versa, yielding colsb[0]=4112 / rows[0]=64 → LDTILECFG #GP (SIGSEGV). 3. TILELOADD/TILESTORED SIB (amx_matmul.rs): SIB 0x08 = [base=rax, index=rcx] but the registers are bound rcx=ptr, rax=stride, so the engine dereferenced the stride (~64) as the base → SIGSEGV. Correct SIB is 0x01 = [base=rcx, index=rax]. 4. TDPBUSD/TDPBF16PS ModRM (amx_matmul.rs): ModRM 0xC1 set rm=tmm1 while vvvv also = tmm1, aliasing the two sources → same-tile #UD (SIGILL). The SDM requires the three tile operands distinct; fixed to 0xC2 (rm=tmm2). 5. OPERAND CONVENTION (int8/bf16 amx_path): on this silicon the AMX operand mapping is the mirror of the naive SDM reading — ModRM.rm is the plain M×K operand and VEX.vvvv the VNNI K×N operand, and for 0x71 rm is UNSIGNED / vvvv is SIGNED (proven by sweeping all four TDPB**D opcodes on sign-sensitive inputs). So the kernels load A(plain)→tmm2 and B(vnni)→tmm1; the standard TDPBUSD (0x71) is the correct int8 variant. The header's "INLINE ASM TESTED" was aspirational: every AMX test early-returns when amx_available() is false (which it always was), so the tile asm had never actually executed until bug #1 was fixed. Throughput (Emerald Rapids, single-thread, target-cpu=native): int8 GEMM 2048^3 = 14.8 GMAC/s, 52.8x scalar, correct=true. (ldtilecfg + VNNI-pack are still per-tile — a follow-up will hoist them; this commit is the correctness checkpoint.) https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

….5x) `int8_gemm_amx_tiled` called `int8_tile_gemm_16x16` per 16×16 output tile, which ran LDTILECFG (a serializing instruction) + TILERELEASE and re-VNNI- packed the B column band on EVERY tile — e.g. 256 LDTILECFGs for a 256² output. Restructured so the driver: * loads the tile config ONCE up front, releases ONCE at the end; * VNNI-packs each B column band ONCE per j-tile (reused across all M/16 row tiles), not once per (i,j) tile; * TILEZEROs the C accumulator and TILESTOREs the 16×16 result straight into its strided slot in C (row pitch n·4 bytes) — no scratch buffer, no copy loop. Single-thread int8 GEMM on Emerald Rapids (target-cpu=native), all correct=true vs scalar: size before(naive) after(hoisted) vs scalar 256^3 3.7 GMAC/s 65.7 GMAC/s 68x 512^3 6.4 GMAC/s 124.9 GMAC/s 130x 1024^3 10.4 GMAC/s 155.9 GMAC/s 177x 2048^3 14.8 GMAC/s 169.7 GMAC/s 600x (339 GOP/s) `int8_tile_gemm_16x16` (the standalone 16×16 accumulating API) is unchanged and still used by the unit tests. Further headroom: 2×2 register blocking (reuse A/B tile loads across 4 C accumulators) and rayon over row tiles. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

The operand-swap fix touched both tile kernels; amx_probe now also checks matmul_f32 (which routes through the BF16 TDPBF16PS path on AMX) against an f32 scalar reference. Max relative error 0.0033–0.0051 across shapes — within BF16's ~8-mantissa-bit precision. Confirms the index-swap fix is correct for TDPBF16PS as well as TDPBUSD. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

…s/agent Detection (`amx_available()`) ran CPUID + XGETBV + the arch_prctl syscall on EVERY matmul call. Cache it in a `LazyLock<bool>` computed once; all four gates are non-blocking (no I/O, no lock, no spin) so the init can't stall, and the arch_prctl grant is process-wide + inherited by all threads, so requesting it once is correct even under rayon. Add `CpuModel` detection from CPUID.01H (GenuineIntel only): SapphireRapids (0x8F) / EmeraldRapids (0xCF) / GraniteRapids (0xAD/0xAE) / SierraForest (0xAF, E-core, no AMX) / OtherX86 / NonX86, also cached. Exposed via `ndarray::simd::{cpu_model, CpuModel, amx_report}`. `amx_report()` now reads e.g. "AMX [Emerald Rapids expects_amx=true]: TILE=true INT8=true BF16=true available=true" — so `has_amx() && !available` cleanly separates "no AMX silicon" from "AMX present but not OS-enabled". Answers the SPR-vs-EMR question: it is NOT the cause. The five bugs are ISA/encoding-level and were latent on Sapphire Rapids too — the SPR-era AMX_GOTCHAS.md literally shipped three of them (syscall 157, TDPBUSD=…73…C1, swapped TILECFG layout). They never fired because detection never returned true; EMR was just the first host to actually execute the tile path. Docs (teach-to-fish): - NEW .claude/knowledge/amx-enablement-and-kernel.md — canonical reference: enablement sequence, validated byte-code table, the empirically-verified (mirrored) operand convention, detection API, perf story, modus operandi. - NEW .claude/agents/amx-savant.md — the AMX specialist agent (enable / troubleshoot / mindset), references the knowledge doc + gotchas. - REWROTE .claude/AMX_GOTCHAS.md — corrected the three bugs it shipped, added a fault-signature→cause index (SIGSEGV@LDTILECFG / SIGSEGV@TILELOADD / SIGILL@TDPBUSD / wrong-values) and the "skipped test ≠ passing test" trap. - simd_amx.rs module header: replaced the false "INLINE ASM TESTED ✓" claim (and wrong "256 MACs/instr" → 16384) with the verified status + bug history. - amx_probe prints amx_report() + cpu_model() up front. No W3C/ISA encoding source exists in-tree — the .claude "w3c" files are semantic-web ontologies (SKOS/PROV-O/FIBO); the byte-code authority is the Intel SDM opcode map + the empirical 4-opcode sweep (0x70/0x73 match the SDM exactly, confirming the map). Validated: amx_probe all CORRECT, clippy -p ndarray --lib clean (only the pre-existing simd_int_ops needless_return), cpu_model()=EmeraldRapids. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

…ated) Adds `int8_gemm_amx_tiled_par`: B is VNNI-packed once into a shared read-only buffer, then the M/16 row-tiles fan across the rayon pool (one task per 16-row block of C). Each worker runs its own LDTILECFG (tile config is per-thread CPU state) + the byte-for-byte validated serial tile sequence; AMX permission is process-wide so it's inherited by every worker. Correctness unchanged (amx_probe --features rayon: all shapes CORRECT, incl. the multi-row-tile parallel cases). HONEST FINDING: this AMX kernel is MEMORY-BANDWIDTH-BOUND. On this 4-core Emerald Rapids box rayon-over-rows scales sublinearly — 2048³ 169.7 → 237.5 GMAC/s (1.4×), and it REGRESSES small/medium shapes (512³ 125 → 73) because thread dispatch + the shared B-prepack dominate. So it's gated to large work only: `m >= 32 && m·n·k >= 2e9` (≈ ≥2048³ here), keeping the fast serial path for everything else. Many-core servers with more memory channels gain more. The real lever for a bandwidth-bound kernel is reducing bytes/FLOP (2×2 register blocking — reuse each A/B tile load across 4 C accumulators), which also lets rayon scale better; that's the queued next step. Feature-gated behind `rayon` (no change to default builds). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

…act) The AMX int8 GEMM is memory-bandwidth-bound (~8% of peak). The lever is fewer bytes/MAC: 2×2 register blocking computes a 32×32 output block with four C accumulators (tmm0-3) fed by two A row-tiles (tmm4-5) and two B col-tiles (tmm6-7) via `tile_dpbusd_2x2`, so each A/B tile load serves TWO products — half the tile loads of the 16×16 kernel. New tile primitives (amx_matmul.rs), each validated by examples/amx_rb_probe (32×32 block vs scalar, K=64/128/256, all CORRECT on first run): * tile_load extended to tmm3-7, tile_store to tmm1-3 (ModRM 0x04|tile<<3). * TileConfig::for_dpbusd_8 — all 8 tiles 16×64. * tile_dpbusd_2x2 — 4 TDPBUSDs (C4 E2 49 5E C4 / 41 5E CC / 49 5E D5 / 41 5E DD); byte2 = ((~vvvv&0xF)<<3)|1, ModRM = 0xC0|dst<<3|rm; all 8 operand tiles distinct (no same-tile #UD). Driver (`int8_gemm_amx_tiled_rb`): BLIS-style loop order is what makes it win — OUTER over 32-col panels, pack only that panel's two B bands (L2-resident) and reuse across all row-blocks, INNER over 32-row blocks. (A first cut pre-packed ALL of B ≈4 MB and thrashed cache, regressing 1024³ 156→125; the panel-local pack fixes it and also halves A's DRAM re-reads.) 16-wide M/N remainders finish on the validated 16×16 path. amx_probe adds 48×48 / 96×80 cases covering both remainder strips + corner — all CORRECT. Single-thread, target-cpu=native, all correct=true vs scalar: size serial rb(2×2) 256³ 65.7 80.8 (+23%) 512³ 124.9 132.0 (+6%) 1024³ 155.9 170.2 (+9%) 2048³ 169.7 197.7 (+16%) 395 GOP/s Dispatch: rayon huge → _par; else m,n≥32 → _rb; else _serial. Next: rayon-over-rb (projected ~270+ GMAC/s) + full BLIS cache blocking. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

…ng at scale) The knowledge doc projected "rayon-over-rb ~270+ GMAC/s" as easy headroom. It is NOT: a first attempt (each rayon task runs _rb on a 64-row band) ran slower than rb-single (155 vs 197 at 2048³ — per-task B re-pack) AND produced correct=false at 1024³/2048³ (256³/512³ stayed correct; single-thread _rb is bit-exact at all sizes, so it's an AMX-tiles-under-rayon-at-scale issue, not yet diagnosed). Reverted. Documented the two prerequisites before reshipping (shared B pre-pack + a probe reproducing/explaining the large-size correctness failure) so a future session doesn't re-walk it blind. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

coderabbitai · 2026-06-14T07:16:28Z

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 35 minutes and 8 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4052a99e-4a4d-4318-9fcf-de0325cc1fd1

📥 Commits

Reviewing files that changed from the base of the PR and between d33b983 and e563fdc.

📒 Files selected for processing (9)

examples/amx_probe.rs
examples/amx_rb_probe.rs
examples/edge_residue_probe.rs
examples/golden_helix_probe.rs
src/backend/native.rs
src/hpc/amx_matmul.rs
src/simd_avx2.rs
src/simd_avx512.rs
src/simd_int_ops.rs

📝 Walkthrough

Walkthrough

Fixes the Linux AMX permission syscall (157→158), corrects multiple tile instruction encoding bugs (TILECFG field layout, SIB base/index, TDPBUSD/TDPBF16PS ModRM aliasing), swaps A/B operand placement in int8 and bf16 kernels, adds adaptive int8 GEMM dispatch (serial/Rayon/register-blocked 2×2), routes backend GEMM/GEMV through corrected tier dispatch, replaces array-backed U16x16 with a native AVX2 wrapper, adds three AMX validation examples plus three supplemental algorithm probes, and updates the documentation corpus.

Changes

AMX Enablement and Correctness Overhaul

Layer / File(s)	Summary
AMX detection API: CpuModel, syscall fix, public re-exports `src/simd_amx.rs`, `src/simd.rs`	Adds `CpuModel` enum and cached `cpu_model()`, fixes Linux AMX permission gate from syscall 157 to 158 (`arch_prctl` with `ARCH_REQ_XCOMP_PERM`/`XTILEDATA=18`), updates `amx_report()` to include CPU model and `expects_amx`, and re-exports `amx_report`/`cpu_model`/`CpuModel` from the public `simd` module.
AMX tile primitive encoding corrections and `tile_dpbusd_2x2` `src/hpc/amx_matmul.rs`	Rewrites `TileConfig::for_dpbusd` with correct Intel SDM XTILECFG colbytes/rows field offsets (previously swapped). Fixes `tile_load`/`tile_store` SIB byte to use base=rcx, index=rax and expands `tile_load` to cover `tmm0`–`tmm7`. Corrects `tile_dpbusd` ModRM byte sequence, corrects `tile_dpbf16ps` ModRM to prevent same-tile `#UD`. Adds `pub unsafe fn tile_dpbusd_2x2()` for 2×2 register-blocked four-accumulator TDPBUSD. Strengthens `test_tile_config_creation` to assert exact little-endian field values.
AMX kernel operand placement and int8 adaptive GEMM dispatch `src/hpc/int8_tile_gemm.rs`, `src/hpc/bf16_tile_gemm.rs`	Fixes `amx_path` operand loading in both kernels: VNNI-packed i8 B → `tmm1` (VEX.vvvv), plain u8 A → `tmm2` (ModRM.rm) for int8; VNNI B → `tile_load(1)`, plain A → `tile_load(2)` for bf16. Rewrites `int8_gemm_amx_tiled` into an adaptive driver selecting among serial (16×16 tile_store direct), Rayon-parallel (pre-packed B, per-worker tile config), and register-blocked 2×2 (`tile_dpbusd_2x2` + strip fallback) strategies.
Backend GEMM/GEMV tier dispatch routing `src/backend/mod.rs`, `src/backend/native.rs`	Switches `gemm_i8` from `vnni_gemm::int8_gemm_vnni` to `simd_int_ops::gemm_u8_i8`. Rewrites `gemm_bf16` to call `hpc::amx_matmul::matmul_bf16_to_f32` via `ArrayView2`. Changes `gemv_f32`/`gemv_f64` from unconditional scalar to tier-dispatched `dot_f32`/`dot_f64` row-dot-product loops.
Native AVX2 `U16x16` and `F32x8` helpers `src/simd_avx2.rs`, `src/simd_avx512.rs`	Replaces macro-generated array-backed `U16x16` with a `#[repr(transparent)]` `__m256i` wrapper with lane ops, FastScan helpers, and `F32x8` conversion. Adds `U8x32::from_ptr` (unsafe unaligned load) and `U8x32::as_u16x16` (bitcast). Adds `F32x8::mul_add` (FMA3) and `F32x8::cmp_gt_mask` (ordered comparison + movemask).
AMX correctness validation examples `examples/amx_probe.rs`, `examples/amx_rb_probe.rs`, `examples/amx_gemm_bench.rs`, `Cargo.toml`	Adds probe examples validating 16×16 tile GEMM, general int8/f32 matmul (with relative-error tolerance), and 2×2 register-blocked tile kernel. Adds a throughput benchmark reporting GMAC/s and speedup vs scalar for square int8 GEMM sizes. Gates examples behind `required-features = ["std"]` in manifest.
Supplemental algorithm probes `examples/edge_residue_probe.rs`, `examples/golden_helix_probe.rs`, `examples/morton_cascade_probe.rs`	Adds edge-residue palette + 4-bit residue reconstruction probe via int8 GEMM. Adds golden-helix hemisphere sampling validation via nearest-neighbor angular statistics. Adds morton-cascade quadtree pruning benchmark with band-query early-exit evaluation.
AMX documentation corpus `.claude/AMX_GOTCHAS.md`, `.claude/agents/amx-savant.md`, `.claude/knowledge/amx-enablement-and-kernel.md`, `.claude/blackboard.md`, `.claude/board/AGENT_LOG.md`	Rewrites `AMX_GOTCHAS.md` as a troubleshooting playbook with corrected syscall/SIB/TILECFG/TDPBUSD entries and a fault-signature index. Adds `amx-savant` agent spec with 4-gate enablement flow, operand convention, and troubleshooting workflow. Adds canonical `amx-enablement-and-kernel.md` covering the five-bug table, validated encoding table, detection API contract, and performance heuristics. Updates architecture decisions and agent log.

Sequence Diagram

sequenceDiagram
  participant App
  participant backend_gemm_i8
  participant simd_int_ops_gemm_u8_i8
  participant int8_gemm_amx_tiled
  participant amx_available
  participant TileConfig

  App->>backend_gemm_i8: gemm_i8(m,k,n,a,b,c)
  backend_gemm_i8->>simd_int_ops_gemm_u8_i8: dispatch via tier
  simd_int_ops_gemm_u8_i8->>amx_available: amx_available() [LazyLock]
  amx_available-->>simd_int_ops_gemm_u8_i8: true/false

  alt AMX available (M>=32, N>=32)
    simd_int_ops_gemm_u8_i8->>int8_gemm_amx_tiled: int8_gemm_amx_tiled_rb
    int8_gemm_amx_tiled->>TileConfig: for_dpbusd_8(k_bytes)
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: pack B into VNNI quads
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_dpbusd_2x2() × K/64 blocks
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_store 4 quadrants → c
  else AMX available serial
    simd_int_ops_gemm_u8_i8->>int8_gemm_amx_tiled: int8_gemm_amx_tiled_serial
    int8_gemm_amx_tiled->>TileConfig: for_dpbusd(k_bytes)
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_dpbusd() + tile_store → c
  else no AMX
    simd_int_ops_gemm_u8_i8->>simd_int_ops_gemm_u8_i8: VNNI/scalar fallback
  end
  int8_gemm_amx_tiled-->>App: c filled

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 Hops onto the silicon stage...

The syscall was wrong, one-five-seven astray,
A single-digit fix unlocked AMX today.
SIB bytes corrected, ModRM aligned true,
Four tile accumulators blazing brand new.
The rabbit checked each colsb field with care—
Correct encodings found, and GMAC/s to spare! 🎉

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the primary changes: AMX tile GEMM enablement and bug fixes, achieving significant performance improvement (0→197 GMAC/s), plus new vertical-SIMD primitives. It is concise, specific, and directly reflects the main objectives documented in the PR.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3e6223088

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T07:22:20Z

    #[cfg(feature = "std")]
    {
-        crate::hpc::vnni_gemm::int8_gemm_vnni(a, b, c, m, n, k);
+        crate::simd_int_ops::gemm_u8_i8(a, b, c, m, n, k);


Keep the runtime VNNI fallback in gemm_i8

In the default x86-64-v3/std build, this now calls simd_int_ops::gemm_u8_i8, whose AVX-512/AVX-VNNI arms are behind #[cfg(target_feature = ...)] after the AMX alignment check; if AMX is unavailable or the shape is not 16/16/64-aligned, it therefore falls through to scalar even on Ice Lake/Zen 4/Sapphire Rapids hardware with VNNI. The previous hpc::vnni_gemm::int8_gemm_vnni used runtime simd_caps() detection and handled arbitrary shapes, so backend::gemm_i8 regresses hardware-supported VNNI calls to the scalar kernel in normal baseline builds.

Useful? React with 👍 / 👎.

examples/{amx_gemm_bench,amx_probe,amx_rb_probe}.rs import `ndarray::simd` and `ndarray::hpc`, both `#[cfg(feature = "std")]`. With std as a default feature they built locally, but the `--no-default-features` CI job compiles examples with std off → `error[E0432]: unresolved import ndarray::simd` (item configured out, gated behind `std`). Declared the three examples with `required-features = ["std"]` (same pattern as `ocr_benchmark`) so non-std jobs skip them. Verified: `cargo build -p ndarray --no-default-features --features portable-atomic-critical-section --examples` now finishes clean (examples skipped) instead of failing on amx_gemm_bench. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

src/hpc/int8_tile_gemm.rs (1)

367-397: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Don’t rely on debug_assert! for a safe AMX entry point.

In release builds, a caller can invoke int8_gemm_amx_tiled() on a non-AMX host or with misaligned shapes, and this function will still fall into the raw-pointer/tile-instruction paths below. That turns a safe public API into either an immediate SIGILL (amx_available() == false) or full 16×16 tile loads/stores against a shape the code no longer proves is aligned.

Suggested fix

-    debug_assert!(crate::hpc::amx_matmul::amx_available());
-    debug_assert_eq!(m % 16, 0, "int8_gemm_amx_tiled: M must be multiple of 16");
-    debug_assert_eq!(n % 16, 0, "int8_gemm_amx_tiled: N must be multiple of 16");
-    debug_assert_eq!(k % 64, 0, "int8_gemm_amx_tiled: K must be multiple of 64");
+    assert!(
+        crate::hpc::amx_matmul::amx_available(),
+        "int8_gemm_amx_tiled: AMX must be available"
+    );
+    assert_eq!(m % 16, 0, "int8_gemm_amx_tiled: M must be multiple of 16");
+    assert_eq!(n % 16, 0, "int8_gemm_amx_tiled: N must be multiple of 16");
+    assert_eq!(k % 64, 0, "int8_gemm_amx_tiled: K must be multiple of 64");

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hpc/int8_tile_gemm.rs` around lines 367 - 397, The function
int8_gemm_amx_tiled uses debug_assert! macros to validate that AMX is available
and that input dimensions m, n, and k are properly aligned (multiples of 16, 16,
and 64 respectively), but these assertions are stripped in release builds. This
allows a caller to invoke the function with invalid preconditions on non-AMX
hosts or with misaligned shapes, causing either SIGILL or undefined behavior
when the code reaches the raw pointer and AMX instruction paths. Replace all
four debug_assert! calls (the amx_available check and the three shape alignment
checks) with regular assert! macro calls to ensure these validation checks
execute in both debug and release builds, protecting the safe public API from
unsafe inputs.

🧹 Nitpick comments (2)

src/backend/mod.rs (1)
170-183: ⚡ Quick win

Add examples to the new public GEMM docs.

Both gemm_i8 and gemm_bf16 now have detailed API docs, but they still miss a usage example. A tiny row-major example for each would satisfy the repo contract and pin down the signedness/shape expectations for callers. As per coding guidelines, "All public APIs (public functions and methods) must have /// doc comments with examples".

Also applies to: 201-208
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/mod.rs` around lines 170 - 183, The public GEMM functions gemm_i8
and gemm_bf16 have detailed API documentation but are missing usage examples,
which violates the repo's requirement that all public APIs include documentation
comments with examples. Add a usage example to the doc comments for both
functions using the `///` style documentation. Include a small row-major matrix
multiplication example for each function that demonstrates how callers should
use it and clarifies the signedness and shape expectations (e.g., u8 × i8 → i32
for gemm_i8, and appropriate types for gemm_bf16).
Source: Coding guidelines
src/backend/native.rs (1)
285-308: ⚡ Quick win

Add examples to the public GEMV docs.

gemv_f32 and gemv_f64 now have expanded API docs, but they still do not include examples. A small doctest for the row-major contract would bring these entrypoints in line with the repository requirement. As per coding guidelines, "All public APIs (public functions and methods) must have /// doc comments with examples".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/native.rs` around lines 285 - 308, The `gemv_f32` and `gemv_f64`
functions are missing doctest examples in their doc comments, which violates the
coding guideline requiring all public APIs to have doc comments with examples.
Add executable doctest examples to both functions' doc comments that demonstrate
the row-major GEMV operation (y = alpha * A * x + beta * y) using concrete
values for the matrix, vectors, and scalars to illustrate how the computation
works.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/amx_probe.rs`:
- Around line 82-110: The relative error calculation in the test_matmul_f32
function is incorrect because using e.abs().max(1.0) as the denominator converts
small values into absolute-error tests rather than true relative-error tests.
This allows large relative regressions on small outputs to incorrectly appear
correct. Fix the denominator calculation in the loop where max_rel is computed
to use a true relative metric: either use a small epsilon as a floor while still
dividing by the actual expected value's magnitude, or compute both absolute and
relative error separately and use the maximum of the two comparisons against the
tolerance threshold.

In `@examples/amx_rb_probe.rs`:
- Around line 45-61: Add validation to reject K values that are not divisible by
64 instead of silently truncating them. Before computing k_blocks with the
division operation k / 64, check if k modulo 64 is not equal to zero and either
return an error or panic with a clear message indicating that K must be a
multiple of 64. This ensures the function fails fast when given invalid input
rather than producing a partial product from the truncated k_blocks calculation
in the loop.

In `@src/backend/native.rs`:
- Around line 289-303: The gemv_f32 and gemv_f64 functions eagerly slice x[..n]
before checking if there are any rows to process. This causes a panic when m
equals zero and x.len() is less than n, whereas the scalar implementations
gracefully handle this case by returning early. Add an early return check at the
beginning of the non-Scalar branch in both gemv_f32 (before line 295 where xn is
assigned) and gemv_f64 (the sibling function also applying the same pattern): if
m equals zero, return immediately without attempting to slice x, preserving the
no-op behavior when there are no matrix rows.

In `@src/hpc/amx_matmul.rs`:
- Around line 827-835: The `test_tile_zero_and_release` test function still
builds the old incorrect XTILECFG layout with swapped offsets (using `data[16] =
1` and `data[48] = 4`), while the assertions in the diff now validate the
corrected offsets. Update the XTILECFG construction in
`test_tile_zero_and_release` to match the fixed offset layout shown in the
assertions, ensuring colsb values are placed at the correct byte offsets (16+2t
for each tile) and rows values are placed at offsets 48+t, so the test will not
create an invalid configuration when AMX becomes reachable on capable hosts.
- Around line 32-42: The `for_dpbusd()` function is configuring the tiles
incorrectly. Currently, tile 1 is programmed with a `16×kb` shape and tile 2
with a `kb/4×64` shape, but the documented and expected contract (as shown in
the comments and used by `tile_dpbusd()`) requires tile 1 (tmm1) to be `kb/4×64`
for VNNI K×N operands and tile 2 (tmm2) to be `16×kb` for plain M×K operands.
Swap the tile configuration assignments so that tile 1 receives the `kb/4×64`
shape and tile 2 receives the `16×kb` shape. This fix needs to be applied
wherever these tile configurations are set in the `for_dpbusd()` function
implementation.
- Around line 77-83: The public functions TileConfig::for_dpbusd_8() and
tile_dpbusd_2x2() are missing documentation examples as required by coding
guidelines for all public Rust APIs. Add a no_run example to the doc comment for
each function demonstrating the strict tile/register contract they expose. Each
example should show the required sequence of operations: calling
tile_loadconfig, performing the necessary tile loads, and then calling
tile_release. This will make these APIs safer to consume by providing clear
usage patterns in the documentation.

In `@src/simd_avx2.rs`:
- Around line 1603-1629: The shr and shl methods in U16x32 only handle shift
amounts of 1, 2, 4, and 8, returning a zero vector for all other values, which
produces incorrect results. Replace both the shr and shl method implementations
to use the vector-count intrinsic forms (_mm256_srl_epi16 and _mm256_sll_epi16
respectively) instead of the immediate forms, since these accept runtime shift
amounts rather than compile-time constants. This approach mirrors how
U8x32::shr_epi16 handles variable shift amounts and will correctly support all
valid shift values.

In `@src/simd_avx512.rs`:
- Around line 1367-1371: The public method cmp_gt_mask in src/simd_avx512.rs is
missing a /// doc comment with examples, which violates the coding guidelines.
Add a doc comment above the cmp_gt_mask method that explains its purpose
(lane-wise ordered greater-than comparison returning a bitmask), describes the
return value semantics (bit i is set if self[i] > other[i]), explains its use
case (SIMD early-out without scalar iteration), and includes a clear example
demonstrating the method with different input scenarios to show both when lanes
are greater and when they are not.
- Around line 1353-1359: Both the `mul_add` method and the `cmp_gt_mask` method
are missing example code blocks in their doc comments. Add an "# Examples"
section to each method's existing doc comment with the provided example code
snippets that demonstrate their usage and expected behavior. The examples for
`mul_add` should show a fused multiply-add operation with sample values, and the
examples for `cmp_gt_mask` should demonstrate comparing two F32x8 values and
showing the resulting mask value.

---

Outside diff comments:
In `@src/hpc/int8_tile_gemm.rs`:
- Around line 367-397: The function int8_gemm_amx_tiled uses debug_assert!
macros to validate that AMX is available and that input dimensions m, n, and k
are properly aligned (multiples of 16, 16, and 64 respectively), but these
assertions are stripped in release builds. This allows a caller to invoke the
function with invalid preconditions on non-AMX hosts or with misaligned shapes,
causing either SIGILL or undefined behavior when the code reaches the raw
pointer and AMX instruction paths. Replace all four debug_assert! calls (the
amx_available check and the three shape alignment checks) with regular assert!
macro calls to ensure these validation checks execute in both debug and release
builds, protecting the safe public API from unsafe inputs.

---

Nitpick comments:
In `@src/backend/mod.rs`:
- Around line 170-183: The public GEMM functions gemm_i8 and gemm_bf16 have
detailed API documentation but are missing usage examples, which violates the
repo's requirement that all public APIs include documentation comments with
examples. Add a usage example to the doc comments for both functions using the
`///` style documentation. Include a small row-major matrix multiplication
example for each function that demonstrates how callers should use it and
clarifies the signedness and shape expectations (e.g., u8 × i8 → i32 for
gemm_i8, and appropriate types for gemm_bf16).

In `@src/backend/native.rs`:
- Around line 285-308: The `gemv_f32` and `gemv_f64` functions are missing
doctest examples in their doc comments, which violates the coding guideline
requiring all public APIs to have doc comments with examples. Add executable
doctest examples to both functions' doc comments that demonstrate the row-major
GEMV operation (y = alpha * A * x + beta * y) using concrete values for the
matrix, vectors, and scalars to illustrate how the computation works.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 24bb1a31-496b-46a6-b035-ca49a8d87553

📥 Commits

Reviewing files that changed from the base of the PR and between bdf243c and f3e6223.

📒 Files selected for processing (17)

.claude/AMX_GOTCHAS.md
.claude/agents/amx-savant.md
.claude/blackboard.md
.claude/board/AGENT_LOG.md
.claude/knowledge/amx-enablement-and-kernel.md
examples/amx_gemm_bench.rs
examples/amx_probe.rs
examples/amx_rb_probe.rs
src/backend/mod.rs
src/backend/native.rs
src/hpc/amx_matmul.rs
src/hpc/bf16_tile_gemm.rs
src/hpc/int8_tile_gemm.rs
src/simd.rs
src/simd_amx.rs
src/simd_avx2.rs
src/simd_avx512.rs

First probe toward the Morton-tile cascade substrate (codec-agnostic; the f32 cell value stands in for the eventual palette256 / helix Fisher-2z per-cell codec — wiring that is the next step). Substrate proven: * 4×4 Morton leaf tile = 16 cells = one F32x16 loaded from `simd_soa::MultiLaneColumn` (the "2bit×2bit" tile; gridlake SoA carrier). * Quadtree over T×T tiles (2×2 per level) ⇒ total grid (4T)² for T=2^k gives the ladder 64, 256, 1024, 4096, 16384, 64k, 256k. * Morton order ⇒ every quadtree node is a contiguous index range, so the aggregate min/max pyramid (the Belichtungsmesser "bands") is a flat bottom-up reduction; the cascade prunes a subtree whose [min,max] can't intersect the query band [q−r, q+r] (the 3-stroke band-miss early-exit, generalized). examples/morton_cascade_probe.rs, validated: cascade count == brute-force count for every (size, query) 64..262144 cells; selective queries prune ~63-67% of cells, broad queries correctly prune 0% (no over-prune). `required-features = ["std"]` (uses `ndarray::simd`). Next: wire the per-cell codec (palette256 / helix), the SIMD leaf mask via F32x16, and a fully non-materialized aggregate variant. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

…asured) Anti-"eigenvalue theater" probe putting numbers on the two load-bearing claims: 1. Golden-angle hemisphere sampling (γ = π(3−√5); θ = ½·arccos(1−2(n+0.5)/N)) maximises the minimum nearest-neighbour gap — 2.6–10× the regular (θ,φ) grid and up to 15× uniform-random at N=64..1024 — so the irrational stride genuinely prevents node collapse (the Fujifilm-X-Trans low-discrepancy point). The regular grid's lower NN-CoV is a mirage: it clusters at the pole, which the min-gap exposes. 2. Fisher-z percentile rank (= arctanh, strictly monotone in cosine) preserves every pairwise similarity ordering — 0 inversions vs cosine order — so ranks are a normalised [0,1] key comparable directly, never re-materialising cosine; and Fisher-z gives the rim (s=0.9) 5.3× the resolution of the centre. examples/golden_helix_probe.rs (pure std, required-features = ["std"]). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

… complement) Option 1 ("stream pairwise into AMX-turbovec as a cheap edge residue that complements the 16×8bit=128bit coarse code"), measured end-to-end on this session's validated AMX int8 GEMM + the turbovec quant idea: vectors → AMX matmul_i8_to_i32 assigns each to a 256-palette (1-byte coarse, the per-edge code) → 4-bit TurboQuant of the residue (D/2-byte fine) → reconstruct. Results (Emerald Rapids, target-cpu=native): * AMX assignment is 100% accurate (finds the true centroid every vector). * coarse(1 byte) rel-err 0.148–0.287 → +turbovec 4-bit rel-err 0.010–0.020 = ~14× better reconstruction for a fixed small byte cost (D/2 = 32–64 B). So the coarse palette edge-code + a cheap turbovec residue is a real, measurable win, with AMX doing the pairwise. (Assignment GEMM here is a small shape → 17–34 GMAC/s, below the 2048³ 197 peak; correctness is the point.) examples/edge_residue_probe.rs, required-features = ["std"]. Path map for the siblings (both now plug into existing certified math): * Option 2 (Gaussian-splat pyramid): wire `hpc::pillar::ewa_sandwich_{2d,3d}` + `splat_invariants` + `cov_high_d` into the Morton cascade. * Option 3 (x264/x265 2×2/4×4 reality-check, eigenvalue/perturbation): use `jc::{weyl,cartan,pearl}` for the transform/resonance math. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

coderabbitai

🧹 Nitpick comments (1)

examples/morton_cascade_probe.rs (1)

115-150: ⚡ Quick win

Remove unused field parameter.

The field parameter is never used in cascade_count (line 148 just suppresses the warning). The function only reads from the col: &MultiLaneColumn parameter. Consider removing the unused parameter to simplify the signature.

♻️ Proposed refactor

-fn cascade_count(field: &[f32], col: &MultiLaneColumn, pyr: &Pyramid, q: f32, r: f32) -> (usize, usize) {
+fn cascade_count(col: &MultiLaneColumn, pyr: &Pyramid, q: f32, r: f32) -> (usize, usize) {
     let (lo, hi) = (q - r, q + r);
     let mut count = 0usize;
     let mut visited = 0usize;
     // Stack of (level, node_index_within_level).
     let mut stack = vec![(pyr.k as usize, 0usize)];
     let bytes = col.as_bytes();
     while let Some((level, node)) = stack.pop() {
         let (mn, mx) = pyr.levels[level][node];
         if mx < lo || mn > hi {
             continue; // band miss → prune whole subtree (early-exit)
         }
         if level == 0 {
             // Leaf tile = 16 cells = one F32x16 chunk in the SoA column.
             let off = node * 64; // 16 f32 × 4 bytes
             let chunk: [u8; 64] = bytes[off..off + 64].try_into().unwrap();
             let arr = f32x16_from_bytes(&chunk).to_array();
             for &v in arr.iter() {
                 if (v - q).abs() <= r {
                     count += 1;
                 }
             }
             visited += 16;
         } else {
             let base = node * 4;
             for c in 0..4 {
                 stack.push((level - 1, base + c));
             }
         }
     }
-    let _ = field; // field kept for the brute-force reference; cascade reads the SoA column
     (count, visited)
 }

And update the call site at line 180:

-        let (got, visited) = cascade_count(&field, &col, &pyr, q, r);
+        let (got, visited) = cascade_count(&col, &pyr, q, r);

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/morton_cascade_probe.rs` around lines 115 - 150, The `cascade_count`
function has an unused `field` parameter that is only suppressed with a `let _ =
field;` statement. Remove the `field: &[f32]` parameter from the `cascade_count`
function signature, delete the `let _ = field;` line that suppresses the
warning, and update the call site to `cascade_count` (which should be removing
the field argument from the function call) to match the new signature.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/morton_cascade_probe.rs`:
- Around line 115-150: The `cascade_count` function has an unused `field`
parameter that is only suppressed with a `let _ = field;` statement. Remove the
`field: &[f32]` parameter from the `cascade_count` function signature, delete
the `let _ = field;` line that suppresses the warning, and update the call site
to `cascade_count` (which should be removing the field argument from the
function call) to match the new signature.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: fbfebf93-8e90-449a-8393-68db69ce96ae

📥 Commits

Reviewing files that changed from the base of the PR and between f3e6223 and d33b983.

📒 Files selected for processing (4)

Cargo.toml
examples/edge_residue_probe.rs
examples/golden_helix_probe.rs
examples/morton_cascade_probe.rs

@16

CI fmt: ran `cargo fmt --all` (edge_residue_probe / golden_helix_probe were committed unformatted). Correctness (codex / coderabbit): * simd_int_ops::gemm_u8_i8 — VNNI dispatch was compile-time `#[cfg(target_feature)]`, so the default x86-64-v3 GitHub build stripped both VNNI arms → scalar on Ice Lake / SPR / Zen 4 silicon (codex P2 regression). Now RUNTIME `is_x86_feature_detected!` (avx512vnni → avxvnni → scalar); compiles + reaches VNNI under v3, and removes the pre-existing `needless_return` clippy warning. * simd_avx2.rs U16x16 `shr`/`shl` — returned ZERO for any shift ∉{1,2,4,8}; now `_mm256_srl_epi16`/`_mm256_sll_epi16` with a runtime lane count (all shifts). * amx_matmul::for_dpbusd — tile 1/2 shapes now match the operand contract (tmm1 = VNNI kb/4×64, tmm2 = plain 16×kb); identical at kb=64 (tests unaffected), correct for kb<64. * backend::native gemv_f32/f64 — early-return on m==0 (don't slice `x[..n]` when there are no rows; matches the scalar reference no-op). * test_tile_zero_and_release — minimal config rewritten on the corrected XTILECFG offsets (colsb=4 @16 / rows=1 @48), with an explanatory note. Probes / docs: * amx_probe matmul_f32 validator — true relative-L2 + max-abs (the old `|e|.max(1.0)` denominator was an absolute test for |e|<1). * amx_rb_probe rb_32 — assert K % 64 == 0 (was silently truncating the tail). * doc `# Examples` (ignore) on the new public APIs: TileConfig::for_dpbusd_8, tile_dpbusd_2x2, F32x8::mul_add, F32x8::cmp_gt_mask. Validated under x86-64-v3 (GitHub target): clippy clean, `cargo build --examples` Finished; native AMX probes still all CORRECT. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

claude added 14 commits June 13, 2026 20:32

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

claude added 3 commits June 14, 2026 07:39

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

AdaWorldAPI merged commit ce83ab6 into master Jun 14, 2026
18 checks passed

coderabbitai Bot mentioned this pull request Jun 14, 2026

Entropy-ladder foundation: reliability suite + edge-codec flavors + entropy_ladder #218

Merged

Conversation

AdaWorldAPI commented Jun 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

AMX (the headline)

Vertical-SIMD primitives

Validation & honest caveats

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdaWorldAPI commented Jun 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading