Skip to content

AMX int8/bf16 tile GEMM — enable, fix 5 bugs, optimize (0→197 GMAC/s) + vertical-SIMD primitives#217

Merged
AdaWorldAPI merged 19 commits into
masterfrom
claude/wonderful-hawking-lodtql
Jun 14, 2026
Merged

AMX int8/bf16 tile GEMM — enable, fix 5 bugs, optimize (0→197 GMAC/s) + vertical-SIMD primitives#217
AdaWorldAPI merged 19 commits into
masterfrom
claude/wonderful-hawking-lodtql

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Summary

Brings Intel AMX (Advanced Matrix Extensions) from silently disabled on every
capable host
to enabled, bit-exact, and fast on stable Rust 1.94, plus the
vertical-SIMD primitives that back the turbovec/lance-graph consumers.

Verified on Emerald Rapids (CPUID model 0xCF), kernel 6.18.5. The fixes are
ISA-level (identical on Sapphire Rapids / Granite Rapids).

AMX (the headline)

Enablement — the one-digit bug. ARCH_REQ_XCOMP_PERM (0x1023) is an
arch_prctl op (syscall 158); the code issued it on prctl (157) →
-EINVAL, so amx_available() returned false on every AMX host and the tile
path was dead code. Fixed 157→158 (simd_amx.rs). This is the "AMX is available
~50-79% of the time but needs a special way to enable it."

Four more tile-kernel bugs, surfaced once the path actually executed and
bisected with examples/amx_probe (prints a flushed line before each tile op):

  1. TileConfig rows/colsb regions swapped → LDTILECFG #GP (SIGSEGV).
  2. TILELOADD/TILESTORED SIB base/index swapped → deref'd stride as ptr (SIGSEGV).
  3. TDPBUSD/TDPBF16PS ModRM 0xC1 aliased two source tiles → same-tile #UD (SIGILL).
  4. Operand index+sign convention is the mirror of the naive SDM reading
    (plain M×K = ModRM.rm = unsigned; VNNI K×N = VEX.vvvv = signed) — proven
    by sweeping all four TDPB**D opcodes against sign-sensitive inputs.

The old header claimed "INLINE ASM TESTED ✓" — it never was: every AMX test
early-returns when amx_available() is false (which it always was), so the asm
had never run. That's why these bugs sat latent (incl. in the SPR-era
AMX_GOTCHAS.md, which shipped three of them).

Now correct: matmul_i8_to_i32 (TDPBUSD) bit-exact vs scalar across all
shapes; matmul_f32/matmul_bf16_to_f32 (TDPBF16PS) within BF16 tolerance.

Performance (Emerald Rapids, single-thread, target-cpu=native, all correct=true):

size first-correct hoist LDTILECFG 2×2 register-blocked
2048³ 14.8 169.7 197.7 GMAC/s (395 GOP/s, 600× scalar)
  • Hoisted the serializing LDTILECFG + VNNI-pack out of the per-tile loops (11.5×).
  • 2×2 register blocking (tile_dpbusd_2x2, 4 C accumulators + 2 A + 2 B tiles)
    with a BLIS-style panel loop order — halves load bytes/MAC; +16% over serial at
    2048³, wins at every size. All 12 new tile encodings validated in isolation
    (examples/amx_rb_probe).
  • rayon row-tile parallelism (feature-gated) → 237 GMAC/s for huge GEMMs.

Detection + ergonomics: amx_available() cached in a LazyLock (non-blocking
gates; process-wide arch_prctl requested once, rayon-safe); new CpuModel
(SPR/EMR/GNR/SierraForest) + cpu_model() + amx_report(), exported via
ndarray::simd::*. has_amx() && !available cleanly separates "no AMX silicon"
from "AMX not OS-enabled".

Docs (teach-to-fish): new .claude/knowledge/amx-enablement-and-kernel.md
(canonical reference + validated byte-code table), new .claude/agents/amx-savant.md,
and a rewritten .claude/AMX_GOTCHAS.md (corrects the three bugs it shipped; adds
a fault-signature→cause playbook).

Vertical-SIMD primitives

Native AVX2 U16x16 (__m256i), FastScan PQ4-ADC helpers on U16x16/U8x32,
F32x8::{mul_add, cmp_gt_mask}, U8x32::from_ptr (zero-overhead hot-loop load),
backend gemm_bf16/gemm_i8/gemv wired to real SIMD dispatch, and the AMX/VNNI
int8 GEMM re-exported through simd.rs — the consumer-facing surface turbovec and
lance-graph build on.

Validation & honest caveats

  • Validated via examples/{amx_probe, amx_rb_probe, amx_gemm_bench} and
    cargo clippy -p ndarray --lib [--features rayon] (clean except one pre-existing
    simd_int_ops needless_return). The lib unit-test target is pre-broken by
    unrelated src/tri.rs type-inference errors (noted in CLAUDE.md), so the examples
    are the gate.
  • rayon-over-rb (the 2×2+rayon combination) was attempted and reverted — it
    ran slower than rb-single and correct=false at 1024³/2048³ (an
    AMX-under-rayon-at-scale issue, documented in the knowledge doc). The banked wins
    (rb-single 197, 16×16-rayon 237) are both correct. Bigger lever ahead: full BLIS
    Mc/Nc/Kc cache blocking.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo


Generated by Claude Code

Summary by CodeRabbit

Release Notes

  • New Features

    • Register-blocked AMX tile GEMM acceleration with improved int8 performance.
    • CPU model reporting to support silicon-specific routing and clearer capability reporting.
    • New probe/benchmark examples for AMX performance and correctness.
  • Bug Fixes

    • Corrected AMX tile configuration and instruction encodings to improve runtime stability.
    • Fixed Linux AMX enablement permission handling.
    • Improved GEMM/GEMV dispatch behavior and added tiered GEMV execution for non-scalar modes.
  • Documentation

    • Updated AMX enablement and troubleshooting playbooks with a fault-signature “cause” index and expanded coverage.

claude added 14 commits June 13, 2026 20:32
Replace the scalar avx2_int_type!(U16x16, ...) polyfill ([u16;16] array +
scalar loops) with a real __m256i wrapper using _mm256_*_epi16, mirroring the
already-native U16x32 in simd_avx512.rs:1200. Both x86 dispatch arms (v3 AVX2,
and v4 AVX-512 which re-exports the avx2-tier U16x16) now surface the hardware
type — per agnostic-surface-cpu-matrix.md §A (U16x16 = __m256i) and the TD-T22
polyfill-audit cell.

Lets the PQ4-ADC FastScan u16 accumulate (turbovec's AVX2 search kernel, the
consumer migration this unblocks) run on hardware instead of a scalar [u16;16]
loop. Drop-in: 2141 lib tests pass, 0 failed (no consumer needed changes).

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Adds the per-arch primitives a 4-bit-PQ ADC FastScan consumer (turbovec's AVX2
search kernel) needs to route through ndarray::simd instead of hand-rolled
_mm256_* intrinsics:

- U16x16::permute2x128::<IMM> / blend_epi32::<IMM> — the SUB-trick cross-128-lane
  combine (IMM 0x21 / 0xF0)
- U16x16::to_f32x8_lo / to_f32x8_hi — lossless u16->f32 (_mm256_cvtepu16_epi32 +
  _mm256_cvtepi32_ps); PQ4 accumulators are <= FLUSH_EVERY*127 so they fit f32
- U8x32::as_u16x16 — zero-cost bitcast bridging shuffle_bytes LUT results into the
  16-bit accumulator

All thin __m256i/__m256 intrinsic wrappers built on the native U16x16 (f766d18).
ndarray builds clean.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
_mm256_fmadd_ps wrapper (self*a + b, single rounding). The 256-bit sibling of the
existing F32x16::mul_add; the PQ4-ADC FastScan flush needs an 8-wide FMA for the
per-query fa = v_scale*partial + fa reduction. Validated end-to-end by turbovec's
avx2_topk_matches_default_kernel parity test (migrated AVX2 kernel == default).

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
_mm256_cmp_ps::<_CMP_GT_OQ> + _mm256_movemask_ps → u8 mask. The SIMD threshold-
prune primitive the PQ4-ADC FastScan heap needs (skip an 8-lane chunk with no
candidate above heap-min in one op). Validated by turbovec's avx2 parity test at v4.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Unchecked _mm256_loadu_si256 from a raw ptr (caller guarantees >=32 bytes). The
from_slice assert + slice-index bounds check are a measurable tax in tight scan
loops; this is the consumer-side escape hatch for validated inner loops. Closed
turbovec's migrated AVX2 FastScan gap: 91.0us -> 78.8us/q at v4 (~1% vs upstream
raw AVX2 77.9us; was ~17%). Safety documented on the fn.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…audit TD-T2/3/5/7)

- gemm_bf16: scalar bf16_gemm_f32 -> amx_matmul::matmul_bf16_to_f32 (AMX TDPBF16PS
  -> AVX-512 VDPBF16PS -> scalar) via ArrayView2 wrapping; bit-equivalent on
  non-AMX/non-AVX512BF16 hosts.
- gemm_i8: 2-tier vnni_gemm -> simd_int_ops::gemm_u8_i8 (4-tier AMX TDPBUSD /
  VNNI-zmm / AVX-VNNI-ymm / scalar). Deliberately stays u8xi8 (NOT the signed
  matmul_i8_to_i32, which would misread A bytes >=128); same int8_gemm_i32 scalar
  reference, so bit-equivalent.
- gemv_f32/f64: scalar -> per-row dot_f32/dot_f64 on the AVX2/AVX-512 tiers
  (scalar tier kept byte-identical); reuses the parity-tested BLAS-1 dot, zero
  new unsafe.

Validated centrally: cargo test -p ndarray -> 2136 passed, 0 failed, 29 ignored.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…ut) + fmt

amx_gemm_bench reports amx_available() and dispatched matmul_i8_to_i32 throughput
vs scalar across square sizes — for verifying the AMX tile path lights up on
SPR/GNR silicon (on this host amx_available()=false, VNNI ceiling ~30 GMAC/s).
Also folds the rustfmt normalization of the from_ptr addition.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
… bugs

AMX was silently disabled on EVERY capable host and, once enabled, the
tile kernel produced wrong results / faulted. `amx_available()` now returns
true on this Emerald Rapids host and `matmul_i8_to_i32` is bit-exact vs the
scalar reference (validated by examples/amx_probe across single/multi
K-block and single/multi-tile shapes, incl. negative i8 through +128/bias).

Five bugs, found by bisecting with examples/amx_probe:

1. ENABLEMENT (simd_amx.rs): `ARCH_REQ_XCOMP_PERM` (0x1023) is an
   *arch_prctl* op (syscall 158), but the code issued it on `prctl`
   (syscall 157) → -EINVAL → step 4 always failed → amx_available()
   returned false on all AMX silicon. Fixed 157 → 158. This is the
   "special way to enable it": Linux 5.16+ requires the XTILEDATA
   permission request before any AMX tile op.

2. TILECFG (amx_matmul.rs for_dpbusd): rows/colsb regions were SWAPPED.
   Per the SDM XTILECFG layout, colsb[t] is a u16 at 16+2t and rows[t] a
   u8 at 48+t; the code wrote rows into the colsb region and vice-versa,
   yielding colsb[0]=4112 / rows[0]=64 → LDTILECFG #GP (SIGSEGV).

3. TILELOADD/TILESTORED SIB (amx_matmul.rs): SIB 0x08 = [base=rax,
   index=rcx] but the registers are bound rcx=ptr, rax=stride, so the
   engine dereferenced the stride (~64) as the base → SIGSEGV. Correct
   SIB is 0x01 = [base=rcx, index=rax].

4. TDPBUSD/TDPBF16PS ModRM (amx_matmul.rs): ModRM 0xC1 set rm=tmm1 while
   vvvv also = tmm1, aliasing the two sources → same-tile #UD (SIGILL).
   The SDM requires the three tile operands distinct; fixed to 0xC2
   (rm=tmm2).

5. OPERAND CONVENTION (int8/bf16 amx_path): on this silicon the AMX
   operand mapping is the mirror of the naive SDM reading — ModRM.rm is
   the plain M×K operand and VEX.vvvv the VNNI K×N operand, and for 0x71
   rm is UNSIGNED / vvvv is SIGNED (proven by sweeping all four TDPB**D
   opcodes on sign-sensitive inputs). So the kernels load A(plain)→tmm2
   and B(vnni)→tmm1; the standard TDPBUSD (0x71) is the correct int8
   variant.

The header's "INLINE ASM TESTED" was aspirational: every AMX test
early-returns when amx_available() is false (which it always was), so the
tile asm had never actually executed until bug #1 was fixed.

Throughput (Emerald Rapids, single-thread, target-cpu=native): int8 GEMM
2048^3 = 14.8 GMAC/s, 52.8x scalar, correct=true. (ldtilecfg + VNNI-pack
are still per-tile — a follow-up will hoist them; this commit is the
correctness checkpoint.)

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
….5x)

`int8_gemm_amx_tiled` called `int8_tile_gemm_16x16` per 16×16 output tile,
which ran LDTILECFG (a serializing instruction) + TILERELEASE and re-VNNI-
packed the B column band on EVERY tile — e.g. 256 LDTILECFGs for a 256²
output. Restructured so the driver:
  * loads the tile config ONCE up front, releases ONCE at the end;
  * VNNI-packs each B column band ONCE per j-tile (reused across all M/16
    row tiles), not once per (i,j) tile;
  * TILEZEROs the C accumulator and TILESTOREs the 16×16 result straight
    into its strided slot in C (row pitch n·4 bytes) — no scratch buffer,
    no copy loop.

Single-thread int8 GEMM on Emerald Rapids (target-cpu=native), all
correct=true vs scalar:

  size       before(naive)   after(hoisted)   vs scalar
  256^3       3.7 GMAC/s      65.7 GMAC/s        68x
  512^3       6.4 GMAC/s     124.9 GMAC/s       130x
  1024^3     10.4 GMAC/s     155.9 GMAC/s       177x
  2048^3     14.8 GMAC/s     169.7 GMAC/s       600x   (339 GOP/s)

`int8_tile_gemm_16x16` (the standalone 16×16 accumulating API) is unchanged
and still used by the unit tests. Further headroom: 2×2 register blocking
(reuse A/B tile loads across 4 C accumulators) and rayon over row tiles.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
The operand-swap fix touched both tile kernels; amx_probe now also checks
matmul_f32 (which routes through the BF16 TDPBF16PS path on AMX) against an
f32 scalar reference. Max relative error 0.0033–0.0051 across shapes — within
BF16's ~8-mantissa-bit precision. Confirms the index-swap fix is correct for
TDPBF16PS as well as TDPBUSD.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…s/agent

Detection (`amx_available()`) ran CPUID + XGETBV + the arch_prctl syscall on
EVERY matmul call. Cache it in a `LazyLock<bool>` computed once; all four gates
are non-blocking (no I/O, no lock, no spin) so the init can't stall, and the
arch_prctl grant is process-wide + inherited by all threads, so requesting it
once is correct even under rayon.

Add `CpuModel` detection from CPUID.01H (GenuineIntel only): SapphireRapids
(0x8F) / EmeraldRapids (0xCF) / GraniteRapids (0xAD/0xAE) / SierraForest (0xAF,
E-core, no AMX) / OtherX86 / NonX86, also cached. Exposed via
`ndarray::simd::{cpu_model, CpuModel, amx_report}`. `amx_report()` now reads
e.g. "AMX [Emerald Rapids expects_amx=true]: TILE=true INT8=true BF16=true
available=true" — so `has_amx() && !available` cleanly separates "no AMX
silicon" from "AMX present but not OS-enabled".

Answers the SPR-vs-EMR question: it is NOT the cause. The five bugs are
ISA/encoding-level and were latent on Sapphire Rapids too — the SPR-era
AMX_GOTCHAS.md literally shipped three of them (syscall 157, TDPBUSD=…73…C1,
swapped TILECFG layout). They never fired because detection never returned
true; EMR was just the first host to actually execute the tile path.

Docs (teach-to-fish):
- NEW .claude/knowledge/amx-enablement-and-kernel.md — canonical reference:
  enablement sequence, validated byte-code table, the empirically-verified
  (mirrored) operand convention, detection API, perf story, modus operandi.
- NEW .claude/agents/amx-savant.md — the AMX specialist agent (enable /
  troubleshoot / mindset), references the knowledge doc + gotchas.
- REWROTE .claude/AMX_GOTCHAS.md — corrected the three bugs it shipped, added a
  fault-signature→cause index (SIGSEGV@LDTILECFG / SIGSEGV@TILELOADD /
  SIGILL@TDPBUSD / wrong-values) and the "skipped test ≠ passing test" trap.
- simd_amx.rs module header: replaced the false "INLINE ASM TESTED ✓" claim
  (and wrong "256 MACs/instr" → 16384) with the verified status + bug history.
- amx_probe prints amx_report() + cpu_model() up front.

No W3C/ISA encoding source exists in-tree — the .claude "w3c" files are
semantic-web ontologies (SKOS/PROV-O/FIBO); the byte-code authority is the
Intel SDM opcode map + the empirical 4-opcode sweep (0x70/0x73 match the SDM
exactly, confirming the map).

Validated: amx_probe all CORRECT, clippy -p ndarray --lib clean (only the
pre-existing simd_int_ops needless_return), cpu_model()=EmeraldRapids.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…ated)

Adds `int8_gemm_amx_tiled_par`: B is VNNI-packed once into a shared read-only
buffer, then the M/16 row-tiles fan across the rayon pool (one task per 16-row
block of C). Each worker runs its own LDTILECFG (tile config is per-thread CPU
state) + the byte-for-byte validated serial tile sequence; AMX permission is
process-wide so it's inherited by every worker. Correctness unchanged
(amx_probe --features rayon: all shapes CORRECT, incl. the multi-row-tile
parallel cases).

HONEST FINDING: this AMX kernel is MEMORY-BANDWIDTH-BOUND. On this 4-core
Emerald Rapids box rayon-over-rows scales sublinearly — 2048³ 169.7 → 237.5
GMAC/s (1.4×), and it REGRESSES small/medium shapes (512³ 125 → 73) because
thread dispatch + the shared B-prepack dominate. So it's gated to large work
only: `m >= 32 && m·n·k >= 2e9` (≈ ≥2048³ here), keeping the fast serial path
for everything else. Many-core servers with more memory channels gain more.

The real lever for a bandwidth-bound kernel is reducing bytes/FLOP (2×2
register blocking — reuse each A/B tile load across 4 C accumulators), which
also lets rayon scale better; that's the queued next step.

Feature-gated behind `rayon` (no change to default builds).

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…act)

The AMX int8 GEMM is memory-bandwidth-bound (~8% of peak). The lever is fewer
bytes/MAC: 2×2 register blocking computes a 32×32 output block with four C
accumulators (tmm0-3) fed by two A row-tiles (tmm4-5) and two B col-tiles
(tmm6-7) via `tile_dpbusd_2x2`, so each A/B tile load serves TWO products —
half the tile loads of the 16×16 kernel.

New tile primitives (amx_matmul.rs), each validated by examples/amx_rb_probe
(32×32 block vs scalar, K=64/128/256, all CORRECT on first run):
  * tile_load extended to tmm3-7, tile_store to tmm1-3 (ModRM 0x04|tile<<3).
  * TileConfig::for_dpbusd_8 — all 8 tiles 16×64.
  * tile_dpbusd_2x2 — 4 TDPBUSDs (C4 E2 49 5E C4 / 41 5E CC / 49 5E D5 /
    41 5E DD); byte2 = ((~vvvv&0xF)<<3)|1, ModRM = 0xC0|dst<<3|rm; all 8
    operand tiles distinct (no same-tile #UD).

Driver (`int8_gemm_amx_tiled_rb`): BLIS-style loop order is what makes it win —
OUTER over 32-col panels, pack only that panel's two B bands (L2-resident) and
reuse across all row-blocks, INNER over 32-row blocks. (A first cut pre-packed
ALL of B ≈4 MB and thrashed cache, regressing 1024³ 156→125; the panel-local
pack fixes it and also halves A's DRAM re-reads.) 16-wide M/N remainders finish
on the validated 16×16 path. amx_probe adds 48×48 / 96×80 cases covering both
remainder strips + corner — all CORRECT.

Single-thread, target-cpu=native, all correct=true vs scalar:
  size    serial    rb(2×2)
  256³     65.7      80.8  (+23%)
  512³    124.9     132.0  (+6%)
  1024³   155.9     170.2  (+9%)
  2048³   169.7     197.7  (+16%)   395 GOP/s

Dispatch: rayon huge → _par; else m,n≥32 → _rb; else _serial. Next:
rayon-over-rb (projected ~270+ GMAC/s) + full BLIS cache blocking.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…ng at scale)

The knowledge doc projected "rayon-over-rb ~270+ GMAC/s" as easy headroom. It is
NOT: a first attempt (each rayon task runs _rb on a 64-row band) ran slower than
rb-single (155 vs 197 at 2048³ — per-task B re-pack) AND produced correct=false
at 1024³/2048³ (256³/512³ stayed correct; single-thread _rb is bit-exact at all
sizes, so it's an AMX-tiles-under-rayon-at-scale issue, not yet diagnosed).
Reverted. Documented the two prerequisites before reshipping (shared B pre-pack
+ a probe reproducing/explaining the large-size correctness failure) so a future
session doesn't re-walk it blind.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@AdaWorldAPI, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 35 minutes and 8 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4052a99e-4a4d-4318-9fcf-de0325cc1fd1

📥 Commits

Reviewing files that changed from the base of the PR and between d33b983 and e563fdc.

📒 Files selected for processing (9)
  • examples/amx_probe.rs
  • examples/amx_rb_probe.rs
  • examples/edge_residue_probe.rs
  • examples/golden_helix_probe.rs
  • src/backend/native.rs
  • src/hpc/amx_matmul.rs
  • src/simd_avx2.rs
  • src/simd_avx512.rs
  • src/simd_int_ops.rs
📝 Walkthrough

Walkthrough

Fixes the Linux AMX permission syscall (157→158), corrects multiple tile instruction encoding bugs (TILECFG field layout, SIB base/index, TDPBUSD/TDPBF16PS ModRM aliasing), swaps A/B operand placement in int8 and bf16 kernels, adds adaptive int8 GEMM dispatch (serial/Rayon/register-blocked 2×2), routes backend GEMM/GEMV through corrected tier dispatch, replaces array-backed U16x16 with a native AVX2 wrapper, adds three AMX validation examples plus three supplemental algorithm probes, and updates the documentation corpus.

Changes

AMX Enablement and Correctness Overhaul

Layer / File(s) Summary
AMX detection API: CpuModel, syscall fix, public re-exports
src/simd_amx.rs, src/simd.rs
Adds CpuModel enum and cached cpu_model(), fixes Linux AMX permission gate from syscall 157 to 158 (arch_prctl with ARCH_REQ_XCOMP_PERM/XTILEDATA=18), updates amx_report() to include CPU model and expects_amx, and re-exports amx_report/cpu_model/CpuModel from the public simd module.
AMX tile primitive encoding corrections and tile_dpbusd_2x2
src/hpc/amx_matmul.rs
Rewrites TileConfig::for_dpbusd with correct Intel SDM XTILECFG colbytes/rows field offsets (previously swapped). Fixes tile_load/tile_store SIB byte to use base=rcx, index=rax and expands tile_load to cover tmm0tmm7. Corrects tile_dpbusd ModRM byte sequence, corrects tile_dpbf16ps ModRM to prevent same-tile #UD. Adds pub unsafe fn tile_dpbusd_2x2() for 2×2 register-blocked four-accumulator TDPBUSD. Strengthens test_tile_config_creation to assert exact little-endian field values.
AMX kernel operand placement and int8 adaptive GEMM dispatch
src/hpc/int8_tile_gemm.rs, src/hpc/bf16_tile_gemm.rs
Fixes amx_path operand loading in both kernels: VNNI-packed i8 B → tmm1 (VEX.vvvv), plain u8 A → tmm2 (ModRM.rm) for int8; VNNI B → tile_load(1), plain A → tile_load(2) for bf16. Rewrites int8_gemm_amx_tiled into an adaptive driver selecting among serial (16×16 tile_store direct), Rayon-parallel (pre-packed B, per-worker tile config), and register-blocked 2×2 (tile_dpbusd_2x2 + strip fallback) strategies.
Backend GEMM/GEMV tier dispatch routing
src/backend/mod.rs, src/backend/native.rs
Switches gemm_i8 from vnni_gemm::int8_gemm_vnni to simd_int_ops::gemm_u8_i8. Rewrites gemm_bf16 to call hpc::amx_matmul::matmul_bf16_to_f32 via ArrayView2. Changes gemv_f32/gemv_f64 from unconditional scalar to tier-dispatched dot_f32/dot_f64 row-dot-product loops.
Native AVX2 U16x16 and F32x8 helpers
src/simd_avx2.rs, src/simd_avx512.rs
Replaces macro-generated array-backed U16x16 with a #[repr(transparent)] __m256i wrapper with lane ops, FastScan helpers, and F32x8 conversion. Adds U8x32::from_ptr (unsafe unaligned load) and U8x32::as_u16x16 (bitcast). Adds F32x8::mul_add (FMA3) and F32x8::cmp_gt_mask (ordered comparison + movemask).
AMX correctness validation examples
examples/amx_probe.rs, examples/amx_rb_probe.rs, examples/amx_gemm_bench.rs, Cargo.toml
Adds probe examples validating 16×16 tile GEMM, general int8/f32 matmul (with relative-error tolerance), and 2×2 register-blocked tile kernel. Adds a throughput benchmark reporting GMAC/s and speedup vs scalar for square int8 GEMM sizes. Gates examples behind required-features = ["std"] in manifest.
Supplemental algorithm probes
examples/edge_residue_probe.rs, examples/golden_helix_probe.rs, examples/morton_cascade_probe.rs
Adds edge-residue palette + 4-bit residue reconstruction probe via int8 GEMM. Adds golden-helix hemisphere sampling validation via nearest-neighbor angular statistics. Adds morton-cascade quadtree pruning benchmark with band-query early-exit evaluation.
AMX documentation corpus
.claude/AMX_GOTCHAS.md, .claude/agents/amx-savant.md, .claude/knowledge/amx-enablement-and-kernel.md, .claude/blackboard.md, .claude/board/AGENT_LOG.md
Rewrites AMX_GOTCHAS.md as a troubleshooting playbook with corrected syscall/SIB/TILECFG/TDPBUSD entries and a fault-signature index. Adds amx-savant agent spec with 4-gate enablement flow, operand convention, and troubleshooting workflow. Adds canonical amx-enablement-and-kernel.md covering the five-bug table, validated encoding table, detection API contract, and performance heuristics. Updates architecture decisions and agent log.

Sequence Diagram

sequenceDiagram
  participant App
  participant backend_gemm_i8
  participant simd_int_ops_gemm_u8_i8
  participant int8_gemm_amx_tiled
  participant amx_available
  participant TileConfig

  App->>backend_gemm_i8: gemm_i8(m,k,n,a,b,c)
  backend_gemm_i8->>simd_int_ops_gemm_u8_i8: dispatch via tier
  simd_int_ops_gemm_u8_i8->>amx_available: amx_available() [LazyLock]
  amx_available-->>simd_int_ops_gemm_u8_i8: true/false

  alt AMX available (M>=32, N>=32)
    simd_int_ops_gemm_u8_i8->>int8_gemm_amx_tiled: int8_gemm_amx_tiled_rb
    int8_gemm_amx_tiled->>TileConfig: for_dpbusd_8(k_bytes)
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: pack B into VNNI quads
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_dpbusd_2x2() × K/64 blocks
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_store 4 quadrants → c
  else AMX available serial
    simd_int_ops_gemm_u8_i8->>int8_gemm_amx_tiled: int8_gemm_amx_tiled_serial
    int8_gemm_amx_tiled->>TileConfig: for_dpbusd(k_bytes)
    int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_dpbusd() + tile_store → c
  else no AMX
    simd_int_ops_gemm_u8_i8->>simd_int_ops_gemm_u8_i8: VNNI/scalar fallback
  end
  int8_gemm_amx_tiled-->>App: c filled
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Poem

🐇 Hops onto the silicon stage...

The syscall was wrong, one-five-seven astray,
A single-digit fix unlocked AMX today.
SIB bytes corrected, ModRM aligned true,
Four tile accumulators blazing brand new.
The rabbit checked each colsb field with care—
Correct encodings found, and GMAC/s to spare! 🎉

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the primary changes: AMX tile GEMM enablement and bug fixes, achieving significant performance improvement (0→197 GMAC/s), plus new vertical-SIMD primitives. It is concise, specific, and directly reflects the main objectives documented in the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3e6223088

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/backend/mod.rs
#[cfg(feature = "std")]
{
crate::hpc::vnni_gemm::int8_gemm_vnni(a, b, c, m, n, k);
crate::simd_int_ops::gemm_u8_i8(a, b, c, m, n, k);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep the runtime VNNI fallback in gemm_i8

In the default x86-64-v3/std build, this now calls simd_int_ops::gemm_u8_i8, whose AVX-512/AVX-VNNI arms are behind #[cfg(target_feature = ...)] after the AMX alignment check; if AMX is unavailable or the shape is not 16/16/64-aligned, it therefore falls through to scalar even on Ice Lake/Zen 4/Sapphire Rapids hardware with VNNI. The previous hpc::vnni_gemm::int8_gemm_vnni used runtime simd_caps() detection and handled arbitrary shapes, so backend::gemm_i8 regresses hardware-supported VNNI calls to the scalar kernel in normal baseline builds.

Useful? React with 👍 / 👎.

examples/{amx_gemm_bench,amx_probe,amx_rb_probe}.rs import `ndarray::simd` and
`ndarray::hpc`, both `#[cfg(feature = "std")]`. With std as a default feature
they built locally, but the `--no-default-features` CI job compiles examples
with std off → `error[E0432]: unresolved import ndarray::simd` (item configured
out, gated behind `std`). Declared the three examples with
`required-features = ["std"]` (same pattern as `ocr_benchmark`) so non-std jobs
skip them.

Verified: `cargo build -p ndarray --no-default-features
--features portable-atomic-critical-section --examples` now finishes clean
(examples skipped) instead of failing on amx_gemm_bench.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/hpc/int8_tile_gemm.rs (1)

367-397: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Don’t rely on debug_assert! for a safe AMX entry point.

In release builds, a caller can invoke int8_gemm_amx_tiled() on a non-AMX host or with misaligned shapes, and this function will still fall into the raw-pointer/tile-instruction paths below. That turns a safe public API into either an immediate SIGILL (amx_available() == false) or full 16×16 tile loads/stores against a shape the code no longer proves is aligned.

Suggested fix
-    debug_assert!(crate::hpc::amx_matmul::amx_available());
-    debug_assert_eq!(m % 16, 0, "int8_gemm_amx_tiled: M must be multiple of 16");
-    debug_assert_eq!(n % 16, 0, "int8_gemm_amx_tiled: N must be multiple of 16");
-    debug_assert_eq!(k % 64, 0, "int8_gemm_amx_tiled: K must be multiple of 64");
+    assert!(
+        crate::hpc::amx_matmul::amx_available(),
+        "int8_gemm_amx_tiled: AMX must be available"
+    );
+    assert_eq!(m % 16, 0, "int8_gemm_amx_tiled: M must be multiple of 16");
+    assert_eq!(n % 16, 0, "int8_gemm_amx_tiled: N must be multiple of 16");
+    assert_eq!(k % 64, 0, "int8_gemm_amx_tiled: K must be multiple of 64");
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/hpc/int8_tile_gemm.rs` around lines 367 - 397, The function
int8_gemm_amx_tiled uses debug_assert! macros to validate that AMX is available
and that input dimensions m, n, and k are properly aligned (multiples of 16, 16,
and 64 respectively), but these assertions are stripped in release builds. This
allows a caller to invoke the function with invalid preconditions on non-AMX
hosts or with misaligned shapes, causing either SIGILL or undefined behavior
when the code reaches the raw pointer and AMX instruction paths. Replace all
four debug_assert! calls (the amx_available check and the three shape alignment
checks) with regular assert! macro calls to ensure these validation checks
execute in both debug and release builds, protecting the safe public API from
unsafe inputs.
🧹 Nitpick comments (2)
src/backend/mod.rs (1)

170-183: ⚡ Quick win

Add examples to the new public GEMM docs.

Both gemm_i8 and gemm_bf16 now have detailed API docs, but they still miss a usage example. A tiny row-major example for each would satisfy the repo contract and pin down the signedness/shape expectations for callers. As per coding guidelines, "All public APIs (public functions and methods) must have /// doc comments with examples".

Also applies to: 201-208

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/mod.rs` around lines 170 - 183, The public GEMM functions gemm_i8
and gemm_bf16 have detailed API documentation but are missing usage examples,
which violates the repo's requirement that all public APIs include documentation
comments with examples. Add a usage example to the doc comments for both
functions using the `///` style documentation. Include a small row-major matrix
multiplication example for each function that demonstrates how callers should
use it and clarifies the signedness and shape expectations (e.g., u8 × i8 → i32
for gemm_i8, and appropriate types for gemm_bf16).

Source: Coding guidelines

src/backend/native.rs (1)

285-308: ⚡ Quick win

Add examples to the public GEMV docs.

gemv_f32 and gemv_f64 now have expanded API docs, but they still do not include examples. A small doctest for the row-major contract would bring these entrypoints in line with the repository requirement. As per coding guidelines, "All public APIs (public functions and methods) must have /// doc comments with examples".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/backend/native.rs` around lines 285 - 308, The `gemv_f32` and `gemv_f64`
functions are missing doctest examples in their doc comments, which violates the
coding guideline requiring all public APIs to have doc comments with examples.
Add executable doctest examples to both functions' doc comments that demonstrate
the row-major GEMV operation (y = alpha * A * x + beta * y) using concrete
values for the matrix, vectors, and scalars to illustrate how the computation
works.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/amx_probe.rs`:
- Around line 82-110: The relative error calculation in the test_matmul_f32
function is incorrect because using e.abs().max(1.0) as the denominator converts
small values into absolute-error tests rather than true relative-error tests.
This allows large relative regressions on small outputs to incorrectly appear
correct. Fix the denominator calculation in the loop where max_rel is computed
to use a true relative metric: either use a small epsilon as a floor while still
dividing by the actual expected value's magnitude, or compute both absolute and
relative error separately and use the maximum of the two comparisons against the
tolerance threshold.

In `@examples/amx_rb_probe.rs`:
- Around line 45-61: Add validation to reject K values that are not divisible by
64 instead of silently truncating them. Before computing k_blocks with the
division operation k / 64, check if k modulo 64 is not equal to zero and either
return an error or panic with a clear message indicating that K must be a
multiple of 64. This ensures the function fails fast when given invalid input
rather than producing a partial product from the truncated k_blocks calculation
in the loop.

In `@src/backend/native.rs`:
- Around line 289-303: The gemv_f32 and gemv_f64 functions eagerly slice x[..n]
before checking if there are any rows to process. This causes a panic when m
equals zero and x.len() is less than n, whereas the scalar implementations
gracefully handle this case by returning early. Add an early return check at the
beginning of the non-Scalar branch in both gemv_f32 (before line 295 where xn is
assigned) and gemv_f64 (the sibling function also applying the same pattern): if
m equals zero, return immediately without attempting to slice x, preserving the
no-op behavior when there are no matrix rows.

In `@src/hpc/amx_matmul.rs`:
- Around line 827-835: The `test_tile_zero_and_release` test function still
builds the old incorrect XTILECFG layout with swapped offsets (using `data[16] =
1` and `data[48] = 4`), while the assertions in the diff now validate the
corrected offsets. Update the XTILECFG construction in
`test_tile_zero_and_release` to match the fixed offset layout shown in the
assertions, ensuring colsb values are placed at the correct byte offsets (16+2t
for each tile) and rows values are placed at offsets 48+t, so the test will not
create an invalid configuration when AMX becomes reachable on capable hosts.
- Around line 32-42: The `for_dpbusd()` function is configuring the tiles
incorrectly. Currently, tile 1 is programmed with a `16×kb` shape and tile 2
with a `kb/4×64` shape, but the documented and expected contract (as shown in
the comments and used by `tile_dpbusd()`) requires tile 1 (tmm1) to be `kb/4×64`
for VNNI K×N operands and tile 2 (tmm2) to be `16×kb` for plain M×K operands.
Swap the tile configuration assignments so that tile 1 receives the `kb/4×64`
shape and tile 2 receives the `16×kb` shape. This fix needs to be applied
wherever these tile configurations are set in the `for_dpbusd()` function
implementation.
- Around line 77-83: The public functions TileConfig::for_dpbusd_8() and
tile_dpbusd_2x2() are missing documentation examples as required by coding
guidelines for all public Rust APIs. Add a no_run example to the doc comment for
each function demonstrating the strict tile/register contract they expose. Each
example should show the required sequence of operations: calling
tile_loadconfig, performing the necessary tile loads, and then calling
tile_release. This will make these APIs safer to consume by providing clear
usage patterns in the documentation.

In `@src/simd_avx2.rs`:
- Around line 1603-1629: The shr and shl methods in U16x32 only handle shift
amounts of 1, 2, 4, and 8, returning a zero vector for all other values, which
produces incorrect results. Replace both the shr and shl method implementations
to use the vector-count intrinsic forms (_mm256_srl_epi16 and _mm256_sll_epi16
respectively) instead of the immediate forms, since these accept runtime shift
amounts rather than compile-time constants. This approach mirrors how
U8x32::shr_epi16 handles variable shift amounts and will correctly support all
valid shift values.

In `@src/simd_avx512.rs`:
- Around line 1367-1371: The public method cmp_gt_mask in src/simd_avx512.rs is
missing a /// doc comment with examples, which violates the coding guidelines.
Add a doc comment above the cmp_gt_mask method that explains its purpose
(lane-wise ordered greater-than comparison returning a bitmask), describes the
return value semantics (bit i is set if self[i] > other[i]), explains its use
case (SIMD early-out without scalar iteration), and includes a clear example
demonstrating the method with different input scenarios to show both when lanes
are greater and when they are not.
- Around line 1353-1359: Both the `mul_add` method and the `cmp_gt_mask` method
are missing example code blocks in their doc comments. Add an "# Examples"
section to each method's existing doc comment with the provided example code
snippets that demonstrate their usage and expected behavior. The examples for
`mul_add` should show a fused multiply-add operation with sample values, and the
examples for `cmp_gt_mask` should demonstrate comparing two F32x8 values and
showing the resulting mask value.

---

Outside diff comments:
In `@src/hpc/int8_tile_gemm.rs`:
- Around line 367-397: The function int8_gemm_amx_tiled uses debug_assert!
macros to validate that AMX is available and that input dimensions m, n, and k
are properly aligned (multiples of 16, 16, and 64 respectively), but these
assertions are stripped in release builds. This allows a caller to invoke the
function with invalid preconditions on non-AMX hosts or with misaligned shapes,
causing either SIGILL or undefined behavior when the code reaches the raw
pointer and AMX instruction paths. Replace all four debug_assert! calls (the
amx_available check and the three shape alignment checks) with regular assert!
macro calls to ensure these validation checks execute in both debug and release
builds, protecting the safe public API from unsafe inputs.

---

Nitpick comments:
In `@src/backend/mod.rs`:
- Around line 170-183: The public GEMM functions gemm_i8 and gemm_bf16 have
detailed API documentation but are missing usage examples, which violates the
repo's requirement that all public APIs include documentation comments with
examples. Add a usage example to the doc comments for both functions using the
`///` style documentation. Include a small row-major matrix multiplication
example for each function that demonstrates how callers should use it and
clarifies the signedness and shape expectations (e.g., u8 × i8 → i32 for
gemm_i8, and appropriate types for gemm_bf16).

In `@src/backend/native.rs`:
- Around line 285-308: The `gemv_f32` and `gemv_f64` functions are missing
doctest examples in their doc comments, which violates the coding guideline
requiring all public APIs to have doc comments with examples. Add executable
doctest examples to both functions' doc comments that demonstrate the row-major
GEMV operation (y = alpha * A * x + beta * y) using concrete values for the
matrix, vectors, and scalars to illustrate how the computation works.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 24bb1a31-496b-46a6-b035-ca49a8d87553

📥 Commits

Reviewing files that changed from the base of the PR and between bdf243c and f3e6223.

📒 Files selected for processing (17)
  • .claude/AMX_GOTCHAS.md
  • .claude/agents/amx-savant.md
  • .claude/blackboard.md
  • .claude/board/AGENT_LOG.md
  • .claude/knowledge/amx-enablement-and-kernel.md
  • examples/amx_gemm_bench.rs
  • examples/amx_probe.rs
  • examples/amx_rb_probe.rs
  • src/backend/mod.rs
  • src/backend/native.rs
  • src/hpc/amx_matmul.rs
  • src/hpc/bf16_tile_gemm.rs
  • src/hpc/int8_tile_gemm.rs
  • src/simd.rs
  • src/simd_amx.rs
  • src/simd_avx2.rs
  • src/simd_avx512.rs

Comment thread examples/amx_probe.rs Outdated
Comment thread examples/amx_rb_probe.rs
Comment thread src/backend/native.rs
Comment thread src/hpc/amx_matmul.rs
Comment thread src/hpc/amx_matmul.rs
Comment thread src/hpc/amx_matmul.rs
Comment thread src/simd_avx2.rs
Comment thread src/simd_avx512.rs
Comment thread src/simd_avx512.rs
claude added 3 commits June 14, 2026 07:39
First probe toward the Morton-tile cascade substrate (codec-agnostic; the f32
cell value stands in for the eventual palette256 / helix Fisher-2z per-cell
codec — wiring that is the next step).

Substrate proven:
  * 4×4 Morton leaf tile = 16 cells = one F32x16 loaded from
    `simd_soa::MultiLaneColumn` (the "2bit×2bit" tile; gridlake SoA carrier).
  * Quadtree over T×T tiles (2×2 per level) ⇒ total grid (4T)² for T=2^k gives
    the ladder 64, 256, 1024, 4096, 16384, 64k, 256k.
  * Morton order ⇒ every quadtree node is a contiguous index range, so the
    aggregate min/max pyramid (the Belichtungsmesser "bands") is a flat bottom-up
    reduction; the cascade prunes a subtree whose [min,max] can't intersect the
    query band [q−r, q+r] (the 3-stroke band-miss early-exit, generalized).

examples/morton_cascade_probe.rs, validated: cascade count == brute-force count
for every (size, query) 64..262144 cells; selective queries prune ~63-67% of
cells, broad queries correctly prune 0% (no over-prune). `required-features =
["std"]` (uses `ndarray::simd`).

Next: wire the per-cell codec (palette256 / helix), the SIMD leaf mask via
F32x16, and a fully non-materialized aggregate variant.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…asured)

Anti-"eigenvalue theater" probe putting numbers on the two load-bearing claims:

1. Golden-angle hemisphere sampling (γ = π(3−√5); θ = ½·arccos(1−2(n+0.5)/N))
   maximises the minimum nearest-neighbour gap — 2.6–10× the regular (θ,φ) grid
   and up to 15× uniform-random at N=64..1024 — so the irrational stride genuinely
   prevents node collapse (the Fujifilm-X-Trans low-discrepancy point). The
   regular grid's lower NN-CoV is a mirage: it clusters at the pole, which the
   min-gap exposes.

2. Fisher-z percentile rank (= arctanh, strictly monotone in cosine) preserves
   every pairwise similarity ordering — 0 inversions vs cosine order — so ranks
   are a normalised [0,1] key comparable directly, never re-materialising cosine;
   and Fisher-z gives the rim (s=0.9) 5.3× the resolution of the centre.

examples/golden_helix_probe.rs (pure std, required-features = ["std"]).

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
… complement)

Option 1 ("stream pairwise into AMX-turbovec as a cheap edge residue that
complements the 16×8bit=128bit coarse code"), measured end-to-end on this
session's validated AMX int8 GEMM + the turbovec quant idea:

  vectors → AMX matmul_i8_to_i32 assigns each to a 256-palette (1-byte coarse,
  the per-edge code) → 4-bit TurboQuant of the residue (D/2-byte fine) →
  reconstruct.

Results (Emerald Rapids, target-cpu=native):
  * AMX assignment is 100% accurate (finds the true centroid every vector).
  * coarse(1 byte) rel-err 0.148–0.287  →  +turbovec 4-bit  rel-err 0.010–0.020
    = ~14× better reconstruction for a fixed small byte cost (D/2 = 32–64 B).

So the coarse palette edge-code + a cheap turbovec residue is a real,
measurable win, with AMX doing the pairwise. (Assignment GEMM here is a small
shape → 17–34 GMAC/s, below the 2048³ 197 peak; correctness is the point.)

examples/edge_residue_probe.rs, required-features = ["std"].

Path map for the siblings (both now plug into existing certified math):
  * Option 2 (Gaussian-splat pyramid): wire `hpc::pillar::ewa_sandwich_{2d,3d}`
    + `splat_invariants` + `cov_high_d` into the Morton cascade.
  * Option 3 (x264/x265 2×2/4×4 reality-check, eigenvalue/perturbation): use
    `jc::{weyl,cartan,pearl}` for the transform/resonance math.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
examples/morton_cascade_probe.rs (1)

115-150: ⚡ Quick win

Remove unused field parameter.

The field parameter is never used in cascade_count (line 148 just suppresses the warning). The function only reads from the col: &MultiLaneColumn parameter. Consider removing the unused parameter to simplify the signature.

♻️ Proposed refactor
-fn cascade_count(field: &[f32], col: &MultiLaneColumn, pyr: &Pyramid, q: f32, r: f32) -> (usize, usize) {
+fn cascade_count(col: &MultiLaneColumn, pyr: &Pyramid, q: f32, r: f32) -> (usize, usize) {
     let (lo, hi) = (q - r, q + r);
     let mut count = 0usize;
     let mut visited = 0usize;
     // Stack of (level, node_index_within_level).
     let mut stack = vec![(pyr.k as usize, 0usize)];
     let bytes = col.as_bytes();
     while let Some((level, node)) = stack.pop() {
         let (mn, mx) = pyr.levels[level][node];
         if mx < lo || mn > hi {
             continue; // band miss → prune whole subtree (early-exit)
         }
         if level == 0 {
             // Leaf tile = 16 cells = one F32x16 chunk in the SoA column.
             let off = node * 64; // 16 f32 × 4 bytes
             let chunk: [u8; 64] = bytes[off..off + 64].try_into().unwrap();
             let arr = f32x16_from_bytes(&chunk).to_array();
             for &v in arr.iter() {
                 if (v - q).abs() <= r {
                     count += 1;
                 }
             }
             visited += 16;
         } else {
             let base = node * 4;
             for c in 0..4 {
                 stack.push((level - 1, base + c));
             }
         }
     }
-    let _ = field; // field kept for the brute-force reference; cascade reads the SoA column
     (count, visited)
 }

And update the call site at line 180:

-        let (got, visited) = cascade_count(&field, &col, &pyr, q, r);
+        let (got, visited) = cascade_count(&col, &pyr, q, r);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/morton_cascade_probe.rs` around lines 115 - 150, The `cascade_count`
function has an unused `field` parameter that is only suppressed with a `let _ =
field;` statement. Remove the `field: &[f32]` parameter from the `cascade_count`
function signature, delete the `let _ = field;` line that suppresses the
warning, and update the call site to `cascade_count` (which should be removing
the field argument from the function call) to match the new signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@examples/morton_cascade_probe.rs`:
- Around line 115-150: The `cascade_count` function has an unused `field`
parameter that is only suppressed with a `let _ = field;` statement. Remove the
`field: &[f32]` parameter from the `cascade_count` function signature, delete
the `let _ = field;` line that suppresses the warning, and update the call site
to `cascade_count` (which should be removing the field argument from the
function call) to match the new signature.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: fbfebf93-8e90-449a-8393-68db69ce96ae

📥 Commits

Reviewing files that changed from the base of the PR and between f3e6223 and d33b983.

📒 Files selected for processing (4)
  • Cargo.toml
  • examples/edge_residue_probe.rs
  • examples/golden_helix_probe.rs
  • examples/morton_cascade_probe.rs

CI fmt: ran `cargo fmt --all` (edge_residue_probe / golden_helix_probe were
committed unformatted).

Correctness (codex / coderabbit):
* simd_int_ops::gemm_u8_i8 — VNNI dispatch was compile-time `#[cfg(target_feature)]`,
  so the default x86-64-v3 GitHub build stripped both VNNI arms → scalar on
  Ice Lake / SPR / Zen 4 silicon (codex P2 regression). Now RUNTIME
  `is_x86_feature_detected!` (avx512vnni → avxvnni → scalar); compiles + reaches
  VNNI under v3, and removes the pre-existing `needless_return` clippy warning.
* simd_avx2.rs U16x16 `shr`/`shl` — returned ZERO for any shift ∉{1,2,4,8};
  now `_mm256_srl_epi16`/`_mm256_sll_epi16` with a runtime lane count (all shifts).
* amx_matmul::for_dpbusd — tile 1/2 shapes now match the operand contract
  (tmm1 = VNNI kb/4×64, tmm2 = plain 16×kb); identical at kb=64 (tests
  unaffected), correct for kb<64.
* backend::native gemv_f32/f64 — early-return on m==0 (don't slice `x[..n]`
  when there are no rows; matches the scalar reference no-op).
* test_tile_zero_and_release — minimal config rewritten on the corrected
  XTILECFG offsets (colsb=4 @16 / rows=1 @48), with an explanatory note.

Probes / docs:
* amx_probe matmul_f32 validator — true relative-L2 + max-abs (the old
  `|e|.max(1.0)` denominator was an absolute test for |e|<1).
* amx_rb_probe rb_32 — assert K % 64 == 0 (was silently truncating the tail).
* doc `# Examples` (ignore) on the new public APIs: TileConfig::for_dpbusd_8,
  tile_dpbusd_2x2, F32x8::mul_add, F32x8::cmp_gt_mask.

Validated under x86-64-v3 (GitHub target): clippy clean, `cargo build
--examples` Finished; native AMX probes still all CORRECT.

https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
@AdaWorldAPI AdaWorldAPI merged commit ce83ab6 into master Jun 14, 2026
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants