AMX int8/bf16 tile GEMM — enable, fix 5 bugs, optimize (0→197 GMAC/s) + vertical-SIMD primitives#217
Conversation
Replace the scalar avx2_int_type!(U16x16, ...) polyfill ([u16;16] array + scalar loops) with a real __m256i wrapper using _mm256_*_epi16, mirroring the already-native U16x32 in simd_avx512.rs:1200. Both x86 dispatch arms (v3 AVX2, and v4 AVX-512 which re-exports the avx2-tier U16x16) now surface the hardware type — per agnostic-surface-cpu-matrix.md §A (U16x16 = __m256i) and the TD-T22 polyfill-audit cell. Lets the PQ4-ADC FastScan u16 accumulate (turbovec's AVX2 search kernel, the consumer migration this unblocks) run on hardware instead of a scalar [u16;16] loop. Drop-in: 2141 lib tests pass, 0 failed (no consumer needed changes). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Adds the per-arch primitives a 4-bit-PQ ADC FastScan consumer (turbovec's AVX2 search kernel) needs to route through ndarray::simd instead of hand-rolled _mm256_* intrinsics: - U16x16::permute2x128::<IMM> / blend_epi32::<IMM> — the SUB-trick cross-128-lane combine (IMM 0x21 / 0xF0) - U16x16::to_f32x8_lo / to_f32x8_hi — lossless u16->f32 (_mm256_cvtepu16_epi32 + _mm256_cvtepi32_ps); PQ4 accumulators are <= FLUSH_EVERY*127 so they fit f32 - U8x32::as_u16x16 — zero-cost bitcast bridging shuffle_bytes LUT results into the 16-bit accumulator All thin __m256i/__m256 intrinsic wrappers built on the native U16x16 (f766d18). ndarray builds clean. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
_mm256_fmadd_ps wrapper (self*a + b, single rounding). The 256-bit sibling of the existing F32x16::mul_add; the PQ4-ADC FastScan flush needs an 8-wide FMA for the per-query fa = v_scale*partial + fa reduction. Validated end-to-end by turbovec's avx2_topk_matches_default_kernel parity test (migrated AVX2 kernel == default). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
_mm256_cmp_ps::<_CMP_GT_OQ> + _mm256_movemask_ps → u8 mask. The SIMD threshold- prune primitive the PQ4-ADC FastScan heap needs (skip an 8-lane chunk with no candidate above heap-min in one op). Validated by turbovec's avx2 parity test at v4. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Unchecked _mm256_loadu_si256 from a raw ptr (caller guarantees >=32 bytes). The from_slice assert + slice-index bounds check are a measurable tax in tight scan loops; this is the consumer-side escape hatch for validated inner loops. Closed turbovec's migrated AVX2 FastScan gap: 91.0us -> 78.8us/q at v4 (~1% vs upstream raw AVX2 77.9us; was ~17%). Safety documented on the fn. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…audit TD-T2/3/5/7) - gemm_bf16: scalar bf16_gemm_f32 -> amx_matmul::matmul_bf16_to_f32 (AMX TDPBF16PS -> AVX-512 VDPBF16PS -> scalar) via ArrayView2 wrapping; bit-equivalent on non-AMX/non-AVX512BF16 hosts. - gemm_i8: 2-tier vnni_gemm -> simd_int_ops::gemm_u8_i8 (4-tier AMX TDPBUSD / VNNI-zmm / AVX-VNNI-ymm / scalar). Deliberately stays u8xi8 (NOT the signed matmul_i8_to_i32, which would misread A bytes >=128); same int8_gemm_i32 scalar reference, so bit-equivalent. - gemv_f32/f64: scalar -> per-row dot_f32/dot_f64 on the AVX2/AVX-512 tiers (scalar tier kept byte-identical); reuses the parity-tested BLAS-1 dot, zero new unsafe. Validated centrally: cargo test -p ndarray -> 2136 passed, 0 failed, 29 ignored. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…ut) + fmt amx_gemm_bench reports amx_available() and dispatched matmul_i8_to_i32 throughput vs scalar across square sizes — for verifying the AMX tile path lights up on SPR/GNR silicon (on this host amx_available()=false, VNNI ceiling ~30 GMAC/s). Also folds the rustfmt normalization of the from_ptr addition. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
… bugs AMX was silently disabled on EVERY capable host and, once enabled, the tile kernel produced wrong results / faulted. `amx_available()` now returns true on this Emerald Rapids host and `matmul_i8_to_i32` is bit-exact vs the scalar reference (validated by examples/amx_probe across single/multi K-block and single/multi-tile shapes, incl. negative i8 through +128/bias). Five bugs, found by bisecting with examples/amx_probe: 1. ENABLEMENT (simd_amx.rs): `ARCH_REQ_XCOMP_PERM` (0x1023) is an *arch_prctl* op (syscall 158), but the code issued it on `prctl` (syscall 157) → -EINVAL → step 4 always failed → amx_available() returned false on all AMX silicon. Fixed 157 → 158. This is the "special way to enable it": Linux 5.16+ requires the XTILEDATA permission request before any AMX tile op. 2. TILECFG (amx_matmul.rs for_dpbusd): rows/colsb regions were SWAPPED. Per the SDM XTILECFG layout, colsb[t] is a u16 at 16+2t and rows[t] a u8 at 48+t; the code wrote rows into the colsb region and vice-versa, yielding colsb[0]=4112 / rows[0]=64 → LDTILECFG #GP (SIGSEGV). 3. TILELOADD/TILESTORED SIB (amx_matmul.rs): SIB 0x08 = [base=rax, index=rcx] but the registers are bound rcx=ptr, rax=stride, so the engine dereferenced the stride (~64) as the base → SIGSEGV. Correct SIB is 0x01 = [base=rcx, index=rax]. 4. TDPBUSD/TDPBF16PS ModRM (amx_matmul.rs): ModRM 0xC1 set rm=tmm1 while vvvv also = tmm1, aliasing the two sources → same-tile #UD (SIGILL). The SDM requires the three tile operands distinct; fixed to 0xC2 (rm=tmm2). 5. OPERAND CONVENTION (int8/bf16 amx_path): on this silicon the AMX operand mapping is the mirror of the naive SDM reading — ModRM.rm is the plain M×K operand and VEX.vvvv the VNNI K×N operand, and for 0x71 rm is UNSIGNED / vvvv is SIGNED (proven by sweeping all four TDPB**D opcodes on sign-sensitive inputs). So the kernels load A(plain)→tmm2 and B(vnni)→tmm1; the standard TDPBUSD (0x71) is the correct int8 variant. The header's "INLINE ASM TESTED" was aspirational: every AMX test early-returns when amx_available() is false (which it always was), so the tile asm had never actually executed until bug #1 was fixed. Throughput (Emerald Rapids, single-thread, target-cpu=native): int8 GEMM 2048^3 = 14.8 GMAC/s, 52.8x scalar, correct=true. (ldtilecfg + VNNI-pack are still per-tile — a follow-up will hoist them; this commit is the correctness checkpoint.) https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
….5x)
`int8_gemm_amx_tiled` called `int8_tile_gemm_16x16` per 16×16 output tile,
which ran LDTILECFG (a serializing instruction) + TILERELEASE and re-VNNI-
packed the B column band on EVERY tile — e.g. 256 LDTILECFGs for a 256²
output. Restructured so the driver:
* loads the tile config ONCE up front, releases ONCE at the end;
* VNNI-packs each B column band ONCE per j-tile (reused across all M/16
row tiles), not once per (i,j) tile;
* TILEZEROs the C accumulator and TILESTOREs the 16×16 result straight
into its strided slot in C (row pitch n·4 bytes) — no scratch buffer,
no copy loop.
Single-thread int8 GEMM on Emerald Rapids (target-cpu=native), all
correct=true vs scalar:
size before(naive) after(hoisted) vs scalar
256^3 3.7 GMAC/s 65.7 GMAC/s 68x
512^3 6.4 GMAC/s 124.9 GMAC/s 130x
1024^3 10.4 GMAC/s 155.9 GMAC/s 177x
2048^3 14.8 GMAC/s 169.7 GMAC/s 600x (339 GOP/s)
`int8_tile_gemm_16x16` (the standalone 16×16 accumulating API) is unchanged
and still used by the unit tests. Further headroom: 2×2 register blocking
(reuse A/B tile loads across 4 C accumulators) and rayon over row tiles.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
The operand-swap fix touched both tile kernels; amx_probe now also checks matmul_f32 (which routes through the BF16 TDPBF16PS path on AMX) against an f32 scalar reference. Max relative error 0.0033–0.0051 across shapes — within BF16's ~8-mantissa-bit precision. Confirms the index-swap fix is correct for TDPBF16PS as well as TDPBUSD. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…s/agent
Detection (`amx_available()`) ran CPUID + XGETBV + the arch_prctl syscall on
EVERY matmul call. Cache it in a `LazyLock<bool>` computed once; all four gates
are non-blocking (no I/O, no lock, no spin) so the init can't stall, and the
arch_prctl grant is process-wide + inherited by all threads, so requesting it
once is correct even under rayon.
Add `CpuModel` detection from CPUID.01H (GenuineIntel only): SapphireRapids
(0x8F) / EmeraldRapids (0xCF) / GraniteRapids (0xAD/0xAE) / SierraForest (0xAF,
E-core, no AMX) / OtherX86 / NonX86, also cached. Exposed via
`ndarray::simd::{cpu_model, CpuModel, amx_report}`. `amx_report()` now reads
e.g. "AMX [Emerald Rapids expects_amx=true]: TILE=true INT8=true BF16=true
available=true" — so `has_amx() && !available` cleanly separates "no AMX
silicon" from "AMX present but not OS-enabled".
Answers the SPR-vs-EMR question: it is NOT the cause. The five bugs are
ISA/encoding-level and were latent on Sapphire Rapids too — the SPR-era
AMX_GOTCHAS.md literally shipped three of them (syscall 157, TDPBUSD=…73…C1,
swapped TILECFG layout). They never fired because detection never returned
true; EMR was just the first host to actually execute the tile path.
Docs (teach-to-fish):
- NEW .claude/knowledge/amx-enablement-and-kernel.md — canonical reference:
enablement sequence, validated byte-code table, the empirically-verified
(mirrored) operand convention, detection API, perf story, modus operandi.
- NEW .claude/agents/amx-savant.md — the AMX specialist agent (enable /
troubleshoot / mindset), references the knowledge doc + gotchas.
- REWROTE .claude/AMX_GOTCHAS.md — corrected the three bugs it shipped, added a
fault-signature→cause index (SIGSEGV@LDTILECFG / SIGSEGV@TILELOADD /
SIGILL@TDPBUSD / wrong-values) and the "skipped test ≠ passing test" trap.
- simd_amx.rs module header: replaced the false "INLINE ASM TESTED ✓" claim
(and wrong "256 MACs/instr" → 16384) with the verified status + bug history.
- amx_probe prints amx_report() + cpu_model() up front.
No W3C/ISA encoding source exists in-tree — the .claude "w3c" files are
semantic-web ontologies (SKOS/PROV-O/FIBO); the byte-code authority is the
Intel SDM opcode map + the empirical 4-opcode sweep (0x70/0x73 match the SDM
exactly, confirming the map).
Validated: amx_probe all CORRECT, clippy -p ndarray --lib clean (only the
pre-existing simd_int_ops needless_return), cpu_model()=EmeraldRapids.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…ated) Adds `int8_gemm_amx_tiled_par`: B is VNNI-packed once into a shared read-only buffer, then the M/16 row-tiles fan across the rayon pool (one task per 16-row block of C). Each worker runs its own LDTILECFG (tile config is per-thread CPU state) + the byte-for-byte validated serial tile sequence; AMX permission is process-wide so it's inherited by every worker. Correctness unchanged (amx_probe --features rayon: all shapes CORRECT, incl. the multi-row-tile parallel cases). HONEST FINDING: this AMX kernel is MEMORY-BANDWIDTH-BOUND. On this 4-core Emerald Rapids box rayon-over-rows scales sublinearly — 2048³ 169.7 → 237.5 GMAC/s (1.4×), and it REGRESSES small/medium shapes (512³ 125 → 73) because thread dispatch + the shared B-prepack dominate. So it's gated to large work only: `m >= 32 && m·n·k >= 2e9` (≈ ≥2048³ here), keeping the fast serial path for everything else. Many-core servers with more memory channels gain more. The real lever for a bandwidth-bound kernel is reducing bytes/FLOP (2×2 register blocking — reuse each A/B tile load across 4 C accumulators), which also lets rayon scale better; that's the queued next step. Feature-gated behind `rayon` (no change to default builds). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…act)
The AMX int8 GEMM is memory-bandwidth-bound (~8% of peak). The lever is fewer
bytes/MAC: 2×2 register blocking computes a 32×32 output block with four C
accumulators (tmm0-3) fed by two A row-tiles (tmm4-5) and two B col-tiles
(tmm6-7) via `tile_dpbusd_2x2`, so each A/B tile load serves TWO products —
half the tile loads of the 16×16 kernel.
New tile primitives (amx_matmul.rs), each validated by examples/amx_rb_probe
(32×32 block vs scalar, K=64/128/256, all CORRECT on first run):
* tile_load extended to tmm3-7, tile_store to tmm1-3 (ModRM 0x04|tile<<3).
* TileConfig::for_dpbusd_8 — all 8 tiles 16×64.
* tile_dpbusd_2x2 — 4 TDPBUSDs (C4 E2 49 5E C4 / 41 5E CC / 49 5E D5 /
41 5E DD); byte2 = ((~vvvv&0xF)<<3)|1, ModRM = 0xC0|dst<<3|rm; all 8
operand tiles distinct (no same-tile #UD).
Driver (`int8_gemm_amx_tiled_rb`): BLIS-style loop order is what makes it win —
OUTER over 32-col panels, pack only that panel's two B bands (L2-resident) and
reuse across all row-blocks, INNER over 32-row blocks. (A first cut pre-packed
ALL of B ≈4 MB and thrashed cache, regressing 1024³ 156→125; the panel-local
pack fixes it and also halves A's DRAM re-reads.) 16-wide M/N remainders finish
on the validated 16×16 path. amx_probe adds 48×48 / 96×80 cases covering both
remainder strips + corner — all CORRECT.
Single-thread, target-cpu=native, all correct=true vs scalar:
size serial rb(2×2)
256³ 65.7 80.8 (+23%)
512³ 124.9 132.0 (+6%)
1024³ 155.9 170.2 (+9%)
2048³ 169.7 197.7 (+16%) 395 GOP/s
Dispatch: rayon huge → _par; else m,n≥32 → _rb; else _serial. Next:
rayon-over-rb (projected ~270+ GMAC/s) + full BLIS cache blocking.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…ng at scale) The knowledge doc projected "rayon-over-rb ~270+ GMAC/s" as easy headroom. It is NOT: a first attempt (each rayon task runs _rb on a 64-row band) ran slower than rb-single (155 vs 197 at 2048³ — per-task B re-pack) AND produced correct=false at 1024³/2048³ (256³/512³ stayed correct; single-thread _rb is bit-exact at all sizes, so it's an AMX-tiles-under-rayon-at-scale issue, not yet diagnosed). Reverted. Documented the two prerequisites before reshipping (shared B pre-pack + a probe reproducing/explaining the large-size correctness failure) so a future session doesn't re-walk it blind. https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
|
Warning Review limit reached
More reviews will be available in 35 minutes and 8 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (9)
📝 WalkthroughWalkthroughFixes the Linux AMX permission syscall (157→158), corrects multiple tile instruction encoding bugs (TILECFG field layout, SIB base/index, TDPBUSD/TDPBF16PS ModRM aliasing), swaps A/B operand placement in int8 and bf16 kernels, adds adaptive int8 GEMM dispatch (serial/Rayon/register-blocked 2×2), routes backend GEMM/GEMV through corrected tier dispatch, replaces array-backed ChangesAMX Enablement and Correctness Overhaul
Sequence DiagramsequenceDiagram
participant App
participant backend_gemm_i8
participant simd_int_ops_gemm_u8_i8
participant int8_gemm_amx_tiled
participant amx_available
participant TileConfig
App->>backend_gemm_i8: gemm_i8(m,k,n,a,b,c)
backend_gemm_i8->>simd_int_ops_gemm_u8_i8: dispatch via tier
simd_int_ops_gemm_u8_i8->>amx_available: amx_available() [LazyLock]
amx_available-->>simd_int_ops_gemm_u8_i8: true/false
alt AMX available (M>=32, N>=32)
simd_int_ops_gemm_u8_i8->>int8_gemm_amx_tiled: int8_gemm_amx_tiled_rb
int8_gemm_amx_tiled->>TileConfig: for_dpbusd_8(k_bytes)
int8_gemm_amx_tiled->>int8_gemm_amx_tiled: pack B into VNNI quads
int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_dpbusd_2x2() × K/64 blocks
int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_store 4 quadrants → c
else AMX available serial
simd_int_ops_gemm_u8_i8->>int8_gemm_amx_tiled: int8_gemm_amx_tiled_serial
int8_gemm_amx_tiled->>TileConfig: for_dpbusd(k_bytes)
int8_gemm_amx_tiled->>int8_gemm_amx_tiled: tile_dpbusd() + tile_store → c
else no AMX
simd_int_ops_gemm_u8_i8->>simd_int_ops_gemm_u8_i8: VNNI/scalar fallback
end
int8_gemm_amx_tiled-->>App: c filled
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f3e6223088
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| #[cfg(feature = "std")] | ||
| { | ||
| crate::hpc::vnni_gemm::int8_gemm_vnni(a, b, c, m, n, k); | ||
| crate::simd_int_ops::gemm_u8_i8(a, b, c, m, n, k); |
There was a problem hiding this comment.
Keep the runtime VNNI fallback in gemm_i8
In the default x86-64-v3/std build, this now calls simd_int_ops::gemm_u8_i8, whose AVX-512/AVX-VNNI arms are behind #[cfg(target_feature = ...)] after the AMX alignment check; if AMX is unavailable or the shape is not 16/16/64-aligned, it therefore falls through to scalar even on Ice Lake/Zen 4/Sapphire Rapids hardware with VNNI. The previous hpc::vnni_gemm::int8_gemm_vnni used runtime simd_caps() detection and handled arbitrary shapes, so backend::gemm_i8 regresses hardware-supported VNNI calls to the scalar kernel in normal baseline builds.
Useful? React with 👍 / 👎.
examples/{amx_gemm_bench,amx_probe,amx_rb_probe}.rs import `ndarray::simd` and
`ndarray::hpc`, both `#[cfg(feature = "std")]`. With std as a default feature
they built locally, but the `--no-default-features` CI job compiles examples
with std off → `error[E0432]: unresolved import ndarray::simd` (item configured
out, gated behind `std`). Declared the three examples with
`required-features = ["std"]` (same pattern as `ocr_benchmark`) so non-std jobs
skip them.
Verified: `cargo build -p ndarray --no-default-features
--features portable-atomic-critical-section --examples` now finishes clean
(examples skipped) instead of failing on amx_gemm_bench.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/hpc/int8_tile_gemm.rs (1)
367-397:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winDon’t rely on
debug_assert!for a safe AMX entry point.In release builds, a caller can invoke
int8_gemm_amx_tiled()on a non-AMX host or with misaligned shapes, and this function will still fall into the raw-pointer/tile-instruction paths below. That turns a safe public API into either an immediateSIGILL(amx_available() == false) or full 16×16 tile loads/stores against a shape the code no longer proves is aligned.Suggested fix
- debug_assert!(crate::hpc::amx_matmul::amx_available()); - debug_assert_eq!(m % 16, 0, "int8_gemm_amx_tiled: M must be multiple of 16"); - debug_assert_eq!(n % 16, 0, "int8_gemm_amx_tiled: N must be multiple of 16"); - debug_assert_eq!(k % 64, 0, "int8_gemm_amx_tiled: K must be multiple of 64"); + assert!( + crate::hpc::amx_matmul::amx_available(), + "int8_gemm_amx_tiled: AMX must be available" + ); + assert_eq!(m % 16, 0, "int8_gemm_amx_tiled: M must be multiple of 16"); + assert_eq!(n % 16, 0, "int8_gemm_amx_tiled: N must be multiple of 16"); + assert_eq!(k % 64, 0, "int8_gemm_amx_tiled: K must be multiple of 64");🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/hpc/int8_tile_gemm.rs` around lines 367 - 397, The function int8_gemm_amx_tiled uses debug_assert! macros to validate that AMX is available and that input dimensions m, n, and k are properly aligned (multiples of 16, 16, and 64 respectively), but these assertions are stripped in release builds. This allows a caller to invoke the function with invalid preconditions on non-AMX hosts or with misaligned shapes, causing either SIGILL or undefined behavior when the code reaches the raw pointer and AMX instruction paths. Replace all four debug_assert! calls (the amx_available check and the three shape alignment checks) with regular assert! macro calls to ensure these validation checks execute in both debug and release builds, protecting the safe public API from unsafe inputs.
🧹 Nitpick comments (2)
src/backend/mod.rs (1)
170-183: ⚡ Quick winAdd examples to the new public GEMM docs.
Both
gemm_i8andgemm_bf16now have detailed API docs, but they still miss a usage example. A tiny row-major example for each would satisfy the repo contract and pin down the signedness/shape expectations for callers. As per coding guidelines, "All public APIs (public functions and methods) must have///doc comments with examples".Also applies to: 201-208
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/mod.rs` around lines 170 - 183, The public GEMM functions gemm_i8 and gemm_bf16 have detailed API documentation but are missing usage examples, which violates the repo's requirement that all public APIs include documentation comments with examples. Add a usage example to the doc comments for both functions using the `///` style documentation. Include a small row-major matrix multiplication example for each function that demonstrates how callers should use it and clarifies the signedness and shape expectations (e.g., u8 × i8 → i32 for gemm_i8, and appropriate types for gemm_bf16).Source: Coding guidelines
src/backend/native.rs (1)
285-308: ⚡ Quick winAdd examples to the public GEMV docs.
gemv_f32andgemv_f64now have expanded API docs, but they still do not include examples. A small doctest for the row-major contract would bring these entrypoints in line with the repository requirement. As per coding guidelines, "All public APIs (public functions and methods) must have///doc comments with examples".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/backend/native.rs` around lines 285 - 308, The `gemv_f32` and `gemv_f64` functions are missing doctest examples in their doc comments, which violates the coding guideline requiring all public APIs to have doc comments with examples. Add executable doctest examples to both functions' doc comments that demonstrate the row-major GEMV operation (y = alpha * A * x + beta * y) using concrete values for the matrix, vectors, and scalars to illustrate how the computation works.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/amx_probe.rs`:
- Around line 82-110: The relative error calculation in the test_matmul_f32
function is incorrect because using e.abs().max(1.0) as the denominator converts
small values into absolute-error tests rather than true relative-error tests.
This allows large relative regressions on small outputs to incorrectly appear
correct. Fix the denominator calculation in the loop where max_rel is computed
to use a true relative metric: either use a small epsilon as a floor while still
dividing by the actual expected value's magnitude, or compute both absolute and
relative error separately and use the maximum of the two comparisons against the
tolerance threshold.
In `@examples/amx_rb_probe.rs`:
- Around line 45-61: Add validation to reject K values that are not divisible by
64 instead of silently truncating them. Before computing k_blocks with the
division operation k / 64, check if k modulo 64 is not equal to zero and either
return an error or panic with a clear message indicating that K must be a
multiple of 64. This ensures the function fails fast when given invalid input
rather than producing a partial product from the truncated k_blocks calculation
in the loop.
In `@src/backend/native.rs`:
- Around line 289-303: The gemv_f32 and gemv_f64 functions eagerly slice x[..n]
before checking if there are any rows to process. This causes a panic when m
equals zero and x.len() is less than n, whereas the scalar implementations
gracefully handle this case by returning early. Add an early return check at the
beginning of the non-Scalar branch in both gemv_f32 (before line 295 where xn is
assigned) and gemv_f64 (the sibling function also applying the same pattern): if
m equals zero, return immediately without attempting to slice x, preserving the
no-op behavior when there are no matrix rows.
In `@src/hpc/amx_matmul.rs`:
- Around line 827-835: The `test_tile_zero_and_release` test function still
builds the old incorrect XTILECFG layout with swapped offsets (using `data[16] =
1` and `data[48] = 4`), while the assertions in the diff now validate the
corrected offsets. Update the XTILECFG construction in
`test_tile_zero_and_release` to match the fixed offset layout shown in the
assertions, ensuring colsb values are placed at the correct byte offsets (16+2t
for each tile) and rows values are placed at offsets 48+t, so the test will not
create an invalid configuration when AMX becomes reachable on capable hosts.
- Around line 32-42: The `for_dpbusd()` function is configuring the tiles
incorrectly. Currently, tile 1 is programmed with a `16×kb` shape and tile 2
with a `kb/4×64` shape, but the documented and expected contract (as shown in
the comments and used by `tile_dpbusd()`) requires tile 1 (tmm1) to be `kb/4×64`
for VNNI K×N operands and tile 2 (tmm2) to be `16×kb` for plain M×K operands.
Swap the tile configuration assignments so that tile 1 receives the `kb/4×64`
shape and tile 2 receives the `16×kb` shape. This fix needs to be applied
wherever these tile configurations are set in the `for_dpbusd()` function
implementation.
- Around line 77-83: The public functions TileConfig::for_dpbusd_8() and
tile_dpbusd_2x2() are missing documentation examples as required by coding
guidelines for all public Rust APIs. Add a no_run example to the doc comment for
each function demonstrating the strict tile/register contract they expose. Each
example should show the required sequence of operations: calling
tile_loadconfig, performing the necessary tile loads, and then calling
tile_release. This will make these APIs safer to consume by providing clear
usage patterns in the documentation.
In `@src/simd_avx2.rs`:
- Around line 1603-1629: The shr and shl methods in U16x32 only handle shift
amounts of 1, 2, 4, and 8, returning a zero vector for all other values, which
produces incorrect results. Replace both the shr and shl method implementations
to use the vector-count intrinsic forms (_mm256_srl_epi16 and _mm256_sll_epi16
respectively) instead of the immediate forms, since these accept runtime shift
amounts rather than compile-time constants. This approach mirrors how
U8x32::shr_epi16 handles variable shift amounts and will correctly support all
valid shift values.
In `@src/simd_avx512.rs`:
- Around line 1367-1371: The public method cmp_gt_mask in src/simd_avx512.rs is
missing a /// doc comment with examples, which violates the coding guidelines.
Add a doc comment above the cmp_gt_mask method that explains its purpose
(lane-wise ordered greater-than comparison returning a bitmask), describes the
return value semantics (bit i is set if self[i] > other[i]), explains its use
case (SIMD early-out without scalar iteration), and includes a clear example
demonstrating the method with different input scenarios to show both when lanes
are greater and when they are not.
- Around line 1353-1359: Both the `mul_add` method and the `cmp_gt_mask` method
are missing example code blocks in their doc comments. Add an "# Examples"
section to each method's existing doc comment with the provided example code
snippets that demonstrate their usage and expected behavior. The examples for
`mul_add` should show a fused multiply-add operation with sample values, and the
examples for `cmp_gt_mask` should demonstrate comparing two F32x8 values and
showing the resulting mask value.
---
Outside diff comments:
In `@src/hpc/int8_tile_gemm.rs`:
- Around line 367-397: The function int8_gemm_amx_tiled uses debug_assert!
macros to validate that AMX is available and that input dimensions m, n, and k
are properly aligned (multiples of 16, 16, and 64 respectively), but these
assertions are stripped in release builds. This allows a caller to invoke the
function with invalid preconditions on non-AMX hosts or with misaligned shapes,
causing either SIGILL or undefined behavior when the code reaches the raw
pointer and AMX instruction paths. Replace all four debug_assert! calls (the
amx_available check and the three shape alignment checks) with regular assert!
macro calls to ensure these validation checks execute in both debug and release
builds, protecting the safe public API from unsafe inputs.
---
Nitpick comments:
In `@src/backend/mod.rs`:
- Around line 170-183: The public GEMM functions gemm_i8 and gemm_bf16 have
detailed API documentation but are missing usage examples, which violates the
repo's requirement that all public APIs include documentation comments with
examples. Add a usage example to the doc comments for both functions using the
`///` style documentation. Include a small row-major matrix multiplication
example for each function that demonstrates how callers should use it and
clarifies the signedness and shape expectations (e.g., u8 × i8 → i32 for
gemm_i8, and appropriate types for gemm_bf16).
In `@src/backend/native.rs`:
- Around line 285-308: The `gemv_f32` and `gemv_f64` functions are missing
doctest examples in their doc comments, which violates the coding guideline
requiring all public APIs to have doc comments with examples. Add executable
doctest examples to both functions' doc comments that demonstrate the row-major
GEMV operation (y = alpha * A * x + beta * y) using concrete values for the
matrix, vectors, and scalars to illustrate how the computation works.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 24bb1a31-496b-46a6-b035-ca49a8d87553
📒 Files selected for processing (17)
.claude/AMX_GOTCHAS.md.claude/agents/amx-savant.md.claude/blackboard.md.claude/board/AGENT_LOG.md.claude/knowledge/amx-enablement-and-kernel.mdexamples/amx_gemm_bench.rsexamples/amx_probe.rsexamples/amx_rb_probe.rssrc/backend/mod.rssrc/backend/native.rssrc/hpc/amx_matmul.rssrc/hpc/bf16_tile_gemm.rssrc/hpc/int8_tile_gemm.rssrc/simd.rssrc/simd_amx.rssrc/simd_avx2.rssrc/simd_avx512.rs
First probe toward the Morton-tile cascade substrate (codec-agnostic; the f32
cell value stands in for the eventual palette256 / helix Fisher-2z per-cell
codec — wiring that is the next step).
Substrate proven:
* 4×4 Morton leaf tile = 16 cells = one F32x16 loaded from
`simd_soa::MultiLaneColumn` (the "2bit×2bit" tile; gridlake SoA carrier).
* Quadtree over T×T tiles (2×2 per level) ⇒ total grid (4T)² for T=2^k gives
the ladder 64, 256, 1024, 4096, 16384, 64k, 256k.
* Morton order ⇒ every quadtree node is a contiguous index range, so the
aggregate min/max pyramid (the Belichtungsmesser "bands") is a flat bottom-up
reduction; the cascade prunes a subtree whose [min,max] can't intersect the
query band [q−r, q+r] (the 3-stroke band-miss early-exit, generalized).
examples/morton_cascade_probe.rs, validated: cascade count == brute-force count
for every (size, query) 64..262144 cells; selective queries prune ~63-67% of
cells, broad queries correctly prune 0% (no over-prune). `required-features =
["std"]` (uses `ndarray::simd`).
Next: wire the per-cell codec (palette256 / helix), the SIMD leaf mask via
F32x16, and a fully non-materialized aggregate variant.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
…asured) Anti-"eigenvalue theater" probe putting numbers on the two load-bearing claims: 1. Golden-angle hemisphere sampling (γ = π(3−√5); θ = ½·arccos(1−2(n+0.5)/N)) maximises the minimum nearest-neighbour gap — 2.6–10× the regular (θ,φ) grid and up to 15× uniform-random at N=64..1024 — so the irrational stride genuinely prevents node collapse (the Fujifilm-X-Trans low-discrepancy point). The regular grid's lower NN-CoV is a mirage: it clusters at the pole, which the min-gap exposes. 2. Fisher-z percentile rank (= arctanh, strictly monotone in cosine) preserves every pairwise similarity ordering — 0 inversions vs cosine order — so ranks are a normalised [0,1] key comparable directly, never re-materialising cosine; and Fisher-z gives the rim (s=0.9) 5.3× the resolution of the centre. examples/golden_helix_probe.rs (pure std, required-features = ["std"]). https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
… complement)
Option 1 ("stream pairwise into AMX-turbovec as a cheap edge residue that
complements the 16×8bit=128bit coarse code"), measured end-to-end on this
session's validated AMX int8 GEMM + the turbovec quant idea:
vectors → AMX matmul_i8_to_i32 assigns each to a 256-palette (1-byte coarse,
the per-edge code) → 4-bit TurboQuant of the residue (D/2-byte fine) →
reconstruct.
Results (Emerald Rapids, target-cpu=native):
* AMX assignment is 100% accurate (finds the true centroid every vector).
* coarse(1 byte) rel-err 0.148–0.287 → +turbovec 4-bit rel-err 0.010–0.020
= ~14× better reconstruction for a fixed small byte cost (D/2 = 32–64 B).
So the coarse palette edge-code + a cheap turbovec residue is a real,
measurable win, with AMX doing the pairwise. (Assignment GEMM here is a small
shape → 17–34 GMAC/s, below the 2048³ 197 peak; correctness is the point.)
examples/edge_residue_probe.rs, required-features = ["std"].
Path map for the siblings (both now plug into existing certified math):
* Option 2 (Gaussian-splat pyramid): wire `hpc::pillar::ewa_sandwich_{2d,3d}`
+ `splat_invariants` + `cov_high_d` into the Morton cascade.
* Option 3 (x264/x265 2×2/4×4 reality-check, eigenvalue/perturbation): use
`jc::{weyl,cartan,pearl}` for the transform/resonance math.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
There was a problem hiding this comment.
🧹 Nitpick comments (1)
examples/morton_cascade_probe.rs (1)
115-150: ⚡ Quick winRemove unused
fieldparameter.The
fieldparameter is never used incascade_count(line 148 just suppresses the warning). The function only reads from thecol: &MultiLaneColumnparameter. Consider removing the unused parameter to simplify the signature.♻️ Proposed refactor
-fn cascade_count(field: &[f32], col: &MultiLaneColumn, pyr: &Pyramid, q: f32, r: f32) -> (usize, usize) { +fn cascade_count(col: &MultiLaneColumn, pyr: &Pyramid, q: f32, r: f32) -> (usize, usize) { let (lo, hi) = (q - r, q + r); let mut count = 0usize; let mut visited = 0usize; // Stack of (level, node_index_within_level). let mut stack = vec![(pyr.k as usize, 0usize)]; let bytes = col.as_bytes(); while let Some((level, node)) = stack.pop() { let (mn, mx) = pyr.levels[level][node]; if mx < lo || mn > hi { continue; // band miss → prune whole subtree (early-exit) } if level == 0 { // Leaf tile = 16 cells = one F32x16 chunk in the SoA column. let off = node * 64; // 16 f32 × 4 bytes let chunk: [u8; 64] = bytes[off..off + 64].try_into().unwrap(); let arr = f32x16_from_bytes(&chunk).to_array(); for &v in arr.iter() { if (v - q).abs() <= r { count += 1; } } visited += 16; } else { let base = node * 4; for c in 0..4 { stack.push((level - 1, base + c)); } } } - let _ = field; // field kept for the brute-force reference; cascade reads the SoA column (count, visited) }And update the call site at line 180:
- let (got, visited) = cascade_count(&field, &col, &pyr, q, r); + let (got, visited) = cascade_count(&col, &pyr, q, r);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/morton_cascade_probe.rs` around lines 115 - 150, The `cascade_count` function has an unused `field` parameter that is only suppressed with a `let _ = field;` statement. Remove the `field: &[f32]` parameter from the `cascade_count` function signature, delete the `let _ = field;` line that suppresses the warning, and update the call site to `cascade_count` (which should be removing the field argument from the function call) to match the new signature.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@examples/morton_cascade_probe.rs`:
- Around line 115-150: The `cascade_count` function has an unused `field`
parameter that is only suppressed with a `let _ = field;` statement. Remove the
`field: &[f32]` parameter from the `cascade_count` function signature, delete
the `let _ = field;` line that suppresses the warning, and update the call site
to `cascade_count` (which should be removing the field argument from the
function call) to match the new signature.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: fbfebf93-8e90-449a-8393-68db69ce96ae
📒 Files selected for processing (4)
Cargo.tomlexamples/edge_residue_probe.rsexamples/golden_helix_probe.rsexamples/morton_cascade_probe.rs
CI fmt: ran `cargo fmt --all` (edge_residue_probe / golden_helix_probe were
committed unformatted).
Correctness (codex / coderabbit):
* simd_int_ops::gemm_u8_i8 — VNNI dispatch was compile-time `#[cfg(target_feature)]`,
so the default x86-64-v3 GitHub build stripped both VNNI arms → scalar on
Ice Lake / SPR / Zen 4 silicon (codex P2 regression). Now RUNTIME
`is_x86_feature_detected!` (avx512vnni → avxvnni → scalar); compiles + reaches
VNNI under v3, and removes the pre-existing `needless_return` clippy warning.
* simd_avx2.rs U16x16 `shr`/`shl` — returned ZERO for any shift ∉{1,2,4,8};
now `_mm256_srl_epi16`/`_mm256_sll_epi16` with a runtime lane count (all shifts).
* amx_matmul::for_dpbusd — tile 1/2 shapes now match the operand contract
(tmm1 = VNNI kb/4×64, tmm2 = plain 16×kb); identical at kb=64 (tests
unaffected), correct for kb<64.
* backend::native gemv_f32/f64 — early-return on m==0 (don't slice `x[..n]`
when there are no rows; matches the scalar reference no-op).
* test_tile_zero_and_release — minimal config rewritten on the corrected
XTILECFG offsets (colsb=4 @16 / rows=1 @48), with an explanatory note.
Probes / docs:
* amx_probe matmul_f32 validator — true relative-L2 + max-abs (the old
`|e|.max(1.0)` denominator was an absolute test for |e|<1).
* amx_rb_probe rb_32 — assert K % 64 == 0 (was silently truncating the tail).
* doc `# Examples` (ignore) on the new public APIs: TileConfig::for_dpbusd_8,
tile_dpbusd_2x2, F32x8::mul_add, F32x8::cmp_gt_mask.
Validated under x86-64-v3 (GitHub target): clippy clean, `cargo build
--examples` Finished; native AMX probes still all CORRECT.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Summary
Brings Intel AMX (Advanced Matrix Extensions) from silently disabled on every
capable host to enabled, bit-exact, and fast on stable Rust 1.94, plus the
vertical-SIMD primitives that back the turbovec/lance-graph consumers.
Verified on Emerald Rapids (CPUID model 0xCF), kernel 6.18.5. The fixes are
ISA-level (identical on Sapphire Rapids / Granite Rapids).
AMX (the headline)
Enablement — the one-digit bug.
ARCH_REQ_XCOMP_PERM(0x1023) is anarch_prctlop (syscall 158); the code issued it onprctl(157) →-EINVAL, soamx_available()returnedfalseon every AMX host and the tilepath was dead code. Fixed 157→158 (
simd_amx.rs). This is the "AMX is available~50-79% of the time but needs a special way to enable it."
Four more tile-kernel bugs, surfaced once the path actually executed and
bisected with
examples/amx_probe(prints a flushed line before each tile op):TileConfigrows/colsb regions swapped →LDTILECFG #GP(SIGSEGV).TILELOADD/TILESTOREDSIB base/index swapped → deref'd stride as ptr (SIGSEGV).TDPBUSD/TDPBF16PSModRM0xC1aliased two source tiles → same-tile#UD(SIGILL).(plain M×K =
ModRM.rm= unsigned; VNNI K×N =VEX.vvvv= signed) — provenby sweeping all four
TDPB**Dopcodes against sign-sensitive inputs.Now correct:
matmul_i8_to_i32(TDPBUSD) bit-exact vs scalar across allshapes;
matmul_f32/matmul_bf16_to_f32(TDPBF16PS) within BF16 tolerance.Performance (Emerald Rapids, single-thread,
target-cpu=native, allcorrect=true):LDTILECFG+ VNNI-pack out of the per-tile loops (11.5×).tile_dpbusd_2x2, 4 C accumulators + 2 A + 2 B tiles)with a BLIS-style panel loop order — halves load bytes/MAC; +16% over serial at
2048³, wins at every size. All 12 new tile encodings validated in isolation
(
examples/amx_rb_probe).rayonrow-tile parallelism (feature-gated) → 237 GMAC/s for huge GEMMs.Detection + ergonomics:
amx_available()cached in aLazyLock(non-blockinggates; process-wide
arch_prctlrequested once, rayon-safe); newCpuModel(SPR/EMR/GNR/SierraForest) +
cpu_model()+amx_report(), exported viandarray::simd::*.has_amx() && !availablecleanly separates "no AMX silicon"from "AMX not OS-enabled".
Docs (teach-to-fish): new
.claude/knowledge/amx-enablement-and-kernel.md(canonical reference + validated byte-code table), new
.claude/agents/amx-savant.md,and a rewritten
.claude/AMX_GOTCHAS.md(corrects the three bugs it shipped; addsa fault-signature→cause playbook).
Vertical-SIMD primitives
Native AVX2
U16x16(__m256i), FastScan PQ4-ADC helpers onU16x16/U8x32,F32x8::{mul_add, cmp_gt_mask},U8x32::from_ptr(zero-overhead hot-loop load),backend
gemm_bf16/gemm_i8/gemvwired to real SIMD dispatch, and the AMX/VNNIint8 GEMM re-exported through
simd.rs— the consumer-facing surface turbovec andlance-graph build on.
Validation & honest caveats
examples/{amx_probe, amx_rb_probe, amx_gemm_bench}andcargo clippy -p ndarray --lib [--features rayon](clean except one pre-existingsimd_int_opsneedless_return). The lib unit-test target is pre-broken byunrelated
src/tri.rstype-inference errors (noted in CLAUDE.md), so the examplesare the gate.
ran slower than rb-single and
correct=falseat 1024³/2048³ (anAMX-under-rayon-at-scale issue, documented in the knowledge doc). The banked wins
(rb-single 197, 16×16-rayon 237) are both correct. Bigger lever ahead: full BLIS
Mc/Nc/Kc cache blocking.
https://claude.ai/code/session_01D2WSmezQBNC3bUdHuGfGmo
Generated by Claude Code
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation