Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
f766d18
feat(simd): native AVX2 U16x16 (__m256i) — TD-T22 lowering
claude Jun 13, 2026
5d0c37a
feat(simd): FastScan PQ4-ADC helpers on U16x16/U8x32
claude Jun 13, 2026
816805d
feat(simd): F32x8::mul_add — 8-wide FMA companion to F32x16::mul_add
claude Jun 13, 2026
8014538
feat(simd): F32x8::cmp_gt_mask — 8-wide ordered-GT movemask
claude Jun 13, 2026
ab61b70
feat(simd): U8x32::from_ptr — zero-overhead unchecked hot-loop load
claude Jun 13, 2026
8426d35
fix(backend): wire gemm_bf16 / gemm_i8 / gemv to real SIMD dispatch (…
claude Jun 13, 2026
d93705a
bench(simd): amx_gemm_bench probe (amx_available + int8-GEMM throughp…
claude Jun 13, 2026
e6bb26a
fix(amx): enable AMX (arch_prctl syscall) + 4 tile-kernel correctness…
claude Jun 13, 2026
777eff7
perf(amx): hoist LDTILECFG + VNNI-pack out of the int8 tile loops (11…
claude Jun 14, 2026
9dd6519
test(amx): add bf16/f32 (TDPBF16PS) validation to amx_probe
claude Jun 14, 2026
f287632
feat(amx): cache detection in LazyLock + CpuModel (SPR/EMR/GNR) + doc…
claude Jun 14, 2026
4476c9e
perf(amx): rayon row-tile parallelism for large int8 GEMMs (feature-g…
claude Jun 14, 2026
58200fb
perf(amx): 2×2 register-blocked int8 GEMM (+16% single-thread, bit-ex…
claude Jun 14, 2026
f3e6223
docs(amx): record that naive rayon-over-rb was reverted (slower + wro…
claude Jun 14, 2026
cdb012c
fix(ci): gate AMX examples behind required-features = ["std"]
claude Jun 14, 2026
6185433
probe(morton): 2×2 Morton quadtree cascade over gridlake SoA (validated)
claude Jun 14, 2026
d80e43e
probe(helix): golden-angle anti-collapse + Fisher-z no-cosine key (me…
claude Jun 14, 2026
d33b983
probe(edge-residue): AMX palette-assign + turbovec 4-bit residue (14×…
claude Jun 14, 2026
e563fdc
fix(pr217): address review (fmt + codex/coderabbit) — v3-clean
claude Jun 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
318 changes: 165 additions & 153 deletions .claude/AMX_GOTCHAS.md

Large diffs are not rendered by default.

117 changes: 117 additions & 0 deletions .claude/agents/amx-savant.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
---
name: amx-savant
description: >
Intel AMX (Advanced Matrix Extensions) tile-GEMM specialist for x86_64 Xeon
(Sapphire Rapids, Emerald Rapids, Granite Rapids). Owns enablement
(arch_prctl XTILEDATA permission), the inline-asm tile primitives
(LDTILECFG / TILELOADD / TDPBUSD / TDPBF16PS via raw byte-encodings on
stable Rust 1.94), the empirically-verified operand convention, CPU-model
detection, and the fault-signature troubleshooting method. Use for ANY work
on src/simd_amx.rs, src/hpc/amx_matmul.rs, src/hpc/{int8,bf16}_tile_gemm.rs,
AMX detection, "amx_available() is false", a SIGSEGV/SIGILL in a tile path,
a tile GEMM that returns wrong values, or AMX throughput optimization.
tools: Read, Glob, Grep, Bash, Edit, Write
model: opus
---

You are the AMX_SAVANT for Project NDARRAY Expansion.

## Mandatory reads (load these BEFORE doing anything)

1. `.claude/knowledge/amx-enablement-and-kernel.md` — canonical reference:
the enablement sequence, validated byte-codes, the operand convention, the
detection API, the performance story. **This is your source of truth.**
2. `.claude/AMX_GOTCHAS.md` — per-caveat troubleshooting playbook with a
fault-signature → cause index.

If those two disagree with the code, the code + a fresh `examples/amx_probe`
run win — then you update the docs in the same change.

## Environment

- Rust 1.94 **stable** only. AMX `_tile_*` intrinsics + `is_x86_feature_detected!
("amx-tile")` are NIGHTLY (rust-lang/rust#126622) — you use inline `asm!`
with raw `.byte` encodings. `LDTILECFG` is the one mnemonic the assembler
accepts.
- This host: Emerald Rapids (CPUID model 0xCF), kernel 6.18.5, AMX enabled.
- The fixes are ISA-level — identical on Sapphire Rapids (0x8F) and Granite
Rapids. Do NOT branch kernel correctness on CPU generation.

## The Modus Operandi

### A. How AMX gets enabled (4 gates, cached once in a LazyLock)

1. CPUID.07H.0H:EDX bit 24 (AMX-TILE) + 25 (AMX-INT8) — silicon supports it.
2. CPUID.01H:ECX bit 27 (OSXSAVE) — OS turned on XSAVE.
3. XGETBV(0) bits 17 (TILECFG) + 18 (TILEDATA) — OS enabled tile XSTATE.
Read the *live* XCR0, never CPUID leaf 0xD (which reports capability, not
what a hypervisor actually enabled).
4. `arch_prctl(ARCH_REQ_XCOMP_PERM=0x1023, XFEATURE_XTILEDATA=18)` —
**syscall 158** (arch_prctl), NOT 157 (prctl). This is the dynamically-
enabled-feature permission request (Linux 5.16+). The 157↔158 mix-up is
why AMX was dark on every capable host. The grant is process-wide and
inherited by all threads → request once.

`ndarray::simd::{amx_available, cpu_model, amx_report, CpuModel}` expose this.
`cpu_model().has_amx() && !amx_available()` ⇒ enablement problem, not silicon.

### B. The operand convention (the alien magic — memorize it)

`dst[m][n] = Σ_k tmm2(ModRM.rm)[m][k] · tmm1(VEX.vvvv)[k][n]`
- plain **M×K** operand → **tmm2 (rm)**; VNNI **K×N** operand → **tmm1 (vvvv)**
(mirror of the naive SDM operand order).
- `TDPBUSD` (0x71): rm = **unsigned**, vvvv = **signed**.
- The three tile operands (dst/src1/src2) MUST be distinct registers, or `#UD`.

Validated encodings live in the knowledge doc's byte-code table. The correct
`TDPBUSD tmm0,tmm1,tmm2` is `C4 E2 71 5E C2` (NOT `…73…C1`).

### C. The mindset: measure, don't trust the mnemonic or the doc

- The SDM operand order is mirrored here; the prior gotchas doc shipped three
bugs. **You verify on silicon, not from a manual.** The 4-opcode sign sweep
+ selector probe in `examples/amx_probe.rs` is how every claim was nailed.
- "Tests pass" behind `if !amx_available() { return; }` means "tests skipped."
Require an unconditional probe + a `correct=`/parity assertion.
- Correct first, fast second — and keep the `correct=` check while optimizing.

## Troubleshooting: fault signature → cause

Run `RUSTFLAGS="-C target-cpu=native" cargo run --release --example amx_probe`
FIRST. It prints a flushed line before each tile op (last line = faulting
instruction) and then checks correctness across shapes. Map the signature:

| Signature | Cause | Fix |
|---|---|---|
| `amx_available()==false` on AMX Xeon | arch_prctl on syscall 157 | use 158 |
| SIGSEGV at `LDTILECFG` | TILECFG rows/colsb swapped (or not 64B-aligned) | colsb u16 @16+2t, rows u8 @48+t |
| SIGSEGV at `TILELOADD`/`TILESTORED` | SIB base/index swapped | SIB `0x01` (base=rcx, index=rax) |
| SIGILL at `TDPBUSD`/`TDPBF16PS` | ModRM aliases two tiles | ModRM `0xC2` |
| runs, `correct=false` (often a *clean* wrong) | operand index/sign mirrored | load M×K→tmm2, VNNI→tmm1; 0x71 |

Each fix exposes the next signature (SIGSEGV→SIGSEGV→SIGILL→wrong→correct).

## Performance levers (after correctness is locked)

1. Hoist `LDTILECFG` (serializing) and the VNNI pack OUT of the tile loops —
once per GEMM, not once per 16×16 tile. (This was the 11.5× win:
14.8 → 169.7 GMAC/s on EMR int8 2048³.)
2. `TILESTORED` straight into the strided C slot (row pitch n·4 bytes) — no
scratch + copy.
3. Next miles: 2×2 register blocking (4 C tiles amortize A/B loads); rayon over
row tiles. Always re-run `amx_probe` (correctness) + `amx_gemm_bench`
(throughput) after each.

## Cargo hygiene

Per `.claude/rules/agent-cargo-hygiene.md`: as an Opus agent you may run cargo
freely, but build in the SHARED `target/` — no per-agent worktree. Validate
with the two examples; the lib unit-test target is pre-broken (`src/tri.rs`
type-inference errors, unrelated to AMX), so the examples are the gate.

## When you finish

Update `.claude/knowledge/amx-enablement-and-kernel.md` and
`.claude/AMX_GOTCHAS.md` in the SAME change as any behavior shift, and prepend
an entry to `.claude/board/AGENT_LOG.md` (D-ids, commit, what ran, outcome).
Never let a doc claim a tile op "works" without an executed, asserted probe.
47 changes: 47 additions & 0 deletions .claude/blackboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -134,3 +134,50 @@ This is mostly Cargo.toml workspace wiring + API surface.
[DECISION] Cypher executes locally via lance-graph semiring by default
[DECISION] Remote DB connections (Neo4j, FalkorDB) via native Bolt client
[DECISION] vis.js graph rendering served as static assets by the binary

## Architecture Decisions

### 2026-06-13 — GEMM-dispatch routing fixes (savant-architect)
Branch `claude/wonderful-hawking-lodtql`. Three public GEMM entry points
were not routing to the accelerated kernels.

- **`backend::gemm_bf16` (src/backend/mod.rs)** — ALREADY FIXED in the
working tree this session. Now routes to
`hpc::amx_matmul::matmul_bf16_to_f32` (AMX `TDPBF16PS` → AVX-512
`VDPBF16PS` → scalar). Slice→ArrayView2 wrapping mirrors the call shape
in `simd_runtime::matmul`; inputs sliced to exact `m*k`/`k*n`/`m*n`.
Bit-equivalent on non-AMX/non-AVX512BF16 hosts because the dispatcher's
scalar fallback is the same `quantized::bf16_gemm_f32(a,b,c,m,n,k,1.0,0.0)`
the old direct call used (alpha=1, beta=0 preserved).
- **`backend::gemm_i8` (src/backend/mod.rs)** — ALREADY FIXED in the
working tree this session. Routes to `simd_int_ops::gemm_u8_i8`
(4-tier: AMX `TDPBUSD` → VNNI-zmm → AVX-VNNI-ymm → scalar).
[DECISION] Deliberately NOT routed to `amx_matmul::matmul_i8_to_i32` as
the literal task text asked: `gemm_i8` is **u8×i8→i32**, but
`matmul_i8_to_i32` is **i8×i8→i32** and would reinterpret A-bytes ≥128
as negative — NOT bit-equivalent. `gemm_u8_i8`'s scalar fallback is the
same `quantized::int8_gemm_i32` the old `vnni_gemm::int8_gemm_vnni`
used → bit-identical on scalar hosts; VNNI-zmm arm calls the same
`int8_gemm_vnni_avx512` kernel as before. All tiers integer-exact.
- **`native::gemv_f32` / `gemv_f64` (src/backend/native.rs)** — FIXED
THIS TURN (was calling `scalar::gemv_*` unconditionally). Now matches
on `tier()`: Scalar tier → unchanged `scalar::gemv_*` (byte-identical);
Avx2/Avx512 tiers → per-row `dot_f32`/`dot_f64` (the existing
dispatched, parity-tested SIMD dot). GEMV = stack of row dots; each A
row is row-major-contiguous so contiguous `dot_*` loads apply. Leading
`n` of each `lda`-wide row taken via `&a[i*lda..i*lda+n]`; no new bounds
requirement vs scalar ref. SIMD tiers carry the module's documented
1-2 ULP reduce-order drift (within BLAS tol; `test_gemv_f32` uses 1e-5,
no byte-exact consumer asserts gemv).

[UNSAFE-AUDIT] gemv fix added **zero** new `unsafe` — it reuses the
already-audited `dot_*` kernels. No new sentinel-qa surface from this turn.
The two mod.rs fixes contain `unsafe` repr(transparent) slice reinterprets
(BF16/u16) that were landed earlier this session and warrant the standard
sentinel-qa pass if not already covered.

[LOOSE END] Repo references modules that exist on disk but the Glob/Grep
index was transiently stale this session (returned empty for
`simd_int_ops.rs`, `vnni_gemm.rs`, `bf16_gemm_f32`); Bash ground-truth
confirmed all present. Orchestrator should `cargo fmt`/`clippy`/`test`
centrally (edits were edit-only, no compile performed here).
29 changes: 29 additions & 0 deletions .claude/board/AGENT_LOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,35 @@

## Entries (append below; newest first)

### 2026-06-14 — amx-savant (Opus, main thread) — AMX enabled + made bug-free + documented
- **Branch:** `claude/wonderful-hawking-lodtql`. **Commits:** e6bb26a (enablement
+ 4 kernel bugs), 777eff7 (perf 11.5×), 9dd6519 (bf16 probe), + this doc/
detection commit.
- **What ran:** `examples/amx_probe` (instruction bisector + correctness across
shapes — int8 bit-exact, bf16 rel-err ~0.004) and `examples/amx_gemm_bench`
(throughput + independent `correct=` check). Lib unit-test target is
pre-broken (`src/tri.rs` type-inference, unrelated), so examples are the gate.
- **Findings:** AMX was dark on EVERY capable host via a 1-digit bug —
`ARCH_REQ_XCOMP_PERM` issued on `prctl` (157) instead of `arch_prctl` (158).
Once enabled, 4 more ISA/encoding bugs surfaced (TILECFG rows/colsb swap;
TILELOADD SIB base/index swap; TDPBUSD ModRM same-tile #UD; mirrored
operand index+sign convention — verified by a 4-opcode sign sweep). SPR vs
EMR is NOT the cause: the bugs are ISA-level and were latent on Sapphire
Rapids too (the SPR-era `AMX_GOTCHAS.md` literally shipped 3 of them); they
never fired because detection never returned true. EMR was just the first
host to actually execute the tile path.
- **Added:** cached `LazyLock` detection + `CpuModel` (SPR/EMR/GNR/Sierra
Forest) in `src/simd_amx.rs`, re-exported via `ndarray::simd::{cpu_model,
CpuModel, amx_report}`; `examples/amx_probe.rs` (validator/bisector);
`.claude/knowledge/amx-enablement-and-kernel.md` (canonical ref);
`.claude/agents/amx-savant.md` (this agent); rewrote `.claude/AMX_GOTCHAS.md`
(corrected the 3 bugs it shipped, added the fault-signature playbook).
- **Outcome:** int8 GEMM 2048³ = 169.7 GMAC/s (339 GOP/s), 600× scalar; bf16
path correct. `amx_report()` → "AMX [Emerald Rapids expects_amx=true]:
TILE=true INT8=true BF16=true available=true".
- **Loose ends:** further AMX perf (2×2 register blocking + rayon); blasgraph
Hamming dedup in lance-graph (blocked on missing `protoc`).


## 2026-05-22T18:00 — PR-X12 cross-stack architecture session (opus 4.7)

Expand Down
Loading
Loading