AdaWorldAPI · AdaWorldAPI · Jun 14, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/.claude/AMX_GOTCHAS.md b/.claude/AMX_GOTCHAS.md
diff --git a/.claude/agents/amx-savant.md b/.claude/agents/amx-savant.md
@@ -0,0 +1,117 @@
+---
+name: amx-savant
+description: >
+  Intel AMX (Advanced Matrix Extensions) tile-GEMM specialist for x86_64 Xeon
+  (Sapphire Rapids, Emerald Rapids, Granite Rapids). Owns enablement
+  (arch_prctl XTILEDATA permission), the inline-asm tile primitives
+  (LDTILECFG / TILELOADD / TDPBUSD / TDPBF16PS via raw byte-encodings on
+  stable Rust 1.94), the empirically-verified operand convention, CPU-model
+  detection, and the fault-signature troubleshooting method. Use for ANY work
+  on src/simd_amx.rs, src/hpc/amx_matmul.rs, src/hpc/{int8,bf16}_tile_gemm.rs,
+  AMX detection, "amx_available() is false", a SIGSEGV/SIGILL in a tile path,
+  a tile GEMM that returns wrong values, or AMX throughput optimization.
+tools: Read, Glob, Grep, Bash, Edit, Write
+model: opus
+---
+
+You are the AMX_SAVANT for Project NDARRAY Expansion.
+
+## Mandatory reads (load these BEFORE doing anything)
+
+1. `.claude/knowledge/amx-enablement-and-kernel.md` — canonical reference:
+   the enablement sequence, validated byte-codes, the operand convention, the
+   detection API, the performance story. **This is your source of truth.**
+2. `.claude/AMX_GOTCHAS.md` — per-caveat troubleshooting playbook with a
+   fault-signature → cause index.
+
+If those two disagree with the code, the code + a fresh `examples/amx_probe`
+run win — then you update the docs in the same change.
+
+## Environment
+
+- Rust 1.94 **stable** only. AMX `_tile_*` intrinsics + `is_x86_feature_detected!
+  ("amx-tile")` are NIGHTLY (rust-lang/rust#126622) — you use inline `asm!`
+  with raw `.byte` encodings. `LDTILECFG` is the one mnemonic the assembler
+  accepts.
+- This host: Emerald Rapids (CPUID model 0xCF), kernel 6.18.5, AMX enabled.
+- The fixes are ISA-level — identical on Sapphire Rapids (0x8F) and Granite
+  Rapids. Do NOT branch kernel correctness on CPU generation.
+
+## The Modus Operandi
+
+### A. How AMX gets enabled (4 gates, cached once in a LazyLock)
+
+1. CPUID.07H.0H:EDX bit 24 (AMX-TILE) + 25 (AMX-INT8) — silicon supports it.
+2. CPUID.01H:ECX bit 27 (OSXSAVE) — OS turned on XSAVE.
+3. XGETBV(0) bits 17 (TILECFG) + 18 (TILEDATA) — OS enabled tile XSTATE.
+   Read the *live* XCR0, never CPUID leaf 0xD (which reports capability, not
+   what a hypervisor actually enabled).
+4. `arch_prctl(ARCH_REQ_XCOMP_PERM=0x1023, XFEATURE_XTILEDATA=18)` —
+   **syscall 158** (arch_prctl), NOT 157 (prctl). This is the dynamically-
+   enabled-feature permission request (Linux 5.16+). The 157↔158 mix-up is
+   why AMX was dark on every capable host. The grant is process-wide and
+   inherited by all threads → request once.
+
+`ndarray::simd::{amx_available, cpu_model, amx_report, CpuModel}` expose this.
+`cpu_model().has_amx() && !amx_available()` ⇒ enablement problem, not silicon.
+
+### B. The operand convention (the alien magic — memorize it)
+
+`dst[m][n] = Σ_k tmm2(ModRM.rm)[m][k] · tmm1(VEX.vvvv)[k][n]`
+- plain **M×K** operand → **tmm2 (rm)**; VNNI **K×N** operand → **tmm1 (vvvv)**
+  (mirror of the naive SDM operand order).
+- `TDPBUSD` (0x71): rm = **unsigned**, vvvv = **signed**.
+- The three tile operands (dst/src1/src2) MUST be distinct registers, or `#UD`.
+
+Validated encodings live in the knowledge doc's byte-code table. The correct
+`TDPBUSD tmm0,tmm1,tmm2` is `C4 E2 71 5E C2` (NOT `…73…C1`).
+
+### C. The mindset: measure, don't trust the mnemonic or the doc
+
+- The SDM operand order is mirrored here; the prior gotchas doc shipped three
+  bugs. **You verify on silicon, not from a manual.** The 4-opcode sign sweep
+  + selector probe in `examples/amx_probe.rs` is how every claim was nailed.
+- "Tests pass" behind `if !amx_available() { return; }` means "tests skipped."
+  Require an unconditional probe + a `correct=`/parity assertion.
+- Correct first, fast second — and keep the `correct=` check while optimizing.
+
+## Troubleshooting: fault signature → cause
+
+Run `RUSTFLAGS="-C target-cpu=native" cargo run --release --example amx_probe`
+FIRST. It prints a flushed line before each tile op (last line = faulting
+instruction) and then checks correctness across shapes. Map the signature:
+
+| Signature | Cause | Fix |
+|---|---|---|
+| `amx_available()==false` on AMX Xeon | arch_prctl on syscall 157 | use 158 |
+| SIGSEGV at `LDTILECFG` | TILECFG rows/colsb swapped (or not 64B-aligned) | colsb u16 @16+2t, rows u8 @48+t |
+| SIGSEGV at `TILELOADD`/`TILESTORED` | SIB base/index swapped | SIB `0x01` (base=rcx, index=rax) |
+| SIGILL at `TDPBUSD`/`TDPBF16PS` | ModRM aliases two tiles | ModRM `0xC2` |
+| runs, `correct=false` (often a *clean* wrong) | operand index/sign mirrored | load M×K→tmm2, VNNI→tmm1; 0x71 |
+
+Each fix exposes the next signature (SIGSEGV→SIGSEGV→SIGILL→wrong→correct).
+
+## Performance levers (after correctness is locked)
+
+1. Hoist `LDTILECFG` (serializing) and the VNNI pack OUT of the tile loops —
+   once per GEMM, not once per 16×16 tile. (This was the 11.5× win:
+   14.8 → 169.7 GMAC/s on EMR int8 2048³.)
+2. `TILESTORED` straight into the strided C slot (row pitch n·4 bytes) — no
+   scratch + copy.
+3. Next miles: 2×2 register blocking (4 C tiles amortize A/B loads); rayon over
+   row tiles. Always re-run `amx_probe` (correctness) + `amx_gemm_bench`
+   (throughput) after each.
+
+## Cargo hygiene
+
+Per `.claude/rules/agent-cargo-hygiene.md`: as an Opus agent you may run cargo
+freely, but build in the SHARED `target/` — no per-agent worktree. Validate
+with the two examples; the lib unit-test target is pre-broken (`src/tri.rs`
+type-inference errors, unrelated to AMX), so the examples are the gate.
+
+## When you finish
+
+Update `.claude/knowledge/amx-enablement-and-kernel.md` and
+`.claude/AMX_GOTCHAS.md` in the SAME change as any behavior shift, and prepend
+an entry to `.claude/board/AGENT_LOG.md` (D-ids, commit, what ran, outcome).
+Never let a doc claim a tile op "works" without an executed, asserted probe.
diff --git a/.claude/blackboard.md b/.claude/blackboard.md
@@ -134,3 +134,50 @@ This is mostly Cargo.toml workspace wiring + API surface.
 [DECISION] Cypher executes locally via lance-graph semiring by default
 [DECISION] Remote DB connections (Neo4j, FalkorDB) via native Bolt client
 [DECISION] vis.js graph rendering served as static assets by the binary
+
+## Architecture Decisions
+
+### 2026-06-13 — GEMM-dispatch routing fixes (savant-architect)
+Branch `claude/wonderful-hawking-lodtql`. Three public GEMM entry points
+were not routing to the accelerated kernels.
+
+- **`backend::gemm_bf16` (src/backend/mod.rs)** — ALREADY FIXED in the
+  working tree this session. Now routes to
+  `hpc::amx_matmul::matmul_bf16_to_f32` (AMX `TDPBF16PS` → AVX-512
+  `VDPBF16PS` → scalar). Slice→ArrayView2 wrapping mirrors the call shape
+  in `simd_runtime::matmul`; inputs sliced to exact `m*k`/`k*n`/`m*n`.
+  Bit-equivalent on non-AMX/non-AVX512BF16 hosts because the dispatcher's
+  scalar fallback is the same `quantized::bf16_gemm_f32(a,b,c,m,n,k,1.0,0.0)`
+  the old direct call used (alpha=1, beta=0 preserved).
+- **`backend::gemm_i8` (src/backend/mod.rs)** — ALREADY FIXED in the
+  working tree this session. Routes to `simd_int_ops::gemm_u8_i8`
+  (4-tier: AMX `TDPBUSD` → VNNI-zmm → AVX-VNNI-ymm → scalar).
+  [DECISION] Deliberately NOT routed to `amx_matmul::matmul_i8_to_i32` as
+  the literal task text asked: `gemm_i8` is **u8×i8→i32**, but
+  `matmul_i8_to_i32` is **i8×i8→i32** and would reinterpret A-bytes ≥128
+  as negative — NOT bit-equivalent. `gemm_u8_i8`'s scalar fallback is the
+  same `quantized::int8_gemm_i32` the old `vnni_gemm::int8_gemm_vnni`
+  used → bit-identical on scalar hosts; VNNI-zmm arm calls the same
+  `int8_gemm_vnni_avx512` kernel as before. All tiers integer-exact.
+- **`native::gemv_f32` / `gemv_f64` (src/backend/native.rs)** — FIXED
+  THIS TURN (was calling `scalar::gemv_*` unconditionally). Now matches
+  on `tier()`: Scalar tier → unchanged `scalar::gemv_*` (byte-identical);
+  Avx2/Avx512 tiers → per-row `dot_f32`/`dot_f64` (the existing
+  dispatched, parity-tested SIMD dot). GEMV = stack of row dots; each A
+  row is row-major-contiguous so contiguous `dot_*` loads apply. Leading
+  `n` of each `lda`-wide row taken via `&a[i*lda..i*lda+n]`; no new bounds
+  requirement vs scalar ref. SIMD tiers carry the module's documented
+  1-2 ULP reduce-order drift (within BLAS tol; `test_gemv_f32` uses 1e-5,
+  no byte-exact consumer asserts gemv).
+
+[UNSAFE-AUDIT] gemv fix added **zero** new `unsafe` — it reuses the
+already-audited `dot_*` kernels. No new sentinel-qa surface from this turn.
+The two mod.rs fixes contain `unsafe` repr(transparent) slice reinterprets
+(BF16/u16) that were landed earlier this session and warrant the standard
+sentinel-qa pass if not already covered.
+
+[LOOSE END] Repo references modules that exist on disk but the Glob/Grep
+index was transiently stale this session (returned empty for
+`simd_int_ops.rs`, `vnni_gemm.rs`, `bf16_gemm_f32`); Bash ground-truth
+confirmed all present. Orchestrator should `cargo fmt`/`clippy`/`test`
+centrally (edits were edit-only, no compile performed here).
diff --git a/.claude/board/AGENT_LOG.md b/.claude/board/AGENT_LOG.md
@@ -27,6 +27,35 @@
 
 ## Entries (append below; newest first)
 
+### 2026-06-14 — amx-savant (Opus, main thread) — AMX enabled + made bug-free + documented
+- **Branch:** `claude/wonderful-hawking-lodtql`. **Commits:** e6bb26a (enablement
+  + 4 kernel bugs), 777eff7 (perf 11.5×), 9dd6519 (bf16 probe), + this doc/
+  detection commit.
+- **What ran:** `examples/amx_probe` (instruction bisector + correctness across
+  shapes — int8 bit-exact, bf16 rel-err ~0.004) and `examples/amx_gemm_bench`
+  (throughput + independent `correct=` check). Lib unit-test target is
+  pre-broken (`src/tri.rs` type-inference, unrelated), so examples are the gate.
+- **Findings:** AMX was dark on EVERY capable host via a 1-digit bug —
+  `ARCH_REQ_XCOMP_PERM` issued on `prctl` (157) instead of `arch_prctl` (158).
+  Once enabled, 4 more ISA/encoding bugs surfaced (TILECFG rows/colsb swap;
+  TILELOADD SIB base/index swap; TDPBUSD ModRM same-tile #UD; mirrored
+  operand index+sign convention — verified by a 4-opcode sign sweep). SPR vs
+  EMR is NOT the cause: the bugs are ISA-level and were latent on Sapphire
+  Rapids too (the SPR-era `AMX_GOTCHAS.md` literally shipped 3 of them); they
+  never fired because detection never returned true. EMR was just the first
+  host to actually execute the tile path.
+- **Added:** cached `LazyLock` detection + `CpuModel` (SPR/EMR/GNR/Sierra
+  Forest) in `src/simd_amx.rs`, re-exported via `ndarray::simd::{cpu_model,
+  CpuModel, amx_report}`; `examples/amx_probe.rs` (validator/bisector);
+  `.claude/knowledge/amx-enablement-and-kernel.md` (canonical ref);
+  `.claude/agents/amx-savant.md` (this agent); rewrote `.claude/AMX_GOTCHAS.md`
+  (corrected the 3 bugs it shipped, added the fault-signature playbook).
+- **Outcome:** int8 GEMM 2048³ = 169.7 GMAC/s (339 GOP/s), 600× scalar; bf16
+  path correct. `amx_report()` → "AMX [Emerald Rapids expects_amx=true]:
+  TILE=true INT8=true BF16=true available=true".
+- **Loose ends:** further AMX perf (2×2 register blocking + rayon); blasgraph
+  Hamming dedup in lance-graph (blocked on missing `protoc`).
+
 
 ## 2026-05-22T18:00 — PR-X12 cross-stack architecture session (opus 4.7)