FluffyAIcode · cursor · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026 · Jun 13, 2026
diff --git a/README.md b/README.md
@@ -105,6 +105,131 @@ the binding correctness gate. Mac M4 evidence on `main`:
 
 Raw artifacts: [`results/platform-tests/bench_session_4h_1780332893.json`](results/platform-tests/) (4-h evidence) and the v0.3.0 GA tag's smoke run committed at `6399546`.
 
+## Design philosophy — AR verifier + dLLM proposer, KV restoration for a memory-bounded Gemma-4 26B
+
+Kakeya pairs a frozen **autoregressive (AR) verifier** — `Gemma-4 26B-A4B-it` — with
+a **diffusion-LM (dLLM) proposer** (`z-lab` DFlash, 0.4 B). The proposer's *first*
+role is not "drafter" but **history reconstructor**: a dLLM carries **no KV cache**
+and can emit transient K/V for *any* past position, so it can restore the verifier's
+**evicted** K/V on demand. A small trained projection **f_θ** maps proposer hidden
+states → verifier K/V; on Gemma-4 the **S5** strategy keeps the 5 full-attention
+layers exact and restores the sliding-window layers.
+
+The whole architecture is built around one inequality:
+
+> Make `Gemma-4 26B-A4B-it` **memory-bounded** *without* trading away model
+> **intelligence** (recall), **token throughput**, or **context length**.
+
+**KV restoration is the mechanism.** The verifier only ever keeps a **bounded
+sink+window** of its own K/V resident (constant **~17 MB** on CUDA / **~133 MB** in
+the Mac S5 config), while the *effective* attention context — the full
+multi-thousand-token history — is **reconstructed on demand** by the proposer + f_θ.
+Because the restored/spec-decoded K/V is **byte-checked** against the AR cache, the
+**output is identical to the standalone AR model** (recall **1.0**): the memory win
+costs **zero intelligence**. Throughput and context length are held at **parity**
+(Mac) or **improved** (CUDA spec-decode **1.79× AR**) — never sacrificed. This is the
+inversion of the usual quantize/evict trade-off: instead of *cheaper, dumber, shorter*,
+KV restoration buys *bounded memory at full fidelity*. See
+[ADR 0012](docs/adr/0012-proposer-verifier-value-proposition.md) (value realised on
+the **memory axis** all-platform + **throughput** on CUDA) and
+[ADR 0013](docs/adr/0013-distributed-inference-topology.md).
+
+### Beta scorecards — Kakeya vs the standalone model (`main` @ `9d5e6b4`)
+
+Both betas run the *same* `Gemma-4 26B-A4B-it` verifier, `z-lab` DFlash proposer, and
+`f_theta_v5_s5_sliding`. "Standalone model" = the same Gemma-4 run **without** Kakeya
+(`mlx_lm` AR oracle on Mac; HuggingFace bf16 AR on CUDA) — i.e. the honest *"what does
+the engine cost vs just running the model?"* baseline.
+
+**Mac (MLX) — Kakeya vs `mlx_lm` AR oracle** · Mac mini M4 · 4-bit verifier:
+
+| Axis | Kakeya | MLX-only | Result |
+| --- | --- | --- | --- |
+| **Memory** (resident KV @ 5 810 tok) | **132.92 MB** (S5) | 1 308.88 MB | **89.8 % saved** (20 vs 220 KB/tok, 11× slower growth) |
+| **Context length** | 4 406–5 810 tok handled, **recall 1.0** | recall 1.0 | byte-identical output |
+| **Throughput** (code, 128-tok decode) | 21.68 tok/s | 23.26 tok/s | **0.93×** (≈ parity) |
+
+*Raw scorecard report — Mac MLX (reproducible evidence):*
+
+```
+Kakeya Inference Engine (MLX beta, main @ 9d5e6b4 / PR #117) vs MLX-only
+Gemma-4 26B-A4B-it 4-bit, Mac mini M4, verifier=gemma-4-26B-A4B-it-mlx-4bit,
+drafter=z-lab DFlash, f_theta=v5_s5_sliding, S5 (5 exact full-attn layers).
+
+================ 1) MEMORY BOUNDED  (NIAH ctx280, T=5810 tok) ================
+                         Kakeya (S5)     MLX-only (naive full-KV)
+resident KV @5810 tok    132.92 MB       1308.88 MB        -> 89.8% saved
+KV growth per token       20.0 KB/tok      220.0 KB/tok     -> 11x slower
+exact full-attn layers    5,11,17,23,29 hold all 5810 pos (full recall)
+sliding layers            bounded to 68 resident positions
+
+================ 2) CONTEXT LENGTH  (NIAH ctx280) ===========================
+prompts handled          4406 - 5810 tokens
+recall (Kakeya)          1.0  (5/5)   == MLX-only oracle 1.0 (5/5)  byte-identical
+verifier attention ctx   full 5810-tok window kept EXACT on 5 full-attn layers
+                         while sliding layers stay window-bounded
+
+================ 3) TOKEN THROUGHPUT  (code workload, 128-tok decode) ========
+                         Kakeya fused    MLX-only AR        ratio
+long-sample mean (e2e)   21.68 tok/s     23.26 tok/s        0.93x  (~parity)
+decode-only (long)       ~24-27 tok/s    --                 best 0.99x
+recall                   1.0 (8/8)       1.0 (8/8)          byte-identical
+
+Net: Kakeya delivers bounded memory (~90% KV saving) + full-context recall at
+MLX-only-identical output, at ~AR-parity throughput on Mac (the 26B verify(L)
+compute per block is the throughput floor; >AR remains CUDA-favored: H200 1.79x).
+```
+
+**CUDA (H200) — Kakeya vs standalone Gemma-4 26B AR** · bf16:
+
+| Axis | Kakeya | AR | Result |
+| --- | --- | --- | --- |
+| **Memory** (resident KV @ 3 238 / 6 438 tok) | **constant 16.71 MB** | 733.06 / 1 453.96 MB | **43.9× / 87.0× saving** |
+| **Context length** | 68-tok window ↦ 3 254 / 6 454 tok, **recall 1.0** | recall 1.0 | **47.9× / 94.9× compression** |
+| **Throughput** (fused spec-decode, block-16) | **28.94 tok/s** | 16.13 tok/s | **1.79× AR** (accept-len 3.32) |
+
+*Raw scorecard report — CUDA H200 (reproducible evidence):*
+
+```
+Kakeya Inference Engine (GPU beta, main @ 9d5e6b4 / #107+#117) vs standalone AR
+NVIDIA H200 · Gemma-4 26B-A4B-it (bf16) · verifier=google/gemma-4-26B-A4B-it
+drafter=z-lab DFlash · f_theta=v5_s5_sliding · S5 (5 exact full-attn layers)
+"AR" = standalone Gemma-4 26B AR model (GPU analog of "mlx-only").
+
+================ 1) MEMORY BOUNDED  (resident KV) ===========================
+context rung      AR full-KV      Kakeya restored     saving
+3238-tok prompt   733.06 MB       16.71 MB            43.9x
+6438-tok prompt   1453.96 MB      16.71 MB            87.0x
+-> Kakeya KV is CONSTANT 16.71 MB (68-tok sink+window) regardless of context;
+   AR KV grows linearly. Saving scales with context length.
+
+================ 2) CONTEXT LENGTH  (window vs effective) ===================
+context rung      resident window   effective ctx      compression   recall
+3238-tok prompt   68 tok            3254 tok           47.9x         1.0 == AR
+6438-tok prompt   68 tok            6454 tok           94.9x         1.0 == AR
+-> 68-token bounded window reconstructs full multi-thousand-token context
+   via f_theta/S5 restoration, with recall identical to AR.
+
+================ 3) TOKEN THROUGHPUT  (decode tok/s, 3238-tok prompt) ========
+path                         tok/s     vs AR     recall
+standalone AR                16.125    1.00x     1.0
+restored per-token (Gap A)   16.297    1.01x     1.0   (restoration is free)
+Kakeya FUSED spec-decode     28.937    1.79x     1.0   (block-16, accept_len 3.32)
+-> On GPU the fused spec-decode delivers 1.79x AR at byte-identical output,
+   because verify-batch is cheap (vs Mac ~0.93x where 26B verify(L) dominates).
+
+Net (GPU): bounded memory (44-87x KV saving, constant 16.71 MB) + full-context
+recall (48-95x compression, recall 1.0) + 1.79x AR throughput, all at
+AR-identical correctness. This is the platform where spec-decode value lands.
+```
+
+Both platforms hold **recall 1.0 / byte-identical output**. The fork is on the
+throughput axis only: CUDA's cheap verify-batch turns spec-decode into a **1.79×**
+win, while on Mac the **26 B `verify(L)` compute per block** is the floor, so the
+engine lands at **≈ AR parity** — the memory + context wins are platform-independent.
+Reproduce with `scripts/research/k3_e2e_gpu_bench.py` + `k3_specdecode_gpu_bench.py`
+(CUDA) and the `k3-beta-scorecard` / `k3-fused-allmlx-code-trim` Mac-bridge presets.
+
 ## Kakeya Inference Engine for Mac — MLX speculative-decode port (K3 beta baseline)
 
 After the **CUDA** beta (PR #107: f_θ + S5 K/V-restoration verifier, **fused DFlash
@@ -125,7 +250,8 @@ the [Mac bridge](#evaluation-environment); ×AR is the ratio).
 
 **Honest ceiling & what was *ruled out*.** ≈AR parity is the Mac result on the
 spec-decode sweet spot (short-context, naturally-long *code/agent* generation);
-**>AR meaningfully remains CUDA-favoured** (H200 1.27×) because the binding
+**>AR meaningfully remains CUDA-favoured** (H200 **1.79×** fused/block-16 on the
+fresh `main` scorecard above; #107 originally reported 1.27×) because the binding
 constraint is the **26B `verify(L)` compute per block** — *not* rollback (fixed),
 *not* sync count (a one-graph "single-fused" probe ran stably at ~0.16 s/block and
 was ≈ equal — the b876 single-fused "143 s" pathology is **large-cache-specific**,