diff --git a/README.md b/README.md index 062b4525..8df8a424 100644 --- a/README.md +++ b/README.md @@ -105,6 +105,131 @@ the binding correctness gate. Mac M4 evidence on `main`: Raw artifacts: [`results/platform-tests/bench_session_4h_1780332893.json`](results/platform-tests/) (4-h evidence) and the v0.3.0 GA tag's smoke run committed at `6399546`. +## Design philosophy — AR verifier + dLLM proposer, KV restoration for a memory-bounded Gemma-4 26B + +Kakeya pairs a frozen **autoregressive (AR) verifier** — `Gemma-4 26B-A4B-it` — with +a **diffusion-LM (dLLM) proposer** (`z-lab` DFlash, 0.4 B). The proposer's *first* +role is not "drafter" but **history reconstructor**: a dLLM carries **no KV cache** +and can emit transient K/V for *any* past position, so it can restore the verifier's +**evicted** K/V on demand. A small trained projection **f_θ** maps proposer hidden +states → verifier K/V; on Gemma-4 the **S5** strategy keeps the 5 full-attention +layers exact and restores the sliding-window layers. + +The whole architecture is built around one inequality: + +> Make `Gemma-4 26B-A4B-it` **memory-bounded** *without* trading away model +> **intelligence** (recall), **token throughput**, or **context length**. + +**KV restoration is the mechanism.** The verifier only ever keeps a **bounded +sink+window** of its own K/V resident (constant **~17 MB** on CUDA / **~133 MB** in +the Mac S5 config), while the *effective* attention context — the full +multi-thousand-token history — is **reconstructed on demand** by the proposer + f_θ. +Because the restored/spec-decoded K/V is **byte-checked** against the AR cache, the +**output is identical to the standalone AR model** (recall **1.0**): the memory win +costs **zero intelligence**. Throughput and context length are held at **parity** +(Mac) or **improved** (CUDA spec-decode **1.79× AR**) — never sacrificed. This is the +inversion of the usual quantize/evict trade-off: instead of *cheaper, dumber, shorter*, +KV restoration buys *bounded memory at full fidelity*. See +[ADR 0012](docs/adr/0012-proposer-verifier-value-proposition.md) (value realised on +the **memory axis** all-platform + **throughput** on CUDA) and +[ADR 0013](docs/adr/0013-distributed-inference-topology.md). + +### Beta scorecards — Kakeya vs the standalone model (`main` @ `9d5e6b4`) + +Both betas run the *same* `Gemma-4 26B-A4B-it` verifier, `z-lab` DFlash proposer, and +`f_theta_v5_s5_sliding`. "Standalone model" = the same Gemma-4 run **without** Kakeya +(`mlx_lm` AR oracle on Mac; HuggingFace bf16 AR on CUDA) — i.e. the honest *"what does +the engine cost vs just running the model?"* baseline. + +**Mac (MLX) — Kakeya vs `mlx_lm` AR oracle** · Mac mini M4 · 4-bit verifier: + +| Axis | Kakeya | MLX-only | Result | +| --- | --- | --- | --- | +| **Memory** (resident KV @ 5 810 tok) | **132.92 MB** (S5) | 1 308.88 MB | **89.8 % saved** (20 vs 220 KB/tok, 11× slower growth) | +| **Context length** | 4 406–5 810 tok handled, **recall 1.0** | recall 1.0 | byte-identical output | +| **Throughput** (code, 128-tok decode) | 21.68 tok/s | 23.26 tok/s | **0.93×** (≈ parity) | + +*Raw scorecard report — Mac MLX (reproducible evidence):* + +``` +Kakeya Inference Engine (MLX beta, main @ 9d5e6b4 / PR #117) vs MLX-only +Gemma-4 26B-A4B-it 4-bit, Mac mini M4, verifier=gemma-4-26B-A4B-it-mlx-4bit, +drafter=z-lab DFlash, f_theta=v5_s5_sliding, S5 (5 exact full-attn layers). + +================ 1) MEMORY BOUNDED (NIAH ctx280, T=5810 tok) ================ + Kakeya (S5) MLX-only (naive full-KV) +resident KV @5810 tok 132.92 MB 1308.88 MB -> 89.8% saved +KV growth per token 20.0 KB/tok 220.0 KB/tok -> 11x slower +exact full-attn layers 5,11,17,23,29 hold all 5810 pos (full recall) +sliding layers bounded to 68 resident positions + +================ 2) CONTEXT LENGTH (NIAH ctx280) =========================== +prompts handled 4406 - 5810 tokens +recall (Kakeya) 1.0 (5/5) == MLX-only oracle 1.0 (5/5) byte-identical +verifier attention ctx full 5810-tok window kept EXACT on 5 full-attn layers + while sliding layers stay window-bounded + +================ 3) TOKEN THROUGHPUT (code workload, 128-tok decode) ======== + Kakeya fused MLX-only AR ratio +long-sample mean (e2e) 21.68 tok/s 23.26 tok/s 0.93x (~parity) +decode-only (long) ~24-27 tok/s -- best 0.99x +recall 1.0 (8/8) 1.0 (8/8) byte-identical + +Net: Kakeya delivers bounded memory (~90% KV saving) + full-context recall at +MLX-only-identical output, at ~AR-parity throughput on Mac (the 26B verify(L) +compute per block is the throughput floor; >AR remains CUDA-favored: H200 1.79x). +``` + +**CUDA (H200) — Kakeya vs standalone Gemma-4 26B AR** · bf16: + +| Axis | Kakeya | AR | Result | +| --- | --- | --- | --- | +| **Memory** (resident KV @ 3 238 / 6 438 tok) | **constant 16.71 MB** | 733.06 / 1 453.96 MB | **43.9× / 87.0× saving** | +| **Context length** | 68-tok window ↦ 3 254 / 6 454 tok, **recall 1.0** | recall 1.0 | **47.9× / 94.9× compression** | +| **Throughput** (fused spec-decode, block-16) | **28.94 tok/s** | 16.13 tok/s | **1.79× AR** (accept-len 3.32) | + +*Raw scorecard report — CUDA H200 (reproducible evidence):* + +``` +Kakeya Inference Engine (GPU beta, main @ 9d5e6b4 / #107+#117) vs standalone AR +NVIDIA H200 · Gemma-4 26B-A4B-it (bf16) · verifier=google/gemma-4-26B-A4B-it +drafter=z-lab DFlash · f_theta=v5_s5_sliding · S5 (5 exact full-attn layers) +"AR" = standalone Gemma-4 26B AR model (GPU analog of "mlx-only"). + +================ 1) MEMORY BOUNDED (resident KV) =========================== +context rung AR full-KV Kakeya restored saving +3238-tok prompt 733.06 MB 16.71 MB 43.9x +6438-tok prompt 1453.96 MB 16.71 MB 87.0x +-> Kakeya KV is CONSTANT 16.71 MB (68-tok sink+window) regardless of context; + AR KV grows linearly. Saving scales with context length. + +================ 2) CONTEXT LENGTH (window vs effective) =================== +context rung resident window effective ctx compression recall +3238-tok prompt 68 tok 3254 tok 47.9x 1.0 == AR +6438-tok prompt 68 tok 6454 tok 94.9x 1.0 == AR +-> 68-token bounded window reconstructs full multi-thousand-token context + via f_theta/S5 restoration, with recall identical to AR. + +================ 3) TOKEN THROUGHPUT (decode tok/s, 3238-tok prompt) ======== +path tok/s vs AR recall +standalone AR 16.125 1.00x 1.0 +restored per-token (Gap A) 16.297 1.01x 1.0 (restoration is free) +Kakeya FUSED spec-decode 28.937 1.79x 1.0 (block-16, accept_len 3.32) +-> On GPU the fused spec-decode delivers 1.79x AR at byte-identical output, + because verify-batch is cheap (vs Mac ~0.93x where 26B verify(L) dominates). + +Net (GPU): bounded memory (44-87x KV saving, constant 16.71 MB) + full-context +recall (48-95x compression, recall 1.0) + 1.79x AR throughput, all at +AR-identical correctness. This is the platform where spec-decode value lands. +``` + +Both platforms hold **recall 1.0 / byte-identical output**. The fork is on the +throughput axis only: CUDA's cheap verify-batch turns spec-decode into a **1.79×** +win, while on Mac the **26 B `verify(L)` compute per block** is the floor, so the +engine lands at **≈ AR parity** — the memory + context wins are platform-independent. +Reproduce with `scripts/research/k3_e2e_gpu_bench.py` + `k3_specdecode_gpu_bench.py` +(CUDA) and the `k3-beta-scorecard` / `k3-fused-allmlx-code-trim` Mac-bridge presets. + ## Kakeya Inference Engine for Mac — MLX speculative-decode port (K3 beta baseline) After the **CUDA** beta (PR #107: f_θ + S5 K/V-restoration verifier, **fused DFlash @@ -125,7 +250,8 @@ the [Mac bridge](#evaluation-environment); ×AR is the ratio). **Honest ceiling & what was *ruled out*.** ≈AR parity is the Mac result on the spec-decode sweet spot (short-context, naturally-long *code/agent* generation); -**>AR meaningfully remains CUDA-favoured** (H200 1.27×) because the binding +**>AR meaningfully remains CUDA-favoured** (H200 **1.79×** fused/block-16 on the +fresh `main` scorecard above; #107 originally reported 1.27×) because the binding constraint is the **26B `verify(L)` compute per block** — *not* rollback (fixed), *not* sync count (a one-graph "single-fused" probe ran stably at ~0.16 s/block and was ≈ equal — the b876 single-fused "143 s" pathology is **large-cache-specific**,