Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
128 changes: 127 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,131 @@ the binding correctness gate. Mac M4 evidence on `main`:

Raw artifacts: [`results/platform-tests/bench_session_4h_1780332893.json`](results/platform-tests/) (4-h evidence) and the v0.3.0 GA tag's smoke run committed at `6399546`.

## Design philosophy — AR verifier + dLLM proposer, KV restoration for a memory-bounded Gemma-4 26B

Kakeya pairs a frozen **autoregressive (AR) verifier** — `Gemma-4 26B-A4B-it` — with
a **diffusion-LM (dLLM) proposer** (`z-lab` DFlash, 0.4 B). The proposer's *first*
role is not "drafter" but **history reconstructor**: a dLLM carries **no KV cache**
and can emit transient K/V for *any* past position, so it can restore the verifier's
**evicted** K/V on demand. A small trained projection **f_θ** maps proposer hidden
states → verifier K/V; on Gemma-4 the **S5** strategy keeps the 5 full-attention
layers exact and restores the sliding-window layers.

The whole architecture is built around one inequality:

> Make `Gemma-4 26B-A4B-it` **memory-bounded** *without* trading away model
> **intelligence** (recall), **token throughput**, or **context length**.

**KV restoration is the mechanism.** The verifier only ever keeps a **bounded
sink+window** of its own K/V resident (constant **~17 MB** on CUDA / **~133 MB** in
the Mac S5 config), while the *effective* attention context — the full
multi-thousand-token history — is **reconstructed on demand** by the proposer + f_θ.
Because the restored/spec-decoded K/V is **byte-checked** against the AR cache, the
**output is identical to the standalone AR model** (recall **1.0**): the memory win
costs **zero intelligence**. Throughput and context length are held at **parity**
(Mac) or **improved** (CUDA spec-decode **1.79× AR**) — never sacrificed. This is the
inversion of the usual quantize/evict trade-off: instead of *cheaper, dumber, shorter*,
KV restoration buys *bounded memory at full fidelity*. See
[ADR 0012](docs/adr/0012-proposer-verifier-value-proposition.md) (value realised on
the **memory axis** all-platform + **throughput** on CUDA) and
[ADR 0013](docs/adr/0013-distributed-inference-topology.md).

### Beta scorecards — Kakeya vs the standalone model (`main` @ `9d5e6b4`)

Both betas run the *same* `Gemma-4 26B-A4B-it` verifier, `z-lab` DFlash proposer, and
`f_theta_v5_s5_sliding`. "Standalone model" = the same Gemma-4 run **without** Kakeya
(`mlx_lm` AR oracle on Mac; HuggingFace bf16 AR on CUDA) — i.e. the honest *"what does
the engine cost vs just running the model?"* baseline.

**Mac (MLX) — Kakeya vs `mlx_lm` AR oracle** · Mac mini M4 · 4-bit verifier:

| Axis | Kakeya | MLX-only | Result |
| --- | --- | --- | --- |
| **Memory** (resident KV @ 5 810 tok) | **132.92 MB** (S5) | 1 308.88 MB | **89.8 % saved** (20 vs 220 KB/tok, 11× slower growth) |
| **Context length** | 4 406–5 810 tok handled, **recall 1.0** | recall 1.0 | byte-identical output |
| **Throughput** (code, 128-tok decode) | 21.68 tok/s | 23.26 tok/s | **0.93×** (≈ parity) |

*Raw scorecard report — Mac MLX (reproducible evidence):*

```
Kakeya Inference Engine (MLX beta, main @ 9d5e6b4 / PR #117) vs MLX-only
Gemma-4 26B-A4B-it 4-bit, Mac mini M4, verifier=gemma-4-26B-A4B-it-mlx-4bit,
drafter=z-lab DFlash, f_theta=v5_s5_sliding, S5 (5 exact full-attn layers).

================ 1) MEMORY BOUNDED (NIAH ctx280, T=5810 tok) ================
Kakeya (S5) MLX-only (naive full-KV)
resident KV @5810 tok 132.92 MB 1308.88 MB -> 89.8% saved
KV growth per token 20.0 KB/tok 220.0 KB/tok -> 11x slower
exact full-attn layers 5,11,17,23,29 hold all 5810 pos (full recall)
sliding layers bounded to 68 resident positions

================ 2) CONTEXT LENGTH (NIAH ctx280) ===========================
prompts handled 4406 - 5810 tokens
recall (Kakeya) 1.0 (5/5) == MLX-only oracle 1.0 (5/5) byte-identical
verifier attention ctx full 5810-tok window kept EXACT on 5 full-attn layers
while sliding layers stay window-bounded

================ 3) TOKEN THROUGHPUT (code workload, 128-tok decode) ========
Kakeya fused MLX-only AR ratio
long-sample mean (e2e) 21.68 tok/s 23.26 tok/s 0.93x (~parity)
decode-only (long) ~24-27 tok/s -- best 0.99x
recall 1.0 (8/8) 1.0 (8/8) byte-identical

Net: Kakeya delivers bounded memory (~90% KV saving) + full-context recall at
MLX-only-identical output, at ~AR-parity throughput on Mac (the 26B verify(L)
compute per block is the throughput floor; >AR remains CUDA-favored: H200 1.79x).
```

**CUDA (H200) — Kakeya vs standalone Gemma-4 26B AR** · bf16:

| Axis | Kakeya | AR | Result |
| --- | --- | --- | --- |
| **Memory** (resident KV @ 3 238 / 6 438 tok) | **constant 16.71 MB** | 733.06 / 1 453.96 MB | **43.9× / 87.0× saving** |
| **Context length** | 68-tok window ↦ 3 254 / 6 454 tok, **recall 1.0** | recall 1.0 | **47.9× / 94.9× compression** |
| **Throughput** (fused spec-decode, block-16) | **28.94 tok/s** | 16.13 tok/s | **1.79× AR** (accept-len 3.32) |

*Raw scorecard report — CUDA H200 (reproducible evidence):*

```
Kakeya Inference Engine (GPU beta, main @ 9d5e6b4 / #107+#117) vs standalone AR
NVIDIA H200 · Gemma-4 26B-A4B-it (bf16) · verifier=google/gemma-4-26B-A4B-it
drafter=z-lab DFlash · f_theta=v5_s5_sliding · S5 (5 exact full-attn layers)
"AR" = standalone Gemma-4 26B AR model (GPU analog of "mlx-only").

================ 1) MEMORY BOUNDED (resident KV) ===========================
context rung AR full-KV Kakeya restored saving
3238-tok prompt 733.06 MB 16.71 MB 43.9x
6438-tok prompt 1453.96 MB 16.71 MB 87.0x
-> Kakeya KV is CONSTANT 16.71 MB (68-tok sink+window) regardless of context;
AR KV grows linearly. Saving scales with context length.

================ 2) CONTEXT LENGTH (window vs effective) ===================
context rung resident window effective ctx compression recall
3238-tok prompt 68 tok 3254 tok 47.9x 1.0 == AR
6438-tok prompt 68 tok 6454 tok 94.9x 1.0 == AR
-> 68-token bounded window reconstructs full multi-thousand-token context
via f_theta/S5 restoration, with recall identical to AR.

================ 3) TOKEN THROUGHPUT (decode tok/s, 3238-tok prompt) ========
path tok/s vs AR recall
standalone AR 16.125 1.00x 1.0
restored per-token (Gap A) 16.297 1.01x 1.0 (restoration is free)
Kakeya FUSED spec-decode 28.937 1.79x 1.0 (block-16, accept_len 3.32)
-> On GPU the fused spec-decode delivers 1.79x AR at byte-identical output,
because verify-batch is cheap (vs Mac ~0.93x where 26B verify(L) dominates).

Net (GPU): bounded memory (44-87x KV saving, constant 16.71 MB) + full-context
recall (48-95x compression, recall 1.0) + 1.79x AR throughput, all at
AR-identical correctness. This is the platform where spec-decode value lands.
```

Both platforms hold **recall 1.0 / byte-identical output**. The fork is on the
throughput axis only: CUDA's cheap verify-batch turns spec-decode into a **1.79×**
win, while on Mac the **26 B `verify(L)` compute per block** is the floor, so the
engine lands at **≈ AR parity** — the memory + context wins are platform-independent.
Reproduce with `scripts/research/k3_e2e_gpu_bench.py` + `k3_specdecode_gpu_bench.py`
(CUDA) and the `k3-beta-scorecard` / `k3-fused-allmlx-code-trim` Mac-bridge presets.

## Kakeya Inference Engine for Mac — MLX speculative-decode port (K3 beta baseline)

After the **CUDA** beta (PR #107: f_θ + S5 K/V-restoration verifier, **fused DFlash
Expand All @@ -125,7 +250,8 @@ the [Mac bridge](#evaluation-environment); ×AR is the ratio).

**Honest ceiling & what was *ruled out*.** ≈AR parity is the Mac result on the
spec-decode sweet spot (short-context, naturally-long *code/agent* generation);
**>AR meaningfully remains CUDA-favoured** (H200 1.27×) because the binding
**>AR meaningfully remains CUDA-favoured** (H200 **1.79×** fused/block-16 on the
fresh `main` scorecard above; #107 originally reported 1.27×) because the binding
constraint is the **26B `verify(L)` compute per block** — *not* rollback (fixed),
*not* sync count (a one-graph "single-fused" probe ran stably at ~0.16 s/block and
was ≈ equal — the b876 single-fused "143 s" pathology is **large-cache-specific**,
Expand Down
Loading