FluffyAIcode · FluffyAIcode · Jun 11, 2026 · Jun 9, 2026 · Jun 10, 2026 · Jun 10, 2026
diff --git a/.gitattributes b/.gitattributes
@@ -1 +1,6 @@
 models/dflash-kakeya-baseline/*.safetensors filter=lfs diff=lfs merge=lfs -text
+results/research/f_theta_v1/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
+results/research/f_theta_v3_attn_distill/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
+results/research/f_theta_v4a_warmstart_hybrid/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
+results/research/f_theta_v4b_fresh_hybrid/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
+results/research/f_theta_v5_s5_sliding/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
diff --git a/docs/k3-gpu-beta.md b/docs/k3-gpu-beta.md
@@ -0,0 +1,109 @@
+# K3 GPU beta — Kakeya inference (f_θ + S5 K/V-Restoration)
+
+Status: beta, GPU-validated on NVIDIA H200 with `google/gemma-4-26B-A4B-it`
+(verifier) + `z-lab/gemma-4-26B-A4B-it-DFlash` (drafter) + the trained f_θ v5
+checkpoint (`results/research/f_theta_v5_s5_sliding/`). Recall 1.0 throughout.
+
+## What it is
+
+The verifier keeps only a **sink+window** local KV cache; at every *evicted*
+position its attention reads **reconstructed** K/V, so it attends over the full
+context while holding `O(sink+window)` resident KV (ADR 0008 §11).
+
+  verifier (Gemma 4 26B-A4B):  sink+window resident KV
+      ├─ sliding layers  → evicted K/V restored via f_θ(drafter K/V)
+      └─ full-attn layers (S5: [5,11,17,23,29]) → verifier's OWN exact K/V
+                            (recall-critical; f_θ cannot reconstruct these —
+                             proven by the α-sweep, eval rel_mse floor ~1.4)
+
+  drafter (DFlash 0.4B): no KV cache; constant-memory K/V reconstruction
+      source (its K/V are projected into verifier space by f_θ).
+
+## Components (this branch)
+
+| piece | file |
+|---|---|
+| DFlash drafter (block diffusion, faithful to z-lab `qwen3_dflash`) | `inference_engine/v04/dflash_drafter.py` |
+| f_θ projection (drafter K/V → verifier K/V) | `inference_engine/v04/f_theta.py` |
+| Cross-model restored verifier (CUDA) + S5 | `inference_engine/v04/cross_model_dlm_verifier.py` |
+| Cross-model restored verifier (MLX / Apple Silicon) | `inference_engine/backends/mlx/cross_model_dlm_verifier.py` |
+| Incremental restored verifier (`SinkWindowVerifier` API) | `inference_engine/v04/restored_sink_window_verifier.py` |
+| Served-path factories + gRPC `--backend restored` | `inference_engine/v04/build_restored.py`, `scripts/start_grpc_runtime_server.py` |
+
+## Three engines (decode modes)
+
+* **Re-forward** (`incremental=False`) — memory-optimal, eval-grade; recomputes
+  restoration each step (O(T)/step). Bit-equivalent reference for the gate.
+* **Gap-A incremental** (`incremental=True`) — capture restored K/V into a
+  `DynamicCache` at prefill, decode natively (O(L)/block). **= AR decode speed**,
+  KV 16.9×–43.9× smaller, recall 1.0.
+* **Fused spec-decode** (`restored_specdecode_fused`) — DFlash block draft +
+  incremental verify, with three prefill-built, incrementally-extended caches:
+  (A) verifier aux hidden captured from the verify forward, (B) drafter context
+  K/V cache, (C) Gap-A restored KV. Per-block O(L). **> AR** (see below).
+
+## Validated results (H200, ctx 1238, gemma-4-26B-A4B)
+
+| path | decode tok/s | vs AR | recall |
+|---|---|---|---|
+| standalone AR | 21.1 | 1.0× | 1.0 |
+| Gap-A incremental restored | 21.7 | 1.03× | 1.0 |
+| fused DFlash spec-decode (aggregate) | 26.8 | **1.27×** | 1.0 |
+
+KV memory: restored resident KV constant **16.71 MB** vs AR 282 MB @1238 tok →
+733 MB @3238 tok (**16.9× → 43.9×**, grows with context). DFlash acceptance on
+HumanEval ≈ official gemma-4-26B parity (length ~3.9 ≈ official 3.3× speedup).
+
+## Run
+
+```bash
+# Incremental restored decode vs AR (memory + tok/s + recall)
+PYTHONPATH=.:sdks/python python scripts/research/k3_e2e_gpu_bench.py \
+  --verifier-id google/gemma-4-26B-A4B-it \
+  --drafter-id z-lab/gemma-4-26B-A4B-it-DFlash \
+  --f-theta-dir results/research/f_theta_v5_s5_sliding \
+  --incremental --haystack-lines 60,160
+
+# Fused DFlash spec-decode vs AR
+PYTHONPATH=.:sdks/python python scripts/research/k3_specdecode_gpu_bench.py \
+  --drafter-id z-lab/gemma-4-26B-A4B-it-DFlash --skip-unfused
+
+# gRPC server with the restored backend
+PYTHONPATH=.:sdks/python python scripts/start_grpc_runtime_server.py \
+  --backend restored --device cuda \
+  --verifier-id google/gemma-4-26B-A4B-it \
+  --drafter-id z-lab/gemma-4-26B-A4B-it-DFlash \
+  --f-theta-dir results/research/f_theta_v5_s5_sliding --sink 4 --window 64
+```
+
+## Canonical proposer
+
+The proposer/drafter is **`z-lab/gemma-4-26B-A4B-it-DFlash`** (the official
+checkpoint, with the Gap-B embed-scale fix) — used uniformly for both drafting
+and as the f_θ restoration K/V source across all entry points. The earlier
+`models/dflash-kakeya-baseline` was alignment-trained against a buggy
+(`×sqrt(hidden)`-scaled) embed pipeline and is not the beta drafter.
+
+f_θ v5 was trained against the kakeya-baseline drafter, so its **sliding-layer**
+restoration is technically off for z-lab K/V — but this is **harmless for
+recall**: recall is carried by the S5 exact full-attention layers, and the
+sliding-layer restored K/V are window-masked during decode. Both incremental
+decode and fused spec-decode measure **recall 1.0** with z-lab. (If pure
+sliding-layer restoration is ever needed, retrain f_θ on z-lab K/V.)
+
+All **inference/eval** entry points default to z-lab (`k3_e2e_gpu_bench`,
+`k3_specdecode_gpu_bench`, `k3_integrated_niah_eval`(+`_mac`),
+`k3_dflash_specdecode_eval`(+`_mac`); the gRPC server takes an explicit
+`--drafter-id`). The **f_θ training** script (`k3_f_theta_train.py`) and its
+orchestration `.sh` keep `models/dflash-kakeya-baseline` because that is how the
+shipped v5 checkpoint was historically trained.
+
+## Notes / scope
+
+* Drafting conditions on the restored verifier hidden for committed decode tokens
+  (clean aux for the prompt) — resolves the bounded-KV vs clean-aux tension
+  natively; no SGLang/vLLM dependency.
+* Stable decode requires loading the verifier without `device_map` (no accelerate
+  per-forward hooks; the 26B-A4B fits on one H200) + a full-length warmup.
+* f_θ v5 restores the sliding layers; recall is carried by the S5 exact
+  full-attention layers, so f_θ fidelity is not the recall bottleneck.
diff --git a/docs/mlx-port-lessons.md b/docs/mlx-port-lessons.md
@@ -0,0 +1,82 @@
+# Porting the K3 GPU beta (#107) to MLX — lessons & plan
+
+Audience: whoever ports the validated CUDA restored-verifier engine
+(`inference_engine/v04/…`, PR #107) to the Apple-Silicon MLX backend
+(`inference_engine/backends/mlx/…`). The current MLX blocker is **decode
+token-throughput collapse**. This doc distills *why* #107 is fast and exactly
+which mechanisms must be reproduced in MLX.
+
+## TL;DR — the throughput collapse is the O(T²) re-forward
+
+On MLX today, `restored_logits` (`backends/mlx/cross_model_dlm_verifier.py`) does
+a **full-position forward over the whole sequence every step**, and the Mac
+harness calls it per generated token → **O(T²)** → collapse (the same harness
+also shows the *oracle* is fast because it uses mlx_lm's **native incremental KV
+cache**). The fix is the #107 **Gap-A** trick, ported verbatim:
+
+> **Capture the restored K/V into a persistent (sink+window) cache at prefill,
+> then decode with mlx_lm's native incremental step (O(L)/block) — never
+> re-forward the whole sequence per token.**
+
+This alone takes the restored path from "collapsed" to **= native AR decode
+speed** (on CUDA: 1.3–2.8 tok/s re-forward → ~21 tok/s incremental = AR).
+
+## What makes #107 fast — and the MLX analog of each
+
+| # | #107 (CUDA) mechanism | MLX analog / gotcha |
+|---|---|---|
+| 1 | **Gap-A incremental decode**: capture restored K/V (per layer, post-norm/RoPE) into a `transformers.DynamicCache` at prefill; decode L new tokens against it. | Capture into `inference_engine/backends/mlx/cache.SinkWindowKVCache` (already exists) and decode via **`mlx_lm.generate.generate_step`** with `prompt_cache=` — its **chunked prefill + `mx.async_eval` pipelined decode** is the throughput-critical part. A hand-rolled per-token loop with `mx.eval` each step is itself a collapse cause. |
+| 2 | **S5 carries recall** via the 5 full-attention layers' **exact own K/V**; f_θ restores only the sliding layers (masked at decode). | Same: store the 5 full-attn evicted own K/V (KakeyaLattice-compressible); **do not** invest in f_θ sliding fidelity for recall. The needle reaches output through the full-attn layers only. |
+| 3 | **Eliminate the extra `capture_own_kv` forward**: in #107 the full-attn own K/V are captured once at prefill (not recomputed per step). PR #108 showed removing it via *f_θ full-attn* breaks recall — wrong fix. | The Mac harness's 12.4s `build_restoration` is this extra forward. Right fix: capture own K/V from the **prefill** forward / store as positions evict — **not** f_θ-restore the full-attn layers. |
+| 4 | **Fused spec-decode (>AR)** = three prefill-built, incrementally-extended caches: (A) verifier aux hidden from the verify forward, (B) drafter context K/V cache, (C) Gap-A restored KV. Per-block O(L). | Port `draft_block_cached` + `make/extend_context_kv` semantics to the MLX drafter path; capture aux from the MLX verify forward. Only after #1 works. |
+| 5 | **Stabilization**: load verifier **without `device_map`** (no accelerate per-forward hooks) + **full-length warmup** (pre-size the allocator) → removed per-block variance. | MLX analog of the variance source is **graph (re)compilation + lazy eval**: warm up the *exact* shapes (prefill chunk size + 1-token decode) before timing; avoid shape churn; force `mx.eval` only where measuring. |
+| 6 | **Gap-B drafter fidelity**: drafter query embedding is a **plain lookup — no Gemma `×sqrt(hidden)`** (port bug; fixed). | Same fix on the MLX drafting path: do not scale the shared embedding fed to the drafter. (z-lab acceptance 0.05→reference parity.) |
+
+## MLX-specific gotchas already learned
+
+- **MPS/MLX SDPA materializes scores** (no flash kernel for some shapes) → OOM at
+  long context. Use **bounded attention** (decode only attends sink+window+restored
+  evicted, not a transient full O(T) matrix) and/or **query-chunked SDPA**
+  (`KAKEYA_DFLASH_ATTN_QCHUNK`). Bounded decode (Gap-A) avoids the transient full
+  cache that OOM'd the ctx280 runs.
+- **Lazy eval**: MLX is lazy; throughput depends on `mx.async_eval` pipelining
+  (mlx_lm's `generate_step` does this). Per-token `mx.eval().item()` serializes →
+  collapse. Mirror the native loop.
+- **`make_sink_window_cache(model, *, sink_size, window_size)`** is keyword-only
+  (a past bug was positional args). The cache is a drop-in `_BaseCache`.
+- **Cross-runtime bridge**: verifier in MLX, drafter+f_θ in PyTorch (MPS/CPU) is
+  workable, but the per-step tensor bridging must not re-forward; bridge only at
+  the K/V-injection boundary, once per block.
+
+## MLX port plan (ordered; each gates the next)
+
+1. **Incremental decode (kills the collapse).** Add an MLX analog of
+   `CrossModelRestoredSinkWindowVerifier(incremental=True)`: prefill → capture
+   restored K/V into `SinkWindowKVCache` (full-attn = own/exact; sliding = f_θ or
+   window-masked) → decode via `generate_step(prompt_cache=…)`. **Gate: decode
+   tok/s ≈ native mlx_lm AR; recall 1.0** (carried by S5).
+2. **Drop the extra build forward.** Capture full-attn own K/V at prefill; do not
+   re-run a clean verifier forward per request beyond prefill. **Gate:
+   `build_restoration` from ~12s → ~prefill cost.**
+3. **Gap-B drafter embed fix** (no `×sqrt`) on the MLX/Bridge drafting path.
+   **Gate: acceptance toward reference on code prompts.**
+4. **Fused spec-decode** (A+B+C incremental caches). **Gate: tok/s > AR.**
+
+## Validation gates (match #107 evidence)
+
+- Recall **1.0** vs oracle (S5).
+- Bounded resident KV (sink+window), reported via `kv_memory_report`.
+- Decode tok/s: incremental **≥ native AR**; fused **> AR**.
+- Reference: #107 on H200 — incremental = 1.0× AR (KV 16.9–43.9× smaller),
+  fused 1.27× AR, recall 1.0. (`docs/k3-gpu-beta.md`,
+  `results/research/k3_e2e_gpu_bench_incremental.json`,
+  `k3_specdecode_fused_stable.json`.)
+
+## Do-not-repeat (anti-patterns)
+
+- ❌ Re-forwarding the full sequence per generated token (the current collapse).
+- ❌ A custom decode loop with per-token `mx.eval` (no async pipelining).
+- ❌ f_θ-restoring the **full-attention** layers (PR #108: breaks recall; those
+  K/V are not reconstructable from the shallow drafter — α-sweep proven). Keep S5.
+- ❌ Scaling the drafter's shared embedding by `×sqrt(hidden)` (Gap-B port bug).
+- ❌ Materializing a transient full-T attention score matrix on MPS (OOM).