Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
bb7909f
K3 Block B + C: f_theta projection + cross-model DLMRestoredVerifier …
cursoragent Jun 9, 2026
c404aee
K3 P0 critical fixes + vast reviewer aids + integrated NIAH eval
cursoragent Jun 10, 2026
634acea
K3: support Gemma4 multimodal nested config/decoder in f_theta train …
cursoragent Jun 10, 2026
a9706aa
K3: capture V from k_proj output for Gemma4 v_proj-None (KV-sharing) …
cursoragent Jun 10, 2026
4a4d96d
K3: heterogeneous per-layer verifier KV heads in f_theta + per-layer …
cursoragent Jun 10, 2026
d3a64c0
K3: Gemma4-faithful cross-model restore forward (per-layer KV, v_norm…
cursoragent Jun 10, 2026
d257a11
K3: cast f_theta input to encoder weight dtype (fp32 f_theta vs bf16 …
cursoragent Jun 10, 2026
46410ad
K3: fix integrated NIAH eval to use real niah_eval API (chat-template…
cursoragent Jun 10, 2026
a0a9fb5
K3: handle BatchEncoding return from Gemma4 apply_chat_template in in…
cursoragent Jun 10, 2026
72ddd15
K3: per-layer verifier head_dim in f_theta (Gemma4 full layers use gl…
cursoragent Jun 10, 2026
844aaac
K3: add identity-restore diagnostic (inject verifier's own K/V) to is…
cursoragent Jun 10, 2026
9aa1f51
K3 f_theta v1 trained checkpoint (Gemma4 26B-A4B verifier, per-layer …
cursoragent Jun 10, 2026
e18f2fc
K3 integrated NIAH gate evidence: arch_correct=1.0 PASS, recall gate …
cursoragent Jun 10, 2026
6c2fc23
K3 f_θ trainer v2 — fix recall=0 (cosine+mag loss + NIAH data + cosin…
cursoragent Jun 10, 2026
6f168dd
K3 f_θ trainer v3 — one-shot attention-output distillation (skip v2 i…
cursoragent Jun 10, 2026
9ae40a8
K3 S6: --mix-alpha-sweep fidelity->recall diagnostic (interpolate evi…
cursoragent Jun 10, 2026
1444416
K3 attn_distill v3 evidence: train reduction 21.47x (attn-output rel-…
cursoragent Jun 10, 2026
72ce157
K3 S6 alpha-sweep on attn_distill v3: recall 0 for all alpha<1.0 (deg…
cursoragent Jun 10, 2026
76f54cc
K3 S6 alpha-sweep on scale-matched relmse v3: recall knee in (0,0.5];…
cursoragent Jun 10, 2026
4a9b6bc
K3 f_θ trainer v4: attn_distill_hybrid loss — fix the f_θ collapse ex…
cursoragent Jun 10, 2026
3643b74
K3 S6 knee refinement (relmse v3): recall transition alpha 0.3->0.4->…
cursoragent Jun 10, 2026
e5a927c
K3 trainer aid: forward NIAH_MIN_LINES/NIAH_MAX_LINES env to --niah-{…
cursoragent Jun 10, 2026
a4f1a46
K3 fix: import apply_rotary_pos_emb for attn_distill_hybrid too (was …
cursoragent Jun 10, 2026
84b5194
K3 v4a warm-start hybrid checkpoint (rank256, init relmse v3, attn_di…
cursoragent Jun 10, 2026
e90528e
K3 v4b fresh hybrid checkpoint (rank768, 128 NIAH, gen1024, niah140, …
cursoragent Jun 10, 2026
523d0c3
K3 v4a/v4b hybrid integrated NIAH evidence: both recall 0/10 both run…
cursoragent Jun 10, 2026
fcd2ebd
K3 fidelity probe v4a/v4b: eval full-attn rel_mse 1.42/1.52 (== relms…
cursoragent Jun 10, 2026
ae68bd6
K3 v4a/v4b canonical NIAH + alpha-sweep artifacts: NIAH 0/10 both; sw…
cursoragent Jun 10, 2026
65ac245
K3 S5: exact_layer_indices in cross-model verifier + --s5-exact-full-…
cursoragent Jun 11, 2026
579d8f0
K3 S5 fix: inject verifier's OWN true K/V at evicted positions for fu…
cursoragent Jun 11, 2026
d85211b
K3 S5 ctx280 PASS: exact full-attn layers [5,11,17,23,29] + v4b slidi…
cursoragent Jun 11, 2026
5377220
K3 S5 trainer mode: --s5-exact-full-attn excludes full-attention laye…
cursoragent Jun 11, 2026
5be2d83
K3 v5 S5 dedicated sliding f_theta (full-attn excluded from loss, ctx…
cursoragent Jun 11, 2026
ac9234a
K3 MLX integration: cross-model DLM-restored verifier (S5 + f_theta) …
cursoragent Jun 11, 2026
3f74c86
Mac M4 K3 S5 NIAH latency diagnostic evidence
Jun 11, 2026
d3160c8
K3 MLX v2: (1) --compress-full-attn KakeyaLattice round-trip on full-…
cursoragent Jun 11, 2026
f785d0e
Mac M4 K3 S5 KL ctx280 OOM evidence
Jun 11, 2026
8452c5a
K3 fix MPS OOM: DFlash attention uses memory-efficient SDPA instead o…
cursoragent Jun 11, 2026
2d855ba
Mac M4 K3 S5 KL ctx280 SDPA OOM evidence
Jun 11, 2026
8e2601f
K3 fix MPS OOM (2): query-chunked drafter attention (_chunked_sdpa, q…
cursoragent Jun 11, 2026
1be821a
Mac M4 K3 S5 KL ctx280 rerun OOM evidence
Jun 11, 2026
91ecaa1
K3: make DFlash attention query-chunk env-tunable (KAKEYA_DFLASH_ATTN…
cursoragent Jun 11, 2026
8b3cea3
Mac M4 K3 S5 KL ctx70 CPU timeout evidence
Jun 11, 2026
95613ed
K3 MLX harness refactor (usability): (1) amortize restoration — captu…
cursoragent Jun 11, 2026
b3aa685
Mac M4 K3 S5 KL ctx70 teacher-forced evidence
Jun 11, 2026
8dcb1d0
K3 MLX harness: fix recall metric — default to free-generation (teach…
cursoragent Jun 11, 2026
4863a47
Mac M4 K3 S5 KL ctx70 free-gen slow evidence
Jun 11, 2026
2d19069
Mac high-perf deployment benchmark: bench_mlx_kakeya_deployment.py — …
cursoragent Jun 11, 2026
2b6851c
Mac deployment bench: default to gemma-4-26B-A4B-it-mlx-4bit; measure…
cursoragent Jun 11, 2026
880f7c5
Mac M4 Gemma 4 MLX deployment benchmark evidence
Jun 11, 2026
85b9c5a
Fix Kakeya path in Mac deployment bench: make_sink_window_cache() tak…
cursoragent Jun 11, 2026
2a8c484
Mac M4 Gemma 4 MLX Kakeya benchmark evidence
Jun 11, 2026
d7d2a99
Mac deployment bench: drive BOTH vanilla and Kakeya through mlx_lm's …
cursoragent Jun 11, 2026
d70a783
Mac deployment bench: add MLX kernel warmup for both cache paths befo…
cursoragent Jun 11, 2026
ed9e351
K3 Gap1+Gap2: wire f_theta+S5 K/V Restoration into the spec-decode lo…
cursoragent Jun 11, 2026
5083260
Tests: 100% coverage for restored sink+window verifier + spec-decode …
cursoragent Jun 11, 2026
ca1f437
K3 e2e GPU bench: Kakeya restored verifier vs standalone Gemma4 26B A…
cursoragent Jun 11, 2026
98e02a3
K3 e2e GPU evidence (H200): Kakeya restored verifier vs standalone Ge…
cursoragent Jun 11, 2026
01061b7
K3 spec-decode GPU bench (restored verifier) + DFlash acceptance evid…
cursoragent Jun 11, 2026
f4905b2
K3 spec-decode GPU evidence (H200): restored verifier block spec-deco…
cursoragent Jun 11, 2026
3c95dc5
Gap-A: incremental-decode restored verifier (capture restored K/V at …
cursoragent Jun 11, 2026
7b2e541
Gap-A GPU evidence (H200): incremental restored decode reaches AR parity
cursoragent Jun 11, 2026
0497504
B: fix DFlash draft embedding scale (reference uses plain lookup, no …
cursoragent Jun 11, 2026
49818a8
B progress: DFlash embed-scale fix validated (3x acceptance), evidenc…
cursoragent Jun 11, 2026
8b5e631
B: add HumanEval-style code prompt set (--prompt-set code) to charact…
cursoragent Jun 11, 2026
46dbfb7
B evidence: DFlash acceptance on code regime = 0.227/4.19 (peaks >7.7…
cursoragent Jun 11, 2026
27bfcde
B: add canonical HumanEval loader (--humaneval-jsonl) + --raw-complet…
cursoragent Jun 11, 2026
bd1c07f
B evidence: canonical HumanEval acceptance = 0.199 / length 3.87 (raw…
cursoragent Jun 11, 2026
342b894
Integrated bench: restored spec-decode now uses Gap-A incremental ver…
cursoragent Jun 11, 2026
ac5983d
Fix stale verifier_forwards print ref in integrated spec-decode bench…
cursoragent Jun 11, 2026
5026b13
Fix integrated spec-decode report aggregation (time_breakdown_s_mean …
cursoragent Jun 11, 2026
0c2217c
Integrated GPU evidence (H200): Gap-A incremental restored decode = A…
cursoragent Jun 11, 2026
e9c33e4
Fused spec-decode engine (A+B+C) in the Kakeya engine: per-block O(L)
cursoragent Jun 11, 2026
bef6bf1
Spec-decode bench: warmup all measured paths before timing (the cold …
cursoragent Jun 11, 2026
a14d7b5
Spec-decode bench: --skip-unfused for clean fused-vs-AR steady-state …
cursoragent Jun 11, 2026
4b3d2e1
Fused engine GPU evidence (H200): reaches/exceeds AR on stable sample…
cursoragent Jun 11, 2026
427ba5a
Stabilize fused spec-decode: load verifier without device_map (no acc…
cursoragent Jun 11, 2026
71d1e91
Spec-decode bench: full-length 2-pass warmup so the caching allocator…
cursoragent Jun 11, 2026
a53abc0
Stabilized fused engine GPU evidence (H200): fused spec-decode EXCEED…
cursoragent Jun 11, 2026
b909142
Trim beta + add architecture note
cursoragent Jun 11, 2026
676c616
Standardize proposer to z-lab/gemma-4-26B-A4B-it-DFlash: change k3_e2…
cursoragent Jun 11, 2026
cbbab48
Unify proposer to z-lab across #107: replace e2e incremental evidence…
cursoragent Jun 11, 2026
6bad344
Unify proposer to z-lab across ALL inference/eval entry points (CUDA …
cursoragent Jun 11, 2026
80574c2
docs: MLX port lessons from #107 — root-cause the MLX decode throughp…
cursoragent Jun 11, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1 +1,6 @@
models/dflash-kakeya-baseline/*.safetensors filter=lfs diff=lfs merge=lfs -text
results/research/f_theta_v1/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
results/research/f_theta_v3_attn_distill/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
results/research/f_theta_v4a_warmstart_hybrid/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
results/research/f_theta_v4b_fresh_hybrid/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
results/research/f_theta_v5_s5_sliding/f_theta_weights.pt filter=lfs diff=lfs merge=lfs -text
109 changes: 109 additions & 0 deletions docs/k3-gpu-beta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# K3 GPU beta — Kakeya inference (f_θ + S5 K/V-Restoration)

Status: beta, GPU-validated on NVIDIA H200 with `google/gemma-4-26B-A4B-it`
(verifier) + `z-lab/gemma-4-26B-A4B-it-DFlash` (drafter) + the trained f_θ v5
checkpoint (`results/research/f_theta_v5_s5_sliding/`). Recall 1.0 throughout.

## What it is

The verifier keeps only a **sink+window** local KV cache; at every *evicted*
position its attention reads **reconstructed** K/V, so it attends over the full
context while holding `O(sink+window)` resident KV (ADR 0008 §11).

verifier (Gemma 4 26B-A4B): sink+window resident KV
├─ sliding layers → evicted K/V restored via f_θ(drafter K/V)
└─ full-attn layers (S5: [5,11,17,23,29]) → verifier's OWN exact K/V
(recall-critical; f_θ cannot reconstruct these —
proven by the α-sweep, eval rel_mse floor ~1.4)

drafter (DFlash 0.4B): no KV cache; constant-memory K/V reconstruction
source (its K/V are projected into verifier space by f_θ).

## Components (this branch)

| piece | file |
|---|---|
| DFlash drafter (block diffusion, faithful to z-lab `qwen3_dflash`) | `inference_engine/v04/dflash_drafter.py` |
| f_θ projection (drafter K/V → verifier K/V) | `inference_engine/v04/f_theta.py` |
| Cross-model restored verifier (CUDA) + S5 | `inference_engine/v04/cross_model_dlm_verifier.py` |
| Cross-model restored verifier (MLX / Apple Silicon) | `inference_engine/backends/mlx/cross_model_dlm_verifier.py` |
| Incremental restored verifier (`SinkWindowVerifier` API) | `inference_engine/v04/restored_sink_window_verifier.py` |
| Served-path factories + gRPC `--backend restored` | `inference_engine/v04/build_restored.py`, `scripts/start_grpc_runtime_server.py` |

## Three engines (decode modes)

* **Re-forward** (`incremental=False`) — memory-optimal, eval-grade; recomputes
restoration each step (O(T)/step). Bit-equivalent reference for the gate.
* **Gap-A incremental** (`incremental=True`) — capture restored K/V into a
`DynamicCache` at prefill, decode natively (O(L)/block). **= AR decode speed**,
KV 16.9×–43.9× smaller, recall 1.0.
* **Fused spec-decode** (`restored_specdecode_fused`) — DFlash block draft +
incremental verify, with three prefill-built, incrementally-extended caches:
(A) verifier aux hidden captured from the verify forward, (B) drafter context
K/V cache, (C) Gap-A restored KV. Per-block O(L). **> AR** (see below).

## Validated results (H200, ctx 1238, gemma-4-26B-A4B)

| path | decode tok/s | vs AR | recall |
|---|---|---|---|
| standalone AR | 21.1 | 1.0× | 1.0 |
| Gap-A incremental restored | 21.7 | 1.03× | 1.0 |
| fused DFlash spec-decode (aggregate) | 26.8 | **1.27×** | 1.0 |

KV memory: restored resident KV constant **16.71 MB** vs AR 282 MB @1238 tok →
733 MB @3238 tok (**16.9× → 43.9×**, grows with context). DFlash acceptance on
HumanEval ≈ official gemma-4-26B parity (length ~3.9 ≈ official 3.3× speedup).

## Run

```bash
# Incremental restored decode vs AR (memory + tok/s + recall)
PYTHONPATH=.:sdks/python python scripts/research/k3_e2e_gpu_bench.py \
--verifier-id google/gemma-4-26B-A4B-it \
--drafter-id z-lab/gemma-4-26B-A4B-it-DFlash \
--f-theta-dir results/research/f_theta_v5_s5_sliding \
--incremental --haystack-lines 60,160

# Fused DFlash spec-decode vs AR
PYTHONPATH=.:sdks/python python scripts/research/k3_specdecode_gpu_bench.py \
--drafter-id z-lab/gemma-4-26B-A4B-it-DFlash --skip-unfused

# gRPC server with the restored backend
PYTHONPATH=.:sdks/python python scripts/start_grpc_runtime_server.py \
--backend restored --device cuda \
--verifier-id google/gemma-4-26B-A4B-it \
--drafter-id z-lab/gemma-4-26B-A4B-it-DFlash \
--f-theta-dir results/research/f_theta_v5_s5_sliding --sink 4 --window 64
```

## Canonical proposer

The proposer/drafter is **`z-lab/gemma-4-26B-A4B-it-DFlash`** (the official
checkpoint, with the Gap-B embed-scale fix) — used uniformly for both drafting
and as the f_θ restoration K/V source across all entry points. The earlier
`models/dflash-kakeya-baseline` was alignment-trained against a buggy
(`×sqrt(hidden)`-scaled) embed pipeline and is not the beta drafter.

f_θ v5 was trained against the kakeya-baseline drafter, so its **sliding-layer**
restoration is technically off for z-lab K/V — but this is **harmless for
recall**: recall is carried by the S5 exact full-attention layers, and the
sliding-layer restored K/V are window-masked during decode. Both incremental
decode and fused spec-decode measure **recall 1.0** with z-lab. (If pure
sliding-layer restoration is ever needed, retrain f_θ on z-lab K/V.)

All **inference/eval** entry points default to z-lab (`k3_e2e_gpu_bench`,
`k3_specdecode_gpu_bench`, `k3_integrated_niah_eval`(+`_mac`),
`k3_dflash_specdecode_eval`(+`_mac`); the gRPC server takes an explicit
`--drafter-id`). The **f_θ training** script (`k3_f_theta_train.py`) and its
orchestration `.sh` keep `models/dflash-kakeya-baseline` because that is how the
shipped v5 checkpoint was historically trained.

## Notes / scope

* Drafting conditions on the restored verifier hidden for committed decode tokens
(clean aux for the prompt) — resolves the bounded-KV vs clean-aux tension
natively; no SGLang/vLLM dependency.
* Stable decode requires loading the verifier without `device_map` (no accelerate
per-forward hooks; the 26B-A4B fits on one H200) + a full-length warmup.
* f_θ v5 restores the sliding layers; recall is carried by the S5 exact
full-attention layers, so f_θ fidelity is not the recall bottleneck.
82 changes: 82 additions & 0 deletions docs/mlx-port-lessons.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Porting the K3 GPU beta (#107) to MLX — lessons & plan

Audience: whoever ports the validated CUDA restored-verifier engine
(`inference_engine/v04/…`, PR #107) to the Apple-Silicon MLX backend
(`inference_engine/backends/mlx/…`). The current MLX blocker is **decode
token-throughput collapse**. This doc distills *why* #107 is fast and exactly
which mechanisms must be reproduced in MLX.

## TL;DR — the throughput collapse is the O(T²) re-forward

On MLX today, `restored_logits` (`backends/mlx/cross_model_dlm_verifier.py`) does
a **full-position forward over the whole sequence every step**, and the Mac
harness calls it per generated token → **O(T²)** → collapse (the same harness
also shows the *oracle* is fast because it uses mlx_lm's **native incremental KV
cache**). The fix is the #107 **Gap-A** trick, ported verbatim:

> **Capture the restored K/V into a persistent (sink+window) cache at prefill,
> then decode with mlx_lm's native incremental step (O(L)/block) — never
> re-forward the whole sequence per token.**

This alone takes the restored path from "collapsed" to **= native AR decode
speed** (on CUDA: 1.3–2.8 tok/s re-forward → ~21 tok/s incremental = AR).

## What makes #107 fast — and the MLX analog of each

| # | #107 (CUDA) mechanism | MLX analog / gotcha |
|---|---|---|
| 1 | **Gap-A incremental decode**: capture restored K/V (per layer, post-norm/RoPE) into a `transformers.DynamicCache` at prefill; decode L new tokens against it. | Capture into `inference_engine/backends/mlx/cache.SinkWindowKVCache` (already exists) and decode via **`mlx_lm.generate.generate_step`** with `prompt_cache=` — its **chunked prefill + `mx.async_eval` pipelined decode** is the throughput-critical part. A hand-rolled per-token loop with `mx.eval` each step is itself a collapse cause. |
| 2 | **S5 carries recall** via the 5 full-attention layers' **exact own K/V**; f_θ restores only the sliding layers (masked at decode). | Same: store the 5 full-attn evicted own K/V (KakeyaLattice-compressible); **do not** invest in f_θ sliding fidelity for recall. The needle reaches output through the full-attn layers only. |
| 3 | **Eliminate the extra `capture_own_kv` forward**: in #107 the full-attn own K/V are captured once at prefill (not recomputed per step). PR #108 showed removing it via *f_θ full-attn* breaks recall — wrong fix. | The Mac harness's 12.4s `build_restoration` is this extra forward. Right fix: capture own K/V from the **prefill** forward / store as positions evict — **not** f_θ-restore the full-attn layers. |
| 4 | **Fused spec-decode (>AR)** = three prefill-built, incrementally-extended caches: (A) verifier aux hidden from the verify forward, (B) drafter context K/V cache, (C) Gap-A restored KV. Per-block O(L). | Port `draft_block_cached` + `make/extend_context_kv` semantics to the MLX drafter path; capture aux from the MLX verify forward. Only after #1 works. |
| 5 | **Stabilization**: load verifier **without `device_map`** (no accelerate per-forward hooks) + **full-length warmup** (pre-size the allocator) → removed per-block variance. | MLX analog of the variance source is **graph (re)compilation + lazy eval**: warm up the *exact* shapes (prefill chunk size + 1-token decode) before timing; avoid shape churn; force `mx.eval` only where measuring. |
| 6 | **Gap-B drafter fidelity**: drafter query embedding is a **plain lookup — no Gemma `×sqrt(hidden)`** (port bug; fixed). | Same fix on the MLX drafting path: do not scale the shared embedding fed to the drafter. (z-lab acceptance 0.05→reference parity.) |

## MLX-specific gotchas already learned

- **MPS/MLX SDPA materializes scores** (no flash kernel for some shapes) → OOM at
long context. Use **bounded attention** (decode only attends sink+window+restored
evicted, not a transient full O(T) matrix) and/or **query-chunked SDPA**
(`KAKEYA_DFLASH_ATTN_QCHUNK`). Bounded decode (Gap-A) avoids the transient full
cache that OOM'd the ctx280 runs.
- **Lazy eval**: MLX is lazy; throughput depends on `mx.async_eval` pipelining
(mlx_lm's `generate_step` does this). Per-token `mx.eval().item()` serializes →
collapse. Mirror the native loop.
- **`make_sink_window_cache(model, *, sink_size, window_size)`** is keyword-only
(a past bug was positional args). The cache is a drop-in `_BaseCache`.
- **Cross-runtime bridge**: verifier in MLX, drafter+f_θ in PyTorch (MPS/CPU) is
workable, but the per-step tensor bridging must not re-forward; bridge only at
the K/V-injection boundary, once per block.

## MLX port plan (ordered; each gates the next)

1. **Incremental decode (kills the collapse).** Add an MLX analog of
`CrossModelRestoredSinkWindowVerifier(incremental=True)`: prefill → capture
restored K/V into `SinkWindowKVCache` (full-attn = own/exact; sliding = f_θ or
window-masked) → decode via `generate_step(prompt_cache=…)`. **Gate: decode
tok/s ≈ native mlx_lm AR; recall 1.0** (carried by S5).
2. **Drop the extra build forward.** Capture full-attn own K/V at prefill; do not
re-run a clean verifier forward per request beyond prefill. **Gate:
`build_restoration` from ~12s → ~prefill cost.**
3. **Gap-B drafter embed fix** (no `×sqrt`) on the MLX/Bridge drafting path.
**Gate: acceptance toward reference on code prompts.**
4. **Fused spec-decode** (A+B+C incremental caches). **Gate: tok/s > AR.**

## Validation gates (match #107 evidence)

- Recall **1.0** vs oracle (S5).
- Bounded resident KV (sink+window), reported via `kv_memory_report`.
- Decode tok/s: incremental **≥ native AR**; fused **> AR**.
- Reference: #107 on H200 — incremental = 1.0× AR (KV 16.9–43.9× smaller),
fused 1.27× AR, recall 1.0. (`docs/k3-gpu-beta.md`,
`results/research/k3_e2e_gpu_bench_incremental.json`,
`k3_specdecode_fused_stable.json`.)

## Do-not-repeat (anti-patterns)

- ❌ Re-forwarding the full sequence per generated token (the current collapse).
- ❌ A custom decode loop with per-token `mx.eval` (no async pipelining).
- ❌ f_θ-restoring the **full-attention** layers (PR #108: breaks recall; those
K/V are not reconstructable from the shallow drafter — α-sweep proven). Keep S5.
- ❌ Scaling the drafter's shared embedding by `×sqrt(hidden)` (Gap-B port bug).
- ❌ Materializing a transient full-T attention score matrix on MPS (OOM).
Loading
Loading