diff --git a/docs/adr/0012-proposer-verifier-value-proposition.md b/docs/adr/0012-proposer-verifier-value-proposition.md new file mode 100644 index 00000000..b48f939a --- /dev/null +++ b/docs/adr/0012-proposer-verifier-value-proposition.md @@ -0,0 +1,182 @@ +# ADR 0012 — Proposer/verifier value proposition: bounded-memory + recall (all platforms), platform-forked throughput + +- **Status**: Accepted (2026-06-13) +- **Date**: 2026-06-13 +- **Decision drivers**: + - A recurring question keeps being re-opened by new contributors (human + and agent): *"is the proposer still worth it, given Step-1 reaches 1.0× AR + on Mac without using it?"* and *"is speculative decoding dead on Mac?"* + This ADR settles the value map so the decision tree is not re-derived + every time the Mac throughput number looks bad in isolation. + - 2026-06-12 Mac ctx280 validation (`results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json`): + Step-1 recall 5/5 vs oracle 5/5, bounded resident KV 132.9 MB vs naive + 1308.9 MB (89.8 % saving) at 4406–5810-token prompts. + - 2026-06-11 H200 #107 evidence: fused spec-decode 1.27× AR, recall 1.0. + - 2026-06-13 `verify(L)` calibration sweep (`results/research/verify_l_sweep.json`, + ctx 4096): measured kernel-dedup headroom 3.92× at L=16, ≈87 % of the + router-measured expert-union bound (4.52×). + - Builds on / re-affirms ADR 0001 (proposer sizing + alignment), + ADR 0004 (alignment data policy), ADR 0006 (local-agent-infra + positioning), ADR 0008 §11 (dLM K/V-Restoration architecture), + ADR 0009 (capability exchange), ADR 0010 (full-attention low-precision + KV / affine4), ADR 0011 (cross-attention coupling, falsified by R1e). + +## ⚠️ Revision (2026-06-13) — the Step-1 / S5-coupon result is a *validation trap*, not architecture evidence + +A 2026-06-13 directive supersedes the optimistic reading of "Step-1 = realised +deliverable" below. The correction: + +- **Step-1 (incremental restored decode) and the native-cache path get their + recall from Gemma-4's *native* retained 5 full-attention layers + native + sliding-window eviction — they never exercise f_θ or proposer KV + restoration.** So "Step-1 recall 5/5 / 1.0× AR" is **Gemma-4 native + behaviour, not evidence that the K/V-Restoration architecture (ADR 0008 §11) + works.** +- The path is structurally **incapable of failing in a way that tests the + architecture**: the full-attention coupon always carries recall regardless of + whether f_θ/restoration is correct or even present. Citing it as a deliverable + **corrupts the integrity assessment**. +- Sharper: **on Gemma-4 no configuration makes proposer/f_θ restoration the + recall source** (the 5 full-attn layers' own exact K/V always do; f_θ only + touches sliding layers, which are window-masked at decode). Gemma-4 is + therefore the **wrong model to validate the restoration architecture**. +- **Step-1 / native-cache bypass is forbidden for any architecture-validation + attempt.** The bounded-memory + recall *architecture* claim is **unvalidated + on a falsifiable model** and must be re-validated on a **pure sliding-window + model (Qwen3, the K1/K2 path)** where recall is mathematically impossible + without proposer/f_θ restoration. Gemma-4 may still be used as a *product* + model, but never as the validation vehicle for architectural integrity. + +The §1 "won / realised" framing below is retained for history but must be read +through this revision: the **memory-saving numbers are real**, but they are not +proof the *restoration mechanism* works — only that Gemma-4 + a bounded sliding +cache works, which Gemma-4 does natively. + +## Context + +ADR 0008 §11 (K-series) changed the proposer's primary role from *drafter* +to *history reconstructor*: the dLM proposer has no KV cache and can produce +transient K/V for the **entire** history, which is used to restore the +verifier's attention at structurally-evicted positions. Speculative decoding +is the **second** product line on the same architecture, not the first. + +The trap is to evaluate the architecture on a single cell of its value +table — "Mac, single host, generic chat, current un-aligned DFlash" — see a +weak throughput number, and conclude the proposer (or spec-decode) has no +value. That conclusion does not generalise. This ADR records the full value +map and prices the open options explicitly. + +## Decision + +The proposer/verifier value proposition is realised on **two axes**, and its +status is **platform- and workload-dependent**, not a single scalar: + +### 1. The core value is "bounded memory + recall", not "fast" + +Since ADR 0008 §11, the proposer's first-class role is **history +reconstruction**: no KV cache, transient full-history K/V → restore the +verifier's evicted-position attention. The main line has already **won**, but +the value is realised on the **memory axis**, not the throughput axis: +Step-1 = **1.0× AR throughput + recall 5/5 + KV 132.9 MB vs naive 1308.9 MB +(89.8 % saving; ~48 MB after affine4 / ADR 0010)**. That is the +proposer/verifier deliverable. + +A finer honesty note: in the Mac **S5-native** shipping configuration, +Gemma-4's *native hybrid attention* means keeping the **5 full-attention +layers exact** is already enough to carry recall — so on this specific model +the f_θ/proposer reconstruction is **replaced by the S5 shortcut**. But on a +**pure sliding-window architecture** (the K1/K2 Qwen3 case — no +full-attention layers to preserve) and on the **CUDA full-restoration path**, +proposer reconstruction remains the **only** source of recall. The +architecture's domain of applicability is unchanged; Gemma-4 simply handed us +a free coupon. + +### 2. Speculative-decoding value forks by platform — the Mac negative does not extrapolate + +- **H200 (#107 measured)**: fused = **1.27× AR, recall 1.0** — the *same* + proposer/verifier code; on a platform where verify-batch is nearly free, + spec-decode value holds. +- **Mac's 0.26×** has a **concrete, movable** bottleneck: real per-token + acceptance is **30–40 %**. The vLLM reference reports the *same* drafter at + **44.7 %**, and our own drafter docs say "the precise EAGLE-3 ↔ block-fusion + alignment is a Stage-2 task". Alignment fine-tuning (the plan that has been + queued in ADR 0001 / 0004 all along) lifting acceptance to **~70 %** makes + the block-4 arithmetic `3.5 × 43.8 / 140 ≈ 1.1×` — Mac clears the bar too. + So the Mac status is **"waiting for the alignment asset"**, not + **"architecture pronounced dead"**. + +### 3. Option value of the verification primitive: correctness containment makes any draft source plug-and-play + +The v3 loop + byte-level consistency guarantees one thing: a draft source can +only affect **throughput**, never **pollute output**. This is an open +interface; the drafter can be swapped for anything. A concrete, this-week, +testable Mac route: **NGramProposer** (the zero-weight prompt-lookup proposer +already in PR #105) + the v3 loop — draft cost ≈ 0, no alignment dependency, +naturally high acceptance on agentic workloads (tool-call JSON, templated +replies, highly self-repetitive sessions). Arithmetic: `draft ≈ 0 + +verify(4) = 120 ms`, committing 2.5/block → 0.78×, 3.5/block → 1.1× — on +Kakeya's target workload (ADR 0006: local agent infrastructure) this is +entirely plausible. One bridge command verifies it. + +### 4. Beyond single-host throughput, the split is the foundation for a multi-host architecture + +ADR 0009 / PR #105's capability-exchange plane is built with the +proposer/verifier roles as primitives: the proposer is a fleet capability that +can be gossip-discovered and remote-invoked. Even if a single Mac never runs +spec-decode, the "verifier on host A, proposer capability on host B / cloud" +shape (including the dev/eval tool plane) is **already running** — the Mac +bridge used over the last two days is itself an instance of this +architecture's tool plane. + +### Bottom line + +If the proposition is narrowed to *"Mac single-host + generic chat + the +current un-aligned DFlash"* — yes, the proposer has **no runtime value in +that one cell today**, and Step-1 reaches 1.0× without it. But the +architecture's value map is: **the realised bounded-memory story (all +platforms) + the realised throughput story (CUDA) + two explicitly-priced Mac +throughput options (alignment fine-tuning / n-gram drafting) + the foundation +for the multi-host capability plane.** + +## Consequences + +- The proposer/verifier split is **retained**; "Step-1 doesn't use the + proposer on Mac/Gemma-4" is **not** grounds to deprecate it (it is the only + recall source on pure-sliding-window models and on CUDA full-restoration). +- Memory-axis claims (bounded KV, S5, affine4) are the **primary**, + all-platform deliverable and should be reported as such; throughput claims + must be qualified by platform (CUDA: realised; Mac: option-pending). +- Two priced Mac throughput options are tracked as next steps: + (a) **alignment fine-tuning** (ADR 0001/0004) to lift DFlash acceptance + toward the 44.7 % reference / ~70 % target; (b) **NGramProposer × v3 loop** + for agentic workloads (draft ≈ 0, no training). Option (b) is the cheapest + and is verifiable in a single bridge run. +- Any future "is spec-decode worth it?" discussion must specify the + **(platform, workload, drafter)** cell; a negative in one cell does not + generalise. + +## Alternatives considered + +- **"The Mac throughput number kills the architecture."** Rejected: the Mac + cell has a concrete, movable bottleneck (acceptance 30–40 % vs 44.7 % + reference), and the negative does not extrapolate to CUDA (1.27×, realised) + or to the memory axis (89.8 % saving, realised, all platforms). +- **"Deprecate the proposer because Step-1 reaches 1.0× without it on + Gemma-4."** Rejected: S5 is a Gemma-4-specific free coupon (native hybrid + attention); on pure sliding-window models (K1/K2 Qwen3) and the CUDA + full-restoration path the proposer is the only recall source. +- **"Only ship spec-decode if it beats AR everywhere."** Rejected: the value + is platform-forked; CUDA already clears it, and the verification primitive's + correctness containment makes the Mac throughput a strictly-additive option + (it can never regress correctness). + +## Evidence pointers + +- Mac bounded-memory + recall: `results/research/k3_mlx_fused_fair_ctx280_n5_gen32_*.json` + (recall 5/5, KV 132.9 MB vs 1308.9 MB), `docs/pr109-mac-ctx280-validation.md`. +- CUDA throughput: PR #107, `docs/k3-gpu-beta.md`. +- verify(L) headroom: `results/research/verify_l_sweep.json` (3.92× measured @ + L=16 vs 4.52× expert-union bound). +- Drafter alignment status: ADR 0001/0004; `inference_engine/v04/dflash_drafter.py` + ("Stage-2" fidelity note). +- Capability plane / multi-host: ADR 0009, PR #105; `scripts/mac_bridge/`. diff --git a/docs/adr/README.md b/docs/adr/README.md index 60ffdd83..89dfdf42 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -40,6 +40,7 @@ reader what was *not* chosen. | 0006 | [Project positioning as local agent infrastructure](0006-local-agent-infrastructure-positioning.md) | Accepted | | 0007 | [Cross-request KV cache reuse for long sessions](0007-cross-request-kv-reuse.md) | Superseded by 0008 | | 0008 | [Session-bound runtime + gRPC protocol](0008-session-bound-runtime-and-grpc-protocol.md) | Accepted | +| 0012 | [Proposer/verifier value proposition: bounded-memory + recall, platform-forked throughput](0012-proposer-verifier-value-proposition.md) | Accepted | Note: ADR numbering is monotonically increasing; in-flight or planned numbers (0005) appear in the index so readers can