From 5b363d49cadce1174c153c2b7441d1ae0d796b7a Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Sat, 13 Jun 2026 03:45:23 +0000
Subject: [PATCH] =?UTF-8?q?ADR=200013:=20distributed=20inference=20topolog?=
 =?UTF-8?q?y=20=E2=80=94=20what=20AR=20sequentiality=20allows?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Fix the can/can't-parallelize conclusion so it is not re-derived. Governing
constraint: single-sequence AR decode is sequential (token N+1 depends on N) ->
a sequence's token chain cannot be split across independent parallel verifiers.

Topologies mapped:
- split one sequence across N independent verifiers: NOT possible (AR seq).
- single big verifier sharded across hosts: yes (mlx.distributed model.shard).
- N proposers : 1 verifier (tree/multi-candidate spec): the path to single-
  request throughput; feasible on ADR 0009 substrate but NOT built (current loop
  is single RemoteProposer + linear accept).
- 1 proposer : N verifiers: realized, but = fleet multi-tenancy throughput, not
  single-task speedup.

Companion/clarification to ADR 0009 (not an amendment - 0009 is Accepted and on
another branch). Cross-links ADR 0001/0008/0009/0012; cites verify_l_sweep.json
(sublinear verify = the tree-spec headroom). README index updated.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
---
 .../0013-distributed-inference-topology.md    | 111 ++++++++++++++++++
 docs/adr/README.md                            |   1 +
 2 files changed, 112 insertions(+)
 create mode 100644 docs/adr/0013-distributed-inference-topology.md

diff --git a/docs/adr/0013-distributed-inference-topology.md b/docs/adr/0013-distributed-inference-topology.md
new file mode 100644
index 00000000..342591f2
--- /dev/null
+++ b/docs/adr/0013-distributed-inference-topology.md
@@ -0,0 +1,111 @@
+# ADR 0013 — Distributed inference topology: what AR sequentiality allows
+
+- **Status**: Accepted (2026-06-13)
+- **Date**: 2026-06-13
+- **Relates to**: ADR 0009 (mlx.distributed + capability exchange — this ADR
+  is its topology companion / clarification), ADR 0008 (session-bound runtime),
+  ADR 0001 (proposer sizing), ADR 0012 (value proposition).
+
+## Context
+
+A recurring vision is raised for "distributed inference": *decompose one
+inference task into several parallel subtasks, have the proposer coordinate
+across multiple verifiers, and win total token throughput* — extrapolated to
+**many-to-many** proposer/verifier wiring (one verifier fed by many proposers;
+one proposer drafting for many verifiers).
+
+ADR 0009 shipped the *substrate* (capability exchange + remote `ProposeBlock` +
+`DistributedSpeculativeDecoder` + an optional `mlx.distributed` data plane) but
+did not pin down **which inference topologies are physically achievable**. This
+ADR fixes the can/can't-parallelize conclusion so it is not re-derived each time
+the idea resurfaces.
+
+## Decision
+
+### The governing constraint
+
+**Single-sequence autoregressive (AR) decoding is inherently sequential**:
+token `N+1` depends on the realized value of token `N`. A single sequence's
+token chain therefore **cannot** be split into independent parallel subtasks
+across multiple verifiers the way a batch / map-reduce job can. This is a
+causal-dependency property of AR generation, not an engineering gap.
+
+The only parallelism available to a **single** sequence is:
+
+1. **Intra-forward (model parallelism)** — split *one* verifier's weights/compute
+   across hosts via tensor/pipeline parallelism (`mlx.distributed`
+   `model.shard`, ADR 0009 §2.1). This is "one verifier across N hosts," not "N
+   verifiers." It enables / accelerates a verifier too big for one host;
+   throughput scales **sublinearly** (collective-communication bound), and its
+   real purpose is fit + latency, not linear throughput multiplication.
+2. **Intra-block (speculative decoding)** — the verifier checks `L` drafted
+   tokens in **one batched forward**; throughput gain = `acceptance × block`,
+   amortizing one verify over many tokens. The `verify(L)` cost is **sublinear**
+   in `L` (`results/research/verify_l_sweep.json`: ~4× at L=16), which is the
+   headroom that makes blocks and trees pay off.
+3. **N:1 tree / multi-candidate speculation** — multiple drafts (many proposers,
+   or one proposer emitting a token *tree*) are verified in **one** batched
+   forward via tree attention; the longest correct path is accepted. This
+   raises **single-request** throughput by exploiting the sublinear `verify(L)`
+   headroom from (2).
+
+### The topologies, mapped to feasibility
+
+| Topology | Realizable? | What it is | Status on the ADR-0009 substrate |
+|---|---|---|---|
+| **Split one sequence across N independent verifiers in parallel** | ❌ No | category error — blocked by AR sequentiality | n/a |
+| **Single big verifier sharded across hosts** (1 verifier, N hosts) | ✅ Yes | tensor/pipeline parallel of one model | `mlx.distributed` ring adapter shipped (ADR 0009 §4.4); sharding is mlx-lm `model.shard` |
+| **N proposers : 1 verifier** (tree / multi-candidate) | ✅ Yes — **the** path to single-request throughput | parallel candidate verification | **feasible, not built** — current `DistributedSpeculativeDecoder` is single `RemoteProposer` + linear accept; needs tree-attention verify + multi-proposer aggregation |
+| **1 proposer : N verifiers** | ✅ Yes (already realized) | a shared proposer capability serves many independent verifier sessions | shipped: `ProposerService` + capability exchange (ADR 0009 §4) |
+
+### What "total throughput advantage" means (two regimes)
+
+- **Single-request throughput**: only (1) intra-verifier model parallelism and
+  (3) N:1 tree speculation help. Multiple *independent* verifiers do **not** —
+  there is nothing to parallelize across them for one sequence.
+- **Fleet / aggregate throughput** (many independent requests): the **1:N**
+  proposer-sharing + role placement is the realized win — it raises utilization
+  (offload the asymmetrically-cheap 0.25–1 B proposer, free
+  `proposer_weight_bytes` on verifier hosts), but does not speed up any single
+  request beyond ordinary spec-decode.
+
+## Consequences
+
+- For **single-request throughput**, the correct next investment is **N:1 tree /
+  multi-candidate speculation** built on the ADR-0009 capability substrate +
+  the sublinear `verify(L)` headroom — **not** "more verifiers." This is tracked
+  as a v0.5+ extension to `DistributedSpeculativeDecoder` (territory of the
+  ADR-0009 / capability-plane workstream).
+- **Multi-host spec-decode trades latency for placement**: F3 (aux hidden
+  states) is MB/block on the critical path (ADR 0009 F-flow table); it only pays
+  off behind a fast data plane (ring / `jaccl`). Distributing for its own sake
+  can regress single-request latency.
+- The **Mac bridge** (`scripts/mac_bridge/`) used for dev/eval is an instance of
+  the capability plane's **tool plane**, *not* a production inference data
+  plane. "The multi-host tool plane is running" must not be extrapolated to "a
+  distributed inference data plane is ready."
+- Any future "let's parallelize one request across machines" proposal must first
+  identify which of the four topologies it is; the "N independent verifiers on
+  one sequence" form is closed.
+
+## Alternatives considered
+
+- **"Decompose one sequence into parallel subtasks across N verifiers."**
+  Rejected: AR sequentiality (token `N+1` needs token `N`) makes the subtasks
+  causally dependent, not independent. No coordination protocol recovers
+  independence that the math forbids.
+- **"Multiple verifiers vote / ensemble on one sequence for speed."** Rejected
+  for throughput: ensembling changes *quality semantics* and still runs each
+  verifier over the same sequential chain — it multiplies cost, not speed.
+- **"Treat distribution as the throughput lever."** Rejected as the primary
+  lever: the realized throughput wins are intra-block (spec-decode, single host)
+  and fleet-aggregate (1:N); cross-host distribution's single-request value is
+  bounded by F3 latency and is a fit/placement tool, not a linear scaler.
+
+## Evidence pointers
+
+- `verify(L)` sublinearity (the headroom tree-spec would exploit):
+  `results/research/verify_l_sweep.json` (3.92× measured @ L=16).
+- F-flow latency analysis + `mlx.distributed` data-plane scope: ADR 0009 §2.
+- Realized 1:N substrate: ADR 0009 §4 (`CapabilityService`, `ProposerService`,
+  `DistributedSpeculativeDecoder`); `inference_engine/distributed/`.
diff --git a/docs/adr/README.md b/docs/adr/README.md
index 89dfdf42..6081cb86 100644
--- a/docs/adr/README.md
+++ b/docs/adr/README.md
@@ -41,6 +41,7 @@ reader what was *not* chosen.
 | 0007 | [Cross-request KV cache reuse for long sessions](0007-cross-request-kv-reuse.md) | Superseded by 0008 |
 | 0008 | [Session-bound runtime + gRPC protocol](0008-session-bound-runtime-and-grpc-protocol.md) | Accepted |
 | 0012 | [Proposer/verifier value proposition: bounded-memory + recall, platform-forked throughput](0012-proposer-verifier-value-proposition.md) | Accepted |
+| 0013 | [Distributed inference topology: what AR sequentiality allows](0013-distributed-inference-topology.md) | Accepted |
 
 Note: ADR numbering is monotonically increasing; in-flight or
 planned numbers (0005) appear in the index so readers can