diff --git a/README.md b/README.md
index 94be72dd..22c45229 100644
--- a/README.md
+++ b/README.md
@@ -230,6 +230,48 @@ engine lands at **≈ AR parity** — the memory + context wins are platform-ind
 Reproduce with `scripts/research/k3_e2e_gpu_bench.py` + `k3_specdecode_gpu_bench.py`
 (CUDA) and the `k3-beta-scorecard` / `k3-fused-allmlx-code-trim` Mac-bridge presets.
 
+### Agent-connection capacity & cross-host topology ([ADR 0014](docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md))
+
+**Agent connections (gRPC `RuntimeService`, Mac mini M4).** A connection load
+test (`scripts/research/grpc_agent_capacity_loadtest.py`, preset
+`agent-capacity-loadtest`) ramps N concurrent agents — an independent gRPC
+channel + session each — against one runtime:
+
+| | result |
+| --- | --- |
+| Max concurrent agents | **256 / 256, zero errors** (the configured capacity — a clean floor, not a failure point) |
+| Per-session resident KV | **bounded** (sink+window; ~7.8 MB @ window 64, ~30 MB @ window 256) |
+| Node KV upper bound | **capacity × per-session bound** (≈2.0 GB @ cap 256) — independent of context length / churn |
+| Server RSS vs agents | **flat** (3825 → 3850 MB across 1 → 256) — adding agents costs ~0 memory |
+
+Caveat: v0.3 is **single-tenant** (one shared verifier, RPCs serialized on one
+asyncio loop — per-session binding is a v0.4 / PR-A3c item), so create/generate
+latency is linear in N and "256" is the max concurrent connections *served*, not
+parallel inferences. Pushing further (preset `agent-capacity-stress`, FD raised
+to 100k / hard unlimited on the Mac) shows the true ceilings: **FD is not the
+limit**; **memory** scales with `capacity × window` (capacity 2048 @ window 256
+→ ~11 GB RSS, theoretical node bound ~61 GB > 24 GB RAM, so capacity must be
+sized to RAM); and with a per-agent **context** prefill the binding constraint
+is **single-tenant serialization** (concurrent heavy-prefill agents serialize
+and time out well before any FD/connection limit). Bounded memory is structural:
+light-session agent count does **not** grow RSS; the memory lever is the
+resident **window**, not the number of agents.
+
+**Cross-host proposer/verifier.** A GPU proposer ⇄ Mac verifier *token-level
+draft* data plane is **design-only** (no `CapabilityService` / `ProposeBlock` /
+gossip) **and** ruled out by the WAN latency budget — now **measured** on real
+H200 compute by injecting one proposer↔verifier round-trip per block:
+
+| per-block RTT | 0 (co-located) | 15 ms (LAN) | 30 ms | 60 ms | 100 ms | 150 ms |
+| --- | --- | --- | --- | --- | --- | --- |
+| vs AR | **2.20×** | 1.81× | 1.50× | 1.22× | **0.98×** (break-even) | 0.77× (loss) |
+
+**Break-even ≈100 ms/block**: a cloud↔desk WAN (30–150 ms) straddles/exceeds it,
+while a LAN (≤15 ms) keeps the 1.8–2.2× win. So the realizable split is **WAN =
+control + tool plane** (the Mac bridge) and **LAN = co-located data plane**. See
+[ADR 0014](docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md)
+for the full plan, evidence, and the served-MLX-gemma gap found during testing.
+
 ## Kakeya Inference Engine for Mac — MLX speculative-decode port (K3 beta baseline)
 
 After the **CUDA** beta (PR #107: f_θ + S5 K/V-restoration verifier, **fused DFlash
diff --git a/docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md b/docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md
new file mode 100644
index 00000000..5c6a1cc2
--- /dev/null
+++ b/docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md
@@ -0,0 +1,226 @@
+# ADR 0014 — Agent-connection capacity & cross-host proposer/verifier topology: test plan & results
+
+- **Status**: Accepted (test record + topology decision)
+- **Date**: 2026-06-14
+- **Relates to**: [ADR 0008](0008-session-bound-runtime-and-grpc-protocol.md)
+  (session-bound gRPC runtime), [`docs/mac-bridge.md`](../mac-bridge.md) +
+  [`docs/design/mac-bridge-cloud-agent-access.md`](../design/mac-bridge-cloud-agent-access.md)
+  (the cloud-agent ⇄ Mac soft link).
+- **Implementation**: `scripts/research/grpc_agent_capacity_loadtest.py`,
+  manifest preset `agent-capacity-loadtest`
+  (`inference_engine/bridge/manifest.py`).
+- **Evidence**: `results/research/k3_agent_capacity_mac.json`.
+
+> Note on numbering: `main`'s ADR index stops at 0008; the README references
+> 0009/0012/0013, which were authored on branches that never merged to `main`
+> (a known doc gap, out of scope here). This ADR uses 0014 to avoid collision.
+
+## 1. Context
+
+Two test cases were requested against the AR-verifier + dLLM-proposer
+architecture, using the Mac bridge as the cloud-agent ⇄ Mac mini M4 link:
+
+1. **Case 1 — agent connection capacity.** Simulate many agents connecting to
+   the Kakeya inference engine on the Mac mini and find the **maximum
+   concurrent agent connections**, plus the bounded KV residency.
+2. **Case 2 — cross-host proposer/verifier.** Run the CUDA proposer on a GPU,
+   have it **discover** and submit **drafts** to the verifier on the Mac mini,
+   and measure **token throughput / max agent connections / Mac KV upper
+   bound** under that topology.
+
+Ground truth from a code audit of `main` (`9d5e6b4` lineage) determined what is
+runnable vs design-only and shaped the test plan below.
+
+## 2. Test environment
+
+- **Mac mini M4** (24 GB unified memory), self-hosted Actions runner
+  `[self-hosted, macOS, ARM64, kakeya-mac-m4]`, reached via the **Mac bridge**
+  git-bus plane (no inbound path; allowlisted presets only).
+- **Cloud agent**: Linux x86 VM (no Metal). Orchestrates via the bridge.
+- **GPU**: H200 NVL (vast.ai) — runs the CUDA proposer + verifier for the
+  co-located reference (fused **2.06–2.20× AR**, recall 1.0) and the §4.3
+  cross-host WAN-penalty sweep.
+- **Engine**: gRPC `RuntimeService` (`scripts/start_grpc_runtime_server.py`),
+  Python SDK clients (`sdks/python/kakeya`).
+
+## 3. Case 1 — agent connection capacity (RUN, real evidence)
+
+### 3.1 Implementation
+
+`scripts/research/grpc_agent_capacity_loadtest.py` launches one
+`RuntimeService` subprocess and ramps `N` concurrent **agents**, each an
+independent gRPC channel + session that creates a session, appends a short
+prompt, holds the session open while all `N` are established (true concurrent
+peak), then generates and reads `GetSessionInfo.kv_live_bytes`. It records, per
+level: created/generate success, create & generate latency p50/p95,
+per-session bounded KV, and server RSS. Run on the Mac via the
+`agent-capacity-loadtest` bridge preset.
+
+Verifier: **cpu `Qwen/Qwen3-0.6B`** (the integration-gate model). Connection /
+admission scaling is **model-independent**, so this isolates the connection
+behavior; the served **MLX gemma** path is a separate v0.4 gap (§6).
+
+### 3.2 Results (Mac mini M4, capacity=256, sink=4 window=64)
+
+| agents | created | errors | create p95 (s) | gen p95 (s) | per-session KV | server RSS |
+| --- | --- | --- | --- | --- | --- | --- |
+| 1 | 1/1 | — | 0.78 | 0.10 | 1.38 MB | 3825 MB |
+| 16 | 16/16 | — | 1.33 | 1.66 | 7.80 MB | 3835 MB |
+| 64 | 64/64 | — | 5.66 | 6.85 | 7.80 MB | 3840 MB |
+| 128 | 128/128 | — | 11.27 | 13.64 | 7.80 MB | 3845 MB |
+| **256** | **256/256** | **—** | 22.61 | 26.44 | 7.80 MB | 3850 MB |
+
+- **Max concurrent agents: 256 / 256, zero errors.** 256 was the configured
+  `--capacity` and was sustained completely — i.e. **256 is a clean floor on
+  the connection ceiling, not a failure point**. The true resource ceiling is
+  higher (not probed past the configured capacity).
+- **Per-session KV is bounded at 7.80 MB** (plateaus from 16 agents up): the
+  `sink+window` (68-token) ceiling holds regardless of agent count.
+- **Node KV upper bound = capacity × per-session bound = 256 × 7.80 MB ≈
+  2.0 GB** — the whole-node resident-KV ceiling, independent of context length
+  or agent churn. This is the bounded-memory guarantee at the fleet level.
+- **Server RSS is flat** (3825 → 3850 MB across 1 → 256 agents): adding agents
+  costs ~0 memory beyond the bounded slab; model weights dominate.
+
+### 3.2b Stress beyond 256 — the real ceilings (preset `agent-capacity-stress`)
+
+Pushing further with the FD limit raised (`RLIMIT_NOFILE` soft 100k, hard
+unlimited on the Mac) and a **per-agent context prefill** (window 256,
+`--context-len 256`, capacity 2048):
+
+| agents | created | create p95 | per-session KV | server RSS |
+| --- | --- | --- | --- | --- |
+| 1 | 1/1 | 3.07 s | 29.8 MB | 11 477 MB |
+| 8 | 8/8 | 25.2 s | 29.8 MB | 11 343 MB |
+| 16 | 15/16 (1 `RpcCancelled`) | 44.6 s | 29.8 MB | 10 781 MB |
+
+- **FD is not the ceiling** (raised to 100k; Mac hard limit is unlimited).
+- **Memory** scales with `capacity × window`: capacity 2048 @ window 256 →
+  **~11.5 GB RSS**, and the theoretical node bound is **~61 GB > 24 GB RAM** —
+  so capacity must be **sized to RAM** (it is the memory knob, not agent count).
+- The binding constraint with real per-agent context is **single-tenant
+  serialization**: create latency is purely linear (3 → 12 → 25 → 45 s as
+  N = 1 → 4 → 8 → 16) because every session's prefill serializes through the one
+  shared verifier, so clean concurrency tops out at **~8 heavy-context agents**
+  before RPCs time out — versus **256 light-session agents** (§3.2). Per-session
+  KV stays bounded (29.8 MB @ window 256) throughout.
+
+### 3.3 Honest caveat — v0.3 is single-tenant
+
+Create/generate latency scales **linearly** with `N` (256 agents → gen p95
+26 s). That is the single-tenant signature: one shared verifier, RPC handlers
+serialized on one asyncio loop (per-session verifier binding is deferred to
+v0.4 / PR-A3c, see ADR 0008). So **256 = max concurrent connections admitted
+and served**, *not* 256 parallel inferences. The capacity cap + LRU eviction
+(`SessionStore`) + slab pool (`PoolExhausted → RESOURCE_EXHAUSTED`) are the
+admission-control levers; `--max-concurrent-rpcs` caps in-flight handlers.
+
+## 4. Case 2 — cross-host proposer/verifier (FEASIBILITY VERDICT)
+
+### 4.1 Verdict: the requested topology is not implementable today, and is architecturally bounded out
+
+A code audit found the cross-host discovery + draft plane is **design-only**:
+
+- **No `distributed.proto`, no `CapabilityService`/`ProposerService`, no
+  `ProposeBlock` RPC, no gossip/registry/TTL** — zero runnable cross-process
+  wiring (the ADR 0009 file is itself absent from `main`).
+- The **only implemented cross-machine plane is the Mac-bridge git-bus**
+  (async, batch, allowlisted presets) — a **tool/control plane**, not a
+  token-level data plane.
+- Speculative decoding (proposer + verifier) is implemented **in-process
+  only** (`kv_cache_proposer/speculative.py`, `inference_engine/v04/`).
+
+Even if built, **per-block draft submission over WAN is ruled out by the
+latency budget** (design doc §4.2): a Gemma-4-26B M4 verify of an 8-token block
+is ~50–100 ms; a cloud↔desk RTT is 30–150 ms **per block**, i.e. 30–300 %
+overhead that consumes any acceptance gain. **Proposer and verifier must share
+a LAN** for the data plane.
+
+### 4.2 What the topology decomposes into (and the measurable proxies)
+
+| Plane | Crosses WAN? | Status | Measured |
+| --- | --- | --- | --- |
+| Discovery / capability advertise | yes (seconds-scale) | bridge proxy only | bridge dispatch ~10 s + queue; one Mac, serialized (`concurrency: mac-bridge`) |
+| Job/tool dispatch (eval/bench) | yes | implemented (bridge) | this ADR's Case-1 run is itself an instance |
+| **Token-level draft (data plane)** | **no — must be LAN** | not implemented | **measured penalty curve §4.3** (break-even ~100 ms/block) |
+| Co-located spec-decode (the feasible data plane) | n/a (same host) | implemented | **GPU H200 2.06–2.20× AR**; **Mac 0.93× AR** (PR #118) |
+
+So the answers to Case 2's three metrics, under the **realizable** topology:
+
+- **Token throughput**: spec-decode is a *co-located* win — **2.06–2.20× AR on
+  the GPU** (recall 1.0) and **≈AR parity (0.93×) on the Mac**. The measured
+  WAN-penalty curve (§4.3) shows the cross-host draft loop falls to **break-even
+  at ~100 ms/block and a net loss at 150 ms**, i.e. slower than running AR
+  locally — so it is not a throughput strategy.
+- **Max agent connections**: governed by the *serving* node (Case 1): **256+
+  concurrent agents** on the Mac via `RuntimeService`.
+- **Mac KV upper bound**: bounded — **capacity × per-session `sink+window`**
+  (≈2.0 GB at capacity 256 for Qwen3-0.6B; for the gemma S5 production config
+  the per-agent resident KV is ~133 MB at 5.8k ctx, dominated by the 5 exact
+  full-attention layers — see the README beta scorecard).
+
+### 4.3 Measured WAN-penalty curve (H200, real models, `--rtt-sweep`)
+
+Rather than rest on the latency *estimate*, we measured it: the fused engine was
+re-timed with one injected proposer↔verifier round-trip **per block** on the real
+Gemma-4-26B verifier + DFlash drafter (H200 NVL), sweeping per-block RTT across
+the cloud↔desk range (`scripts/research/k3_specdecode_gpu_bench.py --rtt-sweep`,
+`results/research/k3_crosshost_rtt_gpu.json`):
+
+| per-block RTT | decode tok/s | vs AR | regime |
+| --- | --- | --- | --- |
+| 0 ms (co-located) | 52.4 | **2.20×** | the win |
+| 5 ms (LAN) | 47.0 | 1.97× | LAN keeps it |
+| 15 ms | 43.3 | 1.81× | LAN keeps it |
+| 30 ms | 35.9 | 1.50× | WAN edge |
+| 60 ms | 29.2 | 1.22× | shrinking |
+| 100 ms | 23.5 | **0.98×** | **break-even** |
+| 150 ms | 18.4 | 0.77× | net **loss** |
+
+AR baseline = 23.8 tok/s. **Break-even is ~100 ms/block**: beyond it, cross-host
+spec-decode is *slower than running AR locally*. A cloud↔desk WAN (30–150 ms RTT)
+straddles or exceeds break-even, while a LAN/Thunderbolt link (≤15 ms) preserves
+the 1.8–2.2× win. This is the architecture's prediction (design doc §4.2),
+**now quantified on real compute** — and it is why the data plane must be LAN.
+
+## 5. Decision
+
+1. **Case 1 is validated**: the session-bound gRPC runtime admits and serves
+   **≥256 concurrent agent connections** on an M4 with **flat memory** and a
+   **bounded ~2.0 GB node KV ceiling**, with the documented single-tenant
+   latency-serialization caveat.
+2. **Case 2's WAN data plane is rejected** as a throughput strategy and is
+   unbuilt: cross-host token-level draft must not cross the cloud↔desk
+   boundary. The correct topology is **WAN = control + tool plane (bridge),
+   LAN = co-located data plane (spec-decode)** — the same conclusion as the
+   Mac-bridge design doc §4, now backed by the audit and the co-located
+   throughput evidence.
+
+## 6. Consequences & follow-ups
+
+- **Served MLX gemma gap (found during this test)**: `MLXSinkWindowVerifier`
+  reads a flat `cfg.num_hidden_layers`, but the gemma-4 MLX model nests its
+  config → `AttributeError` when starting `--backend mlx` with the gemma
+  verifier. The served gRPC path is wired for the torch/HF verifier (and
+  Qwen3-MLX), not gemma-4 MLX. Tracked as a v0.4 item (alongside per-session
+  binding); Case 1 therefore used the cpu verifier.
+- **Multi-tenant (PR-A3c)**: per-session verifier binding would lift the
+  serialization caveat and turn "256 connections" into "256 *concurrent
+  inferences*"; until then, capacity sizing should reflect serialized service.
+- **M3 (fleet capability plane)**: if/when built, placement must treat
+  `ring_address`/RTT class as a hard constraint so data-plane (draft) pairings
+  never span WAN — a one-line filter in the placement candidate set.
+
+## 7. Alternatives considered
+
+- **Build the cross-host gRPC draft plane now and benchmark it.** Rejected: it
+  is a large unimplemented feature (proto + services + discovery) whose result
+  is already known to be *worse than co-located* by the latency budget — it
+  would confirm a negative at high cost.
+- **Run Case 1 against the MLX gemma verifier.** Blocked by the served-MLX gap
+  (§6); connection scaling is model-independent, so cpu Qwen3-0.6B gives the
+  same admission/capacity answer with the production KV bound reported
+  analytically.
+- **Hold live cloud→Mac gRPC sessions for Case 1.** Impossible: the Mac has no
+  inbound path (the reason the bridge exists). The load test runs co-located on
+  the Mac, dispatched via the bridge.
diff --git a/docs/adr/README.md b/docs/adr/README.md
index 60ffdd83..53254167 100644
--- a/docs/adr/README.md
+++ b/docs/adr/README.md
@@ -40,6 +40,7 @@ reader what was *not* chosen.
 | 0006 | [Project positioning as local agent infrastructure](0006-local-agent-infrastructure-positioning.md) | Accepted |
 | 0007 | [Cross-request KV cache reuse for long sessions](0007-cross-request-kv-reuse.md) | Superseded by 0008 |
 | 0008 | [Session-bound runtime + gRPC protocol](0008-session-bound-runtime-and-grpc-protocol.md) | Accepted |
+| 0014 | [Agent-connection capacity & cross-host proposer/verifier topology: test plan & results](0014-agent-connection-capacity-and-cross-host-topology-tests.md) | Accepted |
 
 Note: ADR numbering is monotonically increasing; in-flight or
 planned numbers (0005) appear in the index so readers can
diff --git a/inference_engine/bridge/manifest.py b/inference_engine/bridge/manifest.py
index 64b6a3bb..9a5887cf 100644
--- a/inference_engine/bridge/manifest.py
+++ b/inference_engine/bridge/manifest.py
@@ -131,6 +131,56 @@ def _harness_preset(
             ),
             timeout_minutes=60,
         ),
+        Preset(
+            name="agent-capacity-loadtest",
+            description="Test case 1: ramp concurrent agent connections "
+                        "(independent gRPC channel + session each) against a "
+                        "single RuntimeService; report max concurrent agents, "
+                        "per-session bounded KV, node KV upper bound, latency "
+                        "curve, server RSS. Uses the cpu Qwen3-0.6B verifier "
+                        "(the integration-gate model; connection/admission "
+                        "scaling is model-independent — the served MLX gemma "
+                        "path is a separate v0.4 item).",
+            command_templates=(
+                (
+                    "python3", "scripts/research/grpc_agent_capacity_loadtest.py",
+                    "--backend", "cpu",
+                    "--verifier-id", "Qwen/Qwen3-0.6B",
+                    "--capacity", "256",
+                    "--sink", "4", "--window", "64",
+                    "--levels", "1,2,4,8,16,32,64,128,256",
+                    "--gen-tokens", "4",
+                    "--output",
+                    "results/research/k3_mac_bridge_agent_capacity.json",
+                ),
+            ),
+            timeout_minutes=90,
+            validate_reports=False,
+        ),
+        Preset(
+            name="agent-capacity-stress",
+            description="Test case 1 (stress): push concurrent agents to 2048 "
+                        "with a per-agent prefilled context (window 256), "
+                        "raised FD limit, to probe the true connection ceiling "
+                        "and the bounded-memory behavior (RSS vs agents) on the "
+                        "Mac. cpu Qwen3-0.6B verifier.",
+            command_templates=(
+                (
+                    "python3", "scripts/research/grpc_agent_capacity_loadtest.py",
+                    "--backend", "cpu",
+                    "--verifier-id", "Qwen/Qwen3-0.6B",
+                    "--capacity", "2048",
+                    "--sink", "4", "--window", "256",
+                    "--context-len", "256",
+                    "--levels", "1,4,8,16,32,48,64,96",
+                    "--gen-tokens", "1",
+                    "--output",
+                    "results/research/k3_mac_bridge_agent_capacity_stress.json",
+                ),
+            ),
+            timeout_minutes=120,
+            validate_reports=False,
+        ),
         _harness_preset(
             "k3-step1-incremental",
             "PR #109 Step-1 evidence: incremental restored decode.",
diff --git a/results/research/k3_agent_capacity_mac.json b/results/research/k3_agent_capacity_mac.json
new file mode 100644
index 00000000..f848757b
--- /dev/null
+++ b/results/research/k3_agent_capacity_mac.json
@@ -0,0 +1,188 @@
+{
+  "kind": "grpc_agent_capacity_loadtest",
+  "schema_version": 1,
+  "config": {
+    "backend": "cpu",
+    "verifier_id": "Qwen/Qwen3-0.6B",
+    "capacity": 256,
+    "sink": 4,
+    "window": 64,
+    "gen_tokens": 4,
+    "prompt_len": 8,
+    "levels": [
+      1,
+      2,
+      4,
+      8,
+      16,
+      32,
+      64,
+      128,
+      256
+    ],
+    "single_tenant_note": "v0.3 single-tenant: shared verifier, RPCs serialized on one asyncio loop; this measures connection/session admission scaling, not parallel inference."
+  },
+  "results": [
+    {
+      "agents": 1,
+      "created_ok": 1,
+      "generate_ok": 1,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 0.7768,
+        "p95": 0.7768
+      },
+      "generate_latency_s": {
+        "p50": 0.097,
+        "p95": 0.097
+      },
+      "per_session_kv_bytes": 1376256,
+      "server_rss_mb": 3825.0,
+      "wall_s": 0.87
+    },
+    {
+      "agents": 2,
+      "created_ok": 2,
+      "generate_ok": 2,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 0.0791,
+        "p95": 0.157
+      },
+      "generate_latency_s": {
+        "p50": 0.1981,
+        "p95": 0.1982
+      },
+      "per_session_kv_bytes": 1835008,
+      "server_rss_mb": 3828.4,
+      "wall_s": 0.36
+    },
+    {
+      "agents": 4,
+      "created_ok": 4,
+      "generate_ok": 4,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 0.2667,
+        "p95": 0.3537
+      },
+      "generate_latency_s": {
+        "p50": 0.3802,
+        "p95": 0.4059
+      },
+      "per_session_kv_bytes": 2752512,
+      "server_rss_mb": 3829.0,
+      "wall_s": 0.76
+    },
+    {
+      "agents": 8,
+      "created_ok": 8,
+      "generate_ok": 8,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 0.4326,
+        "p95": 0.7292
+      },
+      "generate_latency_s": {
+        "p50": 0.7434,
+        "p95": 0.8227
+      },
+      "per_session_kv_bytes": 4587520,
+      "server_rss_mb": 3830.5,
+      "wall_s": 1.56
+    },
+    {
+      "agents": 16,
+      "created_ok": 16,
+      "generate_ok": 16,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 0.8042,
+        "p95": 1.3275
+      },
+      "generate_latency_s": {
+        "p50": 1.1834,
+        "p95": 1.664
+      },
+      "per_session_kv_bytes": 7798784,
+      "server_rss_mb": 3834.8,
+      "wall_s": 3.09
+    },
+    {
+      "agents": 32,
+      "created_ok": 32,
+      "generate_ok": 32,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 1.5013,
+        "p95": 2.6569
+      },
+      "generate_latency_s": {
+        "p50": 2.0461,
+        "p95": 3.3688
+      },
+      "per_session_kv_bytes": 7798784,
+      "server_rss_mb": 3836.0,
+      "wall_s": 6.25
+    },
+    {
+      "agents": 64,
+      "created_ok": 64,
+      "generate_ok": 64,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 3.053,
+        "p95": 5.6604
+      },
+      "generate_latency_s": {
+        "p50": 3.7894,
+        "p95": 6.8495
+      },
+      "per_session_kv_bytes": 7798784,
+      "server_rss_mb": 3839.5,
+      "wall_s": 12.89
+    },
+    {
+      "agents": 128,
+      "created_ok": 128,
+      "generate_ok": 128,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 6.0018,
+        "p95": 11.2716
+      },
+      "generate_latency_s": {
+        "p50": 7.3445,
+        "p95": 13.6421
+      },
+      "per_session_kv_bytes": 7798784,
+      "server_rss_mb": 3845.4,
+      "wall_s": 25.9
+    },
+    {
+      "agents": 256,
+      "created_ok": 256,
+      "generate_ok": 256,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 12.0809,
+        "p95": 22.6077
+      },
+      "generate_latency_s": {
+        "p50": 14.1342,
+        "p95": 26.4388
+      },
+      "per_session_kv_bytes": 7798784,
+      "server_rss_mb": 3849.5,
+      "wall_s": 51.47
+    }
+  ],
+  "summary": {
+    "max_concurrent_agents_clean": 256,
+    "per_session_kv_bytes": 7798784,
+    "per_session_kv_mb": 7.7988,
+    "node_kv_upper_bound_mb": 1996.49,
+    "node_kv_upper_bound_note": "capacity * per-session bounded KV \u2014 the whole-node resident-KV ceiling, independent of context length or agent churn.",
+    "server_peak_rss_mb": 3849.5
+  }
+}
\ No newline at end of file
diff --git a/results/research/k3_agent_capacity_stress_mac.json b/results/research/k3_agent_capacity_stress_mac.json
new file mode 100644
index 00000000..d8aa896e
--- /dev/null
+++ b/results/research/k3_agent_capacity_stress_mac.json
@@ -0,0 +1,117 @@
+{
+  "kind": "grpc_agent_capacity_loadtest",
+  "schema_version": 1,
+  "config": {
+    "backend": "cpu",
+    "verifier_id": "Qwen/Qwen3-0.6B",
+    "capacity": 2048,
+    "sink": 4,
+    "window": 256,
+    "gen_tokens": 1,
+    "prompt_len": 8,
+    "context_len": 256,
+    "prefill_len": 256,
+    "fd_limit": {
+      "requested": 100000,
+      "before": [
+        10240,
+        9223372036854775807
+      ],
+      "after": [
+        100000,
+        9223372036854775807
+      ]
+    },
+    "levels": [
+      1,
+      4,
+      8,
+      16,
+      32,
+      48,
+      64,
+      96
+    ],
+    "single_tenant_note": "v0.3 single-tenant: shared verifier, RPCs serialized on one asyncio loop; this measures connection/session admission scaling, not parallel inference."
+  },
+  "results": [
+    {
+      "agents": 1,
+      "created_ok": 1,
+      "generate_ok": 1,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 3.0748,
+        "p95": 3.0748
+      },
+      "generate_latency_s": {
+        "p50": 0.0296,
+        "p95": 0.0296
+      },
+      "per_session_kv_bytes": 29474816,
+      "server_rss_mb": 11476.5,
+      "wall_s": 3.11
+    },
+    {
+      "agents": 4,
+      "created_ok": 4,
+      "generate_ok": 4,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 9.255,
+        "p95": 12.3516
+      },
+      "generate_latency_s": {
+        "p50": 0.0867,
+        "p95": 0.1155
+      },
+      "per_session_kv_bytes": 29818880,
+      "server_rss_mb": 11474.8,
+      "wall_s": 12.47
+    },
+    {
+      "agents": 8,
+      "created_ok": 8,
+      "generate_ok": 8,
+      "errors": {},
+      "create_latency_s": {
+        "p50": 15.7695,
+        "p95": 25.2241
+      },
+      "generate_latency_s": {
+        "p50": 0.1739,
+        "p95": 0.2352
+      },
+      "per_session_kv_bytes": 29818880,
+      "server_rss_mb": 11342.6,
+      "wall_s": 25.46
+    },
+    {
+      "agents": 16,
+      "created_ok": 15,
+      "generate_ok": 15,
+      "errors": {
+        "RpcCancelledError": 1
+      },
+      "create_latency_s": {
+        "p50": 25.6965,
+        "p95": 44.5888
+      },
+      "generate_latency_s": {
+        "p50": 0.2797,
+        "p95": 0.4459
+      },
+      "per_session_kv_bytes": 29818880,
+      "server_rss_mb": 10780.7,
+      "wall_s": 63.64
+    }
+  ],
+  "summary": {
+    "max_concurrent_agents_clean": 8,
+    "per_session_kv_bytes": 29818880,
+    "per_session_kv_mb": 29.8189,
+    "node_kv_upper_bound_mb": 61069.07,
+    "node_kv_upper_bound_note": "capacity * per-session bounded KV \u2014 the whole-node resident-KV ceiling, independent of context length or agent churn.",
+    "server_peak_rss_mb": 11476.5
+  }
+}
\ No newline at end of file
diff --git a/results/research/k3_crosshost_rtt_gpu.json b/results/research/k3_crosshost_rtt_gpu.json
new file mode 100644
index 00000000..d184a70c
--- /dev/null
+++ b/results/research/k3_crosshost_rtt_gpu.json
@@ -0,0 +1,308 @@
+{
+  "kind": "k3_specdecode_gpu_bench",
+  "config": {
+    "verifier_id": "google/gemma-4-26B-A4B-it",
+    "drafter_id": "z-lab/gemma-4-26B-A4B-it-DFlash",
+    "f_theta_dir": "results/research/f_theta_v5_s5_sliding",
+    "haystack_lines": 160,
+    "n_samples": 2,
+    "max_new_tokens": 64,
+    "block_size": 16,
+    "sink": 4,
+    "window": 64,
+    "seed": 0,
+    "skip_unfused": true,
+    "rtt_sweep": "0,5,15,30,60,100,150",
+    "output": "results/research/k3_crosshost_rtt_gpu3.json"
+  },
+  "env": {
+    "gpu": "NVIDIA H200 NVL",
+    "torch": "2.12.0+cu130"
+  },
+  "prompt_tokens": {
+    "min": 3238,
+    "max": 3238
+  },
+  "ar_incremental": {
+    "decode_tokens_per_s_mean": 23.842,
+    "recall": 1.0
+  },
+  "restored_pertoken": {
+    "decode_tokens_per_s_mean": 24.331,
+    "recall": 1.0
+  },
+  "restored_specdecode": {
+    "skipped": true,
+    "decode_tokens_per_s_mean": null,
+    "mean_accept_len": 0.0,
+    "recall": 0.0,
+    "per_sample": [
+      {
+        "decode_tokens_per_s": null,
+        "mean_accept_len": 0.0,
+        "time_breakdown_s": {
+          "aux_clean_forward": 0.0,
+          "drafter": 0.0,
+          "incremental_verify": 0.0
+        },
+        "tokens": []
+      },
+      {
+        "decode_tokens_per_s": null,
+        "mean_accept_len": 0.0,
+        "time_breakdown_s": {
+          "aux_clean_forward": 0.0,
+          "drafter": 0.0,
+          "incremental_verify": 0.0
+        },
+        "tokens": []
+      }
+    ]
+  },
+  "restored_specdecode_fused": {
+    "decode_tokens_per_s_mean": 44.971,
+    "mean_accept_len": 3.49,
+    "time_breakdown_s_mean": {
+      "drafter_cached": 0.165,
+      "incremental_verify": 1.246,
+      "ctx_kv_extend": 0.025
+    },
+    "recall": 1.0,
+    "per_sample": [
+      {
+        "tokens": [
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          28487,
+          1618,
+          236772,
+          236832,
+          236828,
+          236819,
+          236771,
+          84750,
+          106,
+          106,
+          107,
+          45518,
+          107,
+          101,
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          28487,
+          1618,
+          236772,
+          236832,
+          236828,
+          236819,
+          236771,
+          84750,
+          106,
+          106,
+          107,
+          45518,
+          107,
+          101,
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          28487,
+          1618,
+          236772,
+          236832,
+          236828,
+          236819,
+          236771,
+          84750,
+          106,
+          106,
+          106,
+          45518,
+          107,
+          101,
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          28487,
+          1618
+        ],
+        "decode_s": 1.2984125055372715,
+        "prefill_s": 0.812,
+        "decode_tokens_per_s": 49.291,
+        "time_breakdown_s": {
+          "drafter_cached": 0.056,
+          "incremental_verify": 1.218,
+          "ctx_kv_extend": 0.024,
+          "network_rtt": 0.0
+        },
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "decode_tokens": 64,
+        "block_rtt_ms": 0.0
+      },
+      {
+        "tokens": [
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          236777,
+          59790,
+          236772,
+          236828,
+          236819,
+          236825,
+          236770,
+          84750,
+          106,
+          106,
+          45518,
+          107,
+          101,
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          236777,
+          59790,
+          236772,
+          236828,
+          236819,
+          236825,
+          236770,
+          84750,
+          106,
+          106,
+          106,
+          107,
+          45518,
+          107,
+          101,
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          236777,
+          59790,
+          236772,
+          236828,
+          236819,
+          236825,
+          236770,
+          84750,
+          106,
+          106,
+          106,
+          107,
+          45518,
+          107,
+          101,
+          818,
+          6789,
+          3393,
+          563,
+          5213,
+          236777
+        ],
+        "decode_s": 1.5743505377322435,
+        "prefill_s": 0.808,
+        "decode_tokens_per_s": 40.652,
+        "time_breakdown_s": {
+          "drafter_cached": 0.274,
+          "incremental_verify": 1.274,
+          "ctx_kv_extend": 0.026,
+          "network_rtt": 0.0
+        },
+        "blocks": 15,
+        "mean_accept_len": 3.33,
+        "decode_tokens": 64,
+        "block_rtt_ms": 0.0
+      }
+    ],
+    "speedup_over_ar_x": 1.89
+  },
+  "crosshost_rtt_sweep": {
+    "ar_baseline_tps": 23.842,
+    "colocated_fused_tps": 44.971,
+    "sweep": [
+      {
+        "block_rtt_ms": 0.0,
+        "decode_tokens_per_s": 52.38,
+        "vs_ar_x": 2.197,
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "network_s": 0.0,
+        "decode_s": 1.2218368761241436
+      },
+      {
+        "block_rtt_ms": 5.0,
+        "decode_tokens_per_s": 47.005,
+        "vs_ar_x": 1.972,
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "network_s": 0.083,
+        "decode_s": 1.3615625966340303
+      },
+      {
+        "block_rtt_ms": 15.0,
+        "decode_tokens_per_s": 43.266,
+        "vs_ar_x": 1.815,
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "network_s": 0.214,
+        "decode_s": 1.4792107436805964
+      },
+      {
+        "block_rtt_ms": 30.0,
+        "decode_tokens_per_s": 35.864,
+        "vs_ar_x": 1.504,
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "network_s": 0.424,
+        "decode_s": 1.7845264580100775
+      },
+      {
+        "block_rtt_ms": 60.0,
+        "decode_tokens_per_s": 29.148,
+        "vs_ar_x": 1.223,
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "network_s": 0.845,
+        "decode_s": 2.1957101207226515
+      },
+      {
+        "block_rtt_ms": 100.0,
+        "decode_tokens_per_s": 23.453,
+        "vs_ar_x": 0.984,
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "network_s": 1.407,
+        "decode_s": 2.7288324255496264
+      },
+      {
+        "block_rtt_ms": 150.0,
+        "decode_tokens_per_s": 18.4,
+        "vs_ar_x": 0.772,
+        "blocks": 14,
+        "mean_accept_len": 3.64,
+        "network_s": 2.103,
+        "decode_s": 3.4781875386834145
+      }
+    ],
+    "max_rtt_ms_at_or_above_ar": 60.0,
+    "note": "one proposer<->verifier round-trip per block; cloud<->desk RTT is typically 30-150 ms. Quantifies why the cross-host token-level draft data plane is WAN-infeasible."
+  }
+}
\ No newline at end of file
diff --git a/scripts/research/grpc_agent_capacity_loadtest.py b/scripts/research/grpc_agent_capacity_loadtest.py
new file mode 100644
index 00000000..c8525055
--- /dev/null
+++ b/scripts/research/grpc_agent_capacity_loadtest.py
@@ -0,0 +1,336 @@
+"""Agent-connection capacity load test for the Kakeya gRPC RuntimeService.
+
+Test case 1 (cloud-agent ⇄ Mac mini via the Mac bridge): simulate N
+concurrent "agents" — each an independent gRPC channel + session — against a
+single running ``RuntimeService`` and find the maximum number of concurrent
+agent connections the node sustains, plus the bounded per-session KV residency
+and the resulting whole-node KV upper bound.
+
+What this measures (and what it does NOT):
+
+* Measures: concurrent **session/connection admission** scaling — how many
+  independent gRPC channels + open sessions the node holds at once, the
+  create/generate latency curve vs concurrency, server RSS growth, the
+  per-session bounded KV (``GetSessionInfo.kv_live_bytes``), and the admission
+  semantics at ``--capacity`` (LRU eviction / ``RESOURCE_EXHAUSTED``).
+* Does NOT measure parallel-inference throughput: v0.3 is single-tenant — one
+  shared verifier, RPC handlers serialized on one asyncio loop (per-session
+  verifier binding is deferred to v0.4 / PR-A3c). Concurrent ``Generate`` calls
+  therefore serialize; the latency curve reflects that and is reported as such.
+
+The server is launched as a subprocess (mirrors real deployment); clients are
+threads using the Python SDK. Self-contained: one process, one JSON report.
+"""
+
+from __future__ import annotations
+
+import argparse
+import json
+import socket
+import subprocess
+import sys
+import threading
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+
+
+def _raise_fd_limit(target: int = 100_000) -> Dict[str, Any]:
+    """Best-effort raise of RLIMIT_NOFILE so the parent (and the server
+    subprocess it spawns, which inherits the soft limit) can hold many
+    concurrent gRPC channels. Returns the before/after for the report."""
+    info: Dict[str, Any] = {"requested": target}
+    try:
+        import resource
+
+        soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE)
+        info["before"] = [soft, hard]
+        new_soft = hard if hard not in (-1, resource.RLIM_INFINITY) else target
+        new_soft = min(new_soft, target) if new_soft > 0 else target
+        new_hard = hard
+        try:
+            resource.setrlimit(resource.RLIMIT_NOFILE, (new_soft, new_hard))
+        except (ValueError, OSError):
+            # Some platforms forbid raising; fall back to the hard cap.
+            resource.setrlimit(resource.RLIMIT_NOFILE, (min(soft, hard), hard))
+        info["after"] = list(resource.getrlimit(resource.RLIMIT_NOFILE))
+    except Exception as exc:  # noqa: BLE001
+        info["error"] = str(exc)
+    return info
+
+
+def _free_port() -> int:
+    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+    s.bind(("127.0.0.1", 0))
+    port = s.getsockname()[1]
+    s.close()
+    return port
+
+
+def _rss_mb(pid: int) -> Optional[float]:
+    """Resident set size in MB for a pid, via ``ps`` (Linux + macOS)."""
+    try:
+        out = subprocess.run(
+            ["ps", "-o", "rss=", "-p", str(pid)],
+            capture_output=True, text=True, timeout=10,
+        )
+        kb = int(out.stdout.strip().split()[0])
+        return round(kb / 1024.0, 1)
+    except Exception:
+        return None
+
+
+def _pctl(xs: List[float], q: float) -> Optional[float]:
+    if not xs:
+        return None
+    xs = sorted(xs)
+    k = max(0, min(len(xs) - 1, int(round(q * (len(xs) - 1)))))
+    return round(xs[k], 4)
+
+
+def _wait_ready(address: str, timeout_s: float) -> bool:
+    """Poll until a CreateSession+close round-trips (server is serving)."""
+    from kakeya import Client
+    from kakeya.errors import KakeyaError
+
+    deadline = time.time() + timeout_s
+    while time.time() < deadline:
+        try:
+            c = Client(address)
+            s = c.create_session(client_label="readyprobe")
+            s.close()
+            c.close()
+            return True
+        except (KakeyaError, Exception):  # noqa: BLE001 - readiness poll
+            time.sleep(2.0)
+    return False
+
+
+class _AgentResult:
+    __slots__ = ("created", "error", "create_s", "gen_s", "kv_bytes", "gen_tokens")
+
+    def __init__(self) -> None:
+        self.created = False
+        self.error: Optional[str] = None
+        self.create_s: Optional[float] = None
+        self.gen_s: Optional[float] = None
+        self.kv_bytes: Optional[int] = None
+        self.gen_tokens: int = 0
+
+
+def _run_level(
+    address: str, n: int, prompt_ids: List[int], gen_tokens: int,
+    seed: int,
+) -> List[_AgentResult]:
+    """Open ``n`` concurrent agents; hold all sessions open simultaneously,
+    then generate concurrently. Returns per-agent results."""
+    from kakeya import Client
+    from kakeya.errors import KakeyaError
+
+    results = [_AgentResult() for _ in range(n)]
+    created_barrier = threading.Barrier(n, timeout=max(60.0, n * 1.0))
+    clients: List[Any] = [None] * n
+    sessions: List[Any] = [None] * n
+
+    def worker(i: int) -> None:
+        r = results[i]
+        try:
+            t0 = time.time()
+            c = Client(address)
+            sess = c.create_session(client_label=f"agent-{i}")
+            # Prefill a per-agent context (chunked) so each session carries a
+            # realistic KV footprint up to the verifier's bounded window.
+            for off in range(0, len(prompt_ids), 256):
+                sess.append(prompt_ids[off:off + 256])
+            r.create_s = time.time() - t0
+            r.created = True
+            clients[i] = c
+            sessions[i] = sess
+        except KakeyaError as exc:
+            r.error = type(exc).__name__
+            return
+        except Exception as exc:  # noqa: BLE001
+            r.error = f"{type(exc).__name__}:{exc}"[:120]
+            return
+        # Hold until every agent in this level has its session open, so the
+        # node is genuinely holding N concurrent connections at the peak.
+        try:
+            created_barrier.wait()
+        except threading.BrokenBarrierError:
+            pass
+        try:
+            t1 = time.time()
+            toks = list(sess.generate(max_tokens=gen_tokens, seed=seed))
+            r.gen_s = time.time() - t1
+            r.gen_tokens = len(toks)
+            r.kv_bytes = sess.info().kv_live_bytes
+        except Exception as exc:  # noqa: BLE001
+            r.error = (r.error or "") + f"|gen:{type(exc).__name__}"
+
+    threads = [threading.Thread(target=worker, args=(i,)) for i in range(n)]
+    for t in threads:
+        t.start()
+    for t in threads:
+        t.join()
+
+    for i in range(n):
+        try:
+            if sessions[i] is not None:
+                sessions[i].close()
+            if clients[i] is not None:
+                clients[i].close()
+        except Exception:  # noqa: BLE001
+            pass
+    return results
+
+
+def main() -> int:
+    ap = argparse.ArgumentParser(description=__doc__)
+    ap.add_argument("--backend", default="cpu", choices=["cpu", "mlx"])
+    ap.add_argument("--verifier-id", default="Qwen/Qwen3-0.6B")
+    ap.add_argument("--capacity", type=int, default=512,
+                    help="server SessionStore + SlabPool size (the admission cap)")
+    ap.add_argument("--sink", type=int, default=4)
+    ap.add_argument("--window", type=int, default=64)
+    ap.add_argument("--levels", default="1,2,4,8,16,32,64,128,256",
+                    help="comma-separated concurrent-agent counts to ramp")
+    ap.add_argument("--gen-tokens", type=int, default=4)
+    ap.add_argument("--prompt-len", type=int, default=8)
+    ap.add_argument("--context-len", type=int, default=0,
+                    help="per-agent prefill length (tokens); overrides "
+                         "--prompt-len when >0. Fills each session's KV up to "
+                         "the bounded window to probe the memory ceiling.")
+    ap.add_argument("--seed", type=int, default=0)
+    ap.add_argument("--server-ready-timeout", type=float, default=600.0)
+    ap.add_argument("--output", default=None)
+    args = ap.parse_args()
+
+    fd_info = _raise_fd_limit()
+    print(f"[loadtest] RLIMIT_NOFILE: {fd_info}", flush=True)
+
+    port = _free_port()
+    address = f"127.0.0.1:{port}"
+    server_cmd = [
+        sys.executable, "scripts/start_grpc_runtime_server.py",
+        "--backend", args.backend,
+        "--verifier-id", args.verifier_id,
+        "--bind", address,
+        "--capacity", str(args.capacity),
+        "--sink", str(args.sink),
+        "--window", str(args.window),
+        "--skip-cache-check",
+        "--log-level", "WARNING",
+    ]
+    print(f"[loadtest] launching server: {' '.join(server_cmd)}", flush=True)
+    server = subprocess.Popen(server_cmd)
+    levels = [int(x) for x in args.levels.split(",") if x.strip()]
+    prefill_len = args.context_len if args.context_len > 0 else args.prompt_len
+    # Cycle small valid token ids to reach the requested prefill length.
+    prompt_ids = [1 + (j % 64) for j in range(prefill_len)]
+    rows: List[Dict[str, Any]] = []
+    peak_rss = 0.0
+    try:
+        if not _wait_ready(address, args.server_ready_timeout):
+            print("[loadtest] ERROR: server never became ready", flush=True)
+            return 2
+        print(f"[loadtest] server ready on {address} "
+              f"(backend={args.backend}, capacity={args.capacity})", flush=True)
+
+        for n in levels:
+            t0 = time.time()
+            res = _run_level(address, n, prompt_ids, args.gen_tokens, args.seed)
+            wall = time.time() - t0
+            rss = _rss_mb(server.pid)
+            if rss:
+                peak_rss = max(peak_rss, rss)
+            created = sum(1 for r in res if r.created)
+            gen_ok = sum(1 for r in res if r.gen_tokens > 0)
+            errs: Dict[str, int] = {}
+            for r in res:
+                if r.error:
+                    key = r.error.split("|")[0].split(":")[0]
+                    errs[key] = errs.get(key, 0) + 1
+            create_lat = [r.create_s for r in res if r.create_s is not None]
+            gen_lat = [r.gen_s for r in res if r.gen_s is not None]
+            kvs = [r.kv_bytes for r in res if r.kv_bytes is not None]
+            row = {
+                "agents": n,
+                "created_ok": created,
+                "generate_ok": gen_ok,
+                "errors": errs,
+                "create_latency_s": {"p50": _pctl(create_lat, 0.5),
+                                     "p95": _pctl(create_lat, 0.95)},
+                "generate_latency_s": {"p50": _pctl(gen_lat, 0.5),
+                                       "p95": _pctl(gen_lat, 0.95)},
+                "per_session_kv_bytes": (max(kvs) if kvs else None),
+                "server_rss_mb": rss,
+                "wall_s": round(wall, 2),
+            }
+            rows.append(row)
+            print(f"[loadtest] agents={n:5d} created={created}/{n} "
+                  f"gen_ok={gen_ok} errs={errs} "
+                  f"create_p95={row['create_latency_s']['p95']}s "
+                  f"gen_p95={row['generate_latency_s']['p95']}s "
+                  f"kv/sess={row['per_session_kv_bytes']}B rss={rss}MB", flush=True)
+            if created < n:
+                print(f"[loadtest] admission/resource ceiling hit at n={n} "
+                      f"(created {created}); stopping ramp.", flush=True)
+                break
+    finally:
+        server.terminate()
+        try:
+            server.wait(timeout=15)
+        except Exception:  # noqa: BLE001
+            server.kill()
+
+    full = [r for r in rows if r["created_ok"] == r["agents"] and not r["errors"]]
+    max_conc = max((r["agents"] for r in full), default=0)
+    per_sess_kv = next((r["per_session_kv_bytes"] for r in reversed(rows)
+                        if r["per_session_kv_bytes"]), None)
+    report = {
+        "kind": "grpc_agent_capacity_loadtest",
+        "schema_version": 1,
+        "config": {
+            "backend": args.backend,
+            "verifier_id": args.verifier_id,
+            "capacity": args.capacity,
+            "sink": args.sink,
+            "window": args.window,
+            "gen_tokens": args.gen_tokens,
+            "prompt_len": args.prompt_len,
+            "context_len": args.context_len,
+            "prefill_len": prefill_len,
+            "fd_limit": fd_info,
+            "levels": levels,
+            "single_tenant_note": (
+                "v0.3 single-tenant: shared verifier, RPCs serialized on one "
+                "asyncio loop; this measures connection/session admission "
+                "scaling, not parallel inference."),
+        },
+        "results": rows,
+        "summary": {
+            "max_concurrent_agents_clean": max_conc,
+            "per_session_kv_bytes": per_sess_kv,
+            "per_session_kv_mb": (round(per_sess_kv / 1e6, 4) if per_sess_kv else None),
+            "node_kv_upper_bound_mb": (
+                round(args.capacity * per_sess_kv / 1e6, 2) if per_sess_kv else None),
+            "node_kv_upper_bound_note": (
+                "capacity * per-session bounded KV — the whole-node resident-KV "
+                "ceiling, independent of context length or agent churn."),
+            "server_peak_rss_mb": peak_rss or None,
+        },
+    }
+    if args.output:
+        Path(args.output).parent.mkdir(parents=True, exist_ok=True)
+        Path(args.output).write_text(json.dumps(report, indent=2))
+        print(f"[loadtest] wrote {args.output}", flush=True)
+    s = report["summary"]
+    print(f"[loadtest] DONE max_concurrent_agents_clean={s['max_concurrent_agents_clean']} "
+          f"per_session_kv={s['per_session_kv_mb']}MB "
+          f"node_kv_upper_bound={s['node_kv_upper_bound_mb']}MB "
+          f"peak_rss={s['server_peak_rss_mb']}MB", flush=True)
+    return 0
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
diff --git a/scripts/research/k3_specdecode_gpu_bench.py b/scripts/research/k3_specdecode_gpu_bench.py
index bed07520..c1b28f6c 100644
--- a/scripts/research/k3_specdecode_gpu_bench.py
+++ b/scripts/research/k3_specdecode_gpu_bench.py
@@ -169,6 +169,7 @@ def restored_specdecode(
 def restored_specdecode_fused(
     adapter, drafter, verifier, aux_layer_ids, embed_fn, lm_head_fn,
     prompt, gen_tokens, block_size, device, eos_ids,
+    block_rtt_ms: float = 0.0,
 ) -> Dict[str, Any]:
     """FUSED spec-decode engine (A+B+C): per-block O(L).
 
@@ -195,12 +196,21 @@ def restored_specdecode_fused(
 
     generated: List[int] = []
     accepts: List[int] = []
-    t_draft = t_verify = t_extend = 0.0
+    t_draft = t_verify = t_extend = t_network = 0.0
+    rtt_s = max(0.0, block_rtt_ms) / 1000.0
     torch.cuda.synchronize(device)
     t0 = time.perf_counter()
     while len(generated) < gen_tokens:
         L = min(block_size, gen_tokens - len(generated))
         cstart = adapter._past_len           # committed length at block start
+        # Cross-host model: one proposer<->verifier round-trip per block
+        # (proposer ships the draft block to the verifier host, gets the
+        # accept/reject + bonus back). Injected as a per-block delay so the
+        # measured decode throughput reflects the WAN penalty on real compute.
+        if rtt_s:
+            tn = time.perf_counter()
+            time.sleep(rtt_s)
+            t_network += time.perf_counter() - tn
         bonus = int(adapter.next_token_logits.argmax().item())
         td = time.perf_counter()
         drafts = drafter.draft_block_cached(
@@ -252,10 +262,12 @@ def restored_specdecode_fused(
             "drafter_cached": round(t_draft, 3),
             "incremental_verify": round(t_verify, 3),
             "ctx_kv_extend": round(t_extend, 3),
+            "network_rtt": round(t_network, 3),
         },
         "blocks": len(accepts),
         "mean_accept_len": round(sum(accepts) / len(accepts), 2) if accepts else 0.0,
         "decode_tokens": len(generated),
+        "block_rtt_ms": block_rtt_ms,
     }
 
 
@@ -275,6 +287,11 @@ def main() -> int:
                     help="Skip the un-fused restored spec-decode baseline "
                          "(already characterized; removes GPU contention for a "
                          "clean fused-vs-AR steady-state measurement).")
+    ap.add_argument("--rtt-sweep", default=None,
+                    help="Comma-separated per-block RTT values (ms) to model a "
+                         "cross-host proposer<->verifier draft loop. When set, "
+                         "after the co-located run the fused path is re-timed on "
+                         "prompt[0] at each RTT — the WAN-penalty curve (Case 2).")
     ap.add_argument("--output", default=None)
     args = ap.parse_args()
 
@@ -461,6 +478,41 @@ def recall(tokens, ans):
     fu_tps = report["restored_specdecode_fused"]["decode_tokens_per_s_mean"]
     report["restored_specdecode_fused"]["speedup_over_ar_x"] = (
         round(fu_tps / ar_mean, 2) if ar_mean else None)
+
+    # --- Case 2: cross-host proposer<->verifier WAN-penalty curve ---
+    if args.rtt_sweep:
+        rtts = [float(x) for x in args.rtt_sweep.split(",") if x.strip()]
+        prompt0 = ids_list[0][0].tolist()
+        sweep = []
+        for rtt in rtts:
+            r = restored_specdecode_fused(
+                adapter, drafter, verifier, aux_layer_ids, embed_fn, lm_head_fn,
+                prompt0, args.max_new_tokens, args.block_size, device, eos_ids,
+                block_rtt_ms=rtt)
+            tps = r["decode_tokens_per_s"]
+            sweep.append({
+                "block_rtt_ms": rtt,
+                "decode_tokens_per_s": tps,
+                "vs_ar_x": round(tps / ar_mean, 3) if ar_mean else None,
+                "blocks": r["blocks"],
+                "mean_accept_len": r["mean_accept_len"],
+                "network_s": r["time_breakdown_s"]["network_rtt"],
+                "decode_s": r["decode_s"],
+            })
+            print(f"[sd][rtt] {rtt:6.1f} ms/block -> {tps} tok/s "
+                  f"({sweep[-1]['vs_ar_x']}x AR, blocks={r['blocks']})",
+                  file=sys.stderr, flush=True)
+        # break-even: highest RTT still >= AR (vs_ar_x >= 1.0)
+        over_ar = [s["block_rtt_ms"] for s in sweep if (s["vs_ar_x"] or 0) >= 1.0]
+        report["crosshost_rtt_sweep"] = {
+            "ar_baseline_tps": ar_mean,
+            "colocated_fused_tps": fu_tps,
+            "sweep": sweep,
+            "max_rtt_ms_at_or_above_ar": (max(over_ar) if over_ar else 0.0),
+            "note": ("one proposer<->verifier round-trip per block; cloud<->desk "
+                     "RTT is typically 30-150 ms. Quantifies why the cross-host "
+                     "token-level draft data plane is WAN-infeasible."),
+        }
     out_path = Path(args.output) if args.output else Path(
         f"results/research/k3_specdecode_gpu_bench_{int(time.time())}.json")
     out_path.parent.mkdir(parents=True, exist_ok=True)
diff --git a/tests/inference_engine/bridge/test_manifest.py b/tests/inference_engine/bridge/test_manifest.py
index ffe43c99..d2889be4 100644
--- a/tests/inference_engine/bridge/test_manifest.py
+++ b/tests/inference_engine/bridge/test_manifest.py
@@ -55,6 +55,8 @@ def _manifest(**overrides):
 
 def test_allowlist_contains_exactly_the_documented_presets():
     assert sorted(PRESETS) == [
+        "agent-capacity-loadtest",
+        "agent-capacity-stress",
         "integration-tests",
         "k3-beta-scorecard",
         "k3-drafter-parity",