Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,6 +230,48 @@ engine lands at **≈ AR parity** — the memory + context wins are platform-ind
Reproduce with `scripts/research/k3_e2e_gpu_bench.py` + `k3_specdecode_gpu_bench.py`
(CUDA) and the `k3-beta-scorecard` / `k3-fused-allmlx-code-trim` Mac-bridge presets.

### Agent-connection capacity & cross-host topology ([ADR 0014](docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md))

**Agent connections (gRPC `RuntimeService`, Mac mini M4).** A connection load
test (`scripts/research/grpc_agent_capacity_loadtest.py`, preset
`agent-capacity-loadtest`) ramps N concurrent agents — an independent gRPC
channel + session each — against one runtime:

| | result |
| --- | --- |
| Max concurrent agents | **256 / 256, zero errors** (the configured capacity — a clean floor, not a failure point) |
| Per-session resident KV | **bounded** (sink+window; ~7.8 MB @ window 64, ~30 MB @ window 256) |
| Node KV upper bound | **capacity × per-session bound** (≈2.0 GB @ cap 256) — independent of context length / churn |
| Server RSS vs agents | **flat** (3825 → 3850 MB across 1 → 256) — adding agents costs ~0 memory |

Caveat: v0.3 is **single-tenant** (one shared verifier, RPCs serialized on one
asyncio loop — per-session binding is a v0.4 / PR-A3c item), so create/generate
latency is linear in N and "256" is the max concurrent connections *served*, not
parallel inferences. Pushing further (preset `agent-capacity-stress`, FD raised
to 100k / hard unlimited on the Mac) shows the true ceilings: **FD is not the
limit**; **memory** scales with `capacity × window` (capacity 2048 @ window 256
→ ~11 GB RSS, theoretical node bound ~61 GB > 24 GB RAM, so capacity must be
sized to RAM); and with a per-agent **context** prefill the binding constraint
is **single-tenant serialization** (concurrent heavy-prefill agents serialize
and time out well before any FD/connection limit). Bounded memory is structural:
light-session agent count does **not** grow RSS; the memory lever is the
resident **window**, not the number of agents.

**Cross-host proposer/verifier.** A GPU proposer ⇄ Mac verifier *token-level
draft* data plane is **design-only** (no `CapabilityService` / `ProposeBlock` /
gossip) **and** ruled out by the WAN latency budget — now **measured** on real
H200 compute by injecting one proposer↔verifier round-trip per block:

| per-block RTT | 0 (co-located) | 15 ms (LAN) | 30 ms | 60 ms | 100 ms | 150 ms |
| --- | --- | --- | --- | --- | --- | --- |
| vs AR | **2.20×** | 1.81× | 1.50× | 1.22× | **0.98×** (break-even) | 0.77× (loss) |

**Break-even ≈100 ms/block**: a cloud↔desk WAN (30–150 ms) straddles/exceeds it,
while a LAN (≤15 ms) keeps the 1.8–2.2× win. So the realizable split is **WAN =
control + tool plane** (the Mac bridge) and **LAN = co-located data plane**. See
[ADR 0014](docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md)
for the full plan, evidence, and the served-MLX-gemma gap found during testing.

## Kakeya Inference Engine for Mac — MLX speculative-decode port (K3 beta baseline)

After the **CUDA** beta (PR #107: f_θ + S5 K/V-restoration verifier, **fused DFlash
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@
# ADR 0014 — Agent-connection capacity & cross-host proposer/verifier topology: test plan & results

- **Status**: Accepted (test record + topology decision)
- **Date**: 2026-06-14
- **Relates to**: [ADR 0008](0008-session-bound-runtime-and-grpc-protocol.md)
(session-bound gRPC runtime), [`docs/mac-bridge.md`](../mac-bridge.md) +
[`docs/design/mac-bridge-cloud-agent-access.md`](../design/mac-bridge-cloud-agent-access.md)
(the cloud-agent ⇄ Mac soft link).
- **Implementation**: `scripts/research/grpc_agent_capacity_loadtest.py`,
manifest preset `agent-capacity-loadtest`
(`inference_engine/bridge/manifest.py`).
- **Evidence**: `results/research/k3_agent_capacity_mac.json`.

> Note on numbering: `main`'s ADR index stops at 0008; the README references
> 0009/0012/0013, which were authored on branches that never merged to `main`
> (a known doc gap, out of scope here). This ADR uses 0014 to avoid collision.

## 1. Context

Two test cases were requested against the AR-verifier + dLLM-proposer
architecture, using the Mac bridge as the cloud-agent ⇄ Mac mini M4 link:

1. **Case 1 — agent connection capacity.** Simulate many agents connecting to
the Kakeya inference engine on the Mac mini and find the **maximum
concurrent agent connections**, plus the bounded KV residency.
2. **Case 2 — cross-host proposer/verifier.** Run the CUDA proposer on a GPU,
have it **discover** and submit **drafts** to the verifier on the Mac mini,
and measure **token throughput / max agent connections / Mac KV upper
bound** under that topology.

Ground truth from a code audit of `main` (`9d5e6b4` lineage) determined what is
runnable vs design-only and shaped the test plan below.

## 2. Test environment

- **Mac mini M4** (24 GB unified memory), self-hosted Actions runner
`[self-hosted, macOS, ARM64, kakeya-mac-m4]`, reached via the **Mac bridge**
git-bus plane (no inbound path; allowlisted presets only).
- **Cloud agent**: Linux x86 VM (no Metal). Orchestrates via the bridge.
- **GPU**: H200 NVL (vast.ai) — runs the CUDA proposer + verifier for the
co-located reference (fused **2.06–2.20× AR**, recall 1.0) and the §4.3
cross-host WAN-penalty sweep.
- **Engine**: gRPC `RuntimeService` (`scripts/start_grpc_runtime_server.py`),
Python SDK clients (`sdks/python/kakeya`).

## 3. Case 1 — agent connection capacity (RUN, real evidence)

### 3.1 Implementation

`scripts/research/grpc_agent_capacity_loadtest.py` launches one
`RuntimeService` subprocess and ramps `N` concurrent **agents**, each an
independent gRPC channel + session that creates a session, appends a short
prompt, holds the session open while all `N` are established (true concurrent
peak), then generates and reads `GetSessionInfo.kv_live_bytes`. It records, per
level: created/generate success, create & generate latency p50/p95,
per-session bounded KV, and server RSS. Run on the Mac via the
`agent-capacity-loadtest` bridge preset.

Verifier: **cpu `Qwen/Qwen3-0.6B`** (the integration-gate model). Connection /
admission scaling is **model-independent**, so this isolates the connection
behavior; the served **MLX gemma** path is a separate v0.4 gap (§6).

### 3.2 Results (Mac mini M4, capacity=256, sink=4 window=64)

| agents | created | errors | create p95 (s) | gen p95 (s) | per-session KV | server RSS |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1/1 | — | 0.78 | 0.10 | 1.38 MB | 3825 MB |
| 16 | 16/16 | — | 1.33 | 1.66 | 7.80 MB | 3835 MB |
| 64 | 64/64 | — | 5.66 | 6.85 | 7.80 MB | 3840 MB |
| 128 | 128/128 | — | 11.27 | 13.64 | 7.80 MB | 3845 MB |
| **256** | **256/256** | **—** | 22.61 | 26.44 | 7.80 MB | 3850 MB |

- **Max concurrent agents: 256 / 256, zero errors.** 256 was the configured
`--capacity` and was sustained completely — i.e. **256 is a clean floor on
the connection ceiling, not a failure point**. The true resource ceiling is
higher (not probed past the configured capacity).
- **Per-session KV is bounded at 7.80 MB** (plateaus from 16 agents up): the
`sink+window` (68-token) ceiling holds regardless of agent count.
- **Node KV upper bound = capacity × per-session bound = 256 × 7.80 MB ≈
2.0 GB** — the whole-node resident-KV ceiling, independent of context length
or agent churn. This is the bounded-memory guarantee at the fleet level.
- **Server RSS is flat** (3825 → 3850 MB across 1 → 256 agents): adding agents
costs ~0 memory beyond the bounded slab; model weights dominate.

### 3.2b Stress beyond 256 — the real ceilings (preset `agent-capacity-stress`)

Pushing further with the FD limit raised (`RLIMIT_NOFILE` soft 100k, hard
unlimited on the Mac) and a **per-agent context prefill** (window 256,
`--context-len 256`, capacity 2048):

| agents | created | create p95 | per-session KV | server RSS |
| --- | --- | --- | --- | --- |
| 1 | 1/1 | 3.07 s | 29.8 MB | 11 477 MB |
| 8 | 8/8 | 25.2 s | 29.8 MB | 11 343 MB |
| 16 | 15/16 (1 `RpcCancelled`) | 44.6 s | 29.8 MB | 10 781 MB |

- **FD is not the ceiling** (raised to 100k; Mac hard limit is unlimited).
- **Memory** scales with `capacity × window`: capacity 2048 @ window 256 →
**~11.5 GB RSS**, and the theoretical node bound is **~61 GB > 24 GB RAM** —
so capacity must be **sized to RAM** (it is the memory knob, not agent count).
- The binding constraint with real per-agent context is **single-tenant
serialization**: create latency is purely linear (3 → 12 → 25 → 45 s as
N = 1 → 4 → 8 → 16) because every session's prefill serializes through the one
shared verifier, so clean concurrency tops out at **~8 heavy-context agents**
before RPCs time out — versus **256 light-session agents** (§3.2). Per-session
KV stays bounded (29.8 MB @ window 256) throughout.

### 3.3 Honest caveat — v0.3 is single-tenant

Create/generate latency scales **linearly** with `N` (256 agents → gen p95
26 s). That is the single-tenant signature: one shared verifier, RPC handlers
serialized on one asyncio loop (per-session verifier binding is deferred to
v0.4 / PR-A3c, see ADR 0008). So **256 = max concurrent connections admitted
and served**, *not* 256 parallel inferences. The capacity cap + LRU eviction
(`SessionStore`) + slab pool (`PoolExhausted → RESOURCE_EXHAUSTED`) are the
admission-control levers; `--max-concurrent-rpcs` caps in-flight handlers.

## 4. Case 2 — cross-host proposer/verifier (FEASIBILITY VERDICT)

### 4.1 Verdict: the requested topology is not implementable today, and is architecturally bounded out

A code audit found the cross-host discovery + draft plane is **design-only**:

- **No `distributed.proto`, no `CapabilityService`/`ProposerService`, no
`ProposeBlock` RPC, no gossip/registry/TTL** — zero runnable cross-process
wiring (the ADR 0009 file is itself absent from `main`).
- The **only implemented cross-machine plane is the Mac-bridge git-bus**
(async, batch, allowlisted presets) — a **tool/control plane**, not a
token-level data plane.
- Speculative decoding (proposer + verifier) is implemented **in-process
only** (`kv_cache_proposer/speculative.py`, `inference_engine/v04/`).

Even if built, **per-block draft submission over WAN is ruled out by the
latency budget** (design doc §4.2): a Gemma-4-26B M4 verify of an 8-token block
is ~50–100 ms; a cloud↔desk RTT is 30–150 ms **per block**, i.e. 30–300 %
overhead that consumes any acceptance gain. **Proposer and verifier must share
a LAN** for the data plane.

### 4.2 What the topology decomposes into (and the measurable proxies)

| Plane | Crosses WAN? | Status | Measured |
| --- | --- | --- | --- |
| Discovery / capability advertise | yes (seconds-scale) | bridge proxy only | bridge dispatch ~10 s + queue; one Mac, serialized (`concurrency: mac-bridge`) |
| Job/tool dispatch (eval/bench) | yes | implemented (bridge) | this ADR's Case-1 run is itself an instance |
| **Token-level draft (data plane)** | **no — must be LAN** | not implemented | **measured penalty curve §4.3** (break-even ~100 ms/block) |
| Co-located spec-decode (the feasible data plane) | n/a (same host) | implemented | **GPU H200 2.06–2.20× AR**; **Mac 0.93× AR** (PR #118) |

So the answers to Case 2's three metrics, under the **realizable** topology:

- **Token throughput**: spec-decode is a *co-located* win — **2.06–2.20× AR on
the GPU** (recall 1.0) and **≈AR parity (0.93×) on the Mac**. The measured
WAN-penalty curve (§4.3) shows the cross-host draft loop falls to **break-even
at ~100 ms/block and a net loss at 150 ms**, i.e. slower than running AR
locally — so it is not a throughput strategy.
- **Max agent connections**: governed by the *serving* node (Case 1): **256+
concurrent agents** on the Mac via `RuntimeService`.
- **Mac KV upper bound**: bounded — **capacity × per-session `sink+window`**
(≈2.0 GB at capacity 256 for Qwen3-0.6B; for the gemma S5 production config
the per-agent resident KV is ~133 MB at 5.8k ctx, dominated by the 5 exact
full-attention layers — see the README beta scorecard).

### 4.3 Measured WAN-penalty curve (H200, real models, `--rtt-sweep`)

Rather than rest on the latency *estimate*, we measured it: the fused engine was
re-timed with one injected proposer↔verifier round-trip **per block** on the real
Gemma-4-26B verifier + DFlash drafter (H200 NVL), sweeping per-block RTT across
the cloud↔desk range (`scripts/research/k3_specdecode_gpu_bench.py --rtt-sweep`,
`results/research/k3_crosshost_rtt_gpu.json`):

| per-block RTT | decode tok/s | vs AR | regime |
| --- | --- | --- | --- |
| 0 ms (co-located) | 52.4 | **2.20×** | the win |
| 5 ms (LAN) | 47.0 | 1.97× | LAN keeps it |
| 15 ms | 43.3 | 1.81× | LAN keeps it |
| 30 ms | 35.9 | 1.50× | WAN edge |
| 60 ms | 29.2 | 1.22× | shrinking |
| 100 ms | 23.5 | **0.98×** | **break-even** |
| 150 ms | 18.4 | 0.77× | net **loss** |

AR baseline = 23.8 tok/s. **Break-even is ~100 ms/block**: beyond it, cross-host
spec-decode is *slower than running AR locally*. A cloud↔desk WAN (30–150 ms RTT)
straddles or exceeds break-even, while a LAN/Thunderbolt link (≤15 ms) preserves
the 1.8–2.2× win. This is the architecture's prediction (design doc §4.2),
**now quantified on real compute** — and it is why the data plane must be LAN.

## 5. Decision

1. **Case 1 is validated**: the session-bound gRPC runtime admits and serves
**≥256 concurrent agent connections** on an M4 with **flat memory** and a
**bounded ~2.0 GB node KV ceiling**, with the documented single-tenant
latency-serialization caveat.
2. **Case 2's WAN data plane is rejected** as a throughput strategy and is
unbuilt: cross-host token-level draft must not cross the cloud↔desk
boundary. The correct topology is **WAN = control + tool plane (bridge),
LAN = co-located data plane (spec-decode)** — the same conclusion as the
Mac-bridge design doc §4, now backed by the audit and the co-located
throughput evidence.

## 6. Consequences & follow-ups

- **Served MLX gemma gap (found during this test)**: `MLXSinkWindowVerifier`
reads a flat `cfg.num_hidden_layers`, but the gemma-4 MLX model nests its
config → `AttributeError` when starting `--backend mlx` with the gemma
verifier. The served gRPC path is wired for the torch/HF verifier (and
Qwen3-MLX), not gemma-4 MLX. Tracked as a v0.4 item (alongside per-session
binding); Case 1 therefore used the cpu verifier.
- **Multi-tenant (PR-A3c)**: per-session verifier binding would lift the
serialization caveat and turn "256 connections" into "256 *concurrent
inferences*"; until then, capacity sizing should reflect serialized service.
- **M3 (fleet capability plane)**: if/when built, placement must treat
`ring_address`/RTT class as a hard constraint so data-plane (draft) pairings
never span WAN — a one-line filter in the placement candidate set.

## 7. Alternatives considered

- **Build the cross-host gRPC draft plane now and benchmark it.** Rejected: it
is a large unimplemented feature (proto + services + discovery) whose result
is already known to be *worse than co-located* by the latency budget — it
would confirm a negative at high cost.
- **Run Case 1 against the MLX gemma verifier.** Blocked by the served-MLX gap
(§6); connection scaling is model-independent, so cpu Qwen3-0.6B gives the
same admission/capacity answer with the production KV bound reported
analytically.
- **Hold live cloud→Mac gRPC sessions for Case 1.** Impossible: the Mac has no
inbound path (the reason the bridge exists). The load test runs co-located on
the Mac, dispatched via the bridge.
1 change: 1 addition & 0 deletions docs/adr/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ reader what was *not* chosen.
| 0006 | [Project positioning as local agent infrastructure](0006-local-agent-infrastructure-positioning.md) | Accepted |
| 0007 | [Cross-request KV cache reuse for long sessions](0007-cross-request-kv-reuse.md) | Superseded by 0008 |
| 0008 | [Session-bound runtime + gRPC protocol](0008-session-bound-runtime-and-grpc-protocol.md) | Accepted |
| 0014 | [Agent-connection capacity & cross-host proposer/verifier topology: test plan & results](0014-agent-connection-capacity-and-cross-host-topology-tests.md) | Accepted |

Note: ADR numbering is monotonically increasing; in-flight or
planned numbers (0005) appear in the index so readers can
Expand Down
50 changes: 50 additions & 0 deletions inference_engine/bridge/manifest.py
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,56 @@ def _harness_preset(
),
timeout_minutes=60,
),
Preset(
name="agent-capacity-loadtest",
description="Test case 1: ramp concurrent agent connections "
"(independent gRPC channel + session each) against a "
"single RuntimeService; report max concurrent agents, "
"per-session bounded KV, node KV upper bound, latency "
"curve, server RSS. Uses the cpu Qwen3-0.6B verifier "
"(the integration-gate model; connection/admission "
"scaling is model-independent — the served MLX gemma "
"path is a separate v0.4 item).",
command_templates=(
(
"python3", "scripts/research/grpc_agent_capacity_loadtest.py",
"--backend", "cpu",
"--verifier-id", "Qwen/Qwen3-0.6B",
"--capacity", "256",
"--sink", "4", "--window", "64",
"--levels", "1,2,4,8,16,32,64,128,256",
"--gen-tokens", "4",
"--output",
"results/research/k3_mac_bridge_agent_capacity.json",
),
),
timeout_minutes=90,
validate_reports=False,
),
Preset(
name="agent-capacity-stress",
description="Test case 1 (stress): push concurrent agents to 2048 "
"with a per-agent prefilled context (window 256), "
"raised FD limit, to probe the true connection ceiling "
"and the bounded-memory behavior (RSS vs agents) on the "
"Mac. cpu Qwen3-0.6B verifier.",
command_templates=(
(
"python3", "scripts/research/grpc_agent_capacity_loadtest.py",
"--backend", "cpu",
"--verifier-id", "Qwen/Qwen3-0.6B",
"--capacity", "2048",
"--sink", "4", "--window", "256",
"--context-len", "256",
"--levels", "1,4,8,16,32,48,64,96",
"--gen-tokens", "1",
"--output",
"results/research/k3_mac_bridge_agent_capacity_stress.json",
),
),
timeout_minutes=120,
validate_reports=False,
),
_harness_preset(
"k3-step1-incremental",
"PR #109 Step-1 evidence: incremental restored decode.",
Expand Down
Loading
Loading