diff --git a/README.md b/README.md index 94be72dd..22c45229 100644 --- a/README.md +++ b/README.md @@ -230,6 +230,48 @@ engine lands at **≈ AR parity** — the memory + context wins are platform-ind Reproduce with `scripts/research/k3_e2e_gpu_bench.py` + `k3_specdecode_gpu_bench.py` (CUDA) and the `k3-beta-scorecard` / `k3-fused-allmlx-code-trim` Mac-bridge presets. +### Agent-connection capacity & cross-host topology ([ADR 0014](docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md)) + +**Agent connections (gRPC `RuntimeService`, Mac mini M4).** A connection load +test (`scripts/research/grpc_agent_capacity_loadtest.py`, preset +`agent-capacity-loadtest`) ramps N concurrent agents — an independent gRPC +channel + session each — against one runtime: + +| | result | +| --- | --- | +| Max concurrent agents | **256 / 256, zero errors** (the configured capacity — a clean floor, not a failure point) | +| Per-session resident KV | **bounded** (sink+window; ~7.8 MB @ window 64, ~30 MB @ window 256) | +| Node KV upper bound | **capacity × per-session bound** (≈2.0 GB @ cap 256) — independent of context length / churn | +| Server RSS vs agents | **flat** (3825 → 3850 MB across 1 → 256) — adding agents costs ~0 memory | + +Caveat: v0.3 is **single-tenant** (one shared verifier, RPCs serialized on one +asyncio loop — per-session binding is a v0.4 / PR-A3c item), so create/generate +latency is linear in N and "256" is the max concurrent connections *served*, not +parallel inferences. Pushing further (preset `agent-capacity-stress`, FD raised +to 100k / hard unlimited on the Mac) shows the true ceilings: **FD is not the +limit**; **memory** scales with `capacity × window` (capacity 2048 @ window 256 +→ ~11 GB RSS, theoretical node bound ~61 GB > 24 GB RAM, so capacity must be +sized to RAM); and with a per-agent **context** prefill the binding constraint +is **single-tenant serialization** (concurrent heavy-prefill agents serialize +and time out well before any FD/connection limit). Bounded memory is structural: +light-session agent count does **not** grow RSS; the memory lever is the +resident **window**, not the number of agents. + +**Cross-host proposer/verifier.** A GPU proposer ⇄ Mac verifier *token-level +draft* data plane is **design-only** (no `CapabilityService` / `ProposeBlock` / +gossip) **and** ruled out by the WAN latency budget — now **measured** on real +H200 compute by injecting one proposer↔verifier round-trip per block: + +| per-block RTT | 0 (co-located) | 15 ms (LAN) | 30 ms | 60 ms | 100 ms | 150 ms | +| --- | --- | --- | --- | --- | --- | --- | +| vs AR | **2.20×** | 1.81× | 1.50× | 1.22× | **0.98×** (break-even) | 0.77× (loss) | + +**Break-even ≈100 ms/block**: a cloud↔desk WAN (30–150 ms) straddles/exceeds it, +while a LAN (≤15 ms) keeps the 1.8–2.2× win. So the realizable split is **WAN = +control + tool plane** (the Mac bridge) and **LAN = co-located data plane**. See +[ADR 0014](docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md) +for the full plan, evidence, and the served-MLX-gemma gap found during testing. + ## Kakeya Inference Engine for Mac — MLX speculative-decode port (K3 beta baseline) After the **CUDA** beta (PR #107: f_θ + S5 K/V-restoration verifier, **fused DFlash diff --git a/docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md b/docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md new file mode 100644 index 00000000..5c6a1cc2 --- /dev/null +++ b/docs/adr/0014-agent-connection-capacity-and-cross-host-topology-tests.md @@ -0,0 +1,226 @@ +# ADR 0014 — Agent-connection capacity & cross-host proposer/verifier topology: test plan & results + +- **Status**: Accepted (test record + topology decision) +- **Date**: 2026-06-14 +- **Relates to**: [ADR 0008](0008-session-bound-runtime-and-grpc-protocol.md) + (session-bound gRPC runtime), [`docs/mac-bridge.md`](../mac-bridge.md) + + [`docs/design/mac-bridge-cloud-agent-access.md`](../design/mac-bridge-cloud-agent-access.md) + (the cloud-agent ⇄ Mac soft link). +- **Implementation**: `scripts/research/grpc_agent_capacity_loadtest.py`, + manifest preset `agent-capacity-loadtest` + (`inference_engine/bridge/manifest.py`). +- **Evidence**: `results/research/k3_agent_capacity_mac.json`. + +> Note on numbering: `main`'s ADR index stops at 0008; the README references +> 0009/0012/0013, which were authored on branches that never merged to `main` +> (a known doc gap, out of scope here). This ADR uses 0014 to avoid collision. + +## 1. Context + +Two test cases were requested against the AR-verifier + dLLM-proposer +architecture, using the Mac bridge as the cloud-agent ⇄ Mac mini M4 link: + +1. **Case 1 — agent connection capacity.** Simulate many agents connecting to + the Kakeya inference engine on the Mac mini and find the **maximum + concurrent agent connections**, plus the bounded KV residency. +2. **Case 2 — cross-host proposer/verifier.** Run the CUDA proposer on a GPU, + have it **discover** and submit **drafts** to the verifier on the Mac mini, + and measure **token throughput / max agent connections / Mac KV upper + bound** under that topology. + +Ground truth from a code audit of `main` (`9d5e6b4` lineage) determined what is +runnable vs design-only and shaped the test plan below. + +## 2. Test environment + +- **Mac mini M4** (24 GB unified memory), self-hosted Actions runner + `[self-hosted, macOS, ARM64, kakeya-mac-m4]`, reached via the **Mac bridge** + git-bus plane (no inbound path; allowlisted presets only). +- **Cloud agent**: Linux x86 VM (no Metal). Orchestrates via the bridge. +- **GPU**: H200 NVL (vast.ai) — runs the CUDA proposer + verifier for the + co-located reference (fused **2.06–2.20× AR**, recall 1.0) and the §4.3 + cross-host WAN-penalty sweep. +- **Engine**: gRPC `RuntimeService` (`scripts/start_grpc_runtime_server.py`), + Python SDK clients (`sdks/python/kakeya`). + +## 3. Case 1 — agent connection capacity (RUN, real evidence) + +### 3.1 Implementation + +`scripts/research/grpc_agent_capacity_loadtest.py` launches one +`RuntimeService` subprocess and ramps `N` concurrent **agents**, each an +independent gRPC channel + session that creates a session, appends a short +prompt, holds the session open while all `N` are established (true concurrent +peak), then generates and reads `GetSessionInfo.kv_live_bytes`. It records, per +level: created/generate success, create & generate latency p50/p95, +per-session bounded KV, and server RSS. Run on the Mac via the +`agent-capacity-loadtest` bridge preset. + +Verifier: **cpu `Qwen/Qwen3-0.6B`** (the integration-gate model). Connection / +admission scaling is **model-independent**, so this isolates the connection +behavior; the served **MLX gemma** path is a separate v0.4 gap (§6). + +### 3.2 Results (Mac mini M4, capacity=256, sink=4 window=64) + +| agents | created | errors | create p95 (s) | gen p95 (s) | per-session KV | server RSS | +| --- | --- | --- | --- | --- | --- | --- | +| 1 | 1/1 | — | 0.78 | 0.10 | 1.38 MB | 3825 MB | +| 16 | 16/16 | — | 1.33 | 1.66 | 7.80 MB | 3835 MB | +| 64 | 64/64 | — | 5.66 | 6.85 | 7.80 MB | 3840 MB | +| 128 | 128/128 | — | 11.27 | 13.64 | 7.80 MB | 3845 MB | +| **256** | **256/256** | **—** | 22.61 | 26.44 | 7.80 MB | 3850 MB | + +- **Max concurrent agents: 256 / 256, zero errors.** 256 was the configured + `--capacity` and was sustained completely — i.e. **256 is a clean floor on + the connection ceiling, not a failure point**. The true resource ceiling is + higher (not probed past the configured capacity). +- **Per-session KV is bounded at 7.80 MB** (plateaus from 16 agents up): the + `sink+window` (68-token) ceiling holds regardless of agent count. +- **Node KV upper bound = capacity × per-session bound = 256 × 7.80 MB ≈ + 2.0 GB** — the whole-node resident-KV ceiling, independent of context length + or agent churn. This is the bounded-memory guarantee at the fleet level. +- **Server RSS is flat** (3825 → 3850 MB across 1 → 256 agents): adding agents + costs ~0 memory beyond the bounded slab; model weights dominate. + +### 3.2b Stress beyond 256 — the real ceilings (preset `agent-capacity-stress`) + +Pushing further with the FD limit raised (`RLIMIT_NOFILE` soft 100k, hard +unlimited on the Mac) and a **per-agent context prefill** (window 256, +`--context-len 256`, capacity 2048): + +| agents | created | create p95 | per-session KV | server RSS | +| --- | --- | --- | --- | --- | +| 1 | 1/1 | 3.07 s | 29.8 MB | 11 477 MB | +| 8 | 8/8 | 25.2 s | 29.8 MB | 11 343 MB | +| 16 | 15/16 (1 `RpcCancelled`) | 44.6 s | 29.8 MB | 10 781 MB | + +- **FD is not the ceiling** (raised to 100k; Mac hard limit is unlimited). +- **Memory** scales with `capacity × window`: capacity 2048 @ window 256 → + **~11.5 GB RSS**, and the theoretical node bound is **~61 GB > 24 GB RAM** — + so capacity must be **sized to RAM** (it is the memory knob, not agent count). +- The binding constraint with real per-agent context is **single-tenant + serialization**: create latency is purely linear (3 → 12 → 25 → 45 s as + N = 1 → 4 → 8 → 16) because every session's prefill serializes through the one + shared verifier, so clean concurrency tops out at **~8 heavy-context agents** + before RPCs time out — versus **256 light-session agents** (§3.2). Per-session + KV stays bounded (29.8 MB @ window 256) throughout. + +### 3.3 Honest caveat — v0.3 is single-tenant + +Create/generate latency scales **linearly** with `N` (256 agents → gen p95 +26 s). That is the single-tenant signature: one shared verifier, RPC handlers +serialized on one asyncio loop (per-session verifier binding is deferred to +v0.4 / PR-A3c, see ADR 0008). So **256 = max concurrent connections admitted +and served**, *not* 256 parallel inferences. The capacity cap + LRU eviction +(`SessionStore`) + slab pool (`PoolExhausted → RESOURCE_EXHAUSTED`) are the +admission-control levers; `--max-concurrent-rpcs` caps in-flight handlers. + +## 4. Case 2 — cross-host proposer/verifier (FEASIBILITY VERDICT) + +### 4.1 Verdict: the requested topology is not implementable today, and is architecturally bounded out + +A code audit found the cross-host discovery + draft plane is **design-only**: + +- **No `distributed.proto`, no `CapabilityService`/`ProposerService`, no + `ProposeBlock` RPC, no gossip/registry/TTL** — zero runnable cross-process + wiring (the ADR 0009 file is itself absent from `main`). +- The **only implemented cross-machine plane is the Mac-bridge git-bus** + (async, batch, allowlisted presets) — a **tool/control plane**, not a + token-level data plane. +- Speculative decoding (proposer + verifier) is implemented **in-process + only** (`kv_cache_proposer/speculative.py`, `inference_engine/v04/`). + +Even if built, **per-block draft submission over WAN is ruled out by the +latency budget** (design doc §4.2): a Gemma-4-26B M4 verify of an 8-token block +is ~50–100 ms; a cloud↔desk RTT is 30–150 ms **per block**, i.e. 30–300 % +overhead that consumes any acceptance gain. **Proposer and verifier must share +a LAN** for the data plane. + +### 4.2 What the topology decomposes into (and the measurable proxies) + +| Plane | Crosses WAN? | Status | Measured | +| --- | --- | --- | --- | +| Discovery / capability advertise | yes (seconds-scale) | bridge proxy only | bridge dispatch ~10 s + queue; one Mac, serialized (`concurrency: mac-bridge`) | +| Job/tool dispatch (eval/bench) | yes | implemented (bridge) | this ADR's Case-1 run is itself an instance | +| **Token-level draft (data plane)** | **no — must be LAN** | not implemented | **measured penalty curve §4.3** (break-even ~100 ms/block) | +| Co-located spec-decode (the feasible data plane) | n/a (same host) | implemented | **GPU H200 2.06–2.20× AR**; **Mac 0.93× AR** (PR #118) | + +So the answers to Case 2's three metrics, under the **realizable** topology: + +- **Token throughput**: spec-decode is a *co-located* win — **2.06–2.20× AR on + the GPU** (recall 1.0) and **≈AR parity (0.93×) on the Mac**. The measured + WAN-penalty curve (§4.3) shows the cross-host draft loop falls to **break-even + at ~100 ms/block and a net loss at 150 ms**, i.e. slower than running AR + locally — so it is not a throughput strategy. +- **Max agent connections**: governed by the *serving* node (Case 1): **256+ + concurrent agents** on the Mac via `RuntimeService`. +- **Mac KV upper bound**: bounded — **capacity × per-session `sink+window`** + (≈2.0 GB at capacity 256 for Qwen3-0.6B; for the gemma S5 production config + the per-agent resident KV is ~133 MB at 5.8k ctx, dominated by the 5 exact + full-attention layers — see the README beta scorecard). + +### 4.3 Measured WAN-penalty curve (H200, real models, `--rtt-sweep`) + +Rather than rest on the latency *estimate*, we measured it: the fused engine was +re-timed with one injected proposer↔verifier round-trip **per block** on the real +Gemma-4-26B verifier + DFlash drafter (H200 NVL), sweeping per-block RTT across +the cloud↔desk range (`scripts/research/k3_specdecode_gpu_bench.py --rtt-sweep`, +`results/research/k3_crosshost_rtt_gpu.json`): + +| per-block RTT | decode tok/s | vs AR | regime | +| --- | --- | --- | --- | +| 0 ms (co-located) | 52.4 | **2.20×** | the win | +| 5 ms (LAN) | 47.0 | 1.97× | LAN keeps it | +| 15 ms | 43.3 | 1.81× | LAN keeps it | +| 30 ms | 35.9 | 1.50× | WAN edge | +| 60 ms | 29.2 | 1.22× | shrinking | +| 100 ms | 23.5 | **0.98×** | **break-even** | +| 150 ms | 18.4 | 0.77× | net **loss** | + +AR baseline = 23.8 tok/s. **Break-even is ~100 ms/block**: beyond it, cross-host +spec-decode is *slower than running AR locally*. A cloud↔desk WAN (30–150 ms RTT) +straddles or exceeds break-even, while a LAN/Thunderbolt link (≤15 ms) preserves +the 1.8–2.2× win. This is the architecture's prediction (design doc §4.2), +**now quantified on real compute** — and it is why the data plane must be LAN. + +## 5. Decision + +1. **Case 1 is validated**: the session-bound gRPC runtime admits and serves + **≥256 concurrent agent connections** on an M4 with **flat memory** and a + **bounded ~2.0 GB node KV ceiling**, with the documented single-tenant + latency-serialization caveat. +2. **Case 2's WAN data plane is rejected** as a throughput strategy and is + unbuilt: cross-host token-level draft must not cross the cloud↔desk + boundary. The correct topology is **WAN = control + tool plane (bridge), + LAN = co-located data plane (spec-decode)** — the same conclusion as the + Mac-bridge design doc §4, now backed by the audit and the co-located + throughput evidence. + +## 6. Consequences & follow-ups + +- **Served MLX gemma gap (found during this test)**: `MLXSinkWindowVerifier` + reads a flat `cfg.num_hidden_layers`, but the gemma-4 MLX model nests its + config → `AttributeError` when starting `--backend mlx` with the gemma + verifier. The served gRPC path is wired for the torch/HF verifier (and + Qwen3-MLX), not gemma-4 MLX. Tracked as a v0.4 item (alongside per-session + binding); Case 1 therefore used the cpu verifier. +- **Multi-tenant (PR-A3c)**: per-session verifier binding would lift the + serialization caveat and turn "256 connections" into "256 *concurrent + inferences*"; until then, capacity sizing should reflect serialized service. +- **M3 (fleet capability plane)**: if/when built, placement must treat + `ring_address`/RTT class as a hard constraint so data-plane (draft) pairings + never span WAN — a one-line filter in the placement candidate set. + +## 7. Alternatives considered + +- **Build the cross-host gRPC draft plane now and benchmark it.** Rejected: it + is a large unimplemented feature (proto + services + discovery) whose result + is already known to be *worse than co-located* by the latency budget — it + would confirm a negative at high cost. +- **Run Case 1 against the MLX gemma verifier.** Blocked by the served-MLX gap + (§6); connection scaling is model-independent, so cpu Qwen3-0.6B gives the + same admission/capacity answer with the production KV bound reported + analytically. +- **Hold live cloud→Mac gRPC sessions for Case 1.** Impossible: the Mac has no + inbound path (the reason the bridge exists). The load test runs co-located on + the Mac, dispatched via the bridge. diff --git a/docs/adr/README.md b/docs/adr/README.md index 60ffdd83..53254167 100644 --- a/docs/adr/README.md +++ b/docs/adr/README.md @@ -40,6 +40,7 @@ reader what was *not* chosen. | 0006 | [Project positioning as local agent infrastructure](0006-local-agent-infrastructure-positioning.md) | Accepted | | 0007 | [Cross-request KV cache reuse for long sessions](0007-cross-request-kv-reuse.md) | Superseded by 0008 | | 0008 | [Session-bound runtime + gRPC protocol](0008-session-bound-runtime-and-grpc-protocol.md) | Accepted | +| 0014 | [Agent-connection capacity & cross-host proposer/verifier topology: test plan & results](0014-agent-connection-capacity-and-cross-host-topology-tests.md) | Accepted | Note: ADR numbering is monotonically increasing; in-flight or planned numbers (0005) appear in the index so readers can diff --git a/inference_engine/bridge/manifest.py b/inference_engine/bridge/manifest.py index 64b6a3bb..9a5887cf 100644 --- a/inference_engine/bridge/manifest.py +++ b/inference_engine/bridge/manifest.py @@ -131,6 +131,56 @@ def _harness_preset( ), timeout_minutes=60, ), + Preset( + name="agent-capacity-loadtest", + description="Test case 1: ramp concurrent agent connections " + "(independent gRPC channel + session each) against a " + "single RuntimeService; report max concurrent agents, " + "per-session bounded KV, node KV upper bound, latency " + "curve, server RSS. Uses the cpu Qwen3-0.6B verifier " + "(the integration-gate model; connection/admission " + "scaling is model-independent — the served MLX gemma " + "path is a separate v0.4 item).", + command_templates=( + ( + "python3", "scripts/research/grpc_agent_capacity_loadtest.py", + "--backend", "cpu", + "--verifier-id", "Qwen/Qwen3-0.6B", + "--capacity", "256", + "--sink", "4", "--window", "64", + "--levels", "1,2,4,8,16,32,64,128,256", + "--gen-tokens", "4", + "--output", + "results/research/k3_mac_bridge_agent_capacity.json", + ), + ), + timeout_minutes=90, + validate_reports=False, + ), + Preset( + name="agent-capacity-stress", + description="Test case 1 (stress): push concurrent agents to 2048 " + "with a per-agent prefilled context (window 256), " + "raised FD limit, to probe the true connection ceiling " + "and the bounded-memory behavior (RSS vs agents) on the " + "Mac. cpu Qwen3-0.6B verifier.", + command_templates=( + ( + "python3", "scripts/research/grpc_agent_capacity_loadtest.py", + "--backend", "cpu", + "--verifier-id", "Qwen/Qwen3-0.6B", + "--capacity", "2048", + "--sink", "4", "--window", "256", + "--context-len", "256", + "--levels", "1,4,8,16,32,48,64,96", + "--gen-tokens", "1", + "--output", + "results/research/k3_mac_bridge_agent_capacity_stress.json", + ), + ), + timeout_minutes=120, + validate_reports=False, + ), _harness_preset( "k3-step1-incremental", "PR #109 Step-1 evidence: incremental restored decode.", diff --git a/results/research/k3_agent_capacity_mac.json b/results/research/k3_agent_capacity_mac.json new file mode 100644 index 00000000..f848757b --- /dev/null +++ b/results/research/k3_agent_capacity_mac.json @@ -0,0 +1,188 @@ +{ + "kind": "grpc_agent_capacity_loadtest", + "schema_version": 1, + "config": { + "backend": "cpu", + "verifier_id": "Qwen/Qwen3-0.6B", + "capacity": 256, + "sink": 4, + "window": 64, + "gen_tokens": 4, + "prompt_len": 8, + "levels": [ + 1, + 2, + 4, + 8, + 16, + 32, + 64, + 128, + 256 + ], + "single_tenant_note": "v0.3 single-tenant: shared verifier, RPCs serialized on one asyncio loop; this measures connection/session admission scaling, not parallel inference." + }, + "results": [ + { + "agents": 1, + "created_ok": 1, + "generate_ok": 1, + "errors": {}, + "create_latency_s": { + "p50": 0.7768, + "p95": 0.7768 + }, + "generate_latency_s": { + "p50": 0.097, + "p95": 0.097 + }, + "per_session_kv_bytes": 1376256, + "server_rss_mb": 3825.0, + "wall_s": 0.87 + }, + { + "agents": 2, + "created_ok": 2, + "generate_ok": 2, + "errors": {}, + "create_latency_s": { + "p50": 0.0791, + "p95": 0.157 + }, + "generate_latency_s": { + "p50": 0.1981, + "p95": 0.1982 + }, + "per_session_kv_bytes": 1835008, + "server_rss_mb": 3828.4, + "wall_s": 0.36 + }, + { + "agents": 4, + "created_ok": 4, + "generate_ok": 4, + "errors": {}, + "create_latency_s": { + "p50": 0.2667, + "p95": 0.3537 + }, + "generate_latency_s": { + "p50": 0.3802, + "p95": 0.4059 + }, + "per_session_kv_bytes": 2752512, + "server_rss_mb": 3829.0, + "wall_s": 0.76 + }, + { + "agents": 8, + "created_ok": 8, + "generate_ok": 8, + "errors": {}, + "create_latency_s": { + "p50": 0.4326, + "p95": 0.7292 + }, + "generate_latency_s": { + "p50": 0.7434, + "p95": 0.8227 + }, + "per_session_kv_bytes": 4587520, + "server_rss_mb": 3830.5, + "wall_s": 1.56 + }, + { + "agents": 16, + "created_ok": 16, + "generate_ok": 16, + "errors": {}, + "create_latency_s": { + "p50": 0.8042, + "p95": 1.3275 + }, + "generate_latency_s": { + "p50": 1.1834, + "p95": 1.664 + }, + "per_session_kv_bytes": 7798784, + "server_rss_mb": 3834.8, + "wall_s": 3.09 + }, + { + "agents": 32, + "created_ok": 32, + "generate_ok": 32, + "errors": {}, + "create_latency_s": { + "p50": 1.5013, + "p95": 2.6569 + }, + "generate_latency_s": { + "p50": 2.0461, + "p95": 3.3688 + }, + "per_session_kv_bytes": 7798784, + "server_rss_mb": 3836.0, + "wall_s": 6.25 + }, + { + "agents": 64, + "created_ok": 64, + "generate_ok": 64, + "errors": {}, + "create_latency_s": { + "p50": 3.053, + "p95": 5.6604 + }, + "generate_latency_s": { + "p50": 3.7894, + "p95": 6.8495 + }, + "per_session_kv_bytes": 7798784, + "server_rss_mb": 3839.5, + "wall_s": 12.89 + }, + { + "agents": 128, + "created_ok": 128, + "generate_ok": 128, + "errors": {}, + "create_latency_s": { + "p50": 6.0018, + "p95": 11.2716 + }, + "generate_latency_s": { + "p50": 7.3445, + "p95": 13.6421 + }, + "per_session_kv_bytes": 7798784, + "server_rss_mb": 3845.4, + "wall_s": 25.9 + }, + { + "agents": 256, + "created_ok": 256, + "generate_ok": 256, + "errors": {}, + "create_latency_s": { + "p50": 12.0809, + "p95": 22.6077 + }, + "generate_latency_s": { + "p50": 14.1342, + "p95": 26.4388 + }, + "per_session_kv_bytes": 7798784, + "server_rss_mb": 3849.5, + "wall_s": 51.47 + } + ], + "summary": { + "max_concurrent_agents_clean": 256, + "per_session_kv_bytes": 7798784, + "per_session_kv_mb": 7.7988, + "node_kv_upper_bound_mb": 1996.49, + "node_kv_upper_bound_note": "capacity * per-session bounded KV \u2014 the whole-node resident-KV ceiling, independent of context length or agent churn.", + "server_peak_rss_mb": 3849.5 + } +} \ No newline at end of file diff --git a/results/research/k3_agent_capacity_stress_mac.json b/results/research/k3_agent_capacity_stress_mac.json new file mode 100644 index 00000000..d8aa896e --- /dev/null +++ b/results/research/k3_agent_capacity_stress_mac.json @@ -0,0 +1,117 @@ +{ + "kind": "grpc_agent_capacity_loadtest", + "schema_version": 1, + "config": { + "backend": "cpu", + "verifier_id": "Qwen/Qwen3-0.6B", + "capacity": 2048, + "sink": 4, + "window": 256, + "gen_tokens": 1, + "prompt_len": 8, + "context_len": 256, + "prefill_len": 256, + "fd_limit": { + "requested": 100000, + "before": [ + 10240, + 9223372036854775807 + ], + "after": [ + 100000, + 9223372036854775807 + ] + }, + "levels": [ + 1, + 4, + 8, + 16, + 32, + 48, + 64, + 96 + ], + "single_tenant_note": "v0.3 single-tenant: shared verifier, RPCs serialized on one asyncio loop; this measures connection/session admission scaling, not parallel inference." + }, + "results": [ + { + "agents": 1, + "created_ok": 1, + "generate_ok": 1, + "errors": {}, + "create_latency_s": { + "p50": 3.0748, + "p95": 3.0748 + }, + "generate_latency_s": { + "p50": 0.0296, + "p95": 0.0296 + }, + "per_session_kv_bytes": 29474816, + "server_rss_mb": 11476.5, + "wall_s": 3.11 + }, + { + "agents": 4, + "created_ok": 4, + "generate_ok": 4, + "errors": {}, + "create_latency_s": { + "p50": 9.255, + "p95": 12.3516 + }, + "generate_latency_s": { + "p50": 0.0867, + "p95": 0.1155 + }, + "per_session_kv_bytes": 29818880, + "server_rss_mb": 11474.8, + "wall_s": 12.47 + }, + { + "agents": 8, + "created_ok": 8, + "generate_ok": 8, + "errors": {}, + "create_latency_s": { + "p50": 15.7695, + "p95": 25.2241 + }, + "generate_latency_s": { + "p50": 0.1739, + "p95": 0.2352 + }, + "per_session_kv_bytes": 29818880, + "server_rss_mb": 11342.6, + "wall_s": 25.46 + }, + { + "agents": 16, + "created_ok": 15, + "generate_ok": 15, + "errors": { + "RpcCancelledError": 1 + }, + "create_latency_s": { + "p50": 25.6965, + "p95": 44.5888 + }, + "generate_latency_s": { + "p50": 0.2797, + "p95": 0.4459 + }, + "per_session_kv_bytes": 29818880, + "server_rss_mb": 10780.7, + "wall_s": 63.64 + } + ], + "summary": { + "max_concurrent_agents_clean": 8, + "per_session_kv_bytes": 29818880, + "per_session_kv_mb": 29.8189, + "node_kv_upper_bound_mb": 61069.07, + "node_kv_upper_bound_note": "capacity * per-session bounded KV \u2014 the whole-node resident-KV ceiling, independent of context length or agent churn.", + "server_peak_rss_mb": 11476.5 + } +} \ No newline at end of file diff --git a/results/research/k3_crosshost_rtt_gpu.json b/results/research/k3_crosshost_rtt_gpu.json new file mode 100644 index 00000000..d184a70c --- /dev/null +++ b/results/research/k3_crosshost_rtt_gpu.json @@ -0,0 +1,308 @@ +{ + "kind": "k3_specdecode_gpu_bench", + "config": { + "verifier_id": "google/gemma-4-26B-A4B-it", + "drafter_id": "z-lab/gemma-4-26B-A4B-it-DFlash", + "f_theta_dir": "results/research/f_theta_v5_s5_sliding", + "haystack_lines": 160, + "n_samples": 2, + "max_new_tokens": 64, + "block_size": 16, + "sink": 4, + "window": 64, + "seed": 0, + "skip_unfused": true, + "rtt_sweep": "0,5,15,30,60,100,150", + "output": "results/research/k3_crosshost_rtt_gpu3.json" + }, + "env": { + "gpu": "NVIDIA H200 NVL", + "torch": "2.12.0+cu130" + }, + "prompt_tokens": { + "min": 3238, + "max": 3238 + }, + "ar_incremental": { + "decode_tokens_per_s_mean": 23.842, + "recall": 1.0 + }, + "restored_pertoken": { + "decode_tokens_per_s_mean": 24.331, + "recall": 1.0 + }, + "restored_specdecode": { + "skipped": true, + "decode_tokens_per_s_mean": null, + "mean_accept_len": 0.0, + "recall": 0.0, + "per_sample": [ + { + "decode_tokens_per_s": null, + "mean_accept_len": 0.0, + "time_breakdown_s": { + "aux_clean_forward": 0.0, + "drafter": 0.0, + "incremental_verify": 0.0 + }, + "tokens": [] + }, + { + "decode_tokens_per_s": null, + "mean_accept_len": 0.0, + "time_breakdown_s": { + "aux_clean_forward": 0.0, + "drafter": 0.0, + "incremental_verify": 0.0 + }, + "tokens": [] + } + ] + }, + "restored_specdecode_fused": { + "decode_tokens_per_s_mean": 44.971, + "mean_accept_len": 3.49, + "time_breakdown_s_mean": { + "drafter_cached": 0.165, + "incremental_verify": 1.246, + "ctx_kv_extend": 0.025 + }, + "recall": 1.0, + "per_sample": [ + { + "tokens": [ + 818, + 6789, + 3393, + 563, + 5213, + 28487, + 1618, + 236772, + 236832, + 236828, + 236819, + 236771, + 84750, + 106, + 106, + 107, + 45518, + 107, + 101, + 818, + 6789, + 3393, + 563, + 5213, + 28487, + 1618, + 236772, + 236832, + 236828, + 236819, + 236771, + 84750, + 106, + 106, + 107, + 45518, + 107, + 101, + 818, + 6789, + 3393, + 563, + 5213, + 28487, + 1618, + 236772, + 236832, + 236828, + 236819, + 236771, + 84750, + 106, + 106, + 106, + 45518, + 107, + 101, + 818, + 6789, + 3393, + 563, + 5213, + 28487, + 1618 + ], + "decode_s": 1.2984125055372715, + "prefill_s": 0.812, + "decode_tokens_per_s": 49.291, + "time_breakdown_s": { + "drafter_cached": 0.056, + "incremental_verify": 1.218, + "ctx_kv_extend": 0.024, + "network_rtt": 0.0 + }, + "blocks": 14, + "mean_accept_len": 3.64, + "decode_tokens": 64, + "block_rtt_ms": 0.0 + }, + { + "tokens": [ + 818, + 6789, + 3393, + 563, + 5213, + 236777, + 59790, + 236772, + 236828, + 236819, + 236825, + 236770, + 84750, + 106, + 106, + 45518, + 107, + 101, + 818, + 6789, + 3393, + 563, + 5213, + 236777, + 59790, + 236772, + 236828, + 236819, + 236825, + 236770, + 84750, + 106, + 106, + 106, + 107, + 45518, + 107, + 101, + 818, + 6789, + 3393, + 563, + 5213, + 236777, + 59790, + 236772, + 236828, + 236819, + 236825, + 236770, + 84750, + 106, + 106, + 106, + 107, + 45518, + 107, + 101, + 818, + 6789, + 3393, + 563, + 5213, + 236777 + ], + "decode_s": 1.5743505377322435, + "prefill_s": 0.808, + "decode_tokens_per_s": 40.652, + "time_breakdown_s": { + "drafter_cached": 0.274, + "incremental_verify": 1.274, + "ctx_kv_extend": 0.026, + "network_rtt": 0.0 + }, + "blocks": 15, + "mean_accept_len": 3.33, + "decode_tokens": 64, + "block_rtt_ms": 0.0 + } + ], + "speedup_over_ar_x": 1.89 + }, + "crosshost_rtt_sweep": { + "ar_baseline_tps": 23.842, + "colocated_fused_tps": 44.971, + "sweep": [ + { + "block_rtt_ms": 0.0, + "decode_tokens_per_s": 52.38, + "vs_ar_x": 2.197, + "blocks": 14, + "mean_accept_len": 3.64, + "network_s": 0.0, + "decode_s": 1.2218368761241436 + }, + { + "block_rtt_ms": 5.0, + "decode_tokens_per_s": 47.005, + "vs_ar_x": 1.972, + "blocks": 14, + "mean_accept_len": 3.64, + "network_s": 0.083, + "decode_s": 1.3615625966340303 + }, + { + "block_rtt_ms": 15.0, + "decode_tokens_per_s": 43.266, + "vs_ar_x": 1.815, + "blocks": 14, + "mean_accept_len": 3.64, + "network_s": 0.214, + "decode_s": 1.4792107436805964 + }, + { + "block_rtt_ms": 30.0, + "decode_tokens_per_s": 35.864, + "vs_ar_x": 1.504, + "blocks": 14, + "mean_accept_len": 3.64, + "network_s": 0.424, + "decode_s": 1.7845264580100775 + }, + { + "block_rtt_ms": 60.0, + "decode_tokens_per_s": 29.148, + "vs_ar_x": 1.223, + "blocks": 14, + "mean_accept_len": 3.64, + "network_s": 0.845, + "decode_s": 2.1957101207226515 + }, + { + "block_rtt_ms": 100.0, + "decode_tokens_per_s": 23.453, + "vs_ar_x": 0.984, + "blocks": 14, + "mean_accept_len": 3.64, + "network_s": 1.407, + "decode_s": 2.7288324255496264 + }, + { + "block_rtt_ms": 150.0, + "decode_tokens_per_s": 18.4, + "vs_ar_x": 0.772, + "blocks": 14, + "mean_accept_len": 3.64, + "network_s": 2.103, + "decode_s": 3.4781875386834145 + } + ], + "max_rtt_ms_at_or_above_ar": 60.0, + "note": "one proposer<->verifier round-trip per block; cloud<->desk RTT is typically 30-150 ms. Quantifies why the cross-host token-level draft data plane is WAN-infeasible." + } +} \ No newline at end of file diff --git a/scripts/research/grpc_agent_capacity_loadtest.py b/scripts/research/grpc_agent_capacity_loadtest.py new file mode 100644 index 00000000..c8525055 --- /dev/null +++ b/scripts/research/grpc_agent_capacity_loadtest.py @@ -0,0 +1,336 @@ +"""Agent-connection capacity load test for the Kakeya gRPC RuntimeService. + +Test case 1 (cloud-agent ⇄ Mac mini via the Mac bridge): simulate N +concurrent "agents" — each an independent gRPC channel + session — against a +single running ``RuntimeService`` and find the maximum number of concurrent +agent connections the node sustains, plus the bounded per-session KV residency +and the resulting whole-node KV upper bound. + +What this measures (and what it does NOT): + +* Measures: concurrent **session/connection admission** scaling — how many + independent gRPC channels + open sessions the node holds at once, the + create/generate latency curve vs concurrency, server RSS growth, the + per-session bounded KV (``GetSessionInfo.kv_live_bytes``), and the admission + semantics at ``--capacity`` (LRU eviction / ``RESOURCE_EXHAUSTED``). +* Does NOT measure parallel-inference throughput: v0.3 is single-tenant — one + shared verifier, RPC handlers serialized on one asyncio loop (per-session + verifier binding is deferred to v0.4 / PR-A3c). Concurrent ``Generate`` calls + therefore serialize; the latency curve reflects that and is reported as such. + +The server is launched as a subprocess (mirrors real deployment); clients are +threads using the Python SDK. Self-contained: one process, one JSON report. +""" + +from __future__ import annotations + +import argparse +import json +import socket +import subprocess +import sys +import threading +import time +from pathlib import Path +from typing import Any, Dict, List, Optional + + +def _raise_fd_limit(target: int = 100_000) -> Dict[str, Any]: + """Best-effort raise of RLIMIT_NOFILE so the parent (and the server + subprocess it spawns, which inherits the soft limit) can hold many + concurrent gRPC channels. Returns the before/after for the report.""" + info: Dict[str, Any] = {"requested": target} + try: + import resource + + soft, hard = resource.getrlimit(resource.RLIMIT_NOFILE) + info["before"] = [soft, hard] + new_soft = hard if hard not in (-1, resource.RLIM_INFINITY) else target + new_soft = min(new_soft, target) if new_soft > 0 else target + new_hard = hard + try: + resource.setrlimit(resource.RLIMIT_NOFILE, (new_soft, new_hard)) + except (ValueError, OSError): + # Some platforms forbid raising; fall back to the hard cap. + resource.setrlimit(resource.RLIMIT_NOFILE, (min(soft, hard), hard)) + info["after"] = list(resource.getrlimit(resource.RLIMIT_NOFILE)) + except Exception as exc: # noqa: BLE001 + info["error"] = str(exc) + return info + + +def _free_port() -> int: + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.bind(("127.0.0.1", 0)) + port = s.getsockname()[1] + s.close() + return port + + +def _rss_mb(pid: int) -> Optional[float]: + """Resident set size in MB for a pid, via ``ps`` (Linux + macOS).""" + try: + out = subprocess.run( + ["ps", "-o", "rss=", "-p", str(pid)], + capture_output=True, text=True, timeout=10, + ) + kb = int(out.stdout.strip().split()[0]) + return round(kb / 1024.0, 1) + except Exception: + return None + + +def _pctl(xs: List[float], q: float) -> Optional[float]: + if not xs: + return None + xs = sorted(xs) + k = max(0, min(len(xs) - 1, int(round(q * (len(xs) - 1))))) + return round(xs[k], 4) + + +def _wait_ready(address: str, timeout_s: float) -> bool: + """Poll until a CreateSession+close round-trips (server is serving).""" + from kakeya import Client + from kakeya.errors import KakeyaError + + deadline = time.time() + timeout_s + while time.time() < deadline: + try: + c = Client(address) + s = c.create_session(client_label="readyprobe") + s.close() + c.close() + return True + except (KakeyaError, Exception): # noqa: BLE001 - readiness poll + time.sleep(2.0) + return False + + +class _AgentResult: + __slots__ = ("created", "error", "create_s", "gen_s", "kv_bytes", "gen_tokens") + + def __init__(self) -> None: + self.created = False + self.error: Optional[str] = None + self.create_s: Optional[float] = None + self.gen_s: Optional[float] = None + self.kv_bytes: Optional[int] = None + self.gen_tokens: int = 0 + + +def _run_level( + address: str, n: int, prompt_ids: List[int], gen_tokens: int, + seed: int, +) -> List[_AgentResult]: + """Open ``n`` concurrent agents; hold all sessions open simultaneously, + then generate concurrently. Returns per-agent results.""" + from kakeya import Client + from kakeya.errors import KakeyaError + + results = [_AgentResult() for _ in range(n)] + created_barrier = threading.Barrier(n, timeout=max(60.0, n * 1.0)) + clients: List[Any] = [None] * n + sessions: List[Any] = [None] * n + + def worker(i: int) -> None: + r = results[i] + try: + t0 = time.time() + c = Client(address) + sess = c.create_session(client_label=f"agent-{i}") + # Prefill a per-agent context (chunked) so each session carries a + # realistic KV footprint up to the verifier's bounded window. + for off in range(0, len(prompt_ids), 256): + sess.append(prompt_ids[off:off + 256]) + r.create_s = time.time() - t0 + r.created = True + clients[i] = c + sessions[i] = sess + except KakeyaError as exc: + r.error = type(exc).__name__ + return + except Exception as exc: # noqa: BLE001 + r.error = f"{type(exc).__name__}:{exc}"[:120] + return + # Hold until every agent in this level has its session open, so the + # node is genuinely holding N concurrent connections at the peak. + try: + created_barrier.wait() + except threading.BrokenBarrierError: + pass + try: + t1 = time.time() + toks = list(sess.generate(max_tokens=gen_tokens, seed=seed)) + r.gen_s = time.time() - t1 + r.gen_tokens = len(toks) + r.kv_bytes = sess.info().kv_live_bytes + except Exception as exc: # noqa: BLE001 + r.error = (r.error or "") + f"|gen:{type(exc).__name__}" + + threads = [threading.Thread(target=worker, args=(i,)) for i in range(n)] + for t in threads: + t.start() + for t in threads: + t.join() + + for i in range(n): + try: + if sessions[i] is not None: + sessions[i].close() + if clients[i] is not None: + clients[i].close() + except Exception: # noqa: BLE001 + pass + return results + + +def main() -> int: + ap = argparse.ArgumentParser(description=__doc__) + ap.add_argument("--backend", default="cpu", choices=["cpu", "mlx"]) + ap.add_argument("--verifier-id", default="Qwen/Qwen3-0.6B") + ap.add_argument("--capacity", type=int, default=512, + help="server SessionStore + SlabPool size (the admission cap)") + ap.add_argument("--sink", type=int, default=4) + ap.add_argument("--window", type=int, default=64) + ap.add_argument("--levels", default="1,2,4,8,16,32,64,128,256", + help="comma-separated concurrent-agent counts to ramp") + ap.add_argument("--gen-tokens", type=int, default=4) + ap.add_argument("--prompt-len", type=int, default=8) + ap.add_argument("--context-len", type=int, default=0, + help="per-agent prefill length (tokens); overrides " + "--prompt-len when >0. Fills each session's KV up to " + "the bounded window to probe the memory ceiling.") + ap.add_argument("--seed", type=int, default=0) + ap.add_argument("--server-ready-timeout", type=float, default=600.0) + ap.add_argument("--output", default=None) + args = ap.parse_args() + + fd_info = _raise_fd_limit() + print(f"[loadtest] RLIMIT_NOFILE: {fd_info}", flush=True) + + port = _free_port() + address = f"127.0.0.1:{port}" + server_cmd = [ + sys.executable, "scripts/start_grpc_runtime_server.py", + "--backend", args.backend, + "--verifier-id", args.verifier_id, + "--bind", address, + "--capacity", str(args.capacity), + "--sink", str(args.sink), + "--window", str(args.window), + "--skip-cache-check", + "--log-level", "WARNING", + ] + print(f"[loadtest] launching server: {' '.join(server_cmd)}", flush=True) + server = subprocess.Popen(server_cmd) + levels = [int(x) for x in args.levels.split(",") if x.strip()] + prefill_len = args.context_len if args.context_len > 0 else args.prompt_len + # Cycle small valid token ids to reach the requested prefill length. + prompt_ids = [1 + (j % 64) for j in range(prefill_len)] + rows: List[Dict[str, Any]] = [] + peak_rss = 0.0 + try: + if not _wait_ready(address, args.server_ready_timeout): + print("[loadtest] ERROR: server never became ready", flush=True) + return 2 + print(f"[loadtest] server ready on {address} " + f"(backend={args.backend}, capacity={args.capacity})", flush=True) + + for n in levels: + t0 = time.time() + res = _run_level(address, n, prompt_ids, args.gen_tokens, args.seed) + wall = time.time() - t0 + rss = _rss_mb(server.pid) + if rss: + peak_rss = max(peak_rss, rss) + created = sum(1 for r in res if r.created) + gen_ok = sum(1 for r in res if r.gen_tokens > 0) + errs: Dict[str, int] = {} + for r in res: + if r.error: + key = r.error.split("|")[0].split(":")[0] + errs[key] = errs.get(key, 0) + 1 + create_lat = [r.create_s for r in res if r.create_s is not None] + gen_lat = [r.gen_s for r in res if r.gen_s is not None] + kvs = [r.kv_bytes for r in res if r.kv_bytes is not None] + row = { + "agents": n, + "created_ok": created, + "generate_ok": gen_ok, + "errors": errs, + "create_latency_s": {"p50": _pctl(create_lat, 0.5), + "p95": _pctl(create_lat, 0.95)}, + "generate_latency_s": {"p50": _pctl(gen_lat, 0.5), + "p95": _pctl(gen_lat, 0.95)}, + "per_session_kv_bytes": (max(kvs) if kvs else None), + "server_rss_mb": rss, + "wall_s": round(wall, 2), + } + rows.append(row) + print(f"[loadtest] agents={n:5d} created={created}/{n} " + f"gen_ok={gen_ok} errs={errs} " + f"create_p95={row['create_latency_s']['p95']}s " + f"gen_p95={row['generate_latency_s']['p95']}s " + f"kv/sess={row['per_session_kv_bytes']}B rss={rss}MB", flush=True) + if created < n: + print(f"[loadtest] admission/resource ceiling hit at n={n} " + f"(created {created}); stopping ramp.", flush=True) + break + finally: + server.terminate() + try: + server.wait(timeout=15) + except Exception: # noqa: BLE001 + server.kill() + + full = [r for r in rows if r["created_ok"] == r["agents"] and not r["errors"]] + max_conc = max((r["agents"] for r in full), default=0) + per_sess_kv = next((r["per_session_kv_bytes"] for r in reversed(rows) + if r["per_session_kv_bytes"]), None) + report = { + "kind": "grpc_agent_capacity_loadtest", + "schema_version": 1, + "config": { + "backend": args.backend, + "verifier_id": args.verifier_id, + "capacity": args.capacity, + "sink": args.sink, + "window": args.window, + "gen_tokens": args.gen_tokens, + "prompt_len": args.prompt_len, + "context_len": args.context_len, + "prefill_len": prefill_len, + "fd_limit": fd_info, + "levels": levels, + "single_tenant_note": ( + "v0.3 single-tenant: shared verifier, RPCs serialized on one " + "asyncio loop; this measures connection/session admission " + "scaling, not parallel inference."), + }, + "results": rows, + "summary": { + "max_concurrent_agents_clean": max_conc, + "per_session_kv_bytes": per_sess_kv, + "per_session_kv_mb": (round(per_sess_kv / 1e6, 4) if per_sess_kv else None), + "node_kv_upper_bound_mb": ( + round(args.capacity * per_sess_kv / 1e6, 2) if per_sess_kv else None), + "node_kv_upper_bound_note": ( + "capacity * per-session bounded KV — the whole-node resident-KV " + "ceiling, independent of context length or agent churn."), + "server_peak_rss_mb": peak_rss or None, + }, + } + if args.output: + Path(args.output).parent.mkdir(parents=True, exist_ok=True) + Path(args.output).write_text(json.dumps(report, indent=2)) + print(f"[loadtest] wrote {args.output}", flush=True) + s = report["summary"] + print(f"[loadtest] DONE max_concurrent_agents_clean={s['max_concurrent_agents_clean']} " + f"per_session_kv={s['per_session_kv_mb']}MB " + f"node_kv_upper_bound={s['node_kv_upper_bound_mb']}MB " + f"peak_rss={s['server_peak_rss_mb']}MB", flush=True) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/scripts/research/k3_specdecode_gpu_bench.py b/scripts/research/k3_specdecode_gpu_bench.py index bed07520..c1b28f6c 100644 --- a/scripts/research/k3_specdecode_gpu_bench.py +++ b/scripts/research/k3_specdecode_gpu_bench.py @@ -169,6 +169,7 @@ def restored_specdecode( def restored_specdecode_fused( adapter, drafter, verifier, aux_layer_ids, embed_fn, lm_head_fn, prompt, gen_tokens, block_size, device, eos_ids, + block_rtt_ms: float = 0.0, ) -> Dict[str, Any]: """FUSED spec-decode engine (A+B+C): per-block O(L). @@ -195,12 +196,21 @@ def restored_specdecode_fused( generated: List[int] = [] accepts: List[int] = [] - t_draft = t_verify = t_extend = 0.0 + t_draft = t_verify = t_extend = t_network = 0.0 + rtt_s = max(0.0, block_rtt_ms) / 1000.0 torch.cuda.synchronize(device) t0 = time.perf_counter() while len(generated) < gen_tokens: L = min(block_size, gen_tokens - len(generated)) cstart = adapter._past_len # committed length at block start + # Cross-host model: one proposer<->verifier round-trip per block + # (proposer ships the draft block to the verifier host, gets the + # accept/reject + bonus back). Injected as a per-block delay so the + # measured decode throughput reflects the WAN penalty on real compute. + if rtt_s: + tn = time.perf_counter() + time.sleep(rtt_s) + t_network += time.perf_counter() - tn bonus = int(adapter.next_token_logits.argmax().item()) td = time.perf_counter() drafts = drafter.draft_block_cached( @@ -252,10 +262,12 @@ def restored_specdecode_fused( "drafter_cached": round(t_draft, 3), "incremental_verify": round(t_verify, 3), "ctx_kv_extend": round(t_extend, 3), + "network_rtt": round(t_network, 3), }, "blocks": len(accepts), "mean_accept_len": round(sum(accepts) / len(accepts), 2) if accepts else 0.0, "decode_tokens": len(generated), + "block_rtt_ms": block_rtt_ms, } @@ -275,6 +287,11 @@ def main() -> int: help="Skip the un-fused restored spec-decode baseline " "(already characterized; removes GPU contention for a " "clean fused-vs-AR steady-state measurement).") + ap.add_argument("--rtt-sweep", default=None, + help="Comma-separated per-block RTT values (ms) to model a " + "cross-host proposer<->verifier draft loop. When set, " + "after the co-located run the fused path is re-timed on " + "prompt[0] at each RTT — the WAN-penalty curve (Case 2).") ap.add_argument("--output", default=None) args = ap.parse_args() @@ -461,6 +478,41 @@ def recall(tokens, ans): fu_tps = report["restored_specdecode_fused"]["decode_tokens_per_s_mean"] report["restored_specdecode_fused"]["speedup_over_ar_x"] = ( round(fu_tps / ar_mean, 2) if ar_mean else None) + + # --- Case 2: cross-host proposer<->verifier WAN-penalty curve --- + if args.rtt_sweep: + rtts = [float(x) for x in args.rtt_sweep.split(",") if x.strip()] + prompt0 = ids_list[0][0].tolist() + sweep = [] + for rtt in rtts: + r = restored_specdecode_fused( + adapter, drafter, verifier, aux_layer_ids, embed_fn, lm_head_fn, + prompt0, args.max_new_tokens, args.block_size, device, eos_ids, + block_rtt_ms=rtt) + tps = r["decode_tokens_per_s"] + sweep.append({ + "block_rtt_ms": rtt, + "decode_tokens_per_s": tps, + "vs_ar_x": round(tps / ar_mean, 3) if ar_mean else None, + "blocks": r["blocks"], + "mean_accept_len": r["mean_accept_len"], + "network_s": r["time_breakdown_s"]["network_rtt"], + "decode_s": r["decode_s"], + }) + print(f"[sd][rtt] {rtt:6.1f} ms/block -> {tps} tok/s " + f"({sweep[-1]['vs_ar_x']}x AR, blocks={r['blocks']})", + file=sys.stderr, flush=True) + # break-even: highest RTT still >= AR (vs_ar_x >= 1.0) + over_ar = [s["block_rtt_ms"] for s in sweep if (s["vs_ar_x"] or 0) >= 1.0] + report["crosshost_rtt_sweep"] = { + "ar_baseline_tps": ar_mean, + "colocated_fused_tps": fu_tps, + "sweep": sweep, + "max_rtt_ms_at_or_above_ar": (max(over_ar) if over_ar else 0.0), + "note": ("one proposer<->verifier round-trip per block; cloud<->desk " + "RTT is typically 30-150 ms. Quantifies why the cross-host " + "token-level draft data plane is WAN-infeasible."), + } out_path = Path(args.output) if args.output else Path( f"results/research/k3_specdecode_gpu_bench_{int(time.time())}.json") out_path.parent.mkdir(parents=True, exist_ok=True) diff --git a/tests/inference_engine/bridge/test_manifest.py b/tests/inference_engine/bridge/test_manifest.py index ffe43c99..d2889be4 100644 --- a/tests/inference_engine/bridge/test_manifest.py +++ b/tests/inference_engine/bridge/test_manifest.py @@ -55,6 +55,8 @@ def _manifest(**overrides): def test_allowlist_contains_exactly_the_documented_presets(): assert sorted(PRESETS) == [ + "agent-capacity-loadtest", + "agent-capacity-stress", "integration-tests", "k3-beta-scorecard", "k3-drafter-parity",