test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014 by FluffyAIcode · Pull Request #123 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-14T02:50:00Z

What

Runs and records both requested tests against the AR-verifier + dLLM-proposer architecture, plus syncs the conclusions into the README. Adds the load-test harness, bridge presets, the cross-host RTT-sweep mode, evidence JSONs, and ADR 0014.

Case 1 — agent connection capacity (RAN on Mac mini M4)

scripts/research/grpc_agent_capacity_loadtest.py ramps N concurrent agents (independent gRPC channel + session each) against one RuntimeService. Two presets: agent-capacity-loadtest (light) and agent-capacity-stress (per-agent context prefill + raised FD limit).

Light sessions: 256/256 concurrent agents, zero errors; per-session KV bounded 7.80 MB; node KV bound ≈2.0 GB; server RSS flat ~3.85 GB.
Stress (window 256, per-agent ctx prefill, FD→100k): FD is not the ceiling; memory scales with capacity × window (cap 2048 → ~11.5 GB RSS, theoretical node bound ~61 GB > 24 GB RAM → capacity must be RAM-sized); the binding limit is single-tenant serialization — create latency linear (3→12→25→45 s for N=1→4→8→16), so clean heavy-context concurrency tops out at ~8 agents (vs 256 light).

Case 2 — cross-host proposer/verifier (MEASURED on H200)

The GPU-proposer ⇄ Mac-verifier token-level draft plane is design-only (no CapabilityService/ProposeBlock) — so I measured the WAN penalty directly: the fused engine re-timed with one injected proposer↔verifier round-trip per block on the real Gemma-4-26B + DFlash (H200 NVL):

per-block RTT	0 (co-located)	15 ms (LAN)	30 ms	60 ms	100 ms	150 ms
vs AR	2.20×	1.81×	1.50×	1.22×	0.98× (break-even)	0.77× (loss)

Break-even ≈100 ms/block. Cloud↔desk WAN (30–150 ms) straddles/exceeds it; LAN (≤15 ms) keeps the 1.8–2.2× win. Quantifies, on real compute, why the cross-host draft data plane must be LAN — WAN = control + tool plane (the Mac bridge), LAN = co-located data plane.

Also records the served-MLX-gemma gap (MLXSinkWindowVerifier can't resolve the gemma-4 nested config — a v0.4 item).

Files

scripts/research/grpc_agent_capacity_loadtest.py (+ --context-len, FD raise)
scripts/research/k3_specdecode_gpu_bench.py (+ --rtt-sweep cross-host mode)
inference_engine/bridge/manifest.py + test (agent-capacity-loadtest, agent-capacity-stress presets)
docs/adr/0014-…md + docs/adr/README.md; README.md (synced)
results/research/k3_agent_capacity_{mac,stress_mac}.json, k3_crosshost_rtt_gpu.json

Testing

✅ pytest tests/inference_engine/bridge/test_manifest.py (24 passed)
✅ Harness validated locally; Case-1 light + stress run on Mac M4 via bridge (conclusion=success)
✅ Case-2 RTT sweep + co-located baseline run on H200 NVL (exit 0; co-located 2.06–2.20× AR, recall 1.0)
✅ ADR/README markdown links resolve; fences balanced

…eset scripts/research/grpc_agent_capacity_loadtest.py launches a RuntimeService subprocess and ramps N concurrent agents (independent gRPC channel + session each), reporting max concurrent agents, per-session bounded KV (GetSessionInfo), node KV upper bound (capacity * per-session bound), create/generate latency curve, and server RSS. Honest about v0.3 single-tenant (shared verifier, RPCs serialized) — measures connection/admission scaling, not parallel inference. New manifest preset 'agent-capacity-loadtest' runs it on the Mac's real MLX gemma verifier. Validated locally on the cloud agent (cpu Qwen3-1.7B). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… path is a v0.4 gap; connection scaling is model-independent) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…pology test record Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound ~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes (latency linear in N) -> 256 = max concurrent connections served, not parallel inferences. Evidence: results/research/k3_agent_capacity_mac.json. Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is design-only (no distributed.proto / CapabilityService / ProposeBlock) AND WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge), LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118), max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ty-stress preset (ramp to 2048, probe connection + memory ceiling) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ion knee; save cap-2048 stress evidence Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…verifier round-trip into the fused engine to measure the WAN-penalty throughput curve Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… into README Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… curve Case 1 stress (Mac M4): FD not the limit (100k); memory scales with capacity x window (cap 2048 -> 11.5GB, node bound 61GB > 24GB RAM); single-tenant serialization caps heavy-context concurrency at ~8 (vs 256 light). Case 2 (H200 real models): injected per-block proposer<->verifier RTT -> measured WAN-penalty curve. Co-located 2.20x AR; break-even ~100ms/block; 150ms -> 0.77x (net loss). LAN (<=15ms) keeps 1.8-2.2x. Confirms WAN data plane infeasible. Updates ADR 0014 + README with measured curves + evidence JSONs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits June 14, 2026 02:31

test(case1): use cpu Qwen3-0.6B for capacity preset (served MLX gemma…

c335d97

… path is a v0.4 gap; connection scaling is model-independent) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

github-actions Bot added the needs-mac-m4 label Jun 14, 2026

cursoragent and others added 5 commits June 14, 2026 03:04

test(case1): harness --context-len + FD-limit raise; add agent-capaci…

fafb854

…ty-stress preset (ramp to 2048, probe connection + memory ceiling) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

test(case1): finer stress levels to pinpoint single-tenant serializat…

dae8487

…ion knee; save cap-2048 stress evidence Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

test(case2): cross-host RTT-sweep mode — inject per-block proposer<->…

fe1a55e

…verifier round-trip into the fused engine to measure the WAN-penalty throughput curve Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

docs: sync ADR 0014 (agent-connection capacity + cross-host topology)…

cb29fbd

… into README Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursor Bot marked this pull request as ready for review June 14, 2026 03:32

cursor Bot merged commit fb472a5 into main Jun 14, 2026
7 of 8 checks passed

FluffyAIcode mentioned this pull request Jun 19, 2026

feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main) #157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014#123

test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014#123
cursor[bot] merged 8 commits into
mainfrom
AgentMemory/agent-capacity-crosshost-test-2815

FluffyAIcode commented Jun 14, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 14, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Case 1 — agent connection capacity (RAN on Mac mini M4)

Case 2 — cross-host proposer/verifier (MEASURED on H200)

Files

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 14, 2026 •

edited by cursor Bot

Loading