test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014#123
Merged
cursor[bot] merged 8 commits intoJun 14, 2026
Conversation
…eset scripts/research/grpc_agent_capacity_loadtest.py launches a RuntimeService subprocess and ramps N concurrent agents (independent gRPC channel + session each), reporting max concurrent agents, per-session bounded KV (GetSessionInfo), node KV upper bound (capacity * per-session bound), create/generate latency curve, and server RSS. Honest about v0.3 single-tenant (shared verifier, RPCs serialized) — measures connection/admission scaling, not parallel inference. New manifest preset 'agent-capacity-loadtest' runs it on the Mac's real MLX gemma verifier. Validated locally on the cloud agent (cpu Qwen3-1.7B). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… path is a v0.4 gap; connection scaling is model-independent) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…pology test record Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound ~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes (latency linear in N) -> 256 = max concurrent connections served, not parallel inferences. Evidence: results/research/k3_agent_capacity_mac.json. Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is design-only (no distributed.proto / CapabilityService / ProposeBlock) AND WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge), LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118), max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ty-stress preset (ramp to 2048, probe connection + memory ceiling) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ion knee; save cap-2048 stress evidence Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…verifier round-trip into the fused engine to measure the WAN-penalty throughput curve Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… into README Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… curve Case 1 stress (Mac M4): FD not the limit (100k); memory scales with capacity x window (cap 2048 -> 11.5GB, node bound 61GB > 24GB RAM); single-tenant serialization caps heavy-context concurrency at ~8 (vs 256 light). Case 2 (H200 real models): injected per-block proposer<->verifier RTT -> measured WAN-penalty curve. Co-located 2.20x AR; break-even ~100ms/block; 150ms -> 0.77x (net loss). LAN (<=15ms) keeps 1.8-2.2x. Confirms WAN data plane infeasible. Updates ADR 0014 + README with measured curves + evidence JSONs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Runs and records both requested tests against the AR-verifier + dLLM-proposer architecture, plus syncs the conclusions into the README. Adds the load-test harness, bridge presets, the cross-host RTT-sweep mode, evidence JSONs, and ADR 0014.
Case 1 — agent connection capacity (RAN on Mac mini M4)
scripts/research/grpc_agent_capacity_loadtest.pyramps N concurrent agents (independent gRPC channel + session each) against oneRuntimeService. Two presets:agent-capacity-loadtest(light) andagent-capacity-stress(per-agent context prefill + raised FD limit).capacity × window(cap 2048 → ~11.5 GB RSS, theoretical node bound ~61 GB > 24 GB RAM → capacity must be RAM-sized); the binding limit is single-tenant serialization — create latency linear (3→12→25→45 s for N=1→4→8→16), so clean heavy-context concurrency tops out at ~8 agents (vs 256 light).Case 2 — cross-host proposer/verifier (MEASURED on H200)
The GPU-proposer ⇄ Mac-verifier token-level draft plane is design-only (no
CapabilityService/ProposeBlock) — so I measured the WAN penalty directly: the fused engine re-timed with one injected proposer↔verifier round-trip per block on the real Gemma-4-26B + DFlash (H200 NVL):Break-even ≈100 ms/block. Cloud↔desk WAN (30–150 ms) straddles/exceeds it; LAN (≤15 ms) keeps the 1.8–2.2× win. Quantifies, on real compute, why the cross-host draft data plane must be LAN — WAN = control + tool plane (the Mac bridge), LAN = co-located data plane.
Also records the served-MLX-gemma gap (
MLXSinkWindowVerifiercan't resolve the gemma-4 nested config — a v0.4 item).Files
scripts/research/grpc_agent_capacity_loadtest.py(+--context-len, FD raise)scripts/research/k3_specdecode_gpu_bench.py(+--rtt-sweepcross-host mode)inference_engine/bridge/manifest.py+ test (agent-capacity-loadtest,agent-capacity-stresspresets)docs/adr/0014-…md+docs/adr/README.md;README.md(synced)results/research/k3_agent_capacity_{mac,stress_mac}.json,k3_crosshost_rtt_gpu.jsonTesting
pytest tests/inference_engine/bridge/test_manifest.py(24 passed)conclusion=success)