Skip to content

test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014#123

Merged
cursor[bot] merged 8 commits into
mainfrom
AgentMemory/agent-capacity-crosshost-test-2815
Jun 14, 2026
Merged

test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014#123
cursor[bot] merged 8 commits into
mainfrom
AgentMemory/agent-capacity-crosshost-test-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 14, 2026

Copy link
Copy Markdown
Owner

What

Runs and records both requested tests against the AR-verifier + dLLM-proposer architecture, plus syncs the conclusions into the README. Adds the load-test harness, bridge presets, the cross-host RTT-sweep mode, evidence JSONs, and ADR 0014.

Case 1 — agent connection capacity (RAN on Mac mini M4)

scripts/research/grpc_agent_capacity_loadtest.py ramps N concurrent agents (independent gRPC channel + session each) against one RuntimeService. Two presets: agent-capacity-loadtest (light) and agent-capacity-stress (per-agent context prefill + raised FD limit).

  • Light sessions: 256/256 concurrent agents, zero errors; per-session KV bounded 7.80 MB; node KV bound ≈2.0 GB; server RSS flat ~3.85 GB.
  • Stress (window 256, per-agent ctx prefill, FD→100k): FD is not the ceiling; memory scales with capacity × window (cap 2048 → ~11.5 GB RSS, theoretical node bound ~61 GB > 24 GB RAM → capacity must be RAM-sized); the binding limit is single-tenant serialization — create latency linear (3→12→25→45 s for N=1→4→8→16), so clean heavy-context concurrency tops out at ~8 agents (vs 256 light).

Case 2 — cross-host proposer/verifier (MEASURED on H200)

The GPU-proposer ⇄ Mac-verifier token-level draft plane is design-only (no CapabilityService/ProposeBlock) — so I measured the WAN penalty directly: the fused engine re-timed with one injected proposer↔verifier round-trip per block on the real Gemma-4-26B + DFlash (H200 NVL):

per-block RTT 0 (co-located) 15 ms (LAN) 30 ms 60 ms 100 ms 150 ms
vs AR 2.20× 1.81× 1.50× 1.22× 0.98× (break-even) 0.77× (loss)

Break-even ≈100 ms/block. Cloud↔desk WAN (30–150 ms) straddles/exceeds it; LAN (≤15 ms) keeps the 1.8–2.2× win. Quantifies, on real compute, why the cross-host draft data plane must be LAN — WAN = control + tool plane (the Mac bridge), LAN = co-located data plane.

Also records the served-MLX-gemma gap (MLXSinkWindowVerifier can't resolve the gemma-4 nested config — a v0.4 item).

Files

  • scripts/research/grpc_agent_capacity_loadtest.py (+ --context-len, FD raise)
  • scripts/research/k3_specdecode_gpu_bench.py (+ --rtt-sweep cross-host mode)
  • inference_engine/bridge/manifest.py + test (agent-capacity-loadtest, agent-capacity-stress presets)
  • docs/adr/0014-…md + docs/adr/README.md; README.md (synced)
  • results/research/k3_agent_capacity_{mac,stress_mac}.json, k3_crosshost_rtt_gpu.json

Testing

  • pytest tests/inference_engine/bridge/test_manifest.py (24 passed)
  • ✅ Harness validated locally; Case-1 light + stress run on Mac M4 via bridge (conclusion=success)
  • ✅ Case-2 RTT sweep + co-located baseline run on H200 NVL (exit 0; co-located 2.06–2.20× AR, recall 1.0)
  • ✅ ADR/README markdown links resolve; fences balanced
Open in Web Open in Cursor 

cursoragent and others added 3 commits June 14, 2026 02:31
…eset

scripts/research/grpc_agent_capacity_loadtest.py launches a RuntimeService
subprocess and ramps N concurrent agents (independent gRPC channel + session
each), reporting max concurrent agents, per-session bounded KV (GetSessionInfo),
node KV upper bound (capacity * per-session bound), create/generate latency
curve, and server RSS. Honest about v0.3 single-tenant (shared verifier, RPCs
serialized) — measures connection/admission scaling, not parallel inference.

New manifest preset 'agent-capacity-loadtest' runs it on the Mac's real MLX
gemma verifier. Validated locally on the cloud agent (cpu Qwen3-1.7B).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… path is a v0.4 gap; connection scaling is model-independent)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…pology test record

Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent
connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound
~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes
(latency linear in N) -> 256 = max concurrent connections served, not parallel
inferences. Evidence: results/research/k3_agent_capacity_mac.json.

Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is
design-only (no distributed.proto / CapabilityService / ProposeBlock) AND
WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge),
LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118),
max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursoragent and others added 5 commits June 14, 2026 03:04
…ty-stress preset (ramp to 2048, probe connection + memory ceiling)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ion knee; save cap-2048 stress evidence

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…verifier round-trip into the fused engine to measure the WAN-penalty throughput curve

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… into README

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… curve

Case 1 stress (Mac M4): FD not the limit (100k); memory scales with capacity x
window (cap 2048 -> 11.5GB, node bound 61GB > 24GB RAM); single-tenant
serialization caps heavy-context concurrency at ~8 (vs 256 light).

Case 2 (H200 real models): injected per-block proposer<->verifier RTT -> measured
WAN-penalty curve. Co-located 2.20x AR; break-even ~100ms/block; 150ms -> 0.77x
(net loss). LAN (<=15ms) keeps 1.8-2.2x. Confirms WAN data plane infeasible.

Updates ADR 0014 + README with measured curves + evidence JSONs.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot marked this pull request as ready for review June 14, 2026 03:32
@cursor cursor Bot merged commit fb472a5 into main Jun 14, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants