Skip to content

evidence: GPU beta scorecard — Kakeya vs standalone AR on H200 (main #117)#119

Merged
cursor[bot] merged 1 commit into
mainfrom
AgentMemory/gpu-beta-scorecard-2815
Jun 13, 2026
Merged

evidence: GPU beta scorecard — Kakeya vs standalone AR on H200 (main #117)#119
cursor[bot] merged 1 commit into
mainfrom
AgentMemory/gpu-beta-scorecard-2815

Conversation

@FluffyAIcode

Copy link
Copy Markdown
Owner

What

Pulls the GPU beta from main (9d5e6b4, #107 fused engine + #117 consolidation) onto an H200 (vast.ai) and runs the Kakeya inference engine vs standalone Gemma-4 26B AR scorecard — the GPU analog of the Mac "mlx-only" comparison. Commits the two evidence JSONs.

Verifier google/gemma-4-26B-A4B-it (bf16), drafter z-lab/gemma-4-26B-A4B-it-DFlash, f_theta_v5_s5_sliding, S5 (5 exact full-attn layers).

Scorecard (NVIDIA H200, main @ 9d5e6b4)

1) Memory bounded — resident KV

context AR full-KV Kakeya restored saving
3238-tok prompt 733.06 MB 16.71 MB 43.9×
6438-tok prompt 1453.96 MB 16.71 MB 87.0×

Kakeya KV is constant 16.71 MB (68-tok sink+window) regardless of context; AR grows linearly → saving scales with context.

2) Context length — bounded window vs effective context

context resident window effective ctx compression recall
3238-tok 68 tok 3254 tok 47.9× 1.0 == AR
6438-tok 68 tok 6454 tok 94.9× 1.0 == AR

3) Token throughput (decode tok/s, 3238-tok prompt)

path tok/s vs AR recall
standalone AR 16.125 1.00× 1.0
restored per-token (Gap A) 16.297 1.01× 1.0
Kakeya FUSED spec-decode 28.937 1.79× 1.0

block-16, accept_len 3.32, byte-identical output.

Net (GPU): bounded memory (44–87× KV saving, constant 16.71 MB) + full-context recall (48–95× compression) + 1.79× AR throughput, all at AR-identical correctness. Confirms the platform fork: spec-decode value lands on GPU (cheap verify-batch) vs Mac ~0.93× (26B verify(L) dominates).

Files

  • results/research/k3_gpu_beta_e2e_memory_context.jsonk3_e2e_gpu_bench (memory/context/recall, rungs 160/320).
  • results/research/k3_gpu_beta_fused_throughput.jsonk3_specdecode_gpu_bench (AR / restored / fused throughput).

Testing

  • k3_e2e_gpu_bench.py --haystack-lines 160,320 --incremental (H200, exit 0, recall 1.0 both rungs)
  • k3_specdecode_gpu_bench.py --block-size 16 --skip-unfused (H200, exit 0, fused 1.79× AR, recall 1.0)
Open in Web Open in Cursor 

…emory/context/throughput

k3_e2e_gpu_bench (NIAH 160/320 lines, incremental restored vs AR):
- memory: AR KV 733/1454 MB vs Kakeya constant 16.71 MB -> 43.9x/87.0x saving
- context: 68-tok window reconstructs 3254/6454 tok -> 47.9x/94.9x, recall 1.0==AR
- decode: restored 16.5 vs AR 16.3-16.6 tok/s (~1.01x parity)

k3_specdecode_gpu_bench (block-16, 3238-tok prompt):
- AR 16.125 | restored-pertoken 16.297 (1.01x) | FUSED 28.937 tok/s (1.79x AR)
- recall 1.0 across AR/pertoken/fused (byte-identical), accept_len 3.32

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot marked this pull request as ready for review June 13, 2026 13:37
@cursor cursor Bot merged commit d31e409 into main Jun 13, 2026
7 of 8 checks passed
cursor Bot pushed a commit that referenced this pull request Jun 14, 2026
…pology test record

Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent
connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound
~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes
(latency linear in N) -> 256 = max concurrent connections served, not parallel
inferences. Evidence: results/research/k3_agent_capacity_mac.json.

Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is
design-only (no distributed.proto / CapabilityService / ProposeBlock) AND
WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge),
LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118),
max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request Jun 14, 2026
…easibility (Case 2) — ADR 0014 (#123)

* test(case1): gRPC agent-connection capacity load test + Mac-bridge preset

scripts/research/grpc_agent_capacity_loadtest.py launches a RuntimeService
subprocess and ramps N concurrent agents (independent gRPC channel + session
each), reporting max concurrent agents, per-session bounded KV (GetSessionInfo),
node KV upper bound (capacity * per-session bound), create/generate latency
curve, and server RSS. Honest about v0.3 single-tenant (shared verifier, RPCs
serialized) — measures connection/admission scaling, not parallel inference.

New manifest preset 'agent-capacity-loadtest' runs it on the Mac's real MLX
gemma verifier. Validated locally on the cloud agent (cpu Qwen3-1.7B).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* test(case1): use cpu Qwen3-0.6B for capacity preset (served MLX gemma path is a v0.4 gap; connection scaling is model-independent)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* test(case2)+adr: ADR 0014 — agent-connection capacity & cross-host topology test record

Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent
connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound
~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes
(latency linear in N) -> 256 = max concurrent connections served, not parallel
inferences. Evidence: results/research/k3_agent_capacity_mac.json.

Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is
design-only (no distributed.proto / CapabilityService / ProposeBlock) AND
WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge),
LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118),
max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* test(case1): harness --context-len + FD-limit raise; add agent-capacity-stress preset (ramp to 2048, probe connection + memory ceiling)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* test(case1): finer stress levels to pinpoint single-tenant serialization knee; save cap-2048 stress evidence

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* test(case2): cross-host RTT-sweep mode — inject per-block proposer<->verifier round-trip into the fused engine to measure the WAN-penalty throughput curve

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* docs: sync ADR 0014 (agent-connection capacity + cross-host topology) into README

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

* test: Case-1 stress ceilings + Case-2 measured cross-host WAN-penalty curve

Case 1 stress (Mac M4): FD not the limit (100k); memory scales with capacity x
window (cap 2048 -> 11.5GB, node bound 61GB > 24GB RAM); single-tenant
serialization caps heavy-context concurrency at ~8 (vs 256 light).

Case 2 (H200 real models): injected per-block proposer<->verifier RTT -> measured
WAN-penalty curve. Co-located 2.20x AR; break-even ~100ms/block; 150ms -> 0.77x
(net loss). LAN (<=15ms) keeps 1.8-2.2x. Confirms WAN data plane infeasible.

Updates ADR 0014 + README with measured curves + evidence JSONs.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants