evidence: GPU beta scorecard — Kakeya vs standalone AR on H200 (main #117) by FluffyAIcode · Pull Request #119 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-13T11:45:24Z

What

Pulls the GPU beta from main (9d5e6b4, #107 fused engine + #117 consolidation) onto an H200 (vast.ai) and runs the Kakeya inference engine vs standalone Gemma-4 26B AR scorecard — the GPU analog of the Mac "mlx-only" comparison. Commits the two evidence JSONs.

Verifier google/gemma-4-26B-A4B-it (bf16), drafter z-lab/gemma-4-26B-A4B-it-DFlash, f_theta_v5_s5_sliding, S5 (5 exact full-attn layers).

Scorecard (NVIDIA H200, `main` @ `9d5e6b4`)

1) Memory bounded — resident KV

context	AR full-KV	Kakeya restored	saving
3238-tok prompt	733.06 MB	16.71 MB	43.9×
6438-tok prompt	1453.96 MB	16.71 MB	87.0×

Kakeya KV is constant 16.71 MB (68-tok sink+window) regardless of context; AR grows linearly → saving scales with context.

2) Context length — bounded window vs effective context

context	resident window	effective ctx	compression	recall
3238-tok	68 tok	3254 tok	47.9×	1.0 == AR
6438-tok	68 tok	6454 tok	94.9×	1.0 == AR

3) Token throughput (decode tok/s, 3238-tok prompt)

path	tok/s	vs AR	recall
standalone AR	16.125	1.00×	1.0
restored per-token (Gap A)	16.297	1.01×	1.0
Kakeya FUSED spec-decode	28.937	1.79×	1.0

block-16, accept_len 3.32, byte-identical output.

Net (GPU): bounded memory (44–87× KV saving, constant 16.71 MB) + full-context recall (48–95× compression) + 1.79× AR throughput, all at AR-identical correctness. Confirms the platform fork: spec-decode value lands on GPU (cheap verify-batch) vs Mac ~0.93× (26B verify(L) dominates).

Files

results/research/k3_gpu_beta_e2e_memory_context.json — k3_e2e_gpu_bench (memory/context/recall, rungs 160/320).
results/research/k3_gpu_beta_fused_throughput.json — k3_specdecode_gpu_bench (AR / restored / fused throughput).

Testing

✅ k3_e2e_gpu_bench.py --haystack-lines 160,320 --incremental (H200, exit 0, recall 1.0 both rungs)
✅ k3_specdecode_gpu_bench.py --block-size 16 --skip-unfused (H200, exit 0, fused 1.79× AR, recall 1.0)

…emory/context/throughput k3_e2e_gpu_bench (NIAH 160/320 lines, incremental restored vs AR): - memory: AR KV 733/1454 MB vs Kakeya constant 16.71 MB -> 43.9x/87.0x saving - context: 68-tok window reconstructs 3254/6454 tok -> 47.9x/94.9x, recall 1.0==AR - decode: restored 16.5 vs AR 16.3-16.6 tok/s (~1.01x parity) k3_specdecode_gpu_bench (block-16, 3238-tok prompt): - AR 16.125 | restored-pertoken 16.297 (1.01x) | FUSED 28.937 tok/s (1.79x AR) - recall 1.0 across AR/pertoken/fused (byte-identical), accept_len 3.32 Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…pology test record Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound ~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes (latency linear in N) -> 256 = max concurrent connections served, not parallel inferences. Evidence: results/research/k3_agent_capacity_mac.json. Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is design-only (no distributed.proto / CapabilityService / ProposeBlock) AND WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge), LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118), max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…easibility (Case 2) — ADR 0014 (#123) * test(case1): gRPC agent-connection capacity load test + Mac-bridge preset scripts/research/grpc_agent_capacity_loadtest.py launches a RuntimeService subprocess and ramps N concurrent agents (independent gRPC channel + session each), reporting max concurrent agents, per-session bounded KV (GetSessionInfo), node KV upper bound (capacity * per-session bound), create/generate latency curve, and server RSS. Honest about v0.3 single-tenant (shared verifier, RPCs serialized) — measures connection/admission scaling, not parallel inference. New manifest preset 'agent-capacity-loadtest' runs it on the Mac's real MLX gemma verifier. Validated locally on the cloud agent (cpu Qwen3-1.7B). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case1): use cpu Qwen3-0.6B for capacity preset (served MLX gemma path is a v0.4 gap; connection scaling is model-independent) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case2)+adr: ADR 0014 — agent-connection capacity & cross-host topology test record Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound ~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes (latency linear in N) -> 256 = max concurrent connections served, not parallel inferences. Evidence: results/research/k3_agent_capacity_mac.json. Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is design-only (no distributed.proto / CapabilityService / ProposeBlock) AND WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge), LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118), max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case1): harness --context-len + FD-limit raise; add agent-capacity-stress preset (ramp to 2048, probe connection + memory ceiling) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case1): finer stress levels to pinpoint single-tenant serialization knee; save cap-2048 stress evidence Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case2): cross-host RTT-sweep mode — inject per-block proposer<->verifier round-trip into the fused engine to measure the WAN-penalty throughput curve Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * docs: sync ADR 0014 (agent-connection capacity + cross-host topology) into README Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test: Case-1 stress ceilings + Case-2 measured cross-host WAN-penalty curve Case 1 stress (Mac M4): FD not the limit (100k); memory scales with capacity x window (cap 2048 -> 11.5GB, node bound 61GB > 24GB RAM); single-tenant serialization caps heavy-context concurrency at ~8 (vs 256 light). Case 2 (H200 real models): injected per-block proposer<->verifier RTT -> measured WAN-penalty curve. Co-located 2.20x AR; break-even ~100ms/block; 150ms -> 0.77x (net loss). LAN (<=15ms) keeps 1.8-2.2x. Confirms WAN data plane infeasible. Updates ADR 0014 + README with measured curves + evidence JSONs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode mentioned this pull request Jun 13, 2026

README: design philosophy + MLX & CUDA beta scorecards (Kakeya vs standalone model) #120

Merged

cursor Bot marked this pull request as ready for review June 13, 2026 13:37

cursor Bot merged commit d31e409 into main Jun 13, 2026
7 of 8 checks passed

FluffyAIcode mentioned this pull request Jun 14, 2026

test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014 #123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evidence: GPU beta scorecard — Kakeya vs standalone AR on H200 (main #117)#119

evidence: GPU beta scorecard — Kakeya vs standalone AR on H200 (main #117)#119
cursor[bot] merged 1 commit into
mainfrom
AgentMemory/gpu-beta-scorecard-2815

FluffyAIcode commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 13, 2026

What

Scorecard (NVIDIA H200, main @ 9d5e6b4)

Files

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Scorecard (NVIDIA H200, `main` @ `9d5e6b4`)