bridge: k3-beta-scorecard preset — Kakeya vs MLX-only on main (#117) by FluffyAIcode · Pull Request #118 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-13T11:18:57Z

What

Adds a reusable Mac-bridge preset k3-beta-scorecard (NIAH ctx280, all-MLX fused spec-decode + CUDA-trim, S5) so the post-#117 main beta can be benchmarked head-to-head against the MLX-only oracle in one harness run: bounded KV, recall, context length, and decode tok/s.

This is the preset used to produce the Kakeya-vs-MLX-only scorecard requested after #117 landed on main.

Scorecard (Mac mini M4, Gemma-4 26B-A4B-it 4-bit, `main` @ `9d5e6b4`)

1) Memory bounded (NIAH ctx280, T=5810 tok)

	Kakeya (S5)	MLX-only (naive full-KV)
resident KV @5810	132.92 MB	1308.88 MB → 89.8% saved
KV growth / token	20.0 KB	220.0 KB → 11× slower

5 full-attention layers (5,11,17,23,29) hold all 5810 positions exact; sliding layers stay bounded to 68 resident positions.

2) Context length — prompts 4406–5810 tok handled; recall 1.0 (5/5) == MLX-only 1.0 (5/5), byte-identical outputs. Full 5810-tok window kept exact on the 5 full-attn layers.

3) Token throughput (code workload, 128-tok decode, long samples)

	Kakeya fused	MLX-only AR	ratio
e2e mean	21.68 tok/s	23.26 tok/s	0.93× (~parity)
decode-only	~24–27 tok/s	—	best 0.99×

Recall 1.0 (8/8) == MLX-only, byte-identical.

Net: bounded memory (~90% KV saving) + full-context recall at MLX-only-identical output, at ~AR-parity throughput on Mac (the 26B verify(L) compute per block is the throughput floor; >AR remains CUDA-favored — H200 1.27×).

Changes

inference_engine/bridge/manifest.py: add k3-beta-scorecard preset (validate_reports=False; NIAH ctx280, fused all-MLX, --cuda-trim, block-8 default).
tests/inference_engine/bridge/test_manifest.py: extend the strict allowlist.

Testing

✅ pytest tests/inference_engine/bridge/test_manifest.py (24 passed)
✅ Mac-bridge runs (GH Actions, kakeya-mac-m4 runner): k3-beta-scorecard + k3-fused-allmlx-code-trim both conclusion=success, evidence gate PASS.

…trim, Kakeya vs MLX-only) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…pology test record Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound ~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes (latency linear in N) -> 256 = max concurrent connections served, not parallel inferences. Evidence: results/research/k3_agent_capacity_mac.json. Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is design-only (no distributed.proto / CapabilityService / ProposeBlock) AND WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge), LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118), max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…easibility (Case 2) — ADR 0014 (#123) * test(case1): gRPC agent-connection capacity load test + Mac-bridge preset scripts/research/grpc_agent_capacity_loadtest.py launches a RuntimeService subprocess and ramps N concurrent agents (independent gRPC channel + session each), reporting max concurrent agents, per-session bounded KV (GetSessionInfo), node KV upper bound (capacity * per-session bound), create/generate latency curve, and server RSS. Honest about v0.3 single-tenant (shared verifier, RPCs serialized) — measures connection/admission scaling, not parallel inference. New manifest preset 'agent-capacity-loadtest' runs it on the Mac's real MLX gemma verifier. Validated locally on the cloud agent (cpu Qwen3-1.7B). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case1): use cpu Qwen3-0.6B for capacity preset (served MLX gemma path is a v0.4 gap; connection scaling is model-independent) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case2)+adr: ADR 0014 — agent-connection capacity & cross-host topology test record Case 1 (RUN): gRPC RuntimeService on Mac M4 sustains 256/256 concurrent agent connections, zero errors; per-session KV bounded 7.80 MB; node KV upper bound ~2.0 GB; server RSS flat ~3.85 GB. Single-tenant caveat: generate serializes (latency linear in N) -> 256 = max concurrent connections served, not parallel inferences. Evidence: results/research/k3_agent_capacity_mac.json. Case 2 (FEASIBILITY): cross-host GPU-proposer<->Mac-verifier discovery+draft is design-only (no distributed.proto / CapabilityService / ProposeBlock) AND WAN-bounded out by latency. Realizable topology: WAN=control/tool plane (bridge), LAN=co-located data plane. Proxies: GPU 1.79x AR (#119), Mac 0.93x (#118), max conns 256+ (Case 1), bounded Mac KV. Also records served-MLX-gemma gap. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case1): harness --context-len + FD-limit raise; add agent-capacity-stress preset (ramp to 2048, probe connection + memory ceiling) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case1): finer stress levels to pinpoint single-tenant serialization knee; save cap-2048 stress evidence Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test(case2): cross-host RTT-sweep mode — inject per-block proposer<->verifier round-trip into the fused engine to measure the WAN-penalty throughput curve Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * docs: sync ADR 0014 (agent-connection capacity + cross-host topology) into README Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> * test: Case-1 stress ceilings + Case-2 measured cross-host WAN-penalty curve Case 1 stress (Mac M4): FD not the limit (100k); memory scales with capacity x window (cap 2048 -> 11.5GB, node bound 61GB > 24GB RAM); single-tenant serialization caps heavy-context concurrency at ~8 (vs 256 light). Case 2 (H200 real models): injected per-block proposer<->verifier RTT -> measured WAN-penalty curve. Co-located 2.20x AR; break-even ~100ms/block; 150ms -> 0.77x (net loss). LAN (<=15ms) keeps 1.8-2.2x. Confirms WAN data plane infeasible. Updates ADR 0014 + README with measured curves + evidence JSONs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com> --------- Co-authored-by: Cursor Agent <cursoragent@cursor.com> Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

bridge: add k3-beta-scorecard preset (NIAH ctx280 fused all-mlx CUDA-…

7465a27

…trim, Kakeya vs MLX-only) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

github-actions Bot added the needs-mac-m4 label Jun 13, 2026

FluffyAIcode mentioned this pull request Jun 13, 2026

README: design philosophy + MLX & CUDA beta scorecards (Kakeya vs standalone model) #120

Merged

cursor Bot marked this pull request as ready for review June 13, 2026 13:37

cursor Bot merged commit 997f7e4 into main Jun 13, 2026
7 of 8 checks passed

FluffyAIcode mentioned this pull request Jun 14, 2026

test: agent-connection capacity (Case 1, run) + cross-host topology feasibility (Case 2) — ADR 0014 #123

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bridge: k3-beta-scorecard preset — Kakeya vs MLX-only on main (#117)#118

bridge: k3-beta-scorecard preset — Kakeya vs MLX-only on main (#117)#118
cursor[bot] merged 1 commit into
mainfrom
AgentMemory/mac-beta-scorecard-2815

FluffyAIcode commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 13, 2026

What

Scorecard (Mac mini M4, Gemma-4 26B-A4B-it 4-bit, main @ 9d5e6b4)

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Scorecard (Mac mini M4, Gemma-4 26B-A4B-it 4-bit, `main` @ `9d5e6b4`)