Skip to content

README: design philosophy + MLX & CUDA beta scorecards (Kakeya vs standalone model)#120

Merged
cursor[bot] merged 4 commits into
mainfrom
AgentMemory/readme-beta-scorecards-2815
Jun 13, 2026
Merged

README: design philosophy + MLX & CUDA beta scorecards (Kakeya vs standalone model)#120
cursor[bot] merged 4 commits into
mainfrom
AgentMemory/readme-beta-scorecards-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 13, 2026

Copy link
Copy Markdown
Owner

What

Updates main's README.md with (1) the design philosophy of the AR-verifier + dLLM-proposer architecture, (2) the MLX (Mac) and CUDA (H200) beta scorecards as summary tables, and (3) the full verbatim raw scorecard reports for both platforms (collapsible <details> blocks under each table).

Design philosophy

Frames the whole engine around one inequality:

Make Gemma-4 26B-A4B-it memory-bounded without trading away model intelligence (recall), token throughput, or context length.

The proposer's first role is history reconstructor (a dLLM has no KV cache → restores evicted verifier K/V on demand via f_θ/S5); KV restoration keeps only a bounded sink+window resident while the full effective context is reconstructed. Because restored/spec-decoded K/V is byte-checked against the AR cache, output is identical to the standalone model (recall 1.0) — the memory win costs zero intelligence.

Scorecards added (main @ 9d5e6b4)

Mac (MLX) vs mlx_lm AR oracle — 89.8 % KV saved (132.92 vs 1308.88 MB), recall 1.0, throughput 0.93× (≈parity).

CUDA (H200) vs standalone Gemma-4 26B AR — 43.9×/87.0× KV saving (constant 16.71 MB), 47.9×/94.9× context compression, recall 1.0, fused spec-decode 1.79× AR.

Both at byte-identical output; the throughput axis is the only platform fork (CUDA's cheap verify-batch → >AR; Mac's 26B verify(L) floor → parity). Each platform's table is followed by a collapsible raw scorecard report with the exact reproducible evidence. Reconciled one stale H200 1.27× reference in the Mac raw report to the fresh 1.79× so the two reports agree.

Files

  • README.md: new ## Design philosophy … section + ### Beta scorecards … (tables + collapsible raw reports for Mac and CUDA).

Testing

Open in Web Open in Cursor 

cursoragent and others added 4 commits June 13, 2026 11:54
…n → memory-bounded Gemma-4 26B) + MLX & CUDA beta scorecards

- Add 'Design philosophy' section: memory-bounded Gemma-4 26B without trading
  intelligence (recall 1.0), throughput, or context length; KV restoration as
  the mechanism (bounded sink+window resident, full effective context restored).
- Add 'Beta scorecards' with Kakeya-vs-standalone tables on both platforms:
  Mac MLX (89.8% KV saved, recall 1.0, 0.93x ~parity) and
  CUDA H200 (43.9x/87.0x KV saving, 47.9x/94.9x ctx compression, 1.79x AR fused).
- Reconcile honest-ceiling reference to fresh main 1.79x (H200, block-16).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ible)

Adds the full raw scorecard reports as <details> code blocks under each
platform's summary table, so the exact reproducible evidence sits alongside
the condensed tables. Reconciled the Mac report's trailing H200 reference
1.27x -> 1.79x to match the fresh main GPU scorecard (avoids contradicting
the CUDA report in the same section).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ta-scorecards-2815

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot marked this pull request as ready for review June 13, 2026 13:35
@cursor cursor Bot merged commit a2f5086 into main Jun 13, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants