README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline) by FluffyAIcode · Pull Request #116 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-13T10:22:27Z

What

Adds a baseline record to README.md of the MLX-port throughput journey (your task 2), placed after the v0.3 GA evidence section.

The journey table — decode-only ×AR, the binding problem at each stage, and the fix:
- ~0.09× O(T²) collapse → Gap-A incremental decode (native cache + generate_step)
- ~0.2× cross-runtime bridge → move toward all-MLX drafter
- ~0.5× unsound RotatingKVCache rollback → CUDA-DynamicCache parity (all-KVCache + native trim, keep accepted)
- ~0.7× block-4 CUDA-trim
- ~1.0× block-8 tuned (AR parity; best long-code samples just over)
Honest ceiling + ruled-out non-levers: >AR remains CUDA-favoured (H200 1.27×); the binding constraint is 26B verify(L) compute — not rollback, sync count (single-fused probe stable; the 143 s pathology is large-cache-specific), acceptance, quantization, context length, or a missing alignment asset; the "2.13 low acceptance" was a forced-over-gen artifact. Remaining lever: the 4.5→7.7 drafter accept-len gap.
Recall 1.0 throughout + bounded S5 KV; cross-links ADR 0009/0012/0013.
Evaluation environment: the Mac bridge (git-bus + self-hosted kakeya-mac-m4 runner), the evidence gate, and the H200 GPU side.

Note on task 1 (beta merge + tag)

I could not merge PR #115 / create the kakeya inference engine for mac tag from the agent: gh is read-only, ManagePullRequest has no merge action, and I won't push directly to main. Also, PR #115 targets the b876 branch and stacks on the b876 all-MLX drafter (PR #112), neither on main — so the Mac beta → main is a multi-PR consolidation (b876 #112 + #109/#110/#115) that the merge owner must perform, then tag the merge commit. Happy to prepare a consolidated beta-integration branch + PR-to-main if you want.

Testing

Documentation-only — no code paths affected. (The numbers cited are from the real-Mac bridge runs recorded in PRs #109/#110/#115.)

…y (K3 beta baseline) Records the decode-throughput journey from ~0.09x AR (O(T^2) collapse) -> ~0.2x (cross-runtime bridge) -> ~0.5x (all-MLX + CUDA-parity trim rollback) -> ~0.7x (block-4) -> ~1.0x (block-8, AR parity) on Gemma-4-26B-A4B / Mac M4, with each binding problem + fix, the ruled-out non-levers (quant/length/alignment/sync/ forced-over-gen artifact), the honest >AR-is-CUDA-favoured ceiling, and the evaluation environment (Mac bridge git-bus + self-hosted runner + evidence gate + H200). Recall 1.0 throughout; bounded S5 KV. Cross-links ADR 0009/0012/0013. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode mentioned this pull request Jun 13, 2026

Kakeya Inference Engine for Mac — MLX speculative-decode beta (consolidated → main) #117

Merged

FluffyAIcode closed this in #117 Jun 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline)#116

README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline)#116
FluffyAIcode wants to merge 1 commit into
mainfrom
AgentMemory/readme-mlx-mac-port-journey-2815

FluffyAIcode commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 13, 2026

What

Contents

Note on task 1 (beta merge + tag)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants