Skip to content

README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline)#116

Closed
FluffyAIcode wants to merge 1 commit into
mainfrom
AgentMemory/readme-mlx-mac-port-journey-2815
Closed

README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline)#116
FluffyAIcode wants to merge 1 commit into
mainfrom
AgentMemory/readme-mlx-mac-port-journey-2815

Conversation

@FluffyAIcode

Copy link
Copy Markdown
Owner

What

Adds a baseline record to README.md of the MLX-port throughput journey (your task 2), placed after the v0.3 GA evidence section.

Contents

  • The journey table — decode-only ×AR, the binding problem at each stage, and the fix:
    • ~0.09× O(T²) collapse → Gap-A incremental decode (native cache + generate_step)
    • ~0.2× cross-runtime bridge → move toward all-MLX drafter
    • ~0.5× unsound RotatingKVCache rollback → CUDA-DynamicCache parity (all-KVCache + native trim, keep accepted)
    • ~0.7× block-4 CUDA-trim
    • ~1.0× block-8 tuned (AR parity; best long-code samples just over)
  • Honest ceiling + ruled-out non-levers: >AR remains CUDA-favoured (H200 1.27×); the binding constraint is 26B verify(L) compute — not rollback, sync count (single-fused probe stable; the 143 s pathology is large-cache-specific), acceptance, quantization, context length, or a missing alignment asset; the "2.13 low acceptance" was a forced-over-gen artifact. Remaining lever: the 4.5→7.7 drafter accept-len gap.
  • Recall 1.0 throughout + bounded S5 KV; cross-links ADR 0009/0012/0013.
  • Evaluation environment: the Mac bridge (git-bus + self-hosted kakeya-mac-m4 runner), the evidence gate, and the H200 GPU side.

Note on task 1 (beta merge + tag)

I could not merge PR #115 / create the kakeya inference engine for mac tag from the agent: gh is read-only, ManagePullRequest has no merge action, and I won't push directly to main. Also, PR #115 targets the b876 branch and stacks on the b876 all-MLX drafter (PR #112), neither on main — so the Mac beta → main is a multi-PR consolidation (b876 #112 + #109/#110/#115) that the merge owner must perform, then tag the merge commit. Happy to prepare a consolidated beta-integration branch + PR-to-main if you want.

Testing

Documentation-only — no code paths affected. (The numbers cited are from the real-Mac bridge runs recorded in PRs #109/#110/#115.)

Open in Web Open in Cursor 

…y (K3 beta baseline)

Records the decode-throughput journey from ~0.09x AR (O(T^2) collapse) -> ~0.2x
(cross-runtime bridge) -> ~0.5x (all-MLX + CUDA-parity trim rollback) -> ~0.7x
(block-4) -> ~1.0x (block-8, AR parity) on Gemma-4-26B-A4B / Mac M4, with each
binding problem + fix, the ruled-out non-levers (quant/length/alignment/sync/
forced-over-gen artifact), the honest >AR-is-CUDA-favoured ceiling, and the
evaluation environment (Mac bridge git-bus + self-hosted runner + evidence gate +
H200). Recall 1.0 throughout; bounded S5 KV. Cross-links ADR 0009/0012/0013.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants