Kakeya Inference Engine for Mac — MLX speculative-decode beta (consolidated → main) by FluffyAIcode · Pull Request #117 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-13T10:39:19Z

What

Consolidated Mac MLX beta for merge to main + tagging. Brings the full all-MLX speculative-decode stack onto main, purely additive on ce911bb (the CUDA #107 beta). 39 files, +10.7k, recall 1.0 throughout.

Ready for you to merge + tag. I cannot merge or push tags from the agent (read-only gh, no merge action). After merging, tag the merge commit, e.g.:
git tag -a kakeya-inference-engine-for-mac-beta <merge-sha> -m "Kakeya Inference Engine for Mac — MLX spec-decode beta"
git push origin kakeya-inference-engine-for-mac-beta

What's included (consolidated)

All-MLX DFlash drafter + fused spec-decode engine (b876 PR Step-2 rescue: all-MLX DFlash drafter — parity-proven, 17× over the hybrid fused path (0.476× AR) #112): native MLX drafter (zero per-block MLX↔PyTorch bridge), fused_specdecode_generate_mlx, the v3 rollback, parity/kv-quant evals.
Gap-A incremental restored decode + MLXRestoredIncrementalVerifier (my MLX port of #107: incremental decode (Step 1) + fused DFlash spec-decode engine (Step 2) #109): prefill→native cache→generate_step.
CUDA-DynamicCache-parity rollback (my CUDA-parity rollback for the all-MLX fused loop (keep accepted K/V, trim only rejected) — +33% on code, ~AR parity #115): make_full_kv_prompt_cache + fused_specdecode_generate_mlx_trim (keep accepted K/V, trim only rejected) — +33% over v3, ~AR parity on block-8 long-code.
Mac bridge (git-bus + self-hosted kakeya-mac-m4 runner) + evidence gate (k3_report_gate.py) + presets (k3-fused-allmlx-code[-trim], -natural, -singlefused-probe, …).
README baseline: the 0.09× → ~1.0× AR journey (companion to PR README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline) #116, included here so the beta is self-contained).

Deliberately EXCLUDED

native_restored_cache.py (MLX native restored-cache primitive — systemic fix for the Mac throughput collapse #110, the native bypass) — per the 2026-06-13 directive it's a product bypass that doesn't exercise the architecture; left out of the beta. (Its PR MLX native restored-cache primitive — systemic fix for the Mac throughput collapse #110 stays open, labelled forbidden-for-validation.)

Evidence (real Mac, all-MLX fused, recall 1.0)

CUDA-trim block-4 0.68× AR (+33% vs v3 0.51×); block-8 ~1.0–1.05× (AR parity, best long-code); output bit-identical to v3.
Acceptance/throughput investigation closed: the binding constraint is the 26B verify(L) compute per block — not rollback, sync count, acceptance, quantization, or context length. >AR remains CUDA-favoured (H200 1.27×). Remaining lever: the 4.5→7.7 drafter accept-len gap.
Bounded S5 KV (~133 MB vs ~1309 MB naïve @ 5.8k ctx; ~48 MB affine-4).

Testing

✅ Linux: pytest tests/inference_engine/bridge/ tests/inference_engine/bench/ → 92 passed; MLX backend tests pass except 4 pre-existing b876 test_fused_specdecode.py fixture failures (present on PR Step-2 rescue: all-MLX DFlash drafter — parity-proven, 17× over the hybrid fused path (0.476× AR) #112's base, stash-verified — not introduced here; flagged for a fixture refresh).
✅ Real Mac (bridge, from this branch): mlx-env-probe success (branch is bridge-runnable); the engine paths validated in PRs MLX port of #107: incremental decode (Step 1) + fused DFlash spec-decode engine (Step 2) #109/MLX native restored-cache primitive — systemic fix for the Mac throughput collapse #110/CUDA-parity rollback for the all-MLX fused loop (keep accepted K/V, trim only rejected) — +33% on code, ~AR parity #115.

Notes for the merge owner

Supersedes/absorbs the README PR README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline) #116 (included here) — close README: Kakeya Inference Engine for Mac — MLX spec-decode port journey (K3 beta baseline) #116 or merge this instead.
The b876 chain (Step-2 rescue: all-MLX DFlash drafter — parity-proven, 17× over the hybrid fused path (0.476× AR) #112/Mac bridge M1: cloud-agent access to the self-hosted kakeya-mac-m4 (git-bus) + distributed-inference integration evaluation #111/v0.5-M1 milestone: agent capability exchange + distributed spec decode on multi-host fleets (ADR 0009) #105) and my MLX port of #107: incremental decode (Step 1) + fused DFlash spec-decode engine (Step 2) #109/CUDA-parity rollback for the all-MLX fused loop (keep accepted K/V, trim only rejected) — +33% on code, ~AR parity #115 are the sources; this is their integration onto main.
Pre-existing follow-up: the all-MLX fused is 6/8 byte-exact vs greedy AR (bf16 drift) — worth fp32 verify accumulation before promoting beyond beta.

…lapse Port CUDA Gap-A to MLX. The existing MLX restored-attention dispatch already calls cache.update_and_fetch/cache.offset, so the per-token re-forward collapse is fixed by prefilling WITH a cache then decoding incrementally: - restored_prefill_cache: prefill once with restored-K/V injection into the model's native hybrid cache (full/global layers -> exact own K/V (S5); sliding -> f_theta-restored, window-bounded by RotatingKVCache). - restored_incremental_generate: greedy decode via mlx_lm generate_step over the prefilled cache (O(L)/token, async-pipelined). Recall carried by S5 full-attn. - k3_integrated_niah_eval_mac.py: --incremental flag selects the new path. - docs/mlx-port-lessons.md: Step 1 marked implemented + Mac validation command. Linux: compiles, funcs import (mlx lazy), MLX helper tests pass. End-to-end decode requires Apple Silicon -> Mac validation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Inject fake mlx/mlx_lm modules (monkeypatch.setitem, auto-reverted) to exercise the wrapper control flow on Linux without Apple Silicon: - restored_prefill_cache: inject-config targets only has_kv source layers with restored K/V (sharers/missing skipped), make_prompt_cache threaded + returned, evicted-position clamping, attention class restored + configs cleared on exit. - restored_incremental_generate: argmax first token, max_tokens<=1 early-exit, first-token EOS stop, stream-until-EOS, stream-until-max_tokens. restored_prefill_cache (371-423) and restored_incremental_generate (425-455) are now 100% line-covered. MLX-kernel paths (dispatch internals, capture_own_kv, restored_logits forwards) remain Mac-validated. 16/16 MLX tests pass. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Full port of #107's fused spec-decode to the hybrid MLX-verifier + PyTorch-drafter path (inference_engine/backends/mlx/fused_specdecode.py): - Component A: capture_aux_hidden + MLXRestoredIncrementalVerifier.forward_block patch Gemma-4 DecoderLayer.__call__ to record aux-layer outputs (no MLX output_hidden_states), bridged to torch for the drafter. - Component B: reuse the PyTorch drafter make/extend_context_kv + draft_block_cached. - Component C: MLXRestoredIncrementalVerifier — prefill = Gap-A restored cache; commit_or_truncate rolls back rejected tokens via mlx_lm trim_prompt_cache. - fused_specdecode_generate: per-block O(L) accept/reject loop. - make_bridge_embed_lm_head: Gap-B unscaled drafting embed + softcapped lm_head. - k3_integrated_niah_eval_mac.py: --fused-specdecode + --block-size. - docs/mlx-port-lessons.md: Steps 3-4 marked implemented + Mac command. Linux: compiles; fused_specdecode.py 100% line-covered by new UTs (engine loop accept/reject/commit/extend, aux indexing, adapter prefill/verify/trim/append, bridge embed/lm_head). 47 MLX tests pass. MLX-kernel paths need Mac validation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

- scripts/review_mlx_port_on_mac.sh: one-shot Step 1 (incremental) + Step 2 (fused) Mac validation; prints recall vs oracle, tok/s + speedup_vs_AR, KV savings, and PASS/FAIL gates. All knobs env-overridable. - k3_integrated_niah_eval_mac.py: report now includes throughput.oracle_native_ar and throughput.cross_model_speedup_vs_oracle_ar so the AR comparison is in JSON. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Route the Mac S5 adaptive path through native MLX cache behavior when DFlash acceptance is too low, preserving the forced fused path while removing avoidable restoration and bridge overhead from the default smoke path. Co-authored-by: Cursor <cursoragent@cursor.com>

Seed the Gemma4 content channel for direct-answer NIAH smokes so short generations measure retrieval instead of spending their budget in the thought channel, and record cross/oracle parity evidence. Co-authored-by: Cursor <cursoragent@cursor.com>

Warm the MLX decode path before cross/oracle comparisons and stop on Gemma4 turn-end tokens so the Mac validation gate measures steady decode behavior without wasting budget past the answer. Co-authored-by: Cursor <cursoragent@cursor.com>

Use fair e2e prefill+decode timing for cross/oracle comparisons, chunk long-context MLX prefill paths, and record ctx280 n=5/gen32 evidence showing Step 2 recall parity and speedup under the corrected gate. Co-authored-by: Cursor <cursoragent@cursor.com>

- docs/design/mac-bridge-cloud-agent-access.md: three-transport design (M1 git-bus implemented; M2 tailnet SSH + M3 fleet membership designed) + evaluation of folding the bridge into the ADR 0009 distributed-inference plane (WAN = control/tool plane, LAN = data plane; remote-executor as CAPABILITY_ROLE_TOOL) - inference_engine/bridge/manifest.py: preset allowlist (8 presets, typed+bounded params, ${ENV:} placeholders resolved on the runner, argv-only — no shell), manifest schema + validation - scripts/mac_bridge/: run_preset.py executor (logs, summary, evidence-gate pass on K3 reports), request_run.py git-bus client (branch+manifest+overlay+push), fetch_results.py read-only poller - .github/workflows/mac-bridge.yaml: push-on-mac-bridge/** executor on [self-hosted, macOS, ARM64, kakeya-mac-m4], serialized, commits results back to the request branch + uploads artifacts - CI: bridge tests in the Linux gate, inference_engine/bridge/* at 100% coverage, import smoke - docs/ops/mac-m4-runner-setup.md: bridge operator section Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

- scripts/mac_bridge/setup_mac.sh: idempotent Mac-side installer (host shape, deps, Actions runner install/registration with kakeya-mac-m4 labels, model-location + HF-cache checks, bridge self-test, optional --with-tailscale for M2) - scripts/mac_bridge/kakeya_mac.py: cloud-agent front door (doctor / run --wait / status); auto-detects AgentMemory branch policy and requests via AgentMemory/mac-bridge-<preset>-<nonce>-<sfx> - workflow accepts both mac-bridge/** and AgentMemory/mac-bridge-* - request_run.py: --branch-prefix/--branch-suffix; returns worktree to the original branch after pushing (one-click UX) - docs: one-click sections in design doc + runner runbook Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Live testing caught both: a leading-dash branch suffix (-b876) was parsed as an option flag, and request_run's 'git add -A' silently absorbed unrelated uncommitted edits into the request branch (they vanished from the original branch on switch-back). Requests are now always built from a committed state. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

scripts/setup_mac.sh enforced transformers <5.0 ('hard-pin to 4.x') from the legacy Qwen3 MDLM era, while requirements.txt had already dropped the upper bound — the K3 critical path (Gemma 4 verifier, DFlash drafter, current mlx-lm) requires transformers >= 5.0. On a current Mac install (transformers 5.11.0) verify_imports aborted with '5.11.0 >= forbidden upper 5.0'. - verify_imports: transformers bound is now (>=4.45, no upper); comment points legacy-MDLM users at a dedicated 4.x venv (same guidance as requirements.txt) - header/docs updated to the real venv rationale - scripts/mac_bridge/setup_mac.sh: install deps into the runner's plain python3 (the interpreter Actions jobs actually use; the .venv-mac built by scripts/setup_mac.sh is for interactive dev) and report transformers K3-readiness explicitly Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

First live k3 preset run failed fast (2.3s, logs round-tripped): the repo-relative verifier default does not exist in the runner workspace and HF_HUB_OFFLINE turned the fallback lookup into a hard error. Default resolution is now: repo Actions variable > ~/kakeya-models/<name> (documented symlink convention on the runner host) > repo-relative. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

checkout@v4 lfs:true is not sufficient on a reused self-hosted workspace: a prior non-LFS checkout leaves pointer-content files that git does not re-smudge (blob unchanged), observed live as torch.load 'Unsupported operand 118'. git lfs pull + a pointer scan make k3 checkpoint loading deterministic. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Live run exposed the overlay gap: validate_k3_reports.py + k3_report_gate.py existed only on the PR #109 branch, so requests built from a client checkout without them produced a request branch whose on-Mac evidence-gate step crashed (exit 2, file not found). The gate is part of the bridge's evidence discipline (BRIDGE_FILES lists it) — it now lives on this branch with its 68-test suite. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…68308-dc400e-b876

…after work) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

iterC (PR #109) proved the hybrid fused engine correct (recall 5/5 @ctx280, accept_len 2.1-2.9/4) but 0.028x decode-only: each block paid 4+ mx<->torch crossings plus a float32 CPU-torch drafter forward. - inference_engine/backends/mlx/dflash_drafter.py: 1:1 MLX port of the torch DFlashDrafter fast path (same DFlashConfig, same checkpoint weights via mx.load, explicit fp32 RoPE tables, GQA via mx.fast.scaled_dot_product_attention, fc/hidden_norm/norm fusion, make/extend_context_kv + draft_block_cached) + native embed/lm_head (Gap-B preserved: no sqrt(hidden) scale; softcap on logits) - fused_specdecode_generate: accepted-path aux expansion now routes through cat_aux_fn (runtime-agnostic; torch semantics unchanged) - harness --all-mlx-drafter: native drafter + native embed/lm_head + identity aux bridge; requires --s5-exact-full-attn; drafter_runtime recorded per sample - scripts/research/k3_mlx_drafter_parity.py: token-parity gate vs the torch reference on real verifier aux (blocks throughput claims) - bridge presets: k3-drafter-parity, k3-step2-fused-allmlx Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

First Mac parity run: bf16 MLX vs fp32 torch = 94.79% (91/96) token agreement, prefix-consistent mismatches only, MLX draft 3.2x faster already. fp32-vs-fp32 must be exact to rule out port bugs vs dtype near-tie flips. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…r hybrid), parity-proven Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… recall Five arms over the SAME captured full-attn own K/V at ctx280 scale: identity (machinery control), affine 8/4-bit (mx.quantize, group 64 — the QuantizedKVCache storage format), KL D4/E8 (torch codec round trip, eval-time only). Per arm: measured bits/value, energy-weighted rel_mse, and REAL recall via lossy injection + incremental restored decode. Printed verdict: KL justifies an MLX port only if it beats affine4 on rel_mse at <= its rate without losing recall. Bridge preset: k3-kv-quant-eval. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…LX port shelved Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

fused_specdecode_generate_mlx — one host sync per block: - (2) draft ids stay lazy mx tensors and feed the verifier forward in-graph (drafter.draft_block_ids + adapter.forward_block_lazy) - (1) in-graph greedy acceptance (cumprod leading-match) + lazy gather of the next-position logits row; per block mx.eval materialises only the accept count and candidate ids; drafter-context extensions go through mx.async_eval - (3) no correction forward: the gathered next-row makes the verifier's correction the next block's carried bonus, verified (and aux-captured) as position 0 of the next batched forward — guaranteed-accepted by construction, so every block commits >= 1 token and the loop can never run below AR pace Harness uses the new loop automatically on --all-mlx-drafter. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Live block-1 diagnostic proved levers (1)(3) correct (64/64 tokens, 17.5 tok/s carried-greedy); the fully fused drafter+26B graph was the failure (Metal command-buffer pathology: 143s evals, stream divergence). Materialise the small drafter graph first, keep in-graph acceptance + carried correction; 2 syncs/block vs eager 6+L. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Root cause of the v2 stream divergence (and retroactively iterC's 23-token sample): trim_prompt_cache is unsound on Gemma-4's hybrid cache once the sliding RotatingKVCache has wrapped — rejected draft K/V linger in the ring. v3: O(1) reference snapshot before each verify forward; on partial acceptance the WHOLE forward rolls back and the stream-committed tokens carry into the next candidate (guaranteed re-accept, K/V+aux recomputed correctly). Happy path (full accept) costs nothing. block-1 live diagnostic validated the carried-bonus machinery (64/64, 17.5 tok/s). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…fused at ~0.6-0.7x Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

All-MLX fused but WITHOUT --ignore-turn-stop, so generation ends at the real answer. For comparing mean_accept_len (natural-stop) vs the forced over-generation of k3-step2-fused-allmlx, to confirm on the real Mac that the low '2.13' accept is a forced-over-gen artifact, not a drafter/quant/restoration deficiency. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…preset Honest spec-decode throughput probe: all-MLX fused on naturally-long, predictable code-completion prompts (the spec-decode sweet spot), natural stop. Reports decode-only tok/s (fused vs oracle AR) + acceptance. --code-prompts skips the NIAH recall gate (recall N/A by design). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…pted, drop rejected) Eliminates the v3 carry re-forward. Root cause: RotatingKVCache not trimmable once wrapped (is_trimmable -> offset<max_size), so v3 rolls the block back + re-forwards carried accepted tokens. Fix: prefill all-KVCache layout (sliding on full KVCache too -- byte-exact, window mask applies regardless of capacity) -> trim_prompt_cache is a sound O(1) slice on every layer. - restored_prefill_cache: +cache_factory; fused_specdecode.make_full_kv_prompt_cache; fused_specdecode_generate_mlx_trim (forward L, keep accepted, trim L-k, no carry); adapter.prefill +full_kv; harness --cuda-trim; manifest k3-fused-allmlx-code-trim. Linux: compiles; +1 UT; 4 pre-existing b876 failures unchanged. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

fused_specdecode_generate_mlx_trim(single_fused=True): skip the two-phase eval so drafter+26B fuse into ONE graph (the b876-pathological path); report per-block eval times (first8/max/mean). Harness --single-fused + preset k3-fused-singlefused-probe (n=2,gen=16 so a pathological block is bounded). Classifies fundamental command-buffer limit (eval scales w/ graph) vs fixable SDPA fallback (eval huge even at small scale). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…y (K3 beta baseline) Records the decode-throughput journey from ~0.09x AR (O(T^2) collapse) -> ~0.2x (cross-runtime bridge) -> ~0.5x (all-MLX + CUDA-parity trim rollback) -> ~0.7x (block-4) -> ~1.0x (block-8, AR parity) on Gemma-4-26B-A4B / Mac M4, with each binding problem + fix, the ruled-out non-levers (quant/length/alignment/sync/ forced-over-gen artifact), the honest >AR-is-CUDA-favoured ceiling, and the evaluation environment (Mac bridge git-bus + self-hosted runner + evidence gate + H200). Recall 1.0 throughout; bounded S5 KV. Cross-links ADR 0009/0012/0013. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…esets Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 30 commits June 11, 2026 16:02

Fix Gemma4 NIAH recall smoke prompting

491d460

Seed the Gemma4 content channel for direct-answer NIAH smokes so short generations measure retrieval instead of spending their budget in the thought channel, and record cross/oracle parity evidence. Co-authored-by: Cursor <cursoragent@cursor.com>

Stabilize MLX Step 2 throughput gate

c0c5d3c

Warm the MLX decode path before cross/oracle comparisons and stop on Gemma4 turn-end tokens so the Mac validation gate measures steady decode behavior without wasting budget past the answer. Co-authored-by: Cursor <cursoragent@cursor.com>

Mac evidence: k3 gate sync iterC block4 ignoreturn n5 gen64

de6ed9e

Co-authored-by: Cursor <cursoragent@cursor.com>

mac-bridge: checkout with lfs:true (k3 presets load LFS checkpoints)

a16650d

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Sync Mac-local hardened harness (schema 2 + ignore-turn) used for iterC

d79e57a

Co-authored-by: Cursor <cursoragent@cursor.com>

k3 presets: --ignore-turn-stop so evidence runs decode the full budget

9bbe190

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

mac-bridge results: AgentMemory/mac-bridge-k3-step1-incremental-17812…

932ff48

…68308-dc400e-b876

Merge mac-bridge tooling (dev->Mac validation loop for the all-MLX dr…

6340e13

…after work) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

lessons: Step-2 rescue status — all-MLX drafter at 0.476x AR (17x ove…

a555e62

…r hybrid), parity-proven Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

lessons: kv-quant verdict — affine4 passes recall at 25x margin; KL M…

7cf1988

…LX port shelved Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

lessons: levers 1-3 verdict — trim bug exposed; true acceptance caps …

18e2fc4

…fused at ~0.6-0.7x Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 6 commits June 13, 2026 08:36

Beta: update preset allowlist test for the code/trim/natural/probe pr…

003b2bb

…esets Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

github-actions Bot added the needs-mac-m4 label Jun 13, 2026

FluffyAIcode marked this pull request as ready for review June 13, 2026 10:54

FluffyAIcode merged commit 9d5e6b4 into main Jun 13, 2026
7 of 8 checks passed

This was referenced Jun 13, 2026

bridge: k3-beta-scorecard preset — Kakeya vs MLX-only on main (#117) #118

Merged

evidence: GPU beta scorecard — Kakeya vs standalone AR on H200 (main #117) #119

Merged

ci: pin grpcio-tools==1.81.1 to fix proto stub drift (red CI badge on main) #121

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kakeya Inference Engine for Mac — MLX speculative-decode beta (consolidated → main)#117

Kakeya Inference Engine for Mac — MLX speculative-decode beta (consolidated → main)#117
FluffyAIcode merged 36 commits into
mainfrom
AgentMemory/mac-beta-consolidated-2815

FluffyAIcode commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 13, 2026

What

What's included (consolidated)

Deliberately EXCLUDED

Evidence (real Mac, all-MLX fused, recall 1.0)

Testing

Notes for the merge owner

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants