Skip to content

MLX native restored-cache primitive — systemic fix for the Mac throughput collapse#110

Draft
FluffyAIcode wants to merge 4 commits into
AgentMemory/v04-mlx-port-incremental-decode-2815from
AgentMemory/v04-mlx-native-cache-primitive-2815
Draft

MLX native restored-cache primitive — systemic fix for the Mac throughput collapse#110
FluffyAIcode wants to merge 4 commits into
AgentMemory/v04-mlx-port-incremental-decode-2815from
AgentMemory/v04-mlx-native-cache-primitive-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 11, 2026

Copy link
Copy Markdown
Owner

⛔ FORBIDDEN FOR ARCHITECTURE VALIDATION — product/throughput bypass only (2026-06-13 directive)

This native-cache path's recall comes from Gemma-4's native retained full-attention layers + native sliding-window eviction — it never exercises f_θ or proposer K/V restoration (ADR 0008 §11). It is therefore structurally incapable of failing in a way that tests the restoration architecture, so its "recall" results must not be cited as evidence that the K/V-Restoration architecture works (see the ADR 0012 revision 2026-06-13).

Do not use this primitive (or the Step-1 incremental path) for any architecture-validation attempt. The restoration architecture must be validated on a pure sliding-window model (Qwen3, the K1/K2 path) where recall is mathematically impossible without proposer/f_θ restoration. Gemma-4 may be used as a product model, never as the validation vehicle for architectural integrity.

Status of this PR: kept open as a labelled product bypass; the memory/throughput numbers are real, but it is not architecture evidence. (Close it in the GitHub UI if you'd prefer it withdrawn entirely — I can't close PRs from the agent.)


Why

Per the architecture review, the first MLX port made decode native but the cache lifecycle wasn't — and the naive Step-0 was unusable. Root cause (now confirmed against the authoritative mlx_lm 0.31.3 source): build_native_prefill_cache ran the whole ~5k-token NIAH prompt through one forward → O(T_q×T_k) SDPA materialization → Mac OOM (the same failure as the earlier ctx280 runs). mlx_lm's own generate_step avoids this by chunked prefill (prefill_step_size, generate.py ~430-443).

This PR makes the whole cache lifecycle native and fixes the OOM.

What

inference_engine/backends/mlx/native_restored_cache.py:

  • native_generate — pure end-to-end path: hands the whole prompt to mlx_lm's own generate_step (chunked prefill + async decode + optional kv_bits) over a native cache. mlx_lm's validated loop verbatim → cannot OOM on prefill. The default Step-0 path (no quantization).
  • build_native_prefill_cachechunked native prefill (mirrors generate_step) → (cache, last_logits); full-attn KVCache (exact own K/V → S5 recall), sliding RotatingKVCache (bounded). For the selective-quantization path.
  • quantize_full_attn_layers — full-attn KVCache → native QuantizedKVCache (real memory reduction, native quantized decode); sliding untouched.
  • set_kv_cache_state / inject_restored_into_native_cache — write K/V directly into native layout via .state.
  • native_restored_decode, cache_resident_bytes.

Verified against real source: multimodal gemma4.Model.make_cache delegates to language_model (KVCache + RotatingKVCache) → make_prompt_cache types are correct.

Key reframe (now qualified by the banner above): recall rides S5's exact full-attention K/V (native prefill gives it free) → no f_θ/drafter/patch/bridge in the loop. This is the collapse fix, not architecture validation.

Files

  • inference_engine/backends/mlx/native_restored_cache.py
  • scripts/research/k3_integrated_niah_eval_mac.py--native-cache, --quantize-full-attn-bits, --prefill-step-size
  • scripts/review_mlx_port_on_mac.sh — Step 0
  • tests/backends/mlx/test_native_restored_cache.py
  • docs/mlx-port-lessons.md

Validation

  • ✅ Linux: compiles; native_restored_cache.py 100% line-covered; 58 MLX tests pass.
  • ⚠️ MLX execution needs Apple Silicon. Note: any Mac "recall" from this path is Gemma-4 native behaviour, not architecture evidence (see banner).
Open in Web Open in Cursor 

cursoragent and others added 3 commits June 11, 2026 17:15
…collapse

New backends/mlx/native_restored_cache.py makes the whole cache lifecycle native
(addresses the review: split-SDPA was a local kernel win; the bottleneck was
prefill materialization / Python patch / lazy-eval sync / MLX-MPS bridge / non-
native cache shape):

- build_native_prefill_cache: ONE native prefill populates the model's own native
  cache with exact own K/V (full-attn KVCache -> S5 recall for free; sliding
  RotatingKVCache bounded). No attention patch, no separate capture_own_kv, no
  Python cache reconstruction. Optional one-shot aux tap in the same forward.
- set_kv_cache_state / inject_restored_into_native_cache: write K/V directly into
  the native layout via the cache .state setter (no wrapper object).
- quantize_full_attn_layers: full-attn KVCache -> native QuantizedKVCache (real
  resident-memory reduction + native quantized decode). cache_resident_bytes
  reports live nbytes.
- native_restored_decode: native generate_step over the native cache.

Recall rides S5's exact full-attn K/V (native prefill gives it free) -> NO f_theta
/drafter/patch/bridge in the loop -> the collapse fix. Wired into the Mac harness
via --native-cache [--quantize-full-attn-bits N]; review_mlx_port_on_mac.sh runs
it as Step 0; report adds oracle(native-AR) tok/s + speedup. docs updated.

Linux: compiles; native_restored_cache.py 100% line-covered; 56 MLX tests pass.
Remaining for fused>AR: single-runtime DFlash drafter in MLX (kill MLX<->MPS
bridge). MLX execution validated on Mac.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…1.3)

Downloaded the authoritative mlx_lm 0.31.3 source and verified generate_step does
CHUNKED prefill (prefill_step_size, lines ~430-443). The first build_native_prefill_cache
ran the whole ~5k-token NIAH prompt through ONE forward -> O(T_q x T_k) SDPA
materialization -> Mac OOM (the same failure as the earlier ctx280 runs), which
is why Step-0 was unusable.

- build_native_prefill_cache now processes the prompt in chunks of
  prefill_step_size (default 512), mx.eval per chunk to bound peak memory; mirrors
  generate_step's chunked prefill. Returns (cache, last_logits).
- Confirmed multimodal gemma4.Model.make_cache delegates to language_model
  (KVCache full-attn + RotatingKVCache sliding) -> make_prompt_cache types correct.
- Dropped the prefill aux-tap coupling (fused uses capture_aux_hidden separately).
- Harness: --prefill-step-size; --native-cache uses chunked prefill.
- Tests updated (chunked forwards [2,2,1]); native module 100% covered.

Linux: compiles; MLX tests pass. Mac validation still required (Apple Silicon).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…path

Add native_generate: hands the whole prompt to mlx_lm's own generate_step
(chunked prefill + async decode + optional kv_bits) over a native cache -> the
safest collapse-fix path (mlx_lm's validated loop verbatim, cannot OOM on
prefill). Harness --native-cache uses it when --quantize-full-attn-bits==0; uses
the selective chunked-prefill + full-attn-only quantize + decode path otherwise.

native_restored_cache.py 100% covered; 58 MLX tests pass.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…chitecture validation

Per 2026-06-13 directive: this native path's recall comes from Gemma-4's native
full-attn layers + sliding eviction, never exercising f_theta/proposer KV
restoration. Structurally incapable of testing the restoration architecture;
recall results must not be cited as architecture evidence (see ADR 0012 rev).
Validate restoration on a pure sliding-window model (Qwen3) instead.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants