fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate by FluffyAIcode · Pull Request #146 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-17T13:45:14Z

Summary

Fixes the long-decode degeneration in the Mac (MLX) fused spec-decode engine, where long answers (>~1024 tokens) collapsed into a runaway repeat (由于由于…). The root cause was found via a hypothesis-driven debug loop with runtime evidence on the Mac M4 — and the initial hypothesis was disproven, which also exposed a false-positive quality gate that this PR corrects.

Root cause (confirmed on Mac M4, prompt `请详细解释POW的工作原理`)

The decode cache is the model's native hybrid cache: sliding layers are RotatingKVCache (max_size≈1024), full layers are KVCache. The S5 --window-size 64 only feeds the analytical memory math; it does not bound the decode cache.
At 128 and 800 tokens the fused output is coherent; the bug only appears past the ring wrap (offset ≥ max_size, gen≈1017).
Once the ring wraps, mlx_lm.trim_prompt_cache is all-or-nothing and refuses (a rotating layer is is_trimmable only while offset < max_size). The fused speculative loop's rejected-draft rollback then silently fails (15 trim short:true events), so cache.offset runs +8 ahead of committed past_len on every post-wrap block → RoPE/causal misalignment → logit corruption → collapse.
Decisive A/B: a native-greedy control on the same prompt stayed fully coherent past the wrap (clean termination at gen 1247), proving the model handles >1024 fine and the fused engine was at fault.

The fix (correctness-first)

_sliding_ring_would_wrap() + if wrap_l1: L = 1 in fused_specdecode_generate: detect the impending wrap and commit single-token blocks past it. With L=1 the bonus token is always accepted (it is argmax(next_token_logits)), so there is never a rejected tail to trim and offset stays == past_len. Cost: the speculative speedup is forgone past max_size (a sound wrapped-ring rollback for perf is noted as optional follow-up).

Validation (re-run, 1300 tokens, Mac M4)

signal	before	after
`trim short:true` events	15	0
post-wrap offset desync	76/76 blocks	0/225
post-wrap max `cyc_frac`	1.0 (collapse)	0.158
fused final text	`由于由于…` runaway	coherent, clean stop @ gen 1241 (matches native control)

kakeya_mlx_wrap_degeneration_fix.txt

Quality-gate correction (`k3_report_gate.py`)

The runtime evidence disproved the RESTORATION_COVERAGE theory ("restoration covers only ≤ window decode tokens"): a 1300-token run with 332 evicted-unrestored positions stayed coherent once the trim-desync was fixed. So "tokens > window" and even "evicted > 0" are not degeneration signals — the old rule was a pure false-positive (it would fail every coherent answer > 64 tokens).

Remove the RESTORATION_COVERAGE rule.
Add _has_runaway_substring so OUTPUT_DEGENERATE catches the newline-free 由于由于… collapse the line-based detector missed; conservative (≥8× tiled short unit) so templated 矿工 A/B/C enumerations do not false-fire.
Gate module stays at 100% coverage.

Skill doc — reusable debugging template

Added the whole investigation to docs/kakeyainferenceenginebuildskill.md as a worked case-study template (§7): reproduce at increasing scale → add a native-greedy A/B control → instrument the indicated mechanism → fix correctness-first → re-validate, plus two generalizable lessons (runtime evidence overrides plausible hypotheses; gate on observed outcomes, not theorized proxies).

Changes

inference_engine/backends/mlx/fused_specdecode.py: _sliding_ring_would_wrap + single-token commit past the ring wrap (uses getattr(adapter, "_cache", None)).
inference_engine/bench/k3_report_gate.py: drop disproven RESTORATION_COVERAGE; add char-level runaway detector to OUTPUT_DEGENERATE.
scripts/research/k3_integrated_niah_eval_mac.py: keep --chat-native-ref as a permanent A/B coherence control (now captures native text); Phase-1 KDBG instrumentation removed.
inference_engine/bridge/manifest.py: mlx-kakeya-degen-probe repurposed as a long-decode regression guard.
Tests: unit-cover _sliding_ring_would_wrap, _has_runaway_substring, and the corrected gate.
Docs: kakeyainferenceenginebuildskill.md §5 row + §7 case-study template; kakeya-autonomous-iteration-and-self-correction.md confirmed root cause + "verify, don't trust the comment" lesson.

Testing

✅ pytest tests/inference_engine/bench/test_k3_report_gate.py (50 passed, gate 100% coverage)
✅ pytest tests/backends/mlx/test_fused_specdecode.py -k wrap (new helper tests)
✅ pytest tests/inference_engine/bridge/test_manifest.py (31 passed)
✅ Mac M4 1300-token validation run (table above)
⚠️ 4 tests in test_fused_specdecode.py (test_fused_loop_*, test_adapter_prefill_forward_commit) fail identically on origin/main — pre-existing stale fakes/expectations in the aux-capture path, unrelated to this fix.

_{To show artifacts inline, enable in settings.}

Add stderr NDJSON 'KDBG' instrumentation to the fused spec-decode decode loops and the restored-prefill adapter to characterize the long-generation degeneration bug: * prefill: log prompt_len, restored coverage (evicted range + layers), full_kv layout, and per-layer cache class/max_size/keep. * per block (torch f_theta + mlx_trim loops): block idx, generated-token count, past_len, accepted, dt_ms, a cheap repetition signal (unique fraction + longest single-token run over last 32 tokens), the count of sliding-layer positions evicted DURING decode that have no restored K/V (lost), and a per-layer cache state summary. Temporary debug-only; reverted after the fix is verified. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… degeneration characterization) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… long-decode degeneration - cyc_frac/cyc_p phrase-cycle detection (max_run was blind to phrase loops) - _kdbg_sync: per-block sliding-vs-full offset divergence (H2 trim-desync) - commit_or_truncate trim event logging (short-trim smoking gun) - final token-id dump for offline divergence comparison - --chat-native-ref: plain native greedy control on identical prompt (H1 vs engine) - degen-probe preset now runs the native control Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…probe Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…vent offset desync Root cause: once the ms=1024 sliding RotatingKVCache ring wraps (offset>=ms), mlx_lm.trim_prompt_cache refuses the rejected-draft rollback (all-or-nothing, is_trimmable requires offset<max_size). Un-trimmed rejected K/V leave cache.offset ahead of committed past_len (+8 observed), misaligning RoPE/mask -> logit corruption -> runaway repetition (由于由于...) onset ~gen1064. Fix (Option A, correctness-first): detect when the sliding ring would wrap and commit single-token blocks (L=1). With L=1 the bonus token is always accepted (it is argmax(next_token_logits)), so drop==0 and trim is never called while wrapped -> offset stays == past_len, matching the coherent native AR path. Validated by the native-greedy control: native (no spec rollback) stays coherent past 1024; only the fused trim path degenerated. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…+ --chat-native-ref Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…test regressions) + unit-test wrap detector The wrap guard accessed adapter._cache directly, which AttributeError'd on adapters/fakes without a cache (regressing the 3 test_fused_loop_* tests). Use getattr(adapter,'_cache',None); _sliding_ring_would_wrap already treats None as 'no wrap'. Add focused unit tests for the detector (wrap True/False, non-rotating, missing max_size, empty/None). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…PUT_DEGENERATE Runtime evidence (Mac, 1300-tok run) disproved the 'restoration covers only <= window decode tokens' theory: the decode cache is the native RotatingKVCache (max_size~1024, not S5 window=64), and a run with 332 evicted-unrestored positions stayed fully coherent once the trim-desync bug was fixed. So tokens>window and even evicted>0 are NOT degeneration signals; the old rule was a pure false-positive (would fail every coherent answer >64 tokens). - Remove the RESTORATION_COVERAGE token>window rule. - Add _has_runaway_substring: catches the newline-free 由于由于… collapse the line-based _looks_degenerate missed; conservative (>=8x tiled short unit) so templated 矿工 A/B/C enumerations do not false-fire. - OUTPUT_DEGENERATE now = line-wall OR runaway-substring (empirical signal only). - Update tests (100% coverage) and the self-correction methodology doc with the confirmed root cause + the verify-don't-trust-the-comment lesson. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…op stale KDBG mention) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…eusable template (§7) - §5: add the RotatingKVCache-wrap degeneration bug row. - New §7: worked hypothesis-driven debugging template (reproduce at increasing scale, native-greedy A/B control, instrument the indicated mechanism, fix correctness-first, re-validate) + two generalizable lessons (runtime evidence overrides plausible hypotheses; gate on observed outcomes not theorized proxies). - Renumber Validation §7->§8 and Pointers §8->§9; update cross-refs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…codegen presets - Remove the KAKEYA_KDBG-gated debug instrumentation (helpers + per-block emission + prefill_state/turn_compare) from fused_specdecode.py and k3_integrated_niah_eval_mac.py. Investigation complete. - Keep the production fix (runaway-loop guard) + the --chat-scripted-file / --fused-no-loop-guard / --chat-native-ref flags. - Repoint the two codegen presets to the multi-turn 'explain||code' chat (guard-off probe + guard-on validate), accurate descriptions; drop the now- unused pow_codegen_longprompt.txt fixture. On-device (Mac M4): across short/long/multi-turn regimes the engine is coherent (fused==native); guard-on and guard-off outputs are byte-identical on the multi-turn code scenario -> the guard is inert on healthy output (no regression) and the systematic degeneration was already resolved by the wrap fix (#146). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 9 commits June 17, 2026 11:46

debug(mac-bridge): mlx-kakeya-degen-probe preset (Phase-1 long-decode…

0d1daa7

… degeneration characterization) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

debug(mlx-fused): capture native-ref text for coherence A/B in >wrap …

b56c6b4

…probe Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

chore(mlx-fused): remove Phase-1 KDBG instrumentation; keep wrap fix …

34c2d2f

…+ --chat-native-ref Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

docs(manifest): degen-probe is now a long-decode regression guard (dr…

7208318

…op stale KDBG mention) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

github-actions Bot added the needs-mac-m4 label Jun 17, 2026

FluffyAIcode mentioned this pull request Jun 17, 2026

test(mlx-fused): refresh stale fused-loop test fakes and expectations #147

Merged

FluffyAIcode marked this pull request as ready for review June 17, 2026 15:05

FluffyAIcode merged commit be0a35e into main Jun 17, 2026
7 of 8 checks passed

FluffyAIcode mentioned this pull request Jun 17, 2026

feat(mac-launcher): long-answer-safe defaults + full-mode validation preset #148

Merged

cursor Bot mentioned this pull request Jun 18, 2026

fix(mlx-fused): runaway-loop guard for greedy markdown-marker collapse on code prompts #149

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate#146

fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate#146
FluffyAIcode merged 10 commits into
mainfrom
AgentMemory/mac-continuous-decode-restoration-2815

FluffyAIcode commented Jun 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause (confirmed on Mac M4, prompt 请详细解释POW的工作原理)

The fix (correctness-first)

Validation (re-run, 1300 tokens, Mac M4)

Quality-gate correction (k3_report_gate.py)

Skill doc — reusable debugging template

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 17, 2026 •

edited by cursor Bot

Loading

Root cause (confirmed on Mac M4, prompt `请详细解释POW的工作原理`)

Quality-gate correction (`k3_report_gate.py`)