fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate#146
Merged
FluffyAIcode merged 10 commits intoJun 17, 2026
Conversation
Add stderr NDJSON 'KDBG' instrumentation to the fused spec-decode decode loops and the restored-prefill adapter to characterize the long-generation degeneration bug: * prefill: log prompt_len, restored coverage (evicted range + layers), full_kv layout, and per-layer cache class/max_size/keep. * per block (torch f_theta + mlx_trim loops): block idx, generated-token count, past_len, accepted, dt_ms, a cheap repetition signal (unique fraction + longest single-token run over last 32 tokens), the count of sliding-layer positions evicted DURING decode that have no restored K/V (lost), and a per-layer cache state summary. Temporary debug-only; reverted after the fix is verified. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… degeneration characterization) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… long-decode degeneration - cyc_frac/cyc_p phrase-cycle detection (max_run was blind to phrase loops) - _kdbg_sync: per-block sliding-vs-full offset divergence (H2 trim-desync) - commit_or_truncate trim event logging (short-trim smoking gun) - final token-id dump for offline divergence comparison - --chat-native-ref: plain native greedy control on identical prompt (H1 vs engine) - degen-probe preset now runs the native control Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…probe Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…vent offset desync Root cause: once the ms=1024 sliding RotatingKVCache ring wraps (offset>=ms), mlx_lm.trim_prompt_cache refuses the rejected-draft rollback (all-or-nothing, is_trimmable requires offset<max_size). Un-trimmed rejected K/V leave cache.offset ahead of committed past_len (+8 observed), misaligning RoPE/mask -> logit corruption -> runaway repetition (由于由于...) onset ~gen1064. Fix (Option A, correctness-first): detect when the sliding ring would wrap and commit single-token blocks (L=1). With L=1 the bonus token is always accepted (it is argmax(next_token_logits)), so drop==0 and trim is never called while wrapped -> offset stays == past_len, matching the coherent native AR path. Validated by the native-greedy control: native (no spec rollback) stays coherent past 1024; only the fused trim path degenerated. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…+ --chat-native-ref Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…test regressions) + unit-test wrap detector The wrap guard accessed adapter._cache directly, which AttributeError'd on adapters/fakes without a cache (regressing the 3 test_fused_loop_* tests). Use getattr(adapter,'_cache',None); _sliding_ring_would_wrap already treats None as 'no wrap'. Add focused unit tests for the detector (wrap True/False, non-rotating, missing max_size, empty/None). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…PUT_DEGENERATE Runtime evidence (Mac, 1300-tok run) disproved the 'restoration covers only <= window decode tokens' theory: the decode cache is the native RotatingKVCache (max_size~1024, not S5 window=64), and a run with 332 evicted-unrestored positions stayed fully coherent once the trim-desync bug was fixed. So tokens>window and even evicted>0 are NOT degeneration signals; the old rule was a pure false-positive (would fail every coherent answer >64 tokens). - Remove the RESTORATION_COVERAGE token>window rule. - Add _has_runaway_substring: catches the newline-free 由于由于… collapse the line-based _looks_degenerate missed; conservative (>=8x tiled short unit) so templated 矿工 A/B/C enumerations do not false-fire. - OUTPUT_DEGENERATE now = line-wall OR runaway-substring (empirical signal only). - Update tests (100% coverage) and the self-correction methodology doc with the confirmed root cause + the verify-don't-trust-the-comment lesson. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…op stale KDBG mention) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…eusable template (§7) - §5: add the RotatingKVCache-wrap degeneration bug row. - New §7: worked hypothesis-driven debugging template (reproduce at increasing scale, native-greedy A/B control, instrument the indicated mechanism, fix correctness-first, re-validate) + two generalizable lessons (runtime evidence overrides plausible hypotheses; gate on observed outcomes not theorized proxies). - Renumber Validation §7->§8 and Pointers §8->§9; update cross-refs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Jun 18, 2026
…codegen presets - Remove the KAKEYA_KDBG-gated debug instrumentation (helpers + per-block emission + prefill_state/turn_compare) from fused_specdecode.py and k3_integrated_niah_eval_mac.py. Investigation complete. - Keep the production fix (runaway-loop guard) + the --chat-scripted-file / --fused-no-loop-guard / --chat-native-ref flags. - Repoint the two codegen presets to the multi-turn 'explain||code' chat (guard-off probe + guard-on validate), accurate descriptions; drop the now- unused pow_codegen_longprompt.txt fixture. On-device (Mac M4): across short/long/multi-turn regimes the engine is coherent (fused==native); guard-on and guard-off outputs are byte-identical on the multi-turn code scenario -> the guard is inert on healthy output (no regression) and the systematic degeneration was already resolved by the wrap fix (#146). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the long-decode degeneration in the Mac (MLX) fused spec-decode engine, where long answers (>~1024 tokens) collapsed into a runaway repeat (
由于由于…). The root cause was found via a hypothesis-driven debug loop with runtime evidence on the Mac M4 — and the initial hypothesis was disproven, which also exposed a false-positive quality gate that this PR corrects.Root cause (confirmed on Mac M4, prompt
请详细解释POW的工作原理)RotatingKVCache(max_size≈1024), full layers areKVCache. The S5--window-size 64only feeds the analytical memory math; it does not bound the decode cache.offset ≥ max_size, gen≈1017).mlx_lm.trim_prompt_cacheis all-or-nothing and refuses (a rotating layer isis_trimmableonly whileoffset < max_size). The fused speculative loop's rejected-draft rollback then silently fails (15trim short:trueevents), socache.offsetruns +8 ahead of committedpast_lenon every post-wrap block → RoPE/causal misalignment → logit corruption → collapse.The fix (correctness-first)
_sliding_ring_would_wrap()+if wrap_l1: L = 1infused_specdecode_generate: detect the impending wrap and commit single-token blocks past it. WithL=1the bonus token is always accepted (it isargmax(next_token_logits)), so there is never a rejected tail to trim andoffsetstays== past_len. Cost: the speculative speedup is forgone pastmax_size(a sound wrapped-ring rollback for perf is noted as optional follow-up).Validation (re-run, 1300 tokens, Mac M4)
trim short:trueeventscyc_frac由于由于…runawaykakeya_mlx_wrap_degeneration_fix.txt
Quality-gate correction (
k3_report_gate.py)The runtime evidence disproved the
RESTORATION_COVERAGEtheory ("restoration covers only ≤ window decode tokens"): a 1300-token run with 332 evicted-unrestored positions stayed coherent once the trim-desync was fixed. So "tokens > window" and even "evicted > 0" are not degeneration signals — the old rule was a pure false-positive (it would fail every coherent answer > 64 tokens).RESTORATION_COVERAGErule._has_runaway_substringsoOUTPUT_DEGENERATEcatches the newline-free由于由于…collapse the line-based detector missed; conservative (≥8× tiled short unit) so templated矿工 A/B/Cenumerations do not false-fire.Skill doc — reusable debugging template
Added the whole investigation to
docs/kakeyainferenceenginebuildskill.mdas a worked case-study template (§7): reproduce at increasing scale → add a native-greedy A/B control → instrument the indicated mechanism → fix correctness-first → re-validate, plus two generalizable lessons (runtime evidence overrides plausible hypotheses; gate on observed outcomes, not theorized proxies).Changes
inference_engine/backends/mlx/fused_specdecode.py:_sliding_ring_would_wrap+ single-token commit past the ring wrap (usesgetattr(adapter, "_cache", None)).inference_engine/bench/k3_report_gate.py: drop disprovenRESTORATION_COVERAGE; add char-level runaway detector toOUTPUT_DEGENERATE.scripts/research/k3_integrated_niah_eval_mac.py: keep--chat-native-refas a permanent A/B coherence control (now captures native text); Phase-1 KDBG instrumentation removed.inference_engine/bridge/manifest.py:mlx-kakeya-degen-proberepurposed as a long-decode regression guard._sliding_ring_would_wrap,_has_runaway_substring, and the corrected gate.kakeyainferenceenginebuildskill.md§5 row + §7 case-study template;kakeya-autonomous-iteration-and-self-correction.mdconfirmed root cause + "verify, don't trust the comment" lesson.Testing
pytest tests/inference_engine/bench/test_k3_report_gate.py(50 passed, gate 100% coverage)pytest tests/backends/mlx/test_fused_specdecode.py -k wrap(new helper tests)pytest tests/inference_engine/bridge/test_manifest.py(31 passed)test_fused_specdecode.py(test_fused_loop_*,test_adapter_prefill_forward_commit) fail identically onorigin/main— pre-existing stale fakes/expectations in the aux-capture path, unrelated to this fix.To show artifacts inline, enable in settings.