Skip to content

fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate#146

Merged
FluffyAIcode merged 10 commits into
mainfrom
AgentMemory/mac-continuous-decode-restoration-2815
Jun 17, 2026
Merged

fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate#146
FluffyAIcode merged 10 commits into
mainfrom
AgentMemory/mac-continuous-decode-restoration-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the long-decode degeneration in the Mac (MLX) fused spec-decode engine, where long answers (>~1024 tokens) collapsed into a runaway repeat (由于由于…). The root cause was found via a hypothesis-driven debug loop with runtime evidence on the Mac M4 — and the initial hypothesis was disproven, which also exposed a false-positive quality gate that this PR corrects.

Root cause (confirmed on Mac M4, prompt 请详细解释POW的工作原理)

  • The decode cache is the model's native hybrid cache: sliding layers are RotatingKVCache (max_size≈1024), full layers are KVCache. The S5 --window-size 64 only feeds the analytical memory math; it does not bound the decode cache.
  • At 128 and 800 tokens the fused output is coherent; the bug only appears past the ring wrap (offset ≥ max_size, gen≈1017).
  • Once the ring wraps, mlx_lm.trim_prompt_cache is all-or-nothing and refuses (a rotating layer is is_trimmable only while offset < max_size). The fused speculative loop's rejected-draft rollback then silently fails (15 trim short:true events), so cache.offset runs +8 ahead of committed past_len on every post-wrap block → RoPE/causal misalignment → logit corruption → collapse.
  • Decisive A/B: a native-greedy control on the same prompt stayed fully coherent past the wrap (clean termination at gen 1247), proving the model handles >1024 fine and the fused engine was at fault.

The fix (correctness-first)

_sliding_ring_would_wrap() + if wrap_l1: L = 1 in fused_specdecode_generate: detect the impending wrap and commit single-token blocks past it. With L=1 the bonus token is always accepted (it is argmax(next_token_logits)), so there is never a rejected tail to trim and offset stays == past_len. Cost: the speculative speedup is forgone past max_size (a sound wrapped-ring rollback for perf is noted as optional follow-up).

Validation (re-run, 1300 tokens, Mac M4)

signal before after
trim short:true events 15 0
post-wrap offset desync 76/76 blocks 0/225
post-wrap max cyc_frac 1.0 (collapse) 0.158
fused final text 由于由于… runaway coherent, clean stop @ gen 1241 (matches native control)

kakeya_mlx_wrap_degeneration_fix.txt

Quality-gate correction (k3_report_gate.py)

The runtime evidence disproved the RESTORATION_COVERAGE theory ("restoration covers only ≤ window decode tokens"): a 1300-token run with 332 evicted-unrestored positions stayed coherent once the trim-desync was fixed. So "tokens > window" and even "evicted > 0" are not degeneration signals — the old rule was a pure false-positive (it would fail every coherent answer > 64 tokens).

  • Remove the RESTORATION_COVERAGE rule.
  • Add _has_runaway_substring so OUTPUT_DEGENERATE catches the newline-free 由于由于… collapse the line-based detector missed; conservative (≥8× tiled short unit) so templated 矿工 A/B/C enumerations do not false-fire.
  • Gate module stays at 100% coverage.

Skill doc — reusable debugging template

Added the whole investigation to docs/kakeyainferenceenginebuildskill.md as a worked case-study template (§7): reproduce at increasing scale → add a native-greedy A/B control → instrument the indicated mechanism → fix correctness-first → re-validate, plus two generalizable lessons (runtime evidence overrides plausible hypotheses; gate on observed outcomes, not theorized proxies).

Changes

  • inference_engine/backends/mlx/fused_specdecode.py: _sliding_ring_would_wrap + single-token commit past the ring wrap (uses getattr(adapter, "_cache", None)).
  • inference_engine/bench/k3_report_gate.py: drop disproven RESTORATION_COVERAGE; add char-level runaway detector to OUTPUT_DEGENERATE.
  • scripts/research/k3_integrated_niah_eval_mac.py: keep --chat-native-ref as a permanent A/B coherence control (now captures native text); Phase-1 KDBG instrumentation removed.
  • inference_engine/bridge/manifest.py: mlx-kakeya-degen-probe repurposed as a long-decode regression guard.
  • Tests: unit-cover _sliding_ring_would_wrap, _has_runaway_substring, and the corrected gate.
  • Docs: kakeyainferenceenginebuildskill.md §5 row + §7 case-study template; kakeya-autonomous-iteration-and-self-correction.md confirmed root cause + "verify, don't trust the comment" lesson.

Testing

  • pytest tests/inference_engine/bench/test_k3_report_gate.py (50 passed, gate 100% coverage)
  • pytest tests/backends/mlx/test_fused_specdecode.py -k wrap (new helper tests)
  • pytest tests/inference_engine/bridge/test_manifest.py (31 passed)
  • ✅ Mac M4 1300-token validation run (table above)
  • ⚠️ 4 tests in test_fused_specdecode.py (test_fused_loop_*, test_adapter_prefill_forward_commit) fail identically on origin/main — pre-existing stale fakes/expectations in the aux-capture path, unrelated to this fix.

To show artifacts inline, enable in settings.

Open in Web Open in Cursor 

cursoragent and others added 9 commits June 17, 2026 11:46
Add stderr NDJSON 'KDBG' instrumentation to the fused spec-decode decode
loops and the restored-prefill adapter to characterize the long-generation
degeneration bug:

* prefill: log prompt_len, restored coverage (evicted range + layers),
  full_kv layout, and per-layer cache class/max_size/keep.
* per block (torch f_theta + mlx_trim loops): block idx, generated-token
  count, past_len, accepted, dt_ms, a cheap repetition signal (unique
  fraction + longest single-token run over last 32 tokens), the count of
  sliding-layer positions evicted DURING decode that have no restored K/V
  (lost), and a per-layer cache state summary.

Temporary debug-only; reverted after the fix is verified.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… degeneration characterization)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… long-decode degeneration

- cyc_frac/cyc_p phrase-cycle detection (max_run was blind to phrase loops)
- _kdbg_sync: per-block sliding-vs-full offset divergence (H2 trim-desync)
- commit_or_truncate trim event logging (short-trim smoking gun)
- final token-id dump for offline divergence comparison
- --chat-native-ref: plain native greedy control on identical prompt (H1 vs engine)
- degen-probe preset now runs the native control

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…probe

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…vent offset desync

Root cause: once the ms=1024 sliding RotatingKVCache ring wraps (offset>=ms),
mlx_lm.trim_prompt_cache refuses the rejected-draft rollback (all-or-nothing,
is_trimmable requires offset<max_size). Un-trimmed rejected K/V leave
cache.offset ahead of committed past_len (+8 observed), misaligning RoPE/mask
-> logit corruption -> runaway repetition (由于由于...) onset ~gen1064.

Fix (Option A, correctness-first): detect when the sliding ring would wrap and
commit single-token blocks (L=1). With L=1 the bonus token is always accepted
(it is argmax(next_token_logits)), so drop==0 and trim is never called while
wrapped -> offset stays == past_len, matching the coherent native AR path.

Validated by the native-greedy control: native (no spec rollback) stays
coherent past 1024; only the fused trim path degenerated.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…+ --chat-native-ref

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…test regressions) + unit-test wrap detector

The wrap guard accessed adapter._cache directly, which AttributeError'd on
adapters/fakes without a cache (regressing the 3 test_fused_loop_* tests).
Use getattr(adapter,'_cache',None); _sliding_ring_would_wrap already treats
None as 'no wrap'. Add focused unit tests for the detector (wrap True/False,
non-rotating, missing max_size, empty/None).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…PUT_DEGENERATE

Runtime evidence (Mac, 1300-tok run) disproved the 'restoration covers only
<= window decode tokens' theory: the decode cache is the native RotatingKVCache
(max_size~1024, not S5 window=64), and a run with 332 evicted-unrestored
positions stayed fully coherent once the trim-desync bug was fixed. So
tokens>window and even evicted>0 are NOT degeneration signals; the old rule was
a pure false-positive (would fail every coherent answer >64 tokens).

- Remove the RESTORATION_COVERAGE token>window rule.
- Add _has_runaway_substring: catches the newline-free 由于由于… collapse the
  line-based _looks_degenerate missed; conservative (>=8x tiled short unit) so
  templated 矿工 A/B/C enumerations do not false-fire.
- OUTPUT_DEGENERATE now = line-wall OR runaway-substring (empirical signal only).
- Update tests (100% coverage) and the self-correction methodology doc with the
  confirmed root cause + the verify-don't-trust-the-comment lesson.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…op stale KDBG mention)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…eusable template (§7)

- §5: add the RotatingKVCache-wrap degeneration bug row.
- New §7: worked hypothesis-driven debugging template (reproduce at increasing
  scale, native-greedy A/B control, instrument the indicated mechanism, fix
  correctness-first, re-validate) + two generalizable lessons (runtime evidence
  overrides plausible hypotheses; gate on observed outcomes not theorized
  proxies).
- Renumber Validation §7->§8 and Pointers §8->§9; update cross-refs.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 17, 2026 15:05
@FluffyAIcode FluffyAIcode merged commit be0a35e into main Jun 17, 2026
7 of 8 checks passed
cursor Bot pushed a commit that referenced this pull request Jun 18, 2026
…codegen presets

- Remove the KAKEYA_KDBG-gated debug instrumentation (helpers + per-block
  emission + prefill_state/turn_compare) from fused_specdecode.py and
  k3_integrated_niah_eval_mac.py. Investigation complete.
- Keep the production fix (runaway-loop guard) + the --chat-scripted-file /
  --fused-no-loop-guard / --chat-native-ref flags.
- Repoint the two codegen presets to the multi-turn 'explain||code' chat
  (guard-off probe + guard-on validate), accurate descriptions; drop the now-
  unused pow_codegen_longprompt.txt fixture.

On-device (Mac M4): across short/long/multi-turn regimes the engine is coherent
(fused==native); guard-on and guard-off outputs are byte-identical on the
multi-turn code scenario -> the guard is inert on healthy output (no regression)
and the systematic degeneration was already resolved by the wrap fix (#146).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants