fix(mlx-fused): runaway-loop guard for greedy markdown-marker collapse on code prompts#149
Merged
cursor[bot] merged 8 commits intoJun 18, 2026
Merged
Conversation
…ive-control probe KAKEYA_KDBG-gated per-block logging (sampled/committed ids, cyc_frac/cyc_p, cache offsets) in fused_specdecode_generate, and a turn_compare_fused_vs_native record (first_divergence_idx + both tails) in _run_fused_chat. New bridge preset mlx-kakeya-codegen-degen-probe runs the C-code prompt with --chat-native-ref to decide greedy-pathology vs engine bug. Instrumentation only; reverted after fix. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…efill) + multi-turn degen preset KAKEYA_KDBG-gated prefill_state_fused / prefill_state_native records in _run_fused_chat: per-turn prompt_len, evicted_count, rot/full cache offsets, any_wrapped, would_wrap_block0, plus a turn index on turn_compare. Repoints mlx-kakeya-codegen-degen-probe to the multi-turn repro (turn-1 PoW explanation pushes the turn-2 code prompt's prefill past the sliding window) at 1200 tok. Instrumentation only; reverted after fix. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…t prefill) Multi-turn+native at 1200x2 OOM'd the Mac runner. Per debug analysis, the cheapest test of H-C' (long-prompt prefill corrupts logits) vs H-A' (bounded- greedy pathology) is a single-turn LONG prompt that wraps the ring AT prefill (would_wrap_block0) with a tiny 192-tok budget. Add --chat-scripted-file so the ~2k-char context is a committed fixture (pow_codegen_longprompt.txt) instead of a giant manifest argv; repoint mlx-kakeya-codegen-degen-probe to it. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Repro evidence: single-turn fused decode is TOKEN-IDENTICAL to native greedy (first_divergence_idx=None) and coherent through 1200 tokens, so the engine is faithful — the user's '由于...'/'**/.2/*' collapse is greedy-decoding pathology on code/markdown-heavy prompts that the fused path (pure argmax, unlike chat_mlx_kakeya.py) had no mitigation for. Once a loop starts the drafter trivially predicts the repeats and the greedy verifier accepts them (high accept_len), so it walls indefinitely. Fix: _trailing_runaway_drop detects a 1..8-token unit repeated >=12x at the tail (conservative; never trims legit lists/enumerations/code) and the three fused loops stop generation, keeping a short clean tail instead of an unbounded wall. Default ON (stop_on_runaway=True); --fused-no-loop-guard disables it for degeneration probes. Adds stopped_on_runaway to the result. Also: --chat-scripted-file (long prompt as committed fixture) + repoint the codegen-degen probe to a single-turn long prompt that wraps the ring at prefill (cheap; the multi-turn+native variant OOM'd the Mac runner). KAKEYA_KDBG probe instrumentation retained (inert unless the env var is set) for the pending on-device H-C'-vs-H-A' confirmation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…_lm); add guard-ON validation preset The 'env KAKEYA_KDBG=1 python3' prefix resolved a python3 without mlx_lm on the runner (ModuleNotFoundError). Drop it (KDBG instrumentation is now inert, which is also what we want for the final PR). The native_ref/text/stopped_on_runaway signals in the JSON are sufficient to characterize + validate. Add mlx-kakeya-codegen-guard-validate (guard ON) to prove the clean stop. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…get 1100) to reach the ~978-tok collapse onset Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ive-ref, budget 900 (matches the user's high-accept regime) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…codegen presets - Remove the KAKEYA_KDBG-gated debug instrumentation (helpers + per-block emission + prefill_state/turn_compare) from fused_specdecode.py and k3_integrated_niah_eval_mac.py. Investigation complete. - Keep the production fix (runaway-loop guard) + the --chat-scripted-file / --fused-no-loop-guard / --chat-native-ref flags. - Repoint the two codegen presets to the multi-turn 'explain||code' chat (guard-off probe + guard-on validate), accurate descriptions; drop the now- unused pow_codegen_longprompt.txt fixture. On-device (Mac M4): across short/long/multi-turn regimes the engine is coherent (fused==native); guard-on and guard-off outputs are byte-identical on the multi-turn code scenario -> the guard is inert on healthy output (no regression) and the systematic degeneration was already resolved by the wrap fix (#146). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced Jun 18, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a conservative runaway-loop guard to the Mac fused spec-decode engine, the class of failure the user hit when asking for "实现一个PoW的代码,用c语言完成" (the answer collapsed into a
**/.2/*marker wall).What the on-device investigation established (Mac M4)
explain PoW || write PoW in C@900/turn (the user's exact flow) — the fused output is coherent andfused==native(first_divergence_idx=None). No prefill/cache corruption.3.609)*-wall did not reproduce in any controlled run — it's a rare/intermittent greedy-pathology event, not a systematic engine bug.argmaxwith no repetition penalty/loop guard (unlikechat_mlx_kakeya.py), so a rare greedy loop could wall unbounded.Fix
_trailing_runaway_drop(ids)detects a 1–8-token unit repeated ≥12× back-to-back at the tail (conservative — never trims legit lists/enumerations/code), keeps 3 instances, and all three fused loops stop there. Default ON (stop_on_runaway=True);--fused-no-loop-guarddisables it; result gainsstopped_on_runaway. Also adds--chat-scripted-file(long prompt as a file).Validation
_trailing_runaway_dropdetect/trim + conservative non-trigger;test_fused_loop_stops_on_runaway_repeat(cuts a runaway to a clean short tail);test_fused_loop_runaway_guard_can_be_disabled.pytest tests/backends/mlx/test_fused_specdecode.py tests/inference_engine/bridge/test_manifest.py→ 51 passed; branch CI green.mlx-kakeya-codegen-*regression presets (guard-off probe + guard-on validate) retained with accurate descriptions.pr149_guard_ondevice_validation.txt
Honest caveat
The guard could not be demonstrated engaging on a live collapse because the collapse no longer reproduces on current code (the systematic cause was fixed by #146). Its correctness is guaranteed by deterministic unit tests, and its safety (never harming coherent output) is confirmed on-device by the byte-identical guard-on/off A/B.
Changes
inference_engine/backends/mlx/fused_specdecode.py:_trailing_runaway_drop+stop_on_runawayguard in all 3 fused loops;stopped_on_runawayresult field.scripts/research/k3_integrated_niah_eval_mac.py:--chat-scripted-file,--fused-no-loop-guard; passstop_on_runawaythrough.inference_engine/bridge/manifest.py:mlx-kakeya-codegen-degen-probe(guard off) +mlx-kakeya-codegen-guard-validate(guard on) regression presets.To show artifacts inline, enable in settings.