fix(mlx-fused): runaway-loop guard for greedy markdown-marker collapse on code prompts by FluffyAIcode · Pull Request #149 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-17T17:31:19Z

Summary

Adds a conservative runaway-loop guard to the Mac fused spec-decode engine, the class of failure the user hit when asking for "实现一个PoW的代码，用c语言完成" (the answer collapsed into a **/.2/* marker wall).

What the on-device investigation established (Mac M4)

The engine is faithful to native greedy: across every regime tested on current code — short code prompt @1200, long prompt (ring wrapped at prefill) @192 and @810, and the multi-turn explain PoW || write PoW in C @900/turn (the user's exact flow) — the fused output is coherent and fused==native (first_divergence_idx=None). No prefill/cache corruption.
The systematic degeneration was already resolved by the wrap fix (fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate #146). The user's specific high-acceptance (3.609) *-wall did not reproduce in any controlled run — it's a rare/intermittent greedy-pathology event, not a systematic engine bug.
The guard is the right safety net for that pathology class: the fused path uses pure argmax with no repetition penalty/loop guard (unlike chat_mlx_kakeya.py), so a rare greedy loop could wall unbounded.

Fix

_trailing_runaway_drop(ids) detects a 1–8-token unit repeated ≥12× back-to-back at the tail (conservative — never trims legit lists/enumerations/code), keeps 3 instances, and all three fused loops stop there. Default ON (stop_on_runaway=True); --fused-no-loop-guard disables it; result gains stopped_on_runaway. Also adds --chat-scripted-file (long prompt as a file).

Validation

✅ On-device A/B (Mac M4), user's exact multi-turn scenario: guard-OFF and guard-ON runs are byte-identical and coherent (turn-1 explanation + turn-2 valid C code) — proving the guard is inert on healthy output (no regression) while standing ready to cut a runaway.
✅ Deterministic unit tests: _trailing_runaway_drop detect/trim + conservative non-trigger; test_fused_loop_stops_on_runaway_repeat (cuts a runaway to a clean short tail); test_fused_loop_runaway_guard_can_be_disabled.
✅ pytest tests/backends/mlx/test_fused_specdecode.py tests/inference_engine/bridge/test_manifest.py → 51 passed; branch CI green.
KDBG probe instrumentation added during the investigation has been fully removed (0 references); two mlx-kakeya-codegen-* regression presets (guard-off probe + guard-on validate) retained with accurate descriptions.

pr149_guard_ondevice_validation.txt

Honest caveat

The guard could not be demonstrated engaging on a live collapse because the collapse no longer reproduces on current code (the systematic cause was fixed by #146). Its correctness is guaranteed by deterministic unit tests, and its safety (never harming coherent output) is confirmed on-device by the byte-identical guard-on/off A/B.

Changes

inference_engine/backends/mlx/fused_specdecode.py: _trailing_runaway_drop + stop_on_runaway guard in all 3 fused loops; stopped_on_runaway result field.
scripts/research/k3_integrated_niah_eval_mac.py: --chat-scripted-file, --fused-no-loop-guard; pass stop_on_runaway through.
inference_engine/bridge/manifest.py: mlx-kakeya-codegen-degen-probe (guard off) + mlx-kakeya-codegen-guard-validate (guard on) regression presets.
Tests for the guard + presets.

_{To show artifacts inline, enable in settings.}

…ive-control probe KAKEYA_KDBG-gated per-block logging (sampled/committed ids, cyc_frac/cyc_p, cache offsets) in fused_specdecode_generate, and a turn_compare_fused_vs_native record (first_divergence_idx + both tails) in _run_fused_chat. New bridge preset mlx-kakeya-codegen-degen-probe runs the C-code prompt with --chat-native-ref to decide greedy-pathology vs engine bug. Instrumentation only; reverted after fix. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…efill) + multi-turn degen preset KAKEYA_KDBG-gated prefill_state_fused / prefill_state_native records in _run_fused_chat: per-turn prompt_len, evicted_count, rot/full cache offsets, any_wrapped, would_wrap_block0, plus a turn index on turn_compare. Repoints mlx-kakeya-codegen-degen-probe to the multi-turn repro (turn-1 PoW explanation pushes the turn-2 code prompt's prefill past the sliding window) at 1200 tok. Instrumentation only; reverted after fix. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…t prefill) Multi-turn+native at 1200x2 OOM'd the Mac runner. Per debug analysis, the cheapest test of H-C' (long-prompt prefill corrupts logits) vs H-A' (bounded- greedy pathology) is a single-turn LONG prompt that wraps the ring AT prefill (would_wrap_block0) with a tiny 192-tok budget. Add --chat-scripted-file so the ~2k-char context is a committed fixture (pow_codegen_longprompt.txt) instead of a giant manifest argv; repoint mlx-kakeya-codegen-degen-probe to it. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Repro evidence: single-turn fused decode is TOKEN-IDENTICAL to native greedy (first_divergence_idx=None) and coherent through 1200 tokens, so the engine is faithful — the user's '由于...'/'**/.2/*' collapse is greedy-decoding pathology on code/markdown-heavy prompts that the fused path (pure argmax, unlike chat_mlx_kakeya.py) had no mitigation for. Once a loop starts the drafter trivially predicts the repeats and the greedy verifier accepts them (high accept_len), so it walls indefinitely. Fix: _trailing_runaway_drop detects a 1..8-token unit repeated >=12x at the tail (conservative; never trims legit lists/enumerations/code) and the three fused loops stop generation, keeping a short clean tail instead of an unbounded wall. Default ON (stop_on_runaway=True); --fused-no-loop-guard disables it for degeneration probes. Adds stopped_on_runaway to the result. Also: --chat-scripted-file (long prompt as committed fixture) + repoint the codegen-degen probe to a single-turn long prompt that wraps the ring at prefill (cheap; the multi-turn+native variant OOM'd the Mac runner). KAKEYA_KDBG probe instrumentation retained (inert unless the env var is set) for the pending on-device H-C'-vs-H-A' confirmation. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…_lm); add guard-ON validation preset The 'env KAKEYA_KDBG=1 python3' prefix resolved a python3 without mlx_lm on the runner (ModuleNotFoundError). Drop it (KDBG instrumentation is now inert, which is also what we want for the final PR). The native_ref/text/stopped_on_runaway signals in the JSON are sufficient to characterize + validate. Add mlx-kakeya-codegen-guard-validate (guard ON) to prove the clean stop. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…get 1100) to reach the ~978-tok collapse onset Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…ive-ref, budget 900 (matches the user's high-accept regime) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…codegen presets - Remove the KAKEYA_KDBG-gated debug instrumentation (helpers + per-block emission + prefill_state/turn_compare) from fused_specdecode.py and k3_integrated_niah_eval_mac.py. Investigation complete. - Keep the production fix (runaway-loop guard) + the --chat-scripted-file / --fused-no-loop-guard / --chat-native-ref flags. - Repoint the two codegen presets to the multi-turn 'explain||code' chat (guard-off probe + guard-on validate), accurate descriptions; drop the now- unused pow_codegen_longprompt.txt fixture. On-device (Mac M4): across short/long/multi-turn regimes the engine is coherent (fused==native); guard-on and guard-off outputs are byte-identical on the multi-turn code scenario -> the guard is inert on healthy output (no regression) and the systematic degeneration was already resolved by the wrap fix (#146). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 4 commits June 17, 2026 15:55

github-actions Bot added the needs-mac-m4 label Jun 17, 2026

cursoragent and others added 4 commits June 18, 2026 04:23

debug(probe): long single-decode A/B (drop native-ref for memory, bud…

f8a7a9a

…get 1100) to reach the ~978-tok collapse onset Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

debug(probe): multi-turn (explanation->code) guard-off/on A/B, no nat…

85abe81

…ive-ref, budget 900 (matches the user's high-accept regime) Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursor Bot merged commit 5c1bc29 into main Jun 18, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mlx-fused): runaway-loop guard for greedy markdown-marker collapse on code prompts#149

fix(mlx-fused): runaway-loop guard for greedy markdown-marker collapse on code prompts#149
cursor[bot] merged 8 commits into
mainfrom
AgentMemory/fused-codegen-degeneration-fix-2815

FluffyAIcode commented Jun 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the on-device investigation established (Mac M4)

Fix

Validation

Honest caveat

Changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 17, 2026 •

edited by cursor Bot

Loading