Skip to content

fix(mlx-fused): runaway-loop guard for greedy markdown-marker collapse on code prompts#149

Merged
cursor[bot] merged 8 commits into
mainfrom
AgentMemory/fused-codegen-degeneration-fix-2815
Jun 18, 2026
Merged

fix(mlx-fused): runaway-loop guard for greedy markdown-marker collapse on code prompts#149
cursor[bot] merged 8 commits into
mainfrom
AgentMemory/fused-codegen-degeneration-fix-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

Adds a conservative runaway-loop guard to the Mac fused spec-decode engine, the class of failure the user hit when asking for "实现一个PoW的代码,用c语言完成" (the answer collapsed into a **/.2/* marker wall).

What the on-device investigation established (Mac M4)

  • The engine is faithful to native greedy: across every regime tested on current code — short code prompt @1200, long prompt (ring wrapped at prefill) @192 and @810, and the multi-turn explain PoW || write PoW in C @900/turn (the user's exact flow) — the fused output is coherent and fused==native (first_divergence_idx=None). No prefill/cache corruption.
  • The systematic degeneration was already resolved by the wrap fix (fix(mlx-fused): long-decode degeneration past RotatingKVCache wrap + correct quality gate #146). The user's specific high-acceptance (3.609) *-wall did not reproduce in any controlled run — it's a rare/intermittent greedy-pathology event, not a systematic engine bug.
  • The guard is the right safety net for that pathology class: the fused path uses pure argmax with no repetition penalty/loop guard (unlike chat_mlx_kakeya.py), so a rare greedy loop could wall unbounded.

Fix

_trailing_runaway_drop(ids) detects a 1–8-token unit repeated ≥12× back-to-back at the tail (conservative — never trims legit lists/enumerations/code), keeps 3 instances, and all three fused loops stop there. Default ON (stop_on_runaway=True); --fused-no-loop-guard disables it; result gains stopped_on_runaway. Also adds --chat-scripted-file (long prompt as a file).

Validation

  • On-device A/B (Mac M4), user's exact multi-turn scenario: guard-OFF and guard-ON runs are byte-identical and coherent (turn-1 explanation + turn-2 valid C code) — proving the guard is inert on healthy output (no regression) while standing ready to cut a runaway.
  • Deterministic unit tests: _trailing_runaway_drop detect/trim + conservative non-trigger; test_fused_loop_stops_on_runaway_repeat (cuts a runaway to a clean short tail); test_fused_loop_runaway_guard_can_be_disabled.
  • pytest tests/backends/mlx/test_fused_specdecode.py tests/inference_engine/bridge/test_manifest.py → 51 passed; branch CI green.
  • KDBG probe instrumentation added during the investigation has been fully removed (0 references); two mlx-kakeya-codegen-* regression presets (guard-off probe + guard-on validate) retained with accurate descriptions.

pr149_guard_ondevice_validation.txt

Honest caveat

The guard could not be demonstrated engaging on a live collapse because the collapse no longer reproduces on current code (the systematic cause was fixed by #146). Its correctness is guaranteed by deterministic unit tests, and its safety (never harming coherent output) is confirmed on-device by the byte-identical guard-on/off A/B.

Changes

  • inference_engine/backends/mlx/fused_specdecode.py: _trailing_runaway_drop + stop_on_runaway guard in all 3 fused loops; stopped_on_runaway result field.
  • scripts/research/k3_integrated_niah_eval_mac.py: --chat-scripted-file, --fused-no-loop-guard; pass stop_on_runaway through.
  • inference_engine/bridge/manifest.py: mlx-kakeya-codegen-degen-probe (guard off) + mlx-kakeya-codegen-guard-validate (guard on) regression presets.
  • Tests for the guard + presets.

To show artifacts inline, enable in settings.

Open in Web Open in Cursor 

cursoragent and others added 4 commits June 17, 2026 15:55
…ive-control probe

KAKEYA_KDBG-gated per-block logging (sampled/committed ids, cyc_frac/cyc_p,
cache offsets) in fused_specdecode_generate, and a turn_compare_fused_vs_native
record (first_divergence_idx + both tails) in _run_fused_chat. New bridge preset
mlx-kakeya-codegen-degen-probe runs the C-code prompt with --chat-native-ref to
decide greedy-pathology vs engine bug. Instrumentation only; reverted after fix.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…efill) + multi-turn degen preset

KAKEYA_KDBG-gated prefill_state_fused / prefill_state_native records in
_run_fused_chat: per-turn prompt_len, evicted_count, rot/full cache offsets,
any_wrapped, would_wrap_block0, plus a turn index on turn_compare. Repoints
mlx-kakeya-codegen-degen-probe to the multi-turn repro (turn-1 PoW explanation
pushes the turn-2 code prompt's prefill past the sliding window) at 1200 tok.
Instrumentation only; reverted after fix.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…t prefill)

Multi-turn+native at 1200x2 OOM'd the Mac runner. Per debug analysis, the
cheapest test of H-C' (long-prompt prefill corrupts logits) vs H-A' (bounded-
greedy pathology) is a single-turn LONG prompt that wraps the ring AT prefill
(would_wrap_block0) with a tiny 192-tok budget. Add --chat-scripted-file so the
~2k-char context is a committed fixture (pow_codegen_longprompt.txt) instead of
a giant manifest argv; repoint mlx-kakeya-codegen-degen-probe to it.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Repro evidence: single-turn fused decode is TOKEN-IDENTICAL to native greedy
(first_divergence_idx=None) and coherent through 1200 tokens, so the engine is
faithful — the user's '由于...'/'**/.2/*' collapse is greedy-decoding pathology
on code/markdown-heavy prompts that the fused path (pure argmax, unlike
chat_mlx_kakeya.py) had no mitigation for. Once a loop starts the drafter
trivially predicts the repeats and the greedy verifier accepts them (high
accept_len), so it walls indefinitely.

Fix: _trailing_runaway_drop detects a 1..8-token unit repeated >=12x at the tail
(conservative; never trims legit lists/enumerations/code) and the three fused
loops stop generation, keeping a short clean tail instead of an unbounded wall.
Default ON (stop_on_runaway=True); --fused-no-loop-guard disables it for
degeneration probes. Adds stopped_on_runaway to the result.

Also: --chat-scripted-file (long prompt as committed fixture) + repoint the
codegen-degen probe to a single-turn long prompt that wraps the ring at prefill
(cheap; the multi-turn+native variant OOM'd the Mac runner). KAKEYA_KDBG probe
instrumentation retained (inert unless the env var is set) for the pending
on-device H-C'-vs-H-A' confirmation.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursoragent and others added 4 commits June 18, 2026 04:23
…_lm); add guard-ON validation preset

The 'env KAKEYA_KDBG=1 python3' prefix resolved a python3 without mlx_lm on the
runner (ModuleNotFoundError). Drop it (KDBG instrumentation is now inert, which
is also what we want for the final PR). The native_ref/text/stopped_on_runaway
signals in the JSON are sufficient to characterize + validate. Add
mlx-kakeya-codegen-guard-validate (guard ON) to prove the clean stop.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…get 1100) to reach the ~978-tok collapse onset

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…ive-ref, budget 900 (matches the user's high-accept regime)

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…codegen presets

- Remove the KAKEYA_KDBG-gated debug instrumentation (helpers + per-block
  emission + prefill_state/turn_compare) from fused_specdecode.py and
  k3_integrated_niah_eval_mac.py. Investigation complete.
- Keep the production fix (runaway-loop guard) + the --chat-scripted-file /
  --fused-no-loop-guard / --chat-native-ref flags.
- Repoint the two codegen presets to the multi-turn 'explain||code' chat
  (guard-off probe + guard-on validate), accurate descriptions; drop the now-
  unused pow_codegen_longprompt.txt fixture.

On-device (Mac M4): across short/long/multi-turn regimes the engine is coherent
(fused==native); guard-on and guard-off outputs are byte-identical on the
multi-turn code scenario -> the guard is inert on healthy output (no regression)
and the systematic degeneration was already resolved by the wrap fix (#146).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants