fix(mac-chat): stream tokens live so the CLI doesn't look frozen on long answers#152
Merged
Merged
Conversation
…n long answers Root cause (from the code): the interactive chat REPL is fully NON-streaming — _gen_turn runs the entire generation (up to max_new_tokens) before printing anything. On a code-gen prompt the answer is long and the f_θ path is slow (~3-5 tok/s, single-token past the wrap), so the terminal stays silent for minutes — indistinguishable from a freeze (user: '一进入就卡死,完全没有输出'). Not a deadlock (prior scripted code-gen runs completed). Fix: add an on_commit streaming callback to the 3 fused generate loops (safe _emit wrapper, never breaks decode) and have the chat REPL decode incrementally and print the delta LIVE (+ a per-block '[stream] blk=.. t=..s' stderr line that also proves the engine is progressing, not hung). New mlx-kakeya-chat-stream- probe preset runs the user's exact prompt to validate. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the Mac chat CLI on long answers (e.g. code generation): it (1) looked frozen and (2) once streaming was added, the output format was mangled by interleaved progress lines.
Issue 1 — looked frozen (non-streaming)
The interactive REPL ran the entire generation before printing anything. On a long answer at ~5 tok/s that's minutes of silence. Fix: an
on_commitstreaming callback in all three fused generate loops (safe_emitwrapper) + live incremental decode in the REPL, so the answer builds on screen from the first token.Issue 2 — mangled format (interleaving)
The per-block
[stream] blk=.. t=..sprogress lines went to stderr while the answer delta went to stdout; on a shared terminal they interleaved into the text (在计算机[stream] blk=1 tok=2 t=13.6s). Fix: emit the[stream]timing line only on the non-interactive (scripted/bridge) path; the interactive CLI streams only the clean answer delta. Added--chat-stream-stdoutso a non-tty bridge run can capture the exact clean live format.Validation (Mac M4)
[stream]blocks emitted continuously (first tokens at t=8.4s, then ~1 block/0.4s) — steady generation.--chat-stream-stdoutcaptured the answer as continuous markdown with 0[stream]lines in the output:gemma-4 [根据pow的机制…]> 在计算机科学中,实现 \pow(base, exp)` 函数通常有两种主要的机制:快速幂算法…`mac_chat_format_fix.txt
mac_chat_streaming_fix.txt
Changes
inference_engine/backends/mlx/fused_specdecode.py:on_commit+_emitin all 3 fused loops.scripts/research/k3_integrated_niah_eval_mac.py: live clean streaming in the REPL;[stream]timing only on the scripted path;--chat-stream-stdoutflag.inference_engine/bridge/manifest.py:mlx-kakeya-chat-stream-probepreset.Testing
pytest tests/backends/mlx/test_fused_specdecode.py tests/inference_engine/bridge/test_manifest.py(47 passed)Independent of #148/#149/#150; based on
main.To show artifacts inline, enable in settings.