fix(mac-chat): stream tokens live so the CLI doesn't look frozen on long answers by FluffyAIcode · Pull Request #152 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-18T09:21:07Z

Summary

Fixes the Mac chat CLI on long answers (e.g. code generation): it (1) looked frozen and (2) once streaming was added, the output format was mangled by interleaved progress lines.

Issue 1 — looked frozen (non-streaming)

The interactive REPL ran the entire generation before printing anything. On a long answer at ~5 tok/s that's minutes of silence. Fix: an on_commit streaming callback in all three fused generate loops (safe _emit wrapper) + live incremental decode in the REPL, so the answer builds on screen from the first token.

Issue 2 — mangled format (interleaving)

The per-block [stream] blk=.. t=..s progress lines went to stderr while the answer delta went to stdout; on a shared terminal they interleaved into the text (在计算机[stream] blk=1 tok=2 t=13.6s). Fix: emit the [stream] timing line only on the non-interactive (scripted/bridge) path; the interactive CLI streams only the clean answer delta. Added --chat-stream-stdout so a non-tty bridge run can capture the exact clean live format.

Validation (Mac M4)

Streaming works / not a deadlock: first run showed 72 [stream] blocks emitted continuously (first tokens at t=8.4s, then ~1 block/0.4s) — steady generation.
Format clean: re-run with --chat-stream-stdout captured the answer as continuous markdown with 0 [stream] lines in the output:
gemma-4 [根据pow的机制…]> 在计算机科学中，实现 \pow(base, exp)` 函数通常有两种主要的机制：快速幂算法…`

mac_chat_format_fix.txt
mac_chat_streaming_fix.txt

Changes

inference_engine/backends/mlx/fused_specdecode.py: on_commit + _emit in all 3 fused loops.
scripts/research/k3_integrated_niah_eval_mac.py: live clean streaming in the REPL; [stream] timing only on the scripted path; --chat-stream-stdout flag.
inference_engine/bridge/manifest.py: mlx-kakeya-chat-stream-probe preset.

Testing

✅ pytest tests/backends/mlx/test_fused_specdecode.py tests/inference_engine/bridge/test_manifest.py (47 passed)
✅ On-device probe: streaming steady + clean format (evidence above)

Independent of #148/#149/#150; based on main.

_{To show artifacts inline, enable in settings.}

…n long answers Root cause (from the code): the interactive chat REPL is fully NON-streaming — _gen_turn runs the entire generation (up to max_new_tokens) before printing anything. On a code-gen prompt the answer is long and the f_θ path is slow (~3-5 tok/s, single-token past the wrap), so the terminal stays silent for minutes — indistinguishable from a freeze (user: '一进入就卡死，完全没有输出'). Not a deadlock (prior scripted code-gen runs completed). Fix: add an on_commit streaming callback to the 3 fused generate loops (safe _emit wrapper, never breaks decode) and have the chat REPL decode incrementally and print the delta LIVE (+ a per-block '[stream] blk=.. t=..s' stderr line that also proves the engine is progressing, not hung). New mlx-kakeya-chat-stream- probe preset runs the user's exact prompt to validate. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

github-actions Bot added the needs-mac-m4 label Jun 18, 2026

FluffyAIcode marked this pull request as ready for review June 18, 2026 09:26

FluffyAIcode merged commit d7bbcaf into main Jun 18, 2026
8 checks passed

FluffyAIcode deleted the AgentMemory/mac-chat-streaming-output-2815 branch June 18, 2026 09:26

This was referenced Jun 18, 2026

fix(mac-chat): actually stop [stream] lines interleaving (the part #152 squash-dropped) #153

Merged

merge-train: #150 + #154 + #148 + #149 onto current main (conflicts resolved, tests green) #155

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mac-chat): stream tokens live so the CLI doesn't look frozen on long answers#152

fix(mac-chat): stream tokens live so the CLI doesn't look frozen on long answers#152
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/mac-chat-streaming-output-2815

FluffyAIcode commented Jun 18, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 18, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issue 1 — looked frozen (non-streaming)

Issue 2 — mangled format (interleaving)

Validation (Mac M4)

Changes

Testing

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 18, 2026 •

edited by cursor Bot

Loading