Skip to content

fix(mac-chat): stream tokens live so the CLI doesn't look frozen on long answers#152

Merged
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/mac-chat-streaming-output-2815
Jun 18, 2026
Merged

fix(mac-chat): stream tokens live so the CLI doesn't look frozen on long answers#152
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/mac-chat-streaming-output-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes the Mac chat CLI on long answers (e.g. code generation): it (1) looked frozen and (2) once streaming was added, the output format was mangled by interleaved progress lines.

Issue 1 — looked frozen (non-streaming)

The interactive REPL ran the entire generation before printing anything. On a long answer at ~5 tok/s that's minutes of silence. Fix: an on_commit streaming callback in all three fused generate loops (safe _emit wrapper) + live incremental decode in the REPL, so the answer builds on screen from the first token.

Issue 2 — mangled format (interleaving)

The per-block [stream] blk=.. t=..s progress lines went to stderr while the answer delta went to stdout; on a shared terminal they interleaved into the text (在计算机[stream] blk=1 tok=2 t=13.6s). Fix: emit the [stream] timing line only on the non-interactive (scripted/bridge) path; the interactive CLI streams only the clean answer delta. Added --chat-stream-stdout so a non-tty bridge run can capture the exact clean live format.

Validation (Mac M4)

  • Streaming works / not a deadlock: first run showed 72 [stream] blocks emitted continuously (first tokens at t=8.4s, then ~1 block/0.4s) — steady generation.
  • Format clean: re-run with --chat-stream-stdout captured the answer as continuous markdown with 0 [stream] lines in the output:
    gemma-4 [根据pow的机制…]> 在计算机科学中,实现 \pow(base, exp)` 函数通常有两种主要的机制:快速幂算法…`

mac_chat_format_fix.txt
mac_chat_streaming_fix.txt

Changes

  • inference_engine/backends/mlx/fused_specdecode.py: on_commit + _emit in all 3 fused loops.
  • scripts/research/k3_integrated_niah_eval_mac.py: live clean streaming in the REPL; [stream] timing only on the scripted path; --chat-stream-stdout flag.
  • inference_engine/bridge/manifest.py: mlx-kakeya-chat-stream-probe preset.

Testing

  • pytest tests/backends/mlx/test_fused_specdecode.py tests/inference_engine/bridge/test_manifest.py (47 passed)
  • ✅ On-device probe: streaming steady + clean format (evidence above)

Independent of #148/#149/#150; based on main.

To show artifacts inline, enable in settings.

Open in Web Open in Cursor 

…n long answers

Root cause (from the code): the interactive chat REPL is fully NON-streaming —
_gen_turn runs the entire generation (up to max_new_tokens) before printing
anything. On a code-gen prompt the answer is long and the f_θ path is slow
(~3-5 tok/s, single-token past the wrap), so the terminal stays silent for
minutes — indistinguishable from a freeze (user: '一进入就卡死,完全没有输出'). Not a
deadlock (prior scripted code-gen runs completed).

Fix: add an on_commit streaming callback to the 3 fused generate loops (safe
_emit wrapper, never breaks decode) and have the chat REPL decode incrementally
and print the delta LIVE (+ a per-block '[stream] blk=.. t=..s' stderr line that
also proves the engine is progressing, not hung). New mlx-kakeya-chat-stream-
probe preset runs the user's exact prompt to validate.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 18, 2026 09:26
@FluffyAIcode FluffyAIcode merged commit d7bbcaf into main Jun 18, 2026
8 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/mac-chat-streaming-output-2815 branch June 18, 2026 09:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants