fix(src): prevent boo ui freeze from daemon write deadlock#89
Merged
Conversation
The daemon ran a single-threaded poll loop and wrote to the attached client with blocking writes (conn.send -> protocol.writeMsg -> writeAll). While blocked on a client write it stopped reading the PTY and stopped servicing control connections. boo ui drains its view socket only between its own work and, every 250ms, makes synchronous info round-trips to every session. A freshly started full-screen app (helix, neovim) bursts output on attach: the daemon's write to the ui backed up at the same moment the ui blocked on an info round-trip to that daemon, and each waited on the other. boo froze. Make per-connection output non-blocking instead. Accepted sockets are O_NONBLOCK; conn.send queues frames and flushes what the socket takes now, with the remainder going out on POLLOUT. A client that backs up past a generous cap is dropped (it reconnects and gets a fresh repaint) rather than buffered without bound or allowed to wedge the loop. Final frames (detach, exit, replies) drain through a bounded shutdown flush, and sweep recomputes passthrough so the window answers its own queries once a dropped client is gone. Fixes #85.
The sidebar polls every session over its control socket on a 250ms cadence (info) and also issues cwd/rename/quit. These reads had no deadline, so a daemon that stopped answering could hang the ui indefinitely. The non-blocking daemon writes already break that deadlock; this adds a client-side safety net so a stuck daemon can never freeze the ui again. client.control gains an optional timeout_ms: it polls the socket with a deadline before each read and returns error.Timeout if no reply arrives. The ui passes control_timeout_ms (3s); the CLI passes null to preserve its blocking semantics (boo wait already polls client-side). A timeout never deletes the socket, since a slow daemon is not a dead one, so sessionInfo and killSession keep the session instead of orphaning a live one.
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #85.
Summary
boo uifroze when opening helix/nvim in a freshly created session. The root cause is a daemon-side blocking-write deadlock, now confirmed on Linux with kernel evidence from a live frozen instance (below).Root cause (confirmed)
The session daemon (
src/daemon.zig) runs a single-threadedpoll(2)loop and wrote to clients with blocking writes (conn.send->protocol.writeMsg->writeAll->posix.write). Meanwhileboo uidoes synchronous, un-timedclient.control("info")round-trips for every session every 250ms (refreshSessions), with a blocking read and no deadline.When a session produces a large startup repaint (helix/nvim on an 85x144 viewport, x3 UIs), the daemon's socket send buffer to a ui fills before the ui drains it, and the daemon blocks in
write(). That ui is simultaneously blocked reading aninforeply from the same daemon. Neither side makes progress: a mutual deadlock.boo lshangs for the same reason (itinfos every session).Kernel evidence from the reporter's live frozen instance (NixOS):
D/Swrite(fd 7)->unix_stream_sendmsg->sock_alloc_send_pskb(send buffer full, peer not draining)boo uix3Sread(...)->unix_stream_recvmsg(theinforound-trip)This did not reproduce with small test viewports (tiny repaints). The real setup had large viewports and three concurrent UIs, which is enough to fill the send buffer.
The fix
Two changes; the first removes the deadlock, the second is a client-side safety net.
1. Non-blocking daemon writes (
src/daemon.zig,src/protocol.zig)O_NONBLOCK.conn.sendqueues frames (protocol.appendMsg) and flushes what the socket accepts now; the remainder goes out onPOLLOUT.sweeprecomputes passthrough after dropping a client so the window answers its own queries once no client remains.The daemon never blocks on a slow client again; it keeps draining the PTY and answering control commands.
2. Time-bound
boo uicontrol round-trips (src/client.zig,src/main.zig,src/ui.zig)client.controlgains an optionaltimeout_ms: it polls the socket with a deadline before each read and returnserror.Timeoutif no reply arrives.control_timeout_msfor its sidebar polls (info,cwd,rename,quit); the CLI passesnullto preserve its blocking semantics (boo waitalready polls client-side).sessionInfoandkillSessionkeep the session rather than orphaning a live one.Even if a daemon ever stops answering for another reason, the ui now gives up and stays interactive instead of freezing.
Testing
zig build test-allin Debug and-Doptimize=ReleaseSafe: 214/214 pass, including 3 new unit tests (non-blocking queue/flush order, cap-drop, and control timeout).zig fmt --checkclean.boo new -d,ls,peek,killall work.Implementation notes
protocol.appendMsgframes a message into a growable buffer (the queue), parallel towriteMsg.Conngainsout(queue),out_cap, and ashutdownflag so final frames drain before close;flushdoes partial, non-blocking writes and compacts the buffer;dropdiscards and marks the conn dead.POLLOUTonly when output is queued and flushes before handling input.drainOutboundgives queued finals a bounded (250ms) chance to land at teardown.client.control's deadline usespoll()per read; the new test binds a listening unix socket that neveraccept()s, so the reply read blocks and must time out.Generated by Coder Agents on behalf of @kylecarbs.