Phase 0 / IPC / IPC complete (SCM_RIGHTS primary attach + Windows + replay + nightly fuzz)#22
Merged
Merged
Conversation
E1 of M0.7: the POSIX cross-process shm attach pivots to SCM_RIGHTS fd-passing as the primary path (engine-ipc.md §4.8). - protocol: WELD_IPC_PROTOCOL_VERSION 2 -> 3 - messages: ShmRegionsHandoff + ShmRegionDesc (msg_type 14) - connection: recvFrameWithHandles (frame + ancillary fds) - editor: send ShmRegionsHandoff after the handshake; reap orphan sockets/regions at startup (cleanup.zig, is_alive PID check) - runtime: attach the viewport via ShmViewport.fromFd, no shm_open - crash_recovery: send the handoff; un-gated to macOS since the pivot removes the BSD shm cross-process EACCES
Review fix (E1 addendum). The runtime accepted any handoff with at least one fd and region_count >= 1; tighten to the §8.3 contract. - connection.acceptShmHandoff: region_count must be in [1, MAX_SHM_REGIONS] and equal the received fd count exactly, else error.InvalidHandoff. - On rejection, every received fd is closed; on success only the viewport (regions[0]) is kept and any further region fd is closed — no descriptor leak from a malformed or multi-region handoff. - transport.closeHandle: platform-abstracted single-handle close. - handoff_fd.zig: negative tests (count mismatch, over-cap) + multi-region success, asserting the unmapped/excess fds are actually closed.
E2 of M0.7. Adds the 9 messages the brief scopes (engine-ipc.md §3.3), hand-written extern-struct POD like the S6 catalogue (no .etch codegen exists in-repo). MsgType range 1..23. - Play / Pause / Stop (fire-and-forget): the runtime tracks play_state; the render loop advances the mire only while playing (default playing, so S6 behaviour and crash_recovery are unchanged). - LoadScene / HotReloadScript (fire-and-forget): handlers decode + act; an empty LoadScene path emits a non-fatal RuntimeError event. - SaveScene: declared with NO wired handler (scene granularity, deferred). - SaveProject -> ProjectSaved: transactional, ack carries the same seq_id. - RuntimeError: runtime->editor non-fatal event (distinct from CrashReport). - tests/ipc/catalogue.zig: framing round-trips for every new message + end-to-end handler tests against the spawned runtime (SaveProject ack, LoadScene->RuntimeError, Play/Pause/Stop sync). schema_hash uniqueness extended to all 23 messages.
E3 of M0.7. The editor and runtime leave error.Unimplemented on Windows; the IPC round-trip now uses the existing named-pipe transport + by-name shm attach. - platform/process.zig: spawn_process implements CreateProcessW (UTF-16 command line, STARTUPINFOW + PROCESS_INFORMATION), replacing the stub. - editor/main.zig: drops the early error.Unimplemented; OS-correct endpoints (transport.buildSocketPath -> `\\.\pipe\weld-<pid>`; shm name `Local\weld-shm-viewport-<pid>`); the SCM_RIGHTS handoff is gated to POSIX (Windows attaches the named mapping by name, §2.2). - runtime/main.zig: drops the early error.Unimplemented; branches the viewport attach POSIX (fromFd via handoff) vs Windows (open by name). - runtime reader buffer (review point 1): sized over the FULL incoming editor->runtime catalogue, not @max(Echo, LoadScene). Frame sizes (24 + @sizeof): Heartbeat 32, Shutdown 25, Echo 88, SpawnEntity 28, ModifyComponent 80, Play/Pause/Stop 25, LoadScene 280, SaveScene 280, HotReloadScript 32, SaveProject 25 -> max 280. No undersize today, but the explicit enumeration is regression-proof. Windows compile verified by cross-compiling editor + runtime to x86_64-windows (the full `zig build` cross is blocked by a native-tool toolchain quirk, so a direct build-exe per binary is used). Runtime behaviour on Windows is validated by CI + the E4 crash_recovery un-gate.
Review fix (E3 addendum). The naive `"arg"` wrapping in spawn_process corrupts any argument ending in backslashes (a `\` before the closing `"` is read as an escaped quote, swallowing the next arg) or containing a `"`. Replace it with the standard ArgvQuote / CommandLineToArgvW algorithm in a reusable, testable `quoteArg`. - platform/process.zig: `pub fn quoteArg` (UTF-8, ASCII-transparent): verbatim when no whitespace/quote; else wrap in `"` and double runs of backslashes before a `"` (literal or closing). spawn_process uses it per arg (argv[0] included, by convention). - tests/ipc/process.zig: golden cases (empty, verbatim, space, trailing backslash with/without space) + a round-trip through a reference CommandLineToArgvW parser over the nasty cases (internal quote, backslash+quote, tab, multiple trailing backslashes). Pure + cross-platform — runs without Windows. Failure-injection confirmed the assertions actually run.
Two Windows compile errors found on Guy's real Windows build (E3 addendum) — my cross-compile had missed them (see the method note below). 1. parseArgs (editor + runtime): `std.process.Args.Iterator.init` is a `@compileError` on Windows (no POSIX argv). Switch to the allocator variant `Iterator.initAllocator(init.args, gpa)` (preserves the Juicy Main `init.args`; `deinit` frees the Windows buffer). Works on POSIX too. 2. spawn_process: `utf8ToUtf16LeAllocZ` can return `error.InvalidUtf8`, which is outside the process `Error` set. New `utf8ToUtf16Z` helper remaps it to `error.InvalidArgument` (non-UTF-8 argv = invalid input) rather than widening `Error` with a Unicode member. Verification-method fix: a `zig build-exe -target x86_64-windows` with `-target` placed AFTER the `-M` module definitions silently compiled the modules for the NATIVE (macOS) target — so the Windows-gated paths were never analyzed (that is why the earlier "cross GREEN" missed both bugs). With `-target` BEFORE the `-M` flags, `builtin.os.tag == .windows` holds and the broken `Iterator.init` is correctly rejected; both binaries then compile clean for Windows. Real `zig build` on Windows remains the authority.
E3 addendum 2 — CreateProcessW failed (error.SpawnFailed) on Guy's real Windows run, the first real exercise of the Windows runtime path. 1. Diagnostic: on CreateProcessW failure, log path + GetLastError() before returning error.SpawnFailed (e.g. 2 = ERROR_FILE_NOT_FOUND), so the cause is confirmable instead of opaque. 2. Root fix: the default runtime_path was the POSIX-inherited "zig-out/bin/weld-runtime" — no `.exe`, CWD-relative. CreateProcessW with lpApplicationName needs the exact path and does not add `.exe` nor search PATH. The editor now derives it from its own executable directory (std.process.executableDirPathAlloc) + "weld-runtime" + ".exe" on Windows. The `--runtime=` override is kept (tests pass it). The editor switches to the full `std.process.Init` (Juicy Main for dev tools, §2) to obtain `io` for the lookup. 3. Anti-regression: native Windows spawn_process test (cmd.exe /c exit 0 → wait_nonblock == 0), gated Windows — covers the path the interactive run revealed. Verified: editor + runtime + tests/ipc/process.zig compile clean for x86_64-windows (corrected target-first cross method); POSIX test-ipc + fmt + lint green. Guy re-tests the live CreateProcessW run on Windows.
E4 of M0.7. src/core/ipc/command_log.zig: a 1024-entry FIFO ring of the editor->runtime commands sent, each retaining its encoded frame for verbatim replay (engine-ipc.md §7, engine-tools-editor.md §2.7.3). last_clean_line advances to head on markCleanLine (called when the ProjectSaved ack arrives); replaySince iterates the post-clean-line entries still pending (never acked/nacked), in send order, clamped to the retained window. droppedUnsaved flags ring overflow since the last save. Rooted in core/root.zig (§13); 6 inline tests.
SaveProject now writes a minimal fixed-size binary snapshot (magic + version + frame_id marker) as the best-effort-replay reload point (engine-ipc.md §7.1, brief E4 option 1) — not a .scene.etch writer and no project-settings serialization (out of Phase 0). The runtime reloads the marker on restart to resume the mire; a write failure surfaces via ProjectSaved.ok = 0 so the editor does not advance its clean line on a save that did not persist. File I/O goes through std.Io.Dir + io (0.16 API; std.fs.cwd() is gone). Snapshot .bin artifacts are gitignored.
replayCommands re-sends each CommandLog entry pending since the last clean line and awaits a reply carrying the same seq_id; the first nack, timeout, or seq desync stops the pass hard with complete = false (no idempotence, engine-ipc.md §7.2/§7.3). The caller arms the per-command timeout via the socket recv timeout. Never raises — a failure ends the pass cleanly so the editor can surface it without crashing.
crash_recovery now compiles + runs on Windows too (was POSIX-only): the per-OS differences are isolated in spawnAndHandshake (POSIX hands the viewport fd off via SCM_RIGHTS; Windows opens the named mapping by name) and the cleanup helpers; clock/sleep go through the cross-platform platform-time wrapper (std.time.milliTimestamp / std.Thread.sleep are gone in 0.16). Adds a fourth case: SaveProject establishes a clean line, 3 post-save commands are queued unacked, the runtime is killed, then a restart + replayCommands re-acks all 3 in < 500 ms aggregate.
fuzz_1h sent only Echo frames; it now picks a random message type each iteration across a 15-type catalogue slice (incl. ShmRegionsHandoff and every E2 command). Interleaving heterogeneous frame sizes is the real test — it stresses the length-prefixed framing's delimiting over tens of millions of back-to-back frames (the no-desync gate), which a fixed-size Echo stream never exercised. A counting allocator over page_allocator turns leaks into a non-zero exit (alloc/free + byte tallies must balance); sent == recv and a clean reader confirm no desync. A deterministic bad-magic teardown sentinel (published after the stop flag) unblocks the reader's blocking recv instead of relying on a recv timeout. Duration is overridable via argv (env vars are gone in 0.16): zig build test-ipc-fuzz-1h -- --duration-ms=N.
Scheduled (04:00 UTC) + workflow_dispatch job that runs the 1 h full-catalogue IPC fuzz in ReleaseSafe on ubuntu-24.04 and windows-2025, uploading each run's stdout digest as an artifact (G3 gate). Scheduled runs fire only from the default branch, so this activates once M0.7 is squash-merged to main; workflow_dispatch (with a duration_ms input) allows manual runs in the meantime.
The bad-magic teardown sentinel reclassified error.InvalidMagic as benign once stop == 1, opening a masking window: a real framing desync on the last backlog data frames — sent before the writer published stop but received after — would be scored as teardown instead of a fault, blunting the no-desync gate. The sentinel is now a well-formed frame of a catalogue type kept out of fuzz_types (ShutdownAck), recognised by the reader via its msg_type. The clean end-of-stream is thus a valid frame, never an error — so InvalidMagic / UnknownMsgType / version / size mismatches are unconditional faults at any time, regardless of stop. Only a post-stop socket EOF stays tolerated (outside the targeted bug class). A comptime guard keeps the sentinel type out of fuzz_types. Prod transport unchanged. 60 s smoke: sent == recv == 20440075, fault=0, alloc/free balanced.
Review note (c): make explicit that replayCommands is synchronous and drains each replayed command (send → recv same-seq_id ack → markAcked) before the editor resumes emitting new commands. That serialization is what keeps a replayed seq_id from ever colliding with a fresh one in the seq_id->callback map (§3.4); making replay async or pipelined would break it. Comment only — no behavior change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
M0.7 — IPC complete (SCM_RIGHTS primary attach + Windows + replay + nightly fuzz)
Completes and hardens the S6 IPC skeleton and absorbs the carried-forward Phase −1 IPC debts. No Tier 0 surface widening beyond what the SCM_RIGHTS pivot requires. Advances criterion C0.4 (editor↔runtime IPC stable).
Brief (source of truth):
briefs/M0.7-ipc-scm-rights-windows-fuzz.md— Status CLOSED (2026-06-06).Closing notes (copy from the brief)
What shipped (E1→E4)
shm_open(O_RDWR)(refusedEACCESfor non-creator siblings on macOS BSD shm) is gone. The editor creates each region, keeps the fd, and hands the viewport fd to the runtime viaIpcSocket.sendWithHandlesin aShmRegionsHandoffright afterProtocolHelloAck; the runtime maps it withShmRegion.fromFd/ShmViewport.fromFd.openis demoted to intra-process. Protocol bumped to v3. Startup orphan reaping added. EINTR loops attested non-regressed (M0.5-owned, not re-implemented).ShmRegionsHandoff/ShmRegionDesc,Play/Pause/Stop/LoadScene/HotReloadScript,SaveScene,SaveProject→ProjectSaved(transactional, sameseq_id),RuntimeError(non-fatal event).msgTypeOf+isKnownextended to1..23;schema_hashuniqueness guard kept complete.SaveScenedeclared but intentionally not wired (out of scope).src/editor/main.zigout oferror.Unimplemented:CreateProcessWspawn + named pipe + the S2 Win32 window. Editor switched to fullstd.process.Init(Juicy Main) forinit.io/init.arena.CommandLogring (1024);last_clean_lineset on theProjectSavedack.SaveProjectpersists a minimal binary snapshot (option 1; no.scene.etchwriter), reloaded on restart.connection.replayCommandsre-sends pending post-clean-line frames and awaits matching acks, stopping hard on nack/timeout/desync (no idempotence, §7.2/§7.3).crash_recovery.zigun-gated to Windows (detection < 100 ms, replay < 500 ms aggregate).fuzz_1h.zigpromoted to a nightly cron (Linux + Windows, artifact), rewritten to fuzz the full catalogue with leak detection.Review addendums acted
8859ff9) —acceptShmHandoffrejectsregion_count == 0,> MAX_SHM_REGIONS, or an fd-count mismatch; closes every received fd on rejection and every non-viewport fd on success. Negative tests added.32e4fbb+1c3e31d+34a057e) — surfaced by Guy's real Windows runs, invisible to compile checks:parseArgs→Args.Iterator.initAllocator(.initis a@compileErroron Windows);spawn_processerror set remappedInvalidUtf8→InvalidArgument;CreateProcessWerror.SpawnFailedfixed by resolving the runtime path from the editor's exe dir +.exe(with aGetLastErrordiagnostic); MSVCRT-correct command-line quoting (ArgvQuote).3b53414) — the bad-magic teardown sentinel had reclassifiedInvalidMagicas benign oncestop == 1, a narrow window that could mask a real desync on the last backlog frames. Replaced by a well-formed sentinel frame of a type kept out offuzz_types(ShutdownAck);InvalidMagic/UnknownMsgType/ size-mismatch are now unconditional faults at any time. Harness only — prod transport unchanged.Acted deviations (recap)
src/ipc/ipc_messages.etch(no.etchcodegen exists) →src/core/ipc/messages.zig; the<platform>/ipc.zigwrapper →src/core/ipc/{shm,shm_posix,shm_windows,transport,transport_windows,viewport}.zig.mains:cleanup.zig(E1),command_log.zig+snapshot.zig(E4) — reusable, unit-tested Tier 0 primitives, rooted per §13.connection.replayCommandsat the connection layer (not the editormain) — keeps replay unit-testable.src/runtime/main.zig+src/core/platform/process.zigWindows paths beyond the E3 Files list — a green Windows handshake required the runtime to leaveerror.Unimplementedandspawn_processto implementCreateProcessW. Confirmed acted with Guy..gitignore(snapshot artifacts),build.zig(handoff_fd/catalogueregistration + nightly arg forwarding),schema_hash.zig/process.zig(guard + native Windows spawn test).Validation checklist
zig build test/test-ipcgreen incl. all 4crash_recoverycases; fuzz 60 s smokesent == recv == 20 440 075,fault=0, alloc/free balanced;fmt+lint+ pre-push (build + test + test-release + tsan-wayland) green.ubuntu-24.04, Debug + ReleaseSafe): green.zig build+test-ipc+testgreen; interactive editor run ("ipc demo completed cleanly":CreateProcessW+ handshake + round-trip + clean shutdown);crash_recoverygreen; fuzz 60 s smoke green.windows-2025, Debug + ReleaseSafe): green..github/workflows/nightly-fuzz.yml) — activates on squash-merge tomain(scheduled runs fire only from the default branch);workflow_dispatchavailable meanwhile.IN PROGRESSmarker;git diff main..HEADfiles all in the brief's Files list or journaled as acted deviations.Doc files touched (in this repo)
briefs/M0.7-ipc-scm-rights-windows-fuzz.md— milestone brief (journal E1→E4 + addendums, acted deviations, Closing notes; Status CLOSED).CLAUDE.mdis intentionally not touched here — milestone-state table + tag are Guy's closing gestures.Out-of-repo (Claude.ai KB — NOT in this PR)
engine-ipc.md(§4.7/§4.8 fd-passing, §7 replay) andengine-phase-0-plan.mdare patched by Claude.ai in the knowledge base. They live outside the repo and are not part of this PR.Review notes (non-blocking)
ShutdownAck. The well-formed teardown sentinel reuses the catalogue'sShutdownAcktype, recognised bymsg_typeand kept out offuzz_typesvia acomptimeguard. This is a harness-internal semantic (no dedicated test-only msg_type was added to the prod enum, to avoid widening the catalogue).SO_RCVTIMEO.replayCommandsrelies on the socket recv timeout the caller arms; the timeout code path itself has no dedicated test (the happy path + nack/desync stop are covered by thecrash_recoveryreplay case).50fa16f). The synchronous-replayseq_idinvariant is now documented explicitly inreplayCommands(doc comment + inline note): the pass drains each command (send→ same-seq_idack →markAcked) before the editor resumes emitting new commands, which is what keeps a replayedseq_idfrom colliding with a fresh one in theseq_id→callback map (§3.4); making replay async or pipelined would break it. Comment only — no behavior change.ProjectSaved.ok = 0(snapshot write failure) path is untested. The runtime surfaces a snapshot write failure viaok = 0so the editor does not advance its clean line, but there is no test injecting a write failure.🤖 Generated with Claude Code