feat(distributed): remote DFlash+f_θ proposer (F3 data plane) — gemma-4 verifier on host A, DFlash+f_θ on host B#158
Draft
FluffyAIcode wants to merge 14 commits into
Conversation
… data plane Numpy-backed WireTensor <-> proto Tensor (dtype/shape/raw-bytes), with torch and mlx bridges (bfloat16 carried as uint16 bits so an MLX verifier and a torch DFlash proposer interoperate). No torch/mlx dependency in the codec itself; mlx bridges are pragma-excluded (no mlx in CI). 17 unit tests. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Stateful service splitting the engine across hosts: gemma-4 verifier on host A, DFlash drafter + f_θ on host B. RPCs: Restore (prompt -> f_θ-projected verifier K/V), SeedContext (verifier aux hidden -> drafter ctx K/V), DraftBlock (bonus -> drafts), ExtendContext (committed aux -> grow ctx), CloseSession. Adds framework- neutral Tensor + LayerKV messages. Regenerated Python + TS stubs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…wire glue) Framework-neutral RestorationDraftEngine contract (WireTensor in/out) behind an async grpc.aio servicer; sync client for the spec-decode loop. Engine KeyError -> NOT_FOUND, ValueError -> INVALID_ARGUMENT. 7 wire-contract tests (roundtrip, error mapping, dead-address wrap, draft-count refusal). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…oop) Client-side fused spec-decode driving a RemoteDFlashProposer + a local RestoringVerifier: restore+seed per turn, draft+verify+commit+extend per block. Framework-agnostic (verifier behind a Protocol, aux as WireTensor) so it is fully unit-tested. 10 tests prove byte-identical-to-greedy output for BOTH perfect and wrong remote drafts (correctness containment), plus EOS/max-new/edge cases. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…e recipe Documents the per-turn/per-block protocol, wire payloads, what landed (tested machinery), and the precise construction recipe for the next-phase MLX server engine + verifier adapter, plus the in-process + cross-host validation plan. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…r + E2E MLXRestorationDraftEngine (host B: torch DFlash + f_θ + verifier embed/lm_head), MLXRestoringVerifierAdapter (host A: wraps MLXRestoredIncrementalVerifier), and InProcessDFlashProposer. scripts/research/k3_distributed_dflash_e2e_mac.py loads the real models once and asserts the distributed path is byte-identical to greedy (in-process or loopback gRPC). Bridge presets mlx-distributed-dflash-e2e- inproc/-grpc. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… DFlash E2E _TimingProposer wraps the proposer to report mean/p50 RTT for restore/seed/draft/ extend + WireTensor payload bytes (DraftBlock O(1) vs ExtendContext O(block aux)). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…akdown + cross-host analysis Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…host E2E TorchRestorationDraftEngine (inference_engine/v04/dflash_distributed_engine.py): the pure-torch RestorationDraftEngine for a GPU host, reusing the CUDA fused machinery (CrossModelDLMRestoredVerifier.project_drafter_kv, Gap-B torch embed). k3_dflash_proposer_server.py serves it. E2E script gains --remote-addr (true cross-host) and uses block_size=1 as the greedy baseline. MLX adapter now filters restored layers to the verifier's KV-source layers (gemma-4 cross-layer sharing). Preset mlx-distributed-dflash-e2e-crosshost (Mac verifier <-> GPU proposer via vast-mapped port). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…SH -L tunnel) vast external ports are Caddy-gated (no raw-TCP passthrough), so the live Mac<->GPU run uses an SSH -L tunnel to the H200's :6006 (the GPU DFlashProposerService). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…reset to :50070 Deployed TorchRestorationDraftEngine on an H200; measured the real DFlash+f_θ data plane cross-host: DraftBlock p50 108ms (vs 232ms Mac-CPU loopback — GPU offload cuts draft compute), ExtendContext 140ms/0.27MB, per-block ~248ms over an SSH tunnel. Caddy occupies the portal ports, so the link uses :50070. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…esult gemma-4-mlx-4bit verifier @Mac mini <-> torch DFlash+f_θ @h200 over SSH tunnel: block=4 = 3.70 tok/s, acceptance 0.863, PASS byte-identical to greedy. Per-RPC RTT: Restore 3.2s/11.5MB, Seed 412ms, DraftBlock 268ms, ExtendContext 316ms/0.27MB. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… A/B deploy scripts - docs/skills/distributed-dflash-ftheta-inference-skill.md: reusable SOP (two-layer design, build order, the byte-identical validation ladder, the expensive gotchas: MLX-Apple-only/torch-embed, transformers 5.x, gemma-4 KV-source-layer filtering, vast Caddy ports + SSH -L, /dev/shm cache). - scripts/deploy/dflash_proposer_server_gpu.sh: one-command host-B (GPU) deploy (transformers 5.x + fetch gemma-4/DFlash to /dev/shm + serve DFlashProposerService). - scripts/deploy/dflash_verifier_client.sh: host-A (verifier) launcher (open SSH -L tunnel + probe + run the byte-identical + RTT E2E). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Jun 19, 2026
…+f_θ proposer (F3 data plane) #157 feat(distributed): multi-host capability exchange + distributed speculative decoding #158 feat(distributed): remote DFlash+f_θ proposer (F3 bulk-tensor data plane) + SOP skill + deploy scripts Validated: 111 distributed unit tests; real-model byte-identical E2E (in-process, loopback gRPC, live Mac<->H200 cross-host); RTT/throughput/bounded-memory report. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
Brings the production Kakeya config onto the ADR 0009 distributed path: a gemma-4 verifier on host A (Mac mini, MLX) driving a remote DFlash drafter + f_θ K/V projection on host B (a GPU), over a real gRPC data plane — the v0.5-GA "F3" deferred by ADR 0009 §4. Stacked on #157. Correctness containment is structural (greedy verify decides every token) → byte-identical to local greedy regardless of remote drafts.
Components
protoDFlashProposerService+Tensor/LayerKV;tensor_codec;dflash_service(RestorationDraftEngine, servicer,RemoteDFlashProposer);fused_decode(DistributedFusedDecoder+RestoringVerifier).backends/mlx/dflash_distributed.py(host A + Mac host B) andv04/dflash_distributed_engine.py(CUDA host BTorchRestorationDraftEngine).scripts/deploy/dflash_proposer_server_gpu.sh(host B),scripts/deploy/dflash_verifier_client.sh(host A), servers/harnessk3_dflash_proposer_server.py/k3_distributed_dflash_e2e_mac.py; presetsmlx-distributed-dflash-e2e-{inproc,grpc,crosshost}. SOP skill:docs/skills/distributed-dflash-ftheta-inference-skill.md. Design + report:docs/design/distributed-dflash-ftheta-data-plane.md.Testing — all real models (gemma-4-mlx-4bit + DFlash + f_θ_v5_s5_sliding)
pytest tests/inference_engine/distributed/— 111 passed (34 new; byte-identical for perfect AND wrong drafts).Notes
:50070via an SSH-Ltunnel from the Mac.