feat(distributed): remote DFlash+f_θ proposer (F3 data plane) — gemma-4 verifier on host A, DFlash+f_θ on host B by FluffyAIcode · Pull Request #158 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-19T12:46:10Z

What & why

Brings the production Kakeya config onto the ADR 0009 distributed path: a gemma-4 verifier on host A (Mac mini, MLX) driving a remote DFlash drafter + f_θ K/V projection on host B (a GPU), over a real gRPC data plane — the v0.5-GA "F3" deferred by ADR 0009 §4. Stacked on #157. Correctness containment is structural (greedy verify decides every token) → byte-identical to local greedy regardless of remote drafts.

Components

Machinery (unit-tested): proto DFlashProposerService + Tensor/LayerKV; tensor_codec; dflash_service (RestorationDraftEngine, servicer, RemoteDFlashProposer); fused_decode (DistributedFusedDecoder + RestoringVerifier).
Real-model engines: backends/mlx/dflash_distributed.py (host A + Mac host B) and v04/dflash_distributed_engine.py (CUDA host B TorchRestorationDraftEngine).
Deploy + SOP: scripts/deploy/dflash_proposer_server_gpu.sh (host B), scripts/deploy/dflash_verifier_client.sh (host A), servers/harness k3_dflash_proposer_server.py / k3_distributed_dflash_e2e_mac.py; presets mlx-distributed-dflash-e2e-{inproc,grpc,crosshost}. SOP skill: docs/skills/distributed-dflash-ftheta-inference-skill.md. Design + report: docs/design/distributed-dflash-ftheta-data-plane.md.

Testing — all real models (gemma-4-mlx-4bit + DFlash + f_θ_v5_s5_sliding)

✅ pytest tests/inference_engine/distributed/ — 111 passed (34 new; byte-identical for perfect AND wrong drafts).
✅ In-process E2E → byte-identical, 7.89 tok/s, acceptance 0.863.
✅ Loopback gRPC E2E → byte-identical, 8.78 tok/s.
✅ LIVE Mac↔H200 cross-host (production topology) → PASS byte-identical, block=4 3.70 tok/s, acceptance 0.863–0.892. Per-RPC p50: Restore 3.2 s/11.5 MB (one-time), SeedContext 0.4 s, DraftBlock 268 ms, ExtendContext 316 ms/0.27 MB-per-block; per-block ~584 ms. (Re-validated against the script-deployed server: 3.09 tok/s, byte-identical.)
✅ Bounded memory (verifier-side invariant, unchanged by split): ~235.7 MB resident KV, constant over a 1241-token generation.
DraftBlock's DFlash forward is offloaded to the H200 (a VM→H200 probe shows 108 ms ≈ net-RTT vs the Mac-CPU 232 ms compute); cross-host cost is then network-RTT + per-block aux bandwidth bound. GA levers: aux quantization/compression + same-rack placement.
⚠️ Local 100% coverage unmeasurable (torch+coverage segfault); MLX/v04 modules not coverage-gated; CI authoritative once feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main) #157 merges.

Notes

vast portal ports are Caddy-gated, so the link uses :50070 via an SSH -L tunnel from the Mac.

… data plane Numpy-backed WireTensor <-> proto Tensor (dtype/shape/raw-bytes), with torch and mlx bridges (bfloat16 carried as uint16 bits so an MLX verifier and a torch DFlash proposer interoperate). No torch/mlx dependency in the codec itself; mlx bridges are pragma-excluded (no mlx in CI). 17 unit tests. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Stateful service splitting the engine across hosts: gemma-4 verifier on host A, DFlash drafter + f_θ on host B. RPCs: Restore (prompt -> f_θ-projected verifier K/V), SeedContext (verifier aux hidden -> drafter ctx K/V), DraftBlock (bonus -> drafts), ExtendContext (committed aux -> grow ctx), CloseSession. Adds framework- neutral Tensor + LayerKV messages. Regenerated Python + TS stubs. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…wire glue) Framework-neutral RestorationDraftEngine contract (WireTensor in/out) behind an async grpc.aio servicer; sync client for the spec-decode loop. Engine KeyError -> NOT_FOUND, ValueError -> INVALID_ARGUMENT. 7 wire-contract tests (roundtrip, error mapping, dead-address wrap, draft-count refusal). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…oop) Client-side fused spec-decode driving a RemoteDFlashProposer + a local RestoringVerifier: restore+seed per turn, draft+verify+commit+extend per block. Framework-agnostic (verifier behind a Protocol, aux as WireTensor) so it is fully unit-tested. 10 tests prove byte-identical-to-greedy output for BOTH perfect and wrong remote drafts (correctness containment), plus EOS/max-new/edge cases. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…e recipe Documents the per-turn/per-block protocol, wire payloads, what landed (tested machinery), and the precise construction recipe for the next-phase MLX server engine + verifier adapter, plus the in-process + cross-host validation plan. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…r + E2E MLXRestorationDraftEngine (host B: torch DFlash + f_θ + verifier embed/lm_head), MLXRestoringVerifierAdapter (host A: wraps MLXRestoredIncrementalVerifier), and InProcessDFlashProposer. scripts/research/k3_distributed_dflash_e2e_mac.py loads the real models once and asserts the distributed path is byte-identical to greedy (in-process or loopback gRPC). Bridge presets mlx-distributed-dflash-e2e- inproc/-grpc. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… DFlash E2E _TimingProposer wraps the proposer to report mean/p50 RTT for restore/seed/draft/ extend + WireTensor payload bytes (DraftBlock O(1) vs ExtendContext O(block aux)). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…akdown + cross-host analysis Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…host E2E TorchRestorationDraftEngine (inference_engine/v04/dflash_distributed_engine.py): the pure-torch RestorationDraftEngine for a GPU host, reusing the CUDA fused machinery (CrossModelDLMRestoredVerifier.project_drafter_kv, Gap-B torch embed). k3_dflash_proposer_server.py serves it. E2E script gains --remote-addr (true cross-host) and uses block_size=1 as the greedy baseline. MLX adapter now filters restored layers to the verifier's KV-source layers (gemma-4 cross-layer sharing). Preset mlx-distributed-dflash-e2e-crosshost (Mac verifier <-> GPU proposer via vast-mapped port). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…SH -L tunnel) vast external ports are Caddy-gated (no raw-TCP passthrough), so the live Mac<->GPU run uses an SSH -L tunnel to the H200's :6006 (the GPU DFlashProposerService). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…reset to :50070 Deployed TorchRestorationDraftEngine on an H200; measured the real DFlash+f_θ data plane cross-host: DraftBlock p50 108ms (vs 232ms Mac-CPU loopback — GPU offload cuts draft compute), ExtendContext 140ms/0.27MB, per-block ~248ms over an SSH tunnel. Caddy occupies the portal ports, so the link uses :50070. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

@h200

…esult gemma-4-mlx-4bit verifier @Mac mini <-> torch DFlash+f_θ @h200 over SSH tunnel: block=4 = 3.70 tok/s, acceptance 0.863, PASS byte-identical to greedy. Per-RPC RTT: Restore 3.2s/11.5MB, Seed 412ms, DraftBlock 268ms, ExtendContext 316ms/0.27MB. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

… A/B deploy scripts - docs/skills/distributed-dflash-ftheta-inference-skill.md: reusable SOP (two-layer design, build order, the byte-identical validation ladder, the expensive gotchas: MLX-Apple-only/torch-embed, transformers 5.x, gemma-4 KV-source-layer filtering, vast Caddy ports + SSH -L, /dev/shm cache). - scripts/deploy/dflash_proposer_server_gpu.sh: one-command host-B (GPU) deploy (transformers 5.x + fetch gemma-4/DFlash to /dev/shm + serve DFlashProposerService). - scripts/deploy/dflash_verifier_client.sh: host-A (verifier) launcher (open SSH -L tunnel + probe + run the byte-identical + RTT E2E). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…+f_θ proposer (F3 data plane) #157 feat(distributed): multi-host capability exchange + distributed speculative decoding #158 feat(distributed): remote DFlash+f_θ proposer (F3 bulk-tensor data plane) + SOP skill + deploy scripts Validated: 111 distributed unit tests; real-model byte-identical E2E (in-process, loopback gRPC, live Mac<->H200 cross-host); RTT/throughput/bounded-memory report. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 14 commits June 19, 2026 12:27

test(distributed): cover RemoteDFlashProposer context-manager close path

811115d

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

docs(distributed): record real-model DFlash+f_θ E2E results + RTT bre…

653364d

…akdown + cross-host analysis Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distributed): remote DFlash+f_θ proposer (F3 data plane) — gemma-4 verifier on host A, DFlash+f_θ on host B#158

feat(distributed): remote DFlash+f_θ proposer (F3 data plane) — gemma-4 verifier on host A, DFlash+f_θ on host B#158
FluffyAIcode wants to merge 14 commits into
AgentMemory/distributed-spec-decode-rebased-2815from
AgentMemory/distributed-dflash-ftheta-2815

FluffyAIcode commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Components

Testing — all real models (gemma-4-mlx-4bit + DFlash + f_θ_v5_s5_sliding)

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 19, 2026 •

edited by cursor Bot

Loading