Skip to content

feat(distributed): remote DFlash+f_θ proposer (F3 data plane) — gemma-4 verifier on host A, DFlash+f_θ on host B#158

Draft
FluffyAIcode wants to merge 14 commits into
AgentMemory/distributed-spec-decode-rebased-2815from
AgentMemory/distributed-dflash-ftheta-2815
Draft

feat(distributed): remote DFlash+f_θ proposer (F3 data plane) — gemma-4 verifier on host A, DFlash+f_θ on host B#158
FluffyAIcode wants to merge 14 commits into
AgentMemory/distributed-spec-decode-rebased-2815from
AgentMemory/distributed-dflash-ftheta-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 19, 2026

Copy link
Copy Markdown
Owner

What & why

Brings the production Kakeya config onto the ADR 0009 distributed path: a gemma-4 verifier on host A (Mac mini, MLX) driving a remote DFlash drafter + f_θ K/V projection on host B (a GPU), over a real gRPC data plane — the v0.5-GA "F3" deferred by ADR 0009 §4. Stacked on #157. Correctness containment is structural (greedy verify decides every token) → byte-identical to local greedy regardless of remote drafts.

Components

  • Machinery (unit-tested): proto DFlashProposerService + Tensor/LayerKV; tensor_codec; dflash_service (RestorationDraftEngine, servicer, RemoteDFlashProposer); fused_decode (DistributedFusedDecoder + RestoringVerifier).
  • Real-model engines: backends/mlx/dflash_distributed.py (host A + Mac host B) and v04/dflash_distributed_engine.py (CUDA host B TorchRestorationDraftEngine).
  • Deploy + SOP: scripts/deploy/dflash_proposer_server_gpu.sh (host B), scripts/deploy/dflash_verifier_client.sh (host A), servers/harness k3_dflash_proposer_server.py / k3_distributed_dflash_e2e_mac.py; presets mlx-distributed-dflash-e2e-{inproc,grpc,crosshost}. SOP skill: docs/skills/distributed-dflash-ftheta-inference-skill.md. Design + report: docs/design/distributed-dflash-ftheta-data-plane.md.

Testing — all real models (gemma-4-mlx-4bit + DFlash + f_θ_v5_s5_sliding)

  • pytest tests/inference_engine/distributed/111 passed (34 new; byte-identical for perfect AND wrong drafts).
  • ✅ In-process E2E → byte-identical, 7.89 tok/s, acceptance 0.863.
  • ✅ Loopback gRPC E2E → byte-identical, 8.78 tok/s.
  • LIVE Mac↔H200 cross-host (production topology)PASS byte-identical, block=4 3.70 tok/s, acceptance 0.863–0.892. Per-RPC p50: Restore 3.2 s/11.5 MB (one-time), SeedContext 0.4 s, DraftBlock 268 ms, ExtendContext 316 ms/0.27 MB-per-block; per-block ~584 ms. (Re-validated against the script-deployed server: 3.09 tok/s, byte-identical.)
  • ✅ Bounded memory (verifier-side invariant, unchanged by split): ~235.7 MB resident KV, constant over a 1241-token generation.
  • DraftBlock's DFlash forward is offloaded to the H200 (a VM→H200 probe shows 108 ms ≈ net-RTT vs the Mac-CPU 232 ms compute); cross-host cost is then network-RTT + per-block aux bandwidth bound. GA levers: aux quantization/compression + same-rack placement.
  • ⚠️ Local 100% coverage unmeasurable (torch+coverage segfault); MLX/v04 modules not coverage-gated; CI authoritative once feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main) #157 merges.

Notes

  • vast portal ports are Caddy-gated, so the link uses :50070 via an SSH -L tunnel from the Mac.
Open in Web Open in Cursor 

cursoragent and others added 14 commits June 19, 2026 12:27
… data plane

Numpy-backed WireTensor <-> proto Tensor (dtype/shape/raw-bytes), with torch and
mlx bridges (bfloat16 carried as uint16 bits so an MLX verifier and a torch
DFlash proposer interoperate). No torch/mlx dependency in the codec itself; mlx
bridges are pragma-excluded (no mlx in CI). 17 unit tests.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Stateful service splitting the engine across hosts: gemma-4 verifier on host A,
DFlash drafter + f_θ on host B. RPCs: Restore (prompt -> f_θ-projected verifier
K/V), SeedContext (verifier aux hidden -> drafter ctx K/V), DraftBlock (bonus ->
drafts), ExtendContext (committed aux -> grow ctx), CloseSession. Adds framework-
neutral Tensor + LayerKV messages. Regenerated Python + TS stubs.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…wire glue)

Framework-neutral RestorationDraftEngine contract (WireTensor in/out) behind an
async grpc.aio servicer; sync client for the spec-decode loop. Engine KeyError ->
NOT_FOUND, ValueError -> INVALID_ARGUMENT. 7 wire-contract tests (roundtrip,
error mapping, dead-address wrap, draft-count refusal).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…oop)

Client-side fused spec-decode driving a RemoteDFlashProposer + a local
RestoringVerifier: restore+seed per turn, draft+verify+commit+extend per block.
Framework-agnostic (verifier behind a Protocol, aux as WireTensor) so it is fully
unit-tested. 10 tests prove byte-identical-to-greedy output for BOTH perfect and
wrong remote drafts (correctness containment), plus EOS/max-new/edge cases.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…e recipe

Documents the per-turn/per-block protocol, wire payloads, what landed (tested
machinery), and the precise construction recipe for the next-phase MLX server
engine + verifier adapter, plus the in-process + cross-host validation plan.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…r + E2E

MLXRestorationDraftEngine (host B: torch DFlash + f_θ + verifier embed/lm_head),
MLXRestoringVerifierAdapter (host A: wraps MLXRestoredIncrementalVerifier), and
InProcessDFlashProposer. scripts/research/k3_distributed_dflash_e2e_mac.py loads
the real models once and asserts the distributed path is byte-identical to
greedy (in-process or loopback gRPC). Bridge presets mlx-distributed-dflash-e2e-
inproc/-grpc.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… DFlash E2E

_TimingProposer wraps the proposer to report mean/p50 RTT for restore/seed/draft/
extend + WireTensor payload bytes (DraftBlock O(1) vs ExtendContext O(block aux)).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…akdown + cross-host analysis

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…host E2E

TorchRestorationDraftEngine (inference_engine/v04/dflash_distributed_engine.py):
the pure-torch RestorationDraftEngine for a GPU host, reusing the CUDA fused
machinery (CrossModelDLMRestoredVerifier.project_drafter_kv, Gap-B torch embed).
k3_dflash_proposer_server.py serves it. E2E script gains --remote-addr (true
cross-host) and uses block_size=1 as the greedy baseline. MLX adapter now filters
restored layers to the verifier's KV-source layers (gemma-4 cross-layer sharing).
Preset mlx-distributed-dflash-e2e-crosshost (Mac verifier <-> GPU proposer via
vast-mapped port).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…SH -L tunnel)

vast external ports are Caddy-gated (no raw-TCP passthrough), so the live Mac<->GPU
run uses an SSH -L tunnel to the H200's :6006 (the GPU DFlashProposerService).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…reset to :50070

Deployed TorchRestorationDraftEngine on an H200; measured the real DFlash+f_θ data
plane cross-host: DraftBlock p50 108ms (vs 232ms Mac-CPU loopback — GPU offload
cuts draft compute), ExtendContext 140ms/0.27MB, per-block ~248ms over an SSH
tunnel. Caddy occupies the portal ports, so the link uses :50070.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…esult

gemma-4-mlx-4bit verifier @Mac mini <-> torch DFlash+f_θ @h200 over SSH tunnel:
block=4 = 3.70 tok/s, acceptance 0.863, PASS byte-identical to greedy. Per-RPC
RTT: Restore 3.2s/11.5MB, Seed 412ms, DraftBlock 268ms, ExtendContext 316ms/0.27MB.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
… A/B deploy scripts

- docs/skills/distributed-dflash-ftheta-inference-skill.md: reusable SOP (two-layer
  design, build order, the byte-identical validation ladder, the expensive gotchas:
  MLX-Apple-only/torch-embed, transformers 5.x, gemma-4 KV-source-layer filtering,
  vast Caddy ports + SSH -L, /dev/shm cache).
- scripts/deploy/dflash_proposer_server_gpu.sh: one-command host-B (GPU) deploy
  (transformers 5.x + fetch gemma-4/DFlash to /dev/shm + serve DFlashProposerService).
- scripts/deploy/dflash_verifier_client.sh: host-A (verifier) launcher (open SSH -L
  tunnel + probe + run the byte-identical + RTT E2E).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request Jun 19, 2026
…+f_θ proposer (F3 data plane)

#157 feat(distributed): multi-host capability exchange + distributed speculative decoding
#158 feat(distributed): remote DFlash+f_θ proposer (F3 bulk-tensor data plane) + SOP skill + deploy scripts

Validated: 111 distributed unit tests; real-model byte-identical E2E (in-process,
loopback gRPC, live Mac<->H200 cross-host); RTT/throughput/bounded-memory report.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants