feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main) by FluffyAIcode · Pull Request #157 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-19T10:04:38Z

What

Revives the v0.5-M1 multi-host distributed-inference milestone (originally #105, closed un-merged, 107 commits behind), rebased onto current main with conflicts resolved and tests green. Adds:

ADR 0009 — mlx.distributed spec-decode + capability exchange.
inference_engine/distributed/ — capability registry, gossip exchange, mlx.distributed ring probe, model-free NGramProposer, RemoteProposer + servicer, pure accept_block, DistributedSpeculativeDecoder, placement planner.
proto proto/kakeya/v1/distributed.proto (+ regenerated stubs): CapabilityService + ProposerService.
gRPC integration — create_grpc_server optional capability_registry/proposers/default_proposer_model_id planes (off by default); fleet flags; scripts/demo_distributed_spec_decode.py.
Unit tests (tests/inference_engine/distributed/*) + an integration test.

Conflicts resolved (union of both sides)

ci.yaml (test+cover bridge/+distributed/), docs/adr/README.md (ADR 0009 in order), README.md (roadmap + v0.5-M1 rows), grpc_app.py (union signature), start_grpc_runtime_server.py (union kwargs).

Follow-up CI fixes (this revision)

proto stub drift (red→green): regenerated distributed_pb2_grpc.py with the pinned grpcio-tools==1.81.1 (v0.5-M1 milestone: agent capability exchange + distributed spec decode on multi-host fleets (ADR 0009) #105 shipped 1.81.0).
transformers 4.x/5.x compat: apply_chat_template returns a dict on 5.x (the Mac engine's runtime) → verifier.prefill raised 'str' object cannot be interpreted as an integer. Adopted the proven tokenize=True, return_dict=False convention (kv_cache_proposer.proposer.encode_chat) in the demo + the distributed integration-test fixture. Found by the real on-device runs.
integration.yaml install (pre-existing bug): pip install -e . always failed (repo is not a pip package; runs via PYTHONPATH) — switched to pip install -r requirements.txt.
Added on-device tooling: scripts/run_distributed_demo.sh + scripts/bench_distributed_spec_decode.py + bridge presets mlx-distributed-spec-decode-demo / -bench.

Testing — CI

✅ unit tests + 100% coverage (3.12), proto stub drift, proto lint, TypeScript SDK, docker build, package import smoke — all green.
✅ pytest -m integration on Mac M4 — the 2 distributed integration tests pass. The other 13 failures are all the documented attention_type transformers-5.x known issue (requirements.txt:14-25, legacy Qwen3 dLM, deferred to "K2.B Qwen backport," "NOT on the K3 critical path") — pre-existing, unrelated to distributed inference, only visible because the install fix let the long-broken job run (69 pass / 13 known-issue fail).

Testing — real multi-host (byte-identical to local greedy)

2-node gossip → placement (colocated=False) → remote n-gram drafts + LOCAL greedy verify → assert byte-identical. Qwen/Qwen3-0.6B, 48 tok, acceptance 0.127 (23/181) — identical across all three: GPU H200 (2-proc localhost), Mac mini (2-proc, via bridge, transformers 5.x), cross-host (verifier@cloud-vm ↔ proposer@H200 over SSH -L tunnel).

Performance comparison (`scripts/bench_distributed_spec_decode.py`)

verifier Qwen3-0.6B CPU/bf16 sink=4/window=64, block=4, RTT n=300:

Env	gRPC RTT p50	RTT p99	baseline	distributed	bounded-KV
Mac mini (localhost)	0.219 ms	0.264 ms	31.13 tok/s	15.76 tok/s	7.80 MB const
GPU H200 (localhost)	0.323 ms	0.539 ms	15.29 tok/s	8.47 tok/s	7.80 MB const
cross-host VM→H200 (SSH `-L`)	51.6 ms	140.1 ms	15.56 tok/s	7.41 tok/s	7.80 MB const

Bounded memory (headline): at a 1072-token context the verifier's resident K/V stays at the 68-slot sink+window bound = 7.80 MB, vs 122.9 MB full-attention → 15.8× smaller, O(1) in context length (identical on all three).
RTT: localhost ProposeBlock is sub-ms (0.22–0.32 ms); cross-host over the tunnel is ~52 ms p50 (~160× higher) — the network cost of remote drafts.
Throughput (honest): with the model-free n-gram at low acceptance (0.097 on this prompt), distributed throughput is below local baseline everywhere — remote-draft RTT + a full block forward with mostly-rejected drafts (no CPU batching win). Throughput gains require higher acceptance / a GPU-batched verifier; this PR delivers the correctness-containment + bounded-memory guarantees, with throughput left to v0.5-GA (dLM/DFlash-over-ring, ADR 0009).

distributed_perf_comparison.md

Draft. The mlx.distributed data-plane / DFlash-over-ring is left as v0.5-GA hardening (per ADR 0009).

_{To show artifacts inline, enable in settings.}

…0009) - ADR 0009: evaluate AR-verifier/dLM-proposer on mlx.distributed; decide hybrid (gRPC control plane, optional mlx ring data plane) - Design doc: agent capability exchange platform across Mac mini hosts - proto/kakeya/v1/distributed.proto: CapabilityService (gossip push-pull) + ProposerService (remote draft blocks); stubs regenerated - inference_engine/distributed/: capability registry (LWW + TTL), deterministic placement, exchange client/servicer, NGramProposer (prompt-lookup), RemoteProposer + servicer, accept_block + DistributedSpeculativeDecoder, mlx.distributed ring probe - create_grpc_server: optional capability/proposer planes (off by default); start_grpc_runtime_server.py fleet flags (--node-id, --peer, --serve-ngram-proposer, ...) - scripts/demo_distributed_spec_decode.py: two-node demo with greedy byte-identity check - tests: Linux gate suite (77 tests, 100% coverage on the new package) + Mac integration test for real-verifier byte-identity - CI: distributed tests + coverage + import smoke wired in Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…capability-exchange-b876' into AgentMemory/distributed-spec-decode-rebased-2815 # Conflicts: # .github/workflows/ci.yaml # README.md # docs/adr/README.md # inference_engine/server/grpc_app.py # scripts/start_grpc_runtime_server.py Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…s (1.81.0->1.81.1) The distributed_pb2_grpc.py committed with #105 carried GRPC_GENERATED_VERSION 1.81.0; current toolchain (matching CI) emits 1.81.1. Regenerated via scripts/regenerate_proto_stubs.sh — fixes the proto-stub-drift CI check. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…-decode-demo bridge preset Two-process (proposer + Qwen3-0.6B verifier) distributed spec-decode over real gRPC sockets, asserting byte-identical-to-greedy. Lets the Mac bridge validate the ADR 0009 distributed engine on-device (mirrors the GPU-host run). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…nsformers 4.x AND 5.x safe #105's demo assumed apply_chat_template returns token ids (transformers 4.x); on transformers 5.x it returns a string -> verifier.prefill got a str and raised 'str object cannot be interpreted as an integer' (hit on the Mac, whose venv runs transformers 5.x for gemma-4). Pass tokenize=True and coerce str/BatchEncoding/ nested-list to a flat List[int]. Verified on the GPU host (transformers 4.57) and re-validated on the Mac (5.x). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…tall -e .' The repo is not a pip package (no setup.py/pyproject.toml; it runs via PYTHONPATH, as ci.yaml documents), so integration.yaml's 'pip install -e .' always errored ('does not appear to be a Python project') whenever the Mac M4 integration job actually ran. #157 triggered that job (it adds an integration test), exposing the pre-existing break. Install runtime deps from requirements.txt instead; the suite already runs with PYTHONPATH=.:sdks/python. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…False) for cross-version token ids transformers 5.x returns a dict {input_ids, attention_mask} from apply_chat_template by default -> verifier.prefill got dict keys and raised 'str object cannot be interpreted as an integer' on the Mac (5.x). Adopt the proven kv_cache_proposer.proposer.encode_chat convention (return_dict=False) in the distributed integration test fixture AND the demo, replacing the demo's ad-hoc coercion. Both distributed integration tests pass locally (2 passed). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…lx-distributed-spec-decode-bench preset scripts/bench_distributed_spec_decode.py measures the three axes the distributed spec-decode path is judged on; run_distributed_bench.sh starts a local proposer and benches against it; bridge preset runs it on-device. Used to produce the GPU-host / Mac / cross-host comparison. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…bench preset mlx-distributed-spec-decode-bench's rtt_samples needs a higher cap than n_samples (50); add MAX_RTT_SAMPLES=5000 and the int:rtt_samples validator entry. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…+f_θ proposer (F3 data plane) #157 feat(distributed): multi-host capability exchange + distributed speculative decoding #158 feat(distributed): remote DFlash+f_θ proposer (F3 bulk-tensor data plane) + SOP skill + deploy scripts Validated: 111 distributed unit tests; real-model byte-identical E2E (in-process, loopback gRPC, live Mac<->H200 cross-host); RTT/throughput/bounded-memory report. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 3 commits June 10, 2026 03:48

demo: default to no-thinking template for echo-style spec decode demo

ccb8e09

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

github-actions Bot added the needs-mac-m4 label Jun 19, 2026

cursoragent and others added 7 commits June 19, 2026 10:16

FluffyAIcode mentioned this pull request Jun 19, 2026

feat(distributed): remote DFlash+f_θ proposer (F3 data plane) — gemma-4 verifier on host A, DFlash+f_θ on host B #158

Draft

cursor Bot merged commit 25a77b1 into main Jun 19, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main)#157

feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main)#157
cursor[bot] merged 10 commits into
mainfrom
AgentMemory/distributed-spec-decode-rebased-2815

FluffyAIcode commented Jun 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Conflicts resolved (union of both sides)

Follow-up CI fixes (this revision)

Testing — CI

Testing — real multi-host (byte-identical to local greedy)

Performance comparison (scripts/bench_distributed_spec_decode.py)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 19, 2026 •

edited by cursor Bot

Loading

Performance comparison (`scripts/bench_distributed_spec_decode.py`)