feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main)#157
Merged
cursor[bot] merged 10 commits intoJun 19, 2026
Conversation
…0009) - ADR 0009: evaluate AR-verifier/dLM-proposer on mlx.distributed; decide hybrid (gRPC control plane, optional mlx ring data plane) - Design doc: agent capability exchange platform across Mac mini hosts - proto/kakeya/v1/distributed.proto: CapabilityService (gossip push-pull) + ProposerService (remote draft blocks); stubs regenerated - inference_engine/distributed/: capability registry (LWW + TTL), deterministic placement, exchange client/servicer, NGramProposer (prompt-lookup), RemoteProposer + servicer, accept_block + DistributedSpeculativeDecoder, mlx.distributed ring probe - create_grpc_server: optional capability/proposer planes (off by default); start_grpc_runtime_server.py fleet flags (--node-id, --peer, --serve-ngram-proposer, ...) - scripts/demo_distributed_spec_decode.py: two-node demo with greedy byte-identity check - tests: Linux gate suite (77 tests, 100% coverage on the new package) + Mac integration test for real-verifier byte-identity - CI: distributed tests + coverage + import smoke wired in Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…capability-exchange-b876' into AgentMemory/distributed-spec-decode-rebased-2815 # Conflicts: # .github/workflows/ci.yaml # README.md # docs/adr/README.md # inference_engine/server/grpc_app.py # scripts/start_grpc_runtime_server.py Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…s (1.81.0->1.81.1) The distributed_pb2_grpc.py committed with #105 carried GRPC_GENERATED_VERSION 1.81.0; current toolchain (matching CI) emits 1.81.1. Regenerated via scripts/regenerate_proto_stubs.sh — fixes the proto-stub-drift CI check. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…-decode-demo bridge preset Two-process (proposer + Qwen3-0.6B verifier) distributed spec-decode over real gRPC sockets, asserting byte-identical-to-greedy. Lets the Mac bridge validate the ADR 0009 distributed engine on-device (mirrors the GPU-host run). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nsformers 4.x AND 5.x safe #105's demo assumed apply_chat_template returns token ids (transformers 4.x); on transformers 5.x it returns a string -> verifier.prefill got a str and raised 'str object cannot be interpreted as an integer' (hit on the Mac, whose venv runs transformers 5.x for gemma-4). Pass tokenize=True and coerce str/BatchEncoding/ nested-list to a flat List[int]. Verified on the GPU host (transformers 4.57) and re-validated on the Mac (5.x). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…tall -e .'
The repo is not a pip package (no setup.py/pyproject.toml; it runs via
PYTHONPATH, as ci.yaml documents), so integration.yaml's 'pip install -e .'
always errored ('does not appear to be a Python project') whenever the Mac M4
integration job actually ran. #157 triggered that job (it adds an integration
test), exposing the pre-existing break. Install runtime deps from
requirements.txt instead; the suite already runs with PYTHONPATH=.:sdks/python.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…False) for cross-version token ids
transformers 5.x returns a dict {input_ids, attention_mask} from
apply_chat_template by default -> verifier.prefill got dict keys and raised
'str object cannot be interpreted as an integer' on the Mac (5.x). Adopt the
proven kv_cache_proposer.proposer.encode_chat convention (return_dict=False) in
the distributed integration test fixture AND the demo, replacing the demo's
ad-hoc coercion. Both distributed integration tests pass locally (2 passed).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…lx-distributed-spec-decode-bench preset scripts/bench_distributed_spec_decode.py measures the three axes the distributed spec-decode path is judged on; run_distributed_bench.sh starts a local proposer and benches against it; bridge preset runs it on-device. Used to produce the GPU-host / Mac / cross-host comparison. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…bench preset mlx-distributed-spec-decode-bench's rtt_samples needs a higher cap than n_samples (50); add MAX_RTT_SAMPLES=5000 and the int:rtt_samples validator entry. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Jun 19, 2026
…+f_θ proposer (F3 data plane) #157 feat(distributed): multi-host capability exchange + distributed speculative decoding #158 feat(distributed): remote DFlash+f_θ proposer (F3 bulk-tensor data plane) + SOP skill + deploy scripts Validated: 111 distributed unit tests; real-model byte-identical E2E (in-process, loopback gRPC, live Mac<->H200 cross-host); RTT/throughput/bounded-memory report. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Revives the v0.5-M1 multi-host distributed-inference milestone (originally #105, closed un-merged, 107 commits behind), rebased onto current
mainwith conflicts resolved and tests green. Adds:mlx.distributedspec-decode + capability exchange.inference_engine/distributed/— capability registry, gossip exchange,mlx.distributedring probe, model-freeNGramProposer,RemoteProposer+ servicer, pureaccept_block,DistributedSpeculativeDecoder, placement planner.proto/kakeya/v1/distributed.proto(+ regenerated stubs):CapabilityService+ProposerService.create_grpc_serveroptionalcapability_registry/proposers/default_proposer_model_idplanes (off by default); fleet flags;scripts/demo_distributed_spec_decode.py.tests/inference_engine/distributed/*) + an integration test.Conflicts resolved (union of both sides)
ci.yaml(test+coverbridge/+distributed/),docs/adr/README.md(ADR 0009 in order),README.md(roadmap + v0.5-M1 rows),grpc_app.py(union signature),start_grpc_runtime_server.py(union kwargs).Follow-up CI fixes (this revision)
distributed_pb2_grpc.pywith the pinnedgrpcio-tools==1.81.1(v0.5-M1 milestone: agent capability exchange + distributed spec decode on multi-host fleets (ADR 0009) #105 shipped1.81.0).apply_chat_templatereturns a dict on 5.x (the Mac engine's runtime) →verifier.prefillraised'str' object cannot be interpreted as an integer. Adopted the proventokenize=True, return_dict=Falseconvention (kv_cache_proposer.proposer.encode_chat) in the demo + the distributed integration-test fixture. Found by the real on-device runs.integration.yamlinstall (pre-existing bug):pip install -e .always failed (repo is not a pip package; runs viaPYTHONPATH) — switched topip install -r requirements.txt.scripts/run_distributed_demo.sh+scripts/bench_distributed_spec_decode.py+ bridge presetsmlx-distributed-spec-decode-demo/-bench.Testing — CI
unit tests + 100% coverage (3.12),proto stub drift,proto lint,TypeScript SDK,docker build,package import smoke— all green.pytest -m integration on Mac M4— the 2 distributed integration tests pass. The other 13 failures are all the documentedattention_typetransformers-5.x known issue (requirements.txt:14-25, legacy Qwen3 dLM, deferred to "K2.B Qwen backport," "NOT on the K3 critical path") — pre-existing, unrelated to distributed inference, only visible because the install fix let the long-broken job run (69 pass / 13 known-issue fail).Testing — real multi-host (byte-identical to local greedy)
2-node gossip → placement (
colocated=False) → remote n-gram drafts + LOCAL greedy verify → assert byte-identical.Qwen/Qwen3-0.6B, 48 tok, acceptance 0.127 (23/181) — identical across all three: GPU H200 (2-proc localhost), Mac mini (2-proc, via bridge, transformers 5.x), cross-host (verifier@cloud-vm ↔ proposer@H200 over SSH-Ltunnel).Performance comparison (
scripts/bench_distributed_spec_decode.py)verifier
Qwen3-0.6BCPU/bf16 sink=4/window=64, block=4, RTT n=300:-L)ProposeBlockis sub-ms (0.22–0.32 ms); cross-host over the tunnel is ~52 ms p50 (~160× higher) — the network cost of remote drafts.distributed_perf_comparison.md
Draft. The
mlx.distributeddata-plane / DFlash-over-ring is left as v0.5-GA hardening (per ADR 0009).To show artifacts inline, enable in settings.