Skip to content

feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main)#157

Merged
cursor[bot] merged 10 commits into
mainfrom
AgentMemory/distributed-spec-decode-rebased-2815
Jun 19, 2026
Merged

feat(distributed): multi-host capability exchange + distributed speculative decoding (rebase of #105 onto main)#157
cursor[bot] merged 10 commits into
mainfrom
AgentMemory/distributed-spec-decode-rebased-2815

Conversation

@FluffyAIcode

@FluffyAIcode FluffyAIcode commented Jun 19, 2026

Copy link
Copy Markdown
Owner

What

Revives the v0.5-M1 multi-host distributed-inference milestone (originally #105, closed un-merged, 107 commits behind), rebased onto current main with conflicts resolved and tests green. Adds:

  • ADR 0009mlx.distributed spec-decode + capability exchange.
  • inference_engine/distributed/ — capability registry, gossip exchange, mlx.distributed ring probe, model-free NGramProposer, RemoteProposer + servicer, pure accept_block, DistributedSpeculativeDecoder, placement planner.
  • proto proto/kakeya/v1/distributed.proto (+ regenerated stubs): CapabilityService + ProposerService.
  • gRPC integrationcreate_grpc_server optional capability_registry/proposers/default_proposer_model_id planes (off by default); fleet flags; scripts/demo_distributed_spec_decode.py.
  • Unit tests (tests/inference_engine/distributed/*) + an integration test.

Conflicts resolved (union of both sides)

ci.yaml (test+cover bridge/+distributed/), docs/adr/README.md (ADR 0009 in order), README.md (roadmap + v0.5-M1 rows), grpc_app.py (union signature), start_grpc_runtime_server.py (union kwargs).

Follow-up CI fixes (this revision)

  • proto stub drift (red→green): regenerated distributed_pb2_grpc.py with the pinned grpcio-tools==1.81.1 (v0.5-M1 milestone: agent capability exchange + distributed spec decode on multi-host fleets (ADR 0009) #105 shipped 1.81.0).
  • transformers 4.x/5.x compat: apply_chat_template returns a dict on 5.x (the Mac engine's runtime) → verifier.prefill raised 'str' object cannot be interpreted as an integer. Adopted the proven tokenize=True, return_dict=False convention (kv_cache_proposer.proposer.encode_chat) in the demo + the distributed integration-test fixture. Found by the real on-device runs.
  • integration.yaml install (pre-existing bug): pip install -e . always failed (repo is not a pip package; runs via PYTHONPATH) — switched to pip install -r requirements.txt.
  • Added on-device tooling: scripts/run_distributed_demo.sh + scripts/bench_distributed_spec_decode.py + bridge presets mlx-distributed-spec-decode-demo / -bench.

Testing — CI

  • unit tests + 100% coverage (3.12), proto stub drift, proto lint, TypeScript SDK, docker build, package import smoke — all green.
  • pytest -m integration on Mac M4 — the 2 distributed integration tests pass. The other 13 failures are all the documented attention_type transformers-5.x known issue (requirements.txt:14-25, legacy Qwen3 dLM, deferred to "K2.B Qwen backport," "NOT on the K3 critical path") — pre-existing, unrelated to distributed inference, only visible because the install fix let the long-broken job run (69 pass / 13 known-issue fail).

Testing — real multi-host (byte-identical to local greedy)

2-node gossip → placement (colocated=False) → remote n-gram drafts + LOCAL greedy verify → assert byte-identical. Qwen/Qwen3-0.6B, 48 tok, acceptance 0.127 (23/181) — identical across all three: GPU H200 (2-proc localhost), Mac mini (2-proc, via bridge, transformers 5.x), cross-host (verifier@cloud-vm ↔ proposer@H200 over SSH -L tunnel).

Performance comparison (scripts/bench_distributed_spec_decode.py)

verifier Qwen3-0.6B CPU/bf16 sink=4/window=64, block=4, RTT n=300:

Env gRPC RTT p50 RTT p99 baseline distributed bounded-KV
Mac mini (localhost) 0.219 ms 0.264 ms 31.13 tok/s 15.76 tok/s 7.80 MB const
GPU H200 (localhost) 0.323 ms 0.539 ms 15.29 tok/s 8.47 tok/s 7.80 MB const
cross-host VM→H200 (SSH -L) 51.6 ms 140.1 ms 15.56 tok/s 7.41 tok/s 7.80 MB const
  • Bounded memory (headline): at a 1072-token context the verifier's resident K/V stays at the 68-slot sink+window bound = 7.80 MB, vs 122.9 MB full-attention → 15.8× smaller, O(1) in context length (identical on all three).
  • RTT: localhost ProposeBlock is sub-ms (0.22–0.32 ms); cross-host over the tunnel is ~52 ms p50 (~160× higher) — the network cost of remote drafts.
  • Throughput (honest): with the model-free n-gram at low acceptance (0.097 on this prompt), distributed throughput is below local baseline everywhere — remote-draft RTT + a full block forward with mostly-rejected drafts (no CPU batching win). Throughput gains require higher acceptance / a GPU-batched verifier; this PR delivers the correctness-containment + bounded-memory guarantees, with throughput left to v0.5-GA (dLM/DFlash-over-ring, ADR 0009).

distributed_perf_comparison.md

Draft. The mlx.distributed data-plane / DFlash-over-ring is left as v0.5-GA hardening (per ADR 0009).

To show artifacts inline, enable in settings.

Open in Web Open in Cursor 

cursoragent and others added 3 commits June 10, 2026 03:48
…0009)

- ADR 0009: evaluate AR-verifier/dLM-proposer on mlx.distributed;
  decide hybrid (gRPC control plane, optional mlx ring data plane)
- Design doc: agent capability exchange platform across Mac mini hosts
- proto/kakeya/v1/distributed.proto: CapabilityService (gossip
  push-pull) + ProposerService (remote draft blocks); stubs regenerated
- inference_engine/distributed/: capability registry (LWW + TTL),
  deterministic placement, exchange client/servicer, NGramProposer
  (prompt-lookup), RemoteProposer + servicer, accept_block +
  DistributedSpeculativeDecoder, mlx.distributed ring probe
- create_grpc_server: optional capability/proposer planes (off by
  default); start_grpc_runtime_server.py fleet flags (--node-id,
  --peer, --serve-ngram-proposer, ...)
- scripts/demo_distributed_spec_decode.py: two-node demo with greedy
  byte-identity check
- tests: Linux gate suite (77 tests, 100% coverage on the new
  package) + Mac integration test for real-verifier byte-identity
- CI: distributed tests + coverage + import smoke wired in

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…capability-exchange-b876' into AgentMemory/distributed-spec-decode-rebased-2815

# Conflicts:
#	.github/workflows/ci.yaml
#	README.md
#	docs/adr/README.md
#	inference_engine/server/grpc_app.py
#	scripts/start_grpc_runtime_server.py

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursoragent and others added 7 commits June 19, 2026 10:16
…s (1.81.0->1.81.1)

The distributed_pb2_grpc.py committed with #105 carried GRPC_GENERATED_VERSION
1.81.0; current toolchain (matching CI) emits 1.81.1. Regenerated via
scripts/regenerate_proto_stubs.sh — fixes the proto-stub-drift CI check.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…-decode-demo bridge preset

Two-process (proposer + Qwen3-0.6B verifier) distributed spec-decode over real
gRPC sockets, asserting byte-identical-to-greedy. Lets the Mac bridge validate
the ADR 0009 distributed engine on-device (mirrors the GPU-host run).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…nsformers 4.x AND 5.x safe

#105's demo assumed apply_chat_template returns token ids (transformers 4.x);
on transformers 5.x it returns a string -> verifier.prefill got a str and raised
'str object cannot be interpreted as an integer' (hit on the Mac, whose venv runs
transformers 5.x for gemma-4). Pass tokenize=True and coerce str/BatchEncoding/
nested-list to a flat List[int]. Verified on the GPU host (transformers 4.57) and
re-validated on the Mac (5.x).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…tall -e .'

The repo is not a pip package (no setup.py/pyproject.toml; it runs via
PYTHONPATH, as ci.yaml documents), so integration.yaml's 'pip install -e .'
always errored ('does not appear to be a Python project') whenever the Mac M4
integration job actually ran. #157 triggered that job (it adds an integration
test), exposing the pre-existing break. Install runtime deps from
requirements.txt instead; the suite already runs with PYTHONPATH=.:sdks/python.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…False) for cross-version token ids

transformers 5.x returns a dict {input_ids, attention_mask} from
apply_chat_template by default -> verifier.prefill got dict keys and raised
'str object cannot be interpreted as an integer' on the Mac (5.x). Adopt the
proven kv_cache_proposer.proposer.encode_chat convention (return_dict=False) in
the distributed integration test fixture AND the demo, replacing the demo's
ad-hoc coercion. Both distributed integration tests pass locally (2 passed).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…lx-distributed-spec-decode-bench preset

scripts/bench_distributed_spec_decode.py measures the three axes the distributed
spec-decode path is judged on; run_distributed_bench.sh starts a local proposer
and benches against it; bridge preset runs it on-device. Used to produce the
GPU-host / Mac / cross-host comparison.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
…bench preset

mlx-distributed-spec-decode-bench's rtt_samples needs a higher cap than
n_samples (50); add MAX_RTT_SAMPLES=5000 and the int:rtt_samples validator entry.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request Jun 19, 2026
…+f_θ proposer (F3 data plane)

#157 feat(distributed): multi-host capability exchange + distributed speculative decoding
#158 feat(distributed): remote DFlash+f_θ proposer (F3 bulk-tensor data plane) + SOP skill + deploy scripts

Validated: 111 distributed unit tests; real-model byte-identical E2E (in-process,
loopback gRPC, live Mac<->H200 cross-host); RTT/throughput/bounded-memory report.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot merged commit 25a77b1 into main Jun 19, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants