feat: expose ParameterServer metas via HTTP for cross-process join by ddmm2020 · Pull Request #90 · MoonshotAI/checkpoint-engine

ddmm2020 · 2026-06-15T08:49:02Z

Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas}) and a standalone python -m checkpoint_engine.join_cli entrypoint, so a new ParameterServer instance can join an existing P2P weight world over mooncake RDMA without re-reading checkpoints from disk.

Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training job already holds pinned CPU weight buffers registered with the mooncake P2PStore. Newly-started inference replicas should be able to pull these weights over RDMA instead of re-converting the checkpoint from disk.

Changes:

api.py: GET /v1/checkpoints/{name}/metas returns pickle.dumps(ps.get_metas()) as application/octet-stream; POST /v1/checkpoints/{name}/load-metas accepts the same bytes and feeds them into ps.load_metas(). Bad pickle is rejected with 400; PS errors are surfaced as 500.
join_cli.py: python -m checkpoint_engine.join_cli -- the join() flow from examples/update.py, packaged as a first-class CLI under the published package so consumers can invoke it without checking out the source tree. Accepts metas from either a local pickle file or a remote HTTP URL.
tests/test_api.py: 6 CPU-only tests covering pickle round-trip, ps-error propagation, bad-input rejection, and a GET-then-POST chain that validates the new endpoints are mutually consistent.

Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs) vs 6.49s over TCP fallback in environments without RDMA passthrough.

Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas}) and a standalone `python -m checkpoint_engine.join_cli` entrypoint, so a new ParameterServer instance can join an existing P2P weight world over mooncake RDMA without re-reading checkpoints from disk. Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training job already holds pinned CPU weight buffers registered with the mooncake P2PStore. Newly-started inference replicas should be able to pull these weights over RDMA instead of re-converting the checkpoint from disk. Changes: * api.py: GET /v1/checkpoints/{name}/metas returns pickle.dumps(ps.get_metas()) as application/octet-stream; POST /v1/checkpoints/{name}/load-metas accepts the same bytes and feeds them into ps.load_metas(). Bad pickle is rejected with 400; PS errors are surfaced as 500. * join_cli.py: `python -m checkpoint_engine.join_cli` -- the join() flow from examples/update.py, packaged as a first-class CLI under the published package so consumers can invoke it without checking out the source tree. Accepts metas from either a local pickle file or a remote HTTP URL. * tests/test_api.py: 6 CPU-only tests covering pickle round-trip, ps-error propagation, bad-input rejection, and a GET-then-POST chain that validates the new endpoints are mutually consistent. Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs) vs 6.49s over TCP fallback in environments without RDMA passthrough.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose ParameterServer metas via HTTP for cross-process join#90

feat: expose ParameterServer metas via HTTP for cross-process join#90
ddmm2020 wants to merge 1 commit into
MoonshotAI:mainfrom
ddmm2020:feat/expose-metas-endpoints

ddmm2020 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ddmm2020 commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant