Skip to content

feat: expose ParameterServer metas via HTTP for cross-process join#90

Open
ddmm2020 wants to merge 1 commit into
MoonshotAI:mainfrom
ddmm2020:feat/expose-metas-endpoints
Open

feat: expose ParameterServer metas via HTTP for cross-process join#90
ddmm2020 wants to merge 1 commit into
MoonshotAI:mainfrom
ddmm2020:feat/expose-metas-endpoints

Conversation

@ddmm2020

Copy link
Copy Markdown

Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas}) and a standalone python -m checkpoint_engine.join_cli entrypoint, so a new ParameterServer instance can join an existing P2P weight world over mooncake RDMA without re-reading checkpoints from disk.

Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training job already holds pinned CPU weight buffers registered with the mooncake P2PStore. Newly-started inference replicas should be able to pull these weights over RDMA instead of re-converting the checkpoint from disk.

Changes:

  • api.py: GET /v1/checkpoints/{name}/metas returns pickle.dumps(ps.get_metas()) as application/octet-stream; POST /v1/checkpoints/{name}/load-metas accepts the same bytes and feeds them into ps.load_metas(). Bad pickle is rejected with 400; PS errors are surfaced as 500.
  • join_cli.py: python -m checkpoint_engine.join_cli -- the join() flow from examples/update.py, packaged as a first-class CLI under the published package so consumers can invoke it without checking out the source tree. Accepts metas from either a local pickle file or a remote HTTP URL.
  • tests/test_api.py: 6 CPU-only tests covering pickle round-trip, ps-error propagation, bad-input rejection, and a GET-then-POST chain that validates the new endpoints are mutually consistent.

Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs) vs 6.49s over TCP fallback in environments without RDMA passthrough.

Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas})
and a standalone `python -m checkpoint_engine.join_cli` entrypoint, so a
new ParameterServer instance can join an existing P2P weight world over
mooncake RDMA without re-reading checkpoints from disk.

Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training
job already holds pinned CPU weight buffers registered with the mooncake
P2PStore. Newly-started inference replicas should be able to pull these
weights over RDMA instead of re-converting the checkpoint from disk.

Changes:
* api.py: GET /v1/checkpoints/{name}/metas returns pickle.dumps(ps.get_metas())
  as application/octet-stream; POST /v1/checkpoints/{name}/load-metas accepts
  the same bytes and feeds them into ps.load_metas(). Bad pickle is rejected
  with 400; PS errors are surfaced as 500.
* join_cli.py: `python -m checkpoint_engine.join_cli` -- the join() flow from
  examples/update.py, packaged as a first-class CLI under the published
  package so consumers can invoke it without checking out the source tree.
  Accepts metas from either a local pickle file or a remote HTTP URL.
* tests/test_api.py: 6 CPU-only tests covering pickle round-trip, ps-error
  propagation, bad-input rejection, and a GET-then-POST chain that validates
  the new endpoints are mutually consistent.

Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights
transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs)
vs 6.49s over TCP fallback in environments without RDMA passthrough.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant