feat: expose ParameterServer metas via HTTP for cross-process join#90
Open
ddmm2020 wants to merge 1 commit into
Open
feat: expose ParameterServer metas via HTTP for cross-process join#90ddmm2020 wants to merge 1 commit into
ddmm2020 wants to merge 1 commit into
Conversation
Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas})
and a standalone `python -m checkpoint_engine.join_cli` entrypoint, so a
new ParameterServer instance can join an existing P2P weight world over
mooncake RDMA without re-reading checkpoints from disk.
Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training
job already holds pinned CPU weight buffers registered with the mooncake
P2PStore. Newly-started inference replicas should be able to pull these
weights over RDMA instead of re-converting the checkpoint from disk.
Changes:
* api.py: GET /v1/checkpoints/{name}/metas returns pickle.dumps(ps.get_metas())
as application/octet-stream; POST /v1/checkpoints/{name}/load-metas accepts
the same bytes and feeds them into ps.load_metas(). Bad pickle is rejected
with 400; PS errors are surfaced as 500.
* join_cli.py: `python -m checkpoint_engine.join_cli` -- the join() flow from
examples/update.py, packaged as a first-class CLI under the published
package so consumers can invoke it without checking out the source tree.
Accepts metas from either a local pickle file or a remote HTTP URL.
* tests/test_api.py: 6 CPU-only tests covering pickle round-trip, ps-error
propagation, bad-input rejection, and a GET-then-POST chain that validates
the new endpoints are mutually consistent.
Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights
transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs)
vs 6.49s over TCP fallback in environments without RDMA passthrough.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add two HTTP endpoints (GET/POST /v1/checkpoints/{name}/{metas,load-metas}) and a standalone
python -m checkpoint_engine.join_clientrypoint, so a new ParameterServer instance can join an existing P2P weight world over mooncake RDMA without re-reading checkpoints from disk.Motivation: in elastic-rollout setups (e.g. mshrl), a long-running training job already holds pinned CPU weight buffers registered with the mooncake P2PStore. Newly-started inference replicas should be able to pull these weights over RDMA instead of re-converting the checkpoint from disk.
Changes:
python -m checkpoint_engine.join_cli-- the join() flow from examples/update.py, packaged as a first-class CLI under the published package so consumers can invoke it without checking out the source tree. Accepts metas from either a local pickle file or a remote HTTP URL.Verified end-to-end on a 2-node launchpad job: 14.5 GiB Qwen2.5-7B weights transferred from main to elastic in 1.49s over real RDMA (4 mlx5_bond HCAs) vs 6.49s over TCP fallback in environments without RDMA passthrough.