feat(client): download-style deferred retry for merkle uploads#110
Conversation
bd8ccf5 to
9c1bfde
Compare
Review: #110 — Download-style deferred retry for merkle uploads (V2-466)VerdictReady to merge once the four E2E jobs (E2E + Merkle E2E × {ubuntu, macos}) turn green. The refactor is well-scoped, the in-wave vs. post-wave split is the right shape, and the failure-preservation contract is preserved end-to-end. Production evidence (PROD-UL-01) is internally consistent and confirms the mechanism fires exactly where it should. CI is green on every static / unit / build / audit / doc check; the only outstanding jobs are the four E2E suites, which are still in progress at the time of review. Summary
Correctness / boundary check ✅
Tests ✅
Lint / format ✅
CI status
E2E + Merkle E2E on both platforms still running at the time of review. Re-check before flipping to "approved"; nothing in the diff looks likely to fail them. Suggestions (non-blocking, not gating RC)
Files changed
Reviewed by Hermes Agent — locally built and tested on PR head against |
mickvandijke
left a comment
There was a problem hiding this comment.
Findings
- High: deferred retry can blow the upload memory bound.
The PR accumulates all quorum-short chunks across the whole file into deferred and then passes the entire set into one deferred retry call (file.rs: deferred.extend(...) then merkle_deferred_retry(deferred, ...)). Inside each retry round, merkle_deferred_retry builds round_addrs for all remaining chunks and immediately calls read_bodies(&round_addrs). The CLI caller wires that to spill.read_wave(addrs), and read_wave materializes every requested chunk body into a Vec.
That violates the stated peak-memory contract. The wave path is bounded to about 64 chunks / ~256 MB, but the deferred path is now bounded by the total number of deferred chunks in the whole file. On a large upload with hundreds or thousands of quorum-short chunks, this can re-read gigabytes before store concurrency applies. Please process deferred chunks in bounded batches, or stream body reads into the store futures so only store_concurrency or UPLOAD_WAVE_SIZE bodies are resident.
- Medium: fatal store errors produce incomplete
PartialUploadaccounting.
merkle_store_with_retry records successes internally, but on a non-quorum error it returns Err immediately. The wave caller then builds PartialUpload from only previously recorded state, so a fatal error in a wave with no missing proofs can report failed_count = 0 and omit same-wave successes. The deferred path has the same shape: if a retry round hits a fatal error, it marks the entire pre-round remaining set as failed and cannot preserve any successes from the same concurrent pass.
This can under-report stored chunks and produce misleading CLI JSON/resume metadata. The helper should return a partial outcome plus fatal address/reason, or the caller should use a one-pass collector that preserves successes before surfacing the fatal.
Verification run locally
git diff --check origin/rc-2026.6.2...origin/pr/110cargo fmt --all -- --checkcargo clippy -p ant-core --all-targets --all-features -- -D warningscargo test -p ant-core --lib-> 336 passed
|
Follow-up after re-reading Mick's review comments against the current diff: I agree both findings are real, with different severity. 1. Deferred retry memory bound — blocker for this PRThe PR rustdoc/intent says peak memory stays bounded because bodies are re-read from disk rather than pinned. The wave path is still bounded, but the new deferred path is not. Trace:
So peak resident body memory in the deferred round is bounded by That matters because the deferred path is exactly where large uploads can accumulate a large file-level retry set. Example shape: a 50 GB upload with ~20% quorum-short chunks can put roughly 10 GB of bodies resident before throttling helps. Suggested RC-sized fix: process deferred chunks in 2. Fatal-error accounting — real, smaller blast radiusThe second issue is also real, though narrower and partly pre-existing. On a non- In the new deferred wrapper, the The old wave path had the same under-accounting shape on fatal errors, so this PR did not introduce the whole class of bug; it adds another place where it can happen. Since the helper is shared, a small helper-level fix could correct both paths: surface a partial Merge recommendationI'd block this PR on fixing issue 1 because it breaks the stated bounded-memory guarantee and is likely mechanical to fix. Issue 2 is worth taking in the same PR if the helper-level refactor stays small; otherwise it should at least be tracked as a follow-up before merge. Note: my earlier review comment said peak RAM was unchanged. That was wrong after this closer read of the deferred path; Mick's memory finding is correct. |
|
Thanks @mickvandijke — both findings were spot on. Addressed in 2228a92. 1. Memory bound (high). 2. Fatal-error accounting (medium). Fixed at the root in
This also fixes the pre-existing wave-path under-reporting you flagged (a fatal wave with no missing proofs no longer reports
|
…ounting Addresses review on the deferred merkle-upload retry path. 1. Memory bound (high): the deferred pass read every quorum-short chunk in the whole file into one Vec per round before storing, so peak resident bodies scaled with the file-wide deferred count rather than the wave path's ~UPLOAD_WAVE_SIZE / ~256 MB bound. merkle_deferred_retry now takes a batch_size and processes each round in batches of that size, re-reading only one batch of bodies from the spill at a time. The CLI caller passes UPLOAD_WAVE_SIZE. 2. Fatal-abort accounting (medium): merkle_store_with_retry returned Err on a non-quorum error, discarding the successes already recorded in that pass; the wave/deferred callers then built PartialUpload from stale state (could report failed_count = 0 and omit same-pass stores). The store helper now preserves same-pass successes (stored/stored_addresses), records the fatal chunk as failed, and surfaces the error via a new MerkleStoreOutcome::fatal field instead of Err. The external-signer path re-raises fatal as Err to keep its all-or-nothing contract; the CLI wave and deferred paths fold it into a PartialUpload whose failed set is derived authoritatively as every input chunk not in stored_addresses (shared partial_upload_after_fatal helper), so stored_count + failed_count accounts for the whole file. This also fixes the pre-existing wave-path under-reporting the review noted. Tests: same-pass successes preserved on fatal; deferred reads bounded to batch_size; updated the non-quorum-error test to assert fatal-in-outcome. cargo test -p ant-core --lib -> 338 passed; clippy and fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2228a92 to
1d9916d
Compare
Summary
Port the download path's deferred-retry design to the CLI merkle upload path so a few quorum-short
chunks no longer stall the whole pipeline behind in-wave backoffs (V2-466).
Previously
upload_waves_merklestored each wave of 64 chunks throughmerkle_store_with_retrywithup to 4 attempts and 30s jittered backoffs, behind a hard barrier: wave N+1 could not start until
wave N's retry loop fully drained. A handful of quorum-short chunks therefore parked the wave's other
~63 slots idle through multiple backoffs — the single biggest throughput sink on the PROD-UL-01 run
(one wave alone burned 34 minutes).
max_attempts = 1, no backoff) so a wave never blocks on aslow chunk.
immediately.
[0, 15, 45]sdelays (matching the download path), re-reading each chunk's body from the spill at retry time
(peak RAM unchanged, ~256 MB) and reusing its already-paid merkle proof.
PartialUpload;a non-quorum error aborts as
PartialUploadwhile preserving earlier progress. Stats and progressnumbering are carried across rounds, with each deferred round's successes recorded in its own
histogram slot (wave first-pass → slot 0, deferred round r → slot r+1).
Adds
merkle_deferred_retry,DeferredRetryOutcome,deferred_round_histogram_slot,DEFERRED_ROUND_DELAYS_SECS, and unit tests. Scoped to the CLI/spill path only; the external-signermerkle_upload_chunks(in-memory, no spill) is unchanged.Test plan (CI)
cargo test -p ant-core --lib→ 336 passed, 0 failed (incl. 5 new deferred-loop tests:histogram-slot mapping, succeeds-on-a-later-round, leftovers-become-failed,
fatal-error-preserves-prior-progress, empty-set-noop)
cargo clippy -p ant-core --all-targets --all-features -- -D warnings→ cleancargo fmt -p ant-core -- --check→ cleanProduction validation (PROD-UL-01, 5-file merkle upload)
A patched
antbinary was run against production (PROD-UL-01-ant-client-upload-lon1-1,payment=merkle). This is a client-only change, and production naturally supplies the conditionsthat drive quorum shortfalls (disk-full community nodes, closeness divergence).
The new mechanism fired on the two files that hit a quorum-short chunk (files 1 and 4):
Both showed only round 1/3 — the chunk recovered on the immediate (0s) first round and the loop
exited early. The very next log line is a network query at the same millisecond
(
00:10:19.420787→.426822), confirming no backoff sleep before the retry.The structured stdout JSON confirms a chunk stored via the deferred path, counted exactly once:
[924, **1**, 0, 0][386, **1**, 0, 0]Slot
[1]is exactly the slot the first deferred round writes to. The accounting is internallyconsistent and free of double-counting (file 1: 1 preflight already-stored + 924 wave-pass + 1
deferred = 926;
chunk_attempts_total926 = 925 wave attempts incl. the one shortfall + 1 deferredattempt).
The old in-wave backoff loop is gone —
"retrying after backoff"(emitted only insidemerkle_store_with_retrybetween attempts) appears 0 times across all 5 files, confirming eachwave now runs a single pass and never sleeps mid-wave.
Recovery vs. the previous release: in the prior release run, the single quorum-short chunk in the
two affected files ended as
partial(770/771 and 968/969, 1 failed each). With this change theequivalent quorum-short chunk recovered — both files finished
okwith 0 failed — withoutparking the wave's other slots behind 30s backoffs.
Wall-time is suggestive but not strictly apples-to-apples (the two runs uploaded different files):
the largest file here (926 chunks / 3.59 GB) took 112m, versus the prior run's 843 chunks / 3.27 GB
at 190m and 969 chunks / 3.77 GB at 167m. This run only deferred 1 chunk per file, so it validates
correctness of the new path well; the larger throughput win scales with how many chunks defer per
wave (and gets cheaper once V2-411's disk pre-check lands).
Closes V2-466
🤖 Generated with Claude Code