Skip to content

feat(client): download-style deferred retry for merkle uploads#110

Merged
jacderida merged 1 commit into
WithAutonomi:rc-2026.6.2from
jacderida:feat-deferred_retry_for_merkle_uploads
Jun 9, 2026
Merged

feat(client): download-style deferred retry for merkle uploads#110
jacderida merged 1 commit into
WithAutonomi:rc-2026.6.2from
jacderida:feat-deferred_retry_for_merkle_uploads

Conversation

@jacderida

Copy link
Copy Markdown
Contributor

Summary

Port the download path's deferred-retry design to the CLI merkle upload path so a few quorum-short
chunks no longer stall the whole pipeline behind in-wave backoffs (V2-466).

Previously upload_waves_merkle stored each wave of 64 chunks through merkle_store_with_retry with
up to 4 attempts and 30s jittered backoffs, behind a hard barrier: wave N+1 could not start until
wave N's retry loop fully drained. A handful of quorum-short chunks therefore parked the wave's other
~63 slots idle through multiple backoffs — the single biggest throughput sink on the PROD-UL-01 run
(one wave alone burned 34 minutes).

  • Store each wave in a single pass (max_attempts = 1, no backoff) so a wave never blocks on a
    slow chunk.
  • Collect quorum-short chunks into a file-level deferred set and advance to the next wave
    immediately.
  • After the last wave, retry the whole deferred set in concurrent rounds with [0, 15, 45]s
    delays (matching the download path), re-reading each chunk's body from the spill at retry time
    (peak RAM unchanged, ~256 MB) and reusing its already-paid merkle proof.
  • Failure semantics preserved: chunks still short after the final round surface as PartialUpload;
    a non-quorum error aborts as PartialUpload while preserving earlier progress. Stats and progress
    numbering are carried across rounds, with each deferred round's successes recorded in its own
    histogram slot (wave first-pass → slot 0, deferred round r → slot r+1).
  • Total per-chunk retry budget is unchanged (1 wave pass + 3 deferred rounds).

Adds merkle_deferred_retry, DeferredRetryOutcome, deferred_round_histogram_slot,
DEFERRED_ROUND_DELAYS_SECS, and unit tests. Scoped to the CLI/spill path only; the external-signer
merkle_upload_chunks (in-memory, no spill) is unchanged.

Test plan (CI)

  • cargo test -p ant-core --lib → 336 passed, 0 failed (incl. 5 new deferred-loop tests:
    histogram-slot mapping, succeeds-on-a-later-round, leftovers-become-failed,
    fatal-error-preserves-prior-progress, empty-set-noop)
  • cargo clippy -p ant-core --all-targets --all-features -- -D warnings → clean
  • cargo fmt -p ant-core -- --check → clean

Production validation (PROD-UL-01, 5-file merkle upload)

A patched ant binary was run against production (PROD-UL-01-ant-client-upload-lon1-1,
payment=merkle). This is a client-only change, and production naturally supplies the conditions
that drive quorum shortfalls (disk-full community nodes, closeness divergence).

The new mechanism fired on the two files that hit a quorum-short chunk (files 1 and 4):

INFO ant_core::data::client::file:   Deferring 1 merkle chunk(s) short of quorum for concurrent retry after final wave
INFO ant_core::data::client::merkle: Deferred merkle retry round 1/3: 1 chunk(s) short of quorum

Both showed only round 1/3 — the chunk recovered on the immediate (0s) first round and the loop
exited early. The very next log line is a network query at the same millisecond
(00:10:19.420787.426822), confirming no backoff sleep before the retry.

The structured stdout JSON confirms a chunk stored via the deferred path, counted exactly once:

File chunks_stored / failed retries_histogram chunk_attempts_total
1 (3.59 GB, 926 chunks) 926 / 0 [924, **1**, 0, 0] 926
4 (1.50 GB, 388 chunks) 388 / 0 [386, **1**, 0, 0] 388

Slot [1] is exactly the slot the first deferred round writes to. The accounting is internally
consistent and free of double-counting (file 1: 1 preflight already-stored + 924 wave-pass + 1
deferred = 926; chunk_attempts_total 926 = 925 wave attempts incl. the one shortfall + 1 deferred
attempt).

The old in-wave backoff loop is gone"retrying after backoff" (emitted only inside
merkle_store_with_retry between attempts) appears 0 times across all 5 files, confirming each
wave now runs a single pass and never sleeps mid-wave.

Recovery vs. the previous release: in the prior release run, the single quorum-short chunk in the
two affected files ended as partial (770/771 and 968/969, 1 failed each). With this change the
equivalent quorum-short chunk recovered — both files finished ok with 0 failed — without
parking the wave's other slots behind 30s backoffs.

Wall-time is suggestive but not strictly apples-to-apples (the two runs uploaded different files):
the largest file here (926 chunks / 3.59 GB) took 112m, versus the prior run's 843 chunks / 3.27 GB
at 190m and 969 chunks / 3.77 GB at 167m. This run only deferred 1 chunk per file, so it validates
correctness of the new path well; the larger throughput win scales with how many chunks defer per
wave (and gets cheaper once V2-411's disk pre-check lands).

Closes V2-466

🤖 Generated with Claude Code

@jacderida jacderida force-pushed the feat-deferred_retry_for_merkle_uploads branch from bd8ccf5 to 9c1bfde Compare June 8, 2026 13:59
@dirvine

dirvine commented Jun 8, 2026

Copy link
Copy Markdown

Review: #110 — Download-style deferred retry for merkle uploads (V2-466)

Verdict

Ready to merge once the four E2E jobs (E2E + Merkle E2E × {ubuntu, macos}) turn green.

The refactor is well-scoped, the in-wave vs. post-wave split is the right shape, and the failure-preservation contract is preserved end-to-end. Production evidence (PROD-UL-01) is internally consistent and confirms the mechanism fires exactly where it should. CI is green on every static / unit / build / audit / doc check; the only outstanding jobs are the four E2E suites, which are still in progress at the time of review.


Summary

  • What changes: upload_waves_merkle no longer retries in-wave. Each wave now runs a single pass (max_attempts = 1, Duration::ZERO); quorum-short chunks are collected into a file-level deferred set and retried after the final wave by the new merkle_deferred_retry helper, which mirrors the download path's [0, 15, 45]s round delays. Same per-chunk retry budget (1 wave pass + 3 deferred rounds = 4 attempts); same total wall-time in the worst case; far better worst-case fan-out because no wave blocks behind a slow chunk's backoff.
  • Public-API surface: no change. The change is entirely inside Client::upload_waves_merkle and the new merkle_deferred_retry/DeferredRetryOutcome/deferred_round_histogram_slot/DEFERRED_ROUND_DELAYS_SECS items in ant-core/src/data/client/merkle.rs (all pub(crate)). The external-signer merkle_upload_chunks (in-memory, no spill) is explicitly untouched.
  • Accountability preserved: progress numbering stays continuous (stored_offset is threaded through), the retries_histogram is zipped so per-round counts remain meaningful, and a deferred round's chunk_attempts_total is added to the file-level aggregate. Leftovers after the final round still surface as PartialUpload; a non-quorum store error in any round still aborts as PartialUpload while preserving earlier progress.

Correctness / boundary check ✅

  • The key thing to verify on a retry refactor like this is caller-boundary semantics, not just the helper. I checked:
    • Partial-success accounting: the deferred set is collected per wave and only joined to failed after the final round (or on fatal). No wave's earlier successes are lost. Verified against the new deferred_retry_fatal_error_preserves_prior_progress test and the file.rs merging logic (stored_addresses.extend + dr.fatal → PartialUpload with earlier stores).
    • Non-retryable errors: the only retryable error inside merkle_deferred_retry is InsufficientPeers (via merkle_store_with_retry); any other error becomes Err(...), which is converted into outcome.fatal = Some(...) with remaining recorded as failed_addresses. The caller's fatal path preserves stored_addresses and total_stored. ✅
    • Progress events: the helper takes the same progress: Option<&mpsc::Sender<UploadEvent>> and total; it passes stored_offset through unchanged, so ChunkStored { stored, total } events stay correctly numbered across rounds. ✅
    • Histogram mapping: the wave first pass lands in slot 0 (set by merkle_store_with_retry for attempt == 0); each deferred round r redirects its single-pass successes to slot r + 1 (clamped to the last slot). The deferred_round_histogram_slot_maps_and_clamps unit test pins this. The merging at the call site uses zip, so a deferred round that lands in slot 3 (clamped) and a later wave pass into slot 3 don't double-count because the wave helper and the deferred helper each own a disjoint attempt count. This is internally consistent and matches the prod-run numbers in the PR body. ✅
    • Peak RAM: bodies are re-read from the spill at retry time via the injected read_bodies closure (calls spill.read_wave); nothing is held across rounds. ✅
    • store_limiter snapshot: store_limiter.current() is re-evaluated on each round (it's cheap, just .current() on the Semaphore), so a routing-table-driven limiter doesn't strand concurrency on a stale value. ✅

Tests ✅

  • 5 new unit tests in merkle::tests:
    • deferred_round_histogram_slot_maps_and_clamps — slot math + clamping
    • deferred_retry_succeeds_on_a_later_round — exactly-once accounting, slot placement
    • deferred_retry_leftovers_become_failed — exhausted retries → failed, no fatal
    • deferred_retry_fatal_error_preserves_prior_progress — round-N fatal preserves round-M success
    • deferred_retry_empty_set_is_a_noopstored_offset carried through unchanged
  • All 37 tests in data::client::merkle pass locally (including the 5 new ones). I ran cargo test -p ant-core --lib data::client::merkle:: against the PR head on rc-2026.6.2.

Lint / format ✅

  • cargo fmt -p ant-core -- --check clean.
  • cargo clippy -p ant-core --all-targets --all-features -- -D warnings clean.

CI status

Check Status
Format ✅ SUCCESS
Clippy ✅ SUCCESS
Unit Tests (ubuntu / macos) ✅ SUCCESS / ✅ SUCCESS
E2E Tests (ubuntu / macos) ⏳ IN_PROGRESS / ⏳ IN_PROGRESS
Merkle E2E (ubuntu / macos) ⏳ IN_PROGRESS / ⏳ IN_PROGRESS
Documentation ✅ SUCCESS
Build (ubuntu / macos) ✅ SUCCESS / ✅ SUCCESS
Security Audit ✅ SUCCESS

E2E + Merkle E2E on both platforms still running at the time of review. Re-check before flipping to "approved"; nothing in the diff looks likely to fail them.

Suggestions (non-blocking, not gating RC)

  1. merkle_store_with_retry is now only ever called with max_attempts = 1 from the wave path (the other in-tree call, merkle_upload_chunks, is unchanged). The max_attempts and backoff parameters of merkle_store_with_retry are no longer used by the CLI/spill path. Worth a comment in merkle_store_with_retry's rustdoc noting that the wave path is single-pass and the retry knob now lives in the deferred helper, so future readers don't re-add in-wave backoff. Cosmetic.
  2. [0, 15, 45] is duplicated between download and upload. Fine for RC; if the team wants to keep the two in lockstep, a single shared constant (e.g. in ant-protocol or a small client consts module) would prevent drift. Not blocking.
  3. Production data is thin: only 1 chunk deferred per affected file, recovered on round 0/3 (so the 15s/45s delays never actually slept). The change is validated for correctness on real data, but the throughput win — which scales with how many chunks defer per wave — is not directly demonstrated. The author is upfront about this in the body. Suggestion only: if V2-411 (disk pre-check) or a follow-up run defers more chunks, capture a before/after wall-time on a real quorum-degraded run and link it in the PR description.

Files changed

  • ant-core/src/data/client/file.rs+101 / -21 (wave loop, deferred call, error handling)
  • ant-core/src/data/client/merkle.rs+362 / -1 (merkle_deferred_retry, helpers, tests)

Reviewed by Hermes Agent — locally built and tested on PR head against rc-2026.6.2.

@mickvandijke mickvandijke left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  1. High: deferred retry can blow the upload memory bound.

The PR accumulates all quorum-short chunks across the whole file into deferred and then passes the entire set into one deferred retry call (file.rs: deferred.extend(...) then merkle_deferred_retry(deferred, ...)). Inside each retry round, merkle_deferred_retry builds round_addrs for all remaining chunks and immediately calls read_bodies(&round_addrs). The CLI caller wires that to spill.read_wave(addrs), and read_wave materializes every requested chunk body into a Vec.

That violates the stated peak-memory contract. The wave path is bounded to about 64 chunks / ~256 MB, but the deferred path is now bounded by the total number of deferred chunks in the whole file. On a large upload with hundreds or thousands of quorum-short chunks, this can re-read gigabytes before store concurrency applies. Please process deferred chunks in bounded batches, or stream body reads into the store futures so only store_concurrency or UPLOAD_WAVE_SIZE bodies are resident.

  1. Medium: fatal store errors produce incomplete PartialUpload accounting.

merkle_store_with_retry records successes internally, but on a non-quorum error it returns Err immediately. The wave caller then builds PartialUpload from only previously recorded state, so a fatal error in a wave with no missing proofs can report failed_count = 0 and omit same-wave successes. The deferred path has the same shape: if a retry round hits a fatal error, it marks the entire pre-round remaining set as failed and cannot preserve any successes from the same concurrent pass.

This can under-report stored chunks and produce misleading CLI JSON/resume metadata. The helper should return a partial outcome plus fatal address/reason, or the caller should use a one-pass collector that preserves successes before surfacing the fatal.

Verification run locally

  • git diff --check origin/rc-2026.6.2...origin/pr/110
  • cargo fmt --all -- --check
  • cargo clippy -p ant-core --all-targets --all-features -- -D warnings
  • cargo test -p ant-core --lib -> 336 passed

@dirvine

dirvine commented Jun 9, 2026

Copy link
Copy Markdown

Follow-up after re-reading Mick's review comments against the current diff: I agree both findings are real, with different severity.

1. Deferred retry memory bound — blocker for this PR

The PR rustdoc/intent says peak memory stays bounded because bodies are re-read from disk rather than pinned. The wave path is still bounded, but the new deferred path is not.

Trace:

  • file.rs accumulates quorum-short chunks into a file-level deferred set across all waves.
  • The caller then invokes merkle_deferred_retry(deferred, ...) once for the whole file-level set.
  • In each retry round, merkle_deferred_retry builds round_addrs from all remaining chunks, then calls read_bodies(&round_addrs).
  • The CLI/spill caller wires that callback to spill.read_wave(addrs).
  • ChunkSpill::read_wave materializes every requested body into a Vec<(Bytes, [u8; 32])> before calling merkle_store_with_retry.
  • merkle_store_with_retry then holds the whole chunks argument for the duration of the round; buffer_unordered(concurrency) only limits active store futures, not the already-materialized queued bodies.

So peak resident body memory in the deferred round is bounded by deferred_count * chunk_size, not by UPLOAD_WAVE_SIZE * chunk_size or by store_concurrency * chunk_size.

That matters because the deferred path is exactly where large uploads can accumulate a large file-level retry set. Example shape: a 50 GB upload with ~20% quorum-short chunks can put roughly 10 GB of bodies resident before throttling helps.

Suggested RC-sized fix: process deferred chunks in UPLOAD_WAVE_SIZE-sized batches, in series, or make the deferred store path stream read_chunk into store futures so only store_concurrency bodies are resident. For the RC, batching is probably the smallest/easiest review.

2. Fatal-error accounting — real, smaller blast radius

The second issue is also real, though narrower and partly pre-existing.

On a non-InsufficientPeers store error inside a round, merkle_store_with_retry returns Err(e) immediately. Any same-round store successes that completed before the fatal error are accumulated only inside the helper's local MerkleStoreOutcome, which is then discarded because the function returns Err rather than a partial outcome.

In the new deferred wrapper, the Err arm then treats the whole pre-round remaining set as failed. That means same-round successes are both missing from stored accounting and can be reported as failed.

The old wave path had the same under-accounting shape on fatal errors, so this PR did not introduce the whole class of bug; it adds another place where it can happen. Since the helper is shared, a small helper-level fix could correct both paths: surface a partial MerkleStoreOutcome containing already-landed successes plus fatal: Some(...), instead of throwing the partial outcome away via Err(e).

Merge recommendation

I'd block this PR on fixing issue 1 because it breaks the stated bounded-memory guarantee and is likely mechanical to fix. Issue 2 is worth taking in the same PR if the helper-level refactor stays small; otherwise it should at least be tracked as a follow-up before merge.

Note: my earlier review comment said peak RAM was unchanged. That was wrong after this closer read of the deferred path; Mick's memory finding is correct.

@jacderida

Copy link
Copy Markdown
Contributor Author

Thanks @mickvandijke — both findings were spot on. Addressed in 2228a92.

1. Memory bound (high). merkle_deferred_retry now takes a batch_size and processes each round in batches of that size, re-reading only one batch of bodies from the spill at a time (the CLI caller passes UPLOAD_WAVE_SIZE). Peak resident bodies is back to the wave path's batch_size × MAX_CHUNK_SIZE (~256 MB) bound regardless of how many chunks were deferred file-wide. New test deferred_retry_reads_bodies_in_bounded_batches asserts read_bodies is never handed more than batch_size addresses at once.

2. Fatal-error accounting (medium). Fixed at the root in merkle_store_with_retry: instead of returning Err (and discarding the successes recorded in that pass), it now preserves same-pass successes in stored/stored_addresses, records the fatal chunk as failed, and surfaces the error via a new MerkleStoreOutcome::fatal field. Callers:

  • external-signer (merkle_upload_chunks) re-raises fatal as Err to keep its all-or-nothing contract (behaviour unchanged);
  • the CLI wave and deferred paths fold it into PartialUpload whose failed set is derived authoritatively as every input chunk not in stored_addresses (shared partial_upload_after_fatal helper), so stored_count + failed_count always accounts for the whole file.

This also fixes the pre-existing wave-path under-reporting you flagged (a fatal wave with no missing proofs no longer reports failed_count = 0 or omits same-wave stores). New test store_with_retry_fatal_preserves_same_pass_successes covers it.

cargo test -p ant-core --lib → 338 passed; clippy (--all-targets --all-features -D warnings) and fmt clean.

…ounting

Addresses review on the deferred merkle-upload retry path.

1. Memory bound (high): the deferred pass read every quorum-short chunk in the
   whole file into one Vec per round before storing, so peak resident bodies
   scaled with the file-wide deferred count rather than the wave path's
   ~UPLOAD_WAVE_SIZE / ~256 MB bound. merkle_deferred_retry now takes a
   batch_size and processes each round in batches of that size, re-reading only
   one batch of bodies from the spill at a time. The CLI caller passes
   UPLOAD_WAVE_SIZE.

2. Fatal-abort accounting (medium): merkle_store_with_retry returned Err on a
   non-quorum error, discarding the successes already recorded in that pass; the
   wave/deferred callers then built PartialUpload from stale state (could report
   failed_count = 0 and omit same-pass stores). The store helper now preserves
   same-pass successes (stored/stored_addresses), records the fatal chunk as
   failed, and surfaces the error via a new MerkleStoreOutcome::fatal field
   instead of Err. The external-signer path re-raises fatal as Err to keep its
   all-or-nothing contract; the CLI wave and deferred paths fold it into a
   PartialUpload whose failed set is derived authoritatively as every input
   chunk not in stored_addresses (shared partial_upload_after_fatal helper), so
   stored_count + failed_count accounts for the whole file. This also fixes the
   pre-existing wave-path under-reporting the review noted.

Tests: same-pass successes preserved on fatal; deferred reads bounded to
batch_size; updated the non-quorum-error test to assert fatal-in-outcome.
cargo test -p ant-core --lib -> 338 passed; clippy and fmt clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jacderida jacderida force-pushed the feat-deferred_retry_for_merkle_uploads branch from 2228a92 to 1d9916d Compare June 9, 2026 14:09
@jacderida jacderida merged commit 671c32f into WithAutonomi:rc-2026.6.2 Jun 9, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants