feat(client): download-style deferred retry for merkle uploads by jacderida · Pull Request #110 · WithAutonomi/ant-client

jacderida · 2026-06-08T13:24:36Z

Summary

Port the download path's deferred-retry design to the CLI merkle upload path so a few quorum-short
chunks no longer stall the whole pipeline behind in-wave backoffs (V2-466).

Previously upload_waves_merkle stored each wave of 64 chunks through merkle_store_with_retry with
up to 4 attempts and 30s jittered backoffs, behind a hard barrier: wave N+1 could not start until
wave N's retry loop fully drained. A handful of quorum-short chunks therefore parked the wave's other
~63 slots idle through multiple backoffs — the single biggest throughput sink on the PROD-UL-01 run
(one wave alone burned 34 minutes).

Store each wave in a single pass (max_attempts = 1, no backoff) so a wave never blocks on a
slow chunk.
Collect quorum-short chunks into a file-level deferred set and advance to the next wave
immediately.
After the last wave, retry the whole deferred set in concurrent rounds with [0, 15, 45]s
delays (matching the download path), re-reading each chunk's body from the spill at retry time
(peak RAM unchanged, ~256 MB) and reusing its already-paid merkle proof.
Failure semantics preserved: chunks still short after the final round surface as PartialUpload;
a non-quorum error aborts as PartialUpload while preserving earlier progress. Stats and progress
numbering are carried across rounds, with each deferred round's successes recorded in its own
histogram slot (wave first-pass → slot 0, deferred round r → slot r+1).
Total per-chunk retry budget is unchanged (1 wave pass + 3 deferred rounds).

Adds merkle_deferred_retry, DeferredRetryOutcome, deferred_round_histogram_slot,
DEFERRED_ROUND_DELAYS_SECS, and unit tests. Scoped to the CLI/spill path only; the external-signer
merkle_upload_chunks (in-memory, no spill) is unchanged.

Test plan (CI)

cargo test -p ant-core --lib → 336 passed, 0 failed (incl. 5 new deferred-loop tests:
histogram-slot mapping, succeeds-on-a-later-round, leftovers-become-failed,
fatal-error-preserves-prior-progress, empty-set-noop)
cargo clippy -p ant-core --all-targets --all-features -- -D warnings → clean
cargo fmt -p ant-core -- --check → clean

Production validation (PROD-UL-01, 5-file merkle upload)

A patched ant binary was run against production (PROD-UL-01-ant-client-upload-lon1-1,
payment=merkle). This is a client-only change, and production naturally supplies the conditions
that drive quorum shortfalls (disk-full community nodes, closeness divergence).

The new mechanism fired on the two files that hit a quorum-short chunk (files 1 and 4):

INFO ant_core::data::client::file:   Deferring 1 merkle chunk(s) short of quorum for concurrent retry after final wave
INFO ant_core::data::client::merkle: Deferred merkle retry round 1/3: 1 chunk(s) short of quorum

Both showed only round 1/3 — the chunk recovered on the immediate (0s) first round and the loop
exited early. The very next log line is a network query at the same millisecond
(00:10:19.420787 → .426822), confirming no backoff sleep before the retry.

The structured stdout JSON confirms a chunk stored via the deferred path, counted exactly once:

File	chunks_stored / failed	retries_histogram	chunk_attempts_total
1 (3.59 GB, 926 chunks)	926 / 0	`[924, 1, 0, 0]`	926
4 (1.50 GB, 388 chunks)	388 / 0	`[386, 1, 0, 0]`	388

Slot [1] is exactly the slot the first deferred round writes to. The accounting is internally
consistent and free of double-counting (file 1: 1 preflight already-stored + 924 wave-pass + 1
deferred = 926; chunk_attempts_total 926 = 925 wave attempts incl. the one shortfall + 1 deferred
attempt).

The old in-wave backoff loop is gone — "retrying after backoff" (emitted only inside
merkle_store_with_retry between attempts) appears 0 times across all 5 files, confirming each
wave now runs a single pass and never sleeps mid-wave.

Recovery vs. the previous release: in the prior release run, the single quorum-short chunk in the
two affected files ended as partial (770/771 and 968/969, 1 failed each). With this change the
equivalent quorum-short chunk recovered — both files finished ok with 0 failed — without
parking the wave's other slots behind 30s backoffs.

Wall-time is suggestive but not strictly apples-to-apples (the two runs uploaded different files):
the largest file here (926 chunks / 3.59 GB) took 112m, versus the prior run's 843 chunks / 3.27 GB
at 190m and 969 chunks / 3.77 GB at 167m. This run only deferred 1 chunk per file, so it validates
correctness of the new path well; the larger throughput win scales with how many chunks defer per
wave (and gets cheaper once V2-411's disk pre-check lands).

Closes V2-466

🤖 Generated with Claude Code

dirvine · 2026-06-08T14:17:05Z

Review: #110 — Download-style deferred retry for merkle uploads (V2-466)

Verdict

Ready to merge once the four E2E jobs (E2E + Merkle E2E × {ubuntu, macos}) turn green.

The refactor is well-scoped, the in-wave vs. post-wave split is the right shape, and the failure-preservation contract is preserved end-to-end. Production evidence (PROD-UL-01) is internally consistent and confirms the mechanism fires exactly where it should. CI is green on every static / unit / build / audit / doc check; the only outstanding jobs are the four E2E suites, which are still in progress at the time of review.

Summary

What changes: upload_waves_merkle no longer retries in-wave. Each wave now runs a single pass (max_attempts = 1, Duration::ZERO); quorum-short chunks are collected into a file-level deferred set and retried after the final wave by the new merkle_deferred_retry helper, which mirrors the download path's [0, 15, 45]s round delays. Same per-chunk retry budget (1 wave pass + 3 deferred rounds = 4 attempts); same total wall-time in the worst case; far better worst-case fan-out because no wave blocks behind a slow chunk's backoff.
Public-API surface: no change. The change is entirely inside Client::upload_waves_merkle and the new merkle_deferred_retry/DeferredRetryOutcome/deferred_round_histogram_slot/DEFERRED_ROUND_DELAYS_SECS items in ant-core/src/data/client/merkle.rs (all pub(crate)). The external-signer merkle_upload_chunks (in-memory, no spill) is explicitly untouched.
Accountability preserved: progress numbering stays continuous (stored_offset is threaded through), the retries_histogram is zipped so per-round counts remain meaningful, and a deferred round's chunk_attempts_total is added to the file-level aggregate. Leftovers after the final round still surface as PartialUpload; a non-quorum store error in any round still aborts as PartialUpload while preserving earlier progress.

Correctness / boundary check ✅

The key thing to verify on a retry refactor like this is caller-boundary semantics, not just the helper. I checked:
- Partial-success accounting: the deferred set is collected per wave and only joined to failed after the final round (or on fatal). No wave's earlier successes are lost. Verified against the new deferred_retry_fatal_error_preserves_prior_progress test and the file.rs merging logic (stored_addresses.extend + dr.fatal → PartialUpload with earlier stores).
- Non-retryable errors: the only retryable error inside merkle_deferred_retry is InsufficientPeers (via merkle_store_with_retry); any other error becomes Err(...), which is converted into outcome.fatal = Some(...) with remaining recorded as failed_addresses. The caller's fatal path preserves stored_addresses and total_stored. ✅
- Progress events: the helper takes the same progress: Option<&mpsc::Sender<UploadEvent>> and total; it passes stored_offset through unchanged, so ChunkStored { stored, total } events stay correctly numbered across rounds. ✅
- Histogram mapping: the wave first pass lands in slot 0 (set by merkle_store_with_retry for attempt == 0); each deferred round r redirects its single-pass successes to slot r + 1 (clamped to the last slot). The deferred_round_histogram_slot_maps_and_clamps unit test pins this. The merging at the call site uses zip, so a deferred round that lands in slot 3 (clamped) and a later wave pass into slot 3 don't double-count because the wave helper and the deferred helper each own a disjoint attempt count. This is internally consistent and matches the prod-run numbers in the PR body. ✅
- Peak RAM: bodies are re-read from the spill at retry time via the injected read_bodies closure (calls spill.read_wave); nothing is held across rounds. ✅
- store_limiter snapshot: store_limiter.current() is re-evaluated on each round (it's cheap, just .current() on the Semaphore), so a routing-table-driven limiter doesn't strand concurrency on a stale value. ✅

Tests ✅

5 new unit tests in merkle::tests:
- deferred_round_histogram_slot_maps_and_clamps — slot math + clamping
- deferred_retry_succeeds_on_a_later_round — exactly-once accounting, slot placement
- deferred_retry_leftovers_become_failed — exhausted retries → failed, no fatal
- deferred_retry_fatal_error_preserves_prior_progress — round-N fatal preserves round-M success
- deferred_retry_empty_set_is_a_noop — stored_offset carried through unchanged
All 37 tests in data::client::merkle pass locally (including the 5 new ones). I ran cargo test -p ant-core --lib data::client::merkle:: against the PR head on rc-2026.6.2.

Lint / format ✅

cargo fmt -p ant-core -- --check clean.
cargo clippy -p ant-core --all-targets --all-features -- -D warnings clean.

CI status

Check	Status
Format	✅ SUCCESS
Clippy	✅ SUCCESS
Unit Tests (ubuntu / macos)	✅ SUCCESS / ✅ SUCCESS
E2E Tests (ubuntu / macos)	⏳ IN_PROGRESS / ⏳ IN_PROGRESS
Merkle E2E (ubuntu / macos)	⏳ IN_PROGRESS / ⏳ IN_PROGRESS
Documentation	✅ SUCCESS
Build (ubuntu / macos)	✅ SUCCESS / ✅ SUCCESS
Security Audit	✅ SUCCESS

E2E + Merkle E2E on both platforms still running at the time of review. Re-check before flipping to "approved"; nothing in the diff looks likely to fail them.

Suggestions (non-blocking, not gating RC)

merkle_store_with_retry is now only ever called with max_attempts = 1 from the wave path (the other in-tree call, merkle_upload_chunks, is unchanged). The max_attempts and backoff parameters of merkle_store_with_retry are no longer used by the CLI/spill path. Worth a comment in merkle_store_with_retry's rustdoc noting that the wave path is single-pass and the retry knob now lives in the deferred helper, so future readers don't re-add in-wave backoff. Cosmetic.
[0, 15, 45] is duplicated between download and upload. Fine for RC; if the team wants to keep the two in lockstep, a single shared constant (e.g. in ant-protocol or a small client consts module) would prevent drift. Not blocking.
Production data is thin: only 1 chunk deferred per affected file, recovered on round 0/3 (so the 15s/45s delays never actually slept). The change is validated for correctness on real data, but the throughput win — which scales with how many chunks defer per wave — is not directly demonstrated. The author is upfront about this in the body. Suggestion only: if V2-411 (disk pre-check) or a follow-up run defers more chunks, capture a before/after wall-time on a real quorum-degraded run and link it in the PR description.

Files changed

ant-core/src/data/client/file.rs — +101 / -21 (wave loop, deferred call, error handling)
ant-core/src/data/client/merkle.rs — +362 / -1 (merkle_deferred_retry, helpers, tests)

Reviewed by Hermes Agent — locally built and tested on PR head against rc-2026.6.2.

mickvandijke

Findings

High: deferred retry can blow the upload memory bound.

The PR accumulates all quorum-short chunks across the whole file into deferred and then passes the entire set into one deferred retry call (file.rs: deferred.extend(...) then merkle_deferred_retry(deferred, ...)). Inside each retry round, merkle_deferred_retry builds round_addrs for all remaining chunks and immediately calls read_bodies(&round_addrs). The CLI caller wires that to spill.read_wave(addrs), and read_wave materializes every requested chunk body into a Vec.

That violates the stated peak-memory contract. The wave path is bounded to about 64 chunks / ~256 MB, but the deferred path is now bounded by the total number of deferred chunks in the whole file. On a large upload with hundreds or thousands of quorum-short chunks, this can re-read gigabytes before store concurrency applies. Please process deferred chunks in bounded batches, or stream body reads into the store futures so only store_concurrency or UPLOAD_WAVE_SIZE bodies are resident.

Medium: fatal store errors produce incomplete PartialUpload accounting.

merkle_store_with_retry records successes internally, but on a non-quorum error it returns Err immediately. The wave caller then builds PartialUpload from only previously recorded state, so a fatal error in a wave with no missing proofs can report failed_count = 0 and omit same-wave successes. The deferred path has the same shape: if a retry round hits a fatal error, it marks the entire pre-round remaining set as failed and cannot preserve any successes from the same concurrent pass.

This can under-report stored chunks and produce misleading CLI JSON/resume metadata. The helper should return a partial outcome plus fatal address/reason, or the caller should use a one-pass collector that preserves successes before surfacing the fatal.

Verification run locally

git diff --check origin/rc-2026.6.2...origin/pr/110
cargo fmt --all -- --check
cargo clippy -p ant-core --all-targets --all-features -- -D warnings
cargo test -p ant-core --lib -> 336 passed

dirvine · 2026-06-09T11:14:49Z

Follow-up after re-reading Mick's review comments against the current diff: I agree both findings are real, with different severity.

1. Deferred retry memory bound — blocker for this PR

The PR rustdoc/intent says peak memory stays bounded because bodies are re-read from disk rather than pinned. The wave path is still bounded, but the new deferred path is not.

Trace:

file.rs accumulates quorum-short chunks into a file-level deferred set across all waves.
The caller then invokes merkle_deferred_retry(deferred, ...) once for the whole file-level set.
In each retry round, merkle_deferred_retry builds round_addrs from all remaining chunks, then calls read_bodies(&round_addrs).
The CLI/spill caller wires that callback to spill.read_wave(addrs).
ChunkSpill::read_wave materializes every requested body into a Vec<(Bytes, [u8; 32])> before calling merkle_store_with_retry.
merkle_store_with_retry then holds the whole chunks argument for the duration of the round; buffer_unordered(concurrency) only limits active store futures, not the already-materialized queued bodies.

So peak resident body memory in the deferred round is bounded by deferred_count * chunk_size, not by UPLOAD_WAVE_SIZE * chunk_size or by store_concurrency * chunk_size.

That matters because the deferred path is exactly where large uploads can accumulate a large file-level retry set. Example shape: a 50 GB upload with ~20% quorum-short chunks can put roughly 10 GB of bodies resident before throttling helps.

Suggested RC-sized fix: process deferred chunks in UPLOAD_WAVE_SIZE-sized batches, in series, or make the deferred store path stream read_chunk into store futures so only store_concurrency bodies are resident. For the RC, batching is probably the smallest/easiest review.

2. Fatal-error accounting — real, smaller blast radius

The second issue is also real, though narrower and partly pre-existing.

On a non-InsufficientPeers store error inside a round, merkle_store_with_retry returns Err(e) immediately. Any same-round store successes that completed before the fatal error are accumulated only inside the helper's local MerkleStoreOutcome, which is then discarded because the function returns Err rather than a partial outcome.

In the new deferred wrapper, the Err arm then treats the whole pre-round remaining set as failed. That means same-round successes are both missing from stored accounting and can be reported as failed.

The old wave path had the same under-accounting shape on fatal errors, so this PR did not introduce the whole class of bug; it adds another place where it can happen. Since the helper is shared, a small helper-level fix could correct both paths: surface a partial MerkleStoreOutcome containing already-landed successes plus fatal: Some(...), instead of throwing the partial outcome away via Err(e).

Merge recommendation

I'd block this PR on fixing issue 1 because it breaks the stated bounded-memory guarantee and is likely mechanical to fix. Issue 2 is worth taking in the same PR if the helper-level refactor stays small; otherwise it should at least be tracked as a follow-up before merge.

Note: my earlier review comment said peak RAM was unchanged. That was wrong after this closer read of the deferred path; Mick's memory finding is correct.

jacderida · 2026-06-09T13:17:44Z

Thanks @mickvandijke — both findings were spot on. Addressed in 2228a92.

1. Memory bound (high). merkle_deferred_retry now takes a batch_size and processes each round in batches of that size, re-reading only one batch of bodies from the spill at a time (the CLI caller passes UPLOAD_WAVE_SIZE). Peak resident bodies is back to the wave path's batch_size × MAX_CHUNK_SIZE (~256 MB) bound regardless of how many chunks were deferred file-wide. New test deferred_retry_reads_bodies_in_bounded_batches asserts read_bodies is never handed more than batch_size addresses at once.

2. Fatal-error accounting (medium). Fixed at the root in merkle_store_with_retry: instead of returning Err (and discarding the successes recorded in that pass), it now preserves same-pass successes in stored/stored_addresses, records the fatal chunk as failed, and surfaces the error via a new MerkleStoreOutcome::fatal field. Callers:

external-signer (merkle_upload_chunks) re-raises fatal as Err to keep its all-or-nothing contract (behaviour unchanged);
the CLI wave and deferred paths fold it into PartialUpload whose failed set is derived authoritatively as every input chunk not in stored_addresses (shared partial_upload_after_fatal helper), so stored_count + failed_count always accounts for the whole file.

This also fixes the pre-existing wave-path under-reporting you flagged (a fatal wave with no missing proofs no longer reports failed_count = 0 or omits same-wave stores). New test store_with_retry_fatal_preserves_same_pass_successes covers it.

cargo test -p ant-core --lib → 338 passed; clippy (--all-targets --all-features -D warnings) and fmt clean.

…ounting Addresses review on the deferred merkle-upload retry path. 1. Memory bound (high): the deferred pass read every quorum-short chunk in the whole file into one Vec per round before storing, so peak resident bodies scaled with the file-wide deferred count rather than the wave path's ~UPLOAD_WAVE_SIZE / ~256 MB bound. merkle_deferred_retry now takes a batch_size and processes each round in batches of that size, re-reading only one batch of bodies from the spill at a time. The CLI caller passes UPLOAD_WAVE_SIZE. 2. Fatal-abort accounting (medium): merkle_store_with_retry returned Err on a non-quorum error, discarding the successes already recorded in that pass; the wave/deferred callers then built PartialUpload from stale state (could report failed_count = 0 and omit same-pass stores). The store helper now preserves same-pass successes (stored/stored_addresses), records the fatal chunk as failed, and surfaces the error via a new MerkleStoreOutcome::fatal field instead of Err. The external-signer path re-raises fatal as Err to keep its all-or-nothing contract; the CLI wave and deferred paths fold it into a PartialUpload whose failed set is derived authoritatively as every input chunk not in stored_addresses (shared partial_upload_after_fatal helper), so stored_count + failed_count accounts for the whole file. This also fixes the pre-existing wave-path under-reporting the review noted. Tests: same-pass successes preserved on fatal; deferred reads bounded to batch_size; updated the non-quorum-error test to assert fatal-in-outcome. cargo test -p ant-core --lib -> 338 passed; clippy and fmt clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jacderida force-pushed the feat-deferred_retry_for_merkle_uploads branch from bd8ccf5 to 9c1bfde Compare June 8, 2026 13:59

jacderida mentioned this pull request Jun 8, 2026

fix(client): stop verification latency and app-rejections suppressing store AIMD #112

Merged

mickvandijke requested changes Jun 9, 2026

View reviewed changes

jacderida force-pushed the feat-deferred_retry_for_merkle_uploads branch from 2228a92 to 1d9916d Compare June 9, 2026 14:09

jacderida merged commit 671c32f into WithAutonomi:rc-2026.6.2 Jun 9, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(client): download-style deferred retry for merkle uploads#110

feat(client): download-style deferred retry for merkle uploads#110
jacderida merged 1 commit into
WithAutonomi:rc-2026.6.2from
jacderida:feat-deferred_retry_for_merkle_uploads

jacderida commented Jun 8, 2026

Uh oh!

dirvine commented Jun 8, 2026

Uh oh!

mickvandijke left a comment

Uh oh!

dirvine commented Jun 9, 2026

Uh oh!

jacderida commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jacderida commented Jun 8, 2026

Summary

Test plan (CI)

Production validation (PROD-UL-01, 5-file merkle upload)

Uh oh!

dirvine commented Jun 8, 2026

Review: #110 — Download-style deferred retry for merkle uploads (V2-466)

Verdict

Summary

Correctness / boundary check ✅

Tests ✅

Lint / format ✅

CI status

Suggestions (non-blocking, not gating RC)

Files changed

Uh oh!

mickvandijke left a comment

Choose a reason for hiding this comment

Uh oh!

dirvine commented Jun 9, 2026

1. Deferred retry memory bound — blocker for this PR

2. Fatal-error accounting — real, smaller blast radius

Merge recommendation

Uh oh!

jacderida commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants