fix(client): stop verification latency and app-rejections suppressing store AIMD#112
Conversation
The CLI merkle upload path stored each wave of 64 chunks through `merkle_store_with_retry` with up to 4 attempts and 30s jittered backoffs, and a hard barrier: wave N+1 could not start until wave N's retry loop fully drained. A handful of quorum-short chunks therefore parked the wave's other ~63 slots idle through multiple backoffs — the single biggest throughput sink on the PROD-UL-01 run (one wave alone burned 34 minutes). Port the download path's deferred-retry design to the upload path: - Store each wave in a single pass (`max_attempts = 1`, no backoff) so a wave never blocks on a slow chunk. - Collect quorum-short chunks into a file-level deferred set and advance to the next wave immediately. - After the last wave, retry the whole deferred set in concurrent rounds with `[0, 15, 45]s` delays (matching the download path), re-reading each chunk's body from the spill at retry time (peak RAM unchanged) and reusing its proof. Failure semantics are preserved: chunks still short after the final round surface as `PartialUpload`; a non-quorum error aborts as `PartialUpload` while preserving earlier progress. Stats and progress numbering are carried across rounds, with each deferred round's successes recorded in its own histogram slot. Total per-chunk retry budget is unchanged (1 wave pass + 3 deferred rounds). Adds `merkle_deferred_retry`, `DeferredRetryOutcome`, `deferred_round_histogram_slot`, `DEFERRED_ROUND_DELAYS_SECS`, and unit tests. V2-466 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… store AIMD
The adaptive store concurrency limiter never ramped from its cold-start of
8 and got crushed to a +1-per-window crawl because two non-capacity
signals polluted its health input on the merkle upload path:
- Node-side PUT latency is dominated by the ~28s synchronous merkle
closeness lookup, inflating client-observed p95/median to 3-6x and
tripping the latency-vs-baseline Decrease even though nothing about it
is local congestion.
- Remote application rejections (pool-rejected, disk-full, quote-stale)
arrived as Error::Protocol / flattened Error::InsufficientPeers and were
classified as NetworkError, counting against success_target and driving
multiplicative decrease. With the default slow_start_ramp_threshold of 0,
a single such Decrease permanently exited slow-start.
Apply the fetch-channel precedent to the store channel (the situation is
structurally identical — verification variance instead of retry variance),
plus preserve the structured remote rejection reason so it classifies
correctly. The cold-start floor of 8 is deliberately unchanged.
- adaptive.rs: store_cfg.latency_decrease_enabled = false and
store_cfg.slow_start_ramp_threshold = usize::MAX, so a transient Decrease
halves but the next healthy window re-doubles. Genuine store congestion
still surfaces via the timeout-rate ceiling.
- error.rs/chunk.rs: new Error::RemotePut { address, source: ProtocolError }
carrying the structured upstream discriminant instead of stringifying it
into Error::Protocol. A ChunkPutResponse::Error means the transport
round-trip succeeded and the node declined at the application layer.
- chunk.rs: chunk_put_to_close_group surfaces a representative RemotePut for
app-only quorum shortfalls; any genuine transport failure keeps it
InsufficientPeers so real congestion still cuts the cap.
- mod.rs: classify_error maps RemotePut to ApplicationError.
- merkle.rs: merkle_store_with_retry treats RemotePut as recoverable
(defer/retry) like InsufficientPeers, so transient rejections don't abort
the upload.
Adds unit coverage: store ramps/recovers under the new tuning while a
timeout burst still cuts it; remote app-rejections don't move the cap;
RemotePut is recoverable in the retry path.
Linear: V2-468
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review: fix(client): stop verification latency and app-rejections suppressing store AIMD (V2-468)Verdict: Approve with minor nits. The fix is correctly scoped, the production evidence is strong, and the change is conservative — it leaves the genuine-congestion guard (timeout-rate ceiling) and the cold-start floor of 8 untouched. Stacking on #110 is fine because the V2-468 commit is self-contained and the rationale explicitly excludes #110's diff from the review. Focused on the V2-468 commit ( ✅ What's right
|
| Check | Status |
|---|---|
| Format | ✅ SUCCESS |
| Clippy | ✅ SUCCESS |
| Unit Tests (ubuntu) | ✅ SUCCESS |
| Unit Tests (macos) | ✅ SUCCESS |
| Build (ubuntu) | ✅ SUCCESS |
| Documentation | ✅ SUCCESS |
| Security Audit | ✅ SUCCESS |
| E2E (ubuntu) | 🟡 in progress |
| E2E (macos) | 🟡 in progress |
| Merkle E2E (ubuntu) | 🟡 in progress |
| Merkle E2E (macos) | 🟡 in progress |
| Build (macos) | 🟡 in progress |
Core unit/clippy/build/security gates are green. E2E and macOS Build still in flight at time of review — no failures visible.
Stacked-on-#110 note
PR body correctly says #110's commit also appears in the diff. I did not re-review that part. The V2-468 commit (e04c0ee1) is self-contained and the rationale explicitly carves it out, so the stacked-base concern does not block this review. Final merge ordering will need both #110 and #112 to land together (or #110 to land first) — caller's call.
Recommended next steps for the author
- None blocking. Optionally address nits 1, 3, and 5 above before merge.
- The V2-482 follow-up (node returns
verification_ms; client subtracts it) is a clean protocol-level cleanup that would obsolete nit 2's concern — worth keeping that ticket visible. - Consider running
cargo build -p ant-node -p ant-protocolagainst this branch pre-merge as a cheap SemVer-belt-and-braces check for nit 5.
— Hermes (advisory; awaiting human maintainer sign-off)
What & why
The client's adaptive store concurrency limiter (
adaptive.rs) starts at a deliberately low cold-start of 8 (correct for low-bandwidth uplinks) but, on the merkle upload path, it never ramped and got crushed to a +1-per-window crawl. Two signals that are not local-capacity signals were polluting its health input:Error::Protocol(or flattenedError::InsufficientPeers), classifiedNetworkError, counting againstsuccess_targetand driving multiplicative Decrease. With the defaultslow_start_ramp_threshold = 0, a single such Decrease permanently exited slow-start → +1 crawl for the rest of the file.This applies the fetch-channel precedent to the store channel (the situations are structurally identical — verification variance instead of retry variance) and preserves the structured remote-rejection reason so it classifies correctly. The cold-start floor of 8 is unchanged; the timeout-rate ceiling (genuine congestion guard) is untouched.
Linear: V2-468.
Changes
adaptive.rsstore_cfg.latency_decrease_enabled = falseandstore_cfg.slow_start_ramp_threshold = usize::MAX, so a transient Decrease halves but the next healthy window re-doubles instead of crawling. Genuine store congestion still surfaces via the timeout-rate ceiling.error.rs/chunk.rsError::RemotePut { address, source: ant_protocol::ProtocolError }carrying the structured upstream discriminant, instead of stringifying it intoError::Protocol. AChunkPutResponse::Errormeans the transport round-trip succeeded and the node declined at the application layer.chunk.rschunk_put_to_close_groupsurfaces a representativeRemotePutfor app-only quorum shortfalls; any genuine transport failure keeps itInsufficientPeersso real congestion still cuts the cap.mod.rsclassify_errormapsRemotePut → ApplicationError(recorded, but not a capacity signal).merkle.rsmerkle_store_with_retrytreatsRemotePutas recoverable (defer/retry) likeInsufficientPeers, so transient rejections don't abort the upload.Design notes
ApplicationErrorwhen every contributing failure was a structured remote rejection. AnyNetwork/Timeout/Iokeeps it aNetworkErrorcapacity signal, so genuine local congestion still cuts the cap.RemotePutspecifically (not anyApplicationError) so aPaymentRequiredshortfall stays fatal in the retry path rather than being silently retried.Tests
New unit coverage plus updates to the existing controller/classification guards: store ramps and recovers under the new tuning while a timeout burst still cuts it; remote app-rejections don't move the cap;
RemotePutis recoverable in the retry path; the cold-start floor stays 8.Production validation
Validated against the live network with a patched
anton two connection profiles (both binaries confirmed to carry this change — the logs show the newremote PUT rejected …DisplayofError::RemotePut). Production organically produces both polluting signals: the ~28 s verification latency on every cache-miss PUT, and a steady stream of disk-full / pool-rejected rejections from community nodes.1. High-bandwidth link —
PROD-UL-01, 5.23 GB / 1345 chunksStore-limiter cap trajectory (
adaptive: increase/decreasedebug lines):ApplicationError(viaRemotePut) and correctly excluded from the limiter's denominator.2. Low-bandwidth / residential link —
PROD-LOCAL-UL-01Store-limiter cap trajectory:
write_progress_timeouttransport failures). Because store latency-decrease is disabled, this backoff is provably driven by the real transport signal, not the (huge) per-chunk latency.RemotePut→ApplicationErrorand did not contribute to the backoff.What the two runs jointly demonstrate
A follow-up (V2-482) captures the optional protocol change (node returns
verification_ms; client subtracts it to keep a clean latency signal) — deferred to a later release; the high-bandwidth run shows the ramp is beneficial where bandwidth exists, so a denoised latency signal there would be safe.🤖 Generated with Claude Code