Skip to content

fix(client): stop verification latency and app-rejections suppressing store AIMD#112

Merged
jacderida merged 2 commits into
WithAutonomi:rc-2026.6.2from
jacderida:v2-468
Jun 9, 2026
Merged

fix(client): stop verification latency and app-rejections suppressing store AIMD#112
jacderida merged 2 commits into
WithAutonomi:rc-2026.6.2from
jacderida:v2-468

Conversation

@jacderida

Copy link
Copy Markdown
Contributor

What & why

The client's adaptive store concurrency limiter (adaptive.rs) starts at a deliberately low cold-start of 8 (correct for low-bandwidth uplinks) but, on the merkle upload path, it never ramped and got crushed to a +1-per-window crawl. Two signals that are not local-capacity signals were polluting its health input:

  1. Node-side verification latency looked like congestion. Merkle PUT latency is dominated by the ~28 s synchronous closeness lookup, inflating client-observed p95/median to 3–6× and tripping the latency-vs-baseline Decrease even though nothing about it is local congestion.
  2. Remote application rejections were counted as network errors. Disk-full / pool-rejected / quote-stale rejections came back as Error::Protocol (or flattened Error::InsufficientPeers), classified NetworkError, counting against success_target and driving multiplicative Decrease. With the default slow_start_ramp_threshold = 0, a single such Decrease permanently exited slow-start → +1 crawl for the rest of the file.

This applies the fetch-channel precedent to the store channel (the situations are structurally identical — verification variance instead of retry variance) and preserves the structured remote-rejection reason so it classifies correctly. The cold-start floor of 8 is unchanged; the timeout-rate ceiling (genuine congestion guard) is untouched.

Linear: V2-468.

Stacked on #110 (deferred-retry / V2-466), which is still in review. Base is rc-2026.6.2. Until #110 merges, its commit (feat(client): download-style deferred retry for merkle uploads) also appears in this PR's diff; the only change owned by this PR is the single fix(client): stop verification latency and app-rejections suppressing store AIMD commit (the 5 files below).

Changes

Area Change
adaptive.rs store_cfg.latency_decrease_enabled = false and store_cfg.slow_start_ramp_threshold = usize::MAX, so a transient Decrease halves but the next healthy window re-doubles instead of crawling. Genuine store congestion still surfaces via the timeout-rate ceiling.
error.rs / chunk.rs New Error::RemotePut { address, source: ant_protocol::ProtocolError } carrying the structured upstream discriminant, instead of stringifying it into Error::Protocol. A ChunkPutResponse::Error means the transport round-trip succeeded and the node declined at the application layer.
chunk.rs chunk_put_to_close_group surfaces a representative RemotePut for app-only quorum shortfalls; any genuine transport failure keeps it InsufficientPeers so real congestion still cuts the cap.
mod.rs classify_error maps RemotePut → ApplicationError (recorded, but not a capacity signal).
merkle.rs merkle_store_with_retry treats RemotePut as recoverable (defer/retry) like InsufficientPeers, so transient rejections don't abort the upload.

Design notes

  • Reclassification is conservative: a quorum shortfall only becomes ApplicationError when every contributing failure was a structured remote rejection. Any Network/Timeout/Io keeps it a NetworkError capacity signal, so genuine local congestion still cuts the cap.
  • The app-only path is narrowed to RemotePut specifically (not any ApplicationError) so a PaymentRequired shortfall stays fatal in the retry path rather than being silently retried.

Tests

New unit coverage plus updates to the existing controller/classification guards: store ramps and recovers under the new tuning while a timeout burst still cuts it; remote app-rejections don't move the cap; RemotePut is recoverable in the retry path; the cold-start floor stays 8.


Production validation

Validated against the live network with a patched ant on two connection profiles (both binaries confirmed to carry this change — the logs show the new remote PUT rejected … Display of Error::RemotePut). Production organically produces both polluting signals: the ~28 s verification latency on every cache-miss PUT, and a steady stream of disk-full / pool-rejected rejections from community nodes.

1. High-bandwidth link — PROD-UL-01, 5.23 GB / 1345 chunks

Store-limiter cap trajectory (adaptive: increase/decrease debug lines):

store:  8 → 16 → 32 → 64 (ceiling, reached in 22 min)   then HELD at 64 — zero decreases — for ~114 min
quote: 32 → 64 → 128 (ceiling)
  • 0 decreases across the whole run despite 427 remote rejections — all classified ApplicationError (via RemotePut) and correctly excluded from the limiter's denominator.
  • The store cap ramped straight to its ceiling and stayed pegged. This is the intended behaviour and the direct inverse of the pre-fix incident, where the same workload sat at an effective ~5–9 the whole run.
  • (For completeness: total wall-time was bounded by the merkle wave-barrier architecture + node-side verification latency — effective concurrency ~12 against a wide-open cap of 64 — not by the limiter. That bottleneck is tracked separately and is out of scope here.)

2. Low-bandwidth / residential link — PROD-LOCAL-UL-01

Store-limiter cap trajectory:

store: 8 → 16 (17:31) → 32 (17:44)  │  → 16 → 8 → 4 → 2 (17:57–18:01)   settled at 2
  • Ramped via slow-start doubling, then the genuine-congestion guard pulled it back to a sane low cap of 2 as the thin uplink saturated (single 4 MB chunk sends taking 199 s / 762 s / 926 s; write_progress_timeout transport failures). Because store latency-decrease is disabled, this backoff is provably driven by the real transport signal, not the (huge) per-chunk latency.
  • The 26 disk-full / pool-rejected rejections on this run were RemotePutApplicationError and did not contribute to the backoff.
  • Upload completed successfully.

What the two runs jointly demonstrate

  • App-rejections no longer suppress the limiter (issue finding 1 / fix 3) — 427 + 26 rejections, 0 attributable decreases.
  • Verification latency no longer suppresses the limiter (finding 2 / fixes 1–2) — high-bandwidth run pegged at the ceiling through the entire verification-heavy workload.
  • The genuine congestion guard is preserved (validation step 4) — the residential run still backed off to a sane low cap on real uplink saturation.
  • Low-bandwidth floor behaviour is correct (step 5) — it settled low (2) and did not stay pinned high.

A follow-up (V2-482) captures the optional protocol change (node returns verification_ms; client subtracts it to keep a clean latency signal) — deferred to a later release; the high-bandwidth run shows the ramp is beneficial where bandwidth exists, so a denoised latency signal there would be safe.

🤖 Generated with Claude Code

jacderida and others added 2 commits June 8, 2026 14:59
The CLI merkle upload path stored each wave of 64 chunks through
`merkle_store_with_retry` with up to 4 attempts and 30s jittered backoffs,
and a hard barrier: wave N+1 could not start until wave N's retry loop fully
drained. A handful of quorum-short chunks therefore parked the wave's other
~63 slots idle through multiple backoffs — the single biggest throughput sink
on the PROD-UL-01 run (one wave alone burned 34 minutes).

Port the download path's deferred-retry design to the upload path:

- Store each wave in a single pass (`max_attempts = 1`, no backoff) so a wave
  never blocks on a slow chunk.
- Collect quorum-short chunks into a file-level deferred set and advance to the
  next wave immediately.
- After the last wave, retry the whole deferred set in concurrent rounds with
  `[0, 15, 45]s` delays (matching the download path), re-reading each chunk's
  body from the spill at retry time (peak RAM unchanged) and reusing its proof.

Failure semantics are preserved: chunks still short after the final round
surface as `PartialUpload`; a non-quorum error aborts as `PartialUpload` while
preserving earlier progress. Stats and progress numbering are carried across
rounds, with each deferred round's successes recorded in its own histogram slot.
Total per-chunk retry budget is unchanged (1 wave pass + 3 deferred rounds).

Adds `merkle_deferred_retry`, `DeferredRetryOutcome`,
`deferred_round_histogram_slot`, `DEFERRED_ROUND_DELAYS_SECS`, and unit tests.

V2-466

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… store AIMD

The adaptive store concurrency limiter never ramped from its cold-start of
8 and got crushed to a +1-per-window crawl because two non-capacity
signals polluted its health input on the merkle upload path:

- Node-side PUT latency is dominated by the ~28s synchronous merkle
  closeness lookup, inflating client-observed p95/median to 3-6x and
  tripping the latency-vs-baseline Decrease even though nothing about it
  is local congestion.
- Remote application rejections (pool-rejected, disk-full, quote-stale)
  arrived as Error::Protocol / flattened Error::InsufficientPeers and were
  classified as NetworkError, counting against success_target and driving
  multiplicative decrease. With the default slow_start_ramp_threshold of 0,
  a single such Decrease permanently exited slow-start.

Apply the fetch-channel precedent to the store channel (the situation is
structurally identical — verification variance instead of retry variance),
plus preserve the structured remote rejection reason so it classifies
correctly. The cold-start floor of 8 is deliberately unchanged.

- adaptive.rs: store_cfg.latency_decrease_enabled = false and
  store_cfg.slow_start_ramp_threshold = usize::MAX, so a transient Decrease
  halves but the next healthy window re-doubles. Genuine store congestion
  still surfaces via the timeout-rate ceiling.
- error.rs/chunk.rs: new Error::RemotePut { address, source: ProtocolError }
  carrying the structured upstream discriminant instead of stringifying it
  into Error::Protocol. A ChunkPutResponse::Error means the transport
  round-trip succeeded and the node declined at the application layer.
- chunk.rs: chunk_put_to_close_group surfaces a representative RemotePut for
  app-only quorum shortfalls; any genuine transport failure keeps it
  InsufficientPeers so real congestion still cuts the cap.
- mod.rs: classify_error maps RemotePut to ApplicationError.
- merkle.rs: merkle_store_with_retry treats RemotePut as recoverable
  (defer/retry) like InsufficientPeers, so transient rejections don't abort
  the upload.

Adds unit coverage: store ramps/recovers under the new tuning while a
timeout burst still cuts it; remote app-rejections don't move the cap;
RemotePut is recoverable in the retry path.

Linear: V2-468

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dirvine

dirvine commented Jun 8, 2026

Copy link
Copy Markdown

Review: fix(client): stop verification latency and app-rejections suppressing store AIMD (V2-468)

Verdict: Approve with minor nits. The fix is correctly scoped, the production evidence is strong, and the change is conservative — it leaves the genuine-congestion guard (timeout-rate ceiling) and the cold-start floor of 8 untouched. Stacking on #110 is fine because the V2-468 commit is self-contained and the rationale explicitly excludes #110's diff from the review.

Focused on the V2-468 commit (e04c0ee1, +266/-11, 5 files). Excludes the parent #110 deferred-retry commit per the PR description.

✅ What's right

  • The diagnosis is sharp and the fix is minimal. Two non-capacity signals (28 s closeness-lookup latency, structured remote rejections) were being folded into the limiter's health input. Disabling latency_decrease_enabled and lifting slow_start_ramp_threshold to usize::MAX on the store channel is the exact mirror of the existing fetch-channel precedent, so behaviour is now consistent across the two polluted channels.
  • The reclassification is conservative. chunk_put_to_close_group only surfaces a representative RemotePut when every failure was a structured remote rejection; any Network/Timeout/Io keeps InsufficientPeers, so real local congestion still cuts the cap. This is the correct call — the asymmetric risk (false-negative = noisy limiter, false-positive = masking real congestion) clearly favours conservatism.
  • RemotePut is narrowly typed. Keeping it as ApplicationError (not just any app error) means PaymentRequired shortfalls stay fatal in the retry path rather than being silently retried. The comment on merkle.rs:982-987 makes this contract explicit.
  • Production validation is convincing. 0 decreases across 427 rejections on the high-bandwidth run is direct evidence the reclassification works end-to-end, and the low-bandwidth run backing off to cap=2 on real transport timeouts proves the genuine-congestion guard is preserved.
  • The cold-start floor stays at 8. Explicitly asserted by the updated test controller_sets_fetch_channel_download_tuning (renamed mentally to "store+fetch" in the new assertion), and re-asserted in the new end-to-end test.
  • Test coverage is well-targeted. The three new tests cover: (a) verification latency p95 alone does not shrink the cap, (b) a timeout burst still cuts it, (c) the cap re-doubles on the next healthy window (proves the +1 crawl pathology is gone via the multiplicative-recovery path), plus (d) app rejections don't move the cap, and (e) RemotePut is recoverable in the retry path.
  • Tight scope. RemotePut is only constructed at the PUT-response handler (chunk.rs:432); the GET sweep at chunk.rs:840 keeps the old Error::Protocol flattening, which is correct because the GET path's per-peer-sweep semantics do not need the symmetric classification.

⚠️ Nits / non-blocking

  1. slow_start_ramp_threshold = usize::MAX is a blunt instrument. The code comment acknowledges this is "the structural twin of the fetch override" — fine for parity, but usize::MAX will be opaque to anyone reading the limiter's own field docs without the local comment. Consider a named constant (e.g. LIM_THRESHOLD_NEVER_EXIT: usize) at the top of adaptive.rs with a one-line pointer to V2-468, or a short rustdoc on the field itself. Optional — the local comment block is detailed enough that this is genuinely cosmetic.

  2. The 28 s closeness-lookup latency still pollutes RTT/spawn observers elsewhere. The PR correctly addresses the limiter (which is what the user-visible pathology was), but the same signal also flows into any per-call latency aggregator (peer-scoring RTT, per-peer spawn observers, etc.). Out of scope for V2-468, but worth a Linear follow-up: is there a single "node-side work" observer we can subtract from client-observed PUT latency, the way the V2-482 protocol change proposes? The PR's V2-482 mention is exactly this — good — but it is deferred. Consider a Linear ticket: "audit other latency consumers of client-observed PUT latency in adaptive.rs" so the debt is tracked, even if it stays open.

  3. Test store_channel_ramps_and_recovers_under_v2_468_tuning step (c) is a clever reachability proof, but the comment is dense. The "Over this many windows additive +1 recovery could not climb back to before_stress from the stressed floor — only multiplicative doubling can" reasoning is the right test, but the reader has to do a small back-of-envelope calculation to see why. A two-line comment ("window_ops8 healthy windows from a stressed floor near 2/4 could only reach before_stress via multiplicative doubling — the additive +1 crawl would be ~win8 increments short of before_stress") would make the test self-evidently correct.

  4. Doc comment on classify_error was updated but the order of variants in the doc table does not match the match arm order. The new bullet for RemotePut is at the end, but in the match arm it is appended to the ApplicationError block (which is correct). The doc ordering is fine functionally, but minorly inconsistent for grep-ability. Pure cosmetics.

  5. Error::RemotePut is a new public variant. Adding a variant to a pub enum is a SemVer-breaking change for any downstream consumer that exhaustively matches. I searched this repo and the only exhaustive match is the new classify_error arm you added, plus the test in mod.rs:623. Worth a one-line check that no in-tree sister crate (ant-node, ant-protocol) matches on Error exhaustively — I did not see any, but a cargo build -p ant-node against this branch would be cheap insurance before merge.

❌ Real concerns

None.

Production validation

The two reported runs jointly demonstrate all four claims in the PR body:

  • App-rejections excluded: 427 + 26 rejections, 0 attributable decreases.
  • Verification latency excluded: high-bandwidth run pegged at the ceiling through the entire verification-heavy workload.
  • Genuine congestion preserved: residential run backed off to cap=2 on real uplink saturation (single 4 MB sends at 199 s / 762 s / 926 s, write_progress_timeout failures).
  • Low-bandwidth floor behaviour correct: settled at 2, did not stay pinned high.

The reported effective concurrency ~12 against a cap of 64 on the high-bandwidth run is the next bottleneck and is correctly out of scope (tracked separately).

CI

Check Status
Format ✅ SUCCESS
Clippy ✅ SUCCESS
Unit Tests (ubuntu) ✅ SUCCESS
Unit Tests (macos) ✅ SUCCESS
Build (ubuntu) ✅ SUCCESS
Documentation ✅ SUCCESS
Security Audit ✅ SUCCESS
E2E (ubuntu) 🟡 in progress
E2E (macos) 🟡 in progress
Merkle E2E (ubuntu) 🟡 in progress
Merkle E2E (macos) 🟡 in progress
Build (macos) 🟡 in progress

Core unit/clippy/build/security gates are green. E2E and macOS Build still in flight at time of review — no failures visible.

Stacked-on-#110 note

PR body correctly says #110's commit also appears in the diff. I did not re-review that part. The V2-468 commit (e04c0ee1) is self-contained and the rationale explicitly carves it out, so the stacked-base concern does not block this review. Final merge ordering will need both #110 and #112 to land together (or #110 to land first) — caller's call.

Recommended next steps for the author

  • None blocking. Optionally address nits 1, 3, and 5 above before merge.
  • The V2-482 follow-up (node returns verification_ms; client subtracts it) is a clean protocol-level cleanup that would obsolete nit 2's concern — worth keeping that ticket visible.
  • Consider running cargo build -p ant-node -p ant-protocol against this branch pre-merge as a cheap SemVer-belt-and-braces check for nit 5.

— Hermes (advisory; awaiting human maintainer sign-off)

@jacderida jacderida merged commit 17bab41 into WithAutonomi:rc-2026.6.2 Jun 9, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants