Skip to content

fix: continue single-node uploads on partial waves and bound store memory (V2-461)#116

Merged
jacderida merged 2 commits into
WithAutonomi:rc-2026.6.2from
jacderida:v2-461
Jun 12, 2026
Merged

fix: continue single-node uploads on partial waves and bound store memory (V2-461)#116
jacderida merged 2 commits into
WithAutonomi:rc-2026.6.2from
jacderida:v2-461

Conversation

@jacderida

Copy link
Copy Markdown
Contributor

Summary

  • Continue-on-partial (V2-461 core): the single-node (--no-merkle) upload path no longer aborts the whole file on the first wave containing a failed chunk. Failed chunks are accumulated and the file makes maximum progress; the result is surfaced as Error::PartialUpload after all waves are attempted — matching the merkle path's semantics. Genuinely fatal errors (wallet/payment-infrastructure failures, spill reads) still abort immediately. The recoverable-vs-fatal decision is factored into a pure fold_single_wave helper.
  • Cost-on-partial reporting: Error::PartialUpload now carries the on-chain spend (storage atto + gas), reported in both the CLI's human and --json output, so a partial upload shows what it actually cost.
  • Bound store memory: single-node store concurrency is now capped by combined in-flight body bytes (STORE_INFLIGHT_BYTE_BUDGET, 64 MB) instead of chunk count, so a wave of large (~4 MB) chunks can't pin multiple GB and OOM small hosts.

Why

Large single-node uploads on a degraded production network were OOM-killing small (4 GB) hosts and aborting entire files on the first failed wave (see V2-461). These changes make single-node uploads make maximum progress with bounded memory.

Test plan

  • cargo check --workspace, cargo clippy --all-targets --all-features -- -D warnings, cargo fmt --all -- --check — all clean.
  • New unit tests for the recoverable-vs-fatal wave classification (fold_single_wave: ok / partial / fatal).
  • Validated on production PROD-UL-02 runs: 3.90 GB and 4.02 GB single-node uploads completed with bounded RSS (~2 GB peak) and no OOM, where prior builds OOM-killed at ~3.5 GB / ~11 min. Confirmed via host telemetry (procstat memory_rss, cgroup memory.peak).

Notes

  • Pure ant-core / ant-cli change — no saorsa-core or other crate changes.
  • Branch is currently 5 commits behind rc-2026.6.2; may want a rebase/merge before landing.

@mickvandijke mickvandijke left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found three issues worth addressing before landing this.

  1. P2: Partial spend can be over-counted across single-node waves.

ant-core/src/data/client/batch.rs:428 reloads cumulative cached spend and seeds the returned cost with it, but ant-core/src/data/client/file.rs:2039 calls batch_upload_chunks_with_events once per wave and then adds each wave outcome at file.rs:2061. For a 3-wave upload with costs A/B/C, this can report A + (A+B) + (A+B+C) instead of A+B+C. Since this PR surfaces partial spend to users, the new failure cost can be materially wrong. The fix should make the batch call return a per-call delta, or have the outer wave loop load/cache aggregate once and avoid summing cumulative values.

  1. P2: External-signer partial uploads report zero spend even though the payment intent has the storage amount.

finalize_upload_with_progress discards payment_intent at ant-core/src/data/client/file.rs:1437, then fills PartialUploadSpend with "0"/0 at file.rs:1464. PaymentIntent already exposes total_amount in ant-core/src/data/client/batch.rs:146, so at least storage spend can be reported. Gas may still be unknown, but reporting zero storage contradicts the new PartialUploadSpend contract.

  1. P3: PartialUpload.stored can omit preflight already-stored addresses when stored_offset > 0.

ant-core/src/data/client/file.rs:1991 seeds total_stored from stored_offset, but file.rs:2002 starts stored_addresses empty and later returns it in the error. Callers pass nonzero offsets from merkle preflight fallback, for example file.rs:1783. Programmatic callers can then see stored_count > stored.len() and lose the addresses that were already confirmed. Pass the already-stored address slice into this helper and seed stored_addresses with it, like the merkle path does.

Verification performed locally on the PR branch:

  • cargo fmt --all -- --check
  • cargo test -p ant-core fold_single_wave --all-features
  • cargo check --workspace
  • cargo clippy --all-targets --all-features -- -D warnings
  • Synthetic merge into current rc-2026.6.2: cargo check --workspace

jacderida and others added 2 commits June 12, 2026 13:33
…461)

The single-node payment path aborted the entire file on the first wave
with any chunk short of quorum: `upload_spill_addresses_single`
`?`-propagated the per-wave `PartialUpload` from
`batch_upload_chunks_with_events`, so later waves — already
self-encrypted, spilled, and sometimes already paid — were never
attempted. In PROD-UL-02 this turned ~85% per-chunk success into 0%
per-file success, killing every upload at wave 1 of N.

Align it with the merkle path (`upload_waves_merkle`): a wave short of
quorum records its failed chunks and continues; after all waves are
attempted the file returns a single `Error::PartialUpload` with the full
stored/failed breakdown. Genuinely fatal errors (wallet/payment
infrastructure, missing proofs, spill reads) still abort immediately.
The recoverable-vs-fatal decision is factored into a pure `fold_single_wave`
helper with unit tests. Because `UPLOAD_WAVE_SIZE == PAYMENT_WAVE_SIZE`,
each batch call is exactly one payment wave, so folding its `PartialUpload`
leaves nothing un-attempted within the wave.

Also surface on-chain spend on a partial upload: a partial still pays for
the chunks it paid for, but the spend was silently dropped. Add a boxed
`PartialUploadSpend` (storage_cost_atto + gas_cost_wei) to
`Error::PartialUpload`, populate it at every raise site (single-node,
merkle, external-signer), and report it in the CLI (human + JSON). Boxed
to keep `Error` under clippy's `result_large_err` threshold.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…V2-461)

Large-file single-node (--no-merkle) uploads OOM'd on small hosts: store
concurrency could ramp to the wave size (64) and the send path holds each
~4 MB chunk body in flight, so a wave of large chunks pinned several GB.

Cap store concurrency in store_paid_chunks_with_events by combined in-flight
body bytes (STORE_INFLIGHT_BYTE_BUDGET, 64 MB) instead of chunk count, so
~4 MB chunks drop to ~16 concurrent stores while small chunks are unaffected.

This is the standalone memory fix; no saorsa-core change is required.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jacderida

Copy link
Copy Markdown
Contributor Author

Thanks for the careful review — all three addressed and folded into the relevant commit (43a8749, the continue-on-partial + cost commit), with C kept separate.

1. Partial spend over-counted across single-node waves (P2)batch_upload_chunks_with_events now returns a per-call delta: it loads only the cached proofs and seeds the returned cost with zero instead of the cumulative cache. The per-wave driver sums deltas, so the total is A+B+C, not A+(A+B)+(A+B+C). This also corrects the same over-count on the success path. (batch_upload_chunks, the other caller, passes resume_key=None, so its single-call delta == cumulative — unchanged.)

2. External-signer partial uploads reported zero spend (P2)finalize_upload_with_progress now keeps payment_intent and reports payment_intent.total_amount as storage spend on both the partial (PartialUploadSpend) and the success (FileUploadResult) returns, so they're consistent. Gas stays 0 — it's paid by the external signer out-of-band.

3. PartialUpload.stored omitted preflight already-stored addresses (P3)upload_spill_addresses_single now takes the already-stored address slice (replacing the stored_offset count), seeds stored_addresses with it, and derives total_stored from its length, so stored_count == stored.len() holds for programmatic callers. The three call sites pass &merkle_plan.already_stored (merkle fallback) / &[] (full single-node).

Verified locally: cargo fmt --all -- --check, cargo clippy --all-targets --all-features -- -D warnings (clippy 1.96), cargo test -p ant-core --lib (348 pass).

@jacderida jacderida merged commit 192200e into WithAutonomi:rc-2026.6.2 Jun 12, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants