fix: continue single-node uploads on partial waves and bound store memory (V2-461)#116
Conversation
mickvandijke
left a comment
There was a problem hiding this comment.
I found three issues worth addressing before landing this.
- P2: Partial spend can be over-counted across single-node waves.
ant-core/src/data/client/batch.rs:428 reloads cumulative cached spend and seeds the returned cost with it, but ant-core/src/data/client/file.rs:2039 calls batch_upload_chunks_with_events once per wave and then adds each wave outcome at file.rs:2061. For a 3-wave upload with costs A/B/C, this can report A + (A+B) + (A+B+C) instead of A+B+C. Since this PR surfaces partial spend to users, the new failure cost can be materially wrong. The fix should make the batch call return a per-call delta, or have the outer wave loop load/cache aggregate once and avoid summing cumulative values.
- P2: External-signer partial uploads report zero spend even though the payment intent has the storage amount.
finalize_upload_with_progress discards payment_intent at ant-core/src/data/client/file.rs:1437, then fills PartialUploadSpend with "0"/0 at file.rs:1464. PaymentIntent already exposes total_amount in ant-core/src/data/client/batch.rs:146, so at least storage spend can be reported. Gas may still be unknown, but reporting zero storage contradicts the new PartialUploadSpend contract.
- P3:
PartialUpload.storedcan omit preflight already-stored addresses whenstored_offset > 0.
ant-core/src/data/client/file.rs:1991 seeds total_stored from stored_offset, but file.rs:2002 starts stored_addresses empty and later returns it in the error. Callers pass nonzero offsets from merkle preflight fallback, for example file.rs:1783. Programmatic callers can then see stored_count > stored.len() and lose the addresses that were already confirmed. Pass the already-stored address slice into this helper and seed stored_addresses with it, like the merkle path does.
Verification performed locally on the PR branch:
cargo fmt --all -- --checkcargo test -p ant-core fold_single_wave --all-featurescargo check --workspacecargo clippy --all-targets --all-features -- -D warnings- Synthetic merge into current
rc-2026.6.2:cargo check --workspace
…461) The single-node payment path aborted the entire file on the first wave with any chunk short of quorum: `upload_spill_addresses_single` `?`-propagated the per-wave `PartialUpload` from `batch_upload_chunks_with_events`, so later waves — already self-encrypted, spilled, and sometimes already paid — were never attempted. In PROD-UL-02 this turned ~85% per-chunk success into 0% per-file success, killing every upload at wave 1 of N. Align it with the merkle path (`upload_waves_merkle`): a wave short of quorum records its failed chunks and continues; after all waves are attempted the file returns a single `Error::PartialUpload` with the full stored/failed breakdown. Genuinely fatal errors (wallet/payment infrastructure, missing proofs, spill reads) still abort immediately. The recoverable-vs-fatal decision is factored into a pure `fold_single_wave` helper with unit tests. Because `UPLOAD_WAVE_SIZE == PAYMENT_WAVE_SIZE`, each batch call is exactly one payment wave, so folding its `PartialUpload` leaves nothing un-attempted within the wave. Also surface on-chain spend on a partial upload: a partial still pays for the chunks it paid for, but the spend was silently dropped. Add a boxed `PartialUploadSpend` (storage_cost_atto + gas_cost_wei) to `Error::PartialUpload`, populate it at every raise site (single-node, merkle, external-signer), and report it in the CLI (human + JSON). Boxed to keep `Error` under clippy's `result_large_err` threshold. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…V2-461) Large-file single-node (--no-merkle) uploads OOM'd on small hosts: store concurrency could ramp to the wave size (64) and the send path holds each ~4 MB chunk body in flight, so a wave of large chunks pinned several GB. Cap store concurrency in store_paid_chunks_with_events by combined in-flight body bytes (STORE_INFLIGHT_BYTE_BUDGET, 64 MB) instead of chunk count, so ~4 MB chunks drop to ~16 concurrent stores while small chunks are unaffected. This is the standalone memory fix; no saorsa-core change is required. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the careful review — all three addressed and folded into the relevant commit ( 1. Partial spend over-counted across single-node waves (P2) — 2. External-signer partial uploads reported zero spend (P2) — 3. Verified locally: |
Summary
--no-merkle) upload path no longer aborts the whole file on the first wave containing a failed chunk. Failed chunks are accumulated and the file makes maximum progress; the result is surfaced asError::PartialUploadafter all waves are attempted — matching the merkle path's semantics. Genuinely fatal errors (wallet/payment-infrastructure failures, spill reads) still abort immediately. The recoverable-vs-fatal decision is factored into a purefold_single_wavehelper.Error::PartialUploadnow carries the on-chain spend (storage atto + gas), reported in both the CLI's human and--jsonoutput, so a partial upload shows what it actually cost.STORE_INFLIGHT_BYTE_BUDGET, 64 MB) instead of chunk count, so a wave of large (~4 MB) chunks can't pin multiple GB and OOM small hosts.Why
Large single-node uploads on a degraded production network were OOM-killing small (4 GB) hosts and aborting entire files on the first failed wave (see V2-461). These changes make single-node uploads make maximum progress with bounded memory.
Test plan
cargo check --workspace,cargo clippy --all-targets --all-features -- -D warnings,cargo fmt --all -- --check— all clean.fold_single_wave: ok / partial / fatal).procstat memory_rss,cgroup memory.peak).Notes
ant-core/ant-clichange — no saorsa-core or other crate changes.rc-2026.6.2; may want a rebase/merge before landing.