Feat/streaming prover#647
Draft
diegokingston wants to merge 10 commits into
Draft
Conversation
Add LAMBDA_STREAM_LDE=1 opt-in mode that drops main/aux LDE columns right after committing (Phase A/C) instead of caching them, and recomputes each table's LDE on demand in Rounds 2-4 via reconstruct_round1 (rebuilt from the still-resident trace — no VM re-execution). Trades one extra LDE expansion (iFFT+coset+FFT) per table for dropping the O(N x cols x lde_size) cache of all tables' LDE columns held simultaneously between Phase A/C and Rounds 2-4. Default (flag off) is unchanged. - streaming_retire_lde() env helper - ungate reconstruct_round1 (was debug-checks only) - Phase A/C: drop cached LDE when streaming - Rounds 2-4: reconstruct vs build_round1 branch Validated: stark lib 128/128 with flag on; prover bus-completeness prove+verify (aux/LogUp reconstruction) pass; multi-table roundtrip passes; baseline unchanged.
Split the Round 2 (composition) + Round 3 (OOD) part of prove_rounds_2_to_4 into a standalone prove_rounds_2_to_3 that returns (Round2, Round3, z) without running FRI. prove_rounds_2_to_4 now calls it then Round 4 as before — proofs are byte-identical (pure extraction). This is step 1a toward batched FRI within a chunk (Approach 1, M2 option B): the per-table Rounds 2-3 are now separable from Round 4, so a later step can collect them per-chunk and run one batched FRI per lde_size bucket instead of per-table FRI. instruments: round-3 OOD timing now carried via a R3_OOD thread-local (store_r3_ood/take_r3_ood) since the inline local moved into the new fn. Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK; prover bus-completeness OK; instruments feature compiles.
Restructure the per-chunk Rounds 2-4 loop in multi_prove into two passes: Pass 1 runs prove_rounds_2_to_3 for all K tables in the chunk (advancing each table's forked transcript through the OOD evaluations, collecting the intermediates), then Pass 2 runs Round 4 per table. Round 4 is extracted into prove_round_4 (the former prove_rounds_2_to_4, now consuming the Round2/Round3 produced by Pass 1). Still one FRI per table, so proofs are byte-identical to the single-pass version. This establishes the barrier needed for the next step (Approach 1, M2 option B): with all of a chunk's tables through Rounds 2-3, Pass 2 can be replaced by a batched FRI per lde_size bucket instead of per-table FRI. Instruments: per-table sub-op timings are now captured at the end of each pass (same thread as the work — r2/r3 in Pass 1, r4 in Pass 2) and combined in the sequential collection, since the two passes may run on different threads. Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK; prover bus-completeness OK (flag off + on); instruments feature compiles.
… change) Move the DEEP composition challenge gamma + coefficient expansion out of Round 4 into a new sample_deep_coeffs helper, called at the end of prove_rounds_2_to_3 (Pass 1). gamma is sampled from the same fork at the same transcript position as before, so proofs are byte-identical. prove_rounds_2_to_3 now returns (Round2, Round3, z, DeepCoeffs); Round 4 (round_4_compute... / prove_round_4) takes the DeepCoeffs and only builds the DEEP LDE + runs FRI. The two-pass loop threads DeepCoeffs through the intermediate. This is step 1c toward batched FRI (Approach 1, M2 option B): Pass 2 is now reduced to DEEP-LDE + FRI, so the next step can replace it with one batched FRI per lde_size bucket (folding each table's DEEP LDE with delta powers) while the per-table gamma/coeffs stay in Pass 1. Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK; prover bus-completeness OK (off + on); instruments feature compiles.
Replace the per-table Round 4 FRI in multi_prove with ONE batched FRI per
(chunk, lde_size) bucket. Each bucket derives a single delta_fri from a
chunk-shared bucket_seed, folds every member's DEEP-composition LDE with
successive powers of delta_fri into one polynomial, runs one
commit_phase_from_evaluations + grinding + query_phase, and opens each member's
per-table DEEP openings at the bucket-shared iotas.
Proof struct (proof/stark.rs):
- Add ChunkBucketFri<E> { lde_size, members (chunk-local indices), layer_roots,
last_value, decommitments, nonce }.
- MultiProof: add fri_chunk_buckets (per chunk, per bucket) + chunk_size.
- StarkProof: drop per-table fri_layers_merkle_roots, fri_last_value,
query_list, nonce. Keep composition_poly_root, OOD evals, deep_poly_openings.
Prover (prover.rs):
- Capture pre-fork transcript; Pass 2 buckets the chunk by lde_size, builds the
bucket_seed, samples delta_fri, streams each member's DEEP LDE (peak 2x|LDE|),
runs the batched FRI, and assembles per-table StarkProofs + ChunkBucketFri.
- Remove dead per-table Round4 path (prove_round_4, round_4_compute..., Round4).
- Single-table prove() now returns a one-element MultiProof (bucket data lives
at the multi-proof level).
Verifier (verifier.rs):
- replay_rounds_2_to_3 derives per-table Round 2-3 challenges (no FRI).
- verify_chunks_phase_c_d rebuilds each bucket_seed, samples delta_fri + zetas +
grinding + iotas, reconstructs each member's D_i at the shared iotas, combines
with delta_fri powers, verifies the batched FRI fold (verify_bucket_fri_query,
MMCS-free) and each member's per-table openings.
- verify() takes a one-element MultiProof. Remove dead per-table FRI paths.
bucket_seed byte order (prover == verifier): pre-fork shared state, then for
each chunk-local index ascending: table_contribution (if any), then
composition_poly_root, then OOD evals (trace_ood columns column-major, then
composition_poly_parts_ood).
Validated: cargo check -p stark (+instruments), stark --lib 128/128,
VM completeness 14/14 with flag off and LAMBDA_STREAM_LDE=1.
Halve per-table Merkle-tree memory in streaming "retire-LDE" mode by
dropping the leaf half of the main/aux trees after commit and
regenerating the one leaf-level sibling each query needs on demand.
merkle.rs:
- Store `leaves_len` explicitly (after drop, nodes.len() == leaves_len-1,
so the old (node_count+1)/2 recovery no longer applies).
- `drop_leaves()`: truncate to inner nodes only; idempotent; no-op for
single-leaf trees and disk-spill mmap backing.
- `get_proof_by_pos_with_leaf_sibling()` + `sibling_leaf_position()`:
build an auth path whose bottom node is a caller-supplied regenerated
leaf hash and whose higher nodes come from the retained inner nodes.
Byte-identical to the full-tree `get_proof_by_pos`.
- `get_proof_by_pos` (full tree) unchanged.
prover.rs:
- Folded into the existing LAMBDA_STREAM_LDE flag: in streaming mode,
TableCommit::{plain,preprocessed} call drop_leaves() before wrapping
the tree in Arc (Phase A main + precomputed, Phase C aux).
- open_polys_with: when tree.leaves_dropped(), regenerate the sibling
leaf by gathering the recomputed LDE row at sibling_leaf_position and
hashing it with keccak_leaf_from_row (matches keccak_leaves_bit_reversed
byte layout), then build the path with the leaf-dropped opener.
- Composition tree left FULL (transient per-table, no cross-table peak);
documented on streaming_retire_lde.
Verifier unchanged (proofs are byte-identical).
Tests:
- crypto: leaf-dropped opening is byte-identical to full-tree opening;
drop_leaves idempotent.
- stark: prove with LAMBDA_STREAM_LDE=0 vs =1 yields byte-identical
serialized MultiProof.
Add Executor::snapshot() -> VmSnapshot and Executor::from_snapshot(elf, snap),
capturing the mutable VM state {memory, registers, pc} so execution can be
resumed byte-identically from a saved point. The immutable instruction cache is
rebuilt from the ELF, not stored. Determinism holds because all nondeterministic
input (private inputs) is pre-loaded into memory, so replay from a snapshot
reproduces identical logs.
This is the foundation (Approach 1 step B) for retiring resident traces:
re-execution from a checkpoint avoids re-running from cycle 0, bounding the cost
to one segment.
- derive Clone on Memory and Registers; PartialEq/Eq on Log (for test assertions)
- VmSnapshot { memory, registers, pc } + snapshot/from_snapshot on Executor
- hermetic checkpoint_tests: a hand-built 100_005xADDI + JALR-to-0 program (no
ELF fixture, no syscalls) proves snapshot mid-execution (across a resume()
100k-chunk boundary) + resume yields logs identical to a straight run
Validated: executor lib 14/14; stark/prover still build (derives are additive).
First step of "trace retire" (Approach 1, step C): make it possible to build a single table's trace on demand from a shared routed intermediate, so the prover can later build tables one at a time instead of all-at-once. Splits the monolithic `build_traces` (PHASES 1-5) into: - `RoutedTraceData<'a>`: holds the output of PHASES 1-4 — the per-table routed op-lists, the FINAL accumulated bitwise lookup multiplicities (folded in from every table, incl. padding/page/commit/keccak), the MEMW->LT extended lt_ops, num_padding_rows, halt_timestamp, public_output_bytes, decode trace + pc->row, the finalized register state map, and the borrowed elf/memory_state/private_input used only by the PAGE fill. - `route(...)`: PHASES 3-4 — finishes all cross-table coupling once, before any per-table fill reads it. - `TableKind` selector (CPU, LT, SHIFT, MEMW, MEMW_A, LOAD, MUL, DVRM, BRANCH, MEMW_R, matching the prover's table ordering) and `RoutedTraceData::build_table(which, ...)`: the PHASE-5 fill for ONE table. `build_traces` is now a thin orchestrator: `route(...)` then `build_table(...)` per chunked execution table, with the fixed/preprocessed tables (BITWISE, DECODE, REGISTER, HALT, COMMIT, KECCAK*, PAGE) assembled exactly as before. All existing callers (`from_elf_and_logs`, `from_logs`) are unchanged in behaviour. No prover, lib, or proof-structure changes — this only enables per-table build. Cross-table multiplicity coupling: all bitwise multiplicity accumulation happens in `route` and is complete before any `build_table` call, so the BITWISE table (assembled by the orchestrator) sees the full counts. Validation: - `cargo check -p lambda-vm-prover` (default + disk-spill features) clean. - New trace-equality test `test_route_then_build_table_matches_monolithic`: builds every log-derived table via `route` + per-table `build_table` and compares against `from_logs`. Compared as a multiset of rows because several `generate_*_trace` fns dedup via `HashMap<Op, mult>` and emit `into_iter()`, whose order is pre-existing per-run nondeterminism (two back-to-back `from_logs` already disagree on LT row order). Row content + dims are byte-identical. - `cargo test -p lambda-vm-prover --lib trace_builder_tests`: 24/24 pass. - `cargo test -p stark --lib`: 129/129 pass (prove+verify roundtrips unchanged). - Pre-existing 79 ELF-fixture failures (No such file / UnknownSyscall(5)) are identical on the base commit; this change adds exactly one passing test and zero new failures.
The LT/MUL/DVRM/BRANCH trace builders deduplicate rows via HashMap<Op, mult> and emitted op_map.into_iter() order, which std HashMap randomizes per instance -> two builds of the same logs disagreed on row order (content identical). This did not affect the single-build prover, but it is a hard prerequisite for streaming on-demand trace rebuild (C.2b): the Phase-A commit and the Rounds-2-4 rebuild must produce the same trace, or the committed Merkle root won't match the reconstructed trace. Fix: derive Ord on each *Operation and sort unique_ops by the canonical op key after the HashMap dedup. Repeated builds now produce byte-identical traces. NOTE: this changes the proof OUTPUT (row order is now sorted, was HashMap-order) - still a valid proof, now reproducible. Validated: new trace_build_is_deterministic_across_builds test passes (and FAILS if any sort is reverted, confirming it catches the non-determinism); trace_builder 25/25; stark 129/129; prover completeness pass (flag off + LAMBDA_STREAM_LDE=1).
Retire the resident log-derived traces under the streaming flag (LAMBDA_STREAM_LDE=1): build each table's main+aux trace on demand from the compact routed intermediate at the point the prover needs it, then drop it. Composes with M1 (retire LDE), M2 (batched FRI), T1 (leaf-drop). stark: - Add `TraceProvider` trait: on-demand main-trace source for retired tables, with `is_retired` / `num_rows` / `build_main`. - `multi_prove` delegates to new `multi_prove_with_provider(provider)`; `provider = None` is byte-for-byte the old path. When a provider is given AND streaming is on, retired tables are (re)built on demand at Phase A (main commit), Phase C (aux build + commit), and Rounds 2-4 (reconstruct), and dropped after each. `run_debug_checks` is made retire-aware. Off-path is unchanged. prover: - `route_for_streaming` runs PHASES 0-4 + builds the preprocessed/PAGE tables (extracted into `build_preprocessed_tables`), returning a `StreamingProvider` (routed intermediate + air-order slot map) and a `StreamingResident` (air-ordered resident traces with empty placeholders for retired slots, + metadata). The slot map reproduces the exact `air_trace_pairs` ordering (8 preprocessed, then cpu/lt/ shift/memw/memw_a/load/mul/dvrm/branch, PAGE, memw_r); PAGE and the 8 preprocessed tables stay resident. - `RoutedTraceData::build_chunk` rebuilds exactly one chunk on demand (byte-identical to `build_table(..)[chunk]`); `num_rows` builds the chunk to read its true padded row count (LT/MUL/DVRM/BRANCH pad to their deduplicated op count, which a raw op-count cannot predict). - `prove_with_options_and_inputs` branches on `streaming_retire_lde()`: streaming uses `route_for_streaming` + `multi_prove_with_provider`; the default path is unchanged. Validation: - Byte-identical proof test (streaming ON vs OFF, grinding disabled so the proof is deterministic — the parallel grinding nonce search is non-deterministic for grinding>0 independent of this change): PASSES for fib_iterative_1200k (multi-chunk CPU/MEMW_R, PAGE interleaved). - stark --lib 129 pass (flag off and on); prover --lib completeness 14 pass (off and on). prover --lib: 283 pass / 77 fail — the 77 are the pre-existing UnknownSyscall(5) ELF-fixture failures (identical off and on), not a regression of this change. - Compiles with default, instruments, disk-spill, and combined; no new clippy warnings.
Collaborator
Author
|
/bench 5 |
Benchmark — fib_iterative_8M (median of 5)Table parallelism: auto (cores / 3)
Commit: a7eabd3 · Baseline: cached · Runner: self-hosted bench |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.