Feat/streaming prover by diegokingston · Pull Request #647 · yetanotherco/lambda_vm

diegokingston · 2026-06-03T17:32:31Z

No description provided.

Add LAMBDA_STREAM_LDE=1 opt-in mode that drops main/aux LDE columns right after committing (Phase A/C) instead of caching them, and recomputes each table's LDE on demand in Rounds 2-4 via reconstruct_round1 (rebuilt from the still-resident trace — no VM re-execution). Trades one extra LDE expansion (iFFT+coset+FFT) per table for dropping the O(N x cols x lde_size) cache of all tables' LDE columns held simultaneously between Phase A/C and Rounds 2-4. Default (flag off) is unchanged. - streaming_retire_lde() env helper - ungate reconstruct_round1 (was debug-checks only) - Phase A/C: drop cached LDE when streaming - Rounds 2-4: reconstruct vs build_round1 branch Validated: stark lib 128/128 with flag on; prover bus-completeness prove+verify (aux/LogUp reconstruction) pass; multi-table roundtrip passes; baseline unchanged.

Split the Round 2 (composition) + Round 3 (OOD) part of prove_rounds_2_to_4 into a standalone prove_rounds_2_to_3 that returns (Round2, Round3, z) without running FRI. prove_rounds_2_to_4 now calls it then Round 4 as before — proofs are byte-identical (pure extraction). This is step 1a toward batched FRI within a chunk (Approach 1, M2 option B): the per-table Rounds 2-3 are now separable from Round 4, so a later step can collect them per-chunk and run one batched FRI per lde_size bucket instead of per-table FRI. instruments: round-3 OOD timing now carried via a R3_OOD thread-local (store_r3_ood/take_r3_ood) since the inline local moved into the new fn. Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK; prover bus-completeness OK; instruments feature compiles.

Restructure the per-chunk Rounds 2-4 loop in multi_prove into two passes: Pass 1 runs prove_rounds_2_to_3 for all K tables in the chunk (advancing each table's forked transcript through the OOD evaluations, collecting the intermediates), then Pass 2 runs Round 4 per table. Round 4 is extracted into prove_round_4 (the former prove_rounds_2_to_4, now consuming the Round2/Round3 produced by Pass 1). Still one FRI per table, so proofs are byte-identical to the single-pass version. This establishes the barrier needed for the next step (Approach 1, M2 option B): with all of a chunk's tables through Rounds 2-3, Pass 2 can be replaced by a batched FRI per lde_size bucket instead of per-table FRI. Instruments: per-table sub-op timings are now captured at the end of each pass (same thread as the work — r2/r3 in Pass 1, r4 in Pass 2) and combined in the sequential collection, since the two passes may run on different threads. Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK; prover bus-completeness OK (flag off + on); instruments feature compiles.

… change) Move the DEEP composition challenge gamma + coefficient expansion out of Round 4 into a new sample_deep_coeffs helper, called at the end of prove_rounds_2_to_3 (Pass 1). gamma is sampled from the same fork at the same transcript position as before, so proofs are byte-identical. prove_rounds_2_to_3 now returns (Round2, Round3, z, DeepCoeffs); Round 4 (round_4_compute... / prove_round_4) takes the DeepCoeffs and only builds the DEEP LDE + runs FRI. The two-pass loop threads DeepCoeffs through the intermediate. This is step 1c toward batched FRI (Approach 1, M2 option B): Pass 2 is now reduced to DEEP-LDE + FRI, so the next step can replace it with one batched FRI per lde_size bucket (folding each table's DEEP LDE with delta powers) while the per-table gamma/coeffs stay in Pass 1. Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK; prover bus-completeness OK (off + on); instruments feature compiles.

Replace the per-table Round 4 FRI in multi_prove with ONE batched FRI per (chunk, lde_size) bucket. Each bucket derives a single delta_fri from a chunk-shared bucket_seed, folds every member's DEEP-composition LDE with successive powers of delta_fri into one polynomial, runs one commit_phase_from_evaluations + grinding + query_phase, and opens each member's per-table DEEP openings at the bucket-shared iotas. Proof struct (proof/stark.rs): - Add ChunkBucketFri<E> { lde_size, members (chunk-local indices), layer_roots, last_value, decommitments, nonce }. - MultiProof: add fri_chunk_buckets (per chunk, per bucket) + chunk_size. - StarkProof: drop per-table fri_layers_merkle_roots, fri_last_value, query_list, nonce. Keep composition_poly_root, OOD evals, deep_poly_openings. Prover (prover.rs): - Capture pre-fork transcript; Pass 2 buckets the chunk by lde_size, builds the bucket_seed, samples delta_fri, streams each member's DEEP LDE (peak 2x|LDE|), runs the batched FRI, and assembles per-table StarkProofs + ChunkBucketFri. - Remove dead per-table Round4 path (prove_round_4, round_4_compute..., Round4). - Single-table prove() now returns a one-element MultiProof (bucket data lives at the multi-proof level). Verifier (verifier.rs): - replay_rounds_2_to_3 derives per-table Round 2-3 challenges (no FRI). - verify_chunks_phase_c_d rebuilds each bucket_seed, samples delta_fri + zetas + grinding + iotas, reconstructs each member's D_i at the shared iotas, combines with delta_fri powers, verifies the batched FRI fold (verify_bucket_fri_query, MMCS-free) and each member's per-table openings. - verify() takes a one-element MultiProof. Remove dead per-table FRI paths. bucket_seed byte order (prover == verifier): pre-fork shared state, then for each chunk-local index ascending: table_contribution (if any), then composition_poly_root, then OOD evals (trace_ood columns column-major, then composition_poly_parts_ood). Validated: cargo check -p stark (+instruments), stark --lib 128/128, VM completeness 14/14 with flag off and LAMBDA_STREAM_LDE=1.

Halve per-table Merkle-tree memory in streaming "retire-LDE" mode by dropping the leaf half of the main/aux trees after commit and regenerating the one leaf-level sibling each query needs on demand. merkle.rs: - Store `leaves_len` explicitly (after drop, nodes.len() == leaves_len-1, so the old (node_count+1)/2 recovery no longer applies). - `drop_leaves()`: truncate to inner nodes only; idempotent; no-op for single-leaf trees and disk-spill mmap backing. - `get_proof_by_pos_with_leaf_sibling()` + `sibling_leaf_position()`: build an auth path whose bottom node is a caller-supplied regenerated leaf hash and whose higher nodes come from the retained inner nodes. Byte-identical to the full-tree `get_proof_by_pos`. - `get_proof_by_pos` (full tree) unchanged. prover.rs: - Folded into the existing LAMBDA_STREAM_LDE flag: in streaming mode, TableCommit::{plain,preprocessed} call drop_leaves() before wrapping the tree in Arc (Phase A main + precomputed, Phase C aux). - open_polys_with: when tree.leaves_dropped(), regenerate the sibling leaf by gathering the recomputed LDE row at sibling_leaf_position and hashing it with keccak_leaf_from_row (matches keccak_leaves_bit_reversed byte layout), then build the path with the leaf-dropped opener. - Composition tree left FULL (transient per-table, no cross-table peak); documented on streaming_retire_lde. Verifier unchanged (proofs are byte-identical). Tests: - crypto: leaf-dropped opening is byte-identical to full-tree opening; drop_leaves idempotent. - stark: prove with LAMBDA_STREAM_LDE=0 vs =1 yields byte-identical serialized MultiProof.

Add Executor::snapshot() -> VmSnapshot and Executor::from_snapshot(elf, snap), capturing the mutable VM state {memory, registers, pc} so execution can be resumed byte-identically from a saved point. The immutable instruction cache is rebuilt from the ELF, not stored. Determinism holds because all nondeterministic input (private inputs) is pre-loaded into memory, so replay from a snapshot reproduces identical logs. This is the foundation (Approach 1 step B) for retiring resident traces: re-execution from a checkpoint avoids re-running from cycle 0, bounding the cost to one segment. - derive Clone on Memory and Registers; PartialEq/Eq on Log (for test assertions) - VmSnapshot { memory, registers, pc } + snapshot/from_snapshot on Executor - hermetic checkpoint_tests: a hand-built 100_005xADDI + JALR-to-0 program (no ELF fixture, no syscalls) proves snapshot mid-execution (across a resume() 100k-chunk boundary) + resume yields logs identical to a straight run Validated: executor lib 14/14; stark/prover still build (derives are additive).

First step of "trace retire" (Approach 1, step C): make it possible to build a single table's trace on demand from a shared routed intermediate, so the prover can later build tables one at a time instead of all-at-once. Splits the monolithic `build_traces` (PHASES 1-5) into: - `RoutedTraceData<'a>`: holds the output of PHASES 1-4 — the per-table routed op-lists, the FINAL accumulated bitwise lookup multiplicities (folded in from every table, incl. padding/page/commit/keccak), the MEMW->LT extended lt_ops, num_padding_rows, halt_timestamp, public_output_bytes, decode trace + pc->row, the finalized register state map, and the borrowed elf/memory_state/private_input used only by the PAGE fill. - `route(...)`: PHASES 3-4 — finishes all cross-table coupling once, before any per-table fill reads it. - `TableKind` selector (CPU, LT, SHIFT, MEMW, MEMW_A, LOAD, MUL, DVRM, BRANCH, MEMW_R, matching the prover's table ordering) and `RoutedTraceData::build_table(which, ...)`: the PHASE-5 fill for ONE table. `build_traces` is now a thin orchestrator: `route(...)` then `build_table(...)` per chunked execution table, with the fixed/preprocessed tables (BITWISE, DECODE, REGISTER, HALT, COMMIT, KECCAK*, PAGE) assembled exactly as before. All existing callers (`from_elf_and_logs`, `from_logs`) are unchanged in behaviour. No prover, lib, or proof-structure changes — this only enables per-table build. Cross-table multiplicity coupling: all bitwise multiplicity accumulation happens in `route` and is complete before any `build_table` call, so the BITWISE table (assembled by the orchestrator) sees the full counts. Validation: - `cargo check -p lambda-vm-prover` (default + disk-spill features) clean. - New trace-equality test `test_route_then_build_table_matches_monolithic`: builds every log-derived table via `route` + per-table `build_table` and compares against `from_logs`. Compared as a multiset of rows because several `generate_*_trace` fns dedup via `HashMap<Op, mult>` and emit `into_iter()`, whose order is pre-existing per-run nondeterminism (two back-to-back `from_logs` already disagree on LT row order). Row content + dims are byte-identical. - `cargo test -p lambda-vm-prover --lib trace_builder_tests`: 24/24 pass. - `cargo test -p stark --lib`: 129/129 pass (prove+verify roundtrips unchanged). - Pre-existing 79 ELF-fixture failures (No such file / UnknownSyscall(5)) are identical on the base commit; this change adds exactly one passing test and zero new failures.

The LT/MUL/DVRM/BRANCH trace builders deduplicate rows via HashMap<Op, mult> and emitted op_map.into_iter() order, which std HashMap randomizes per instance -> two builds of the same logs disagreed on row order (content identical). This did not affect the single-build prover, but it is a hard prerequisite for streaming on-demand trace rebuild (C.2b): the Phase-A commit and the Rounds-2-4 rebuild must produce the same trace, or the committed Merkle root won't match the reconstructed trace. Fix: derive Ord on each *Operation and sort unique_ops by the canonical op key after the HashMap dedup. Repeated builds now produce byte-identical traces. NOTE: this changes the proof OUTPUT (row order is now sorted, was HashMap-order) - still a valid proof, now reproducible. Validated: new trace_build_is_deterministic_across_builds test passes (and FAILS if any sort is reverted, confirming it catches the non-determinism); trace_builder 25/25; stark 129/129; prover completeness pass (flag off + LAMBDA_STREAM_LDE=1).

Retire the resident log-derived traces under the streaming flag (LAMBDA_STREAM_LDE=1): build each table's main+aux trace on demand from the compact routed intermediate at the point the prover needs it, then drop it. Composes with M1 (retire LDE), M2 (batched FRI), T1 (leaf-drop). stark: - Add `TraceProvider` trait: on-demand main-trace source for retired tables, with `is_retired` / `num_rows` / `build_main`. - `multi_prove` delegates to new `multi_prove_with_provider(provider)`; `provider = None` is byte-for-byte the old path. When a provider is given AND streaming is on, retired tables are (re)built on demand at Phase A (main commit), Phase C (aux build + commit), and Rounds 2-4 (reconstruct), and dropped after each. `run_debug_checks` is made retire-aware. Off-path is unchanged. prover: - `route_for_streaming` runs PHASES 0-4 + builds the preprocessed/PAGE tables (extracted into `build_preprocessed_tables`), returning a `StreamingProvider` (routed intermediate + air-order slot map) and a `StreamingResident` (air-ordered resident traces with empty placeholders for retired slots, + metadata). The slot map reproduces the exact `air_trace_pairs` ordering (8 preprocessed, then cpu/lt/ shift/memw/memw_a/load/mul/dvrm/branch, PAGE, memw_r); PAGE and the 8 preprocessed tables stay resident. - `RoutedTraceData::build_chunk` rebuilds exactly one chunk on demand (byte-identical to `build_table(..)[chunk]`); `num_rows` builds the chunk to read its true padded row count (LT/MUL/DVRM/BRANCH pad to their deduplicated op count, which a raw op-count cannot predict). - `prove_with_options_and_inputs` branches on `streaming_retire_lde()`: streaming uses `route_for_streaming` + `multi_prove_with_provider`; the default path is unchanged. Validation: - Byte-identical proof test (streaming ON vs OFF, grinding disabled so the proof is deterministic — the parallel grinding nonce search is non-deterministic for grinding>0 independent of this change): PASSES for fib_iterative_1200k (multi-chunk CPU/MEMW_R, PAGE interleaved). - stark --lib 129 pass (flag off and on); prover --lib completeness 14 pass (off and on). prover --lib: 283 pass / 77 fail — the 77 are the pre-existing UnknownSyscall(5) ELF-fixture failures (identical off and on), not a regression of this change. - Compiles with default, instruments, disk-spill, and combined; no new clippy warnings.

diegokingston · 2026-06-03T17:49:14Z

/bench 5

github-actions · 2026-06-03T17:52:17Z

Benchmark — fib_iterative_8M (median of 5)

_{Table parallelism: auto (cores / 3)}

Metric	main	PR	Δ
Peak heap	51779 MB	51319 MB	-460 MB (-0.9%) ⚪
Prove time	24.898s	28.404s	+3.506s (+14.1%) 🔴

⚠️ Regression detected — heap or time increased by more than 5%.

✅ Low variance (time: 1.0%, heap: 0.6%)

_{Commit: a7eabd3 · Baseline: cached · Runner: self-hosted bench}

diegokingston added 10 commits June 2, 2026 14:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/streaming prover#647

Feat/streaming prover#647
diegokingston wants to merge 10 commits into
mainfrom
feat/streaming-prover

diegokingston commented Jun 3, 2026

Uh oh!

diegokingston commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

diegokingston commented Jun 3, 2026

Uh oh!

diegokingston commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Benchmark — fib_iterative_8M (median of 5)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant