Skip to content

Feat/streaming prover#647

Draft
diegokingston wants to merge 10 commits into
mainfrom
feat/streaming-prover
Draft

Feat/streaming prover#647
diegokingston wants to merge 10 commits into
mainfrom
feat/streaming-prover

Conversation

@diegokingston
Copy link
Copy Markdown
Collaborator

No description provided.

Add LAMBDA_STREAM_LDE=1 opt-in mode that drops main/aux LDE columns right
after committing (Phase A/C) instead of caching them, and recomputes each
table's LDE on demand in Rounds 2-4 via reconstruct_round1 (rebuilt from the
still-resident trace — no VM re-execution).

Trades one extra LDE expansion (iFFT+coset+FFT) per table for dropping the
O(N x cols x lde_size) cache of all tables' LDE columns held simultaneously
between Phase A/C and Rounds 2-4. Default (flag off) is unchanged.

- streaming_retire_lde() env helper
- ungate reconstruct_round1 (was debug-checks only)
- Phase A/C: drop cached LDE when streaming
- Rounds 2-4: reconstruct vs build_round1 branch

Validated: stark lib 128/128 with flag on; prover bus-completeness prove+verify
(aux/LogUp reconstruction) pass; multi-table roundtrip passes; baseline unchanged.
Split the Round 2 (composition) + Round 3 (OOD) part of prove_rounds_2_to_4
into a standalone prove_rounds_2_to_3 that returns (Round2, Round3, z) without
running FRI. prove_rounds_2_to_4 now calls it then Round 4 as before — proofs
are byte-identical (pure extraction).

This is step 1a toward batched FRI within a chunk (Approach 1, M2 option B):
the per-table Rounds 2-3 are now separable from Round 4, so a later step can
collect them per-chunk and run one batched FRI per lde_size bucket instead of
per-table FRI.

instruments: round-3 OOD timing now carried via a R3_OOD thread-local
(store_r3_ood/take_r3_ood) since the inline local moved into the new fn.

Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK;
prover bus-completeness OK; instruments feature compiles.
Restructure the per-chunk Rounds 2-4 loop in multi_prove into two passes:
Pass 1 runs prove_rounds_2_to_3 for all K tables in the chunk (advancing each
table's forked transcript through the OOD evaluations, collecting the
intermediates), then Pass 2 runs Round 4 per table. Round 4 is extracted into
prove_round_4 (the former prove_rounds_2_to_4, now consuming the Round2/Round3
produced by Pass 1).

Still one FRI per table, so proofs are byte-identical to the single-pass
version. This establishes the barrier needed for the next step (Approach 1, M2
option B): with all of a chunk's tables through Rounds 2-3, Pass 2 can be
replaced by a batched FRI per lde_size bucket instead of per-table FRI.

Instruments: per-table sub-op timings are now captured at the end of each pass
(same thread as the work — r2/r3 in Pass 1, r4 in Pass 2) and combined in the
sequential collection, since the two passes may run on different threads.

Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK;
prover bus-completeness OK (flag off + on); instruments feature compiles.
… change)

Move the DEEP composition challenge gamma + coefficient expansion out of Round 4
into a new sample_deep_coeffs helper, called at the end of prove_rounds_2_to_3
(Pass 1). gamma is sampled from the same fork at the same transcript position as
before, so proofs are byte-identical.

prove_rounds_2_to_3 now returns (Round2, Round3, z, DeepCoeffs); Round 4
(round_4_compute... / prove_round_4) takes the DeepCoeffs and only builds the
DEEP LDE + runs FRI. The two-pass loop threads DeepCoeffs through the
intermediate.

This is step 1c toward batched FRI (Approach 1, M2 option B): Pass 2 is now
reduced to DEEP-LDE + FRI, so the next step can replace it with one batched FRI
per lde_size bucket (folding each table's DEEP LDE with delta powers) while the
per-table gamma/coeffs stay in Pass 1.

Validated: stark lib 128/128 (flag off); streaming roundtrip/completeness OK;
prover bus-completeness OK (off + on); instruments feature compiles.
Replace the per-table Round 4 FRI in multi_prove with ONE batched FRI per
(chunk, lde_size) bucket. Each bucket derives a single delta_fri from a
chunk-shared bucket_seed, folds every member's DEEP-composition LDE with
successive powers of delta_fri into one polynomial, runs one
commit_phase_from_evaluations + grinding + query_phase, and opens each member's
per-table DEEP openings at the bucket-shared iotas.

Proof struct (proof/stark.rs):
- Add ChunkBucketFri<E> { lde_size, members (chunk-local indices), layer_roots,
  last_value, decommitments, nonce }.
- MultiProof: add fri_chunk_buckets (per chunk, per bucket) + chunk_size.
- StarkProof: drop per-table fri_layers_merkle_roots, fri_last_value,
  query_list, nonce. Keep composition_poly_root, OOD evals, deep_poly_openings.

Prover (prover.rs):
- Capture pre-fork transcript; Pass 2 buckets the chunk by lde_size, builds the
  bucket_seed, samples delta_fri, streams each member's DEEP LDE (peak 2x|LDE|),
  runs the batched FRI, and assembles per-table StarkProofs + ChunkBucketFri.
- Remove dead per-table Round4 path (prove_round_4, round_4_compute..., Round4).
- Single-table prove() now returns a one-element MultiProof (bucket data lives
  at the multi-proof level).

Verifier (verifier.rs):
- replay_rounds_2_to_3 derives per-table Round 2-3 challenges (no FRI).
- verify_chunks_phase_c_d rebuilds each bucket_seed, samples delta_fri + zetas +
  grinding + iotas, reconstructs each member's D_i at the shared iotas, combines
  with delta_fri powers, verifies the batched FRI fold (verify_bucket_fri_query,
  MMCS-free) and each member's per-table openings.
- verify() takes a one-element MultiProof. Remove dead per-table FRI paths.

bucket_seed byte order (prover == verifier): pre-fork shared state, then for
each chunk-local index ascending: table_contribution (if any), then
composition_poly_root, then OOD evals (trace_ood columns column-major, then
composition_poly_parts_ood).

Validated: cargo check -p stark (+instruments), stark --lib 128/128,
VM completeness 14/14 with flag off and LAMBDA_STREAM_LDE=1.
Halve per-table Merkle-tree memory in streaming "retire-LDE" mode by
dropping the leaf half of the main/aux trees after commit and
regenerating the one leaf-level sibling each query needs on demand.

merkle.rs:
- Store `leaves_len` explicitly (after drop, nodes.len() == leaves_len-1,
  so the old (node_count+1)/2 recovery no longer applies).
- `drop_leaves()`: truncate to inner nodes only; idempotent; no-op for
  single-leaf trees and disk-spill mmap backing.
- `get_proof_by_pos_with_leaf_sibling()` + `sibling_leaf_position()`:
  build an auth path whose bottom node is a caller-supplied regenerated
  leaf hash and whose higher nodes come from the retained inner nodes.
  Byte-identical to the full-tree `get_proof_by_pos`.
- `get_proof_by_pos` (full tree) unchanged.

prover.rs:
- Folded into the existing LAMBDA_STREAM_LDE flag: in streaming mode,
  TableCommit::{plain,preprocessed} call drop_leaves() before wrapping
  the tree in Arc (Phase A main + precomputed, Phase C aux).
- open_polys_with: when tree.leaves_dropped(), regenerate the sibling
  leaf by gathering the recomputed LDE row at sibling_leaf_position and
  hashing it with keccak_leaf_from_row (matches keccak_leaves_bit_reversed
  byte layout), then build the path with the leaf-dropped opener.
- Composition tree left FULL (transient per-table, no cross-table peak);
  documented on streaming_retire_lde.

Verifier unchanged (proofs are byte-identical).

Tests:
- crypto: leaf-dropped opening is byte-identical to full-tree opening;
  drop_leaves idempotent.
- stark: prove with LAMBDA_STREAM_LDE=0 vs =1 yields byte-identical
  serialized MultiProof.
Add Executor::snapshot() -> VmSnapshot and Executor::from_snapshot(elf, snap),
capturing the mutable VM state {memory, registers, pc} so execution can be
resumed byte-identically from a saved point. The immutable instruction cache is
rebuilt from the ELF, not stored. Determinism holds because all nondeterministic
input (private inputs) is pre-loaded into memory, so replay from a snapshot
reproduces identical logs.

This is the foundation (Approach 1 step B) for retiring resident traces:
re-execution from a checkpoint avoids re-running from cycle 0, bounding the cost
to one segment.

- derive Clone on Memory and Registers; PartialEq/Eq on Log (for test assertions)
- VmSnapshot { memory, registers, pc } + snapshot/from_snapshot on Executor
- hermetic checkpoint_tests: a hand-built 100_005xADDI + JALR-to-0 program (no
  ELF fixture, no syscalls) proves snapshot mid-execution (across a resume()
  100k-chunk boundary) + resume yields logs identical to a straight run

Validated: executor lib 14/14; stark/prover still build (derives are additive).
First step of "trace retire" (Approach 1, step C): make it possible to build a
single table's trace on demand from a shared routed intermediate, so the prover
can later build tables one at a time instead of all-at-once.

Splits the monolithic `build_traces` (PHASES 1-5) into:

- `RoutedTraceData<'a>`: holds the output of PHASES 1-4 — the per-table routed
  op-lists, the FINAL accumulated bitwise lookup multiplicities (folded in from
  every table, incl. padding/page/commit/keccak), the MEMW->LT extended lt_ops,
  num_padding_rows, halt_timestamp, public_output_bytes, decode trace + pc->row,
  the finalized register state map, and the borrowed elf/memory_state/private_input
  used only by the PAGE fill.
- `route(...)`: PHASES 3-4 — finishes all cross-table coupling once, before any
  per-table fill reads it.
- `TableKind` selector (CPU, LT, SHIFT, MEMW, MEMW_A, LOAD, MUL, DVRM, BRANCH,
  MEMW_R, matching the prover's table ordering) and
  `RoutedTraceData::build_table(which, ...)`: the PHASE-5 fill for ONE table.

`build_traces` is now a thin orchestrator: `route(...)` then `build_table(...)`
per chunked execution table, with the fixed/preprocessed tables (BITWISE, DECODE,
REGISTER, HALT, COMMIT, KECCAK*, PAGE) assembled exactly as before. All existing
callers (`from_elf_and_logs`, `from_logs`) are unchanged in behaviour.

No prover, lib, or proof-structure changes — this only enables per-table build.

Cross-table multiplicity coupling: all bitwise multiplicity accumulation happens
in `route` and is complete before any `build_table` call, so the BITWISE table
(assembled by the orchestrator) sees the full counts.

Validation:
- `cargo check -p lambda-vm-prover` (default + disk-spill features) clean.
- New trace-equality test `test_route_then_build_table_matches_monolithic`:
  builds every log-derived table via `route` + per-table `build_table` and
  compares against `from_logs`. Compared as a multiset of rows because several
  `generate_*_trace` fns dedup via `HashMap<Op, mult>` and emit `into_iter()`,
  whose order is pre-existing per-run nondeterminism (two back-to-back
  `from_logs` already disagree on LT row order). Row content + dims are
  byte-identical.
- `cargo test -p lambda-vm-prover --lib trace_builder_tests`: 24/24 pass.
- `cargo test -p stark --lib`: 129/129 pass (prove+verify roundtrips unchanged).
- Pre-existing 79 ELF-fixture failures (No such file / UnknownSyscall(5)) are
  identical on the base commit; this change adds exactly one passing test and
  zero new failures.
The LT/MUL/DVRM/BRANCH trace builders deduplicate rows via HashMap<Op, mult>
and emitted op_map.into_iter() order, which std HashMap randomizes per instance
-> two builds of the same logs disagreed on row order (content identical). This
did not affect the single-build prover, but it is a hard prerequisite for
streaming on-demand trace rebuild (C.2b): the Phase-A commit and the Rounds-2-4
rebuild must produce the same trace, or the committed Merkle root won't match
the reconstructed trace.

Fix: derive Ord on each *Operation and sort unique_ops by the canonical op key
after the HashMap dedup. Repeated builds now produce byte-identical traces.

NOTE: this changes the proof OUTPUT (row order is now sorted, was HashMap-order)
- still a valid proof, now reproducible.

Validated: new trace_build_is_deterministic_across_builds test passes (and FAILS
if any sort is reverted, confirming it catches the non-determinism); trace_builder
25/25; stark 129/129; prover completeness pass (flag off + LAMBDA_STREAM_LDE=1).
Retire the resident log-derived traces under the streaming flag
(LAMBDA_STREAM_LDE=1): build each table's main+aux trace on demand from
the compact routed intermediate at the point the prover needs it, then
drop it. Composes with M1 (retire LDE), M2 (batched FRI), T1 (leaf-drop).

stark:
- Add `TraceProvider` trait: on-demand main-trace source for retired
  tables, with `is_retired` / `num_rows` / `build_main`.
- `multi_prove` delegates to new `multi_prove_with_provider(provider)`;
  `provider = None` is byte-for-byte the old path. When a provider is
  given AND streaming is on, retired tables are (re)built on demand at
  Phase A (main commit), Phase C (aux build + commit), and Rounds 2-4
  (reconstruct), and dropped after each. `run_debug_checks` is made
  retire-aware. Off-path is unchanged.

prover:
- `route_for_streaming` runs PHASES 0-4 + builds the preprocessed/PAGE
  tables (extracted into `build_preprocessed_tables`), returning a
  `StreamingProvider` (routed intermediate + air-order slot map) and a
  `StreamingResident` (air-ordered resident traces with empty
  placeholders for retired slots, + metadata). The slot map reproduces
  the exact `air_trace_pairs` ordering (8 preprocessed, then cpu/lt/
  shift/memw/memw_a/load/mul/dvrm/branch, PAGE, memw_r); PAGE and the 8
  preprocessed tables stay resident.
- `RoutedTraceData::build_chunk` rebuilds exactly one chunk on demand
  (byte-identical to `build_table(..)[chunk]`); `num_rows` builds the
  chunk to read its true padded row count (LT/MUL/DVRM/BRANCH pad to
  their deduplicated op count, which a raw op-count cannot predict).
- `prove_with_options_and_inputs` branches on `streaming_retire_lde()`:
  streaming uses `route_for_streaming` + `multi_prove_with_provider`;
  the default path is unchanged.

Validation:
- Byte-identical proof test (streaming ON vs OFF, grinding disabled so
  the proof is deterministic — the parallel grinding nonce search is
  non-deterministic for grinding>0 independent of this change): PASSES
  for fib_iterative_1200k (multi-chunk CPU/MEMW_R, PAGE interleaved).
- stark --lib 129 pass (flag off and on); prover --lib completeness 14
  pass (off and on). prover --lib: 283 pass / 77 fail — the 77 are the
  pre-existing UnknownSyscall(5) ELF-fixture failures (identical off and
  on), not a regression of this change.
- Compiles with default, instruments, disk-spill, and combined; no new
  clippy warnings.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

Benchmark — fib_iterative_8M (median of 5)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 51779 MB 51319 MB -460 MB (-0.9%) ⚪
Prove time 24.898s 28.404s +3.506s (+14.1%) 🔴

⚠️ Regression detected — heap or time increased by more than 5%.

✅ Low variance (time: 1.0%, heap: 0.6%)

Commit: a7eabd3 · Baseline: cached · Runner: self-hosted bench

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant