fix: re-tick immediately when on_tick overruns its interval boundary by MegaRedHand · Pull Request #416 · lambdaclass/ethlambda

MegaRedHand · 2026-06-03T21:02:52Z

Closes #413

Problem

handle_tick scheduled the next tick after on_tick returned, with a delay computed from the timestamp captured at entry. If a duty overran the 800ms interval (block building proves inline, ~1.3s observed), the next tick fired past the following boundary and that interval's duty was silently skipped.

Concrete symptom: a proposing node never attests on any slot it proposes (interval-0 overrun skips interval 1). This alone makes 1- and 2-node devnets unable to justify, and loses ~1/N of attestations per slot on N-node devnets.

Fix

Minimal change, contained to handle_tick:

Re-read the clock after on_tick and compare elapsed time against the interval slack remaining at entry (remaining_at_entry).
If a boundary was crossed (elapsed >= remaining_at_entry), tick again immediately so the missed interval's duty still runs.
Otherwise schedule for the next boundary, computed from the post-tick timestamp.

The comparison keys off the boundary actually being crossed rather than a fixed 800ms threshold, so it also handles scheduling jitter (a late-firing tick whose on_tick crosses the boundary in <800ms) and exact-boundary entry.

Scope / tradeoff

Recovers a single missed interval, which covers the reported symptom (interval-0 overrun of ~1.3s skips one interval). Overruns spanning more than one full interval are not individually recovered; that case isn't triggered in practice.
Block building still proves inline on the actor thread; moving it off-thread (direction 2 in the issue) remains a possible follow-up.

Testing

Existing blockchain unit tests, cargo fmt, and clippy -D warnings pass.

github-actions · 2026-06-03T21:03:59Z

🤖 Kimi Code Review

The PR correctly fixes issue #413 (silent skipping of overrun duties) and implements proper fast-forward logic for offline catch-up. The state machine logic is sound: ticks are now tagged with their intended interval time, and late ticks fire immediately to process missed duties rather than skipping them.

Minor issues & suggestions:

Overflow safety (theoretical)
crates/blockchain/src/lib.rs:67
```
let boundary_ms = genesis_time_ms + time * MILLISECONDS_PER_INTERVAL;
```
While practically impossible (would require Ethereum to run for millions of years), this multiplication/addition could overflow u64 in release mode (which wraps). Consider checked_add/checked_mul or a saturating variant if you want absolute robustness against corrupted wall-clock inputs.
Redundant system clock calls
crates/blockchain/src/lib.rs:727-735
unix_now_ms() is called twice in handle_tick (once for clamp_tick_time, once for delay calculation). Cache the result in a local variable to avoid the second syscall and ensure consistency if the timer ticks over between calls:
```
let now = unix_now_ms();
let mut tick_time = clamp_tick_time(msg.time, now, genesis_time_ms);
// ... later ...
let delay = ms_until_tick(next_time, now, genesis_time_ms);
```
Truncation of u128
crates/blockchain/src/lib.rs:84
```
.as_millis() as u64
```
This is safe until year ~584 million, but worth documenting if this code is expected to run in embedded environments with 32-bit usize (though u64 is fixed width).

Consensus correctness verification:

Item 1: The clamp_tick_time logic correctly mirrors store::on_tick by bounding replay to one slot behind wall clock, preventing DoS via excessive catch-up iterations.
Item 2: The loop in handle_tick (lines 731-750) correctly implements "run missed interval immediately" semantics. The log warning at line 744 aids observability when duties overrun.
Item 3: Deriving timestamp_ms from the scheduled tick_time rather than wall clock (line 730) ensures attestations and block processing use the correct slot/epoch for state transition, preventing misattribution to future intervals.

Testing:
The unit tests comprehensively cover boundary conditions, pre-genesis states, and the specific overrun scenario from issue #413.

Verdict: Approved with minor nits. The logic is correct and secure for consensus operations.

Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

github-actions · 2026-06-03T21:05:29Z

🤖 Codex Code Review

crates/blockchain/src/lib.rs: handle_tick now catches up missed intervals in a tight inline loop. If tick work is consistently slower than 800ms for even a short stretch, delay remains 0 and the actor never yields back to its mailbox. In this actor, that means inbound blocks/attestations stop being processed while catch-up keeps running, which can stall fork choice and attestation handling under load. I’d re-enqueue one overdue tick at Duration::ZERO (or otherwise cap the loop) instead of looping indefinitely inside a single handler.

The late-tick attribution itself looks directionally correct, and the added unit tests cover the helper math well. I did not find a direct security/XMSS/SSZ issue in this diff. I could not run cargo test here because the environment blocks writes under the default Rust toolchain/cache locations.

Automated review by OpenAI Codex · gpt-5.4 · custom prompt

github-actions · 2026-06-03T21:06:08Z

🤖 Claude Code Review

Here is the review:

PR 416: fix: run skipped tick intervals instead of silently dropping them

Overview

The PR addresses a real correctness bug: block building (~1.3 s) overran the 800 ms interval, causing the proposer to never attest on any slot it proposed. The fix carries the intended interval count in the Tick message so that a late-firing tick still runs its correct duties, and adds a catch-up loop for consecutive overruns.

The root-cause diagnosis is correct, the solution is well-motivated, and the code is clean. One medium issue worth addressing before merge.

Medium: Loop in `handle_tick` has no iteration cap

File: crates/blockchain/src/lib.rs, lines 728–753

clamp_tick_time bounds the entry gap to INTERVALS_PER_SLOT (5), so in the happy path the loop runs at most 5 catch-up iterations before next_time is in the future.

However, clamp_tick_time is called once before the loop and never again inside it. If on_tick itself takes longer than MILLISECONDS_PER_INTERVAL (800 ms) on every iteration, wall clock advances faster than tick_time, and the loop never breaks:

tick_time = wall_time − 5          # after clamp
iteration 1: on_tick (1300 ms),    # wall_time += 1.625 intervals
             next_time = wall_time − 4,  delay = 0 → continue
iteration 2: on_tick (1300 ms),    # wall_time += 3.25 intervals total
             next_time = wall_time − 3,  delay = 0 → continue
…                                  # never catches up

Block building is the only consistently slow operation (and only on proposing intervals), so in practice the loop runs 1–2 extra iterations and terminates. But the actor thread is not guarded against a pathological scenario, and there is no diagnostic that reveals if we enter a runaway state.

Recommendation: Add a small iteration cap (e.g. 2 * INTERVALS_PER_SLOT) and fall back to scheduling from wall clock if exceeded:

let max_catchup = 2 * INTERVALS_PER_SLOT;
let mut catchup = 0u64;
loop {
    let timestamp_ms = genesis_time_ms + tick_time * MILLISECONDS_PER_INTERVAL;
    self.on_tick(timestamp_ms, ctx).await;

    let next_time = tick_time + 1;
    let delay = ms_until_tick(next_time, unix_now_ms(), genesis_time_ms);
    if delay > 0 || catchup >= max_catchup {
        if catchup >= max_catchup {
            warn!(catchup, "Catch-up limit reached; rescheduling from wall clock");
        }
        send_after(Duration::from_millis(delay), ctx.clone(),
                   block_chain_protocol::Tick { time: next_time });
        break;
    }
    ...
    catchup += 1;
}

Minor: `clamp_tick_time` boundary is off-by-one vs. its comment

File: crates/blockchain/src/lib.rs, line 76

if wall_time.saturating_sub(intended_time) > INTERVALS_PER_SLOT {

The comment says "more than a slot behind". The condition triggers at gap = INTERVALS_PER_SLOT + 1, so a gap of exactly INTERVALS_PER_SLOT is not clamped. The test clamp_tick_time_late_within_a_slot_is_unchanged confirms this is intentional ("Late by exactly one slot: still replay every missed interval"). The code is correct, but the comment "more than a slot behind" could be tightened to "strictly more than a slot (> INTERVALS_PER_SLOT intervals)" to match the test's prose and avoid misreading.

Minor: `genesis_time * 1000` duplicated with different variable name

Files: lib.rs lines 721 and ~108

genesis_time_ms is computed in both new() and handle_tick with the same expression. Not wrong—just noting it as a future extraction opportunity if the pattern grows.

Observations / Good

Carrying time (absolute interval count) in the Tick message is the correct fix for the root cause. Wall-clock derivation of slot/interval was always fragile on an overloaded thread.
ms_until_tick returning 0 (not underflowing) for past boundaries is exactly right, and saturating_sub makes the intent clear.
unix_now_ms() helper eliminates the four-line boilerplate duplication.
Test coverage is solid. overrunning_interval_0_still_schedules_interval_1 directly encodes the bug scenario from the issue, which is excellent.
Warning log at overrun includes %slot and %interval—useful for diagnosing timing issues in devnets.
The PR description is detailed and honest about what is and isn't addressed (block building still proves inline; forkchoice spec test failures are pre-existing).

Automated review by Claude (Anthropic) · sonnet · custom prompt

greptile-apps · 2026-06-03T21:08:48Z

Greptile Summary

This PR fixes a silent duty-skipping bug (#413) where an overlong interval-0 duty (e.g. inline block building at ~1.3s) caused the next tick to be scheduled past the interval-1 boundary, silently dropping attestation duties. Each Tick message now carries the interval-count it was scheduled for; slot/interval are derived from that instead of the wall clock, so a late tick still runs its intended interval's duties. A catch-up loop replays any missed intervals in place (with a warning) before scheduling the next future tick, and clamp_tick_time fast-forwards to one slot before wall-clock time when the lag exceeds a full slot.

handle_tick rewritten: drives a catch-up loop that processes every overrun interval before scheduling the next future one, preventing silent skips at the cost of ~500ms extra delay (vs. the old complete miss).
New helpers: ms_until_tick (returns 0 when boundary already passed), clamp_tick_time (one-slot fast-forward bound after a long stall), and unix_now_ms (shared accessor); seven unit tests cover all edge-cases including the exact Tick scheduler silently skips intervals when a duty overruns 800ms (proposer never attests on its own slot) #413 scenario.

Confidence Score: 4/5

The change is safe to merge. The catch-up loop is correctly bounded (clamp_tick_time at entry + natural break when a future interval hasn't yet been reached), the tick-tagging approach eliminates the silent duty-skip, and the unit tests cover the exact failure scenario from issue #413.

The core scheduling logic is sound and the test suite is thorough. The only non-trivial observation is that the last_tick_instant metric will report misleading durations during consecutive catch-up iterations, which could obscure overrun events in monitoring. This does not affect correctness of duty execution.

crates/blockchain/src/lib.rs — the handle_tick catch-up loop and its interaction with the last_tick_instant metric are the only areas worth a second look.

Important Files Changed

Filename	Overview
crates/blockchain/src/lib.rs	Core fix: handle_tick now carries an interval-count in Tick, runs a catch-up loop for overrun intervals, and uses clamp_tick_time / ms_until_tick helpers. Logic is correct and well-tested; one subtle metric side-effect noted (see comment).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Tick message arrives with time=N] --> B[clamp_tick_time: fast-forward if more than one slot behind]
    B --> C[on_tick with timestamp from tick_time]
    C --> D{ms_until_tick next boundary}
    D -->|delay > 0| E[send_after delay, Tick time=next]
    E --> F[Done]
    D -->|delay == 0 boundary already passed| G[warn: tick overran interval]
    G --> H[increment tick_time]
    H --> C

Comments Outside Diff (1)

crates/blockchain/src/lib.rs, line 196-199 (link)

last_tick_instant metric skewed during catch-up loop iterations

on_tick calls observe_tick_interval_duration(prev_instant.elapsed()) and then resets last_tick_instant on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.

Prompt To Fix With AI

This is a comment left during a code review.
Path: crates/blockchain/src/lib.rs
Line: 196-199

Comment:
**`last_tick_instant` metric skewed during catch-up loop iterations**

`on_tick` calls `observe_tick_interval_duration(prev_instant.elapsed())` and then resets `last_tick_instant` on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.

How can I resolve this? If you propose a fix, please make it concise.

Prompt To Fix All With AI

Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/blockchain/src/lib.rs:196-199
**`last_tick_instant` metric skewed during catch-up loop iterations**

`on_tick` calls `observe_tick_interval_duration(prev_instant.elapsed())` and then resets `last_tick_instant` on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.

_{Reviews (1): Last reviewed commit: "fix: run skipped tick intervals instead ..." | Re-trigger Greptile}

handle_tick scheduled the next tick after on_tick returned, with a delay computed from the timestamp captured at entry. A duty overrunning the 800ms interval (e.g. inline block-building proofs taking ~1.3s) pushed the next tick past the following boundary, so that interval's duty never ran: a proposing node never attested on its own slot, which alone prevents 1- and 2-node devnets from ever justifying. Drive tick processing from store.time() (the last processed interval) instead of the wall clock at handling time: each tick loops over every interval boundary that has passed but was not processed yet, running missed intervals in place (with a warning) so their duties still execute. When more than a slot behind (e.g. a long stall), jump straight to the wall clock, matching store::on_tick's existing fast-forward. Closes #413

…them" This reverts commit 15750d7.

handle_tick scheduled the next tick from the timestamp captured at entry, so a duty overrunning the 800ms interval (e.g. inline block-building proofs ~1.3s) pushed the next tick past the following boundary and that interval's duty never ran: a proposing node never attested on its own slot, which alone prevents 1- and 2-node devnets from ever justifying. Re-read the clock after on_tick and compare elapsed time against the interval slack remaining at entry. If a boundary was crossed, tick again immediately so the missed interval's duty still runs; otherwise schedule for the next boundary from the post-tick timestamp. The comparison keys off the boundary actually being crossed rather than a fixed 800ms threshold, so it also handles scheduling jitter and exact-boundary entry. Closes #413

MegaRedHand force-pushed the fix-tick-scheduler-skip branch from b20c5e7 to 15750d7 Compare June 3, 2026 21:16

MegaRedHand added 2 commits June 3, 2026 18:43

Revert "fix: run skipped tick intervals instead of silently dropping …

baf8c67

…them" This reverts commit 15750d7.

MegaRedHand changed the title ~~fix: run skipped tick intervals instead of silently dropping them~~ fix: re-tick immediately when on_tick overruns its interval boundary Jun 4, 2026

Merge branch 'main' into fix-tick-scheduler-skip

ad02251

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: re-tick immediately when on_tick overruns its interval boundary#416

fix: re-tick immediately when on_tick overruns its interval boundary#416
MegaRedHand wants to merge 4 commits into
mainfrom
fix-tick-scheduler-skip

MegaRedHand commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

github-actions Bot commented Jun 3, 2026

Uh oh!

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading

Important Files Changed

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MegaRedHand commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Scope / tradeoff

Testing

Uh oh!

github-actions Bot commented Jun 3, 2026

🤖 Kimi Code Review

Uh oh!

github-actions Bot commented Jun 3, 2026

🤖 Codex Code Review

Uh oh!

github-actions Bot commented Jun 3, 2026

🤖 Claude Code Review

PR 416: fix: run skipped tick intervals instead of silently dropping them

Overview

Medium: Loop in handle_tick has no iteration cap

Minor: clamp_tick_time boundary is off-by-one vs. its comment

Minor: genesis_time * 1000 duplicated with different variable name

Observations / Good

Uh oh!

greptile-apps Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MegaRedHand commented Jun 3, 2026 •

edited

Loading

Medium: Loop in `handle_tick` has no iteration cap

Minor: `clamp_tick_time` boundary is off-by-one vs. its comment

Minor: `genesis_time * 1000` duplicated with different variable name

greptile-apps Bot commented Jun 3, 2026 •

edited

Loading