Skip to content

fix: re-tick immediately when on_tick overruns its interval boundary#416

Open
MegaRedHand wants to merge 4 commits into
mainfrom
fix-tick-scheduler-skip
Open

fix: re-tick immediately when on_tick overruns its interval boundary#416
MegaRedHand wants to merge 4 commits into
mainfrom
fix-tick-scheduler-skip

Conversation

@MegaRedHand
Copy link
Copy Markdown
Collaborator

@MegaRedHand MegaRedHand commented Jun 3, 2026

Closes #413

Problem

handle_tick scheduled the next tick after on_tick returned, with a delay computed from the timestamp captured at entry. If a duty overran the 800ms interval (block building proves inline, ~1.3s observed), the next tick fired past the following boundary and that interval's duty was silently skipped.

Concrete symptom: a proposing node never attests on any slot it proposes (interval-0 overrun skips interval 1). This alone makes 1- and 2-node devnets unable to justify, and loses ~1/N of attestations per slot on N-node devnets.

Fix

Minimal change, contained to handle_tick:

  • Re-read the clock after on_tick and compare elapsed time against the interval slack remaining at entry (remaining_at_entry).
  • If a boundary was crossed (elapsed >= remaining_at_entry), tick again immediately so the missed interval's duty still runs.
  • Otherwise schedule for the next boundary, computed from the post-tick timestamp.

The comparison keys off the boundary actually being crossed rather than a fixed 800ms threshold, so it also handles scheduling jitter (a late-firing tick whose on_tick crosses the boundary in <800ms) and exact-boundary entry.

Scope / tradeoff

  • Recovers a single missed interval, which covers the reported symptom (interval-0 overrun of ~1.3s skips one interval). Overruns spanning more than one full interval are not individually recovered; that case isn't triggered in practice.
  • Block building still proves inline on the actor thread; moving it off-thread (direction 2 in the issue) remains a possible follow-up.

Testing

Existing blockchain unit tests, cargo fmt, and clippy -D warnings pass.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🤖 Kimi Code Review

The PR correctly fixes issue #413 (silent skipping of overrun duties) and implements proper fast-forward logic for offline catch-up. The state machine logic is sound: ticks are now tagged with their intended interval time, and late ticks fire immediately to process missed duties rather than skipping them.

Minor issues & suggestions:

  1. Overflow safety (theoretical)
    crates/blockchain/src/lib.rs:67

    let boundary_ms = genesis_time_ms + time * MILLISECONDS_PER_INTERVAL;

    While practically impossible (would require Ethereum to run for millions of years), this multiplication/addition could overflow u64 in release mode (which wraps). Consider checked_add/checked_mul or a saturating variant if you want absolute robustness against corrupted wall-clock inputs.

  2. Redundant system clock calls
    crates/blockchain/src/lib.rs:727-735
    unix_now_ms() is called twice in handle_tick (once for clamp_tick_time, once for delay calculation). Cache the result in a local variable to avoid the second syscall and ensure consistency if the timer ticks over between calls:

    let now = unix_now_ms();
    let mut tick_time = clamp_tick_time(msg.time, now, genesis_time_ms);
    // ... later ...
    let delay = ms_until_tick(next_time, now, genesis_time_ms);
  3. Truncation of u128
    crates/blockchain/src/lib.rs:84

    .as_millis() as u64

    This is safe until year ~584 million, but worth documenting if this code is expected to run in embedded environments with 32-bit usize (though u64 is fixed width).

Consensus correctness verification:

  • Item 1: The clamp_tick_time logic correctly mirrors store::on_tick by bounding replay to one slot behind wall clock, preventing DoS via excessive catch-up iterations.
  • Item 2: The loop in handle_tick (lines 731-750) correctly implements "run missed interval immediately" semantics. The log warning at line 744 aids observability when duties overrun.
  • Item 3: Deriving timestamp_ms from the scheduled tick_time rather than wall clock (line 730) ensures attestations and block processing use the correct slot/epoch for state transition, preventing misattribution to future intervals.

Testing:
The unit tests comprehensively cover boundary conditions, pre-genesis states, and the specific overrun scenario from issue #413.

Verdict: Approved with minor nits. The logic is correct and secure for consensus operations.


Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🤖 Codex Code Review

  1. crates/blockchain/src/lib.rs: handle_tick now catches up missed intervals in a tight inline loop. If tick work is consistently slower than 800ms for even a short stretch, delay remains 0 and the actor never yields back to its mailbox. In this actor, that means inbound blocks/attestations stop being processed while catch-up keeps running, which can stall fork choice and attestation handling under load. I’d re-enqueue one overdue tick at Duration::ZERO (or otherwise cap the loop) instead of looping indefinitely inside a single handler.

The late-tick attribution itself looks directionally correct, and the added unit tests cover the helper math well. I did not find a direct security/XMSS/SSZ issue in this diff. I could not run cargo test here because the environment blocks writes under the default Rust toolchain/cache locations.


Automated review by OpenAI Codex · gpt-5.4 · custom prompt

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 3, 2026

🤖 Claude Code Review

Here is the review:


PR 416: fix: run skipped tick intervals instead of silently dropping them

Overview

The PR addresses a real correctness bug: block building (~1.3 s) overran the 800 ms interval, causing the proposer to never attest on any slot it proposed. The fix carries the intended interval count in the Tick message so that a late-firing tick still runs its correct duties, and adds a catch-up loop for consecutive overruns.

The root-cause diagnosis is correct, the solution is well-motivated, and the code is clean. One medium issue worth addressing before merge.


Medium: Loop in handle_tick has no iteration cap

File: crates/blockchain/src/lib.rs, lines 728–753

clamp_tick_time bounds the entry gap to INTERVALS_PER_SLOT (5), so in the happy path the loop runs at most 5 catch-up iterations before next_time is in the future.

However, clamp_tick_time is called once before the loop and never again inside it. If on_tick itself takes longer than MILLISECONDS_PER_INTERVAL (800 ms) on every iteration, wall clock advances faster than tick_time, and the loop never breaks:

tick_time = wall_time − 5          # after clamp
iteration 1: on_tick (1300 ms),    # wall_time += 1.625 intervals
             next_time = wall_time − 4,  delay = 0 → continue
iteration 2: on_tick (1300 ms),    # wall_time += 3.25 intervals total
             next_time = wall_time − 3,  delay = 0 → continue
…                                  # never catches up

Block building is the only consistently slow operation (and only on proposing intervals), so in practice the loop runs 1–2 extra iterations and terminates. But the actor thread is not guarded against a pathological scenario, and there is no diagnostic that reveals if we enter a runaway state.

Recommendation: Add a small iteration cap (e.g. 2 * INTERVALS_PER_SLOT) and fall back to scheduling from wall clock if exceeded:

let max_catchup = 2 * INTERVALS_PER_SLOT;
let mut catchup = 0u64;
loop {
    let timestamp_ms = genesis_time_ms + tick_time * MILLISECONDS_PER_INTERVAL;
    self.on_tick(timestamp_ms, ctx).await;

    let next_time = tick_time + 1;
    let delay = ms_until_tick(next_time, unix_now_ms(), genesis_time_ms);
    if delay > 0 || catchup >= max_catchup {
        if catchup >= max_catchup {
            warn!(catchup, "Catch-up limit reached; rescheduling from wall clock");
        }
        send_after(Duration::from_millis(delay), ctx.clone(),
                   block_chain_protocol::Tick { time: next_time });
        break;
    }
    ...
    catchup += 1;
}

Minor: clamp_tick_time boundary is off-by-one vs. its comment

File: crates/blockchain/src/lib.rs, line 76

if wall_time.saturating_sub(intended_time) > INTERVALS_PER_SLOT {

The comment says "more than a slot behind". The condition triggers at gap = INTERVALS_PER_SLOT + 1, so a gap of exactly INTERVALS_PER_SLOT is not clamped. The test clamp_tick_time_late_within_a_slot_is_unchanged confirms this is intentional ("Late by exactly one slot: still replay every missed interval"). The code is correct, but the comment "more than a slot behind" could be tightened to "strictly more than a slot (> INTERVALS_PER_SLOT intervals)" to match the test's prose and avoid misreading.


Minor: genesis_time * 1000 duplicated with different variable name

Files: lib.rs lines 721 and ~108

genesis_time_ms is computed in both new() and handle_tick with the same expression. Not wrong—just noting it as a future extraction opportunity if the pattern grows.


Observations / Good

  • Carrying time (absolute interval count) in the Tick message is the correct fix for the root cause. Wall-clock derivation of slot/interval was always fragile on an overloaded thread.
  • ms_until_tick returning 0 (not underflowing) for past boundaries is exactly right, and saturating_sub makes the intent clear.
  • unix_now_ms() helper eliminates the four-line boilerplate duplication.
  • Test coverage is solid. overrunning_interval_0_still_schedules_interval_1 directly encodes the bug scenario from the issue, which is excellent.
  • Warning log at overrun includes %slot and %interval—useful for diagnosing timing issues in devnets.
  • The PR description is detailed and honest about what is and isn't addressed (block building still proves inline; forkchoice spec test failures are pre-existing).

Automated review by Claude (Anthropic) · sonnet · custom prompt

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 3, 2026

Greptile Summary

This PR fixes a silent duty-skipping bug (#413) where an overlong interval-0 duty (e.g. inline block building at ~1.3s) caused the next tick to be scheduled past the interval-1 boundary, silently dropping attestation duties. Each Tick message now carries the interval-count it was scheduled for; slot/interval are derived from that instead of the wall clock, so a late tick still runs its intended interval's duties. A catch-up loop replays any missed intervals in place (with a warning) before scheduling the next future tick, and clamp_tick_time fast-forwards to one slot before wall-clock time when the lag exceeds a full slot.

Confidence Score: 4/5

The change is safe to merge. The catch-up loop is correctly bounded (clamp_tick_time at entry + natural break when a future interval hasn't yet been reached), the tick-tagging approach eliminates the silent duty-skip, and the unit tests cover the exact failure scenario from issue #413.

The core scheduling logic is sound and the test suite is thorough. The only non-trivial observation is that the last_tick_instant metric will report misleading durations during consecutive catch-up iterations, which could obscure overrun events in monitoring. This does not affect correctness of duty execution.

crates/blockchain/src/lib.rs — the handle_tick catch-up loop and its interaction with the last_tick_instant metric are the only areas worth a second look.

Important Files Changed

Filename Overview
crates/blockchain/src/lib.rs Core fix: handle_tick now carries an interval-count in Tick, runs a catch-up loop for overrun intervals, and uses clamp_tick_time / ms_until_tick helpers. Logic is correct and well-tested; one subtle metric side-effect noted (see comment).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Tick message arrives with time=N] --> B[clamp_tick_time: fast-forward if more than one slot behind]
    B --> C[on_tick with timestamp from tick_time]
    C --> D{ms_until_tick next boundary}
    D -->|delay > 0| E[send_after delay, Tick time=next]
    E --> F[Done]
    D -->|delay == 0 boundary already passed| G[warn: tick overran interval]
    G --> H[increment tick_time]
    H --> C
Loading

Comments Outside Diff (1)

  1. crates/blockchain/src/lib.rs, line 196-199 (link)

    P2 last_tick_instant metric skewed during catch-up loop iterations

    on_tick calls observe_tick_interval_duration(prev_instant.elapsed()) and then resets last_tick_instant on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: crates/blockchain/src/lib.rs
    Line: 196-199
    
    Comment:
    **`last_tick_instant` metric skewed during catch-up loop iterations**
    
    `on_tick` calls `observe_tick_interval_duration(prev_instant.elapsed())` and then resets `last_tick_instant` on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
crates/blockchain/src/lib.rs:196-199
**`last_tick_instant` metric skewed during catch-up loop iterations**

`on_tick` calls `observe_tick_interval_duration(prev_instant.elapsed())` and then resets `last_tick_instant` on every invocation. When the catch-up loop runs multiple iterations back-to-back (delay == 0), each successive call measures only the previous iteration's processing time rather than the ~800 ms wall-clock interval. Monitoring dashboards will show near-zero or processing-time-only values for catch-up ticks, making the metric misleading precisely when an overrun is occurring. Consider skipping the observation (or tagging it as a catch-up sample) for in-place iterations to keep the metric semantically consistent.

Reviews (1): Last reviewed commit: "fix: run skipped tick intervals instead ..." | Re-trigger Greptile

handle_tick scheduled the next tick after on_tick returned, with a delay
computed from the timestamp captured at entry. A duty overrunning the
800ms interval (e.g. inline block-building proofs taking ~1.3s) pushed
the next tick past the following boundary, so that interval's duty never
ran: a proposing node never attested on its own slot, which alone
prevents 1- and 2-node devnets from ever justifying.

Drive tick processing from store.time() (the last processed interval)
instead of the wall clock at handling time: each tick loops over every
interval boundary that has passed but was not processed yet, running
missed intervals in place (with a warning) so their duties still
execute. When more than a slot behind (e.g. a long stall), jump straight
to the wall clock, matching store::on_tick's existing fast-forward.

Closes #413
@MegaRedHand MegaRedHand force-pushed the fix-tick-scheduler-skip branch from b20c5e7 to 15750d7 Compare June 3, 2026 21:16
handle_tick scheduled the next tick from the timestamp captured at entry,
so a duty overrunning the 800ms interval (e.g. inline block-building
proofs ~1.3s) pushed the next tick past the following boundary and that
interval's duty never ran: a proposing node never attested on its own
slot, which alone prevents 1- and 2-node devnets from ever justifying.

Re-read the clock after on_tick and compare elapsed time against the
interval slack remaining at entry. If a boundary was crossed, tick again
immediately so the missed interval's duty still runs; otherwise schedule
for the next boundary from the post-tick timestamp. The comparison keys
off the boundary actually being crossed rather than a fixed 800ms
threshold, so it also handles scheduling jitter and exact-boundary entry.

Closes #413
@MegaRedHand MegaRedHand changed the title fix: run skipped tick intervals instead of silently dropping them fix: re-tick immediately when on_tick overruns its interval boundary Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tick scheduler silently skips intervals when a duty overruns 800ms (proposer never attests on its own slot)

1 participant