Skip to content

Reorg leaves gap in accepted chain as new blocks are not downloaded #77

@rnbrady

Description

@rnbrady

When a node announces a reorg via headers, the agent removes the stale block's node_block row but never rolls back SyncState for the reorged height, so the replacement block(s) is/are never downloaded.

The result is a hole in the accepted chain: the orphan block remains in block with no node_block row, the canonical block at that height is entirely absent from the database, and the child block is saved and accepted normally. The hole persists for as long as the agent process stays up, and silently corrupts spent/unspent queries: outputs spent by transactions in the missing canonical block appear unspent.

A restart of the agent fixes this because it pulls all block heights from the DB and syncs from the first height not found.

Example

Environment: production instance, mainnet, single trusted node, BCHN 29.0.0

Two damaged heights observed:

Height 953896 (reorg on 2026-06-04)

  • In DB, orphan with accepted_by: []: 000000000000000001e0c1c70eff0bee1c958f4e9499166dceca3357dedcb045 (34 txs)
  • Canonical block, absent from DB: 0000000000000000004b32671f255b4cc6e0a1bead73616aded4e2d6d303391d (37 txs)

Height 951556 (reorg on 2026-05-18)

  • In DB, orphan with accepted_by: []: 00000000000000000127178b4dc9a30742aa3f505373bf24b2812fd877b6a6c7 (1 tx)
  • Canonical block, absent from DB: 000000000000000000e70f428e1b05869697093a26d951a8202d1258cd9e42fb (3 txs)

In both cases the neighboring heights (e.g. 953895 and 953897) are saved and accepted by the node, so the accepted chain has a one-block gap — a state that should be impossible.

For contrast, height 936520 (reorg on 2026-02-02) shows the repaired outcome: both the orphan (…ed411c59, unaccepted) and the canonical block (…15d4a0d0, accepted) are present. Notably the canonical 936520's node_block.accepted_at is NULL — the signature of a block saved more than 2 hours after its timestamp (saveBlock, src/agent.ts ~1735) — while its neighbor 936521 was accepted live 37 seconds after mining. So the hole formed at 936520 too and was only backfilled by a later agent restart.

Downstream impact: a CashToken category we track had 66 UTXOs reported unspent by Chaingraph that Fulcrum/Electrum report as spent. All 66 are spent by transactions whose only block_inclusions row points at the unaccepted orphan at 953896. One of those spenders (68ea268e6ebc61440b56b3bfdacd79cc161fbe5e2612cff497f29d56953a9344) is confirmed on the real chain in exactly the missing canonical block …d303391d. The transactions were re-mined in the replacement block but Chaingraph never indexed it.

Root cause

  1. handleStaleBlocks does not roll back SyncState (src/agent.ts ~1839–1854). It calls removeStaleBlocksForNode (deletes the orphan's node_block row, src/db.ts ~988–1001) but never calls syncState.blockReorganizationAtHeight(firstHeight) — which exists for exactly this purpose (src/components/sync-state.ts ~122–137). SyncState therefore still reports the reorged height as synced, so selectNextBlockToDownload never selects the canonical replacement for download. The child block syncs normally, and fullySyncedUpToHeight advances past the hole.

  2. Silent failure modes hide the problem. acceptBlocksViaHeaders (src/db.ts ~935–972) joins incoming hashes against block and silently inserts zero node_block rows for any hash with no block row — no error, no warning. And the in-memory blockDb set is populated once at startup (src/agent.ts ~430–432) and never updated at runtime, so catchUpViaHeaders (src/agent.ts ~1367–1430) reasons from stale knowledge of what is saved.

Sequence

  1. Node announces a one-block reorg at height N via headers; BlockTree.updateHeaders splices in the canonical hash and fires onStaleBlocks (src/components/block-tree.ts ~208–218).
  2. handleStaleBlocks deletes the orphan's node_block row. SyncState is untouched and still considers height N synced.
  3. No download of canonical N is ever scheduled (selectNextBlockToDownload skips "synced" heights). Block N+1 arrives, is saved, and is accepted.
  4. The database is now: orphan at N (unaccepted), no canonical N, accepted N+1. Permanent until restart.

The sequence above describes a one-block reorg (the case observed in production), but the mechanism generalizes:

  • Deeper reorgs: for a depth-d reorg replacing already-synced heights N…N+d−1, onStaleBlocks fires with the full stale chain, all d node_block rows are deleted, SyncState still reports all d heights as synced, and selectNextBlockToDownload schedules only the net-new heights above the old tip — leaving a d-block hole. Reorgs deeper than 8 blocks arrive via inv rather than headers, but that handler (src/agent.ts ~656–665) just calls requestHeaders, funneling into the same updateHeadersonStaleBlocks path.
  • Flip-flop reorgs (chain reorgs away from a block and later back to it) produce a variant hole: the canonical block row survives from its first acceptance, but its node_block row was deleted on the reorg-away and nothing re-inserts it — catchUpViaHeaders never reaches that height because SyncState reports it synced. Result: block present but unaccepted.
  • The only case with no hole is a reorg at heights the agent had not yet synced (still catching up below the fork point) — SyncState is below the fork, so the canonical blocks download in normal course. This is why the bug specifically bites in steady-state operation at tip.

Why a restart repairs it (workaround — verified)

On restart, registerTrustedNodeWithDb (src/db.ts ~460–504) rebuilds syncedHeaderHashChain from blocknode_block; the damaged height yields no row, producing a null in the chain (blockArrayToHashChain, src/db.ts ~62–80). restoreChainForNode then sets fullySyncedUpToHeight = N−1, and after header sync fillBlockBuffer/selectNextBlockToDownload schedules the canonical block at N for download; it is saved with accepted_at = NULL (timestamp older than 2 hours, src/agent.ts ~1735). We verified this end-to-end: restarting the damaged instance repaired both holes within the session, and the backfilled node_block rows carry the NULL signature. Note repairIncompleteBlocks does not catch this case — it only repairs blocks that exist with mismatched transaction counts.

Suggested fix

In handleStaleBlocks, roll back sync state before/alongside removing stale blocks, and re-trigger the buffer fill:

handleStaleBlocks(staleChain: string[], firstHeight: number, nodeName: string) {
  this.nodes[nodeName]?.syncState?.blockReorganizationAtHeight(firstHeight);
  removeStaleBlocksForNode(this.nodes[nodeName]!.internalId!, staleChain)
    .then(() => {
      this.logger.info(staleChain, `${nodeName}: re-organization at height ${firstHeight}`);
      this.scheduleBlockBufferFill();
    })
    .catch((err) => this.logger.error(err));
}

Hardening, secondarily:

  • Add saved hashes to this.blockDb when saveBlock succeeds, so catchUpViaHeaders reasons from current state.
  • Make acceptBlocksViaHeaders log a warning (or fail loudly) when fewer node_block rows are inserted than hashes supplied — today it silently drops hashes that have no block row.
  • Optionally extend the periodic audit to detect accepted-chain holes (height H accepted at H−1 and H+1 but not H), which would self-heal instances already damaged.

Tests and line numbered based on commit 535e41b.

Co-athored with claude-fable-5.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions