Skip to content

Commit e35cdb6

Browse files
committed
refactor leader election around DB-issued terms
The old elector treated leadership as a lease held by one client ID and renewed over time. That was simple, but it left too much implicit. One leadership term was not clearly separated from the next, so reelection and resignation were not scoped to a specific term. That made same- client reacquisition harder to reason about, made it easy for stale work or cleanup to target the wrong lease, and left the elector carrying more responsibility for edge cases than it should have. This change makes the database issue explicit leadership terms using the columns we already have. `leader_id` remains the stable client ID, while `(leader_id, elected_at)` identifies one specific term. Elect, reelect, and resign now all operate on that exact term and return the leader row from the database. The elector keeps a bounded local trust window for its last successful confirmation, but that window is anchored to the attempt that produced it, not to when the response happened to arrive. That keeps slow successful reelections from stretching leadership past its real lease budget while still avoiding direct app-vs-database clock comparisons in the state machine. The notification and test story is also clearer after the rewrite. Slow subscribers now receive each leadership transition in order without blocking the elector, resignation wakeups are coalesced safely, and the poll-only coverage uses isolated fixtures so it can exercise real handoff behavior without shared-schema flakiness. The shared driver suite now covers term-scoped elect, reelect, and resign behavior across PostgreSQL and both SQLite backends, including same-client term replacement and stale-term rejection. The elector tests focus on the observable behaviors that matter: gaining leadership, handing it off, responding to resign requests, and stepping down cleanly when its trust window expires. This also rolls up the branch's earlier flake investigation and keeps the original CI reference for the shared-schema failures that led to the redesign: https://github.com/riverqueue/river/actions/runs/24406465152
1 parent a678c97 commit e35cdb6

14 files changed

Lines changed: 1462 additions & 730 deletions

File tree

CHANGELOG.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Fixed
11+
12+
- Fixed leader election to track explicit database-issued leadership terms, reducing handoff flakiness and same-client reacquisition edge cases while making reelection and resign target the current leadership lease instead of a stale one. [PR #1213](https://github.com/riverqueue/river/pull/1213).
13+
1014
## [0.34.0] - 2026-04-08
1115

1216
### Added

internal/leadership/doc.go

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
// Package leadership implements leader election for River clients sharing a
2+
// database schema.
3+
//
4+
// The database records at most one current leadership term at a time. The
5+
// elected client runs distributed maintenance work such as queue management,
6+
// job scheduling, and reindexing that should not be duplicated across clients.
7+
//
8+
// # Overview
9+
//
10+
// Leadership is modeled as a database-backed lease with an explicit term
11+
// identity.
12+
//
13+
// A term is identified by:
14+
// - `leader_id`: the stable client identity
15+
// - `elected_at`: the database-issued timestamp for that specific term
16+
//
17+
// The database is authoritative for:
18+
// - which client currently holds the leadership row
19+
// - whether a term can be renewed
20+
// - whether a term has already been replaced
21+
//
22+
// The process uses local time only to bound how long it trusts its last
23+
// successful elect or reelect result. If it cannot renew a term in time, it
24+
// steps down conservatively instead of continuing to act as leader on stale
25+
// information.
26+
//
27+
// # State Model
28+
//
29+
// At a high level, an elector alternates between follower and leader states:
30+
//
31+
// Start
32+
// │
33+
// ▼
34+
// ┌─────────────────────────────────────┐
35+
// │ Follower │
36+
// │ │
37+
// │ Retries election on timer or wakeup │
38+
// └──────────────────┬──────────────────┘
39+
// │ won election
40+
// ▼
41+
// ┌─────────────────────────────────────┐
42+
// │ Leader │
43+
// │ │
44+
// │ Renews before trust window expires │
45+
// └──────────────────┬──────────────────┘
46+
// │ replaced / expired / resign requested /
47+
// │ renewal failed for too long / shutdown
48+
// ▼
49+
// Follower
50+
//
51+
// Followers attempt election periodically and can wake early when they learn
52+
// that the previous leader resigned. Leaders renew their current term
53+
// periodically. If renewal fails, the term is replaced, or the local trust
54+
// window expires, the process stops acting as leader and returns to follower
55+
// behavior.
56+
//
57+
// # Trust Window
58+
//
59+
// After each successful election or renewal, the elector computes a local
60+
// trust deadline:
61+
//
62+
// trustedUntil = attemptStarted + TTL - safetyMargin
63+
//
64+
// This trust window has two important properties:
65+
// - it is anchored to when the elect or reelect attempt started, so a slow
66+
// successful database round trip cannot stretch leadership longer than the
67+
// attempt budget allows
68+
// - it ends before the database lease should expire, giving the process time
69+
// to step down before it risks acting on a stale term
70+
//
71+
// The local trust window is a conservative stop condition, not an alternative
72+
// source of truth. A client may step down while the database row is still
73+
// present, but it should not continue acting as leader after it no longer
74+
// trusts its last successful renewal.
75+
//
76+
// # Term-Scoped Operations
77+
//
78+
// Renewing and resigning are scoped to the exact term identified by
79+
// `(leader_id, elected_at)`.
80+
//
81+
// That means:
82+
// - an old term cannot accidentally renew a newer term for the same client
83+
// - a delayed resign from an old term cannot delete a newer term for the
84+
// same client
85+
// - when the database says a term is gone, the elector can step down without
86+
// ambiguity about which term it held
87+
//
88+
// # Notifications and Subscribers
89+
//
90+
// When a notifier is available, the elector listens for leadership-related
91+
// events so followers can wake promptly and leaders can honor explicit
92+
// resignation requests.
93+
//
94+
// Notification delivery is intentionally non-blocking:
95+
// - wakeups may coalesce, because multiple rapid resignations only need to
96+
// prompt another election attempt
97+
// - polling remains the fallback when notifications are unavailable or missed
98+
//
99+
// Consumers inside the process can subscribe to leadership transitions. Those
100+
// subscriptions preserve ordered `true`/`false` transitions so downstream
101+
// maintenance components can reliably start and stop work, while still keeping
102+
// slow subscribers from blocking the elector itself.
103+
//
104+
// # Failure Handling
105+
//
106+
// The system is intentionally conservative under failures:
107+
// - if renewal errors persist until the trust window is exhausted, the leader
108+
// steps down
109+
// - if the database reports that the current term no longer exists, the
110+
// leader steps down immediately
111+
// - if resignation fails during shutdown or after a local timeout, the
112+
// database lease expiry remains the safety net that eventually allows a new
113+
// election
114+
//
115+
// This design keeps leadership decisions centered on the database while using
116+
// local time only to stop trusting stale state sooner rather than later.
117+
package leadership

0 commit comments

Comments
 (0)