Skip to content

[FEATURE]: Valkey SessionStore: bounded retry + circuit breaker #75

@araujof

Description

@araujof

Summary

Follow-up to #73 / #74 (Valkey-backed SessionStore).

v0 of apl-session-valkey fails closed on the first backend error — correct and safe, but with no resilience smoothing. This issue tracks adding a bounded retry + circuit breaker so transient blips don't turn into per-request denials, without amplifying load on a sick primary.

Deliberately deferred from #74 because it needs its own fault-injection test infra and a blast-radius tuning decision that shouldn't ride in on the feature PR.

Scope

  • Error classification (prerequisite). Stop flattening every failure to SessionStoreError::Backend(String) in store.rs; classify the redis::RedisError kind so retry fires only on transient errors (is_timeout / is_connection_dropped / is_io_error), never on deterministic ones (WRONGTYPE, auth, parse).
  • Bounded retry. Wrap acquire + command as a unit (re-acquire on a stale pooled connection), max_retries ≤ 1 by default, full jitter, bounded by a total per-request deadline so retries don't blow the latency ceiling. Re-introduce the max_retries config knob (removed in feat: Valkey-backed SessionStore (cross-node / cross-restart session labels) #74 to avoid shipping it dead).
  • Token-bucket retry budget. Cross-request budget so retries can't become a retry storm against a degraded primary. Ships with retry, not separately (retry without it is worse than no retry).
  • Circuit breaker. Consecutive-failure threshold → open → fail-closed fast (no dial) for a cooldown, with half-open probing. Build as a standalone clock-injected state machine (breaker.rs) so it's unit-testable without real time.

Behavioral decisions to settle (need sign-off, not just code)

  • Breaker blast radius: when open, all session-bearing traffic fails closed for the cooldown. A brief Valkey blip could trip it and deny the fleet for that window — potentially widening an outage vs. per-request timeouts that recover as soon as Valkey does. Need deliberate threshold/cooldown defaults.
  • Latency budget: per-command timeout bounds one attempt; retry multiplies it. The total-deadline budget must cap worst-case p99 for requests that fail closed anyway.

Testing requirements

  • Breaker + classifier: pure unit tests (injected clock, synthetic RedisErrors) — cheap, high-confidence.
  • End-to-end retry path (transient-then-succeeds): needs fault injection — a toxiproxy sidecar in the integration suite, or a mock ConnectionLike seam. This infra is the main reason this is a separate PR; do not ship the retry layer without an end-to-end test that actually exercises a transient failure.

Rough size

~200–250 LOC across store.rs + a new breaker.rs, a few config knobs, likely 1 new dep (rand for jitter or governor for the token bucket), plus the toxiproxy test wiring.

Cheap subset (optional partial)

Error-classification + the standalone clock-injected breaker are fully unit-testable with no new test infra and protect a sick primary — could land first if a full pass is too large. The retry loop is the piece that genuinely needs toxiproxy.

Refs: docs/plans/2026-06-17-001-feat-valkey-session-store-plan.md (Key Technical Decisions — committed retry/breaker defaults), docs/operations/valkey-session-store.md (§5 availability tradeoff, §6 alarms).

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Task.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions