Skip to content

feat: Valkey-backed SessionStore (cross-node / cross-restart session labels)#74

Merged
araujof merged 15 commits into
devfrom
feat/valkey_kv_store
Jun 18, 2026
Merged

feat: Valkey-backed SessionStore (cross-node / cross-restart session labels)#74
araujof merged 15 commits into
devfrom
feat/valkey_kv_store

Conversation

@araujof

@araujof araujof commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a config-selectable, Valkey-backed SessionStore alongside the in-process MemorySessionStore, so session security labels (extensions.security.labels, the monotonic taint set that drives information-flow authorization) persist across process restarts and are shared across gateway nodes. The backend is fail-closed: any store error denies the request rather than letting it proceed with missing taint.

Closes #73.
Plan: docs/plans/2026-06-17-001-feat-valkey-session-store-plan.md
Requirements: docs/brainstorms/valkey-session-store-requirements.md.

Changes

Core (apl-cpex)

  • SessionStore trait methods now return Result<_, SessionStoreError>; MemorySessionStore adapts (infallible → Ok). (R4, R15)
  • Fail-closed wiring: a load error fails the request closed before the policy decision; an append error flips an Allow to Deny (continue_processing is computed after persist_session). Merge precedence: Deny + append-error keeps the original policy violation and only alarms. (R5, R18 — route_handler.rs, cmf_invoker.rs)
  • New SessionStoreFactory + global.apl.session_store config seam: the visitor builds the store during visit_global and swaps its own RwLock-held store before route handlers capture it — no per-request indirection. Default stays MemorySessionStore when no block is present. (R2, R3 — visitor.rs, register.rs)

New crate (apl-session-valkey)

  • ValkeySessionStore over redis-rs 1.x + deadpool-redis on rustls (no openssl). Labels are a Redis SET keyed taint:v1:<sha256(session_id)>; append is a single atomic SADD(+EXPIRE) pipeline (R16); key-miss → Ok(empty), backend/timeout/WRONGTYPE → Err (R5/R15); sliding-TTL refresh on load is fail-open (R7).
  • Config enforces TLS for non-localhost endpoints, percent-encodes credentials, rejects the tls:true+redis:// contradiction, redacts credentials from errors, and warns when TTL < session lifetime (R8/R10/R17).
  • Wired into cpex-ffi behind the optional valkey feature; excluded from default-members. The default build links no Valkey object code. (R13)

Docs/ops: operator runbook (docs/operations/valkey-session-store.md) and deploy/valkey-compose.yml.

Testing

  • apl-cpex: fail-closed unit/integration tests — load-error→deny (AE1), append-error→deny + alarm (AE6), Deny+append-error preserves policy violation (R18), sessionless traffic unaffected, config-driven selection + unknown-kind rejection (AE3/AE5). Full suite green (84 tests).
  • apl-session-valkey: 10 config/security unit tests (TLS rules, credential encoding/redaction). Integration tests against a real valkey/valkey container (#[ignore] by default; CI-gated via REQUIRE_VALKEY_TESTS=1, or VALKEY_TEST_URL) — cross-node union (R16/AE4), unknown→empty (R15), WRONGTYPE→fail-closed and unreachable→fail-closed (R5), TTL set+refresh (R7/AE2). All 5 pass against live Valkey 8.
  • cargo clippy clean on both crates; default and --features valkey FFI builds verified.

Post-Deploy Monitoring & Validation

This backend is opt-in (valkey feature + global.apl.session_store config); with neither, behavior is unchanged.

  • Healthy signals: session labels survive a gateway restart and are visible cross-node; evicted_keys == 0 on the Valkey instance.
  • Failure signals / alarms (structured tracing): alarm="session_store_failure" (load/append failed → request denied), alarm="session_store_ttl_refresh_failed" (risk of silent key expiry), alarm="session_store_ttl_unsound" (TTL < session lifetime). A rise in fail-closed denials correlates with Valkey health.
  • Rollback trigger: sustained session_store_failure denials → disable the session_store config block (reverts to in-process memory store) while Valkey is investigated.
  • Operator preconditions (runbook): maxmemory-policy noeviction, TLS/mTLS, least-privilege ACL. See docs/operations/valkey-session-store.md.
  • Window/owner: watch denial-rate + evicted_keys for the first deploy cycle after enabling Valkey in any environment.

Follow-ups (deferred, not blocking)

  • Bounded retry + circuit-breaker on the hot path (v0 fails closed on first error — safe, just no resilience layer; the max_retries knob was intentionally omitted rather than shipped dead).
  • Sentinel/Cluster support, replica reads, and application-level HMAC of stored labels remain out of scope (see plan Scope Boundaries).

Known Residuals

None — the three code-review findings (dead connect_timeout_ms, dead max_retries, implicit TLS feature) were fixed in this branch.

@araujof araujof requested review from jonpspri and terylt as code owners June 18, 2026 04:46
@araujof araujof added enhancement New feature or request framework Rust labels Jun 18, 2026
@araujof araujof changed the base branch from main to dev June 18, 2026 12:29
@araujof araujof added this to CPEX Jun 18, 2026
@github-project-automation github-project-automation Bot moved this to Backlog in CPEX Jun 18, 2026
@araujof araujof moved this from Backlog to In progress in CPEX Jun 18, 2026
@terylt

terylt commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review: Valkey-backed SessionStore (feat/valkey_kv_store)

Thanks Fred — this is a strong PR. The fail-closed semantics are the right
centerpiece and they're implemented with real care, the data-model fit
(SADD for atomic monotonic union) is exactly right for taint, and the
feature-gating keeps it cleanly opt-in. Verified locally: builds with
--features cpex-ffi/valkey, clippy clean on both crates, and the full suite
is green (apl-cpex 39+ integration/unit tests, valkey unit tests; container
tests correctly #[ignore]d).

Verdict: approve, with one substantive doc/config ask and two minor code
nits — none blocking.


What's strong

  • Fail-closed is correct end-to-end. Load failure denies before any
    decision (session.load_failed); append failure after an Allow flips to
    Deny (session.persist_failed); append failure on an already-Deny preserves
    the original policy violation. All three precedence cases plus the
    sessionless-unaffected case have dedicated tests.
  • Atomic server-side union (SADD) instead of client-side
    read-modify-write — no lost-update race under concurrent cross-node appends,
    with a container test proving it.
  • Security hygiene: session IDs SHA-256'd out of the keyspace; credentials
    percent-encoded via the url crate; endpoints redacted in every error/log;
    TLS mandatory off-localhost; the tls:true + plaintext redis://
    contradiction explicitly rejected with a regression test.
  • No dead config surface — retries/circuit-breaker deliberately absent
    rather than present-but-unimplemented, with a comment saying why.
  • Runbook is above averagenoeviction rationale, least-privilege ACL,
    the TTL soundness/downgrade-by-waiting rule, blast-radius/availability
    tradeoff, and the "never read-split to replicas" constraint are all there.

Substantive ask: document and configure persistence/durability

This is the one real gap. We've promoted Valkey from a cache to the system of
record for a security control
, which inverts its default durability
assumptions — and the runbook doesn't address on-disk persistence at all.

The runbook already closes two ways a label can silently vanish:

  • eviction under memory pressure → §2 (noeviction)
  • sliding-TTL expiry → §4

But it's silent on the third, which is the one most specific to an in-memory
store:

A SADD is acknowledged to the gateway, the node crashes before the write is
fsync'd to disk, and the label is gone. On restart (or replica failover)
the next read returns a normal Ok(empty)not an error — so fail-closed
never trips. The request proceeds with less taint than actually
accumulated
: a silent downgrade.

This is invisible to all the machinery we've built, because nothing errors.
It's governed entirely by Valkey's persistence config, which we neither
document nor recommend a baseline for:

  • appendonly no (RDB-only) → lose everything since the last snapshot
    (minutes).
  • appendfsync everysec → ~1s loss window on crash.
  • appendfsync always → effectively no loss, at a per-write latency cost.

It also interacts with §5: the "fail over to a healthy primary" guidance
inherits Valkey's asynchronous replication, so a failover can promote a
replica that's missing the most recent un-replicated appends — the same
downgrade by a different path.

Requested changes:

  1. A new runbook section — "Persistence & durability" — stating that the
    label keyspace is a security system-of-record, the three fsync options with
    their crash-loss windows, and a recommended baseline (suggest: AOF on,
    appendfsync everysec as the floor, always where the threat model can't
    tolerate the ~1s window). Note the failover interaction with §5.
  2. A matching requirement (next R-number) in
    valkey-session-store-requirements.md so this is a recorded decision, not
    an implicit default — peer to the R8 TTL-soundness rule.
  3. (Optional, follow-up) a best-effort startup CONFIG GET appendonly /
    appendfsync self-check that warns when persistence looks cache-shaped,
    mirroring the existing noeviction self-check.

No code change to the store itself is required — this is operator contract +
docs, the same category as noeviction.


Minor code nits (non-blocking)

1. username without password is silently dropped.
In config.rs::connection_url (bare host:port branch), the
if let Some(password) guard means a config with username: set but no
password: never applies the username at all:

if let Some(password) = &self.password {
    url.set_username(self.username.as_deref().unwrap_or(""))?;
    url.set_password(Some(password))?;
}

A passwordless ACL user is rare but valid; today it connects as the default
user with no signal. Suggest applying the username whenever either credential
is present (or erroring if username is set without password).

2. Embedded-URL endpoint + separate username/password are silently
ignored.
When endpoint is a full redis:///rediss:// URL,
connection_url returns early and never folds in the separate username/
password fields. It's documented as "trust the operator's embedded
credentials," but setting both is a plausible misconfig — a load-time warning
(or error) when both an in-URL userinfo and separate creds are supplied would
fail louder.


Already well-handled — just confirming I looked

  • Consistency / read-staleness — §5's single-endpoint, primary-only,
    never-read-split design gives linearizable reads (single Valkey primary
    serializes commands), so an append on one gateway node is immediately visible
    to another. Good, and the "no read-split" warning correctly fences off the
    one way to break it.
  • TTL refresh fail-open — deliberate (the read succeeded) and alarmed; §4
    covers the persistent-failure downgrade risk.
  • visit_globalvisit_route ordering — confirmed in manager.rs that
    global is walked before routes, so the config-selected store is the one
    handlers capture. The RwLock<Arc<dyn SessionStore>> swap is sound.

@terylt terylt left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I made some small comments above that could be incorporated.

araujof added 12 commits June 18, 2026 16:20
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
…rrors

U1+U2: SessionStore trait methods now return Result with a crate-local
SessionStoreError. MemorySessionStore adapts (infallible -> Ok). The CMF
invoker (for_request, persist_session) and route_handler propagate:
- a load error fails the request closed before any decision (R5)
- an append error flips the outcome to Deny with merge precedence
  (Allow+Err -> session.persist_failed; Deny+Err -> keep policy violation),
  with a distinguished alarm (R18)

Sessionless traffic never touches the store. Adds fail-closed tests
(AE1, AE6) plus a sessionless carve-out.

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
U3: Add SessionStoreFactory (mirrors PdpFactory) and a global.apl.session_store
config block. The visitor builds the store during visit_global and swaps its
own (now RwLock-held) session_store field before visit_route clones it into
handlers — no per-request indirection. Default MemorySessionStore stays active
when no block is present (R3). AplOptions gains session_store_factories; all
struct-literal sites updated. Adds config-selection and unknown-kind tests
(AE3, AE5).

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
…I wiring

U4/U5/U6: New apl-session-valkey crate (redis-rs 1.x + deadpool-redis over
rustls, no openssl), excluded from default-members. ValkeySessionStore stores
labels as a Redis SET keyed by taint:v1:<sha256(session_id)>; append is a
single atomic SADD(+EXPIRE) pipeline (R16); load maps key-miss->Ok(empty) and
backend/timeout->Err (R5/R15); sliding-TTL refresh on load is fail-open (R7).
Config parsing enforces TLS for non-localhost endpoints (R10), commits safe
timeout defaults, and warns when TTL < session lifetime (R17). Wired into
cpex-ffi behind the optional 'valkey' feature; default build links no Valkey
object code (R13).

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
U7: Integration tests (testcontainers valkey, or VALKEY_TEST_URL escape hatch;
#[ignore]d by default, CI-gated via REQUIRE_VALKEY_TESTS) covering cross-node
union (R16/AE4), unknown->empty (R15), WRONGTYPE->fail-closed and unreachable->
fail-closed (R5), and TTL set+refresh (R7/AE2). deploy/valkey-compose.yml runs
a noeviction-configured Valkey for local dev.

Also hardens config.rs per security review: connection_url() percent-encodes
credentials via the url crate and always reflects tls_enabled() in the scheme;
rejects the tls:true + plaintext redis:// contradiction; redacts credentials
from all error/log output. Adds regression tests.

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
U8: documents the operator-owned controls the backend depends on but cannot
enforce — noeviction (R9) + monitoring, TLS/mTLS + least-privilege ACL (R10),
the TTL soundness rule and refresh-failure alarm (R8/R17), the single-endpoint
primary-only topology and accepted fail-closed availability tradeoff (with the
sessionless carve-out), the v0 no-live-reload limitation, the alarm catalog,
and local-dev setup.

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Revert formatting-only changes to apl-pdp-cel / apl-pdp-cedar-direct / apl-cpex
files that a crate-wide `cargo fmt` reformatted (pre-existing drift from #68)
and `git add -A` swept into earlier commits. Keeps this PR focused on the
Valkey session store; those files now match main exactly.

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
… TLS feature

Addresses code-review findings:
- connect_timeout_ms now bounds connection acquisition (distinct from the
  per-command timeout) instead of being parsed-but-ignored.
- Remove the max_retries config knob: no retry/circuit-breaker exists in v0
  (deferred follow-up), so the store fails closed on first error — config no
  longer advertises behavior the code lacks.
- Forward tokio-rustls-comp through deadpool-redis explicitly so the rediss://
  TLS path doesn't depend on incidental feature unification.
- Add Deny+append-failure regression test: the original policy violation is
  preserved (not overwritten by session.persist_failed) per R18 merge rules.

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
ASCII '->' inside sequence-diagram message text was tokenized as an arrow.
Replace with 'yields' / Unicode → (matching the requirements-doc diagrams).

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Mermaid sequence message text terminates at ';', so 'yields ...; swap ...'
cut the message early and the parser choked on the next line. Replace the
semicolon with ', then' and drop the Arc<dyn ...> angle brackets. (The earlier
'->' fix was needed too but not the actual break.)

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
@araujof araujof force-pushed the feat/valkey_kv_store branch from 2756427 to 4b1afbd Compare June 18, 2026 20:21
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Addresses the PR #74 review (terylt).

Substantive ask — persistence/durability as a security contract:
- Runbook gains a "Persistence and durability" section (§5): the label
  keyspace is a system-of-record, not a cache. A SADD acked then lost to a
  crash before fsync returns Ok(empty) on the next read (not an error), so
  fail-closed never trips — a silent downgrade invisible to all alarms.
  Documents the three fsync options, a recommended AOF baseline, and the
  async-replication failover interaction with the topology section.
- Adds R19 to the requirements doc, peer to the R9 noeviction contract.

Doc fix found while verifying:
- The runbook claimed a CONFIG GET maxmemory-policy self-check that does not
  exist in code (the pool is lazy and never dials at config-load). Corrected
  §2 to present noeviction purely as operator contract; the actual self-check
  (noeviction + persistence) is deferred to #76.

Code nits — hard-error ambiguous credential config at load, consistent with
the existing tls:true + redis:// rejection:
- `username` without `password` is rejected (previously dropped silently,
  connecting as the default user).
- A full URL endpoint combined with separate `username`/`password` fields is
  rejected (those fields are ignored for URL endpoints).
- Bare host:port credential application now applies the username when either
  credential is present; a lone password (default-user AUTH) stays valid.

Adds regression tests for all three. apl-session-valkey unit tests green,
clippy clean, cpex-ffi --features valkey builds.

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
@araujof

araujof commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @terylt — all addressed in 6e1bfab. Summary:

Persistence/durability

  • New runbook section §5 "Persistence and durability": the label keyspace as a security system-of-record, the crash-before-fsync silent-downgrade (acked SADD lost → next read returns Ok(empty), not an error → fail-closed never trips → invisible to alarms), the three fsync options with crash-loss windows, a recommended AOF baseline (everysec floor, always where the ~1s window is intolerable), and the async-replication failover interaction with the topology section.
  • New R19 in the requirements doc, peer to the R9 noeviction contract.

Code nits (consistent with the existing tls:true + redis:// rejection):

  1. username without password is now rejected rather than silently dropped.
  2. A full URL endpoint combined with separate username/password is now rejected (those fields are ignored for URL endpoints). Slightly broader than the in-URL-userinfo-and-separate-creds case you flagged — separate creds are always ignored for URL endpoints, so rejecting the whole combination forces credentials into one place. Bare host:port now also applies the username when either credential is present; a lone password (default-user AUTH) stays valid. Regression tests added for all three.

Additional finding The runbook §2 claimed the backend issues a CONFIG GET maxmemory-policy self-check. It doesn't — there's no such check in store.rs/connection.rs, and the pool is lazy (never dials at config-load). So item 3 ("mirror the existing noeviction self-check") rested on a check that wasn't there. I corrected §2 to present noeviction purely as operator contract, and filed #76 to actually build both self-checks (noeviction + persistence) — it's a deliberate design change since it requires dialing at startup, and must stay best-effort so it never becomes a new availability dependency.

Verified: apl-session-valkey unit tests green (3 new), clippy clean, cargo build -p cpex-ffi --features valkey builds.

`apl_subblock` lifts APL terms written directly on a section (no `apl:`
wrapper) into a synthetic block, but only for the keys in FLAT_APL_KEYS.
`pdp` was listed; `session_store` was not — so a flat `global.session_store`
block was silently dropped, while `global.pdp` and the `apl:`-wrapped
`global.apl.session_store` both worked. Asymmetric and a silent-config
footgun.

- Add `session_store` to FLAT_APL_KEYS so the flat form is honored,
  symmetric with `pdp` and with the wrapped form.
- Introduce GLOBAL_ONLY_NON_DSL_KEYS (`pdp`, `session_store`) as the single
  source of truth for the keys that are CPEX wiring (acted on only by
  visit_global) and stripped before policy compilation. strip_non_dsl_keys
  now iterates it.
- Generalize warn_if_pdp_at_nonglobal_scope ->
  warn_if_global_only_key_at_nonglobal_scope so a `session_store:` written
  at route/default/policy-bundle scope (where it is inert) is flagged, the
  same way `pdp:` already is — closing the new silent-no-op the flat key
  would otherwise introduce.

Adds tests: flat session_store is collected by apl_subblock; the renamed
warning helper is a safe no-op for both keys. apl-cpex tests green, clippy
clean, cpex-ffi --features valkey builds.

Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
@araujof araujof merged commit 67de0ae into dev Jun 18, 2026
@github-project-automation github-project-automation Bot moved this from In progress to Done in CPEX Jun 18, 2026
@araujof araujof deleted the feat/valkey_kv_store branch June 18, 2026 22:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request framework Rust

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Valkey-backed SessionStore (cross-node / cross-restart session labels)

2 participants