feat: Valkey-backed SessionStore (cross-node / cross-restart session labels)#74
Conversation
Review: Valkey-backed
|
terylt
left a comment
There was a problem hiding this comment.
LGTM. I made some small comments above that could be incorporated.
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
…rrors U1+U2: SessionStore trait methods now return Result with a crate-local SessionStoreError. MemorySessionStore adapts (infallible -> Ok). The CMF invoker (for_request, persist_session) and route_handler propagate: - a load error fails the request closed before any decision (R5) - an append error flips the outcome to Deny with merge precedence (Allow+Err -> session.persist_failed; Deny+Err -> keep policy violation), with a distinguished alarm (R18) Sessionless traffic never touches the store. Adds fail-closed tests (AE1, AE6) plus a sessionless carve-out. Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
U3: Add SessionStoreFactory (mirrors PdpFactory) and a global.apl.session_store config block. The visitor builds the store during visit_global and swaps its own (now RwLock-held) session_store field before visit_route clones it into handlers — no per-request indirection. Default MemorySessionStore stays active when no block is present (R3). AplOptions gains session_store_factories; all struct-literal sites updated. Adds config-selection and unknown-kind tests (AE3, AE5). Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
…I wiring U4/U5/U6: New apl-session-valkey crate (redis-rs 1.x + deadpool-redis over rustls, no openssl), excluded from default-members. ValkeySessionStore stores labels as a Redis SET keyed by taint:v1:<sha256(session_id)>; append is a single atomic SADD(+EXPIRE) pipeline (R16); load maps key-miss->Ok(empty) and backend/timeout->Err (R5/R15); sliding-TTL refresh on load is fail-open (R7). Config parsing enforces TLS for non-localhost endpoints (R10), commits safe timeout defaults, and warns when TTL < session lifetime (R17). Wired into cpex-ffi behind the optional 'valkey' feature; default build links no Valkey object code (R13). Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
U7: Integration tests (testcontainers valkey, or VALKEY_TEST_URL escape hatch; #[ignore]d by default, CI-gated via REQUIRE_VALKEY_TESTS) covering cross-node union (R16/AE4), unknown->empty (R15), WRONGTYPE->fail-closed and unreachable-> fail-closed (R5), and TTL set+refresh (R7/AE2). deploy/valkey-compose.yml runs a noeviction-configured Valkey for local dev. Also hardens config.rs per security review: connection_url() percent-encodes credentials via the url crate and always reflects tls_enabled() in the scheme; rejects the tls:true + plaintext redis:// contradiction; redacts credentials from all error/log output. Adds regression tests. Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
U8: documents the operator-owned controls the backend depends on but cannot enforce — noeviction (R9) + monitoring, TLS/mTLS + least-privilege ACL (R10), the TTL soundness rule and refresh-failure alarm (R8/R17), the single-endpoint primary-only topology and accepted fail-closed availability tradeoff (with the sessionless carve-out), the v0 no-live-reload limitation, the alarm catalog, and local-dev setup. Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Revert formatting-only changes to apl-pdp-cel / apl-pdp-cedar-direct / apl-cpex files that a crate-wide `cargo fmt` reformatted (pre-existing drift from #68) and `git add -A` swept into earlier commits. Keeps this PR focused on the Valkey session store; those files now match main exactly. Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
… TLS feature Addresses code-review findings: - connect_timeout_ms now bounds connection acquisition (distinct from the per-command timeout) instead of being parsed-but-ignored. - Remove the max_retries config knob: no retry/circuit-breaker exists in v0 (deferred follow-up), so the store fails closed on first error — config no longer advertises behavior the code lacks. - Forward tokio-rustls-comp through deadpool-redis explicitly so the rediss:// TLS path doesn't depend on incidental feature unification. - Add Deny+append-failure regression test: the original policy violation is preserved (not overwritten by session.persist_failed) per R18 merge rules. Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
ASCII '->' inside sequence-diagram message text was tokenized as an arrow. Replace with 'yields' / Unicode → (matching the requirements-doc diagrams). Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Mermaid sequence message text terminates at ';', so 'yields ...; swap ...' cut the message early and the parser choked on the next line. Replace the semicolon with ', then' and drop the Arc<dyn ...> angle brackets. (The earlier '->' fix was needed too but not the actual break.) Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
2756427 to
4b1afbd
Compare
Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
4b1afbd to
e0d129b
Compare
Addresses the PR #74 review (terylt). Substantive ask — persistence/durability as a security contract: - Runbook gains a "Persistence and durability" section (§5): the label keyspace is a system-of-record, not a cache. A SADD acked then lost to a crash before fsync returns Ok(empty) on the next read (not an error), so fail-closed never trips — a silent downgrade invisible to all alarms. Documents the three fsync options, a recommended AOF baseline, and the async-replication failover interaction with the topology section. - Adds R19 to the requirements doc, peer to the R9 noeviction contract. Doc fix found while verifying: - The runbook claimed a CONFIG GET maxmemory-policy self-check that does not exist in code (the pool is lazy and never dials at config-load). Corrected §2 to present noeviction purely as operator contract; the actual self-check (noeviction + persistence) is deferred to #76. Code nits — hard-error ambiguous credential config at load, consistent with the existing tls:true + redis:// rejection: - `username` without `password` is rejected (previously dropped silently, connecting as the default user). - A full URL endpoint combined with separate `username`/`password` fields is rejected (those fields are ignored for URL endpoints). - Bare host:port credential application now applies the username when either credential is present; a lone password (default-user AUTH) stays valid. Adds regression tests for all three. apl-session-valkey unit tests green, clippy clean, cpex-ffi --features valkey builds. Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
|
Thanks @terylt — all addressed in 6e1bfab. Summary: Persistence/durability
Code nits (consistent with the existing
Additional finding The runbook §2 claimed the backend issues a Verified: |
`apl_subblock` lifts APL terms written directly on a section (no `apl:` wrapper) into a synthetic block, but only for the keys in FLAT_APL_KEYS. `pdp` was listed; `session_store` was not — so a flat `global.session_store` block was silently dropped, while `global.pdp` and the `apl:`-wrapped `global.apl.session_store` both worked. Asymmetric and a silent-config footgun. - Add `session_store` to FLAT_APL_KEYS so the flat form is honored, symmetric with `pdp` and with the wrapped form. - Introduce GLOBAL_ONLY_NON_DSL_KEYS (`pdp`, `session_store`) as the single source of truth for the keys that are CPEX wiring (acted on only by visit_global) and stripped before policy compilation. strip_non_dsl_keys now iterates it. - Generalize warn_if_pdp_at_nonglobal_scope -> warn_if_global_only_key_at_nonglobal_scope so a `session_store:` written at route/default/policy-bundle scope (where it is inert) is flagged, the same way `pdp:` already is — closing the new silent-no-op the flat key would otherwise introduce. Adds tests: flat session_store is collected by apl_subblock; the renamed warning helper is a safe no-op for both keys. apl-cpex tests green, clippy clean, cpex-ffi --features valkey builds. Signed-off-by: Frederico Araujo <frederico.araujo@ibm.com>
Summary
Adds a config-selectable, Valkey-backed
SessionStorealongside the in-processMemorySessionStore, so session security labels (extensions.security.labels, the monotonic taint set that drives information-flow authorization) persist across process restarts and are shared across gateway nodes. The backend is fail-closed: any store error denies the request rather than letting it proceed with missing taint.Closes #73.
Plan:
docs/plans/2026-06-17-001-feat-valkey-session-store-plan.mdRequirements:
docs/brainstorms/valkey-session-store-requirements.md.Changes
Core (
apl-cpex)SessionStoretrait methods now returnResult<_, SessionStoreError>;MemorySessionStoreadapts (infallible →Ok). (R4, R15)AllowtoDeny(continue_processingis computed afterpersist_session). Merge precedence:Deny+ append-error keeps the original policy violation and only alarms. (R5, R18 —route_handler.rs,cmf_invoker.rs)SessionStoreFactory+global.apl.session_storeconfig seam: the visitor builds the store duringvisit_globaland swaps its ownRwLock-held store before route handlers capture it — no per-request indirection. Default staysMemorySessionStorewhen no block is present. (R2, R3 —visitor.rs,register.rs)New crate (
apl-session-valkey)ValkeySessionStoreoverredis-rs 1.x+deadpool-redison rustls (no openssl). Labels are a Redis SET keyedtaint:v1:<sha256(session_id)>; append is a single atomicSADD(+EXPIRE) pipeline (R16); key-miss →Ok(empty), backend/timeout/WRONGTYPE →Err(R5/R15); sliding-TTL refresh on load is fail-open (R7).tls:true+redis://contradiction, redacts credentials from errors, and warns when TTL < session lifetime (R8/R10/R17).cpex-ffibehind the optionalvalkeyfeature; excluded fromdefault-members. The default build links no Valkey object code. (R13)Docs/ops: operator runbook (
docs/operations/valkey-session-store.md) anddeploy/valkey-compose.yml.Testing
apl-cpex: fail-closed unit/integration tests — load-error→deny (AE1), append-error→deny + alarm (AE6), Deny+append-error preserves policy violation (R18), sessionless traffic unaffected, config-driven selection + unknown-kind rejection (AE3/AE5). Full suite green (84 tests).apl-session-valkey: 10 config/security unit tests (TLS rules, credential encoding/redaction). Integration tests against a realvalkey/valkeycontainer (#[ignore]by default; CI-gated viaREQUIRE_VALKEY_TESTS=1, orVALKEY_TEST_URL) — cross-node union (R16/AE4), unknown→empty (R15), WRONGTYPE→fail-closed and unreachable→fail-closed (R5), TTL set+refresh (R7/AE2). All 5 pass against live Valkey 8.cargo clippyclean on both crates; default and--features valkeyFFI builds verified.Post-Deploy Monitoring & Validation
This backend is opt-in (
valkeyfeature +global.apl.session_storeconfig); with neither, behavior is unchanged.evicted_keys == 0on the Valkey instance.tracing):alarm="session_store_failure"(load/append failed → request denied),alarm="session_store_ttl_refresh_failed"(risk of silent key expiry),alarm="session_store_ttl_unsound"(TTL < session lifetime). A rise in fail-closed denials correlates with Valkey health.session_store_failuredenials → disable thesession_storeconfig block (reverts to in-process memory store) while Valkey is investigated.maxmemory-policy noeviction, TLS/mTLS, least-privilege ACL. Seedocs/operations/valkey-session-store.md.evicted_keysfor the first deploy cycle after enabling Valkey in any environment.Follow-ups (deferred, not blocking)
max_retriesknob was intentionally omitted rather than shipped dead).Known Residuals
None — the three code-review findings (dead
connect_timeout_ms, deadmax_retries, implicit TLS feature) were fixed in this branch.