feat(db) resource version cas#1292
Conversation
Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations and migrate critical paths to use it. This prevents lost updates in HA deployments with multiple gateway replicas. Core infrastructure: - Add resource_version field to ObjectMeta proto (uint64) - Add resource_version column to objects table (SQLite: INTEGER, Postgres: BIGINT) - Add WriteCondition enum (MustCreate, MatchResourceVersion, Unconditional) - Add PersistenceError::Conflict variant for version mismatch - Add Store::put_if() and Store::delete_if() CAS methods - Add Store::update_message_cas() with bounded retry for mutations - Implement CAS operations for both SQLite and Postgres backends - Hydrate resource_version on all typed reads (defaults to 1 for backfill) Migrations: - Migrate policy mutations to CAS (draft operations, settings) - Migrate provider updates to CAS (credentials, config merging) - Migrate sandbox updates to CAS (phase transitions, status reconciliation) - Migrate compute status updates to CAS (driver watch event handling) Database migrations backfill existing rows with resource_version = 1. CAS updates increment atomically: resource_version = resource_version + 1. gRPC handlers map PersistenceError::Conflict to ABORTED status code to signal clients to retry with fresh data. Server-side retries use bounded retry (5 attempts) with fresh reads on each iteration. Test coverage includes concurrent update scenarios and handler-level resource_version round-trip tests. Signed-off-by: Derek Carr <decarr@redhat.com>
Add documentation for the optimistic concurrency control (CAS) pattern implemented in the persistence layer. Updates architecture/gateway.md to explain: - resource_version field purpose and behavior - When to use update_message_cas vs put_message - Bounded retry pattern with 5 attempt limit - Conflict error handling and ABORTED status mapping - Current mutation paths using CAS - Settings mutex exception This ensures future developers understand when and how to use CAS for safe concurrent mutations in HA deployments. Signed-off-by: Derek Carr <decarr@redhat.com>
30e8540 to
46c93c3
Compare
| .map_err(|e| Status::internal(format!("persist provider failed: {e}")))?; | ||
|
|
||
| // Read back from DB to get correct resource_version | ||
| let provider = store |
There was a problem hiding this comment.
This CAS migration still leaves create exposed to a lost-create race. The handler prechecks by name, but the preceding write still uses put_message, whose backend implementation upserts on (object_type, name). Two replicas can both pass the precheck and the later write can replace the payload for the existing name, leaving row identity and protobuf metadata identity inconsistent.
Please switch create to put_if(..., WriteCondition::MustCreate) and map the unique/conflict case to AlreadyExists. The same issue exists in crates/openshell-server/src/compute/mod.rs around sandbox creation, where create_sandbox also uses put_message after a precheck.
| let incoming_config = provider.config; | ||
|
|
||
| // Use CAS to merge and update provider with bounded retry | ||
| let updated = store |
There was a problem hiding this comment.
This writes the merged provider before validating the merged result. If the incoming map introduces an oversized key/value or pushes the merged provider over the configured limits, update_message_cas has already persisted the invalid provider by the time validate_provider_fields(&updated) returns InvalidArgument below. That leaves invalid stored state after a failed request.
Please validate the merged candidate before the CAS write commits. One way is a fallible/manual CAS loop that loads the latest provider, applies the merge, runs validate_provider_fields(&candidate)?, and only then calls put_if with the candidate's current resource_version.
| // the standard 5, but 100 was excessive. 20 retries = ~20ms max latency under heavy contention. | ||
| for attempt in 0..20 { | ||
| match store | ||
| .put_if(object_type, &id, name, &payload, None, condition) |
There was a problem hiding this comment.
This retry loop still reuses the payload serialized before the first attempt. On a CAS conflict, the code refreshes only resource_version and then retries the same stale full settings object. In HA, two replicas updating different settings can both eventually succeed while the later retry overwrites the other setting.
Please reload the latest StoredSettings and reapply the semantic mutation on each retry before serializing, or do not retry here and return ABORTED so callers retry from a fresh read.
Summary
Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations
and migrate critical paths to use it. This prevents lost updates in HA
deployments with multiple gateway replicas.
Core infrastructure:
Migrations:
Database migrations backfill existing rows with resource_version = 1.
CAS updates increment atomically: resource_version = resource_version + 1.
gRPC handlers map PersistenceError::Conflict to ABORTED status code
to signal clients to retry with fresh data. Server-side retries use
bounded retry (5 attempts) with fresh reads on each iteration.
Test coverage includes concurrent update scenarios and handler-level
resource_version round-trip tests.
Related Issue
Fixes #1255
Changes
Testing
mise run pre-commitpassesChecklist