feat(db) resource version cas by derekwaynecarr · Pull Request #1292 · NVIDIA/OpenShell

derekwaynecarr · 2026-05-09T13:59:25Z

Summary

Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations
and migrate critical paths to use it. This prevents lost updates in HA
deployments with multiple gateway replicas.

Core infrastructure:

Add resource_version field to ObjectMeta proto (uint64)
Add resource_version column to objects table (SQLite: INTEGER, Postgres: BIGINT)
Add WriteCondition enum (MustCreate, MatchResourceVersion, Unconditional)
Add PersistenceError::Conflict variant for version mismatch
Add Store::put_if() and Store::delete_if() CAS methods
Add Store::update_message_cas() with bounded retry for mutations
Implement CAS operations for both SQLite and Postgres backends
Hydrate resource_version on all typed reads (defaults to 1 for backfill)

Migrations:

Migrate policy mutations to CAS (draft operations, settings)
Migrate provider updates to CAS (credentials, config merging)
Migrate sandbox updates to CAS (phase transitions, status reconciliation)
Migrate compute status updates to CAS (driver watch event handling)

Database migrations backfill existing rows with resource_version = 1.
CAS updates increment atomically: resource_version = resource_version + 1.

gRPC handlers map PersistenceError::Conflict to ABORTED status code
to signal clients to retry with fresh data. Server-side retries use
bounded retry (5 attempts) with fresh reads on each iteration.

Test coverage includes concurrent update scenarios and handler-level
resource_version round-trip tests.

Related Issue

Fixes #1255

Changes

Testing

[x ] mise run pre-commit passes
[ x] Unit tests added/updated
[ x] E2E tests added/updated (if applicable)

Checklist

[ x] Follows Conventional Commits
[ x] Commits are signed off (DCO)
[ x] Architecture docs updated (if applicable)

copy-pr-bot · 2026-05-09T13:59:29Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Add Compare-And-Swap (CAS) infrastructure for safe concurrent object mutations and migrate critical paths to use it. This prevents lost updates in HA deployments with multiple gateway replicas. Core infrastructure: - Add resource_version field to ObjectMeta proto (uint64) - Add resource_version column to objects table (SQLite: INTEGER, Postgres: BIGINT) - Add WriteCondition enum (MustCreate, MatchResourceVersion, Unconditional) - Add PersistenceError::Conflict variant for version mismatch - Add Store::put_if() and Store::delete_if() CAS methods - Add Store::update_message_cas() with bounded retry for mutations - Implement CAS operations for both SQLite and Postgres backends - Hydrate resource_version on all typed reads (defaults to 1 for backfill) Migrations: - Migrate policy mutations to CAS (draft operations, settings) - Migrate provider updates to CAS (credentials, config merging) - Migrate sandbox updates to CAS (phase transitions, status reconciliation) - Migrate compute status updates to CAS (driver watch event handling) Database migrations backfill existing rows with resource_version = 1. CAS updates increment atomically: resource_version = resource_version + 1. gRPC handlers map PersistenceError::Conflict to ABORTED status code to signal clients to retry with fresh data. Server-side retries use bounded retry (5 attempts) with fresh reads on each iteration. Test coverage includes concurrent update scenarios and handler-level resource_version round-trip tests. Signed-off-by: Derek Carr <decarr@redhat.com>

Add documentation for the optimistic concurrency control (CAS) pattern implemented in the persistence layer. Updates architecture/gateway.md to explain: - resource_version field purpose and behavior - When to use update_message_cas vs put_message - Bounded retry pattern with 5 attempt limit - Conflict error handling and ABORTED status mapping - Current mutation paths using CAS - Settings mutex exception This ensures future developers understand when and how to use CAS for safe concurrent mutations in HA deployments. Signed-off-by: Derek Carr <decarr@redhat.com>

johntmyers · 2026-05-12T18:55:30Z

        .map_err(|e| Status::internal(format!("persist provider failed: {e}")))?;

+    // Read back from DB to get correct resource_version
+    let provider = store


This CAS migration still leaves create exposed to a lost-create race. The handler prechecks by name, but the preceding write still uses put_message, whose backend implementation upserts on (object_type, name). Two replicas can both pass the precheck and the later write can replace the payload for the existing name, leaving row identity and protobuf metadata identity inconsistent.

Please switch create to put_if(..., WriteCondition::MustCreate) and map the unique/conflict case to AlreadyExists. The same issue exists in crates/openshell-server/src/compute/mod.rs around sandbox creation, where create_sandbox also uses put_message after a precheck.

johntmyers · 2026-05-12T18:55:44Z

+    let incoming_config = provider.config;
+
+    // Use CAS to merge and update provider with bounded retry
+    let updated = store


This writes the merged provider before validating the merged result. If the incoming map introduces an oversized key/value or pushes the merged provider over the configured limits, update_message_cas has already persisted the invalid provider by the time validate_provider_fields(&updated) returns InvalidArgument below. That leaves invalid stored state after a failed request.

Please validate the merged candidate before the CAS write commits. One way is a fallible/manual CAS loop that loads the latest provider, applies the merge, runs validate_provider_fields(&candidate)?, and only then calls put_if with the candidate's current resource_version.

johntmyers · 2026-05-12T18:55:58Z

+    // the standard 5, but 100 was excessive. 20 retries = ~20ms max latency under heavy contention.
+    for attempt in 0..20 {
+        match store
+            .put_if(object_type, &id, name, &payload, None, condition)


This retry loop still reuses the payload serialized before the first attempt. On a CAS conflict, the code refreshes only resource_version and then retries the same stale full settings object. In HA, two replicas updating different settings can both eventually succeed while the later retry overwrites the other setting.

Please reload the latest StoredSettings and reapply the semantic mutation on each retry before serializing, or do not retry here and return ABORTED so callers retry from a fresh read.

derekwaynecarr requested review from a team, maxamillion and mrunalp as code owners May 9, 2026 13:59

derekwaynecarr changed the title ~~Feat/db resource version cas~~ feat(db) resource version cas May 9, 2026

derekwaynecarr added 2 commits May 9, 2026 21:31

derekwaynecarr force-pushed the feat/db-resource-version-cas branch from 30e8540 to 46c93c3 Compare May 10, 2026 01:38

johntmyers reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(db) resource version cas#1292

feat(db) resource version cas#1292
derekwaynecarr wants to merge 2 commits into
NVIDIA:mainfrom
derekwaynecarr:feat/db-resource-version-cas

derekwaynecarr commented May 9, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 9, 2026

Uh oh!

johntmyers May 12, 2026

Uh oh!

johntmyers May 12, 2026

Uh oh!

johntmyers May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

derekwaynecarr commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot Bot commented May 9, 2026

Uh oh!

johntmyers May 12, 2026

Choose a reason for hiding this comment

Uh oh!

johntmyers May 12, 2026

Choose a reason for hiding this comment

Uh oh!

johntmyers May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

derekwaynecarr commented May 9, 2026 •

edited

Loading