Skip to content

diskquota/jdbc: retry SERIALIZABLE aborts (#1527)#1530

Open
groldan wants to merge 2 commits into
GeoWebCache:mainfrom
groldan:fix/1527/main
Open

diskquota/jdbc: retry SERIALIZABLE aborts (#1527)#1530
groldan wants to merge 2 commits into
GeoWebCache:mainfrom
groldan:fix/1527/main

Conversation

@groldan
Copy link
Copy Markdown
Member

@groldan groldan commented May 15, 2026

Closes #1527.
Requires #1528 #1529

JDBCQuotaStore runs every transaction at SERIALIZABLE isolation to defend the quota ledger against lost updates, but it never implemented the retry that SERIALIZABLE is defined in terms of. When two writers contended on the same
TILESET or TILEPAGE row (e.g. several consumer threads in one JVM under load, or any clustered GeoServer deployment) the database aborts one of the transactions and Spring throws DataAccessException (a RuntimeException). The exception propagates to QueuedQuotaUpdatesConsumer.call(), which catches theRuntimeException, logs at FINE, and continues; the aggregated batch of pending quota updates was silently drop.

That silent drop is itself the lost-update path SERIALIZABLE was meant to prevent. Over time the in-DB ledger drifts out of sync with what's on disk the original symptom that motivated isolating quota writes in the first place, which might be hard to diagnose because if it happens sporadically, a large cache disk size (e.g. computed with du -sh) will not deviate enough from the computed disk quota enough to draw attention, and the FINE log level would just hide it.

Fix

Wrap every tt.execute(...) call in JDBCQuotaStore with a bounded-retry helper. Default policy: 10 attempts, 10 ms -> 500 ms exponential backoff with full jitter, FINE-level logging on each retry and WARNING only when the cap is exhausted. The retry predicate matches three families of abort signals:

Family What it covers
PessimisticLockingFailureException (Spring) Postgres SSI aborts (SQLSTATE 40001, translated by Spring to CannotAcquireLockException)
SQLException with SQLState class "40" HSQL's SQLTransactionRollbackException, which Spring translates to a bare ConcurrencyFailureException rather than to one of the pessimistic-locking subclasses
SQLException with Oracle vendor code 8176 or 8177 ORA-08176 (consistent read failure) and ORA-08177 (can't serialize access). Spring leaves 08176 uncategorised and routes 08177 to the deprecated CannotSerializeTransactionException (a sibling of PessimisticLockingFailureException, not a subclass), so vendor-code matching keeps the dialect-specific knowledge local to this predicate

The SERIALIZABLE isolation level is not touched. The maintainer added it deliberately to defend the ledger; this PR is the missing application-level retry that brings the code in line with how SERIALIZABLE is meant to be used.

Two important refinements once the retry loop was in place

  • Skip retry when nested inside an active transaction. Spring's PROPAGATION_REQUIRED reuses the outer transaction, so an inner executeWithRetry (e.g. createLayerInternal called from initialize) that loops can only re-fail against the same stale snapshot. The helper now short-circuits with TransactionSynchronizationManager.isActualTransactionActive() and lets aborts bubble up to the outermost retry, which starts a fresh transaction. On Oracle this cut full-suite cap-exhaustion WARNINGs from 22 to 0 in OracleQuotaStoreIT/OracleQuotaStoreConcurrencyIT.
  • Run DDL outside the wrapping transaction in initialize(). Oracle auto-commits DDL and leaves the post-DDL SCN bookkeeping in a state where the first SERIALIZABLE read across recently-created indexes aborts with ORA-08176; running schema setup before the wrapped block lets the snapshot be taken cleanly past that auto-commit. HSQLDB also forces an implicit commit immediately before and after each DDL statement but isn't affected by the snapshot-read failure mode; Postgres, in contrast, has fully transactional DDL. The refactor is portably safe either way,initialize() is run-once startup work, and narrows the wrapped block to just the layer-reconciliation reads/writes that actually need retry.

Oracle-specific completeness

Oracle cannot support ON UPDATE CASCADE, so the existing FK migration path (introduced in #1526 for the other dialects) is overridden:

  • The TILEPAGE -> TILESET FK is declared DEFERRABLE INITIALLY DEFERRED on fresh installs and migrated to that shape on upgrades, removing the snapshot read against TILESET that fires on every plain INSERT INTO TILEPAGE (the main remaining ORA-08176 trigger).
  • getRenameLayerStatement is rewritten as a PL/SQL anonymous block that rewrites both TILESET.KEY and TILEPAGE.TILESET_ID inside a single transaction. The deferred FK is verified once at commit with both updates already in place. This fixes the two OracleQuotaStoreIT.testRenameLayer{,2} cases that JDBCQuotaStore.renameLayer only updates LAYER_NAME, leaves TILESET.KEY and TILEPAGE.TILESET_ID stale #1526's stricter assertions left failing.

The SQLDialect migrate scaffolding from #1526 is refactored to expose two protected hooks: tilepageFkIsMigrated(rs) and tilepageFkAddSql(table, prefix). This way Oracle's overrides without duplicating the scan/race-recovery loop.

Tests

A new AbstractJDBCQuotaStoreConcurrencyTest holds the shared scenarios. Three concrete subclasses run them against the dialects we support:

Class Phase Engine
HSQLQuotaStoreConcurrencyTest surefire In-memory HSQL
PostgreSQLQuotaStoreConcurrencyIT failsafe Postgres testcontainer
OracleQuotaStoreConcurrencyIT failsafe Oracle XE testcontainer

JDBCQuotaStoreRetryTest (offline) covers the retry helper itself: cap exhaustion, interrupt mid-backoff, and non-concurrency exceptions skipping the retry path entirely. The existing JDBCQuotaStoreTest suite passes on all dialects.

Verification

Running mvn -Ponline verify on the diskquota/jdbc module:

Surefire:  90 run, 0 failures, 38 skipped (fixture-based PG/Oracle without ~/.geowebcache/*.properties)
Failsafe:  45 run, 0 failures
  PostgreSQLQuotaStoreIT                 19/19
  OracleQuotaStoreIT                     19/19    (~46 s)
  PostgreSQLForeignKeyMigrationIT         3/3
  PostgreSQLQuotaStoreConcurrencyIT       2/2
  OracleQuotaStoreConcurrencyIT           2/2     (~33 s)
  HSQLQuotaStoreConcurrencyTest           2/2     (surefire)

Cap-exhaustion WARNs in test logs: 0
ORA-08176 occurrences in logs:     0
ORA-08177 occurrences in logs:     0

For comparison, the same Oracle IT suite took ~120 s and produced 66 ORA-08176 stack traces in the logs before this PR.

Out of scope

  • Dropping SERIALIZABLE. Deliberately kept. The retry is the missing piece, not the isolation model.
  • A Postgres-native ON CONFLICT upsert in PostgreSQLDialect. Could reduce the conflict rate but does not replace the retry layer and haven't checked it'd work for all cases.

@groldan groldan force-pushed the fix/1527/main branch 2 times, most recently from a17b5b8 to 89886fd Compare May 18, 2026 17:30
Copy link
Copy Markdown
Member

@aaime aaime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, two feedbacks:

  • I think the foreign key fix got merged into the PR by accident?
  • Some of the explanations javadocs are very long-winded, while reviewing I am repeatedly tempted to jump to the code rather than keep on reading. I would try to summarize a bit and only comment on bits that are less than obvious.

@groldan
Copy link
Copy Markdown
Member Author

groldan commented May 18, 2026

hi, thanks for reviewing.

I think the foreign key fix got merged into the PR by accident?

no, I jus rebased it after #1529 was merged

Some of the explanations javadocs are very long-winded

hear you, shriking... brb

Concurrent writers hitting the same TILESET or TILEPAGE row produced
serialization aborts that escaped JDBCQuotaStore and reached
QueuedQuotaUpdatesConsumer, which logs at FINE and drops the
aggregated batch, silently losing pending quota updates and letting
the on-disk ledger drift out of sync. SERIALIZABLE isolation is kept
as-is; the missing piece was the application-level retry that
SERIALIZABLE assumes.

Wrap every JDBCQuotaStore transaction in a bounded-retry helper whose
predicate recognises serialization aborts from Postgres SSI, HSQL
(SQLState class "40"), and Oracle (vendor codes 8176/8177). Oracle
additionally gets a DEFERRABLE INITIALLY DEFERRED TILEPAGE -> TILESET
foreign key, the matching migrate-on-upgrade path, and a PL/SQL
renameLayer so TILESET.KEY and TILEPAGE.TILESET_ID rewrite atomically
at commit. Verified end-to-end against PostgreSQL and Oracle XE
testcontainers.

on-behalf-of: @camptocamp <info@camptocamp.com>
Concurrent migrateForeignKeys callers could fail OracleForeignKey
MigrationIT.migrateIsConcurrentStartupSafe sporadically. Oracle's
ALTER TABLE uses NOWAIT, and a peer thread that gets ORA-00054 on
DROP CONSTRAINT could re-check the FK state while the peer was still
between its drop and its add, observe an in-between state, and
rethrow.

on-behalf-of: @camptocamp <info@camptocamp.com>
@groldan groldan requested a review from aaime May 18, 2026 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JDBCQuotaStore hits CannotSerializeTransactionException in clustered deployments

2 participants