diskquota/jdbc: retry SERIALIZABLE aborts (#1527)#1530
Open
groldan wants to merge 2 commits into
Open
Conversation
a17b5b8 to
89886fd
Compare
aaime
requested changes
May 18, 2026
Member
aaime
left a comment
There was a problem hiding this comment.
Overall looks good, two feedbacks:
- I think the foreign key fix got merged into the PR by accident?
- Some of the explanations javadocs are very long-winded, while reviewing I am repeatedly tempted to jump to the code rather than keep on reading. I would try to summarize a bit and only comment on bits that are less than obvious.
Member
Author
|
hi, thanks for reviewing.
no, I jus rebased it after #1529 was merged
hear you, shriking... brb |
Concurrent writers hitting the same TILESET or TILEPAGE row produced serialization aborts that escaped JDBCQuotaStore and reached QueuedQuotaUpdatesConsumer, which logs at FINE and drops the aggregated batch, silently losing pending quota updates and letting the on-disk ledger drift out of sync. SERIALIZABLE isolation is kept as-is; the missing piece was the application-level retry that SERIALIZABLE assumes. Wrap every JDBCQuotaStore transaction in a bounded-retry helper whose predicate recognises serialization aborts from Postgres SSI, HSQL (SQLState class "40"), and Oracle (vendor codes 8176/8177). Oracle additionally gets a DEFERRABLE INITIALLY DEFERRED TILEPAGE -> TILESET foreign key, the matching migrate-on-upgrade path, and a PL/SQL renameLayer so TILESET.KEY and TILEPAGE.TILESET_ID rewrite atomically at commit. Verified end-to-end against PostgreSQL and Oracle XE testcontainers. on-behalf-of: @camptocamp <info@camptocamp.com>
Concurrent migrateForeignKeys callers could fail OracleForeignKey MigrationIT.migrateIsConcurrentStartupSafe sporadically. Oracle's ALTER TABLE uses NOWAIT, and a peer thread that gets ORA-00054 on DROP CONSTRAINT could re-check the FK state while the peer was still between its drop and its add, observe an in-between state, and rethrow. on-behalf-of: @camptocamp <info@camptocamp.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1527.
Requires #1528 #1529
JDBCQuotaStoreruns every transaction atSERIALIZABLEisolation to defend the quota ledger against lost updates, but it never implemented the retry thatSERIALIZABLEis defined in terms of. When two writers contended on the sameTILESETorTILEPAGErow (e.g. several consumer threads in one JVM under load, or any clustered GeoServer deployment) the database aborts one of the transactions and Spring throwsDataAccessException(aRuntimeException). The exception propagates toQueuedQuotaUpdatesConsumer.call(), which catches theRuntimeException, logs atFINE, and continues; the aggregated batch of pending quota updates was silently drop.That silent drop is itself the lost-update path
SERIALIZABLEwas meant to prevent. Over time the in-DB ledger drifts out of sync with what's on disk the original symptom that motivated isolating quota writes in the first place, which might be hard to diagnose because if it happens sporadically, a large cache disk size (e.g. computed withdu -sh) will not deviate enough from the computed disk quota enough to draw attention, and the FINE log level would just hide it.Fix
Wrap every
tt.execute(...)call inJDBCQuotaStorewith a bounded-retry helper. Default policy: 10 attempts, 10 ms -> 500 ms exponential backoff with full jitter,FINE-level logging on each retry andWARNINGonly when the cap is exhausted. The retry predicate matches three families of abort signals:PessimisticLockingFailureException(Spring)SQLSTATE 40001, translated by Spring toCannotAcquireLockException)SQLExceptionwithSQLStateclass"40"SQLTransactionRollbackException, which Spring translates to a bareConcurrencyFailureExceptionrather than to one of the pessimistic-locking subclassesSQLExceptionwith Oracle vendor code8176or8177consistent read failure) and ORA-08177 (can't serialize access). Spring leaves 08176 uncategorised and routes 08177 to the deprecatedCannotSerializeTransactionException(a sibling ofPessimisticLockingFailureException, not a subclass), so vendor-code matching keeps the dialect-specific knowledge local to this predicateThe
SERIALIZABLEisolation level is not touched. The maintainer added it deliberately to defend the ledger; this PR is the missing application-level retry that brings the code in line with howSERIALIZABLEis meant to be used.Two important refinements once the retry loop was in place
PROPAGATION_REQUIREDreuses the outer transaction, so an innerexecuteWithRetry(e.g.createLayerInternalcalled frominitialize) that loops can only re-fail against the same stale snapshot. The helper now short-circuits withTransactionSynchronizationManager.isActualTransactionActive()and lets aborts bubble up to the outermost retry, which starts a fresh transaction. On Oracle this cut full-suite cap-exhaustionWARNINGs from 22 to 0 inOracleQuotaStoreIT/OracleQuotaStoreConcurrencyIT.initialize(). Oracle auto-commits DDL and leaves the post-DDL SCN bookkeeping in a state where the firstSERIALIZABLEread across recently-created indexes aborts with ORA-08176; running schema setup before the wrapped block lets the snapshot be taken cleanly past that auto-commit. HSQLDB also forces an implicit commit immediately before and after each DDL statement but isn't affected by the snapshot-read failure mode; Postgres, in contrast, has fully transactional DDL. The refactor is portably safe either way,initialize()is run-once startup work, and narrows the wrapped block to just the layer-reconciliation reads/writes that actually need retry.Oracle-specific completeness
Oracle cannot support
ON UPDATE CASCADE, so the existing FK migration path (introduced in #1526 for the other dialects) is overridden:TILEPAGE -> TILESETFK is declaredDEFERRABLE INITIALLY DEFERREDon fresh installs and migrated to that shape on upgrades, removing the snapshot read againstTILESETthat fires on every plainINSERT INTO TILEPAGE(the main remaining ORA-08176 trigger).getRenameLayerStatementis rewritten as a PL/SQL anonymous block that rewrites bothTILESET.KEYandTILEPAGE.TILESET_IDinside a single transaction. The deferred FK is verified once at commit with both updates already in place. This fixes the twoOracleQuotaStoreIT.testRenameLayer{,2}cases that JDBCQuotaStore.renameLayer only updates LAYER_NAME, leaves TILESET.KEY and TILEPAGE.TILESET_ID stale #1526's stricter assertions left failing.The
SQLDialectmigrate scaffolding from #1526 is refactored to expose two protected hooks:tilepageFkIsMigrated(rs)andtilepageFkAddSql(table, prefix). This way Oracle's overrides without duplicating the scan/race-recovery loop.Tests
A new
AbstractJDBCQuotaStoreConcurrencyTestholds the shared scenarios. Three concrete subclasses run them against the dialects we support:HSQLQuotaStoreConcurrencyTestPostgreSQLQuotaStoreConcurrencyITOracleQuotaStoreConcurrencyITJDBCQuotaStoreRetryTest(offline) covers the retry helper itself: cap exhaustion, interrupt mid-backoff, and non-concurrency exceptions skipping the retry path entirely. The existingJDBCQuotaStoreTestsuite passes on all dialects.Verification
Running
mvn -Ponline verifyon the diskquota/jdbc module:For comparison, the same Oracle IT suite took ~120 s and produced 66 ORA-08176 stack traces in the logs before this PR.
Out of scope
SERIALIZABLE. Deliberately kept. The retry is the missing piece, not the isolation model.ON CONFLICTupsert inPostgreSQLDialect. Could reduce the conflict rate but does not replace the retry layer and haven't checked it'd work for all cases.