PHOENIX-7859 Make ParallelPhoenixConnectionFallbackIT deterministic#2482
PHOENIX-7859 Make ParallelPhoenixConnectionFallbackIT deterministic#2482mnpoonia wants to merge 1 commit into
Conversation
ce41869 to
fb1bfd5
Compare
…y checking queue state directly The test ParallelPhoenixConnectionFallbackIT.testParallelConnectionBackoff times out intermittently in CI when polling hasCapacity() to detect when executor queues fill up. Root cause: hasCapacity() performs a multi-step calculation (read queue size, read capacity, divide, compare threshold) which creates a race condition. Tasks can enter queues during calculation steps, causing the check to miss state transitions. Solution: Check queue.size() >= 1 directly (single atomic operation), then verify hasCapacity() matches expected state as an assertion. Benefits: - Eliminates race condition (atomic read vs multi-step calculation) - More deterministic (checks actual state, not derived value) - Maintains 5s timeout (no increase needed) - Validates both queue state and hasCapacity() logic - Adds debug logging for troubleshooting Testing: Passed locally with HBase 2.6.5. Queues filled in ~105ms (2 checks), well under 5s timeout. State transition [0,0] → [1,1] detected reliably. hasCapacity() correctly returned [false, false]. Related: PHOENIX-6840 (flaky ParallelPhoenix tests) Co-authored-by: Claude Code <claude-code@anthropic.com>
fb1bfd5 to
2b1341c
Compare
There was a problem hiding this comment.
Pull request overview
This PR makes ParallelPhoenixConnectionFallbackIT.testParallelConnectionBackoff deterministic by removing a race-prone polling condition and instead waiting on the executor queues’ observed state before asserting hasCapacity() behavior.
Changes:
- Replaces polling on
PhoenixHAExecutorServiceProvider.hasCapacity()with polling on the underlying executor queue sizes. - Adds explicit assertions that
hasCapacity()reports the expected “no capacity” state once queues are filled. - Adds debug logging to show queue sizes while waiting.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| int queueSize1 = ((ThreadPoolExecutor) services.get(0).getExecutorService()).getQueue().size(); | ||
| int queueSize2 = ((ThreadPoolExecutor) services.get(1).getExecutorService()).getQueue().size(); |
| assertFalse("Cluster 1 should have no capacity after queues filled", capacity.get(0).booleanValue()); | ||
| assertFalse("Cluster 2 should have no capacity after queues filled", capacity.get(1).booleanValue()); |
There was a problem hiding this comment.
The stated root cause is at least partly wrong?
Even if one hasCapacity() call returns a stale answer because of the non-atomic read pair, the next poll will see the update. The test polls 50 times over 5s. Maybe detection is delayed by one tick (~100ms). It cannot cause a 5s timeout. There must be some other reason?
Be sure to run mvn spotless:apply before posting an update or this will fail precommit or cause precommits to fail post merge.
| LOG.debug("Waiting for queues to fill: cluster1 queue={}, cluster2 queue={}", | ||
| queueSize1, queueSize2); | ||
|
|
||
| return queueSize1 >= 1 && queueSize2 >= 1; |
There was a problem hiding this comment.
If any future change adjusts HA_MAX_QUEUE_SIZE this test may pass when it should fail (the wait completes earlier than the fallback signal actually flips) or fail in confusing ways. Consider deriving the trigger from the configured values, e.g.:
int maxQueue = Integer.parseInt(
PROPERTIES.getProperty(PhoenixHAExecutorServiceProvider.HA_MAX_QUEUE_SIZE));
double threshold = Double.parseDouble(
PROPERTIES.getProperty(
PhoenixHAExecutorServiceProvider.HA_THREADPOOL_QUEUE_BACKOFF_THRESHOLD));
int trigger = (int) Math.ceil(threshold * maxQueue);... or just keep hasCapacity() as the wait predicate
|
FWIW @mnpoonia I put this to my robot and it suggests CI host CPU starvation is the cause because the timeouts are too tight.
|
What is the purpose of this change?
Fix intermittent timeout in
ParallelPhoenixConnectionFallbackIT.testParallelConnectionBackoffby eliminating race condition in queue capacity check.Background
The test polls
hasCapacity()to detect when executor queues fill up. However,hasCapacity()performs a multi-step calculation (read size, read capacity, divide, compare) that creates a race condition with queue state changes. Tasks can enter queues during the calculation, causing the test to miss the state transition and timeout.This has caused intermittent CI failures and is part of a pattern of flakiness in
ParallelPhoenix*tests (see PHOENIX-6840, @W-15906980).What changes did I make?
Changed from polling
hasCapacity()(calculated value) to directly checkingqueue.size()(actual state), then verifyinghasCapacity()returns the expected result as an assertion.Before:
After:
Why is this better?
Race Condition Eliminated
hasCapacity() calculation window:
Direct queue check (atomic):
Performance
Benefits
hasCapacity()logic is correctTesting
✅ Local Testing:
[0,0] → [1,1]as expectedhasCapacity()correctly returned[false, false]✅ Expected CI Behavior:
Related Issues
Checklist
List,ThreadPoolExecutor)