fix: cost autoscaler flaky test#19594
Conversation
|
|
||
| cluster.callApi().postSupervisor(supervisor.createSuspendedSpec()); | ||
| cluster.callApi().waitForAllSegmentsToBeAvailable(dataSource, coordinator, broker); | ||
| Assertions.assertEquals("10000", cluster.runSql("SELECT COUNT(*) FROM %s", dataSource)); |
There was a problem hiding this comment.
Assert final row count from the actual number of published records.
| ); | ||
| } | ||
| finally { | ||
| keepPublishing.set(false); |
There was a problem hiding this comment.
Clean up publisher and suspend supervisor
| overlord.latchableEmitter().waitForEvent( | ||
| event -> event.hasMetricName(OPTIMAL_TASK_COUNT_METRIC) | ||
| .hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId()) | ||
| .hasValueMatching(Matchers.greaterThan(1L)) |
There was a problem hiding this comment.
wait until the cost-based autoscaler computes that the optimal task count is greater than 1. This is only the autoscaler recommendation.
| overlord.latchableEmitter().waitForEvent( | ||
| event -> event.hasMetricName("task/autoScaler/updatedCount") | ||
| .hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId()) | ||
| .hasValueMatching(Matchers.greaterThan(1L)) |
There was a problem hiding this comment.
This is the applied scale-up event.
| final AtomicInteger totalRecords = new AtomicInteger(); | ||
| final ExecutorService publisher = Executors.newSingleThreadExecutor(); | ||
| final Future<?> publisherFuture = publisher.submit(() -> { | ||
| for (int i = 0; i < MAX_SCALE_UP_RECORD_BATCHES && keepPublishing.get(); ++i) { |
There was a problem hiding this comment.
instead of publishing 10k records upfront and then waiting, the test keeps publishing records in the background while the autoscaler is running. This gives the autoscaler a
stable lag signal to observe.
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 2 |
| P3 | 0 |
| Total | 2 |
Reviewed 1 of 1 changed files.
This is an automated review by Codex GPT-5.5
| .hasValueMatching(Matchers.greaterThan(1L)) | ||
| ); | ||
| keepPublishing.set(false); | ||
| publisherFuture.get(30, TimeUnit.SECONDS); |
There was a problem hiding this comment.
[P2] Surface publisher failures before waiting for scaler metrics
The background publisher future is only observed after both autoscaler metric waits succeed. If publish1kRecords throws before creating enough lag, the test now waits for scaler metrics that may never arrive and then exits through finally without ever calling publisherFuture.get(), masking the real producer failure. The previous synchronous publish path failed immediately. Check the future while waiting, or observe it in the failure path so producer exceptions fail the test directly.
| ); | ||
| keepPublishing.set(false); | ||
| publisherFuture.get(30, TimeUnit.SECONDS); | ||
| ITRetryUtil.retryUntilTrue( |
There was a problem hiding this comment.
[P2] Avoid unbounded-length retries in this integration test
ITRetryUtil.retryUntilTrue uses the default 240 retries with 5 seconds between attempts, so this newly added check can add up to 20 minutes to a failing run, and the method itself has no @timeout. Since this test already uses 600-second latch waits, a regression can now occupy CI for much longer before failing. Use a bounded retry tuned for this test or add an explicit method timeout.
FrankChen021
left a comment
There was a problem hiding this comment.
I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.
Reviewed 1 of 1 changed files.
This is an automated review by Codex GPT-5.5
|
@Shekharrajak , thanks for the PR. It seems that this PR is rewriting the logic of the test itself. It would probably make more sense to target only the failing event and identify why that event fails to arrive in time sometimes. |
|
@Shekharrajak, fully agree with @kfaraz opinion. |
Ref. #19517
Description
Fixes flaky
CostBasedAutoScalerIntegrationTestscale-up wait.This PR has: