fix: cost autoscaler flaky test by Shekharrajak · Pull Request #19594 · apache/druid

Shekharrajak · 2026-06-17T17:52:52Z

Ref. #19517

Description

Fixes flaky CostBasedAutoScalerIntegrationTest scale-up wait.

This PR has:

Shekharrajak · 2026-06-17T17:53:14Z


-    cluster.callApi().postSupervisor(supervisor.createSuspendedSpec());
    cluster.callApi().waitForAllSegmentsToBeAvailable(dataSource, coordinator, broker);
-    Assertions.assertEquals("10000", cluster.runSql("SELECT COUNT(*) FROM %s", dataSource));


Assert final row count from the actual number of published records.

Shekharrajak · 2026-06-17T17:53:26Z

+      );
+    }
+    finally {
+      keepPublishing.set(false);


Clean up publisher and suspend supervisor

Shekharrajak · 2026-06-17T17:56:49Z

+      overlord.latchableEmitter().waitForEvent(
+          event -> event.hasMetricName(OPTIMAL_TASK_COUNT_METRIC)
+                        .hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId())
+                        .hasValueMatching(Matchers.greaterThan(1L))


wait until the cost-based autoscaler computes that the optimal task count is greater than 1. This is only the autoscaler recommendation.

Shekharrajak · 2026-06-17T17:56:58Z

+      overlord.latchableEmitter().waitForEvent(
+          event -> event.hasMetricName("task/autoScaler/updatedCount")
+                        .hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId())
+                        .hasValueMatching(Matchers.greaterThan(1L))


This is the applied scale-up event.

Shekharrajak · 2026-06-17T17:58:27Z

+    final AtomicInteger totalRecords = new AtomicInteger();
+    final ExecutorService publisher = Executors.newSingleThreadExecutor();
+    final Future<?> publisherFuture = publisher.submit(() -> {
+      for (int i = 0; i < MAX_SCALE_UP_RECORD_BATCHES && keepPublishing.get(); ++i) {


instead of publishing 10k records upfront and then waiting, the test keeps publishing records in the background while the autoscaler is running. This gives the autoscaler a
stable lag signal to observe.

FrankChen021

Severity	Findings
P0	0
P1	0
P2	2
P3	0
Total	2

Reviewed 1 of 1 changed files.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-06-18T12:25:55Z

+                        .hasValueMatching(Matchers.greaterThan(1L))
+      );
+      keepPublishing.set(false);
+      publisherFuture.get(30, TimeUnit.SECONDS);


[P2] Surface publisher failures before waiting for scaler metrics

The background publisher future is only observed after both autoscaler metric waits succeed. If publish1kRecords throws before creating enough lag, the test now waits for scaler metrics that may never arrive and then exits through finally without ever calling publisherFuture.get(), masking the real producer failure. The previous synchronous publish path failed immediately. Check the future while waiting, or observe it in the failure path so producer exceptions fail the test directly.

FrankChen021 · 2026-06-18T12:25:55Z

+      );
+      keepPublishing.set(false);
+      publisherFuture.get(30, TimeUnit.SECONDS);
+      ITRetryUtil.retryUntilTrue(


[P2] Avoid unbounded-length retries in this integration test

ITRetryUtil.retryUntilTrue uses the default 240 retries with 5 seconds between attempts, so this newly added check can add up to 20 minutes to a failing run, and the method itself has no @timeout. Since this test already uses 600-second latch waits, a regression can now occupy CI for much longer before failing. Use a bounded retry tuned for this test or add an explicit method timeout.

FrankChen021

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 1 of 1 changed files.

This is an automated review by Codex GPT-5.5

kfaraz · 2026-06-23T07:14:29Z

@Shekharrajak , thanks for the PR.
What was the root cause of the flakiness? I suppose the LatchableEmitter timed out. If yes, then which event did it never receive?

It seems that this PR is rewriting the logic of the test itself. It would probably make more sense to target only the failing event and identify why that event fails to arrive in time sometimes.

Fly-Style · 2026-06-23T08:21:20Z

@Shekharrajak, fully agree with @kfaraz opinion.

Shekharrajak added 2 commits June 17, 2026 23:11

Wait for autoscaler task counts

d0355e1

Keep autoscaler backlog active

c9f6978

Shekharrajak changed the title ~~Fix cost autoscaler flaky test~~ fix: cost autoscaler flaky test Jun 17, 2026

Shekharrajak commented Jun 17, 2026

View reviewed changes

FrankChen021 reviewed Jun 18, 2026

View reviewed changes

Shekharrajak added 2 commits June 18, 2026 20:44

Bound autoscaler task count retries

67de675

Surface autoscaler publisher failures

b2e6ec9

FrankChen021 reviewed Jun 19, 2026

View reviewed changes

kfaraz requested a review from Fly-Style June 23, 2026 07:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cost autoscaler flaky test#19594

fix: cost autoscaler flaky test#19594
Shekharrajak wants to merge 4 commits into
apache:masterfrom
Shekharrajak:fix-cost-autoscaler-flaky-test

Shekharrajak commented Jun 17, 2026

Uh oh!

Shekharrajak Jun 17, 2026

Uh oh!

Shekharrajak Jun 17, 2026

Uh oh!

Shekharrajak Jun 17, 2026

Uh oh!

Shekharrajak Jun 17, 2026

Uh oh!

Shekharrajak Jun 17, 2026

Uh oh!

FrankChen021 left a comment

Uh oh!

FrankChen021 Jun 18, 2026

Uh oh!

FrankChen021 Jun 18, 2026

Uh oh!

FrankChen021 left a comment

Uh oh!

kfaraz commented Jun 23, 2026

Uh oh!

Fly-Style commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Shekharrajak commented Jun 17, 2026

Description

Uh oh!

Shekharrajak Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Shekharrajak Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

kfaraz commented Jun 23, 2026

Uh oh!

Fly-Style commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants