Skip to content

fix: cost autoscaler flaky test#19594

Open
Shekharrajak wants to merge 4 commits into
apache:masterfrom
Shekharrajak:fix-cost-autoscaler-flaky-test
Open

fix: cost autoscaler flaky test#19594
Shekharrajak wants to merge 4 commits into
apache:masterfrom
Shekharrajak:fix-cost-autoscaler-flaky-test

Conversation

@Shekharrajak

Copy link
Copy Markdown
Contributor

Ref. #19517

Description

Fixes flaky CostBasedAutoScalerIntegrationTest scale-up wait.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Shekharrajak Shekharrajak changed the title Fix cost autoscaler flaky test fix: cost autoscaler flaky test Jun 17, 2026

cluster.callApi().postSupervisor(supervisor.createSuspendedSpec());
cluster.callApi().waitForAllSegmentsToBeAvailable(dataSource, coordinator, broker);
Assertions.assertEquals("10000", cluster.runSql("SELECT COUNT(*) FROM %s", dataSource));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert final row count from the actual number of published records.

);
}
finally {
keepPublishing.set(false);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up publisher and suspend supervisor

overlord.latchableEmitter().waitForEvent(
event -> event.hasMetricName(OPTIMAL_TASK_COUNT_METRIC)
.hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId())
.hasValueMatching(Matchers.greaterThan(1L))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait until the cost-based autoscaler computes that the optimal task count is greater than 1. This is only the autoscaler recommendation.

overlord.latchableEmitter().waitForEvent(
event -> event.hasMetricName("task/autoScaler/updatedCount")
.hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId())
.hasValueMatching(Matchers.greaterThan(1L))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the applied scale-up event.

final AtomicInteger totalRecords = new AtomicInteger();
final ExecutorService publisher = Executors.newSingleThreadExecutor();
final Future<?> publisherFuture = publisher.submit(() -> {
for (int i = 0; i < MAX_SCALE_UP_RECORD_BATCHES && keepPublishing.get(); ++i) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of publishing 10k records upfront and then waiting, the test keeps publishing records in the background while the autoscaler is running. This gives the autoscaler a
stable lag signal to observe.

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 2
P3 0
Total 2

Reviewed 1 of 1 changed files.


This is an automated review by Codex GPT-5.5

.hasValueMatching(Matchers.greaterThan(1L))
);
keepPublishing.set(false);
publisherFuture.get(30, TimeUnit.SECONDS);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Surface publisher failures before waiting for scaler metrics

The background publisher future is only observed after both autoscaler metric waits succeed. If publish1kRecords throws before creating enough lag, the test now waits for scaler metrics that may never arrive and then exits through finally without ever calling publisherFuture.get(), masking the real producer failure. The previous synchronous publish path failed immediately. Check the future while waiting, or observe it in the failure path so producer exceptions fail the test directly.

);
keepPublishing.set(false);
publisherFuture.get(30, TimeUnit.SECONDS);
ITRetryUtil.retryUntilTrue(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Avoid unbounded-length retries in this integration test

ITRetryUtil.retryUntilTrue uses the default 240 retries with 5 seconds between attempts, so this newly added check can add up to 20 minutes to a failing run, and the method itself has no @timeout. Since this test already uses 600-second latch waits, a regression can now occupy CI for much longer before failing. Use a bounded retry tuned for this test or add an explicit method timeout.

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the code for correctness, edge cases, concurrency, and integration risks; no issues found.

Reviewed 1 of 1 changed files.


This is an automated review by Codex GPT-5.5

@kfaraz

kfaraz commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

@Shekharrajak , thanks for the PR.
What was the root cause of the flakiness? I suppose the LatchableEmitter timed out. If yes, then which event did it never receive?

It seems that this PR is rewriting the logic of the test itself. It would probably make more sense to target only the failing event and identify why that event fails to arrive in time sometimes.

@kfaraz kfaraz requested a review from Fly-Style June 23, 2026 07:14
@Fly-Style

Copy link
Copy Markdown
Contributor

@Shekharrajak, fully agree with @kfaraz opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants