Skip to content

fix: fix broker segment metadata cache refresh#19625

Open
jtuglu1 wants to merge 1 commit into
apache:masterfrom
jtuglu1:fix-broker-metadata-refresh-timeout
Open

fix: fix broker segment metadata cache refresh#19625
jtuglu1 wants to merge 1 commit into
apache:masterfrom
jtuglu1:fix-broker-metadata-refresh-timeout

Conversation

@jtuglu1

@jtuglu1 jtuglu1 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Description

Brokers can maintain a schema cache via segment metadata queries. Currently, if any of these queries timeout, the remaining queries are aborted until the next refresh. If you have a huge datasource delta (think 500k+ segments being scanned), such a query can fail/timeout and cause other unrelated datasources' broker schema discovery to fail. Without centralized schema through coordinator, there is no intra-datasource atomicity guarantee w.r.t schema discovery (it is just ASAP), so decoupling this error dependency and instead emiting a metric per datasource when failures occur.

Introduces segment/schemaCache/refresh/failed metric with a dataSource dimension, emitted when a refresh fails. Can alternatively just aggregate and emit at the end. Also open to keeping this a warning/error log.

Release note

Fix broker segment metadata cache refresh


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@jtuglu1 jtuglu1 force-pushed the fix-broker-metadata-refresh-timeout branch from 2540cf2 to ae024bd Compare June 24, 2026 05:15
@jtuglu1 jtuglu1 requested review from clintropolis and kfaraz June 24, 2026 05:18
@jtuglu1 jtuglu1 force-pushed the fix-broker-metadata-refresh-timeout branch from ae024bd to 862d3de Compare June 24, 2026 05:27
@jtuglu1 jtuglu1 added the Bug label Jun 24, 2026
@jtuglu1 jtuglu1 added this to the 38.0.0 milestone Jun 24, 2026
@jtuglu1 jtuglu1 force-pushed the fix-broker-metadata-refresh-timeout branch from 862d3de to c49c2c5 Compare June 24, 2026 07:14

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1
Severity Findings
P0 0
P1 0
P2 1
P3 0
Total 1

Reviewed 4 of 4 changed files.


This is an automated review by Codex GPT-5.5

try {
updatedSegmentIds.addAll(refreshSegmentsForDataSource(dataSource, entry.getValue()));
}
catch (QueryInterruptedException e) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Do not treat every QueryInterruptedException as shutdown

This catch rethrows all QueryInterruptedException instances before the per-datasource failure handler. In Druid, QueryInterruptedException is also used as a generic client-side wrapper for failed queries, including remote QueryExceptions with unknown/runtime failures, not only actual thread interruption. When one datasource's SegmentMetadataQuery fails through that wrapper, refreshSegments still aborts the whole cycle, later datasources are skipped, and no segment/schemaCache/refresh/failed metric is emitted for the offending datasource. Please distinguish real local interruption/cancellation from ordinary query failure, for example by checking the cause/errorCode, and let non-interruption query failures follow the isolated datasource path.

@jtuglu1 jtuglu1 force-pushed the fix-broker-metadata-refresh-timeout branch from c49c2c5 to 6559cb7 Compare June 24, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants