fix: fix broker segment metadata cache refresh#19625
Conversation
2540cf2 to
ae024bd
Compare
ae024bd to
862d3de
Compare
862d3de to
c49c2c5
Compare
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 0 |
| P2 | 1 |
| P3 | 0 |
| Total | 1 |
Reviewed 4 of 4 changed files.
This is an automated review by Codex GPT-5.5
| try { | ||
| updatedSegmentIds.addAll(refreshSegmentsForDataSource(dataSource, entry.getValue())); | ||
| } | ||
| catch (QueryInterruptedException e) { |
There was a problem hiding this comment.
[P2] Do not treat every QueryInterruptedException as shutdown
This catch rethrows all QueryInterruptedException instances before the per-datasource failure handler. In Druid, QueryInterruptedException is also used as a generic client-side wrapper for failed queries, including remote QueryExceptions with unknown/runtime failures, not only actual thread interruption. When one datasource's SegmentMetadataQuery fails through that wrapper, refreshSegments still aborts the whole cycle, later datasources are skipped, and no segment/schemaCache/refresh/failed metric is emitted for the offending datasource. Please distinguish real local interruption/cancellation from ordinary query failure, for example by checking the cause/errorCode, and let non-interruption query failures follow the isolated datasource path.
c49c2c5 to
6559cb7
Compare
Description
Brokers can maintain a schema cache via segment metadata queries. Currently, if any of these queries timeout, the remaining queries are aborted until the next refresh. If you have a huge datasource delta (think 500k+ segments being scanned), such a query can fail/timeout and cause other unrelated datasources' broker schema discovery to fail. Without centralized schema through coordinator, there is no intra-datasource atomicity guarantee w.r.t schema discovery (it is just ASAP), so decoupling this error dependency and instead emiting a metric per datasource when failures occur.
Introduces
segment/schemaCache/refresh/failedmetric with a dataSource dimension, emitted when a refresh fails. Can alternatively just aggregate and emit at the end. Also open to keeping this a warning/error log.Release note
Fix broker segment metadata cache refresh
This PR has: