Skip to content

Move delta record/collect coordination from instrument to series level#8313

Open
jack-berg wants to merge 11 commits intoopen-telemetry:mainfrom
jack-berg:delta-aggregator-handle-coordination-1
Open

Move delta record/collect coordination from instrument to series level#8313
jack-berg wants to merge 11 commits intoopen-telemetry:mainfrom
jack-berg:delta-aggregator-handle-coordination-1

Conversation

@jack-berg
Copy link
Copy Markdown
Member

Alternative to #8077. Where #8077 improves performance by striping the AtomicInteger used to coordinate between record and collect threads, this comes at the cost of single threaded performance because selecting the striped instance of AtomicInteger is additional work that is not necessary.

With this PR, I also increase the number of coordinator AtomicInteger instances to reduce contention, but do so by moving the AtomicInteger from the instrument level (i.e. DeltaSynchronoousMetricStorage) to timeseries level (i.e. AggregatorHandle).

  • Each timeseries get its own AtomicInteger
  • At record time, the record thread increments / decrements the timeseries' AtomicInteger instance
  • At collect time, the collect thread marks all timeseries AtomicInteger instances for collect, then waits for all active records to complete, then collects

This means that there's no additional work for the single threaded case, but callers recording against multiple series concurrently get to distribute the contention across as many AtomicInteger instances as there are series. You see this manifest in the benchmark diff in that single threaded cases don't change, and there are significant improvements in the threads=4, temporality=delta, cardinality=100 cases.

One downside of this approach is that multiple threads concurrently recording to a single timeseries still are bottlenecked by a single AtomicInteger. You see this manifest in the benchmark diff in that threads=4, temporality=delta, cardinality=1 cases don't improve.

Note, the benchmarks below incorporate the changes from #8308, which leads to much less variance/noise than previously.

Threads Temporality Cardinality Instrument 0_baseline (ops/s) 2_aggregator_handle_tracking_minamlist (ops/s) Δ ops/s Δ %
1 DELTA 1 COUNTER_SUM 139,309,882 126,772,929 -12,536,953 -9.0%
1 DELTA 1 UP_DOWN_COUNTER_SUM 138,953,746 133,244,856 -5,708,890 -4.1%
1 DELTA 1 GAUGE_LAST_VALUE 37,205,058 41,952,132 +4,747,074 +12.8%
1 DELTA 1 HISTOGRAM_EXPLICIT 76,174,648 72,036,559 -4,138,089 -5.4%
1 DELTA 1 HISTOGRAM_BASE2_EXPONENTIAL 43,956,180 41,842,933 -2,113,247 -4.8%
1 DELTA 100 COUNTER_SUM 88,390,625 101,415,374 +13,024,749 +14.7%
1 DELTA 100 UP_DOWN_COUNTER_SUM 100,682,536 104,228,731 +3,546,195 +3.5%
1 DELTA 100 GAUGE_LAST_VALUE 31,129,986 30,153,053 -976,933 -3.1%
1 DELTA 100 HISTOGRAM_EXPLICIT 79,728,720 71,241,352 -8,487,368 -10.6%
1 DELTA 100 HISTOGRAM_BASE2_EXPONENTIAL 41,119,222 42,712,082 +1,592,860 +3.9%
1 CUMULATIVE 1 COUNTER_SUM 156,207,611 156,051,490 -156,121 -0.1%
1 CUMULATIVE 1 UP_DOWN_COUNTER_SUM 164,723,673 164,464,216 -259,457 -0.2%
1 CUMULATIVE 1 GAUGE_LAST_VALUE 63,483,186 67,329,671 +3,846,485 +6.1%
1 CUMULATIVE 1 HISTOGRAM_EXPLICIT 102,708,149 102,593,029 -115,120 -0.1%
1 CUMULATIVE 1 HISTOGRAM_BASE2_EXPONENTIAL 46,786,722 47,702,621 +915,899 +2.0%
1 CUMULATIVE 100 COUNTER_SUM 93,234,874 92,660,085 -574,789 -0.6%
1 CUMULATIVE 100 UP_DOWN_COUNTER_SUM 102,773,896 93,401,286 -9,372,610 -9.1%
1 CUMULATIVE 100 GAUGE_LAST_VALUE 90,948,347 87,231,953 -3,716,394 -4.1%
1 CUMULATIVE 100 HISTOGRAM_EXPLICIT 78,204,809 77,987,516 -217,293 -0.3%
1 CUMULATIVE 100 HISTOGRAM_BASE2_EXPONENTIAL 44,678,570 44,894,584 +216,014 +0.5%
4 DELTA 1 COUNTER_SUM 20,959,707 16,469,448 -4,490,259 -21.4%
4 DELTA 1 UP_DOWN_COUNTER_SUM 20,991,865 16,210,248 -4,781,617 -22.8%
4 DELTA 1 GAUGE_LAST_VALUE 19,358,172 18,777,835 -580,337 -3.0%
4 DELTA 1 HISTOGRAM_EXPLICIT 11,868,012 11,091,512 -776,500 -6.5%
4 DELTA 1 HISTOGRAM_BASE2_EXPONENTIAL 10,236,804 10,606,743 +369,939 +3.6%
4 DELTA 100 COUNTER_SUM 24,494,401 73,211,762 +48,717,361 +198.9%
4 DELTA 100 UP_DOWN_COUNTER_SUM 24,520,580 67,246,731 +42,726,151 +174.2%
4 DELTA 100 GAUGE_LAST_VALUE 22,992,790 47,030,541 +24,037,751 +104.5%
4 DELTA 100 HISTOGRAM_EXPLICIT 23,253,146 56,965,510 +33,712,364 +145.0%
4 DELTA 100 HISTOGRAM_BASE2_EXPONENTIAL 21,707,199 57,301,938 +35,594,739 +164.0%
4 CUMULATIVE 1 COUNTER_SUM 79,758,391 75,378,143 -4,380,248 -5.5%
4 CUMULATIVE 1 UP_DOWN_COUNTER_SUM 73,199,378 73,274,059 +74,681 +0.1%
4 CUMULATIVE 1 GAUGE_LAST_VALUE 30,208,943 30,232,185 +23,242 +0.1%
4 CUMULATIVE 1 HISTOGRAM_EXPLICIT 18,515,768 18,585,553 +69,785 +0.4%
4 CUMULATIVE 1 HISTOGRAM_BASE2_EXPONENTIAL 18,665,724 16,009,460 -2,656,264 -14.2%
4 CUMULATIVE 100 COUNTER_SUM 108,295,650 99,551,354 -8,744,296 -8.1%
4 CUMULATIVE 100 UP_DOWN_COUNTER_SUM 107,571,455 101,067,120 -6,504,335 -6.0%
4 CUMULATIVE 100 GAUGE_LAST_VALUE 95,965,166 97,450,271 +1,485,105 +1.5%
4 CUMULATIVE 100 HISTOGRAM_EXPLICIT 78,957,306 76,630,707 -2,326,599 -2.9%
4 CUMULATIVE 100 HISTOGRAM_BASE2_EXPONENTIAL 73,614,813 73,852,367 +237,554 +0.3%

@jack-berg jack-berg requested a review from a team as a code owner April 21, 2026 20:37
import java.util.concurrent.ConcurrentHashMap;
import java.util.logging.Level;

class CumulativeSynchronousMetricStorage<T extends PointData>
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this, the code paths for cumulative vs. delta are different enough that including them in the same file just makes everything harder to read / understand.

Separately, just look how simple the cumulative code paths are. With delta we jump through extra hoops required for record/collect coordination, and despite that still can't match the cumulative performance, which is just dirt simple.

Aggregation aggregation,
MemoryMode memoryMode,
InstrumentValueType instrumentValueType) {
for (int repetition = 0; repetition < 50; repetition++) {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: revert or make configurable before merging

Increasing the reps of this test really helps build confidence that experiments are correct.

A useful tool when iterating experiments, but increases the test runtime.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 98.20359% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.89%. Comparing base (df8063b) to head (d743384).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
.../internal/state/DeltaSynchronousMetricStorage.java 98.41% 0 Missing and 2 partials ⚠️
...rnal/state/CumulativeSynchronousMetricStorage.java 97.50% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #8313      +/-   ##
============================================
+ Coverage     90.31%   90.89%   +0.58%     
- Complexity     7725     8002     +277     
============================================
  Files           850      900      +50     
  Lines         23259    24087     +828     
  Branches       2364     2402      +38     
============================================
+ Hits          21007    21895     +888     
+ Misses         1528     1454      -74     
- Partials        724      738      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

instrument.record(measurement, attr);
}
Uninterruptibles.sleepUninterruptibly(ONE_MICROSECOND);
Thread.yield();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL that java 8, 11, 17 don't have the precision to sleep for periods less than 1ms, so every 1us sleep was actually sleeping for 1ms, such that this essentially timed out because it took so long after I changed it to run 50 times to increase confidence of correctness.

So to make run fast, I switched to a strategy where I start all the record and collect threads, but gate their activity on a latch that indicates the collect thread has started. After each record, yield the thread to allow interleaving. This allows the concurrent collect and record we want to simulate without needing to sleep.

Both spin loops in DeltaSynchronousMetricStorage burned the collect
thread's core while waiting for a recorder that may have been preempted
mid-section. On constrained CPU (single-CPU containers, pinned cores),
the recorder could not be rescheduled until the OS forced a quantum,
multiplied per stuck handle.

Add Thread.yield() in both loops so the collect thread releases its
core. Matches the existing yield in acquireHandleForRecord.
Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
@zeitlinger
Copy link
Copy Markdown
Member

Reviewed against #8077. Confirms this is pareto-optimal: 8077's striped ReentrantLocks gave large multi-threaded wins but regressed single-threaded performance by 26-42% on DELTA hot paths. This PR keeps single-threaded baseline (mostly within ±5%, worst -10.6%) while still capturing the 100-200% multi-threaded wins at cardinality=100. The acknowledged miss at threads=4/cardinality=1 is the right tradeoff — that workload is rare in practice (real instruments have labels), and single-threaded performance is not.

I also compared the design with prometheus/client_java's Buffer.java. It uses the same even/odd AtomicInteger + busy-wait protocol, but stripes per-CPU rather than per-series. That makes sense for client_java because their hot path is histogram.observe() on a single series; it would not fit the OTel goal here (per-CPU striping would re-introduce the single-threaded regression that closed 8077). Per-series striping is the right axis for this workload.

Walked the happens-before chain for the subtle late-recorder case (recorder passes tryAcquireForRecord after awaitRecordersAndUnlock has decremented state back to even). The holder.isLockedForCollect() re-check catches it because the holder gate is locked first and never unlocks within a collection. The comment at line ~660 accurately describes this. Pool growth is bounded because new-series creation is gated during collect, so the previous trim step is correctly removed.

Issues

1. Busy-spin without yield in the collect-thread loops (most important).
awaitRecordersAndUnlock() (line ~354) and lockForCollectAndAwait() (line ~296) will burn the collect thread's core if a recorder is preempted mid-section. On constrained CPU (1-CPU containers, pinned cores, oversubscribed hosts), the recorder cannot be rescheduled until the OS forces a quantum, multiplied per stuck handle. Stacked fix here: jack-berg#27 — adds Thread.yield() to both loops, consistent with the existing yield in acquireHandleForRecord.

2. getDeltaAggregatorHandle name is misleading.
It has side effects (state += 2, may return null) but reads like a plain getter. Suggest renaming to tryAcquireHandleForRecord.

3. Stress test 50× repetitions.
SynchronousInstrumentStressTest.stressTest now runs 50× per parameter combo. Justified by the algorithm's subtlety, but worth gating behind a system property or moving to a dedicated stress profile so PR CI does not pay for it on every run.

@laurit
Copy link
Copy Markdown
Contributor

laurit commented May 8, 2026

Stacked fix here: jack-berg#27 — adds Thread.yield() to both loops, consistent with the existing yield in acquireHandleForRecord.

alternatively could consider Thread.onSpinWait that was introduced in jdk9

Yield collect-thread spin loops to unblock preempted recorders
@jack-berg
Copy link
Copy Markdown
Member Author

  1. Busy-spin without yield in the collect-thread loops (most important).

Merged

alternatively could consider Thread.onSpinWait that was introduced in jdk9

Will explore separately: #8382

  1. getDeltaAggregatorHandle name is misleading.

Will improve

  1. Stress test 50× repetitions.

#8313 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants