Move delta record/collect coordination from instrument to series level#8313
Move delta record/collect coordination from instrument to series level#8313jack-berg wants to merge 11 commits intoopen-telemetry:mainfrom
Conversation
| import java.util.concurrent.ConcurrentHashMap; | ||
| import java.util.logging.Level; | ||
|
|
||
| class CumulativeSynchronousMetricStorage<T extends PointData> |
There was a problem hiding this comment.
With this, the code paths for cumulative vs. delta are different enough that including them in the same file just makes everything harder to read / understand.
Separately, just look how simple the cumulative code paths are. With delta we jump through extra hoops required for record/collect coordination, and despite that still can't match the cumulative performance, which is just dirt simple.
| Aggregation aggregation, | ||
| MemoryMode memoryMode, | ||
| InstrumentValueType instrumentValueType) { | ||
| for (int repetition = 0; repetition < 50; repetition++) { |
There was a problem hiding this comment.
TODO: revert or make configurable before merging
Increasing the reps of this test really helps build confidence that experiments are correct.
A useful tool when iterating experiments, but increases the test runtime.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8313 +/- ##
============================================
+ Coverage 90.31% 90.89% +0.58%
- Complexity 7725 8002 +277
============================================
Files 850 900 +50
Lines 23259 24087 +828
Branches 2364 2402 +38
============================================
+ Hits 21007 21895 +888
+ Misses 1528 1454 -74
- Partials 724 738 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…y-java into delta-aggregator-handle-coordination-1
…y-java into delta-aggregator-handle-coordination-1
| instrument.record(measurement, attr); | ||
| } | ||
| Uninterruptibles.sleepUninterruptibly(ONE_MICROSECOND); | ||
| Thread.yield(); |
There was a problem hiding this comment.
TIL that java 8, 11, 17 don't have the precision to sleep for periods less than 1ms, so every 1us sleep was actually sleeping for 1ms, such that this essentially timed out because it took so long after I changed it to run 50 times to increase confidence of correctness.
So to make run fast, I switched to a strategy where I start all the record and collect threads, but gate their activity on a latch that indicates the collect thread has started. After each record, yield the thread to allow interleaving. This allows the concurrent collect and record we want to simulate without needing to sleep.
Both spin loops in DeltaSynchronousMetricStorage burned the collect thread's core while waiting for a recorder that may have been preempted mid-section. On constrained CPU (single-CPU containers, pinned cores), the recorder could not be rescheduled until the OS forced a quantum, multiplied per stuck handle. Add Thread.yield() in both loops so the collect thread releases its core. Matches the existing yield in acquireHandleForRecord. Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
|
Reviewed against #8077. Confirms this is pareto-optimal: 8077's striped I also compared the design with Walked the happens-before chain for the subtle late-recorder case (recorder passes Issues1. Busy-spin without yield in the collect-thread loops (most important). 2. 3. Stress test 50× repetitions. |
alternatively could consider |
Yield collect-thread spin loops to unblock preempted recorders
Merged
Will explore separately: #8382
Will improve
|
Alternative to #8077. Where #8077 improves performance by striping the
AtomicIntegerused to coordinate between record and collect threads, this comes at the cost of single threaded performance because selecting the striped instance ofAtomicIntegeris additional work that is not necessary.With this PR, I also increase the number of coordinator
AtomicIntegerinstances to reduce contention, but do so by moving theAtomicIntegerfrom the instrument level (i.e. DeltaSynchronoousMetricStorage) to timeseries level (i.e. AggregatorHandle).This means that there's no additional work for the single threaded case, but callers recording against multiple series concurrently get to distribute the contention across as many AtomicInteger instances as there are series. You see this manifest in the benchmark diff in that single threaded cases don't change, and there are significant improvements in the threads=4, temporality=delta, cardinality=100 cases.
One downside of this approach is that multiple threads concurrently recording to a single timeseries still are bottlenecked by a single AtomicInteger. You see this manifest in the benchmark diff in that threads=4, temporality=delta, cardinality=1 cases don't improve.
Note, the benchmarks below incorporate the changes from #8308, which leads to much less variance/noise than previously.