Skip to content

Yield collect-thread spin loops to unblock preempted recorders#27

Merged
jack-berg merged 1 commit into
jack-berg:delta-aggregator-handle-coordination-1from
zeitlinger:stacked/delta-spin-wait
May 8, 2026
Merged

Yield collect-thread spin loops to unblock preempted recorders#27
jack-berg merged 1 commit into
jack-berg:delta-aggregator-handle-coordination-1from
zeitlinger:stacked/delta-spin-wait

Conversation

@zeitlinger
Copy link
Copy Markdown

Stacked on top of open-telemetry#8313 — targets your branch directly so it merges in if/when 8313 lands.

Summary

Two spin loops in DeltaSynchronousMetricStorage busy-wait without yielding:

  • AggregatorHolder.lockForCollectAndAwait() — waits for in-flight new-series operations
  • DeltaAggregatorHandle.awaitRecordersAndUnlock() — waits for in-flight recorders

Recorder critical sections are short (a LongAdder.add for counters, bucket lookup + add for histograms), so on a roomy multi-core box the spin terminates in hundreds of cycles. But if the OS preempts a recorder mid-section, the collect thread holds its core spinning while the recorder cannot be rescheduled. Failure mode shows up on:

  • Single-CPU containers (cpus=1)
  • Pinned-core deployments
  • Oversubscribed hosts where collect and recorder threads compete for the same core

Worst case stalls each handle by ~one OS quantum (~10ms), multiplied across all in-flight handles in the collection.

Fix

Add Thread.yield() in both loops. Matches the existing Thread.yield() already used in acquireHandleForRecord for the holder-swap retry, so the suppression and pattern are consistent.

Thread.onSpinWait() would be the textbook choice (cheap PAUSE hint, JIT-friendly), but it is Java 9+. The SDK targets Java 8 via --release 8 and animalsniffer would reject it. Thread.yield() is heavier but works on Java 8 and actually deschedules — which is what the constrained-CPU case needs anyway.

Test plan

  • Existing SynchronousInstrumentStressTest continues to pass (50× repetitions)
  • Spotless + compile clean

Both spin loops in DeltaSynchronousMetricStorage burned the collect
thread's core while waiting for a recorder that may have been preempted
mid-section. On constrained CPU (single-CPU containers, pinned cores),
the recorder could not be rescheduled until the OS forced a quantum,
multiplied per stuck handle.

Add Thread.yield() in both loops so the collect thread releases its
core. Matches the existing yield in acquireHandleForRecord.
Signed-off-by: Gregor Zeitlinger <gregor.zeitlinger@grafana.com>
@jack-berg jack-berg merged commit d743384 into jack-berg:delta-aggregator-handle-coordination-1 May 8, 2026
@zeitlinger zeitlinger deleted the stacked/delta-spin-wait branch May 8, 2026 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants