[PROF-15201] fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614
[PROF-15201] fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle#614r1viollet wants to merge 2 commits into
Conversation
… lifecycle The CPU profiler sends SIGPROF to all threads via per-thread kernel timers. The signal handler checks _enabled and, if true, calls recordSample() which accesses JFR buffers. Two races existed around the recording cycle transition (default every 60 s) where JFR structures could be in mid-init or mid-teardown while the handler was active: Race 1 — stop() side (TOCTOU on _enabled vs _jfr.stop()): A handler that passed the _enabled=true check could still be executing inside recordSample() when disableEngines() set _enabled=false and _jfr.stop() freed JFR buffers — use-after-free → SIGSEGV. Fix: add an _inflight counter (incremented on handler entry, decremented on all exits). CTimer::stop() calls drainInflight() after deleting per- thread timers, spinning until _inflight==0 before returning to the caller that proceeds to _jfr.stop(). Any handler that fires after disableEngines() sees _enabled=false and returns early without touching JFR. Race 2 — start() side (enableEngines() before _jfr.start()): enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures. Fix: move enableEngines() to after _jfr.start() and _cpu_engine->start() have both returned successfully (just before _state.store(RUNNING)). Validated empirically: a controlled reproducer in DataDog/profiling-backend (AnalysisEndpointTest.testResourceExhausted with a 60 s recording period) showed ~60% failure rate without the fix (SIGSEGV / hang), 0% with both fixes applied (12/12 iterations clean). Each fix alone only partially addressed the failures, confirming both races were independently active. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
CI Test ResultsRun: #28084764170 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Failed Testsmusl-aarch64/debug / 8-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 25-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 21-librcaJob: View logs No detailed failure information available. Check the job logs. musl-amd64/debug / 11-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 11-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 17-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17-graalJob: View logs No detailed failure information available. Check the job logs. musl-amd64/debug / 21-librcaJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 25-graalJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 25Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 21-graalJob: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 17-j9Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 21Job: View logs No detailed failure information available. Check the job logs. glibc-aarch64/debug / 11-j9Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 25Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 17-j9Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 17-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 11Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 21Job: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 25-graalJob: View logs No detailed failure information available. Check the job logs. glibc-amd64/debug / 21-graalJob: View logs No detailed failure information available. Check the job logs. Summary: Total: 32 | Passed: 10 | Failed: 22 Updated: 2026-06-24 08:29:18 UTC |
Reliability & Chaos Results❌ 2 failure(s) detected Pipeline: https://gitlab.ddbuild.io/DataDog/java-profiler/-/pipelines/120504164 ❌ chaos: profiler gmalloc amd64 21 0 3 temXchaos❌ chaos: profiler jemalloc aarch64 25 0 3 temXchaos |
…iveness Without a timeout, drainInflight() spins indefinitely if a SIGPROF handler is stuck (e.g. deadlock inside recordSample()). This would prevent _jfr.stop() from running and hang JVM shutdown. Use clock_gettime(CLOCK_MONOTONIC) for a real wall-clock bound of 200ms. If the deadline fires, log a warning and proceed; in that pathological case the caller's JFR teardown may race with the stuck handler, but the JVM is not permanently deadlocked. In normal operation (handlers complete in microseconds) the timeout is never reached. Refs: PROF-15201 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 81184abc5c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // recordSample() on partially-initialized JFR structures. | ||
| // Paired with drainInflight() in CTimer::stop() which closes the | ||
| // symmetric race on the stop side. | ||
| enableEngines(); |
There was a problem hiding this comment.
Enable wall engine before launching its worker
When wall profiling is requested, _wall_engine->start() has already created the BaseWallClock pthread before this call runs; that thread enters timerLoopCommon() and returns immediately if _enabled is still false. On a normal first start or restart _enabled is false, so if the wall thread is scheduled before line 1478, wall-clock sampling silently dies for the whole recording instead of being enabled by this later call.
Useful? React with 👍 / 👎.
| } | ||
| Counters::increment(CTIMER_SIGNAL_OWN); | ||
|
|
||
| __atomic_fetch_add(&_inflight, 1, __ATOMIC_ACQUIRE); |
There was a problem hiding this comment.
Balance JVMTI in-flight counter on init-window return
With jvmtistacks on Linux, this increment is not balanced when the handler takes the current->inInitWindow() early return a few lines below. A single sample delivered during that thread-init window permanently leaves _inflight positive, causing every later CTimer::stop() to wait until the 200 ms timeout and then proceed without knowing whether any real handler is still in flight, which defeats the drain this patch adds.
Useful? React with 👍 / 👎.
What does this PR do?:
Closes two TOCTOU races between the SIGPROF signal handler and JFR lifecycle transitions that could cause SIGSEGV or hangs in the test JVM during the 60-second recording cycle rotation.
Race 1 — stop() side (
ctimer_linux.cpp):disableEngines()sets_enabled=false, but a handler that already passed the_enabled=truecheck could still be executing insiderecordSample()when_jfr.stop()freed JFR buffers → use-after-free → SIGSEGV (or hang if the crash is caught by crashtracking).Fix: add an
_inflightcounter, incremented on every handler entry before the_enabledcheck, decremented on every exit path.CTimer::stop()callsdrainInflight()after deleting per-thread timers, spinning until_inflight==0before returning. The caller (Profiler::stop) then proceeds to_jfr.stop()only once all handlers have fully exited.Race 2 — start() side (
profiler.cpp):enableEngines()set_enabled=truebefore_jfr.start()had completed. A SIGPROF delivered in that window would see_enabled=trueand callrecordSample()on partially-initialized JFR structures.Fix: move
enableEngines()to after both_jfr.start()and_cpu_engine->start()have returned successfully (immediately before_state.store(RUNNING)).Motivation:
Discovered while investigating intermittent SIGSEGV (exit 139) and hang failures in DataDog/profiling-backend CI. Bisected to a dd-trace-java commit that changed instrumentation initialization timing, shifting when the 60-second recording cycle boundary fell relative to test thread activity — exposing both races reliably enough to isolate.
How to test the change?:
Controlled reproducer in DataDog/profiling-backend using
AnalysisEndpointTest.testResourceExhaustedwith the bad dd-trace-java agent (0e13e90dac) and a patchedlibjavaProfiler.so:drainInflight): ~20% failure rate — Race 2 still activeenableEngines): ~40% failure rate — Race 1 still activev_1.44.0baselineAdditional Notes:
drainInflight()is an unbounded spin. In practicerecordSample()completes in microseconds so this is safe, but a bounded spin with a log warning could be added as a follow-up._inflightcounter is incremented even whenCriticalSectionfails (handler returns early without touching JFR). This is intentional: it makes the drain conservative and guarantees the counter reaches zero only after all code paths between the counter increment and any potential JFR access have completed.For Datadog employees: