[PROF-15201] fix(profiler): close two TOCTOU races between SIGPROF handler and JFR lifecycle by r1viollet · Pull Request #614 · DataDog/java-profiler

r1viollet · 2026-06-23T13:25:58Z

What does this PR do?:

Closes two TOCTOU races between the SIGPROF signal handler and JFR lifecycle transitions that could cause SIGSEGV or hangs in the test JVM during the 60-second recording cycle rotation.

Race 1 — stop() side (ctimer_linux.cpp):

disableEngines() sets _enabled=false, but a handler that already passed the _enabled=true check could still be executing inside recordSample() when _jfr.stop() freed JFR buffers → use-after-free → SIGSEGV (or hang if the crash is caught by crashtracking).

Fix: add an _inflight counter, incremented on every handler entry before the _enabled check, decremented on every exit path. CTimer::stop() calls drainInflight() after deleting per-thread timers, spinning until _inflight==0 before returning. The caller (Profiler::stop) then proceeds to _jfr.stop() only once all handlers have fully exited.

Race 2 — start() side (profiler.cpp):

enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF delivered in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures.

Fix: move enableEngines() to after both _jfr.start() and _cpu_engine->start() have returned successfully (immediately before _state.store(RUNNING)).

Motivation:

Discovered while investigating intermittent SIGSEGV (exit 139) and hang failures in DataDog/profiling-backend CI. Bisected to a dd-trace-java commit that changed instrumentation initialization timing, shifting when the 60-second recording cycle boundary fell relative to test thread activity — exposing both races reliably enough to isolate.

How to test the change?:

Controlled reproducer in DataDog/profiling-backend using AnalysisEndpointTest.testResourceExhausted with the bad dd-trace-java agent (0e13e90dac) and a patched libjavaProfiler.so:

Without fix: ~60% failure rate per iteration (SIGSEGV / hang)
Race 1 fix only (drainInflight): ~20% failure rate — Race 2 still active
Race 2 fix only (move enableEngines): ~40% failure rate — Race 1 still active
Both fixes together: 12/12 iterations clean against v_1.44.0 baseline

Additional Notes:

drainInflight() is an unbounded spin. In practice recordSample() completes in microseconds so this is safe, but a bounded spin with a log warning could be added as a follow-up.
The _inflight counter is incremented even when CriticalSection fails (handler returns early without touching JFR). This is intentional: it makes the drain conservative and guarantees the counter reaches zero only after all code paths between the counter increment and any potential JFR access have completed.
Related: Revert "Ignore capturing connection continuation for armeria (#11657)" dd-trace-java#11685 (revert of the dd-trace-java commit that exposed these races).

For Datadog employees:

This PR doesn't touch any of that.
JIRA: [PROF-XXXX]

… lifecycle The CPU profiler sends SIGPROF to all threads via per-thread kernel timers. The signal handler checks _enabled and, if true, calls recordSample() which accesses JFR buffers. Two races existed around the recording cycle transition (default every 60 s) where JFR structures could be in mid-init or mid-teardown while the handler was active: Race 1 — stop() side (TOCTOU on _enabled vs _jfr.stop()): A handler that passed the _enabled=true check could still be executing inside recordSample() when disableEngines() set _enabled=false and _jfr.stop() freed JFR buffers — use-after-free → SIGSEGV. Fix: add an _inflight counter (incremented on handler entry, decremented on all exits). CTimer::stop() calls drainInflight() after deleting per- thread timers, spinning until _inflight==0 before returning to the caller that proceeds to _jfr.stop(). Any handler that fires after disableEngines() sees _enabled=false and returns early without touching JFR. Race 2 — start() side (enableEngines() before _jfr.start()): enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures. Fix: move enableEngines() to after _jfr.start() and _cpu_engine->start() have both returned successfully (just before _state.store(RUNNING)). Validated empirically: a controlled reproducer in DataDog/profiling-backend (AnalysisEndpointTest.testResourceExhausted with a 60 s recording period) showed ~60% failure rate without the fix (SIGSEGV / hang), 0% with both fixes applied (12/12 iterations clean). Each fix alone only partially addressed the failures, confirming both races were independently active. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

datadog-prod-us1-3 · 2026-06-23T13:32:57Z

✨ Fix all issues with BitsAI

⚠️ Warnings

🚦 36 Pipeline jobs failed

CI Run | test-matrix / test-linux-glibc-aarch64 (11-j9, debug)

CI Run | test-matrix / test-linux-glibc-aarch64 (17-j9, debug)

CI Run | test-matrix / test-linux-glibc-aarch64 (21, debug)

View all 36 failed jobs.

Useful? React with 👍 / 👎

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: 81184ab | Docs | Datadog PR Page | Give us feedback!}

dd-octo-sts · 2026-06-23T13:45:31Z

CI Test Results

Run: #28084764170 | Commit: e68bab6 | Duration: 13m 11s (longest job)

❌ 22 of 32 test jobs failed

Status Overview

JDK	glibc-aarch64/debug	glibc-amd64/debug	musl-aarch64/debug	musl-amd64/debug
8	-	✅	-	-
8-ibm	-	✅	-	-
8-j9	✅	✅	-	-
8-librca	-	-	❌	✅
8-orcl	-	✅	-	-
11	-	❌	-	-
11-j9	❌	✅	-	-
11-librca	-	-	❌	❌
17	❌	✅	-	-
17-graal	❌	❌	-	-
17-j9	❌	❌	-	-
17-librca	-	-	❌	✅
21	❌	❌	-	-
21-graal	❌	❌	-	-
21-librca	-	-	❌	❌
25	❌	❌	-	-
25-graal	❌	❌	-	-
25-librca	-	-	❌	✅

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Failed Tests

musl-aarch64/debug / 8-librca

+                  // recordSample() on partially-initialized JFR structures.
+                  // Paired with drainInflight() in CTimer::stop() which closes the
+                  // symmetric race on the stop side.
+                  enableEngines();

                 }
                 Counters::increment(CTIMER_SIGNAL_OWN);
+                __atomic_fetch_add(&_inflight, 1, __ATOMIC_ACQUIRE);

Conversation

r1viollet commented Jun 23, 2026

Uh oh!

datadog-prod-us1-3 Bot commented Jun 23, 2026 • edited by datadog-datadog-prod-us1 Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

dd-octo-sts Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Test Results

Status Overview

Failed Tests

Uh oh!

dd-octo-sts Bot commented Jun 23, 2026

Reliability & Chaos Results

Uh oh!

r1viollet commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

datadog-prod-us1-3 Bot commented Jun 23, 2026 •

edited by datadog-datadog-prod-us1 Bot

Loading

dd-octo-sts Bot commented Jun 23, 2026 •

edited

Loading