feat(profiler): wall-clock signal suppression for idle threads#560
feat(profiler): wall-clock signal suppression for idle threads#560kaahos wants to merge 30 commits into
Conversation
|
CI Test ResultsRun: #28086338511 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Failed Testsmusl-aarch64/debug / 21-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 11-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 25-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 17-librcaJob: View logs No detailed failure information available. Check the job logs. musl-aarch64/debug / 8-librcaJob: View logs No detailed failure information available. Check the job logs. Summary: Total: 32 | Passed: 27 | Failed: 5 Updated: 2026-06-24 08:59:15 UTC |
3f4fe1e to
bcf405b
Compare
bcf405b to
b1cb73f
Compare
|
@copilot review this PR |
Reviewed the changes; no blocking issues found in the wall-clock precheck logic or tests. |
|
[Sphinx Review — HIGH · CONSENSUS]
Suggestion: Either (a) add the same once-per-run precheck logic to
|
|
[Sphinx Review — MEDIUM]
Suggestion: Either strip |
|
[Sphinx Review — LOW · CONSENSUS] The comment justifying the CAS RMW states Suggestion: Replace with: "CAS RMW to update only TYPE_MASK bits without clobbering FLAG_PARKED, which may be written concurrently by |
|
@copilot review this PR. |
|
most of the previous comments have been addressed @jbachorik! The branch now also includes the new features coming from main. |
rkennke
left a comment
There was a problem hiding this comment.
This is a nice improvement! I have some questions...
0940fbe to
3df9af6
Compare
rkennke
left a comment
There was a problem hiding this comment.
Two follow-up accuracy/design questions on the weighted-sampling logic (not blockers). Thanks for addressing all my earlier points thoroughly.
Reliability & Chaos Results❌ 2 failure(s) detected Pipeline: https://gitlab.ddbuild.io/DataDog/java-profiler/-/pipelines/120561457 ❌ chaos: profiler jemalloc amd64 21 0 3 temXchaos❌ chaos: profiler tracer tcmalloc amd64 21 0 3 temXchaos |
rkennke
left a comment
There was a problem hiding this comment.
Thanks for making the changes. The CI failures appear to be infra issues (failure to upload artifacts). Good to go! Thanks!
What does this PR do?:
Before sending
SIGVTALRM, the wall-clock timer checks whether the target thread is already in a skippable OS state and has already produced a sample during this blocking run. If so, the signal is suppressed. The first sample of every blocking run is always collected.Motivation:
High-frequency wall-clock sampling on threads spending most of their time idle generates many redundant samples of identical stacks at high CPU cost. The precheck lets us emit one representative sample per blocking run and skip the rest, recovering signal throughput for threads that are actually running.
This is the first of two PRs splitting
paul.fournillon/wallclock_precheck. This PR ships the suppression infrastructure. The follow-up PR addsdatadog.TaskBlockevent recording.Additional Notes:
How to test the change?:
New unit tests:
ddprof-lib/src/test/cpp/park_state_ut.cppddprof-lib/src/test/cpp/wallprecheck_args_ut.cppNew integration tests:
PrecheckTest: verifies suppression fires and reduces signal count.PrecheckEfficiencyTest: measures overhead reduction under sustained park.WallclockMitigationsCombinedTest: end-to-end combined mitigations scenario.Run locally with:
./gradlew :ddprof-test:test --tests '*Precheck*' --tests '*WallclockMitigations*' --tests '*WallclockTimer*'For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance.Unsure? Have a question? Request a review!