feat(taskworker): Emit worker occupancy metric for autoscaling#734
Merged
Conversation
untitaker
approved these changes
Jun 29, 2026
| WORKER_SERVICE_NAME = "sentry_protos.taskbroker.v1.WorkerService" | ||
|
|
||
|
|
||
| class WorkerPrometheusMetrics: |
Member
There was a problem hiding this comment.
do you think it would be possible to use the existing metrics abstraction, but expose a subset of those metrics to prometheus in addition to the configured backend?
Contributor
Author
There was a problem hiding this comment.
Good point, I think it's definitely possible. As I was thinking how to do that, two things came up:
- app.metrics is built per process, each child re-imports the app through
import_app, so it's not one shared instance. Since Prometheus is pull-based, a Prometheus backend's would start an HTTP server in every child, not just the parent. We can sidestep that by keeping DD as the app backend and wrapping Prometheus only in the parent, but then the wrap is a parent-only special case rather than a uniform backend, so we don't really get the cleanliness the shared abstraction would suggest. - The MetricsBackend API is statsd-shaped so dynamic metric names with free-form tags are created per call. On the other hand, Prometheus needs each metric declared up front with a fixed label set, so we can't forward arbitrary calls. Even exposing just a subset, we'd still hand-declare each metric's name and labels, so routing it through the backend mostly adds a layer without removing that work. This might make sense to do down the line, but for now, we only need one metric (occupancy).
markstory
approved these changes
Jun 30, 2026
| ) | ||
| @click.option( | ||
| "--prometheus-port", | ||
| help="Expose occupancy on this port for Prometheus scraping. Unset = disabled.", |
Member
There was a problem hiding this comment.
Suggested change
| help="Expose occupancy on this port for Prometheus scraping. Unset = disabled.", | |
| help="Expose prometheus metrics on this port for scraping. Unset = disabled.", |
Could generalize this.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refs: STREAM-1114
Adds a worker-local occupancy signal (
busy_child_processes / concurrency) so taskworkers can be autoscaled on how hard they're actually working. Occupancy is the replacement signal for kafka lag. This PR is the worker-side instrumentation only. The KEDA/HPA wiring lands separately in ops. When this PR is merged, prometheus server will not be enabled yet.Changes:
busy_counteris incremented when a child picks up a task and decremented in afinallyafter it completes.taskworker.worker.occupancyfrom the existing 1s metrics thread. This is on regardless of Prometheus, so occupancy is visible in dashboards for validation before anything scales on it.WorkerPrometheusMetricswhich owns the prometheus registry, HTTP server and exposestaskworker_worker_occupancyfor scraping. Opt-in viaprometheus_portarg on the workerTesting:
taskworker_worker_occupancyon/metricstracked load (rose toward 1.0 under saturation, fell to 0 when idle).