feat(taskworker): Emit worker occupancy metric for autoscaling by enochtangg · Pull Request #734 · getsentry/taskbroker

enochtangg · 2026-06-29T20:05:20Z

Refs: STREAM-1114

Adds a worker-local occupancy signal (busy_child_processes / concurrency) so taskworkers can be autoscaled on how hard they're actually working. Occupancy is the replacement signal for kafka lag. This PR is the worker-side instrumentation only. The KEDA/HPA wiring lands separately in ops. When this PR is merged, prometheus server will not be enabled yet.

Changes:

Introduced a shared busy_counter is incremented when a child picks up a task and decremented in a finally after it completes.
Emits new DD gauge taskworker.worker.occupancy from the existing 1s metrics thread. This is on regardless of Prometheus, so occupancy is visible in dashboards for validation before anything scales on it.
Adds new WorkerPrometheusMetrics which owns the prometheus registry, HTTP server and exposes taskworker_worker_occupancy for scraping. Opt-in via prometheus_port arg on the worker

Testing:

Unit test: busy counter returns to 0 after a task completes.
End to end locally: ran the broker, worker, spawned timed tasks, and confirmed taskworker_worker_occupancy on /metrics tracked load (rose toward 1.0 under saturation, fell to 0 when idle).

linear-code · 2026-06-29T20:05:24Z

STREAM-1114

untitaker · 2026-06-29T20:29:04Z

 WORKER_SERVICE_NAME = "sentry_protos.taskbroker.v1.WorkerService"


+class WorkerPrometheusMetrics:


do you think it would be possible to use the existing metrics abstraction, but expose a subset of those metrics to prometheus in addition to the configured backend?

Good point, I think it's definitely possible. As I was thinking how to do that, two things came up:

app.metrics is built per process, each child re-imports the app through import_app, so it's not one shared instance. Since Prometheus is pull-based, a Prometheus backend's would start an HTTP server in every child, not just the parent. We can sidestep that by keeping DD as the app backend and wrapping Prometheus only in the parent, but then the wrap is a parent-only special case rather than a uniform backend, so we don't really get the cleanliness the shared abstraction would suggest.

The MetricsBackend API is statsd-shaped so dynamic metric names with free-form tags are created per call. On the other hand, Prometheus needs each metric declared up front with a fixed label set, so we can't forward arbitrary calls. Even exposing just a subset, we'd still hand-declare each metric's name and labels, so routing it through the backend mostly adds a layer without removing that work. This might make sense to do down the line, but for now, we only need one metric (occupancy).

markstory · 2026-06-30T14:38:11Z

 )
+@click.option(
+    "--prometheus-port",
+    help="Expose occupancy on this port for Prometheus scraping. Unset = disabled.",


Suggested change

help="Expose occupancy on this port for Prometheus scraping. Unset = disabled.",

help="Expose prometheus metrics on this port for scraping. Unset = disabled.",

Could generalize this.

enochtangg added 3 commits June 29, 2026 14:23

Expose occupancy metric and prom server in taskworker

6d4c2dc

fix

bcf91cd

remove extra file

50e72f5

enochtangg requested a review from a team as a code owner June 29, 2026 20:05

untitaker approved these changes Jun 29, 2026

View reviewed changes

markstory approved these changes Jun 30, 2026

View reviewed changes

generalize comment

2afd074

sentry Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/worker/worker.py

Merge remote-tracking branch 'origin' into taskworker-autoscaler

c80c630

enochtangg merged commit ee14aca into main Jun 30, 2026
29 checks passed

enochtangg deleted the taskworker-autoscaler branch June 30, 2026 15:32

sentry-release-bot Bot mentioned this pull request Jun 30, 2026

publish: getsentry/taskbroker/clients@0.20.8 getsentry/publish#8750

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(taskworker): Emit worker occupancy metric for autoscaling#734

feat(taskworker): Emit worker occupancy metric for autoscaling#734
enochtangg merged 5 commits into
mainfrom
taskworker-autoscaler

enochtangg commented Jun 29, 2026 •

edited

Loading

Uh oh!

linear-code Bot commented Jun 29, 2026

Uh oh!

untitaker Jun 29, 2026

Uh oh!

enochtangg Jun 30, 2026 •

edited

Loading

Uh oh!

markstory Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		WORKER_SERVICE_NAME = "sentry_protos.taskbroker.v1.WorkerService"


		class WorkerPrometheusMetrics:

	help="Expose occupancy on this port for Prometheus scraping. Unset = disabled.",
	help="Expose prometheus metrics on this port for scraping. Unset = disabled.",

Uh oh!

Uh oh!

Conversation

enochtangg commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linear-code Bot commented Jun 29, 2026

Uh oh!

untitaker Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

enochtangg Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

markstory Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

enochtangg commented Jun 29, 2026 •

edited

Loading

enochtangg Jun 30, 2026 •

edited

Loading