Skip to content

feat(taskworker): Emit worker occupancy metric for autoscaling#734

Merged
enochtangg merged 5 commits into
mainfrom
taskworker-autoscaler
Jun 30, 2026
Merged

feat(taskworker): Emit worker occupancy metric for autoscaling#734
enochtangg merged 5 commits into
mainfrom
taskworker-autoscaler

Conversation

@enochtangg

@enochtangg enochtangg commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Refs: STREAM-1114

Adds a worker-local occupancy signal (busy_child_processes / concurrency) so taskworkers can be autoscaled on how hard they're actually working. Occupancy is the replacement signal for kafka lag. This PR is the worker-side instrumentation only. The KEDA/HPA wiring lands separately in ops. When this PR is merged, prometheus server will not be enabled yet.

Changes:

  • Introduced a shared busy_counter is incremented when a child picks up a task and decremented in a finally after it completes.
  • Emits new DD gauge taskworker.worker.occupancy from the existing 1s metrics thread. This is on regardless of Prometheus, so occupancy is visible in dashboards for validation before anything scales on it.
  • Adds new WorkerPrometheusMetrics which owns the prometheus registry, HTTP server and exposes taskworker_worker_occupancy for scraping. Opt-in via prometheus_port arg on the worker

Testing:

  • Unit test: busy counter returns to 0 after a task completes.
  • End to end locally: ran the broker, worker, spawned timed tasks, and confirmed taskworker_worker_occupancy on /metrics tracked load (rose toward 1.0 under saturation, fell to 0 when idle).

@enochtangg enochtangg requested a review from a team as a code owner June 29, 2026 20:05
@linear-code

linear-code Bot commented Jun 29, 2026

Copy link
Copy Markdown

STREAM-1114

WORKER_SERVICE_NAME = "sentry_protos.taskbroker.v1.WorkerService"


class WorkerPrometheusMetrics:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think it would be possible to use the existing metrics abstraction, but expose a subset of those metrics to prometheus in addition to the configured backend?

@enochtangg enochtangg Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think it's definitely possible. As I was thinking how to do that, two things came up:

  1. app.metrics is built per process, each child re-imports the app through import_app, so it's not one shared instance. Since Prometheus is pull-based, a Prometheus backend's would start an HTTP server in every child, not just the parent. We can sidestep that by keeping DD as the app backend and wrapping Prometheus only in the parent, but then the wrap is a parent-only special case rather than a uniform backend, so we don't really get the cleanliness the shared abstraction would suggest.
  2. The MetricsBackend API is statsd-shaped so dynamic metric names with free-form tags are created per call. On the other hand, Prometheus needs each metric declared up front with a fixed label set, so we can't forward arbitrary calls. Even exposing just a subset, we'd still hand-declare each metric's name and labels, so routing it through the backend mostly adds a layer without removing that work. This might make sense to do down the line, but for now, we only need one metric (occupancy).

Comment thread clients/python/src/examples/cli.py Outdated
)
@click.option(
"--prometheus-port",
help="Expose occupancy on this port for Prometheus scraping. Unset = disabled.",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
help="Expose occupancy on this port for Prometheus scraping. Unset = disabled.",
help="Expose prometheus metrics on this port for scraping. Unset = disabled.",

Could generalize this.

Comment thread clients/python/src/taskbroker_client/worker/worker.py
@enochtangg enochtangg merged commit ee14aca into main Jun 30, 2026
29 checks passed
@enochtangg enochtangg deleted the taskworker-autoscaler branch June 30, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants