feat(taskbroker): Better Child Management by george-sentry · Pull Request #732 · getsentry/taskbroker

george-sentry · 2026-06-29T03:20:29Z

Linear

Description

In #731, we deferred setting the status to SERVING for workers in push mode until all threads were warmed up because before, Kubernetes thought workers were ready to receive activations even though they weren't, causing significant throughput declines during worker redeployments.

Something similar happens when child processes recycle, or exit voluntarily because they have executed a certain, configurable number of tasks. This is meant to prevent memory leaks.

In some cases, child processes recycle at the same time, decreasing throughput significantly until all child processes are ready again. We can solve this problem by improving the child management process as follows.

Spawn children and wait until they are all ready before advertising SERVING
When a process has executed the maximum number of tasks specified, it does NOT exit right away - instead, it tells the parent that it's ready to exit and continues working
The parent spawns a new child process to replace the old one
Once the newly spawned child is ready, the one that wanted to exit is killed by the parent

This way, after startup, there are always exactly --concurrency child processes at all times.

Before

After

linear-code · 2026-06-29T03:20:33Z

STREAM-1269

untitaker · 2026-06-29T08:39:40Z

When a process has executed the maximum number of tasks specified, it does NOT exit right away - instead, it tells the parent that it's ready to exit and continues working

I believe this may result in higher and less predictable memory usage. Isn't there a way to avoid all child processes restarting at the same time by adding some jitter to the restart interval?

george-sentry · 2026-06-29T15:36:05Z

I believe this may result in higher and less predictable memory usage. Isn't there a way to avoid all child processes restarting at the same time by adding some jitter to the restart interval?

You are right, that is a possible effect. And yes, it is possible to add jitter.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 12b2d19. Configure here.}

sentry · 2026-06-29T16:14:19Z

            logger.debug("taskworker.worker.spawn_children_thread.started")
+
+            # Queue of incoming message from children
+            messages: multiprocessing.Queue[ChildMessage] = self._mp_context.Queue()


Bug: The messages queue is a local variable in spawn_children_thread and can be garbage-collected upon thread exit, causing child processes to fail when they try to use it.
_{Severity: MEDIUM}

Suggested Fix

Move the messages queue to a class attribute to ensure its lifetime exceeds that of the child processes that use it. Alternatively, explicitly manage the queue's lifecycle by calling messages.close() and messages.join_thread() within a try/finally block in the shutdown flow to guarantee proper cleanup.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: clients/python/src/taskbroker_client/worker/worker.py#L969 Potential issue: The `messages` queue is created as a local variable within the `spawn_children_thread` function. When this thread terminates during shutdown, the `messages` queue object goes out of scope and can be garbage collected by Python. However, child processes, which are terminated shortly after, may still hold references to this queue. If a child attempts to send a message via `messages.put_nowait()` after the queue has been garbage collected, it could lead to a `BrokenPipeError` or a deadlock, creating a race condition during the shutdown sequence.

george-sentry · 2026-06-30T21:32:34Z

Closed in favor of #735.

george-sentry added 2 commits June 28, 2026 18:29

Don't Quit Children Until Replaced

024240e

Add Fake Sleep to Child Process

78f0fce

george-sentry requested a review from a team as a code owner June 29, 2026 03:20

cursor Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated

sentry Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated

Remove Fake Sleep

8904e3c

sentry Bot reviewed Jun 29, 2026

View reviewed changes