Skip to content

feat(taskbroker): Better Child Management#732

Closed
george-sentry wants to merge 13 commits into
mainfrom
george/push-taskbroker/better-child-management
Closed

feat(taskbroker): Better Child Management#732
george-sentry wants to merge 13 commits into
mainfrom
george/push-taskbroker/better-child-management

Conversation

@george-sentry

Copy link
Copy Markdown
Member

Linear

Refs STREAM-1269

Description

In #731, we deferred setting the status to SERVING for workers in push mode until all threads were warmed up because before, Kubernetes thought workers were ready to receive activations even though they weren't, causing significant throughput declines during worker redeployments.

Something similar happens when child processes recycle, or exit voluntarily because they have executed a certain, configurable number of tasks. This is meant to prevent memory leaks.

In some cases, child processes recycle at the same time, decreasing throughput significantly until all child processes are ready again. We can solve this problem by improving the child management process as follows.

  • Spawn children and wait until they are all ready before advertising SERVING
  • When a process has executed the maximum number of tasks specified, it does NOT exit right away - instead, it tells the parent that it's ready to exit and continues working
  • The parent spawns a new child process to replace the old one
  • Once the newly spawned child is ready, the one that wanted to exit is killed by the parent

This way, after startup, there are always exactly --concurrency child processes at all times.

image

Before

image

After

@george-sentry george-sentry requested a review from a team as a code owner June 29, 2026 03:20
@linear-code

linear-code Bot commented Jun 29, 2026

Copy link
Copy Markdown

STREAM-1269

Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated
Comment thread clients/python/src/taskbroker_client/worker/workerchild.py Outdated
Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
@untitaker

Copy link
Copy Markdown
Member

When a process has executed the maximum number of tasks specified, it does NOT exit right away - instead, it tells the parent that it's ready to exit and continues working

I believe this may result in higher and less predictable memory usage. Isn't there a way to avoid all child processes restarting at the same time by adding some jitter to the restart interval?

Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
@george-sentry

Copy link
Copy Markdown
Member Author

I believe this may result in higher and less predictable memory usage. Isn't there a way to avoid all child processes restarting at the same time by adding some jitter to the restart interval?

You are right, that is a possible effect. And yes, it is possible to add jitter.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 12b2d19. Configure here.

Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
Comment thread clients/python/src/taskbroker_client/worker/worker.py Outdated
logger.debug("taskworker.worker.spawn_children_thread.started")

# Queue of incoming message from children
messages: multiprocessing.Queue[ChildMessage] = self._mp_context.Queue()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The messages queue is a local variable in spawn_children_thread and can be garbage-collected upon thread exit, causing child processes to fail when they try to use it.
Severity: MEDIUM

Suggested Fix

Move the messages queue to a class attribute to ensure its lifetime exceeds that of the child processes that use it. Alternatively, explicitly manage the queue's lifecycle by calling messages.close() and messages.join_thread() within a try/finally block in the shutdown flow to guarantee proper cleanup.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: clients/python/src/taskbroker_client/worker/worker.py#L969

Potential issue: The `messages` queue is created as a local variable within the
`spawn_children_thread` function. When this thread terminates during shutdown, the
`messages` queue object goes out of scope and can be garbage collected by Python.
However, child processes, which are terminated shortly after, may still hold references
to this queue. If a child attempts to send a message via `messages.put_nowait()` after
the queue has been garbage collected, it could lead to a `BrokenPipeError` or a
deadlock, creating a race condition during the shutdown sequence.

@george-sentry

Copy link
Copy Markdown
Member Author

Closed in favor of #735.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants