Skip to content

(7) threading rewrite#1555

Open
daniel-noland wants to merge 4 commits into
pr/daniel-noland/drop-dpdk-scaffoldfrom
pr/daniel-noland/threading-rewrite
Open

(7) threading rewrite#1555
daniel-noland wants to merge 4 commits into
pr/daniel-noland/drop-dpdk-scaffoldfrom
pr/daniel-noland/threading-rewrite

Conversation

@daniel-noland
Copy link
Copy Markdown
Collaborator

@daniel-noland daniel-noland commented May 21, 2026

This basically rebuild main in prep for DPDK ACL work to land.

We simply can't manage such an anarchic and nonsensical threading model anymore. Not when we need principled memory mangement, organized shutdown, and an extensible design.

Warning

This step is necessary but in no way sufficient to get all of DPDK to land. This is basically the least drastic thing I could contrive which should still tolerate DPDK ACLs. I'm going to turn around and kick this subsystem much, MUCH harder in the next few weeks as I get DPDK proper to land.

@daniel-noland daniel-noland self-assigned this May 21, 2026
@daniel-noland daniel-noland added the ci:+vlab Enable VLAB tests label May 21, 2026
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch from 82d1188 to 1c14a51 Compare May 21, 2026 05:43
@daniel-noland daniel-noland requested a review from Copilot May 21, 2026 06:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a new lifecycle crate and refactors dataplane threading/shutdown to use structured cancellation (CancellationToken) and task tracking (TaskTracker) instead of ad-hoc stop channels and “private” Tokio runtimes per subsystem. It aligns mgmt/BMP/metrics with a caller-owned mgmt runtime and migrates kernel driver threads to std::thread::scope-owned, cancellation-aware workers.

Changes:

  • Add dataplane-lifecycle crate providing Shutdown/Subsystem primitives, signal handling, and drain semantics.
  • Refactor mgmt/BMP/metrics to spawn onto a provided Tokio runtime and exit via subsystem cancellation tokens.
  • Update router RIO and kernel driver workers to observe cancellation tokens and fit the new shutdown model.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
routing/src/router/rio.rs RIO loop now exits via subsystem cancellation token; handle triggers cancellation and joins thread.
routing/src/router/mod.rs Router construction now threads mgmt/router subsystems + runtime handle; BMP spawn updated.
routing/src/router/ctl.rs Removes RouterCtlMsg::Finish shutdown path in favor of cancellation.
routing/src/frr/test.rs Updates router construction to pass lifecycle subsystems/runtime handle.
routing/src/bmp/mod.rs BMP no longer creates/leaks a runtime; spawns on provided handle and observes cancellation.
routing/Cargo.toml Adds lifecycle dependency.
mgmt/src/tests/mgmt.rs Updates router construction to pass lifecycle subsystems/runtime handle.
mgmt/src/processor/launch.rs Replaces dedicated mgmt thread/runtime with run_mgmt spawning tracked tasks on provided runtime.
mgmt/src/lib.rs Re-exports run_mgmt instead of start_mgmt.
mgmt/Cargo.toml Adds lifecycle dependency.
lifecycle/src/lib.rs New lifecycle primitives: Shutdown, Subsystem, drain deadlines, and signal handler install.
lifecycle/Cargo.toml New crate manifest for lifecycle utilities.
dataplane/src/statistics/mod.rs Replaces dedicated metrics thread/runtime with spawn_metrics on mgmt runtime + cancellation.
dataplane/src/packet_processor/mod.rs Threads mgmt/router subsystems + runtime handle into router startup.
dataplane/src/main.rs Introduces Shutdown and mgmt runtime ownership; unified shutdown signaling and drain flow.
dataplane/src/drivers/tokio_util.rs Removes helper for local runtime setup (now inlined where needed).
dataplane/src/drivers/mod.rs Removes tokio_util module reference.
dataplane/src/drivers/kernel/worker.rs Workers now use scoped threads, build per-thread local runtimes, and observe cancellation in reader loops.
dataplane/src/drivers/kernel/mod.rs Kernel driver now uses scoped worker threads + supervisor reporting fatal via lifecycle subsystem.
dataplane/docs/threading-invariants.md Adds threading invariants document used to guide the rewrite.
dataplane/Cargo.toml Adds lifecycle dependency; adjusts Tokio features for mgmt runtime ownership.
Cargo.toml Adds lifecycle workspace member and dependency alias.
Cargo.lock Adds dataplane-lifecycle and updates dependency graph accordingly.

Comment thread mgmt/src/processor/launch.rs Outdated
Comment thread dataplane/src/main.rs Outdated
Comment thread routing/src/router/rio.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

dataplane/src/drivers/kernel/mod.rs:63

confidence: 9
tags: [logic]

`spawn_workers_scoped` logs and then ignores worker thread spawn failures, so `DriverKernel::start` can return `Ok(())` with fewer than `num_workers` (including zero) workers running. That leaves the dataplane up but non-functional and may never trigger shutdown.

Propagate the failure (e.g., return an error if any `worker.start(...)` fails or if `workers.len() != num_workers`), or explicitly `report_fatal`/cancel the workers subsystem when a worker cannot be spawned.
        match worker.start(scope, builder, interfaces) {
            Ok(handle) => workers.push(handle),
            Err(e) => {
                error!("Failed to start worker {wid}: {e}");
            }
</details>

Comment thread lifecycle/src/lib.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 22 out of 23 changed files in this pull request and generated 3 comments.

Comment thread mgmt/src/processor/launch.rs Outdated
Comment thread dataplane/src/main.rs Outdated
Comment thread dataplane/src/main.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 2 comments.

Comment thread lifecycle/src/lib.rs
Comment thread lifecycle/src/lib.rs Outdated
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch 3 times, most recently from 33be1c2 to 1c3360b Compare May 23, 2026 04:15
@daniel-noland daniel-noland changed the base branch from main to bump/cargo-upgrades May 23, 2026 04:15
@daniel-noland daniel-noland force-pushed the bump/cargo-upgrades branch from 0edba91 to 75bd274 Compare May 23, 2026 04:23
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch 2 times, most recently from e80a331 to 6b9faae Compare May 23, 2026 05:17
@daniel-noland daniel-noland changed the title (sketch) threading rewrite (5) threading rewrite May 23, 2026
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch 7 times, most recently from c410888 to bea93e7 Compare May 23, 2026 19:07
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/drop-dpdk-scaffold branch from a0fc5d8 to b9782bb Compare May 23, 2026 19:24
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch from bea93e7 to f5902c4 Compare May 23, 2026 19:24
@daniel-noland daniel-noland requested a review from Copilot May 23, 2026 19:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 1 comment.

Comment thread lifecycle/Cargo.toml Outdated
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch 2 times, most recently from 34adf74 to 17e9fa7 Compare May 23, 2026 19:54
@daniel-noland daniel-noland requested a review from Copilot May 23, 2026 19:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

mgmt/src/processor/launch.rs:182

confidence: 9
tags: [logic]

Same lifetime issue as above: passing `&mgmt.root_token()` across `.await` borrows a temporary `CancellationToken` and is likely to fail to compile. Store `mgmt.root_token()` in a local variable and pass a reference to that instead.
k8s_mgmt_init(&k8s_client, &mgmt.root_token()).await?;
</details>

Comment thread dataplane/src/runtime.rs
Comment thread mgmt/src/processor/launch.rs
Comment thread dataplane/Cargo.toml
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch 5 times, most recently from 554b95e to c6b8f53 Compare May 23, 2026 21:00
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/drop-dpdk-scaffold branch 2 times, most recently from f6f7ab9 to 30e7a6d Compare May 23, 2026 21:02
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch from c6b8f53 to e8061b0 Compare May 23, 2026 21:02
@daniel-noland daniel-noland changed the title (6) threading rewrite (7) threading rewrite May 23, 2026
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/drop-dpdk-scaffold branch from 30e7a6d to 190ea59 Compare May 23, 2026 21:17
daniel-noland and others added 4 commits May 23, 2026 15:17
Add a `dataplane-lifecycle` crate with `Shutdown` and `Subsystem`
primitives, signal-handler installation, and a process-wide shutdown
watchdog.

`Shutdown` bundles a root `CancellationToken` and one `Subsystem` per
long-lived component (workers, router, mgmt, metrics). Each subsystem
exposes a per-subsystem cancel token, a `TaskTracker`, and a shared
fatal flag. Subsystems drain in topological order with per-subsystem
deadlines; the detached watchdog enforces an absolute upper bound on
total shutdown duration.

No consumers yet -- wired up in follow-on commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Plumb lifecycle Subsystems into the routing crate as the first step of
the threading rewrite. The rest of `main`'s shutdown signaling (ctrlc +
mpsc<i32> + start_mgmt + MetricsServer + old DriverKernel::start) stays
in place; follow-on commits migrate each.

- `Router::new` takes `(mgmt, mgmt_handle, router)` Subsystems +
  runtime handle. Plumbed through `packet_processor::start_router`.
- `start_rio` takes `&Subsystem`; the IO loop observes its cancel
  between poll cycles (worst-case exit latency = 1s poll). Adds an
  ExitGuard so panic-unwind or unexpected loop exit reports fatal.
- `RouterCtlMsg::Finish` removed; `RioHandle::finish` becomes idempotent.
- `bmp::spawn_background` spawns onto the caller-provided runtime handle
  tracked under `mgmt`; no more leaked runtime.
- `runtime.rs` builds a multi-thread mgmt runtime (only BMP tenants it
  in this commit) and a `Shutdown` for plumbing into Router::new.
- `dataplane` and `mgmt` Cargo.toml gain `lifecycle` dep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Move mgmt + metrics from dedicated OS threads with private runtimes to
tenants of the multi-thread mgmt runtime introduced in the prior commit.
The kernel driver still uses the legacy ctrlc + mpsc<i32> path; the
final unification commit migrates that.

- `run_mgmt` (renamed from `start_mgmt`): synchronous init on the
  caller-provided handle, then spawns the three long-lived tasks (config
  processor, status updater, config watcher) via
  `Subsystem::spawn_fatal_on_exit` so their unexpected exit flips fatal.
  Init observes `mgmt.root_token()` so SIGINT during k8s retries returns
  `LaunchError::Cancelled` within cancel latency.
- `LaunchError::Cancelled` is a clean-shutdown signal; the call site in
  `runtime.rs` forwards the existing mpsc stop channel with code 0 so
  the legacy shutdown path stays consistent.
- `spawn_metrics` replaces `MetricsServer`: HTTP endpoint, upkeep ticker,
  and stats collector all spawn onto `mgmt_handle` tracked under
  `metrics`. Uses plain `spawn_on` (not `spawn_fatal_on_exit`) — a dead
  metrics endpoint should not take down the dataplane.
- Drop stale `LaunchError` variants no longer constructed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Replace the last legacy shutdown signaling (ctrlc handler + mpsc<i32>
exit code + dedicated controller thread) with the single
`lifecycle::Shutdown` path, and migrate the kernel driver to scoped
threads with cancellation observation. After this commit there is one
signaling path: SIGINT/SIGTERM -> shutdown.root, or any subsystem's
report_fatal -> shutdown.root, with the watchdog as the absolute
upper bound.

- `main`: install `spawn_signal_handler` and `spawn_shutdown_watchdog`,
  run everything inside `concurrency::thread::scope`, block on
  `root.cancelled()`, then drain subsystems in canonical order
  (workers -> router -> metrics -> mgmt). Exit code from `is_fatal()`.
- `DriverKernel::start`: takes `&Scope` and `&Subsystem`; workers spawn
  via `spawn_scoped` with an `ExitGuard` Drop pattern that reports
  fatal on panic-unwind, early `?`-return, and unexpected normal exit.
  Reader loops observe cancel between reads. Supervisor joins-and-logs.
- Drop `dataplane/src/drivers/tokio_util.rs` and its
  `run_in_local_tokio_runtime` helper (inlined where needed).
- Drop `ctrlc` and `mio` from `dataplane` dependencies; drop `ctrlc`
  from the workspace.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Daniel Noland <daniel@githedgehog.com>
@daniel-noland daniel-noland force-pushed the pr/daniel-noland/threading-rewrite branch from e8061b0 to d69246d Compare May 23, 2026 21:17
@daniel-noland daniel-noland marked this pull request as ready for review May 23, 2026 22:31
@daniel-noland daniel-noland requested a review from a team as a code owner May 23, 2026 22:31
@daniel-noland daniel-noland requested review from Fredi-raspall and removed request for a team May 23, 2026 22:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci:+vlab Enable VLAB tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants