(7) threading rewrite#1555
Conversation
82d1188 to
1c14a51
Compare
There was a problem hiding this comment.
Pull request overview
This PR introduces a new lifecycle crate and refactors dataplane threading/shutdown to use structured cancellation (CancellationToken) and task tracking (TaskTracker) instead of ad-hoc stop channels and “private” Tokio runtimes per subsystem. It aligns mgmt/BMP/metrics with a caller-owned mgmt runtime and migrates kernel driver threads to std::thread::scope-owned, cancellation-aware workers.
Changes:
- Add
dataplane-lifecyclecrate providingShutdown/Subsystemprimitives, signal handling, and drain semantics. - Refactor mgmt/BMP/metrics to spawn onto a provided Tokio runtime and exit via subsystem cancellation tokens.
- Update router RIO and kernel driver workers to observe cancellation tokens and fit the new shutdown model.
Reviewed changes
Copilot reviewed 22 out of 23 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| routing/src/router/rio.rs | RIO loop now exits via subsystem cancellation token; handle triggers cancellation and joins thread. |
| routing/src/router/mod.rs | Router construction now threads mgmt/router subsystems + runtime handle; BMP spawn updated. |
| routing/src/router/ctl.rs | Removes RouterCtlMsg::Finish shutdown path in favor of cancellation. |
| routing/src/frr/test.rs | Updates router construction to pass lifecycle subsystems/runtime handle. |
| routing/src/bmp/mod.rs | BMP no longer creates/leaks a runtime; spawns on provided handle and observes cancellation. |
| routing/Cargo.toml | Adds lifecycle dependency. |
| mgmt/src/tests/mgmt.rs | Updates router construction to pass lifecycle subsystems/runtime handle. |
| mgmt/src/processor/launch.rs | Replaces dedicated mgmt thread/runtime with run_mgmt spawning tracked tasks on provided runtime. |
| mgmt/src/lib.rs | Re-exports run_mgmt instead of start_mgmt. |
| mgmt/Cargo.toml | Adds lifecycle dependency. |
| lifecycle/src/lib.rs | New lifecycle primitives: Shutdown, Subsystem, drain deadlines, and signal handler install. |
| lifecycle/Cargo.toml | New crate manifest for lifecycle utilities. |
| dataplane/src/statistics/mod.rs | Replaces dedicated metrics thread/runtime with spawn_metrics on mgmt runtime + cancellation. |
| dataplane/src/packet_processor/mod.rs | Threads mgmt/router subsystems + runtime handle into router startup. |
| dataplane/src/main.rs | Introduces Shutdown and mgmt runtime ownership; unified shutdown signaling and drain flow. |
| dataplane/src/drivers/tokio_util.rs | Removes helper for local runtime setup (now inlined where needed). |
| dataplane/src/drivers/mod.rs | Removes tokio_util module reference. |
| dataplane/src/drivers/kernel/worker.rs | Workers now use scoped threads, build per-thread local runtimes, and observe cancellation in reader loops. |
| dataplane/src/drivers/kernel/mod.rs | Kernel driver now uses scoped worker threads + supervisor reporting fatal via lifecycle subsystem. |
| dataplane/docs/threading-invariants.md | Adds threading invariants document used to guide the rewrite. |
| dataplane/Cargo.toml | Adds lifecycle dependency; adjusts Tokio features for mgmt runtime ownership. |
| Cargo.toml | Adds lifecycle workspace member and dependency alias. |
| Cargo.lock | Adds dataplane-lifecycle and updates dependency graph accordingly. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 22 out of 23 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
dataplane/src/drivers/kernel/mod.rs:63
confidence: 9
tags: [logic]
`spawn_workers_scoped` logs and then ignores worker thread spawn failures, so `DriverKernel::start` can return `Ok(())` with fewer than `num_workers` (including zero) workers running. That leaves the dataplane up but non-functional and may never trigger shutdown.
Propagate the failure (e.g., return an error if any `worker.start(...)` fails or if `workers.len() != num_workers`), or explicitly `report_fatal`/cancel the workers subsystem when a worker cannot be spawned.
match worker.start(scope, builder, interfaces) {
Ok(handle) => workers.push(handle),
Err(e) => {
error!("Failed to start worker {wid}: {e}");
}
</details>
33be1c2 to
1c3360b
Compare
0edba91 to
75bd274
Compare
e80a331 to
6b9faae
Compare
c410888 to
bea93e7
Compare
a0fc5d8 to
b9782bb
Compare
bea93e7 to
f5902c4
Compare
34adf74 to
17e9fa7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
mgmt/src/processor/launch.rs:182
confidence: 9
tags: [logic]
Same lifetime issue as above: passing `&mgmt.root_token()` across `.await` borrows a temporary `CancellationToken` and is likely to fail to compile. Store `mgmt.root_token()` in a local variable and pass a reference to that instead.
k8s_mgmt_init(&k8s_client, &mgmt.root_token()).await?;
</details>
554b95e to
c6b8f53
Compare
f6f7ab9 to
30e7a6d
Compare
c6b8f53 to
e8061b0
Compare
30e7a6d to
190ea59
Compare
Add a `dataplane-lifecycle` crate with `Shutdown` and `Subsystem` primitives, signal-handler installation, and a process-wide shutdown watchdog. `Shutdown` bundles a root `CancellationToken` and one `Subsystem` per long-lived component (workers, router, mgmt, metrics). Each subsystem exposes a per-subsystem cancel token, a `TaskTracker`, and a shared fatal flag. Subsystems drain in topological order with per-subsystem deadlines; the detached watchdog enforces an absolute upper bound on total shutdown duration. No consumers yet -- wired up in follow-on commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Plumb lifecycle Subsystems into the routing crate as the first step of the threading rewrite. The rest of `main`'s shutdown signaling (ctrlc + mpsc<i32> + start_mgmt + MetricsServer + old DriverKernel::start) stays in place; follow-on commits migrate each. - `Router::new` takes `(mgmt, mgmt_handle, router)` Subsystems + runtime handle. Plumbed through `packet_processor::start_router`. - `start_rio` takes `&Subsystem`; the IO loop observes its cancel between poll cycles (worst-case exit latency = 1s poll). Adds an ExitGuard so panic-unwind or unexpected loop exit reports fatal. - `RouterCtlMsg::Finish` removed; `RioHandle::finish` becomes idempotent. - `bmp::spawn_background` spawns onto the caller-provided runtime handle tracked under `mgmt`; no more leaked runtime. - `runtime.rs` builds a multi-thread mgmt runtime (only BMP tenants it in this commit) and a `Shutdown` for plumbing into Router::new. - `dataplane` and `mgmt` Cargo.toml gain `lifecycle` dep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Move mgmt + metrics from dedicated OS threads with private runtimes to tenants of the multi-thread mgmt runtime introduced in the prior commit. The kernel driver still uses the legacy ctrlc + mpsc<i32> path; the final unification commit migrates that. - `run_mgmt` (renamed from `start_mgmt`): synchronous init on the caller-provided handle, then spawns the three long-lived tasks (config processor, status updater, config watcher) via `Subsystem::spawn_fatal_on_exit` so their unexpected exit flips fatal. Init observes `mgmt.root_token()` so SIGINT during k8s retries returns `LaunchError::Cancelled` within cancel latency. - `LaunchError::Cancelled` is a clean-shutdown signal; the call site in `runtime.rs` forwards the existing mpsc stop channel with code 0 so the legacy shutdown path stays consistent. - `spawn_metrics` replaces `MetricsServer`: HTTP endpoint, upkeep ticker, and stats collector all spawn onto `mgmt_handle` tracked under `metrics`. Uses plain `spawn_on` (not `spawn_fatal_on_exit`) — a dead metrics endpoint should not take down the dataplane. - Drop stale `LaunchError` variants no longer constructed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
Replace the last legacy shutdown signaling (ctrlc handler + mpsc<i32> exit code + dedicated controller thread) with the single `lifecycle::Shutdown` path, and migrate the kernel driver to scoped threads with cancellation observation. After this commit there is one signaling path: SIGINT/SIGTERM -> shutdown.root, or any subsystem's report_fatal -> shutdown.root, with the watchdog as the absolute upper bound. - `main`: install `spawn_signal_handler` and `spawn_shutdown_watchdog`, run everything inside `concurrency::thread::scope`, block on `root.cancelled()`, then drain subsystems in canonical order (workers -> router -> metrics -> mgmt). Exit code from `is_fatal()`. - `DriverKernel::start`: takes `&Scope` and `&Subsystem`; workers spawn via `spawn_scoped` with an `ExitGuard` Drop pattern that reports fatal on panic-unwind, early `?`-return, and unexpected normal exit. Reader loops observe cancel between reads. Supervisor joins-and-logs. - Drop `dataplane/src/drivers/tokio_util.rs` and its `run_in_local_tokio_runtime` helper (inlined where needed). - Drop `ctrlc` and `mio` from `dataplane` dependencies; drop `ctrlc` from the workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Daniel Noland <daniel@githedgehog.com>
e8061b0 to
d69246d
Compare
This basically rebuild main in prep for DPDK ACL work to land.
We simply can't manage such an anarchic and nonsensical threading model anymore. Not when we need principled memory mangement, organized shutdown, and an extensible design.
Warning
This step is necessary but in no way sufficient to get all of DPDK to land. This is basically the least drastic thing I could contrive which should still tolerate DPDK ACLs. I'm going to turn around and kick this subsystem much, MUCH harder in the next few weeks as I get DPDK proper to land.