compute: fix "cannot clone ReadHold" panic during shutdown by def- · Pull Request #37014 · MaterializeInc/materialize

def- · 2026-06-12T12:13:07Z

During process shutdown the tokio runtime drops tasks in arbitrary order, so the StorageCollections background task can disappear while an Instance task still processes a queued maintain(). Rehydrating a failed replica then panics in ReadHold::clone when registering the clone with the hung-up issuer (seen in sqllogictest, which tears down a server per file).

Add a fallible ReadHold::try_clone and use it in add_replica_state: on a hung-up issuer, stop populating per-replica collection state but still insert the replica to keep bookkeeping consistent with the controller. Clone keeps panicking for all other callers, since a clone that fails to register is not a real read hold.

Fixes database-issues#9964

During process shutdown the tokio runtime drops tasks in arbitrary order, so the `StorageCollections` background task can disappear while an `Instance` task still processes a queued `maintain()`. Rehydrating a failed replica then panics in `ReadHold::clone` when registering the clone with the hung-up issuer (seen in sqllogictest, which tears down a server per file). Add a fallible `ReadHold::try_clone` and use it in `add_replica_state`: on a hung-up issuer, stop populating per-replica collection state but still insert the replica to keep bookkeeping consistent with the controller. `Clone` keeps panicking for all other callers, since a clone that fails to register is not a real read hold. Fixes database-issues#9964 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

antiguru

I left some comments, but the main hesitation I have with this change is that it leaves the replica and in turn the controller state in an invalid configuration. I'd prefer if we bubble up the error instead of swallowing it, to protect us from genuine storage dependency failures that now would turn into info logs.

antiguru · 2026-06-12T12:14:50Z

+            // collection state, so we stop trying. We still add the replica itself, to keep the
+            // bookkeeping consistent with the controller's, but it will never hydrate.
+            let mut input_read_holds = Vec::with_capacity(collection.storage_dependencies.len());
+            let mut hung_up = None;


Should this be a vector? It'd name all collections in the error message.

antiguru · 2026-06-12T12:17:42Z

+    /// the tokio runtime drops tasks in arbitrary order. Callers that may run
+    /// concurrently with shutdown can use this method to handle that case
+    /// gracefully, instead of panicking via [Clone::clone].
+    pub fn try_clone(&self) -> Result<Self, SendError<(GlobalId, ChangeBatch<Timestamp>)>> {


The SendError is not helpful for the caller, as it's just an indication that the other end has hung up. Either return a concrete error, or an option.

antiguru · 2026-06-12T12:20:10Z

+            // Cloning a `ReadHold` fails when its issuer has hung up. For these holds the issuer
+            // is the `StorageCollections`, which doesn't hang up as long as the `Instance` exists,
+            // except during process shutdown, when the tokio runtime drops tasks in arbitrary
+            // order. In that case there is no way of correctly initializing the per-replica
+            // collection state, so we stop trying. We still add the replica itself, to keep the
+            // bookkeeping consistent with the controller's, but it will never hydrate.


The issue with the below code is that it leaves the replica in an invalid state, even if it's just momentarily. Ideally we return an error to indicate an unrecoverable error that can only happen during shutdown, which in turn should shutdown the compute controller.

This is a risk since the log is an info message, which can easily be overlooked. Consider making it an error message.

Return a concrete `ReadHoldIssuerHungUp` from `ReadHold::try_clone` instead of leaking `SendError`. Collect all hung-up input ids and log them at error level. Propagate the error from `add_replica_state` so `add_replica` terminates the instance task, instead of running on with half-initialized replica state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

def-

lgtm. Keeps failing in CI

antiguru · 2026-06-13T20:54:13Z

Thanks for the re-review! ;-)

def- requested review from a team as code owners June 12, 2026 12:13

antiguru requested changes Jun 12, 2026

View reviewed changes

def- commented Jun 13, 2026

View reviewed changes

antiguru approved these changes Jun 13, 2026

View reviewed changes

antiguru merged commit 248f143 into MaterializeInc:main Jun 13, 2026
117 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compute: fix "cannot clone ReadHold" panic during shutdown#37014

compute: fix "cannot clone ReadHold" panic during shutdown#37014
antiguru merged 2 commits into
MaterializeInc:mainfrom
def-:fix-readhold-clone-panic-shutdown

def- commented Jun 12, 2026

Uh oh!

antiguru left a comment

Uh oh!

antiguru Jun 12, 2026

Uh oh!

antiguru Jun 12, 2026

Uh oh!

antiguru Jun 12, 2026

Uh oh!

def- left a comment

Uh oh!

Uh oh!

antiguru commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

def- commented Jun 12, 2026

Uh oh!

antiguru left a comment

Choose a reason for hiding this comment

Uh oh!

antiguru Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

antiguru Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

antiguru Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

antiguru commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants