Skip to content

compute: fix "cannot clone ReadHold" panic during shutdown#37014

Merged
antiguru merged 2 commits into
MaterializeInc:mainfrom
def-:fix-readhold-clone-panic-shutdown
Jun 13, 2026
Merged

compute: fix "cannot clone ReadHold" panic during shutdown#37014
antiguru merged 2 commits into
MaterializeInc:mainfrom
def-:fix-readhold-clone-panic-shutdown

Conversation

@def-

@def- def- commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

During process shutdown the tokio runtime drops tasks in arbitrary order, so the StorageCollections background task can disappear while an Instance task still processes a queued maintain(). Rehydrating a failed replica then panics in ReadHold::clone when registering the clone with the hung-up issuer (seen in sqllogictest, which tears down a server per file).

Add a fallible ReadHold::try_clone and use it in add_replica_state: on a hung-up issuer, stop populating per-replica collection state but still insert the replica to keep bookkeeping consistent with the controller. Clone keeps panicking for all other callers, since a clone that fails to register is not a real read hold.

Fixes database-issues#9964

During process shutdown the tokio runtime drops tasks in arbitrary order,
so the `StorageCollections` background task can disappear while an
`Instance` task still processes a queued `maintain()`. Rehydrating a
failed replica then panics in `ReadHold::clone` when registering the
clone with the hung-up issuer (seen in sqllogictest, which tears down a
server per file).

Add a fallible `ReadHold::try_clone` and use it in `add_replica_state`:
on a hung-up issuer, stop populating per-replica collection state but
still insert the replica to keep bookkeeping consistent with the
controller. `Clone` keeps panicking for all other callers, since a
clone that fails to register is not a real read hold.

Fixes database-issues#9964

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@def- def- requested review from a team as code owners June 12, 2026 12:13

@antiguru antiguru left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some comments, but the main hesitation I have with this change is that it leaves the replica and in turn the controller state in an invalid configuration. I'd prefer if we bubble up the error instead of swallowing it, to protect us from genuine storage dependency failures that now would turn into info logs.

// collection state, so we stop trying. We still add the replica itself, to keep the
// bookkeeping consistent with the controller's, but it will never hydrate.
let mut input_read_holds = Vec::with_capacity(collection.storage_dependencies.len());
let mut hung_up = None;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a vector? It'd name all collections in the error message.

Comment thread src/storage-types/src/read_holds.rs Outdated
/// the tokio runtime drops tasks in arbitrary order. Callers that may run
/// concurrently with shutdown can use this method to handle that case
/// gracefully, instead of panicking via [Clone::clone].
pub fn try_clone(&self) -> Result<Self, SendError<(GlobalId, ChangeBatch<Timestamp>)>> {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SendError is not helpful for the caller, as it's just an indication that the other end has hung up. Either return a concrete error, or an option.

Comment on lines +381 to +386
// Cloning a `ReadHold` fails when its issuer has hung up. For these holds the issuer
// is the `StorageCollections`, which doesn't hang up as long as the `Instance` exists,
// except during process shutdown, when the tokio runtime drops tasks in arbitrary
// order. In that case there is no way of correctly initializing the per-replica
// collection state, so we stop trying. We still add the replica itself, to keep the
// bookkeeping consistent with the controller's, but it will never hydrate.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue with the below code is that it leaves the replica in an invalid state, even if it's just momentarily. Ideally we return an error to indicate an unrecoverable error that can only happen during shutdown, which in turn should shutdown the compute controller.

This is a risk since the log is an info message, which can easily be overlooked. Consider making it an error message.

Return a concrete `ReadHoldIssuerHungUp` from `ReadHold::try_clone`
instead of leaking `SendError`. Collect all hung-up input ids and log
them at error level. Propagate the error from `add_replica_state` so
`add_replica` terminates the instance task, instead of running on with
half-initialized replica state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@def- def- left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Keeps failing in CI

@antiguru antiguru merged commit 248f143 into MaterializeInc:main Jun 13, 2026
117 checks passed
@antiguru

Copy link
Copy Markdown
Member

Thanks for the re-review! ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants