compute: fix "cannot clone ReadHold" panic during shutdown#37014
Conversation
During process shutdown the tokio runtime drops tasks in arbitrary order, so the `StorageCollections` background task can disappear while an `Instance` task still processes a queued `maintain()`. Rehydrating a failed replica then panics in `ReadHold::clone` when registering the clone with the hung-up issuer (seen in sqllogictest, which tears down a server per file). Add a fallible `ReadHold::try_clone` and use it in `add_replica_state`: on a hung-up issuer, stop populating per-replica collection state but still insert the replica to keep bookkeeping consistent with the controller. `Clone` keeps panicking for all other callers, since a clone that fails to register is not a real read hold. Fixes database-issues#9964 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
antiguru
left a comment
There was a problem hiding this comment.
I left some comments, but the main hesitation I have with this change is that it leaves the replica and in turn the controller state in an invalid configuration. I'd prefer if we bubble up the error instead of swallowing it, to protect us from genuine storage dependency failures that now would turn into info logs.
| // collection state, so we stop trying. We still add the replica itself, to keep the | ||
| // bookkeeping consistent with the controller's, but it will never hydrate. | ||
| let mut input_read_holds = Vec::with_capacity(collection.storage_dependencies.len()); | ||
| let mut hung_up = None; |
There was a problem hiding this comment.
Should this be a vector? It'd name all collections in the error message.
| /// the tokio runtime drops tasks in arbitrary order. Callers that may run | ||
| /// concurrently with shutdown can use this method to handle that case | ||
| /// gracefully, instead of panicking via [Clone::clone]. | ||
| pub fn try_clone(&self) -> Result<Self, SendError<(GlobalId, ChangeBatch<Timestamp>)>> { |
There was a problem hiding this comment.
The SendError is not helpful for the caller, as it's just an indication that the other end has hung up. Either return a concrete error, or an option.
| // Cloning a `ReadHold` fails when its issuer has hung up. For these holds the issuer | ||
| // is the `StorageCollections`, which doesn't hang up as long as the `Instance` exists, | ||
| // except during process shutdown, when the tokio runtime drops tasks in arbitrary | ||
| // order. In that case there is no way of correctly initializing the per-replica | ||
| // collection state, so we stop trying. We still add the replica itself, to keep the | ||
| // bookkeeping consistent with the controller's, but it will never hydrate. |
There was a problem hiding this comment.
The issue with the below code is that it leaves the replica in an invalid state, even if it's just momentarily. Ideally we return an error to indicate an unrecoverable error that can only happen during shutdown, which in turn should shutdown the compute controller.
This is a risk since the log is an info message, which can easily be overlooked. Consider making it an error message.
Return a concrete `ReadHoldIssuerHungUp` from `ReadHold::try_clone` instead of leaking `SendError`. Collect all hung-up input ids and log them at error level. Propagate the error from `add_replica_state` so `add_replica` terminates the instance task, instead of running on with half-initialized replica state. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
def-
left a comment
There was a problem hiding this comment.
lgtm. Keeps failing in CI
|
Thanks for the re-review! ;-) |
During process shutdown the tokio runtime drops tasks in arbitrary order, so the
StorageCollectionsbackground task can disappear while anInstancetask still processes a queuedmaintain(). Rehydrating a failed replica then panics inReadHold::clonewhen registering the clone with the hung-up issuer (seen in sqllogictest, which tears down a server per file).Add a fallible
ReadHold::try_cloneand use it inadd_replica_state: on a hung-up issuer, stop populating per-replica collection state but still insert the replica to keep bookkeeping consistent with the controller.Clonekeeps panicking for all other callers, since a clone that fails to register is not a real read hold.Fixes database-issues#9964