Problem
The test test_hintedhandoff_rebalance in scylla-dtest was found to hang indefinitely during cluster.connect() (reported in scylladb/scylla-dtest#7046). The root cause is a cyclic thread deadlock in the driver triggered during session initialization when some cluster nodes are dead.
Root cause
The deadlock occurs when connecting to a cluster where some nodes are dead (e.g. return ECONNREFUSED) and others are alive. The following race condition leads to a Netty I/O thread blocking on itself:
SessionManager.initAsync() chains updateCreatedPools() via MoreExecutors.directExecutor(), meaning it executes on whichever thread completes the last pool-creation future — which can be a Netty I/O thread.
triggerOnDown() submits DOWN processing to a background executor asynchronously. If that executor hasn't processed the conviction yet, dead nodes still appear UP in the driver's topology state.
updateCreatedPools() therefore attempts to create a pool for a dead node. This flows through HostConnectionPool.initAsync() → Connection.Factory.open() → connection.initAsync().get() — a synchronous blocking wait on the calling Netty I/O thread.
- Netty uses a round-robin
GenericEventExecutorChooser to assign new channels to I/O threads. If the new channel is assigned to the same thread that is already blocked on .get(), it can never complete — a permanent cyclic deadlock.
Connect timeouts do not save it because timeout tasks are also queued on the blocked thread.
Problem
The test
test_hintedhandoff_rebalancein scylla-dtest was found to hang indefinitely duringcluster.connect()(reported in scylladb/scylla-dtest#7046). The root cause is a cyclic thread deadlock in the driver triggered during session initialization when some cluster nodes are dead.Root cause
The deadlock occurs when connecting to a cluster where some nodes are dead (e.g. return
ECONNREFUSED) and others are alive. The following race condition leads to a Netty I/O thread blocking on itself:SessionManager.initAsync()chainsupdateCreatedPools()viaMoreExecutors.directExecutor(), meaning it executes on whichever thread completes the last pool-creation future — which can be a Netty I/O thread.triggerOnDown()submits DOWN processing to a background executor asynchronously. If that executor hasn't processed the conviction yet, dead nodes still appearUPin the driver's topology state.updateCreatedPools()therefore attempts to create a pool for a dead node. This flows throughHostConnectionPool.initAsync()→Connection.Factory.open()→connection.initAsync().get()— a synchronous blocking wait on the calling Netty I/O thread.GenericEventExecutorChooserto assign new channels to I/O threads. If the new channel is assigned to the same thread that is already blocked on.get(), it can never complete — a permanent cyclic deadlock.Connect timeouts do not save it because timeout tasks are also queued on the blocked thread.