Skip to content

Java driver may get stuck while processing dead nodes #920

@dawmd

Description

@dawmd

Problem

The test test_hintedhandoff_rebalance in scylla-dtest was found to hang indefinitely during cluster.connect() (reported in scylladb/scylla-dtest#7046). The root cause is a cyclic thread deadlock in the driver triggered during session initialization when some cluster nodes are dead.

Root cause

The deadlock occurs when connecting to a cluster where some nodes are dead (e.g. return ECONNREFUSED) and others are alive. The following race condition leads to a Netty I/O thread blocking on itself:

  1. SessionManager.initAsync() chains updateCreatedPools() via MoreExecutors.directExecutor(), meaning it executes on whichever thread completes the last pool-creation future — which can be a Netty I/O thread.
  2. triggerOnDown() submits DOWN processing to a background executor asynchronously. If that executor hasn't processed the conviction yet, dead nodes still appear UP in the driver's topology state.
  3. updateCreatedPools() therefore attempts to create a pool for a dead node. This flows through HostConnectionPool.initAsync()Connection.Factory.open()connection.initAsync().get()a synchronous blocking wait on the calling Netty I/O thread.
  4. Netty uses a round-robin GenericEventExecutorChooser to assign new channels to I/O threads. If the new channel is assigned to the same thread that is already blocked on .get(), it can never complete — a permanent cyclic deadlock.

Connect timeouts do not save it because timeout tasks are also queued on the blocked thread.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions