Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,27 +1,89 @@
---
title: "DDLWorker and DDL queue problems"
linkTitle: "DDLWorker and DDL queue problems"
description: >
weight: 100
description: >
Finding and troubleshooting problems in the `distributed_ddl_queue`
keywords:
- clickhouse ddl
- clickhouse replication queue
keywords:
- clickhouse ddl
- clickhouse replication queue
- distributed_ddl_queue
- DDLWorker
---
DDLWorker is a subprocess (thread) of `clickhouse-server` that executes `ON CLUSTER` tasks at the node.

When you execute a DDL query with `ON CLUSTER mycluster` section, the query executor at the current node reads the cluster `mycluster` definition (remote_servers / system.clusters) and places tasks into Zookeeper znode `task_queue/ddl/...` for members of the cluster `mycluster`.
`DDLWorker` is a thread inside `clickhouse-server` that executes `ON CLUSTER`
tasks on the local node.

DDLWorker at all ClickHouse® nodes constantly check this `task_queue` for their tasks, executes them locally, and reports about the results back into `task_queue`.
When a DDL is run with `ON CLUSTER mycluster`, the initiator node reads the
`mycluster` definition from `system.clusters` and writes a single task znode
`/clickhouse/task_queue/ddl/query-NNNNNNNNNN` in ZooKeeper. Its value contains
the query and the list of target hosts. Each target's `DDLWorker` polls
`/clickhouse/task_queue/ddl/`, claims tasks addressed to its own host name,
registers itself under the task's `active/` child while executing, then
writes its result under the task's `finished/` child when done.

The common issue is the different hostnames/IPAddresses in the cluster definition and locally.
The most frequent failure mode is a hostname or IP mismatch between the
cluster definition and what each node thinks its own name is — a host never
picks up tasks addressed to it under a name it doesn't recognize. See
[Hostname / IP mismatch](#hostname--ip-mismatch) below.

So if the initiator node puts tasks for a host named Host1. But the Host1 thinks about own name as localhost or **xdgt634678d** (internal docker hostname) and never sees tasks for the Host1 because is looking tasks for **xdgt634678d.** The same with internal VS external IP addresses.
For deep-dive symptoms see
[There are N unfinished hosts](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active/).
For the underlying ZooKeeper layer see
[ZooKeeper](/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/).

## Inspecting the queue: `system.distributed_ddl_queue`

Start here before reaching for raw `system.zookeeper` queries — the system
table joins state from ZooKeeper and the local executor and answers the typical
"who is stuck and why" question:

```sql
SELECT entry, host, port, status, exception_code, exception_text,
query_create_time, query_finish_time, query
FROM system.distributed_ddl_queue
WHERE status != 'Finished'
ORDER BY entry DESC, host
LIMIT 50;
```

For per-task znode digs (children of `finished/`, `active/`, raw task body) see
the SQL recipes in
[There are N unfinished hosts](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active/).

## Hostname / IP mismatch

The initiator addresses tasks to a host using the name it has in
`system.clusters`. If the target host's `system.clusters.is_local = 0` for its
own row, `DDLWorker` won't claim those tasks — it's waiting for tasks addressed
to a different name (often `localhost`, an internal Docker hostname like
`xdgt634678d`, or a different IP family).

Checklist on the host that isn't picking up tasks:

```sql
-- Should return is_local = 1 for the row matching this node.
SELECT cluster, host_name, host_address, port, is_local
FROM system.clusters
WHERE cluster = 'mycluster';
```

```bash
hostname --fqdn
cat /etc/hostname
cat /etc/hosts
getent hosts $(hostname --fqdn)
```

On Debian/Ubuntu the FQDN often resolves to `127.0.1.1`, which doesn't match
any real interface and trips this exact failure — see
[ClickHouse#23504](https://github.com/ClickHouse/ClickHouse/issues/23504).

## DDLWorker thread crashed

That causes ClickHouse to stop executing `ON CLUSTER` tasks.
If the thread dies, `ON CLUSTER` tasks stop executing on this node.

Check that DDLWorker is alive:
Check that both threads are alive:

```bash
ps -eL|grep DDL
Expand All @@ -32,49 +94,69 @@ ps -ef|grep 18829|grep -v grep
clickho+ 18829 18828 1 Feb09 ? 00:55:00 /usr/bin/clickhouse-server --con...
```

As you can see there are two threads: `DDLWorker` and `DDLWorkerClnr`.
Two threads should be present: `DDLWorker` (executes tasks) and `DDLWorkerClnr`
(cleans old tasks from `task_queue/ddl/`).

The second thread – `DDLWorkerCleaner` cleans old tasks from `task_queue`. You can configure how many recent tasks to store:
If either is missing, the only reliable recovery is a `clickhouse-server`
restart. Capture
`/var/log/clickhouse-server/clickhouse-server.err.log` and the matching
`clickhouse-server.log` window first — the crash reason is usually visible
there and you'll want it to file a bug.

```markup
config.xml
<yandex>
You can tune the cleaner from `config.xml`:

```xml
<clickhouse>
<distributed_ddl>
<path>/clickhouse/task_queue/ddl</path>
<pool_size>1</pool_size>
<max_tasks_in_queue>1000</max_tasks_in_queue>
<task_max_lifetime>604800</task_max_lifetime>
<cleanup_delay_period>60</cleanup_delay_period>
</distributed_ddl>
</yandex>
</clickhouse>
```

Default values:

**cleanup_delay_period** = 60 seconds – Sets how often to start cleanup to remove outdated data.

**task_max_lifetime** = 7 \* 24 \* 60 \* 60 (in seconds = week) – Delete task if its age is greater than that.
Defaults:

**max_tasks_in_queue** = 1000 – How many tasks could be in the queue.
- **cleanup_delay_period** = `60` seconds — how often the cleaner runs.
- **task_max_lifetime** = `604800` seconds (1 week) — older tasks are deleted.
- **max_tasks_in_queue** = `1000` — soft cap on retained tasks.
- **pool_size** = `1` — how many `ON CLUSTER` queries run concurrently.

**pool_size** = 1 - How many ON CLUSTER queries can be run simultaneously.
## Too intensive stream of ON CLUSTER commands

## Too intensive stream of ON CLUSTER command
Generally this is a design problem, but `pool_size` can be raised so more
DDLs run in parallel on each node (the default is `1`). Raise it gradually
and watch ZooKeeper write rate and per-node memory — every additional
concurrent DDL can trigger heavy operations (mutations, ALTERs) that compete
for memory and replication queue slots.

Generally, it's a bad design, but you can increase pool_size setting
If raising `pool_size` doesn't keep up, the fix is upstream: batch the DDLs,
replace cluster-wide `DELETE WHERE …` with lightweight deletes or partition
drops, or use `CREATE TEMPORARY TABLE` for transient intermediates so the
per-session table is dropped automatically.

## Stuck DDL tasks in the distributed_ddl_queue
## Stuck DDL tasks in the `distributed_ddl_queue`

Sometimes [DDL tasks](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) (the ones that use ON CLUSTER) can get stuck in the `distributed_ddl_queue` because the replicas can overload if multiple DDLs (thousands of CREATE/DROP/ALTER) are executed at the same time. This is very normal in heavy ETL jobs.This can be detected by checking the `distributed_ddl_queue` table and see if there are tasks that are not moving or are stuck for a long time.
`ON CLUSTER` tasks can pile up when many DDLs (thousands of
CREATE/DROP/ALTER) hit the cluster at once — common in heavy ETL jobs. They
show up in `system.distributed_ddl_queue` as long-`query_create_time` rows
that aren't moving.

If these DDLs are completed in some replicas but failed in others, the simplest way to solve this is to execute the failed command in the missed replicas without ON CLUSTER. If most of the DDLs failed, then check the number of unfinished records in `distributed_ddl_queue` on the other nodes, because most probably it will be as high as thousands.
If the DDL finished on some replicas but failed on others, the simplest fix is
to rerun the failed statement on the missing replicas **without** `ON
CLUSTER`. If most failed, check `system.distributed_ddl_queue` on every node —
the backlog is often in the thousands.

First, backup the `distributed_ddl_queue` into a table so you will have a snapshot of the table with the states of the tasks. You can do this with the following command:
Snapshot the queue first so you don't lose the state:

```sql
CREATE TABLE default.system_distributed_ddl_queue AS SELECT * FROM system.distributed_ddl_queue;
CREATE TABLE default.system_distributed_ddl_queue
AS SELECT * FROM system.distributed_ddl_queue;
```

After this, we need to check from the backup table which tasks are not finished and execute them manually in the missed replicas, and review the pipeline which do `ON CLUSTER` command and does not abuse them. There is a new `CREATE TEMPORARY TABLE` command that can be used to avoid the `ON CLUSTER` command in some cases, where you need an intermediate table to do some operations and after that you can `INSERT INTO` the final table or do `ALTER TABLE final ATTACH PARTITION FROM TABLE temp` and this temp table will be dropped automatically after the session is closed.


Then work through the snapshot, executing the missing statements locally and
fixing the pipeline that's spamming `ON CLUSTER`. `CREATE TEMPORARY TABLE`
plus `ALTER TABLE final ATTACH PARTITION FROM TABLE temp` is a common way to
avoid cluster-wide DDLs for staging.
Loading
Loading