diff --git a/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/_index.md b/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/_index.md
index 0722b0021c..3ae7c4c34e 100644
--- a/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/_index.md
+++ b/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/_index.md
@@ -1,27 +1,89 @@
---
title: "DDLWorker and DDL queue problems"
linkTitle: "DDLWorker and DDL queue problems"
-description: >
+weight: 100
+description: >
Finding and troubleshooting problems in the `distributed_ddl_queue`
-keywords:
- - clickhouse ddl
- - clickhouse replication queue
+keywords:
+ - clickhouse ddl
+ - clickhouse replication queue
+ - distributed_ddl_queue
+ - DDLWorker
---
-DDLWorker is a subprocess (thread) of `clickhouse-server` that executes `ON CLUSTER` tasks at the node.
-When you execute a DDL query with `ON CLUSTER mycluster` section, the query executor at the current node reads the cluster `mycluster` definition (remote_servers / system.clusters) and places tasks into Zookeeper znode `task_queue/ddl/...` for members of the cluster `mycluster`.
+`DDLWorker` is a thread inside `clickhouse-server` that executes `ON CLUSTER`
+tasks on the local node.
-DDLWorker at all ClickHouse® nodes constantly check this `task_queue` for their tasks, executes them locally, and reports about the results back into `task_queue`.
+When a DDL is run with `ON CLUSTER mycluster`, the initiator node reads the
+`mycluster` definition from `system.clusters` and writes a single task znode
+`/clickhouse/task_queue/ddl/query-NNNNNNNNNN` in ZooKeeper. Its value contains
+the query and the list of target hosts. Each target's `DDLWorker` polls
+`/clickhouse/task_queue/ddl/`, claims tasks addressed to its own host name,
+registers itself under the task's `active/` child while executing, then
+writes its result under the task's `finished/` child when done.
-The common issue is the different hostnames/IPAddresses in the cluster definition and locally.
+The most frequent failure mode is a hostname or IP mismatch between the
+cluster definition and what each node thinks its own name is — a host never
+picks up tasks addressed to it under a name it doesn't recognize. See
+[Hostname / IP mismatch](#hostname--ip-mismatch) below.
-So if the initiator node puts tasks for a host named Host1. But the Host1 thinks about own name as localhost or **xdgt634678d** (internal docker hostname) and never sees tasks for the Host1 because is looking tasks for **xdgt634678d.** The same with internal VS external IP addresses.
+For deep-dive symptoms see
+[There are N unfinished hosts](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active/).
+For the underlying ZooKeeper layer see
+[ZooKeeper](/altinity-kb-setup-and-maintenance/altinity-kb-zookeeper/).
+
+## Inspecting the queue: `system.distributed_ddl_queue`
+
+Start here before reaching for raw `system.zookeeper` queries — the system
+table joins state from ZooKeeper and the local executor and answers the typical
+"who is stuck and why" question:
+
+```sql
+SELECT entry, host, port, status, exception_code, exception_text,
+ query_create_time, query_finish_time, query
+FROM system.distributed_ddl_queue
+WHERE status != 'Finished'
+ORDER BY entry DESC, host
+LIMIT 50;
+```
+
+For per-task znode digs (children of `finished/`, `active/`, raw task body) see
+the SQL recipes in
+[There are N unfinished hosts](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active/).
+
+## Hostname / IP mismatch
+
+The initiator addresses tasks to a host using the name it has in
+`system.clusters`. If the target host's `system.clusters.is_local = 0` for its
+own row, `DDLWorker` won't claim those tasks — it's waiting for tasks addressed
+to a different name (often `localhost`, an internal Docker hostname like
+`xdgt634678d`, or a different IP family).
+
+Checklist on the host that isn't picking up tasks:
+
+```sql
+-- Should return is_local = 1 for the row matching this node.
+SELECT cluster, host_name, host_address, port, is_local
+FROM system.clusters
+WHERE cluster = 'mycluster';
+```
+
+```bash
+hostname --fqdn
+cat /etc/hostname
+cat /etc/hosts
+getent hosts $(hostname --fqdn)
+```
+
+On Debian/Ubuntu the FQDN often resolves to `127.0.1.1`, which doesn't match
+any real interface and trips this exact failure — see
+[ClickHouse#23504](https://github.com/ClickHouse/ClickHouse/issues/23504).
## DDLWorker thread crashed
-That causes ClickHouse to stop executing `ON CLUSTER` tasks.
+If the thread dies, `ON CLUSTER` tasks stop executing on this node.
-Check that DDLWorker is alive:
+Check that both threads are alive:
```bash
ps -eL|grep DDL
@@ -32,13 +94,19 @@ ps -ef|grep 18829|grep -v grep
clickho+ 18829 18828 1 Feb09 ? 00:55:00 /usr/bin/clickhouse-server --con...
```
-As you can see there are two threads: `DDLWorker` and `DDLWorkerClnr`.
+Two threads should be present: `DDLWorker` (executes tasks) and `DDLWorkerClnr`
+(cleans old tasks from `task_queue/ddl/`).
-The second thread – `DDLWorkerCleaner` cleans old tasks from `task_queue`. You can configure how many recent tasks to store:
+If either is missing, the only reliable recovery is a `clickhouse-server`
+restart. Capture
+`/var/log/clickhouse-server/clickhouse-server.err.log` and the matching
+`clickhouse-server.log` window first — the crash reason is usually visible
+there and you'll want it to file a bug.
-```markup
-config.xml
-
+You can tune the cleaner from `config.xml`:
+
+```xml
+
/clickhouse/task_queue/ddl
1
@@ -46,35 +114,49 @@ config.xml
604800
60
-
+
```
-Default values:
-
-**cleanup_delay_period** = 60 seconds – Sets how often to start cleanup to remove outdated data.
-
-**task_max_lifetime** = 7 \* 24 \* 60 \* 60 (in seconds = week) – Delete task if its age is greater than that.
+Defaults:
-**max_tasks_in_queue** = 1000 – How many tasks could be in the queue.
+- **cleanup_delay_period** = `60` seconds — how often the cleaner runs.
+- **task_max_lifetime** = `604800` seconds (1 week) — older tasks are deleted.
+- **max_tasks_in_queue** = `1000` — soft cap on retained tasks.
+- **pool_size** = `1` — how many `ON CLUSTER` queries run concurrently.
-**pool_size** = 1 - How many ON CLUSTER queries can be run simultaneously.
+## Too intensive stream of ON CLUSTER commands
-## Too intensive stream of ON CLUSTER command
+Generally this is a design problem, but `pool_size` can be raised so more
+DDLs run in parallel on each node (the default is `1`). Raise it gradually
+and watch ZooKeeper write rate and per-node memory — every additional
+concurrent DDL can trigger heavy operations (mutations, ALTERs) that compete
+for memory and replication queue slots.
-Generally, it's a bad design, but you can increase pool_size setting
+If raising `pool_size` doesn't keep up, the fix is upstream: batch the DDLs,
+replace cluster-wide `DELETE WHERE …` with lightweight deletes or partition
+drops, or use `CREATE TEMPORARY TABLE` for transient intermediates so the
+per-session table is dropped automatically.
-## Stuck DDL tasks in the distributed_ddl_queue
+## Stuck DDL tasks in the `distributed_ddl_queue`
-Sometimes [DDL tasks](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) (the ones that use ON CLUSTER) can get stuck in the `distributed_ddl_queue` because the replicas can overload if multiple DDLs (thousands of CREATE/DROP/ALTER) are executed at the same time. This is very normal in heavy ETL jobs.This can be detected by checking the `distributed_ddl_queue` table and see if there are tasks that are not moving or are stuck for a long time.
+`ON CLUSTER` tasks can pile up when many DDLs (thousands of
+CREATE/DROP/ALTER) hit the cluster at once — common in heavy ETL jobs. They
+show up in `system.distributed_ddl_queue` as long-`query_create_time` rows
+that aren't moving.
-If these DDLs are completed in some replicas but failed in others, the simplest way to solve this is to execute the failed command in the missed replicas without ON CLUSTER. If most of the DDLs failed, then check the number of unfinished records in `distributed_ddl_queue` on the other nodes, because most probably it will be as high as thousands.
+If the DDL finished on some replicas but failed on others, the simplest fix is
+to rerun the failed statement on the missing replicas **without** `ON
+CLUSTER`. If most failed, check `system.distributed_ddl_queue` on every node —
+the backlog is often in the thousands.
-First, backup the `distributed_ddl_queue` into a table so you will have a snapshot of the table with the states of the tasks. You can do this with the following command:
+Snapshot the queue first so you don't lose the state:
```sql
-CREATE TABLE default.system_distributed_ddl_queue AS SELECT * FROM system.distributed_ddl_queue;
+CREATE TABLE default.system_distributed_ddl_queue
+AS SELECT * FROM system.distributed_ddl_queue;
```
-After this, we need to check from the backup table which tasks are not finished and execute them manually in the missed replicas, and review the pipeline which do `ON CLUSTER` command and does not abuse them. There is a new `CREATE TEMPORARY TABLE` command that can be used to avoid the `ON CLUSTER` command in some cases, where you need an intermediate table to do some operations and after that you can `INSERT INTO` the final table or do `ALTER TABLE final ATTACH PARTITION FROM TABLE temp` and this temp table will be dropped automatically after the session is closed.
-
-
+Then work through the snapshot, executing the missing statements locally and
+fixing the pipeline that's spamming `ON CLUSTER`. `CREATE TEMPORARY TABLE`
+plus `ALTER TABLE final ATTACH PARTITION FROM TABLE temp` is a common way to
+avoid cluster-wide DDLs for staging.
diff --git a/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active.md b/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active.md
index ca02a38cf7..9699555d02 100644
--- a/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active.md
+++ b/content/en/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/there-are-n-unfinished-hosts-0-of-them-are-currently-active.md
@@ -1,36 +1,56 @@
---
title: "There are N unfinished hosts (0 of them are currently active)."
-linkTitle: "There are N unfinished hosts (0 of them are currently active)."
+linkTitle: "Unfinished hosts"
+weight: 100
description: >
- There are N unfinished hosts (0 of them are currently active).
+ Diagnosing `Distributed DDL` queries stuck with unfinished, inactive hosts.
---
-Sometimes your Distributed DDL queries are being stuck, and not executing on all or subset of nodes, there are a lot of possible reasons for that kind of behavior, so it would take some time and effort to investigate.
+
+When a `Distributed DDL` query is "stuck" on one or more nodes, the initiator
+typically reports `There are N unfinished hosts (0 of them are currently
+active).` Several distinct root causes produce the same message, so the
+investigation usually means narrowing down to one of them.
+
+Background and config knobs live in
+[DDLWorker and DDL queue problems](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/).
+
+The fastest first look is the system table — it joins ZooKeeper and local
+executor state in one query:
+
+```sql
+SELECT entry, host, port, status, exception_code, exception_text,
+ query_create_time, query_finish_time, query
+FROM system.distributed_ddl_queue
+WHERE status != 'Finished'
+ORDER BY entry DESC, host;
+```
+
+If that doesn't make it obvious, work through the possible reasons below.
## Possible reasons
### ClickHouse® node can't recognize itself
```sql
-SELECT * FROM system.clusters; -- check is_local column, it should have 1 for itself
+SELECT * FROM system.clusters; -- check is_local column, it should be 1 for itself
```
```bash
-getent hosts clickhouse.local.net # or other name which should be local
+getent hosts clickhouse.local.net # or whichever name should resolve to this host
hostname --fqdn
cat /etc/hosts
cat /etc/hostname
```
-### Debian / Ubuntu
+On Debian/Ubuntu images the FQDN often maps to `127.0.1.1`, which doesn't
+match any network interface and ClickHouse® fails to detect this address as
+local — see
+[ClickHouse#23504](https://github.com/ClickHouse/ClickHouse/issues/23504).
-There is an issue in Debian based images, when hostname being mapped to 127.0.1.1 address which doesn't literally match network interface and ClickHouse fails to detect this address as local.
+### Previous task is being executed and taking some time
-[https://github.com/ClickHouse/ClickHouse/issues/23504](https://github.com/ClickHouse/ClickHouse/issues/23504)
-
-#### Previous task is being executed and taking some time
-
-It's usually some heavy operations like merges, mutations, alter columns, so it make sense to check those tables:
+Usually a heavy operation — large merge, mutation, or `ALTER COLUMN`:
```sql
SHOW PROCESSLIST;
@@ -38,11 +58,11 @@ SELECT * FROM system.merges;
SELECT * FROM system.mutations;
```
-In that case, you can just wait completion of previous task.
+In that case, wait for the previous task to finish.
-### Previous task is stuck because of some error
+### Previous task is stuck because of an error
-In that case, the first step is to understand which exact task is stuck and why. There are some queries which can help with that.
+Identify the exact task and figure out why. Useful queries:
```sql
-- list of all distributed ddl queries, path can be different in your installation
@@ -52,11 +72,11 @@ SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/';
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/';
SELECT * FROM system.zookeeper WHERE path = '/clickhouse/task_queue/ddl/' AND name = 'query-0000001000';
-- 22.3
-SELECT * FROM system.zookeeper WHERE path like '/clickhouse/task_queue/ddl/query-0000001000/%'
+SELECT * FROM system.zookeeper WHERE path like '/clickhouse/task_queue/ddl/query-0000001000/%'
ORDER BY ctime, path SETTINGS allow_unrestricted_reads_from_keeper='true'
-- 22.6
-SELECT path, name, value, ctime, mtime
-FROM system.zookeeper WHERE path like '/clickhouse/task_queue/ddl/query-0000001000/%'
+SELECT path, name, value, ctime, mtime
+FROM system.zookeeper WHERE path like '/clickhouse/task_queue/ddl/query-0000001000/%'
ORDER BY ctime, path SETTINGS allow_unrestricted_reads_from_keeper='true'
-- How many nodes executed this task
@@ -68,101 +88,93 @@ WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/' AND name = 'finished
└──────────┴────────────────┘
-- The nodes that are running the task
-SELECT name, value, ctime, mtime FROM system.zookeeper
+SELECT name, value, ctime, mtime FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/active/';
--- What was the result for the finished nodes
-SELECT name, value, ctime, mtime FROM system.zookeeper
+-- What was the result for the finished nodes
+SELECT name, value, ctime, mtime FROM system.zookeeper
WHERE path = '/clickhouse/task_queue/ddl/query-0000001000/finished/';
--- Latest successfull executed tasks from query_log.
+-- Latest successfully executed tasks from query_log.
SELECT query FROM system.query_log WHERE query LIKE '%ddl_entry%' AND type = 2 ORDER BY event_time DESC LIMIT 5;
-SELECT
- FQDN(),
- *
+-- Compare highest processed DDL entry across every replica.
+SELECT FQDN(), *
FROM clusterAllReplicas('cluster', system.metrics)
-WHERE metric LIKE '%MaxDDLEntryID%'
-
-┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
-│ chi-ab.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
-└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
-┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
-│ chi-ab.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
-└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
-┌─FQDN()───────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
-│ chi-ab.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
-└──────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
+WHERE metric LIKE '%MaxDDLEntryID%';
+┌─FQDN()────────────────────┬─metric────────┬─value─┬─description───────────────────────────┐
+│ chi-ab-r1.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
+│ chi-ab-r2.svc.cluster.local │ MaxDDLEntryID │ 1432 │ Max processed DDL entry of DDLWorker. │
+│ chi-ab-r3.svc.cluster.local │ MaxDDLEntryID │ 1468 │ Max processed DDL entry of DDLWorker. │
+└─────────────────────────────┴───────────────┴───────┴───────────────────────────────────────┘
-- Information about task execution from logs.
grep -C 40 "ddl_entry" /var/log/clickhouse-server/clickhouse-server*.log
```
+A replica whose `MaxDDLEntryID` lags the others is the one to investigate.
### Issues that can prevent task execution
-#### Obsolete Replicas
+#### Obsolete replicas
-Obsolete replicas left in zookeeper.
+Old replicas left in ZooKeeper that never come back online block tasks that
+expect them:
```sql
-SELECT database, table, zookeeper_path, replica_path zookeeper FROM system.replicas WHERE total_replicas != active_replicas;
+SELECT database, table, zookeeper_path, replica_path
+FROM system.replicas
+WHERE total_replicas != active_replicas;
-SELECT * FROM system.zookeeper WHERE path = '/clickhouse/cluster/tables/01/database/table/replicas';
+SELECT * FROM system.zookeeper
+WHERE path = '/clickhouse/cluster/tables/01/database/table/replicas';
SYSTEM DROP REPLICA 'replica_name';
-
-SYSTEM STOP REPLICATION QUEUES;
-SYSTEM START REPLICATION QUEUES;
```
-[https://clickhouse.tech/docs/en/sql-reference/statements/system/\#query_language-system-drop-replica](https://clickhouse.tech/docs/en/sql-reference/statements/system/\#query_language-system-drop-replica)
+See [SYSTEM DROP REPLICA](https://clickhouse.com/docs/en/sql-reference/statements/system/#query_language-system-drop-replica).
-#### Tasks manually removed from DDL queue
+#### Tasks manually removed from DDL queue
-Task were removed from DDL queue, but left in Replicated\*MergeTree table queue.
+Task was removed from the DDL queue but is still referenced by a
+`Replicated*MergeTree` table's replication queue:
```bash
grep -C 40 "ddl_entry" /var/log/clickhouse-server/clickhouse-server*.log
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:28.956888 [ 599 ] {} DDLWorker: Processing task query-0000211211 (ALTER TABLE db.table_local ON CLUSTER `all-replicated` DELETE WHERE id = 1)
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:29.053555 [ 599 ] {} DDLWorker: ZooKeeper error: Code: 999, e.displayText() = Coordination::Exception: No node, Stack trace (when copying this message, always include the lines below):
-/var/log/clickhouse-server/clickhouse-server.log-
-/var/log/clickhouse-server/clickhouse-server.log-0. Coordination::Exception::Exception(std::__1::basic_string, std::__1::allocator > const&, Coordination::Error, int) @ 0xfb2f6b3 in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log-1. Coordination::Exception::Exception(Coordination::Error) @ 0xfb2fb56 in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log:2. DB::DDLWorker::createStatusDirs(std::__1::basic_string, std::__1::allocator > const&, std::__1::shared_ptr const&) @ 0xeb3127a in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log:3. DB::DDLWorker::processTask(DB::DDLTask&) @ 0xeb36c96 in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log:4. DB::DDLWorker::enqueueTask(std::__1::unique_ptr >) @ 0xeb35f22 in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log-5. ? @ 0xeb47aed in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log-6. ThreadPoolImpl::worker(std::__1::__list_iterator) @ 0x8633bcd in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log-7. ThreadFromGlobalPool::ThreadFromGlobalPool::scheduleImpl(std::__1::function, int, std::__1::optional)::'lambda1'()>(void&&, void ThreadPoolImpl::scheduleImpl(std::__1::function, int, std::__1::optional)::'lambda1'()&&...)::'lambda'()::operator()() @ 0x863612f in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log-8. ThreadPoolImpl::worker(std::__1::__list_iterator) @ 0x8630ffd in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log-9. ? @ 0x8634bb3 in /usr/bin/clickhouse
-/var/log/clickhouse-server/clickhouse-server.log-10. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
-/var/log/clickhouse-server/clickhouse-server.log-11. __clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so
+/var/log/clickhouse-server/clickhouse-server.log-0. Coordination::Exception::Exception(...) @ ... in /usr/bin/clickhouse
+/var/log/clickhouse-server/clickhouse-server.log-1. Coordination::Exception::Exception(Coordination::Error) @ ... in /usr/bin/clickhouse
+/var/log/clickhouse-server/clickhouse-server.log:2. DB::DDLWorker::createStatusDirs(...) @ ... in /usr/bin/clickhouse
+/var/log/clickhouse-server/clickhouse-server.log:3. DB::DDLWorker::processTask(DB::DDLTask&) @ ... in /usr/bin/clickhouse
+/var/log/clickhouse-server/clickhouse-server.log- ...
/var/log/clickhouse-server/clickhouse-server.log- (version 21.1.8.30 (official build))
/var/log/clickhouse-server/clickhouse-server.log:2021.05.04 12:41:29.053951 [ 599 ] {} DDLWorker: Processing task query-0000211211 (ALTER TABLE db.table_local ON CLUSTER `all-replicated` DELETE WHERE id = 1)
```
-Context of this problem is:
-* Constant pressure of cheap ON CLUSTER DELETE queries.
-* One replica was down for a long amount of time (multiple days).
-* Because of pressure on the DDL queue, it purged old records due to the `task_max_lifetime` setting.
-* When a lagging replica comes up, it's fail's execute old queries from DDL queue, because at this point they were purged from it.
+Context:
+* Constant pressure of cheap `ON CLUSTER DELETE` queries.
+* One replica was down for a long time (multiple days).
+* Because of pressure on the DDL queue, old records were purged via `task_max_lifetime`.
+* When the lagging replica came back, it failed to execute the old queries from the DDL queue — they no longer existed.
Solution:
-* Reload/Restore this replica from scratch.
+* Reload/restore that replica from scratch.
-#### DDL path was changed in Zookeeper without restarting ClickHouse
+#### DDL path was changed in ZooKeeper without restarting ClickHouse
-Changing the DDL queue path in Zookeeper without restarting ClickHouse will make ClickHouse confused. If you need to do this ensure that you restart ClickHouse before submitting additional distributed DDL commands. Here's an example.
+Changing the DDL queue path in ZooKeeper without restarting ClickHouse leaves
+the server confused — it keeps polling the old path. Avoid path changes if at
+all possible; if it must be done, restart ClickHouse before submitting any
+further `ON CLUSTER` commands.
```sql
-- Path before change:
SELECT *
FROM system.zookeeper
-WHERE path = '/clickhouse/clickhouse101/task_queue'
+WHERE path = '/clickhouse/clickhouse101/task_queue';
┌─name─┬─value─┬─path─────────────────────────────────┐
│ ddl │ │ /clickhouse/clickhouse101/task_queue │
@@ -171,11 +183,21 @@ WHERE path = '/clickhouse/clickhouse101/task_queue'
-- Path after change
SELECT *
FROM system.zookeeper
-WHERE path = '/clickhouse/clickhouse101/task_queue'
+WHERE path = '/clickhouse/clickhouse101/task_queue';
┌─name─┬─value─┬─path─────────────────────────────────┐
│ ddl2 │ │ /clickhouse/clickhouse101/task_queue │
└──────┴───────┴──────────────────────────────────────┘
```
-The reason is that ClickHouse will not "see" this change and will continue to look for tasks in the old path. Altering paths in Zookeeper should be avoided if at all possible. If necessary it must be done *very carefully*.
+## Still stuck?
+
+If the task can't be made to progress and is blocking everything else:
+
+- Rerun the original DDL statement on each missing replica directly (without
+ `ON CLUSTER`) once the queue is unblocked.
+- For obsolete replicas, `SYSTEM DROP REPLICA 'replica_name'` removes their
+ expectations from ZooKeeper.
+- If the queue itself is corrupt, capture the relevant
+ `system.distributed_ddl_queue` rows and ZooKeeper paths before any
+ remediation so you can reconstruct what happened.