Altinity · TheraniAA · May 14, 2026
diff --git a/content/en/altinity-kb-diagnostics-runbook/_index.md b/content/en/altinity-kb-diagnostics-runbook/_index.md
@@ -0,0 +1,99 @@
+---
+title: "ClickHouse® Cluster Diagnostics Runbook"
+linkTitle: "Diagnostics Runbook"
+weight: 110
+description: >
+    A query library and scenario-based diagnostic flows for triaging
+    ClickHouse® clusters during incidents.
+keywords:
+  - clickhouse diagnostics
+  - clickhouse troubleshooting
+  - clickhouse runbook
+  - replication queue
+  - async inserts
+  - keeper
+  - host skew
+---
+
+A reference for diagnosing problems on a running ClickHouse® cluster: a
+catalogue of cluster-wide queries you can run, organised by subsystem, plus
+scenario playbooks that walk you from a symptom to the queries that resolve
+it.
+
+The intended reader is an on-call or support engineer who has cluster-wide
+read access and needs to identify *which subsystem* is misbehaving as quickly
+as possible.
+
+## How this runbook is organised
+
+| Section | What's in it |
+|---|---|
+| [Quick reference](/altinity-kb-diagnostics-runbook/quick-reference/) | One-page symptom → query map and the gotchas every diagnosis depends on. **Start here.** |
+| [Investigation methods](/altinity-kb-diagnostics-runbook/investigation-methods/) | Process reminders — how to avoid common misdiagnoses. |
+| [Query library](/altinity-kb-diagnostics-runbook/query-library/) | 54 cluster-wide queries grouped by subsystem (replication, parts, async inserts, Keeper, etc.). Reference material. |
+| [Scenarios](/altinity-kb-diagnostics-runbook/scenarios/) | Step-by-step diagnostic flows for specific failure modes. |
+
+## How the queries are written
+
+Every query in the library fans out across the cluster using
+`clusterAllReplicas('{cluster}', system.<table>)`. Replace these placeholders
+before running:
+
+- `{cluster}` — your cluster name (the value used in `remote_servers` /
+  `system.clusters.cluster`).
+- `{database}`, `{table}`, `{mv_name}`, `{target_table_pattern}` — appear in
+  queries that drill into a specific object.
+
+Most queries include `hostName() AS host` as the first column so you can see
+per-replica behaviour at a glance. Replication and metric tables vary slightly
+across ClickHouse versions — when in doubt, inspect the columns first with
+`SELECT name FROM system.columns WHERE database='system' AND table='<name>'`.
+
+## Patterns that recur
+
+These are the misreads that account for a large share of wrong diagnoses.
+Read them once before drilling into a specific scenario.
+
+1. **Host-skewed failures with a balanced workload.** Settings identical,
+   workload balanced, but failure rates differ wildly across replicas. The
+   cause is usually entry-point routing (HAProxy / ingress) directing most
+   traffic to a subset of hosts — not a ClickHouse misconfiguration. See
+   [scenarios → host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/).
+
+2. **`tables[]` in `query_log` is not the writer.** Failed inserts list many
+   tables. The actual physical writer is in the INSERT query text — not the
+   first element of `tables[]`, which also includes the MV dependency chain.
+   See the [insert load and host skew queries](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) and
+   [scenarios → async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/).
+
+3. **Cumulative vs current state.** `system.events` totals since process
+   start; ratios computed from those totals can show stale peak-load skew
+   that no longer exists. Always cross-check with `system.metric_log` over a
+   recent window before concluding "host X is slow".
+
+4. **ProfileEvents reveal "waited not worked".** A failed insert with
+   `RealTimeMicroseconds ≈ timeout` and `UserTimeMicroseconds < 10ms` means
+   the query never executed. The bottleneck is a lock or queue, not work.
+   Look upstream for what is blocking.
+
+5. **Same settings + different behaviour ⇒ upstream cause.** When
+   `system.settings` is identical across hosts and behaviour is still
+   skewed, the cause is outside ClickHouse: entry-point routing, pod
+   resource contention, or leader-coordination concentration. Stop looking
+   inside ClickHouse.
+
+## Where to start
+
+- "Customer says something is wrong, I don't know what" → run
+  [Scenario 10: General triage](/altinity-kb-diagnostics-runbook/scenarios/general-triage/).
+- "I have a specific symptom" → open the
+  [quick reference](/altinity-kb-diagnostics-runbook/quick-reference/).
+- "I need a specific query" → browse the
+  [query library](/altinity-kb-diagnostics-runbook/query-library/) by subsystem.
+
+## Related KB pages
+
+- [Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/) — focused memory diagnostics.
+- [Who ate my CPU?](/altinity-kb-setup-and-maintenance/who-ate-my-cpu/) — focused CPU diagnostics.
+- [DDLWorker and DDL queue problems](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) — `ON CLUSTER` task troubleshooting.
+- [System tables eat my disk](/altinity-kb-setup-and-maintenance/altinity-kb-system-tables-eat-my-disk/) — when `*_log` tables grow too large.
diff --git a/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md b/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md
@@ -0,0 +1,147 @@
+---
+title: "Investigation methods"
+linkTitle: "Investigation methods"
+weight: 20
+description: >
+    Process reminders that prevent the most common misdiagnoses.
+keywords:
+  - clickhouse troubleshooting
+  - clickhouse diagnostics
+  - tables array
+  - profileevents
+  - metric_log
+---
+
+These reminders are about *how* to investigate — they prevent the kinds of
+wrong reads that send a diagnosis in the wrong direction for hours. Each one
+maps to a specific query or pattern elsewhere in the runbook.
+
+## Verify before committing to a cause
+
+When the evidence points to more than one plausible cause, run one more
+verification query before you state a conclusion. A wrong RCA costs more
+trust and more time than the verification step would have. The cost of an
+extra `SELECT` is seconds; the cost of unwinding a wrong diagnosis can be
+days.
+
+## `tables[]` in `query_log` is not the writer
+
+The `query_log.tables` array contains every table touched by the query,
+including the entire MV dependency chain. The actual physical INSERT target
+is in the query text, not in `tables[0]`.
+
+To find the real writer behind a failing insert, extract from the query
+text:
+
+```sql
+SELECT regexpExtract(query, 'INSERT INTO\s+([\w\.`]+)') AS target, …
+```
+
+See [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection)
+and the dedicated [scenario](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/).
+
+## Cumulative metrics hide current state
+
+`system.events` integrates since process start. Ratios computed from those
+totals can reflect a peak-load period that happened days ago and is no
+longer relevant.
+
+When comparing per-host behaviour right now, use `system.metric_log` with a
+recent window (5–30 minutes):
+
+- [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log)
+  — per-second profile activity by host.
+- [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations)
+  — p50/p95/p99 of Keeper transactions, by host.
+
+If someone reports "host X has Nx higher Keeper waits", reproduce it with
+Q49 over the last 30 minutes before treating it as a current problem.
+
+## Same settings + different behaviour ⇒ upstream cause
+
+If `system.settings` is identical across hosts (see
+[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection))
+and behaviour is still skewed across replicas, the cause is outside
+ClickHouse. Likely sources:
+
+- Entry-point routing (HAProxy, ingress, or client library load balancing)
+  concentrating traffic on a subset of replicas.
+- Pod-level resource contention (CPU throttling, memory pressure on the
+  node, page cache flushes from a noisy neighbour).
+- Coordination work concentrated on a subset of hosts (leader concentration,
+  see [Q51](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q51-leader-distribution-across-hosts)).
+
+Stop looking inside ClickHouse — the answer is upstream.
+
+## Distinguish workload from failure
+
+"Volume is balanced" and "failures are balanced" answer different questions.
+Either can be skewed independently. To resolve a host-skew report, look at
+both:
+
+- Workload — [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log)
+  (`ProfileEvent_AsyncInsertQuery` per host).
+- Failure rate — [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host)
+  (failures normalised by attempts).
+
+Together they let you say "host A receives 4× more attempts" or "host A
+fails at 5× the rate at equal volume" — those are very different problems
+with different fixes.
+
+## ProfileEvents reveal "waited not worked"
+
+A failed query with `RealTimeMicroseconds ≈ timeout` and
+`UserTimeMicroseconds` near zero means the query never executed. It sat in
+a queue or on a lock. This rules out "the work itself is slow" and points
+to "the wait is the problem".
+
+Before theorising about a slow MV chain or slow merge as the cause of a
+failed insert, inspect ProfileEvents on representative failed queries:
+
+```sql
+SELECT
+    query_id,
+    query_duration_ms,
+    ProfileEvents['RealTimeMicroseconds']  AS real_us,
+    ProfileEvents['UserTimeMicroseconds']  AS user_us,
+    ProfileEvents['SystemTimeMicroseconds'] AS sys_us
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 30 MINUTE
+  AND type = 'ExceptionWhileProcessing'
+  AND exception ILIKE '%async insert%timeout%'
+LIMIT 20;
+```
+
+If `user_us` is in single-digit milliseconds while `real_us` is at the
+timeout ceiling, the work never ran. Find the lock or queue, not the slow
+operator.
+
+## Routing settings to know about
+
+A short glossary of the settings that determine *where* a query lands and
+*how* its MVs execute. Confirm them with
+[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection)
+before tuning anything.
+
+- **`load_balancing`** — picks the replica for a Distributed table read or
+  insert. `hostname_levenshtein_distance` concentrates by hostname
+  similarity (often pinning to self), which can imbalance routing
+  unexpectedly. `random` or `round_robin` spreads work evenly.
+- **`prefer_localhost_replica`** — when `1`, the local replica is preferred
+  regardless of `load_balancing`. Useful for read locality, risky for
+  insert balance.
+- **`distributed_foreground_insert`** — when `1`, INSERTs into a
+  Distributed table wait synchronously for remote acks. Slower but no
+  silent loss.
+- **`parallel_view_processing`** — when `0` (historical default on many
+  versions), MVs on a target table execute serially per insert. With a
+  deep MV chain, this turns each insert into a long sequential pipeline.
+
+## Sidecar Keeper means co-located, not shared
+
+If `system.zookeeper_connection.host == hostName()` (see
+[Q50](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q50-keeper-connection-topology)),
+the replica connects to a Keeper running on the same pod. "Slow Keeper
+follower" theories don't apply in this topology — there is no shared
+follower to be slow. Issues here are about pod-level contention (CPU, page
+cache, disk), not Keeper network routing.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md b/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md
@@ -0,0 +1,40 @@
+---
+title: "Query library"
+linkTitle: "Query library"
+weight: 30
+description: >
+    Reference catalogue of cluster-wide diagnostic queries, grouped by subsystem.
+keywords:
+  - clickhouse system tables
+  - clickhouse diagnostics
+  - clusterAllReplicas
+---
+
+54 cluster-wide queries grouped by the subsystem they probe. Every query
+fans out via `clusterAllReplicas('{cluster}', system.<table>)`. Replace
+`{cluster}` / `{database}` / `{table}` / `{mv_name}` /
+`{target_table_pattern}` with values from your environment before running.
+
+Queries are referenced from the
+[scenarios](/altinity-kb-diagnostics-runbook/scenarios/) by their numeric
+IDs (`Q1`, `Q2`, …). Numbering is stable across the runbook so you can copy
+shortcuts between teammates.
+
+| Page | Queries | Purpose |
+|---|---|---|
+| [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) | Q1–Q5, Q31, Q32 | Replication queue depth, postpone reasons, replica lag, fetches in flight |
+| [Parts and merges](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/) | Q6–Q10, Q42 | Parts per host/partition, active merges, merge throughput |
+| [Disk and storage](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/) | Q11, Q12 | Per-disk free space, TTL move activity |
+| [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/) | Q13–Q15, Q54 | Background pool saturation, memory pressure, cgroup limits |
+| [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) | Q16–Q19 | Recent query load, active queries, OOM/exception queries, stuck mutations |
+| [Async inserts](/altinity-kb-diagnostics-runbook/query-library/async-inserts/) | Q20–Q28, Q38 | Flush errors, latency, MV chain inspection, timeout patterns |
+| [Keeper and coordination](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/) | Q29, Q30, Q33, Q49–Q51 | Connection state, exception patterns, wait-time percentiles, topology, leader distribution |
+| [Insert load and host skew](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) | Q34–Q37, Q40, Q41, Q46–Q48, Q52, Q53 | Insert rate/volume, per-host duration, routing settings, failure rate |
+| [Dictionaries and Kafka](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/) | Q43–Q45 | Dictionary health, Kafka consumer vs pool size, consumer errors |
+
+## A note on version drift
+
+Several system tables changed schema between ClickHouse releases — column
+names on `replicated_fetches`, the view columns on `query_log`, and the
+existence of `zookeeper_log`. Each query page calls out the columns to
+check first when a query errors out on a specific cluster.