Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions content/en/altinity-kb-diagnostics-runbook/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: "ClickHouse® Cluster Diagnostics Runbook"
linkTitle: "Diagnostics Runbook"
weight: 110
description: >
A query library and scenario-based diagnostic flows for triaging
ClickHouse® clusters during incidents.
keywords:
- clickhouse diagnostics
- clickhouse troubleshooting
- clickhouse runbook
- replication queue
- async inserts
- keeper
- host skew
---

A reference for diagnosing problems on a running ClickHouse® cluster: a
catalogue of cluster-wide queries you can run, organised by subsystem, plus
scenario playbooks that walk you from a symptom to the queries that resolve
it.

The intended reader is an on-call or support engineer who has cluster-wide
read access and needs to identify *which subsystem* is misbehaving as quickly
as possible.

## How this runbook is organised

| Section | What's in it |
|---|---|
| [Quick reference](/altinity-kb-diagnostics-runbook/quick-reference/) | One-page symptom → query map and the gotchas every diagnosis depends on. **Start here.** |
| [Investigation methods](/altinity-kb-diagnostics-runbook/investigation-methods/) | Process reminders — how to avoid common misdiagnoses. |
| [Query library](/altinity-kb-diagnostics-runbook/query-library/) | 54 cluster-wide queries grouped by subsystem (replication, parts, async inserts, Keeper, etc.). Reference material. |
| [Scenarios](/altinity-kb-diagnostics-runbook/scenarios/) | Step-by-step diagnostic flows for specific failure modes. |

## How the queries are written

Every query in the library fans out across the cluster using
`clusterAllReplicas('{cluster}', system.<table>)`. Replace these placeholders
before running:

- `{cluster}` — your cluster name (the value used in `remote_servers` /
`system.clusters.cluster`).
- `{database}`, `{table}`, `{mv_name}`, `{target_table_pattern}` — appear in
queries that drill into a specific object.

Most queries include `hostName() AS host` as the first column so you can see
per-replica behaviour at a glance. Replication and metric tables vary slightly
across ClickHouse versions — when in doubt, inspect the columns first with
`SELECT name FROM system.columns WHERE database='system' AND table='<name>'`.

## Patterns that recur

These are the misreads that account for a large share of wrong diagnoses.
Read them once before drilling into a specific scenario.

1. **Host-skewed failures with a balanced workload.** Settings identical,
workload balanced, but failure rates differ wildly across replicas. The
cause is usually entry-point routing (HAProxy / ingress) directing most
traffic to a subset of hosts — not a ClickHouse misconfiguration. See
[scenarios → host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/).

2. **`tables[]` in `query_log` is not the writer.** Failed inserts list many
tables. The actual physical writer is in the INSERT query text — not the
first element of `tables[]`, which also includes the MV dependency chain.
See the [insert load and host skew queries](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) and
[scenarios → async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/).

3. **Cumulative vs current state.** `system.events` totals since process
start; ratios computed from those totals can show stale peak-load skew
that no longer exists. Always cross-check with `system.metric_log` over a
recent window before concluding "host X is slow".

4. **ProfileEvents reveal "waited not worked".** A failed insert with
`RealTimeMicroseconds ≈ timeout` and `UserTimeMicroseconds < 10ms` means
the query never executed. The bottleneck is a lock or queue, not work.
Look upstream for what is blocking.

5. **Same settings + different behaviour ⇒ upstream cause.** When
`system.settings` is identical across hosts and behaviour is still
skewed, the cause is outside ClickHouse: entry-point routing, pod
resource contention, or leader-coordination concentration. Stop looking
inside ClickHouse.

## Where to start

- "Customer says something is wrong, I don't know what" → run
[Scenario 10: General triage](/altinity-kb-diagnostics-runbook/scenarios/general-triage/).
- "I have a specific symptom" → open the
[quick reference](/altinity-kb-diagnostics-runbook/quick-reference/).
- "I need a specific query" → browse the
[query library](/altinity-kb-diagnostics-runbook/query-library/) by subsystem.

## Related KB pages

- [Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/) — focused memory diagnostics.
- [Who ate my CPU?](/altinity-kb-setup-and-maintenance/who-ate-my-cpu/) — focused CPU diagnostics.
- [DDLWorker and DDL queue problems](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) — `ON CLUSTER` task troubleshooting.
- [System tables eat my disk](/altinity-kb-setup-and-maintenance/altinity-kb-system-tables-eat-my-disk/) — when `*_log` tables grow too large.
147 changes: 147 additions & 0 deletions content/en/altinity-kb-diagnostics-runbook/investigation-methods.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
---
title: "Investigation methods"
linkTitle: "Investigation methods"
weight: 20
description: >
Process reminders that prevent the most common misdiagnoses.
keywords:
- clickhouse troubleshooting
- clickhouse diagnostics
- tables array
- profileevents
- metric_log
---

These reminders are about *how* to investigate — they prevent the kinds of
wrong reads that send a diagnosis in the wrong direction for hours. Each one
maps to a specific query or pattern elsewhere in the runbook.

## Verify before committing to a cause

When the evidence points to more than one plausible cause, run one more
verification query before you state a conclusion. A wrong RCA costs more
trust and more time than the verification step would have. The cost of an
extra `SELECT` is seconds; the cost of unwinding a wrong diagnosis can be
days.

## `tables[]` in `query_log` is not the writer

The `query_log.tables` array contains every table touched by the query,
including the entire MV dependency chain. The actual physical INSERT target
is in the query text, not in `tables[0]`.

To find the real writer behind a failing insert, extract from the query
text:

```sql
SELECT regexpExtract(query, 'INSERT INTO\s+([\w\.`]+)') AS target, …
```

See [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection)
and the dedicated [scenario](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/).

## Cumulative metrics hide current state

`system.events` integrates since process start. Ratios computed from those
totals can reflect a peak-load period that happened days ago and is no
longer relevant.

When comparing per-host behaviour right now, use `system.metric_log` with a
recent window (5–30 minutes):

- [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log)
— per-second profile activity by host.
- [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations)
— p50/p95/p99 of Keeper transactions, by host.

If someone reports "host X has Nx higher Keeper waits", reproduce it with
Q49 over the last 30 minutes before treating it as a current problem.

## Same settings + different behaviour ⇒ upstream cause

If `system.settings` is identical across hosts (see
[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection))
and behaviour is still skewed across replicas, the cause is outside
ClickHouse. Likely sources:

- Entry-point routing (HAProxy, ingress, or client library load balancing)
concentrating traffic on a subset of replicas.
- Pod-level resource contention (CPU throttling, memory pressure on the
node, page cache flushes from a noisy neighbour).
- Coordination work concentrated on a subset of hosts (leader concentration,
see [Q51](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q51-leader-distribution-across-hosts)).

Stop looking inside ClickHouse — the answer is upstream.

## Distinguish workload from failure

"Volume is balanced" and "failures are balanced" answer different questions.
Either can be skewed independently. To resolve a host-skew report, look at
both:

- Workload — [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log)
(`ProfileEvent_AsyncInsertQuery` per host).
- Failure rate — [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host)
(failures normalised by attempts).

Together they let you say "host A receives 4× more attempts" or "host A
fails at 5× the rate at equal volume" — those are very different problems
with different fixes.

## ProfileEvents reveal "waited not worked"

A failed query with `RealTimeMicroseconds ≈ timeout` and
`UserTimeMicroseconds` near zero means the query never executed. It sat in
a queue or on a lock. This rules out "the work itself is slow" and points
to "the wait is the problem".

Before theorising about a slow MV chain or slow merge as the cause of a
failed insert, inspect ProfileEvents on representative failed queries:

```sql
SELECT
query_id,
query_duration_ms,
ProfileEvents['RealTimeMicroseconds'] AS real_us,
ProfileEvents['UserTimeMicroseconds'] AS user_us,
ProfileEvents['SystemTimeMicroseconds'] AS sys_us
FROM clusterAllReplicas('{cluster}', system.query_log)
WHERE event_time >= now() - INTERVAL 30 MINUTE
AND type = 'ExceptionWhileProcessing'
AND exception ILIKE '%async insert%timeout%'
LIMIT 20;
```

If `user_us` is in single-digit milliseconds while `real_us` is at the
timeout ceiling, the work never ran. Find the lock or queue, not the slow
operator.

## Routing settings to know about

A short glossary of the settings that determine *where* a query lands and
*how* its MVs execute. Confirm them with
[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection)
before tuning anything.

- **`load_balancing`** — picks the replica for a Distributed table read or
insert. `hostname_levenshtein_distance` concentrates by hostname
similarity (often pinning to self), which can imbalance routing
unexpectedly. `random` or `round_robin` spreads work evenly.
- **`prefer_localhost_replica`** — when `1`, the local replica is preferred
regardless of `load_balancing`. Useful for read locality, risky for
insert balance.
- **`distributed_foreground_insert`** — when `1`, INSERTs into a
Distributed table wait synchronously for remote acks. Slower but no
silent loss.
- **`parallel_view_processing`** — when `0` (historical default on many
versions), MVs on a target table execute serially per insert. With a
deep MV chain, this turns each insert into a long sequential pipeline.

## Sidecar Keeper means co-located, not shared

If `system.zookeeper_connection.host == hostName()` (see
[Q50](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q50-keeper-connection-topology)),
the replica connects to a Keeper running on the same pod. "Slow Keeper
follower" theories don't apply in this topology — there is no shared
follower to be slow. Issues here are about pod-level contention (CPU, page
cache, disk), not Keeper network routing.
40 changes: 40 additions & 0 deletions content/en/altinity-kb-diagnostics-runbook/query-library/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
---
title: "Query library"
linkTitle: "Query library"
weight: 30
description: >
Reference catalogue of cluster-wide diagnostic queries, grouped by subsystem.
keywords:
- clickhouse system tables
- clickhouse diagnostics
- clusterAllReplicas
---

54 cluster-wide queries grouped by the subsystem they probe. Every query
fans out via `clusterAllReplicas('{cluster}', system.<table>)`. Replace
`{cluster}` / `{database}` / `{table}` / `{mv_name}` /
`{target_table_pattern}` with values from your environment before running.

Queries are referenced from the
[scenarios](/altinity-kb-diagnostics-runbook/scenarios/) by their numeric
IDs (`Q1`, `Q2`, …). Numbering is stable across the runbook so you can copy
shortcuts between teammates.

| Page | Queries | Purpose |
|---|---|---|
| [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) | Q1–Q5, Q31, Q32 | Replication queue depth, postpone reasons, replica lag, fetches in flight |
| [Parts and merges](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/) | Q6–Q10, Q42 | Parts per host/partition, active merges, merge throughput |
| [Disk and storage](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/) | Q11, Q12 | Per-disk free space, TTL move activity |
| [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/) | Q13–Q15, Q54 | Background pool saturation, memory pressure, cgroup limits |
| [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) | Q16–Q19 | Recent query load, active queries, OOM/exception queries, stuck mutations |
| [Async inserts](/altinity-kb-diagnostics-runbook/query-library/async-inserts/) | Q20–Q28, Q38 | Flush errors, latency, MV chain inspection, timeout patterns |
| [Keeper and coordination](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/) | Q29, Q30, Q33, Q49–Q51 | Connection state, exception patterns, wait-time percentiles, topology, leader distribution |
| [Insert load and host skew](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) | Q34–Q37, Q40, Q41, Q46–Q48, Q52, Q53 | Insert rate/volume, per-host duration, routing settings, failure rate |
| [Dictionaries and Kafka](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/) | Q43–Q45 | Dictionary health, Kafka consumer vs pool size, consumer errors |

## A note on version drift

Several system tables changed schema between ClickHouse releases — column
names on `replicated_fetches`, the view columns on `query_log`, and the
existence of `zookeeper_log`. Each query page calls out the columns to
check first when a query errors out on a specific cluster.
Loading
Loading