diff --git a/content/en/altinity-kb-diagnostics-runbook/_index.md b/content/en/altinity-kb-diagnostics-runbook/_index.md new file mode 100644 index 0000000000..06de8dd5bc --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/_index.md @@ -0,0 +1,99 @@ +--- +title: "ClickHouse® Cluster Diagnostics Runbook" +linkTitle: "Diagnostics Runbook" +weight: 110 +description: > + A query library and scenario-based diagnostic flows for triaging + ClickHouse® clusters during incidents. +keywords: + - clickhouse diagnostics + - clickhouse troubleshooting + - clickhouse runbook + - replication queue + - async inserts + - keeper + - host skew +--- + +A reference for diagnosing problems on a running ClickHouse® cluster: a +catalogue of cluster-wide queries you can run, organised by subsystem, plus +scenario playbooks that walk you from a symptom to the queries that resolve +it. + +The intended reader is an on-call or support engineer who has cluster-wide +read access and needs to identify *which subsystem* is misbehaving as quickly +as possible. + +## How this runbook is organised + +| Section | What's in it | +|---|---| +| [Quick reference](/altinity-kb-diagnostics-runbook/quick-reference/) | One-page symptom → query map and the gotchas every diagnosis depends on. **Start here.** | +| [Investigation methods](/altinity-kb-diagnostics-runbook/investigation-methods/) | Process reminders — how to avoid common misdiagnoses. | +| [Query library](/altinity-kb-diagnostics-runbook/query-library/) | 54 cluster-wide queries grouped by subsystem (replication, parts, async inserts, Keeper, etc.). Reference material. | +| [Scenarios](/altinity-kb-diagnostics-runbook/scenarios/) | Step-by-step diagnostic flows for specific failure modes. | + +## How the queries are written + +Every query in the library fans out across the cluster using +`clusterAllReplicas('{cluster}', system.)`. Replace these placeholders +before running: + +- `{cluster}` — your cluster name (the value used in `remote_servers` / + `system.clusters.cluster`). +- `{database}`, `{table}`, `{mv_name}`, `{target_table_pattern}` — appear in + queries that drill into a specific object. + +Most queries include `hostName() AS host` as the first column so you can see +per-replica behaviour at a glance. Replication and metric tables vary slightly +across ClickHouse versions — when in doubt, inspect the columns first with +`SELECT name FROM system.columns WHERE database='system' AND table=''`. + +## Patterns that recur + +These are the misreads that account for a large share of wrong diagnoses. +Read them once before drilling into a specific scenario. + +1. **Host-skewed failures with a balanced workload.** Settings identical, + workload balanced, but failure rates differ wildly across replicas. The + cause is usually entry-point routing (HAProxy / ingress) directing most + traffic to a subset of hosts — not a ClickHouse misconfiguration. See + [scenarios → host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/). + +2. **`tables[]` in `query_log` is not the writer.** Failed inserts list many + tables. The actual physical writer is in the INSERT query text — not the + first element of `tables[]`, which also includes the MV dependency chain. + See the [insert load and host skew queries](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) and + [scenarios → async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/). + +3. **Cumulative vs current state.** `system.events` totals since process + start; ratios computed from those totals can show stale peak-load skew + that no longer exists. Always cross-check with `system.metric_log` over a + recent window before concluding "host X is slow". + +4. **ProfileEvents reveal "waited not worked".** A failed insert with + `RealTimeMicroseconds ≈ timeout` and `UserTimeMicroseconds < 10ms` means + the query never executed. The bottleneck is a lock or queue, not work. + Look upstream for what is blocking. + +5. **Same settings + different behaviour ⇒ upstream cause.** When + `system.settings` is identical across hosts and behaviour is still + skewed, the cause is outside ClickHouse: entry-point routing, pod + resource contention, or leader-coordination concentration. Stop looking + inside ClickHouse. + +## Where to start + +- "Customer says something is wrong, I don't know what" → run + [Scenario 10: General triage](/altinity-kb-diagnostics-runbook/scenarios/general-triage/). +- "I have a specific symptom" → open the + [quick reference](/altinity-kb-diagnostics-runbook/quick-reference/). +- "I need a specific query" → browse the + [query library](/altinity-kb-diagnostics-runbook/query-library/) by subsystem. + +## Related KB pages + +- [Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/) — focused memory diagnostics. +- [Who ate my CPU?](/altinity-kb-setup-and-maintenance/who-ate-my-cpu/) — focused CPU diagnostics. +- [DDLWorker and DDL queue problems](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) — `ON CLUSTER` task troubleshooting. +- [System tables eat my disk](/altinity-kb-setup-and-maintenance/altinity-kb-system-tables-eat-my-disk/) — when `*_log` tables grow too large. diff --git a/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md b/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md new file mode 100644 index 0000000000..aaaccb6066 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md @@ -0,0 +1,147 @@ +--- +title: "Investigation methods" +linkTitle: "Investigation methods" +weight: 20 +description: > + Process reminders that prevent the most common misdiagnoses. +keywords: + - clickhouse troubleshooting + - clickhouse diagnostics + - tables array + - profileevents + - metric_log +--- + +These reminders are about *how* to investigate — they prevent the kinds of +wrong reads that send a diagnosis in the wrong direction for hours. Each one +maps to a specific query or pattern elsewhere in the runbook. + +## Verify before committing to a cause + +When the evidence points to more than one plausible cause, run one more +verification query before you state a conclusion. A wrong RCA costs more +trust and more time than the verification step would have. The cost of an +extra `SELECT` is seconds; the cost of unwinding a wrong diagnosis can be +days. + +## `tables[]` in `query_log` is not the writer + +The `query_log.tables` array contains every table touched by the query, +including the entire MV dependency chain. The actual physical INSERT target +is in the query text, not in `tables[0]`. + +To find the real writer behind a failing insert, extract from the query +text: + +```sql +SELECT regexpExtract(query, 'INSERT INTO\s+([\w\.`]+)') AS target, … +``` + +See [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection) +and the dedicated [scenario](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/). + +## Cumulative metrics hide current state + +`system.events` integrates since process start. Ratios computed from those +totals can reflect a peak-load period that happened days ago and is no +longer relevant. + +When comparing per-host behaviour right now, use `system.metric_log` with a +recent window (5–30 minutes): + +- [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) + — per-second profile activity by host. +- [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations) + — p50/p95/p99 of Keeper transactions, by host. + +If someone reports "host X has Nx higher Keeper waits", reproduce it with +Q49 over the last 30 minutes before treating it as a current problem. + +## Same settings + different behaviour ⇒ upstream cause + +If `system.settings` is identical across hosts (see +[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection)) +and behaviour is still skewed across replicas, the cause is outside +ClickHouse. Likely sources: + +- Entry-point routing (HAProxy, ingress, or client library load balancing) + concentrating traffic on a subset of replicas. +- Pod-level resource contention (CPU throttling, memory pressure on the + node, page cache flushes from a noisy neighbour). +- Coordination work concentrated on a subset of hosts (leader concentration, + see [Q51](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q51-leader-distribution-across-hosts)). + +Stop looking inside ClickHouse — the answer is upstream. + +## Distinguish workload from failure + +"Volume is balanced" and "failures are balanced" answer different questions. +Either can be skewed independently. To resolve a host-skew report, look at +both: + +- Workload — [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) + (`ProfileEvent_AsyncInsertQuery` per host). +- Failure rate — [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host) + (failures normalised by attempts). + +Together they let you say "host A receives 4× more attempts" or "host A +fails at 5× the rate at equal volume" — those are very different problems +with different fixes. + +## ProfileEvents reveal "waited not worked" + +A failed query with `RealTimeMicroseconds ≈ timeout` and +`UserTimeMicroseconds` near zero means the query never executed. It sat in +a queue or on a lock. This rules out "the work itself is slow" and points +to "the wait is the problem". + +Before theorising about a slow MV chain or slow merge as the cause of a +failed insert, inspect ProfileEvents on representative failed queries: + +```sql +SELECT + query_id, + query_duration_ms, + ProfileEvents['RealTimeMicroseconds'] AS real_us, + ProfileEvents['UserTimeMicroseconds'] AS user_us, + ProfileEvents['SystemTimeMicroseconds'] AS sys_us +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 30 MINUTE + AND type = 'ExceptionWhileProcessing' + AND exception ILIKE '%async insert%timeout%' +LIMIT 20; +``` + +If `user_us` is in single-digit milliseconds while `real_us` is at the +timeout ceiling, the work never ran. Find the lock or queue, not the slow +operator. + +## Routing settings to know about + +A short glossary of the settings that determine *where* a query lands and +*how* its MVs execute. Confirm them with +[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection) +before tuning anything. + +- **`load_balancing`** — picks the replica for a Distributed table read or + insert. `hostname_levenshtein_distance` concentrates by hostname + similarity (often pinning to self), which can imbalance routing + unexpectedly. `random` or `round_robin` spreads work evenly. +- **`prefer_localhost_replica`** — when `1`, the local replica is preferred + regardless of `load_balancing`. Useful for read locality, risky for + insert balance. +- **`distributed_foreground_insert`** — when `1`, INSERTs into a + Distributed table wait synchronously for remote acks. Slower but no + silent loss. +- **`parallel_view_processing`** — when `0` (historical default on many + versions), MVs on a target table execute serially per insert. With a + deep MV chain, this turns each insert into a long sequential pipeline. + +## Sidecar Keeper means co-located, not shared + +If `system.zookeeper_connection.host == hostName()` (see +[Q50](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q50-keeper-connection-topology)), +the replica connects to a Keeper running on the same pod. "Slow Keeper +follower" theories don't apply in this topology — there is no shared +follower to be slow. Issues here are about pod-level contention (CPU, page +cache, disk), not Keeper network routing. diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md b/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md new file mode 100644 index 0000000000..b38a916758 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md @@ -0,0 +1,40 @@ +--- +title: "Query library" +linkTitle: "Query library" +weight: 30 +description: > + Reference catalogue of cluster-wide diagnostic queries, grouped by subsystem. +keywords: + - clickhouse system tables + - clickhouse diagnostics + - clusterAllReplicas +--- + +54 cluster-wide queries grouped by the subsystem they probe. Every query +fans out via `clusterAllReplicas('{cluster}', system.
)`. Replace +`{cluster}` / `{database}` / `{table}` / `{mv_name}` / +`{target_table_pattern}` with values from your environment before running. + +Queries are referenced from the +[scenarios](/altinity-kb-diagnostics-runbook/scenarios/) by their numeric +IDs (`Q1`, `Q2`, …). Numbering is stable across the runbook so you can copy +shortcuts between teammates. + +| Page | Queries | Purpose | +|---|---|---| +| [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) | Q1–Q5, Q31, Q32 | Replication queue depth, postpone reasons, replica lag, fetches in flight | +| [Parts and merges](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/) | Q6–Q10, Q42 | Parts per host/partition, active merges, merge throughput | +| [Disk and storage](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/) | Q11, Q12 | Per-disk free space, TTL move activity | +| [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/) | Q13–Q15, Q54 | Background pool saturation, memory pressure, cgroup limits | +| [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) | Q16–Q19 | Recent query load, active queries, OOM/exception queries, stuck mutations | +| [Async inserts](/altinity-kb-diagnostics-runbook/query-library/async-inserts/) | Q20–Q28, Q38 | Flush errors, latency, MV chain inspection, timeout patterns | +| [Keeper and coordination](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/) | Q29, Q30, Q33, Q49–Q51 | Connection state, exception patterns, wait-time percentiles, topology, leader distribution | +| [Insert load and host skew](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) | Q34–Q37, Q40, Q41, Q46–Q48, Q52, Q53 | Insert rate/volume, per-host duration, routing settings, failure rate | +| [Dictionaries and Kafka](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/) | Q43–Q45 | Dictionary health, Kafka consumer vs pool size, consumer errors | + +## A note on version drift + +Several system tables changed schema between ClickHouse releases — column +names on `replicated_fetches`, the view columns on `query_log`, and the +existence of `zookeeper_log`. Each query page calls out the columns to +check first when a query errors out on a specific cluster. diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/async-inserts.md b/content/en/altinity-kb-diagnostics-runbook/query-library/async-inserts.md new file mode 100644 index 0000000000..24e76d9a94 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/async-inserts.md @@ -0,0 +1,238 @@ +--- +title: "Async inserts queries" +linkTitle: "Async inserts" +weight: 60 +description: > + Cluster-wide queries for async insert flush errors, latency, MV chain + inspection, and timeout patterns. +keywords: + - clickhouse async insert + - asynchronous_insert_log + - materialized view + - flush errors +--- + +Queries for diagnosing the async insert subsystem: schema variations, +flush errors and latency, materialized-view chain inspection, and the +specific timeout pattern that signals MV-chain saturation. All queries fan +out across the cluster — replace `{cluster}` / `{database}` / `{mv_name}` +with values from your environment. + +## Q20. Async insert log — schema check + +Column names on `asynchronous_insert_log` have shifted across versions. +Run this once when investigating a new cluster so the rest of the queries +on this page match the actual schema. + +```sql +SELECT name, type +FROM system.columns +WHERE database = 'system' + AND table = 'asynchronous_insert_log' +ORDER BY position; +``` + +## Q21. Async insert flush errors + +Recent failed flushes with the exception text, target database/table, +rows, size, and how long the flush waited. The starting point for "inserts +return 200 OK but the data isn't there". + +```sql +SELECT + hostname AS host, + event_time, + status, + exception, + database, + table, + rows, + round(bytes / 1e6, 1) AS size_MB, + flush_time, + dateDiff('second', event_time, flush_time) AS buffer_wait_sec +FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log) +WHERE status != 'Ok' + AND event_time >= now() - INTERVAL 4 HOUR +ORDER BY event_time DESC +LIMIT 30; +``` + +## Q22. Async insert impact aggregation + +Aggregates the last 12 hours of `FlushError` rows by host/table — total +rows, total size, first-error and last-error timestamps. Tells you "how +much data is affected and over what window". + +```sql +SELECT + hostname, + database, + table, + status, + count() AS flush_attempts, + sum(rows) AS total_rows_affected, + round(sum(bytes) / 1e9, 2) AS total_GB, + min(event_time) AS first_error, + max(event_time) AS last_error +FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log) +WHERE status = 'FlushError' + AND event_time >= now() - INTERVAL 12 HOUR +GROUP BY hostname, database, table, status +ORDER BY total_rows_affected DESC; +``` + +## Q23. Async insert flush latency by table/status + +Average and max buffer wait time, plus average flush size. Compare `Ok` +rows to `FlushError` rows for the same table — a divergence in flush size +or buffer wait is a strong hint about the cause. + +```sql +SELECT + hostname AS host, + database, + table, + status, + count() AS count, + sum(rows) AS total_rows, + round(sum(bytes) / 1e9, 2) AS total_GB, + avg(dateDiff('second', event_time, flush_time)) AS avg_buffer_wait_sec, + max(dateDiff('second', event_time, flush_time)) AS max_buffer_wait_sec, + round(avg(bytes) / 1e6, 1) AS avg_flush_MB +FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log) +WHERE event_time >= now() - INTERVAL 4 HOUR +GROUP BY hostname, database, table, status +ORDER BY hostname, status, count DESC; +``` + +## Q24. Slowest AsyncInsertFlush queries + +The slowest flush *queries* (`query_kind = 'AsyncInsertFlush'`) in the last +four hours. Each flush execution is a query in `query_log` — this lets you +see memory, exception, and full query text for the slowest ones. + +```sql +SELECT + hostName() AS host, + query_id, + event_time, + query_duration_ms, + round(memory_usage / 1e9, 1) AS memory_GB, + read_rows, + written_rows, + exception, + substr(query, 1, 500) AS query_text +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 4 HOUR + AND query_kind = 'AsyncInsertFlush' + AND (database = '{database}' OR query ILIKE '%{database}%') + AND type IN ('QueryFinish', 'ExceptionWhileProcessing') +ORDER BY query_duration_ms DESC +LIMIT 20; +``` + +## Q25. MV appearances in failed flushes + +For a specific MV, list every failed flush where the MV appears in +`views`. Quantifies the impact of one MV on flush failures. + +```sql +SELECT + hostName() AS host, + query_duration_ms, + round(memory_usage / 1e9, 1) AS memory_GB, + read_rows, + written_rows, + views, + exception, + event_time +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 4 HOUR + AND query_kind = 'AsyncInsertFlush' + AND has(views, '{database}.{mv_name}') + AND type IN ('QueryFinish', 'ExceptionWhileProcessing') +ORDER BY query_duration_ms DESC +LIMIT 20; +``` + +## Q26. MV frequency in errors + +Counts how often each MV appears across `ExceptionWhileProcessing` rows in +the last four hours. The MV with the highest `appearances` is the prime +suspect for the chain bottleneck. + +```sql +SELECT + hostName() AS host, + arrayJoin(views) AS mv_name, + count() AS appearances, + avg(query_duration_ms) / 1000 AS avg_sec, + max(query_duration_ms) / 1000 AS max_sec +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 4 HOUR + AND query_kind = 'AsyncInsertFlush' + AND type = 'ExceptionWhileProcessing' +GROUP BY host, mv_name +ORDER BY appearances DESC; +``` + +## Q27. MV definitions — chain inspection + +The `as_select` text for every MV in a database. Use after Q26 to inspect +the MV that's appearing most often in failures. + +```sql +SELECT name, as_select +FROM system.tables +WHERE database = '{database}' + AND engine = 'MaterializedView' +ORDER BY name; +``` + +## Q28. Live async insert health check (last 5 minutes) + +A rolling status summary — counts and average row count by `status` for +the last five minutes. Useful as a poll during incident response: "are we +still failing right now?". + +```sql +SELECT + hostname, + status, + count() AS cnt, + avg(rows) AS avg_rows_per_flush, + max(rows) AS max_rows_per_flush, + max(event_time) AS latest +FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log) +WHERE event_time >= now() - INTERVAL 5 MINUTE +GROUP BY hostname, status +ORDER BY hostname, status; +``` + +## Q38. Async insert timeout failures by table ⭐ + +Direct culprit identification — pulls failures whose exception matches +`async insert%timeout%` and groups by `arrayJoin(tables)`. The table at +the top of the result is the timing-out target. + +```sql +SELECT + hostName() AS host, + arrayJoin(tables) AS table_name, + count() AS failures, + round(avg(query_duration_ms), 0) AS avg_ms, + max(event_time) AS last_fail, + substring(any(exception), 1, 200) AS sample_exception +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 4 HOUR + AND type = 'ExceptionWhileProcessing' + AND exception ILIKE '%async insert%timeout%' +GROUP BY host, table_name +ORDER BY failures DESC +LIMIT 20; +``` + +`arrayJoin(tables)` exposes the full MV blast radius — including non-writer +dependencies. Always cross-check the actual physical INSERT target with +[Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection) +before recommending a fix on one of these tables. diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka.md b/content/en/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka.md new file mode 100644 index 0000000000..17ca4fe587 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka.md @@ -0,0 +1,88 @@ +--- +title: "Dictionaries and Kafka queries" +linkTitle: "Dictionaries and Kafka" +weight: 90 +description: > + Cluster-wide queries for dictionary health and Kafka consumer state. +keywords: + - clickhouse dictionaries + - clickhouse kafka + - kafka consumers + - max.poll.interval.ms +--- + +Queries for two related concerns: dictionary load state (often consumed by +MVs and therefore on the insert hot path) and Kafka consumer activity +(starvation manifests as `max.poll.interval.ms` violations). All queries +fan out across the cluster — replace `{cluster}` with your cluster name. + +## Q43. Dictionary health check + +First — only the dictionaries that are not loaded or have an exception: + +```sql +SELECT + name, status, last_exception, + loading_duration AS load_sec, + element_count, + round(bytes_allocated / 1e6, 1) AS MB +FROM clusterAllReplicas('{cluster}', system.dictionaries) +WHERE status != 'LOADED' OR last_exception != '' +ORDER BY name; +``` + +Then — every dictionary, sorted by load time. A long-loading dictionary +on the insert hot path (e.g., used inside `dictGet` in an MV) is a common +source of unexpected MV slowness. + +```sql +SELECT + name, status, element_count, + round(loading_duration, 2) AS load_sec, + round(bytes_allocated / 1e6, 1) AS MB +FROM clusterAllReplicas('{cluster}', system.dictionaries) +ORDER BY load_sec DESC +LIMIT 30; +``` + +## Q44. Kafka consumer count vs pool size ⭐ + +Compares the number of Kafka consumers to the configured message-broker +pool size and the current pool activity. The first query for +"`max.poll.interval.ms` exceeded" errors and Kafka consumer rebalance +storms. + +```sql +SELECT + hostName() AS host, + (SELECT count() FROM system.kafka_consumers) AS consumers, + (SELECT value FROM system.server_settings + WHERE name = 'background_message_broker_schedule_pool_size') AS mb_pool_size, + (SELECT value FROM system.metrics + WHERE metric = 'BackgroundMessageBrokerSchedulePoolTask') AS mb_pool_active; +``` + +Rule of thumb: if `consumers > mb_pool_size`, poll-interval violations are +all but guaranteed. Aim for `mb_pool_size >= consumers * 1.25`. + +## Q45. Kafka consumer error inspection + +Per-consumer last exception, last poll time, message count, and rebalance +counters. After Q44 confirms starvation, this tells you which consumers +are hitting it. + +```sql +SELECT + hostName() AS host, + database, table, + consumer_id, + last_exception, + num_messages_read, + last_poll_time, + num_rebalance_revocations, + num_rebalance_assignments +FROM clusterAllReplicas('{cluster}', system.kafka_consumers) +WHERE last_exception != '' +ORDER BY last_poll_time DESC +LIMIT 30; +``` diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/disk-and-storage.md b/content/en/altinity-kb-diagnostics-runbook/query-library/disk-and-storage.md new file mode 100644 index 0000000000..346bc471ef --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/disk-and-storage.md @@ -0,0 +1,60 @@ +--- +title: "Disk and storage queries" +linkTitle: "Disk and storage" +weight: 30 +description: > + Cluster-wide queries for disk usage and TTL move activity. +keywords: + - clickhouse disk usage + - clickhouse ttl + - NOT_ENOUGH_SPACE +--- + +Queries for inspecting per-disk free space across the cluster and recent +TTL movement / mutation activity. All queries fan out across the cluster — +replace `{cluster}` with your cluster name. + +## Q11. Disk usage per host ⭐ + +Per-host, per-disk free space, total space, and used percentage. The first +query when `NOT_ENOUGH_SPACE` appears in `last_exception`, or when merges +fail and `Q1`'s exception column points at disk. + +```sql +SELECT + hostName() AS host, + name AS disk_name, + type, + round(free_space / 1e9, 1) AS free_GB, + round(total_space / 1e9, 1) AS total_GB, + round((1 - free_space / total_space) * 100, 1) AS used_pct +FROM clusterAllReplicas('{cluster}', system.disks) +GROUP BY host, disk_name, type, free_space, total_space +ORDER BY host, used_pct DESC; +``` + +## Q12. TTL move / mutation activity + +`MovePart` and `MutatePart` events from `part_log` over the last hour. +Useful when investigating whether TTL moves to a cold tier are actually +running, and whether they're succeeding. + +```sql +SELECT + hostName() AS host, + event_time, + event_type, + database, table, part_name, + rows, + formatReadableSize(size_in_bytes) AS size, + error +FROM clusterAllReplicas('{cluster}', system.part_log) +WHERE event_time >= now() - INTERVAL 1 HOUR + AND event_type IN ('MovePart', 'MutatePart') +ORDER BY event_time DESC +LIMIT 50; +``` + +A non-empty `error` column with `S3 access denied`, `connection`, or +`credentials` keywords points at the cold-tier disk policy, not at +ClickHouse itself. diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew.md b/content/en/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew.md new file mode 100644 index 0000000000..cd4c7858ee --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew.md @@ -0,0 +1,259 @@ +--- +title: "Insert load and host skew queries" +linkTitle: "Insert load and host skew" +weight: 80 +description: > + Cluster-wide queries for insert volume, per-host duration, routing + settings, and failure-rate quantification. +keywords: + - clickhouse insert rate + - host skew + - load_balancing + - metric_log + - failure rate +--- + +Queries for profiling insert workload and detecting host-skewed behaviour. +The set here lets you answer "is the workload balanced", "is the duration +balanced", "is the failure rate balanced", and "are the routing settings +balanced" — four independent questions that together pinpoint host-skew +root causes. + +All queries fan out across the cluster — replace `{cluster}` / +`{database}` / `{table_pattern}` / `{target_table_pattern}` with values +from your environment. + +## Q34. Active insert sources by user + +Live insert activity grouped by user — quick "who's inserting right now". + +```sql +SELECT hostName(), user, query_kind, count() +FROM clusterAllReplicas('{cluster}', system.processes) +WHERE query_kind = 'Insert' +GROUP BY hostName(), user, query_kind; +``` + +## Q35. Insert volume by user (last 24 hours) + +Insert volume, error count, and time window per user across the last +day — identifies the heavy clients and the failing ones. + +```sql +SELECT + hostName() AS host, + user, + count() AS insert_count, + sum(written_rows) AS total_rows, + round(sum(written_bytes) / 1e9, 2) AS total_GB, + round(avg(query_duration_ms), 0) AS avg_dur_ms, + countIf(type = 'ExceptionWhileProcessing') AS errors, + min(event_time) AS first_seen, + max(event_time) AS last_seen +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 24 HOUR + AND query_kind = 'Insert' + AND type IN ('QueryFinish', 'ExceptionWhileProcessing') +GROUP BY host, user +ORDER BY total_rows DESC +LIMIT 30; +``` + +## Q36. Insert volume by target table (last 24 hours) + +Extracts the target table from the query text with a regex, then +aggregates by it. Cross-check against +[Q47](#q47-failed-insert-query-text-inspection) before treating any table +as the "actual writer" — `tables[]` includes the MV chain. + +```sql +SELECT + hostName() AS host, + extract(query, 'INTO\s+([\w\.`]+)') AS target_table, + count() AS inserts, + sum(written_rows) AS rows, + round(sum(written_bytes) / 1e9, 2) AS GB, + countIf(type = 'ExceptionWhileProcessing') AS errors +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 24 HOUR + AND query_kind = 'Insert' +GROUP BY host, target_table +ORDER BY rows DESC +LIMIT 30; +``` + +## Q37. Insert rate per minute (spike detection) + +Per-minute insert counts and error counts across the last 24 hours. The +shape of the distribution tells you "spike" vs "sustained" — the +remediation differs. + +```sql +SELECT + toStartOfMinute(event_time) AS minute, + count() AS inserts, + sum(written_rows) AS rows, + countIf(type = 'ExceptionWhileProcessing') AS errors +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 24 HOUR + AND query_kind = 'Insert' +GROUP BY minute +ORDER BY inserts DESC +LIMIT 30; +``` + +## Q40. Active inserts confirmation (per-table specific) + +Counts new parts in the last 24 hours for tables matching +`{table_pattern}`. An empty result confirms a table is *not* being written +to — useful for verifying that an old/archived table is frozen. + +```sql +SELECT + hostName() AS host, + database, `table`, + sum(rows) AS rows_inserted, + count() AS insert_events, + max(event_time) AS last_insert +FROM clusterAllReplicas('{cluster}', system.part_log) +WHERE event_time >= now() - INTERVAL 24 HOUR + AND event_type = 'NewPart' + AND `table` LIKE '%{table_pattern}%' +GROUP BY host, database, `table` +ORDER BY rows_inserted DESC; +``` + +## Q41. Partition schema check (preventive) + +Lists the partition key and sorting key for tables matching the pattern. +Use ahead of partition fragmentation diagnosis to confirm what the schema +actually is. + +```sql +SELECT + database, `table`, partition_key, sorting_key +FROM clusterAllReplicas('{cluster}', system.tables) +WHERE `table` LIKE '%{table_pattern}%' +GROUP BY database, `table`, partition_key, sorting_key +ORDER BY database, `table`; +``` + +## Q46. Per-host insert duration profile ⭐ + +Per-host average, p95, and p99 insert duration over the last five minutes. +The first query to confirm "failures concentrate on some hosts but volume +looks similar" — if `avg_ms` or `p95_ms` differ by orders of magnitude +across hosts on identical workloads, the bottleneck is host-specific. + +```sql +SELECT + hostName() AS host, + count() AS query_count, + round(avg(query_duration_ms), 0) AS avg_ms, + round(quantile(0.95)(query_duration_ms), 0) AS p95_ms, + round(quantile(0.99)(query_duration_ms), 0) AS p99_ms +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 5 MINUTE + AND type = 'QueryFinish' + AND query_kind = 'Insert' +GROUP BY host +ORDER BY host; +``` + +## Q47. Failed insert query text inspection ⭐ + +The query text contains the actual physical INSERT target — not just the +MV chain that `tables[]` exposes. Use this before blaming any specific +table for a timeout. + +```sql +SELECT + hostName() AS host, + event_time, + query_duration_ms, + substring(exception, 1, 200) AS exception_text, + user, client_hostname, initial_address, + substring(query, 1, 500) AS query_text +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 30 MINUTE + AND type = 'ExceptionWhileProcessing' + AND exception ILIKE '%async insert%timeout%' +ORDER BY event_time DESC +LIMIT 5 FORMAT Vertical; +``` + +The `INSERT INTO database.table` statement in the query text reveals the +real writer. Any other tables that show up in +[Q38](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q38-async-insert-timeout-failures-by-table)'s +`arrayJoin(tables)` are MV dependencies, not direct writes. + +## Q48. Per-second activity from metric_log + +Per-host averages and sums of profile events over a recent window: +active queries, message-broker pool task count, disk-write microseconds, +ZooKeeper wait microseconds, async insert attempts, and failed insert +attempts. The right tool for "is host-X currently doing more or less +work than the others?". + +```sql +SELECT + hostName() AS host, + count() AS samples, + avg(CurrentMetric_Query) AS avg_active_queries, + max(CurrentMetric_Query) AS max_active_queries, + avg(CurrentMetric_BackgroundMessageBrokerSchedulePoolTask) AS avg_mb_pool, + sum(ProfileEvent_DiskWriteElapsedMicroseconds) AS disk_write_us, + sum(ProfileEvent_ZooKeeperWaitMicroseconds) AS zk_wait_us, + sum(ProfileEvent_AsyncInsertQuery) AS async_inserts, + sum(ProfileEvent_FailedInsertQuery) AS failed_inserts +FROM clusterAllReplicas('{cluster}', system.metric_log) +WHERE event_time >= now() - INTERVAL 5 MINUTE +GROUP BY host +ORDER BY host; +``` + +`system.metric_log` stores metrics as **columns** (`CurrentMetric_*`, +`ProfileEvent_*`), not rows. You can't filter with +`WHERE metric IN (...)` — `SELECT` the specific columns. + +## Q52. Routing settings inspection + +Per-host inspection of the settings that control where INSERTs land and +how MVs execute. When these are identical across hosts but behaviour is +still skewed, the cause is upstream (entry-point routing, not ClickHouse). + +```sql +SELECT + hostName() AS host, + name, value +FROM clusterAllReplicas('{cluster}', system.settings) +WHERE name IN ('load_balancing', 'parallel_view_processing', + 'prefer_localhost_replica', 'distributed_foreground_insert', + 'async_insert', 'async_insert_busy_timeout_ms', + 'async_insert_busy_timeout_max_ms', 'async_insert_threads', + 'wait_for_async_insert') +ORDER BY host, name; +``` + +See +[Investigation methods → routing settings to know about](/altinity-kb-diagnostics-runbook/investigation-methods/#routing-settings-to-know-about) +for what each setting does. + +## Q53. Failure rate per host ⭐ + +Failure rate as a percentage of attempts — the workload-normalised view +of "which hosts are actually failing more". Pair with Q46 (duration) and +Q48 (volume) for the full picture. + +```sql +SELECT + hostName() AS host, + sum(ProfileEvent_AsyncInsertQuery) AS total_attempts, + sum(ProfileEvent_FailedInsertQuery) AS failures, + round(sum(ProfileEvent_FailedInsertQuery) * 100.0 / + nullIf(sum(ProfileEvent_AsyncInsertQuery), 0), 1) AS failure_rate_pct +FROM clusterAllReplicas('{cluster}', system.metric_log) +WHERE event_time >= now() - INTERVAL 5 MINUTE +GROUP BY host +ORDER BY failure_rate_pct DESC; +``` diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination.md b/content/en/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination.md new file mode 100644 index 0000000000..7e080507c0 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination.md @@ -0,0 +1,139 @@ +--- +title: "Keeper and coordination queries" +linkTitle: "Keeper and coordination" +weight: 70 +description: > + Cluster-wide queries for ClickHouse Keeper / ZooKeeper connection + state, wait-time percentiles, topology, and leader distribution. +keywords: + - clickhouse keeper + - zookeeper + - zookeeper_connection + - keeper latency +--- + +Queries for ClickHouse Keeper / ZooKeeper visibility: connection state, +recent exceptions, cumulative wait events, current-window tail latency, +sidecar vs centralized topology, and per-host leader counts. + +All queries fan out across the cluster — replace `{cluster}` with your +cluster name. + +## Q29. Keeper connection status + +Connection state per replica — which Keeper node it's connected to, +session age, expiry flag, API version. + +```sql +SELECT + hostName() AS host, + name, value +FROM clusterAllReplicas('{cluster}', system.zookeeper_connection); +``` + +## Q30. Keeper errors (last hour) + +Recent exceptions mentioning ZooKeeper / Keeper / code 999. Useful when a +replica goes readonly and you suspect a Keeper session loss. + +```sql +SELECT + hostName() AS host, + event_time, + exception_code, + substring(exception, 1, 200) AS exception_short +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 1 HOUR + AND type = 'ExceptionWhileProcessing' + AND (exception ILIKE '%zookeeper%' OR exception ILIKE '%keeper%' OR exception_code = 999) +ORDER BY event_time DESC +LIMIT 30; +``` + +## Q33. Keeper wait time and activity (cumulative) + +Cumulative Keeper-related event counters since process start. Useful for a +quick "what does Keeper see" snapshot — but read the warning below before +computing ratios. + +```sql +SELECT hostName() AS host, event, value +FROM clusterAllReplicas('{cluster}', system.events) +WHERE event LIKE '%ZooKeeper%' OR event LIKE '%Keeper%' +ORDER BY value DESC +LIMIT 30; +``` + +Key events: + +- `ZooKeeperWaitMicroseconds` — total wait time on Keeper responses. +- `ZooKeeperTransactions` — total transactions. +- `ZooKeeperList` — directory listings (high during many-parts + coordination). +- `ZooKeeperHardwareExceptions` / `ZooKeeperUserExceptions` — error counts. + +> **Warning.** These are cumulative since process start. A ratio like +> `ZooKeeperWaitMicroseconds / ZooKeeperTransactions` reflects everything +> the process has seen, including peaks from days ago. For current state, +> use Q49 instead. + +## Q49. Tail latency for Keeper operations ⭐ + +p50 / p95 / p99 of microseconds-per-transaction from `metric_log` over a +recent window. The right tool for "is host X slow on Keeper right now", +because it ignores stale peaks baked into the cumulative counters. + +```sql +SELECT + hostName() AS host, + quantile(0.50)(ProfileEvent_ZooKeeperWaitMicroseconds / nullIf(ProfileEvent_ZooKeeperTransactions, 0)) AS p50_us_per_txn, + quantile(0.95)(ProfileEvent_ZooKeeperWaitMicroseconds / nullIf(ProfileEvent_ZooKeeperTransactions, 0)) AS p95_us_per_txn, + quantile(0.99)(ProfileEvent_ZooKeeperWaitMicroseconds / nullIf(ProfileEvent_ZooKeeperTransactions, 0)) AS p99_us_per_txn +FROM clusterAllReplicas('{cluster}', system.metric_log) +WHERE event_time >= now() - INTERVAL 30 MINUTE + AND ProfileEvent_ZooKeeperTransactions > 0 +GROUP BY host +ORDER BY host; +``` + +If Q33 shows a per-host ratio but Q49 doesn't, the ratio is an artefact of +historical peak load — not a current problem. + +## Q50. Keeper connection topology + +Tells you whether each replica connects to a co-located Keeper (sidecar: +`keeper_address == hostName()`) or to a central Keeper cluster. The +"slow Keeper follower" hypothesis only applies in the central topology. + +```sql +SELECT + hostName() AS host, + name AS keeper_node, + host AS keeper_address, + port, + connected_time, + session_uptime_elapsed_seconds, + is_expired, + keeper_api_version +FROM clusterAllReplicas('{cluster}', system.zookeeper_connection) +ORDER BY host; +``` + +## Q51. Leader distribution across hosts + +Per-host counts of `is_leader = 1` vs `is_leader = 0` rows in +`system.replicas`. In a healthy multi-replica cluster, leader counts +should be roughly balanced. In a sidecar Keeper layout where every replica +is leader of its local copy, you'll see `leader_count == total_replicas` — +expected, not a concern. + +```sql +SELECT + hostName() AS host, + countIf(is_leader = 1) AS leader_count, + countIf(is_leader = 0) AS non_leader_count, + count() AS total_replicas +FROM clusterAllReplicas('{cluster}', system.replicas) +GROUP BY host +ORDER BY host; +``` diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/parts-and-merges.md b/content/en/altinity-kb-diagnostics-runbook/query-library/parts-and-merges.md new file mode 100644 index 0000000000..75a262c054 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/parts-and-merges.md @@ -0,0 +1,153 @@ +--- +title: "Parts and merges queries" +linkTitle: "Parts and merges" +weight: 20 +description: > + Cluster-wide queries for parts health, partition fragmentation, and + merge throughput. +keywords: + - clickhouse parts + - clickhouse merges + - parts_to_delay_insert + - too_many_parts +--- + +Queries for diagnosing part counts, partition fragmentation, active merges, +and the part-creation-vs-merge rate. All queries fan out across the +cluster — replace `{cluster}` with your cluster name and +`{table_pattern}` where indicated. + +## Q6. Parts health per host + +Per-host, per-table active part count, total rows, on-disk size, and the +most recent modification time. The starting point when investigating high +part counts cluster-wide. + +```sql +SELECT + hostName() AS host, + database, + table, + count() AS active_parts, + sum(rows) AS total_rows, + round(sum(bytes_on_disk) / 1e9, 2) AS size_GB, + max(modification_time) AS last_modified +FROM clusterAllReplicas('{cluster}', system.parts) +WHERE active = 1 +GROUP BY host, database, table +ORDER BY host, active_parts DESC; +``` + +## Q7. Parts count per partition ⭐ + +`parts_to_delay_insert` and `parts_to_throw_insert` are **per partition**, +not per table. A table with a thousand parts spread across a hundred +partitions is fine; a partition with three hundred parts is in trouble. +Use this when diagnosing `TOO_MANY_PARTS` (code 252) or "Delaying inserts +by N ms" warnings. + +```sql +SELECT + hostName() AS host, + database, table, partition, + count() AS parts, + sum(rows) AS rows, + round(sum(bytes_on_disk) / 1e9, 2) AS size_GB +FROM clusterAllReplicas('{cluster}', system.parts) +WHERE active = 1 +GROUP BY host, database, table, partition +HAVING parts > 100 +ORDER BY parts DESC +LIMIT 50; +``` + +## Q8. Active merges + +Currently-executing merges by host and table, with progress, elapsed time, +total merge size, and memory in use. Lets you see whether merges are +running and how much memory they hold. + +```sql +SELECT + hostName() AS host, + database, + table, + count() AS active_merges, + round(avg(progress) * 100, 1) AS avg_progress_pct, + max(elapsed) AS max_elapsed_sec, + round(sum(total_size_bytes_compressed) / 1e9, 2) AS total_merge_GB, + round(sum(memory_usage) / 1e9, 1) AS merge_memory_GB +FROM clusterAllReplicas('{cluster}', system.merges) +GROUP BY host, database, table +ORDER BY host, active_merges DESC; +``` + +## Q9. Part creation vs merge rate (last 30 minutes) + +Counts `NewPart`, `MergeParts`, `MutatePart`, and `RemovePart` events in a +recent window. When `new_parts` is growing faster than `merged_parts`, the +merge pool is not keeping up — back-pressure is imminent. + +```sql +SELECT + hostName() AS host, + database, table, + sum(if(event_type = 'NewPart', 1, 0)) AS new_parts, + sum(if(event_type = 'MergeParts', 1, 0)) AS merged_parts, + sum(if(event_type = 'MergeParts', rows, 0)) AS rows_merged, + sum(if(event_type = 'MutatePart', 1, 0)) AS mutations, + sum(if(event_type = 'RemovePart', 1, 0)) AS removed_parts +FROM clusterAllReplicas('{cluster}', system.part_log) +WHERE event_time >= now() - INTERVAL 30 MINUTE +GROUP BY host, database, table +ORDER BY new_parts DESC +LIMIT 30; +``` + +## Q10. Merge settings check + +Confirms the threshold settings before recommending a tuning change. These +are the values the engine actually uses, not what's in the running config. + +```sql +SELECT name, value +FROM system.merge_tree_settings +WHERE name IN ( + 'max_bytes_to_merge_at_max_space_in_pool', + 'number_of_free_entries_in_pool_to_lower_max_size_of_merge', + 'max_number_of_merges_with_ttl_in_pool', + 'parts_to_delay_insert', + 'parts_to_throw_insert', + 'inactive_parts_to_delay_insert', + 'inactive_parts_to_throw_insert' +); +``` + +## Q42. Partition count health + +Per-table partition count, active part count, and the ratio between them. +A high `partition_count` usually means a high-cardinality partition key +(e.g., partitioning by minute or hour on a dataset that doesn't need it). +A high `avg_parts_per_partition` means merges can't keep up with inserts. + +```sql +SELECT + hostName() AS host, + database, `table`, + count(DISTINCT partition) AS partition_count, + count() AS active_parts, + round(active_parts / partition_count, 1) AS avg_parts_per_partition, + sum(rows) AS total_rows, + round(sum(bytes_on_disk) / 1e9, 2) AS size_GB +FROM clusterAllReplicas('{cluster}', system.parts) +WHERE active = 1 AND `table` LIKE '%{table_pattern}%' +GROUP BY host, database, `table` +ORDER BY partition_count DESC; +``` + +Flag thresholds: + +- `partition_count > 500` per table → schema problem (partition key + cardinality is too high). +- `avg_parts_per_partition > 50` → merge pool can't keep up. +- `partition_count = 12` for a year of monthly data → correct. diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/pools-and-resources.md b/content/en/altinity-kb-diagnostics-runbook/query-library/pools-and-resources.md new file mode 100644 index 0000000000..721e85acb6 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/pools-and-resources.md @@ -0,0 +1,114 @@ +--- +title: "Pools and resources queries" +linkTitle: "Pools and resources" +weight: 40 +description: > + Cluster-wide queries for background pool saturation and memory pressure. +keywords: + - clickhouse background pool + - memory pressure + - cgroup memory + - jemalloc +--- + +Queries for inspecting background thread pool activity, configured pool +sizes, and memory pressure (process, jemalloc, cgroup, OS). All queries +fan out across the cluster — replace `{cluster}` with your cluster name. + +## Q13. Pool saturation metrics + +Current activity in each background pool. When a pool counter equals its +configured size (Q14), the pool is saturated — additional work will queue +behind it. + +```sql +SELECT + hostName() AS host, + metric, + value +FROM clusterAllReplicas('{cluster}', system.metrics) +WHERE metric IN ( + 'BackgroundFetchesPoolTask', + 'BackgroundMergesAndMutationsPoolTask', + 'BackgroundCommonPoolTask', + 'BackgroundSchedulePoolTask', + 'BackgroundMessageBrokerSchedulePoolTask', + 'ReplicatedFetch', + 'ReplicatedSend', + 'ReplicatedChecks', + 'Merge', + 'PartMutation', + 'Query' +) +ORDER BY host, metric; +``` + +## Q14. Pool sizes (server settings) + +The configured upper bound for each pool. Pair with Q13: when a Q13 value +matches the Q14 value for the same pool, that pool is the bottleneck. + +```sql +SELECT + hostName() AS host, + name, value +FROM clusterAllReplicas('{cluster}', system.server_settings) +WHERE name IN ( + 'background_pool_size', + 'background_fetches_pool_size', + 'background_merges_mutations_concurrency_ratio', + 'background_common_pool_size', + 'background_schedule_pool_size', + 'background_message_broker_schedule_pool_size' +); +``` + +## Q15. Memory pressure + +Process RSS, the ClickHouse memory tracker, jemalloc resident/active, OS +available/total, and cgroup used/total. The first query when investigating +OOM-kills, `MEMORY_LIMIT_EXCEEDED` (code 241), or pod restarts. + +```sql +SELECT + hostName() AS host, + metric, + formatReadableSize(value) AS val +FROM clusterAllReplicas('{cluster}', system.asynchronous_metrics) +WHERE metric IN ( + 'MemoryResident', + 'MemoryTracking', + 'jemalloc.resident', + 'jemalloc.active', + 'OSMemoryAvailable', + 'OSMemoryTotal', + 'CGroupMemoryUsed', + 'CGroupMemoryTotal' +) +ORDER BY host, metric; +``` + +If `MemoryResident` is far above `MemoryTracking`, the gap is jemalloc +retained pages and OS page cache. See +[Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/) +for attribution. + +## Q54. Memory pressure per host (compact) + +Same as Q15 but limited to the three numbers you compare across hosts. +Use this to detect cluster-wide memory pressure (every host at >90%) vs a +single-host issue. + +```sql +SELECT + hostName() AS host, + metric, formatReadableSize(value) AS val +FROM clusterAllReplicas('{cluster}', system.asynchronous_metrics) +WHERE metric IN ('MemoryResident', 'OSMemoryAvailable', + 'CGroupMemoryUsed', 'CGroupMemoryTotal') +ORDER BY host, metric; +``` + +When `CGroupMemoryUsed / CGroupMemoryTotal > 90%` on every host, the +cluster is memory-constrained globally — workload-level tuning helps +marginally, but the real fix is more RAM per node or less work per node. diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations.md b/content/en/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations.md new file mode 100644 index 0000000000..1c8516f4aa --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations.md @@ -0,0 +1,110 @@ +--- +title: "Queries and mutations queries" +linkTitle: "Queries and mutations" +weight: 50 +description: > + Per-host query load, active queries, OOM/exception patterns, and stuck + mutations. +keywords: + - clickhouse query_log + - clickhouse processes + - stuck mutations + - OOM +--- + +Queries for the live and recent state of the query system: load by kind, +what's running right now, recent exceptions, and stuck mutations. All +queries fan out across the cluster — replace `{cluster}` with your cluster +name. + +## Q16. Query load per host (last 30 minutes) + +Per-host query counts by `query_kind`, average duration, peak memory, read +and written rows, and error count. Useful for spotting load imbalance and +error spikes by query type. + +```sql +SELECT + hostName() AS host, + query_kind, + count() AS query_count, + round(avg(query_duration_ms), 0) AS avg_duration_ms, + round(max(memory_usage) / 1e9, 1) AS max_memory_GB, + sum(read_rows) AS total_read_rows, + sum(written_rows) AS total_written_rows, + countIf(type = 'ExceptionWhileProcessing') AS errors +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 30 MINUTE + AND type IN ('QueryFinish', 'ExceptionWhileProcessing') +GROUP BY host, query_kind +ORDER BY host, query_count DESC; +``` + +## Q17. Active queries right now + +Live snapshot of running queries — elapsed, memory, rows read, plus a +query snippet. The fastest way to see "what's pinning the cluster right +now". + +```sql +SELECT + hostName() AS host, + query_id, user, elapsed, + round(memory_usage / 1e9, 2) AS memory_GB, + read_rows, + formatReadableSize(read_bytes) AS read_bytes, + query_kind, + substring(query, 1, 200) AS query_snippet +FROM clusterAllReplicas('{cluster}', system.processes) +ORDER BY elapsed DESC +LIMIT 30; +``` + +## Q18. Recent OOM / exception queries + +Failed queries in the last four hours with their exception code, exception +text, memory usage, and query snippet. Read after Q15 — gives you the +queries responsible for memory pressure spikes. + +```sql +SELECT + hostName() AS host, + event_time, + query_id, + round(memory_usage / 1e9, 1) AS memory_GB, + query_duration_ms, + exception_code, + substring(exception, 1, 300) AS exception_short, + substring(query, 1, 200) AS query_snippet +FROM clusterAllReplicas('{cluster}', system.query_log) +WHERE event_time >= now() - INTERVAL 4 HOUR + AND type = 'ExceptionWhileProcessing' +ORDER BY event_time DESC +LIMIT 30; +``` + +## Q19. Stuck mutations ⭐ + +All not-done mutations with their command, age, parts-to-do count, and +latest failure reason. The starting point for `ALTER TABLE … UPDATE/DELETE` +not completing. + +```sql +SELECT + hostName() AS host, + database, table, + mutation_id, + command, + create_time, + is_done, + parts_to_do, + latest_fail_reason, + latest_fail_time +FROM clusterAllReplicas('{cluster}', system.mutations) +WHERE NOT is_done +ORDER BY host, create_time; +``` + +Mutations share the merge pool, so a stuck mutation often means the merge +pool is saturated (see Q13). A mutation that references a column that +no longer exists fails immediately with a clear `latest_fail_reason`. diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/replication-and-queue.md b/content/en/altinity-kb-diagnostics-runbook/query-library/replication-and-queue.md new file mode 100644 index 0000000000..e43301aea4 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/query-library/replication-and-queue.md @@ -0,0 +1,178 @@ +--- +title: "Replication and queue queries" +linkTitle: "Replication and queue" +weight: 10 +description: > + Cluster-wide queries for inspecting the replication queue, replica + status, and active fetches. +keywords: + - clickhouse replication queue + - replicated_fetches + - postpone_reason + - system.replicas +--- + +Queries for diagnosing replication queue depth, postpone reasons, replica +lag, readonly mode, and in-flight fetches. All queries fan out across the +cluster — replace `{cluster}` with your cluster name. + +These queries are referenced from the +[scenarios](/altinity-kb-diagnostics-runbook/scenarios/) by their numeric +IDs (`Q1`, `Q2`, …). The numbering is stable across the runbook. + +## Q1. Replication queue overview + +Per-host, per-table queue depth, currently-executing entries, max retries, +and the oldest entry. The starting point when "the queue isn't draining". + +```sql +SELECT + hostName() AS host, + database, + table, + count() AS queue_depth, + countIf(is_currently_executing) AS executing, + max(num_tries) AS max_retries, + max(last_exception) AS last_error, + min(create_time) AS oldest_entry +FROM clusterAllReplicas('{cluster}', system.replication_queue) +GROUP BY host, database, table +ORDER BY host, queue_depth DESC; +``` + +## Q2. Replication queue — postpone reasons ⭐ + +The smoking-gun query for merge↔fetch cycles. The `postpone_reason` text +names the actual cause; see the patterns table in +[quick reference](/altinity-kb-diagnostics-runbook/quick-reference/#common-postpone_reason-patterns). + +```sql +SELECT + hostName() AS host, + database, table, type, + new_part_name, + is_currently_executing, + num_tries, + num_postponed, + postpone_reason, + last_exception, + create_time +FROM clusterAllReplicas('{cluster}', system.replication_queue) +WHERE num_postponed > 0 OR last_exception != '' +ORDER BY num_postponed DESC, num_tries DESC +LIMIT 50; +``` + +## Q3. Queue entry type breakdown + +Splits the queue by entry type (`GET_PART`, `MERGE_PARTS`, `MUTATE_PART`, +etc.) so you can tell whether the backlog is fetches, merges, or mutations. + +```sql +SELECT + hostName() AS host, + database, table, type, + count() AS entries, + countIf(is_currently_executing) AS executing, + avg(num_tries) AS avg_tries, + sum(num_postponed) AS total_postponed +FROM clusterAllReplicas('{cluster}', system.replication_queue) +GROUP BY host, database, table, type +ORDER BY entries DESC; +``` + +## Q4. Replica status — lag and readonly per host + +Drills into a specific replica's state: leader flag, readonly flag, absolute +delay in seconds, queue size split, and how far the log pointer is behind +the leader. + +```sql +SELECT + hostName() AS host, + database, + table, + is_leader, + is_readonly, + absolute_delay AS replica_lag_sec, + queue_size, + inserts_in_queue, + merges_in_queue, + log_max_index - log_pointer AS log_behind, + active_replicas, + total_replicas +FROM clusterAllReplicas('{cluster}', system.replicas) +ORDER BY host, replica_lag_sec DESC; +``` + +## Q5. Replication summary per host ⭐ + +One row per host — readonly count, lag, queue depth, insert/merge backlog. +The fastest first look at cluster-wide replication health and the first +query in the general-triage flow. + +```sql +SELECT + hostName() AS host, + count() AS total_tables, + countIf(is_readonly) AS readonly_tables, + countIf(absolute_delay > 300) AS lagging_tables, + max(absolute_delay) AS max_lag_sec, + sum(queue_size) AS total_queue_depth, + sum(inserts_in_queue) AS total_inserts_queued, + sum(merges_in_queue) AS total_merges_queued +FROM clusterAllReplicas('{cluster}', system.replicas) +GROUP BY host +ORDER BY max_lag_sec DESC, readonly_tables DESC; +``` + +## Q31. Replicated fetches in flight + +Active fetch tasks with their source replica, progress, elapsed time, and +bytes transferred. Distinguishes pool *exhaustion* from pool slots *claimed +by stuck tasks*. + +```sql +SELECT + hostName() AS host, + database, `table`, + source_replica_hostname, + elapsed, + progress, + round(total_size_bytes_compressed / 1e6, 1) AS total_MB, + round(bytes_read_compressed / 1e6, 1) AS read_MB, + result_part_name, + partition_id, + thread_id +FROM clusterAllReplicas('{cluster}', system.replicated_fetches) +ORDER BY host, elapsed DESC; +``` + +The column for the source replica varies by ClickHouse version. If the +above errors with "unknown identifier", inspect the schema first: + +```sql +SELECT name FROM system.columns +WHERE database = 'system' AND table = 'replicated_fetches'; +``` + +If `BackgroundFetchesPoolTask` is at the configured pool size but Q31 +returns few rows, the slots are claimed by tasks that are *waiting*, not +*transferring* — Keeper saturation is the usual cause. + +## Q32. Source replica distribution for active fetches + +Aggregates Q31 by source replica — useful when one replica is acting as the +fetch source for everyone and saturating its outbound bandwidth. + +```sql +SELECT + hostName() AS host, + source_replica_hostname, + count() AS active_fetches, + round(avg(progress) * 100, 1) AS avg_progress_pct, + max(elapsed) AS max_elapsed_sec +FROM clusterAllReplicas('{cluster}', system.replicated_fetches) +GROUP BY host, source_replica_hostname +ORDER BY host, active_fetches DESC; +``` diff --git a/content/en/altinity-kb-diagnostics-runbook/quick-reference.md b/content/en/altinity-kb-diagnostics-runbook/quick-reference.md new file mode 100644 index 0000000000..c33cc31d70 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/quick-reference.md @@ -0,0 +1,95 @@ +--- +title: "Quick reference: symptom to query" +linkTitle: "Quick reference" +weight: 10 +description: > + One-page lookup: pick a symptom, jump to the query that diagnoses it. +keywords: + - clickhouse triage + - clickhouse diagnostics + - postpone_reason + - replication queue + - async insert +--- + +When you have a specific symptom, run the indicated query first. When you +don't know what's wrong, run **Q5 → Q11 → Q15 → Q17** in that order — it +gives you 80% of the cluster's state in about ten seconds. + +All query IDs (`Q1`, `Q2`, …) link into the +[query library](/altinity-kb-diagnostics-runbook/query-library/). + +## Symptom → first query + +| Symptom | Run first | Section | +|---|---|---| +| Queue not draining | Q2 — postpone reasons | [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) | +| Background pool pinned, no progress | Q31 — active fetches | [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) | +| Async insert timeout | Q38 — failure tables | [Async inserts](/altinity-kb-diagnostics-runbook/query-library/async-inserts/) | +| Kafka consumer kicks (`max.poll.interval.ms`) | Q44 — consumers vs pool | [Dictionaries and Kafka](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/) | +| Memory low | Q15 then Q8 — merges holding RAM | [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/) | +| OOM / `MEMORY_LIMIT_EXCEEDED` | Q15 + Q17 + Q18 | [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/), [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) | +| Disk full / `NOT_ENOUGH_SPACE` | Q11 | [Disk and storage](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/) | +| Mutations stuck | Q19 | [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) | +| Replica readonly | Q4 + Q29 | [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/), [Keeper and coordination](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/) | +| Slow queries | Q17 + Q16 | [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) | +| Insert backpressure ("delayed by X ms") | Q7 — parts per partition | [Parts and merges](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/) | +| Failures concentrated on a subset of hosts | Q46 + Q53 + Q48 | [Insert load and host skew](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) | +| Don't know what's wrong | Q5 → Q11 → Q15 → Q17 | mixed | + +## Common `postpone_reason` patterns + +From `system.replication_queue.postpone_reason` (Q2). The text is what to +match on: + +| Pattern in `postpone_reason` | What it means | +|---|---| +| `N fetches already executing, max N` | Fetch pool pinned. Check Q31 for actual transfers — if `replicated_fetches` is near-empty while the pool counter is at the limit, the symptom is Keeper saturation, not pool exhaustion. | +| `source parts size … greater than current maximum` | Fetch waiting on a merge to produce a smaller source part. Look at Q8 for the upstream merge. | +| `covering parts list` | Merge is waiting on child fetches to land. | +| `another log entry for same part is being processed` | Normal serialisation. Only a problem if persistent (the same entry stuck for tens of minutes). | +| Anything mentioning `timeout`, `S3`, or `network` | Infrastructure-layer issue — investigate the storage/network path, not ClickHouse internals. | + +## "Trust but verify" — pitfalls that hide root causes + +- **Empty `system.replicated_fetches` despite a high + `BackgroundFetchesPoolTask` counter** means tasks are stuck claiming slots + but not transferring. The pool isn't the bottleneck — Keeper or another + coordinator usually is. +- **`query_log.tables` is an array** that includes every table touched — + inserts, MV dependencies, and read-side joins. Use `arrayJoin(tables)` for + per-table grouping, never `tables[1]` as "the writer". The actual physical + INSERT target is in the query text. Always inspect the query text before + blaming a specific table. +- **`system.query_log` has no `database` or `table` column** — they live in + `databases[]` and `tables[]`. +- **`part_log` is the source of truth for "is this table being written to?"** + It covers both direct inserts and MV writes, while `query_log` only sees + the originating query. +- **`avg_ms ≈ async_insert_busy_timeout_ms`** is the signature of an MV-chain + timeout (the insert is *waiting*, not *working*). A genuinely slow insert + has a distribution; a queue timeout is a hard ceiling. +- **`system.metric_log` stores metrics as columns, not rows** + (`CurrentMetric_*`, `ProfileEvent_*`). You cannot filter with + `WHERE metric IN (…)` — `SELECT` the specific columns. +- **`system.events` uses an `event` column, not a `metric` column.** Easy + thinko when you switch between `metric_log`/`metrics` and `events`. +- **`system.zookeeper_log` does not exist on every version.** Run + `EXISTS TABLE system.zookeeper_log` before assuming it's available. +- **`EXPLAIN PIPELINE graph=1`** uses lowercase `graph=1`. Older syntax + (`GRAPH = 1`) does not parse. +- **The `views`/`view_durations` columns on `query_log` vary by version.** + When in doubt: + `SELECT name FROM system.columns WHERE database='system' AND table='query_log' AND name ILIKE '%view%'`. +- **Cumulative `system.events` totals integrate since process start.** Ratios + computed from them can reflect a peak-load period from days ago. Use + `system.metric_log` over a recent window when comparing live host + behaviour. See + [Investigation methods → cumulative metrics hide current state](/altinity-kb-diagnostics-runbook/investigation-methods/#cumulative-metrics-hide-current-state). + +## Priority heatmap + +If you can only run one query for a given scenario, the scenario page marks +it with **⭐**. For broad triage where you don't know the scenario yet: +`Q5 → Q11 → Q15 → Q17` covers replication, disk, memory, and active queries +in four queries. diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/_index.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/_index.md new file mode 100644 index 0000000000..407730673e --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/_index.md @@ -0,0 +1,32 @@ +--- +title: "Scenarios" +linkTitle: "Scenarios" +weight: 40 +description: > + Step-by-step diagnostic flows for common ClickHouse® failure modes. +keywords: + - clickhouse troubleshooting + - diagnostic playbook + - ClickHouse scenarios +--- + +Each scenario lists triggering symptoms, an ordered diagnostic flow +(queries to run, in order, with "what to look for"), common root causes, +and resolution paths. Queries are referenced by their numeric ID — follow +the link to the +[query library](/altinity-kb-diagnostics-runbook/query-library/) for the +full SQL. + +| Scenario | When to use | +|---|---| +| [General triage](/altinity-kb-diagnostics-runbook/scenarios/general-triage/) | "Something is wrong" — no specific symptom yet. Start here. | +| [Merge–fetch and pool issues](/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues/) | Queue not draining, pool counters pinned, replicated_fetches near-empty. | +| [Too many parts and backpressure](/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure/) | `TOO_MANY_PARTS`, "Delaying inserts by N ms", cascading insert slowdown. | +| [Replica readonly](/altinity-kb-diagnostics-runbook/scenarios/replica-readonly/) | One or more replicas in readonly mode, growing `absolute_delay`. | +| [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/) | OOM, `MEMORY_LIMIT_EXCEEDED`, `NOT_ENOUGH_SPACE`, cluster-wide pressure. | +| [Stuck mutations](/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations/) | `ALTER UPDATE/DELETE` not completing. | +| [Async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/) | Flush errors, MV chain timeouts, stuck async insert queue. | +| [Slow queries](/altinity-kb-diagnostics-runbook/scenarios/slow-queries/) | Dashboard timeouts, query latency complaints. | +| [Kafka consumer issues](/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues/) | `max.poll.interval.ms` violations, consumer rebalance storms. | +| [Frozen historical tables](/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables/) | Old tables adding permanent background load. | +| [Host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/) | Failures concentrate on a subset of hosts; settings and workload look identical. | diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues.md new file mode 100644 index 0000000000..079cbbdca8 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues.md @@ -0,0 +1,122 @@ +--- +title: "Async insert issues" +linkTitle: "Async insert issues" +weight: 70 +description: > + Diagnosing async insert flush failures, MV-chain timeouts, and stuck + async insert queues. +keywords: + - async insert + - asynchronous_insert_log + - flusherror + - MV timeout + - async_insert_busy_timeout_ms +--- + +Three failure modes share async-insert symptoms but differ in their cause +and fix. The MV-chain timeout case is the most commonly misdiagnosed — a +flush that looks slow is actually waiting in a queue. + +## Async insert flush failures + +### Symptoms + +- Inserts succeed at the HTTP layer but data is missing or delayed. +- `FlushError` rows in `system.asynchronous_insert_log`. +- Reports of "silent data loss". + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q28](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q28-live-async-insert-health-check-last-5-minutes) | Live snapshot — is it happening right now? | +| 2 | [Q21](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q21-async-insert-flush-errors) | Recent flush errors with exception text. | +| 3 | [Q22](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q22-async-insert-impact-aggregation) | Impact aggregation — total rows / bytes affected, time window. | +| 4 | [Q23](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q23-async-insert-flush-latency-by-tablestatus) | Latency patterns — are flushes timing out? | +| 5 | [Q24](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q24-slowest-asyncinsertflush-queries) | Slowest flush queries — what's making them slow? | +| 6 | [Q26](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q26-mv-frequency-in-errors) | Is one specific MV showing up in errors? | +| 7 | [Q25](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q25-mv-appearances-in-failed-flushes) | Drill into that MV's failure pattern. | +| 8 | [Q27](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q27-mv-definitions--chain-inspection) | Inspect the MV definition. | + +### Common root causes + +- MV in the chain hitting a memory limit or a slow JOIN. +- Target table on the MV chain has `TOO_MANY_PARTS`. +- Async insert buffer too large — flush exceeds query memory. +- MV using non-deterministic functions or external dictionaries that are + slow / failing to refresh. + +## MV chain timeout on async inserts + +### Symptoms + +- `Code: 159. DB::Exception: Wait for async insert timeout (120000 ms) exceeded`. +- `avg_ms` exactly at `async_insert_busy_timeout_ms` (default 120000) — + the signature of a *wait*, not a *slow work*. +- Specific target tables in the failure list, not all of them. +- Persistent failures, not bursty. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q38](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q38-async-insert-timeout-failures-by-table) ⭐ | Which tables are timing out. | +| 2 | [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection) ⭐ | The **actual** physical INSERT target from the query text — not just `tables[]`. | +| 3 | Q39 (`as_select` for MVs writing into those tables) | MV chain depth feeding the failing tables. | +| 4 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) | Are target tables fragmented? | +| 5 | [Q43](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/#q43-dictionary-health-check) | Are dictionaries used in MVs healthy? | +| 6 | [Q16](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q16-query-load-per-host-last-30-minutes) | Other queries running heavily on those tables? | + +### Common root causes + +1. MV doing batch-ETL work (heavy joins, many `dictGet`, aggregations) at + insert time. +2. Target table has too many parts — the MV's writes back into it are + slow. +3. A dictionary used in an MV is slow or stale. +4. MV chain depth too deep (`MV → table → MV → table`). + +### Resolution path + +1. **Quick relief**: raise `async_insert_busy_timeout_ms` for the + user/table. +2. **Real fix**: simplify the MV — move heavy work to a scheduled + Refreshable MV or a batch job. +3. If a dictionary is slow → fix its source or refresh policy. +4. If the target is fragmented → fix the part count first + ([Too many parts and backpressure](/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure/)). + +## Stuck async insert queue (buffers don't drain) + +### Symptoms + +- `system.metrics.PendingAsyncInsert` very high (hundreds+) on some hosts, + low on others. +- Failed async inserts piling up. +- `async_insert_threads` already adequately sized. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection) | Confirm the actual writers. | +| 2 | [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) | Active query count per host. | +| 3 | [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host) | Failure rate per host. | +| 4 | [Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection) | `async_insert_busy_timeout_ms` and related settings. | +| 5 | Inspect the failing query's `Settings` | Is `wait_for_async_insert=1`? Client is waiting for flush completion. | + +### Key signature + +`query_duration_ms ≈ async_insert_busy_timeout_ms` with +`UserTimeMicroseconds` in single-digit milliseconds. The insert sat in a +queue for the full timeout, doing no CPU work. See +[Investigation methods → ProfileEvents reveal "waited not worked"](/altinity-kb-diagnostics-runbook/investigation-methods/#profileevents-reveal-waited-not-worked). + +### Resolution path + +- Raise `async_insert_busy_timeout_ms` (the wait ceiling) — buys time per + insert, treats the symptom. +- Lower `async_insert_max_data_size` — smaller, more frequent flushes. +- Find and fix the upstream cause of queue concentration — + [Host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/) + is the usual next stop. diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables.md new file mode 100644 index 0000000000..50747ef699 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables.md @@ -0,0 +1,45 @@ +--- +title: "Frozen historical tables adding background load" +linkTitle: "Frozen historical tables" +weight: 100 +description: > + Identifying old, no-longer-written tables whose partition count adds + permanent Keeper coordination load. +keywords: + - clickhouse partitions + - keeper load + - historical tables + - partition cardinality +--- + +## Symptoms + +- Old tables (previous-year or archive tables) showing high partition + counts. +- Part counts high but stable — not growing. +- Background merge / Keeper traffic disproportionate to the active + workload. + +## Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) ⭐ | Tables with extreme partition counts. | +| 2 | [Q40](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q40-active-inserts-confirmation-per-table-specific) | Confirm no recent writes — an empty result means the table is frozen. | +| 3 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) | If `ZooKeeperList` is very high → confirms Keeper coordination overhead is the load source. | + +## Resolution path + +Frozen high-cardinality-partition tables don't cause acute incidents but +add permanent load. Options, ordered by lowest disruption: + +1. **Drop** if the data is archived elsewhere. +2. **Detach old partitions** and **re-attach** them to a re-partitioned + table with a sane partition key (`toYYYYMM(date)` for monthly, + `toYYYYMMDD(date)` for daily on small datasets). +3. **Rebuild** the table with the sane partition key — only when neither + of the above is feasible. Costly in time and disk. + +The partition key choice is the schema-level fix; see +[How to pick an ORDER BY / PRIMARY KEY / PARTITION BY](/engines/mergetree-table-engine-family/pick-keys/) +for guidance. diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/general-triage.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/general-triage.md new file mode 100644 index 0000000000..79f60c19d7 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/general-triage.md @@ -0,0 +1,37 @@ +--- +title: "General cluster triage" +linkTitle: "General triage" +weight: 10 +description: > + The four-query first look when you don't yet know what's wrong. +keywords: + - clickhouse triage + - clickhouse health check +--- + +When the only information you have is "something is wrong", four queries +in order give you 80% of the cluster's state in about ten seconds. Use +this when you can't yet pick a more specific scenario. + +## Diagnostic flow + +| Step | Query | Purpose | +|---|---|---| +| 1 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | One row per host — readonly tables, lag, queue depth. | +| 2 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Per-disk free space across the cluster. | +| 3 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | Memory headroom (process, jemalloc, cgroup, OS) everywhere. | +| 4 | [Q17](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q17-active-queries-right-now) | Active queries right now — what's running and how heavy. | +| 5 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent exceptions across the last 4 hours. | +| 6 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Parts count overview. | +| 7 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Background pool saturation. | + +After these, branch into a specific scenario based on what surfaced: + +- Readonly tables in Q5 → [Replica readonly](/altinity-kb-diagnostics-runbook/scenarios/replica-readonly/). +- High lag or queue in Q5 → [Merge–fetch and pool issues](/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues/). +- Disk near full in Q11 → [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/). +- Memory pressure in Q15 → [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/). +- Long-running queries in Q17 → [Slow queries](/altinity-kb-diagnostics-runbook/scenarios/slow-queries/). +- `TOO_MANY_PARTS` in Q18 → [Too many parts and backpressure](/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure/). +- Async insert timeouts in Q18 → [Async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/). +- Pool counters pinned in Q13 → [Merge–fetch and pool issues](/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues/). diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures.md new file mode 100644 index 0000000000..e228e89101 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures.md @@ -0,0 +1,112 @@ +--- +title: "Host-skewed failures" +linkTitle: "Host-skewed failures" +weight: 110 +description: > + Diagnosing situations where failures concentrate on a subset of hosts + even though workload and configuration look identical. +keywords: + - host skew + - load_balancing + - haproxy + - parallel_view_processing + - cumulative metrics +--- + +Three related cases live here: host-skewed failures with a balanced +workload, "stale skew" complaints based on cumulative metrics, and the +misattribution of failure tables when `tables[]` is read as the writer. +All three share the same root pattern — surface appearances disagree with +what's actually happening — and the same investigative tools resolve them. + +## Host-skewed insert failures (workload balanced, failures not) + +### Symptoms + +- Multiple replicas in the cluster. +- Async insert failure rate is wildly different across hosts. +- Question is some variation of "why are some hosts broken while others + work fine?". + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q46](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q46-per-host-insert-duration-profile) ⭐ | Per-host insert duration imbalance. | +| 2 | [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host) | Failure rate per host, workload-normalised. | +| 3 | [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) | Workload volume per host; active query pile-up. | +| 4 | [Q54](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q54-memory-pressure-per-host-compact) | Memory pressure — concentrated or cluster-wide? | +| 5 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) + [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations) | Confirm Keeper isn't the imbalance source. | +| 6 | [Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection) | Verify routing settings are identical across hosts. | + +### Decision tree + +- **Settings identical + workload balanced + failures host-skewed** → + upstream entry-point routing (HAProxy / ingress directing traffic to a + subset of hosts). +- **Settings identical + memory pressure on bad hosts only** → resource + contention on those pods (CPU throttling, page-cache pressure). +- **`parallel_view_processing = 0` + MV chains on slow hosts** → serial + MV execution queue, exacerbated by entry-point routing. + +### Resolution path + +- Raise `async_insert_busy_timeout_ms` for immediate relief. +- Enable `parallel_view_processing = 1` to cut MV-chain wall time on each + insert (be aware this can change MV ordering semantics — confirm the + application is tolerant). +- Change `load_balancing` from a hostname-affine policy to `round_robin` + or `random`. +- Investigate the ingress / load balancer to spread client connections + evenly across replicas. + +## Stale skew: a "ratio" computed from cumulative metrics + +### Symptoms + +- Someone reports a metric ratio ("host X has Nx higher Keeper waits") + and asks for investigation. +- The supporting evidence is `system.events` totals — cumulative since + process start. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | — | Ask which window the ratio was computed over. Cumulative `system.events` values include all historical peaks since process start. | +| 2 | [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations) | p50/p95/p99 from `metric_log` over a recent window (10–30 min). | +| 3 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) | Wait by host on a recent window — confirm whether imbalance is current or historical. | +| 4 | [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) | Whether watches and inflight requests are balanced now. | +| 5 | [Q50](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q50-keeper-connection-topology) + [Q51](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q51-leader-distribution-across-hosts) | Verify Keeper topology and leadership are uniform. | + +### Decision tree + +- **Cumulative shows Nx skew, recent 10-min window shows balanced** → + historical incident artefact, already resolved. +- **Cumulative and recent window agree** → real ongoing imbalance; dig + into per-host root cause. +- **Recent window shows a different host as outlier** → the original + observation is stale. Explain the data carefully when reporting back. + +## Misattributed failure tables + +### Symptoms + +- Failed inserts list many target tables in `system.query_log.tables[]`. +- Several look like the culprit. +- Raising timeouts on the suspected tables doesn't help. + +### Diagnostic flow + +1. Run [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection) + to get the actual INSERT query text. +2. The `INSERT INTO database.table` statement reveals the **physical** + target — not the MV chain. +3. Compare with the `tables[]` array — additional entries are MV + dependencies, not direct writes. +4. Apply the fix on the actual physical target table, not on MV + dependencies. + +The `tables[]` array tells you the full MV blast radius, not the specific +writer. Always run Q47 before deciding "the slow table is X". See +[Investigation methods → `tables[]` in query_log is not the writer](/altinity-kb-diagnostics-runbook/investigation-methods/#tables-in-query_log-is-not-the-writer). diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues.md new file mode 100644 index 0000000000..4a70d79ecd --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues.md @@ -0,0 +1,43 @@ +--- +title: "Kafka consumer issues" +linkTitle: "Kafka consumer issues" +weight: 90 +description: > + Diagnosing Kafka consumer thread starvation and rebalance storms. +keywords: + - clickhouse kafka + - max.poll.interval.ms + - kafka rebalance + - background_message_broker_schedule_pool_size +--- + +## Symptoms + +- `Maximum application poll interval (max.poll.interval.ms) exceeded` + errors. +- Kafka consumers getting kicked and rejoining frequently. +- Drip-fire pattern: 1–10 kicks per minute, sustained. + +## Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q44](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/#q44-kafka-consumer-count-vs-pool-size) ⭐ | `consumers > mb_pool_size` confirms starvation. | +| 2 | [Q45](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/#q45-kafka-consumer-error-inspection) | Per-consumer error inspection. | +| 3 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Is `BackgroundMessageBrokerSchedulePoolTask` pinned at pool size? | + +## Resolution + +Raise `background_message_broker_schedule_pool_size` to at least +`consumers * 1.25`. Requires a server restart — the setting is +server-level, not user-level. + +If the consumer count itself is excessive, also review whether +`kafka_num_consumers` per table is over-provisioned. Each +`Kafka` table contributes consumers based on this setting; multiplying +across many tables explodes the total quickly. + +Related setup guidance: + +- [background_message_broker_schedule_pool_size](/altinity-kb-integrations/altinity-kb-kafka/04-operations-troubleshooting/background_message_broker_schedule_pool_size/) +- [Kafka parallel consuming](/altinity-kb-integrations/altinity-kb-kafka/02-consumption-patterns/altinity-kb-kafka-parallel-consuming/) diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure.md new file mode 100644 index 0000000000..1716f9bfb4 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure.md @@ -0,0 +1,101 @@ +--- +title: "Memory and disk pressure" +linkTitle: "Memory and disk pressure" +weight: 50 +description: > + Diagnosing OOM, `MEMORY_LIMIT_EXCEEDED`, `NOT_ENOUGH_SPACE`, and + cluster-wide memory pressure that aggravates other failures. +keywords: + - clickhouse OOM + - MEMORY_LIMIT_EXCEEDED + - NOT_ENOUGH_SPACE + - cgroup memory +--- + +Three closely-related modes: per-query OOM, disk-full conditions blocking +merges, and the cluster-wide memory pressure that turns a marginal +workload into one that fails. + +## OOM / memory pressure + +### Symptoms + +- Code 241 (`MEMORY_LIMIT_EXCEEDED`). +- `OvercommitTracker` killing queries. +- ClickHouse pod restarts / OOMKilled. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | `MemoryResident` vs `CGroupMemoryTotal` — actual headroom. | +| 2 | [Q17](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q17-active-queries-right-now) | Active queries — large aggregations holding GB of memory. | +| 3 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent OOM patterns — same query? Same user? Same time? | +| 4 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Active merges — large merges hold memory too. | +| 5 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | High parts count → metadata overhead in RAM. | +| 6 | [Q16](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q16-query-load-per-host-last-30-minutes) | Too many concurrent queries? | +| 7 | [Q14](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q14-pool-sizes-server-settings) + [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Background pools consuming memory unnecessarily? | + +### Common root causes + +- Cluster genuinely undersized for the workload. +- Query without `max_memory_usage` doing a large `GROUP BY` without an + `max_bytes_before_external_group_by` spill threshold. +- Many parts → metadata pressure. +- Concurrent large merges of wide parts. +- Async insert buffers oversized. + +See [Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/) +for per-subsystem RAM attribution. + +## Disk full / NOT_ENOUGH_SPACE + +### Symptoms + +- Merges failing with "Not enough space" in `last_exception`. +- Insert errors. +- One disk in the storage policy full. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) ⭐ | Disk usage — which disk on which host. | +| 2 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Largest tables by size — cleanup candidates. | +| 3 | [Q12](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q12-ttl-move--mutation-activity) | TTL move activity — are parts moving to cold tier? | +| 4 | [Q19](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q19-stuck-mutations) | Stuck mutations adding to disk usage (mutations rewrite parts). | +| 5 | [Q1](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q1-replication-queue-overview) | Queue entries with disk-related exceptions. | + +### Common root causes + +- TTL move not configured, or the cold-tier disk policy failing (S3 + credentials, network). +- Backup volumes filling local disk. +- Detached parts not cleaned up. +- A single huge partition. + +## Cluster-wide memory pressure as an aggravator + +### Symptoms + +- No single host is OOM, but every host shows `CGroupMemoryUsed > 90%` of + `CGroupMemoryTotal`. +- Slow inserts, slow merges, page-cache thrashing — and the failures move + around the cluster rather than concentrating on one host. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q54](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q54-memory-pressure-per-host-compact) ⭐ | Confirm pressure is cluster-wide, not concentrated. | +| 2 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | Full memory breakdown. | +| 3 | [Q9](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q9-part-creation-vs-merge-rate-last-30-minutes) | Whether merge throughput is degraded (a sign of pressure). | +| 4 | `system.asynchronous_metrics.MemoryCacheFiles` (if available) | Page-cache size proxy. | + +### Resolution path + +With sustained 95%+ utilisation, large MV processing or merge bursts will +stall under pressure. Workload-level tuning helps marginally; the real +fix is more RAM per node or reducing the workload (fewer MVs, smaller +batches, less concurrent work). Tighten `max_memory_usage` per query as a +guard. diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues.md new file mode 100644 index 0000000000..4e2f6e9efa --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues.md @@ -0,0 +1,98 @@ +--- +title: "Merge–fetch and pool issues" +linkTitle: "Merge–fetch and pool issues" +weight: 20 +description: > + Diagnosing replication queues that stop draining, including merge–fetch + cycles and fetch-pool deadlocks where slots are claimed but no transfers + happen. +keywords: + - replication queue + - postpone_reason + - replicated_fetches + - background_fetches_pool_size +--- + +Two distinct failure modes share these symptoms but need different fixes. +The first is a merge↔fetch cycle (work blocked behind itself). The second +is a fetch-pool deadlock where the pool counter is pinned but +`replicated_fetches` is near-empty — typically a Keeper saturation under a +fragmentation-driven coordination load. + +## Merge↔fetch cycle / merge stall + +### Symptoms + +- Replication queue not draining even with ingestion stopped. +- `merges_in_queue` high, but few active merges. +- Reports of "merges waiting for fetches, fetches waiting for merges". +- Parts count climbing despite no or low writes. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | Which hosts have readonly tables, max lag, largest queues. | +| 2 | [Q4](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q4-replica-status--lag-and-readonly-per-host) | Specific tables — is one replica lagging while others are fine? | +| 3 | [Q2](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q2-replication-queue--postpone-reasons) ⭐ | `postpone_reason` text — look for "source parts size … greater than current maximum", "another log entry for same part is being processed", "covering parts list". | +| 4 | [Q3](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q3-queue-entry-type-breakdown) | Entry type breakdown — `GET_PART` (fetches) vs `MERGE_PARTS` ratio. | +| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Are `BackgroundFetchesPoolTask` and `BackgroundMergesAndMutationsPoolTask` pinned at their pool size? | +| 6 | [Q14](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q14-pool-sizes-server-settings) | Confirm configured pool sizes — has the cluster been pre-tuned? | +| 7 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Are merges making progress, or stuck for hours on huge parts? | +| 8 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Disk full ruled out? `NOT_ENOUGH_SPACE` looks like a merge stall but is separate. | + +### Common root causes + +- Pool sizes too small for the workload (especially + `background_fetches_pool_size`). +- Wide imbalance — one replica not serving fetches (S3, network, or + credentials) so peers cannot pull. +- Disk full on one node blocks merges, cascading into a fetch backlog on + peers. +- Merge throughput collapsed because of 100+ GiB merges on slow storage. + +## Distributed fetch deadlock (pool pinned, no transfers) + +### Symptoms + +- `BackgroundFetchesPoolTask` at pool size on all hosts. +- Replication queue is 99%+ `GET_PART` (not `MERGE_PARTS`). +- Queue does not drain even with ingestion stopped. +- [Q31](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q31-replicated-fetches-in-flight) + returns very few rows compared to the claimed pool slots. +- `postpone_reason` mentions *"Not executing fetch of part X because N + fetches already executing, max N"*. + +This is **not** a merge↔fetch cycle. Pool slots are claimed but transfers +aren't happening. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | Queue depth per host — usually concentrated on a subset. | +| 2 | [Q2](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q2-replication-queue--postpone-reasons) | `postpone_reason` mentioning "fetches already executing, max". | +| 3 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | `BackgroundFetchesPoolTask = pool_size` on all hosts but `ReplicatedFetch` near zero. | +| 4 | [Q31](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q31-replicated-fetches-in-flight) ⭐ | Actual fetches transferring — should be hundreds, will be single digits. | +| 5 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) | `ZooKeeperWaitMicroseconds` extremely high → Keeper saturation. | +| 6 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) | Find the table with massive part count driving Keeper load. | +| 7 | [Q34](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q34-active-insert-sources-by-user) | Confirm ingestion is actually stopped. | + +### Common root cause + +Part fragmentation on one or more high-volume tables saturates Keeper +coordinating replication. Fetch tasks block waiting on Keeper responses; +the pool fills with waiting tasks while no transfers happen. + +### Resolution path + +1. Stop ingestion to the offending table. +2. Wait for merges to reduce part count (hours, not minutes). +3. Once parts collapse, Keeper pressure drops, fetches resume, queue + drains. +4. Before resuming ingestion, fix the insert pattern — async inserts, + larger batches, less granular partitioning. + +**Do not** raise `background_fetches_pool_size`. The pool is not the +bottleneck — it's saturated by tasks waiting on Keeper, not by genuine +work. Adding pool slots adds more waiters. diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/replica-readonly.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/replica-readonly.md new file mode 100644 index 0000000000..0aa865b894 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/replica-readonly.md @@ -0,0 +1,48 @@ +--- +title: "Replica readonly / high lag" +linkTitle: "Replica readonly" +weight: 40 +description: > + Diagnosing replicas stuck in readonly mode or with growing absolute_delay. +keywords: + - clickhouse readonly replica + - absolute_delay + - clickhouse keeper session +--- + +## Symptoms + +- One or more replicas in readonly mode. +- `absolute_delay` increasing on specific replicas. +- Failover not behaving as expected. + +## Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q4](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q4-replica-status--lag-and-readonly-per-host) ⭐ | Which replicas are readonly, which tables, lag in seconds. | +| 2 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | Is this isolated or cluster-wide? | +| 3 | [Q29](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q29-keeper-connection-status) | Keeper/ZK connection — readonly is often a Keeper-session issue. | +| 4 | [Q30](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q30-keeper-errors-last-hour) | Recent Keeper exceptions. | +| 5 | [Q1](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q1-replication-queue-overview) | Queue depth on the affected replica — accumulating or stuck? | +| 6 | [Q2](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q2-replication-queue--postpone-reasons) | If queue is stuck — `postpone_reason` and `last_exception`. | +| 7 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Disk space on affected replica (full disk → readonly). | +| 8 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent exceptions on that host. | + +## Common root causes + +- Keeper session lost or Keeper unreachable. +- Disk full. +- Metadata mismatch with Keeper (e.g., after a restore from backup). +- Manual `SYSTEM RESTART REPLICA` needed after a transient Keeper issue. + +## Resolution path + +- Confirm Keeper connectivity is healthy first (Q29 + Q30); fixing + Keeper before the replica self-recovers in most cases. +- If disk is full, free space first — the replica may auto-recover. +- If metadata is mismatched, `SYSTEM RESTART REPLICA .
` + reinitialises the replica's view of the ZooKeeper state. +- For persistent failures, see + [DDLWorker and DDL queue problems](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) + for related cluster-coordination diagnostics. diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/slow-queries.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/slow-queries.md new file mode 100644 index 0000000000..8ca67f6f9a --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/slow-queries.md @@ -0,0 +1,32 @@ +--- +title: "Slow queries / high query load" +linkTitle: "Slow queries" +weight: 80 +description: > + Diagnosing query timeouts and dashboard latency complaints. +keywords: + - clickhouse slow query + - dashboard timeout + - query load +--- + +## Symptoms + +- Query timeouts reported by clients. +- Dashboards slow. + +## Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q17](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q17-active-queries-right-now) | What's running right now — how long, how much memory. | +| 2 | [Q16](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q16-query-load-per-host-last-30-minutes) | Query mix in the last 30 minutes — error rate and average duration by `query_kind`. | +| 3 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent exceptions. | +| 4 | [Q4](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q4-replica-status--lag-and-readonly-per-host) | Are reads hitting a lagging or readonly replica? | +| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Background pool stealing CPU/IO from queries? | +| 6 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | Memory pressure forcing spill or kills? | +| 7 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Are scanned tables fragmented (many small parts)? | + +For deeper per-query investigation, see +[Who ate my CPU?](/altinity-kb-setup-and-maintenance/who-ate-my-cpu/) and +[Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/). diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations.md new file mode 100644 index 0000000000..fa80cd09f5 --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations.md @@ -0,0 +1,46 @@ +--- +title: "Stuck mutations" +linkTitle: "Stuck mutations" +weight: 60 +description: > + Diagnosing `ALTER TABLE … UPDATE/DELETE` mutations that won't complete. +keywords: + - clickhouse mutations + - alter update + - alter delete + - is_done +--- + +## Symptoms + +- `ALTER TABLE … UPDATE / DELETE` not completing. +- `system.mutations.is_done = 0` for hours. + +## Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q19](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q19-stuck-mutations) ⭐ | All stuck mutations with `latest_fail_reason`. | +| 2 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Active merges (mutations share the merge pool). | +| 3 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Pool saturation — mutations queued behind merges. | +| 4 | [Q1](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q1-replication-queue-overview) | Queue entries — `MUTATE_PART` types. | +| 5 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Disk space (mutations rewrite parts → need ~2× space). | + +## Common root causes + +- Insufficient merge-pool slots. +- Mutation references a column that no longer exists (look at + `latest_fail_reason` — the error is explicit). +- Disk space insufficient for the rewrite. +- Mutation blocked behind a merge of the same part. + +## Resolution + +- For pool-bound stalls, raising the merge pool size (Q14) restores + progress; review whether the workload genuinely needs that much + concurrent mutation. +- A mutation whose `latest_fail_reason` is a missing column is fatal — + `KILL MUTATION WHERE …` is the only path forward. +- For disk-bound stalls, free space (see + [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/)) + before retrying. diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure.md new file mode 100644 index 0000000000..e73292196b --- /dev/null +++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure.md @@ -0,0 +1,92 @@ +--- +title: "Too many parts and backpressure" +linkTitle: "Too many parts and backpressure" +weight: 30 +description: > + Diagnosing `TOO_MANY_PARTS` (code 252), insert delays, and the + sustained insert pressure that causes cascading issues. +keywords: + - TOO_MANY_PARTS + - parts_to_delay_insert + - parts_to_throw_insert + - clickhouse backpressure +--- + +Three related failure modes appear here: hard `TOO_MANY_PARTS` rejections, +soft "Delaying inserts by N ms" warnings, and the sustained high insert +rate that causes multiple symptoms at once. + +## TOO_MANY_PARTS / part explosion + +### Symptoms + +- Inserts failing with code 252 (`TOO_MANY_PARTS`). +- Or inserts delayed with "Delaying inserts by N ms" warnings in the log. +- Parts count per partition exceeds ~300. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Tables with highest active part count — single offender or cluster-wide? | +| 2 | [Q7](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q7-parts-count-per-partition) ⭐ | Parts per partition — `parts_to_delay_insert` is **per partition**, not per table. | +| 3 | [Q9](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q9-part-creation-vs-merge-rate-last-30-minutes) | New parts vs merged parts in the last 30 minutes — is merge throughput below insert rate? | +| 4 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Are merges actually running, or queued and idle? | +| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Merge pool saturated? | +| 6 | [Q10](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q10-merge-settings-check) | Confirm `parts_to_delay_insert` / `parts_to_throw_insert` thresholds. | +| 7 | [Q22](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q22-async-insert-impact-aggregation), [Q23](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q23-async-insert-flush-latency-by-tablestatus) | If async inserts are in use — are flushes producing many small parts? | + +### Common root causes + +- Insert batch size too small (sync inserts without client-side batching — + one part per insert). +- Async inserts not enabled, or buffer thresholds too small. +- Partitioning too granular (e.g., per-hour partitioning on a dataset that + could be per-day). +- Merge pool too small for the insert rate. +- Excessive `Nullable` columns slowing merges. + +## Insert backpressure ("delayed inserts") + +### Symptoms + +- Inserts not failing, just very slow. +- Server logs show "Delaying inserts by N ms". + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q7](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q7-parts-count-per-partition) ⭐ | Partition with > `parts_to_delay_insert` (default 150) parts. | +| 2 | [Q10](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q10-merge-settings-check) | Confirm threshold values. | +| 3 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Are merges keeping up? | +| 4 | [Q9](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q9-part-creation-vs-merge-rate-last-30-minutes) | New-parts vs merged-parts ratio. | +| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Merge pool capacity. | + +## Sustained high insert rate causing cascading issues + +### Symptoms + +- Multiple symptoms at once: timeouts, Kafka kicks, part growth. +- "The same issues come back after fixing X." +- No single clear root cause. + +### Diagnostic flow + +| Step | Query | What to look for | +|---|---|---| +| 1 | [Q37](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q37-insert-rate-per-minute-spike-detection) | Insert rate per minute — sustained or spike? | +| 2 | [Q36](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q36-insert-volume-by-target-table-last-24-hours) | Insert volume by target table — biggest contributors. | +| 3 | [Q35](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q35-insert-volume-by-user-last-24-hours) | Insert volume by user — which clients. | +| 4 | [Q38](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q38-async-insert-timeout-failures-by-table) | Currently failing tables. | +| 5 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) | Part fragmentation per table. | +| 6 | Q39 (in MV chain — see `system.tables`) | MV chains on the failing tables. | + +### How to read the result + +- **Few inserts/minute with huge row counts** → bulk loads; MV chain + bottleneck is the likely cause. +- **Many inserts/minute with small row counts** → batch size problem; + fix at the producer or via async insert configuration. +- **Spike pattern** → identify the specific user or process responsible. +- **Flat pattern** → baseline load multiplied by a config issue.