diff --git a/content/en/altinity-kb-diagnostics-runbook/_index.md b/content/en/altinity-kb-diagnostics-runbook/_index.md
new file mode 100644
index 0000000000..06de8dd5bc
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/_index.md
@@ -0,0 +1,99 @@
+---
+title: "ClickHouse® Cluster Diagnostics Runbook"
+linkTitle: "Diagnostics Runbook"
+weight: 110
+description: >
+ A query library and scenario-based diagnostic flows for triaging
+ ClickHouse® clusters during incidents.
+keywords:
+ - clickhouse diagnostics
+ - clickhouse troubleshooting
+ - clickhouse runbook
+ - replication queue
+ - async inserts
+ - keeper
+ - host skew
+---
+
+A reference for diagnosing problems on a running ClickHouse® cluster: a
+catalogue of cluster-wide queries you can run, organised by subsystem, plus
+scenario playbooks that walk you from a symptom to the queries that resolve
+it.
+
+The intended reader is an on-call or support engineer who has cluster-wide
+read access and needs to identify *which subsystem* is misbehaving as quickly
+as possible.
+
+## How this runbook is organised
+
+| Section | What's in it |
+|---|---|
+| [Quick reference](/altinity-kb-diagnostics-runbook/quick-reference/) | One-page symptom → query map and the gotchas every diagnosis depends on. **Start here.** |
+| [Investigation methods](/altinity-kb-diagnostics-runbook/investigation-methods/) | Process reminders — how to avoid common misdiagnoses. |
+| [Query library](/altinity-kb-diagnostics-runbook/query-library/) | 54 cluster-wide queries grouped by subsystem (replication, parts, async inserts, Keeper, etc.). Reference material. |
+| [Scenarios](/altinity-kb-diagnostics-runbook/scenarios/) | Step-by-step diagnostic flows for specific failure modes. |
+
+## How the queries are written
+
+Every query in the library fans out across the cluster using
+`clusterAllReplicas('{cluster}', system.
)`. Replace these placeholders
+before running:
+
+- `{cluster}` — your cluster name (the value used in `remote_servers` /
+ `system.clusters.cluster`).
+- `{database}`, `{table}`, `{mv_name}`, `{target_table_pattern}` — appear in
+ queries that drill into a specific object.
+
+Most queries include `hostName() AS host` as the first column so you can see
+per-replica behaviour at a glance. Replication and metric tables vary slightly
+across ClickHouse versions — when in doubt, inspect the columns first with
+`SELECT name FROM system.columns WHERE database='system' AND table=''`.
+
+## Patterns that recur
+
+These are the misreads that account for a large share of wrong diagnoses.
+Read them once before drilling into a specific scenario.
+
+1. **Host-skewed failures with a balanced workload.** Settings identical,
+ workload balanced, but failure rates differ wildly across replicas. The
+ cause is usually entry-point routing (HAProxy / ingress) directing most
+ traffic to a subset of hosts — not a ClickHouse misconfiguration. See
+ [scenarios → host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/).
+
+2. **`tables[]` in `query_log` is not the writer.** Failed inserts list many
+ tables. The actual physical writer is in the INSERT query text — not the
+ first element of `tables[]`, which also includes the MV dependency chain.
+ See the [insert load and host skew queries](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) and
+ [scenarios → async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/).
+
+3. **Cumulative vs current state.** `system.events` totals since process
+ start; ratios computed from those totals can show stale peak-load skew
+ that no longer exists. Always cross-check with `system.metric_log` over a
+ recent window before concluding "host X is slow".
+
+4. **ProfileEvents reveal "waited not worked".** A failed insert with
+ `RealTimeMicroseconds ≈ timeout` and `UserTimeMicroseconds < 10ms` means
+ the query never executed. The bottleneck is a lock or queue, not work.
+ Look upstream for what is blocking.
+
+5. **Same settings + different behaviour ⇒ upstream cause.** When
+ `system.settings` is identical across hosts and behaviour is still
+ skewed, the cause is outside ClickHouse: entry-point routing, pod
+ resource contention, or leader-coordination concentration. Stop looking
+ inside ClickHouse.
+
+## Where to start
+
+- "Customer says something is wrong, I don't know what" → run
+ [Scenario 10: General triage](/altinity-kb-diagnostics-runbook/scenarios/general-triage/).
+- "I have a specific symptom" → open the
+ [quick reference](/altinity-kb-diagnostics-runbook/quick-reference/).
+- "I need a specific query" → browse the
+ [query library](/altinity-kb-diagnostics-runbook/query-library/) by subsystem.
+
+## Related KB pages
+
+- [Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/) — focused memory diagnostics.
+- [Who ate my CPU?](/altinity-kb-setup-and-maintenance/who-ate-my-cpu/) — focused CPU diagnostics.
+- [DDLWorker and DDL queue problems](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/) — `ON CLUSTER` task troubleshooting.
+- [System tables eat my disk](/altinity-kb-setup-and-maintenance/altinity-kb-system-tables-eat-my-disk/) — when `*_log` tables grow too large.
diff --git a/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md b/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md
new file mode 100644
index 0000000000..aaaccb6066
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/investigation-methods.md
@@ -0,0 +1,147 @@
+---
+title: "Investigation methods"
+linkTitle: "Investigation methods"
+weight: 20
+description: >
+ Process reminders that prevent the most common misdiagnoses.
+keywords:
+ - clickhouse troubleshooting
+ - clickhouse diagnostics
+ - tables array
+ - profileevents
+ - metric_log
+---
+
+These reminders are about *how* to investigate — they prevent the kinds of
+wrong reads that send a diagnosis in the wrong direction for hours. Each one
+maps to a specific query or pattern elsewhere in the runbook.
+
+## Verify before committing to a cause
+
+When the evidence points to more than one plausible cause, run one more
+verification query before you state a conclusion. A wrong RCA costs more
+trust and more time than the verification step would have. The cost of an
+extra `SELECT` is seconds; the cost of unwinding a wrong diagnosis can be
+days.
+
+## `tables[]` in `query_log` is not the writer
+
+The `query_log.tables` array contains every table touched by the query,
+including the entire MV dependency chain. The actual physical INSERT target
+is in the query text, not in `tables[0]`.
+
+To find the real writer behind a failing insert, extract from the query
+text:
+
+```sql
+SELECT regexpExtract(query, 'INSERT INTO\s+([\w\.`]+)') AS target, …
+```
+
+See [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection)
+and the dedicated [scenario](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/).
+
+## Cumulative metrics hide current state
+
+`system.events` integrates since process start. Ratios computed from those
+totals can reflect a peak-load period that happened days ago and is no
+longer relevant.
+
+When comparing per-host behaviour right now, use `system.metric_log` with a
+recent window (5–30 minutes):
+
+- [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log)
+ — per-second profile activity by host.
+- [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations)
+ — p50/p95/p99 of Keeper transactions, by host.
+
+If someone reports "host X has Nx higher Keeper waits", reproduce it with
+Q49 over the last 30 minutes before treating it as a current problem.
+
+## Same settings + different behaviour ⇒ upstream cause
+
+If `system.settings` is identical across hosts (see
+[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection))
+and behaviour is still skewed across replicas, the cause is outside
+ClickHouse. Likely sources:
+
+- Entry-point routing (HAProxy, ingress, or client library load balancing)
+ concentrating traffic on a subset of replicas.
+- Pod-level resource contention (CPU throttling, memory pressure on the
+ node, page cache flushes from a noisy neighbour).
+- Coordination work concentrated on a subset of hosts (leader concentration,
+ see [Q51](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q51-leader-distribution-across-hosts)).
+
+Stop looking inside ClickHouse — the answer is upstream.
+
+## Distinguish workload from failure
+
+"Volume is balanced" and "failures are balanced" answer different questions.
+Either can be skewed independently. To resolve a host-skew report, look at
+both:
+
+- Workload — [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log)
+ (`ProfileEvent_AsyncInsertQuery` per host).
+- Failure rate — [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host)
+ (failures normalised by attempts).
+
+Together they let you say "host A receives 4× more attempts" or "host A
+fails at 5× the rate at equal volume" — those are very different problems
+with different fixes.
+
+## ProfileEvents reveal "waited not worked"
+
+A failed query with `RealTimeMicroseconds ≈ timeout` and
+`UserTimeMicroseconds` near zero means the query never executed. It sat in
+a queue or on a lock. This rules out "the work itself is slow" and points
+to "the wait is the problem".
+
+Before theorising about a slow MV chain or slow merge as the cause of a
+failed insert, inspect ProfileEvents on representative failed queries:
+
+```sql
+SELECT
+ query_id,
+ query_duration_ms,
+ ProfileEvents['RealTimeMicroseconds'] AS real_us,
+ ProfileEvents['UserTimeMicroseconds'] AS user_us,
+ ProfileEvents['SystemTimeMicroseconds'] AS sys_us
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 30 MINUTE
+ AND type = 'ExceptionWhileProcessing'
+ AND exception ILIKE '%async insert%timeout%'
+LIMIT 20;
+```
+
+If `user_us` is in single-digit milliseconds while `real_us` is at the
+timeout ceiling, the work never ran. Find the lock or queue, not the slow
+operator.
+
+## Routing settings to know about
+
+A short glossary of the settings that determine *where* a query lands and
+*how* its MVs execute. Confirm them with
+[Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection)
+before tuning anything.
+
+- **`load_balancing`** — picks the replica for a Distributed table read or
+ insert. `hostname_levenshtein_distance` concentrates by hostname
+ similarity (often pinning to self), which can imbalance routing
+ unexpectedly. `random` or `round_robin` spreads work evenly.
+- **`prefer_localhost_replica`** — when `1`, the local replica is preferred
+ regardless of `load_balancing`. Useful for read locality, risky for
+ insert balance.
+- **`distributed_foreground_insert`** — when `1`, INSERTs into a
+ Distributed table wait synchronously for remote acks. Slower but no
+ silent loss.
+- **`parallel_view_processing`** — when `0` (historical default on many
+ versions), MVs on a target table execute serially per insert. With a
+ deep MV chain, this turns each insert into a long sequential pipeline.
+
+## Sidecar Keeper means co-located, not shared
+
+If `system.zookeeper_connection.host == hostName()` (see
+[Q50](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q50-keeper-connection-topology)),
+the replica connects to a Keeper running on the same pod. "Slow Keeper
+follower" theories don't apply in this topology — there is no shared
+follower to be slow. Issues here are about pod-level contention (CPU, page
+cache, disk), not Keeper network routing.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md b/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md
new file mode 100644
index 0000000000..b38a916758
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/_index.md
@@ -0,0 +1,40 @@
+---
+title: "Query library"
+linkTitle: "Query library"
+weight: 30
+description: >
+ Reference catalogue of cluster-wide diagnostic queries, grouped by subsystem.
+keywords:
+ - clickhouse system tables
+ - clickhouse diagnostics
+ - clusterAllReplicas
+---
+
+54 cluster-wide queries grouped by the subsystem they probe. Every query
+fans out via `clusterAllReplicas('{cluster}', system.)`. Replace
+`{cluster}` / `{database}` / `{table}` / `{mv_name}` /
+`{target_table_pattern}` with values from your environment before running.
+
+Queries are referenced from the
+[scenarios](/altinity-kb-diagnostics-runbook/scenarios/) by their numeric
+IDs (`Q1`, `Q2`, …). Numbering is stable across the runbook so you can copy
+shortcuts between teammates.
+
+| Page | Queries | Purpose |
+|---|---|---|
+| [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) | Q1–Q5, Q31, Q32 | Replication queue depth, postpone reasons, replica lag, fetches in flight |
+| [Parts and merges](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/) | Q6–Q10, Q42 | Parts per host/partition, active merges, merge throughput |
+| [Disk and storage](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/) | Q11, Q12 | Per-disk free space, TTL move activity |
+| [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/) | Q13–Q15, Q54 | Background pool saturation, memory pressure, cgroup limits |
+| [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) | Q16–Q19 | Recent query load, active queries, OOM/exception queries, stuck mutations |
+| [Async inserts](/altinity-kb-diagnostics-runbook/query-library/async-inserts/) | Q20–Q28, Q38 | Flush errors, latency, MV chain inspection, timeout patterns |
+| [Keeper and coordination](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/) | Q29, Q30, Q33, Q49–Q51 | Connection state, exception patterns, wait-time percentiles, topology, leader distribution |
+| [Insert load and host skew](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) | Q34–Q37, Q40, Q41, Q46–Q48, Q52, Q53 | Insert rate/volume, per-host duration, routing settings, failure rate |
+| [Dictionaries and Kafka](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/) | Q43–Q45 | Dictionary health, Kafka consumer vs pool size, consumer errors |
+
+## A note on version drift
+
+Several system tables changed schema between ClickHouse releases — column
+names on `replicated_fetches`, the view columns on `query_log`, and the
+existence of `zookeeper_log`. Each query page calls out the columns to
+check first when a query errors out on a specific cluster.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/async-inserts.md b/content/en/altinity-kb-diagnostics-runbook/query-library/async-inserts.md
new file mode 100644
index 0000000000..24e76d9a94
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/async-inserts.md
@@ -0,0 +1,238 @@
+---
+title: "Async inserts queries"
+linkTitle: "Async inserts"
+weight: 60
+description: >
+ Cluster-wide queries for async insert flush errors, latency, MV chain
+ inspection, and timeout patterns.
+keywords:
+ - clickhouse async insert
+ - asynchronous_insert_log
+ - materialized view
+ - flush errors
+---
+
+Queries for diagnosing the async insert subsystem: schema variations,
+flush errors and latency, materialized-view chain inspection, and the
+specific timeout pattern that signals MV-chain saturation. All queries fan
+out across the cluster — replace `{cluster}` / `{database}` / `{mv_name}`
+with values from your environment.
+
+## Q20. Async insert log — schema check
+
+Column names on `asynchronous_insert_log` have shifted across versions.
+Run this once when investigating a new cluster so the rest of the queries
+on this page match the actual schema.
+
+```sql
+SELECT name, type
+FROM system.columns
+WHERE database = 'system'
+ AND table = 'asynchronous_insert_log'
+ORDER BY position;
+```
+
+## Q21. Async insert flush errors
+
+Recent failed flushes with the exception text, target database/table,
+rows, size, and how long the flush waited. The starting point for "inserts
+return 200 OK but the data isn't there".
+
+```sql
+SELECT
+ hostname AS host,
+ event_time,
+ status,
+ exception,
+ database,
+ table,
+ rows,
+ round(bytes / 1e6, 1) AS size_MB,
+ flush_time,
+ dateDiff('second', event_time, flush_time) AS buffer_wait_sec
+FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log)
+WHERE status != 'Ok'
+ AND event_time >= now() - INTERVAL 4 HOUR
+ORDER BY event_time DESC
+LIMIT 30;
+```
+
+## Q22. Async insert impact aggregation
+
+Aggregates the last 12 hours of `FlushError` rows by host/table — total
+rows, total size, first-error and last-error timestamps. Tells you "how
+much data is affected and over what window".
+
+```sql
+SELECT
+ hostname,
+ database,
+ table,
+ status,
+ count() AS flush_attempts,
+ sum(rows) AS total_rows_affected,
+ round(sum(bytes) / 1e9, 2) AS total_GB,
+ min(event_time) AS first_error,
+ max(event_time) AS last_error
+FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log)
+WHERE status = 'FlushError'
+ AND event_time >= now() - INTERVAL 12 HOUR
+GROUP BY hostname, database, table, status
+ORDER BY total_rows_affected DESC;
+```
+
+## Q23. Async insert flush latency by table/status
+
+Average and max buffer wait time, plus average flush size. Compare `Ok`
+rows to `FlushError` rows for the same table — a divergence in flush size
+or buffer wait is a strong hint about the cause.
+
+```sql
+SELECT
+ hostname AS host,
+ database,
+ table,
+ status,
+ count() AS count,
+ sum(rows) AS total_rows,
+ round(sum(bytes) / 1e9, 2) AS total_GB,
+ avg(dateDiff('second', event_time, flush_time)) AS avg_buffer_wait_sec,
+ max(dateDiff('second', event_time, flush_time)) AS max_buffer_wait_sec,
+ round(avg(bytes) / 1e6, 1) AS avg_flush_MB
+FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log)
+WHERE event_time >= now() - INTERVAL 4 HOUR
+GROUP BY hostname, database, table, status
+ORDER BY hostname, status, count DESC;
+```
+
+## Q24. Slowest AsyncInsertFlush queries
+
+The slowest flush *queries* (`query_kind = 'AsyncInsertFlush'`) in the last
+four hours. Each flush execution is a query in `query_log` — this lets you
+see memory, exception, and full query text for the slowest ones.
+
+```sql
+SELECT
+ hostName() AS host,
+ query_id,
+ event_time,
+ query_duration_ms,
+ round(memory_usage / 1e9, 1) AS memory_GB,
+ read_rows,
+ written_rows,
+ exception,
+ substr(query, 1, 500) AS query_text
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 4 HOUR
+ AND query_kind = 'AsyncInsertFlush'
+ AND (database = '{database}' OR query ILIKE '%{database}%')
+ AND type IN ('QueryFinish', 'ExceptionWhileProcessing')
+ORDER BY query_duration_ms DESC
+LIMIT 20;
+```
+
+## Q25. MV appearances in failed flushes
+
+For a specific MV, list every failed flush where the MV appears in
+`views`. Quantifies the impact of one MV on flush failures.
+
+```sql
+SELECT
+ hostName() AS host,
+ query_duration_ms,
+ round(memory_usage / 1e9, 1) AS memory_GB,
+ read_rows,
+ written_rows,
+ views,
+ exception,
+ event_time
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 4 HOUR
+ AND query_kind = 'AsyncInsertFlush'
+ AND has(views, '{database}.{mv_name}')
+ AND type IN ('QueryFinish', 'ExceptionWhileProcessing')
+ORDER BY query_duration_ms DESC
+LIMIT 20;
+```
+
+## Q26. MV frequency in errors
+
+Counts how often each MV appears across `ExceptionWhileProcessing` rows in
+the last four hours. The MV with the highest `appearances` is the prime
+suspect for the chain bottleneck.
+
+```sql
+SELECT
+ hostName() AS host,
+ arrayJoin(views) AS mv_name,
+ count() AS appearances,
+ avg(query_duration_ms) / 1000 AS avg_sec,
+ max(query_duration_ms) / 1000 AS max_sec
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 4 HOUR
+ AND query_kind = 'AsyncInsertFlush'
+ AND type = 'ExceptionWhileProcessing'
+GROUP BY host, mv_name
+ORDER BY appearances DESC;
+```
+
+## Q27. MV definitions — chain inspection
+
+The `as_select` text for every MV in a database. Use after Q26 to inspect
+the MV that's appearing most often in failures.
+
+```sql
+SELECT name, as_select
+FROM system.tables
+WHERE database = '{database}'
+ AND engine = 'MaterializedView'
+ORDER BY name;
+```
+
+## Q28. Live async insert health check (last 5 minutes)
+
+A rolling status summary — counts and average row count by `status` for
+the last five minutes. Useful as a poll during incident response: "are we
+still failing right now?".
+
+```sql
+SELECT
+ hostname,
+ status,
+ count() AS cnt,
+ avg(rows) AS avg_rows_per_flush,
+ max(rows) AS max_rows_per_flush,
+ max(event_time) AS latest
+FROM clusterAllReplicas('{cluster}', system.asynchronous_insert_log)
+WHERE event_time >= now() - INTERVAL 5 MINUTE
+GROUP BY hostname, status
+ORDER BY hostname, status;
+```
+
+## Q38. Async insert timeout failures by table ⭐
+
+Direct culprit identification — pulls failures whose exception matches
+`async insert%timeout%` and groups by `arrayJoin(tables)`. The table at
+the top of the result is the timing-out target.
+
+```sql
+SELECT
+ hostName() AS host,
+ arrayJoin(tables) AS table_name,
+ count() AS failures,
+ round(avg(query_duration_ms), 0) AS avg_ms,
+ max(event_time) AS last_fail,
+ substring(any(exception), 1, 200) AS sample_exception
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 4 HOUR
+ AND type = 'ExceptionWhileProcessing'
+ AND exception ILIKE '%async insert%timeout%'
+GROUP BY host, table_name
+ORDER BY failures DESC
+LIMIT 20;
+```
+
+`arrayJoin(tables)` exposes the full MV blast radius — including non-writer
+dependencies. Always cross-check the actual physical INSERT target with
+[Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection)
+before recommending a fix on one of these tables.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka.md b/content/en/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka.md
new file mode 100644
index 0000000000..17ca4fe587
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka.md
@@ -0,0 +1,88 @@
+---
+title: "Dictionaries and Kafka queries"
+linkTitle: "Dictionaries and Kafka"
+weight: 90
+description: >
+ Cluster-wide queries for dictionary health and Kafka consumer state.
+keywords:
+ - clickhouse dictionaries
+ - clickhouse kafka
+ - kafka consumers
+ - max.poll.interval.ms
+---
+
+Queries for two related concerns: dictionary load state (often consumed by
+MVs and therefore on the insert hot path) and Kafka consumer activity
+(starvation manifests as `max.poll.interval.ms` violations). All queries
+fan out across the cluster — replace `{cluster}` with your cluster name.
+
+## Q43. Dictionary health check
+
+First — only the dictionaries that are not loaded or have an exception:
+
+```sql
+SELECT
+ name, status, last_exception,
+ loading_duration AS load_sec,
+ element_count,
+ round(bytes_allocated / 1e6, 1) AS MB
+FROM clusterAllReplicas('{cluster}', system.dictionaries)
+WHERE status != 'LOADED' OR last_exception != ''
+ORDER BY name;
+```
+
+Then — every dictionary, sorted by load time. A long-loading dictionary
+on the insert hot path (e.g., used inside `dictGet` in an MV) is a common
+source of unexpected MV slowness.
+
+```sql
+SELECT
+ name, status, element_count,
+ round(loading_duration, 2) AS load_sec,
+ round(bytes_allocated / 1e6, 1) AS MB
+FROM clusterAllReplicas('{cluster}', system.dictionaries)
+ORDER BY load_sec DESC
+LIMIT 30;
+```
+
+## Q44. Kafka consumer count vs pool size ⭐
+
+Compares the number of Kafka consumers to the configured message-broker
+pool size and the current pool activity. The first query for
+"`max.poll.interval.ms` exceeded" errors and Kafka consumer rebalance
+storms.
+
+```sql
+SELECT
+ hostName() AS host,
+ (SELECT count() FROM system.kafka_consumers) AS consumers,
+ (SELECT value FROM system.server_settings
+ WHERE name = 'background_message_broker_schedule_pool_size') AS mb_pool_size,
+ (SELECT value FROM system.metrics
+ WHERE metric = 'BackgroundMessageBrokerSchedulePoolTask') AS mb_pool_active;
+```
+
+Rule of thumb: if `consumers > mb_pool_size`, poll-interval violations are
+all but guaranteed. Aim for `mb_pool_size >= consumers * 1.25`.
+
+## Q45. Kafka consumer error inspection
+
+Per-consumer last exception, last poll time, message count, and rebalance
+counters. After Q44 confirms starvation, this tells you which consumers
+are hitting it.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, table,
+ consumer_id,
+ last_exception,
+ num_messages_read,
+ last_poll_time,
+ num_rebalance_revocations,
+ num_rebalance_assignments
+FROM clusterAllReplicas('{cluster}', system.kafka_consumers)
+WHERE last_exception != ''
+ORDER BY last_poll_time DESC
+LIMIT 30;
+```
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/disk-and-storage.md b/content/en/altinity-kb-diagnostics-runbook/query-library/disk-and-storage.md
new file mode 100644
index 0000000000..346bc471ef
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/disk-and-storage.md
@@ -0,0 +1,60 @@
+---
+title: "Disk and storage queries"
+linkTitle: "Disk and storage"
+weight: 30
+description: >
+ Cluster-wide queries for disk usage and TTL move activity.
+keywords:
+ - clickhouse disk usage
+ - clickhouse ttl
+ - NOT_ENOUGH_SPACE
+---
+
+Queries for inspecting per-disk free space across the cluster and recent
+TTL movement / mutation activity. All queries fan out across the cluster —
+replace `{cluster}` with your cluster name.
+
+## Q11. Disk usage per host ⭐
+
+Per-host, per-disk free space, total space, and used percentage. The first
+query when `NOT_ENOUGH_SPACE` appears in `last_exception`, or when merges
+fail and `Q1`'s exception column points at disk.
+
+```sql
+SELECT
+ hostName() AS host,
+ name AS disk_name,
+ type,
+ round(free_space / 1e9, 1) AS free_GB,
+ round(total_space / 1e9, 1) AS total_GB,
+ round((1 - free_space / total_space) * 100, 1) AS used_pct
+FROM clusterAllReplicas('{cluster}', system.disks)
+GROUP BY host, disk_name, type, free_space, total_space
+ORDER BY host, used_pct DESC;
+```
+
+## Q12. TTL move / mutation activity
+
+`MovePart` and `MutatePart` events from `part_log` over the last hour.
+Useful when investigating whether TTL moves to a cold tier are actually
+running, and whether they're succeeding.
+
+```sql
+SELECT
+ hostName() AS host,
+ event_time,
+ event_type,
+ database, table, part_name,
+ rows,
+ formatReadableSize(size_in_bytes) AS size,
+ error
+FROM clusterAllReplicas('{cluster}', system.part_log)
+WHERE event_time >= now() - INTERVAL 1 HOUR
+ AND event_type IN ('MovePart', 'MutatePart')
+ORDER BY event_time DESC
+LIMIT 50;
+```
+
+A non-empty `error` column with `S3 access denied`, `connection`, or
+`credentials` keywords points at the cold-tier disk policy, not at
+ClickHouse itself.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew.md b/content/en/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew.md
new file mode 100644
index 0000000000..cd4c7858ee
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew.md
@@ -0,0 +1,259 @@
+---
+title: "Insert load and host skew queries"
+linkTitle: "Insert load and host skew"
+weight: 80
+description: >
+ Cluster-wide queries for insert volume, per-host duration, routing
+ settings, and failure-rate quantification.
+keywords:
+ - clickhouse insert rate
+ - host skew
+ - load_balancing
+ - metric_log
+ - failure rate
+---
+
+Queries for profiling insert workload and detecting host-skewed behaviour.
+The set here lets you answer "is the workload balanced", "is the duration
+balanced", "is the failure rate balanced", and "are the routing settings
+balanced" — four independent questions that together pinpoint host-skew
+root causes.
+
+All queries fan out across the cluster — replace `{cluster}` /
+`{database}` / `{table_pattern}` / `{target_table_pattern}` with values
+from your environment.
+
+## Q34. Active insert sources by user
+
+Live insert activity grouped by user — quick "who's inserting right now".
+
+```sql
+SELECT hostName(), user, query_kind, count()
+FROM clusterAllReplicas('{cluster}', system.processes)
+WHERE query_kind = 'Insert'
+GROUP BY hostName(), user, query_kind;
+```
+
+## Q35. Insert volume by user (last 24 hours)
+
+Insert volume, error count, and time window per user across the last
+day — identifies the heavy clients and the failing ones.
+
+```sql
+SELECT
+ hostName() AS host,
+ user,
+ count() AS insert_count,
+ sum(written_rows) AS total_rows,
+ round(sum(written_bytes) / 1e9, 2) AS total_GB,
+ round(avg(query_duration_ms), 0) AS avg_dur_ms,
+ countIf(type = 'ExceptionWhileProcessing') AS errors,
+ min(event_time) AS first_seen,
+ max(event_time) AS last_seen
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 24 HOUR
+ AND query_kind = 'Insert'
+ AND type IN ('QueryFinish', 'ExceptionWhileProcessing')
+GROUP BY host, user
+ORDER BY total_rows DESC
+LIMIT 30;
+```
+
+## Q36. Insert volume by target table (last 24 hours)
+
+Extracts the target table from the query text with a regex, then
+aggregates by it. Cross-check against
+[Q47](#q47-failed-insert-query-text-inspection) before treating any table
+as the "actual writer" — `tables[]` includes the MV chain.
+
+```sql
+SELECT
+ hostName() AS host,
+ extract(query, 'INTO\s+([\w\.`]+)') AS target_table,
+ count() AS inserts,
+ sum(written_rows) AS rows,
+ round(sum(written_bytes) / 1e9, 2) AS GB,
+ countIf(type = 'ExceptionWhileProcessing') AS errors
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 24 HOUR
+ AND query_kind = 'Insert'
+GROUP BY host, target_table
+ORDER BY rows DESC
+LIMIT 30;
+```
+
+## Q37. Insert rate per minute (spike detection)
+
+Per-minute insert counts and error counts across the last 24 hours. The
+shape of the distribution tells you "spike" vs "sustained" — the
+remediation differs.
+
+```sql
+SELECT
+ toStartOfMinute(event_time) AS minute,
+ count() AS inserts,
+ sum(written_rows) AS rows,
+ countIf(type = 'ExceptionWhileProcessing') AS errors
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 24 HOUR
+ AND query_kind = 'Insert'
+GROUP BY minute
+ORDER BY inserts DESC
+LIMIT 30;
+```
+
+## Q40. Active inserts confirmation (per-table specific)
+
+Counts new parts in the last 24 hours for tables matching
+`{table_pattern}`. An empty result confirms a table is *not* being written
+to — useful for verifying that an old/archived table is frozen.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, `table`,
+ sum(rows) AS rows_inserted,
+ count() AS insert_events,
+ max(event_time) AS last_insert
+FROM clusterAllReplicas('{cluster}', system.part_log)
+WHERE event_time >= now() - INTERVAL 24 HOUR
+ AND event_type = 'NewPart'
+ AND `table` LIKE '%{table_pattern}%'
+GROUP BY host, database, `table`
+ORDER BY rows_inserted DESC;
+```
+
+## Q41. Partition schema check (preventive)
+
+Lists the partition key and sorting key for tables matching the pattern.
+Use ahead of partition fragmentation diagnosis to confirm what the schema
+actually is.
+
+```sql
+SELECT
+ database, `table`, partition_key, sorting_key
+FROM clusterAllReplicas('{cluster}', system.tables)
+WHERE `table` LIKE '%{table_pattern}%'
+GROUP BY database, `table`, partition_key, sorting_key
+ORDER BY database, `table`;
+```
+
+## Q46. Per-host insert duration profile ⭐
+
+Per-host average, p95, and p99 insert duration over the last five minutes.
+The first query to confirm "failures concentrate on some hosts but volume
+looks similar" — if `avg_ms` or `p95_ms` differ by orders of magnitude
+across hosts on identical workloads, the bottleneck is host-specific.
+
+```sql
+SELECT
+ hostName() AS host,
+ count() AS query_count,
+ round(avg(query_duration_ms), 0) AS avg_ms,
+ round(quantile(0.95)(query_duration_ms), 0) AS p95_ms,
+ round(quantile(0.99)(query_duration_ms), 0) AS p99_ms
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 5 MINUTE
+ AND type = 'QueryFinish'
+ AND query_kind = 'Insert'
+GROUP BY host
+ORDER BY host;
+```
+
+## Q47. Failed insert query text inspection ⭐
+
+The query text contains the actual physical INSERT target — not just the
+MV chain that `tables[]` exposes. Use this before blaming any specific
+table for a timeout.
+
+```sql
+SELECT
+ hostName() AS host,
+ event_time,
+ query_duration_ms,
+ substring(exception, 1, 200) AS exception_text,
+ user, client_hostname, initial_address,
+ substring(query, 1, 500) AS query_text
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 30 MINUTE
+ AND type = 'ExceptionWhileProcessing'
+ AND exception ILIKE '%async insert%timeout%'
+ORDER BY event_time DESC
+LIMIT 5 FORMAT Vertical;
+```
+
+The `INSERT INTO database.table` statement in the query text reveals the
+real writer. Any other tables that show up in
+[Q38](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q38-async-insert-timeout-failures-by-table)'s
+`arrayJoin(tables)` are MV dependencies, not direct writes.
+
+## Q48. Per-second activity from metric_log
+
+Per-host averages and sums of profile events over a recent window:
+active queries, message-broker pool task count, disk-write microseconds,
+ZooKeeper wait microseconds, async insert attempts, and failed insert
+attempts. The right tool for "is host-X currently doing more or less
+work than the others?".
+
+```sql
+SELECT
+ hostName() AS host,
+ count() AS samples,
+ avg(CurrentMetric_Query) AS avg_active_queries,
+ max(CurrentMetric_Query) AS max_active_queries,
+ avg(CurrentMetric_BackgroundMessageBrokerSchedulePoolTask) AS avg_mb_pool,
+ sum(ProfileEvent_DiskWriteElapsedMicroseconds) AS disk_write_us,
+ sum(ProfileEvent_ZooKeeperWaitMicroseconds) AS zk_wait_us,
+ sum(ProfileEvent_AsyncInsertQuery) AS async_inserts,
+ sum(ProfileEvent_FailedInsertQuery) AS failed_inserts
+FROM clusterAllReplicas('{cluster}', system.metric_log)
+WHERE event_time >= now() - INTERVAL 5 MINUTE
+GROUP BY host
+ORDER BY host;
+```
+
+`system.metric_log` stores metrics as **columns** (`CurrentMetric_*`,
+`ProfileEvent_*`), not rows. You can't filter with
+`WHERE metric IN (...)` — `SELECT` the specific columns.
+
+## Q52. Routing settings inspection
+
+Per-host inspection of the settings that control where INSERTs land and
+how MVs execute. When these are identical across hosts but behaviour is
+still skewed, the cause is upstream (entry-point routing, not ClickHouse).
+
+```sql
+SELECT
+ hostName() AS host,
+ name, value
+FROM clusterAllReplicas('{cluster}', system.settings)
+WHERE name IN ('load_balancing', 'parallel_view_processing',
+ 'prefer_localhost_replica', 'distributed_foreground_insert',
+ 'async_insert', 'async_insert_busy_timeout_ms',
+ 'async_insert_busy_timeout_max_ms', 'async_insert_threads',
+ 'wait_for_async_insert')
+ORDER BY host, name;
+```
+
+See
+[Investigation methods → routing settings to know about](/altinity-kb-diagnostics-runbook/investigation-methods/#routing-settings-to-know-about)
+for what each setting does.
+
+## Q53. Failure rate per host ⭐
+
+Failure rate as a percentage of attempts — the workload-normalised view
+of "which hosts are actually failing more". Pair with Q46 (duration) and
+Q48 (volume) for the full picture.
+
+```sql
+SELECT
+ hostName() AS host,
+ sum(ProfileEvent_AsyncInsertQuery) AS total_attempts,
+ sum(ProfileEvent_FailedInsertQuery) AS failures,
+ round(sum(ProfileEvent_FailedInsertQuery) * 100.0 /
+ nullIf(sum(ProfileEvent_AsyncInsertQuery), 0), 1) AS failure_rate_pct
+FROM clusterAllReplicas('{cluster}', system.metric_log)
+WHERE event_time >= now() - INTERVAL 5 MINUTE
+GROUP BY host
+ORDER BY failure_rate_pct DESC;
+```
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination.md b/content/en/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination.md
new file mode 100644
index 0000000000..7e080507c0
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination.md
@@ -0,0 +1,139 @@
+---
+title: "Keeper and coordination queries"
+linkTitle: "Keeper and coordination"
+weight: 70
+description: >
+ Cluster-wide queries for ClickHouse Keeper / ZooKeeper connection
+ state, wait-time percentiles, topology, and leader distribution.
+keywords:
+ - clickhouse keeper
+ - zookeeper
+ - zookeeper_connection
+ - keeper latency
+---
+
+Queries for ClickHouse Keeper / ZooKeeper visibility: connection state,
+recent exceptions, cumulative wait events, current-window tail latency,
+sidecar vs centralized topology, and per-host leader counts.
+
+All queries fan out across the cluster — replace `{cluster}` with your
+cluster name.
+
+## Q29. Keeper connection status
+
+Connection state per replica — which Keeper node it's connected to,
+session age, expiry flag, API version.
+
+```sql
+SELECT
+ hostName() AS host,
+ name, value
+FROM clusterAllReplicas('{cluster}', system.zookeeper_connection);
+```
+
+## Q30. Keeper errors (last hour)
+
+Recent exceptions mentioning ZooKeeper / Keeper / code 999. Useful when a
+replica goes readonly and you suspect a Keeper session loss.
+
+```sql
+SELECT
+ hostName() AS host,
+ event_time,
+ exception_code,
+ substring(exception, 1, 200) AS exception_short
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 1 HOUR
+ AND type = 'ExceptionWhileProcessing'
+ AND (exception ILIKE '%zookeeper%' OR exception ILIKE '%keeper%' OR exception_code = 999)
+ORDER BY event_time DESC
+LIMIT 30;
+```
+
+## Q33. Keeper wait time and activity (cumulative)
+
+Cumulative Keeper-related event counters since process start. Useful for a
+quick "what does Keeper see" snapshot — but read the warning below before
+computing ratios.
+
+```sql
+SELECT hostName() AS host, event, value
+FROM clusterAllReplicas('{cluster}', system.events)
+WHERE event LIKE '%ZooKeeper%' OR event LIKE '%Keeper%'
+ORDER BY value DESC
+LIMIT 30;
+```
+
+Key events:
+
+- `ZooKeeperWaitMicroseconds` — total wait time on Keeper responses.
+- `ZooKeeperTransactions` — total transactions.
+- `ZooKeeperList` — directory listings (high during many-parts
+ coordination).
+- `ZooKeeperHardwareExceptions` / `ZooKeeperUserExceptions` — error counts.
+
+> **Warning.** These are cumulative since process start. A ratio like
+> `ZooKeeperWaitMicroseconds / ZooKeeperTransactions` reflects everything
+> the process has seen, including peaks from days ago. For current state,
+> use Q49 instead.
+
+## Q49. Tail latency for Keeper operations ⭐
+
+p50 / p95 / p99 of microseconds-per-transaction from `metric_log` over a
+recent window. The right tool for "is host X slow on Keeper right now",
+because it ignores stale peaks baked into the cumulative counters.
+
+```sql
+SELECT
+ hostName() AS host,
+ quantile(0.50)(ProfileEvent_ZooKeeperWaitMicroseconds / nullIf(ProfileEvent_ZooKeeperTransactions, 0)) AS p50_us_per_txn,
+ quantile(0.95)(ProfileEvent_ZooKeeperWaitMicroseconds / nullIf(ProfileEvent_ZooKeeperTransactions, 0)) AS p95_us_per_txn,
+ quantile(0.99)(ProfileEvent_ZooKeeperWaitMicroseconds / nullIf(ProfileEvent_ZooKeeperTransactions, 0)) AS p99_us_per_txn
+FROM clusterAllReplicas('{cluster}', system.metric_log)
+WHERE event_time >= now() - INTERVAL 30 MINUTE
+ AND ProfileEvent_ZooKeeperTransactions > 0
+GROUP BY host
+ORDER BY host;
+```
+
+If Q33 shows a per-host ratio but Q49 doesn't, the ratio is an artefact of
+historical peak load — not a current problem.
+
+## Q50. Keeper connection topology
+
+Tells you whether each replica connects to a co-located Keeper (sidecar:
+`keeper_address == hostName()`) or to a central Keeper cluster. The
+"slow Keeper follower" hypothesis only applies in the central topology.
+
+```sql
+SELECT
+ hostName() AS host,
+ name AS keeper_node,
+ host AS keeper_address,
+ port,
+ connected_time,
+ session_uptime_elapsed_seconds,
+ is_expired,
+ keeper_api_version
+FROM clusterAllReplicas('{cluster}', system.zookeeper_connection)
+ORDER BY host;
+```
+
+## Q51. Leader distribution across hosts
+
+Per-host counts of `is_leader = 1` vs `is_leader = 0` rows in
+`system.replicas`. In a healthy multi-replica cluster, leader counts
+should be roughly balanced. In a sidecar Keeper layout where every replica
+is leader of its local copy, you'll see `leader_count == total_replicas` —
+expected, not a concern.
+
+```sql
+SELECT
+ hostName() AS host,
+ countIf(is_leader = 1) AS leader_count,
+ countIf(is_leader = 0) AS non_leader_count,
+ count() AS total_replicas
+FROM clusterAllReplicas('{cluster}', system.replicas)
+GROUP BY host
+ORDER BY host;
+```
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/parts-and-merges.md b/content/en/altinity-kb-diagnostics-runbook/query-library/parts-and-merges.md
new file mode 100644
index 0000000000..75a262c054
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/parts-and-merges.md
@@ -0,0 +1,153 @@
+---
+title: "Parts and merges queries"
+linkTitle: "Parts and merges"
+weight: 20
+description: >
+ Cluster-wide queries for parts health, partition fragmentation, and
+ merge throughput.
+keywords:
+ - clickhouse parts
+ - clickhouse merges
+ - parts_to_delay_insert
+ - too_many_parts
+---
+
+Queries for diagnosing part counts, partition fragmentation, active merges,
+and the part-creation-vs-merge rate. All queries fan out across the
+cluster — replace `{cluster}` with your cluster name and
+`{table_pattern}` where indicated.
+
+## Q6. Parts health per host
+
+Per-host, per-table active part count, total rows, on-disk size, and the
+most recent modification time. The starting point when investigating high
+part counts cluster-wide.
+
+```sql
+SELECT
+ hostName() AS host,
+ database,
+ table,
+ count() AS active_parts,
+ sum(rows) AS total_rows,
+ round(sum(bytes_on_disk) / 1e9, 2) AS size_GB,
+ max(modification_time) AS last_modified
+FROM clusterAllReplicas('{cluster}', system.parts)
+WHERE active = 1
+GROUP BY host, database, table
+ORDER BY host, active_parts DESC;
+```
+
+## Q7. Parts count per partition ⭐
+
+`parts_to_delay_insert` and `parts_to_throw_insert` are **per partition**,
+not per table. A table with a thousand parts spread across a hundred
+partitions is fine; a partition with three hundred parts is in trouble.
+Use this when diagnosing `TOO_MANY_PARTS` (code 252) or "Delaying inserts
+by N ms" warnings.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, table, partition,
+ count() AS parts,
+ sum(rows) AS rows,
+ round(sum(bytes_on_disk) / 1e9, 2) AS size_GB
+FROM clusterAllReplicas('{cluster}', system.parts)
+WHERE active = 1
+GROUP BY host, database, table, partition
+HAVING parts > 100
+ORDER BY parts DESC
+LIMIT 50;
+```
+
+## Q8. Active merges
+
+Currently-executing merges by host and table, with progress, elapsed time,
+total merge size, and memory in use. Lets you see whether merges are
+running and how much memory they hold.
+
+```sql
+SELECT
+ hostName() AS host,
+ database,
+ table,
+ count() AS active_merges,
+ round(avg(progress) * 100, 1) AS avg_progress_pct,
+ max(elapsed) AS max_elapsed_sec,
+ round(sum(total_size_bytes_compressed) / 1e9, 2) AS total_merge_GB,
+ round(sum(memory_usage) / 1e9, 1) AS merge_memory_GB
+FROM clusterAllReplicas('{cluster}', system.merges)
+GROUP BY host, database, table
+ORDER BY host, active_merges DESC;
+```
+
+## Q9. Part creation vs merge rate (last 30 minutes)
+
+Counts `NewPart`, `MergeParts`, `MutatePart`, and `RemovePart` events in a
+recent window. When `new_parts` is growing faster than `merged_parts`, the
+merge pool is not keeping up — back-pressure is imminent.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, table,
+ sum(if(event_type = 'NewPart', 1, 0)) AS new_parts,
+ sum(if(event_type = 'MergeParts', 1, 0)) AS merged_parts,
+ sum(if(event_type = 'MergeParts', rows, 0)) AS rows_merged,
+ sum(if(event_type = 'MutatePart', 1, 0)) AS mutations,
+ sum(if(event_type = 'RemovePart', 1, 0)) AS removed_parts
+FROM clusterAllReplicas('{cluster}', system.part_log)
+WHERE event_time >= now() - INTERVAL 30 MINUTE
+GROUP BY host, database, table
+ORDER BY new_parts DESC
+LIMIT 30;
+```
+
+## Q10. Merge settings check
+
+Confirms the threshold settings before recommending a tuning change. These
+are the values the engine actually uses, not what's in the running config.
+
+```sql
+SELECT name, value
+FROM system.merge_tree_settings
+WHERE name IN (
+ 'max_bytes_to_merge_at_max_space_in_pool',
+ 'number_of_free_entries_in_pool_to_lower_max_size_of_merge',
+ 'max_number_of_merges_with_ttl_in_pool',
+ 'parts_to_delay_insert',
+ 'parts_to_throw_insert',
+ 'inactive_parts_to_delay_insert',
+ 'inactive_parts_to_throw_insert'
+);
+```
+
+## Q42. Partition count health
+
+Per-table partition count, active part count, and the ratio between them.
+A high `partition_count` usually means a high-cardinality partition key
+(e.g., partitioning by minute or hour on a dataset that doesn't need it).
+A high `avg_parts_per_partition` means merges can't keep up with inserts.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, `table`,
+ count(DISTINCT partition) AS partition_count,
+ count() AS active_parts,
+ round(active_parts / partition_count, 1) AS avg_parts_per_partition,
+ sum(rows) AS total_rows,
+ round(sum(bytes_on_disk) / 1e9, 2) AS size_GB
+FROM clusterAllReplicas('{cluster}', system.parts)
+WHERE active = 1 AND `table` LIKE '%{table_pattern}%'
+GROUP BY host, database, `table`
+ORDER BY partition_count DESC;
+```
+
+Flag thresholds:
+
+- `partition_count > 500` per table → schema problem (partition key
+ cardinality is too high).
+- `avg_parts_per_partition > 50` → merge pool can't keep up.
+- `partition_count = 12` for a year of monthly data → correct.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/pools-and-resources.md b/content/en/altinity-kb-diagnostics-runbook/query-library/pools-and-resources.md
new file mode 100644
index 0000000000..721e85acb6
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/pools-and-resources.md
@@ -0,0 +1,114 @@
+---
+title: "Pools and resources queries"
+linkTitle: "Pools and resources"
+weight: 40
+description: >
+ Cluster-wide queries for background pool saturation and memory pressure.
+keywords:
+ - clickhouse background pool
+ - memory pressure
+ - cgroup memory
+ - jemalloc
+---
+
+Queries for inspecting background thread pool activity, configured pool
+sizes, and memory pressure (process, jemalloc, cgroup, OS). All queries
+fan out across the cluster — replace `{cluster}` with your cluster name.
+
+## Q13. Pool saturation metrics
+
+Current activity in each background pool. When a pool counter equals its
+configured size (Q14), the pool is saturated — additional work will queue
+behind it.
+
+```sql
+SELECT
+ hostName() AS host,
+ metric,
+ value
+FROM clusterAllReplicas('{cluster}', system.metrics)
+WHERE metric IN (
+ 'BackgroundFetchesPoolTask',
+ 'BackgroundMergesAndMutationsPoolTask',
+ 'BackgroundCommonPoolTask',
+ 'BackgroundSchedulePoolTask',
+ 'BackgroundMessageBrokerSchedulePoolTask',
+ 'ReplicatedFetch',
+ 'ReplicatedSend',
+ 'ReplicatedChecks',
+ 'Merge',
+ 'PartMutation',
+ 'Query'
+)
+ORDER BY host, metric;
+```
+
+## Q14. Pool sizes (server settings)
+
+The configured upper bound for each pool. Pair with Q13: when a Q13 value
+matches the Q14 value for the same pool, that pool is the bottleneck.
+
+```sql
+SELECT
+ hostName() AS host,
+ name, value
+FROM clusterAllReplicas('{cluster}', system.server_settings)
+WHERE name IN (
+ 'background_pool_size',
+ 'background_fetches_pool_size',
+ 'background_merges_mutations_concurrency_ratio',
+ 'background_common_pool_size',
+ 'background_schedule_pool_size',
+ 'background_message_broker_schedule_pool_size'
+);
+```
+
+## Q15. Memory pressure
+
+Process RSS, the ClickHouse memory tracker, jemalloc resident/active, OS
+available/total, and cgroup used/total. The first query when investigating
+OOM-kills, `MEMORY_LIMIT_EXCEEDED` (code 241), or pod restarts.
+
+```sql
+SELECT
+ hostName() AS host,
+ metric,
+ formatReadableSize(value) AS val
+FROM clusterAllReplicas('{cluster}', system.asynchronous_metrics)
+WHERE metric IN (
+ 'MemoryResident',
+ 'MemoryTracking',
+ 'jemalloc.resident',
+ 'jemalloc.active',
+ 'OSMemoryAvailable',
+ 'OSMemoryTotal',
+ 'CGroupMemoryUsed',
+ 'CGroupMemoryTotal'
+)
+ORDER BY host, metric;
+```
+
+If `MemoryResident` is far above `MemoryTracking`, the gap is jemalloc
+retained pages and OS page cache. See
+[Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/)
+for attribution.
+
+## Q54. Memory pressure per host (compact)
+
+Same as Q15 but limited to the three numbers you compare across hosts.
+Use this to detect cluster-wide memory pressure (every host at >90%) vs a
+single-host issue.
+
+```sql
+SELECT
+ hostName() AS host,
+ metric, formatReadableSize(value) AS val
+FROM clusterAllReplicas('{cluster}', system.asynchronous_metrics)
+WHERE metric IN ('MemoryResident', 'OSMemoryAvailable',
+ 'CGroupMemoryUsed', 'CGroupMemoryTotal')
+ORDER BY host, metric;
+```
+
+When `CGroupMemoryUsed / CGroupMemoryTotal > 90%` on every host, the
+cluster is memory-constrained globally — workload-level tuning helps
+marginally, but the real fix is more RAM per node or less work per node.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations.md b/content/en/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations.md
new file mode 100644
index 0000000000..1c8516f4aa
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations.md
@@ -0,0 +1,110 @@
+---
+title: "Queries and mutations queries"
+linkTitle: "Queries and mutations"
+weight: 50
+description: >
+ Per-host query load, active queries, OOM/exception patterns, and stuck
+ mutations.
+keywords:
+ - clickhouse query_log
+ - clickhouse processes
+ - stuck mutations
+ - OOM
+---
+
+Queries for the live and recent state of the query system: load by kind,
+what's running right now, recent exceptions, and stuck mutations. All
+queries fan out across the cluster — replace `{cluster}` with your cluster
+name.
+
+## Q16. Query load per host (last 30 minutes)
+
+Per-host query counts by `query_kind`, average duration, peak memory, read
+and written rows, and error count. Useful for spotting load imbalance and
+error spikes by query type.
+
+```sql
+SELECT
+ hostName() AS host,
+ query_kind,
+ count() AS query_count,
+ round(avg(query_duration_ms), 0) AS avg_duration_ms,
+ round(max(memory_usage) / 1e9, 1) AS max_memory_GB,
+ sum(read_rows) AS total_read_rows,
+ sum(written_rows) AS total_written_rows,
+ countIf(type = 'ExceptionWhileProcessing') AS errors
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 30 MINUTE
+ AND type IN ('QueryFinish', 'ExceptionWhileProcessing')
+GROUP BY host, query_kind
+ORDER BY host, query_count DESC;
+```
+
+## Q17. Active queries right now
+
+Live snapshot of running queries — elapsed, memory, rows read, plus a
+query snippet. The fastest way to see "what's pinning the cluster right
+now".
+
+```sql
+SELECT
+ hostName() AS host,
+ query_id, user, elapsed,
+ round(memory_usage / 1e9, 2) AS memory_GB,
+ read_rows,
+ formatReadableSize(read_bytes) AS read_bytes,
+ query_kind,
+ substring(query, 1, 200) AS query_snippet
+FROM clusterAllReplicas('{cluster}', system.processes)
+ORDER BY elapsed DESC
+LIMIT 30;
+```
+
+## Q18. Recent OOM / exception queries
+
+Failed queries in the last four hours with their exception code, exception
+text, memory usage, and query snippet. Read after Q15 — gives you the
+queries responsible for memory pressure spikes.
+
+```sql
+SELECT
+ hostName() AS host,
+ event_time,
+ query_id,
+ round(memory_usage / 1e9, 1) AS memory_GB,
+ query_duration_ms,
+ exception_code,
+ substring(exception, 1, 300) AS exception_short,
+ substring(query, 1, 200) AS query_snippet
+FROM clusterAllReplicas('{cluster}', system.query_log)
+WHERE event_time >= now() - INTERVAL 4 HOUR
+ AND type = 'ExceptionWhileProcessing'
+ORDER BY event_time DESC
+LIMIT 30;
+```
+
+## Q19. Stuck mutations ⭐
+
+All not-done mutations with their command, age, parts-to-do count, and
+latest failure reason. The starting point for `ALTER TABLE … UPDATE/DELETE`
+not completing.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, table,
+ mutation_id,
+ command,
+ create_time,
+ is_done,
+ parts_to_do,
+ latest_fail_reason,
+ latest_fail_time
+FROM clusterAllReplicas('{cluster}', system.mutations)
+WHERE NOT is_done
+ORDER BY host, create_time;
+```
+
+Mutations share the merge pool, so a stuck mutation often means the merge
+pool is saturated (see Q13). A mutation that references a column that
+no longer exists fails immediately with a clear `latest_fail_reason`.
diff --git a/content/en/altinity-kb-diagnostics-runbook/query-library/replication-and-queue.md b/content/en/altinity-kb-diagnostics-runbook/query-library/replication-and-queue.md
new file mode 100644
index 0000000000..e43301aea4
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/query-library/replication-and-queue.md
@@ -0,0 +1,178 @@
+---
+title: "Replication and queue queries"
+linkTitle: "Replication and queue"
+weight: 10
+description: >
+ Cluster-wide queries for inspecting the replication queue, replica
+ status, and active fetches.
+keywords:
+ - clickhouse replication queue
+ - replicated_fetches
+ - postpone_reason
+ - system.replicas
+---
+
+Queries for diagnosing replication queue depth, postpone reasons, replica
+lag, readonly mode, and in-flight fetches. All queries fan out across the
+cluster — replace `{cluster}` with your cluster name.
+
+These queries are referenced from the
+[scenarios](/altinity-kb-diagnostics-runbook/scenarios/) by their numeric
+IDs (`Q1`, `Q2`, …). The numbering is stable across the runbook.
+
+## Q1. Replication queue overview
+
+Per-host, per-table queue depth, currently-executing entries, max retries,
+and the oldest entry. The starting point when "the queue isn't draining".
+
+```sql
+SELECT
+ hostName() AS host,
+ database,
+ table,
+ count() AS queue_depth,
+ countIf(is_currently_executing) AS executing,
+ max(num_tries) AS max_retries,
+ max(last_exception) AS last_error,
+ min(create_time) AS oldest_entry
+FROM clusterAllReplicas('{cluster}', system.replication_queue)
+GROUP BY host, database, table
+ORDER BY host, queue_depth DESC;
+```
+
+## Q2. Replication queue — postpone reasons ⭐
+
+The smoking-gun query for merge↔fetch cycles. The `postpone_reason` text
+names the actual cause; see the patterns table in
+[quick reference](/altinity-kb-diagnostics-runbook/quick-reference/#common-postpone_reason-patterns).
+
+```sql
+SELECT
+ hostName() AS host,
+ database, table, type,
+ new_part_name,
+ is_currently_executing,
+ num_tries,
+ num_postponed,
+ postpone_reason,
+ last_exception,
+ create_time
+FROM clusterAllReplicas('{cluster}', system.replication_queue)
+WHERE num_postponed > 0 OR last_exception != ''
+ORDER BY num_postponed DESC, num_tries DESC
+LIMIT 50;
+```
+
+## Q3. Queue entry type breakdown
+
+Splits the queue by entry type (`GET_PART`, `MERGE_PARTS`, `MUTATE_PART`,
+etc.) so you can tell whether the backlog is fetches, merges, or mutations.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, table, type,
+ count() AS entries,
+ countIf(is_currently_executing) AS executing,
+ avg(num_tries) AS avg_tries,
+ sum(num_postponed) AS total_postponed
+FROM clusterAllReplicas('{cluster}', system.replication_queue)
+GROUP BY host, database, table, type
+ORDER BY entries DESC;
+```
+
+## Q4. Replica status — lag and readonly per host
+
+Drills into a specific replica's state: leader flag, readonly flag, absolute
+delay in seconds, queue size split, and how far the log pointer is behind
+the leader.
+
+```sql
+SELECT
+ hostName() AS host,
+ database,
+ table,
+ is_leader,
+ is_readonly,
+ absolute_delay AS replica_lag_sec,
+ queue_size,
+ inserts_in_queue,
+ merges_in_queue,
+ log_max_index - log_pointer AS log_behind,
+ active_replicas,
+ total_replicas
+FROM clusterAllReplicas('{cluster}', system.replicas)
+ORDER BY host, replica_lag_sec DESC;
+```
+
+## Q5. Replication summary per host ⭐
+
+One row per host — readonly count, lag, queue depth, insert/merge backlog.
+The fastest first look at cluster-wide replication health and the first
+query in the general-triage flow.
+
+```sql
+SELECT
+ hostName() AS host,
+ count() AS total_tables,
+ countIf(is_readonly) AS readonly_tables,
+ countIf(absolute_delay > 300) AS lagging_tables,
+ max(absolute_delay) AS max_lag_sec,
+ sum(queue_size) AS total_queue_depth,
+ sum(inserts_in_queue) AS total_inserts_queued,
+ sum(merges_in_queue) AS total_merges_queued
+FROM clusterAllReplicas('{cluster}', system.replicas)
+GROUP BY host
+ORDER BY max_lag_sec DESC, readonly_tables DESC;
+```
+
+## Q31. Replicated fetches in flight
+
+Active fetch tasks with their source replica, progress, elapsed time, and
+bytes transferred. Distinguishes pool *exhaustion* from pool slots *claimed
+by stuck tasks*.
+
+```sql
+SELECT
+ hostName() AS host,
+ database, `table`,
+ source_replica_hostname,
+ elapsed,
+ progress,
+ round(total_size_bytes_compressed / 1e6, 1) AS total_MB,
+ round(bytes_read_compressed / 1e6, 1) AS read_MB,
+ result_part_name,
+ partition_id,
+ thread_id
+FROM clusterAllReplicas('{cluster}', system.replicated_fetches)
+ORDER BY host, elapsed DESC;
+```
+
+The column for the source replica varies by ClickHouse version. If the
+above errors with "unknown identifier", inspect the schema first:
+
+```sql
+SELECT name FROM system.columns
+WHERE database = 'system' AND table = 'replicated_fetches';
+```
+
+If `BackgroundFetchesPoolTask` is at the configured pool size but Q31
+returns few rows, the slots are claimed by tasks that are *waiting*, not
+*transferring* — Keeper saturation is the usual cause.
+
+## Q32. Source replica distribution for active fetches
+
+Aggregates Q31 by source replica — useful when one replica is acting as the
+fetch source for everyone and saturating its outbound bandwidth.
+
+```sql
+SELECT
+ hostName() AS host,
+ source_replica_hostname,
+ count() AS active_fetches,
+ round(avg(progress) * 100, 1) AS avg_progress_pct,
+ max(elapsed) AS max_elapsed_sec
+FROM clusterAllReplicas('{cluster}', system.replicated_fetches)
+GROUP BY host, source_replica_hostname
+ORDER BY host, active_fetches DESC;
+```
diff --git a/content/en/altinity-kb-diagnostics-runbook/quick-reference.md b/content/en/altinity-kb-diagnostics-runbook/quick-reference.md
new file mode 100644
index 0000000000..c33cc31d70
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/quick-reference.md
@@ -0,0 +1,95 @@
+---
+title: "Quick reference: symptom to query"
+linkTitle: "Quick reference"
+weight: 10
+description: >
+ One-page lookup: pick a symptom, jump to the query that diagnoses it.
+keywords:
+ - clickhouse triage
+ - clickhouse diagnostics
+ - postpone_reason
+ - replication queue
+ - async insert
+---
+
+When you have a specific symptom, run the indicated query first. When you
+don't know what's wrong, run **Q5 → Q11 → Q15 → Q17** in that order — it
+gives you 80% of the cluster's state in about ten seconds.
+
+All query IDs (`Q1`, `Q2`, …) link into the
+[query library](/altinity-kb-diagnostics-runbook/query-library/).
+
+## Symptom → first query
+
+| Symptom | Run first | Section |
+|---|---|---|
+| Queue not draining | Q2 — postpone reasons | [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) |
+| Background pool pinned, no progress | Q31 — active fetches | [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/) |
+| Async insert timeout | Q38 — failure tables | [Async inserts](/altinity-kb-diagnostics-runbook/query-library/async-inserts/) |
+| Kafka consumer kicks (`max.poll.interval.ms`) | Q44 — consumers vs pool | [Dictionaries and Kafka](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/) |
+| Memory low | Q15 then Q8 — merges holding RAM | [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/) |
+| OOM / `MEMORY_LIMIT_EXCEEDED` | Q15 + Q17 + Q18 | [Pools and resources](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/), [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) |
+| Disk full / `NOT_ENOUGH_SPACE` | Q11 | [Disk and storage](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/) |
+| Mutations stuck | Q19 | [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) |
+| Replica readonly | Q4 + Q29 | [Replication and queue](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/), [Keeper and coordination](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/) |
+| Slow queries | Q17 + Q16 | [Queries and mutations](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/) |
+| Insert backpressure ("delayed by X ms") | Q7 — parts per partition | [Parts and merges](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/) |
+| Failures concentrated on a subset of hosts | Q46 + Q53 + Q48 | [Insert load and host skew](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/) |
+| Don't know what's wrong | Q5 → Q11 → Q15 → Q17 | mixed |
+
+## Common `postpone_reason` patterns
+
+From `system.replication_queue.postpone_reason` (Q2). The text is what to
+match on:
+
+| Pattern in `postpone_reason` | What it means |
+|---|---|
+| `N fetches already executing, max N` | Fetch pool pinned. Check Q31 for actual transfers — if `replicated_fetches` is near-empty while the pool counter is at the limit, the symptom is Keeper saturation, not pool exhaustion. |
+| `source parts size … greater than current maximum` | Fetch waiting on a merge to produce a smaller source part. Look at Q8 for the upstream merge. |
+| `covering parts list` | Merge is waiting on child fetches to land. |
+| `another log entry for same part is being processed` | Normal serialisation. Only a problem if persistent (the same entry stuck for tens of minutes). |
+| Anything mentioning `timeout`, `S3`, or `network` | Infrastructure-layer issue — investigate the storage/network path, not ClickHouse internals. |
+
+## "Trust but verify" — pitfalls that hide root causes
+
+- **Empty `system.replicated_fetches` despite a high
+ `BackgroundFetchesPoolTask` counter** means tasks are stuck claiming slots
+ but not transferring. The pool isn't the bottleneck — Keeper or another
+ coordinator usually is.
+- **`query_log.tables` is an array** that includes every table touched —
+ inserts, MV dependencies, and read-side joins. Use `arrayJoin(tables)` for
+ per-table grouping, never `tables[1]` as "the writer". The actual physical
+ INSERT target is in the query text. Always inspect the query text before
+ blaming a specific table.
+- **`system.query_log` has no `database` or `table` column** — they live in
+ `databases[]` and `tables[]`.
+- **`part_log` is the source of truth for "is this table being written to?"**
+ It covers both direct inserts and MV writes, while `query_log` only sees
+ the originating query.
+- **`avg_ms ≈ async_insert_busy_timeout_ms`** is the signature of an MV-chain
+ timeout (the insert is *waiting*, not *working*). A genuinely slow insert
+ has a distribution; a queue timeout is a hard ceiling.
+- **`system.metric_log` stores metrics as columns, not rows**
+ (`CurrentMetric_*`, `ProfileEvent_*`). You cannot filter with
+ `WHERE metric IN (…)` — `SELECT` the specific columns.
+- **`system.events` uses an `event` column, not a `metric` column.** Easy
+ thinko when you switch between `metric_log`/`metrics` and `events`.
+- **`system.zookeeper_log` does not exist on every version.** Run
+ `EXISTS TABLE system.zookeeper_log` before assuming it's available.
+- **`EXPLAIN PIPELINE graph=1`** uses lowercase `graph=1`. Older syntax
+ (`GRAPH = 1`) does not parse.
+- **The `views`/`view_durations` columns on `query_log` vary by version.**
+ When in doubt:
+ `SELECT name FROM system.columns WHERE database='system' AND table='query_log' AND name ILIKE '%view%'`.
+- **Cumulative `system.events` totals integrate since process start.** Ratios
+ computed from them can reflect a peak-load period from days ago. Use
+ `system.metric_log` over a recent window when comparing live host
+ behaviour. See
+ [Investigation methods → cumulative metrics hide current state](/altinity-kb-diagnostics-runbook/investigation-methods/#cumulative-metrics-hide-current-state).
+
+## Priority heatmap
+
+If you can only run one query for a given scenario, the scenario page marks
+it with **⭐**. For broad triage where you don't know the scenario yet:
+`Q5 → Q11 → Q15 → Q17` covers replication, disk, memory, and active queries
+in four queries.
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/_index.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/_index.md
new file mode 100644
index 0000000000..407730673e
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/_index.md
@@ -0,0 +1,32 @@
+---
+title: "Scenarios"
+linkTitle: "Scenarios"
+weight: 40
+description: >
+ Step-by-step diagnostic flows for common ClickHouse® failure modes.
+keywords:
+ - clickhouse troubleshooting
+ - diagnostic playbook
+ - ClickHouse scenarios
+---
+
+Each scenario lists triggering symptoms, an ordered diagnostic flow
+(queries to run, in order, with "what to look for"), common root causes,
+and resolution paths. Queries are referenced by their numeric ID — follow
+the link to the
+[query library](/altinity-kb-diagnostics-runbook/query-library/) for the
+full SQL.
+
+| Scenario | When to use |
+|---|---|
+| [General triage](/altinity-kb-diagnostics-runbook/scenarios/general-triage/) | "Something is wrong" — no specific symptom yet. Start here. |
+| [Merge–fetch and pool issues](/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues/) | Queue not draining, pool counters pinned, replicated_fetches near-empty. |
+| [Too many parts and backpressure](/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure/) | `TOO_MANY_PARTS`, "Delaying inserts by N ms", cascading insert slowdown. |
+| [Replica readonly](/altinity-kb-diagnostics-runbook/scenarios/replica-readonly/) | One or more replicas in readonly mode, growing `absolute_delay`. |
+| [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/) | OOM, `MEMORY_LIMIT_EXCEEDED`, `NOT_ENOUGH_SPACE`, cluster-wide pressure. |
+| [Stuck mutations](/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations/) | `ALTER UPDATE/DELETE` not completing. |
+| [Async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/) | Flush errors, MV chain timeouts, stuck async insert queue. |
+| [Slow queries](/altinity-kb-diagnostics-runbook/scenarios/slow-queries/) | Dashboard timeouts, query latency complaints. |
+| [Kafka consumer issues](/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues/) | `max.poll.interval.ms` violations, consumer rebalance storms. |
+| [Frozen historical tables](/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables/) | Old tables adding permanent background load. |
+| [Host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/) | Failures concentrate on a subset of hosts; settings and workload look identical. |
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues.md
new file mode 100644
index 0000000000..079cbbdca8
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues.md
@@ -0,0 +1,122 @@
+---
+title: "Async insert issues"
+linkTitle: "Async insert issues"
+weight: 70
+description: >
+ Diagnosing async insert flush failures, MV-chain timeouts, and stuck
+ async insert queues.
+keywords:
+ - async insert
+ - asynchronous_insert_log
+ - flusherror
+ - MV timeout
+ - async_insert_busy_timeout_ms
+---
+
+Three failure modes share async-insert symptoms but differ in their cause
+and fix. The MV-chain timeout case is the most commonly misdiagnosed — a
+flush that looks slow is actually waiting in a queue.
+
+## Async insert flush failures
+
+### Symptoms
+
+- Inserts succeed at the HTTP layer but data is missing or delayed.
+- `FlushError` rows in `system.asynchronous_insert_log`.
+- Reports of "silent data loss".
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q28](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q28-live-async-insert-health-check-last-5-minutes) | Live snapshot — is it happening right now? |
+| 2 | [Q21](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q21-async-insert-flush-errors) | Recent flush errors with exception text. |
+| 3 | [Q22](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q22-async-insert-impact-aggregation) | Impact aggregation — total rows / bytes affected, time window. |
+| 4 | [Q23](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q23-async-insert-flush-latency-by-tablestatus) | Latency patterns — are flushes timing out? |
+| 5 | [Q24](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q24-slowest-asyncinsertflush-queries) | Slowest flush queries — what's making them slow? |
+| 6 | [Q26](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q26-mv-frequency-in-errors) | Is one specific MV showing up in errors? |
+| 7 | [Q25](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q25-mv-appearances-in-failed-flushes) | Drill into that MV's failure pattern. |
+| 8 | [Q27](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q27-mv-definitions--chain-inspection) | Inspect the MV definition. |
+
+### Common root causes
+
+- MV in the chain hitting a memory limit or a slow JOIN.
+- Target table on the MV chain has `TOO_MANY_PARTS`.
+- Async insert buffer too large — flush exceeds query memory.
+- MV using non-deterministic functions or external dictionaries that are
+ slow / failing to refresh.
+
+## MV chain timeout on async inserts
+
+### Symptoms
+
+- `Code: 159. DB::Exception: Wait for async insert timeout (120000 ms) exceeded`.
+- `avg_ms` exactly at `async_insert_busy_timeout_ms` (default 120000) —
+ the signature of a *wait*, not a *slow work*.
+- Specific target tables in the failure list, not all of them.
+- Persistent failures, not bursty.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q38](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q38-async-insert-timeout-failures-by-table) ⭐ | Which tables are timing out. |
+| 2 | [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection) ⭐ | The **actual** physical INSERT target from the query text — not just `tables[]`. |
+| 3 | Q39 (`as_select` for MVs writing into those tables) | MV chain depth feeding the failing tables. |
+| 4 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) | Are target tables fragmented? |
+| 5 | [Q43](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/#q43-dictionary-health-check) | Are dictionaries used in MVs healthy? |
+| 6 | [Q16](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q16-query-load-per-host-last-30-minutes) | Other queries running heavily on those tables? |
+
+### Common root causes
+
+1. MV doing batch-ETL work (heavy joins, many `dictGet`, aggregations) at
+ insert time.
+2. Target table has too many parts — the MV's writes back into it are
+ slow.
+3. A dictionary used in an MV is slow or stale.
+4. MV chain depth too deep (`MV → table → MV → table`).
+
+### Resolution path
+
+1. **Quick relief**: raise `async_insert_busy_timeout_ms` for the
+ user/table.
+2. **Real fix**: simplify the MV — move heavy work to a scheduled
+ Refreshable MV or a batch job.
+3. If a dictionary is slow → fix its source or refresh policy.
+4. If the target is fragmented → fix the part count first
+ ([Too many parts and backpressure](/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure/)).
+
+## Stuck async insert queue (buffers don't drain)
+
+### Symptoms
+
+- `system.metrics.PendingAsyncInsert` very high (hundreds+) on some hosts,
+ low on others.
+- Failed async inserts piling up.
+- `async_insert_threads` already adequately sized.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection) | Confirm the actual writers. |
+| 2 | [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) | Active query count per host. |
+| 3 | [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host) | Failure rate per host. |
+| 4 | [Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection) | `async_insert_busy_timeout_ms` and related settings. |
+| 5 | Inspect the failing query's `Settings` | Is `wait_for_async_insert=1`? Client is waiting for flush completion. |
+
+### Key signature
+
+`query_duration_ms ≈ async_insert_busy_timeout_ms` with
+`UserTimeMicroseconds` in single-digit milliseconds. The insert sat in a
+queue for the full timeout, doing no CPU work. See
+[Investigation methods → ProfileEvents reveal "waited not worked"](/altinity-kb-diagnostics-runbook/investigation-methods/#profileevents-reveal-waited-not-worked).
+
+### Resolution path
+
+- Raise `async_insert_busy_timeout_ms` (the wait ceiling) — buys time per
+ insert, treats the symptom.
+- Lower `async_insert_max_data_size` — smaller, more frequent flushes.
+- Find and fix the upstream cause of queue concentration —
+ [Host-skewed failures](/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures/)
+ is the usual next stop.
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables.md
new file mode 100644
index 0000000000..50747ef699
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/frozen-historical-tables.md
@@ -0,0 +1,45 @@
+---
+title: "Frozen historical tables adding background load"
+linkTitle: "Frozen historical tables"
+weight: 100
+description: >
+ Identifying old, no-longer-written tables whose partition count adds
+ permanent Keeper coordination load.
+keywords:
+ - clickhouse partitions
+ - keeper load
+ - historical tables
+ - partition cardinality
+---
+
+## Symptoms
+
+- Old tables (previous-year or archive tables) showing high partition
+ counts.
+- Part counts high but stable — not growing.
+- Background merge / Keeper traffic disproportionate to the active
+ workload.
+
+## Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) ⭐ | Tables with extreme partition counts. |
+| 2 | [Q40](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q40-active-inserts-confirmation-per-table-specific) | Confirm no recent writes — an empty result means the table is frozen. |
+| 3 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) | If `ZooKeeperList` is very high → confirms Keeper coordination overhead is the load source. |
+
+## Resolution path
+
+Frozen high-cardinality-partition tables don't cause acute incidents but
+add permanent load. Options, ordered by lowest disruption:
+
+1. **Drop** if the data is archived elsewhere.
+2. **Detach old partitions** and **re-attach** them to a re-partitioned
+ table with a sane partition key (`toYYYYMM(date)` for monthly,
+ `toYYYYMMDD(date)` for daily on small datasets).
+3. **Rebuild** the table with the sane partition key — only when neither
+ of the above is feasible. Costly in time and disk.
+
+The partition key choice is the schema-level fix; see
+[How to pick an ORDER BY / PRIMARY KEY / PARTITION BY](/engines/mergetree-table-engine-family/pick-keys/)
+for guidance.
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/general-triage.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/general-triage.md
new file mode 100644
index 0000000000..79f60c19d7
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/general-triage.md
@@ -0,0 +1,37 @@
+---
+title: "General cluster triage"
+linkTitle: "General triage"
+weight: 10
+description: >
+ The four-query first look when you don't yet know what's wrong.
+keywords:
+ - clickhouse triage
+ - clickhouse health check
+---
+
+When the only information you have is "something is wrong", four queries
+in order give you 80% of the cluster's state in about ten seconds. Use
+this when you can't yet pick a more specific scenario.
+
+## Diagnostic flow
+
+| Step | Query | Purpose |
+|---|---|---|
+| 1 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | One row per host — readonly tables, lag, queue depth. |
+| 2 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Per-disk free space across the cluster. |
+| 3 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | Memory headroom (process, jemalloc, cgroup, OS) everywhere. |
+| 4 | [Q17](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q17-active-queries-right-now) | Active queries right now — what's running and how heavy. |
+| 5 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent exceptions across the last 4 hours. |
+| 6 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Parts count overview. |
+| 7 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Background pool saturation. |
+
+After these, branch into a specific scenario based on what surfaced:
+
+- Readonly tables in Q5 → [Replica readonly](/altinity-kb-diagnostics-runbook/scenarios/replica-readonly/).
+- High lag or queue in Q5 → [Merge–fetch and pool issues](/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues/).
+- Disk near full in Q11 → [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/).
+- Memory pressure in Q15 → [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/).
+- Long-running queries in Q17 → [Slow queries](/altinity-kb-diagnostics-runbook/scenarios/slow-queries/).
+- `TOO_MANY_PARTS` in Q18 → [Too many parts and backpressure](/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure/).
+- Async insert timeouts in Q18 → [Async insert issues](/altinity-kb-diagnostics-runbook/scenarios/async-insert-issues/).
+- Pool counters pinned in Q13 → [Merge–fetch and pool issues](/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues/).
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures.md
new file mode 100644
index 0000000000..e228e89101
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/host-skewed-failures.md
@@ -0,0 +1,112 @@
+---
+title: "Host-skewed failures"
+linkTitle: "Host-skewed failures"
+weight: 110
+description: >
+ Diagnosing situations where failures concentrate on a subset of hosts
+ even though workload and configuration look identical.
+keywords:
+ - host skew
+ - load_balancing
+ - haproxy
+ - parallel_view_processing
+ - cumulative metrics
+---
+
+Three related cases live here: host-skewed failures with a balanced
+workload, "stale skew" complaints based on cumulative metrics, and the
+misattribution of failure tables when `tables[]` is read as the writer.
+All three share the same root pattern — surface appearances disagree with
+what's actually happening — and the same investigative tools resolve them.
+
+## Host-skewed insert failures (workload balanced, failures not)
+
+### Symptoms
+
+- Multiple replicas in the cluster.
+- Async insert failure rate is wildly different across hosts.
+- Question is some variation of "why are some hosts broken while others
+ work fine?".
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q46](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q46-per-host-insert-duration-profile) ⭐ | Per-host insert duration imbalance. |
+| 2 | [Q53](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q53-failure-rate-per-host) | Failure rate per host, workload-normalised. |
+| 3 | [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) | Workload volume per host; active query pile-up. |
+| 4 | [Q54](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q54-memory-pressure-per-host-compact) | Memory pressure — concentrated or cluster-wide? |
+| 5 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) + [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations) | Confirm Keeper isn't the imbalance source. |
+| 6 | [Q52](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q52-routing-settings-inspection) | Verify routing settings are identical across hosts. |
+
+### Decision tree
+
+- **Settings identical + workload balanced + failures host-skewed** →
+ upstream entry-point routing (HAProxy / ingress directing traffic to a
+ subset of hosts).
+- **Settings identical + memory pressure on bad hosts only** → resource
+ contention on those pods (CPU throttling, page-cache pressure).
+- **`parallel_view_processing = 0` + MV chains on slow hosts** → serial
+ MV execution queue, exacerbated by entry-point routing.
+
+### Resolution path
+
+- Raise `async_insert_busy_timeout_ms` for immediate relief.
+- Enable `parallel_view_processing = 1` to cut MV-chain wall time on each
+ insert (be aware this can change MV ordering semantics — confirm the
+ application is tolerant).
+- Change `load_balancing` from a hostname-affine policy to `round_robin`
+ or `random`.
+- Investigate the ingress / load balancer to spread client connections
+ evenly across replicas.
+
+## Stale skew: a "ratio" computed from cumulative metrics
+
+### Symptoms
+
+- Someone reports a metric ratio ("host X has Nx higher Keeper waits")
+ and asks for investigation.
+- The supporting evidence is `system.events` totals — cumulative since
+ process start.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | — | Ask which window the ratio was computed over. Cumulative `system.events` values include all historical peaks since process start. |
+| 2 | [Q49](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q49-tail-latency-for-keeper-operations) | p50/p95/p99 from `metric_log` over a recent window (10–30 min). |
+| 3 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) | Wait by host on a recent window — confirm whether imbalance is current or historical. |
+| 4 | [Q48](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q48-per-second-activity-from-metric_log) | Whether watches and inflight requests are balanced now. |
+| 5 | [Q50](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q50-keeper-connection-topology) + [Q51](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q51-leader-distribution-across-hosts) | Verify Keeper topology and leadership are uniform. |
+
+### Decision tree
+
+- **Cumulative shows Nx skew, recent 10-min window shows balanced** →
+ historical incident artefact, already resolved.
+- **Cumulative and recent window agree** → real ongoing imbalance; dig
+ into per-host root cause.
+- **Recent window shows a different host as outlier** → the original
+ observation is stale. Explain the data carefully when reporting back.
+
+## Misattributed failure tables
+
+### Symptoms
+
+- Failed inserts list many target tables in `system.query_log.tables[]`.
+- Several look like the culprit.
+- Raising timeouts on the suspected tables doesn't help.
+
+### Diagnostic flow
+
+1. Run [Q47](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q47-failed-insert-query-text-inspection)
+ to get the actual INSERT query text.
+2. The `INSERT INTO database.table` statement reveals the **physical**
+ target — not the MV chain.
+3. Compare with the `tables[]` array — additional entries are MV
+ dependencies, not direct writes.
+4. Apply the fix on the actual physical target table, not on MV
+ dependencies.
+
+The `tables[]` array tells you the full MV blast radius, not the specific
+writer. Always run Q47 before deciding "the slow table is X". See
+[Investigation methods → `tables[]` in query_log is not the writer](/altinity-kb-diagnostics-runbook/investigation-methods/#tables-in-query_log-is-not-the-writer).
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues.md
new file mode 100644
index 0000000000..4a70d79ecd
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/kafka-consumer-issues.md
@@ -0,0 +1,43 @@
+---
+title: "Kafka consumer issues"
+linkTitle: "Kafka consumer issues"
+weight: 90
+description: >
+ Diagnosing Kafka consumer thread starvation and rebalance storms.
+keywords:
+ - clickhouse kafka
+ - max.poll.interval.ms
+ - kafka rebalance
+ - background_message_broker_schedule_pool_size
+---
+
+## Symptoms
+
+- `Maximum application poll interval (max.poll.interval.ms) exceeded`
+ errors.
+- Kafka consumers getting kicked and rejoining frequently.
+- Drip-fire pattern: 1–10 kicks per minute, sustained.
+
+## Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q44](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/#q44-kafka-consumer-count-vs-pool-size) ⭐ | `consumers > mb_pool_size` confirms starvation. |
+| 2 | [Q45](/altinity-kb-diagnostics-runbook/query-library/dictionaries-and-kafka/#q45-kafka-consumer-error-inspection) | Per-consumer error inspection. |
+| 3 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Is `BackgroundMessageBrokerSchedulePoolTask` pinned at pool size? |
+
+## Resolution
+
+Raise `background_message_broker_schedule_pool_size` to at least
+`consumers * 1.25`. Requires a server restart — the setting is
+server-level, not user-level.
+
+If the consumer count itself is excessive, also review whether
+`kafka_num_consumers` per table is over-provisioned. Each
+`Kafka` table contributes consumers based on this setting; multiplying
+across many tables explodes the total quickly.
+
+Related setup guidance:
+
+- [background_message_broker_schedule_pool_size](/altinity-kb-integrations/altinity-kb-kafka/04-operations-troubleshooting/background_message_broker_schedule_pool_size/)
+- [Kafka parallel consuming](/altinity-kb-integrations/altinity-kb-kafka/02-consumption-patterns/altinity-kb-kafka-parallel-consuming/)
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure.md
new file mode 100644
index 0000000000..1716f9bfb4
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure.md
@@ -0,0 +1,101 @@
+---
+title: "Memory and disk pressure"
+linkTitle: "Memory and disk pressure"
+weight: 50
+description: >
+ Diagnosing OOM, `MEMORY_LIMIT_EXCEEDED`, `NOT_ENOUGH_SPACE`, and
+ cluster-wide memory pressure that aggravates other failures.
+keywords:
+ - clickhouse OOM
+ - MEMORY_LIMIT_EXCEEDED
+ - NOT_ENOUGH_SPACE
+ - cgroup memory
+---
+
+Three closely-related modes: per-query OOM, disk-full conditions blocking
+merges, and the cluster-wide memory pressure that turns a marginal
+workload into one that fails.
+
+## OOM / memory pressure
+
+### Symptoms
+
+- Code 241 (`MEMORY_LIMIT_EXCEEDED`).
+- `OvercommitTracker` killing queries.
+- ClickHouse pod restarts / OOMKilled.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | `MemoryResident` vs `CGroupMemoryTotal` — actual headroom. |
+| 2 | [Q17](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q17-active-queries-right-now) | Active queries — large aggregations holding GB of memory. |
+| 3 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent OOM patterns — same query? Same user? Same time? |
+| 4 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Active merges — large merges hold memory too. |
+| 5 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | High parts count → metadata overhead in RAM. |
+| 6 | [Q16](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q16-query-load-per-host-last-30-minutes) | Too many concurrent queries? |
+| 7 | [Q14](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q14-pool-sizes-server-settings) + [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Background pools consuming memory unnecessarily? |
+
+### Common root causes
+
+- Cluster genuinely undersized for the workload.
+- Query without `max_memory_usage` doing a large `GROUP BY` without an
+ `max_bytes_before_external_group_by` spill threshold.
+- Many parts → metadata pressure.
+- Concurrent large merges of wide parts.
+- Async insert buffers oversized.
+
+See [Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/)
+for per-subsystem RAM attribution.
+
+## Disk full / NOT_ENOUGH_SPACE
+
+### Symptoms
+
+- Merges failing with "Not enough space" in `last_exception`.
+- Insert errors.
+- One disk in the storage policy full.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) ⭐ | Disk usage — which disk on which host. |
+| 2 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Largest tables by size — cleanup candidates. |
+| 3 | [Q12](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q12-ttl-move--mutation-activity) | TTL move activity — are parts moving to cold tier? |
+| 4 | [Q19](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q19-stuck-mutations) | Stuck mutations adding to disk usage (mutations rewrite parts). |
+| 5 | [Q1](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q1-replication-queue-overview) | Queue entries with disk-related exceptions. |
+
+### Common root causes
+
+- TTL move not configured, or the cold-tier disk policy failing (S3
+ credentials, network).
+- Backup volumes filling local disk.
+- Detached parts not cleaned up.
+- A single huge partition.
+
+## Cluster-wide memory pressure as an aggravator
+
+### Symptoms
+
+- No single host is OOM, but every host shows `CGroupMemoryUsed > 90%` of
+ `CGroupMemoryTotal`.
+- Slow inserts, slow merges, page-cache thrashing — and the failures move
+ around the cluster rather than concentrating on one host.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q54](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q54-memory-pressure-per-host-compact) ⭐ | Confirm pressure is cluster-wide, not concentrated. |
+| 2 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | Full memory breakdown. |
+| 3 | [Q9](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q9-part-creation-vs-merge-rate-last-30-minutes) | Whether merge throughput is degraded (a sign of pressure). |
+| 4 | `system.asynchronous_metrics.MemoryCacheFiles` (if available) | Page-cache size proxy. |
+
+### Resolution path
+
+With sustained 95%+ utilisation, large MV processing or merge bursts will
+stall under pressure. Workload-level tuning helps marginally; the real
+fix is more RAM per node or reducing the workload (fewer MVs, smaller
+batches, less concurrent work). Tighten `max_memory_usage` per query as a
+guard.
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues.md
new file mode 100644
index 0000000000..4e2f6e9efa
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/merge-fetch-and-pool-issues.md
@@ -0,0 +1,98 @@
+---
+title: "Merge–fetch and pool issues"
+linkTitle: "Merge–fetch and pool issues"
+weight: 20
+description: >
+ Diagnosing replication queues that stop draining, including merge–fetch
+ cycles and fetch-pool deadlocks where slots are claimed but no transfers
+ happen.
+keywords:
+ - replication queue
+ - postpone_reason
+ - replicated_fetches
+ - background_fetches_pool_size
+---
+
+Two distinct failure modes share these symptoms but need different fixes.
+The first is a merge↔fetch cycle (work blocked behind itself). The second
+is a fetch-pool deadlock where the pool counter is pinned but
+`replicated_fetches` is near-empty — typically a Keeper saturation under a
+fragmentation-driven coordination load.
+
+## Merge↔fetch cycle / merge stall
+
+### Symptoms
+
+- Replication queue not draining even with ingestion stopped.
+- `merges_in_queue` high, but few active merges.
+- Reports of "merges waiting for fetches, fetches waiting for merges".
+- Parts count climbing despite no or low writes.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | Which hosts have readonly tables, max lag, largest queues. |
+| 2 | [Q4](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q4-replica-status--lag-and-readonly-per-host) | Specific tables — is one replica lagging while others are fine? |
+| 3 | [Q2](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q2-replication-queue--postpone-reasons) ⭐ | `postpone_reason` text — look for "source parts size … greater than current maximum", "another log entry for same part is being processed", "covering parts list". |
+| 4 | [Q3](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q3-queue-entry-type-breakdown) | Entry type breakdown — `GET_PART` (fetches) vs `MERGE_PARTS` ratio. |
+| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Are `BackgroundFetchesPoolTask` and `BackgroundMergesAndMutationsPoolTask` pinned at their pool size? |
+| 6 | [Q14](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q14-pool-sizes-server-settings) | Confirm configured pool sizes — has the cluster been pre-tuned? |
+| 7 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Are merges making progress, or stuck for hours on huge parts? |
+| 8 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Disk full ruled out? `NOT_ENOUGH_SPACE` looks like a merge stall but is separate. |
+
+### Common root causes
+
+- Pool sizes too small for the workload (especially
+ `background_fetches_pool_size`).
+- Wide imbalance — one replica not serving fetches (S3, network, or
+ credentials) so peers cannot pull.
+- Disk full on one node blocks merges, cascading into a fetch backlog on
+ peers.
+- Merge throughput collapsed because of 100+ GiB merges on slow storage.
+
+## Distributed fetch deadlock (pool pinned, no transfers)
+
+### Symptoms
+
+- `BackgroundFetchesPoolTask` at pool size on all hosts.
+- Replication queue is 99%+ `GET_PART` (not `MERGE_PARTS`).
+- Queue does not drain even with ingestion stopped.
+- [Q31](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q31-replicated-fetches-in-flight)
+ returns very few rows compared to the claimed pool slots.
+- `postpone_reason` mentions *"Not executing fetch of part X because N
+ fetches already executing, max N"*.
+
+This is **not** a merge↔fetch cycle. Pool slots are claimed but transfers
+aren't happening.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | Queue depth per host — usually concentrated on a subset. |
+| 2 | [Q2](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q2-replication-queue--postpone-reasons) | `postpone_reason` mentioning "fetches already executing, max". |
+| 3 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | `BackgroundFetchesPoolTask = pool_size` on all hosts but `ReplicatedFetch` near zero. |
+| 4 | [Q31](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q31-replicated-fetches-in-flight) ⭐ | Actual fetches transferring — should be hundreds, will be single digits. |
+| 5 | [Q33](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q33-keeper-wait-time-and-activity-cumulative) | `ZooKeeperWaitMicroseconds` extremely high → Keeper saturation. |
+| 6 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) | Find the table with massive part count driving Keeper load. |
+| 7 | [Q34](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q34-active-insert-sources-by-user) | Confirm ingestion is actually stopped. |
+
+### Common root cause
+
+Part fragmentation on one or more high-volume tables saturates Keeper
+coordinating replication. Fetch tasks block waiting on Keeper responses;
+the pool fills with waiting tasks while no transfers happen.
+
+### Resolution path
+
+1. Stop ingestion to the offending table.
+2. Wait for merges to reduce part count (hours, not minutes).
+3. Once parts collapse, Keeper pressure drops, fetches resume, queue
+ drains.
+4. Before resuming ingestion, fix the insert pattern — async inserts,
+ larger batches, less granular partitioning.
+
+**Do not** raise `background_fetches_pool_size`. The pool is not the
+bottleneck — it's saturated by tasks waiting on Keeper, not by genuine
+work. Adding pool slots adds more waiters.
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/replica-readonly.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/replica-readonly.md
new file mode 100644
index 0000000000..0aa865b894
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/replica-readonly.md
@@ -0,0 +1,48 @@
+---
+title: "Replica readonly / high lag"
+linkTitle: "Replica readonly"
+weight: 40
+description: >
+ Diagnosing replicas stuck in readonly mode or with growing absolute_delay.
+keywords:
+ - clickhouse readonly replica
+ - absolute_delay
+ - clickhouse keeper session
+---
+
+## Symptoms
+
+- One or more replicas in readonly mode.
+- `absolute_delay` increasing on specific replicas.
+- Failover not behaving as expected.
+
+## Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q4](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q4-replica-status--lag-and-readonly-per-host) ⭐ | Which replicas are readonly, which tables, lag in seconds. |
+| 2 | [Q5](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q5-replication-summary-per-host) | Is this isolated or cluster-wide? |
+| 3 | [Q29](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q29-keeper-connection-status) | Keeper/ZK connection — readonly is often a Keeper-session issue. |
+| 4 | [Q30](/altinity-kb-diagnostics-runbook/query-library/keeper-and-coordination/#q30-keeper-errors-last-hour) | Recent Keeper exceptions. |
+| 5 | [Q1](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q1-replication-queue-overview) | Queue depth on the affected replica — accumulating or stuck? |
+| 6 | [Q2](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q2-replication-queue--postpone-reasons) | If queue is stuck — `postpone_reason` and `last_exception`. |
+| 7 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Disk space on affected replica (full disk → readonly). |
+| 8 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent exceptions on that host. |
+
+## Common root causes
+
+- Keeper session lost or Keeper unreachable.
+- Disk full.
+- Metadata mismatch with Keeper (e.g., after a restore from backup).
+- Manual `SYSTEM RESTART REPLICA` needed after a transient Keeper issue.
+
+## Resolution path
+
+- Confirm Keeper connectivity is healthy first (Q29 + Q30); fixing
+ Keeper before the replica self-recovers in most cases.
+- If disk is full, free space first — the replica may auto-recover.
+- If metadata is mismatched, `SYSTEM RESTART REPLICA .`
+ reinitialises the replica's view of the ZooKeeper state.
+- For persistent failures, see
+ [DDLWorker and DDL queue problems](/altinity-kb-setup-and-maintenance/altinity-kb-ddlworker/)
+ for related cluster-coordination diagnostics.
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/slow-queries.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/slow-queries.md
new file mode 100644
index 0000000000..8ca67f6f9a
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/slow-queries.md
@@ -0,0 +1,32 @@
+---
+title: "Slow queries / high query load"
+linkTitle: "Slow queries"
+weight: 80
+description: >
+ Diagnosing query timeouts and dashboard latency complaints.
+keywords:
+ - clickhouse slow query
+ - dashboard timeout
+ - query load
+---
+
+## Symptoms
+
+- Query timeouts reported by clients.
+- Dashboards slow.
+
+## Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q17](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q17-active-queries-right-now) | What's running right now — how long, how much memory. |
+| 2 | [Q16](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q16-query-load-per-host-last-30-minutes) | Query mix in the last 30 minutes — error rate and average duration by `query_kind`. |
+| 3 | [Q18](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q18-recent-oom--exception-queries) | Recent exceptions. |
+| 4 | [Q4](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q4-replica-status--lag-and-readonly-per-host) | Are reads hitting a lagging or readonly replica? |
+| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Background pool stealing CPU/IO from queries? |
+| 6 | [Q15](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q15-memory-pressure) | Memory pressure forcing spill or kills? |
+| 7 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Are scanned tables fragmented (many small parts)? |
+
+For deeper per-query investigation, see
+[Who ate my CPU?](/altinity-kb-setup-and-maintenance/who-ate-my-cpu/) and
+[Who ate my memory?](/altinity-kb-setup-and-maintenance/altinity-kb-who-ate-my-memory/).
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations.md
new file mode 100644
index 0000000000..fa80cd09f5
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/stuck-mutations.md
@@ -0,0 +1,46 @@
+---
+title: "Stuck mutations"
+linkTitle: "Stuck mutations"
+weight: 60
+description: >
+ Diagnosing `ALTER TABLE … UPDATE/DELETE` mutations that won't complete.
+keywords:
+ - clickhouse mutations
+ - alter update
+ - alter delete
+ - is_done
+---
+
+## Symptoms
+
+- `ALTER TABLE … UPDATE / DELETE` not completing.
+- `system.mutations.is_done = 0` for hours.
+
+## Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q19](/altinity-kb-diagnostics-runbook/query-library/queries-and-mutations/#q19-stuck-mutations) ⭐ | All stuck mutations with `latest_fail_reason`. |
+| 2 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Active merges (mutations share the merge pool). |
+| 3 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Pool saturation — mutations queued behind merges. |
+| 4 | [Q1](/altinity-kb-diagnostics-runbook/query-library/replication-and-queue/#q1-replication-queue-overview) | Queue entries — `MUTATE_PART` types. |
+| 5 | [Q11](/altinity-kb-diagnostics-runbook/query-library/disk-and-storage/#q11-disk-usage-per-host) | Disk space (mutations rewrite parts → need ~2× space). |
+
+## Common root causes
+
+- Insufficient merge-pool slots.
+- Mutation references a column that no longer exists (look at
+ `latest_fail_reason` — the error is explicit).
+- Disk space insufficient for the rewrite.
+- Mutation blocked behind a merge of the same part.
+
+## Resolution
+
+- For pool-bound stalls, raising the merge pool size (Q14) restores
+ progress; review whether the workload genuinely needs that much
+ concurrent mutation.
+- A mutation whose `latest_fail_reason` is a missing column is fatal —
+ `KILL MUTATION WHERE …` is the only path forward.
+- For disk-bound stalls, free space (see
+ [Memory and disk pressure](/altinity-kb-diagnostics-runbook/scenarios/memory-and-disk-pressure/))
+ before retrying.
diff --git a/content/en/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure.md b/content/en/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure.md
new file mode 100644
index 0000000000..e73292196b
--- /dev/null
+++ b/content/en/altinity-kb-diagnostics-runbook/scenarios/too-many-parts-and-backpressure.md
@@ -0,0 +1,92 @@
+---
+title: "Too many parts and backpressure"
+linkTitle: "Too many parts and backpressure"
+weight: 30
+description: >
+ Diagnosing `TOO_MANY_PARTS` (code 252), insert delays, and the
+ sustained insert pressure that causes cascading issues.
+keywords:
+ - TOO_MANY_PARTS
+ - parts_to_delay_insert
+ - parts_to_throw_insert
+ - clickhouse backpressure
+---
+
+Three related failure modes appear here: hard `TOO_MANY_PARTS` rejections,
+soft "Delaying inserts by N ms" warnings, and the sustained high insert
+rate that causes multiple symptoms at once.
+
+## TOO_MANY_PARTS / part explosion
+
+### Symptoms
+
+- Inserts failing with code 252 (`TOO_MANY_PARTS`).
+- Or inserts delayed with "Delaying inserts by N ms" warnings in the log.
+- Parts count per partition exceeds ~300.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q6](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q6-parts-health-per-host) | Tables with highest active part count — single offender or cluster-wide? |
+| 2 | [Q7](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q7-parts-count-per-partition) ⭐ | Parts per partition — `parts_to_delay_insert` is **per partition**, not per table. |
+| 3 | [Q9](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q9-part-creation-vs-merge-rate-last-30-minutes) | New parts vs merged parts in the last 30 minutes — is merge throughput below insert rate? |
+| 4 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Are merges actually running, or queued and idle? |
+| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Merge pool saturated? |
+| 6 | [Q10](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q10-merge-settings-check) | Confirm `parts_to_delay_insert` / `parts_to_throw_insert` thresholds. |
+| 7 | [Q22](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q22-async-insert-impact-aggregation), [Q23](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q23-async-insert-flush-latency-by-tablestatus) | If async inserts are in use — are flushes producing many small parts? |
+
+### Common root causes
+
+- Insert batch size too small (sync inserts without client-side batching —
+ one part per insert).
+- Async inserts not enabled, or buffer thresholds too small.
+- Partitioning too granular (e.g., per-hour partitioning on a dataset that
+ could be per-day).
+- Merge pool too small for the insert rate.
+- Excessive `Nullable` columns slowing merges.
+
+## Insert backpressure ("delayed inserts")
+
+### Symptoms
+
+- Inserts not failing, just very slow.
+- Server logs show "Delaying inserts by N ms".
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q7](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q7-parts-count-per-partition) ⭐ | Partition with > `parts_to_delay_insert` (default 150) parts. |
+| 2 | [Q10](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q10-merge-settings-check) | Confirm threshold values. |
+| 3 | [Q8](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q8-active-merges) | Are merges keeping up? |
+| 4 | [Q9](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q9-part-creation-vs-merge-rate-last-30-minutes) | New-parts vs merged-parts ratio. |
+| 5 | [Q13](/altinity-kb-diagnostics-runbook/query-library/pools-and-resources/#q13-pool-saturation-metrics) | Merge pool capacity. |
+
+## Sustained high insert rate causing cascading issues
+
+### Symptoms
+
+- Multiple symptoms at once: timeouts, Kafka kicks, part growth.
+- "The same issues come back after fixing X."
+- No single clear root cause.
+
+### Diagnostic flow
+
+| Step | Query | What to look for |
+|---|---|---|
+| 1 | [Q37](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q37-insert-rate-per-minute-spike-detection) | Insert rate per minute — sustained or spike? |
+| 2 | [Q36](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q36-insert-volume-by-target-table-last-24-hours) | Insert volume by target table — biggest contributors. |
+| 3 | [Q35](/altinity-kb-diagnostics-runbook/query-library/insert-load-and-host-skew/#q35-insert-volume-by-user-last-24-hours) | Insert volume by user — which clients. |
+| 4 | [Q38](/altinity-kb-diagnostics-runbook/query-library/async-inserts/#q38-async-insert-timeout-failures-by-table) | Currently failing tables. |
+| 5 | [Q42](/altinity-kb-diagnostics-runbook/query-library/parts-and-merges/#q42-partition-count-health) | Part fragmentation per table. |
+| 6 | Q39 (in MV chain — see `system.tables`) | MV chains on the failing tables. |
+
+### How to read the result
+
+- **Few inserts/minute with huge row counts** → bulk loads; MV chain
+ bottleneck is the likely cause.
+- **Many inserts/minute with small row counts** → batch size problem;
+ fix at the producer or via async insert configuration.
+- **Spike pattern** → identify the specific user or process responsible.
+- **Flat pattern** → baseline load multiplied by a config issue.