Skip to content

Add diagnostics runbook section#185

Open
TheraniAA wants to merge 1 commit into
mainfrom
ashwini-diagnostics-runbook
Open

Add diagnostics runbook section#185
TheraniAA wants to merge 1 commit into
mainfrom
ashwini-diagnostics-runbook

Conversation

@TheraniAA
Copy link
Copy Markdown
Contributor

New top-level KB section at content/en/altinity-kb-diagnostics-runbook/ with 25 pages of cluster-wide diagnostic queries and scenario-based playbooks. Aimed at on-call engineers triaging running ClickHouse clusters.

Structure:

  • _index.md, quick-reference.md, investigation-methods.md (top level)
  • query-library/ — 54 queries grouped into 9 subsystem pages (replication, parts/merges, disk, pools, queries/mutations, async inserts, keeper, insert load + host skew, dictionaries/kafka)
  • scenarios/ — 11 scenario playbooks (general triage, merge/fetch, too many parts, replica readonly, memory/disk pressure, stuck mutations, async insert issues, slow queries, kafka, frozen historical tables, host-skewed failures)

All queries use clusterAllReplicas('{cluster}', ...) placeholders. Queries are referenced by stable numeric IDs (Q1-Q54) for cross-linking. Heavy use of internal cross-links between scenarios and the query library to support both top-down (start from a symptom) and bottom-up (grab a known query) workflows.

I have read the CLA Document and I hereby sign the CLA

New top-level KB section at content/en/altinity-kb-diagnostics-runbook/
with 25 pages of cluster-wide diagnostic queries and scenario-based
playbooks. Aimed at on-call engineers triaging running ClickHouse
clusters.

Structure:
- _index.md, quick-reference.md, investigation-methods.md (top level)
- query-library/ — 54 queries grouped into 9 subsystem pages
  (replication, parts/merges, disk, pools, queries/mutations,
  async inserts, keeper, insert load + host skew, dictionaries/kafka)
- scenarios/ — 11 scenario playbooks (general triage, merge/fetch,
  too many parts, replica readonly, memory/disk pressure, stuck
  mutations, async insert issues, slow queries, kafka, frozen
  historical tables, host-skewed failures)

All queries use clusterAllReplicas('{cluster}', ...) placeholders.
Queries are referenced by stable numeric IDs (Q1-Q54) for cross-linking.
Heavy use of internal cross-links between scenarios and the query
library to support both top-down (start from a symptom) and bottom-up
(grab a known query) workflows.
@TheraniAA TheraniAA requested a review from mkmkme May 14, 2026 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant