Scrape, analyze, and explore discussions from any Discourse forum.
- Scrape — Download all topics, posts, and metadata as structured JSON. Supports delta sync.
- Analyze — DuckDB analytics (tag distribution, top contributors, activity trends, keyword search, SQL REPL).
- Discover — Derive an entity-type vocabulary tailored to your forum by sampling topics with an LLM. Drives extraction quality in step 4.
- Query — Ask natural-language questions using a local GraphRAG knowledge graph (LightRAG + OpenAI/Ollama).
- Visualize — Interactive HTML graph explorer: entities, relationships, communities.
Everything runs locally — no cloud services required except LLM calls if you opt for OpenAI and the initial scrape.
Requires Python ≥ 3.10 and uv. For GraphRAG features, also install Ollama or bring an OpenAI key.
uv syncA 588 KB committed fixture under sample/fixtures/seed42-tiny/ carries a full deterministic forum (33 topics / 116 posts) plus the minimum GraphRAG artefacts the offline tools need. Try the analyzer + visualizer end-to-end without scraping anything:
uv run discourse-explorer stats --path sample/fixtures/seed42-tiny categories
uv run discourse-explorer visualize sample/fixtures/seed42-tiny --openThe fixture comes from the synthetic-forum seeder under sample/ — see sample/README.md for the Docker-stack path that lets you regenerate it locally and test the live init / extend paths against a real Discourse instance.
A single checkout supports multiple forums: the project root has a 1-line selector, each forum has its own config directory.
# 1. Selector at project root (one line, points at whichever forum is "active")
echo 'DISCOURSE_DATA_DIR=./data/my-forum' > .env
# 2. Per-forum config (URL, auth, models, gleaning — all env vars for this corpus)
mkdir -p ./data/my-forum/config
cp discourse_explorer/config/env.example ./data/my-forum/config/.env
# edit ./data/my-forum/config/.envPriority when both dotenv files set the same key: data-dir wins. Shell exports override both. CLI flags override everything.
Full env-var reference and layering rules: docs/analysis/vocabulary-and-config.md.
Edit <data-dir>/config/.env and pick one:
| Method | Env vars | Notes |
|---|---|---|
| API key (preferred) | DISCOURSE_API_KEY + DISCOURSE_API_USERNAME |
Generate at Discourse Admin → API → New API Key. |
| Session cookie (fallback) | DISCOURSE_COOKIE |
F12 → Cookies → copy _t value. Expires in a few weeks. |
| OIDC / Keycloak | DISCOURSE_USERNAME + DISCOURSE_PASSWORD |
Automated SSO. May not work with all setups. |
Priority at runtime: API key ≻ cookie ≻ OIDC.
Also set DISCOURSE_URL=https://discourse.example.com in the same file for unflagged scraper runs.
| Tool | Purpose | Reference |
|---|---|---|
scrape |
Download topics + posts + metadata; delta sync | Manual §1 |
stats |
DuckDB analytics + SQL REPL | Manual §2 |
discover-types |
Distill an entity-type vocabulary from sampled topics | Manual §3 — Discover |
query |
Build the knowledge graph (--index) and ask questions |
Manual §3 — Build · Ask |
visualize |
Render the interactive HTML graph explorer | Manual §4 |
| Claude Code skills | Slash commands for end-to-end workflows | Manual — Guided workflows |
docs/MANUAL.md— per-tool usage reference: CLI flags, env vars, examples, the end-to-end workflow.CLAUDE.md— maintainer-facing map of the codebase and invariants.docs/analysis/— deep-dives on indexing, canonicalization, visualization, configuration.docs/lightrag/— read before editingquery.pyordiscover_types.py.docs/discourse/— Discourse JSON shape + terminology.docs/ideas/— forward-looking proposals.
