Skip to content

mbackschat/discourse-explorer

Repository files navigation

Discourse Explorer

Scrape, analyze, and explore discussions from any Discourse forum.

What it does

  1. Scrape — Download all topics, posts, and metadata as structured JSON. Supports delta sync.
  2. Analyze — DuckDB analytics (tag distribution, top contributors, activity trends, keyword search, SQL REPL).
  3. Discover — Derive an entity-type vocabulary tailored to your forum by sampling topics with an LLM. Drives extraction quality in step 4.
  4. Query — Ask natural-language questions using a local GraphRAG knowledge graph (LightRAG + OpenAI/Ollama).
  5. Visualize — Interactive HTML graph explorer: entities, relationships, communities.

Everything runs locally — no cloud services required except LLM calls if you opt for OpenAI and the initial scrape.

Discourse Explorer graph view — nodes colored by entity type, filter sidebar on the left, per-node detail panel on the right

Setup

Requires Python ≥ 3.10 and uv. For GraphRAG features, also install Ollama or bring an OpenAI key.

uv sync

Try the demo (no Discourse, no LLM)

A 588 KB committed fixture under sample/fixtures/seed42-tiny/ carries a full deterministic forum (33 topics / 116 posts) plus the minimum GraphRAG artefacts the offline tools need. Try the analyzer + visualizer end-to-end without scraping anything:

uv run discourse-explorer stats --path sample/fixtures/seed42-tiny categories
uv run discourse-explorer visualize sample/fixtures/seed42-tiny --open

The fixture comes from the synthetic-forum seeder under sample/ — see sample/README.md for the Docker-stack path that lets you regenerate it locally and test the live init / extend paths against a real Discourse instance.

Configuration in two tiers

A single checkout supports multiple forums: the project root has a 1-line selector, each forum has its own config directory.

# 1. Selector at project root (one line, points at whichever forum is "active")
echo 'DISCOURSE_DATA_DIR=./data/my-forum' > .env

# 2. Per-forum config (URL, auth, models, gleaning — all env vars for this corpus)
mkdir -p ./data/my-forum/config
cp discourse_explorer/config/env.example ./data/my-forum/config/.env
# edit ./data/my-forum/config/.env

Priority when both dotenv files set the same key: data-dir wins. Shell exports override both. CLI flags override everything.

Full env-var reference and layering rules: docs/analysis/vocabulary-and-config.md.

Authentication

Edit <data-dir>/config/.env and pick one:

Method Env vars Notes
API key (preferred) DISCOURSE_API_KEY + DISCOURSE_API_USERNAME Generate at Discourse Admin → API → New API Key.
Session cookie (fallback) DISCOURSE_COOKIE F12 → Cookies → copy _t value. Expires in a few weeks.
OIDC / Keycloak DISCOURSE_USERNAME + DISCOURSE_PASSWORD Automated SSO. May not work with all setups.

Priority at runtime: API key ≻ cookie ≻ OIDC.

Also set DISCOURSE_URL=https://discourse.example.com in the same file for unflagged scraper runs.

Tools at a glance

Tool Purpose Reference
scrape Download topics + posts + metadata; delta sync Manual §1
stats DuckDB analytics + SQL REPL Manual §2
discover-types Distill an entity-type vocabulary from sampled topics Manual §3 — Discover
query Build the knowledge graph (--index) and ask questions Manual §3 — Build · Ask
visualize Render the interactive HTML graph explorer Manual §4
Claude Code skills Slash commands for end-to-end workflows Manual — Guided workflows

Documentation

  • docs/MANUAL.md — per-tool usage reference: CLI flags, env vars, examples, the end-to-end workflow.
  • CLAUDE.md — maintainer-facing map of the codebase and invariants.
  • docs/analysis/ — deep-dives on indexing, canonicalization, visualization, configuration.
  • docs/lightrag/ — read before editing query.py or discover_types.py.
  • docs/discourse/ — Discourse JSON shape + terminology.
  • docs/ideas/ — forward-looking proposals.

About

Scrape, analyze, and explore Discourse forums — Python scraper + GraphRAG knowledge graph + interactive HTML visualizer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors