A small internal tool that turns PDF codebooks into structured, machine-readable data.
Research datasets often come with PDF codebooks that describe what variables are available and when. These PDFs are great for humans to read, but terrible for computers. This tool extracts the tables from those PDFs and turns them into structured data (CSVW format) that can be used in data catalogs, APIs, or pipelines.
- Upload a PDF — Drag and drop or click to select your codebook PDF
- Run extraction — The tool reads the PDF, finds tables, and uses an AI model to understand the structure
- Explore results — Browse the extracted data, see where it came from in the PDF, and download as CSVW
pip install -r requirements.txt
python app.pyOpens automatically at http://localhost:8000.
- Upload area (drag & drop or click)
- Page navigation (previous / next)
- Re-upload button (↻ top-right) — upload a different PDF without restarting
- Settings drawer (⚙ top-left) — hover to open
After extraction, you can switch between 4 views:
| View | What you see |
|---|---|
| Variable Cards (default) | One card per row, with tags showing which columns it's available in. Click a card to jump to its location in the PDF. |
| Table View | A scrollable matrix showing all rows × columns with ✓ or — for availability. Good for spotting patterns. |
| CSVW Metadata | The technical schema (column names, data types) plus a preview of the first 5 data rows. |
| Raw JSON | The full extraction output as pretty-printed JSON. Useful for debugging. |
Hover over any variable card to temporarily highlight its location in the PDF. Click to keep the highlight.
Open the settings drawer (⚙) to configure:
| Setting | What it does |
|---|---|
| Provider | Which AI model to use: Claude, Mistral, Kimi, or local HuggingFace |
| API Key | Your key for the chosen provider (not needed for HuggingFace) |
| Model | Specific model version (e.g. claude-sonnet-4, mistral-small) |
| Device | For HuggingFace only: CPU, CUDA (NVIDIA GPU), or MPS (Apple Silicon) |
| PDF render scale | How sharp the PDF looks (1×–3×). Higher = clearer but slower. |
| Detect green checkmark icons | Off by default. Turn on if your PDF uses green ✅ symbols instead of text "X" marks. This scans each table cell for green pixels, which is slower but catches visual-only indicators. |
Settings are saved in your browser (localStorage).
| Provider | Needs API key? | Speed | Quality | Notes |
|---|---|---|---|---|
| Claude (Anthropic) | Yes (Payed) | Fast | Not validated | |
| Mistral | Yes (Free) | Fast | Not validated | Recommended as API has a free tier. |
| Kimi (Moonshot) | Yes (Payed) | Fast | Not validated | Chinese provider. |
| HuggingFace | No | Slow (Depends on hardware) | Not validated | Runs entirely on your machine. No data leaves your computer. Requires more setup (PyTorch, etc.). |
The tool is designed for tabular codebooks — PDFs with tables that show which variables are available for which conditions or timepoints.
Works well:
- Tables with text marks ("X", "yes", numbers)
- Tables with green checkmark icons (if icon detection is enabled)
- Multi-page tables
- Mixed layouts (some tables, some text)
Does not work:
- Scanned images (needs OCR, not enabled)
- Free-form text without tables
- Highly complex nested tables
Each extracted table becomes a CSVW group containing:
- Metadata (
metadata.json): Describes the table structure in a standard W3C format. Includes column definitions, data types, and descriptions. - Data (
data.csv): The actual rows, with one boolean column per codebook column showing whether that variable is available.
Example:
{
"@context": "http://www.w3.org/ns/csvw",
"dc:title": "Patient variables",
"tables": [{
"url": "patient_variables.csv",
"tableSchema": {
"columns": [
{"name": "variable_name", "datatype": "string"},
{"name": "variable_label", "datatype": "string"},
{"name": "has_NHL", "datatype": "boolean"},
{"name": "has_HL", "datatype": "boolean"}
]
}
}]
}app.py # FastAPI backend (serves UI + handles extraction)
frontend/
index.html # UI markup
style.css # Dark theme styling
app.js # All interactivity (views, hover, navigation)
pipeline/
pdf_processor.py # PDF parsing: Docling (primary) + PyMuPDF (fallback)
llm_extractor.py # LLM calls, JSON parsing, CSVW conversion
| Problem | Likely cause | Fix |
|---|---|---|
| "Upload failed" | File too large or not a PDF | Check file is < 50MB and has .pdf extension |
| "Enter an API key" | Using Claude/Mistral/Kimi without key | Add key in Settings, or switch to HuggingFace |
| Extraction hangs | LLM taking long or model error | Click stop button (■) and try a different model |
| Missing checkmarks | PDF uses visual icons, not text | Enable "Detect green checkmark icons" in Settings |
| Wrong table structure | Complex or unusual layout | Try a different model (Claude usually handles these best) |
| UI looks old / missing buttons | Browser cache | Hard-refresh: Ctrl+Shift+R (or Cmd+Shift+R on Mac) |
- This is a proof of concept — not production software. Expect rough edges.
- No data is stored on disk or sent to external services (except the LLM provider you choose).
- The visual icon detection is heuristic-based (green pixel counting). It may miss some icons or produce false positives on green-themed PDFs.