Codebook Extractor

A small internal tool that turns PDF codebooks into structured, machine-readable data.

What is this for?

Research datasets often come with PDF codebooks that describe what variables are available and when. These PDFs are great for humans to read, but terrible for computers. This tool extracts the tables from those PDFs and turns them into structured data (CSVW format) that can be used in data catalogs, APIs, or pipelines.

How it works (3 steps)

Upload a PDF — Drag and drop or click to select your codebook PDF
Run extraction — The tool reads the PDF, finds tables, and uses an AI model to understand the structure
Explore results — Browse the extracted data, see where it came from in the PDF, and download as CSVW

Quick start

pip install -r requirements.txt
python app.py

Opens automatically at http://localhost:8000.

The interface

Left side — PDF viewer

Upload area (drag & drop or click)
Page navigation (previous / next)
Re-upload button (↻ top-right) — upload a different PDF without restarting
Settings drawer (⚙ top-left) — hover to open

Right side — Results

After extraction, you can switch between 4 views:

View	What you see
Variable Cards (default)	One card per row, with tags showing which columns it's available in. Click a card to jump to its location in the PDF.
Table View	A scrollable matrix showing all rows × columns with ✓ or — for availability. Good for spotting patterns.
CSVW Metadata	The technical schema (column names, data types) plus a preview of the first 5 data rows.
Raw JSON	The full extraction output as pretty-printed JSON. Useful for debugging.

Hover to preview

Hover over any variable card to temporarily highlight its location in the PDF. Click to keep the highlight.

Settings

Open the settings drawer (⚙) to configure:

Setting	What it does
Provider	Which AI model to use: Claude, Mistral, Kimi, or local HuggingFace
API Key	Your key for the chosen provider (not needed for HuggingFace)
Model	Specific model version (e.g. claude-sonnet-4, mistral-small)
Device	For HuggingFace only: CPU, CUDA (NVIDIA GPU), or MPS (Apple Silicon)
PDF render scale	How sharp the PDF looks (1×–3×). Higher = clearer but slower.
Detect green checkmark icons	Off by default. Turn on if your PDF uses green ✅ symbols instead of text "X" marks. This scans each table cell for green pixels, which is slower but catches visual-only indicators.

Settings are saved in your browser (localStorage).

Models

Provider	Needs API key?	Speed	Quality	Notes
Claude (Anthropic)	Yes (Payed)	Fast	Not validated
Mistral	Yes (Free)	Fast	Not validated	Recommended as API has a free tier.
Kimi (Moonshot)	Yes (Payed)	Fast	Not validated	Chinese provider.
HuggingFace	No	Slow (Depends on hardware)	Not validated	Runs entirely on your machine. No data leaves your computer. Requires more setup (PyTorch, etc.).

What kind of PDFs work?

The tool is designed for tabular codebooks — PDFs with tables that show which variables are available for which conditions or timepoints.

Works well:

Tables with text marks ("X", "yes", numbers)
Tables with green checkmark icons (if icon detection is enabled)
Multi-page tables
Mixed layouts (some tables, some text)

Does not work:

Scanned images (needs OCR, not enabled)
Free-form text without tables
Highly complex nested tables

Output: CSVW (CSV on the Web)

Each extracted table becomes a CSVW group containing:

Metadata (metadata.json): Describes the table structure in a standard W3C format. Includes column definitions, data types, and descriptions.
Data (data.csv): The actual rows, with one boolean column per codebook column showing whether that variable is available.

Example:

{
  "@context": "http://www.w3.org/ns/csvw",
  "dc:title": "Patient variables",
  "tables": [{
    "url": "patient_variables.csv",
    "tableSchema": {
      "columns": [
        {"name": "variable_name", "datatype": "string"},
        {"name": "variable_label", "datatype": "string"},
        {"name": "has_NHL", "datatype": "boolean"},
        {"name": "has_HL", "datatype": "boolean"}
      ]
    }
  }]
}

File layout

app.py                  # FastAPI backend (serves UI + handles extraction)
frontend/
  index.html            # UI markup
  style.css             # Dark theme styling
  app.js                # All interactivity (views, hover, navigation)
pipeline/
  pdf_processor.py      # PDF parsing: Docling (primary) + PyMuPDF (fallback)
  llm_extractor.py      # LLM calls, JSON parsing, CSVW conversion

Troubleshooting

Problem	Likely cause	Fix
"Upload failed"	File too large or not a PDF	Check file is < 50MB and has .pdf extension
"Enter an API key"	Using Claude/Mistral/Kimi without key	Add key in Settings, or switch to HuggingFace
Extraction hangs	LLM taking long or model error	Click stop button (■) and try a different model
Missing checkmarks	PDF uses visual icons, not text	Enable "Detect green checkmark icons" in Settings
Wrong table structure	Complex or unusual layout	Try a different model (Claude usually handles these best)
UI looks old / missing buttons	Browser cache	Hard-refresh: Ctrl+Shift+R (or Cmd+Shift+R on Mac)

Notes

This is a proof of concept — not production software. Expect rough edges.
No data is stored on disk or sent to external services (except the LLM provider you choose).
The visual icon detection is heuristic-based (green pixel counting). It may miss some icons or produce false positives on green-themed PDFs.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
frontend		frontend
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Codebook Extractor

What is this for?

How it works (3 steps)

Quick start

The interface

Left side — PDF viewer

Right side — Results

Hover to preview

Settings

Models

What kind of PDFs work?

Output: CSVW (CSV on the Web)

File layout

Troubleshooting

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Codebook Extractor

What is this for?

How it works (3 steps)

Quick start

The interface

Left side — PDF viewer

Right side — Results

Hover to preview

Settings

Models

What kind of PDFs work?

Output: CSVW (CSV on the Web)

File layout

Troubleshooting

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages