Skip to content

Health-RI/codebook_pdf_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Codebook Extractor

A small internal tool that turns PDF codebooks into structured, machine-readable data.

What is this for?

Research datasets often come with PDF codebooks that describe what variables are available and when. These PDFs are great for humans to read, but terrible for computers. This tool extracts the tables from those PDFs and turns them into structured data (CSVW format) that can be used in data catalogs, APIs, or pipelines.

How it works (3 steps)

  1. Upload a PDF — Drag and drop or click to select your codebook PDF
  2. Run extraction — The tool reads the PDF, finds tables, and uses an AI model to understand the structure
  3. Explore results — Browse the extracted data, see where it came from in the PDF, and download as CSVW

Quick start

pip install -r requirements.txt
python app.py

Opens automatically at http://localhost:8000.

The interface

Left side — PDF viewer

  • Upload area (drag & drop or click)
  • Page navigation (previous / next)
  • Re-upload button (↻ top-right) — upload a different PDF without restarting
  • Settings drawer (⚙ top-left) — hover to open

Right side — Results

After extraction, you can switch between 4 views:

View What you see
Variable Cards (default) One card per row, with tags showing which columns it's available in. Click a card to jump to its location in the PDF.
Table View A scrollable matrix showing all rows × columns with ✓ or — for availability. Good for spotting patterns.
CSVW Metadata The technical schema (column names, data types) plus a preview of the first 5 data rows.
Raw JSON The full extraction output as pretty-printed JSON. Useful for debugging.

Hover to preview

Hover over any variable card to temporarily highlight its location in the PDF. Click to keep the highlight.

Settings

Open the settings drawer (⚙) to configure:

Setting What it does
Provider Which AI model to use: Claude, Mistral, Kimi, or local HuggingFace
API Key Your key for the chosen provider (not needed for HuggingFace)
Model Specific model version (e.g. claude-sonnet-4, mistral-small)
Device For HuggingFace only: CPU, CUDA (NVIDIA GPU), or MPS (Apple Silicon)
PDF render scale How sharp the PDF looks (1×–3×). Higher = clearer but slower.
Detect green checkmark icons Off by default. Turn on if your PDF uses green ✅ symbols instead of text "X" marks. This scans each table cell for green pixels, which is slower but catches visual-only indicators.

Settings are saved in your browser (localStorage).

Models

Provider Needs API key? Speed Quality Notes
Claude (Anthropic) Yes (Payed) Fast Not validated
Mistral Yes (Free) Fast Not validated Recommended as API has a free tier.
Kimi (Moonshot) Yes (Payed) Fast Not validated Chinese provider.
HuggingFace No Slow (Depends on hardware) Not validated Runs entirely on your machine. No data leaves your computer. Requires more setup (PyTorch, etc.).

What kind of PDFs work?

The tool is designed for tabular codebooks — PDFs with tables that show which variables are available for which conditions or timepoints.

Works well:

  • Tables with text marks ("X", "yes", numbers)
  • Tables with green checkmark icons (if icon detection is enabled)
  • Multi-page tables
  • Mixed layouts (some tables, some text)

Does not work:

  • Scanned images (needs OCR, not enabled)
  • Free-form text without tables
  • Highly complex nested tables

Output: CSVW (CSV on the Web)

Each extracted table becomes a CSVW group containing:

  • Metadata (metadata.json): Describes the table structure in a standard W3C format. Includes column definitions, data types, and descriptions.
  • Data (data.csv): The actual rows, with one boolean column per codebook column showing whether that variable is available.

Example:

{
  "@context": "http://www.w3.org/ns/csvw",
  "dc:title": "Patient variables",
  "tables": [{
    "url": "patient_variables.csv",
    "tableSchema": {
      "columns": [
        {"name": "variable_name", "datatype": "string"},
        {"name": "variable_label", "datatype": "string"},
        {"name": "has_NHL", "datatype": "boolean"},
        {"name": "has_HL", "datatype": "boolean"}
      ]
    }
  }]
}

File layout

app.py                  # FastAPI backend (serves UI + handles extraction)
frontend/
  index.html            # UI markup
  style.css             # Dark theme styling
  app.js                # All interactivity (views, hover, navigation)
pipeline/
  pdf_processor.py      # PDF parsing: Docling (primary) + PyMuPDF (fallback)
  llm_extractor.py      # LLM calls, JSON parsing, CSVW conversion

Troubleshooting

Problem Likely cause Fix
"Upload failed" File too large or not a PDF Check file is < 50MB and has .pdf extension
"Enter an API key" Using Claude/Mistral/Kimi without key Add key in Settings, or switch to HuggingFace
Extraction hangs LLM taking long or model error Click stop button (■) and try a different model
Missing checkmarks PDF uses visual icons, not text Enable "Detect green checkmark icons" in Settings
Wrong table structure Complex or unusual layout Try a different model (Claude usually handles these best)
UI looks old / missing buttons Browser cache Hard-refresh: Ctrl+Shift+R (or Cmd+Shift+R on Mac)

Notes

  • This is a proof of concept — not production software. Expect rough edges.
  • No data is stored on disk or sent to external services (except the LLM provider you choose).
  • The visual icon detection is heuristic-based (green pixel counting). It may miss some icons or produce false positives on green-themed PDFs.

About

A small internal tool that turns PDF codebooks into structured, machine-readable data.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors