Skip to content

Latest commit

 

History

History
461 lines (303 loc) · 13.5 KB

File metadata and controls

461 lines (303 loc) · 13.5 KB

Using the Parxy Command Line Interface (CLI)

Parxy provides a powerful and flexible command-line interface (CLI) that allows you to parse documents, convert them to Markdown, and manage configuration files directly from your terminal — without writing any Python code.

Once installed, you can run the CLI via the parxy command.

Overview

The Parxy CLI lets you:

Command Description
parxy parse Extract text content from documents with customizable detail levels and output formats. Process files or folders with multiple drivers.
parxy markdown Convert documents to Markdown files, with support for multiple drivers and folder processing
parxy pdf:merge Merge multiple PDF files into one, with support for page ranges
parxy pdf:split Split a PDF into individual pages, with optional page range and single-file extraction
parxy pdf:outline Print or export a PDF's outline (bookmarks / table of contents)
parxy pdf:tags Inspect and extract the tag (structure) tree of a tagged, accessible PDF
parxy pdf:xmp Read and extract XMP metadata from a PDF
parxy drivers List available document processing drivers
parxy env Generate a default .env configuration file
parxy docker Create a Docker Compose configuration for running Parxy-related services

Parsing Documents

The parse command is a powerful tool for extracting text from documents with extensive customization options. It supports processing individual files or entire folders, multiple output formats, and can use multiple drivers for comparison.

Basic Usage

Parse a single document using the default settings (PyMuPDF driver, json output):

parxy parse document.pdf

This creates a pymupdf-document.json file in the same directory as the source file. Parxy always prefix the output file with the driver name.

Processing Multiple Files and Folders

Parse multiple files at once:

parxy parse doc1.pdf doc2.pdf doc3.pdf

Process all PDFs in a folder (recursively):

parxy parse /path/to/folder

Mix files and folders:

parxy parse document.pdf /path/to/folder

Output Formats

Control the output format with the --mode (-m) option:

# Markdown format (default)
parxy parse document.pdf -m markdown

# Plain text
parxy parse document.pdf -m plain

# JSON (full document structure)
parxy parse document.pdf -m json

The file extension is automatically set based on the output mode (.md, .txt, or .json).

Output Directory

Specify where to save the output files with --output (-o):

parxy parse document.pdf -o output/

If not specified, files are saved in the same directory as the source files.

Extraction Levels

Adjust the extraction level with the --level (-l) option:

parxy parse --level line document.pdf

Supported levels are (depending on the driver):

  • page
  • block (default)
  • line
  • span
  • character

Using Different Drivers

Specify a driver with the --driver (-d) option:

parxy parse --driver llamaparse document.pdf
# output will be saved as llamaparse-document.json

Using Multiple Drivers for Comparison

Parse the same document(s) with multiple drivers by specifying --driver (or -d for short) multiple times:

parxy parse document.pdf -d pymupdf -d llamaparse

When using multiple drivers, Parxy always prepend the driver name to the output filenames, e.g. pymupdf-document.json, llamaparse-document.json. This is particularly useful for comparing extraction quality across different parsers.

Progress Tracking

When processing multiple files, Parxy displays a progress bar showing:

  • Files being processed
  • Driver being used
  • Output file location
  • Number of pages extracted

Complete Example

Process all PDFs in a folder with two drivers, output as JSON, and save to a specific directory:

parxy parse /path/to/pdfs -d pymupdf -d llamaparse -m json -o output/

Converting to Markdown

The markdown command converts documents to Markdown format, preserving structure such as headings and lists. It follows the same conventions as the parse command: output files are prefixed with the driver name and saved next to the source file by default.

Basic Usage

parxy markdown document.pdf

This creates a pymupdf-document.md file in the same directory as the source file.

Processing Multiple Files and Folders

# Parse multiple files
parxy markdown doc1.pdf doc2.pdf doc3.pdf

# Parse all PDFs in a folder (non-recursive by default)
parxy markdown /path/to/folder

# Parse recursively
parxy markdown /path/to/folder --recursive

# Limit recursion depth
parxy markdown /path/to/folder --recursive --max-depth 2

Output Directory

parxy markdown document.pdf -o output/

Using Multiple Drivers

Run the same documents through multiple drivers for comparison:

parxy markdown document.pdf -d pymupdf -d llamaparse

This produces pymupdf-document.md and llamaparse-document.md.

Converting Pre-parsed JSON Results

If you have a JSON file produced by parxy parse -m json, you can convert it to Markdown directly without re-parsing:

parxy markdown result.json

This loads the Document model from the JSON and converts it immediately — no driver or API call required. You can mix JSON files and PDF files in the same invocation:

parxy markdown result.json document.pdf -d pymupdf -o output/

Page Separator Comments

Use --page-separators to insert HTML comments before each page's content:

parxy markdown document.pdf --page-separators

Output will contain markers like:

<!-- page: 1 -->

First page content...

<!-- page: 2 -->

Second page content...

This is useful for post-processing scripts that need to identify page boundaries.

Inline Output

Use --inline with a single file to print markdown directly to stdout with a YAML frontmatter header — useful for shell pipelines:

parxy markdown document.pdf --inline
parxy markdown document.pdf --inline | your-tool

Output format:

---
file: "document.pdf"
pages: 10
---

# Document heading
...

Manipulating PDFs

Parxy provides two powerful commands for PDF manipulation: merging multiple PDFs into one and splitting a single PDF into multiple files.

Merging PDFs

The pdf:merge command combines multiple PDF files into a single output file. You can merge entire files, specific page ranges, or folders of PDFs.

Basic merge:

parxy pdf:merge file1.pdf file2.pdf -o merged.pdf

Merge with page ranges:

parxy pdf:merge doc1.pdf[1:5] doc2.pdf[3:7] -o combined.pdf

Page range syntax (1-based indexing):

  • file.pdf[1] - Single page (page 1)
  • file.pdf[1:5] - Pages 1 through 5
  • file.pdf[:3] - First 3 pages
  • file.pdf[5:] - From page 5 to the end

Merge entire folders:

parxy pdf:merge /path/to/pdfs -o combined.pdf

Mix files, folders, and page ranges:

parxy pdf:merge cover.pdf /chapters doc.pdf[10:20] appendix.pdf -o book.pdf

Splitting PDFs

The pdf:split command divides a PDF file into individual pages, with optional page range extraction and single-file output.

Split into individual pages:

parxy pdf:split document.pdf

This creates a document_split/ folder containing document_page_1.pdf, document_page_2.pdf, etc.

Specify output directory and prefix:

parxy pdf:split report.pdf -o ./pages -p page

Extract a page range as individual files:

parxy pdf:split document.pdf --pages 2:5 -o ./pages

Combine a page range into a single PDF:

# Auto-named output next to the input file
parxy pdf:split document.pdf --pages 2:5 --combine

# Custom output path
parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf

Page range formats (1-based): 3 · 2:5 · :5 · 3:

For more detailed examples and use cases, see the Merge and split PDFs guide.

Inspecting PDFs

Beyond text extraction, Parxy can inspect a PDF's structure and metadata: its outline (bookmarks), its accessibility tag tree, and its XMP metadata. Each command prints a human-readable view by default and can emit JSON with --json (to stdout) or --output (to a file).

Outline (bookmarks)

The pdf:outline command prints the table of contents as a tree:

parxy pdf:outline document.pdf

Use --flat for an indented list instead of a tree, or export the structure:

# Flat listing
parxy pdf:outline document.pdf --flat

# Export as JSON (flat entries + nested tree)
parxy pdf:outline document.pdf -o outline.json

The command exits with code 2 when the PDF has no bookmarks, which is handy in scripts.

Tags (accessibility structure)

A tagged PDF carries a logical structure tree (/StructTreeRoot) that makes it accessible. Start by checking whether a PDF is tagged:

parxy pdf:tags-check document.pdf

This reports whether the content is marked, whether a structure tree is present, the document language, and the number of structure elements. It exits with 0 for a tagged PDF and 2 otherwise.

Extract the tag tree itself with pdf:tags:

# Print the structure tree (with page references and alt text)
parxy pdf:tags document.pdf

# Include the visible text of each element (rebuilt per page)
parxy pdf:tags document.pdf --text

# Export the full nested structure as JSON
parxy pdf:tags document.pdf -o tags.json

The default view walks the document-wide structure tree and shows accessibility attributes (alt text, titles, page references) but not body text, which lives in the page content streams. The --text view reconstructs the structure per page including each element's visible text, but without the accessibility attributes.

Two companion commands help with accessibility work:

# Copy a tagged PDF keeping its tags but removing visible content
parxy pdf:tag-skeleton document.pdf -o tags-only.pdf

# Create an empty tagged PDF skeleton from scratch
parxy pdf:tag-template -o template.pdf --pages 3 --lang en-US

XMP metadata

The pdf:xmp command reads the XMP metadata packet (an RDF/XML block holding properties such as dc:title, dc:creator, and pdf:Producer) and prints the parsed properties alongside the classic /Info dictionary:

parxy pdf:xmp document.pdf

You can view the original packet or export the metadata:

# Print the raw XMP XML packet
parxy pdf:xmp document.pdf --raw

# Export parsed metadata as JSON
parxy pdf:xmp document.pdf --json

# Save the raw XMP packet (a .xml path writes the raw packet,
# any other extension writes parsed JSON)
parxy pdf:xmp document.pdf -o metadata.xml

Managing Drivers

To view the list of supported document parsing drivers:

parxy drivers

This will display all available backends (e.g., pymupdf, pdfact, llamaparse, etc.).

Environment Configuration

To create a default .env configuration file for Parxy:

parxy env

If a .env file already exists, you'll be prompted before overwriting it. You can then edit this file to adjust driver settings, API keys, or other environment variables.

Running with Docker

Parxy can generate a ready-to-use Docker Compose configuration for self-hosted services (e.g., parsers available via an http-based api):

parxy docker

This creates a compose.yaml file in your working directory. To start the services, run:

docker compose pull
docker compose up -d

Full Command Reference

Run the following to see all available commands and options:

parxy --help

Each command also supports --help for detailed usage, for example:

parxy parse --help

Summary

With the CLI, you can use Parxy as a standalone document parsing tool — ideal for quick experiments, batch conversions, or integrations in shell-based pipelines.

Command Purpose
parxy parse Extract text from documents with multiple formats & drivers
parxy markdown Generate Markdown files; accepts JSON results and supports --page-separators
parxy pdf:merge Merge multiple PDF files with page range support
parxy pdf:split Split PDF into individual pages; supports --pages and --combine
parxy pdf:outline Print or export a PDF's outline (bookmarks)
parxy pdf:tags Inspect and extract a tagged PDF's structure tree; supports --text
parxy pdf:xmp Read and extract XMP metadata; supports --raw and JSON export
parxy drivers List supported drivers
parxy env Create default configuration file
parxy docker Generate Docker Compose setup