Install
$ agentstack add mcp-nameetp-pdfmux Open-source listing — not yet scanned by AgentStack. Follow the source repository for install instructions.
About
pdfmux
[](https://github.com/NameetP/pdfmux/actions/workflows/ci.yml) [](https://pypi.org/project/pdfmux/) [](https://pypi.org/project/pdfmux/) [](https://opensource.org/licenses/MIT) [](https://pypi.org/project/pdfmux/)
Self-healing PDF extraction with per-page confidence scoring. Open-source LlamaParse alternative for RAG pipelines, MCP server for Claude Desktop, LangChain + LlamaIndex loaders. Ranked #2 on opendataloader-bench (0.900).
The only PDF extractor that audits its own output. Catches blank pages, scrambled columns, broken tables — re-extracts them with a stronger backend. So your LLM gets clean data, not silent garbage. Routes each page to the best of 5 rule-based backends + BYOK LLM fallback (Gemini / Claude / GPT-4o / Ollama). One CLI. One API. Zero config.
PDF ──> pdfmux router ──> best extractor per page ──> audit ──> re-extract failures ──> Markdown / JSON / chunks
|
├─ PyMuPDF (digital text, 0.01s/page)
├─ OpenDataLoader (complex layouts, 0.05s/page)
├─ RapidOCR (scanned pages, CPU-only)
├─ Docling (tables, 97.9% TEDS)
├─ Surya (heavy OCR fallback)
├─ Marker (academic papers, neural)
├─ Mistral OCR ($0.002/page, 96.6% tables)
└─ YOUR LLM (Gemini / Gemma 4 / Claude / GPT-4o / Ollama / Mistral — BYOK via YAML)
Install
pip install pdfmux
That handles digital PDFs. For any real-world batch, install pdfmux[ocr] too — almost every directory of PDFs has at least one scan, and without OCR those pages return empty text:
pip install "pdfmux[ocr]" # ⭐ recommended — RapidOCR for scanned pages (~200MB, CPU)
Other backends, by document type:
pip install "pdfmux[tables]" # Docling — table-heavy docs (~500MB)
pip install "pdfmux[opendataloader]" # OpenDataLoader — complex layouts (Java 11+)
pip install "pdfmux[marker]" # Marker — neural extraction for academic papers
pip install "pdfmux[llm]" # Gemini fallback (default LLM)
pip install "pdfmux[llm-claude]" # Claude (Sonnet / Opus)
pip install "pdfmux[llm-openai]" # GPT-4o family
pip install "pdfmux[llm-ollama]" # Ollama (any local model)
pip install "pdfmux[llm-mistral]" # Mistral OCR API ($0.002/page)
pip install "pdfmux[llm-all]" # all LLM providers (incl. Gemma 4 via Gemini key)
pip install "pdfmux[watch]" # `pdfmux watch ` auto-convert on change
pip install "pdfmux[all]" # everything
Requires Python 3.11+.
Quick Start
CLI
# zero config — just works
pdfmux convert invoice.pdf
# invoice.pdf -> invoice.md (2 pages, 95% confidence, via pymupdf4llm)
# RAG-ready chunks with token limits
pdfmux convert report.pdf --chunk --max-tokens 500
# cost-aware extraction with budget cap
pdfmux convert report.pdf --mode economy --budget 0.50
# schema-guided structured extraction (5 built-in presets)
pdfmux convert invoice.pdf --schema invoice
# BYOK any LLM for hardest pages
pdfmux convert scan.pdf --llm-provider claude
# use a built-in or saved profile (invoices, receipts, papers, contracts, bulk-rag)
pdfmux convert invoice.pdf --profile invoices
# predict cost before running anything
pdfmux estimate big-report.pdf --llm-provider gemini
# stream pages as NDJSON as they finish (great for long documents)
pdfmux stream report.pdf --quality high
# auto-convert any new PDFs that land in a folder
pdfmux watch ./inbox/ -o ./output/
# diff two extractions side-by-side
pdfmux diff old.pdf new.pdf
# batch a directory — writes manifest.json with per-doc confidence
pdfmux convert ./docs/ -o ./output/
# CI mode: fail the run if any document is below 0.20 confidence
pdfmux convert ./docs/ -o ./output/ --strict --min-confidence 0.20
# pre-flight a directory: which extras do you actually need for THIS batch?
pdfmux doctor --check ./docs/
# results are cached by file hash — re-runs are instant; bypass with --no-cache
pdfmux convert report.pdf --no-cache
pdfmux convert report.pdf --clear-cache
Python
For batch processing, use batch_extract() — not a subprocess.run(['pdfmux', ...]) loop. Same pipeline, no per-file process spawn, handles non-ASCII filenames:
import pdfmux
from pathlib import Path
# Batch extract — yields (path, result) tuples as each PDF completes.
pdfs = list(Path("./inbox").glob("*.pdf"))
for path, result in pdfmux.batch_extract(pdfs, quality="standard"):
if isinstance(result, Exception):
print(f"FAILED {path.name}: {result}")
continue
if result.confidence **Don't wrap pdfmux with your own pypdf/pdfplumber fallback.** pdfmux already routes per page through PyMuPDF → RapidOCR → vision LLM. PyMuPDF tolerates malformed PDFs that pypdf rejects ("Stream has ended unexpectedly"), so a downstream pypdf fallback turns recoverable PDFs into failures. Trust the router; check the confidence score on the result.
## Architecture
┌─────────────────────────────┐ │ Segment Detector │ │ text / tables / images / │ │ formulas / headers per page │ └─────────────┬───────────────┘ │ ┌────────────────────────────────────────┐ │ Router Engine │ │ │ │ economy ── balanced ── premium │ │ (minimize $) (default) (max quality)│ │ budget caps: --budget 0.50 │ └────────────────────┬───────────────────┘ │ ┌──────────┬──────────┬────────┴────────┬──────────┐ │ │ │ │ │ PyMuPDF OpenData RapidOCR Docling LLM digital Loader scanned tables (BYOK) 0.01s/pg complex CPU-only 97.9% any provider layouts TEDS │ │ │ │ │ └──────────┴──────────┴────────┬────────┴──────────┘ │ ┌────────────────────────────────────────┐ │ Quality Auditor │ │ │ │ 4-signal dynamic confidence scoring │ │ per-page: good / bad / empty │ │ if bad -> re-extract with next backend│ └────────────────────┬───────────────────┘ │ ┌────────────────────────────────────────┐ │ Output Pipeline │ │ │ │ heading injection (font-size analysis)│ │ table extraction + normalization │ │ text cleanup + merge │ │ confidence score (honest, not inflated)│ └────────────────────────────────────────┘
### Key design decisions
- **Router, not extractor.** pdfmux does not compete with PyMuPDF or Docling. It picks the best one per page.
- **Agentic multi-pass.** Extract, audit confidence, re-extract failures with a stronger backend. Bad pages get retried automatically.
- **Segment-level detection.** Each page is classified by content type (text, tables, images, formulas, headers) before routing.
- **4-signal confidence.** Dynamic quality scoring from character density, OCR noise ratio, table integrity, and heading structure. Not hardcoded thresholds.
- **Document cache.** Each PDF is opened once, not once per extractor. Shared across the full pipeline.
- **Data flywheel.** Local telemetry tracks which extractors win per document type. Routing improves with usage.
## Features
| Feature | What it does | Command |
|---------|-------------|---------|
| Zero-config extraction | Routes to best backend automatically | `pdfmux convert file.pdf` |
| RAG chunking | Section-aware chunks with token estimates | `pdfmux convert file.pdf --chunk --max-tokens 500` |
| Cost modes | economy / balanced / premium with budget caps | `pdfmux convert file.pdf --mode economy --budget 0.50` |
| Schema extraction | 5 built-in presets (invoice, receipt, contract, resume, paper) | `pdfmux convert file.pdf --schema invoice` |
| Profiles | Save and re-use config; built-ins for invoices/receipts/papers/contracts/bulk-rag | `pdfmux convert file.pdf --profile invoices` |
| BYOK LLM | Gemini, Gemma 4, Claude, GPT-4o, Ollama, Mistral, any OpenAI-compatible API | `pdfmux convert file.pdf --llm-provider claude` |
| Cost estimate | Predict spend before running | `pdfmux estimate file.pdf --llm-provider gemini` |
| Streaming output | NDJSON events page-by-page for long docs | `pdfmux stream file.pdf` |
| Smart cache | Hash-keyed result cache, 30-day TTL, 1 GB LRU | `pdfmux convert file.pdf` (auto), `--no-cache` to bypass |
| Watch mode | Auto-convert any PDF added to a folder | `pdfmux watch ./inbox/` |
| Diff | Compare two extractions | `pdfmux diff a.pdf b.pdf` |
| Benchmark | Eval all installed extractors against ground truth | `pdfmux benchmark` |
| Doctor | Show installed backends, coverage gaps, recommendations | `pdfmux doctor` |
| MCP server | AI agents read PDFs via stdio or HTTP | `pdfmux serve` |
| Batch processing | Convert entire directories | `pdfmux convert ./docs/` |
| Page-level streaming API | Bounded-memory page iteration for large files | `for page in ext.extract("500pg.pdf")` |
| Retry with backoff | Every LLM provider auto-retries with exponential backoff + `Retry-After` | (built-in) |
## CLI Reference
### `pdfmux convert`
```bash
pdfmux convert [options]
Options:
-o, --output PATH Output file or directory
-f, --format FORMAT markdown | json | csv | llm (default: markdown)
-q, --quality QUALITY fast | standard | high (default: standard)
-s, --schema SCHEMA JSON schema file or preset (invoice, receipt, contract, resume, paper)
--chunk Output RAG-ready chunks
--max-tokens N Max tokens per chunk (default: 500)
--mode MODE economy | balanced | premium (default: balanced)
--budget AMOUNT Max spend per document in USD
--llm-provider PROVIDER LLM backend: gemini | claude | openai | ollama
--confidence Include confidence score in output
--stdout Print to stdout instead of file
pdfmux serve
Start the MCP server for AI agent integration.
pdfmux serve # stdio mode (Claude Desktop, Cursor)
pdfmux serve --http 8080 # HTTP mode
pdfmux doctor
pdfmux doctor
# ┌──────────────────┬─────────────┬─────────┬──────────────────────────────────┐
# │ Extractor │ Status │ Version │ Install │
# ├──────────────────┼─────────────┼─────────┼──────────────────────────────────┤
# │ PyMuPDF │ installed │ 1.25.3 │ │
# │ OpenDataLoader │ installed │ 0.3.1 │ │
# │ RapidOCR │ installed │ 3.0.6 │ │
# │ Docling │ missing │ -- │ pip install pdfmux[tables] │
# │ Surya │ missing │ -- │ pip install pdfmux[ocr-heavy] │
# │ LLM (Gemini) │ configured │ -- │ GEMINI_API_KEY set │
# └──────────────────┴─────────────┴─────────┴──────────────────────────────────┘
pdfmux benchmark
pdfmux benchmark report.pdf
# ┌──────────────────┬────────┬────────────┬─────────────┬──────────────────────┐
# │ Extractor │ Time │ Confidence │ Output │ Status │
# ├──────────────────┼────────┼────────────┼─────────────┼──────────────────────┤
# │ PyMuPDF │ 0.02s │ 95% │ 3,241 chars │ all pages good │
# │ Multi-pass │ 0.03s │ 95% │ 3,241 chars │ all pages good │
# │ RapidOCR │ 4.20s │ 88% │ 2,891 chars │ ok │
# │ OpenDataLoader │ 0.12s │ 97% │ 3,310 chars │ best │
# └──────────────────┴────────┴────────────┴─────────────┴──────────────────────┘
pdfmux estimate
Predict spend (and which backends will run) before processing.
pdfmux estimate report.pdf --quality high --llm-provider gemini
# Pages : 47
# Extractors : pymupdf4llm + gemini-2.5-flash on 9 pages
# Estimated : $0.0234
# Cache hit? : no (first run for this file)
pdfmux stream
Emit NDJSON events as pages complete — useful for very long PDFs and live UIs.
pdfmux stream long.pdf --quality high
# {"event":"classified","page_count":312,"plan":"pymupdf+gemini-fallback"}
# {"event":"page","page_num":0,"confidence":0.97,"chars":1842}
# {"event":"page","page_num":1,"confidence":0.92,"chars":1611,"ocr":true}
# ...
# {"event":"complete","confidence":0.94,"cost_usd":0.0712}
pdfmux watch
Auto-convert any PDFs that land in a directory. Survives until Ctrl+C.
pdfmux watch ./inbox/ -o ./output/ --profile bulk-rag
pdfmux diff
Side-by-side extraction comparison (quality, content, cost).
pdfmux diff a.pdf b.pdf --quality standard
pdfmux profiles
Saved configs at ~/.config/pdfmux/profiles.yaml. Built-ins ship for the common shapes; save your own for project defaults.
pdfmux profiles list
# invoices quality=standard, schema=invoice, format=json
# receipts quality=fast, schema=receipt, format=json
# papers quality=high, chunk=true, max_tokens=500
# contracts quality=high, schema=contract
# bulk-rag quality=standard, format=llm, chunk=true
pdfmux profiles show invoices
pdfmux profiles save my-default --quality high --format llm --chunk
pdfmux profiles delete my-default
# use a profile when converting
pdfmux convert file.pdf --profile invoices
Python API
Text extraction
import pdfmux
text = pdfmux.extract_text("report.pdf") # -> str (markdown)
text = pdfmux.extract_text("report.pdf", quality="fast") # PyMuPDF only, instant
text = pdfmux.extract_text("report.pdf", quality="high") # LLM-assisted
Structured extraction
data = pdfmux.extract_json("report.pdf")
# data["page_count"] -> 12
# data["confidence"] -> 0.91
# data["ocr_pages"] -> [2, 5, 8]
# data["pages"][0]["key_values"] -> [{"key": "Date", "value": "2026-02-28"}]
# data["pages"][0]["tables"] -> [{"headers": [...], "rows": [...]}]
RAG chunking
chunks = pdfmux.chunk("report.pdf", max_tokens=500)
for c in chunks:
print(f"{c['title']}: {c['tokens']} tokens (pages {c['page_start']}-{c['page_end']})")
Schema-guided extraction
data = pdfmux.extract_json("invoice.pdf", schema="invoice")
# Uses built-in invoice preset: extracts date, vendor, line items, totals
# Also accepts a path to a custom JSON Schema file
Streaming (bounded memory)
from pdfmux.extractors import get_extractor
ext = get_extractor("fast")
for page in ext.extract("large-500-pages.pdf"): # Iterator[PageResult]
process(page.text) # constant memory, even on 500-page PDFs
Types and errors
from pdfmux import (
# Enums
Quality, # FAST, STANDARD, HIGH
OutputFormat, # MARKDOWN, JSON, CSV, LLM
PageQuality, # GOOD, BAD, EMPTY
# Data objects (frozen dataclasses)
PageResult, # page: text, page_num, c
…
## Source & license
This open-source MCP server is cataloged on AgentStack and links to its original source — we do not rehost the code.
- **Author:** [NameetP](https://github.com/NameetP)
- **Source:** [NameetP/pdfmux](https://github.com/NameetP/pdfmux)
- **License:** MIT
- **Homepage:** https://pdfmux.com
Install and usage instructions live in the source repository linked above.
Reviews
No reviews yet — be the first.
Write a review
Versions
- v1.5.1 Imported from the upstream source.