Welcome to your first experience with Parxy — a unified Python interface for document parsing.
This tutorial will guide you step-by-step through:
- Parsing your first document with Parxy
- Loading a PDF and extracting its text
- Understanding the unified
Documentmodel returned by all parsers
By the end of this tutorial, you'll be able to:
- Install and use Parxy as a Python library
- Parse documents with a single function call
- Access structured text through Parxy's unified data model
- Convert parsed documents to plain text or Markdown
Install Parxy from PyPI (or your development package):
pip install parxyYou can also install optional parser backends depending on your needs (e.g. PyMuPDF, Unstructured, LlamaParse):
pip install parxy[llama]Let's start by parsing a simple PDF.
The easiest way is to use the Parxy.parse() method, which automatically selects the default parser (usually pymupdf).
from parxy_core.facade.parxy import Parxy
# Parse a document from a local file path
doc = Parxy.parse("samples/example.pdf")
# Print basic information
print(f"Pages: {len(doc.pages)}")
print(f"Title: {doc.metadata.title}")You can also specify a parser explicitly:
doc = Parxy.parse("samples/example.pdf", driver_name=Parxy.PYMUPDF)Or even pass an in-memory file:
import io
with open("samples/example.pdf", "rb") as f:
pdf_bytes = io.BytesIO(f.read())
doc = Parxy.parse(pdf_bytes)Each parser requires a configurations that can be specified through enviroment variables. Refers to config.py for details.
Once parsed, the returned object is a Document model — a structured representation of your file.
You can access its text content in different ways:
Get all text as a single string
text = doc.text()
print(text[:500]) # print first 500 charactersConvert the document to Markdown
markdown = doc.markdown()
print(markdown[:500])This method preserves headings, paragraphs, and lists (when identified by the parser).
Every parser in Parxy returns the same structure, built with Pydantic:
Document
├── Metadata
├── Page[]
│ ├── TextBlock[]
│ │ ├── Line[]
│ │ │ ├── Span[]
│ │ │ │ ├── Character[]
│ │ │ │ └── ...
│ ├── ImageBlock[]
│ └── TableBlock[]
└── Outline[]
Example:
page = doc.pages[0]
first_block = page.blocks[0]
print(first_block.text)
print(first_block.bbox)
print(first_block.category)When you call:
doc = Parxy.parse("file.pdf")Parxy performs the following steps:
- Initializes a singleton
DriverFactory - Selects the appropriate driver (e.g. PyMuPDF)
- Invokes the driver's
.parse()method - Returns a normalized
Documentobject with consistent structure
This means you can switch parsers (e.g., from PyMuPDF to LlamaParse) without changing how you handle the output.
In this tutorial you:
- Installed and imported Parxy
- Parsed a document with a single line of code
- Extracted text and Markdown
- Explored the unified document model
You're now ready to try more advanced use cases, such as:
- Using Parxy from the command line
- Processing multiple documents in parallel
- Comparing different parsers on the same document
- Extending Parxy with a custom driver
- Monitoring document processing with OpenTelemetry
Tip
If your parsed text seems incomplete or misaligned, try a different driver:
doc = Parxy.parse("file.pdf", driver_name=Parxy.UNSTRUCTURED_LIBRARY)Each backend may specialize in different document types.