PDF Structured Output Server: Technical Overview
There's no getting away from the fact that AI is everywhere and most businesses are experimenting in an attempt to boost productivity and competitive advantage.
I work for an AI startup called BookWyrm and we have a set of API endpoints and a Python client to let customers build agentic pipelines to create automation, workflow, and accurate chat tools based on business documents. Learn about the endpoints here.
The aim of BookWyrm is to extract the pain of extracting data from business documents so devs don't have to mess around spending countless hours preparing data. They get to build straight way.
This article focuses on the PDF Structured Output Server we built, initially for product data enrichment to send to OpenAI for its Agentic Commerce Protocol. However, on reflection, we've extended the use of this, as by setting the Pydantic model you use, you can get any structured output that you want from documents.
The premise is, set up the server, you then have a service to handle manual document processing based on the requirements you have. Let the front-end create its own schema, or predefine them with agreed output.
Cloneable examples to play with:
Invoice Processing:
Product Data Enrichment:
Sorry about the AI voice, it sure beats my dulcet tones!
You can clone and play with the server and front-end examples from GitHub:
Server Repo: https://github.com/scidonia/pdf-structured-output-server
Invoice Processing Front-end: https://github.com/scidonia/demo-frontend-invoice-processing
Product Enrichment Front-end: https://github.com/scidonia/demo-frontend-product-enrichment
Intro to the Tutorial/Tech Overview
The PDF Structured Output Server is a FastAPI service that extracts structured data from PDFs using the BookWyrm API. It processes product brochures, catalogs, and invoices (and any other PDFs you need to process), returning structured JSON via a streaming API.
Architecture Overview
The server is a FastAPI application that orchestrates a three-step pipeline using BookWyrm's API endpoints. The architecture is designed for flexibility. You define the output schema, and the server handles the document processing.
Core Components
-
FastAPI Server (
server.py): Handles HTTP requests, manages parallel processing, and streams results via Server-Sent Events (SSE) -
ProductFeedGenerator (
product_feed_generator.py): Orchestrates the BookWyrm API workflow.It probably should be named something else, we initially started building a product enrichment example, before expanding it to other uses. -
Pydantic Models (
models.py): Define your structured output schema
The BookWyrm Processing Pipeline
The server uses three BookWyrm endpoints in sequence to transform raw PDFs into structured data:
Step 1: PDF Text Extraction
stream = self.client.stream_extract_pdf(
pdf_bytes=pdf_bytes,
filename=pdf_path.name,
start_page=1,
num_pages=None # Extract all pages
)
BookWyrm's /extract_pdf endpoint handles PDF parsing, extracting text while preserving structure and layout. This handles complex layouts, tables, and multi-column formats.
Step 2: Text to Phrasal Processing
stream = self.client.stream_process_text(
text=text_content,
response_format="WITH_OFFSETS"
)
The /process_text endpoint converts raw text into a phrasal representation. Semantic chunks that preserve context and relationships. This improves extraction accuracy.
Step 3: Structured Extraction
stream = self.client.stream_summarize(
phrases=phrases,
summary_class=ProductExtractionModel, # Your Pydantic model
model_strength="wise",
debug=False
)
The /summarize/sse endpoint extracts structured data matching your Pydantic model or JSON schema. BookWyrm uses the schema to identify and extract relevant fields.
Flexible Schema Definition
You can define your output schema in two ways:
Option 1: Pydantic Models (CLI Mode)
Define a Pydantic model in models/models.py:
from pydantic import BaseModel, Field
class ProductExtractionModel(BaseModel):
title: Optional[str] = Field(
None,
max_length=150,
description="Product title as mentioned in the document"
)
price: Optional[str] = Field(
None,
description="Regular price with currency code (e.g., '79.99 USD')"
)
dimensions: Optional[str] = Field(
None,
description="Overall dimensions with units (e.g., '12x8x5 in')"
)
Option 2: JSON Schema (API Mode)
Pass a JSON schema directly in the API request:
{
"type": "object",
"properties": {
"title": {
"type": "string",
"description": "Product name or title"
},
"price": {
"type": "number",
"description": "Product price as a numeric value"
}
},
"required": ["title"]
}
This lets front-end applications define schema dynamically without code changes.
Note that within the schema, the description is important. It tells BookWyrm exactly what you want to extract, so treat it like a prompt to ensure the effectiveness of your app.
API Usage
Starting the Server
uv run pdf-server serve --host 0.0.0.0 --port 8000
Processing PDFs via API
The /process endpoint accepts multiple PDFs and returns streaming results:
curl -X POST "http://localhost:8000/process" \
-H "Authorization: Bearer $BOOKWYRM_API_KEY" \
-F "files=@product_brochure.pdf" \
-F "schema_name=ProductSummary" \
-F 'json_schema={"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}},"required":["title"]}'
Streaming Response Format
The API returns Server-Sent Events (SSE) with different message types:
// Status update
{"type": "status", "message": "Starting processing of 2 files"}
// Progress update
{"type": "progress", "completed": 1, "total": 2, "percentage": 50.0}
// Result
{"type": "result", "filename": "product_brochure.pdf", "data": {...}}
// Completion
{"type": "complete", "message": "Processed 2 of 2 files successfully"}
This enables real-time progress updates in frontend applications.
Parallel Processing
The server processes multiple PDFs concurrently using a ThreadPoolExecutor:
with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
tasks = [
executor.submit(self._process_single_pdf, generator, pdf_path, schema_name, schema_dict)
for pdf_path in pdf_paths
]
Results stream as they complete, improving throughput for batch processing.
Error Handling & Resilience
The server includes error handling:
- Individual PDF failures don't stop batch processing
- Graceful fallbacks when extraction fails
- Clear error messages in the streaming response
- Validation of input files and schemas
Use Cases
1. Product Data Enrichment
Extract product information from brochures and catalogs for e-commerce feeds. The example model includes fields for title, price, dimensions, specifications, and more.
2. Invoice Processing
Extract structured data from invoices—vendor, date, line items, totals—by defining an invoice-specific schema.
3. Custom Document Processing
Define any Pydantic model or JSON schema to extract data from contracts, reports, or forms.
Benefits for Developers
- No prompt engineering: Define your schema; BookWyrm handles extraction
- Type safety: Pydantic models provide validation and IDE support
- Streaming API: Real-time progress for better UX
- Parallel processing: Efficient batch handling
- Flexible schemas: JSON schema support for dynamic front-end needs
- Production-ready: Error handling, validation, and logging
Getting Started
- Clone the repository:
git clone https://github.com/scidonia/pdf-structured-output-server
cd pdf-structured-output-server
- Install dependencies:
uv sync
- Set your BookWyrm API key (not needed if sending authorization headers via the front-end):
export BOOKWYRM_API_KEY=your_api_key_here
- Start the server:
uv run pdf-server serve
- Process your first PDF:
curl -X POST "http://localhost:8000/process" \
-H "Authorization: Bearer $BOOKWYRM_API_KEY" \
-F "files=@your_document.pdf" \
-F "schema_name=MySchema" \
-F 'json_schema={"type":"object","properties":{"title":{"type":"string"}}}'
Try it for yourself and let us know what you think
The PDF Structured Output Server demonstrates how BookWyrm's API endpoints can be combined to build production-ready document processing pipelines. By abstracting away the complexity of PDF parsing and AI extraction, developers can focus on defining their data requirements and building applications.
The streaming API, parallel processing, and flexible schema system make it suitable for both manual document processing workflows and automated pipelines. Whether you're enriching product catalogs, processing invoices, or extracting custom data from business documents, this server provides a solid foundation.
Check out the GitHub repository to explore the code, and visit BookWyrm's documentation to learn more about the API endpoints powering this solution.
The actual server was built in half a day using Aider assistance and our AI-integration docs.
Top comments (0)