Oliver Smith

Posted on Nov 17

Automate Document Workflows with AI

#ai #python #productivity #tutorial

PDF Structured Output Server: Technical Overview

There's no getting away from the fact that AI is everywhere and most businesses are experimenting in an attempt to boost productivity and competitive advantage.

I work for an AI startup called BookWyrm and we have a set of API endpoints and a Python client to let customers build agentic pipelines to create automation, workflow, and accurate chat tools based on business documents. Learn about the endpoints here.

The aim of BookWyrm is to extract the pain of extracting data from business documents so devs don't have to mess around spending countless hours preparing data. They get to build straight way.

This article focuses on the PDF Structured Output Server we built, initially for product data enrichment to send to OpenAI for its Agentic Commerce Protocol. However, on reflection, we've extended the use of this, as by setting the Pydantic model you use, you can get any structured output that you want from documents.

The premise is, set up the server, you then have a service to handle manual document processing based on the requirements you have. Let the front-end create its own schema, or predefine them with agreed output.

Cloneable examples to play with:

Invoice Processing:

Product Data Enrichment:

Sorry about the AI voice, it sure beats my dulcet tones!

You can clone and play with the server and front-end examples from GitHub:

Server Repo: https://github.com/scidonia/pdf-structured-output-server
Invoice Processing Front-end: https://github.com/scidonia/demo-frontend-invoice-processing
Product Enrichment Front-end: https://github.com/scidonia/demo-frontend-product-enrichment

Intro to the Tutorial/Tech Overview

The PDF Structured Output Server is a FastAPI service that extracts structured data from PDFs using the BookWyrm API. It processes product brochures, catalogs, and invoices (and any other PDFs you need to process), returning structured JSON via a streaming API.

Architecture Overview

The server is a FastAPI application that orchestrates a three-step pipeline using BookWyrm's API endpoints. The architecture is designed for flexibility. You define the output schema, and the server handles the document processing.

Core Components

FastAPI Server (server.py): Handles HTTP requests, manages parallel processing, and streams results via Server-Sent Events (SSE)
ProductFeedGenerator (product_feed_generator.py): Orchestrates the BookWyrm API workflow.It probably should be named something else, we initially started building a product enrichment example, before expanding it to other uses.
Pydantic Models (models.py): Define your structured output schema

The BookWyrm Processing Pipeline

The server uses three BookWyrm endpoints in sequence to transform raw PDFs into structured data:

Step 1: PDF Text Extraction

stream = self.client.stream_extract_pdf(
    pdf_bytes=pdf_bytes,
    filename=pdf_path.name,
    start_page=1,
    num_pages=None  # Extract all pages
)

BookWyrm's /extract_pdf endpoint handles PDF parsing, extracting text while preserving structure and layout. This handles complex layouts, tables, and multi-column formats.

Step 2: Text to Phrasal Processing

stream = self.client.stream_process_text(
    text=text_content,
    response_format="WITH_OFFSETS"
)

The /process_text endpoint converts raw text into a phrasal representation. Semantic chunks that preserve context and relationships. This improves extraction accuracy.

Step 3: Structured Extraction

stream = self.client.stream_summarize(
    phrases=phrases,
    summary_class=ProductExtractionModel,  # Your Pydantic model
    model_strength="wise",
    debug=False
)

The /summarize/sse endpoint extracts structured data matching your Pydantic model or JSON schema. BookWyrm uses the schema to identify and extract relevant fields.

Flexible Schema Definition

You can define your output schema in two ways:

Option 1: Pydantic Models (CLI Mode)

Define a Pydantic model in models/models.py:

from pydantic import BaseModel, Field

class ProductExtractionModel(BaseModel):
    title: Optional[str] = Field(
        None, 
        max_length=150, 
        description="Product title as mentioned in the document"
    )
    price: Optional[str] = Field(
        None,
        description="Regular price with currency code (e.g., '79.99 USD')"
    )
    dimensions: Optional[str] = Field(
        None,
        description="Overall dimensions with units (e.g., '12x8x5 in')"
    )

Option 2: JSON Schema (API Mode)

Pass a JSON schema directly in the API request:

{
  "type": "object",
  "properties": {
    "title": {
      "type": "string",
      "description": "Product name or title"
    },
    "price": {
      "type": "number",
      "description": "Product price as a numeric value"
    }
  },
  "required": ["title"]
}

This lets front-end applications define schema dynamically without code changes.

Note that within the schema, the description is important. It tells BookWyrm exactly what you want to extract, so treat it like a prompt to ensure the effectiveness of your app.

API Usage

Starting the Server

uv run pdf-server serve --host 0.0.0.0 --port 8000

Processing PDFs via API

The /process endpoint accepts multiple PDFs and returns streaming results:

curl -X POST "http://localhost:8000/process" \
  -H "Authorization: Bearer $BOOKWYRM_API_KEY" \
  -F "files=@product_brochure.pdf" \
  -F "schema_name=ProductSummary" \
  -F 'json_schema={"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}},"required":["title"]}'

Streaming Response Format

The API returns Server-Sent Events (SSE) with different message types:

// Status update
{"type": "status", "message": "Starting processing of 2 files"}

// Progress update
{"type": "progress", "completed": 1, "total": 2, "percentage": 50.0}

// Result
{"type": "result", "filename": "product_brochure.pdf", "data": {...}}

// Completion
{"type": "complete", "message": "Processed 2 of 2 files successfully"}

This enables real-time progress updates in frontend applications.

Parallel Processing

The server processes multiple PDFs concurrently using a ThreadPoolExecutor:

with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
    tasks = [
        executor.submit(self._process_single_pdf, generator, pdf_path, schema_name, schema_dict)
        for pdf_path in pdf_paths
    ]

Results stream as they complete, improving throughput for batch processing.

Error Handling & Resilience

The server includes error handling:

Individual PDF failures don't stop batch processing
Graceful fallbacks when extraction fails
Clear error messages in the streaming response
Validation of input files and schemas

Use Cases

1. Product Data Enrichment

Extract product information from brochures and catalogs for e-commerce feeds. The example model includes fields for title, price, dimensions, specifications, and more.

2. Invoice Processing

Extract structured data from invoices—vendor, date, line items, totals—by defining an invoice-specific schema.

3. Custom Document Processing

Define any Pydantic model or JSON schema to extract data from contracts, reports, or forms.

Benefits for Developers

No prompt engineering: Define your schema; BookWyrm handles extraction
Type safety: Pydantic models provide validation and IDE support
Streaming API: Real-time progress for better UX
Parallel processing: Efficient batch handling
Flexible schemas: JSON schema support for dynamic front-end needs
Production-ready: Error handling, validation, and logging

Getting Started

Clone the repository:

   git clone https://github.com/scidonia/pdf-structured-output-server
   cd pdf-structured-output-server

Install dependencies:

   uv sync

Set your BookWyrm API key (not needed if sending authorization headers via the front-end):

   export BOOKWYRM_API_KEY=your_api_key_here

Start the server:

   uv run pdf-server serve

Process your first PDF:

   curl -X POST "http://localhost:8000/process" \
     -H "Authorization: Bearer $BOOKWYRM_API_KEY" \
     -F "files=@your_document.pdf" \
     -F "schema_name=MySchema" \
     -F 'json_schema={"type":"object","properties":{"title":{"type":"string"}}}'

Try it for yourself and let us know what you think

The PDF Structured Output Server demonstrates how BookWyrm's API endpoints can be combined to build production-ready document processing pipelines. By abstracting away the complexity of PDF parsing and AI extraction, developers can focus on defining their data requirements and building applications.

The streaming API, parallel processing, and flexible schema system make it suitable for both manual document processing workflows and automated pipelines. Whether you're enriching product catalogs, processing invoices, or extracting custom data from business documents, this server provides a solid foundation.

Check out the GitHub repository to explore the code, and visit BookWyrm's documentation to learn more about the API endpoints powering this solution.

The actual server was built in half a day using Aider assistance and our AI-integration docs.

DEV Community