Stefan Vitoria

Posted on Dec 20, 2025

Building a Page-Level PDF Processing Pipeline for Smarter RAG Systems

#rag #architecture #tutorial #ai

How I solved the granularity problem in document-based AI applications using Wasp, pdf2pic, and smart architecture decisions

The Problem: When Your RAG System Can't Point to the Right Page

Picture this: You're building an AI-powered document analysis system. Users upload PDFs, ask questions, and expect precise answers with exact references. But there's a frustrating problem—your RAG (Retrieval-Augmented Generation) system can tell users what information exists but not where to find it.

❌ User: "Where can I find the tax information?"
❌ AI: "The tax rate is 15%"
❌ User: "But which page?"
❌ AI: 🤷‍♂️

The challenge I faced was that existing solutions didn't provide the exact level of control and page-level granularity I needed for my RAG system. Most approaches either lacked the precise page referencing I required, were overly complex for my use case, or didn't integrate well with my existing technology stack.

This becomes a dealbreaker for professional applications where users need to verify information, cite sources, or navigate large documents efficiently.

The Challenge: Finding the Right Fit for My Requirements

When building my document-based AI system, I had specific requirements that existing solutions couldn't meet:

What I Needed:

Page-level granularity: Each page processed and referenced individually
Cost-effective scaling: Handle volume processing without breaking the bank
Simple integration: Work seamlessly with my Wasp application stack
Full control: Customize the processing pipeline for my specific needs
Reliable quality: Consistent results across various PDF formats

What I Tried:

Cloud-based solutions: Often too expensive and complex for my specific use case
Basic PDF parsing: Inconsistent results with complex layouts and images
Simple OCR libraries: Required too much manual optimization and tuning

I needed a solution that was: Purpose-built, cost-effective, and gave me complete control over the processing pipeline.

The Solution: PDF-to-Image Pipeline with Smart Processing

After experimenting with various approaches, I developed a pipeline that converts PDFs into individual page images, processes each page separately, and maintains precise page references throughout the RAG system.

Architecture Overview

PDF Upload → Page Images → Individual OCR → RAG with Page References
     ↓              ↓              ↓              ↓
  [file.pdf]  [page_1.png]  [text + page #]  "Found on page 3"
             [page_2.png]
             [page_3.png]

Implementation: Building with Wasp and pdf2pic

I chose Wasp for its full-stack TypeScript approach and built-in file handling capabilities. Here's how I implemented the solution:

Step 1: Setting Up File Upload with Multer

First, I configured multer for handling PDF uploads directly in memory:

// src/server/data_ingestion/apis/ingest-pdf.ts
import multer from "multer";
import { HttpError, type MiddlewareConfigFn } from "wasp/server";
import { fromBuffer } from "pdf2pic";

const MAX_FILE_SIZE_BYTES = 10 * 1024 * 1024; // 10MB limit

export const ingestPdfMiddlewareConfigFn: MiddlewareConfigFn = (middlewareConfig) => {
  const upload = multer({
    storage: multer.memoryStorage(), // Keep in memory for processing
    limits: { fileSize: MAX_FILE_SIZE_BYTES },
    fileFilter: (_req, file, cb) => {
      if (file.mimetype === 'application/pdf') {
        cb(null, true);
      } else {
        cb(new Error('Only PDF files are allowed'));
      }
    },
  });

  middlewareConfig.set('multer.single', upload.single('pdf'));
  return middlewareConfig;
};

Step 2: Wasp API Configuration

In main.wasp, I used the new apiNamespace feature for cleaner middleware management:

// main.wasp
apiNamespace fileUploadMiddleware {
  middlewareConfigFn: import { ingestPdfMiddlewareConfigFn } from "@src/server/data_ingestion/apis/ingest-pdf",
  path: "/api/v1/data-ingestion"
}

api ingestPdf {
  httpRoute: (POST, "/api/v1/data-ingestion/pdf"),
  fn: import { ingestPdf } from "@src/server/data_ingestion/apis/ingest-pdf",
  entities: [User]
}

Step 3: The Heart of the System - PDF Processing Function

This is where the magic happens. The function takes a PDF buffer and converts each page to an individual image:

interface PageInfo {
  pageNumber: number;
  imagePath: string;
}

interface FileMetadata {
  name: string;
  originalName: string;
  type: string;
  size: number;
}

async function processPdfToImages(
  fileBuffer: Buffer, 
  fileMetadata: FileMetadata
): Promise<PageInfo[]> {
  try {
    // Validate the buffer
    if (!fileBuffer || fileBuffer.length === 0) {
      throw new HttpError(400, "Invalid PDF file buffer");
    }

    // Create storage directory
    const imagesDir = path.join(
      process.cwd(), 
      'src', 'server', 'data_ingestion', 'pdf_to_pic_images'
    );
    await fs.mkdir(imagesDir, { recursive: true });

    // Sanitize filename for file system compatibility
    const baseName = path.parse(fileMetadata.originalName)
      .name.replace(/[^a-zA-Z0-9_-]/g, '_');

    // Configure pdf2pic with optimal settings
    const convertPdfOptions = {
      density: 150,           // 150 DPI - sweet spot for OCR accuracy
      saveFilename: `${baseName}_page`,
      savePath: imagesDir,
      format: "png" as const, // PNG for lossless text clarity
      width: 800,            // Reasonable file size
      height: 1200           // Maintain aspect ratio
    };

    const convert = fromBuffer(fileBuffer, convertPdfOptions);

    // Convert all pages (-1 means all pages)
    const results = await convert.bulk(-1, { responseType: "image" });

    // Process results into our page info structure
    const pageInfos: PageInfo[] = [];

    if (results && Array.isArray(results)) {
      results.forEach((result, index) => {
        if (result.path) {
          const fileName = path.basename(result.path);
          pageInfos.push({
            pageNumber: index + 1,
            imagePath: fileName
          });
        }
      });
    }

    if (pageInfos.length === 0) {
      throw new HttpError(400, "No pages could be extracted from PDF");
    }

    return pageInfos;

  } catch (error) {
    if (error instanceof HttpError) throw error;

    console.error("PDF processing error:", error);
    throw new HttpError(500, "Failed to process PDF pages");
  }
}

Step 4: Main API Handler with Enhanced Response

The main API handler orchestrates the entire process:

export const ingestPdf: IngestPdf = async (req, res, _context) => {
  try {
    if (!req.file) {
      throw new HttpError(400, "No PDF file provided");
    }

    // Extract metadata
    const fileMetadata: FileMetadata = {
      name: req.file.filename || req.file.originalname,
      originalName: req.file.originalname,
      type: req.file.mimetype,
      size: req.file.size,
    };

    // Validate file type and size
    if (req.file.mimetype !== 'application/pdf') {
      throw new HttpError(400, "File must be a PDF");
    }

    if (req.file.size > MAX_FILE_SIZE_BYTES) {
      throw new HttpError(400, `File exceeds ${MAX_FILE_SIZE_BYTES / 1024 / 1024}MB limit`);
    }

    // Process PDF to images
    const pages = await processPdfToImages(req.file.buffer, fileMetadata);

    // Return enhanced response with page information
    res.json({
      success: true,
      message: "PDF processed successfully",
      file: fileMetadata,
      pages: pages,
      totalPages: pages.length
    });

  } catch (error) {
    if (error instanceof HttpError) throw error;
    console.error("Error processing PDF upload:", error);
    throw new HttpError(500, "Internal server error processing PDF");
  }
};

Error Handling: Production-Ready Considerations

Real-world PDF processing is messy. Here are the error scenarios I handle:

1. Corrupted or Invalid PDFs

try {
  results = await convert.bulk(-1, { responseType: "image" });
} catch (conversionError) {
  console.error("PDF conversion error:", conversionError);
  throw new HttpError(400, "Failed to convert PDF - file may be corrupted or password-protected");
}

2. File System Issues

try {
  await fs.mkdir(imagesDir, { recursive: true });
} catch (dirError) {
  console.error("Failed to create images directory:", dirError);
  throw new HttpError(500, "Failed to create storage directory");
}

3. Resource Limits

// In multer configuration
limits: { 
  fileSize: MAX_FILE_SIZE_BYTES,
  files: 1 // Only one file at a time
}

API Usage: From Upload to Page References

Making a Request

# Using cURL
curl -X POST \
  -F "pdf=@document.pdf" \
  http://localhost:3001/api/v1/data-ingestion/pdf

# Using JavaScript fetch
const formData = new FormData();
formData.append('pdf', fileInput.files[0]);

const response = await fetch('/api/v1/data-ingestion/pdf', {
  method: 'POST',
  body: formData
});

const result = await response.json();

Enhanced Response Structure

{
  "success": true,
  "message": "PDF processed successfully",
  "file": {
    "name": "tax_document.pdf",
    "originalName": "tax_document.pdf",
    "type": "application/pdf",
    "size": 2048576
  },
  "pages": [
    { "pageNumber": 1, "imagePath": "tax_document_page.1.png" },
    { "pageNumber": 2, "imagePath": "tax_document_page.2.png" },
    { "pageNumber": 3, "imagePath": "tax_document_page.3.png" }
  ],
  "totalPages": 3
}

Integration with RAG Systems

Now comes the payoff. With individual page images, I can:

1. Process Each Page with High-Accuracy OCR

// Example: Process with Mistral or other OCR service
async function processPageWithOCR(imagePath: string, pageNumber: number) {
  const imageBuffer = await fs.readFile(imagePath);

  const ocrResult = await mistralOCR.processImage(imageBuffer);

  return {
    pageNumber,
    content: ocrResult.text,
    confidence: ocrResult.confidence,
    imagePath
  };
}

2. Index Content with Page References

// Example: Store in vector database with page metadata
async function indexPageContent(pageData) {
  const embedding = await openai.embeddings.create({
    input: pageData.content,
    model: "text-embedding-3-small"
  });

  await vectorDB.upsert({
    id: `${documentId}_page_${pageData.pageNumber}`,
    values: embedding.data[0].embedding,
    metadata: {
      documentId,
      pageNumber: pageData.pageNumber,
      imagePath: pageData.imagePath,
      content: pageData.content
    }
  });
}

3. Provide Precise Query Responses

// Example: RAG query with page references
async function queryWithPageReferences(question: string) {
  const results = await vectorDB.query({
    vector: await embedQuestion(question),
    topK: 3,
    includeMetadata: true
  });

  return {
    answer: await generateAnswer(question, results),
    sources: results.map(r => ({
      pageNumber: r.metadata.pageNumber,
      imagePath: r.metadata.imagePath,
      relevanceScore: r.score
    }))
  };
}

Results: The Transformation in Action

Before Implementation

❌ User: "What's the tax rate mentioned in the document?"
❌ AI: "The tax rate is 15%"
❌ User: "Where exactly?"
❌ AI: "I found that information in the document"

After Implementation

✅ User: "What's the tax rate mentioned in the document?"
✅ AI: "The tax rate is 15%, found on page 3 of the document"
✅ User: "Can you show me the exact page?"
✅ AI: [Returns page 3 image with highlighted relevant section]

Performance Considerations

File Size and Processing Time

Small PDFs (1-5 pages): ~2-3 seconds processing time
Medium PDFs (10-20 pages): ~8-12 seconds processing time
Large PDFs (50+ pages): Consider async processing with job queues

Storage Requirements

Images are ~200-500KB each at 150 DPI
10-page PDF: ~2-5MB in images
Consider cleanup strategies for old processed files

Memory Usage

PDF buffer in memory: Temporary during processing
Images written to disk: Permanent storage
Peak memory usage: ~3x the PDF file size during processing

Production Deployment Tips

1. Async Processing for Large Documents

// Consider using a job queue for large PDFs
if (pages.length > 20) {
  await jobQueue.add('process-large-pdf', { 
    fileBuffer, 
    fileMetadata, 
    userId: context.user.id 
  });

  return { message: "Large PDF queued for processing" };
}

2. Storage Optimization

// Implement cleanup for old images
async function cleanupOldImages() {
  const cutoffDate = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000); // 7 days
  // Remove images older than cutoff date
}

3. Rate Limiting

// Add rate limiting for PDF processing
const rateLimit = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 5, // 5 PDFs per window
  message: 'Too many PDF uploads, please try again later'
});

Key Takeaways and Future Possibilities

What I Learned

Page-level granularity is crucial for professional document AI applications
pdf2pic provides excellent quality with reasonable performance
Buffer-based processing is more secure than temporary file storage
Error handling is critical - PDFs are unpredictable
Wasp's middleware system makes complex integrations clean

Future Enhancements

OCR confidence scoring per page
Layout analysis to identify tables, charts, headers
Text highlighting on source images
Multi-format support (Word docs, PowerPoint, etc.)
Real-time collaboration on document analysis

The Bottom Line

Building page-level PDF processing transformed my RAG system from a basic Q&A tool into a professional document analysis platform. Users can now get precise answers with exact page references, making the AI system trustworthy and practical for real-world use cases.

The combination of Wasp's full-stack capabilities, pdf2pic's reliable conversion, and thoughtful error handling created a robust solution that scales from prototype to production.

Have you implemented similar document processing pipelines? I'd love to hear about your experiences and challenges in the comments below.

DEV Community