DEV Community

Cover image for Building a Page-Level PDF Processing Pipeline for Smarter RAG Systems
Stefan Vitoria
Stefan Vitoria

Posted on

Building a Page-Level PDF Processing Pipeline for Smarter RAG Systems

How I solved the granularity problem in document-based AI applications using Wasp, pdf2pic, and smart architecture decisions


The Problem: When Your RAG System Can't Point to the Right Page

Picture this: You're building an AI-powered document analysis system. Users upload PDFs, ask questions, and expect precise answers with exact references. But there's a frustrating problem—your RAG (Retrieval-Augmented Generation) system can tell users what information exists but not where to find it.

❌ User: "Where can I find the tax information?"
❌ AI: "The tax rate is 15%"
❌ User: "But which page?"
❌ AI: 🤷‍♂️
Enter fullscreen mode Exit fullscreen mode

The challenge I faced was that existing solutions didn't provide the exact level of control and page-level granularity I needed for my RAG system. Most approaches either lacked the precise page referencing I required, were overly complex for my use case, or didn't integrate well with my existing technology stack.

This becomes a dealbreaker for professional applications where users need to verify information, cite sources, or navigate large documents efficiently.

The Challenge: Finding the Right Fit for My Requirements

When building my document-based AI system, I had specific requirements that existing solutions couldn't meet:

What I Needed:

  • Page-level granularity: Each page processed and referenced individually
  • Cost-effective scaling: Handle volume processing without breaking the bank
  • Simple integration: Work seamlessly with my Wasp application stack
  • Full control: Customize the processing pipeline for my specific needs
  • Reliable quality: Consistent results across various PDF formats

What I Tried:

  • Cloud-based solutions: Often too expensive and complex for my specific use case
  • Basic PDF parsing: Inconsistent results with complex layouts and images
  • Simple OCR libraries: Required too much manual optimization and tuning

I needed a solution that was: Purpose-built, cost-effective, and gave me complete control over the processing pipeline.

The Solution: PDF-to-Image Pipeline with Smart Processing

After experimenting with various approaches, I developed a pipeline that converts PDFs into individual page images, processes each page separately, and maintains precise page references throughout the RAG system.

Architecture Overview

PDF Upload → Page Images → Individual OCR → RAG with Page References
     ↓              ↓              ↓              ↓
  [file.pdf]  [page_1.png]  [text + page #]  "Found on page 3"
             [page_2.png]
             [page_3.png]
Enter fullscreen mode Exit fullscreen mode

Implementation: Building with Wasp and pdf2pic

I chose Wasp for its full-stack TypeScript approach and built-in file handling capabilities. Here's how I implemented the solution:

Step 1: Setting Up File Upload with Multer

First, I configured multer for handling PDF uploads directly in memory:

// src/server/data_ingestion/apis/ingest-pdf.ts
import multer from "multer";
import { HttpError, type MiddlewareConfigFn } from "wasp/server";
import { fromBuffer } from "pdf2pic";

const MAX_FILE_SIZE_BYTES = 10 * 1024 * 1024; // 10MB limit

export const ingestPdfMiddlewareConfigFn: MiddlewareConfigFn = (middlewareConfig) => {
  const upload = multer({
    storage: multer.memoryStorage(), // Keep in memory for processing
    limits: { fileSize: MAX_FILE_SIZE_BYTES },
    fileFilter: (_req, file, cb) => {
      if (file.mimetype === 'application/pdf') {
        cb(null, true);
      } else {
        cb(new Error('Only PDF files are allowed'));
      }
    },
  });

  middlewareConfig.set('multer.single', upload.single('pdf'));
  return middlewareConfig;
};
Enter fullscreen mode Exit fullscreen mode

Step 2: Wasp API Configuration

In main.wasp, I used the new apiNamespace feature for cleaner middleware management:

// main.wasp
apiNamespace fileUploadMiddleware {
  middlewareConfigFn: import { ingestPdfMiddlewareConfigFn } from "@src/server/data_ingestion/apis/ingest-pdf",
  path: "/api/v1/data-ingestion"
}

api ingestPdf {
  httpRoute: (POST, "/api/v1/data-ingestion/pdf"),
  fn: import { ingestPdf } from "@src/server/data_ingestion/apis/ingest-pdf",
  entities: [User]
}
Enter fullscreen mode Exit fullscreen mode

Step 3: The Heart of the System - PDF Processing Function

This is where the magic happens. The function takes a PDF buffer and converts each page to an individual image:

interface PageInfo {
  pageNumber: number;
  imagePath: string;
}

interface FileMetadata {
  name: string;
  originalName: string;
  type: string;
  size: number;
}

async function processPdfToImages(
  fileBuffer: Buffer, 
  fileMetadata: FileMetadata
): Promise<PageInfo[]> {
  try {
    // Validate the buffer
    if (!fileBuffer || fileBuffer.length === 0) {
      throw new HttpError(400, "Invalid PDF file buffer");
    }

    // Create storage directory
    const imagesDir = path.join(
      process.cwd(), 
      'src', 'server', 'data_ingestion', 'pdf_to_pic_images'
    );
    await fs.mkdir(imagesDir, { recursive: true });

    // Sanitize filename for file system compatibility
    const baseName = path.parse(fileMetadata.originalName)
      .name.replace(/[^a-zA-Z0-9_-]/g, '_');

    // Configure pdf2pic with optimal settings
    const convertPdfOptions = {
      density: 150,           // 150 DPI - sweet spot for OCR accuracy
      saveFilename: `${baseName}_page`,
      savePath: imagesDir,
      format: "png" as const, // PNG for lossless text clarity
      width: 800,            // Reasonable file size
      height: 1200           // Maintain aspect ratio
    };

    const convert = fromBuffer(fileBuffer, convertPdfOptions);

    // Convert all pages (-1 means all pages)
    const results = await convert.bulk(-1, { responseType: "image" });

    // Process results into our page info structure
    const pageInfos: PageInfo[] = [];

    if (results && Array.isArray(results)) {
      results.forEach((result, index) => {
        if (result.path) {
          const fileName = path.basename(result.path);
          pageInfos.push({
            pageNumber: index + 1,
            imagePath: fileName
          });
        }
      });
    }

    if (pageInfos.length === 0) {
      throw new HttpError(400, "No pages could be extracted from PDF");
    }

    return pageInfos;

  } catch (error) {
    if (error instanceof HttpError) throw error;

    console.error("PDF processing error:", error);
    throw new HttpError(500, "Failed to process PDF pages");
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Main API Handler with Enhanced Response

The main API handler orchestrates the entire process:

export const ingestPdf: IngestPdf = async (req, res, _context) => {
  try {
    if (!req.file) {
      throw new HttpError(400, "No PDF file provided");
    }

    // Extract metadata
    const fileMetadata: FileMetadata = {
      name: req.file.filename || req.file.originalname,
      originalName: req.file.originalname,
      type: req.file.mimetype,
      size: req.file.size,
    };

    // Validate file type and size
    if (req.file.mimetype !== 'application/pdf') {
      throw new HttpError(400, "File must be a PDF");
    }

    if (req.file.size > MAX_FILE_SIZE_BYTES) {
      throw new HttpError(400, `File exceeds ${MAX_FILE_SIZE_BYTES / 1024 / 1024}MB limit`);
    }

    // Process PDF to images
    const pages = await processPdfToImages(req.file.buffer, fileMetadata);

    // Return enhanced response with page information
    res.json({
      success: true,
      message: "PDF processed successfully",
      file: fileMetadata,
      pages: pages,
      totalPages: pages.length
    });

  } catch (error) {
    if (error instanceof HttpError) throw error;
    console.error("Error processing PDF upload:", error);
    throw new HttpError(500, "Internal server error processing PDF");
  }
};
Enter fullscreen mode Exit fullscreen mode

Error Handling: Production-Ready Considerations

Real-world PDF processing is messy. Here are the error scenarios I handle:

1. Corrupted or Invalid PDFs

try {
  results = await convert.bulk(-1, { responseType: "image" });
} catch (conversionError) {
  console.error("PDF conversion error:", conversionError);
  throw new HttpError(400, "Failed to convert PDF - file may be corrupted or password-protected");
}
Enter fullscreen mode Exit fullscreen mode

2. File System Issues

try {
  await fs.mkdir(imagesDir, { recursive: true });
} catch (dirError) {
  console.error("Failed to create images directory:", dirError);
  throw new HttpError(500, "Failed to create storage directory");
}
Enter fullscreen mode Exit fullscreen mode

3. Resource Limits

// In multer configuration
limits: { 
  fileSize: MAX_FILE_SIZE_BYTES,
  files: 1 // Only one file at a time
}
Enter fullscreen mode Exit fullscreen mode

API Usage: From Upload to Page References

Making a Request

# Using cURL
curl -X POST \
  -F "pdf=@document.pdf" \
  http://localhost:3001/api/v1/data-ingestion/pdf

# Using JavaScript fetch
const formData = new FormData();
formData.append('pdf', fileInput.files[0]);

const response = await fetch('/api/v1/data-ingestion/pdf', {
  method: 'POST',
  body: formData
});

const result = await response.json();
Enter fullscreen mode Exit fullscreen mode

Enhanced Response Structure

{
  "success": true,
  "message": "PDF processed successfully",
  "file": {
    "name": "tax_document.pdf",
    "originalName": "tax_document.pdf",
    "type": "application/pdf",
    "size": 2048576
  },
  "pages": [
    { "pageNumber": 1, "imagePath": "tax_document_page.1.png" },
    { "pageNumber": 2, "imagePath": "tax_document_page.2.png" },
    { "pageNumber": 3, "imagePath": "tax_document_page.3.png" }
  ],
  "totalPages": 3
}
Enter fullscreen mode Exit fullscreen mode

Integration with RAG Systems

Now comes the payoff. With individual page images, I can:

1. Process Each Page with High-Accuracy OCR

// Example: Process with Mistral or other OCR service
async function processPageWithOCR(imagePath: string, pageNumber: number) {
  const imageBuffer = await fs.readFile(imagePath);

  const ocrResult = await mistralOCR.processImage(imageBuffer);

  return {
    pageNumber,
    content: ocrResult.text,
    confidence: ocrResult.confidence,
    imagePath
  };
}
Enter fullscreen mode Exit fullscreen mode

2. Index Content with Page References

// Example: Store in vector database with page metadata
async function indexPageContent(pageData) {
  const embedding = await openai.embeddings.create({
    input: pageData.content,
    model: "text-embedding-3-small"
  });

  await vectorDB.upsert({
    id: `${documentId}_page_${pageData.pageNumber}`,
    values: embedding.data[0].embedding,
    metadata: {
      documentId,
      pageNumber: pageData.pageNumber,
      imagePath: pageData.imagePath,
      content: pageData.content
    }
  });
}
Enter fullscreen mode Exit fullscreen mode

3. Provide Precise Query Responses

// Example: RAG query with page references
async function queryWithPageReferences(question: string) {
  const results = await vectorDB.query({
    vector: await embedQuestion(question),
    topK: 3,
    includeMetadata: true
  });

  return {
    answer: await generateAnswer(question, results),
    sources: results.map(r => ({
      pageNumber: r.metadata.pageNumber,
      imagePath: r.metadata.imagePath,
      relevanceScore: r.score
    }))
  };
}
Enter fullscreen mode Exit fullscreen mode

Results: The Transformation in Action

Before Implementation

❌ User: "What's the tax rate mentioned in the document?"
❌ AI: "The tax rate is 15%"
❌ User: "Where exactly?"
❌ AI: "I found that information in the document"
Enter fullscreen mode Exit fullscreen mode

After Implementation

✅ User: "What's the tax rate mentioned in the document?"
✅ AI: "The tax rate is 15%, found on page 3 of the document"
✅ User: "Can you show me the exact page?"
✅ AI: [Returns page 3 image with highlighted relevant section]
Enter fullscreen mode Exit fullscreen mode

Performance Considerations

File Size and Processing Time

  • Small PDFs (1-5 pages): ~2-3 seconds processing time
  • Medium PDFs (10-20 pages): ~8-12 seconds processing time
  • Large PDFs (50+ pages): Consider async processing with job queues

Storage Requirements

  • Images are ~200-500KB each at 150 DPI
  • 10-page PDF: ~2-5MB in images
  • Consider cleanup strategies for old processed files

Memory Usage

  • PDF buffer in memory: Temporary during processing
  • Images written to disk: Permanent storage
  • Peak memory usage: ~3x the PDF file size during processing

Production Deployment Tips

1. Async Processing for Large Documents

// Consider using a job queue for large PDFs
if (pages.length > 20) {
  await jobQueue.add('process-large-pdf', { 
    fileBuffer, 
    fileMetadata, 
    userId: context.user.id 
  });

  return { message: "Large PDF queued for processing" };
}
Enter fullscreen mode Exit fullscreen mode

2. Storage Optimization

// Implement cleanup for old images
async function cleanupOldImages() {
  const cutoffDate = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000); // 7 days
  // Remove images older than cutoff date
}
Enter fullscreen mode Exit fullscreen mode

3. Rate Limiting

// Add rate limiting for PDF processing
const rateLimit = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 5, // 5 PDFs per window
  message: 'Too many PDF uploads, please try again later'
});
Enter fullscreen mode Exit fullscreen mode

Key Takeaways and Future Possibilities

What I Learned

  1. Page-level granularity is crucial for professional document AI applications
  2. pdf2pic provides excellent quality with reasonable performance
  3. Buffer-based processing is more secure than temporary file storage
  4. Error handling is critical - PDFs are unpredictable
  5. Wasp's middleware system makes complex integrations clean

Future Enhancements

  • OCR confidence scoring per page
  • Layout analysis to identify tables, charts, headers
  • Text highlighting on source images
  • Multi-format support (Word docs, PowerPoint, etc.)
  • Real-time collaboration on document analysis

The Bottom Line

Building page-level PDF processing transformed my RAG system from a basic Q&A tool into a professional document analysis platform. Users can now get precise answers with exact page references, making the AI system trustworthy and practical for real-world use cases.

The combination of Wasp's full-stack capabilities, pdf2pic's reliable conversion, and thoughtful error handling created a robust solution that scales from prototype to production.


Have you implemented similar document processing pipelines? I'd love to hear about your experiences and challenges in the comments below.

Top comments (0)