How I solved the granularity problem in document-based AI applications using Wasp, pdf2pic, and smart architecture decisions
The Problem: When Your RAG System Can't Point to the Right Page
Picture this: You're building an AI-powered document analysis system. Users upload PDFs, ask questions, and expect precise answers with exact references. But there's a frustrating problem—your RAG (Retrieval-Augmented Generation) system can tell users what information exists but not where to find it.
❌ User: "Where can I find the tax information?"
❌ AI: "The tax rate is 15%"
❌ User: "But which page?"
❌ AI: 🤷♂️
The challenge I faced was that existing solutions didn't provide the exact level of control and page-level granularity I needed for my RAG system. Most approaches either lacked the precise page referencing I required, were overly complex for my use case, or didn't integrate well with my existing technology stack.
This becomes a dealbreaker for professional applications where users need to verify information, cite sources, or navigate large documents efficiently.
The Challenge: Finding the Right Fit for My Requirements
When building my document-based AI system, I had specific requirements that existing solutions couldn't meet:
What I Needed:
- Page-level granularity: Each page processed and referenced individually
- Cost-effective scaling: Handle volume processing without breaking the bank
- Simple integration: Work seamlessly with my Wasp application stack
- Full control: Customize the processing pipeline for my specific needs
- Reliable quality: Consistent results across various PDF formats
What I Tried:
- Cloud-based solutions: Often too expensive and complex for my specific use case
- Basic PDF parsing: Inconsistent results with complex layouts and images
- Simple OCR libraries: Required too much manual optimization and tuning
I needed a solution that was: Purpose-built, cost-effective, and gave me complete control over the processing pipeline.
The Solution: PDF-to-Image Pipeline with Smart Processing
After experimenting with various approaches, I developed a pipeline that converts PDFs into individual page images, processes each page separately, and maintains precise page references throughout the RAG system.
Architecture Overview
PDF Upload → Page Images → Individual OCR → RAG with Page References
↓ ↓ ↓ ↓
[file.pdf] [page_1.png] [text + page #] "Found on page 3"
[page_2.png]
[page_3.png]
Implementation: Building with Wasp and pdf2pic
I chose Wasp for its full-stack TypeScript approach and built-in file handling capabilities. Here's how I implemented the solution:
Step 1: Setting Up File Upload with Multer
First, I configured multer for handling PDF uploads directly in memory:
// src/server/data_ingestion/apis/ingest-pdf.ts
import multer from "multer";
import { HttpError, type MiddlewareConfigFn } from "wasp/server";
import { fromBuffer } from "pdf2pic";
const MAX_FILE_SIZE_BYTES = 10 * 1024 * 1024; // 10MB limit
export const ingestPdfMiddlewareConfigFn: MiddlewareConfigFn = (middlewareConfig) => {
const upload = multer({
storage: multer.memoryStorage(), // Keep in memory for processing
limits: { fileSize: MAX_FILE_SIZE_BYTES },
fileFilter: (_req, file, cb) => {
if (file.mimetype === 'application/pdf') {
cb(null, true);
} else {
cb(new Error('Only PDF files are allowed'));
}
},
});
middlewareConfig.set('multer.single', upload.single('pdf'));
return middlewareConfig;
};
Step 2: Wasp API Configuration
In main.wasp, I used the new apiNamespace feature for cleaner middleware management:
// main.wasp
apiNamespace fileUploadMiddleware {
middlewareConfigFn: import { ingestPdfMiddlewareConfigFn } from "@src/server/data_ingestion/apis/ingest-pdf",
path: "/api/v1/data-ingestion"
}
api ingestPdf {
httpRoute: (POST, "/api/v1/data-ingestion/pdf"),
fn: import { ingestPdf } from "@src/server/data_ingestion/apis/ingest-pdf",
entities: [User]
}
Step 3: The Heart of the System - PDF Processing Function
This is where the magic happens. The function takes a PDF buffer and converts each page to an individual image:
interface PageInfo {
pageNumber: number;
imagePath: string;
}
interface FileMetadata {
name: string;
originalName: string;
type: string;
size: number;
}
async function processPdfToImages(
fileBuffer: Buffer,
fileMetadata: FileMetadata
): Promise<PageInfo[]> {
try {
// Validate the buffer
if (!fileBuffer || fileBuffer.length === 0) {
throw new HttpError(400, "Invalid PDF file buffer");
}
// Create storage directory
const imagesDir = path.join(
process.cwd(),
'src', 'server', 'data_ingestion', 'pdf_to_pic_images'
);
await fs.mkdir(imagesDir, { recursive: true });
// Sanitize filename for file system compatibility
const baseName = path.parse(fileMetadata.originalName)
.name.replace(/[^a-zA-Z0-9_-]/g, '_');
// Configure pdf2pic with optimal settings
const convertPdfOptions = {
density: 150, // 150 DPI - sweet spot for OCR accuracy
saveFilename: `${baseName}_page`,
savePath: imagesDir,
format: "png" as const, // PNG for lossless text clarity
width: 800, // Reasonable file size
height: 1200 // Maintain aspect ratio
};
const convert = fromBuffer(fileBuffer, convertPdfOptions);
// Convert all pages (-1 means all pages)
const results = await convert.bulk(-1, { responseType: "image" });
// Process results into our page info structure
const pageInfos: PageInfo[] = [];
if (results && Array.isArray(results)) {
results.forEach((result, index) => {
if (result.path) {
const fileName = path.basename(result.path);
pageInfos.push({
pageNumber: index + 1,
imagePath: fileName
});
}
});
}
if (pageInfos.length === 0) {
throw new HttpError(400, "No pages could be extracted from PDF");
}
return pageInfos;
} catch (error) {
if (error instanceof HttpError) throw error;
console.error("PDF processing error:", error);
throw new HttpError(500, "Failed to process PDF pages");
}
}
Step 4: Main API Handler with Enhanced Response
The main API handler orchestrates the entire process:
export const ingestPdf: IngestPdf = async (req, res, _context) => {
try {
if (!req.file) {
throw new HttpError(400, "No PDF file provided");
}
// Extract metadata
const fileMetadata: FileMetadata = {
name: req.file.filename || req.file.originalname,
originalName: req.file.originalname,
type: req.file.mimetype,
size: req.file.size,
};
// Validate file type and size
if (req.file.mimetype !== 'application/pdf') {
throw new HttpError(400, "File must be a PDF");
}
if (req.file.size > MAX_FILE_SIZE_BYTES) {
throw new HttpError(400, `File exceeds ${MAX_FILE_SIZE_BYTES / 1024 / 1024}MB limit`);
}
// Process PDF to images
const pages = await processPdfToImages(req.file.buffer, fileMetadata);
// Return enhanced response with page information
res.json({
success: true,
message: "PDF processed successfully",
file: fileMetadata,
pages: pages,
totalPages: pages.length
});
} catch (error) {
if (error instanceof HttpError) throw error;
console.error("Error processing PDF upload:", error);
throw new HttpError(500, "Internal server error processing PDF");
}
};
Error Handling: Production-Ready Considerations
Real-world PDF processing is messy. Here are the error scenarios I handle:
1. Corrupted or Invalid PDFs
try {
results = await convert.bulk(-1, { responseType: "image" });
} catch (conversionError) {
console.error("PDF conversion error:", conversionError);
throw new HttpError(400, "Failed to convert PDF - file may be corrupted or password-protected");
}
2. File System Issues
try {
await fs.mkdir(imagesDir, { recursive: true });
} catch (dirError) {
console.error("Failed to create images directory:", dirError);
throw new HttpError(500, "Failed to create storage directory");
}
3. Resource Limits
// In multer configuration
limits: {
fileSize: MAX_FILE_SIZE_BYTES,
files: 1 // Only one file at a time
}
API Usage: From Upload to Page References
Making a Request
# Using cURL
curl -X POST \
-F "pdf=@document.pdf" \
http://localhost:3001/api/v1/data-ingestion/pdf
# Using JavaScript fetch
const formData = new FormData();
formData.append('pdf', fileInput.files[0]);
const response = await fetch('/api/v1/data-ingestion/pdf', {
method: 'POST',
body: formData
});
const result = await response.json();
Enhanced Response Structure
{
"success": true,
"message": "PDF processed successfully",
"file": {
"name": "tax_document.pdf",
"originalName": "tax_document.pdf",
"type": "application/pdf",
"size": 2048576
},
"pages": [
{ "pageNumber": 1, "imagePath": "tax_document_page.1.png" },
{ "pageNumber": 2, "imagePath": "tax_document_page.2.png" },
{ "pageNumber": 3, "imagePath": "tax_document_page.3.png" }
],
"totalPages": 3
}
Integration with RAG Systems
Now comes the payoff. With individual page images, I can:
1. Process Each Page with High-Accuracy OCR
// Example: Process with Mistral or other OCR service
async function processPageWithOCR(imagePath: string, pageNumber: number) {
const imageBuffer = await fs.readFile(imagePath);
const ocrResult = await mistralOCR.processImage(imageBuffer);
return {
pageNumber,
content: ocrResult.text,
confidence: ocrResult.confidence,
imagePath
};
}
2. Index Content with Page References
// Example: Store in vector database with page metadata
async function indexPageContent(pageData) {
const embedding = await openai.embeddings.create({
input: pageData.content,
model: "text-embedding-3-small"
});
await vectorDB.upsert({
id: `${documentId}_page_${pageData.pageNumber}`,
values: embedding.data[0].embedding,
metadata: {
documentId,
pageNumber: pageData.pageNumber,
imagePath: pageData.imagePath,
content: pageData.content
}
});
}
3. Provide Precise Query Responses
// Example: RAG query with page references
async function queryWithPageReferences(question: string) {
const results = await vectorDB.query({
vector: await embedQuestion(question),
topK: 3,
includeMetadata: true
});
return {
answer: await generateAnswer(question, results),
sources: results.map(r => ({
pageNumber: r.metadata.pageNumber,
imagePath: r.metadata.imagePath,
relevanceScore: r.score
}))
};
}
Results: The Transformation in Action
Before Implementation
❌ User: "What's the tax rate mentioned in the document?"
❌ AI: "The tax rate is 15%"
❌ User: "Where exactly?"
❌ AI: "I found that information in the document"
After Implementation
✅ User: "What's the tax rate mentioned in the document?"
✅ AI: "The tax rate is 15%, found on page 3 of the document"
✅ User: "Can you show me the exact page?"
✅ AI: [Returns page 3 image with highlighted relevant section]
Performance Considerations
File Size and Processing Time
- Small PDFs (1-5 pages): ~2-3 seconds processing time
- Medium PDFs (10-20 pages): ~8-12 seconds processing time
- Large PDFs (50+ pages): Consider async processing with job queues
Storage Requirements
- Images are ~200-500KB each at 150 DPI
- 10-page PDF: ~2-5MB in images
- Consider cleanup strategies for old processed files
Memory Usage
- PDF buffer in memory: Temporary during processing
- Images written to disk: Permanent storage
- Peak memory usage: ~3x the PDF file size during processing
Production Deployment Tips
1. Async Processing for Large Documents
// Consider using a job queue for large PDFs
if (pages.length > 20) {
await jobQueue.add('process-large-pdf', {
fileBuffer,
fileMetadata,
userId: context.user.id
});
return { message: "Large PDF queued for processing" };
}
2. Storage Optimization
// Implement cleanup for old images
async function cleanupOldImages() {
const cutoffDate = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000); // 7 days
// Remove images older than cutoff date
}
3. Rate Limiting
// Add rate limiting for PDF processing
const rateLimit = rateLimit({
windowMs: 15 * 60 * 1000, // 15 minutes
max: 5, // 5 PDFs per window
message: 'Too many PDF uploads, please try again later'
});
Key Takeaways and Future Possibilities
What I Learned
- Page-level granularity is crucial for professional document AI applications
- pdf2pic provides excellent quality with reasonable performance
- Buffer-based processing is more secure than temporary file storage
- Error handling is critical - PDFs are unpredictable
- Wasp's middleware system makes complex integrations clean
Future Enhancements
- OCR confidence scoring per page
- Layout analysis to identify tables, charts, headers
- Text highlighting on source images
- Multi-format support (Word docs, PowerPoint, etc.)
- Real-time collaboration on document analysis
The Bottom Line
Building page-level PDF processing transformed my RAG system from a basic Q&A tool into a professional document analysis platform. Users can now get precise answers with exact page references, making the AI system trustworthy and practical for real-world use cases.
The combination of Wasp's full-stack capabilities, pdf2pic's reliable conversion, and thoughtful error handling created a robust solution that scales from prototype to production.
Have you implemented similar document processing pipelines? I'd love to hear about your experiences and challenges in the comments below.
Top comments (0)