DEV Community: Stefan Vitoria

Building Vroom AI: A Multi-Agent Architecture for Intelligent Driving Education

Stefan Vitoria — Mon, 29 Dec 2025 19:15:20 +0000

How we're transforming driving education with a sophisticated AI system powered by specialized agents, intelligent tools, and orchestrated workflows

The Vision: Reimagining Driving Education

Learning to drive shouldn't be a one-size-fits-all experience. Every student has different strengths, weaknesses, and learning preferences. Some struggle with traffic signs, others with complex regulations, and many need personalized study paths that adapt to their unique pace and style.

Vroom AI is our answer to this challenge—an intelligent tutoring system that doesn't just teach driving rules, but understands each learner individually and adapts its approach accordingly.

The Architecture: A Symphony of AI Agents

At the heart of Vroom AI lies a sophisticated multi-agent architecture built on Mastra AI framework. Instead of a single monolithic AI, we've designed a ecosystem of specialized agents, each with distinct expertise, working together to create a truly personalized learning experience.

┌─────────────────────────────────────────────────────────┐
│                    VROOM AI SYSTEM                      │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────────┐    ┌─────────────────┐            │
│  │   USER INPUT    │───▶│ Learning        │            │
│  │ "I'm struggling │    │ Orchestrator    │            │
│  │ with signs"     │    │ (Main Agent)    │            │
│  └─────────────────┘    └────────┬────────┘            │
│                                   │                     │
│                                   ▼                     │
│  ┌─────────────────────────────────────────────────────┤
│  │               SPECIALIST AGENTS                     │
│  │                                                     │
│  │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │
│  │ │    Quiz     │ │    Sign     │ │    Rule     │    │
│  │ │  Generator  │ │  Explainer  │ │ Interpreter │    │
│  │ └─────────────┘ └─────────────┘ └─────────────┘    │
│  │                                                     │
│  │        ┌─────────────┐    ┌─────────────┐          │
│  │        │ Study Path  │    │ RAG System  │          │
│  │        │   Agent     │    │ & Content   │          │
│  │        └─────────────┘    └─────────────┘          │
│  └─────────────────────────────────────────────────────┘
│                                                         │
│  ┌─────────────────────────────────────────────────────┤
│  │                    TOOL ARSENAL                     │
│  │                                                     │
│  │ Document Search • Quiz Creator • Progress Tracker  │
│  │ Learning Analytics • Content Recommender           │
│  └─────────────────────────────────────────────────────┘
└─────────────────────────────────────────────────────────┘

The Agent Ecosystem: Specialized Intelligence

1. Learning Orchestrator - The Conductor

Think of this as the "brain" of the system. When you ask "What does that no-parking sign mean?", the Learning Orchestrator doesn't just answer—it analyzes your intent, considers your learning history, and decides which specialist can help you best.

Example Interaction:

Student: "I keep failing questions about regulatory signs"
Orchestrator: 
  → Analyzes: Quiz performance + learning gaps
  → Routes to: Sign Explainer (for concept understanding) 
  → Routes to: Quiz Generator (for targeted practice)
  → Routes to: Study Path Agent (for systematic improvement)

2. Quiz Generator - The Adaptive Assessor

This agent doesn't just pull random questions from a database. It creates personalized quizzes by analyzing your weak spots, searching through our content repository, and generating questions that challenge you at exactly the right level.

How it works:

Detects you struggle with speed limit signs → 60% basic, 30% intermediate, 10% advanced questions
Finds real-world scenarios → "You're approaching a school zone at 3 PM on a Tuesday..."
Creates intelligent distractors → Wrong answers that teach common misconceptions

3. Sign Explainer - The Visual Expert

Every traffic sign has a story—when to use it, why it matters, what happens if you ignore it. The Sign Explainer doesn't just define signs; it provides rich, contextual understanding.

Example Deep Dive:

Input: "V15 sign"
Output:
  ✓ Official Definition: "No Overtaking"
  ✓ Real-world Context: "Typically found on hills, curves, or narrow roads"
  ✓ Related Signs: "Works with V16 (End of No Overtaking)"
  ✓ Safety Rationale: "Prevents accidents where visibility is limited"
  ✓ Penalties: "Fine + 2 points on license"

4. Rule Interpreter - The Legal Translator

Traffic laws are written in dense legal language. The Rule Interpreter transforms complex regulations into clear, actionable understanding.

Before: "Vehicles shall not exceed the prescribed speed limit on designated thoroughfares during specified temporal periods as determined by municipal authorities..."

After: "Don't drive faster than the speed limit shown on signs. In residential areas, this is usually 50 km/h unless posted otherwise."

5. Study Path Agent - The Personal Tutor

This agent creates your personalized learning journey. It analyzes your goals, available time, and performance data to build a curriculum that's uniquely yours.

Sample Study Plan:

Week 1: Foundation Building
├─ Day 1-2: Traffic Signs Overview (2 hours)
├─ Day 3-4: Basic Right-of-Way Rules (2 hours)  
├─ Day 5-6: Speed Limits & Regulations (2 hours)
└─ Day 7: Practice Quiz + Review (1 hour)

Week 2: Advanced Concepts
├─ Day 8-9: Complex Intersections (2 hours)
├─ Day 10-11: Highway Driving Rules (2 hours)
├─ Day 12-13: Emergency Situations (2 hours)
└─ Day 14: Comprehensive Assessment (1 hour)

The Tool Arsenal: Powering Intelligence

Document Search - The Knowledge Retriever

Using RAG (Retrieval-Augmented Generation), this tool searches through our entire content library—processed road code documents, sign databases, and quiz archives—to find exactly the right information for any question.

Smart Search Example:

Query: "parking rules near schools"
Retrieved Context:
  → Regulation Chapter 4.2: School Zone Parking
  → Signs P3, P4: Time-restricted parking
  → Real-world scenarios: Drop-off vs. all-day parking
  → Penalty information: Fines and enforcement

Quiz Creator - The Question Architect

This tool transforms raw content into engaging, educational questions. It analyzes document chunks, identifies key concepts, and creates multiple-choice questions with realistic wrong answers based on common student mistakes.

Progress Tracker - The Learning Analytics Engine

Every interaction generates valuable data: Which topics you master quickly? Where do you struggle? How does your performance change over time? The Progress Tracker captures and analyzes all of this to continually improve your experience.

Learning Analytics - The Insight Generator

This tool goes beyond simple tracking—it identifies patterns, predicts learning outcomes, and provides actionable insights for both students and the system itself.

Content Recommender - The Personalization Engine

Like Netflix for driving education, this tool suggests what you should study next based on your performance, goals, and successful patterns from similar learners.

Workflow Orchestra: Complex Tasks Made Simple

Adaptive Quiz Generation Workflow

When you request practice questions, here's what happens behind the scenes:

1. KNOWLEDGE ASSESSMENT
   ├─ Analyze your quiz history
   ├─ Identify weak areas (e.g., 40% accuracy on regulatory signs)
   └─ Set difficulty target: 70% basic, 20% medium, 10% hard

2. CONTENT RETRIEVAL  
   ├─ Search documents for regulatory sign content
   ├─ Pull related sign data from database
   └─ Gather successful question patterns

3. QUESTION GENERATION
   ├─ Create scenario: "You see this sign while driving..."
   ├─ Generate 4 answer options with 1 correct + 3 intelligent distractors
   └─ Add detailed explanation for learning

4. QUIZ ASSEMBLY
   ├─ Randomize question order
   ├─ Configure progress tracking
   └─ Deliver to your device with streaming updates

Study Path Creation Workflow

Creating your personalized curriculum involves:

1. USER ANALYSIS
   ├─ Current knowledge level assessment
   ├─ Learning style identification
   ├─ Goal setting (exam date, daily time available)
   └─ Preference mapping

2. CURRICULUM GENERATION
   ├─ Topic sequencing based on dependencies
   ├─ Time allocation per subject
   ├─ Difficulty progression planning
   └─ Milestone definition

3. CONTENT MAPPING
   ├─ Resource selection for each topic
   ├─ Practice session configuration  
   ├─ Assessment scheduling
   └─ Review cycle planning

4. DELIVERY & ADAPTATION
   ├─ Progress tracking setup
   ├─ Performance monitoring
   ├─ Dynamic replanning triggers
   └─ Achievement system activation

RAG Integration: Making Content Intelligent

One of our key innovations is leveraging our existing content processing pipeline. We already convert PDF documents to markdown, extract sign information, and process quiz data. The RAG system transforms this raw content into intelligent, searchable knowledge.

EXISTING CONTENT PIPELINE → INTELLIGENT KNOWLEDGE BASE
════════════════════════════════════════════════════

PDF Documents ──► OCR Processing ──► Markdown Content
     │                                      │
     │                                      ▼
     │                             Semantic Chunking
     │                                      │
     │                                      ▼
     │                            Vector Embeddings
     │                                      │
     │                                      ▼
     └──► Sign Database ──────────► LibSQL Vector Database
              │                             │
              ▼                             ▼
        Quiz Content ──────────► Intelligent Search & Retrieval

This means when you ask about overtaking rules, the system doesn't just match keywords—it understands context, finds related concepts, and provides comprehensive answers drawing from multiple sources.

Real-World Learning Scenarios

Scenario 1: The Visual Learner

Sarah prefers learning through examples and scenarios rather than memorizing rules.

Sarah: "I don't understand when I can and can't overtake"

Learning Orchestrator: Identifies preference for scenario-based learning
↓
Rule Interpreter: Explains overtaking rules in simple language
↓  
Sign Explainer: Shows V15 (No Overtaking) and V16 (End No Overtaking) signs
↓
Quiz Generator: Creates scenario questions:
  "You're driving behind a slow truck on a two-lane road. 
   You see a V15 sign ahead. What should you do?"
↓
Content Recommender: Suggests related topics like lane discipline

Scenario 2: The Last-Minute Crammer

João has his driving test in 2 weeks and needs an intensive study plan.

João: "I need to pass my test in 2 weeks, studying 2 hours daily"

Study Path Agent: Creates intensive 14-day curriculum
↓
Progress Tracker: Monitors daily goals and completion rates
↓
Learning Analytics: Identifies João learns best in morning sessions
↓
Quiz Generator: Creates daily assessments targeting exam-style questions
↓
Adaptive adjustments: Speeds up mastered topics, extends time on weak areas

Scenario 3: The Concept Connector

Maria struggles to see how different rules relate to each other.

Maria: "Why are there so many different speed limit signs?"

Sign Explainer: Shows speed limit sign family (C1, C2, C3...)
↓
Rule Interpreter: Explains speed limit hierarchy and contexts
↓
Document Search: Finds related regulations about speed enforcement
↓
Content Recommender: Suggests studying road classification and sign categories
↓
Quiz Creator: Generates comparison questions between different speed contexts

The Learning Benefits

For Students:

Personalized Experience: No two learning paths are identical
Adaptive Difficulty: Always challenged but never overwhelmed
Contextual Understanding: Learn not just what, but why and when
Efficient Preparation: Focus time on areas that need improvement
Real-world Application: Practice with scenarios you'll actually encounter

For Educators:

Performance Insights: Detailed analytics on learning patterns
Content Optimization: Data-driven improvements to educational materials
Intervention Detection: Early identification of struggling students
Curriculum Planning: Evidence-based study program development

Technical Innovation Highlights

1. Multi-Agent Coordination

Instead of a single AI trying to do everything, specialized agents collaborate, each bringing deep expertise to specific educational challenges.

2. Intelligent Content Reuse

We transform existing educational content into a dynamic, searchable knowledge base that powers personalized learning experiences.

3. Adaptive Learning Engine

The system continuously learns about each student and adjusts its teaching approach in real-time.

4. Contextual AI Responses

Every answer considers not just the question, but the student's history, goals, and learning style.

The Future of Driving Education

This architecture represents more than just a technical achievement—it's a fundamental shift toward truly personalized education. By combining AI sophistication with educational expertise, we're creating learning experiences that adapt to each student rather than forcing students to adapt to rigid curricula.

The multi-agent approach allows us to scale specialized expertise that would be impossible with human tutors alone, while the RAG integration ensures our AI stays grounded in authoritative, up-to-date content.

Conclusion: Building Intelligent Education Systems

Vroom AI demonstrates how thoughtful architecture can transform traditional education. By breaking complex AI tasks into specialized agents, providing them with intelligent tools, and orchestrating their collaboration through well-designed workflows, we've created a system that's both sophisticated and maintainable.

The key insights from this architecture:

Specialization over Generalization: Multiple focused agents outperform single general-purpose AI
Content Leveraging: Transform existing materials into intelligent knowledge bases
Workflow Orchestration: Complex educational tasks need structured, multi-step processes
Adaptive Intelligence: Systems must learn and evolve with each user interaction

As we continue building and refining Vroom AI, this architectural foundation ensures we can add new capabilities, improve existing ones, and scale to serve learners with diverse needs and goals.

The future of education isn't just AI-powered—it's AI-architected, with systems designed from the ground up to understand and adapt to human learning.

Want to learn more about building AI-powered education systems? Follow my journey as I document the development process, technical decisions, and lessons learned building Vroom AI.

Connect with me:

How Strategic Image Cropping Transforms Data Ingestion Pipelines

Stefan Vitoria — Fri, 26 Dec 2025 13:38:36 +0000

Picture this: You're running an OCR pipeline processing thousands of PDF documents daily. Your accuracy is decent, costs are climbing, and processing times are slower than you'd like. Sound familiar?

The culprit might be hiding in plain sight – those pesky document margins, headers, footers, and irrelevant graphics that your OCR engine is desperately trying to make sense of. What if I told you that a simple preprocessing step could significantly reduce your costs while improving accuracy?

Enter strategic image cropping – the unsung hero of efficient data ingestion.

The Hidden Cost of Processing Everything

When most developers implement document processing pipelines, they feed entire page/images to their OCR engines. This seems logical, but it's incredibly wasteful:

The Accuracy Problem: OCR engines struggle with mixed content. When your algorithm tries to extract meaningful text from a document containing logos, decorative borders, page numbers, and watermarks, accuracy plummets. Research shows that general-purpose OCR typically achieves only 95% accuracy, but this drops significantly when processing noisy, unstructured content.

The Cost Problem: Every pixel you send to an LLM-based OCR service costs tokens. Processing irrelevant content like headers, footers, and margins can substantially inflate your token usage. With LLM OCR processing typically costing $15-25 per 1,000 documents, this waste adds up quickly.

The Speed Problem: Larger images with mixed content take longer to process. Your pipeline becomes bottlenecked by the sheer volume of irrelevant data being analyzed.

The Solution: Intelligent Image Cropping

Strategic cropping is about surgical precision – removing everything except the content that matters. Instead of sending a full document page to your OCR engine, you extract only the text-heavy regions where meaningful information lives.

How It Works

The process involves three key steps:

Content Analysis: Identify regions of interest within the document
Precise Extraction: Crop to focus areas with configurable coordinates
Optimized Processing: Feed clean, focused images to your OCR engine

Here's a practical implementation using Sharp for image processing:

export async function cropImage(
  inputPath: string, 
  outputPath: string, 
  cropOptions: CropOptions
): Promise<string> {
  const { left, top, width, height } = cropOptions;

  // Remove headers, footers, and margins
  await sharp(inputPath)
    .extract({ left, top, width, height })
    .png({ quality: 90 })
    .toFile(outputPath);

  return outputPath;
}

The Measurable Impact

Accuracy Improvements

Recent research demonstrates meaningful improvements when preprocessing is properly implemented:

AI-enhanced OCR systems can achieve up to 30% accuracy improvement over traditional methods
Modern systems typically achieve 1-5% error rates on standard documents
Elimination of visual noise reduces false positive text detection

Cost Savings

The financial benefits can be substantial:

Token Reduction: Context compression through cropping can significantly reduce token usage
Processing Efficiency: Clean, focused input can reduce processing costs substantially
Model Selection: Smaller, focused images allow you to use more cost-effective OCR models for routine tasks

Processing Efficiency

Smaller images reduce computational overhead
Less irrelevant content to analyze
Fewer manual corrections needed due to improved accuracy

Real-World Case Study: PDF Processing Pipeline

In my recent implementation of a PDF data ingestion engine, I integrated strategic cropping into the OCR workflow. Here's how it transformed the pipeline:

Before Cropping:

Processing full PDF pages including margins, headers, footers
OCR struggling with mixed content types
Higher token costs for irrelevant content processing

After Implementation:

// Static crop configuration targeting main content area
const STATIC_CROP_OPTIONS = {
    left: 100,    // Skip left margin
    top: 50,      // Skip header area  
    width: 600,   // Focus on content width
    height: 800   // Exclude footer area
};

// Integrate into OCR pipeline
const croppedImagePath = generateCroppedImagePath(fullImagePath, "cropped");
await cropImage(fullImagePath, croppedImagePath, STATIC_CROP_OPTIONS);
const markdownContent = await performOcrOnImage(croppedImagePath);

Results:

Focused OCR analysis on relevant content areas only
Cleaner text extraction with fewer artifacts
Reduced processing overhead and faster pipeline execution

Best Practices for Implementation

1. Start with Static Coordinates

Begin with fixed crop coordinates based on your document types. Most business documents have predictable layouts – invoices, contracts, and reports typically follow standard formatting patterns.

2. Add Dynamic Detection Later

As your pipeline matures, implement intelligent content detection:

Use computer vision to identify text regions
Implement margin detection algorithms
Add automatic header/footer boundary detection

3. Validate Crop Areas

Always verify that your crop coordinates don't exclude important content:

Implement boundary checking to ensure crops stay within image dimensions
Add logging to track crop effectiveness
Monitor OCR accuracy metrics after implementing cropping

4. Cache Cropped Images

Store preprocessed images to avoid re-processing:

Cache cropped versions for reuse
Implement intelligent cache invalidation
Consider the storage vs. processing cost trade-off

5. Make Coordinates Configurable

Design your system for flexibility:

interface CropOptions {
  left: number;
  top: number; 
  width: number;
  height: number;
}

This allows easy adjustments without code changes as you encounter new document formats.

The ROI Story

Organizations implementing strategic cropping in their data ingestion pipelines often see:

Significant cost reduction through optimized token usage
Measurable accuracy improvements from focused content analysis
Improved processing efficiency due to smaller, cleaner inputs
Reduced manual review requirements due to better extraction quality

The potential is compelling: if you're processing large volumes of documents, proper cropping could deliver meaningful cost savings while improving results. Results will vary based on document types and implementation quality.

Getting Started

Ready to optimize your data ingestion pipeline? Start simple:

Analyze Your Documents: Identify common layout patterns in your document types
Implement Basic Cropping: Add configurable crop coordinates to remove obvious waste areas
Measure Results: Track accuracy, speed, and cost metrics before and after implementation
Iterate and Improve: Refine coordinates based on real-world performance data

Remember, the goal isn't perfect cropping from day one – it's eliminating the most obvious inefficiencies in your current pipeline. Even basic margin removal can yield significant improvements.

Strategic image cropping transforms data ingestion from a brute-force operation into a precision tool. In an era where API costs and processing efficiency directly impact your bottom line, can you afford not to implement this optimization?

The question isn't whether you should implement cropping – it's how quickly you can get started.

Building a PDF Ingestion Pipeline with TypeScript, Wasp, and AI OCR

Stefan Vitoria — Wed, 24 Dec 2025 14:27:02 +0000

How we built a scalable document processing system that converts PDFs to searchable text using modern web technologies

The Problem: Turning Static PDFs into Actionable Data

Picture this: You have thousands of PDF documents containing valuable information, but they're essentially digital paperweights. Users can't search through them effectively, extract insights, or build applications on top of the content. This is exactly the challenge we faced when building VROOM, a driving education platform for Cape Verde.

Our platform needed to ingest government traffic regulation PDFs and make them searchable and interactive for students. The documents contained crucial information about traffic signs, rules, and regulations, but in their static PDF format, they were practically unusable for modern web applications.

The requirements were clear:

Accept PDF uploads through a web API
Convert documents to searchable text while preserving structure
Handle large documents (dozens of pages) without blocking the main application
Provide real-time processing status and error handling
Scale to handle multiple concurrent document uploads

After researching various solutions, we decided to build our own PDF data ingestion pipeline using TypeScript, the Wasp full-stack framework, and AI-powered OCR. Here's the complete story of how we built it.

Why We Chose This Tech Stack

Wasp Framework: The Game Changer

Wasp is a declarative DSL that generates React + Node.js + Prisma applications. What makes it perfect for this use case:

Built-in job queues with PgBoss for background processing
Type-safe operations between frontend and backend
Integrated database operations with Prisma ORM
Zero-config deployment and development setup

The Supporting Cast

TypeScript: Type safety across the entire pipeline
pdf2pic: Reliable PDF-to-image conversion
Mistral AI OCR: State-of-the-art text extraction
PostgreSQL: Robust data persistence with JSONB support
PgBoss: Production-ready job queue built on PostgreSQL

This combination gave us enterprise-grade reliability with startup-level development speed.

System Architecture: A Bird's Eye View

Our PDF data ingestion pipeline follows a three-stage architecture designed for reliability, scalability, and maintainability:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   PDF Upload    │───▶│  Background Jobs │───▶│   Database      │
│   API Endpoint  │    │    (PgBoss)      │    │  (PostgreSQL)   │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
    File Validation        Image Processing        Content Storage
   & Initial Storage      & OCR Pipeline          (Structured Data)

The Three-Phase Processing Pipeline

Phase 1: Upload & Validation

Immediate PDF upload via REST API
File validation (type, size, format)
Database record creation with PROCESSING status
Background job submission for heavy processing
Instant response to client with document tracking ID

Phase 2: Image Generation

PDF pages converted to high-quality PNG images using pdf2pic
Images stored in organized file system structure
Document metadata updated with total page count
Individual OCR jobs queued for each page

Phase 3: Content Extraction

Mistral AI OCR processes each image independently
Extracted markdown content saved to database
Progress tracking across all pages
Document status updated to COMPLETED when all pages finish

Database Schema Design

Our schema is optimized for both write performance during processing and read performance for content queries:

-- Document tracking table
CREATE TABLE "Document" (
    "id" TEXT PRIMARY KEY,
    "name" TEXT NOT NULL,
    "path" TEXT NOT NULL,
    "totalPages" INTEGER NOT NULL,
    "status" "DocumentStatus" NOT NULL, -- QUEUED/PROCESSING/COMPLETED/FAILED
    "createdAt" TIMESTAMP DEFAULT NOW(),
    "updatedAt" TIMESTAMP NOT NULL
);

-- Individual page content storage
CREATE TABLE "DocumentPage" (
    "id" TEXT PRIMARY KEY,
    "documentId" TEXT REFERENCES "Document"("id"),
    "number" INTEGER NOT NULL,
    "path" TEXT, -- Image file path
    "markdown" TEXT NOT NULL, -- OCR extracted content
    "createdAt" TIMESTAMP DEFAULT NOW()
);

-- Performance indexes
CREATE INDEX "idx_doc_pages" ON "DocumentPage"("documentId", "number");
CREATE INDEX "idx_doc_status" ON "Document"("status");

This design enables us to:

Track processing status in real-time
Handle partial failures gracefully
Query content efficiently
Support document versioning (future enhancement)

Implementation Deep Dive

Let's walk through the key components of our implementation, starting with the API endpoint and moving through the background processing pipeline.

1. The Upload API Endpoint

Our upload endpoint is built with Express middleware and Wasp's type-safe operations:

// main.wasp - API definition
api ingestPdf {
  httpRoute: (POST, "/api/v1/data-ingestion/pdf"),
  fn: import { ingestPdf } from "@src/server/data_ingestion/apis/ingest-pdf",
  entities: [User]
}

// ingest-pdf.ts - Implementation
export const ingestPdf: IngestPdf = async (req, res, _context) => {
    try {
        // File validation
        if (!req.file || req.file.mimetype !== 'application/pdf') {
            throw new HttpError(400, "Valid PDF file required");
        }

        // Extract metadata
        const fileMetadata = {
            name: req.file.originalname,
            type: req.file.mimetype,
            size: req.file.size,
        };

        // Create database record immediately
        const documentId = await createDocument({
            name: fileMetadata.name,
            path: fileMetadata.name,
            totalPages: 0 // Updated after processing
        });

        // Return success immediately
        res.json({
            success: true,
            message: "PDF uploaded successfully. Processing started.",
            documentId: documentId
        });

        // Submit background job (non-blocking)
        await processPdfToImages.submit({
            fileBufferString: req.file.buffer.toString('base64'),
            fileMetadata: fileMetadata,
            documentId: documentId
        });

    } catch (error) {
        console.error("Upload failed:", error);
        throw new HttpError(500, "PDF processing failed");
    }
};

Key Design Decisions:

Immediate response: Client gets instant feedback with tracking ID
Base64 encoding: File buffer serialized for job queue persistence
Comprehensive validation: Multiple layers of file checking
Error isolation: Upload failures don't affect background processing

2. Stage 1: PDF-to-Image Conversion

The first background job converts PDF pages to high-quality images:

// pdf-to-image.job.ts
export const processPdfToImages: ProcessPdfToImages<ProcessPdfArgs, void> = 
async (args, context) => {
    const { fileBufferString, fileMetadata, documentId } = args;
    const fileBuffer = Buffer.from(fileBufferString, 'base64');

    try {
        // Configure pdf2pic for optimal quality/performance balance
        const convertOptions = {
            density: 150,           // DPI - good quality without huge files
            saveFilename: `${baseName}_page`,
            savePath: imagesDir,
            format: "png" as const,
            width: 800,            // Max width for web display
            height: 1200           // Preserve aspect ratio
        };

        const convert = fromBuffer(fileBuffer, convertOptions);
        const results = await convert.bulk(-1, { responseType: "image" });

        // Update document with actual page count
        await context.entities.Document.update({
            where: { id: documentId },
            data: { 
                totalPages: results.length,
                status: DocumentStatus.PROCESSING 
            }
        });

        // Submit OCR job for each page independently
        for (const [index, result] of results.entries()) {
            await extractAndProcessPageContent.delay(index).submit({
                pageNumber: index + 1,
                imagePath: path.basename(result.path),
                documentId: documentId
            });
        }

        console.log(`✅ Generated ${results.length} images for ${fileMetadata.name}`);

    } catch (error) {
        // Mark document as failed and log for monitoring
        await updateDocumentStatus(documentId, DocumentStatus.FAILED);
        throw error;
    }
};

Performance Optimizations:

Delayed job submission: Pages processed with staggered delays to prevent OCR API rate limiting
Optimal image settings: Balanced quality/file size for web applications
Atomic operations: Database updates happen only after successful image generation

3. Stage 2: OCR Content Extraction

Each page gets processed independently by our OCR pipeline:

// extract-and-process-page-content.job.ts
export const extractAndProcessPageContent: ExtractAndProcessPageContent = 
async (args, context) => {
    const { pageNumber, imagePath, documentId } = args;
    const fullImagePath = path.join(BASE_PATH, imagePath);

    try {
        // Perform OCR using Mistral AI
        const markdownContent = await performOcrOnImage(fullImagePath);

        // Save extracted content to database
        await saveDocumentPage({
            documentId: documentId,
            number: pageNumber,
            path: imagePath,
            markdown: markdownContent
        });

        // Check if all pages are complete
        const document = await context.entities.Document.findFirst({
            where: { id: documentId },
            include: { pages: { where: { deletedAt: null } } }
        });

        // Mark document complete when all pages processed
        if (document && document.pages.length === document.totalPages) {
            await updateDocumentStatus(documentId, DocumentStatus.COMPLETED);
            console.log(`🎉 Document ${documentId} fully processed`);
        }

    } catch (error) {
        console.error(`OCR failed for page ${pageNumber}:`, error);
        await updateDocumentStatus(documentId, DocumentStatus.FAILED);
        throw error;
    }
};

4. OCR Integration with Mistral AI

Our OCR service provides high-quality text extraction:

// mistral-ai.ts
export async function performOcrOnImage(imagePath: string): Promise<string> {
    const mistral = new Mistral({
        apiKey: process.env.MISTRAL_API_KEY ?? "",
    });

    const base64Image = await loadImageAsBase64(imagePath);

    const ocrResponse = await mistral.ocr.process({
        model: "mistral-ocr-latest",
        document: {
            imageUrl: base64Image,
            type: "image_url",
        },
    });

    if (ocrResponse.pages && ocrResponse.pages.length > 0) {
        return ocrResponse.pages[0].markdown;
    }

    throw new Error("No content extracted from OCR");
}

Why Mistral AI OCR?

High accuracy: Superior text recognition compared to traditional OCR
Markdown output: Structured format preserving document formatting
Multi-language support: Handles Portuguese content excellently
API reliability: Enterprise-grade uptime and rate limiting

Challenges We Encountered (And How We Solved Them)

Building a production-ready document processing pipeline isn't just about happy path scenarios. Here are the real-world challenges we faced and our solutions:

Challenge 1: Memory Management with Large PDFs

Problem: Large PDF files (>5MB) were causing memory issues when converting to images, especially with high DPI settings.

Solution: Implemented streaming buffer processing and optimized pdf2pic configuration:

// Before: Memory spikes with large files
const convert = fromBuffer(fileBuffer, { density: 300, ... });

// After: Balanced approach for web applications
const convertOptions = {
    density: 150,        // Reduced from 300 DPI
    width: 800,          // Max width constraint
    height: 1200,        // Preserve aspect ratio
    quality: 85          // Slight compression for file size
};

Result: 60% reduction in memory usage while maintaining text readability.

Challenge 2: OCR Rate Limiting and Failures

Problem: Mistral AI has rate limits, and simultaneous OCR requests for multi-page documents were hitting API limits.

Solution: Implemented intelligent job delays and retry logic:

// Stagger OCR jobs to respect rate limits
for (const [index, result] of results.entries()) {
    await extractAndProcessPageContent
        .delay(index * 2) // 2-second delays between pages
        .submit({
            pageNumber: index + 1,
            imagePath: path.basename(result.path),
            documentId: documentId
        });
}

Result: Zero rate limit errors and improved overall processing reliability.

Challenge 3: Partial Processing Failures

Problem: If one page failed OCR, the entire document would be marked as failed, even if other pages processed successfully.

Solution: Implemented granular error handling and recovery:

// Allow partial success with detailed error tracking
try {
    const markdownContent = await performOcrOnImage(fullImagePath);
    await saveDocumentPage({ /* ... */ });
} catch (error) {
    // Log error but don't fail entire document
    console.error(`Page ${pageNumber} failed, continuing with other pages`);

    // Save page with error status for manual review
    await saveDocumentPage({
        documentId,
        number: pageNumber,
        path: imagePath,
        markdown: `[OCR_ERROR]: ${error.message}`,
        status: 'FAILED'
    });
}

Result: Documents with partial failures can still be useful, with clear indication of problematic pages.

Challenge 4: File System Organization and Cleanup

Problem: Generated images were accumulating without cleanup, and file paths were hardcoded.

Solution: Implemented organized file structure and cleanup jobs:

// Organized file naming with document context
const baseName = path.parse(fileMetadata.originalName).name.replace(/[^a-zA-Z0-9_-]/g, '_');
const imagesDir = path.join(process.env.STORAGE_PATH, 'documents', documentId);

// Future cleanup job (in backlog)
job cleanupProcessedImages {
  executor: PgBoss,
  perform: { fn: import { cleanupOldImages } from "@src/jobs/cleanup" },
  schedule: { cron: "0 2 * * *" } // Daily at 2 AM
}

Result: Better file organization and foundation for automated cleanup.

Performance and Scalability Considerations

Our pipeline handles real-world production loads with these performance characteristics:

Current Performance Metrics

Processing Speed:

Small PDFs (1-5 pages): ~30-60 seconds end-to-end
Medium PDFs (10-20 pages): ~2-4 minutes
Large PDFs (30+ pages): ~5-10 minutes
Concurrent documents: Up to 10 simultaneous processing jobs

Resource Usage:

Memory: ~200MB peak per document during image generation
Storage: ~1.5MB per page (PNG images + database)
CPU: Moderate usage, mostly I/O bound waiting on OCR API

Scalability Design Decisions

Horizontal Scaling Ready:

// Job queue handles distribution across multiple workers
job processPdfToImages {
  executor: PgBoss,
  perform: { fn: import { processPdfToImages } from "@src/jobs/pdf-processing" },
  // PgBoss automatically distributes across multiple Node.js instances
}

Database Optimization:

-- Indexes for common query patterns
CREATE INDEX "idx_document_status_created" ON "Document"("status", "createdAt");
CREATE INDEX "idx_page_content_search" ON "DocumentPage" USING gin(to_tsvector('english', "markdown"));

-- Partitioning strategy for large deployments
CREATE TABLE "DocumentPage_2024" PARTITION OF "DocumentPage" 
FOR VALUES FROM ('2024-01-01') TO ('2025-01-01');

Rate Limiting and Circuit Breakers:

// OCR service protection
const ocrWithRetry = async (imagePath: string, retries = 3): Promise<string> => {
    for (let attempt = 1; attempt <= retries; attempt++) {
        try {
            return await performOcrOnImage(imagePath);
        } catch (error) {
            if (attempt === retries) throw error;

            // Exponential backoff: 2s, 4s, 8s
            await new Promise(resolve => setTimeout(resolve, 2000 * attempt));
            console.warn(`OCR retry attempt ${attempt} for ${imagePath}`);
        }
    }
};

Monitoring and Observability

Key Metrics We Track:

Job queue length and processing times
OCR API success/failure rates
Storage usage growth
Database query performance
Memory usage patterns during processing

Health Checks:

// API endpoint for system health
api systemHealth {
  httpRoute: (GET, "/api/v1/health"),
  fn: import { getSystemHealth } from "@src/monitoring/health"
}

export const getSystemHealth = async () => ({
    database: await checkDatabaseConnection(),
    jobQueue: await checkJobQueueHealth(), 
    ocrService: await checkOcrServiceStatus(),
    storage: await checkStorageAvailability(),
    timestamp: new Date().toISOString()
});

Scaling Bottlenecks and Solutions

Current Bottleneck: OCR API Rate Limits

Problem: Mistral AI limits concurrent requests
Solution: Intelligent request queuing with backoff
Future: Multi-provider OCR with automatic failover

Future Bottleneck: Database Writes

Anticipated: High-volume concurrent page saves
Solution: Write-optimized database configuration and potential read replicas

Storage Considerations:

Current: Local file system for development
Production: S3-compatible storage with CDN for image serving
Cleanup: Automated deletion of processed images after content extraction

Lessons Learned and Key Takeaways

Building this PDF processing pipeline taught us valuable lessons about modern web application architecture:

1. Choose the Right Level of Abstraction

Lesson: Wasp's declarative approach eliminated huge amounts of boilerplate while maintaining flexibility.

// This simple Wasp declaration...
job processPdfToImages {
  executor: PgBoss,
  perform: { fn: import { processPdfToImages } from "@src/jobs/pdf-processing" },
  entities: [Document, DocumentPage]
}

// ...generates type-safe job submission, queue management, and error handling
await processPdfToImages.submit({ documentId, fileBuffer, metadata });

Impact: Reduced development time by ~40% compared to setting up raw Express + PgBoss + Prisma.

2. Design for Observability from Day One

Lesson: Comprehensive logging and status tracking saved countless debugging hours.

// Every major operation logs structured data
console.log(`📤 OCR job submitted - Doc: ${documentId}, Page: ${pageNumber}`, {
    documentId,
    pageNumber,
    jobId: result.id,
    timestamp: new Date().toISOString()
});

Impact: Issues can be traced through the entire pipeline with searchable, structured logs.

3. Embrace Async Processing for Better UX

Lesson: Immediate API responses with background processing creates much better user experience than blocking operations.

Before: 60-second API timeouts for large documents

After: Sub-second API responses with real-time status updates

4. Error Handling Should Be Granular and Recoverable

Lesson: Failing fast on individual pages while continuing document processing provides better user value.

// Don't let one bad page kill the entire document
try {
    await processPage(pageInfo);
} catch (error) {
    await logPageError(pageInfo, error);
    // Continue processing other pages
}

Impact: 85% of documents with partial page failures still provide valuable content.

5. AI APIs Are Powerful But Require Defensive Programming

Lesson: External AI services need circuit breakers, retries, and graceful degradation.

Key Strategies:

Exponential backoff for rate limits
Structured error responses for debugging
Fallback mechanisms for service outages
Cost monitoring for API usage

Future Enhancements and Roadmap

Our current implementation handles the core use case well, but there's always room for improvement:

Short Term (Next Month)

Progress Tracking API: Real-time status updates for frontend integration
Batch Processing: Handle multiple PDFs in a single upload
Content Validation: Quality checks on extracted text
File Cleanup Jobs: Automated removal of temporary images

Medium Term (3-6 Months)

Multi-Provider OCR: Automatic failover between OCR services
Content Search: Full-text search across all processed documents
Document Versioning: Handle updates to existing documents
Advanced Error Recovery: Automatic retry of failed pages

Long Term (6+ Months)

Real-time Processing: WebSocket updates for live status
AI Content Enhancement: Automatic tagging and categorization
Multi-format Support: Word documents, PowerPoint, images
Enterprise Features: User permissions, audit trails, API keys

Code Quality Improvements

// Current backlog items from our TODO list:
const backlogItems = [
    "Add progress tracking query endpoint",
    "Implement file cleanup operations", 
    "Enhanced error recovery with exponential backoff",
    "Remove hardcoded file paths",
    "Add structured logging with proper levels",
    "Implement job queue monitoring and dead letter handling"
];

The Bottom Line

Building a production-ready PDF data ingestion pipeline taught us that the architecture matters more than the individual technologies. By choosing tools that work well together (Wasp + TypeScript + PgBoss + Mistral AI), we built something that's both powerful and maintainable.

Key Success Factors:

Async-first design for better user experience
Comprehensive error handling for production reliability
Structured logging for operational visibility
Type safety across the entire stack
Gradual optimization based on real usage patterns

The entire pipeline processes documents reliably in production, handling everything from single-page forms to 50-page government manuals. More importantly, it's maintainable and extensible for future requirements.

Want to Build Something Similar?

If you're working on document processing or considering a similar architecture:

Start with the Wasp framework - The productivity boost is real
Design your job queue strategy early - Async processing is crucial for good UX
Choose your OCR provider carefully - Quality varies dramatically between services
Plan for partial failures - Documents are messy, and your system should handle that gracefully

Questions or want to dive deeper into any part of this architecture? Drop a comment below - I love discussing technical architecture and lessons learned from real-world implementations.

Building something similar? I'd love to hear about your approach and any challenges you're facing. The developer community thrives on sharing these kinds of implementation stories!

This post is part of our ongoing series on building modern SaaS applications with Wasp. Follow for more deep dives into full-stack TypeScript development and production architecture patterns.

Tags: #WebDev #TypeScript #PDF #OCR #Architecture #Wasp #FullStack #BackgroundJobs #AI

Building a Page-Level PDF Processing Pipeline for Smarter RAG Systems

Stefan Vitoria — Sat, 20 Dec 2025 17:59:55 +0000

How I solved the granularity problem in document-based AI applications using Wasp, pdf2pic, and smart architecture decisions

The Problem: When Your RAG System Can't Point to the Right Page

Picture this: You're building an AI-powered document analysis system. Users upload PDFs, ask questions, and expect precise answers with exact references. But there's a frustrating problem—your RAG (Retrieval-Augmented Generation) system can tell users what information exists but not where to find it.

❌ User: "Where can I find the tax information?"
❌ AI: "The tax rate is 15%"
❌ User: "But which page?"
❌ AI: 🤷‍♂️

The challenge I faced was that existing solutions didn't provide the exact level of control and page-level granularity I needed for my RAG system. Most approaches either lacked the precise page referencing I required, were overly complex for my use case, or didn't integrate well with my existing technology stack.

This becomes a dealbreaker for professional applications where users need to verify information, cite sources, or navigate large documents efficiently.

The Challenge: Finding the Right Fit for My Requirements

When building my document-based AI system, I had specific requirements that existing solutions couldn't meet:

What I Needed:

Page-level granularity: Each page processed and referenced individually
Cost-effective scaling: Handle volume processing without breaking the bank
Simple integration: Work seamlessly with my Wasp application stack
Full control: Customize the processing pipeline for my specific needs
Reliable quality: Consistent results across various PDF formats

What I Tried:

Cloud-based solutions: Often too expensive and complex for my specific use case
Basic PDF parsing: Inconsistent results with complex layouts and images
Simple OCR libraries: Required too much manual optimization and tuning

I needed a solution that was: Purpose-built, cost-effective, and gave me complete control over the processing pipeline.

The Solution: PDF-to-Image Pipeline with Smart Processing

After experimenting with various approaches, I developed a pipeline that converts PDFs into individual page images, processes each page separately, and maintains precise page references throughout the RAG system.

Architecture Overview

PDF Upload → Page Images → Individual OCR → RAG with Page References
     ↓              ↓              ↓              ↓
  [file.pdf]  [page_1.png]  [text + page #]  "Found on page 3"
             [page_2.png]
             [page_3.png]

Implementation: Building with Wasp and pdf2pic

I chose Wasp for its full-stack TypeScript approach and built-in file handling capabilities. Here's how I implemented the solution:

Step 1: Setting Up File Upload with Multer

First, I configured multer for handling PDF uploads directly in memory:

// src/server/data_ingestion/apis/ingest-pdf.ts
import multer from "multer";
import { HttpError, type MiddlewareConfigFn } from "wasp/server";
import { fromBuffer } from "pdf2pic";

const MAX_FILE_SIZE_BYTES = 10 * 1024 * 1024; // 10MB limit

export const ingestPdfMiddlewareConfigFn: MiddlewareConfigFn = (middlewareConfig) => {
  const upload = multer({
    storage: multer.memoryStorage(), // Keep in memory for processing
    limits: { fileSize: MAX_FILE_SIZE_BYTES },
    fileFilter: (_req, file, cb) => {
      if (file.mimetype === 'application/pdf') {
        cb(null, true);
      } else {
        cb(new Error('Only PDF files are allowed'));
      }
    },
  });

  middlewareConfig.set('multer.single', upload.single('pdf'));
  return middlewareConfig;
};

Step 2: Wasp API Configuration

In main.wasp, I used the new apiNamespace feature for cleaner middleware management:

// main.wasp
apiNamespace fileUploadMiddleware {
  middlewareConfigFn: import { ingestPdfMiddlewareConfigFn } from "@src/server/data_ingestion/apis/ingest-pdf",
  path: "/api/v1/data-ingestion"
}

api ingestPdf {
  httpRoute: (POST, "/api/v1/data-ingestion/pdf"),
  fn: import { ingestPdf } from "@src/server/data_ingestion/apis/ingest-pdf",
  entities: [User]
}

Step 3: The Heart of the System - PDF Processing Function

This is where the magic happens. The function takes a PDF buffer and converts each page to an individual image:

interface PageInfo {
  pageNumber: number;
  imagePath: string;
}

interface FileMetadata {
  name: string;
  originalName: string;
  type: string;
  size: number;
}

async function processPdfToImages(
  fileBuffer: Buffer, 
  fileMetadata: FileMetadata
): Promise<PageInfo[]> {
  try {
    // Validate the buffer
    if (!fileBuffer || fileBuffer.length === 0) {
      throw new HttpError(400, "Invalid PDF file buffer");
    }

    // Create storage directory
    const imagesDir = path.join(
      process.cwd(), 
      'src', 'server', 'data_ingestion', 'pdf_to_pic_images'
    );
    await fs.mkdir(imagesDir, { recursive: true });

    // Sanitize filename for file system compatibility
    const baseName = path.parse(fileMetadata.originalName)
      .name.replace(/[^a-zA-Z0-9_-]/g, '_');

    // Configure pdf2pic with optimal settings
    const convertPdfOptions = {
      density: 150,           // 150 DPI - sweet spot for OCR accuracy
      saveFilename: `${baseName}_page`,
      savePath: imagesDir,
      format: "png" as const, // PNG for lossless text clarity
      width: 800,            // Reasonable file size
      height: 1200           // Maintain aspect ratio
    };

    const convert = fromBuffer(fileBuffer, convertPdfOptions);

    // Convert all pages (-1 means all pages)
    const results = await convert.bulk(-1, { responseType: "image" });

    // Process results into our page info structure
    const pageInfos: PageInfo[] = [];

    if (results && Array.isArray(results)) {
      results.forEach((result, index) => {
        if (result.path) {
          const fileName = path.basename(result.path);
          pageInfos.push({
            pageNumber: index + 1,
            imagePath: fileName
          });
        }
      });
    }

    if (pageInfos.length === 0) {
      throw new HttpError(400, "No pages could be extracted from PDF");
    }

    return pageInfos;

  } catch (error) {
    if (error instanceof HttpError) throw error;

    console.error("PDF processing error:", error);
    throw new HttpError(500, "Failed to process PDF pages");
  }
}

Step 4: Main API Handler with Enhanced Response

The main API handler orchestrates the entire process:

export const ingestPdf: IngestPdf = async (req, res, _context) => {
  try {
    if (!req.file) {
      throw new HttpError(400, "No PDF file provided");
    }

    // Extract metadata
    const fileMetadata: FileMetadata = {
      name: req.file.filename || req.file.originalname,
      originalName: req.file.originalname,
      type: req.file.mimetype,
      size: req.file.size,
    };

    // Validate file type and size
    if (req.file.mimetype !== 'application/pdf') {
      throw new HttpError(400, "File must be a PDF");
    }

    if (req.file.size > MAX_FILE_SIZE_BYTES) {
      throw new HttpError(400, `File exceeds ${MAX_FILE_SIZE_BYTES / 1024 / 1024}MB limit`);
    }

    // Process PDF to images
    const pages = await processPdfToImages(req.file.buffer, fileMetadata);

    // Return enhanced response with page information
    res.json({
      success: true,
      message: "PDF processed successfully",
      file: fileMetadata,
      pages: pages,
      totalPages: pages.length
    });

  } catch (error) {
    if (error instanceof HttpError) throw error;
    console.error("Error processing PDF upload:", error);
    throw new HttpError(500, "Internal server error processing PDF");
  }
};

Error Handling: Production-Ready Considerations

Real-world PDF processing is messy. Here are the error scenarios I handle:

1. Corrupted or Invalid PDFs

try {
  results = await convert.bulk(-1, { responseType: "image" });
} catch (conversionError) {
  console.error("PDF conversion error:", conversionError);
  throw new HttpError(400, "Failed to convert PDF - file may be corrupted or password-protected");
}

2. File System Issues

try {
  await fs.mkdir(imagesDir, { recursive: true });
} catch (dirError) {
  console.error("Failed to create images directory:", dirError);
  throw new HttpError(500, "Failed to create storage directory");
}

3. Resource Limits

// In multer configuration
limits: { 
  fileSize: MAX_FILE_SIZE_BYTES,
  files: 1 // Only one file at a time
}

API Usage: From Upload to Page References

Making a Request

# Using cURL
curl -X POST \
  -F "pdf=@document.pdf" \
  http://localhost:3001/api/v1/data-ingestion/pdf

# Using JavaScript fetch
const formData = new FormData();
formData.append('pdf', fileInput.files[0]);

const response = await fetch('/api/v1/data-ingestion/pdf', {
  method: 'POST',
  body: formData
});

const result = await response.json();

Enhanced Response Structure

{
  "success": true,
  "message": "PDF processed successfully",
  "file": {
    "name": "tax_document.pdf",
    "originalName": "tax_document.pdf",
    "type": "application/pdf",
    "size": 2048576
  },
  "pages": [
    { "pageNumber": 1, "imagePath": "tax_document_page.1.png" },
    { "pageNumber": 2, "imagePath": "tax_document_page.2.png" },
    { "pageNumber": 3, "imagePath": "tax_document_page.3.png" }
  ],
  "totalPages": 3
}

Integration with RAG Systems

Now comes the payoff. With individual page images, I can:

1. Process Each Page with High-Accuracy OCR

// Example: Process with Mistral or other OCR service
async function processPageWithOCR(imagePath: string, pageNumber: number) {
  const imageBuffer = await fs.readFile(imagePath);

  const ocrResult = await mistralOCR.processImage(imageBuffer);

  return {
    pageNumber,
    content: ocrResult.text,
    confidence: ocrResult.confidence,
    imagePath
  };
}

2. Index Content with Page References

// Example: Store in vector database with page metadata
async function indexPageContent(pageData) {
  const embedding = await openai.embeddings.create({
    input: pageData.content,
    model: "text-embedding-3-small"
  });

  await vectorDB.upsert({
    id: `${documentId}_page_${pageData.pageNumber}`,
    values: embedding.data[0].embedding,
    metadata: {
      documentId,
      pageNumber: pageData.pageNumber,
      imagePath: pageData.imagePath,
      content: pageData.content
    }
  });
}

3. Provide Precise Query Responses

// Example: RAG query with page references
async function queryWithPageReferences(question: string) {
  const results = await vectorDB.query({
    vector: await embedQuestion(question),
    topK: 3,
    includeMetadata: true
  });

  return {
    answer: await generateAnswer(question, results),
    sources: results.map(r => ({
      pageNumber: r.metadata.pageNumber,
      imagePath: r.metadata.imagePath,
      relevanceScore: r.score
    }))
  };
}

Results: The Transformation in Action

Before Implementation

❌ User: "What's the tax rate mentioned in the document?"
❌ AI: "The tax rate is 15%"
❌ User: "Where exactly?"
❌ AI: "I found that information in the document"

After Implementation

✅ User: "What's the tax rate mentioned in the document?"
✅ AI: "The tax rate is 15%, found on page 3 of the document"
✅ User: "Can you show me the exact page?"
✅ AI: [Returns page 3 image with highlighted relevant section]

Performance Considerations

File Size and Processing Time

Small PDFs (1-5 pages): ~2-3 seconds processing time
Medium PDFs (10-20 pages): ~8-12 seconds processing time
Large PDFs (50+ pages): Consider async processing with job queues

Storage Requirements

Images are ~200-500KB each at 150 DPI
10-page PDF: ~2-5MB in images
Consider cleanup strategies for old processed files

Memory Usage

PDF buffer in memory: Temporary during processing
Images written to disk: Permanent storage
Peak memory usage: ~3x the PDF file size during processing

Production Deployment Tips

1. Async Processing for Large Documents

// Consider using a job queue for large PDFs
if (pages.length > 20) {
  await jobQueue.add('process-large-pdf', { 
    fileBuffer, 
    fileMetadata, 
    userId: context.user.id 
  });

  return { message: "Large PDF queued for processing" };
}

2. Storage Optimization

// Implement cleanup for old images
async function cleanupOldImages() {
  const cutoffDate = new Date(Date.now() - 7 * 24 * 60 * 60 * 1000); // 7 days
  // Remove images older than cutoff date
}

3. Rate Limiting

// Add rate limiting for PDF processing
const rateLimit = rateLimit({
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 5, // 5 PDFs per window
  message: 'Too many PDF uploads, please try again later'
});

Key Takeaways and Future Possibilities

What I Learned

Page-level granularity is crucial for professional document AI applications
pdf2pic provides excellent quality with reasonable performance
Buffer-based processing is more secure than temporary file storage
Error handling is critical - PDFs are unpredictable
Wasp's middleware system makes complex integrations clean

Future Enhancements

OCR confidence scoring per page
Layout analysis to identify tables, charts, headers
Text highlighting on source images
Multi-format support (Word docs, PowerPoint, etc.)
Real-time collaboration on document analysis

The Bottom Line

Building page-level PDF processing transformed my RAG system from a basic Q&A tool into a professional document analysis platform. Users can now get precise answers with exact page references, making the AI system trustworthy and practical for real-world use cases.

The combination of Wasp's full-stack capabilities, pdf2pic's reliable conversion, and thoughtful error handling created a robust solution that scales from prototype to production.

Have you implemented similar document processing pipelines? I'd love to hear about your experiences and challenges in the comments below.

How I’m Building a Job Application Bot for Developers

Stefan Vitoria — Fri, 20 Jun 2025 22:01:14 +0000

Hi, I’m Stefan, a software engineer building a tool I wish existed earlier.

I recently started working on a side project called Devapply, a bot that automatically finds developer jobs, creates personalized application documents using AI, and even sends the applications for you—while you sleep.

This wasn’t a sudden idea. It came from watching the market and the community.

Even though I’m not actively looking for a job right now, I’ve been seeing developers all around me—especially in Portugal—sharing their struggles: fewer openings, more competition, and the burnout of endless job applications.

So I thought, maybe there’s something I can build that actually helps. A tool to spot the best job openings early (only listing with less then 24 hours), prepare a personalized application instantly, and give developers a head start before the rest of the market even knows the role exists.

And it’s also a great excuse to practice and apply some real-world software engineering and AI architecture concepts:

Scalability & parallel job processing
Queue-based architecture (BullMQ)
Agentic AI document generation workflows
Multimodal AI (different models used for different steps based on strengths)
Email automation & delivery tracking
Background workers and event-driven systems
Personalization at scale
GDPR-aware user data handling

And yes—this is built for developers in Portugal, and only scrapes job opportunities from Portuguese sources for now.

😓 The Problem: Applying for Jobs Sucks

If you’ve been job hunting as a developer, you know the pain:

Rewriting your resume 10 times for similar roles
Trying to stand out with generic cover letters
Spending hours copying/pasting job requirements
Writing awkward emails to recruiters

It feels like a full-time job in itself.

And it’s especially bad if you’re:

Already working full-time and trying to switch
Freelancing but want stability
Getting ghosted despite doing all the right things

So I thought:

What if I automate all of it?

🤖 The Idea: Automate the Entire Developer Job Application Pipeline (Portuguese market only)

Devapply is a tool built specifically for developers. Here's how it works:

Finds New Dev Jobs in Portugal:

It looks for new jobs daily from sources inside Portugal.
Filters them based on your preferences (location, stack, remote/flex).

Generates Tailored Applications:

Uses a multimodal AI workflow to build your resume, cover letter, and application email.
Each model is used based on its specific strengths (e.g., summarizing vs. writing).

Sends the Application for You:

For job posts with recruiter emails, it applies in your name.
You wake up to a notification: “3 applications sent. Docs ready.”

🏛️ Why I Chose to Focus on Developers First

There are plenty of resume tools out there, but:

None are focused on developers
None handle the full application lifecycle
Most stop at: "Here's a resume template. Good luck!"

I wanted to create something that understands developer workflows, tech stacks, and expectations. Something that doesn’t just offer tips — it takes action.

Also, devs are:

Open to automation
Often juggling multiple opportunities
Willing to pay for tools that save time and mental energy

So Devapply is developer-first by design.

🚀 What I’ve Built So Far

The scraper for Portuguese job boards is live and working
The AI document builder (resume + cover letter + email) is up and running
The email sender uses dynamic reply-to setup for user identity
The pipeline is queue-based with BullMQ for reliability and speed
Most pieces are built — now I’m integrating them into one seamless flow

It already works for me.

Now I want it to work for others.

🚀 What’s Next

Finish the user dashboard (see jobs, generated kits, application status)
Write more technical articles about how I built each part (this is the first)

I'm planning to charge later, but early adopters will get free access during the beta. My goal is to learn with real users, refine the system, and see if it truly solves the problem.

🌐 Why I’m Sharing This

If you’re a dev in Portugal, recruiter, indie hacker, or just curious — I’d love your feedback.

Would you use something like this?
Do you think its something worth building?
What would make it better?

This is not just a product. It’s an experiment — and your input helps shape what it becomes.

📅 Follow the Journey

Over the next few weeks, I’ll be publishing more details on:

How I scrape developer jobs in Portugal
Queue-based automation and scaling workers
Multimodal AI workflows for document generation
Email delivery + identity handling
GDPR compliance considerations
Real feedback from users (and what breaks!)

If that sounds interesting, follow me.

You can also:

Join the waitlist
DM me on X (@Ste_ravy)
Or drop your thoughts in the comments — I’m open :)

Thanks for reading,
Stefan