<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Stefan Vitoria</title>
    <description>The latest articles on DEV Community by Stefan Vitoria (@steravy).</description>
    <link>https://dev.to/steravy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1148238%2F5cfee3b4-d6e2-45ba-9db9-c1b9e64d2766.jpg</url>
      <title>DEV Community: Stefan Vitoria</title>
      <link>https://dev.to/steravy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/steravy"/>
    <language>en</language>
    <item>
      <title>Building Vroom AI: A Multi-Agent Architecture for Intelligent Driving Education</title>
      <dc:creator>Stefan Vitoria</dc:creator>
      <pubDate>Mon, 29 Dec 2025 19:15:20 +0000</pubDate>
      <link>https://dev.to/steravy/building-vroom-ai-a-multi-agent-architecture-for-intelligent-driving-education-4h79</link>
      <guid>https://dev.to/steravy/building-vroom-ai-a-multi-agent-architecture-for-intelligent-driving-education-4h79</guid>
      <description>&lt;p&gt;&lt;em&gt;How we're transforming driving education with a sophisticated AI system powered by specialized agents, intelligent tools, and orchestrated workflows&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Vision: Reimagining Driving Education
&lt;/h2&gt;

&lt;p&gt;Learning to drive shouldn't be a one-size-fits-all experience. Every student has different strengths, weaknesses, and learning preferences. Some struggle with traffic signs, others with complex regulations, and many need personalized study paths that adapt to their unique pace and style.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vroom AI&lt;/strong&gt; is our answer to this challenge—an intelligent tutoring system that doesn't just teach driving rules, but understands each learner individually and adapts its approach accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture: A Symphony of AI Agents
&lt;/h2&gt;

&lt;p&gt;At the heart of Vroom AI lies a sophisticated multi-agent architecture built on &lt;strong&gt;Mastra AI framework&lt;/strong&gt;. Instead of a single monolithic AI, we've designed a ecosystem of specialized agents, each with distinct expertise, working together to create a truly personalized learning experience.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────┐
│                    VROOM AI SYSTEM                      │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  ┌─────────────────┐    ┌─────────────────┐            │
│  │   USER INPUT    │───▶│ Learning        │            │
│  │ "I'm struggling │    │ Orchestrator    │            │
│  │ with signs"     │    │ (Main Agent)    │            │
│  └─────────────────┘    └────────┬────────┘            │
│                                   │                     │
│                                   ▼                     │
│  ┌─────────────────────────────────────────────────────┤
│  │               SPECIALIST AGENTS                     │
│  │                                                     │
│  │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐    │
│  │ │    Quiz     │ │    Sign     │ │    Rule     │    │
│  │ │  Generator  │ │  Explainer  │ │ Interpreter │    │
│  │ └─────────────┘ └─────────────┘ └─────────────┘    │
│  │                                                     │
│  │        ┌─────────────┐    ┌─────────────┐          │
│  │        │ Study Path  │    │ RAG System  │          │
│  │        │   Agent     │    │ &amp;amp; Content   │          │
│  │        └─────────────┘    └─────────────┘          │
│  └─────────────────────────────────────────────────────┘
│                                                         │
│  ┌─────────────────────────────────────────────────────┤
│  │                    TOOL ARSENAL                     │
│  │                                                     │
│  │ Document Search • Quiz Creator • Progress Tracker  │
│  │ Learning Analytics • Content Recommender           │
│  └─────────────────────────────────────────────────────┘
└─────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Agent Ecosystem: Specialized Intelligence
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Learning Orchestrator - The Conductor
&lt;/h3&gt;

&lt;p&gt;Think of this as the "brain" of the system. When you ask "What does that no-parking sign mean?", the Learning Orchestrator doesn't just answer—it analyzes your intent, considers your learning history, and decides which specialist can help you best.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Interaction:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Student: "I keep failing questions about regulatory signs"
Orchestrator: 
  → Analyzes: Quiz performance + learning gaps
  → Routes to: Sign Explainer (for concept understanding) 
  → Routes to: Quiz Generator (for targeted practice)
  → Routes to: Study Path Agent (for systematic improvement)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Quiz Generator - The Adaptive Assessor
&lt;/h3&gt;

&lt;p&gt;This agent doesn't just pull random questions from a database. It creates personalized quizzes by analyzing your weak spots, searching through our content repository, and generating questions that challenge you at exactly the right level.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detects you struggle with speed limit signs → 60% basic, 30% intermediate, 10% advanced questions&lt;/li&gt;
&lt;li&gt;Finds real-world scenarios → "You're approaching a school zone at 3 PM on a Tuesday..."&lt;/li&gt;
&lt;li&gt;Creates intelligent distractors → Wrong answers that teach common misconceptions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Sign Explainer - The Visual Expert
&lt;/h3&gt;

&lt;p&gt;Every traffic sign has a story—when to use it, why it matters, what happens if you ignore it. The Sign Explainer doesn't just define signs; it provides rich, contextual understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example Deep Dive:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: "V15 sign"
Output:
  ✓ Official Definition: "No Overtaking"
  ✓ Real-world Context: "Typically found on hills, curves, or narrow roads"
  ✓ Related Signs: "Works with V16 (End of No Overtaking)"
  ✓ Safety Rationale: "Prevents accidents where visibility is limited"
  ✓ Penalties: "Fine + 2 points on license"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Rule Interpreter - The Legal Translator
&lt;/h3&gt;

&lt;p&gt;Traffic laws are written in dense legal language. The Rule Interpreter transforms complex regulations into clear, actionable understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt; &lt;em&gt;"Vehicles shall not exceed the prescribed speed limit on designated thoroughfares during specified temporal periods as determined by municipal authorities..."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt; &lt;em&gt;"Don't drive faster than the speed limit shown on signs. In residential areas, this is usually 50 km/h unless posted otherwise."&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Study Path Agent - The Personal Tutor
&lt;/h3&gt;

&lt;p&gt;This agent creates your personalized learning journey. It analyzes your goals, available time, and performance data to build a curriculum that's uniquely yours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample Study Plan:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week 1: Foundation Building
├─ Day 1-2: Traffic Signs Overview (2 hours)
├─ Day 3-4: Basic Right-of-Way Rules (2 hours)  
├─ Day 5-6: Speed Limits &amp;amp; Regulations (2 hours)
└─ Day 7: Practice Quiz + Review (1 hour)

Week 2: Advanced Concepts
├─ Day 8-9: Complex Intersections (2 hours)
├─ Day 10-11: Highway Driving Rules (2 hours)
├─ Day 12-13: Emergency Situations (2 hours)
└─ Day 14: Comprehensive Assessment (1 hour)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Tool Arsenal: Powering Intelligence
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Document Search - The Knowledge Retriever
&lt;/h3&gt;

&lt;p&gt;Using RAG (Retrieval-Augmented Generation), this tool searches through our entire content library—processed road code documents, sign databases, and quiz archives—to find exactly the right information for any question.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Smart Search Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Query: "parking rules near schools"
Retrieved Context:
  → Regulation Chapter 4.2: School Zone Parking
  → Signs P3, P4: Time-restricted parking
  → Real-world scenarios: Drop-off vs. all-day parking
  → Penalty information: Fines and enforcement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Quiz Creator - The Question Architect
&lt;/h3&gt;

&lt;p&gt;This tool transforms raw content into engaging, educational questions. It analyzes document chunks, identifies key concepts, and creates multiple-choice questions with realistic wrong answers based on common student mistakes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Progress Tracker - The Learning Analytics Engine
&lt;/h3&gt;

&lt;p&gt;Every interaction generates valuable data: Which topics you master quickly? Where do you struggle? How does your performance change over time? The Progress Tracker captures and analyzes all of this to continually improve your experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Learning Analytics - The Insight Generator
&lt;/h3&gt;

&lt;p&gt;This tool goes beyond simple tracking—it identifies patterns, predicts learning outcomes, and provides actionable insights for both students and the system itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  Content Recommender - The Personalization Engine
&lt;/h3&gt;

&lt;p&gt;Like Netflix for driving education, this tool suggests what you should study next based on your performance, goals, and successful patterns from similar learners.&lt;/p&gt;

&lt;h2&gt;
  
  
  Workflow Orchestra: Complex Tasks Made Simple
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Adaptive Quiz Generation Workflow
&lt;/h3&gt;

&lt;p&gt;When you request practice questions, here's what happens behind the scenes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. KNOWLEDGE ASSESSMENT
   ├─ Analyze your quiz history
   ├─ Identify weak areas (e.g., 40% accuracy on regulatory signs)
   └─ Set difficulty target: 70% basic, 20% medium, 10% hard

2. CONTENT RETRIEVAL  
   ├─ Search documents for regulatory sign content
   ├─ Pull related sign data from database
   └─ Gather successful question patterns

3. QUESTION GENERATION
   ├─ Create scenario: "You see this sign while driving..."
   ├─ Generate 4 answer options with 1 correct + 3 intelligent distractors
   └─ Add detailed explanation for learning

4. QUIZ ASSEMBLY
   ├─ Randomize question order
   ├─ Configure progress tracking
   └─ Deliver to your device with streaming updates
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Study Path Creation Workflow
&lt;/h3&gt;

&lt;p&gt;Creating your personalized curriculum involves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. USER ANALYSIS
   ├─ Current knowledge level assessment
   ├─ Learning style identification
   ├─ Goal setting (exam date, daily time available)
   └─ Preference mapping

2. CURRICULUM GENERATION
   ├─ Topic sequencing based on dependencies
   ├─ Time allocation per subject
   ├─ Difficulty progression planning
   └─ Milestone definition

3. CONTENT MAPPING
   ├─ Resource selection for each topic
   ├─ Practice session configuration  
   ├─ Assessment scheduling
   └─ Review cycle planning

4. DELIVERY &amp;amp; ADAPTATION
   ├─ Progress tracking setup
   ├─ Performance monitoring
   ├─ Dynamic replanning triggers
   └─ Achievement system activation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  RAG Integration: Making Content Intelligent
&lt;/h2&gt;

&lt;p&gt;One of our key innovations is leveraging our existing content processing pipeline. We already convert PDF documents to markdown, extract sign information, and process quiz data. The RAG system transforms this raw content into intelligent, searchable knowledge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXISTING CONTENT PIPELINE → INTELLIGENT KNOWLEDGE BASE
════════════════════════════════════════════════════

PDF Documents ──► OCR Processing ──► Markdown Content
     │                                      │
     │                                      ▼
     │                             Semantic Chunking
     │                                      │
     │                                      ▼
     │                            Vector Embeddings
     │                                      │
     │                                      ▼
     └──► Sign Database ──────────► LibSQL Vector Database
              │                             │
              ▼                             ▼
        Quiz Content ──────────► Intelligent Search &amp;amp; Retrieval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means when you ask about overtaking rules, the system doesn't just match keywords—it understands context, finds related concepts, and provides comprehensive answers drawing from multiple sources.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-World Learning Scenarios
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario 1: The Visual Learner
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sarah&lt;/strong&gt; prefers learning through examples and scenarios rather than memorizing rules.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Sarah: "I don't understand when I can and can't overtake"

Learning Orchestrator: Identifies preference for scenario-based learning
↓
Rule Interpreter: Explains overtaking rules in simple language
↓  
Sign Explainer: Shows V15 (No Overtaking) and V16 (End No Overtaking) signs
↓
Quiz Generator: Creates scenario questions:
  "You're driving behind a slow truck on a two-lane road. 
   You see a V15 sign ahead. What should you do?"
↓
Content Recommender: Suggests related topics like lane discipline
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 2: The Last-Minute Crammer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;João&lt;/strong&gt; has his driving test in 2 weeks and needs an intensive study plan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;João: "I need to pass my test in 2 weeks, studying 2 hours daily"

Study Path Agent: Creates intensive 14-day curriculum
↓
Progress Tracker: Monitors daily goals and completion rates
↓
Learning Analytics: Identifies João learns best in morning sessions
↓
Quiz Generator: Creates daily assessments targeting exam-style questions
↓
Adaptive adjustments: Speeds up mastered topics, extends time on weak areas
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scenario 3: The Concept Connector
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Maria&lt;/strong&gt; struggles to see how different rules relate to each other.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Maria: "Why are there so many different speed limit signs?"

Sign Explainer: Shows speed limit sign family (C1, C2, C3...)
↓
Rule Interpreter: Explains speed limit hierarchy and contexts
↓
Document Search: Finds related regulations about speed enforcement
↓
Content Recommender: Suggests studying road classification and sign categories
↓
Quiz Creator: Generates comparison questions between different speed contexts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Learning Benefits
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For Students:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Personalized Experience&lt;/strong&gt;: No two learning paths are identical&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Difficulty&lt;/strong&gt;: Always challenged but never overwhelmed
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contextual Understanding&lt;/strong&gt;: Learn not just what, but why and when&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Efficient Preparation&lt;/strong&gt;: Focus time on areas that need improvement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-world Application&lt;/strong&gt;: Practice with scenarios you'll actually encounter&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  For Educators:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Performance Insights&lt;/strong&gt;: Detailed analytics on learning patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Optimization&lt;/strong&gt;: Data-driven improvements to educational materials&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Intervention Detection&lt;/strong&gt;: Early identification of struggling students&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Curriculum Planning&lt;/strong&gt;: Evidence-based study program development&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical Innovation Highlights
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Multi-Agent Coordination
&lt;/h3&gt;

&lt;p&gt;Instead of a single AI trying to do everything, specialized agents collaborate, each bringing deep expertise to specific educational challenges.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Intelligent Content Reuse
&lt;/h3&gt;

&lt;p&gt;We transform existing educational content into a dynamic, searchable knowledge base that powers personalized learning experiences.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Adaptive Learning Engine
&lt;/h3&gt;

&lt;p&gt;The system continuously learns about each student and adjusts its teaching approach in real-time.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Contextual AI Responses
&lt;/h3&gt;

&lt;p&gt;Every answer considers not just the question, but the student's history, goals, and learning style.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Future of Driving Education
&lt;/h2&gt;

&lt;p&gt;This architecture represents more than just a technical achievement—it's a fundamental shift toward truly personalized education. By combining AI sophistication with educational expertise, we're creating learning experiences that adapt to each student rather than forcing students to adapt to rigid curricula.&lt;/p&gt;

&lt;p&gt;The multi-agent approach allows us to scale specialized expertise that would be impossible with human tutors alone, while the RAG integration ensures our AI stays grounded in authoritative, up-to-date content.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Building Intelligent Education Systems
&lt;/h2&gt;

&lt;p&gt;Vroom AI demonstrates how thoughtful architecture can transform traditional education. By breaking complex AI tasks into specialized agents, providing them with intelligent tools, and orchestrating their collaboration through well-designed workflows, we've created a system that's both sophisticated and maintainable.&lt;/p&gt;

&lt;p&gt;The key insights from this architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Specialization over Generalization&lt;/strong&gt;: Multiple focused agents outperform single general-purpose AI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Leveraging&lt;/strong&gt;: Transform existing materials into intelligent knowledge bases&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow Orchestration&lt;/strong&gt;: Complex educational tasks need structured, multi-step processes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Intelligence&lt;/strong&gt;: Systems must learn and evolve with each user interaction&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As we continue building and refining Vroom AI, this architectural foundation ensures we can add new capabilities, improve existing ones, and scale to serve learners with diverse needs and goals.&lt;/p&gt;

&lt;p&gt;The future of education isn't just AI-powered—it's AI-architected, with systems designed from the ground up to understand and adapt to human learning.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Want to learn more about building AI-powered education systems? Follow my journey as I document the development process, technical decisions, and lessons learned building Vroom AI.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect with me:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.linkedin.com/in/stefan-vit%C3%B3ria-391924261" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; &lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Steravy" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://x.com/Ste_ravy" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>edtech</category>
      <category>rag</category>
      <category>agents</category>
    </item>
    <item>
      <title>How Strategic Image Cropping Transforms Data Ingestion Pipelines</title>
      <dc:creator>Stefan Vitoria</dc:creator>
      <pubDate>Fri, 26 Dec 2025 13:38:36 +0000</pubDate>
      <link>https://dev.to/steravy/how-strategic-image-cropping-transforms-data-ingestion-pipelines-3kd5</link>
      <guid>https://dev.to/steravy/how-strategic-image-cropping-transforms-data-ingestion-pipelines-3kd5</guid>
      <description>&lt;p&gt;Picture this: You're running an OCR pipeline processing thousands of PDF documents daily. Your accuracy is decent, costs are climbing, and processing times are slower than you'd like. Sound familiar? &lt;/p&gt;

&lt;p&gt;The culprit might be hiding in plain sight – those pesky document margins, headers, footers, and irrelevant graphics that your OCR engine is desperately trying to make sense of. What if I told you that a simple preprocessing step could significantly reduce your costs while improving accuracy?&lt;/p&gt;

&lt;p&gt;Enter strategic image cropping – the unsung hero of efficient data ingestion.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hidden Cost of Processing Everything
&lt;/h2&gt;

&lt;p&gt;When most developers implement document processing pipelines, they feed entire page/images to their OCR engines. This seems logical, but it's incredibly wasteful:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Accuracy Problem&lt;/strong&gt;: OCR engines struggle with mixed content. When your algorithm tries to extract meaningful text from a document containing logos, decorative borders, page numbers, and watermarks, accuracy plummets. Research shows that general-purpose OCR typically achieves only 95% accuracy, but this drops significantly when processing noisy, unstructured content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Cost Problem&lt;/strong&gt;: Every pixel you send to an LLM-based OCR service costs tokens. Processing irrelevant content like headers, footers, and margins can substantially inflate your token usage. With LLM OCR processing typically costing $15-25 per 1,000 documents, this waste adds up quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Speed Problem&lt;/strong&gt;: Larger images with mixed content take longer to process. Your pipeline becomes bottlenecked by the sheer volume of irrelevant data being analyzed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: Intelligent Image Cropping
&lt;/h2&gt;

&lt;p&gt;Strategic cropping is about surgical precision – removing everything except the content that matters. Instead of sending a full document page to your OCR engine, you extract only the text-heavy regions where meaningful information lives.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;The process involves three key steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Content Analysis&lt;/strong&gt;: Identify regions of interest within the document&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precise Extraction&lt;/strong&gt;: Crop to focus areas with configurable coordinates
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimized Processing&lt;/strong&gt;: Feed clean, focused images to your OCR engine&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here's a practical implementation using Sharp for image processing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;cropImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;inputPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nx"&gt;outputPath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nx"&gt;cropOptions&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;CropOptions&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;top&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;height&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;cropOptions&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// Remove headers, footers, and margins&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;sharp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;inputPath&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extract&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;left&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;top&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;width&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;height&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;png&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;90&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;outputPath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;outputPath&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Measurable Impact
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Accuracy Improvements
&lt;/h3&gt;

&lt;p&gt;Recent research demonstrates meaningful improvements when preprocessing is properly implemented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AI-enhanced OCR systems can achieve up to 30% accuracy improvement over traditional methods&lt;/li&gt;
&lt;li&gt;Modern systems typically achieve 1-5% error rates on standard documents&lt;/li&gt;
&lt;li&gt;Elimination of visual noise reduces false positive text detection&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost Savings
&lt;/h3&gt;

&lt;p&gt;The financial benefits can be substantial:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Token Reduction&lt;/strong&gt;: Context compression through cropping can significantly reduce token usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing Efficiency&lt;/strong&gt;: Clean, focused input can reduce processing costs substantially&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Selection&lt;/strong&gt;: Smaller, focused images allow you to use more cost-effective OCR models for routine tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Processing Efficiency
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Smaller images reduce computational overhead&lt;/li&gt;
&lt;li&gt;Less irrelevant content to analyze&lt;/li&gt;
&lt;li&gt;Fewer manual corrections needed due to improved accuracy&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Real-World Case Study: PDF Processing Pipeline
&lt;/h2&gt;

&lt;p&gt;In my recent implementation of a &lt;a href="https://dev.to/steravy/building-a-pdf-ingestion-pipeline-with-typescript-wasp-and-ai-ocr-1lpp"&gt;PDF data ingestion engine&lt;/a&gt;, I integrated strategic cropping into the OCR workflow. Here's how it transformed the pipeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before Cropping:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing full PDF pages including margins, headers, footers&lt;/li&gt;
&lt;li&gt;OCR struggling with mixed content types&lt;/li&gt;
&lt;li&gt;Higher token costs for irrelevant content processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After Implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Static crop configuration targeting main content area&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;STATIC_CROP_OPTIONS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;left&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Skip left margin&lt;/span&gt;
    &lt;span class="na"&gt;top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;// Skip header area  &lt;/span&gt;
    &lt;span class="na"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;// Focus on content width&lt;/span&gt;
    &lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;   &lt;span class="c1"&gt;// Exclude footer area&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Integrate into OCR pipeline&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;croppedImagePath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;generateCroppedImagePath&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fullImagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cropped&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;cropImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fullImagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;croppedImagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;STATIC_CROP_OPTIONS&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;markdownContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;performOcrOnImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;croppedImagePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Focused OCR analysis on relevant content areas only&lt;/li&gt;
&lt;li&gt;Cleaner text extraction with fewer artifacts&lt;/li&gt;
&lt;li&gt;Reduced processing overhead and faster pipeline execution&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Best Practices for Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Start with Static Coordinates
&lt;/h3&gt;

&lt;p&gt;Begin with fixed crop coordinates based on your document types. Most business documents have predictable layouts – invoices, contracts, and reports typically follow standard formatting patterns.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Add Dynamic Detection Later
&lt;/h3&gt;

&lt;p&gt;As your pipeline matures, implement intelligent content detection:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use computer vision to identify text regions&lt;/li&gt;
&lt;li&gt;Implement margin detection algorithms&lt;/li&gt;
&lt;li&gt;Add automatic header/footer boundary detection&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Validate Crop Areas
&lt;/h3&gt;

&lt;p&gt;Always verify that your crop coordinates don't exclude important content:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Implement boundary checking to ensure crops stay within image dimensions&lt;/li&gt;
&lt;li&gt;Add logging to track crop effectiveness&lt;/li&gt;
&lt;li&gt;Monitor OCR accuracy metrics after implementing cropping&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Cache Cropped Images
&lt;/h3&gt;

&lt;p&gt;Store preprocessed images to avoid re-processing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Cache cropped versions for reuse&lt;/li&gt;
&lt;li&gt;Implement intelligent cache invalidation&lt;/li&gt;
&lt;li&gt;Consider the storage vs. processing cost trade-off&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  5. Make Coordinates Configurable
&lt;/h3&gt;

&lt;p&gt;Design your system for flexibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;CropOptions&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;left&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;top&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; 
  &lt;span class="nl"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows easy adjustments without code changes as you encounter new document formats.&lt;/p&gt;

&lt;h2&gt;
  
  
  The ROI Story
&lt;/h2&gt;

&lt;p&gt;Organizations implementing strategic cropping in their data ingestion pipelines often see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Significant cost reduction&lt;/strong&gt; through optimized token usage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measurable accuracy improvements&lt;/strong&gt; from focused content analysis
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Improved processing efficiency&lt;/strong&gt; due to smaller, cleaner inputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduced manual review&lt;/strong&gt; requirements due to better extraction quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The potential is compelling: if you're processing large volumes of documents, proper cropping could deliver meaningful cost savings while improving results. Results will vary based on document types and implementation quality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;Ready to optimize your data ingestion pipeline? Start simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Analyze Your Documents&lt;/strong&gt;: Identify common layout patterns in your document types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement Basic Cropping&lt;/strong&gt;: Add configurable crop coordinates to remove obvious waste areas&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure Results&lt;/strong&gt;: Track accuracy, speed, and cost metrics before and after implementation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate and Improve&lt;/strong&gt;: Refine coordinates based on real-world performance data&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Remember, the goal isn't perfect cropping from day one – it's eliminating the most obvious inefficiencies in your current pipeline. Even basic margin removal can yield significant improvements.&lt;/p&gt;

&lt;p&gt;Strategic image cropping transforms data ingestion from a brute-force operation into a precision tool. In an era where API costs and processing efficiency directly impact your bottom line, can you afford not to implement this optimization?&lt;/p&gt;

&lt;p&gt;The question isn't whether you should implement cropping – it's how quickly you can get started.&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>machinelearning</category>
      <category>performance</category>
    </item>
    <item>
      <title>Building a PDF Ingestion Pipeline with TypeScript, Wasp, and AI OCR</title>
      <dc:creator>Stefan Vitoria</dc:creator>
      <pubDate>Wed, 24 Dec 2025 14:27:02 +0000</pubDate>
      <link>https://dev.to/steravy/building-a-pdf-ingestion-pipeline-with-typescript-wasp-and-ai-ocr-1lpp</link>
      <guid>https://dev.to/steravy/building-a-pdf-ingestion-pipeline-with-typescript-wasp-and-ai-ocr-1lpp</guid>
      <description>&lt;p&gt;&lt;em&gt;How we built a scalable document processing system that converts PDFs to searchable text using modern web technologies&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Turning Static PDFs into Actionable Data
&lt;/h2&gt;

&lt;p&gt;Picture this: You have thousands of PDF documents containing valuable information, but they're essentially digital paperweights. Users can't search through them effectively, extract insights, or build applications on top of the content. This is exactly the challenge we faced when building &lt;strong&gt;VROOM&lt;/strong&gt;, a driving education platform for Cape Verde.&lt;/p&gt;

&lt;p&gt;Our platform needed to ingest government traffic regulation PDFs and make them searchable and interactive for students. The documents contained crucial information about traffic signs, rules, and regulations, but in their static PDF format, they were practically unusable for modern web applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The requirements were clear:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accept PDF uploads through a web API&lt;/li&gt;
&lt;li&gt;Convert documents to searchable text while preserving structure&lt;/li&gt;
&lt;li&gt;Handle large documents (dozens of pages) without blocking the main application&lt;/li&gt;
&lt;li&gt;Provide real-time processing status and error handling&lt;/li&gt;
&lt;li&gt;Scale to handle multiple concurrent document uploads&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After researching various solutions, we decided to build our own PDF data ingestion pipeline using TypeScript, the Wasp full-stack framework, and AI-powered OCR. Here's the complete story of how we built it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why We Chose This Tech Stack
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Wasp Framework: The Game Changer
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://wasp-lang.dev" rel="noopener noreferrer"&gt;Wasp&lt;/a&gt; is a declarative DSL that generates React + Node.js + Prisma applications. What makes it perfect for this use case:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Built-in job queues&lt;/strong&gt; with PgBoss for background processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type-safe operations&lt;/strong&gt; between frontend and backend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integrated database operations&lt;/strong&gt; with Prisma ORM&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-config deployment&lt;/strong&gt; and development setup&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Supporting Cast
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;TypeScript&lt;/strong&gt;: Type safety across the entire pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pdf2pic&lt;/strong&gt;: Reliable PDF-to-image conversion&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mistral AI OCR&lt;/strong&gt;: State-of-the-art text extraction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt;: Robust data persistence with JSONB support&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PgBoss&lt;/strong&gt;: Production-ready job queue built on PostgreSQL&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This combination gave us enterprise-grade reliability with startup-level development speed.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture: A Bird's Eye View
&lt;/h2&gt;

&lt;p&gt;Our PDF data ingestion pipeline follows a three-stage architecture designed for reliability, scalability, and maintainability:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   PDF Upload    │───▶│  Background Jobs │───▶│   Database      │
│   API Endpoint  │    │    (PgBoss)      │    │  (PostgreSQL)   │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
    File Validation        Image Processing        Content Storage
   &amp;amp; Initial Storage      &amp;amp; OCR Pipeline          (Structured Data)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Three-Phase Processing Pipeline
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Phase 1: Upload &amp;amp; Validation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Immediate PDF upload via REST API&lt;/li&gt;
&lt;li&gt;File validation (type, size, format)&lt;/li&gt;
&lt;li&gt;Database record creation with &lt;code&gt;PROCESSING&lt;/code&gt; status&lt;/li&gt;
&lt;li&gt;Background job submission for heavy processing&lt;/li&gt;
&lt;li&gt;Instant response to client with document tracking ID&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Phase 2: Image Generation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PDF pages converted to high-quality PNG images using pdf2pic&lt;/li&gt;
&lt;li&gt;Images stored in organized file system structure&lt;/li&gt;
&lt;li&gt;Document metadata updated with total page count&lt;/li&gt;
&lt;li&gt;Individual OCR jobs queued for each page&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Phase 3: Content Extraction&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Mistral AI OCR processes each image independently&lt;/li&gt;
&lt;li&gt;Extracted markdown content saved to database&lt;/li&gt;
&lt;li&gt;Progress tracking across all pages&lt;/li&gt;
&lt;li&gt;Document status updated to &lt;code&gt;COMPLETED&lt;/code&gt; when all pages finish&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Database Schema Design
&lt;/h3&gt;

&lt;p&gt;Our schema is optimized for both write performance during processing and read performance for content queries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Document tracking table&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;"Document"&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;"id"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"name"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"path"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"totalPages"&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"status"&lt;/span&gt; &lt;span class="nv"&gt;"DocumentStatus"&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;-- QUEUED/PROCESSING/COMPLETED/FAILED&lt;/span&gt;
    &lt;span class="nv"&gt;"createdAt"&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="nv"&gt;"updatedAt"&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Individual page content storage&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;"DocumentPage"&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;"id"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"documentId"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;REFERENCES&lt;/span&gt; &lt;span class="nv"&gt;"Document"&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nv"&gt;"number"&lt;/span&gt; &lt;span class="nb"&gt;INTEGER&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;"path"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;-- Image file path&lt;/span&gt;
    &lt;span class="nv"&gt;"markdown"&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;-- OCR extracted content&lt;/span&gt;
    &lt;span class="nv"&gt;"createdAt"&lt;/span&gt; &lt;span class="nb"&gt;TIMESTAMP&lt;/span&gt; &lt;span class="k"&gt;DEFAULT&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;-- Performance indexes&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="nv"&gt;"idx_doc_pages"&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="nv"&gt;"DocumentPage"&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"documentId"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"number"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="nv"&gt;"idx_doc_status"&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="nv"&gt;"Document"&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This design enables us to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track processing status in real-time&lt;/li&gt;
&lt;li&gt;Handle partial failures gracefully&lt;/li&gt;
&lt;li&gt;Query content efficiently&lt;/li&gt;
&lt;li&gt;Support document versioning (future enhancement)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Implementation Deep Dive
&lt;/h2&gt;

&lt;p&gt;Let's walk through the key components of our implementation, starting with the API endpoint and moving through the background processing pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Upload API Endpoint
&lt;/h3&gt;

&lt;p&gt;Our upload endpoint is built with Express middleware and Wasp's type-safe operations:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// main.wasp - API definition&lt;/span&gt;
&lt;span class="nx"&gt;api&lt;/span&gt; &lt;span class="nx"&gt;ingestPdf&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;httpRoute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;POST&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/api/v1/data-ingestion/pdf&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nx"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;ingestPdf&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@src/server/data_ingestion/apis/ingest-pdf&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;User&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ingest-pdf.ts - Implementation&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ingestPdf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;IngestPdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// File validation&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mimetype&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/pdf&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Valid PDF file required&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="c1"&gt;// Extract metadata&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileMetadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;originalname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mimetype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;

        &lt;span class="c1"&gt;// Create database record immediately&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;createDocument&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;totalPages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="c1"&gt;// Updated after processing&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="c1"&gt;// Return success immediately&lt;/span&gt;
        &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PDF uploaded successfully. Processing started.&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="c1"&gt;// Submit background job (non-blocking)&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;processPdfToImages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;fileBufferString&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;base64&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="na"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Upload failed:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PDF processing failed&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key Design Decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Immediate response&lt;/strong&gt;: Client gets instant feedback with tracking ID&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Base64 encoding&lt;/strong&gt;: File buffer serialized for job queue persistence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive validation&lt;/strong&gt;: Multiple layers of file checking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error isolation&lt;/strong&gt;: Upload failures don't affect background processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Stage 1: PDF-to-Image Conversion
&lt;/h3&gt;

&lt;p&gt;The first background job converts PDF pages to high-quality images:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// pdf-to-image.job.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;processPdfToImages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ProcessPdfToImages&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;ProcessPdfArgs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;void&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;fileBufferString&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileBuffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileBufferString&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;base64&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Configure pdf2pic for optimal quality/performance balance&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;convertOptions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;density&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// DPI - good quality without huge files&lt;/span&gt;
            &lt;span class="na"&gt;saveFilename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;baseName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_page`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;savePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;imagesDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;png&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;// Max width for web display&lt;/span&gt;
            &lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;           &lt;span class="c1"&gt;// Preserve aspect ratio&lt;/span&gt;
        &lt;span class="p"&gt;};&lt;/span&gt;

        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;convert&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fromBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileBuffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;convertOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bulk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;responseType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;image&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="c1"&gt;// Update document with actual page count&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="na"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
                &lt;span class="na"&gt;totalPages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;DocumentStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;PROCESSING&lt;/span&gt; 
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="c1"&gt;// Submit OCR job for each page independently&lt;/span&gt;
        &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;extractAndProcessPageContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="na"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="na"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                &lt;span class="na"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt;
            &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`✅ Generated &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; images for &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Mark document as failed and log for monitoring&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;updateDocumentStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DocumentStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FAILED&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Performance Optimizations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delayed job submission&lt;/strong&gt;: Pages processed with staggered delays to prevent OCR API rate limiting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimal image settings&lt;/strong&gt;: Balanced quality/file size for web applications
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomic operations&lt;/strong&gt;: Database updates happen only after successful image generation&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Stage 2: OCR Content Extraction
&lt;/h3&gt;

&lt;p&gt;Each page gets processed independently by our OCR pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// extract-and-process-page-content.job.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;extractAndProcessPageContent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ExtractAndProcessPageContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; 
&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fullImagePath&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;BASE_PATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Perform OCR using Mistral AI&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;markdownContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;performOcrOnImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fullImagePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

        &lt;span class="c1"&gt;// Save extracted content to database&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveDocumentPage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;markdownContent&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="c1"&gt;// Check if all pages are complete&lt;/span&gt;
        &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findFirst&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="na"&gt;include&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;where&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;deletedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;

        &lt;span class="c1"&gt;// Mark document complete when all pages processed&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;document&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;totalPages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;updateDocumentStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DocumentStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;COMPLETED&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`🎉 Document &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; fully processed`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`OCR failed for page &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;updateDocumentStatus&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DocumentStatus&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;FAILED&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. OCR Integration with Mistral AI
&lt;/h3&gt;

&lt;p&gt;Our OCR service provides high-quality text extraction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// mistral-ai.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;performOcrOnImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mistral&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Mistral&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;MISTRAL_API_KEY&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;base64Image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;loadImageAsBase64&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ocrResponse&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;mistral&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ocr&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;mistral-ocr-latest&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;document&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="na"&gt;imageUrl&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;base64Image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;image_url&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ocrResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;ocrResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;ocrResponse&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No content extracted from OCR&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Mistral AI OCR?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;High accuracy&lt;/strong&gt;: Superior text recognition compared to traditional OCR&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markdown output&lt;/strong&gt;: Structured format preserving document formatting&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-language support&lt;/strong&gt;: Handles Portuguese content excellently&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API reliability&lt;/strong&gt;: Enterprise-grade uptime and rate limiting&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges We Encountered (And How We Solved Them)
&lt;/h2&gt;

&lt;p&gt;Building a production-ready document processing pipeline isn't just about happy path scenarios. Here are the real-world challenges we faced and our solutions:&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge 1: Memory Management with Large PDFs
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Large PDF files (&amp;gt;5MB) were causing memory issues when converting to images, especially with high DPI settings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Implemented streaming buffer processing and optimized pdf2pic configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Before: Memory spikes with large files&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;convert&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fromBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileBuffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;density&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// After: Balanced approach for web applications&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;convertOptions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;density&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Reduced from 300 DPI&lt;/span&gt;
    &lt;span class="na"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// Max width constraint&lt;/span&gt;
    &lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;        &lt;span class="c1"&gt;// Preserve aspect ratio&lt;/span&gt;
    &lt;span class="na"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;85&lt;/span&gt;          &lt;span class="c1"&gt;// Slight compression for file size&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: 60% reduction in memory usage while maintaining text readability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge 2: OCR Rate Limiting and Failures
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Mistral AI has rate limits, and simultaneous OCR requests for multi-page documents were hitting API limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Implemented intelligent job delays and retry logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Stagger OCR jobs to respect rate limits&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;entries&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;extractAndProcessPageContent&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;delay&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;// 2-second delays between pages&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="na"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt;
        &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Zero rate limit errors and improved overall processing reliability.&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge 3: Partial Processing Failures
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: If one page failed OCR, the entire document would be marked as failed, even if other pages processed successfully.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Implemented granular error handling and recovery:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Allow partial success with detailed error tracking&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;markdownContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;performOcrOnImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fullImagePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveDocumentPage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Log error but don't fail entire document&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Page &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; failed, continuing with other pages`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Save page with error status for manual review&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveDocumentPage&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;number&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`[OCR_ERROR]: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;message&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;FAILED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Documents with partial failures can still be useful, with clear indication of problematic pages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Challenge 4: File System Organization and Cleanup
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Problem&lt;/strong&gt;: Generated images were accumulating without cleanup, and file paths were hardcoded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: Implemented organized file structure and cleanup jobs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Organized file naming with document context&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;originalName&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;a-zA-Z0-9_-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;_&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;imagesDir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;STORAGE_PATH&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;documents&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Future cleanup job (in backlog)&lt;/span&gt;
&lt;span class="nx"&gt;job&lt;/span&gt; &lt;span class="nx"&gt;cleanupProcessedImages&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;PgBoss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;cleanupOldImages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@src/jobs/cleanup&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="nx"&gt;schedule&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;cron&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;0 2 * * *&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="c1"&gt;// Daily at 2 AM&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result&lt;/strong&gt;: Better file organization and foundation for automated cleanup.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance and Scalability Considerations
&lt;/h2&gt;

&lt;p&gt;Our pipeline handles real-world production loads with these performance characteristics:&lt;/p&gt;

&lt;h3&gt;
  
  
  Current Performance Metrics
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Processing Speed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small PDFs (1-5 pages): ~30-60 seconds end-to-end&lt;/li&gt;
&lt;li&gt;Medium PDFs (10-20 pages): ~2-4 minutes
&lt;/li&gt;
&lt;li&gt;Large PDFs (30+ pages): ~5-10 minutes&lt;/li&gt;
&lt;li&gt;Concurrent documents: Up to 10 simultaneous processing jobs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Resource Usage:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Memory: ~200MB peak per document during image generation&lt;/li&gt;
&lt;li&gt;Storage: ~1.5MB per page (PNG images + database)&lt;/li&gt;
&lt;li&gt;CPU: Moderate usage, mostly I/O bound waiting on OCR API&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scalability Design Decisions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Horizontal Scaling Ready:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Job queue handles distribution across multiple workers&lt;/span&gt;
&lt;span class="nx"&gt;job&lt;/span&gt; &lt;span class="nx"&gt;processPdfToImages&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;PgBoss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;processPdfToImages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@src/jobs/pdf-processing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="c1"&gt;// PgBoss automatically distributes across multiple Node.js instances&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Database Optimization:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- Indexes for common query patterns&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="nv"&gt;"idx_document_status_created"&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="nv"&gt;"Document"&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"createdAt"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="nv"&gt;"idx_page_content_search"&lt;/span&gt; &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="nv"&gt;"DocumentPage"&lt;/span&gt; &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;gin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;to_tsvector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'english'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;"markdown"&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="c1"&gt;-- Partitioning strategy for large deployments&lt;/span&gt;
&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="nv"&gt;"DocumentPage_2024"&lt;/span&gt; &lt;span class="k"&gt;PARTITION&lt;/span&gt; &lt;span class="k"&gt;OF&lt;/span&gt; &lt;span class="nv"&gt;"DocumentPage"&lt;/span&gt; 
&lt;span class="k"&gt;FOR&lt;/span&gt; &lt;span class="k"&gt;VALUES&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2024-01-01'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;TO&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2025-01-01'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rate Limiting and Circuit Breakers:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// OCR service protection&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ocrWithRetry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;retries&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="nx"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;performOcrOnImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;attempt&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="nx"&gt;retries&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

            &lt;span class="c1"&gt;// Exponential backoff: 2s, 4s, 8s&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2000&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
            &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`OCR retry attempt &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;attempt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; for &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Monitoring and Observability
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Key Metrics We Track:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Job queue length and processing times&lt;/li&gt;
&lt;li&gt;OCR API success/failure rates&lt;/li&gt;
&lt;li&gt;Storage usage growth&lt;/li&gt;
&lt;li&gt;Database query performance&lt;/li&gt;
&lt;li&gt;Memory usage patterns during processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Health Checks:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// API endpoint for system health&lt;/span&gt;
&lt;span class="nx"&gt;api&lt;/span&gt; &lt;span class="nx"&gt;systemHealth&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;httpRoute&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;GET&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;/api/v1/health&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nx"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;getSystemHealth&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@src/monitoring/health&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;getSystemHealth&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;database&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkDatabaseConnection&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;jobQueue&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkJobQueueHealth&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; 
    &lt;span class="na"&gt;ocrService&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkOcrServiceStatus&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;checkStorageAvailability&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Scaling Bottlenecks and Solutions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Current Bottleneck: OCR API Rate Limits&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Problem&lt;/strong&gt;: Mistral AI limits concurrent requests&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Intelligent request queuing with backoff&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Future&lt;/strong&gt;: Multi-provider OCR with automatic failover&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Future Bottleneck: Database Writes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Anticipated&lt;/strong&gt;: High-volume concurrent page saves&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Solution&lt;/strong&gt;: Write-optimized database configuration and potential read replicas&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Storage Considerations:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Current&lt;/strong&gt;: Local file system for development&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production&lt;/strong&gt;: S3-compatible storage with CDN for image serving&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cleanup&lt;/strong&gt;: Automated deletion of processed images after content extraction&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons Learned and Key Takeaways
&lt;/h2&gt;

&lt;p&gt;Building this PDF processing pipeline taught us valuable lessons about modern web application architecture:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Choose the Right Level of Abstraction&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: Wasp's declarative approach eliminated huge amounts of boilerplate while maintaining flexibility.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// This simple Wasp declaration...&lt;/span&gt;
&lt;span class="nx"&gt;job&lt;/span&gt; &lt;span class="nx"&gt;processPdfToImages&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;executor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;PgBoss&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;perform&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;processPdfToImages&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@src/jobs/pdf-processing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="nx"&gt;entities&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DocumentPage&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ...generates type-safe job submission, queue management, and error handling&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;processPdfToImages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;submit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileBuffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;metadata&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;: Reduced development time by ~40% compared to setting up raw Express + PgBoss + Prisma.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;Design for Observability from Day One&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: Comprehensive logging and status tracking saved countless debugging hours.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Every major operation logs structured data&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`📤 OCR job submitted - Doc: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, Page: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;jobId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;: Issues can be traced through the entire pipeline with searchable, structured logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;Embrace Async Processing for Better UX&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: Immediate API responses with background processing creates much better user experience than blocking operations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before&lt;/strong&gt;: 60-second API timeouts for large documents&lt;br&gt;&lt;br&gt;
&lt;strong&gt;After&lt;/strong&gt;: Sub-second API responses with real-time status updates&lt;/p&gt;
&lt;h3&gt;
  
  
  4. &lt;strong&gt;Error Handling Should Be Granular and Recoverable&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: Failing fast on individual pages while continuing document processing provides better user value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Don't let one bad page kill the entire document&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;processPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageInfo&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;logPageError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageInfo&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// Continue processing other pages&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Impact&lt;/strong&gt;: 85% of documents with partial page failures still provide valuable content.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;AI APIs Are Powerful But Require Defensive Programming&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Lesson&lt;/strong&gt;: External AI services need circuit breakers, retries, and graceful degradation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Strategies:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Exponential backoff for rate limits&lt;/li&gt;
&lt;li&gt;Structured error responses for debugging&lt;/li&gt;
&lt;li&gt;Fallback mechanisms for service outages&lt;/li&gt;
&lt;li&gt;Cost monitoring for API usage&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Future Enhancements and Roadmap
&lt;/h2&gt;

&lt;p&gt;Our current implementation handles the core use case well, but there's always room for improvement:&lt;/p&gt;

&lt;h3&gt;
  
  
  Short Term (Next Month)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Progress Tracking API&lt;/strong&gt;: Real-time status updates for frontend integration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Batch Processing&lt;/strong&gt;: Handle multiple PDFs in a single upload&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Validation&lt;/strong&gt;: Quality checks on extracted text&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;File Cleanup Jobs&lt;/strong&gt;: Automated removal of temporary images&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Medium Term (3-6 Months)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Provider OCR&lt;/strong&gt;: Automatic failover between OCR services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Content Search&lt;/strong&gt;: Full-text search across all processed documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Document Versioning&lt;/strong&gt;: Handle updates to existing documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Advanced Error Recovery&lt;/strong&gt;: Automatic retry of failed pages&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Long Term (6+ Months)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Real-time Processing&lt;/strong&gt;: WebSocket updates for live status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI Content Enhancement&lt;/strong&gt;: Automatic tagging and categorization&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-format Support&lt;/strong&gt;: Word documents, PowerPoint, images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise Features&lt;/strong&gt;: User permissions, audit trails, API keys&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Code Quality Improvements
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Current backlog items from our TODO list:&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;backlogItems&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Add progress tracking query endpoint&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Implement file cleanup operations&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Enhanced error recovery with exponential backoff&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Remove hardcoded file paths&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Add structured logging with proper levels&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Implement job queue monitoring and dead letter handling&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;];&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;Building a production-ready PDF data ingestion pipeline taught us that &lt;strong&gt;the architecture matters more than the individual technologies&lt;/strong&gt;. By choosing tools that work well together (Wasp + TypeScript + PgBoss + Mistral AI), we built something that's both powerful and maintainable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Success Factors:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Async-first design&lt;/strong&gt; for better user experience&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comprehensive error handling&lt;/strong&gt; for production reliability
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Structured logging&lt;/strong&gt; for operational visibility&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type safety&lt;/strong&gt; across the entire stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gradual optimization&lt;/strong&gt; based on real usage patterns&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The entire pipeline processes documents reliably in production, handling everything from single-page forms to 50-page government manuals. More importantly, it's maintainable and extensible for future requirements.&lt;/p&gt;




&lt;h2&gt;
  
  
  Want to Build Something Similar?
&lt;/h2&gt;

&lt;p&gt;If you're working on document processing or considering a similar architecture:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Start with the Wasp framework&lt;/strong&gt; - The productivity boost is real&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Design your job queue strategy early&lt;/strong&gt; - Async processing is crucial for good UX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Choose your OCR provider carefully&lt;/strong&gt; - Quality varies dramatically between services&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Plan for partial failures&lt;/strong&gt; - Documents are messy, and your system should handle that gracefully&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Questions or want to dive deeper into any part of this architecture?&lt;/strong&gt; Drop a comment below - I love discussing technical architecture and lessons learned from real-world implementations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Building something similar?&lt;/strong&gt; I'd love to hear about your approach and any challenges you're facing. The developer community thrives on sharing these kinds of implementation stories!&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This post is part of our ongoing series on building modern SaaS applications with Wasp. Follow for more deep dives into full-stack TypeScript development and production architecture patterns.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags&lt;/strong&gt;: #WebDev #TypeScript #PDF #OCR #Architecture #Wasp #FullStack #BackgroundJobs #AI&lt;/p&gt;

</description>
      <category>ai</category>
      <category>typescript</category>
      <category>ocr</category>
      <category>backend</category>
    </item>
    <item>
      <title>Building a Page-Level PDF Processing Pipeline for Smarter RAG Systems</title>
      <dc:creator>Stefan Vitoria</dc:creator>
      <pubDate>Sat, 20 Dec 2025 17:59:55 +0000</pubDate>
      <link>https://dev.to/steravy/building-a-page-level-pdf-processing-pipeline-for-smarter-rag-systems-3bgm</link>
      <guid>https://dev.to/steravy/building-a-page-level-pdf-processing-pipeline-for-smarter-rag-systems-3bgm</guid>
      <description>&lt;p&gt;&lt;em&gt;How I solved the granularity problem in document-based AI applications using Wasp, pdf2pic, and smart architecture decisions&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: When Your RAG System Can't Point to the Right Page
&lt;/h2&gt;

&lt;p&gt;Picture this: You're building an AI-powered document analysis system. Users upload PDFs, ask questions, and expect precise answers with exact references. But there's a frustrating problem—your RAG (Retrieval-Augmented Generation) system can tell users &lt;em&gt;what&lt;/em&gt; information exists but not &lt;em&gt;where&lt;/em&gt; to find it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ User: "Where can I find the tax information?"
❌ AI: "The tax rate is 15%"
❌ User: "But which page?"
❌ AI: 🤷‍♂️
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The challenge I faced was that existing solutions didn't provide the exact level of control and page-level granularity I needed for my RAG system. Most approaches either lacked the precise page referencing I required, were overly complex for my use case, or didn't integrate well with my existing technology stack.&lt;/p&gt;

&lt;p&gt;This becomes a dealbreaker for professional applications where users need to verify information, cite sources, or navigate large documents efficiently.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge: Finding the Right Fit for My Requirements
&lt;/h2&gt;

&lt;p&gt;When building my document-based AI system, I had specific requirements that existing solutions couldn't meet:&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What I Needed:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Page-level granularity&lt;/strong&gt;: Each page processed and referenced individually&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost-effective scaling&lt;/strong&gt;: Handle volume processing without breaking the bank&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple integration&lt;/strong&gt;: Work seamlessly with my Wasp application stack&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full control&lt;/strong&gt;: Customize the processing pipeline for my specific needs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reliable quality&lt;/strong&gt;: Consistent results across various PDF formats&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What I Tried:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cloud-based solutions&lt;/strong&gt;: Often too expensive and complex for my specific use case&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Basic PDF parsing&lt;/strong&gt;: Inconsistent results with complex layouts and images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple OCR libraries&lt;/strong&gt;: Required too much manual optimization and tuning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I needed a solution that was: &lt;strong&gt;Purpose-built, cost-effective, and gave me complete control over the processing pipeline&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution: PDF-to-Image Pipeline with Smart Processing
&lt;/h2&gt;

&lt;p&gt;After experimenting with various approaches, I developed a pipeline that converts PDFs into individual page images, processes each page separately, and maintains precise page references throughout the RAG system.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture Overview
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;PDF Upload → Page Images → Individual OCR → RAG with Page References
     ↓              ↓              ↓              ↓
  [file.pdf]  [page_1.png]  [text + page #]  "Found on page 3"
             [page_2.png]
             [page_3.png]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Implementation: Building with Wasp and pdf2pic
&lt;/h2&gt;

&lt;p&gt;I chose &lt;a href="https://wasp-lang.dev/" rel="noopener noreferrer"&gt;Wasp&lt;/a&gt; for its full-stack TypeScript approach and built-in file handling capabilities. Here's how I implemented the solution:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Setting Up File Upload with Multer
&lt;/h3&gt;

&lt;p&gt;First, I configured multer for handling PDF uploads directly in memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/server/data_ingestion/apis/ingest-pdf.ts&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;multer&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;multer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="nx"&gt;MiddlewareConfigFn&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;wasp/server&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;fromBuffer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;pdf2pic&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;MAX_FILE_SIZE_BYTES&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="c1"&gt;// 10MB limit&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ingestPdfMiddlewareConfigFn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;MiddlewareConfigFn&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;middlewareConfig&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;upload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;multer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;storage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;multer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;memoryStorage&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="c1"&gt;// Keep in memory for processing&lt;/span&gt;
    &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;fileSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;MAX_FILE_SIZE_BYTES&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;fileFilter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;_req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mimetype&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/pdf&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;cb&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Only PDF files are allowed&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;middlewareConfig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;multer.single&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;upload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;single&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pdf&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;middlewareConfig&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Wasp API Configuration
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;main.wasp&lt;/code&gt;, I used the new &lt;code&gt;apiNamespace&lt;/code&gt; feature for cleaner middleware management:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;// main.wasp
apiNamespace fileUploadMiddleware {
  middlewareConfigFn: import { ingestPdfMiddlewareConfigFn } from "@src/server/data_ingestion/apis/ingest-pdf",
  path: "/api/v1/data-ingestion"
}

api ingestPdf {
  httpRoute: (POST, "/api/v1/data-ingestion/pdf"),
  fn: import { ingestPdf } from "@src/server/data_ingestion/apis/ingest-pdf",
  entities: [User]
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: The Heart of the System - PDF Processing Function
&lt;/h3&gt;

&lt;p&gt;This is where the magic happens. The function takes a PDF buffer and converts each page to an individual image:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;PageInfo&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;FileMetadata&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;originalName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;processPdfToImages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;fileBuffer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
  &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;FileMetadata&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;PageInfo&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Validate the buffer&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;fileBuffer&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;fileBuffer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Invalid PDF file buffer&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Create storage directory&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;imagesDir&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cwd&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; 
      &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;src&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;server&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;data_ingestion&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pdf_to_pic_images&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagesDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;recursive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Sanitize filename for file system compatibility&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;baseName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;originalName&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;a-zA-Z0-9_-&lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;_&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Configure pdf2pic with optimal settings&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;convertPdfOptions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;density&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;           &lt;span class="c1"&gt;// 150 DPI - sweet spot for OCR accuracy&lt;/span&gt;
      &lt;span class="na"&gt;saveFilename&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;baseName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_page`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;savePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;imagesDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;png&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// PNG for lossless text clarity&lt;/span&gt;
      &lt;span class="na"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;800&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;// Reasonable file size&lt;/span&gt;
      &lt;span class="na"&gt;height&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1200&lt;/span&gt;           &lt;span class="c1"&gt;// Maintain aspect ratio&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;convert&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;fromBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fileBuffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;convertPdfOptions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Convert all pages (-1 means all pages)&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bulk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;responseType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;image&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

    &lt;span class="c1"&gt;// Process results into our page info structure&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;pageInfos&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;PageInfo&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[];&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
          &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fileName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;basename&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;path&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
          &lt;span class="nx"&gt;pageInfos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="na"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;index&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="na"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fileName&lt;/span&gt;
          &lt;span class="p"&gt;});&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="p"&gt;});&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageInfos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No pages could be extracted from PDF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;pageInfos&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nx"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PDF processing error:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Failed to process PDF pages&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Main API Handler with Enhanced Response
&lt;/h3&gt;

&lt;p&gt;The main API handler orchestrates the entire process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ingestPdf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;IngestPdf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;_context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;No PDF file provided&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Extract metadata&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="na"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;FileMetadata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;filename&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;originalname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;originalName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;originalname&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mimetype&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;

    &lt;span class="c1"&gt;// Validate file type and size&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mimetype&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;application/pdf&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;File must be a PDF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;MAX_FILE_SIZE_BYTES&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;`File exceeds &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;MAX_FILE_SIZE_BYTES&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;MB limit`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c1"&gt;// Process PDF to images&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;processPdfToImages&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Return enhanced response with page information&lt;/span&gt;
    &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;success&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PDF processed successfully&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;totalPages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nx"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Error processing PDF upload:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Internal server error processing PDF&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Error Handling: Production-Ready Considerations
&lt;/h2&gt;

&lt;p&gt;Real-world PDF processing is messy. Here are the error scenarios I handle:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Corrupted or Invalid PDFs&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bulk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;responseType&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;image&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;conversionError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;PDF conversion error:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;conversionError&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;400&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Failed to convert PDF - file may be corrupted or password-protected&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;File System Issues&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mkdir&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagesDir&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;recursive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;dirError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Failed to create images directory:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dirError&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Failed to create storage directory&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Resource Limits&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In multer configuration&lt;/span&gt;
&lt;span class="nx"&gt;limits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
  &lt;span class="nl"&gt;fileSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;MAX_FILE_SIZE_BYTES&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;files&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="c1"&gt;// Only one file at a time&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  API Usage: From Upload to Page References
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Making a Request
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Using cURL&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"pdf=@document.pdf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  http://localhost:3001/api/v1/data-ingestion/pdf

&lt;span class="c"&gt;# Using JavaScript fetch&lt;/span&gt;
const formData &lt;span class="o"&gt;=&lt;/span&gt; new FormData&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
formData.append&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'pdf'&lt;/span&gt;, fileInput.files[0]&lt;span class="o"&gt;)&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

const response &lt;span class="o"&gt;=&lt;/span&gt; await fetch&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'/api/v1/data-ingestion/pdf'&lt;/span&gt;, &lt;span class="o"&gt;{&lt;/span&gt;
  method: &lt;span class="s1"&gt;'POST'&lt;/span&gt;,
  body: formData
&lt;span class="o"&gt;})&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

const result &lt;span class="o"&gt;=&lt;/span&gt; await response.json&lt;span class="o"&gt;()&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Enhanced Response Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"message"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PDF processed successfully"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tax_document.pdf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"originalName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tax_document.pdf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"application/pdf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2048576&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"pages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"pageNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"imagePath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tax_document_page.1.png"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"pageNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"imagePath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tax_document_page.2.png"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"pageNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"imagePath"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tax_document_page.3.png"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"totalPages"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Integration with RAG Systems
&lt;/h2&gt;

&lt;p&gt;Now comes the payoff. With individual page images, I can:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Process Each Page with High-Accuracy OCR&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example: Process with Mistral or other OCR service&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;processPageWithOCR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;imageBuffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;readFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ocrResult&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;mistralOCR&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;processImage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;imageBuffer&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ocrResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ocrResult&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;confidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nx"&gt;imagePath&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Index Content with Page References&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example: Store in vector database with page metadata&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;indexPageContent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pageData&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text-embedding-3-small&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;vectorDB&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_page_&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;pageData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;values&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nx"&gt;documentId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;pageData&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Provide Precise Query Responses&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Example: RAG query with page references&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;queryWithPageReferences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;vectorDB&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;embedQuestion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;topK&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;includeMetadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;generateAnswer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;sources&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;pageNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;imagePath&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;relevanceScore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;score&lt;/span&gt;
    &lt;span class="p"&gt;}))&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Results: The Transformation in Action
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Before Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;❌ User: "What's the tax rate mentioned in the document?"
❌ AI: "The tax rate is 15%"
❌ User: "Where exactly?"
❌ AI: "I found that information in the document"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  After Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ User: "What's the tax rate mentioned in the document?"
✅ AI: "The tax rate is 15%, found on page 3 of the document"
✅ User: "Can you show me the exact page?"
✅ AI: [Returns page 3 image with highlighted relevant section]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Performance Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  File Size and Processing Time
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Small PDFs (1-5 pages)&lt;/strong&gt;: ~2-3 seconds processing time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medium PDFs (10-20 pages)&lt;/strong&gt;: ~8-12 seconds processing time&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Large PDFs (50+ pages)&lt;/strong&gt;: Consider async processing with job queues&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Storage Requirements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Images are ~200-500KB each&lt;/strong&gt; at 150 DPI&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10-page PDF&lt;/strong&gt;: ~2-5MB in images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consider cleanup strategies&lt;/strong&gt; for old processed files&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Memory Usage
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PDF buffer in memory&lt;/strong&gt;: Temporary during processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Images written to disk&lt;/strong&gt;: Permanent storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Peak memory usage&lt;/strong&gt;: ~3x the PDF file size during processing&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Production Deployment Tips
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;Async Processing for Large Documents&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Consider using a job queue for large PDFs&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;jobQueue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;process-large-pdf&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; 
    &lt;span class="nx"&gt;fileBuffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="nx"&gt;fileMetadata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
    &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; 
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Large PDF queued for processing&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. &lt;strong&gt;Storage Optimization&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Implement cleanup for old images&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;cleanupOldImages&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;cutoffDate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// 7 days&lt;/span&gt;
  &lt;span class="c1"&gt;// Remove images older than cutoff date&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. &lt;strong&gt;Rate Limiting&lt;/strong&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Add rate limiting for PDF processing&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rateLimit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;rateLimit&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;windowMs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;60&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// 15 minutes&lt;/span&gt;
  &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;// 5 PDFs per window&lt;/span&gt;
  &lt;span class="na"&gt;message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Too many PDF uploads, please try again later&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Key Takeaways and Future Possibilities
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What I Learned
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Page-level granularity is crucial&lt;/strong&gt; for professional document AI applications&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;pdf2pic provides excellent quality&lt;/strong&gt; with reasonable performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Buffer-based processing&lt;/strong&gt; is more secure than temporary file storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error handling is critical&lt;/strong&gt; - PDFs are unpredictable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wasp's middleware system&lt;/strong&gt; makes complex integrations clean&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Future Enhancements
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;OCR confidence scoring&lt;/strong&gt; per page&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layout analysis&lt;/strong&gt; to identify tables, charts, headers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Text highlighting&lt;/strong&gt; on source images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-format support&lt;/strong&gt; (Word docs, PowerPoint, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time collaboration&lt;/strong&gt; on document analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  The Bottom Line
&lt;/h3&gt;

&lt;p&gt;Building page-level PDF processing transformed my RAG system from a basic Q&amp;amp;A tool into a professional document analysis platform. Users can now get precise answers with exact page references, making the AI system trustworthy and practical for real-world use cases.&lt;/p&gt;

&lt;p&gt;The combination of Wasp's full-stack capabilities, pdf2pic's reliable conversion, and thoughtful error handling created a robust solution that scales from prototype to production.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have you implemented similar document processing pipelines? I'd love to hear about your experiences and challenges in the comments below.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>rag</category>
      <category>architecture</category>
      <category>tutorial</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I’m Building a Job Application Bot for Developers</title>
      <dc:creator>Stefan Vitoria</dc:creator>
      <pubDate>Fri, 20 Jun 2025 22:01:14 +0000</pubDate>
      <link>https://dev.to/steravy/how-im-building-a-job-application-bot-for-developers-4372</link>
      <guid>https://dev.to/steravy/how-im-building-a-job-application-bot-for-developers-4372</guid>
      <description>&lt;p&gt;&lt;strong&gt;Hi, I’m Stefan, a software engineer building a tool I wish existed earlier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I recently started working on a side project called &lt;strong&gt;Devapply&lt;/strong&gt;, a bot that automatically finds developer jobs, creates personalized application documents using AI, and even sends the applications for you—while you sleep.&lt;/p&gt;

&lt;p&gt;This wasn’t a sudden idea. It came from watching the market and the community.&lt;/p&gt;

&lt;p&gt;Even though I’m not actively looking for a job right now, I’ve been seeing developers all around me—especially in Portugal—sharing their struggles: fewer openings, more competition, and the burnout of endless job applications.&lt;/p&gt;

&lt;p&gt;So I thought, maybe there’s something I can build that actually helps. A tool to spot the best job openings early (only listing with less then 24 hours), prepare a personalized application instantly, and give developers a head start before the rest of the market even knows the role exists.&lt;/p&gt;

&lt;p&gt;And it’s also a great excuse to practice and apply some real-world software engineering and AI architecture concepts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scalability &amp;amp; parallel job processing&lt;/li&gt;
&lt;li&gt;Queue-based architecture (BullMQ)&lt;/li&gt;
&lt;li&gt;Agentic AI document generation workflows&lt;/li&gt;
&lt;li&gt;Multimodal AI (different models used for different steps based on strengths)&lt;/li&gt;
&lt;li&gt;Email automation &amp;amp; delivery tracking&lt;/li&gt;
&lt;li&gt;Background workers and event-driven systems&lt;/li&gt;
&lt;li&gt;Personalization at scale&lt;/li&gt;
&lt;li&gt;GDPR-aware user data handling&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And yes—&lt;strong&gt;this is built for developers in Portugal&lt;/strong&gt;, and &lt;strong&gt;only scrapes job opportunities from Portuguese sources&lt;/strong&gt; for now.&lt;/p&gt;




&lt;h3&gt;
  
  
  😓 The Problem: Applying for Jobs Sucks
&lt;/h3&gt;

&lt;p&gt;If you’ve been job hunting as a developer, you know the pain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Rewriting your resume 10 times for similar roles&lt;/li&gt;
&lt;li&gt;Trying to stand out with generic cover letters&lt;/li&gt;
&lt;li&gt;Spending hours copying/pasting job requirements&lt;/li&gt;
&lt;li&gt;Writing awkward emails to recruiters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It feels like a full-time job in itself.&lt;/p&gt;

&lt;p&gt;And it’s especially bad if you’re:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Already working full-time and trying to switch&lt;/li&gt;
&lt;li&gt;Freelancing but want stability&lt;/li&gt;
&lt;li&gt;Getting ghosted despite doing all the right things&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt; So I thought:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What if I automate all of it?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h3&gt;
  
  
  🤖 The Idea: Automate the Entire Developer Job Application Pipeline (Portuguese market only)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Devapply&lt;/strong&gt; is a tool built specifically for developers. Here's how it works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Finds New Dev Jobs in Portugal&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;It looks for new jobs daily from sources inside Portugal.&lt;/li&gt;
&lt;li&gt;Filters them based on your preferences (location, stack, remote/flex).&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Generates Tailored Applications&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Uses a multimodal AI workflow to build your resume, cover letter, and application email.&lt;/li&gt;
&lt;li&gt;Each model is used based on its specific strengths (e.g., summarizing vs. writing).&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Sends the Application for You&lt;/strong&gt;:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;For job posts with recruiter emails, it applies in your name.&lt;/li&gt;
&lt;li&gt;You wake up to a notification: “3 applications sent. Docs ready.”&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  🏛️ Why I Chose to Focus on Developers First
&lt;/h3&gt;

&lt;p&gt;There are plenty of resume tools out there, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;None are focused on &lt;strong&gt;developers&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;None handle &lt;strong&gt;the full application lifecycle&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Most stop at: "Here's a resume template. Good luck!"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I wanted to create something that understands developer workflows, tech stacks, and expectations. Something that doesn’t just offer tips — it takes action.&lt;/p&gt;

&lt;p&gt;Also, devs are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open to automation&lt;/li&gt;
&lt;li&gt;Often juggling multiple opportunities&lt;/li&gt;
&lt;li&gt;Willing to pay for tools that save time and mental energy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So Devapply is developer-first by design.&lt;/p&gt;




&lt;h3&gt;
  
  
  🚀 What I’ve Built So Far
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;The scraper for Portuguese job boards is live and working&lt;/li&gt;
&lt;li&gt;The AI document builder (resume + cover letter + email) is up and running&lt;/li&gt;
&lt;li&gt;The email sender uses dynamic reply-to setup for user identity&lt;/li&gt;
&lt;li&gt;The pipeline is queue-based with BullMQ for reliability and speed&lt;/li&gt;
&lt;li&gt;Most pieces are built — now I’m integrating them into one seamless flow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It already works for me.&lt;/p&gt;

&lt;p&gt;Now I want it to work for others.&lt;/p&gt;




&lt;h3&gt;
  
  
  🚀 What’s Next
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Finish the user dashboard&lt;/strong&gt; (see jobs, generated kits, application status)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write more technical articles&lt;/strong&gt; about how I built each part (this is the first)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I'm planning to charge later, but &lt;strong&gt;early adopters will get free access during the beta&lt;/strong&gt;. My goal is to learn with real users, refine the system, and see if it truly solves the problem.&lt;/p&gt;




&lt;h3&gt;
  
  
  🌐 Why I’m Sharing This
&lt;/h3&gt;

&lt;p&gt;If you’re a dev in Portugal, recruiter, indie hacker, or just curious — I’d love your feedback.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Would you use something like this?&lt;/li&gt;
&lt;li&gt;Do you think its something worth building?&lt;/li&gt;
&lt;li&gt;What would make it better?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not just a product. It’s an experiment — and your input helps shape what it becomes.&lt;/p&gt;




&lt;h3&gt;
  
  
  📅 Follow the Journey
&lt;/h3&gt;

&lt;p&gt;Over the next few weeks, I’ll be publishing more details on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How I scrape developer jobs in Portugal&lt;/li&gt;
&lt;li&gt;Queue-based automation and scaling workers&lt;/li&gt;
&lt;li&gt;Multimodal AI workflows for document generation&lt;/li&gt;
&lt;li&gt;Email delivery + identity handling&lt;/li&gt;
&lt;li&gt;GDPR compliance considerations&lt;/li&gt;
&lt;li&gt;Real feedback from users (and what breaks!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If that sounds interesting, follow me.&lt;/p&gt;

&lt;p&gt;You can also:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Join the waitlist&lt;/li&gt;
&lt;li&gt;DM me on X (@Ste_ravy)&lt;/li&gt;
&lt;li&gt;Or drop your thoughts in the comments — I’m open :)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks for reading,&lt;br&gt;
&lt;strong&gt;Stefan&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>ai</category>
      <category>automation</category>
      <category>career</category>
    </item>
  </channel>
</rss>
