<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: BLESSEDEFEM</title>
    <description>The latest articles on DEV Community by BLESSEDEFEM (@blessedefem).</description>
    <link>https://dev.to/blessedefem</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3678454%2F33621605-a0a6-4db7-a0b4-06d89d7e9649.png</url>
      <title>DEV Community: BLESSEDEFEM</title>
      <link>https://dev.to/blessedefem</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/blessedefem"/>
    <language>en</language>
    <item>
      <title>I Built a Production RAG System in 3 Weeks - Here's What Actually Broke</title>
      <dc:creator>BLESSEDEFEM</dc:creator>
      <pubDate>Fri, 09 Jan 2026 13:37:46 +0000</pubDate>
      <link>https://dev.to/blessedefem/i-built-a-production-rag-system-in-3-weeks-heres-what-actually-broke-556m</link>
      <guid>https://dev.to/blessedefem/i-built-a-production-rag-system-in-3-weeks-heres-what-actually-broke-556m</guid>
      <description>&lt;p&gt;&lt;strong&gt;The Wake-Up Call&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I deployed my first AI system to production. Within the first hour, it crashed several times.&lt;/p&gt;

&lt;p&gt;The error logs were a nightmare: relation "documents" does not exist, Dense vectors must contain at least one non-zero value, 429 Too Many Requests, CORS policy: No 'Access-Control-Allow-Origin'. Every fix revealed three new problems.&lt;/p&gt;

&lt;p&gt;Most RAG tutorials end at "it works on localhost." They skip the brutal reality: rate limits, CORS hell, database migrations, API quota exhaustion, and the 3 AM debugging sessions that come with real production systems.&lt;/p&gt;

&lt;p&gt;This isn't that kind of tutorial.&lt;/p&gt;

&lt;p&gt;I'm Blessing, a junior AI engineer from Lagos, Nigeria. This was my first production AI system, and I documented every failure, every panic moment, and every "why didn't the tutorial mention THIS?" frustration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Here's what you'll learn:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why my embeddings worked locally but failed in production&lt;br&gt;
The cascade of failures that happens when one service hits quota&lt;br&gt;
How I went from "no relevant information found" on every query to 90% success rate&lt;br&gt;
Real code and architecture decisions (not theory)&lt;br&gt;
Actual production metrics and costs&lt;/p&gt;

&lt;p&gt;If you're building your first production AI system, this post might save you 47 crashes and countless hours of debugging.&lt;br&gt;
Let's dive into what actually happened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I Built (And Why It Matters)&lt;/strong&gt;&lt;br&gt;
The System: A RAG (Retrieval-Augmented Generation) Document Q&amp;amp;A application where users upload PDFs, DOCX, or TXT files, then ask questions in plain English and get AI-generated answers with source citations.&lt;br&gt;
Why RAG? Traditional LLMs hallucinate - they confidently make things up. RAG solves this by grounding responses in YOUR actual documents. Upload your company's 500-page policy manual, ask "What's our remote work policy?" and get an accurate answer with the exact page reference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world impact:&lt;/strong&gt; Instead of Ctrl+F through dozens of files, users get conversational answers in 2-4 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Try it live: &lt;a href="//rag-document-qa-system.vercel.app"&gt;@URL&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Tech Stack (And Why I Chose Each)&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;React + TypeScript + Tailwind CSS&lt;br&gt;
Deployed on Vercel&lt;br&gt;
Why: Fast dev experience, automatic deployments, global CDN&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;FastAPI (Python)&lt;br&gt;
Deployed on Railway&lt;br&gt;
Why: Async support, automatic API docs, simpler than AWS&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Databases:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;PostgreSQL (document metadata)&lt;br&gt;
Pinecone (vector embeddings)&lt;br&gt;
Why: Pinecone serverless = no infrastructure management&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AI Services:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Google Gemini 2.0 Flash (answer generation)&lt;br&gt;
Cohere embed-v3 (embeddings)&lt;br&gt;
Why: Gemini's free tier (15K requests/month) vs OpenAI's limited free trial&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Authentication:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Clerk (JWT-based)&lt;br&gt;
Why: Drop-in solution, handles edge cases&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Architecture&lt;/strong&gt;&lt;br&gt;
┌─────────────┐&lt;br&gt;
│    User     │&lt;br&gt;
└──────┬──────┘&lt;br&gt;
       │&lt;br&gt;
       ▼&lt;br&gt;
┌──────────────────────┐&lt;br&gt;
│  React Frontend      │ ← Vercel&lt;br&gt;
│  (TypeScript)        │&lt;br&gt;
└────────┬─────────────┘&lt;br&gt;
         │ HTTPS + JWT&lt;br&gt;
         ▼&lt;br&gt;
┌──────────────────────┐&lt;br&gt;
│  FastAPI Backend     │ ← Railway&lt;br&gt;
│  (Async Python)      │&lt;br&gt;
└────┬──────┬──────┬───┘&lt;br&gt;
     │      │      │&lt;br&gt;
     ▼      ▼      ▼&lt;br&gt;
┌─────────┐┌──────────┐┌──────────┐&lt;br&gt;
│Pinecone ││PostgreSQL││VirusTotal│&lt;br&gt;
│ Vectors ││   Docs   ││  Scanner │&lt;br&gt;
└─────────┘└──────────┘└──────────┘&lt;br&gt;
     │&lt;br&gt;
     ▼&lt;br&gt;
┌─────────────────────┐&lt;br&gt;
│ Gemini (primary)    │&lt;br&gt;
│ Cohere (fallback)   │&lt;br&gt;
└─────────────────────┘&lt;br&gt;
&lt;strong&gt;The Flow:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;User uploads document → Virus scan → PostgreSQL record&lt;br&gt;
Background task extracts text → Chunks (1000 chars, 100 overlap)&lt;br&gt;
Gemini generates embeddings (768-dim vectors)&lt;br&gt;
Store in Pinecone with metadata&lt;br&gt;
User asks question → Gemini embeds query&lt;br&gt;
Pinecone finds top 5 similar chunks (cosine similarity)&lt;br&gt;
Gemini generates answer from retrieved context&lt;br&gt;
Return answer with source citations&lt;/p&gt;

&lt;p&gt;Simple in theory. Brutal in practice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash #1: "Dense Vectors Must Contain Non-Zero Values"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What happened: My first upload to Pinecone failed instantly.&lt;br&gt;
Error: Dense vectors must contain at least one non-zero value&lt;/p&gt;

&lt;p&gt;The mistake: I was using dummy embeddings for testing:&lt;/p&gt;
&lt;h5&gt;
  
  
  ❌ WRONG - What I did initially
&lt;/h5&gt;

&lt;p&gt;embeddings = [[0.0] * 768 for _ in chunks]&lt;br&gt;
Pinecone rejected them because zero vectors have no semantic meaning - you can't calculate similarity with nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I tried:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Used Google Gemini embeddings → Hit quota limit (1500/day free tier had... 0 available)&lt;br&gt;
Switched to Cohere → Hit their 96 text limit per request&lt;br&gt;
Tried batch processing → Hit 100K tokens/minute rate limit&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt;&lt;br&gt;
def generate_embeddings(self, texts: List[str]) -&amp;gt; List[List[float]]:&lt;br&gt;
    """Generate embeddings with batching and rate limiting"""&lt;br&gt;
    all_embeddings = []&lt;br&gt;
    batch_size = 96  # Cohere's limit&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for i in range(0, len(texts), batch_size):
    batch = texts[i:i + batch_size]

    response = self.cohere_client.embed(
        texts=batch,
        model='embed-english-v3.0',
        input_type='search_document',
        embedding_types=['float']
    )

    all_embeddings.extend(response.embeddings.float_)

    # Rate limiting: 6 second delay between batches
    if i + batch_size &amp;lt; len(texts):
        time.sleep(6)

return all_embeddings
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Successfully processed 1000-chunk documents in ~60 seconds.&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; Always test with real API responses, not mocked data. Dummy values that work locally will fail in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash #2: "No Relevant Information Found" (The Cascade)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What happened: Every single query returned "no relevant information found" despite successful uploads.&lt;br&gt;
This was the most frustrating bug. Documents uploaded fine. No errors. But queries found... nothing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The investigation:&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Step 1: Checked Pinecone console&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Result: 0 vectors stored&lt;br&gt;
Realization: Embeddings weren't being saved!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Checked upload logs&lt;/strong&gt;&lt;/p&gt;
&lt;h5&gt;
  
  
  Found this in my code:
&lt;/h5&gt;

&lt;p&gt;embedding = embedding_service.generate_embedding(text)  # ❌ WRONG&lt;br&gt;
I was calling the SINGULAR method (for one text) instead of plural method (for batches).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Fixed the method, still failed&lt;/strong&gt;&lt;/p&gt;
&lt;h5&gt;
  
  
  Error: 403 Your API key was reported as leaked
&lt;/h5&gt;

&lt;p&gt;My Gemini key had been exposed (hardcoded in .env.example that I committed to GitHub). Google auto-blocked it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Regenerated all API keys&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemini: 768-dim embeddings&lt;br&gt;
Cohere: 1024-dim embeddings&lt;br&gt;
Pinecone index: 768-dim&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: New error&lt;/strong&gt;&lt;br&gt;
Vector dimension 768 does not match the dimension of the index 1024&lt;br&gt;
The Pinecone index was created for Cohere (1024-dim), but I was now using Gemini (768-dim). They're incompatible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Deleted Pinecone index&lt;br&gt;
Created new index with 768 dimensions (for Gemini)&lt;br&gt;
Implemented dual-fallback embedding system&lt;/p&gt;

&lt;p&gt;def generate_embedding(self, text: str) -&amp;gt; List[float]:&lt;br&gt;
    """Generate embedding - Gemini first, Cohere fallback"""&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Try Gemini (15K free/month)
if self.gemini_api_key:
    try:
        result = genai.embed_content(
            model="models/text-embedding-004",
            content=text,
            task_type="retrieval_query"
        )
        return result['embedding']
    except Exception as e:
        logger.warning(f"Gemini failed: {e}, trying Cohere...")

# Fallback to Cohere (100 free/month)
if self.cohere_api_key:
    try:
        response = self.cohere_client.embed(
            texts=[text],
            model="embed-english-v3.0",
            input_type="search_query",
            embedding_types=["float"]
        )
        return response.embeddings.float_[0]
    except Exception as e:
        logger.error(f"Both services failed: {e}")
        return None

return None
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Query success rate jumped from 0% to 90%.&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; API quotas will hit you when you least expect it. Always have a fallback provider. Never commit API keys, even in example files.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash #3: "Relation 'documents' Does Not Exist"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What happened: Deployed to Railway. Backend started. Made first API call. Instant crash.&lt;br&gt;
pythonpsycopg2.errors.UndefinedTable: relation "documents" does not exist&lt;br&gt;
The mistake: I assumed Railway would auto-create my database tables like my local SQLite did.&lt;br&gt;
It didn't.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What I learned:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Local development: SQLAlchemy created tables automatically&lt;br&gt;
Production PostgreSQL: Fresh database, zero tables&lt;br&gt;
Alembic migrations: Not configured for Railway deployment&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt;&lt;br&gt;
Manually created tables via Railway's PostgreSQL CLI:&lt;br&gt;
sqlCREATE TABLE documents (&lt;br&gt;
    id SERIAL PRIMARY KEY,&lt;br&gt;
    user_id VARCHAR(255) NOT NULL,&lt;br&gt;
    filename VARCHAR(255) NOT NULL,&lt;br&gt;
    original_filename VARCHAR(255),&lt;br&gt;
    file_path VARCHAR(500),&lt;br&gt;
    file_size INTEGER,&lt;br&gt;
    file_type VARCHAR(50),&lt;br&gt;
    extracted_text TEXT,&lt;br&gt;
    page_count INTEGER,&lt;br&gt;
    chunks JSON,&lt;br&gt;
    chunk_count INTEGER,&lt;br&gt;
    embedding_model VARCHAR(100),&lt;br&gt;
    embedding_dimension INTEGER,&lt;br&gt;
    status VARCHAR(50) DEFAULT 'processing',&lt;br&gt;
    upload_date TIMESTAMP DEFAULT NOW(),&lt;br&gt;
    processed_date TIMESTAMP,&lt;br&gt;
    is_deleted BOOLEAN DEFAULT FALSE&lt;br&gt;
);&lt;/p&gt;

&lt;p&gt;CREATE INDEX idx_documents_user_id ON documents(user_id);&lt;br&gt;
CREATE INDEX idx_documents_status ON documents(status);&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Better solution (learned after):&lt;/strong&gt;&lt;br&gt;
Set up Alembic migrations properly:&lt;/p&gt;
&lt;h5&gt;
  
  
  alembic/env.py
&lt;/h5&gt;

&lt;p&gt;from app.models import Base&lt;/p&gt;

&lt;p&gt;target_metadata = Base.metadata&lt;/p&gt;
&lt;h5&gt;
  
  
  Then in Railway:
&lt;/h5&gt;

&lt;p&gt;alembic upgrade head&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
**Result:** Database tables created, app started successfully.

**Lesson:** Always test database migrations in a staging environment that mirrors production. Don't assume cloud providers work like localhost.

---

## **Crash #4: "Failed to Fetch" (CORS Hell)**

**What happened:** Frontend deployed to Vercel. Backend on Railway. They couldn't talk to each other.

**Chrome console:**
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Access to fetch at '&lt;a href="https://backend.railway.app/api/documents/list" rel="noopener noreferrer"&gt;https://backend.railway.app/api/documents/list&lt;/a&gt;' &lt;br&gt;
from origin '&lt;a href="https://frontend.vercel.app" rel="noopener noreferrer"&gt;https://frontend.vercel.app&lt;/a&gt;' has been blocked by CORS policy: &lt;br&gt;
No 'Access-Control-Allow-Origin' header is present&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The mistake:&lt;/strong&gt; My CORS configuration only allowed localhost:&lt;/p&gt;

&lt;h5&gt;
  
  
  ❌ WRONG - Only worked locally
&lt;/h5&gt;

&lt;p&gt;app.add_middleware(&lt;br&gt;
    CORSMiddleware,&lt;br&gt;
    allow_origins=["&lt;a href="http://localhost:5173%22" rel="noopener noreferrer"&gt;http://localhost:5173"&lt;/a&gt;],&lt;br&gt;
    allow_credentials=True,&lt;br&gt;
    allow_methods=["&lt;em&gt;"],&lt;br&gt;
    allow_headers=["&lt;/em&gt;"],&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  ✅ CORRECT - Works in production
&lt;/h5&gt;

&lt;p&gt;app.add_middleware(&lt;br&gt;
    CORSMiddleware,&lt;br&gt;
    allow_origins=[&lt;br&gt;
        "&lt;a href="http://localhost:3000" rel="noopener noreferrer"&gt;http://localhost:3000&lt;/a&gt;",&lt;br&gt;
        "&lt;a href="http://localhost:5173" rel="noopener noreferrer"&gt;http://localhost:5173&lt;/a&gt;",&lt;br&gt;
        "&lt;a href="https://rag-document-qa-system.vercel.app" rel="noopener noreferrer"&gt;https://rag-document-qa-system.vercel.app&lt;/a&gt;",  # Production frontend&lt;br&gt;
    ],&lt;br&gt;
    allow_credentials=True,&lt;br&gt;
    allow_methods=["&lt;em&gt;"],&lt;br&gt;
    allow_headers=["&lt;/em&gt;"],&lt;br&gt;
)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Even better solution (learned later):&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Use environment variables:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALLOWED_ORIGINS = os.getenv(
"ALLOWED_ORIGINS",
"http://localhost:5173,https://rag-document-qa-system.vercel.app"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;).split(",")&lt;/p&gt;

&lt;p&gt;app.add_middleware(&lt;br&gt;
    CORSMiddleware,&lt;br&gt;
    allow_origins=ALLOWED_ORIGINS,&lt;br&gt;
    allow_credentials=True,&lt;br&gt;
    allow_methods=["&lt;em&gt;"],&lt;br&gt;
    allow_headers=["&lt;/em&gt;"],&lt;br&gt;
)&lt;br&gt;
&lt;strong&gt;Result:&lt;/strong&gt; Frontend successfully connected to backend.&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; Configure CORS on day 1, not day 20. Test with production URLs before deploying. Use environment variables for flexibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Crash #5: Background Tasks Timing Out&lt;/strong&gt;&lt;br&gt;
What happened: Large documents (1000+ chunks) failed with timeout errors.&lt;br&gt;
504 Gateway Timeout&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem:&lt;/strong&gt; Processing was synchronous - upload endpoint waited for:&lt;/p&gt;

&lt;p&gt;Text extraction (5-10 seconds)&lt;br&gt;
Chunking (2-3 seconds)&lt;br&gt;
Embedding generation (45-60 seconds for 1000 chunks)&lt;br&gt;
Pinecone upload (5-10 seconds)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Total:&lt;/strong&gt; 60-80 seconds. Railway's timeout: 30 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Move processing to background tasks&lt;/p&gt;

&lt;p&gt;from fastapi import BackgroundTasks&lt;/p&gt;

&lt;p&gt;async def process_document_background(&lt;br&gt;
    document_id: int,&lt;br&gt;
    file_path: str,&lt;br&gt;
    file_extension: str&lt;br&gt;
):&lt;br&gt;
    """Process document asynchronously"""&lt;br&gt;
    from app.database import SessionLocal&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;db = SessionLocal()
try:
    document = db.query(Document).filter(
        Document.id == document_id
    ).first()

    # Extract text
    extraction_result = await text_extraction.extract_text(
        file_path, file_extension
    )

    if extraction_result["success"]:
        # Chunk text
        chunks = chunk_text(
            extraction_result["text"],
            chunk_size=1000,
            overlap=100
        )

        # Generate embeddings
        embeddings = embedding_service.generate_embeddings(chunks)

        # Store in Pinecone
        pinecone_service.upsert_embeddings(
            document_id=document_id,
            chunks=chunks,
            embeddings=embeddings
        )

        document.status = "ready"
    else:
        document.status = "failed"

    db.commit()
finally:
    db.close()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;@router.post("/upload")&lt;br&gt;
async def upload_document(&lt;br&gt;
    background_tasks: BackgroundTasks,&lt;br&gt;
    file: UploadFile = File(...),&lt;br&gt;
    db: Session = Depends(get_db),&lt;br&gt;
    user: dict = Depends(get_current_user)&lt;br&gt;
):&lt;br&gt;
    # Save file and create database record&lt;br&gt;
    file_path = await file_storage.save_uploaded_file(file)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;document = Document(
    user_id=user["sub"],
    filename=file.filename,
    status="processing"
)
db.add(document)
db.commit()

# Queue background processing
background_tasks.add_task(
    process_document_background,
    document.id,
    file_path,
    file.filename.split(".")[-1]
)

return {
    "message": "Document uploaded. Processing in background...",
    "document_id": document.id,
    "status": "processing"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; Upload endpoint returns in &amp;lt;1 second. Processing happens in background. No timeouts.&lt;br&gt;
**Lesson:** Any operation taking &amp;gt;5 seconds should be a background task in production. Return immediately, process asynchronously.&lt;/p&gt;

&lt;p&gt;The Security Audit That Changed Everything&lt;br&gt;
After getting it "working," I ran CodeRabbit's security review.&lt;br&gt;
&lt;strong&gt;Result:&lt;/strong&gt; 17 vulnerabilities found.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2 CRITICAL:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hardcoded database password in code&lt;br&gt;
CORS wildcard (allow_origins=["*"])&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5 HIGH:&lt;/strong&gt;&lt;br&gt;
No rate limiting (DoS vulnerability)&lt;br&gt;
No virus scanning on uploads&lt;br&gt;
No input sanitization&lt;br&gt;
Missing pagination (could load 10K documents at once)&lt;br&gt;
SQL injection potential (even with ORM)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fixes:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate Limiting:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;from slowapi import Limiter&lt;br&gt;
from slowapi.util import get_remote_address&lt;/p&gt;

&lt;p&gt;limiter = Limiter(key_func=get_remote_address, default_limits=["200/minute"])&lt;br&gt;
app.state.limiter = limiter&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Virus Scanning:&lt;/strong&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Integrated VirusTotal API
&lt;/h1&gt;

&lt;p&gt;async def scan_file(file_path: str) -&amp;gt; Dict[str, Any]:&lt;br&gt;
    response = requests.get(VIRUSTOTAL_URL, ...)&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;if response.json()["data"]["attributes"]["stats"]["malicious"] &amp;gt; 0:
    return {"is_safe": False}

return {"is_safe": True}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Input Sanitization:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;import bleach&lt;/p&gt;

&lt;p&gt;query = bleach.clean(request.query.strip(), tags=[], strip=True)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pagination:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;@router.get("/list")&lt;br&gt;
async def list_documents(&lt;br&gt;
    skip: int = 0,&lt;br&gt;
    limit: int = 100,&lt;br&gt;
    db: Session = Depends(get_db)&lt;br&gt;
):&lt;br&gt;
    documents = db.query(Document).offset(skip).limit(limit).all()&lt;br&gt;
    total = db.query(Document).count()&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;return {
    "documents": documents,
    "total": total,
    "skip": skip,
    "limit": limit
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; All 17 vulnerabilities fixed. System production-hardened.&lt;br&gt;
&lt;strong&gt;Lesson:&lt;/strong&gt; Security isn't optional. Code reviews catch what you miss. Production means thinking about malicious users, not just happy paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Production Metrics (The Real Numbers)&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;System Performance:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Metric&lt;/strong&gt;                                &lt;strong&gt;Value&lt;/strong&gt;&lt;br&gt;
Average Query Time                        2.4 seconds&lt;br&gt;
Upload Processing (100 chunks)            12 seconds&lt;br&gt;
Upload Processing (1000 chunks)           68 seconds&lt;br&gt;
Embedding Generation (per chunk)          0.25 seconds&lt;br&gt;
Database Query Time                       45ms average&lt;br&gt;
Pinecone Query Time                       180ms average&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;API Costs (Monthly):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Service&lt;/strong&gt; &lt;strong&gt;Free Tier&lt;/strong&gt;       &lt;strong&gt;My Usage&lt;/strong&gt;    &lt;strong&gt;Cost&lt;/strong&gt;&lt;br&gt;
Gemini      15K requests        ~200/month      $0&lt;br&gt;
Cohere      100 requests        ~50/month       $0&lt;br&gt;
Pinecone    1 index,1M vectors  ~5K vectors     $0&lt;br&gt;
Railway     500 hours           ~720 hours      $5&lt;br&gt;
Vercel      Unlimited           N/A             $0&lt;br&gt;
&lt;strong&gt;Total:&lt;/strong&gt; $5/month for a production AI system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Success Rates:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Document uploads: 95% (failures = corrupted files)&lt;br&gt;
Query responses: 90% (10% = no relevant chunks found)&lt;br&gt;
Background processing: 92% (8% = text extraction failures)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Feedback (First Week):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;17 documents uploaded&lt;br&gt;
118 queries processed&lt;br&gt;
5 users (mostly testing)&lt;/p&gt;

&lt;p&gt;What I'd Do Differently&lt;br&gt;
If I started over tomorrow:&lt;/p&gt;

&lt;p&gt;Check API quotas FIRST - Not after hitting them. Gemini's "free tier" had 0 requests available. Cohere saved me.&lt;br&gt;
Set up CORS early enough - Don't wait until deployment fails. Test with production URLs locally.&lt;br&gt;
Database migrations from the start - Alembic configuration before first deployment, not after.&lt;br&gt;
Implement background tasks immediately - Any operation &amp;gt;5 seconds should be async from the beginning.&lt;br&gt;
Security review before deployment - Not after. CodeRabbit would've caught issues in development.&lt;br&gt;
Use environment variables everywhere - No hardcoded values. Even in development.&lt;br&gt;
Test with corrupted files - Users will upload anything. Test with 1-byte PDFs, empty files, and non-UTF8 text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Current Limitations &amp;amp; Future Improvements&lt;br&gt;
Known Issues:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Scanned PDFs return 0 characters (needs OCR)&lt;br&gt;
Large documents take 60+ seconds to process&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Planned Features:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Streaming responses for better UX&lt;br&gt;
OCR for scanned PDFs&lt;br&gt;
Excel and PowerPoint support&lt;br&gt;
Semantic caching to reduce API costs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaways&lt;/strong&gt;&lt;br&gt;
Production AI is 20% algorithms, 80% infrastructure.&lt;br&gt;
&lt;strong&gt;The biggest lessons:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Free tiers lie - "15K requests/month" doesn't mean you get 15K. Check actual quotas.&lt;br&gt;
Always have fallbacks - Gemini fails → Cohere backup. Saved my deployment multiple times.&lt;br&gt;
Background tasks are non-negotiable - Anything &amp;gt;5 seconds will timeout in production.&lt;br&gt;
Security can't wait - One hardcoded password = complete compromise. Fix it before deploying.&lt;br&gt;
CORS will break you - Configure it early, test with production URLs.&lt;br&gt;
Test with real, messy data - Corrupted PDFs, empty files, non-UTF8 text. Users will upload anything.&lt;br&gt;
Dimension mismatches are silent killers - 768 vs 1024 dimensions broke everything with no clear error.&lt;/p&gt;

&lt;p&gt;The truth about production AI: Tutorials show the happy path. Production is 90% edge cases, rate limits, and error handling.&lt;/p&gt;

&lt;p&gt;Try It Yourself&lt;br&gt;
&lt;strong&gt;Live Demo:&lt;/strong&gt; &lt;a href="//rag-document-qa-system.vercel.app"&gt;@URL&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/BLESSEDEFEM/-rag-document-qa-system" rel="noopener noreferrer"&gt;@BLESSEDEFEM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;To build something similar:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Start with document upload + text extraction (get this working first)&lt;br&gt;
Add embeddings locally (test with small files)&lt;br&gt;
Deploy backend before frontend (easier to debug)&lt;br&gt;
Implement CORS from day 1&lt;br&gt;
Monitor API quotas obsessively&lt;br&gt;
Add background tasks early&lt;br&gt;
Security audit before deployment&lt;/p&gt;

&lt;p&gt;Questions? Open an issue on GitHub or connect with me on LinkedIn.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;About the Author&lt;/strong&gt;&lt;br&gt;
Blessing Nejo - Junior Software &amp;amp; AI Engineer from Lagos, Nigeria&lt;br&gt;
I build production AI systems and document the messy parts that tutorials skip. It was adventure in learning plus hands-on project. This RAG system taught me more in 3 weeks than months of tutorials.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Currently seeking:&lt;/strong&gt; Software/AI Engineer roles (remote-first)&lt;br&gt;
&lt;strong&gt;Skills:&lt;/strong&gt; Python, TypeScript, FastAPI, React, PostgreSQL, Vector Databases, Production AI Systems&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Connect:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;LinkedIn:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/blessing-nejo-195673134" rel="noopener noreferrer"&gt;Blessing Nejo&lt;/a&gt;&lt;br&gt;
🐙 &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/BLESSEDEFEM/-rag-document-qa-system" rel="noopener noreferrer"&gt;@BLESSEDEFEM&lt;/a&gt;&lt;br&gt;
📧 [&lt;a href="mailto:nejoblessing72@gmail.com"&gt;nejoblessing72@gmail.com&lt;/a&gt;]&lt;br&gt;
📍 Lagos, Nigeria&lt;/p&gt;

&lt;p&gt;Found this helpful? Drop a comment below - I read and respond to every one.&lt;br&gt;
Building something similar? I'm happy to review your architecture or debug issues. DM me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #AI #MachineLearning #RAG #Python #FastAPI #React #TypeScript #ProductionAI #VectorDatabases #Pinecone #LLM #SoftwareEngineering&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
  </channel>
</rss>
