BLESSEDEFEM

Posted on Jan 9

I Built a Production RAG System in 3 Weeks - Here's What Actually Broke

#ai #machinelearning #python #tutorial

The Wake-Up Call

I deployed my first AI system to production. Within the first hour, it crashed several times.

The error logs were a nightmare: relation "documents" does not exist, Dense vectors must contain at least one non-zero value, 429 Too Many Requests, CORS policy: No 'Access-Control-Allow-Origin'. Every fix revealed three new problems.

Most RAG tutorials end at "it works on localhost." They skip the brutal reality: rate limits, CORS hell, database migrations, API quota exhaustion, and the 3 AM debugging sessions that come with real production systems.

This isn't that kind of tutorial.

I'm Blessing, a junior AI engineer from Lagos, Nigeria. This was my first production AI system, and I documented every failure, every panic moment, and every "why didn't the tutorial mention THIS?" frustration.

Here's what you'll learn:

Why my embeddings worked locally but failed in production
The cascade of failures that happens when one service hits quota
How I went from "no relevant information found" on every query to 90% success rate
Real code and architecture decisions (not theory)
Actual production metrics and costs

If you're building your first production AI system, this post might save you 47 crashes and countless hours of debugging.
Let's dive into what actually happened.

What I Built (And Why It Matters)
The System: A RAG (Retrieval-Augmented Generation) Document Q&A application where users upload PDFs, DOCX, or TXT files, then ask questions in plain English and get AI-generated answers with source citations.
Why RAG? Traditional LLMs hallucinate - they confidently make things up. RAG solves this by grounding responses in YOUR actual documents. Upload your company's 500-page policy manual, ask "What's our remote work policy?" and get an accurate answer with the exact page reference.

Real-world impact: Instead of Ctrl+F through dozens of files, users get conversational answers in 2-4 seconds.

Try it live: @URL

The Tech Stack (And Why I Chose Each)
Frontend:

React + TypeScript + Tailwind CSS
Deployed on Vercel
Why: Fast dev experience, automatic deployments, global CDN

Backend:

FastAPI (Python)
Deployed on Railway
Why: Async support, automatic API docs, simpler than AWS

Databases:

PostgreSQL (document metadata)
Pinecone (vector embeddings)
Why: Pinecone serverless = no infrastructure management

AI Services:

Google Gemini 2.0 Flash (answer generation)
Cohere embed-v3 (embeddings)
Why: Gemini's free tier (15K requests/month) vs OpenAI's limited free trial

Authentication:

Clerk (JWT-based)
Why: Drop-in solution, handles edge cases

The Architecture
┌─────────────┐
│ User │
└──────┬──────┘
│
▼
┌──────────────────────┐
│ React Frontend │ ← Vercel
│ (TypeScript) │
└────────┬─────────────┘
│ HTTPS + JWT
▼
┌──────────────────────┐
│ FastAPI Backend │ ← Railway
│ (Async Python) │
└────┬──────┬──────┬───┘
│ │ │
▼ ▼ ▼
┌─────────┐┌──────────┐┌──────────┐
│Pinecone ││PostgreSQL││VirusTotal│
│ Vectors ││ Docs ││ Scanner │
└─────────┘└──────────┘└──────────┘
│
▼
┌─────────────────────┐
│ Gemini (primary) │
│ Cohere (fallback) │
└─────────────────────┘
The Flow:

User uploads document → Virus scan → PostgreSQL record
Background task extracts text → Chunks (1000 chars, 100 overlap)
Gemini generates embeddings (768-dim vectors)
Store in Pinecone with metadata
User asks question → Gemini embeds query
Pinecone finds top 5 similar chunks (cosine similarity)
Gemini generates answer from retrieved context
Return answer with source citations

Simple in theory. Brutal in practice.

Crash #1: "Dense Vectors Must Contain Non-Zero Values"

What happened: My first upload to Pinecone failed instantly.
Error: Dense vectors must contain at least one non-zero value

The mistake: I was using dummy embeddings for testing:

❌ WRONG - What I did initially

embeddings = [[0.0] * 768 for _ in chunks]
Pinecone rejected them because zero vectors have no semantic meaning - you can't calculate similarity with nothing.

What I tried:

Used Google Gemini embeddings → Hit quota limit (1500/day free tier had... 0 available)
Switched to Cohere → Hit their 96 text limit per request
Tried batch processing → Hit 100K tokens/minute rate limit

The solution:
def generate_embeddings(self, texts: List[str]) -> List[List[float]]:
"""Generate embeddings with batching and rate limiting"""
all_embeddings = []
batch_size = 96 # Cohere's limit

for i in range(0, len(texts), batch_size):
    batch = texts[i:i + batch_size]

    response = self.cohere_client.embed(
        texts=batch,
        model='embed-english-v3.0',
        input_type='search_document',
        embedding_types=['float']
    )

    all_embeddings.extend(response.embeddings.float_)

    # Rate limiting: 6 second delay between batches
    if i + batch_size < len(texts):
        time.sleep(6)

return all_embeddings

Result: Successfully processed 1000-chunk documents in ~60 seconds.
Lesson: Always test with real API responses, not mocked data. Dummy values that work locally will fail in production.

Crash #2: "No Relevant Information Found" (The Cascade)

What happened: Every single query returned "no relevant information found" despite successful uploads.
This was the most frustrating bug. Documents uploaded fine. No errors. But queries found... nothing.

The investigation:
Step 1: Checked Pinecone console

Result: 0 vectors stored
Realization: Embeddings weren't being saved!

Step 2: Checked upload logs

Found this in my code:

embedding = embedding_service.generate_embedding(text) # ❌ WRONG
I was calling the SINGULAR method (for one text) instead of plural method (for batches).

Step 3: Fixed the method, still failed

Error: 403 Your API key was reported as leaked

My Gemini key had been exposed (hardcoded in .env.example that I committed to GitHub). Google auto-blocked it.

Step 4: Regenerated all API keys

Gemini: 768-dim embeddings
Cohere: 1024-dim embeddings
Pinecone index: 768-dim

Step 5: New error
Vector dimension 768 does not match the dimension of the index 1024
The Pinecone index was created for Cohere (1024-dim), but I was now using Gemini (768-dim). They're incompatible.

The solution:

Deleted Pinecone index
Created new index with 768 dimensions (for Gemini)
Implemented dual-fallback embedding system

def generate_embedding(self, text: str) -> List[float]:
"""Generate embedding - Gemini first, Cohere fallback"""

# Try Gemini (15K free/month)
if self.gemini_api_key:
    try:
        result = genai.embed_content(
            model="models/text-embedding-004",
            content=text,
            task_type="retrieval_query"
        )
        return result['embedding']
    except Exception as e:
        logger.warning(f"Gemini failed: {e}, trying Cohere...")

# Fallback to Cohere (100 free/month)
if self.cohere_api_key:
    try:
        response = self.cohere_client.embed(
            texts=[text],
            model="embed-english-v3.0",
            input_type="search_query",
            embedding_types=["float"]
        )
        return response.embeddings.float_[0]
    except Exception as e:
        logger.error(f"Both services failed: {e}")
        return None

return None

Result: Query success rate jumped from 0% to 90%.
Lesson: API quotas will hit you when you least expect it. Always have a fallback provider. Never commit API keys, even in example files.

Crash #3: "Relation 'documents' Does Not Exist"

What happened: Deployed to Railway. Backend started. Made first API call. Instant crash.
pythonpsycopg2.errors.UndefinedTable: relation "documents" does not exist
The mistake: I assumed Railway would auto-create my database tables like my local SQLite did.
It didn't.

What I learned:

Local development: SQLAlchemy created tables automatically
Production PostgreSQL: Fresh database, zero tables
Alembic migrations: Not configured for Railway deployment

The solution:
Manually created tables via Railway's PostgreSQL CLI:
sqlCREATE TABLE documents (
id SERIAL PRIMARY KEY,
user_id VARCHAR(255) NOT NULL,
filename VARCHAR(255) NOT NULL,
original_filename VARCHAR(255),
file_path VARCHAR(500),
file_size INTEGER,
file_type VARCHAR(50),
extracted_text TEXT,
page_count INTEGER,
chunks JSON,
chunk_count INTEGER,
embedding_model VARCHAR(100),
embedding_dimension INTEGER,
status VARCHAR(50) DEFAULT 'processing',
upload_date TIMESTAMP DEFAULT NOW(),
processed_date TIMESTAMP,
is_deleted BOOLEAN DEFAULT FALSE
);

CREATE INDEX idx_documents_user_id ON documents(user_id);
CREATE INDEX idx_documents_status ON documents(status);

Better solution (learned after):
Set up Alembic migrations properly:

alembic/env.py

from app.models import Base

target_metadata = Base.metadata

Then in Railway:

alembic upgrade head


**Result:** Database tables created, app started successfully.

**Lesson:** Always test database migrations in a staging environment that mirrors production. Don't assume cloud providers work like localhost.

---

## **Crash #4: "Failed to Fetch" (CORS Hell)**

**What happened:** Frontend deployed to Vercel. Backend on Railway. They couldn't talk to each other.

**Chrome console:**

Access to fetch at 'https://backend.railway.app/api/documents/list'
from origin 'https://frontend.vercel.app' has been blocked by CORS policy:
No 'Access-Control-Allow-Origin' header is present

The mistake: My CORS configuration only allowed localhost:

❌ WRONG - Only worked locally

app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:5173"],
allow_credentials=True,
allow_methods=[""],
allow_headers=[""],
)

The solution:

✅ CORRECT - Works in production

app.add_middleware(
CORSMiddleware,
allow_origins=[
"http://localhost:3000",
"http://localhost:5173",
"https://rag-document-qa-system.vercel.app", # Production frontend
],
allow_credentials=True,
allow_methods=[""],
allow_headers=[""],
)

Even better solution (learned later):
Use environment variables:

ALLOWED_ORIGINS = os.getenv(
"ALLOWED_ORIGINS",
"http://localhost:5173,https://rag-document-qa-system.vercel.app"

).split(",")

app.add_middleware(
CORSMiddleware,
allow_origins=ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=[""],
allow_headers=[""],
)
Result: Frontend successfully connected to backend.
Lesson: Configure CORS on day 1, not day 20. Test with production URLs before deploying. Use environment variables for flexibility.

Crash #5: Background Tasks Timing Out
What happened: Large documents (1000+ chunks) failed with timeout errors.
504 Gateway Timeout

The problem: Processing was synchronous - upload endpoint waited for:

Text extraction (5-10 seconds)
Chunking (2-3 seconds)
Embedding generation (45-60 seconds for 1000 chunks)
Pinecone upload (5-10 seconds)

Total: 60-80 seconds. Railway's timeout: 30 seconds.

The solution: Move processing to background tasks

from fastapi import BackgroundTasks

async def process_document_background(
document_id: int,
file_path: str,
file_extension: str
):
"""Process document asynchronously"""
from app.database import SessionLocal

db = SessionLocal()
try:
    document = db.query(Document).filter(
        Document.id == document_id
    ).first()

    # Extract text
    extraction_result = await text_extraction.extract_text(
        file_path, file_extension
    )

    if extraction_result["success"]:
        # Chunk text
        chunks = chunk_text(
            extraction_result["text"],
            chunk_size=1000,
            overlap=100
        )

        # Generate embeddings
        embeddings = embedding_service.generate_embeddings(chunks)

        # Store in Pinecone
        pinecone_service.upsert_embeddings(
            document_id=document_id,
            chunks=chunks,
            embeddings=embeddings
        )

        document.status = "ready"
    else:
        document.status = "failed"

    db.commit()
finally:
    db.close()

@router.post("/upload")
async def upload_document(
background_tasks: BackgroundTasks,
file: UploadFile = File(...),
db: Session = Depends(get_db),
user: dict = Depends(get_current_user)
):
# Save file and create database record
file_path = await file_storage.save_uploaded_file(file)

document = Document(
    user_id=user["sub"],
    filename=file.filename,
    status="processing"
)
db.add(document)
db.commit()

# Queue background processing
background_tasks.add_task(
    process_document_background,
    document.id,
    file_path,
    file.filename.split(".")[-1]
)

return {
    "message": "Document uploaded. Processing in background...",
    "document_id": document.id,
    "status": "processing"
}

Result: Upload endpoint returns in <1 second. Processing happens in background. No timeouts.
**Lesson:** Any operation taking >5 seconds should be a background task in production. Return immediately, process asynchronously.

The Security Audit That Changed Everything
After getting it "working," I ran CodeRabbit's security review.
Result: 17 vulnerabilities found.

2 CRITICAL:

Hardcoded database password in code
CORS wildcard (allow_origins=["*"])

5 HIGH:
No rate limiting (DoS vulnerability)
No virus scanning on uploads
No input sanitization
Missing pagination (could load 10K documents at once)
SQL injection potential (even with ORM)

The fixes:

Rate Limiting:

from slowapi import Limiter
from slowapi.util import get_remote_address

limiter = Limiter(key_func=get_remote_address, default_limits=["200/minute"])
app.state.limiter = limiter

Virus Scanning:

Integrated VirusTotal API

async def scan_file(file_path: str) -> Dict[str, Any]:
response = requests.get(VIRUSTOTAL_URL, ...)

if response.json()["data"]["attributes"]["stats"]["malicious"] > 0:
    return {"is_safe": False}

return {"is_safe": True}

Input Sanitization:

import bleach

query = bleach.clean(request.query.strip(), tags=[], strip=True)

Pagination:

@router.get("/list")
async def list_documents(
skip: int = 0,
limit: int = 100,
db: Session = Depends(get_db)
):
documents = db.query(Document).offset(skip).limit(limit).all()
total = db.query(Document).count()

return {
    "documents": documents,
    "total": total,
    "skip": skip,
    "limit": limit
}

Result: All 17 vulnerabilities fixed. System production-hardened.
Lesson: Security isn't optional. Code reviews catch what you miss. Production means thinking about malicious users, not just happy paths.

Production Metrics (The Real Numbers)
System Performance:

Metric Value
Average Query Time 2.4 seconds
Upload Processing (100 chunks) 12 seconds
Upload Processing (1000 chunks) 68 seconds
Embedding Generation (per chunk) 0.25 seconds
Database Query Time 45ms average
Pinecone Query Time 180ms average

API Costs (Monthly):

Service Free Tier My Usage Cost
Gemini 15K requests ~200/month $0
Cohere 100 requests ~50/month $0
Pinecone 1 index,1M vectors ~5K vectors $0
Railway 500 hours ~720 hours $5
Vercel Unlimited N/A $0
Total: $5/month for a production AI system.

Success Rates:

Document uploads: 95% (failures = corrupted files)
Query responses: 90% (10% = no relevant chunks found)
Background processing: 92% (8% = text extraction failures)

User Feedback (First Week):

17 documents uploaded
118 queries processed
5 users (mostly testing)

What I'd Do Differently
If I started over tomorrow:

Check API quotas FIRST - Not after hitting them. Gemini's "free tier" had 0 requests available. Cohere saved me.
Set up CORS early enough - Don't wait until deployment fails. Test with production URLs locally.
Database migrations from the start - Alembic configuration before first deployment, not after.
Implement background tasks immediately - Any operation >5 seconds should be async from the beginning.
Security review before deployment - Not after. CodeRabbit would've caught issues in development.
Use environment variables everywhere - No hardcoded values. Even in development.
Test with corrupted files - Users will upload anything. Test with 1-byte PDFs, empty files, and non-UTF8 text.

Current Limitations & Future Improvements
Known Issues:

Scanned PDFs return 0 characters (needs OCR)
Large documents take 60+ seconds to process

Planned Features:

Streaming responses for better UX
OCR for scanned PDFs
Excel and PowerPoint support
Semantic caching to reduce API costs

Key Takeaways
Production AI is 20% algorithms, 80% infrastructure.
The biggest lessons:

Free tiers lie - "15K requests/month" doesn't mean you get 15K. Check actual quotas.
Always have fallbacks - Gemini fails → Cohere backup. Saved my deployment multiple times.
Background tasks are non-negotiable - Anything >5 seconds will timeout in production.
Security can't wait - One hardcoded password = complete compromise. Fix it before deploying.
CORS will break you - Configure it early, test with production URLs.
Test with real, messy data - Corrupted PDFs, empty files, non-UTF8 text. Users will upload anything.
Dimension mismatches are silent killers - 768 vs 1024 dimensions broke everything with no clear error.

The truth about production AI: Tutorials show the happy path. Production is 90% edge cases, rate limits, and error handling.

Try It Yourself
Live Demo: @URL
GitHub: @BLESSEDEFEM

To build something similar:

Start with document upload + text extraction (get this working first)
Add embeddings locally (test with small files)
Deploy backend before frontend (easier to debug)
Implement CORS from day 1
Monitor API quotas obsessively
Add background tasks early
Security audit before deployment

Questions? Open an issue on GitHub or connect with me on LinkedIn.

About the Author
Blessing Nejo - Junior Software & AI Engineer from Lagos, Nigeria
I build production AI systems and document the messy parts that tutorials skip. It was adventure in learning plus hands-on project. This RAG system taught me more in 3 weeks than months of tutorials.

Currently seeking: Software/AI Engineer roles (remote-first)
Skills: Python, TypeScript, FastAPI, React, PostgreSQL, Vector Databases, Production AI Systems

Connect:

🔗 LinkedIn: Blessing Nejo
🐙 GitHub: @BLESSEDEFEM
📧 [nejoblessing72@gmail.com]
📍 Lagos, Nigeria

Found this helpful? Drop a comment below - I read and respond to every one.
Building something similar? I'm happy to review your architecture or debug issues. DM me.

Tags: #AI #MachineLearning #RAG #Python #FastAPI #React #TypeScript #ProductionAI #VectorDatabases #Pinecone #LLM #SoftwareEngineering

DEV Community