The Wake-Up Call
I deployed my first AI system to production. Within the first hour, it crashed several times.
The error logs were a nightmare: relation "documents" does not exist, Dense vectors must contain at least one non-zero value, 429 Too Many Requests, CORS policy: No 'Access-Control-Allow-Origin'. Every fix revealed three new problems.
Most RAG tutorials end at "it works on localhost." They skip the brutal reality: rate limits, CORS hell, database migrations, API quota exhaustion, and the 3 AM debugging sessions that come with real production systems.
This isn't that kind of tutorial.
I'm Blessing, a junior AI engineer from Lagos, Nigeria. This was my first production AI system, and I documented every failure, every panic moment, and every "why didn't the tutorial mention THIS?" frustration.
Here's what you'll learn:
Why my embeddings worked locally but failed in production
The cascade of failures that happens when one service hits quota
How I went from "no relevant information found" on every query to 90% success rate
Real code and architecture decisions (not theory)
Actual production metrics and costs
If you're building your first production AI system, this post might save you 47 crashes and countless hours of debugging.
Let's dive into what actually happened.
What I Built (And Why It Matters)
The System: A RAG (Retrieval-Augmented Generation) Document Q&A application where users upload PDFs, DOCX, or TXT files, then ask questions in plain English and get AI-generated answers with source citations.
Why RAG? Traditional LLMs hallucinate - they confidently make things up. RAG solves this by grounding responses in YOUR actual documents. Upload your company's 500-page policy manual, ask "What's our remote work policy?" and get an accurate answer with the exact page reference.
Real-world impact: Instead of Ctrl+F through dozens of files, users get conversational answers in 2-4 seconds.
Try it live: @URL
The Tech Stack (And Why I Chose Each)
Frontend:
React + TypeScript + Tailwind CSS
Deployed on Vercel
Why: Fast dev experience, automatic deployments, global CDN
Backend:
FastAPI (Python)
Deployed on Railway
Why: Async support, automatic API docs, simpler than AWS
Databases:
PostgreSQL (document metadata)
Pinecone (vector embeddings)
Why: Pinecone serverless = no infrastructure management
AI Services:
Google Gemini 2.0 Flash (answer generation)
Cohere embed-v3 (embeddings)
Why: Gemini's free tier (15K requests/month) vs OpenAI's limited free trial
Authentication:
Clerk (JWT-based)
Why: Drop-in solution, handles edge cases
The Architecture
┌─────────────┐
│ User │
└──────┬──────┘
│
▼
┌──────────────────────┐
│ React Frontend │ ← Vercel
│ (TypeScript) │
└────────┬─────────────┘
│ HTTPS + JWT
▼
┌──────────────────────┐
│ FastAPI Backend │ ← Railway
│ (Async Python) │
└────┬──────┬──────┬───┘
│ │ │
▼ ▼ ▼
┌─────────┐┌──────────┐┌──────────┐
│Pinecone ││PostgreSQL││VirusTotal│
│ Vectors ││ Docs ││ Scanner │
└─────────┘└──────────┘└──────────┘
│
▼
┌─────────────────────┐
│ Gemini (primary) │
│ Cohere (fallback) │
└─────────────────────┘
The Flow:
User uploads document → Virus scan → PostgreSQL record
Background task extracts text → Chunks (1000 chars, 100 overlap)
Gemini generates embeddings (768-dim vectors)
Store in Pinecone with metadata
User asks question → Gemini embeds query
Pinecone finds top 5 similar chunks (cosine similarity)
Gemini generates answer from retrieved context
Return answer with source citations
Simple in theory. Brutal in practice.
Crash #1: "Dense Vectors Must Contain Non-Zero Values"
What happened: My first upload to Pinecone failed instantly.
Error: Dense vectors must contain at least one non-zero value
The mistake: I was using dummy embeddings for testing:
❌ WRONG - What I did initially
embeddings = [[0.0] * 768 for _ in chunks]
Pinecone rejected them because zero vectors have no semantic meaning - you can't calculate similarity with nothing.
What I tried:
Used Google Gemini embeddings → Hit quota limit (1500/day free tier had... 0 available)
Switched to Cohere → Hit their 96 text limit per request
Tried batch processing → Hit 100K tokens/minute rate limit
The solution:
def generate_embeddings(self, texts: List[str]) -> List[List[float]]:
"""Generate embeddings with batching and rate limiting"""
all_embeddings = []
batch_size = 96 # Cohere's limit
for i in range(0, len(texts), batch_size):
batch = texts[i:i + batch_size]
response = self.cohere_client.embed(
texts=batch,
model='embed-english-v3.0',
input_type='search_document',
embedding_types=['float']
)
all_embeddings.extend(response.embeddings.float_)
# Rate limiting: 6 second delay between batches
if i + batch_size < len(texts):
time.sleep(6)
return all_embeddings
Result: Successfully processed 1000-chunk documents in ~60 seconds.
Lesson: Always test with real API responses, not mocked data. Dummy values that work locally will fail in production.
Crash #2: "No Relevant Information Found" (The Cascade)
What happened: Every single query returned "no relevant information found" despite successful uploads.
This was the most frustrating bug. Documents uploaded fine. No errors. But queries found... nothing.
The investigation:
Step 1: Checked Pinecone console
Result: 0 vectors stored
Realization: Embeddings weren't being saved!
Step 2: Checked upload logs
Found this in my code:
embedding = embedding_service.generate_embedding(text) # ❌ WRONG
I was calling the SINGULAR method (for one text) instead of plural method (for batches).
Step 3: Fixed the method, still failed
Error: 403 Your API key was reported as leaked
My Gemini key had been exposed (hardcoded in .env.example that I committed to GitHub). Google auto-blocked it.
Step 4: Regenerated all API keys
Gemini: 768-dim embeddings
Cohere: 1024-dim embeddings
Pinecone index: 768-dim
Step 5: New error
Vector dimension 768 does not match the dimension of the index 1024
The Pinecone index was created for Cohere (1024-dim), but I was now using Gemini (768-dim). They're incompatible.
The solution:
Deleted Pinecone index
Created new index with 768 dimensions (for Gemini)
Implemented dual-fallback embedding system
def generate_embedding(self, text: str) -> List[float]:
"""Generate embedding - Gemini first, Cohere fallback"""
# Try Gemini (15K free/month)
if self.gemini_api_key:
try:
result = genai.embed_content(
model="models/text-embedding-004",
content=text,
task_type="retrieval_query"
)
return result['embedding']
except Exception as e:
logger.warning(f"Gemini failed: {e}, trying Cohere...")
# Fallback to Cohere (100 free/month)
if self.cohere_api_key:
try:
response = self.cohere_client.embed(
texts=[text],
model="embed-english-v3.0",
input_type="search_query",
embedding_types=["float"]
)
return response.embeddings.float_[0]
except Exception as e:
logger.error(f"Both services failed: {e}")
return None
return None
Result: Query success rate jumped from 0% to 90%.
Lesson: API quotas will hit you when you least expect it. Always have a fallback provider. Never commit API keys, even in example files.
Crash #3: "Relation 'documents' Does Not Exist"
What happened: Deployed to Railway. Backend started. Made first API call. Instant crash.
pythonpsycopg2.errors.UndefinedTable: relation "documents" does not exist
The mistake: I assumed Railway would auto-create my database tables like my local SQLite did.
It didn't.
What I learned:
Local development: SQLAlchemy created tables automatically
Production PostgreSQL: Fresh database, zero tables
Alembic migrations: Not configured for Railway deployment
The solution:
Manually created tables via Railway's PostgreSQL CLI:
sqlCREATE TABLE documents (
id SERIAL PRIMARY KEY,
user_id VARCHAR(255) NOT NULL,
filename VARCHAR(255) NOT NULL,
original_filename VARCHAR(255),
file_path VARCHAR(500),
file_size INTEGER,
file_type VARCHAR(50),
extracted_text TEXT,
page_count INTEGER,
chunks JSON,
chunk_count INTEGER,
embedding_model VARCHAR(100),
embedding_dimension INTEGER,
status VARCHAR(50) DEFAULT 'processing',
upload_date TIMESTAMP DEFAULT NOW(),
processed_date TIMESTAMP,
is_deleted BOOLEAN DEFAULT FALSE
);
CREATE INDEX idx_documents_user_id ON documents(user_id);
CREATE INDEX idx_documents_status ON documents(status);
Better solution (learned after):
Set up Alembic migrations properly:
alembic/env.py
from app.models import Base
target_metadata = Base.metadata
Then in Railway:
alembic upgrade head
**Result:** Database tables created, app started successfully.
**Lesson:** Always test database migrations in a staging environment that mirrors production. Don't assume cloud providers work like localhost.
---
## **Crash #4: "Failed to Fetch" (CORS Hell)**
**What happened:** Frontend deployed to Vercel. Backend on Railway. They couldn't talk to each other.
**Chrome console:**
Access to fetch at 'https://backend.railway.app/api/documents/list'
from origin 'https://frontend.vercel.app' has been blocked by CORS policy:
No 'Access-Control-Allow-Origin' header is present
The mistake: My CORS configuration only allowed localhost:
❌ WRONG - Only worked locally
app.add_middleware(
CORSMiddleware,
allow_origins=["http://localhost:5173"],
allow_credentials=True,
allow_methods=[""],
allow_headers=[""],
)
The solution:
✅ CORRECT - Works in production
app.add_middleware(
CORSMiddleware,
allow_origins=[
"http://localhost:3000",
"http://localhost:5173",
"https://rag-document-qa-system.vercel.app", # Production frontend
],
allow_credentials=True,
allow_methods=[""],
allow_headers=[""],
)
Even better solution (learned later):
Use environment variables:
ALLOWED_ORIGINS = os.getenv(
"ALLOWED_ORIGINS",
"http://localhost:5173,https://rag-document-qa-system.vercel.app"
).split(",")
app.add_middleware(
CORSMiddleware,
allow_origins=ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=[""],
allow_headers=[""],
)
Result: Frontend successfully connected to backend.
Lesson: Configure CORS on day 1, not day 20. Test with production URLs before deploying. Use environment variables for flexibility.
Crash #5: Background Tasks Timing Out
What happened: Large documents (1000+ chunks) failed with timeout errors.
504 Gateway Timeout
The problem: Processing was synchronous - upload endpoint waited for:
Text extraction (5-10 seconds)
Chunking (2-3 seconds)
Embedding generation (45-60 seconds for 1000 chunks)
Pinecone upload (5-10 seconds)
Total: 60-80 seconds. Railway's timeout: 30 seconds.
The solution: Move processing to background tasks
from fastapi import BackgroundTasks
async def process_document_background(
document_id: int,
file_path: str,
file_extension: str
):
"""Process document asynchronously"""
from app.database import SessionLocal
db = SessionLocal()
try:
document = db.query(Document).filter(
Document.id == document_id
).first()
# Extract text
extraction_result = await text_extraction.extract_text(
file_path, file_extension
)
if extraction_result["success"]:
# Chunk text
chunks = chunk_text(
extraction_result["text"],
chunk_size=1000,
overlap=100
)
# Generate embeddings
embeddings = embedding_service.generate_embeddings(chunks)
# Store in Pinecone
pinecone_service.upsert_embeddings(
document_id=document_id,
chunks=chunks,
embeddings=embeddings
)
document.status = "ready"
else:
document.status = "failed"
db.commit()
finally:
db.close()
@router.post("/upload")
async def upload_document(
background_tasks: BackgroundTasks,
file: UploadFile = File(...),
db: Session = Depends(get_db),
user: dict = Depends(get_current_user)
):
# Save file and create database record
file_path = await file_storage.save_uploaded_file(file)
document = Document(
user_id=user["sub"],
filename=file.filename,
status="processing"
)
db.add(document)
db.commit()
# Queue background processing
background_tasks.add_task(
process_document_background,
document.id,
file_path,
file.filename.split(".")[-1]
)
return {
"message": "Document uploaded. Processing in background...",
"document_id": document.id,
"status": "processing"
}
Result: Upload endpoint returns in <1 second. Processing happens in background. No timeouts.
**Lesson:** Any operation taking >5 seconds should be a background task in production. Return immediately, process asynchronously.
The Security Audit That Changed Everything
After getting it "working," I ran CodeRabbit's security review.
Result: 17 vulnerabilities found.
2 CRITICAL:
Hardcoded database password in code
CORS wildcard (allow_origins=["*"])
5 HIGH:
No rate limiting (DoS vulnerability)
No virus scanning on uploads
No input sanitization
Missing pagination (could load 10K documents at once)
SQL injection potential (even with ORM)
The fixes:
Rate Limiting:
from slowapi import Limiter
from slowapi.util import get_remote_address
limiter = Limiter(key_func=get_remote_address, default_limits=["200/minute"])
app.state.limiter = limiter
Virus Scanning:
Integrated VirusTotal API
async def scan_file(file_path: str) -> Dict[str, Any]:
response = requests.get(VIRUSTOTAL_URL, ...)
if response.json()["data"]["attributes"]["stats"]["malicious"] > 0:
return {"is_safe": False}
return {"is_safe": True}
Input Sanitization:
import bleach
query = bleach.clean(request.query.strip(), tags=[], strip=True)
Pagination:
@router.get("/list")
async def list_documents(
skip: int = 0,
limit: int = 100,
db: Session = Depends(get_db)
):
documents = db.query(Document).offset(skip).limit(limit).all()
total = db.query(Document).count()
return {
"documents": documents,
"total": total,
"skip": skip,
"limit": limit
}
Result: All 17 vulnerabilities fixed. System production-hardened.
Lesson: Security isn't optional. Code reviews catch what you miss. Production means thinking about malicious users, not just happy paths.
Production Metrics (The Real Numbers)
System Performance:
Metric Value
Average Query Time 2.4 seconds
Upload Processing (100 chunks) 12 seconds
Upload Processing (1000 chunks) 68 seconds
Embedding Generation (per chunk) 0.25 seconds
Database Query Time 45ms average
Pinecone Query Time 180ms average
API Costs (Monthly):
Service Free Tier My Usage Cost
Gemini 15K requests ~200/month $0
Cohere 100 requests ~50/month $0
Pinecone 1 index,1M vectors ~5K vectors $0
Railway 500 hours ~720 hours $5
Vercel Unlimited N/A $0
Total: $5/month for a production AI system.
Success Rates:
Document uploads: 95% (failures = corrupted files)
Query responses: 90% (10% = no relevant chunks found)
Background processing: 92% (8% = text extraction failures)
User Feedback (First Week):
17 documents uploaded
118 queries processed
5 users (mostly testing)
What I'd Do Differently
If I started over tomorrow:
Check API quotas FIRST - Not after hitting them. Gemini's "free tier" had 0 requests available. Cohere saved me.
Set up CORS early enough - Don't wait until deployment fails. Test with production URLs locally.
Database migrations from the start - Alembic configuration before first deployment, not after.
Implement background tasks immediately - Any operation >5 seconds should be async from the beginning.
Security review before deployment - Not after. CodeRabbit would've caught issues in development.
Use environment variables everywhere - No hardcoded values. Even in development.
Test with corrupted files - Users will upload anything. Test with 1-byte PDFs, empty files, and non-UTF8 text.
Current Limitations & Future Improvements
Known Issues:
Scanned PDFs return 0 characters (needs OCR)
Large documents take 60+ seconds to process
Planned Features:
Streaming responses for better UX
OCR for scanned PDFs
Excel and PowerPoint support
Semantic caching to reduce API costs
Key Takeaways
Production AI is 20% algorithms, 80% infrastructure.
The biggest lessons:
Free tiers lie - "15K requests/month" doesn't mean you get 15K. Check actual quotas.
Always have fallbacks - Gemini fails → Cohere backup. Saved my deployment multiple times.
Background tasks are non-negotiable - Anything >5 seconds will timeout in production.
Security can't wait - One hardcoded password = complete compromise. Fix it before deploying.
CORS will break you - Configure it early, test with production URLs.
Test with real, messy data - Corrupted PDFs, empty files, non-UTF8 text. Users will upload anything.
Dimension mismatches are silent killers - 768 vs 1024 dimensions broke everything with no clear error.
The truth about production AI: Tutorials show the happy path. Production is 90% edge cases, rate limits, and error handling.
Try It Yourself
Live Demo: @URL
GitHub: @BLESSEDEFEM
To build something similar:
Start with document upload + text extraction (get this working first)
Add embeddings locally (test with small files)
Deploy backend before frontend (easier to debug)
Implement CORS from day 1
Monitor API quotas obsessively
Add background tasks early
Security audit before deployment
Questions? Open an issue on GitHub or connect with me on LinkedIn.
About the Author
Blessing Nejo - Junior Software & AI Engineer from Lagos, Nigeria
I build production AI systems and document the messy parts that tutorials skip. It was adventure in learning plus hands-on project. This RAG system taught me more in 3 weeks than months of tutorials.
Currently seeking: Software/AI Engineer roles (remote-first)
Skills: Python, TypeScript, FastAPI, React, PostgreSQL, Vector Databases, Production AI Systems
Connect:
🔗 LinkedIn: Blessing Nejo
🐙 GitHub: @BLESSEDEFEM
📧 [nejoblessing72@gmail.com]
📍 Lagos, Nigeria
Found this helpful? Drop a comment below - I read and respond to every one.
Building something similar? I'm happy to review your architecture or debug issues. DM me.
Tags: #AI #MachineLearning #RAG #Python #FastAPI #React #TypeScript #ProductionAI #VectorDatabases #Pinecone #LLM #SoftwareEngineering
Top comments (0)