Girma

Posted on Jan 30

Build a Production-Ready AI Document Brain: A No-Nonsense Guide to RAG SaaS

#rag #ai #vertexai #agentaichallenge

Let’s be real: most companies are drowning in a sea of PDFs. Contracts, handbooks, ancient policy docs—it’s a mess. Usually, employees waste half their day hunting for one specific clause. And if you try to just throw a generic ChatGPT at the problem? You get "hallucinations" (AI-speak for "making stuff up") that could get someone fired.

That’s the exact pain we’re solving today. We’re building a RAG-powered (Retrieval-Augmented Generation) SaaS backend.

The goal: Users upload their own docs, and the AI only answers based on that data. No external guessing, just fast, accurate, cited answers. If you’re a dev looking to move past "Hello World" AI tutorials and build something that actually survives a production environment, you're in the right place.

The "Battle-Tested" Tech Stack

I’ve built enough of these to know what breaks. Here’s what we’re using to keep things scalable:

The Engine: FastAPI. It’s fast, handles async like a champ, and the auto-docs save you hours of debugging.
The Brain (Vectors): PostgreSQL + pgvector. Don't get distracted by "trendy" vector-only DBs for your first SaaS. pgvector lets you keep your user data and your embeddings in one place. It’s persistent, SQL-friendly, and scales beautifully.
The Muscle: Redis + Celery. Generating embeddings is "heavy lifting." You don't want your API hanging while you process a 50-page PDF. Celery handles the dirty work in the background.
The Intelligence: OpenAI (text-embedding-3-small + GPT-4o-mini). It’s the gold standard for a reason, though you can swap in Gemini if you're feeling adventurous.
The Glue: Unstructured.io for parsing those messy PDFs and JWT for keeping user data private.

Let’s Build It: Step-by-Step

1. The Foundation

First, grab your virtual env (I’m a fan of uv or poetry these days—standard pip is a bit "last decade" for prod).

pip install fastapi uvicorn sqlalchemy asyncpg psycopg[binary] pgvector openai redis celery python-dotenv python-multipart PyPDF2 "unstructured[all-docs]" slowapi python-jose[cryptography] passlib[bcrypt]

Pro Tip: Create your .env file immediately. OPENAI_API_KEY, DATABASE_URL, REDIS_URL. If I see these hardcoded in your repo, we’re gonna have a talk.

2. Setting Up the Vector Vault

Spin up Postgres with pgvector using Docker. It’s the fastest way to get moving.

docker run -d --name pgvector -e POSTGRES_PASSWORD=your-secret -p 5432:5432 ankane/pgvector

In your models, define a Chunk table with a Vector(1536) column. Trust me: keeping your vectors inside Postgres makes joining metadata (like "which user owns this doc?") a breeze.

3. Privacy is Non-Negotiable

This is a SaaS, not a personal script. Every document and every text chunk must be tied to a user_id.

The Rule: Always filter queries with WHERE user_id = current_user.id.
The Level Up: Use Postgres Row-Level Security (RLS) to ensure one user can never peek at another's data.

4. The Processing Pipeline

When a user hits POST /upload, don't make them wait.

Parse: Use Unstructured—it’s way better than PyPDF2 at handling tables.
Chunk: Don't just cut text every 500 characters. Use a recursive splitter with overlap (e.g., 800 tokens with 150 token overlap) so you don't lose the context mid-sentence.
Embed: Send those chunks to OpenAI in batches.
Offload: Use Celery. Your API should just say "Got it, I'm working on it!" while the background worker does the heavy lifting.

5. The Magic "/ask" Endpoint

This is where the RAG happens:

Embed the Question: Turn the user's query into a vector.
Semantic Search: Use pgvector to find the 5-10 most relevant chunks.
The Prompt: "Answer ONLY using this context. If it's not there, say you don't know. Cite your sources."
Cache: If someone asks the same thing twice, serve it from Redis. It’s cheaper and faster.

Lessons Learned (The Hard Way)

I’ve made the mistakes so you don't have to:

The "Prototype Trap": Don't use FAISS for a multi-user app. It lives in RAM. If your server restarts, your "brain" disappears. Use pgvector.
The "Spinning Wheel of Death": Never embed synchronously. If a user uploads a book, your API will timeout. Always use background tasks.
The "Hallucination Headache": Be aggressive with your system prompt. Tell the AI: "If you aren't 100% sure based on the provided text, don't guess."

The Payoff

When you're done, you have a system where retrieval usually takes under 200ms, and full, cited answers pop up in less than 3 seconds. It looks incredible in a portfolio because it shows you understand async flows, data security, and cost management.

(This is usually where you'd drop a screenshot of your Swagger UI showing those clean /upload and /ask endpoints in action!)

Want This Built for Your Business?

I’m a freelance developer who lives and breathes this stuff. If you need a custom RAG platform, a high-performance FastAPI backend, or just want to turn your company's messy documentation into a searchable superpower, let's talk.

Portfolio: girma.studio
Upwork: View My Profile
X (Twitter): @Girma880731631

DEV Community