How I Structured a Production-Grade AI Platform From Scratch
I stopped doing tutorials.
Not because tutorials are bad. But because after finishing one, I could follow code. I couldn't explain why the code was written that way.
So I decided to build something real from scratch. No copy-paste. No shortcuts. Every decision justified. Every file explained.
This is the first article in a series documenting how I build a production-grade Agentic RAG Document Intelligence System — phase by phase, file by file.
What We're Building
The GenAI DocQA Platform is a system where users upload documents (PDF, DOCX, CSV, PPTX) and ask complex natural language questions. A 10-node LangGraph agent retrieves relevant chunks, reasons over them, self-corrects, and streams sourced answers back to the user.
Think: mini Perplexity AI + Notion AI + an OpenAI API platform. Built entirely from scratch.
Total cost to run: $0 — all free tiers.
The Full Stack
| Layer | Technology |
|---|---|
| API Framework | FastAPI + async SQLAlchemy 2.0 |
| AI Agent | LangGraph (10-node ReAct workflow) |
| RAG | pgvector + BM25 hybrid search + Cohere reranking |
| LLMs | Groq (free) → OpenAI → Anthropic (fallback chain) |
| Embeddings | Sentence-Transformers (local, free, CPU) |
| Cache | Redis (rate limiting + query cache + embeddings) |
| Database | PostgreSQL 16 + pgvector extension |
| Monitoring | LangSmith + Prometheus + Grafana + RAGAS |
| Security | JWT + bcrypt + AES-256-GCM + Presidio PII |
| Infrastructure | Docker Compose + GitHub Actions CI/CD |
13 Phases
| Phase | What Gets Built |
|---|---|
| 01 | Project scaffold — this article |
| 02 | JWT auth, bcrypt, AES encryption, rate limiting |
| 03 | Document parsers, chunking, WebSocket progress |
| 04 | Embeddings, pgvector, hybrid search, reranking |
| 05 | LLM router — 7 providers, fallback chain, cost tracking |
| 06 | RAG pipeline — prompt engineering, query rewriting, CRAG |
| 07 | LangGraph 10-node agent, self-correction loops |
| 08 | MCP integration — external tool calling |
| 09 | SSE streaming, conversation memory |
| 10 | Safety layer — PII masking, RAGAS evaluation, CI gates |
| 11-13 | React frontend, production deployment |
This article covers Phase 1 — the complete project scaffold. No AI yet. Just the foundation that everything else sits on.
The Philosophy: Why This Matters
Before writing a single line of code, I made one decision:
Every file gets a reason. Every decision gets a justification. Nothing exists "because the tutorial said so."
This forces architectural clarity. When you know WHY each piece exists, you can adapt it. When you only know WHAT it does, you're stuck.
Phase 1 File Structure
genai-platform/
├── .gitignore # what git never tracks
├── .env.example # env var template — committed
├── README.md # project front door
├── .github/
│ └── workflows/
│ ├── ci.yml # runs on every push
│ └── deploy.yml # runs on main merge only
├── infrastructure/
│ └── docker-compose.yml # PostgreSQL + Redis + Backend
└── backend/
├── requirements.txt # pinned Python dependencies
├── pyproject.toml # ruff + mypy + pytest config
├── Dockerfile # multi-stage production build
├── alembic.ini # migration configuration
├── alembic/
│ ├── env.py # async migration bridge
│ └── versions/ # migration files (Phase 2+)
└── app/
├── __init__.py # package marker + version
├── config.py # Pydantic Settings — all env vars
├── main.py # FastAPI app + lifespan + middleware
├── dependencies.py # get_db, get_redis, get_current_user
├── db/
│ ├── database.py # async engine + session factory
│ └── init_db.py # pgvector extension + admin seed
├── monitoring/
│ ├── logger.py # structured JSON logging (structlog)
│ └── metrics.py # 12 Prometheus metrics defined
└── api/v1/
└── health.py # /health (liveness) + /ready (readiness)
24 files. Let's go through the key decisions.
Decision 1: .gitignore — What Never Gets Committed
The .gitignore has one critical pattern most developers miss:
# The pattern every production codebase uses
.env # real secrets — never committed
.env.* # covers .env.local, .env.production, etc.
!.env.example # ← the ! means EXCEPTION — template IS committed
.env.example is committed. It has placeholder values. When someone clones the repo they do:
cp .env.example .env
# then fill in real values
Zero guessing about what variables are needed.
We also ignore uploaded documents:
uploads/
*.pdf
*.docx
*.pptx
*.csv
In production, files go to Cloudflare R2 object storage — not git. Git is for code. Not user data.
Decision 2: Environment Variables Done Right
The beginner approach:
# scattered across 15 files — dangerous
db_url = os.getenv("DATABASE_URL") # returns None silently if missing
secret = os.getenv("SECRIT_KEY") # typo — also None, no warning
expire = int(os.getenv("TOKEN_EXPIRE")) # crashes here if None
Three problems: typos return None silently, no types, missing variables don't surface until runtime — deep inside a failing request.
The production approach — Pydantic Settings:
# app/config.py
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import Field, field_validator
from functools import lru_cache
class Settings(BaseSettings):
model_config = SettingsConfigDict(
env_file=".env",
extra="ignore",
)
# Required — app refuses to start without these
DATABASE_URL: str = Field(...)
SECRET_KEY: str = Field(...)
GROQ_API_KEY: str = Field(...)
# Optional — typed, with defaults
ACCESS_TOKEN_EXPIRE_MINUTES: int = 15 # "15" → int automatically
DEBUG: bool = False # "true" → bool automatically
# Custom validators — fail fast with clear messages
@field_validator("SECRET_KEY")
@classmethod
def validate_secret_key(cls, v: str) -> str:
if len(v) < 32:
raise ValueError(
"SECRET_KEY must be at least 32 characters. "
"Generate with: openssl rand -hex 32"
)
return v
@field_validator("DATABASE_URL")
@classmethod
def validate_database_url(cls, v: str) -> str:
if not v.startswith("postgresql+asyncpg://"):
raise ValueError(
"DATABASE_URL must use asyncpg driver. "
"Change postgresql:// to postgresql+asyncpg://"
)
return v
@lru_cache()
def get_settings() -> Settings:
return Settings()
settings = get_settings() # read once, cached forever
If DATABASE_URL is missing:
ValidationError: DATABASE_URL field required
If SECRET_KEY is too short:
ValidationError: SECRET_KEY must be at least 32 characters.
Generate with: openssl rand -hex 32
Clear. Specific. Actionable. At startup — not runtime.
The @lru_cache() means the .env file is read once at startup. Not on every request. Not on every import. Once. Cached forever. This also ensures immutable configuration — the app has one consistent config for its entire lifetime.
Decision 3: Async SQLAlchemy 2.0 — Why It Changes Everything
# app/db/database.py
from sqlalchemy.ext.asyncio import (
create_async_engine,
async_sessionmaker,
AsyncSession,
)
from sqlalchemy.orm import DeclarativeBase
engine = create_async_engine(
url=settings.DATABASE_URL, # postgresql+asyncpg:// — MUST be asyncpg
pool_size=20, # 20 connections in pool
max_overflow=10, # 10 extra when pool is full
pool_pre_ping=True, # test before using (prevents stale connections)
echo=settings.DEBUG, # log SQL in development
)
AsyncSessionLocal = async_sessionmaker(
bind=engine,
expire_on_commit=False, # ← CRITICAL — more on this below
class_=AsyncSession,
autoflush=False,
)
class Base(DeclarativeBase):
pass
The expire_on_commit=False Trap
This is the most common async SQLAlchemy mistake. By default, after commit(), SQLAlchemy marks all objects as "expired". The next attribute access triggers a new DB query. In sync code — fine. In async code:
user = await db.get(User, user_id)
await db.commit()
print(user.email) # ← CRASH in async: MissingGreenlet error
With expire_on_commit=False:
user = await db.get(User, user_id)
await db.commit()
print(user.email) # ✅ works — values kept in memory
When you actually need fresh data, use await db.refresh(user) explicitly. Explicit is better than implicit.
Why postgresql+asyncpg:// Not postgresql://?
One character difference. Completely different behaviour.
postgresql:// → sync driver → blocks the event loop → your server handles one request at a time during DB calls.
postgresql+asyncpg:// → async driver → non-blocking → server handles hundreds of concurrent requests during DB calls.
Our validator in config.py catches the wrong driver at startup:
ValidationError: DATABASE_URL must use asyncpg driver.
Decision 4: FastAPI Application Structure
# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware
@asynccontextmanager
async def lifespan(app: FastAPI):
# ── STARTUP ──────────────────────────────────────────────
setup_logging() # 1. logging first — everything else logs
setup_prometheus() # 2. metrics
await init_db() # 3. DB — needs logging ready
connect_redis() # 4. Redis
yield # ← app handles requests here
# ── SHUTDOWN ─────────────────────────────────────────────
await redis.aclose()
await engine.dispose()
app = FastAPI(lifespan=lifespan)
Why lifespan Instead of @app.on_event?
@app.on_event("startup") is deprecated since FastAPI 0.93. It's still in most tutorials. Don't use it.
The lifespan pattern:
- Startup and shutdown in one function — paired naturally
-
finallyblock ensures cleanup even on crashes - Testable — can be mocked cleanly
- No deprecation warnings ### The Startup Order
Order matters. Logging must be first so everything after it can produce logs.
Logging → Prometheus → Database → Redis → Ready
If database fails at startup — we raise the exception. An app that starts without a database looks healthy but serves broken responses. Fail fast. Fail loudly.
Middleware — Three Layers
# Layer 1: CORS — browsers need this to call your API
app.add_middleware(
CORSMiddleware,
allow_origins=settings.ALLOWED_ORIGINS,
allow_credentials=True,
allow_methods=["GET", "POST", "PUT", "PATCH", "DELETE", "OPTIONS"],
allow_headers=["*"],
)
# Layer 2: Request ID — every request gets a unique ID
@app.middleware("http")
async def add_request_id(request: Request, call_next):
request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
structlog.contextvars.bind_contextvars(request_id=request_id)
response = await call_next(request)
response.headers["X-Request-ID"] = request_id
structlog.contextvars.clear_contextvars()
return response
# Layer 3: Request Logging — every request logged automatically
@app.middleware("http")
async def log_requests(request: Request, call_next):
start = time.perf_counter()
response = await call_next(request)
duration_ms = (time.perf_counter() - start) * 1000
log.info("request_completed",
method=request.method,
path=request.url.path,
status_code=response.status_code,
duration_ms=round(duration_ms, 2))
return response
CORS must be registered first — it needs to be on every response including error responses. If registered after your error handler, CORS errors on errors produce confusing network failures.
Decision 5: Dependency Injection
# app/dependencies.py
async def get_db() -> AsyncGenerator[AsyncSession, None]:
async with AsyncSessionLocal() as session:
try:
yield session # route runs with this session
await session.commit()
except Exception:
await session.rollback()
raise
# session closes automatically — even on exception
async def get_redis(request: Request):
return request.app.state.redis
In every route:
@router.get("/documents")
async def list_documents(
db: AsyncSession = Depends(get_db),
current_user: User = Depends(get_current_user),
):
# db is ready, current_user is verified
# no setup code needed here
...
The yield pattern ensures the session always closes, even if the route raises an exception. No leaked connections.
For testing:
app.dependency_overrides[get_db] = override_get_db # swap real DB for test DB
Decision 6: Liveness vs Readiness Probes
Two endpoints. Two completely different questions.
# GET /api/v1/health — liveness: is the process alive?
# NEVER checks external services
@router.get("/health")
async def health_check():
return {"status": "ok", "version": __version__}
# GET /api/v1/ready — readiness: can this instance handle traffic?
# Checks ALL dependencies
@router.get("/ready")
async def readiness_check(request: Request):
checks = {}
all_ready = True
try:
async with engine.connect() as conn:
await conn.execute(text("SELECT 1"))
checks["database"] = {"status": "ok"}
except Exception as e:
all_ready = False
checks["database"] = {"status": "error", "error": str(e)}
try:
await request.app.state.redis.ping()
checks["redis"] = {"status": "ok"}
except Exception as e:
all_ready = False
checks["redis"] = {"status": "error"}
return JSONResponse(
status_code=200 if all_ready else 503,
content={"status": "ready" if all_ready else "not_ready",
"checks": checks},
)
Real scenario — PostgreSQL restarts for 30 seconds:
With one combined endpoint:
-
/healthreturns 503 - Kubernetes thinks the process is dead
- Kubernetes kills and restarts the container
- Restart doesn't fix PostgreSQL
- Kubernetes keeps restarting
- This is a crash loop — users see errors for minutes With two separate endpoints:
-
/healthreturns 200 (process is alive) -
/readyreturns 503 (can't reach DB) - Load balancer stops routing to this instance
- Traffic goes to other healthy instances
- Users see nothing
- PostgreSQL comes back → /ready returns 200 → traffic resumes
Decision 7: Structured Logging
# app/monitoring/logger.py
import structlog
def setup_logging():
structlog.configure(
processors=[
structlog.stdlib.add_log_level,
structlog.stdlib.add_logger_name,
structlog.processors.TimeStamper(fmt="iso"),
structlog.contextvars.merge_contextvars, # includes request_id
structlog.processors.JSONRenderer() if not settings.DEBUG
else structlog.dev.ConsoleRenderer(colors=True),
],
wrapper_class=structlog.stdlib.BoundLogger,
logger_factory=structlog.stdlib.LoggerFactory(),
)
Development output (colored, human-readable):
2025-03-14 10:30:00 [info] document_uploaded filename=report.pdf size_mb=2.4
Production output (JSON, machine-readable):
{"event":"document_uploaded","filename":"report.pdf","size_mb":2.4,
"request_id":"abc-123","level":"info","timestamp":"2025-03-14T10:30:00Z"}
Same logging call. Different format. Zero code changes.
Usage anywhere in the app:
log = structlog.get_logger(__name__)
log.info("chunks_created", count=47, strategy="parent_child", duration_ms=234)
log.error("llm_failed", provider="groq", error=str(e), exc_info=True)
The request_id bound in middleware flows through every log line automatically. When something breaks, search request_id = "abc-123" and see the complete story of that request.
Decision 8: Multi-Stage Docker Build
# Stage 1 — Builder (large, temporary)
FROM python:3.12-slim AS builder
WORKDIR /build
RUN apt-get update && apt-get install -y gcc python3-dev libpq-dev \
&& rm -rf /var/lib/apt/lists/*
# requirements.txt BEFORE app code — enables layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2 — Production (small, deployed)
FROM python:3.12-slim AS production
# Runtime dependencies only — no build tools
RUN apt-get update && apt-get install -y libpq5 tesseract-ocr curl \
&& rm -rf /var/lib/apt/lists/*
# Copy ONLY compiled packages — not gcc, not make, not compilers
COPY --from=builder /install /usr/local
# Non-root user — least privilege principle
RUN groupadd -r appgroup && useradd -r -g appgroup appuser
RUN mkdir -p /app/uploads && chown -R appuser:appgroup /app
USER appuser
COPY --chown=appuser:appgroup ./app /app/app
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Result: 812MB → 298MB. Same application. Same functionality.
Layer caching rule:
Things that change RARELY → top of Dockerfile
Things that change OFTEN → bottom of Dockerfile
requirements.txt changes rarely. App code changes constantly. Copy requirements first → pip install is cached → code changes don't trigger pip install.
Decision 9: Alembic Async Bridge
Standard Alembic is synchronous. Our app uses async SQLAlchemy. The bridge:
# alembic/env.py
async def run_async_migrations() -> None:
connectable = create_async_engine(settings.DATABASE_URL)
async with connectable.connect() as connection:
# run_sync() extracts a sync connection from async
# Alembic runs inside that sync connection
await connection.run_sync(do_run_migrations)
await connectable.dispose()
def run_migrations_online() -> None:
asyncio.run(run_async_migrations())
connection.run_sync(do_run_migrations) is the bridge.
-
connectionis async -
run_sync()extracts a sync version - Alembic runs normally inside
do_run_migrations()This is the official Alembic async pattern.
Decision 10: GitHub Actions CI
Every push triggers four jobs in parallel:
jobs:
lint: # ruff — style + security issues
typecheck: # mypy — type errors
test: # pytest with real postgres + redis
services:
postgres:
image: pgvector/pgvector:pg16
redis:
image: redis:7-alpine
docker: # verify image builds
needs: [lint, typecheck, test] # only if all pass
Key decisions:
- Real PostgreSQL and Redis in test job — not mocks
- Docker build only runs after all three pass — fail fast on cheap checks
-
cache: "pip"in setup-python — subsequent runs are 10× faster
- cache-from: type=gha for Docker — layer caching across CI runs
Running Phase 1
# 1. Clone and set up
git clone https://github.com/digvijaysingh21/genai-docqa.git
cd genai-docqa
cp .env.example .env
# 2. Generate secrets
openssl rand -hex 32 # paste as SECRET_KEY
openssl rand -hex 32 # paste as ENCRYPTION_KEY
# Get free Groq key at: console.groq.com → paste as GROQ_API_KEY
# 3. Start everything
cd infrastructure
docker compose up
# 4. Test
curl http://localhost:8000/api/v1/health
# {"status":"ok","version":"1.0.0","environment":"development"}
curl http://localhost:8000/api/v1/ready
# {"status":"ready","checks":{"database":{"status":"ok"},"redis":{"status":"ok"}}}
# 5. Open Swagger UI
open http://localhost:8000/docs
What Phase 1 Establishes
Before a single AI feature is built, we have:
- ✅ Deterministic builds (pinned versions)
- ✅ Fail-fast configuration (Pydantic validators)
- ✅ Async database with connection pooling
- ✅ Structured JSON logging with request tracing
- ✅ 12 Prometheus metrics defined
- ✅ Liveness + readiness health probes
- ✅ Multi-stage Docker build (300MB vs 800MB)
- ✅ Non-root container user
- ✅ Layer-optimised Dockerfile (10s rebuild vs 6min)
- ✅ Async Alembic migration bridge
- ✅ FastAPI DI system (get_db, get_redis)
- ✅ CI pipeline (lint + typecheck + tests + docker build) This is the foundation. Everything from Phase 2 through Phase 13 sits on top of this.
Next Article
Phase 2: Auth and Security — JWT tokens with 15-minute access + 7-day refresh, bcrypt password hashing, AES-256-GCM encryption for LLM API keys (BYOK pattern), and Redis sliding window rate limiting.
If you're building along, the full code is at: THE REPO will be added soon
Building this project phase by phase. Every decision explained. Follow along for Phase 2.
Top comments (0)