DEV Community

Cover image for I Built a Production-Grade AI Platform From Scratch (Here’s the Exact Folder Structure)
Digvijay Singh
Digvijay Singh

Posted on

I Built a Production-Grade AI Platform From Scratch (Here’s the Exact Folder Structure)

How I Structured a Production-Grade AI Platform From Scratch

I stopped doing tutorials.

Not because tutorials are bad. But because after finishing one, I could follow code. I couldn't explain why the code was written that way.

So I decided to build something real from scratch. No copy-paste. No shortcuts. Every decision justified. Every file explained.

This is the first article in a series documenting how I build a production-grade Agentic RAG Document Intelligence System — phase by phase, file by file.

What We're Building

The GenAI DocQA Platform is a system where users upload documents (PDF, DOCX, CSV, PPTX) and ask complex natural language questions. A 10-node LangGraph agent retrieves relevant chunks, reasons over them, self-corrects, and streams sourced answers back to the user.

Think: mini Perplexity AI + Notion AI + an OpenAI API platform. Built entirely from scratch.

Total cost to run: $0 — all free tiers.

The Full Stack

Layer Technology
API Framework FastAPI + async SQLAlchemy 2.0
AI Agent LangGraph (10-node ReAct workflow)
RAG pgvector + BM25 hybrid search + Cohere reranking
LLMs Groq (free) → OpenAI → Anthropic (fallback chain)
Embeddings Sentence-Transformers (local, free, CPU)
Cache Redis (rate limiting + query cache + embeddings)
Database PostgreSQL 16 + pgvector extension
Monitoring LangSmith + Prometheus + Grafana + RAGAS
Security JWT + bcrypt + AES-256-GCM + Presidio PII
Infrastructure Docker Compose + GitHub Actions CI/CD

13 Phases

Phase What Gets Built
01 Project scaffold — this article
02 JWT auth, bcrypt, AES encryption, rate limiting
03 Document parsers, chunking, WebSocket progress
04 Embeddings, pgvector, hybrid search, reranking
05 LLM router — 7 providers, fallback chain, cost tracking
06 RAG pipeline — prompt engineering, query rewriting, CRAG
07 LangGraph 10-node agent, self-correction loops
08 MCP integration — external tool calling
09 SSE streaming, conversation memory
10 Safety layer — PII masking, RAGAS evaluation, CI gates
11-13 React frontend, production deployment

This article covers Phase 1 — the complete project scaffold. No AI yet. Just the foundation that everything else sits on.


The Philosophy: Why This Matters

Before writing a single line of code, I made one decision:

Every file gets a reason. Every decision gets a justification. Nothing exists "because the tutorial said so."

This forces architectural clarity. When you know WHY each piece exists, you can adapt it. When you only know WHAT it does, you're stuck.


Phase 1 File Structure

genai-platform/
├── .gitignore                     # what git never tracks
├── .env.example                   # env var template — committed
├── README.md                      # project front door
├── .github/
│   └── workflows/
│       ├── ci.yml                 # runs on every push
│       └── deploy.yml             # runs on main merge only
├── infrastructure/
│   └── docker-compose.yml         # PostgreSQL + Redis + Backend
└── backend/
    ├── requirements.txt           # pinned Python dependencies
    ├── pyproject.toml             # ruff + mypy + pytest config
    ├── Dockerfile                 # multi-stage production build
    ├── alembic.ini                # migration configuration
    ├── alembic/
    │   ├── env.py                 # async migration bridge
    │   └── versions/              # migration files (Phase 2+)
    └── app/
        ├── __init__.py            # package marker + version
        ├── config.py              # Pydantic Settings — all env vars
        ├── main.py                # FastAPI app + lifespan + middleware
        ├── dependencies.py        # get_db, get_redis, get_current_user
        ├── db/
        │   ├── database.py        # async engine + session factory
        │   └── init_db.py         # pgvector extension + admin seed
        ├── monitoring/
        │   ├── logger.py          # structured JSON logging (structlog)
        │   └── metrics.py         # 12 Prometheus metrics defined
        └── api/v1/
            └── health.py          # /health (liveness) + /ready (readiness)
Enter fullscreen mode Exit fullscreen mode

24 files. Let's go through the key decisions.


Decision 1: .gitignore — What Never Gets Committed

The .gitignore has one critical pattern most developers miss:

# The pattern every production codebase uses
.env        # real secrets — never committed
.env.*      # covers .env.local, .env.production, etc.
!.env.example  # ← the ! means EXCEPTION — template IS committed
Enter fullscreen mode Exit fullscreen mode

.env.example is committed. It has placeholder values. When someone clones the repo they do:

cp .env.example .env
# then fill in real values
Enter fullscreen mode Exit fullscreen mode

Zero guessing about what variables are needed.

We also ignore uploaded documents:

uploads/
*.pdf
*.docx
*.pptx
*.csv
Enter fullscreen mode Exit fullscreen mode

In production, files go to Cloudflare R2 object storage — not git. Git is for code. Not user data.


Decision 2: Environment Variables Done Right

The beginner approach:

# scattered across 15 files — dangerous
db_url = os.getenv("DATABASE_URL")         # returns None silently if missing
secret = os.getenv("SECRIT_KEY")           # typo — also None, no warning
expire = int(os.getenv("TOKEN_EXPIRE"))    # crashes here if None
Enter fullscreen mode Exit fullscreen mode

Three problems: typos return None silently, no types, missing variables don't surface until runtime — deep inside a failing request.

The production approach — Pydantic Settings:

# app/config.py
from pydantic_settings import BaseSettings, SettingsConfigDict
from pydantic import Field, field_validator
from functools import lru_cache

class Settings(BaseSettings):
    model_config = SettingsConfigDict(
        env_file=".env",
        extra="ignore",
    )

    # Required — app refuses to start without these
    DATABASE_URL: str = Field(...)
    SECRET_KEY: str = Field(...)
    GROQ_API_KEY: str = Field(...)

    # Optional — typed, with defaults
    ACCESS_TOKEN_EXPIRE_MINUTES: int = 15   # "15" → int automatically
    DEBUG: bool = False                      # "true" → bool automatically

    # Custom validators — fail fast with clear messages
    @field_validator("SECRET_KEY")
    @classmethod
    def validate_secret_key(cls, v: str) -> str:
        if len(v) < 32:
            raise ValueError(
                "SECRET_KEY must be at least 32 characters. "
                "Generate with: openssl rand -hex 32"
            )
        return v

    @field_validator("DATABASE_URL")
    @classmethod
    def validate_database_url(cls, v: str) -> str:
        if not v.startswith("postgresql+asyncpg://"):
            raise ValueError(
                "DATABASE_URL must use asyncpg driver. "
                "Change postgresql:// to postgresql+asyncpg://"
            )
        return v

@lru_cache()
def get_settings() -> Settings:
    return Settings()

settings = get_settings()  # read once, cached forever
Enter fullscreen mode Exit fullscreen mode

If DATABASE_URL is missing:

ValidationError: DATABASE_URL field required
Enter fullscreen mode Exit fullscreen mode

If SECRET_KEY is too short:

ValidationError: SECRET_KEY must be at least 32 characters.
Generate with: openssl rand -hex 32
Enter fullscreen mode Exit fullscreen mode

Clear. Specific. Actionable. At startup — not runtime.

The @lru_cache() means the .env file is read once at startup. Not on every request. Not on every import. Once. Cached forever. This also ensures immutable configuration — the app has one consistent config for its entire lifetime.


Decision 3: Async SQLAlchemy 2.0 — Why It Changes Everything

# app/db/database.py
from sqlalchemy.ext.asyncio import (
    create_async_engine,
    async_sessionmaker,
    AsyncSession,
)
from sqlalchemy.orm import DeclarativeBase

engine = create_async_engine(
    url=settings.DATABASE_URL,  # postgresql+asyncpg:// — MUST be asyncpg
    pool_size=20,                # 20 connections in pool
    max_overflow=10,             # 10 extra when pool is full
    pool_pre_ping=True,          # test before using (prevents stale connections)
    echo=settings.DEBUG,         # log SQL in development
)

AsyncSessionLocal = async_sessionmaker(
    bind=engine,
    expire_on_commit=False,  # ← CRITICAL — more on this below
    class_=AsyncSession,
    autoflush=False,
)

class Base(DeclarativeBase):
    pass
Enter fullscreen mode Exit fullscreen mode

The expire_on_commit=False Trap

This is the most common async SQLAlchemy mistake. By default, after commit(), SQLAlchemy marks all objects as "expired". The next attribute access triggers a new DB query. In sync code — fine. In async code:

user = await db.get(User, user_id)
await db.commit()
print(user.email)   # ← CRASH in async: MissingGreenlet error
Enter fullscreen mode Exit fullscreen mode

With expire_on_commit=False:

user = await db.get(User, user_id)
await db.commit()
print(user.email)   # ✅ works — values kept in memory
Enter fullscreen mode Exit fullscreen mode

When you actually need fresh data, use await db.refresh(user) explicitly. Explicit is better than implicit.

Why postgresql+asyncpg:// Not postgresql://?

One character difference. Completely different behaviour.

postgresql:// → sync driver → blocks the event loop → your server handles one request at a time during DB calls.

postgresql+asyncpg:// → async driver → non-blocking → server handles hundreds of concurrent requests during DB calls.

Our validator in config.py catches the wrong driver at startup:

ValidationError: DATABASE_URL must use asyncpg driver.
Enter fullscreen mode Exit fullscreen mode

Decision 4: FastAPI Application Structure

# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from fastapi.middleware.cors import CORSMiddleware

@asynccontextmanager
async def lifespan(app: FastAPI):
    # ── STARTUP ──────────────────────────────────────────────
    setup_logging()        # 1. logging first — everything else logs
    setup_prometheus()     # 2. metrics
    await init_db()        # 3. DB — needs logging ready
    connect_redis()        # 4. Redis

    yield  # ← app handles requests here

    # ── SHUTDOWN ─────────────────────────────────────────────
    await redis.aclose()
    await engine.dispose()

app = FastAPI(lifespan=lifespan)
Enter fullscreen mode Exit fullscreen mode

Why lifespan Instead of @app.on_event?

@app.on_event("startup") is deprecated since FastAPI 0.93. It's still in most tutorials. Don't use it.

The lifespan pattern:

  • Startup and shutdown in one function — paired naturally
  • finally block ensures cleanup even on crashes
  • Testable — can be mocked cleanly
  • No deprecation warnings ### The Startup Order

Order matters. Logging must be first so everything after it can produce logs.

Logging → Prometheus → Database → Redis → Ready
Enter fullscreen mode Exit fullscreen mode

If database fails at startup — we raise the exception. An app that starts without a database looks healthy but serves broken responses. Fail fast. Fail loudly.

Middleware — Three Layers

# Layer 1: CORS — browsers need this to call your API
app.add_middleware(
    CORSMiddleware,
    allow_origins=settings.ALLOWED_ORIGINS,
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "PATCH", "DELETE", "OPTIONS"],
    allow_headers=["*"],
)

# Layer 2: Request ID — every request gets a unique ID
@app.middleware("http")
async def add_request_id(request: Request, call_next):
    request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
    structlog.contextvars.bind_contextvars(request_id=request_id)
    response = await call_next(request)
    response.headers["X-Request-ID"] = request_id
    structlog.contextvars.clear_contextvars()
    return response

# Layer 3: Request Logging — every request logged automatically
@app.middleware("http")
async def log_requests(request: Request, call_next):
    start = time.perf_counter()
    response = await call_next(request)
    duration_ms = (time.perf_counter() - start) * 1000
    log.info("request_completed",
             method=request.method,
             path=request.url.path,
             status_code=response.status_code,
             duration_ms=round(duration_ms, 2))
    return response
Enter fullscreen mode Exit fullscreen mode

CORS must be registered first — it needs to be on every response including error responses. If registered after your error handler, CORS errors on errors produce confusing network failures.


Decision 5: Dependency Injection

# app/dependencies.py
async def get_db() -> AsyncGenerator[AsyncSession, None]:
    async with AsyncSessionLocal() as session:
        try:
            yield session        # route runs with this session
            await session.commit()
        except Exception:
            await session.rollback()
            raise
        # session closes automatically — even on exception

async def get_redis(request: Request):
    return request.app.state.redis
Enter fullscreen mode Exit fullscreen mode

In every route:

@router.get("/documents")
async def list_documents(
    db: AsyncSession = Depends(get_db),
    current_user: User = Depends(get_current_user),
):
    # db is ready, current_user is verified
    # no setup code needed here
    ...
Enter fullscreen mode Exit fullscreen mode

The yield pattern ensures the session always closes, even if the route raises an exception. No leaked connections.

For testing:

app.dependency_overrides[get_db] = override_get_db  # swap real DB for test DB
Enter fullscreen mode Exit fullscreen mode

Decision 6: Liveness vs Readiness Probes

Two endpoints. Two completely different questions.

# GET /api/v1/health — liveness: is the process alive?
# NEVER checks external services
@router.get("/health")
async def health_check():
    return {"status": "ok", "version": __version__}

# GET /api/v1/ready — readiness: can this instance handle traffic?
# Checks ALL dependencies
@router.get("/ready")
async def readiness_check(request: Request):
    checks = {}
    all_ready = True

    try:
        async with engine.connect() as conn:
            await conn.execute(text("SELECT 1"))
        checks["database"] = {"status": "ok"}
    except Exception as e:
        all_ready = False
        checks["database"] = {"status": "error", "error": str(e)}

    try:
        await request.app.state.redis.ping()
        checks["redis"] = {"status": "ok"}
    except Exception as e:
        all_ready = False
        checks["redis"] = {"status": "error"}

    return JSONResponse(
        status_code=200 if all_ready else 503,
        content={"status": "ready" if all_ready else "not_ready",
                 "checks": checks},
    )
Enter fullscreen mode Exit fullscreen mode

Real scenario — PostgreSQL restarts for 30 seconds:

With one combined endpoint:

  • /health returns 503
  • Kubernetes thinks the process is dead
  • Kubernetes kills and restarts the container
  • Restart doesn't fix PostgreSQL
  • Kubernetes keeps restarting
  • This is a crash loop — users see errors for minutes With two separate endpoints:
  • /health returns 200 (process is alive)
  • /ready returns 503 (can't reach DB)
  • Load balancer stops routing to this instance
  • Traffic goes to other healthy instances
  • Users see nothing

- PostgreSQL comes back → /ready returns 200 → traffic resumes

Decision 7: Structured Logging

# app/monitoring/logger.py
import structlog

def setup_logging():
    structlog.configure(
        processors=[
            structlog.stdlib.add_log_level,
            structlog.stdlib.add_logger_name,
            structlog.processors.TimeStamper(fmt="iso"),
            structlog.contextvars.merge_contextvars,  # includes request_id
            structlog.processors.JSONRenderer() if not settings.DEBUG
            else structlog.dev.ConsoleRenderer(colors=True),
        ],
        wrapper_class=structlog.stdlib.BoundLogger,
        logger_factory=structlog.stdlib.LoggerFactory(),
    )
Enter fullscreen mode Exit fullscreen mode

Development output (colored, human-readable):

2025-03-14 10:30:00 [info] document_uploaded  filename=report.pdf size_mb=2.4
Enter fullscreen mode Exit fullscreen mode

Production output (JSON, machine-readable):

{"event":"document_uploaded","filename":"report.pdf","size_mb":2.4,
 "request_id":"abc-123","level":"info","timestamp":"2025-03-14T10:30:00Z"}
Enter fullscreen mode Exit fullscreen mode

Same logging call. Different format. Zero code changes.

Usage anywhere in the app:

log = structlog.get_logger(__name__)
log.info("chunks_created", count=47, strategy="parent_child", duration_ms=234)
log.error("llm_failed", provider="groq", error=str(e), exc_info=True)
Enter fullscreen mode Exit fullscreen mode

The request_id bound in middleware flows through every log line automatically. When something breaks, search request_id = "abc-123" and see the complete story of that request.


Decision 8: Multi-Stage Docker Build

# Stage 1 — Builder (large, temporary)
FROM python:3.12-slim AS builder
WORKDIR /build

RUN apt-get update && apt-get install -y gcc python3-dev libpq-dev \
    && rm -rf /var/lib/apt/lists/*

# requirements.txt BEFORE app code — enables layer caching
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2 — Production (small, deployed)
FROM python:3.12-slim AS production

# Runtime dependencies only — no build tools
RUN apt-get update && apt-get install -y libpq5 tesseract-ocr curl \
    && rm -rf /var/lib/apt/lists/*

# Copy ONLY compiled packages — not gcc, not make, not compilers
COPY --from=builder /install /usr/local

# Non-root user — least privilege principle
RUN groupadd -r appgroup && useradd -r -g appgroup appuser
RUN mkdir -p /app/uploads && chown -R appuser:appgroup /app

USER appuser
COPY --chown=appuser:appgroup ./app /app/app

CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
Enter fullscreen mode Exit fullscreen mode

Result: 812MB → 298MB. Same application. Same functionality.

Layer caching rule:

Things that change RARELY → top of Dockerfile
Things that change OFTEN  → bottom of Dockerfile
Enter fullscreen mode Exit fullscreen mode

requirements.txt changes rarely. App code changes constantly. Copy requirements first → pip install is cached → code changes don't trigger pip install.


Decision 9: Alembic Async Bridge

Standard Alembic is synchronous. Our app uses async SQLAlchemy. The bridge:

# alembic/env.py
async def run_async_migrations() -> None:
    connectable = create_async_engine(settings.DATABASE_URL)

    async with connectable.connect() as connection:
        # run_sync() extracts a sync connection from async
        # Alembic runs inside that sync connection
        await connection.run_sync(do_run_migrations)

    await connectable.dispose()

def run_migrations_online() -> None:
    asyncio.run(run_async_migrations())
Enter fullscreen mode Exit fullscreen mode

connection.run_sync(do_run_migrations) is the bridge.

  • connection is async
  • run_sync() extracts a sync version
  • Alembic runs normally inside do_run_migrations() This is the official Alembic async pattern.

Decision 10: GitHub Actions CI

Every push triggers four jobs in parallel:

jobs:
  lint:      # ruff — style + security issues
  typecheck: # mypy — type errors
  test:      # pytest with real postgres + redis
    services:
      postgres:
        image: pgvector/pgvector:pg16
      redis:
        image: redis:7-alpine

  docker:    # verify image builds
    needs: [lint, typecheck, test]  # only if all pass
Enter fullscreen mode Exit fullscreen mode

Key decisions:

  • Real PostgreSQL and Redis in test job — not mocks
  • Docker build only runs after all three pass — fail fast on cheap checks
  • cache: "pip" in setup-python — subsequent runs are 10× faster

- cache-from: type=gha for Docker — layer caching across CI runs

Running Phase 1

# 1. Clone and set up
git clone https://github.com/digvijaysingh21/genai-docqa.git
cd genai-docqa
cp .env.example .env

# 2. Generate secrets
openssl rand -hex 32  # paste as SECRET_KEY
openssl rand -hex 32  # paste as ENCRYPTION_KEY
# Get free Groq key at: console.groq.com → paste as GROQ_API_KEY

# 3. Start everything
cd infrastructure
docker compose up

# 4. Test
curl http://localhost:8000/api/v1/health
# {"status":"ok","version":"1.0.0","environment":"development"}

curl http://localhost:8000/api/v1/ready
# {"status":"ready","checks":{"database":{"status":"ok"},"redis":{"status":"ok"}}}

# 5. Open Swagger UI
open http://localhost:8000/docs
Enter fullscreen mode Exit fullscreen mode

What Phase 1 Establishes

Before a single AI feature is built, we have:

  • ✅ Deterministic builds (pinned versions)
  • ✅ Fail-fast configuration (Pydantic validators)
  • ✅ Async database with connection pooling
  • ✅ Structured JSON logging with request tracing
  • ✅ 12 Prometheus metrics defined
  • ✅ Liveness + readiness health probes
  • ✅ Multi-stage Docker build (300MB vs 800MB)
  • ✅ Non-root container user
  • ✅ Layer-optimised Dockerfile (10s rebuild vs 6min)
  • ✅ Async Alembic migration bridge
  • ✅ FastAPI DI system (get_db, get_redis)
  • ✅ CI pipeline (lint + typecheck + tests + docker build) This is the foundation. Everything from Phase 2 through Phase 13 sits on top of this.

Next Article

Phase 2: Auth and Security — JWT tokens with 15-minute access + 7-day refresh, bcrypt password hashing, AES-256-GCM encryption for LLM API keys (BYOK pattern), and Redis sliding window rate limiting.

If you're building along, the full code is at: THE REPO will be added soon


Building this project phase by phase. Every decision explained. Follow along for Phase 2.

Top comments (0)