Building InternFlow (Part 3): Lessons Learned Deploying a Multi-Service AI Application

#devops #docker #nginx #tutorial

Taking InternFlow from Docker Compose on a laptop to a production server exposed problems that no tutorial prepares you for.

Deploying a single web application is straightforward. You push code, a server runs it, done.

Deploying seven interconnected services — with AI models, background crawlers, databases, and a reverse proxy — is an entirely different challenge.

The production stack

[ Internet ]
     │
[ Nginx reverse proxy + SSL ]
     │
┌────┴────────────────────────┐
│  Next.js  │  FastAPI API    │
├───────────┴─────────────────┤
│  AI Service │ RAG Service   │
│  Job Crawler                │
├─────────────────────────────┤
│  PostgreSQL │ Redis         │
│  Docker volumes (persistent)│
└─────────────────────────────┘

Every layer had at least one thing that surprised me.

Lesson 1: Separate infrastructure from application logic

The biggest mistake I made early on was mixing infrastructure configuration with application code. Environment variables were hardcoded in places. Database connection strings were duplicated across services.

The fix was treating infrastructure as a completely separate concern:

# docker-compose.yml — environment from files, not hardcoded
services:
  api:
    env_file:
      - .env.production
    depends_on:
      postgres:
        condition: service_healthy

# .env.production — never committed to git
DATABASE_URL=postgresql://user:pass@postgres:5432/internflow
REDIS_URL=redis://redis:6379
SECRET_KEY=your-secret-here

Lesson 2: Logs are your only debugger in production

Before: Service fails silently. No error surfaces. User sees a blank response. I SSH into the server and run docker ps to find a container has been OOMkilled with no trace of why.

After: Structured logging with timestamps and service names. Every request logged with duration. Every error logged with full stack trace and context.

import logging
import time

logger = logging.getLogger(__name__)

async def generate_resume(repo_id: str):
    start = time.time()
    logger.info(f"[resume] starting generation for repo={repo_id}")

    try:
        result = await pipeline.run(repo_id)
        logger.info(f"[resume] completed repo={repo_id} duration={time.time()-start:.2f}s")
        return result
    except Exception as e:
        logger.error(f"[resume] failed repo={repo_id} error={str(e)}", exc_info=True)
        raise

Good logs are not a nice-to-have. In a multi-service architecture, they're the only way to know what your system is doing.

Lesson 3: Health checks are load-bearing

Docker's healthcheck directive sounds like a monitoring nicety. In practice it's critical for startup ordering.

Services come up in unpredictable order. A FastAPI service that starts before PostgreSQL is ready will crash on first database connection — silently, in a way that looks like the application is broken when it's actually just a race condition.

services:
  postgres:
    image: postgres:15
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

  api:
    depends_on:
      postgres:
        condition: service_healthy  # waits for postgres to pass health check

This single change eliminated an entire class of random startup failures.

Lesson 4: Persistent storage requires deliberate planning

The FAISS vector indexes are large. Model weights are large. PostgreSQL data must survive container restarts.

Early on I had all of this in ephemeral container filesystems. One docker compose down wiped everything — including indexed repositories.

volumes:
  postgres_data:
  faiss_indexes:
  model_cache:

services:
  postgres:
    volumes:
      - postgres_data:/var/lib/postgresql/data

  rag_service:
    volumes:
      - faiss_indexes:/app/indexes
      - model_cache:/app/models

Named volumes persist across docker compose down and up. Pair this with a nightly backup script and you have a real persistence story.

Lesson 5: Nginx config is part of your application

I treated Nginx as an afterthought — something I'd configure "at the end." That was wrong.

server {
    listen 80;
    server_name intern-flow.in;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    server_name intern-flow.in;

    ssl_certificate /etc/letsencrypt/live/intern-flow.in/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/intern-flow.in/privkey.pem;

    # Frontend
    location / {
        proxy_pass http://frontend:3000;
        proxy_set_header Host $host;
    }

    # API
    location /api/ {
        proxy_pass http://api:8000;
        proxy_read_timeout 120s;  # AI endpoints take longer
    }
}

Note the proxy_read_timeout 120s for AI endpoints. Default Nginx timeout is 60 seconds — AI generation regularly exceeded this and caused cryptic 504 errors that took me hours to trace.

The key lessons, distilled

Infrastructure is application code — treat Dockerfiles, Nginx configs, and Compose files with the same care as your Python or TypeScript
Observability before features — logging and health checks should be set up before any feature goes to production
Assume services will fail — design for restart, not for reliability
Persistence is separate from compute — containers are disposable, data is not
Read the timeout docs — every proxy, every client, every service has a default timeout that will surprise you in production

In Part 4, I'll cover where InternFlow is going next — beyond an internship tool into a full developer career platform.

InternFlow helps B.Tech and engineering students turn their GitHub projects into internship offers. AI code reviews on every commit, ATS resume generation, and daily internship listings.

→ Try it free at intern-flow.in