Taking InternFlow from Docker Compose on a laptop to a production server exposed problems that no tutorial prepares you for.
Deploying a single web application is straightforward. You push code, a server runs it, done.
Deploying seven interconnected services — with AI models, background crawlers, databases, and a reverse proxy — is an entirely different challenge.
The production stack
[ Internet ]
│
[ Nginx reverse proxy + SSL ]
│
┌────┴────────────────────────┐
│ Next.js │ FastAPI API │
├───────────┴─────────────────┤
│ AI Service │ RAG Service │
│ Job Crawler │
├─────────────────────────────┤
│ PostgreSQL │ Redis │
│ Docker volumes (persistent)│
└─────────────────────────────┘
Every layer had at least one thing that surprised me.
Lesson 1: Separate infrastructure from application logic
The biggest mistake I made early on was mixing infrastructure configuration with application code. Environment variables were hardcoded in places. Database connection strings were duplicated across services.
The fix was treating infrastructure as a completely separate concern:
# docker-compose.yml — environment from files, not hardcoded
services:
api:
env_file:
- .env.production
depends_on:
postgres:
condition: service_healthy
# .env.production — never committed to git
DATABASE_URL=postgresql://user:pass@postgres:5432/internflow
REDIS_URL=redis://redis:6379
SECRET_KEY=your-secret-here
Lesson 2: Logs are your only debugger in production
Before: Service fails silently. No error surfaces. User sees a blank response. I SSH into the server and run docker ps to find a container has been OOMkilled with no trace of why.
After: Structured logging with timestamps and service names. Every request logged with duration. Every error logged with full stack trace and context.
import logging
import time
logger = logging.getLogger(__name__)
async def generate_resume(repo_id: str):
start = time.time()
logger.info(f"[resume] starting generation for repo={repo_id}")
try:
result = await pipeline.run(repo_id)
logger.info(f"[resume] completed repo={repo_id} duration={time.time()-start:.2f}s")
return result
except Exception as e:
logger.error(f"[resume] failed repo={repo_id} error={str(e)}", exc_info=True)
raise
Good logs are not a nice-to-have. In a multi-service architecture, they're the only way to know what your system is doing.
Lesson 3: Health checks are load-bearing
Docker's healthcheck directive sounds like a monitoring nicety. In practice it's critical for startup ordering.
Services come up in unpredictable order. A FastAPI service that starts before PostgreSQL is ready will crash on first database connection — silently, in a way that looks like the application is broken when it's actually just a race condition.
services:
postgres:
image: postgres:15
healthcheck:
test: ["CMD-SHELL", "pg_isready -U postgres"]
interval: 5s
timeout: 5s
retries: 5
api:
depends_on:
postgres:
condition: service_healthy # waits for postgres to pass health check
This single change eliminated an entire class of random startup failures.
Lesson 4: Persistent storage requires deliberate planning
The FAISS vector indexes are large. Model weights are large. PostgreSQL data must survive container restarts.
Early on I had all of this in ephemeral container filesystems. One docker compose down wiped everything — including indexed repositories.
volumes:
postgres_data:
faiss_indexes:
model_cache:
services:
postgres:
volumes:
- postgres_data:/var/lib/postgresql/data
rag_service:
volumes:
- faiss_indexes:/app/indexes
- model_cache:/app/models
Named volumes persist across docker compose down and up. Pair this with a nightly backup script and you have a real persistence story.
Lesson 5: Nginx config is part of your application
I treated Nginx as an afterthought — something I'd configure "at the end." That was wrong.
server {
listen 80;
server_name intern-flow.in;
return 301 https://$host$request_uri;
}
server {
listen 443 ssl;
server_name intern-flow.in;
ssl_certificate /etc/letsencrypt/live/intern-flow.in/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/intern-flow.in/privkey.pem;
# Frontend
location / {
proxy_pass http://frontend:3000;
proxy_set_header Host $host;
}
# API
location /api/ {
proxy_pass http://api:8000;
proxy_read_timeout 120s; # AI endpoints take longer
}
}
Note the proxy_read_timeout 120s for AI endpoints. Default Nginx timeout is 60 seconds — AI generation regularly exceeded this and caused cryptic 504 errors that took me hours to trace.
The key lessons, distilled
- Infrastructure is application code — treat Dockerfiles, Nginx configs, and Compose files with the same care as your Python or TypeScript
- Observability before features — logging and health checks should be set up before any feature goes to production
- Assume services will fail — design for restart, not for reliability
- Persistence is separate from compute — containers are disposable, data is not
- Read the timeout docs — every proxy, every client, every service has a default timeout that will surprise you in production
In Part 4, I'll cover where InternFlow is going next — beyond an internship tool into a full developer career platform.
InternFlow helps B.Tech and engineering students turn their GitHub projects into internship offers. AI code reviews on every commit, ATS resume generation, and daily internship listings.
→ Try it free at intern-flow.in
Top comments (0)