How we went from "ask someone for the .env" to one-click local development for our entire microservice stack.
If you're running microservices, you've probably been here:
A new developer joins. You point them at the repos. Then begins the ritual. Clone this. Run migrations on that. Ask Slack for the latest .env. Debug why nginx isn't routing. Realize they're on Node 20 but this service needs Node 23. Spend two hours figuring out why the queue consumer isn't connecting.
Two weeks later, they write their first line of actual code.
We had 15 NestJS microservices. Each with its own repo, its own .env, its own database schema, migrations, queues, and inter-service dependencies. Every developer had their own frankensteined local setup — commented-out code, hardcoded URLs, an nginx config held together with hope.
Integration testing? People just tested directly against the shared dev database. New joiners spent their first week or two just getting things running.
I'm the CTO. I'm hands-on, but lately I only code two or three times a month. The last time I tried to pick up a feature, I had to pull the code, run migrations, ask someone for the latest env vars, debug why things weren't connecting, fix my local nginx config — and by the time I had a working setup, I'd lost half a day and gotten pulled into something else.
That weekend, I decided to fix this. For everyone. Forever.
The Problem Isn't Microservices. It's Environment Chaos.
Here's what we found when we audited our 15 repos:
1. Env variable naming was a mess. The same database connection string was called DB_HOST in one repo, DATABASE_HOST in another, and POSTGRES_HOST in a third. Some were just plural changes — QUEUE_URL vs QUEUES_URL. One service used DB_HOST_CREDENTIALS for a secondary database, another used DB_HOST_CREDENTIAL (singular). Multiply this across 15 repos and you get a combinatorial nightmare.
2. No single source of truth. Each repo had its own .env.example that was perpetually outdated. Developers copied .env files from each other over Slack. Some had AWS credentials hardcoded. Others had localhost URLs that only worked on one person's machine.
3. Node version drift. Some services were on Node 20, others on Node 23. The package.json didn't enforce this, so things would break silently.
4. AWS services in local dev. Some services connected to real AWS SQS queues locally. Others mocked them. There was no standard.
5. Nginx configuration hell. Every developer maintained their own nginx config to route between services. One person's config looked nothing like another's. New joiners spent days getting this right.
Step 1: A Shared Env Schema with Zod (The Hard Part)
The first thing we built was a centralized env schema package — a single source of truth for every environment variable across all services.
Sounds simple. It wasn't.
We had to map every .env file across 15 repos, find the overlaps, resolve the naming conflicts, and split variables into shared building blocks and service-specific schemas.
This is where AI agents saved us hours. I spawned multiple agents to do a retrospective across all repos — mapping every env variable, finding common ones, identifying naming conflicts, and generating a unified schema. What would have taken a team days of grep-and-spreadsheet work took a couple of hours.
The result: a shared npm package using Zod for runtime validation. Here's the actual pattern:
// shared.schema.ts — Reusable building blocks
import { z } from 'zod';
export const DatabaseConfigSchema = z.object({
DB_HOST: z.string().default('localhost'),
DB_PORT: z.coerce.number().default(5432),
DB_USER: z.string().default('postgres'),
DB_PASSWORD: z.string(),
DB_NAME: z.string(),
});
export const RedisConfigSchema = z.object({
REDIS_HOST: z.string().default('localhost'),
REDIS_PORT: z.coerce.number().default(6379),
REDIS_PASSWORD: z.string().optional(),
});
export const QueueConfigSchema = z.object({
SQS_ENDPOINT: z.string().default('http://localhost:9324'),
SQS_REGION: z.string().default('us-east-1'),
AWS_ACCESS_KEY_ID: z.string().default('local'),
AWS_SECRET_ACCESS_KEY: z.string().default('local'),
});
export const JWTConfigSchema = z.object({
JWT_SECRET: z.string(),
JWT_ACCESS_TOKEN_EXPIRY: z.string().default('15m'),
});
export const InterServiceAuthSchema = z.object({
INTER_SERVICE_SECRET: z.string(),
});
// Base schema every backend service inherits
export const SharedBackendSchema = z.object({
NODE_ENV: z.enum(['dev-local', 'dev', 'uat', 'production']),
PORT: z.coerce.number(),
}).merge(DatabaseConfigSchema)
.merge(RedisConfigSchema)
.merge(JWTConfigSchema);
Each service composes its schema from these shared blocks:
// services/payments.schema.ts
export const PaymentsEnvSchema = SharedBackendSchema
.merge(QueueConfigSchema)
.merge(InterServiceAuthSchema)
.merge(z.object({
PAYMENT_PROVIDER_API_KEY: z.string(),
PAYMENT_ENCRYPTION_KEY: z.string(),
WEBHOOK_SIGNING_SECRET: z.string(),
}));
The key insight: composition via .merge(). When we renamed DATABASE_HOST to DB_HOST, we only changed it in one place. Every service that imports DatabaseConfigSchema gets the fix automatically.
We published this as an internal npm package. Each service declares it as a dependency and validates on startup:
// Any service's index.ts
import { validateEnv } from '@company/env-schema';
const env = validateEnv('payments');
// Throws with clear error messages if anything is missing
// Returns a frozen, type-safe env object
Environment-aware strictness was crucial. In dev-local mode, missing optional vars log warnings but don't block startup — so developers can run just the services they need. In dev, uat, and production, missing required vars call process.exit(1). No silent failures in deployed environments.
Step 2: Auto-Generate .env Files (The CLI)
Having a schema is useless if developers still have to manually create .env files. So we built a CLI that generates them:
# Generate .env for all 15 services
npx env-schema init --all --base-path ~/code
# Generate for a single service
npx env-schema init --service payments
# Preview without writing
npx env-schema init --service payments --stdout
The generator:
- Fills in safe local defaults (localhost URLs, local Redis passwords, sandbox API keys)
- Reuses shared secrets across services (same JWT secret, same inter-service auth token)
- Comments out optional fields so developers know they exist
- Is idempotent — safe to re-run, merges new keys without overwriting existing values
No more Slack messages asking "can someone send me the .env for the notification service?"
Step 3: Prevent Future Drift (The Regex Scanner)
Fixing the current mess was one thing. Preventing it from coming back was another.
We built a drift checker that scans source code for process.env references and compares them against the schema registry:
// check-drift.ts — simplified version
function extractEnvVars(filePath: string): string[] {
const content = fs.readFileSync(filePath, 'utf-8');
const matches = [
...content.matchAll(/process\\.env\\.(\\w+)/g),
...content.matchAll(/process\\.env\\['(\\w+)'\\]/g),
];
return matches.map(m => m[1]);
}
function checkDrift(serviceId: string) {
const schemaKeys = Object.keys(schemaRegistry[serviceId].shape);
const codeKeys = walkDir('src/').flatMap(extractEnvVars);
const unregistered = codeKeys.filter(k =>
!schemaKeys.includes(k) && !IGNORED_VARS.includes(k)
);
if (unregistered.length > 0) {
console.error(`Env drift detected! Unregistered vars: ${unregistered}`);
process.exit(1);
}
}
This runs as:
- Pre-commit hook — blocks commits with unregistered env vars
- CI check — PRs can't merge if drift is detected
-
Pre-startup check — each service runs
npm run check-envbefore starting
// package.json of any service
{
"scripts": {
"check-env": "npx env-schema check payments",
"check-infra": "npx env-schema infra",
"start:local": "npm run check-env && npm run check-infra && cross-env NODE_ENV=dev-local tsnd src/index.ts"
}
}
We know how teams work. Lint rules get ignored, pre-commit hooks get bypassed with --no-verify. That's why the same check runs in CI. The PR won't merge if there's env drift. No exceptions.
Step 4: Kill Nginx with Traefik
This was the game-changer.
Every developer had a custom nginx config to route API calls between services locally. /api/payments -> port 3001, /api/users -> port 3002, and so on. When a new service was added, everyone had to update their nginx config manually. Nobody's config was the same.
We replaced all of it with Traefik v3.
Traefik is a reverse proxy that auto-discovers services. We use a file-based dynamic provider that watches a config directory for changes — hot reload, no restart needed.
# docker-compose.yml
services:
traefik:
image: traefik:v3.0
ports:
- "9090:9090" # API Gateway
- "8080:8080" # Dashboard
volumes:
- ./traefik/traefik.yml:/etc/traefik/traefik.yml
- ./traefik/dynamic:/etc/traefik/dynamic # Hot-reload configs
networks:
- app-network
No more per-developer nginx configs. One shared Traefik config in the repo. Add a new service? Add 5 lines to services.yml. Traefik picks it up automatically via hot reload. Everyone gets the same routing.
The dashboard at localhost:8080 gives you a visual map of every route, middleware, and service — something nginx never offered out of the box.
Step 5: One Command to Rule Them All
With the env schema, Traefik, and local service mocking in place, we built the orchestration layer.
Bootstrap for new developers — a single script that handles everything from zero:
# New developer runs this on day one
./bootstrap.sh
This 10-step wizard:
- Checks prerequisites (git, Docker, Node.js, VS Code)
- Collects git identity
- Configures workspace directory
- Clones all 15 repos in parallel (4 concurrent)
- Sets up git config in each repo
- Configures npm registry for private packages
- Runs
npm installin parallel (3 concurrent) - Generates all
.envfiles from the shared schema - Provisions infrastructure (Docker containers, databases, migrations)
- Installs the VS Code extension + generates workspace file
For existing developers — the daily startup:
npm run start
--- Infrastructure Check ---
[OK] PostgreSQL is responding on port 5432
[OK] Redis/Valkey is responding on port 6379
[OK] ElasticMQ (SQS) is responding on port 9324
[OK] Traefik (API Gateway) is responding on port 9090
Select services to start (SPACE=toggle, A=all, N=none, ENTER=confirm):
The infrastructure check does TCP port scanning with 2-second timeouts. If something's down, it offers to auto-start it via Docker. Then you select which services you need.
Smart terminal detection — the startup script auto-detects your terminal and adapts:
- tmux: Grid layout with split panes
- iTerm2: Native AppleScript-driven split panes (up to 8 per tab)
- Terminal.app: Opens tabs per service
- Fallback: Color-coded concurrent output in a single terminal
Each service gets a color-coded label. Health monitoring polls every service in real-time — green when healthy, yellow when starting, red when unhealthy.
The Team Took It Further
I built the core over a weekend and handed it to my tech lead. "Check and deploy," I said.
What they shipped blew me away. They didn't just deploy it — they built a VS Code extension on top:
A welcome page with a 5-step onboarding flow:
- Run Preflight Checks -> Start Your First Service -> Manage Branches -> Explore Utilities -> Keyboard Shortcuts
A Services Dashboard (Cmd+Alt+S):
- Init All Envs, Start All (Dev), Start All (Build), Stop All
- Real-time status:
0/15 running | 0/15 healthy | 0 missing env - Click a service to see logs, restart, or open its Swagger docs
A Preflight Diagnostics panel (Cmd+Alt+P):
- The dependency graph visualization — 237 checks passing across all services
- Shows which services depend on which, what infrastructure they need
A Branch Manager (Cmd+Alt+B):
- View and switch branches across all 15 repos from one UI
- No more cd-ing into each repo to check what branch you're on
A Web Portal (Vue 3 + Vite):
- Swagger UI aggregator for all service APIs
- ElasticMQ queue inspector
- Real-time service status monitoring
Now anyone — including our product managers — can run all 15 services with millions of lines of code in under 5 minutes. They can test features end-to-end on their local machine. They ask AI to check if a design is practical. They run the code and see for themselves.
A new developer's first day? Clone, click, code. Not clone, cry, configure.
How to Avoid This at Your Startup (Before It's Too Late)
If you're at 3-5 microservices, here's what to do now before it becomes a 15-service nightmare:
1. Start with a shared env schema from day one. Use Zod (or Joi, or JSON Schema). Even with 2 services, standardize your variable names. DB_HOST everywhere, not DATABASE_HOST in some and POSTGRES_HOST in others. Compose shared blocks with .merge() so naming changes propagate automatically.
2. Pin your runtimes. .nvmrc + engines in package.json. Enforce in CI. It takes 5 minutes and saves weeks of debugging.
3. Mock external services locally. Use ElasticMQ instead of real SQS, MinIO instead of real S3. Your env schema should auto-switch endpoints based on NODE_ENV=dev-local.
4. Use Traefik instead of nginx from the start. File-based dynamic provider + hot reload beats editing nginx.conf every time a service changes. Your future self will thank you.
5. Add env drift detection to CI. A regex scanner that checks process.env references against your schema catches problems before they spread. Run it in pre-commit hooks AND CI — belt and suspenders.
6. Invest in the "first 5 minutes" experience. If a new developer can't run your entire stack in 5 minutes, you have a problem. It will only get worse. Build a bootstrap script. Make it idempotent. Make it parallel.
The Before and After
| Before | After | |
|---|---|---|
| Onboarding | 1-2 weeks | 5 minutes |
| Env setup | Ask on Slack, copy-paste | Auto-generated from schema |
| Env validation | Crash at runtime | Fail fast on startup with clear errors |
| Routing | Manual nginx per developer | Traefik with hot-reload config |
| Integration testing | Against shared dev DB | Full local stack, end-to-end |
| Starting services | Manual, per-service, per-developer | One command, interactive selection |
| Node version | Whatever was installed | Pinned in .nvmrc, enforced in CI |
| New service added | Update everyone's nginx, share new .env | Add 5 lines to Traefik config, schema auto-generates .env |
| Drift prevention | None (hope-based) | Pre-commit + CI drift checks |
| Who can run the stack | Senior devs only | Anyone, including PMs |
The tools matter less than the principle: your local development environment is a product. Treat it like one. Your developers are the users. If the onboarding experience is painful, every day after that is a little painful too.
We used NestJS, TypeScript, Zod, Traefik, ElasticMQ, Docker, and VS Code. You might use different tools. The pattern is the same: centralize config, validate on startup, auto-generate defaults, prevent drift, make it one click.
Build it once. Fix it for everyone. Forever.
I'm Arun, CTO at a fintech startup. We're a team of 15 engineers in India building payment infrastructure for the UK. I write about the messy reality of scaling engineering teams and systems. Find me on X @mickyarun.
Top comments (0)