arun rajkumar

Posted on Mar 24

We Had 15 Microservices and It Took 2 Weeks to Onboard a Developer. Here's How We Fixed It in a Weekend.

#microservices #devops #typescript #startup

How we went from "ask someone for the .env" to one-click local development for our entire microservice stack.

If you're running microservices, you've probably been here:

A new developer joins. You point them at the repos. Then begins the ritual. Clone this. Run migrations on that. Ask Slack for the latest .env. Debug why nginx isn't routing. Realize they're on Node 20 but this service needs Node 23. Spend two hours figuring out why the queue consumer isn't connecting.

Two weeks later, they write their first line of actual code.

We had 15 NestJS microservices. Each with its own repo, its own .env, its own database schema, migrations, queues, and inter-service dependencies. Every developer had their own frankensteined local setup — commented-out code, hardcoded URLs, an nginx config held together with hope.

Integration testing? People just tested directly against the shared dev database. New joiners spent their first week or two just getting things running.

I'm the CTO. I'm hands-on, but lately I only code two or three times a month. The last time I tried to pick up a feature, I had to pull the code, run migrations, ask someone for the latest env vars, debug why things weren't connecting, fix my local nginx config — and by the time I had a working setup, I'd lost half a day and gotten pulled into something else.

That weekend, I decided to fix this. For everyone. Forever.

The Problem Isn't Microservices. It's Environment Chaos.

Here's what we found when we audited our 15 repos:

1. Env variable naming was a mess. The same database connection string was called DB_HOST in one repo, DATABASE_HOST in another, and POSTGRES_HOST in a third. Some were just plural changes — QUEUE_URL vs QUEUES_URL. One service used DB_HOST_CREDENTIALS for a secondary database, another used DB_HOST_CREDENTIAL (singular). Multiply this across 15 repos and you get a combinatorial nightmare.

2. No single source of truth. Each repo had its own .env.example that was perpetually outdated. Developers copied .env files from each other over Slack. Some had AWS credentials hardcoded. Others had localhost URLs that only worked on one person's machine.

3. Node version drift. Some services were on Node 20, others on Node 23. The package.json didn't enforce this, so things would break silently.

4. AWS services in local dev. Some services connected to real AWS SQS queues locally. Others mocked them. There was no standard.

5. Nginx configuration hell. Every developer maintained their own nginx config to route between services. One person's config looked nothing like another's. New joiners spent days getting this right.

Step 1: A Shared Env Schema with Zod (The Hard Part)

The first thing we built was a centralized env schema package — a single source of truth for every environment variable across all services.

Sounds simple. It wasn't.

We had to map every .env file across 15 repos, find the overlaps, resolve the naming conflicts, and split variables into shared building blocks and service-specific schemas.

This is where AI agents saved us hours. I spawned multiple agents to do a retrospective across all repos — mapping every env variable, finding common ones, identifying naming conflicts, and generating a unified schema. What would have taken a team days of grep-and-spreadsheet work took a couple of hours.

The result: a shared npm package using Zod for runtime validation. Here's the actual pattern:

// shared.schema.ts — Reusable building blocks
import { z } from 'zod';

export const DatabaseConfigSchema = z.object({
  DB_HOST: z.string().default('localhost'),
  DB_PORT: z.coerce.number().default(5432),
  DB_USER: z.string().default('postgres'),
  DB_PASSWORD: z.string(),
  DB_NAME: z.string(),
});

export const RedisConfigSchema = z.object({
  REDIS_HOST: z.string().default('localhost'),
  REDIS_PORT: z.coerce.number().default(6379),
  REDIS_PASSWORD: z.string().optional(),
});

export const QueueConfigSchema = z.object({
  SQS_ENDPOINT: z.string().default('http://localhost:9324'),
  SQS_REGION: z.string().default('us-east-1'),
  AWS_ACCESS_KEY_ID: z.string().default('local'),
  AWS_SECRET_ACCESS_KEY: z.string().default('local'),
});

export const JWTConfigSchema = z.object({
  JWT_SECRET: z.string(),
  JWT_ACCESS_TOKEN_EXPIRY: z.string().default('15m'),
});

export const InterServiceAuthSchema = z.object({
  INTER_SERVICE_SECRET: z.string(),
});

// Base schema every backend service inherits
export const SharedBackendSchema = z.object({
  NODE_ENV: z.enum(['dev-local', 'dev', 'uat', 'production']),
  PORT: z.coerce.number(),
}).merge(DatabaseConfigSchema)
  .merge(RedisConfigSchema)
  .merge(JWTConfigSchema);

Each service composes its schema from these shared blocks:

// services/payments.schema.ts
export const PaymentsEnvSchema = SharedBackendSchema
  .merge(QueueConfigSchema)
  .merge(InterServiceAuthSchema)
  .merge(z.object({
    PAYMENT_PROVIDER_API_KEY: z.string(),
    PAYMENT_ENCRYPTION_KEY: z.string(),
    WEBHOOK_SIGNING_SECRET: z.string(),
  }));

The key insight: composition via .merge(). When we renamed DATABASE_HOST to DB_HOST, we only changed it in one place. Every service that imports DatabaseConfigSchema gets the fix automatically.

We published this as an internal npm package. Each service declares it as a dependency and validates on startup:

// Any service's index.ts
import { validateEnv } from '@company/env-schema';

const env = validateEnv('payments');
// Throws with clear error messages if anything is missing
// Returns a frozen, type-safe env object

Environment-aware strictness was crucial. In dev-local mode, missing optional vars log warnings but don't block startup — so developers can run just the services they need. In dev, uat, and production, missing required vars call process.exit(1). No silent failures in deployed environments.

Step 2: Auto-Generate .env Files (The CLI)

Having a schema is useless if developers still have to manually create .env files. So we built a CLI that generates them:

# Generate .env for all 15 services
npx env-schema init --all --base-path ~/code

# Generate for a single service
npx env-schema init --service payments

# Preview without writing
npx env-schema init --service payments --stdout

The generator:

Fills in safe local defaults (localhost URLs, local Redis passwords, sandbox API keys)
Reuses shared secrets across services (same JWT secret, same inter-service auth token)
Comments out optional fields so developers know they exist
Is idempotent — safe to re-run, merges new keys without overwriting existing values

No more Slack messages asking "can someone send me the .env for the notification service?"

Step 3: Prevent Future Drift (The Regex Scanner)

Fixing the current mess was one thing. Preventing it from coming back was another.

We built a drift checker that scans source code for process.env references and compares them against the schema registry:

// check-drift.ts — simplified version
function extractEnvVars(filePath: string): string[] {
  const content = fs.readFileSync(filePath, 'utf-8');
  const matches = [
    ...content.matchAll(/process\\.env\\.(\\w+)/g),
    ...content.matchAll(/process\\.env\\['(\\w+)'\\]/g),
  ];
  return matches.map(m => m[1]);
}

function checkDrift(serviceId: string) {
  const schemaKeys = Object.keys(schemaRegistry[serviceId].shape);
  const codeKeys = walkDir('src/').flatMap(extractEnvVars);

  const unregistered = codeKeys.filter(k =>
    !schemaKeys.includes(k) && !IGNORED_VARS.includes(k)
  );

  if (unregistered.length > 0) {
    console.error(`Env drift detected! Unregistered vars: ${unregistered}`);
    process.exit(1);
  }
}

This runs as:

Pre-commit hook — blocks commits with unregistered env vars
CI check — PRs can't merge if drift is detected
Pre-startup check — each service runs npm run check-env before starting

// package.json of any service
{
  "scripts": {
    "check-env": "npx env-schema check payments",
    "check-infra": "npx env-schema infra",
    "start:local": "npm run check-env && npm run check-infra && cross-env NODE_ENV=dev-local tsnd src/index.ts"
  }
}

We know how teams work. Lint rules get ignored, pre-commit hooks get bypassed with --no-verify. That's why the same check runs in CI. The PR won't merge if there's env drift. No exceptions.

Step 4: Kill Nginx with Traefik

This was the game-changer.

Every developer had a custom nginx config to route API calls between services locally. /api/payments -> port 3001, /api/users -> port 3002, and so on. When a new service was added, everyone had to update their nginx config manually. Nobody's config was the same.

We replaced all of it with Traefik v3.

Traefik is a reverse proxy that auto-discovers services. We use a file-based dynamic provider that watches a config directory for changes — hot reload, no restart needed.

# docker-compose.yml
services:
  traefik:
    image: traefik:v3.0
    ports:
      - "9090:9090"    # API Gateway
      - "8080:8080"    # Dashboard
    volumes:
      - ./traefik/traefik.yml:/etc/traefik/traefik.yml
      - ./traefik/dynamic:/etc/traefik/dynamic  # Hot-reload configs
    networks:
      - app-network

No more per-developer nginx configs. One shared Traefik config in the repo. Add a new service? Add 5 lines to services.yml. Traefik picks it up automatically via hot reload. Everyone gets the same routing.

The dashboard at localhost:8080 gives you a visual map of every route, middleware, and service — something nginx never offered out of the box.

Step 5: One Command to Rule Them All

With the env schema, Traefik, and local service mocking in place, we built the orchestration layer.

Bootstrap for new developers — a single script that handles everything from zero:

# New developer runs this on day one
./bootstrap.sh

This 10-step wizard:

Checks prerequisites (git, Docker, Node.js, VS Code)
Collects git identity
Configures workspace directory
Clones all 15 repos in parallel (4 concurrent)
Sets up git config in each repo
Configures npm registry for private packages
Runs npm install in parallel (3 concurrent)
Generates all .env files from the shared schema
Provisions infrastructure (Docker containers, databases, migrations)
Installs the VS Code extension + generates workspace file

For existing developers — the daily startup:

npm run start

--- Infrastructure Check ---

  [OK] PostgreSQL is responding on port 5432
  [OK] Redis/Valkey is responding on port 6379
  [OK] ElasticMQ (SQS) is responding on port 9324
  [OK] Traefik (API Gateway) is responding on port 9090

Select services to start (SPACE=toggle, A=all, N=none, ENTER=confirm):

The infrastructure check does TCP port scanning with 2-second timeouts. If something's down, it offers to auto-start it via Docker. Then you select which services you need.

Smart terminal detection — the startup script auto-detects your terminal and adapts:

tmux: Grid layout with split panes
iTerm2: Native AppleScript-driven split panes (up to 8 per tab)
Terminal.app: Opens tabs per service
Fallback: Color-coded concurrent output in a single terminal

Each service gets a color-coded label. Health monitoring polls every service in real-time — green when healthy, yellow when starting, red when unhealthy.

The Team Took It Further

I built the core over a weekend and handed it to my tech lead. "Check and deploy," I said.

What they shipped blew me away. They didn't just deploy it — they built a VS Code extension on top:

A welcome page with a 5-step onboarding flow:

Run Preflight Checks -> Start Your First Service -> Manage Branches -> Explore Utilities -> Keyboard Shortcuts

A Services Dashboard (Cmd+Alt+S):

Init All Envs, Start All (Dev), Start All (Build), Stop All
Real-time status: 0/15 running | 0/15 healthy | 0 missing env
Click a service to see logs, restart, or open its Swagger docs

A Preflight Diagnostics panel (Cmd+Alt+P):

The dependency graph visualization — 237 checks passing across all services
Shows which services depend on which, what infrastructure they need

A Branch Manager (Cmd+Alt+B):

View and switch branches across all 15 repos from one UI
No more cd-ing into each repo to check what branch you're on

A Web Portal (Vue 3 + Vite):

Swagger UI aggregator for all service APIs
ElasticMQ queue inspector
Real-time service status monitoring

Now anyone — including our product managers — can run all 15 services with millions of lines of code in under 5 minutes. They can test features end-to-end on their local machine. They ask AI to check if a design is practical. They run the code and see for themselves.

A new developer's first day? Clone, click, code. Not clone, cry, configure.

How to Avoid This at Your Startup (Before It's Too Late)

If you're at 3-5 microservices, here's what to do now before it becomes a 15-service nightmare:

1. Start with a shared env schema from day one. Use Zod (or Joi, or JSON Schema). Even with 2 services, standardize your variable names. DB_HOST everywhere, not DATABASE_HOST in some and POSTGRES_HOST in others. Compose shared blocks with .merge() so naming changes propagate automatically.

2. Pin your runtimes. .nvmrc + engines in package.json. Enforce in CI. It takes 5 minutes and saves weeks of debugging.

3. Mock external services locally. Use ElasticMQ instead of real SQS, MinIO instead of real S3. Your env schema should auto-switch endpoints based on NODE_ENV=dev-local.

4. Use Traefik instead of nginx from the start. File-based dynamic provider + hot reload beats editing nginx.conf every time a service changes. Your future self will thank you.

5. Add env drift detection to CI. A regex scanner that checks process.env references against your schema catches problems before they spread. Run it in pre-commit hooks AND CI — belt and suspenders.

6. Invest in the "first 5 minutes" experience. If a new developer can't run your entire stack in 5 minutes, you have a problem. It will only get worse. Build a bootstrap script. Make it idempotent. Make it parallel.

The Before and After

	Before	After
Onboarding	1-2 weeks	5 minutes
Env setup	Ask on Slack, copy-paste	Auto-generated from schema
Env validation	Crash at runtime	Fail fast on startup with clear errors
Routing	Manual nginx per developer	Traefik with hot-reload config
Integration testing	Against shared dev DB	Full local stack, end-to-end
Starting services	Manual, per-service, per-developer	One command, interactive selection
Node version	Whatever was installed	Pinned in `.nvmrc`, enforced in CI
New service added	Update everyone's nginx, share new .env	Add 5 lines to Traefik config, schema auto-generates .env
Drift prevention	None (hope-based)	Pre-commit + CI drift checks
Who can run the stack	Senior devs only	Anyone, including PMs

The tools matter less than the principle: your local development environment is a product. Treat it like one. Your developers are the users. If the onboarding experience is painful, every day after that is a little painful too.

We used NestJS, TypeScript, Zod, Traefik, ElasticMQ, Docker, and VS Code. You might use different tools. The pattern is the same: centralize config, validate on startup, auto-generate defaults, prevent drift, make it one click.

Build it once. Fix it for everyone. Forever.

I'm Arun, CTO at a fintech startup. We're a team of 15 engineers in India building payment infrastructure for the UK. I write about the messy reality of scaling engineering teams and systems. Find me on X @mickyarun.

Top comments (6)

Phil Miller • Apr 22

Check out varlock.dev, I think it will make all the env stuff even easier! Nice write up

arun rajkumar • Apr 30

Thanks Phil! Just took a look — varlock's type-safe schema + secrets handling is close to what we built ourselves piece-by-piece. The validate-once-at-boot pattern with schema introspection is the part that's most painful to roll from scratch, so good to see it done properly. Going to dig into whether it plays nicely with our NestJS config setup. Appreciate the tip.

Phil Miller • Apr 30

Hop in the discord if there's anything we can help with!

arun rajkumar • May 5

Will do — sending a teammate to the discord this week. One concrete thing we'd love to compare notes on: how varlock handles secret rotation across staging/prod when the schema itself changes shape mid-migration (e.g., adding a required field that prod hasn't been seeded with yet). That's been our most painful failure mode — the schema-change PR passes locally because the dev secrets already include the new key, then prod boot fails because the rotation hasn't run yet. Curious whether you've designed for that or routed around it.

Theo Ephraim • May 5

We don't really do anything specific around rotations yet, although it's something we think about. But at it's core, what you want to do is ensure that prod has been seeded with any new config before allowing the PR to merge. This can be achieved by adding additional checks in CI.

You'd usually already have a CI step that says "resolve and validate our config" but this is usually using config meant for testing / CI. You'll usually have this implicitly because your build/test command if booting via varlock will already fail if validation fails, but adding an additional explicit CI step with pretty printed and redacted output can be really handy.

What you can do though is add an additional step that says "resolve and validate our this PRs schema with prod config". -- for example APP_ENV=prod varlock load. Of course this means that CI step also needs access to the prod credentials.

Happy to help walk you through it and help get things set up - just pop into the discord :)

arun rajkumar • May 7

Theo — thanks for the detailed reply, and especially for the APP_ENV=prod varlock load pattern. The "CI step that resolves the PR's schema against prod config (with prod creds)" is exactly the gate we've been missing. Today our prod-creds-in-CI surface is locked down for good reasons (PCI/SOC2 boundary), so the work we'd need to do is design a narrow, audited path for "schema-resolution-only" CI access — read-only, time-boxed, with a clear audit trail. Sending a teammate to the discord this week to dig into how the dmno approach handles that boundary; appreciate the offer to walk through.