Chaos Engineering for Node.js Without the Infrastructure

#api #node #sre #testing

Chaos engineering sounds expensive. Netflix built Chaos Monkey to randomly kill production servers. Google runs DiRT (Disaster Recovery Testing) across their entire infrastructure. Amazon does game days where they intentionally take down services.

You're building a Node.js API. You don't have a platform team. You don't have a chaos infrastructure. But you still need to know: what happens when your dependencies get slow?

The good news is that 80% of the value of chaos engineering comes from one question, and you can answer it locally in five minutes.

The one question that matters

What does my application do when a dependency responds slowly or not at all?

Not "what if the server catches fire" — that's infrastructure chaos. What about application chaos: the database is slow, the payment API is timing out, Redis is having a bad day. These happen constantly in production and they're almost never tested.

The failure modes look like this:

Your DB gets slow under load → your API response times climb → your timeout fires → you retry → now you're sending twice the load to an already-slow DB
Your Redis cache goes down → every request hits Postgres directly → Postgres gets slow → same cascade
Stripe's API takes 3 seconds instead of 200ms → your checkout route times out → users get errors → you're losing revenue

Every one of these is a latency failure, not a crash. The service is still up. It's just slow. And slow is the hardest failure mode to test because your local environment is fast.

Why local testing misses this

When you test locally, your "database" is either:

A real local Postgres running on the same machine (sub-millisecond latency, not production-like)
A mock that returns instantly (jest.fn().mockResolvedValue(data))
A fake with a flat delay (await sleep(200))

None of these produce realistic latency. A real production database has:

Fast responses most of the time (p50 ~5ms)
Occasional slowdowns (p95 ~50ms)
Rare but real spikes (p99 ~200ms, p99.9 ~2000ms)

The spikes are what break things. And the spikes only happen when the conditions are right — high load, cold cache, GC pause, noisy neighbor. You can't predict when they'll occur, only that they will.

Simulating realistic failure locally

The key insight: you don't need to kill servers to do chaos engineering. You need to inject realistic latency into your dependency calls.

Here's the minimal version you can write yourself:

function lognormalSample(p50, p99) {
  const mu = Math.log(p50);
  const sigma = (Math.log(p99) - mu) / 2.326;
  const u1 = Math.random(), u2 = Math.random();
  const z = Math.sqrt(-2 * Math.log(u1)) * Math.cos(2 * Math.PI * u2);
  return Math.exp(mu + sigma * z);
}

function withChaos(fn, { p50, p99, errorRate = 0 }) {
  return async function(...args) {
    const delay = lognormalSample(p50, p99);
    await new Promise(r => setTimeout(r, delay));

    if (Math.random() < errorRate) {
      throw new Error('simulated transient error');
    }

    return fn.apply(this, args);
  };
}

Or use slowdep which packages this with built-in presets:

npm install slowdep

import { withLatency, withLatencyAll } from 'slowdep';

// wrap individual functions
const slowQuery = withLatency(db.findUser, 'postgres');

// or wrap entire clients
const slowDB = withLatencyAll(prisma, 'postgres');
const slowCache = withLatencyAll(redis, 'redis');

The chaos test playbook

Here are the three tests every Node.js service should run before shipping:

Test 1: Timeout calibration

import { withLatency } from 'slowdep';

const db = withLatency(realDBCall, 'postgres'); // p99: 200ms

async function getUser(id) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 250); // your timeout

  try {
    const result = await db.findById(id);
    clearTimeout(timeout);
    return result;
  } catch (err) {
    if (err.name === 'AbortError') {
      throw new Error('DB timeout');
    }
    throw err;
  }
}

// run 1000 times and check your timeout hit rate
let timeouts = 0;
for (let i = 0; i < 1000; i++) {
  try {
    await getUser(i);
  } catch (err) {
    if (err.message === 'DB timeout') timeouts++;
  }
}

console.log(`timeout rate: ${(timeouts / 1000 * 100).toFixed(2)}%`);
// if this is above 1%, your timeout is too tight

Test 2: Retry storm detection

import { withLatency } from 'slowdep';

// simulate DB under stress — higher p99
const stressedDB = withLatency(realDBCall, { p50: 20, p99: 800, errorRate: 0.05 });

let totalCalls = 0;
const originalDB = stressedDB;

// count how many actual DB calls your retry logic makes
const instrumentedDB = async (...args) => {
  totalCalls++;
  return originalDB(...args);
};

// run 100 user requests through your retry logic
for (let i = 0; i < 100; i++) {
  try {
    await withRetry(() => instrumentedDB(i));
  } catch {}
}

console.log(`100 requests → ${totalCalls} DB calls`);
// if this is above 150, your retry logic is amplifying load

Test 3: Cascade failure simulation

import { withLatencyAll } from 'slowdep';

// simulate what happens when your cache is down
// and every request falls through to the DB
const noCache = withLatencyAll(redis, { p50: 500, p99: 3000, errorRate: 0.3 });
const db = withLatencyAll(prisma, 'postgres');

async function getUser(id) {
  try {
    const cached = await noCache.get(`user:${id}`);
    if (cached) return JSON.parse(cached);
  } catch {
    // cache miss — fall through
  }

  return db.user.findUnique({ where: { id } });
}

// does your app stay responsive when cache is degraded?
const start = Date.now();
await Promise.all(Array.from({ length: 50 }, (_, i) => getUser(i)));
console.log(`50 concurrent requests: ${Date.now() - start}ms`);

What you'll find

When you run these tests for the first time, you'll typically discover:

Your timeouts are too tight. Most developers set timeouts based on p50 latency. Your DB responds in 5ms usually, so you set a 100ms timeout. That's fine until p99 hits 200ms and 1% of requests start timing out and retrying.

Your retry logic increases load. Naive retry without backoff sends the same request again immediately. Under stress, this doubles your DB load at exactly the wrong moment.

Your circuit breaker never opens. If you have one. Most local tests never trigger it because the fake is too well-behaved.

Your error handling has gaps. Timeout errors, transient errors, and permanent errors need different handling. Flat fakes always produce the same error type.

This is local chaos engineering

You didn't need Chaos Monkey. You didn't need a dedicated chaos platform. You needed realistic latency in your test environment — which is just math and a 80-line package.

The infrastructure chaos stuff matters too, eventually. But if you haven't tested your application behavior under slow dependencies first, you haven't done the basics.

Start there.

slowdep on npm · github.com/arnnnavvvvv/slowdep