DEV Community

Cover image for Benchmarking API reliability under load: when zero downtime migration becomes critical
binadit
binadit

Posted on • Originally published at binadit.com

Benchmarking API reliability under load: when zero downtime migration becomes critical

When APIs break: load testing reveals the truth about infrastructure limits

Here's a reality check: most teams discover their API's breaking point when users are already hitting errors, not during careful testing. By then, you're fighting fires instead of preventing them.

We decided to get real data. How much concurrent load can different infrastructure setups actually handle before things fall apart? The results surprised us.

The experiment: same API, different infrastructure

We built a straightforward e-commerce API with three endpoints:

  • GET /products (product browsing)
  • POST /auth/login (authentication)
  • POST /orders (order placement)

Test stack:

Runtime: Node.js 18.17.0 + Express 4.18.2
Database: PostgreSQL 15.3 (2GB RAM)
Hardware: 4 cores, 8GB RAM, NVMe storage
Cache: Redis 7.0.11
Load testing: Artillery.io
Enter fullscreen mode Exit fullscreen mode

We tested four infrastructure patterns:

  1. Single server: everything on one machine
  2. Database separation: dedicated DB server
  3. Load balanced: 2 app servers + shared database + Redis cluster
  4. Auto-scaling: 2-6 servers with horizontal scaling

The load profile ramped from 10 to 2,000 concurrent users over 40 minutes, mimicking real e-commerce traffic patterns.

Results: reliability doesn't decline gracefully

Here's what we found:

Single server configuration

Concurrent Users P50 Response P95 Response Error Rate
100 178ms 456ms 1.2%
250 456ms 1,234ms 8.7%
500 1,234ms 4,567ms 23.4%

Breaking point: 500 concurrent users

Load balanced configuration

Concurrent Users P50 Response P95 Response Error Rate
500 234ms 678ms 1.1%
1,000 456ms 1,234ms 5.7%
1,500 890ms 2,456ms 15.3%

Breaking point: 1,500 concurrent users

Auto-scaling configuration

Concurrent Users P50 Response P95 Response Error Rate Active Servers
1,000 189ms 457ms 0.8% 4
1,500 234ms 567ms 2.1% 5
2,000 289ms 678ms 3.9% 6

Breaking point: handled 2,000+ users gracefully

The database bottleneck pattern

In every configuration, the database connection pool became the limiting factor. This happened before CPU hit 60% utilization.

Why? PostgreSQL's default connection settings don't optimize for high concurrency. Even with more application servers, they all compete for the same database connections.

// Typical connection pool config that fails under load
const pool = new Pool({
  connectionString: process.env.DATABASE_URL,
  max: 20, // Default: too low for high concurrency
  idleTimeoutMillis: 30000,
});
Enter fullscreen mode Exit fullscreen mode

Key insights for production systems

Cliff-edge failures are real. Systems work fine until they completely don't. There's usually a narrow band between "acceptable performance" and "total failure."

Database scaling matters more than app scaling. Adding application servers won't help if they can't get database connections. Plan for connection pooling, read replicas, and database optimization from day one.

Infrastructure changes under load are dangerous. The performance gap between single server and distributed systems is massive. Plan your migration strategy during quiet periods, not during outages.

Testing limitations

Our setup was simplified compared to production systems:

  • Used default PostgreSQL settings (real systems are usually optimized)
  • Synthetic load patterns (real traffic is more unpredictable)
  • Single region testing (global users add complexity)
  • Basic CRUD operations (real apps have more complex logic)

Your mileage will vary, but the patterns remain consistent.

Bottom line

Know your infrastructure limits before your users do. Load testing during development is cheaper than debugging during peak traffic.

The numbers show that architectural decisions have massive performance implications. A single server might handle your current load fine, but what about Black Friday?

Plan your zero downtime migration strategy when you don't need it yet. Future you will thank present you.

Originally published on binadit.com

Top comments (0)