Most systems don't collapse under load because of bad code.
They collapse because the architecture was never designed for order-of-magnitude growth.
Here's a breakdown of every layer, with patterns, tradeoffs, and code, that bridges the gap between 10k and 1M requests/day.
⚙️ Layer 1: The Data Layer Breaks First
At 10k req/day, a single RDBMS instance handles reads and writes without complaint.
At 1M, you hit lock contention, connection exhaustion, and query latency that compounds across every endpoint.
Read Replicas
Route all SELECT queries to replicas. Keep writes on primary.
-- On primary (write)
INSERT INTO orders (user_id, total) VALUES (42, 199.99);
-- On replica (read)
SELECT * FROM orders WHERE user_id = 42;
Watch for replication lag under heavy write load, a lagging replica can serve stale reads. Monitor with:
-- PostgreSQL: check replica lag
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;
Connection Pooling with PgBouncer
Raw TCP connections to Postgres are expensive. At scale, hundreds of app instances opening direct connections will exhaust max_connections.
PgBouncer in transaction mode, each query borrows a connection from the pool, releases it immediately after. Reduces connection count dramatically.
# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
CQRS: Separate Read and Write Models
Command Query Responsibility Segregation separates the write model (normalized, transactional) from the read model (denormalized, optimized for queries).
Write path: API → Command Handler → Write DB (normalized)
↓ (event/sync)
Read path: API → Query Handler → Read DB (denormalized view)
Read models can be precomputed materialized views or even a separate datastore (e.g., Elasticsearch for search, Redis for leaderboards).
Composite and Partial Indexes
-- Composite index: covers multi-column WHERE + ORDER BY
CREATE INDEX idx_orders_user_created
ON orders (user_id, created_at DESC);
-- Partial index: only indexes rows matching a condition
-- Much smaller, faster for filtered queries
CREATE INDEX idx_active_users
ON users (email)
WHERE status = 'active';
Always validate with EXPLAIN ANALYZE:
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 42 AND created_at > now() - interval '30 days';
Look for Seq Scan → that's a missing index. Index Scan → you're good.
⚡ Layer 2: Caching as an Architectural Layer
Every cache hit is a request your database, compute, and network never have to handle.
Cache-Aside Pattern (Lazy Loading)
The application checks the cache first. On a miss, it fetches from the DB and populates the cache.
async function getUser(userId) {
const cacheKey = `user:${userId}`;
// 1. Check cache
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
// 2. Cache miss: fetch from DB
const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
// 3. Populate cache with TTL
await redis.setex(cacheKey, 3600, JSON.stringify(user)); // 1 hour TTL
return user;
}
Write-Through vs Write-Behind
| Strategy | How it works | Best for |
|---|---|---|
| Write-through | Write to cache + DB synchronously | Read-heavy, consistency critical |
| Write-behind | Write to cache, async flush to DB | Write-heavy, latency sensitive |
// Write-through
async function updateUser(userId, data) {
await db.query('UPDATE users SET name=$1 WHERE id=$2', [data.name, userId]);
await redis.setex(`user:${userId}`, 3600, JSON.stringify(data)); // sync
}
// Write-behind (async flush via queue)
async function updateUser(userId, data) {
await redis.setex(`user:${userId}`, 3600, JSON.stringify(data));
await queue.publish('db.write', { table: 'users', id: userId, data }); // async
}
Thundering Herd Prevention
When a hot cache key expires, thousands of requests can simultaneously hit the DB. This is the thundering herd problem.
Solution: Mutex on cache miss
async function getUserWithLock(userId) {
const cacheKey = `user:${userId}`;
const lockKey = `lock:${cacheKey}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
// Acquire lock: only one request hits the DB
const lock = await redis.set(lockKey, '1', 'NX', 'EX', 5);
if (!lock) {
// Another request is fetching: wait and retry
await sleep(100);
return getUserWithLock(userId);
}
const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
await redis.setex(cacheKey, 3600, JSON.stringify(user));
await redis.del(lockKey);
return user;
}
HTTP Caching Headers
Underused but extremely powerful. A properly cached API response never hits your origin.
HTTP/1.1 200 OK
Cache-Control: public, max-age=60, stale-while-revalidate=300
ETag: "abc123"
Last-Modified: Sun, 22 Mar 2026 10:00:00 GMT
-
max-age=60: serve from cache for 60 seconds -
stale-while-revalidate=300: serve stale while refreshing in the background (zero perceived latency) -
ETag: client sendsIf-None-Match: "abc123"on next request; server returns304 Not Modifiedif unchanged
🔀 Layer 3: Async-First Architecture
The user doesn't need to wait for your email to send, your analytics to log, or your thumbnail to generate. Synchronous request-response for every operation is a design choice that doesn't survive scale.
Message Queue Pattern (SQS / RabbitMQ / Kafka)
// Producer: HTTP handler returns immediately
app.post('/orders', async (req, res) => {
const order = await db.createOrder(req.body);
// Don't wait: publish to queue and respond
await sqs.sendMessage({
QueueUrl: process.env.ORDER_QUEUE_URL,
MessageBody: JSON.stringify({ orderId: order.id }),
});
res.status(202).json({ orderId: order.id }); // 202 Accepted
});
// Consumer: runs separately, processes async
async function processOrder(message) {
const { orderId } = JSON.parse(message.Body);
await sendConfirmationEmail(orderId);
await updateInventory(orderId);
await triggerAnalyticsEvent(orderId);
}
Dead Letter Queue (DLQ)
Failed messages should never be silently dropped. Route them to a DLQ for inspection and replay.
Normal Queue → Consumer fails 3x → Dead Letter Queue
↓
Alert + manual replay
// SQS queue with DLQ configured
{
"RedrivePolicy": {
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:orders-dlq",
"maxReceiveCount": 3 // Move to DLQ after 3 failures
}
}
Idempotency Keys
At-least-once delivery means consumers may process the same message twice. Idempotent consumers handle this safely.
async function processOrder(message) {
const { orderId, idempotencyKey } = JSON.parse(message.Body);
// Check if already processed
const alreadyProcessed = await redis.get(`processed:${idempotencyKey}`);
if (alreadyProcessed) return; // Skip duplicate
await fulfillOrder(orderId);
// Mark as processed with TTL
await redis.setex(`processed:${idempotencyKey}`, 86400, '1');
}
The Saga Pattern for Distributed Transactions
Traditional 2-phase commit (2PC) doesn't work across microservices. The Saga pattern replaces it with a sequence of local transactions, each with a compensating rollback action.
PlaceOrder Saga:
1. Reserve inventory → compensate: release inventory
2. Charge payment → compensate: refund payment
3. Create shipment → compensate: cancel shipment
4. Send confirmation email → (no compensation needed)
If step 3 fails → run compensating transactions for steps 2 and 1.
🌐 Layer 4: Infrastructure That Scales Horizontally
Vertical scaling has a ceiling and a single point of failure. Horizontal scaling is the only path forward.
Stateless Services
Session state stored server-side makes horizontal scaling impossible. Any instance must be able to handle any request.
// ❌ Stateful: breaks with multiple instances
app.post('/login', (req, res) => {
req.session.userId = user.id; // Stored in memory, instance-specific
});
// ✅ Stateless: works across any instance
app.post('/login', async (req, res) => {
const token = jwt.sign({ userId: user.id }, process.env.JWT_SECRET, { expiresIn: '1h' });
res.json({ token }); // Client holds state
});
Circuit Breaker Pattern
When a downstream service degrades, stop sending requests to it. Fail fast, return a fallback, allow recovery.
const CircuitBreaker = require('opossum');
const options = {
timeout: 3000, // If function takes > 3s, trigger failure
errorThresholdPercentage: 50, // Open circuit if 50% of requests fail
resetTimeout: 30000, // After 30s, try again (half-open state)
};
const breaker = new CircuitBreaker(callPaymentService, options);
breaker.fallback(() => ({ status: 'queued', message: 'Payment queued for retry' }));
breaker.on('open', () => console.warn('Circuit OPEN: payment service degraded'));
breaker.on('halfOpen', () => console.info('Circuit HALF-OPEN: testing recovery'));
breaker.on('close', () => console.info('Circuit CLOSED: payment service recovered'));
// Usage
const result = await breaker.fire(paymentData);
States: CLOSED (normal) → OPEN (failing fast) → HALF-OPEN (testing recovery) → CLOSED
Bulkhead Pattern
Isolate resource pools per service or tenant. One overloaded consumer doesn't starve the rest of the system.
// Separate thread pools / connection pools per downstream service
const paymentPool = new ConnectionPool({ max: 10 }); // Max 10 connections to payment service
const inventoryPool = new ConnectionPool({ max: 20 }); // Max 20 to inventory service
// If payment service is slow and exhausts its pool,
// inventory service pool is completely unaffected.
📊 Layer 5: Observability is Infrastructure
You cannot debug a distributed system with console.log. At 1M req/day, observability is a prerequisite.
The Three Pillars
| Pillar | Tool | Purpose |
|---|---|---|
| Metrics | Prometheus + Grafana | Aggregated numbers over time |
| Logs | ELK / Loki | Discrete events with context |
| Traces | Jaeger / OpenTelemetry | End-to-end request flow |
RED Method for Services
Track Rate, Errors, Duration for every service.
const { Counter, Histogram } = require('prom-client');
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request latency',
labelNames: ['method', 'route'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5], // p50, p95, p99 visible here
});
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({ method: req.method, route: req.path });
res.on('finish', () => {
httpRequestsTotal.inc({ method: req.method, route: req.path, status_code: res.statusCode });
end();
});
next();
});
p95 and p99 latency matter more than averages at scale. An average of 50ms can hide that 1% of users are waiting 5 seconds.
Distributed Tracing with Context Propagation
Trace IDs must propagate across service boundaries: HTTP headers, queues, async jobs, to reconstruct the full request flow.
const { trace, context, propagation } = require('@opentelemetry/api');
// Outgoing HTTP request: inject trace context into headers
async function callInventoryService(orderId, headers = {}) {
propagation.inject(context.active(), headers); // Injects traceparent header
return fetch(`http://inventory-service/reserve/${orderId}`, { headers });
}
// Incoming request in inventory service: extract and continue trace
app.use((req, res, next) => {
const ctx = propagation.extract(context.active(), req.headers);
context.with(ctx, next); // All spans created in next() are children of caller's trace
});
SLO-Based Alerting
Alert on error budget burn rate, not raw error counts. This reduces alert fatigue and focuses on user impact.
# Prometheus alerting rule
- alert: HighErrorBudgetBurnRate
expr: |
(
rate(http_requests_total{status_code=~"5.."}[1h]) /
rate(http_requests_total[1h])
) > 0.01 # More than 1% error rate (if SLO is 99% success)
for: 5m
labels:
severity: critical
annotations:
summary: "Error budget burning too fast"
🔒 Layer 6: Security Surface Grows With Traffic
Scale makes you a target. The attack surface grows proportionally.
Rate Limiting: Token Bucket vs Leaky Bucket
- Token bucket: allows bursting (user has 100 tokens, each request costs 1, refills at 10/sec)
- Leaky bucket: enforces a steady output rate regardless of burst
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
// Token bucket: allows short bursts
const limiter = rateLimit({
windowMs: 60 * 1000, // 1 minute window
max: 100, // 100 requests per window per IP
standardHeaders: true,
store: new RedisStore({ client: redis }), // Distributed: works across instances
keyGenerator: (req) => req.headers['x-api-key'] || req.ip, // Per API key, fallback to IP
});
app.use('/api/', limiter);
Apply rate limiting at the edge (CDN/WAF level) before bad traffic reaches your app servers.
JWT Short-Lived Tokens + Refresh Rotation
Long-lived tokens are liabilities at scale. Stolen tokens remain valid for their full lifetime.
// Issue short-lived access token + longer-lived refresh token
function issueTokens(userId) {
const accessToken = jwt.sign(
{ userId, type: 'access' },
process.env.JWT_SECRET,
{ expiresIn: '15m' } // Short-lived
);
const refreshToken = jwt.sign(
{ userId, type: 'refresh', jti: crypto.randomUUID() }, // jti = unique ID for rotation
process.env.REFRESH_SECRET,
{ expiresIn: '7d' }
);
return { accessToken, refreshToken };
}
// On refresh-rotate: invalidate old, issue new
async function rotateRefreshToken(oldToken) {
const decoded = jwt.verify(oldToken, process.env.REFRESH_SECRET);
// Check token hasn't been used before (detect reuse attacks)
const isRevoked = await redis.get(`revoked:${decoded.jti}`);
if (isRevoked) throw new Error('Refresh token reuse detected');
// Revoke old token
await redis.setex(`revoked:${decoded.jti}`, 7 * 86400, '1');
return issueTokens(decoded.userId);
}
The Architectural Mindset Shift
| 10k req/day | 1M req/day | |
|---|---|---|
| Consistency | Strong consistency fine | Eventual consistency worth the tradeoff |
| Failures | Rare, handle manually | Expected, handle programmatically |
| Primary bottleneck | Application logic | I/O, network, database |
| Deployment | Downtime acceptable | Zero-downtime mandatory |
| Debugging | Logs are enough | Distributed tracing required |
| Scaling axis | Vertical | Horizontal + auto |
| Caching | Nice to have | Core architectural layer |
| Async processing | Optional | Default pattern |
The gap between 10k and 1M isn't just infrastructure.
It's a shift from building features to designing for failure.Redundancy, idempotency, backpressure, and observability aren't overengineering at this scale, they're the baseline.
Which layer tends to break first in your experience? data, async, or infra? Drop it in the comments.
Top comments (2)
the idempotency key section is the one most teams implement too late. what I've seen work well is scoping keys per operation type so you don't get cross-operation collisions (
payment:${idempotencyKey}vs just${idempotencyKey}). also worth noting that the 24h TTL for processed keys can create subtle bugs in financial systems where the same key gets reused after it expires. tying it to the business event's own retention policy (rather than a fixed TTL) tends to be safer in practice.the thundering herd lock implementation has a subtle footgun: if the DB query takes longer than the lock TTL (5s in the example), you can get a double-fetch anyway. worth either matching the TTL to your actual DB timeout, or using a lease extension. the stale-while-revalidate directive in the HTTP caching section is also underrated for exactly this problem - most thundering herd scenarios can be absorbed at the CDN edge before they ever reach your origin.