shipra Shankhwar

Posted on Mar 22

Scaling from 10k to 1M Requests/Day(Every Layer That Matters)

#architecture #backend #performance #systemdesign

Most systems don't collapse under load because of bad code.

They collapse because the architecture was never designed for order-of-magnitude growth.

Here's a breakdown of every layer, with patterns, tradeoffs, and code, that bridges the gap between 10k and 1M requests/day.

⚙️ Layer 1: The Data Layer Breaks First

At 10k req/day, a single RDBMS instance handles reads and writes without complaint.

At 1M, you hit lock contention, connection exhaustion, and query latency that compounds across every endpoint.

Read Replicas

Route all SELECT queries to replicas. Keep writes on primary.

-- On primary (write)
INSERT INTO orders (user_id, total) VALUES (42, 199.99);

-- On replica (read)
SELECT * FROM orders WHERE user_id = 42;

Watch for replication lag under heavy write load, a lagging replica can serve stale reads. Monitor with:

-- PostgreSQL: check replica lag
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;

Connection Pooling with PgBouncer

Raw TCP connections to Postgres are expensive. At scale, hundreds of app instances opening direct connections will exhaust max_connections.

PgBouncer in transaction mode, each query borrows a connection from the pool, releases it immediately after. Reduces connection count dramatically.

# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20

CQRS: Separate Read and Write Models

Command Query Responsibility Segregation separates the write model (normalized, transactional) from the read model (denormalized, optimized for queries).

Write path:  API → Command Handler → Write DB (normalized)
                                          ↓ (event/sync)
Read path:   API → Query Handler  → Read DB (denormalized view)

Read models can be precomputed materialized views or even a separate datastore (e.g., Elasticsearch for search, Redis for leaderboards).

Composite and Partial Indexes

-- Composite index: covers multi-column WHERE + ORDER BY
CREATE INDEX idx_orders_user_created 
ON orders (user_id, created_at DESC);

-- Partial index: only indexes rows matching a condition
-- Much smaller, faster for filtered queries
CREATE INDEX idx_active_users 
ON users (email) 
WHERE status = 'active';

Always validate with EXPLAIN ANALYZE:

EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE user_id = 42 AND created_at > now() - interval '30 days';

Look for Seq Scan → that's a missing index. Index Scan → you're good.

⚡ Layer 2: Caching as an Architectural Layer

Every cache hit is a request your database, compute, and network never have to handle.

Cache-Aside Pattern (Lazy Loading)

The application checks the cache first. On a miss, it fetches from the DB and populates the cache.

async function getUser(userId) {
  const cacheKey = `user:${userId}`;

  // 1. Check cache
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // 2. Cache miss: fetch from DB
  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);

  // 3. Populate cache with TTL
  await redis.setex(cacheKey, 3600, JSON.stringify(user)); // 1 hour TTL

  return user;
}

Write-Through vs Write-Behind

Strategy	How it works	Best for
Write-through	Write to cache + DB synchronously	Read-heavy, consistency critical
Write-behind	Write to cache, async flush to DB	Write-heavy, latency sensitive

// Write-through
async function updateUser(userId, data) {
  await db.query('UPDATE users SET name=$1 WHERE id=$2', [data.name, userId]);
  await redis.setex(`user:${userId}`, 3600, JSON.stringify(data)); // sync
}

// Write-behind (async flush via queue)
async function updateUser(userId, data) {
  await redis.setex(`user:${userId}`, 3600, JSON.stringify(data));
  await queue.publish('db.write', { table: 'users', id: userId, data }); // async
}

Thundering Herd Prevention

When a hot cache key expires, thousands of requests can simultaneously hit the DB. This is the thundering herd problem.

Solution: Mutex on cache miss

async function getUserWithLock(userId) {
  const cacheKey = `user:${userId}`;
  const lockKey = `lock:${cacheKey}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Acquire lock: only one request hits the DB
  const lock = await redis.set(lockKey, '1', 'NX', 'EX', 5);
  if (!lock) {
    // Another request is fetching: wait and retry
    await sleep(100);
    return getUserWithLock(userId);
  }

  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
  await redis.setex(cacheKey, 3600, JSON.stringify(user));
  await redis.del(lockKey);

  return user;
}

HTTP Caching Headers

Underused but extremely powerful. A properly cached API response never hits your origin.

HTTP/1.1 200 OK
Cache-Control: public, max-age=60, stale-while-revalidate=300
ETag: "abc123"
Last-Modified: Sun, 22 Mar 2026 10:00:00 GMT

max-age=60: serve from cache for 60 seconds
stale-while-revalidate=300: serve stale while refreshing in the background (zero perceived latency)
ETag: client sends If-None-Match: "abc123" on next request; server returns 304 Not Modified if unchanged

🔀 Layer 3: Async-First Architecture

The user doesn't need to wait for your email to send, your analytics to log, or your thumbnail to generate. Synchronous request-response for every operation is a design choice that doesn't survive scale.

Message Queue Pattern (SQS / RabbitMQ / Kafka)

// Producer: HTTP handler returns immediately
app.post('/orders', async (req, res) => {
  const order = await db.createOrder(req.body);

  // Don't wait: publish to queue and respond
  await sqs.sendMessage({
    QueueUrl: process.env.ORDER_QUEUE_URL,
    MessageBody: JSON.stringify({ orderId: order.id }),
  });

  res.status(202).json({ orderId: order.id }); // 202 Accepted
});

// Consumer: runs separately, processes async
async function processOrder(message) {
  const { orderId } = JSON.parse(message.Body);
  await sendConfirmationEmail(orderId);
  await updateInventory(orderId);
  await triggerAnalyticsEvent(orderId);
}

Dead Letter Queue (DLQ)

Failed messages should never be silently dropped. Route them to a DLQ for inspection and replay.

Normal Queue → Consumer fails 3x → Dead Letter Queue
                                          ↓
                                    Alert + manual replay

// SQS queue with DLQ configured
{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:orders-dlq",
    "maxReceiveCount": 3  // Move to DLQ after 3 failures
  }
}

Idempotency Keys

At-least-once delivery means consumers may process the same message twice. Idempotent consumers handle this safely.

async function processOrder(message) {
  const { orderId, idempotencyKey } = JSON.parse(message.Body);

  // Check if already processed
  const alreadyProcessed = await redis.get(`processed:${idempotencyKey}`);
  if (alreadyProcessed) return; // Skip duplicate

  await fulfillOrder(orderId);

  // Mark as processed with TTL
  await redis.setex(`processed:${idempotencyKey}`, 86400, '1');
}

The Saga Pattern for Distributed Transactions

Traditional 2-phase commit (2PC) doesn't work across microservices. The Saga pattern replaces it with a sequence of local transactions, each with a compensating rollback action.

PlaceOrder Saga:
  1. Reserve inventory       → compensate: release inventory
  2. Charge payment          → compensate: refund payment
  3. Create shipment         → compensate: cancel shipment
  4. Send confirmation email → (no compensation needed)

If step 3 fails → run compensating transactions for steps 2 and 1.

🌐 Layer 4: Infrastructure That Scales Horizontally

Vertical scaling has a ceiling and a single point of failure. Horizontal scaling is the only path forward.

Stateless Services

Session state stored server-side makes horizontal scaling impossible. Any instance must be able to handle any request.

// ❌ Stateful: breaks with multiple instances
app.post('/login', (req, res) => {
  req.session.userId = user.id; // Stored in memory, instance-specific
});

// ✅ Stateless: works across any instance
app.post('/login', async (req, res) => {
  const token = jwt.sign({ userId: user.id }, process.env.JWT_SECRET, { expiresIn: '1h' });
  res.json({ token }); // Client holds state
});

Circuit Breaker Pattern

When a downstream service degrades, stop sending requests to it. Fail fast, return a fallback, allow recovery.

const CircuitBreaker = require('opossum');

const options = {
  timeout: 3000,         // If function takes > 3s, trigger failure
  errorThresholdPercentage: 50, // Open circuit if 50% of requests fail
  resetTimeout: 30000,   // After 30s, try again (half-open state)
};

const breaker = new CircuitBreaker(callPaymentService, options);

breaker.fallback(() => ({ status: 'queued', message: 'Payment queued for retry' }));

breaker.on('open', () => console.warn('Circuit OPEN: payment service degraded'));
breaker.on('halfOpen', () => console.info('Circuit HALF-OPEN: testing recovery'));
breaker.on('close', () => console.info('Circuit CLOSED: payment service recovered'));

// Usage
const result = await breaker.fire(paymentData);

States: CLOSED (normal) → OPEN (failing fast) → HALF-OPEN (testing recovery) → CLOSED

Bulkhead Pattern

Isolate resource pools per service or tenant. One overloaded consumer doesn't starve the rest of the system.

// Separate thread pools / connection pools per downstream service
const paymentPool = new ConnectionPool({ max: 10 });   // Max 10 connections to payment service
const inventoryPool = new ConnectionPool({ max: 20 });  // Max 20 to inventory service

// If payment service is slow and exhausts its pool,
// inventory service pool is completely unaffected.

📊 Layer 5: Observability is Infrastructure

You cannot debug a distributed system with console.log. At 1M req/day, observability is a prerequisite.

The Three Pillars

Pillar	Tool	Purpose
Metrics	Prometheus + Grafana	Aggregated numbers over time
Logs	ELK / Loki	Discrete events with context
Traces	Jaeger / OpenTelemetry	End-to-end request flow

RED Method for Services

Track Rate, Errors, Duration for every service.

const { Counter, Histogram } = require('prom-client');

const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5], // p50, p95, p99 visible here
});

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, route: req.path });
  res.on('finish', () => {
    httpRequestsTotal.inc({ method: req.method, route: req.path, status_code: res.statusCode });
    end();
  });
  next();
});

p95 and p99 latency matter more than averages at scale. An average of 50ms can hide that 1% of users are waiting 5 seconds.

Distributed Tracing with Context Propagation

Trace IDs must propagate across service boundaries: HTTP headers, queues, async jobs, to reconstruct the full request flow.

const { trace, context, propagation } = require('@opentelemetry/api');

// Outgoing HTTP request: inject trace context into headers
async function callInventoryService(orderId, headers = {}) {
  propagation.inject(context.active(), headers); // Injects traceparent header

  return fetch(`http://inventory-service/reserve/${orderId}`, { headers });
}

// Incoming request in inventory service: extract and continue trace
app.use((req, res, next) => {
  const ctx = propagation.extract(context.active(), req.headers);
  context.with(ctx, next); // All spans created in next() are children of caller's trace
});

SLO-Based Alerting

Alert on error budget burn rate, not raw error counts. This reduces alert fatigue and focuses on user impact.

# Prometheus alerting rule
- alert: HighErrorBudgetBurnRate
  expr: |
    (
      rate(http_requests_total{status_code=~"5.."}[1h]) /
      rate(http_requests_total[1h])
    ) > 0.01   # More than 1% error rate (if SLO is 99% success)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning too fast"

🔒 Layer 6: Security Surface Grows With Traffic

Scale makes you a target. The attack surface grows proportionally.

Rate Limiting: Token Bucket vs Leaky Bucket

Token bucket: allows bursting (user has 100 tokens, each request costs 1, refills at 10/sec)
Leaky bucket: enforces a steady output rate regardless of burst

const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');

// Token bucket: allows short bursts
const limiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute window
  max: 100,            // 100 requests per window per IP
  standardHeaders: true,
  store: new RedisStore({ client: redis }), // Distributed: works across instances
  keyGenerator: (req) => req.headers['x-api-key'] || req.ip, // Per API key, fallback to IP
});

app.use('/api/', limiter);

Apply rate limiting at the edge (CDN/WAF level) before bad traffic reaches your app servers.

JWT Short-Lived Tokens + Refresh Rotation

Long-lived tokens are liabilities at scale. Stolen tokens remain valid for their full lifetime.

// Issue short-lived access token + longer-lived refresh token
function issueTokens(userId) {
  const accessToken = jwt.sign(
    { userId, type: 'access' },
    process.env.JWT_SECRET,
    { expiresIn: '15m' }  // Short-lived
  );

  const refreshToken = jwt.sign(
    { userId, type: 'refresh', jti: crypto.randomUUID() }, // jti = unique ID for rotation
    process.env.REFRESH_SECRET,
    { expiresIn: '7d' }
  );

  return { accessToken, refreshToken };
}

// On refresh-rotate: invalidate old, issue new
async function rotateRefreshToken(oldToken) {
  const decoded = jwt.verify(oldToken, process.env.REFRESH_SECRET);

  // Check token hasn't been used before (detect reuse attacks)
  const isRevoked = await redis.get(`revoked:${decoded.jti}`);
  if (isRevoked) throw new Error('Refresh token reuse detected');

  // Revoke old token
  await redis.setex(`revoked:${decoded.jti}`, 7 * 86400, '1');

  return issueTokens(decoded.userId);
}

The Architectural Mindset Shift

	10k req/day	1M req/day
Consistency	Strong consistency fine	Eventual consistency worth the tradeoff
Failures	Rare, handle manually	Expected, handle programmatically
Primary bottleneck	Application logic	I/O, network, database
Deployment	Downtime acceptable	Zero-downtime mandatory
Debugging	Logs are enough	Distributed tracing required
Scaling axis	Vertical	Horizontal + auto
Caching	Nice to have	Core architectural layer
Async processing	Optional	Default pattern

The gap between 10k and 1M isn't just infrastructure.

It's a shift from building features to designing for failure.

Redundancy, idempotency, backpressure, and observability aren't overengineering at this scale, they're the baseline.

Which layer tends to break first in your experience? data, async, or infra? Drop it in the comments.

Top comments (2)

Andre Cytryn • Mar 22

the idempotency key section is the one most teams implement too late. what I've seen work well is scoping keys per operation type so you don't get cross-operation collisions (payment:${idempotencyKey} vs just ${idempotencyKey}). also worth noting that the 24h TTL for processed keys can create subtle bugs in financial systems where the same key gets reused after it expires. tying it to the business event's own retention policy (rather than a fixed TTL) tends to be safer in practice.

Andre Cytryn • Mar 22

the thundering herd lock implementation has a subtle footgun: if the DB query takes longer than the lock TTL (5s in the example), you can get a double-fetch anyway. worth either matching the TTL to your actual DB timeout, or using a lease extension. the stale-while-revalidate directive in the HTTP caching section is also underrated for exactly this problem - most thundering herd scenarios can be absorbed at the CDN edge before they ever reach your origin.