DEV Community

Cover image for Scaling from 10k to 1M Requests/Day(Every Layer That Matters)
shipra Shankhwar
shipra Shankhwar

Posted on

Scaling from 10k to 1M Requests/Day(Every Layer That Matters)

Most systems don't collapse under load because of bad code.

They collapse because the architecture was never designed for order-of-magnitude growth.

Here's a breakdown of every layer, with patterns, tradeoffs, and code, that bridges the gap between 10k and 1M requests/day.


⚙️ Layer 1: The Data Layer Breaks First

At 10k req/day, a single RDBMS instance handles reads and writes without complaint.

At 1M, you hit lock contention, connection exhaustion, and query latency that compounds across every endpoint.

Read Replicas

Route all SELECT queries to replicas. Keep writes on primary.

-- On primary (write)
INSERT INTO orders (user_id, total) VALUES (42, 199.99);

-- On replica (read)
SELECT * FROM orders WHERE user_id = 42;
Enter fullscreen mode Exit fullscreen mode

Watch for replication lag under heavy write load, a lagging replica can serve stale reads. Monitor with:

-- PostgreSQL: check replica lag
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;
Enter fullscreen mode Exit fullscreen mode

Connection Pooling with PgBouncer

Raw TCP connections to Postgres are expensive. At scale, hundreds of app instances opening direct connections will exhaust max_connections.

PgBouncer in transaction mode, each query borrows a connection from the pool, releases it immediately after. Reduces connection count dramatically.

# pgbouncer.ini
[databases]
mydb = host=127.0.0.1 port=5432 dbname=mydb

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
Enter fullscreen mode Exit fullscreen mode

CQRS: Separate Read and Write Models

Command Query Responsibility Segregation separates the write model (normalized, transactional) from the read model (denormalized, optimized for queries).

Write path:  API → Command Handler → Write DB (normalized)
                                          ↓ (event/sync)
Read path:   API → Query Handler  → Read DB (denormalized view)
Enter fullscreen mode Exit fullscreen mode

Read models can be precomputed materialized views or even a separate datastore (e.g., Elasticsearch for search, Redis for leaderboards).


Composite and Partial Indexes

-- Composite index: covers multi-column WHERE + ORDER BY
CREATE INDEX idx_orders_user_created 
ON orders (user_id, created_at DESC);

-- Partial index: only indexes rows matching a condition
-- Much smaller, faster for filtered queries
CREATE INDEX idx_active_users 
ON users (email) 
WHERE status = 'active';
Enter fullscreen mode Exit fullscreen mode

Always validate with EXPLAIN ANALYZE:

EXPLAIN ANALYZE 
SELECT * FROM orders 
WHERE user_id = 42 AND created_at > now() - interval '30 days';
Enter fullscreen mode Exit fullscreen mode

Look for Seq Scan → that's a missing index. Index Scan → you're good.


⚡ Layer 2: Caching as an Architectural Layer

Every cache hit is a request your database, compute, and network never have to handle.

Cache-Aside Pattern (Lazy Loading)

The application checks the cache first. On a miss, it fetches from the DB and populates the cache.

async function getUser(userId) {
  const cacheKey = `user:${userId}`;

  // 1. Check cache
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // 2. Cache miss: fetch from DB
  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);

  // 3. Populate cache with TTL
  await redis.setex(cacheKey, 3600, JSON.stringify(user)); // 1 hour TTL

  return user;
}
Enter fullscreen mode Exit fullscreen mode

Write-Through vs Write-Behind

Strategy How it works Best for
Write-through Write to cache + DB synchronously Read-heavy, consistency critical
Write-behind Write to cache, async flush to DB Write-heavy, latency sensitive
// Write-through
async function updateUser(userId, data) {
  await db.query('UPDATE users SET name=$1 WHERE id=$2', [data.name, userId]);
  await redis.setex(`user:${userId}`, 3600, JSON.stringify(data)); // sync
}

// Write-behind (async flush via queue)
async function updateUser(userId, data) {
  await redis.setex(`user:${userId}`, 3600, JSON.stringify(data));
  await queue.publish('db.write', { table: 'users', id: userId, data }); // async
}
Enter fullscreen mode Exit fullscreen mode

Thundering Herd Prevention

When a hot cache key expires, thousands of requests can simultaneously hit the DB. This is the thundering herd problem.

Solution: Mutex on cache miss

async function getUserWithLock(userId) {
  const cacheKey = `user:${userId}`;
  const lockKey = `lock:${cacheKey}`;

  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);

  // Acquire lock: only one request hits the DB
  const lock = await redis.set(lockKey, '1', 'NX', 'EX', 5);
  if (!lock) {
    // Another request is fetching: wait and retry
    await sleep(100);
    return getUserWithLock(userId);
  }

  const user = await db.query('SELECT * FROM users WHERE id = $1', [userId]);
  await redis.setex(cacheKey, 3600, JSON.stringify(user));
  await redis.del(lockKey);

  return user;
}
Enter fullscreen mode Exit fullscreen mode

HTTP Caching Headers

Underused but extremely powerful. A properly cached API response never hits your origin.

HTTP/1.1 200 OK
Cache-Control: public, max-age=60, stale-while-revalidate=300
ETag: "abc123"
Last-Modified: Sun, 22 Mar 2026 10:00:00 GMT
Enter fullscreen mode Exit fullscreen mode
  • max-age=60: serve from cache for 60 seconds
  • stale-while-revalidate=300: serve stale while refreshing in the background (zero perceived latency)
  • ETag: client sends If-None-Match: "abc123" on next request; server returns 304 Not Modified if unchanged

🔀 Layer 3: Async-First Architecture

The user doesn't need to wait for your email to send, your analytics to log, or your thumbnail to generate. Synchronous request-response for every operation is a design choice that doesn't survive scale.

Message Queue Pattern (SQS / RabbitMQ / Kafka)

// Producer: HTTP handler returns immediately
app.post('/orders', async (req, res) => {
  const order = await db.createOrder(req.body);

  // Don't wait: publish to queue and respond
  await sqs.sendMessage({
    QueueUrl: process.env.ORDER_QUEUE_URL,
    MessageBody: JSON.stringify({ orderId: order.id }),
  });

  res.status(202).json({ orderId: order.id }); // 202 Accepted
});

// Consumer: runs separately, processes async
async function processOrder(message) {
  const { orderId } = JSON.parse(message.Body);
  await sendConfirmationEmail(orderId);
  await updateInventory(orderId);
  await triggerAnalyticsEvent(orderId);
}
Enter fullscreen mode Exit fullscreen mode

Dead Letter Queue (DLQ)

Failed messages should never be silently dropped. Route them to a DLQ for inspection and replay.

Normal Queue → Consumer fails 3x → Dead Letter Queue
                                          ↓
                                    Alert + manual replay
Enter fullscreen mode Exit fullscreen mode
// SQS queue with DLQ configured
{
  "RedrivePolicy": {
    "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:orders-dlq",
    "maxReceiveCount": 3  // Move to DLQ after 3 failures
  }
}
Enter fullscreen mode Exit fullscreen mode

Idempotency Keys

At-least-once delivery means consumers may process the same message twice. Idempotent consumers handle this safely.

async function processOrder(message) {
  const { orderId, idempotencyKey } = JSON.parse(message.Body);

  // Check if already processed
  const alreadyProcessed = await redis.get(`processed:${idempotencyKey}`);
  if (alreadyProcessed) return; // Skip duplicate

  await fulfillOrder(orderId);

  // Mark as processed with TTL
  await redis.setex(`processed:${idempotencyKey}`, 86400, '1');
}
Enter fullscreen mode Exit fullscreen mode

The Saga Pattern for Distributed Transactions

Traditional 2-phase commit (2PC) doesn't work across microservices. The Saga pattern replaces it with a sequence of local transactions, each with a compensating rollback action.

PlaceOrder Saga:
  1. Reserve inventory       → compensate: release inventory
  2. Charge payment          → compensate: refund payment
  3. Create shipment         → compensate: cancel shipment
  4. Send confirmation email → (no compensation needed)

If step 3 fails → run compensating transactions for steps 2 and 1.
Enter fullscreen mode Exit fullscreen mode

🌐 Layer 4: Infrastructure That Scales Horizontally

Vertical scaling has a ceiling and a single point of failure. Horizontal scaling is the only path forward.

Stateless Services

Session state stored server-side makes horizontal scaling impossible. Any instance must be able to handle any request.

// ❌ Stateful: breaks with multiple instances
app.post('/login', (req, res) => {
  req.session.userId = user.id; // Stored in memory, instance-specific
});

// ✅ Stateless: works across any instance
app.post('/login', async (req, res) => {
  const token = jwt.sign({ userId: user.id }, process.env.JWT_SECRET, { expiresIn: '1h' });
  res.json({ token }); // Client holds state
});
Enter fullscreen mode Exit fullscreen mode

Circuit Breaker Pattern

When a downstream service degrades, stop sending requests to it. Fail fast, return a fallback, allow recovery.

const CircuitBreaker = require('opossum');

const options = {
  timeout: 3000,         // If function takes > 3s, trigger failure
  errorThresholdPercentage: 50, // Open circuit if 50% of requests fail
  resetTimeout: 30000,   // After 30s, try again (half-open state)
};

const breaker = new CircuitBreaker(callPaymentService, options);

breaker.fallback(() => ({ status: 'queued', message: 'Payment queued for retry' }));

breaker.on('open', () => console.warn('Circuit OPEN: payment service degraded'));
breaker.on('halfOpen', () => console.info('Circuit HALF-OPEN: testing recovery'));
breaker.on('close', () => console.info('Circuit CLOSED: payment service recovered'));

// Usage
const result = await breaker.fire(paymentData);
Enter fullscreen mode Exit fullscreen mode

States: CLOSED (normal) → OPEN (failing fast) → HALF-OPEN (testing recovery) → CLOSED


Bulkhead Pattern

Isolate resource pools per service or tenant. One overloaded consumer doesn't starve the rest of the system.

// Separate thread pools / connection pools per downstream service
const paymentPool = new ConnectionPool({ max: 10 });   // Max 10 connections to payment service
const inventoryPool = new ConnectionPool({ max: 20 });  // Max 20 to inventory service

// If payment service is slow and exhausts its pool,
// inventory service pool is completely unaffected.
Enter fullscreen mode Exit fullscreen mode

📊 Layer 5: Observability is Infrastructure

You cannot debug a distributed system with console.log. At 1M req/day, observability is a prerequisite.

The Three Pillars

Pillar Tool Purpose
Metrics Prometheus + Grafana Aggregated numbers over time
Logs ELK / Loki Discrete events with context
Traces Jaeger / OpenTelemetry End-to-end request flow

RED Method for Services

Track Rate, Errors, Duration for every service.

const { Counter, Histogram } = require('prom-client');

const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});

const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request latency',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5], // p50, p95, p99 visible here
});

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, route: req.path });
  res.on('finish', () => {
    httpRequestsTotal.inc({ method: req.method, route: req.path, status_code: res.statusCode });
    end();
  });
  next();
});
Enter fullscreen mode Exit fullscreen mode

p95 and p99 latency matter more than averages at scale. An average of 50ms can hide that 1% of users are waiting 5 seconds.


Distributed Tracing with Context Propagation

Trace IDs must propagate across service boundaries: HTTP headers, queues, async jobs, to reconstruct the full request flow.

const { trace, context, propagation } = require('@opentelemetry/api');

// Outgoing HTTP request: inject trace context into headers
async function callInventoryService(orderId, headers = {}) {
  propagation.inject(context.active(), headers); // Injects traceparent header

  return fetch(`http://inventory-service/reserve/${orderId}`, { headers });
}

// Incoming request in inventory service: extract and continue trace
app.use((req, res, next) => {
  const ctx = propagation.extract(context.active(), req.headers);
  context.with(ctx, next); // All spans created in next() are children of caller's trace
});
Enter fullscreen mode Exit fullscreen mode

SLO-Based Alerting

Alert on error budget burn rate, not raw error counts. This reduces alert fatigue and focuses on user impact.

# Prometheus alerting rule
- alert: HighErrorBudgetBurnRate
  expr: |
    (
      rate(http_requests_total{status_code=~"5.."}[1h]) /
      rate(http_requests_total[1h])
    ) > 0.01   # More than 1% error rate (if SLO is 99% success)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning too fast"
Enter fullscreen mode Exit fullscreen mode

🔒 Layer 6: Security Surface Grows With Traffic

Scale makes you a target. The attack surface grows proportionally.

Rate Limiting: Token Bucket vs Leaky Bucket

  • Token bucket: allows bursting (user has 100 tokens, each request costs 1, refills at 10/sec)
  • Leaky bucket: enforces a steady output rate regardless of burst
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');

// Token bucket: allows short bursts
const limiter = rateLimit({
  windowMs: 60 * 1000, // 1 minute window
  max: 100,            // 100 requests per window per IP
  standardHeaders: true,
  store: new RedisStore({ client: redis }), // Distributed: works across instances
  keyGenerator: (req) => req.headers['x-api-key'] || req.ip, // Per API key, fallback to IP
});

app.use('/api/', limiter);
Enter fullscreen mode Exit fullscreen mode

Apply rate limiting at the edge (CDN/WAF level) before bad traffic reaches your app servers.


JWT Short-Lived Tokens + Refresh Rotation

Long-lived tokens are liabilities at scale. Stolen tokens remain valid for their full lifetime.

// Issue short-lived access token + longer-lived refresh token
function issueTokens(userId) {
  const accessToken = jwt.sign(
    { userId, type: 'access' },
    process.env.JWT_SECRET,
    { expiresIn: '15m' }  // Short-lived
  );

  const refreshToken = jwt.sign(
    { userId, type: 'refresh', jti: crypto.randomUUID() }, // jti = unique ID for rotation
    process.env.REFRESH_SECRET,
    { expiresIn: '7d' }
  );

  return { accessToken, refreshToken };
}

// On refresh-rotate: invalidate old, issue new
async function rotateRefreshToken(oldToken) {
  const decoded = jwt.verify(oldToken, process.env.REFRESH_SECRET);

  // Check token hasn't been used before (detect reuse attacks)
  const isRevoked = await redis.get(`revoked:${decoded.jti}`);
  if (isRevoked) throw new Error('Refresh token reuse detected');

  // Revoke old token
  await redis.setex(`revoked:${decoded.jti}`, 7 * 86400, '1');

  return issueTokens(decoded.userId);
}
Enter fullscreen mode Exit fullscreen mode

The Architectural Mindset Shift

10k req/day 1M req/day
Consistency Strong consistency fine Eventual consistency worth the tradeoff
Failures Rare, handle manually Expected, handle programmatically
Primary bottleneck Application logic I/O, network, database
Deployment Downtime acceptable Zero-downtime mandatory
Debugging Logs are enough Distributed tracing required
Scaling axis Vertical Horizontal + auto
Caching Nice to have Core architectural layer
Async processing Optional Default pattern

The gap between 10k and 1M isn't just infrastructure.

It's a shift from building features to designing for failure.

Redundancy, idempotency, backpressure, and observability aren't overengineering at this scale, they're the baseline.


Which layer tends to break first in your experience? data, async, or infra? Drop it in the comments.

Top comments (2)

Collapse
 
acytryn profile image
Andre Cytryn

the idempotency key section is the one most teams implement too late. what I've seen work well is scoping keys per operation type so you don't get cross-operation collisions (payment:${idempotencyKey} vs just ${idempotencyKey}). also worth noting that the 24h TTL for processed keys can create subtle bugs in financial systems where the same key gets reused after it expires. tying it to the business event's own retention policy (rather than a fixed TTL) tends to be safer in practice.

Collapse
 
acytryn profile image
Andre Cytryn

the thundering herd lock implementation has a subtle footgun: if the DB query takes longer than the lock TTL (5s in the example), you can get a double-fetch anyway. worth either matching the TTL to your actual DB timeout, or using a lease extension. the stale-while-revalidate directive in the HTTP caching section is also underrated for exactly this problem - most thundering herd scenarios can be absorbed at the CDN edge before they ever reach your origin.