From 500 to 50,000 Concurrent Users: Architecture Decisions That Actually Scale

#aws #postgres #redis #webdev

TL;DR: Black Friday on a 2M-product platform. Before adding a single server, EXPLAIN ANALYZE found three queries doing sequential scans — fixing the indexes dropped query times from 800ms to 12ms. PgBouncer multiplexed 8,000 connections down to 150 and gave us 4x throughput. Redis cached the right three things and nothing else. Here's the full picture.

Black Friday at 12,000 concurrent users on a platform you migrated from Magento six months ago is a different kind of stress test. The team that ran it before me had a sensible instinct when load spiked: add app servers. That instinct is almost always wrong, and I spent the three weeks before the event proving it.

Start with the database, not the app servers

Every scaling conversation jumps to horizontal app server scaling. The database is almost always the first bottleneck. Before touching anything else, run:

-- Find slow queries in production (pg_stat_statements must be enabled)
SELECT
  query,
  calls,
  mean_exec_time,
  total_exec_time,
  rows
FROM pg_stat_statements
WHERE mean_exec_time > 50   -- anything over 50ms is a candidate
ORDER BY mean_exec_time DESC
LIMIT 20;

We found three queries. One of them, a product listing query joining categories and inventory:

-- Before: full sequential scan, 800ms at scale
EXPLAIN ANALYZE
SELECT p.*, c.name AS category_name, i.quantity
FROM products p
JOIN categories c ON c.id = p.category_id
JOIN inventory i ON i.product_id = p.id
WHERE p.status = 'active'
  AND c.slug = 'electronics'
ORDER BY p.created_at DESC
LIMIT 50;

-- Result: Seq Scan on products (cost=0.00..48234.12 rows=2089432)
--         Execution Time: 847.392 ms

The fix was a composite index that matched the filter and sort together:

-- Covering index for the status + category_id + created_at access pattern
CREATE INDEX CONCURRENTLY idx_products_status_category_created
  ON products (status, category_id, created_at DESC)
  INCLUDE (id, title, price, slug);

-- Separate index for the categories join on slug
CREATE INDEX CONCURRENTLY idx_categories_slug ON categories (slug);

-- After: index scan, 12ms
-- Result: Index Scan using idx_products_status_category_created
--         Execution Time: 11.843 ms

800ms to 12ms. No new servers, no code changes, no cost increase. Index analysis before Black Friday is table stakes.

Connection pooling is not optional

At 12,000 concurrent users, your Node.js app servers will each maintain a connection pool. Ten ECS tasks, 100 connections each: you're already at 1,000 connections against Postgres. Scale to 20 tasks under load and PostgreSQL starts refusing connections entirely — it degrades sharply above roughly 200 active connections.

The fix is a connection pooler sitting between your app tier and Postgres. We used PgBouncer in transaction pooling mode:

# pgbouncer.ini
[databases]
production = host=your-rds-endpoint port=5432 dbname=production

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
pool_mode = transaction         # key: connection released after each transaction
max_client_conn = 10000         # app-side connections (generous ceiling)
default_pool_size = 150         # actual Postgres connections (stay under 200)
reserve_pool_size = 10
reserve_pool_timeout = 5
server_idle_timeout = 600
log_connections = 0
log_disconnections = 0

Before PgBouncer: 8,000 app-side connections → Postgres melting.
After PgBouncer: 8,000 app-side connections → 150 Postgres connections → 4x throughput increase.

One caveat with transaction pooling mode: you cannot use SET statements, advisory locks, or LISTEN/NOTIFY across transactions — the connection you get back may not be the same one. Keep that in mind if you rely on session-level state.

Redis: cache the right data and nothing else

Redis is not a general caching layer. The wrong data in Redis creates distributed consistency problems that are much harder to debug than slow queries.

We cached exactly three things:

1. Product page data          TTL: 5 minutes
2. User session data          TTL: 30 minutes (sliding)
3. Rate limit counters        TTL: 1 minute (sliding window)

We deliberately did not cache inventory counts. Here is why that matters:

// Wrong: cached inventory leads to overselling
async function getProductPage(productId) {
  const cached = await redis.get(`product:${productId}`);
  if (cached) return JSON.parse(cached); // inventory count may be stale

  const product = await db.query(
    `SELECT p.*, i.quantity FROM products p
     JOIN inventory i ON i.product_id = p.id
     WHERE p.id = $1`,
    [productId]
  );

  await redis.setex(`product:${productId}`, 300, JSON.stringify(product.rows[0]));
  return product.rows[0];
}

// Right: cache everything except the inventory count
async function getProductPage(productId) {
  const [cachedProduct, liveInventory] = await Promise.all([
    redis.get(`product:${productId}`),
    db.query(`SELECT quantity FROM inventory WHERE product_id = $1`, [productId])
  ]);

  const product = cachedProduct
    ? JSON.parse(cachedProduct)
    : await fetchAndCacheProduct(productId); // separate function

  return { ...product, quantity: liveInventory.rows[0].quantity };
}

Stale stock counts cost conversions and cause overselling. Read inventory directly from the database, optimised by index. The indexed inventory read takes 3ms — that is a fine trade for correctness.

Serverless for spiky, stateful for baseline

Not everything needs to survive sustained load. We split the workload by its traffic shape:

Kept on ECS (pre-warmed instances):

Checkout pipeline — latency-sensitive, stateful, must not cold-start
Product search — sustained load, stateful Elasticsearch client
Session API — low latency requirement, high call frequency

Moved to Lambda:

Email notifications — spiky (order confirmation bursts), fully async
PDF receipt generation — CPU-intensive but async, nobody waits for it
Inventory update event processing — event-driven, SQS-triggered

ECS (checkout, search, session)      ← baseline traffic, pre-warmed
Lambda (email, PDF, inventory)       ← burst traffic, event-driven

This cut our baseline EC2 spend by 35% while giving us effectively unlimited burst capacity for async workloads. The checkout service never competes for resources with a PDF generation spike.

Load test before you need to

Three weeks before Black Friday we ran k6 against a production-sized staging environment:

// k6 load test script (simplified)
import http from 'k6/http';
import { check, sleep } from 'k6';

export const options = {
  stages: [
    { duration: '5m', target: 5000 },    // ramp up
    { duration: '10m', target: 15000 },  // hold at 15k (above our target)
    { duration: '5m', target: 0 },       // ramp down
  ],
  thresholds: {
    http_req_duration: ['p95<500'],       // 95th percentile under 500ms
    http_req_failed: ['rate<0.01'],       // less than 1% errors
  },
};

export default function () {
  const res = http.get(`https://staging.yourdomain.com/products/${randomProductId()}`);
  check(res, { 'status 200': r => r.status === 200 });
  sleep(1);
}

The test revealed one non-obvious problem: our CloudFront cache behaviour headers were wrong. Browsers were re-fetching product images on every page load because Cache-Control: no-cache had been set during development and never changed.

Before fix: ~40% of image requests reached origin
After fix:  CDN cache hit rate → 94%

One CloudFront distribution setting. Under actual Black Friday load, the CDN absorbed 94% of all traffic. Origin never saw the full load.

What actually happened on Black Friday

Peak concurrent users: ~12,000
p95 response time on product pages: 210ms
Error rate: 0.3%
Database connections at peak: 148 (PgBouncer keeping us safely under 150)
Zero rollbacks, zero incidents

The database index changes, PgBouncer, and the CDN cache fix together did more work than any amount of horizontal scaling would have. We added zero new ECS tasks during the event.

The checklist before your next traffic event

[ ] Run pg_stat_statements analysis — fix any query over 50ms
[ ] Deploy PgBouncer in transaction pooling mode between app and Postgres
[ ] Audit Redis cache keys — evict anything that must be consistent
[ ] Split traffic-shaped workloads: stateful/latency-sensitive on ECS, async/spiky on Lambda
[ ] Run a k6 load test at 125% of your peak estimate, three weeks out
[ ] Verify CDN cache headers with curl -I on key asset URLs
[ ] Set a CloudWatch alarm on Postgres connection count (alert at 80% of max_connections)

Originally published at https://shubhamkansal.com/blog/scaling-from-500-to-50000-concurrent-users. I'm Shubham Kansal, a freelance Full Stack & DevOps engineer — more at https://shubhamkansal.com.