TL;DR: Black Friday on a 2M-product platform. Before adding a single server, EXPLAIN ANALYZE found three queries doing sequential scans — fixing the indexes dropped query times from 800ms to 12ms. PgBouncer multiplexed 8,000 connections down to 150 and gave us 4x throughput. Redis cached the right three things and nothing else. Here's the full picture.
Black Friday at 12,000 concurrent users on a platform you migrated from Magento six months ago is a different kind of stress test. The team that ran it before me had a sensible instinct when load spiked: add app servers. That instinct is almost always wrong, and I spent the three weeks before the event proving it.
Start with the database, not the app servers
Every scaling conversation jumps to horizontal app server scaling. The database is almost always the first bottleneck. Before touching anything else, run:
-- Find slow queries in production (pg_stat_statements must be enabled)
SELECT
query,
calls,
mean_exec_time,
total_exec_time,
rows
FROM pg_stat_statements
WHERE mean_exec_time > 50 -- anything over 50ms is a candidate
ORDER BY mean_exec_time DESC
LIMIT 20;
We found three queries. One of them, a product listing query joining categories and inventory:
-- Before: full sequential scan, 800ms at scale
EXPLAIN ANALYZE
SELECT p.*, c.name AS category_name, i.quantity
FROM products p
JOIN categories c ON c.id = p.category_id
JOIN inventory i ON i.product_id = p.id
WHERE p.status = 'active'
AND c.slug = 'electronics'
ORDER BY p.created_at DESC
LIMIT 50;
-- Result: Seq Scan on products (cost=0.00..48234.12 rows=2089432)
-- Execution Time: 847.392 ms
The fix was a composite index that matched the filter and sort together:
-- Covering index for the status + category_id + created_at access pattern
CREATE INDEX CONCURRENTLY idx_products_status_category_created
ON products (status, category_id, created_at DESC)
INCLUDE (id, title, price, slug);
-- Separate index for the categories join on slug
CREATE INDEX CONCURRENTLY idx_categories_slug ON categories (slug);
-- After: index scan, 12ms
-- Result: Index Scan using idx_products_status_category_created
-- Execution Time: 11.843 ms
800ms to 12ms. No new servers, no code changes, no cost increase. Index analysis before Black Friday is table stakes.
Connection pooling is not optional
At 12,000 concurrent users, your Node.js app servers will each maintain a connection pool. Ten ECS tasks, 100 connections each: you're already at 1,000 connections against Postgres. Scale to 20 tasks under load and PostgreSQL starts refusing connections entirely — it degrades sharply above roughly 200 active connections.
The fix is a connection pooler sitting between your app tier and Postgres. We used PgBouncer in transaction pooling mode:
# pgbouncer.ini
[databases]
production = host=your-rds-endpoint port=5432 dbname=production
[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 5432
pool_mode = transaction # key: connection released after each transaction
max_client_conn = 10000 # app-side connections (generous ceiling)
default_pool_size = 150 # actual Postgres connections (stay under 200)
reserve_pool_size = 10
reserve_pool_timeout = 5
server_idle_timeout = 600
log_connections = 0
log_disconnections = 0
Before PgBouncer: 8,000 app-side connections → Postgres melting.
After PgBouncer: 8,000 app-side connections → 150 Postgres connections → 4x throughput increase.
One caveat with transaction pooling mode: you cannot use SET statements, advisory locks, or LISTEN/NOTIFY across transactions — the connection you get back may not be the same one. Keep that in mind if you rely on session-level state.
Redis: cache the right data and nothing else
Redis is not a general caching layer. The wrong data in Redis creates distributed consistency problems that are much harder to debug than slow queries.
We cached exactly three things:
1. Product page data TTL: 5 minutes
2. User session data TTL: 30 minutes (sliding)
3. Rate limit counters TTL: 1 minute (sliding window)
We deliberately did not cache inventory counts. Here is why that matters:
// Wrong: cached inventory leads to overselling
async function getProductPage(productId) {
const cached = await redis.get(`product:${productId}`);
if (cached) return JSON.parse(cached); // inventory count may be stale
const product = await db.query(
`SELECT p.*, i.quantity FROM products p
JOIN inventory i ON i.product_id = p.id
WHERE p.id = $1`,
[productId]
);
await redis.setex(`product:${productId}`, 300, JSON.stringify(product.rows[0]));
return product.rows[0];
}
// Right: cache everything except the inventory count
async function getProductPage(productId) {
const [cachedProduct, liveInventory] = await Promise.all([
redis.get(`product:${productId}`),
db.query(`SELECT quantity FROM inventory WHERE product_id = $1`, [productId])
]);
const product = cachedProduct
? JSON.parse(cachedProduct)
: await fetchAndCacheProduct(productId); // separate function
return { ...product, quantity: liveInventory.rows[0].quantity };
}
Stale stock counts cost conversions and cause overselling. Read inventory directly from the database, optimised by index. The indexed inventory read takes 3ms — that is a fine trade for correctness.
Serverless for spiky, stateful for baseline
Not everything needs to survive sustained load. We split the workload by its traffic shape:
Kept on ECS (pre-warmed instances):
- Checkout pipeline — latency-sensitive, stateful, must not cold-start
- Product search — sustained load, stateful Elasticsearch client
- Session API — low latency requirement, high call frequency
Moved to Lambda:
- Email notifications — spiky (order confirmation bursts), fully async
- PDF receipt generation — CPU-intensive but async, nobody waits for it
- Inventory update event processing — event-driven, SQS-triggered
ECS (checkout, search, session) ← baseline traffic, pre-warmed
Lambda (email, PDF, inventory) ← burst traffic, event-driven
This cut our baseline EC2 spend by 35% while giving us effectively unlimited burst capacity for async workloads. The checkout service never competes for resources with a PDF generation spike.
Load test before you need to
Three weeks before Black Friday we ran k6 against a production-sized staging environment:
// k6 load test script (simplified)
import http from 'k6/http';
import { check, sleep } from 'k6';
export const options = {
stages: [
{ duration: '5m', target: 5000 }, // ramp up
{ duration: '10m', target: 15000 }, // hold at 15k (above our target)
{ duration: '5m', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p95<500'], // 95th percentile under 500ms
http_req_failed: ['rate<0.01'], // less than 1% errors
},
};
export default function () {
const res = http.get(`https://staging.yourdomain.com/products/${randomProductId()}`);
check(res, { 'status 200': r => r.status === 200 });
sleep(1);
}
The test revealed one non-obvious problem: our CloudFront cache behaviour headers were wrong. Browsers were re-fetching product images on every page load because Cache-Control: no-cache had been set during development and never changed.
Before fix: ~40% of image requests reached origin
After fix: CDN cache hit rate → 94%
One CloudFront distribution setting. Under actual Black Friday load, the CDN absorbed 94% of all traffic. Origin never saw the full load.
What actually happened on Black Friday
- Peak concurrent users: ~12,000
- p95 response time on product pages: 210ms
- Error rate: 0.3%
- Database connections at peak: 148 (PgBouncer keeping us safely under 150)
- Zero rollbacks, zero incidents
The database index changes, PgBouncer, and the CDN cache fix together did more work than any amount of horizontal scaling would have. We added zero new ECS tasks during the event.
The checklist before your next traffic event
- [ ] Run
pg_stat_statementsanalysis — fix any query over 50ms - [ ] Deploy PgBouncer in transaction pooling mode between app and Postgres
- [ ] Audit Redis cache keys — evict anything that must be consistent
- [ ] Split traffic-shaped workloads: stateful/latency-sensitive on ECS, async/spiky on Lambda
- [ ] Run a k6 load test at 125% of your peak estimate, three weeks out
- [ ] Verify CDN cache headers with
curl -Ion key asset URLs - [ ] Set a CloudWatch alarm on Postgres connection count (alert at 80% of
max_connections)
Originally published at https://shubhamkansal.com/blog/scaling-from-500-to-50000-concurrent-users. I'm Shubham Kansal, a freelance Full Stack & DevOps engineer — more at https://shubhamkansal.com.
Top comments (0)