Nobody warned me. I deployed at 4:57 PM on a Friday, closed my laptop, and woke up Saturday to 47 Slack messages, a database on fire, and users who could not log in.
This is not a theory post. This is the exact breakdown of what failed, why it failed, and the step-by-step fixes I now run on every project before it ever touches production again. If you ship Node.js or React apps, read this before your Friday deploy.
What Actually Broke (And Why)
Here is the damage report from that weekend:
- Login endpoint returned 503 after 200 concurrent users
- Dashboard loaded in 14 seconds instead of 1.2 seconds
- One rogue API route had no rate limiting and got hammered by a bot within 6 hours of launch
- App crashed on restart because sessions were stored in memory, not persisted
- Zero visibility into what was happening because logging was just
console.log
Every single one of these is fixable in an afternoon. I just did not know to fix them before launch.
Fix 1: Stop Your Login Endpoint From Dying Under Load
The login route was doing three database calls per request with zero caching. At 200 concurrent users, the database gave up.
First, find your slow routes before users find them for you:
// middleware/timing.js
export function timingMiddleware(req, res, next) {
const start = process.hrtime.bigint();
res.on('finish', () => {
const ms = Number(process.hrtime.bigint() - start) / 1_000_000;
if (ms > 300) {
console.warn(`SLOW: [${req.method}] ${req.path} -- ${ms.toFixed(1)}ms`);
}
});
next();
}
// app.js
import { timingMiddleware } from './middleware/timing.js';
app.use(timingMiddleware);
Any route consistently above 300ms is a problem waiting to go public. Fix it before launch, not during it.
Fix 2: Add a Cache Layer That Actually Makes Sense
Most developers throw everything into one Redis instance. That is wrong. There are four tiers and mixing them up costs you both performance and correctness.
CDN tier handles static files and public marketing pages. Your server should never touch these at all.
Application tier handles expensive computed data like user dashboards and analytics summaries.
Query tier handles repeated identical database lookups.
Browser tier handles files users already downloaded once.
Here is the application tier cache that saved us 80% of our database load:
// lib/cache.js
import { createClient } from 'redis';
const redis = createClient({ url: process.env.REDIS_URL });
await redis.connect();
export async function getOrSet(key, ttlSeconds, fetchFn) {
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
const fresh = await fetchFn();
await redis.setEx(key, ttlSeconds, JSON.stringify(fresh));
return fresh;
}
// routes/dashboard.js
app.get('/api/dashboard/:userId', async (req, res) => {
const data = await getOrSet(
`dashboard:${req.params.userId}`,
60,
() => buildDashboard(req.params.userId)
);
res.json(data);
});
Five database calls became one Redis call. The dashboard went from 14 seconds to 380ms.
Fix 3: Your Database Is Probably Doing Full Table Scans Right Now
Run this on your Postgres instance right now:
SELECT
query,
calls,
mean_exec_time,
total_exec_time
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
If any query averages above 100ms, look at it. Then check whether it is using indexes:
EXPLAIN ANALYZE
SELECT * FROM users WHERE email = 'someone@example.com';
If you see Seq Scan in the output, your database is reading the entire users table every single time someone tries to log in. Add the index:
CREATE INDEX CONCURRENTLY idx_users_email ON users(email);
CONCURRENTLY builds the index without locking your table. Always use it in production. We went from 340ms average login query time to 4ms with this one line.
Fix 4: Rate Limit Everything That Faces the Internet
A bot found our unsecured file upload endpoint within six hours of launch and sent 4,000 requests in 20 minutes. Here is the limiter that now goes on every single project I touch:
import rateLimit from 'express-rate-limit';
const apiLimiter = rateLimit({
windowMs: 15 * 60 * 1000,
max: 100,
standardHeaders: true,
legacyHeaders: false,
message: { error: 'Too many requests. Try again in 15 minutes.' }
});
const authLimiter = rateLimit({
windowMs: 60 * 60 * 1000,
max: 10,
message: { error: 'Too many login attempts.' }
});
app.use('/api/', apiLimiter);
app.use('/api/auth/', authLimiter);
Auth endpoints get the strict limiter. Everything else gets the generous one. No exceptions.
Fix 5: Move Sessions Out of Memory or You Cannot Scale
When I added a second server behind a load balancer, half my users got randomly logged out. The reason: sessions were stored in the server's memory, and the second server had no idea they existed.
The fix is one npm install away:
import session from 'express-session';
import RedisStore from 'connect-redis';
app.use(session({
store: new RedisStore({ client: redis }),
secret: process.env.SESSION_SECRET,
resave: false,
saveUninitialized: false,
cookie: {
secure: true,
httpOnly: true,
maxAge: 1000 * 60 * 60 * 24
}
}));
Now every server shares the same session store. You can run ten instances and every user stays logged in across all of them.
Fix 6: Add a Health Check That Actually Tells the Truth
A health check that just returns 200 OK is lying to your infrastructure. Here is one that checks what actually matters:
app.get('/health', async (req, res) => {
const result = {
status: 'ok',
timestamp: new Date().toISOString(),
services: {}
};
try {
await db.query('SELECT 1');
result.services.database = 'ok';
} catch {
result.services.database = 'error';
result.status = 'degraded';
}
try {
await redis.ping();
result.services.cache = 'ok';
} catch {
result.services.cache = 'error';
result.status = 'degraded';
}
res.status(result.status === 'ok' ? 200 : 503).json(result);
});
Kubernetes, ECS, Railway, Render, all of them use this endpoint to decide whether to restart your app. Give them the real answer.
Fix 7: Handle Errors Like a Senior Dev, Not a Tutorial
The default behavior in most codebases is to either crash silently or send a raw stack trace to the browser. Both are terrible. Here is the global error handler that goes in every project:
app.use((err, req, res, next) => {
const isDev = process.env.NODE_ENV === 'development';
console.error({
message: err.message,
stack: err.stack,
url: req.url,
method: req.method,
userId: req.user?.id ?? 'unauthenticated'
});
res.status(err.status || 500).json({
error: isDev
? err.message
: 'Something went wrong. Our team has been notified.',
...(isDev && { stack: err.stack })
});
});
Structured logs internally. Safe message externally. Never both to the same place.
Fix 8: Graceful Shutdown So Deploys Do Not 502 Your Users
Every time you deploy without graceful shutdown, in-flight requests get cut off mid-response. Users see a sudden error for no reason. Here is the fix:
const server = app.listen(3000);
async function shutdown(signal) {
console.log(`${signal} received. Shutting down...`);
server.close(async () => {
await db.end();
await redis.quit();
console.log('Clean shutdown complete.');
process.exit(0);
});
setTimeout(() => {
console.error('Shutdown timeout. Forcing exit.');
process.exit(1);
}, 10_000);
}
process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));
Add this once. Forget about mysterious 502s during deployments forever.
The Pre-Launch Checklist I Now Run on Every Project
Print this. Stick it somewhere. Do not deploy without it.
- Slow route detection middleware added
- Redis caching layer for expensive queries
- Database indexes verified with EXPLAIN ANALYZE
- Rate limiting on all public and auth endpoints
- Sessions stored in Redis, not server memory
- Health check that tests real dependencies
- Global error handler with structured logging
- Graceful shutdown wired to SIGTERM and SIGINT
- No secrets in committed code
- Environment variables in a proper secrets manager
When the Project Outgrows the Checklist
These eight fixes will carry most apps from zero to a few hundred thousand users without breaking a sweat. But when you are dealing with microservices architecture, advanced cloud and DevOps infrastructure, multi-region deployments, or a mobile app that needs to stay in sync with a complex backend, the decisions get harder and the cost of getting them wrong gets much higher.
That is the point where working with a team that offers dedicated full-stack web development, mobile app development, and cloud architecture services pays for itself fast. Getting the infrastructure decisions right at the start is dramatically cheaper than rebuilding them after a production fire.
The Real Lesson
I deployed on a Friday and paid for it all weekend.
You do not have to.
Every fix in this post is a few hours of work total. The checklist at the end is fifteen minutes before any deploy. The difference between a launch that becomes a growth story and one that becomes a postmortem is almost always not intelligence or talent. It is just knowing what to check before you push.
Now you know.
Which one of these are you missing in your current project? Drop it in the comments. I read all of them.
Top comments (0)