DEV Community

Cover image for ⚠️ The SQL Query That Nearly Crashed Our Production Server
Chirag Patel
Chirag Patel

Posted on

⚠️ The SQL Query That Nearly Crashed Our Production Server

Launch day.
Traffic was peaking.
Dashboards were green.
Everything looked perfect… for about 10 minutes.

Then—boom.
Our site slowed to a crawl and went completely offline.


🚨 The Villain? A Single SQL Query

Here’s the innocent-looking line that brought our system to its knees:

SELECT * FROM orders WHERE status = 'pending';
Enter fullscreen mode Exit fullscreen mode

In our development setup, this ran instantly.
Why wouldn’t it? We only had about 500 dummy records.

But in production, the story was very different:

  • Over 60,000+ orders
  • Each row around 12 KB (including customer info, items, metadata, and logs)
  • That’s roughly 720 MB being pulled into memory

And that one query triggered a chain reaction:

  • 🧠 Memory pressure: DB tried to load hundreds of MB into RAM
  • 🌐 Network choke: 700MB+ transferred across the wire = 30–40s latency
  • 🔒 Connection lock: Slow queries kept DB connections occupied
  • 💥 Concurrency collapse: Dozens of users triggered the same query = total overload All because of one tiny SELECT *.

Why SELECT * Is a Production Trap

It feels convenient — but it’s a ticking bomb under load.

Here’s why:

  1. You fetch way more than needed: Images, logs, metadata — all pulled even if you don’t use them.

  2. Schemas evolve silently: Add a new column later, and every query gets heavier without warning.

  3. Indexes become less effective: The query planner can’t optimize wide rows efficiently.

  4. Memory + bandwidth waste: Every unnecessary byte eats CPU cycles and RAM.

  5. Concurrency death spiral: Multiply those inefficiencies by hundreds of simultaneous users = meltdown.


How We Fixed It (and What You Should Do Instead)

1. Select only what you need

SELECT id, customer_id, total_price
FROM orders
WHERE status = 'pending'
LIMIT 50 OFFSET 0;
Enter fullscreen mode Exit fullscreen mode

Always be explicit. Pull only the fields you’ll actually use.

2. Paginate aggressively

Never fetch thousands of rows in a single request.
Use LIMIT + OFFSET or better — keyset pagination for large tables.

3. Cache frequent queries

If certain data is read-heavy (e.g., pending orders, popular products), store it in Redis or Memcached.
You’ll offload 70–80% of traffic from your database instantly.

4. Inspect with EXPLAIN

Before deploying, always check query plans:

EXPLAIN SELECT ...
Enter fullscreen mode Exit fullscreen mode

You’ll quickly spot missing indexes or full table scans.

5. Enable slow query logs

Set up slow query logging in MySQL/Postgres.
Anything over 200ms deserves your attention.

6. Test at scale

Your dev DB with 500 records won’t expose real bottlenecks.
Clone anonymized production data or use synthetic generators.

7. Enforce query timeouts

Never allow one rogue query to monopolize resources forever.


Pre-Deployment Checklist 🧩

Before any SQL hits production, ask yourself:

  1. Are you selecting only the columns you need?
  2. Is pagination in place?
  3. Are your WHERE/JOIN clauses indexed?
  4. Have you tested on production-scale data?
  5. Do you have slow query logs + timeouts enabled?

Bonus tip → Cache hot queries and watch your DB load drop by 80%+.


🧠 Key Takeaway

In production, every byte matters.
That one harmless-looking SELECT * might seem fine locally...
…but at scale, it can choke your entire system.

Build and test as if you already have a million users.
That mindset will save you more than any optimization later.


Have you ever faced a similar “small bug, massive blast radius” moment?
Drop your story below — let’s learn from each other’s war stories. 💬

Top comments (9)

Collapse
 
hashbyt profile image
Hashbyt

Great post, and a classic "rite of passage" for many developers! We had a nearly identical incident a few years back. A dashboard for internal analytics used a SELECT * on a user_events table. It worked fine for months until the table grew large enough that the query would time out, taking the dashboard down every morning when the first manager logged in.

Collapse
 
chiragx309 profile image
Chirag Patel

Haha yes — a rite of passage indeed! 😅
Dashboards are especially sneaky since they start small and “just work”... until the data snowballs.
It’s wild how invisible performance debt can be until it suddenly costs uptime.
Curious — did you end up fixing it by optimizing the query, caching results, or redesigning the dashboard?

Collapse
 
alex_chen_3a43ce352a43d3d profile image
Alex Chen

been there with a JOIN that wasn't indexed -- took down checkout for 8min during Black Friday. 2,300 orders queued up, boss wasn't happy. now I basically panic-test everything against production-sized data before shipping.

Collapse
 
chiragx309 profile image
Chirag Patel

Oof, I can feel that pain. 😅
Unindexed JOINs under heavy traffic are nightmare fuel.
Testing with production-sized data honestly changes everything — it’s the only way to catch those hidden time bombs before they explode.
Glad to hear you made that part of your process!

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Quite right! SELECT COUNT(*) is the only query where the asterisk is canonical and most performant, as the engine is uniquely optimized just for row counting.

Collapse
 
chiragx309 profile image
Chirag Patel

100% true — great point!
COUNT(*) is the one exception where the engine is optimized internally, especially in PostgreSQL and MySQL.
I should’ve mentioned that nuance in the post — thanks for highlighting it! 🙌

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Particularly in PostgreSQL and MySQL: Very true! Each engine has its own optimizations and its own way of working.

Collapse
 
dshaw0004 profile image
Dipankar Shaw

Noted. Will avoid this thing. Thanks

Collapse
 
chiragx309 profile image
Chirag Patel

Awesome! Glad it helped 🙂
Once you start being intentional with SELECT columns, you’ll notice how much faster your queries get — especially at scale.