Chirag Patel

Posted on Oct 28

⚠️ The SQL Query That Nearly Crashed Our Production Server

#webdev #database #discuss #productivity

Launch day.
Traffic was peaking.
Dashboards were green.
Everything looked perfect… for about 10 minutes.

Then—boom.
Our site slowed to a crawl and went completely offline.

🚨 The Villain? A Single SQL Query

Here’s the innocent-looking line that brought our system to its knees:

SELECT * FROM orders WHERE status = 'pending';

In our development setup, this ran instantly.
Why wouldn’t it? We only had about 500 dummy records.

But in production, the story was very different:

Over 60,000+ orders
Each row around 12 KB (including customer info, items, metadata, and logs)
That’s roughly 720 MB being pulled into memory

And that one query triggered a chain reaction:

🧠 Memory pressure: DB tried to load hundreds of MB into RAM
🌐 Network choke: 700MB+ transferred across the wire = 30–40s latency
🔒 Connection lock: Slow queries kept DB connections occupied
💥 Concurrency collapse: Dozens of users triggered the same query = total overload All because of one tiny SELECT *.

Why `SELECT *` Is a Production Trap

It feels convenient — but it’s a ticking bomb under load.

Here’s why:

You fetch way more than needed: Images, logs, metadata — all pulled even if you don’t use them.
Schemas evolve silently: Add a new column later, and every query gets heavier without warning.
Indexes become less effective: The query planner can’t optimize wide rows efficiently.
Memory + bandwidth waste: Every unnecessary byte eats CPU cycles and RAM.
Concurrency death spiral: Multiply those inefficiencies by hundreds of simultaneous users = meltdown.

How We Fixed It (and What You Should Do Instead)

1. Select only what you need

SELECT id, customer_id, total_price
FROM orders
WHERE status = 'pending'
LIMIT 50 OFFSET 0;

Always be explicit. Pull only the fields you’ll actually use.

2. Paginate aggressively

Never fetch thousands of rows in a single request.
Use LIMIT + OFFSET or better — keyset pagination for large tables.

3. Cache frequent queries

If certain data is read-heavy (e.g., pending orders, popular products), store it in Redis or Memcached.
You’ll offload 70–80% of traffic from your database instantly.

4. Inspect with `EXPLAIN`

Before deploying, always check query plans:

EXPLAIN SELECT ...

You’ll quickly spot missing indexes or full table scans.

5. Enable slow query logs

Set up slow query logging in MySQL/Postgres.
Anything over 200ms deserves your attention.

6. Test at scale

Your dev DB with 500 records won’t expose real bottlenecks.
Clone anonymized production data or use synthetic generators.

7. Enforce query timeouts

Never allow one rogue query to monopolize resources forever.

Pre-Deployment Checklist 🧩

Before any SQL hits production, ask yourself:

Are you selecting only the columns you need?
Is pagination in place?
Are your WHERE/JOIN clauses indexed?
Have you tested on production-scale data?
Do you have slow query logs + timeouts enabled?

Bonus tip → Cache hot queries and watch your DB load drop by 80%+.

🧠 Key Takeaway

In production, every byte matters.
That one harmless-looking SELECT * might seem fine locally...
…but at scale, it can choke your entire system.

Build and test as if you already have a million users.
That mindset will save you more than any optimization later.

Have you ever faced a similar “small bug, massive blast radius” moment?
Drop your story below — let’s learn from each other’s war stories. 💬

Top comments (24)

Igor Nosatov • Oct 28

Author brilliantly breaks down a real-world case where one "innocent" SELECT * nearly took down the entire production system. The launch day story with green dashboards and a sudden crash — it’s like a devops thriller. From euphoria to panic in 10 minutes? Classic!
What really stood out:

Gold-tier practical advice: From explicitly selecting only needed columns to pagination, Redis caching, and EXPLAIN — all crystal clear with examples. I can already picture applying this in my own project.
Pre-deployment checklist: A must-have for any team. "Test like you already have a million users" — golden rule! 📋
Humor and relatability: "Ticking bomb under load" — laughed out loud. And the bonus about dropping DB load by 80%+ with caching? Instant motivation for tomorrow’s refactoring!

This article is the perfect reminder that in production, every byte counts. That one harmless-looking SELECT * might seem fine locally... but at scale, it can choke your entire system.
Thanks for the lesson without the pain (well, almost 😉). Definitely following for more war stories!
Ever had a similar “tiny bug, massive blast radius” moment? Drop your story below! 💬 #database #webdev #productivity

Chirag Patel • Oct 29

Wow — really appreciate the thoughtful breakdown and encouragement 🙌
Love how you phrased it: “devops thriller” — that’s exactly what it felt like in the moment 😅

Totally agree: these kinds of issues feel tiny in dev, invisible in staging, and then become boss-summoning emergencies in production.

And yes — testing like you already have a million users is such an underrated mindset shift. The earlier we think about scale, the fewer panic-patches we end up shipping later.

Thanks again for the kind words — feedback like this fuels writing more engineering war stories!

By the way, I’m curious — what’s the most “small bug, huge blast radius” issue you’ve run into?

Neurolov AI • Oct 29

Such a relatable post a perfect reminder that works on my machine doesn’t mean ready for production. The breakdown of root cause and recovery steps is spot on practical, clear and packed with lessons every engineer should bookmark.

Chirag Patel • Oct 30

Thank you! 🙌

Exactly — “works on my machine” is one of the most dangerous forms of false confidence in engineering 😅
Local success ≠ production readiness, especially when data volume & traffic scale kick in.

Glad the breakdown felt practical — that’s the goal: real lessons that actually apply on day-to-day systems.

Curious — have you ever had a moment where everything worked locally but production said NOPE? 😄

Randall • Nov 1 • Edited

This contains a lot of good advice. I have a few critiques and I think there may be some more good lessons you can still learn from this experience.

An orders table shouldn't contain "customer info, items, metadata, and logs", that suggests the schema design can benefit from normalization.
Generally speaking, for row-oriented databases, SELECTing specific columns doesn't reduce how much data the database reads into RAM. The database always pulls entire pages into RAM, even if you only select a few columns in one row. However, there are exceptions, and indeed your use case may have triggered one. PostgreSQL for example uses "TOAST (The Oversized-Attribute Storage Technique)" for large columns, and this can lead to some column values not being read into memory, depending on the columns you select (but the pointer to the actual value still gets read into memory).
I think what really killed you here was trying to pull 60k rows across the network. The LIMIT 50 alone probably would have fixed the issue. Limiting the SELECTed columns, while helpful, probably doesn't make as big of a difference. But this can all be tested. You can import the database into a local instance and write some test scripts against it to measure the effects of changes.
About "indexes become less effective", this deserves some elaboration. The major factor here will be whether or not the query is covered by the index. A covered query will allow a fast index-only scan. But a SELECT * can still be covered by an index (if every column in the table is in the index) and allow an index-only scan. So as a binary decision, whether or not you limit the columns selected doesn't improve index performance per-se. To benefit from that optimization you have to make sure you only select columns that are in the index, and understanding "covered queries" is the key to using that optimization effectively.

Chirag Patel • Nov 3

Thanks for the solid breakdown — really appreciate it.

Agreed on normalization; that was definitely a lesson learned.
Good call on row-store behavior — my point was based on TOAST-style cases, but you're right that it's not generally true.
Fair point on the network cost being the real bottleneck; limiting rows was the biggest win.
Index coverage explanation was helpful — I oversimplified that part.

Overall, thanks — this adds valuable clarity to the nuances I glossed over.

Augusts Bautra • Oct 30

Thanks for the top tier post, king. Kudos for mentioning keyset pagination. OFFSET is so 90s :D

Limiting the potential size of a query is a something every developer has to grok. I always ask myself "what if DB grows to 10 million rows of this?". Limiting, paging, batch-looping and, of course, targeted selecting are the ways. Plus, plentiful and smart indexing.

Chirag Patel • Oct 30

Thanks! 🙌 And yep — keyset pagination for the win 😄 Thinking in 10M-row scale really changes how we design queries. Smart limits + indexing = life saver.

grant horwood • Oct 29

i obsess over the slow query log for the first month or so a project is running on prod and it's one of the first things i check when performance issues arise.

i did do a writeup on some of the shell awk/sed kludgery i do to bet meaningful data out of a crowded slow log:
dev.to/gbhorwood/mysql-using-the-s...

Chirag Patel • Oct 30

That's awesome! 🔥 Slow query log obsession early in production is a superpower.
Thanks for sharing the link — will check it out! 🙌

Hashbyt • Oct 28

Great post, and a classic "rite of passage" for many developers! We had a nearly identical incident a few years back. A dashboard for internal analytics used a SELECT * on a user_events table. It worked fine for months until the table grew large enough that the query would time out, taking the dashboard down every morning when the first manager logged in.

Chirag Patel • Oct 28

Haha yes — a rite of passage indeed! 😅
Dashboards are especially sneaky since they start small and “just work”... until the data snowballs.
It’s wild how invisible performance debt can be until it suddenly costs uptime.
Curious — did you end up fixing it by optimizing the query, caching results, or redesigning the dashboard?

Alex Chen • Oct 28

been there with a JOIN that wasn't indexed -- took down checkout for 8min during Black Friday. 2,300 orders queued up, boss wasn't happy. now I basically panic-test everything against production-sized data before shipping.

Chirag Patel • Oct 28

Oof, I can feel that pain. 😅
Unindexed JOINs under heavy traffic are nightmare fuel.
Testing with production-sized data honestly changes everything — it’s the only way to catch those hidden time bombs before they explode.
Glad to hear you made that part of your process!

bartue-dev • Oct 29

asdas

Chirag Patel • Oct 30

Looks like that might’ve been a typo 😄 but thanks for stopping by!

Pascal CESCATO • Oct 28

Quite right! SELECT COUNT(*) is the only query where the asterisk is canonical and most performant, as the engine is uniquely optimized just for row counting.

Chirag Patel • Oct 28

100% true — great point!
COUNT(*) is the one exception where the engine is optimized internally, especially in PostgreSQL and MySQL.
I should’ve mentioned that nuance in the post — thanks for highlighting it! 🙌