DEV Community

Cover image for ⚠️ The SQL Query That Nearly Crashed Our Production Server

⚠️ The SQL Query That Nearly Crashed Our Production Server

Chirag Patel on October 28, 2025

Launch day. Traffic was peaking. Dashboards were green. Everything looked perfect… for about 10 minutes. Then—boom. Our site slowed to a crawl a...
Collapse
 
igornosatov_15 profile image
Igor Nosatov

Author brilliantly breaks down a real-world case where one "innocent" SELECT * nearly took down the entire production system. The launch day story with green dashboards and a sudden crash — it’s like a devops thriller. From euphoria to panic in 10 minutes? Classic!
What really stood out:

Gold-tier practical advice: From explicitly selecting only needed columns to pagination, Redis caching, and EXPLAIN — all crystal clear with examples. I can already picture applying this in my own project.
Pre-deployment checklist: A must-have for any team. "Test like you already have a million users" — golden rule! 📋
Humor and relatability: "Ticking bomb under load" — laughed out loud. And the bonus about dropping DB load by 80%+ with caching? Instant motivation for tomorrow’s refactoring!

This article is the perfect reminder that in production, every byte counts. That one harmless-looking SELECT * might seem fine locally... but at scale, it can choke your entire system.
Thanks for the lesson without the pain (well, almost 😉). Definitely following for more war stories!
Ever had a similar “tiny bug, massive blast radius” moment? Drop your story below! 💬 #database #webdev #productivity

Collapse
 
chiragx309 profile image
Chirag Patel

Wow — really appreciate the thoughtful breakdown and encouragement 🙌
Love how you phrased it: “devops thriller” — that’s exactly what it felt like in the moment 😅

Totally agree: these kinds of issues feel tiny in dev, invisible in staging, and then become boss-summoning emergencies in production.

And yes — testing like you already have a million users is such an underrated mindset shift. The earlier we think about scale, the fewer panic-patches we end up shipping later.

Thanks again for the kind words — feedback like this fuels writing more engineering war stories!

By the way, I’m curious — what’s the most “small bug, huge blast radius” issue you’ve run into?

Collapse
 
mistval profile image
Randall • Edited

This contains a lot of good advice. I have a few critiques and I think there may be some more good lessons you can still learn from this experience.

  1. An orders table shouldn't contain "customer info, items, metadata, and logs", that suggests the schema design can benefit from normalization.
  2. Generally speaking, for row-oriented databases, SELECTing specific columns doesn't reduce how much data the database reads into RAM. The database always pulls entire pages into RAM, even if you only select a few columns in one row. However, there are exceptions, and indeed your use case may have triggered one. PostgreSQL for example uses "TOAST (The Oversized-Attribute Storage Technique)" for large columns, and this can lead to some column values not being read into memory, depending on the columns you select (but the pointer to the actual value still gets read into memory).
  3. I think what really killed you here was trying to pull 60k rows across the network. The LIMIT 50 alone probably would have fixed the issue. Limiting the SELECTed columns, while helpful, probably doesn't make as big of a difference. But this can all be tested. You can import the database into a local instance and write some test scripts against it to measure the effects of changes.
  4. About "indexes become less effective", this deserves some elaboration. The major factor here will be whether or not the query is covered by the index. A covered query will allow a fast index-only scan. But a SELECT * can still be covered by an index (if every column in the table is in the index) and allow an index-only scan. So as a binary decision, whether or not you limit the columns selected doesn't improve index performance per-se. To benefit from that optimization you have to make sure you only select columns that are in the index, and understanding "covered queries" is the key to using that optimization effectively.
Collapse
 
chiragx309 profile image
Chirag Patel

Thanks for the solid breakdown — really appreciate it.

  • Agreed on normalization; that was definitely a lesson learned.
  • Good call on row-store behavior — my point was based on TOAST-style cases, but you're right that it's not generally true.
  • Fair point on the network cost being the real bottleneck; limiting rows was the biggest win.
  • Index coverage explanation was helpful — I oversimplified that part.

Overall, thanks — this adds valuable clarity to the nuances I glossed over.

Collapse
 
neurolov__ai profile image
Neurolov AI

Such a relatable post a perfect reminder that works on my machine doesn’t mean ready for production. The breakdown of root cause and recovery steps is spot on practical, clear and packed with lessons every engineer should bookmark.

Collapse
 
chiragx309 profile image
Chirag Patel

Thank you! 🙌

Exactly — “works on my machine” is one of the most dangerous forms of false confidence in engineering 😅
Local success ≠ production readiness, especially when data volume & traffic scale kick in.

Glad the breakdown felt practical — that’s the goal: real lessons that actually apply on day-to-day systems.

Curious — have you ever had a moment where everything worked locally but production said NOPE? 😄

Collapse
 
hashbyt profile image
Hashbyt

Great post, and a classic "rite of passage" for many developers! We had a nearly identical incident a few years back. A dashboard for internal analytics used a SELECT * on a user_events table. It worked fine for months until the table grew large enough that the query would time out, taking the dashboard down every morning when the first manager logged in.

Collapse
 
chiragx309 profile image
Chirag Patel

Haha yes — a rite of passage indeed! 😅
Dashboards are especially sneaky since they start small and “just work”... until the data snowballs.
It’s wild how invisible performance debt can be until it suddenly costs uptime.
Curious — did you end up fixing it by optimizing the query, caching results, or redesigning the dashboard?

Collapse
 
epigene profile image
Augusts Bautra

Thanks for the top tier post, king. Kudos for mentioning keyset pagination. OFFSET is so 90s :D

Limiting the potential size of a query is a something every developer has to grok. I always ask myself "what if DB grows to 10 million rows of this?". Limiting, paging, batch-looping and, of course, targeted selecting are the ways. Plus, plentiful and smart indexing.

Collapse
 
chiragx309 profile image
Chirag Patel

Thanks! 🙌 And yep — keyset pagination for the win 😄 Thinking in 10M-row scale really changes how we design queries. Smart limits + indexing = life saver.

Collapse
 
gbhorwood profile image
grant horwood

i obsess over the slow query log for the first month or so a project is running on prod and it's one of the first things i check when performance issues arise.

i did do a writeup on some of the shell awk/sed kludgery i do to bet meaningful data out of a crowded slow log:
dev.to/gbhorwood/mysql-using-the-s...

Collapse
 
chiragx309 profile image
Chirag Patel

That's awesome! 🔥 Slow query log obsession early in production is a superpower.
Thanks for sharing the link — will check it out! 🙌

Collapse
 
alex_chen_3a43ce352a43d3d profile image
Alex Chen

been there with a JOIN that wasn't indexed -- took down checkout for 8min during Black Friday. 2,300 orders queued up, boss wasn't happy. now I basically panic-test everything against production-sized data before shipping.

Collapse
 
chiragx309 profile image
Chirag Patel

Oof, I can feel that pain. 😅
Unindexed JOINs under heavy traffic are nightmare fuel.
Testing with production-sized data honestly changes everything — it’s the only way to catch those hidden time bombs before they explode.
Glad to hear you made that part of your process!

Collapse
 
bartuedev profile image
bartue-dev

asdas

Collapse
 
chiragx309 profile image
Chirag Patel

Looks like that might’ve been a typo 😄 but thanks for stopping by!

Collapse
 
lulu_rahma_22162084a3a02a profile image
lulu rahma

jo777. gacor

Collapse
 
chiragx309 profile image
Chirag Patel

Looks like that might’ve been a typo 😄 but thanks for stopping by!

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Quite right! SELECT COUNT(*) is the only query where the asterisk is canonical and most performant, as the engine is uniquely optimized just for row counting.

Collapse
 
chiragx309 profile image
Chirag Patel

100% true — great point!
COUNT(*) is the one exception where the engine is optimized internally, especially in PostgreSQL and MySQL.
I should’ve mentioned that nuance in the post — thanks for highlighting it! 🙌

Collapse
 
pascal_cescato_692b7a8a20 profile image
Pascal CESCATO

Particularly in PostgreSQL and MySQL: Very true! Each engine has its own optimizations and its own way of working.

Collapse
 
dshaw0004 profile image
Dipankar Shaw

Noted. Will avoid this thing. Thanks

Collapse
 
chiragx309 profile image
Chirag Patel

Awesome! Glad it helped 🙂
Once you start being intentional with SELECT columns, you’ll notice how much faster your queries get — especially at scale.