Makroumi

Posted on Mar 13

The Query That Looks Instant in Development and Destroys Production at Scale

#sql #database #performance #backend

Originally published at makroumi.hashnode.dev

You push the code on Friday afternoon. Tests pass. Staging looks fine. You deploy and go home.

At 2am your phone rings.

Response times are 8 seconds. Users are abandoning. The on-call engineer is staring at a dashboard full of red. And somewhere in the codebase is a query that ran in 4 milliseconds in development and is now bringing down production on a table with 50 million rows.

This is not a rare story. It happens every week at companies of every size. The query was never dangerous in development because development doesn't look like production. Different data volumes, different indexes, different query plans. The bug was always there. You just couldn't see it yet.

Why development lies to you

Your development database has maybe 10,000 rows. Your production database has 50 million. The query planner makes completely different decisions at different scales.

A full table scan on 10,000 rows takes 2 milliseconds. The same full table scan on 50 million rows takes 45 seconds. The query is identical. The environment is not.

This is why code review fails to catch these bugs. The reviewer runs the query locally, it's fast, they approve it. Nobody in the review process ever sees production scale data. The bug ships because the feedback loop is broken.

The patterns that are invisible in development

The leading wildcard

SELECT * FROM products WHERE name LIKE '%wireless%'

On 1,000 products this returns instantly. On 10 million products this performs a full table scan because a leading wildcard cannot use a B-tree index. It has to read every single row.

In development you have 1,000 products. In production you have 10 million. The query plan is completely different and you will never know until it's too late.

The function on an indexed column

SELECT * FROM orders WHERE YEAR(created_at) = 2025

You have an index on created_at. This query doesn't use it. Wrapping an indexed column in a function forces the database to evaluate the function for every row before it can apply the filter. The index is completely bypassed.

In development with 500 orders this is instant. In production with 20 million orders this is a full table scan that nobody saw coming.

The implicit type conversion

SELECT * FROM users WHERE user_id = '12345'

user_id is an integer. You're comparing it to a string. The database has to cast every value in the column before comparing. Your index on user_id is ignored entirely.

This works fine in development. It works fine in production until the table grows large enough that the full scan takes longer than your timeout allows.

The N+1 hiding in the ORM

orders = Order.objects.all()
for order in orders:
print(order.customer.name)

This looks like two queries. It's actually one query to fetch all orders and then one additional query per order to fetch the customer. With 10 orders in development that's 11 queries. With 10,000 orders in production that's 10,001 queries and your database is on its knees.

ORMs make N+1 invisible. The code looks clean. The query count is catastrophic.

SELECT * on a wide table

SELECT * FROM events WHERE user_id = 123

Your events table has 47 columns including several TEXT and JSON columns. You need 3 of them. SELECT * fetches all 47 columns for every matching row, transfers all of that data over the network, and loads it all into memory.

In development your events table has 200 rows. In production it has 200 million rows and this query is reading gigabytes of data you never needed.

On cloud warehouses like BigQuery or Athena you pay per byte scanned. SELECT * doesn't just slow you down. It charges you for every column you didn't need.

The moment everything changes

The reason these patterns are so dangerous is not just that they're slow. It's that they're invisible until they aren't.

Your query runs fine for months. Then the table crosses a threshold — 1 million rows, 10 million rows, 50 million rows and the query plan changes. What was fast yesterday is catastrophic today. Nothing in the code changed. Only the data.

The on-call engineer at 2am is not dealing with a new bug. They're dealing with a bug that shipped months ago and finally found the right conditions to detonate.

The feedback loop is broken by design

The fundamental problem is that static code review happens at development scale and production incidents happen at production scale. These two environments will never be the same and no amount of careful review can bridge that gap if the reviewer is running queries against a development database.

The only way to catch these patterns before they ship is to analyze the query itself, not the query's performance on a small dataset. Leading wildcards are always dangerous regardless of table size. Functions on indexed columns always bypass the index regardless of row count. SELECT * is always fetching more than you need.

These patterns don't need production data to be identified as dangerous. They need static analysis.

Catching these before they ship

SlowQL runs against your SQL files before they merge. It doesn't execute queries. It doesn't need a database connection. It reads the query itself and flags the patterns that are universally dangerous regardless of scale.

Leading wildcards. Functions on indexed columns. SELECT * in production queries. Implicit type conversions. Unbounded queries with no LIMIT. All of them flagged before the code ever reaches review.

pip install slowql
slowql --input-file sql/ --fail-on high

Add it to your CI pipeline and the query that would have paged you at 2am gets caught before it merges on Friday afternoon.

uses: makroumi/slowql-action@v1 with: input: sql/ fail-on: high

171 rules across performance, security, reliability, compliance and cost. Completely offline. Your SQL never leaves your machine.

github.com/makroumi/slowql

The query is already in your codebase

The pattern that will page you at 2am is probably already there. It shipped in a PR that looked fine in review. It ran fast in staging. It's been in production for months waiting for the table to grow large enough.

Static analysis won't catch everything. It won't replace EXPLAIN ANALYZE or query plan monitoring. But it catches the patterns that are universally dangerous before they ever reach a database, before they ever reach review, before they ever get the chance to detonate at scale.

The 2am page is optional. The query that causes it is not.

makroumi.hashnode.dev

DEV Community

The Query That Looks Instant in Development and Destroys Production at Scale

Top comments (0)