John Munn

Posted on Jun 12 • Edited on Jun 23

The Scaling Gauntlet: The Art of Query Archaeology

#database #backend #performance #scalability

It started, as most tech crises do, with an announcement and a pastry.

You were three bites into a blueberry muffin when the CTO, burst into the dev pit, eyes wide, voice too loud, radiating the kind of giddy terror usually reserved for space launches and wedding proposals.

“We did it. We landed GigaGym.”

A hush fell over the room. Someone from Sales whispered, “No way,” like they were invoking a forbidden name.

You set down your muffin, dreading the next words.

“They’re onboarding next month. They’re bringing 100,000 concurrent users.”

Applause erupted.

People hugged.

Marketing began updating the pitch deck with fireworks emojis.

But not you.

Because you know the truth: your poor database, let’s call him Postgres Pete, is already sweating through his metaphorical t-shirt handling 50 users during peak lunch traffic. The last time someone ran a report and clicked “export CSV,” Pete let out a wheeze and crashed like a Windows 98 desktop.

And now?

100,000 users. Concurrent. From a fitness company that livestreams biometric yoga data to AI coaches and 12K smart mirrors.

GigaGym isn’t just a client. It’s a stress test wrapped in venture funding and Bluetooth-enabled shame.

So congratulations. You’ve entered the Scaling Gauntlet™.

Welcome to Database Scaling, Part One, where we explore the ancient ruins of your query planner, tune connection pools like it’s F1 season, and prepare your system to survive a tidal wave of abs and analytics.

Chapter 1: Reading the Ruins

"EXPLAIN. Like I'm five."

It all begins here, in the wreckage of a slow-loading dashboard and a pile of unexplained EXPLAIN outputs.

Your system just got hit with the news that GigaGym is coming, bringing 100,000 concurrent users to a database that's already wheezing at 50. Panic is setting in. But deep down, you know what you have to do:

You must descend into the ancient ruins of your queries and uncover what sins past developers have committed.

You run your first EXPLAIN ANALYZE, expecting insight. It's generally safe in production environments—but be careful with write-heavy or long-running mutation queries, as it will execute them. Instead, it reads like a debug log from a sentient compiler having a minor panic attack:

Nested Loop  (cost=0.85..15204.13 rows=14 width=48) (actual time=0.049..404.375 rows=1109 loops=1)
  -> Seq Scan on users u  (cost=0.00..35.50 rows=2550 width=4) (actual time=0.008..1.074 rows=2550 loops=1)
  -> Index Scan using idx_orders_user_id on orders o  (cost=0.43..5.90 rows=1 width=44) (actual time=0.051..0.149 rows=1 loops=2550)

You don't need to understand it all yet. Just know this: when you see 'Seq Scan' or 'Nested Loop' and your rows look inflated or your execution time skyrockets, it usually means your query is doing far more work than it should., something is wrong.

The PostgreSQL Decoder Ring:

Seq Scan on a large table: Your DB is scanning every row. This is fine for tiny tables, not for joins across millions of rows.
Nested Loop with high row counts: You may be joining two large sets without indexes. Watch out for multiplying costs.
Sort spilling to disk: Sort operations that don’t fit in memory slow everything down. Tune work_mem or refactor.
Hash Join with disk I/O: Hash joins are fast in memory, but once they spill, it’s slog city.

Example: If your query plan says:

Sort  (cost=104.33..104.84 rows=204 width=56) (actual time=42.173..42.257 rows=300 loops=1)
  Sort Key: orders.created_at
  Sort Method: quicksort  Memory: 38kB

That’s fine.

But if you see:

Sort Method: external merge  Disk: 560MB

You’ve got a problem. That’s a sign you’re sorting too much data in memory that’s too small. Fix your query, or tune your DB settings. Consider adding a LIMIT clause, filtering earlier in your query, or increasing work_mem in your DB configuration to allow more memory for in-memory sorts.

Index Design That Doesn't Suck (and When to Skip Them)

Not all indexes are created equal. Let's look at three that actually help.

1. The "Covering Index" – Bring What You Need

CREATE INDEX idx_user_orders_covering
ON orders (user_id, created_at)
INCLUDE (total, status, product_id);

Why? The query gets everything it needs from the index itself. No need to go back to the main table.

2. The "Partial Index" – Don't Index Trash

CREATE INDEX idx_active_orders
ON orders (user_id, created_at)
WHERE status IN ('pending', 'processing');

Why? If 90% of rows are completed orders you'll never query, this keeps the index lean and fast.

3. The "Expression Index" – For the Creative WHERE Clause

CREATE INDEX idx_user_email_lower
ON users (LOWER(email));

Why? Case-insensitive lookups are fast now, instead of soul-crushing.

⚠️ But Wait. When NOT to Index

Adding indexes isn't free. Every insert, update, or delete now has to update those indexes too. Indexes cost storage and write performance.

Skip the index if:

The column is low-cardinality (e.g. status = 'active') and queried infrequently.
You write far more often than you read.
You're indexing a column just because “we might need it later.” (You probably won’t.)

Choose wisely. Indexes are powerful, but they’re not coupons. You don’t need to collect them all.

Query Rewriting Kung Fu

Instead of this query that makes your DB cry:

SELECT DISTINCT u.name,
       (SELECT COUNT(*) FROM orders WHERE user_id = u.id) as order_count
FROM users u
WHERE u.created_at > '2024-01-01';

Try this:

SELECT u.name, COALESCE(o.order_count, 0) as order_count
FROM users u
LEFT JOIN (
  SELECT user_id, COUNT(*) as order_count
  FROM orders
  GROUP BY user_id
) o ON u.id = o.user_id
WHERE u.created_at > '2024-01-01';

It’s cleaner, faster, and your database won’t develop abandonment issues.

The N+1 Problem: Death by Papercuts

You’re fetching 100 users. Then 100 more queries to get their orders. You're making your database do burpees for no reason.

# This looks fine. It is not fine.
users = User.objects.filter(active=True)
for user in users:
    print(user.orders.count())

Instead:

users = User.objects.filter(active=True).prefetch_related('orders')
for user in users:
    print(user.orders.count())

One query for users. One for orders. That’s it. Your DB breathes a sigh of relief.

TL;DR: Your First Scalability Wins

Avoid full table scans unless you're absolutely sure it’s cheap
Use covering or partial indexes that match your query pattern
Rewrite nested subqueries into joins when possible
Avoid N+1 queries through prefetching or eager loading
Use EXPLAIN ANALYZE to verify query plans, not guess them

Next up: Connection Pooling and how to stop your app from DDoSing your own database.

But for now, take a breath. You just started the journey from query chaos to performance Zen.

And remember: every slow dashboard is just a poorly indexed story waiting to be rewritten.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.