DEV Community

Cover image for The Bug: How a Missing Database Index Cost Us Real Money
VivekLumbhani
VivekLumbhani

Posted on

The Bug: How a Missing Database Index Cost Us Real Money

It was 3 AM when my phone exploded with notifications.

Our IoT dashboard was down. 500+ devices weren't reporting data.
Customers were calling support. And our cloud bill was climbing
faster than my heart rate.

The culprit? A single missing database index.

Here's the story of how one overlooked optimization decision cost
us approximately $10,000 in cloud costs, customer trust, and about
72 hours of my life I'll never get back.
Let me paint the picture. I was three months into my role at an
IoT company. We had just onboarded a major client - a chain of
smart buildings with 200 devices across multiple locations. Our
platform monitored temperature, humidity, occupancy, you name it.

During testing with 50-100 devices, everything looked great.
Response times were decent. No red flags. We pushed to production
feeling confident.

Big mistake.
The new client went live on a Friday afternoon. (First lesson:
never deploy on Friday. But that's another story.)

Everything seemed fine... for about 4 hours.

Then our monitoring dashboard started throwing warnings:

  • API response times creeping up: 500ms... 1000ms... 2000ms
  • Database CPU usage spiking to 90%
  • Memory consumption climbing steadily
  • Customer complaints starting to roll in

By Saturday morning, the system was essentially unusable.

Saturday, 9 AM. Tea. My weekend plans are already toast.

I started where any sensible developer would - checking the logs.
Nothing obviously broken. No errors. No crashes. The system was
just... slow. Painfully slow.

Then I checked our cloud provider dashboard. My stomach dropped.

Our database instance was working overtime:

  • CPU: 95% constantly
  • IOPS (disk operations): Through the roof
  • Network throughput: Maxed out
  • Memory: Swapping to disk

The AWS bill was already showing $300 for the day. Normal daily
cost? About $20.
I pulled up MongoDB Compass and ran the profiler:

db.setProfilingLevel(2)
db.system.profile.find({millis: {$gt: 100}}).sort({millis: -1})

What I saw made my blood run cold.

One query was running thousands of times per minute, and each
execution was taking 3-5 seconds:

{
"op": "query",
"ns": "iot_db.sensor_readings",
"query": {
"deviceId": "...",
"timestamp": { "$gte": ISODate("..."), "$lte": ISODate("...") }
},
"millis": 4823,
"planSummary": "COLLSCAN",
"docsExamined": 2847392,
"nreturned": 288
}

See that "COLLSCAN"? That's MongoDB speak for "I'm checking
every single document in your collection because I have no idea
how to find what you want efficiently."

We were scanning 2.8 MILLION documents to return 288 results.

Every. Single. Time.
Our sensor_readings collection looked innocent enough:

{
"_id": ObjectId("..."),
"deviceId": "DEVICE_123",
"timestamp": ISODate("2024-11-10T10:30:00Z"),
"temperature": 22.5,
"humidity": 45,
"metadata": { ... }
}

The dashboard query was simple - get readings for a device in
a time range. We did this constantly for every device on every
dashboard refresh.

With 50 test devices and a few thousand readings, MongoDB's
default behavior worked fine. It could scan everything quickly
enough.

But with 200+ devices and millions of readings? Disaster.

Here's what was happening behind the scenes:

  1. User opens dashboard
  2. Frontend requests data for 20 devices (visible on screen)
  3. Backend makes 20 database queries
  4. Each query scans millions of documents
  5. Database CPU hits 100%
  6. Queries queue up behind each other
  7. Response times balloon to 30+ seconds
  8. Users refresh impatiently, creating MORE queries
  9. Everything grinds to a halt The solution was embarrassingly simple.

I created a compound index:

db.sensor_readings.createIndex({
deviceId: 1,
timestamp: -1
})

That's it. One line of code.

I ran it from my laptop on the production database. It took about
12 minutes to build the index across our millions of documents.

Then I watched the monitoring dashboard.

Within 60 seconds:

  • Query time dropped from 4000ms to 15ms
  • Database CPU fell from 95% to 12%
  • API response times went from 30s to 300ms
  • Customer complaints stopped

It felt like magic. Terrible, "this should have been here all along"
magic.
Let's talk numbers, because this hurt.

Direct Costs:

  • Extra cloud infrastructure: ~$2,800 for the weekend
  • Emergency scaling of database instances: $1,200
  • On-call developer time (me + senior dev): ~$3,000
  • Total direct cost: ~$7,000

Indirect Costs:

  • Customer support time handling complaints: ~10 hours
  • Engineering time investigating: ~16 hours
  • Trust with new client: Priceless

And all of this could have been prevented by adding one index
during development.

What I Learned (The Hard Way)

  1. "Works in development" means nothing

    • Test with production-scale data
    • Always simulate realistic load
    • 100 records ≠ 1,000,000 records
  2. Indexes aren't optional for production

    • Profile queries BEFORE deploying
    • Index your query patterns, not random fields
    • Use db.collection.explain("executionStats") religiously
  3. Monitor from day one

    • Set up query performance monitoring
    • Alert on slow queries (>100ms for critical paths)
    • Track database metrics continuously
  4. The "it's fast enough" trap

    • What's fast at 10 requests/minute breaks at 1000
    • Performance problems compound exponentially
    • Optimization isn't premature if you know you'll scale
  5. Cloud costs can spiral FAST

    • Set up billing alerts
    • Inefficient code = expensive infrastructure
    • $20/day can become $300/day overnight

The MongoDB Indexing Cheat Sheet
Here's what I wish I knew:

Common Query Patterns → Index Strategy

  1. Exact match on single field:
    Query: {userId: "123"}
    Index: {userId: 1}

  2. Range query on single field:
    Query: {timestamp: {$gte: date}}
    Index: {timestamp: 1} or {timestamp: -1}

  3. Multiple exact matches:
    Query: {userId: "123", status: "active"}
    Index: {userId: 1, status: 1}

  4. Exact match + range:
    Query: {deviceId: "ABC", timestamp: {$gte: date}}
    Index: {deviceId: 1, timestamp: -1}
    (Exact match field FIRST!)

  5. Sorting:
    Query: find({}).sort({createdAt: -1})
    Index: {createdAt: -1}

Red Flags to Watch For:

  • COLLSCAN in explain() output
  • docsExamined >> nreturned (scanning way more than returning)
  • Query time consistently >100ms
  • Database CPU >50% with normal load

Key Takeaways
If you remember nothing else from this post:

✅ Always index your query patterns BEFORE production
✅ Test with production-scale data, not toy datasets
✅ Monitor query performance from day one
✅ Set up cloud billing alerts (seriously)
✅ "Fast enough" in development can be "disaster" in production
✅ One missing index can cost thousands of dollars

And maybe don't deploy on Friday afternoons. Just a thought.

Top comments (0)