It was 3 AM when my phone exploded with notifications.
Our IoT dashboard was down. 500+ devices weren't reporting data.
Customers were calling support. And our cloud bill was climbing
faster than my heart rate.
The culprit? A single missing database index.
Here's the story of how one overlooked optimization decision cost
us approximately $10,000 in cloud costs, customer trust, and about
72 hours of my life I'll never get back.
Let me paint the picture. I was three months into my role at an
IoT company. We had just onboarded a major client - a chain of
smart buildings with 200 devices across multiple locations. Our
platform monitored temperature, humidity, occupancy, you name it.
During testing with 50-100 devices, everything looked great.
Response times were decent. No red flags. We pushed to production
feeling confident.
Big mistake.
The new client went live on a Friday afternoon. (First lesson:
never deploy on Friday. But that's another story.)
Everything seemed fine... for about 4 hours.
Then our monitoring dashboard started throwing warnings:
- API response times creeping up: 500ms... 1000ms... 2000ms
- Database CPU usage spiking to 90%
- Memory consumption climbing steadily
- Customer complaints starting to roll in
By Saturday morning, the system was essentially unusable.
Saturday, 9 AM. Tea. My weekend plans are already toast.
I started where any sensible developer would - checking the logs.
Nothing obviously broken. No errors. No crashes. The system was
just... slow. Painfully slow.
Then I checked our cloud provider dashboard. My stomach dropped.
Our database instance was working overtime:
- CPU: 95% constantly
- IOPS (disk operations): Through the roof
- Network throughput: Maxed out
- Memory: Swapping to disk
The AWS bill was already showing $300 for the day. Normal daily
cost? About $20.
I pulled up MongoDB Compass and ran the profiler:
db.setProfilingLevel(2)
db.system.profile.find({millis: {$gt: 100}}).sort({millis: -1})
What I saw made my blood run cold.
One query was running thousands of times per minute, and each
execution was taking 3-5 seconds:
{
"op": "query",
"ns": "iot_db.sensor_readings",
"query": {
"deviceId": "...",
"timestamp": { "$gte": ISODate("..."), "$lte": ISODate("...") }
},
"millis": 4823,
"planSummary": "COLLSCAN",
"docsExamined": 2847392,
"nreturned": 288
}
See that "COLLSCAN"? That's MongoDB speak for "I'm checking
every single document in your collection because I have no idea
how to find what you want efficiently."
We were scanning 2.8 MILLION documents to return 288 results.
Every. Single. Time.
Our sensor_readings collection looked innocent enough:
{
"_id": ObjectId("..."),
"deviceId": "DEVICE_123",
"timestamp": ISODate("2024-11-10T10:30:00Z"),
"temperature": 22.5,
"humidity": 45,
"metadata": { ... }
}
The dashboard query was simple - get readings for a device in
a time range. We did this constantly for every device on every
dashboard refresh.
With 50 test devices and a few thousand readings, MongoDB's
default behavior worked fine. It could scan everything quickly
enough.
But with 200+ devices and millions of readings? Disaster.
Here's what was happening behind the scenes:
- User opens dashboard
- Frontend requests data for 20 devices (visible on screen)
- Backend makes 20 database queries
- Each query scans millions of documents
- Database CPU hits 100%
- Queries queue up behind each other
- Response times balloon to 30+ seconds
- Users refresh impatiently, creating MORE queries
- Everything grinds to a halt The solution was embarrassingly simple.
I created a compound index:
db.sensor_readings.createIndex({
deviceId: 1,
timestamp: -1
})
That's it. One line of code.
I ran it from my laptop on the production database. It took about
12 minutes to build the index across our millions of documents.
Then I watched the monitoring dashboard.
Within 60 seconds:
- Query time dropped from 4000ms to 15ms
- Database CPU fell from 95% to 12%
- API response times went from 30s to 300ms
- Customer complaints stopped
It felt like magic. Terrible, "this should have been here all along"
magic.
Let's talk numbers, because this hurt.
Direct Costs:
- Extra cloud infrastructure: ~$2,800 for the weekend
- Emergency scaling of database instances: $1,200
- On-call developer time (me + senior dev): ~$3,000
- Total direct cost: ~$7,000
Indirect Costs:
- Customer support time handling complaints: ~10 hours
- Engineering time investigating: ~16 hours
- Trust with new client: Priceless
And all of this could have been prevented by adding one index
during development.
What I Learned (The Hard Way)
-
"Works in development" means nothing
- Test with production-scale data
- Always simulate realistic load
- 100 records ≠ 1,000,000 records
-
Indexes aren't optional for production
- Profile queries BEFORE deploying
- Index your query patterns, not random fields
- Use db.collection.explain("executionStats") religiously
-
Monitor from day one
- Set up query performance monitoring
- Alert on slow queries (>100ms for critical paths)
- Track database metrics continuously
-
The "it's fast enough" trap
- What's fast at 10 requests/minute breaks at 1000
- Performance problems compound exponentially
- Optimization isn't premature if you know you'll scale
-
Cloud costs can spiral FAST
- Set up billing alerts
- Inefficient code = expensive infrastructure
- $20/day can become $300/day overnight
The MongoDB Indexing Cheat Sheet
Here's what I wish I knew:
Common Query Patterns → Index Strategy
Exact match on single field:
Query: {userId: "123"}
Index: {userId: 1}Range query on single field:
Query: {timestamp: {$gte: date}}
Index: {timestamp: 1} or {timestamp: -1}Multiple exact matches:
Query: {userId: "123", status: "active"}
Index: {userId: 1, status: 1}Exact match + range:
Query: {deviceId: "ABC", timestamp: {$gte: date}}
Index: {deviceId: 1, timestamp: -1}
(Exact match field FIRST!)Sorting:
Query: find({}).sort({createdAt: -1})
Index: {createdAt: -1}
Red Flags to Watch For:
- COLLSCAN in explain() output
- docsExamined >> nreturned (scanning way more than returning)
- Query time consistently >100ms
- Database CPU >50% with normal load
Key Takeaways
If you remember nothing else from this post:
✅ Always index your query patterns BEFORE production
✅ Test with production-scale data, not toy datasets
✅ Monitor query performance from day one
✅ Set up cloud billing alerts (seriously)
✅ "Fast enough" in development can be "disaster" in production
✅ One missing index can cost thousands of dollars
And maybe don't deploy on Friday afternoons. Just a thought.
Top comments (0)