DEV Community

Milinda Biswas
Milinda Biswas

Posted on

We Hit 6 Billion MongoDB Documents (And Lived to Tell the Tale)


So here's the thing about running a database at scale – nobody tells you about the weird stuff until you're already knee-deep in it. At Avluz.com, we crossed 6 billion documents in our MongoDB cluster this year, and honestly? It was equal parts terrifying and fascinating.

We started on AWS like everyone else. Three years in, our monthly bill hit $7,500 and our CFO was giving me that look in every meeting. We moved everything to OVH, spent four months optimizing with help from GenSpark AI, and now we're paying $2,180/month for better performance.

Here's what actually happened.

The "Oh Crap" Moment

Picture this: It's 2AM, I'm getting Slack alerts that queries are timing out, and our main collection just hit 4.8 billion documents. Our carefully-tuned indexes? Useless. Our query times went from "pretty good" to "are you even trying?" in about two weeks.

That's when we knew we had to do something drastic. The AWS bills were bad enough, but watching our p95 query times climb to 8 seconds? That was the real pain.

Why We Ditched AWS for OVH

Look, AWS is great. Their managed services are top-notch. But when you're running dedicated MongoDB instances and you know what you're doing, the cost difference is just… brutal.

Cost Comparison

Check out these numbers:

  • AWS: $7,500/month for our replica set
  • GCP: $6,800/month (we checked)
  • OVH: $2,180/month for the same specs

That's not a typo. Same hardware – 32 cores, 256GB RAM per node, NVMe storage. The catch? OVH doesn't hold your hand as much. But that's fine because we were already managing everything manually anyway.

The real kicker? OVH's vRack private network between servers is completely free. AWS was charging us $459/month just for replication traffic between nodes. That's $5,500 a year on network transfers alone. For data that never even leaves the datacenter!

Our Sharding Disaster (And Recovery)

When you hit billions of documents, sharding isn't optional. But man, can you screw it up.

Our first attempt:

sh.shardCollection("avluz.events", { user_id: 1 })

Seemed logical, right? Every query filters by user_id, so let's shard on that. Except we have power users who generate 10x more data than normal users. Within a week, some shards were at 900GB while others were chilling at 120GB.

Sharding Distribution

Queries on the hot shards were dying. The MongoDB balancer was moving chunks around constantly, making everything worse. It was a mess.

I spent three days reading documentation, blog posts, and MongoDB's forums. Then I just asked GenSpark:

"I have 6B MongoDB documents with user_id and timestamp. Queries filter by user_id and date range. My shards are super unbalanced. What do I do?"

It came back with a compound sharding key strategy I hadn't even considered:

sh.shardCollection("avluz.events", { 
  user_id: "hashed", 
  timestamp: 1 
})

Hashing the user_id distributes power users evenly across shards. The timestamp as secondary key helps with our range queries.

Resharding 6 billion documents took 72 hours of nail-biting, but the results were immediate:

  • Shard sizes went from all over the place to within 85GB of each other
  • Query latency dropped from 2.4 seconds to 78ms
  • The balancer finally calmed down

The Index Audit From Hell

We had 47 indexes across our collections. I knew some were probably useless, but which ones? Going through MongoDB logs manually would've taken weeks.

So I dumped our index stats and slow query logs into a file and asked GenSpark to analyze it. Twenty minutes later, it told me:

  • 12 indexes had literally zero accesses in 30 days
  • 8 were redundant (covered by other compound indexes)
  • 6 had fields in the wrong order for our query patterns

Index Optimization

I'll be honest – I felt pretty dumb. Some of these were obvious in hindsight. But when you're managing dozens of indexes across multiple collections, obvious things stop being obvious.

We dropped the useless indexes and reordered the problematic ones. Results:

  • Index storage: 840GB → 380GB (55% reduction!)
  • Write performance: +28% faster
  • My stress level: way down

The whole process took maybe four hours of actual work. Doing this manually would've been a multi-week project involving spreadsheets, meetings, and probably a few arguments.

That Time We Melted Our Cache

MongoDB's WiredTiger storage engine cache is supposed to be magical. By default it uses 50% of your RAM. We have 256GB per server, so that's 128GB of cache. Should be plenty, right?

Wrong.

Our cache hit ratio was stuck at 72%. That means 28% of reads were going to disk. With billions of documents and NVMe storage, it wasn't terrible, but it wasn't great either. Queries were averaging 145ms when we knew they could be faster.

I mentioned this to GenSpark while troubleshooting something else, and it suggested bumping the cache to 80% of RAM. I was skeptical – doesn't the OS need memory? But the logic made sense: we're not running anything else heavy on these boxes, and Linux is pretty good at managing tight memory.

Changed one config line:

storage:
  wiredTiger:
    engineConfig:
      cacheSizeGB: 205

Restarted the nodes one by one (because downtime is scary), and watched the metrics:

  • Cache hit ratio: 72% → 94%
  • Disk IOPS: dropped by 77%
  • Query latency: 145ms → 78ms

Sometimes the simple fixes are the best ones. The MongoDB documentation recommends 50% but notes you can go higher if your workload benefits from it. Ours definitely did.

Connection Pool Shenanigans

This one almost took us down completely. We have 200 application servers connecting to MongoDB. Each one had a connection pool of 100 connections because… honestly? That's what some blog post recommended three years ago and nobody questioned it.

Do the math: 200 servers × 100 connections = 20,000 connections trying to hit MongoDB.

MongoDB started refusing connections around 15,000. Things got weird. Queries would randomly fail. Connections would hang. Our on-call person (me) was having a bad week.

GenSpark suggested dropping the pool size way down:

const client = new MongoClient(uri, {
  maxPoolSize: 25,      // was 100
  minPoolSize: 5,
  maxIdleTimeMS: 30000
})

And bumping MongoDB's connection limit:

net:
  maxIncomingConnections: 5000

Now we're at about 5,000 total connections across all apps. MongoDB's CPU usage dropped 40%. Connection errors disappeared. Query latency improved 35%.

Turns out you don't need a massive connection pool per instance. You just need enough to handle your concurrent queries. Who knew? (Everyone who actually read the documentation properly, probably.)

The Dashboard That Actually Matters

Performance Dashboard

After dealing with all this, I realized we were monitoring way too much noise. Most metrics don't matter until they're already broken. We rebuilt our primary dashboard to show just five things:

  1. Query time (p95): Currently 78ms. If it hits 200ms, something's wrong
  2. Cache hit ratio: Sitting at 94%. Below 90% means we're thrashing disk
  3. Active connections: 3,847 right now. Over 4,500 and we start investigating
  4. Replication lag: 2.1 seconds. Over 10 seconds means a node is struggling
  5. Disk space per shard: Alert at 15% free (learned this one the hard way)

Everything else is in a secondary dashboard that we check during incidents. But this five-metric view? It tells us 95% of what we need to know at a glance.

GenSpark helped us design this after I fed it six months of incident logs and asked "which metrics actually predicted our outages?" Turns out most of them didn't. These five did.

How AI Actually Saved Us Weeks

Let me be real about the GenSpark thing. It didn't write our code. It didn't magically fix our database. But here's what it did:

Index optimization: Would've taken me two weeks of analyzing logs, testing changes, measuring results. With GenSpark analyzing patterns? Four hours.

Schema redesign: We were restructuring our product catalog. Normally this is weeks of research, testing different approaches, measuring performance. GenSpark gave us three solid approaches with pros/cons in minutes. We tested the best one, it worked, done in three days.

Query optimization: Our analytics queries were slow. I'd spend a day staring at EXPLAIN output trying to figure out why. Now I paste the EXPLAIN into GenSpark, it tells me "you're doing a collection scan on 2M documents, add this index," and I'm done in an hour.

Connection tuning: This would've been pure trial and error. Test a pool size, monitor for a day, try another. GenSpark gave us a sensible starting point based on our query patterns and we only had to tweak it once.

Total time saved on this project? About 7-8 weeks of work compressed into a week and a half. That's not hype – that's me being able to ship this whole migration in four months instead of six.

The key is knowing what to ask and when. GenSpark isn't magic. But it's like having a senior database engineer available 24/7 to sanity-check your ideas and point out things you missed.

The Weird Stuff Nobody Warns You About

Aggregation queries that lie: MongoDB will happily try to aggregate billions of documents in memory, fail silently when it hits the memory limit, and return incomplete results. Always use { allowDiskUse: true } on big aggregations. Always.

The balancer is chaos: When resharding, the balancer moves chunks between shards automatically. Sounds great! Except it hammers your cluster and makes everything slow. Schedule balancing windows or you'll get surprise performance hits at random times.

Backups at this scale are weird: Our full backup is 14.5TB. Restoring from backup takes 8 hours. We test this quarterly because the one time you don't test is the one time you'll need it and discover it's broken.

OVH support is… different: They're helpful, but you need to know what you're doing. If you're used to AWS holding your hand, OVH will make you Google some stuff. That's the tradeoff for paying a third of the price.

What We'd Do Differently

If I could go back and redo this whole thing?

  1. Start with hashed compound sharding keys. Don't wait until you have problems.
  2. Audit indexes every quarter. They accumulate like junk in a garage. Use $indexStats regularly.
  3. Set up GenSpark AI queries earlier. Would've saved us from at least two mistakes.
  4. OVH from day one? Maybe. AWS was nice for the early days when we didn't know what we were doing. But once we hit any real scale, the cost difference is just too big to ignore.

The Numbers That Matter

Just to close this out with some actual data:

  • Total documents: 6.2 billion (and growing 85M/day)
  • Storage: 14.5 TB compressed with WiredTiger
  • Queries per day: 1.2 million
  • p95 query time: 78ms
  • Monthly cost: $2,180 on OVH vs $7,500 on AWS
  • Time to migrate: 4 months
  • Times we thought it would never work: at least 6
  • Times we almost gave up and just paid AWS: 2
  • Current stress level: manageable

Running 6 billion documents at Avluz.com taught us that scaling isn't about perfect architecture or having infinite budget. It's about making smart tradeoffs, knowing when to ask for help (even from AI), and being willing to spend a weekend resharding your database when you have to.

Also, monitor your cache hit ratio. Seriously.


Originally published at Avluz.com

Top comments (0)