DEV Community

Cover image for We Architected CockroachDB the Wrong Way (And It Works
Om Narayan
Om Narayan

Posted on • Originally published at devicelab.hashnode.dev

We Architected CockroachDB the Wrong Way (And It Works

Disclaimer: This is likely a bad idea. We're probably missing something obvious. Smarter engineers would find a better solution. But here we are.

When we started evaluating CockroachDB for DeviceLab, we had high hopes. The promise of a distributed SQL database that could scale horizontally while maintaining consistency seemed perfect for our SaaS platform. After all, who doesn't want the scalability of NoSQL with the familiarity of PostgreSQL?

But then reality hit us like a poorly configured database timeout.

The Journey Into Madness 🀯

Our first attempt was with CockroachDB's cloud offering. Simple queries were taking over 2 seconds. Table creation? Nearly 2 minutes. We raised a support ticket, hoping for some magical configuration fix. Instead, we got two responses that made us question everything:

  1. First, they suggested deleting our entire cluster and starting fresh
  2. When that didn't work, they essentially told us that performance issues weren't really issues unless something was completely "broken"

We thought maybe we were doing something wrong, so we decided to run our own tests. That's when things got weird.

We spun up CockroachDB on Google Cloud (GCE) and ran the same queries (under load testing):

  • GCE cluster: 200 milliseconds βœ…
  • Single-node on tiny VM (2 vCPUs, 2GB RAM): 20 milliseconds πŸš€

How was a resource-constrained single node outperforming a properly provisioned cluster by an order of magnitude?

When Physics Stopped Making Sense 🌌

The real head-scratcher came when we tested cross-region connectivity. We had application servers on AWS and a database cluster on GCE. Logic would suggest that keeping everything within the same cloud provider and region would be faster, right?

Wrong. (under load testing)

AWS β†’ GCE (cross-cloud): 124ms ⚑
GCE β†’ GCE (same region): 2.65 seconds 🐌
Enter fullscreen mode Exit fullscreen mode

Yes, you read that correctly. Cross-cloud communication was literally 20 times faster than same-region communication within GCE.

At this point, we started questioning our understanding of basic networking. Maybe we were measuring wrong? Maybe we had misconfigured something fundamental? We ran the tests again. And again. The results were consistent.

The Multi-Node Penalty πŸ“Š

As we dug deeper, we discovered something that should have been obvious in hindsight: CockroachDB's distributed nature comes with a cost.

  • Every write needs consensus from the majority of nodes
  • Every query might need to hop between nodes to find the leaseholder for the data
  • The more nodes we added, the slower things got

This makes sense for CockroachDB's intended use case - globally distributed applications where surviving region failures is more important than raw speed. But for our use case, where most customers are regional and latency matters more than multi-region survival, we were paying a heavy price for benefits we didn't need.

The Lazy Solution πŸ’‘

Now, a competent engineering team would have taken this learning and designed a proper architecture. They might have:

  • Implemented intelligent query routing
  • Used read replicas
  • Maybe even questioned whether CockroachDB was the right choice at all

We are not that team.

Instead, we looked at:

  • Our application servers: 150MB RAM, <10% CPU usage
  • Our database nodes: Needing dedicated instances

And we had a terrible, horrible, no-good idea.

What if we just... put them together?

The Setup Nobody Should Copy ⚠️

So that's what we did. Each of our instances now runs:

  • Our application server
  • A CockroachDB node

The application connects to localhost:26257. That's it. That's our entire database architecture.

# Our "architecture" in a nutshell
instance:
  - app_server: βœ…
  - cockroachdb_node: βœ…
  - connection: localhost:26257
  - complexity: none
  - best_practices: ignored
Enter fullscreen mode Exit fullscreen mode

We know this violates every principle of proper system design:

  • Separation of concerns? Thrown out the window
  • Independent scaling? Not possible
  • Maintenance windows? They affect everything

Any architect looking at our setup would immediately fail us in a design review.

But here's the embarrassing truth: it works better than our "proper" setup did.

How It Actually Works

When our application makes a query, one of three things happens:

  1. Data is local β†’ Sub-millisecond response (no network hop)
  2. Data is on another node β†’ Local CockroachDB forwards it (one network hop)

Compare this to the traditional setup:

App Server β†’ Load Balancer β†’ Random CockroachDB Node β†’ Actual Leaseholder
Enter fullscreen mode Exit fullscreen mode

Potentially two network hops.

Result: We eliminated an entire network round trip from every database query.

Why This Is Still a Bad Idea 🚨

Let me be clear: this is probably wrong. We're almost certainly creating problems we haven't discovered yet. Real engineers separate application and database tiers for good reasons that we're too lazy or ignorant to fully appreciate.

Potential Issues We're Ignoring:

  • Resource competition: What happens during a CockroachDB compaction? Or during a backup?
  • Our solution if problems arise: Just throw more RAM at it. Hardware is cheap. Thinking is expensive.
  • Wasted distributed query capabilities: The query optimizer assumes all nodes are equally accessible
  • We're using a Ferrari to do grocery runs

The Economics of Incompetence πŸ’°

The economics actually make more sense than they first appear:

Traditional Setup Scaling:

  1. Figure out which tier needs scaling (app or database?)
  2. Coordinate the changes
  3. Rebalance your load
  4. Debug why connection pooling is suddenly acting weird

Our Setup Scaling:

  1. Add another instance
  2. That's it
// Our scaling strategy
if (needMoreCapacity) {
  addInstance();  // Gets both app server AND database node
}
// Done. Go home. Sleep well.
Enter fullscreen mode Exit fullscreen mode

The real kicker? Our single app server can handle 10,000 requests per second. So when we add instances, we're really adding them for database distribution, not application capacity. But we get both anyway.

What We Should Have Done πŸ€”

Looking back, we probably should have:

  1. Used PostgreSQL with streaming replication - Simpler, faster, more appropriate
  2. Hired someone who actually understands distributed systems - To configure CockroachDB properly
  3. Stuck with the cloud offering - And figured out what we were doing wrong

But we didn't. We took the lazy path, combining things that shouldn't be combined, ignoring best practices that smarter people developed for good reasons.

The worst part? It's been running in production for months without issues.

  • P99 latency is better than ever
  • Customers are happy
  • Ops burden is minimal

We keep waiting for it to blow up spectacularly, to teach us the lesson we deserve for our architectural sins. But it keeps working.

Conclusion 🎯

I'm not advocating for this approach. Please don't read this and think "those DeviceLab folks are onto something." We're not. We're just lazy engineers who found a local maximum that works for our very specific, probably unusual case.

If you meet ALL these criteria:

  • βœ… Your application is tiny
  • βœ… Your workload is read-heavy
  • βœ… You're allergic to operational complexity
  • βœ… You're willing to accept that you're probably doing it wrong

Then maybe our terrible setup might work for you too.

But probably not. You should probably do it properly. Unlike us.


P.S. - If you know why this is going to blow up in our faces, please tell us in the comments. We're genuinely curious about what we're missing. There has to be something, right? 🀷

If you're looking for a solution where questionable architecture decisions can't compromise your data, check out DeviceLab - we built a zero-trust distributed device lab platform. We don't promise not to look at your data; we architecturally can't. Everything runs on your devices, connected peer-to-peer. Your tests, your data, your security. Unlike our database setup, we actually thought this one through.

AI was used to help structure this post

Top comments (0)