Naman Vashistha

Posted on Oct 11

Why I'm Building My Own Distributed Database

#database #distributedsystems #java #springboot

As a backend developer, I've worked with Redis, PostgreSQL, MongoDB, and countless other databases. But I always felt like there was something missing – a deeper understanding of how these systems actually work under the hood. So I decided to embark on a journey to build my own distributed key-value database from scratch.

Meet LimeDB – a distributed key-value store I'm currently building with Java 21 and Spring Boot. My goal is to create a truly custom database system that starts with PostgreSQL as a foundation but evolves into something much more ambitious, all wrapped in a horizontally scalable coordinator-shard architecture.

GitHub: namanvashistha/limedb

🤔 Why Build Another Database?

You might be thinking: "Why reinvent the wheel? Redis and PostgreSQL already exist!" And you're absolutely right. But here's the thing – as backend developers, we often treat databases as black boxes. We know how to use them, but not how they work.

Building LimeDB is already teaching me more about distributed systems, consistency, partitioning, and database internals than years of just using existing solutions. It's like the difference between driving a car and understanding how the engine works.

🎯 The Learning Goals

When I started this project, I had several learning objectives:

Understand Distributed System Patterns - How do you route requests across multiple nodes?
Grasp Database Internals - What happens when you store and retrieve data?
Learn About Horizontal Scaling - How do systems like Redis Cluster actually work?
Master Modern Java - Put Java 21 features and Spring Boot to real use
Build Something Production-Adjacent - Not just a toy, but something that could theoretically scale

🏗️ Architecture Decisions

The Coordinator-Shard Pattern

Instead of a peer-to-peer system (like Cassandra) or a single-node system (like Redis), I chose a coordinator-shard architecture:

Client → Coordinator → Shard 1, 2, 3...

Why this pattern?

Simplicity: Clients only need to know about one endpoint
Routing Logic: Centralized decision-making about where data lives
Operational Ease: Easy to monitor and debug
Familiar: Similar to how many real systems work (think MongoDB's router)

Hash-Based Routing (For Now)

// Simple but effective
int shardIndex = Math.abs(key.hashCode()) % numberOfShards;

This is deliberately simple. I know consistent hashing is "better" for rebalancing, but I wanted to start with something I could fully understand and implement correctly. You can see this decision in the ShardRegistryService:

public String getShardByKey(String key) {
    int index = Math.abs(key.hashCode()) % shards.size();
    return shards.get(index);
}

Perfect? No. Educational? Absolutely.

PostgreSQL as a Starting Point

Each shard currently uses its own PostgreSQL database (limedb_shard_1, limedb_shard_2, etc.). But here's the key - PostgreSQL is just my Phase 1 storage engine, not the final destination.

Why start with PostgreSQL?

Quick Validation: Get the distributed architecture working first
ACID Guarantees: Data survives restarts while I focus on routing logic
Familiar Tooling: Easy to inspect and debug during development
Stepping Stone: Proven foundation before building custom storage

The plan is to eventually replace PostgreSQL with custom storage engines optimized for key-value workloads. Think LSM trees, custom file formats, and memory-mapped storage - but PostgreSQL lets me focus on the distributed systems challenges first.

💡 What I'm Learning Building This

1. Distributed Systems Are Hard

Even with this simple architecture, I'm already running into fascinating problems:

What happens when a shard goes down?
How do you handle network timeouts?
What about data consistency across shards?

These aren't academic questions anymore – they're real problems I need to solve as I build this system.

2. The Power of Good Abstractions

The Spring Boot framework is letting me focus on the distributed systems logic rather than HTTP parsing and dependency injection. My controllers are staying clean:

@GetMapping("/get/{key}")
public ResponseEntity<String> get(@PathVariable String key) {
    String value = routingService.get(key);
    return value != null ? ResponseEntity.ok(value) : ResponseEntity.notFound().build();
}

3. Testing Distributed Systems is Different

You can't just unit test individual methods. You need to:

Start multiple services
Test network failures
Verify data consistency
Check routing logic

def set_values():
    for i in range(1_000):
        payload = {"key": f"key_{i}", "value": f"value_{i}"}
        response = requests.post("http://localhost:8080/api/v1/set", json=payload)

4. Configuration Management is Crucial

With multiple nodes, configuration becomes complex. Each shard needs to know:

Which database to connect to
What port to run on
Its shard ID

./gradlew bootRun --args='--node.type=shard --server.port=7001 --shard.id=1'

🚀 Current Progress

LimeDB currently supports:

✅ GET/SET/DELETE operations (Redis-like API)
✅ Hash-based routing across 3 shards
✅ PostgreSQL persistence per shard
✅ REST API with proper error handling
✅ Health monitoring endpoints

Performance? It's not going to beat Redis. But it's already handling operations smoothly and teaching me why Redis is so fast.

🎯 What's Next?

The roadmap is ambitious and includes features I'm excited to tackle:

Phase 2: Better Distribution

Consistent Hashing: Replace modulo with a proper hash ring
Health Checks: Automatic failover when shards go down
Replication: Primary-replica setup for high availability
Metrics: Monitoring and observability

Phase 3: Custom Storage Engine

LSM Trees: Replace PostgreSQL with custom key-value storage
Memory-Mapped Files: Direct file system control
Custom Serialization: Optimized data formats
WAL Implementation: Write-ahead logging from scratch

Phase 4: Advanced Features

Custom Binary Protocol: Move beyond HTTP/REST
Compression: Custom compression algorithms
Cache Layers: Multi-level caching strategies
Transaction Support: ACID across multiple shards

Each phase represents deeper database internals knowledge - PostgreSQL is just the beginning!

💭 Why You Should Build One Too

Building your own database isn't about competing with PostgreSQL or Redis. It's about:

Deep Learning: Understanding systems from the ground up
Interview Prep: Nothing impresses like saying "I built a distributed database"
Problem-Solving Skills: Real distributed systems problems
Technology Mastery: Push your programming language skills
Portfolio Project: Something unique that stands out

🛠️ Getting Started

If this inspired you to build your own database, here's my advice:

Start Simple: Don't try to build Redis on day one
Pick Your Language: Use something you're comfortable with
Choose One Feature: GET/SET is enough to start
Add Gradually: Persistence, then distribution, then optimizations
Document Everything: Future you will thank you

🔗 Follow the Journey

Want to see the code as I build it? It's all open source:

GitHub: namanvashistha/limedb
Tech Stack: Java 21, Spring Boot, PostgreSQL
Current Status: Basic coordinator-shard architecture working

The README has setup instructions, and I'm trying to make the code as readable as possible for learning purposes. Feel free to star the repo and follow along as I tackle more distributed systems challenges!

🎉 Final Thoughts

Building LimeDB is turning out to be one of the most educational projects I've undertaken as a backend developer. It's not going to be the fastest database, or the most feature-complete, but it's mine. I understand every line of code, every architectural decision, and every trade-off I'm making along the way.

In a world of microservices and cloud abstractions, there's something deeply satisfying about building a system from first principles. I'm already looking at Redis, PostgreSQL, and MongoDB differently after just starting this journey.

So grab your favorite programming language, pick a simple data structure, and start building. The distributed systems knowledge you'll gain is worth its weight in gold.

What do you think? Have you ever built your own database or distributed system? What did you learn? Drop a comment below!

Top comments (5)

OssiDev • Oct 12 • Edited

While I applaud you for this project (I was thinking about doing something similar in Go just for fun), you released this project as something people should use only saying this:

Building LimeDB is already teaching me more about distributed systems, consistency, partitioning, and database internals than years of just using existing solutions. It's like the difference between driving a car and understanding how the engine works.

So you built a database engine to learn about databases, and now we should use it because of that? This makes absolutely no sense. What problem does your LimeDB solve that other key-value storages don't? I doubt it'll teach me about how database engines work unless I look at the code, and I can do that with any other database engine. They're actually open source.

I would much rather use an established, well maintained and backed, database than your pet project. No offense here, I think what you did is a brilliant project as a learning experience, but you should not have created a logo and started promoting it. You got way too serious with this for absolutely the wrong reasons. Unless your tool solves a particular problem that might be helpful for others, don't promote it. You'll let people down if you end up giving up on the project and they're already bought into it.

Maybe I'm just ringing alarm bells for no reason as people should be able to make this determination themselves, but I'm just saying as I see it.

Naman Vashistha • Oct 13

Hey, I get what you’re saying - but I think there’s a bit of a misunderstanding here.

LimeDB isn’t being “promoted” as a production-ready system. It’s an open-source learning project, and I’ve been very clear about that. The fact that it has a logo, proper documentation, and structure doesn’t suddenly make it a product I’m trying to sell - it just means I want it to look good and serious. Some people learn by reading papers and code; I learn best by building and implementing end-to-end.

And honestly, having something open source with a logo doesn’t mean people are dumb enough to adopt it blindly. Developers are smart - they know how to evaluate what’s experimental and what’s production-ready. Sharing something well-organized just helps others explore and maybe even learn from it, not trick them into using it.

Open-source isn’t only about releasing polished, production-grade tools. It’s also about sharing your journey, and for me, this is my way of learning by doing- not just talking about distributed systems, but actually implementing them.

I appreciate your feedback, truly - but I think it’s unfair to equate enthusiasm and effort with misplaced seriousness. LimeDB was built to learn, share, and inspire and if it sparks curiosity in even a few developers, I’d say it’s already done its job.

That said, you’re absolutely right that established databases are the way to go for any real use case. LimeDB’s goal is educational, not competitive more like a playground for curiosity who want to understand the internals by running and experimenting with code that’s simple and open.

Thanks again for taking the time to share your thoughts - I genuinely appreciate you engaging with it. 🙏

shemith mohanan • Oct 13

This is such a great initiative! I love that you’re building LimeDB not just as a project, but as a way to truly understand distributed systems from the inside out. The coordinator–shard pattern choice is clean and practical — especially starting with PostgreSQL before going custom. Respect 👏

Naman Vashistha • Oct 13

Thanks a lot! Appreciate you taking the time to check it out 🙌

Some comments may only be visible to logged-in visitors. Sign in to view all comments.