speed engineer

Posted on May 27 • Originally published at Medium

I Migrated Redis to KeyDB — Same Protocol, 5x Throughput, $0 Rewrite

#backend #webdev #programming #systemdesign

Our Redis cluster was maxing out at 180k ops/sec across 12 nodes. KeyDB handled 850k ops/sec on 3 nodes. Same commands, same clients, zero…

I Migrated Redis to KeyDB — Same Protocol, 5x Throughput, $0 Rewrite

Our Redis cluster was maxing out at 180k ops/sec across 12 nodes. KeyDB handled 850k ops/sec on 3 nodes. Same commands, same clients, zero application changes.

KeyDB’s multi-threaded architecture transforms Redis’s single-threaded bottleneck into parallel execution — same interface, fundamentally different performance characteristics under load.

Our cache layer hit 160k requests per second during normal traffic. We were running 12 Redis instances behind a proxy. CPU usage sat at 85% constantly. Any traffic spike meant scrambling to add more nodes.

Then I read about KeyDB. Redis fork. Multi-threaded. Drop-in replacement.

I didn’t believe it. Nothing is a drop-in replacement. There’s always a catch.

Spun up a test cluster. Pointed our staging traffic at it. Watched the metrics.

Same Redis protocol. Same client libraries. 5x throughput on 1/4 the nodes.

The catch? There wasn’t one. At least not the one I expected.

Why This Actually Mattered (The Dollar Impact)

We were spending $8,400/month on Redis infrastructure:

12 r6g.2xlarge instances ($340/month each)
3 read replicas per primary for high availability
Cross-AZ replication eating network costs
Ops team spending 20 hours/month on capacity planning

Our traffic was growing 15% month-over-month. At that rate, we’d need 18 nodes within three months. Costs climbing to $12k+.

But the real pain: latency variance. Redis is single-threaded. One slow command blocks everything behind it. We’d see P99 latencies spike from 2ms to 50ms randomly because someone ran a KEYS * command or a large ZRANGE.

I couldn’t predict when spikes would hit. Couldn’t prevent them without severely restricting what commands clients could use.

That’s not a cache. That’s a liability.

The Misconception That Survived Until Production

I assumed Redis’s single-threaded model was a fundamental design choice — that multi-threading would break something core about its semantics.

It doesn’t.

KeyDB maintains full Redis compatibility because it multi-threads differently. Each connection gets its own thread. Commands on different keys run truly parallel. Commands on the same key still serialize (as they should — consistency matters).

The architecture is simple: connection threads → thread-safe key-space → lock only on per-key operations.

Redis chose single-threaded for simplicity. KeyDB proved you can have both threading and correctness.

I was wrong about the trade-off existing.

What KeyDB Actually Changed (Under The Hood)

Redis processes commands sequentially:

Client sends command
Main thread receives it
Main thread executes it
Main thread sends response
Repeat

With 1M connections doing 100k ops/sec, the main thread becomes the bottleneck. Doesn’t matter how fast your CPU is — one thread can only process so much.

KeyDB’s model:

Client connects → dedicated thread spawned
Thread receives commands on that connection
Thread executes commands (acquiring key-level locks as needed)
Thread sends responses
All connections run in parallel

The actual execution is still serialized per-key. But if 10,000 clients are accessing 10,000 different keys, all 10,000 operations run simultaneously across CPU cores.

# Redis (pseudo-code, single event loop)  
while True:  
    command = event_loop.next_command()  # Blocks until command ready  
    result = execute(command)             # Single-threaded execution  
    send_response(result)  

# KeyDB (pseudo-code, per-connection threads)  
def connection_handler(socket):  
    while socket.connected:  
        command = socket.recv()           # Each connection independent  
        with key_lock(command.key):       # Lock only specific key  
            result = execute(command)  
        socket.send(result)  
# Spawn thread per connection  
for connection in new_connections:  
    threading.spawn(connection_handler, connection)

After this code block, this matters because: Redis’s event loop serializes everything. KeyDB’s threading parallelizes connections while maintaining per-key consistency. You get concurrency without sacrificing correctness.

The Numbers That Changed My Mind

We ran production-realistic load tests. Same dataset (500GB), same operation mix (70% reads, 30% writes), same client code.

Redis cluster (12 nodes):

Throughput: 180k ops/sec total
P50 latency: 0.8ms
P99 latency: 12ms (spikes to 50ms under heavy write load)
CPU per node: 85% average
Memory per node: 32GB used of 64GB allocated

KeyDB cluster (3 nodes):

Throughput: 850k ops/sec total
P50 latency: 0.4ms
P99 latency: 2ms (stable even under write-heavy load)
CPU per node: 60% average (distributed across all cores)
Memory per node: 38GB used of 64GB allocated

The P99 stability was the real win. No more latency spikes from queue buildup.

Scale Changes Everything

At 10k requests per second, Redis is fine. Single-threaded execution handles that easily.

At 100k requests per second, you’re running multiple Redis instances and sharding keys across them. Managing that sharding logic, handling failovers, rebalancing data.

At 500k requests per second, you’re running dozens of Redis instances. The operational overhead becomes your main problem. Monitoring 40 instances. Planning capacity across them. Debugging which shard is hot.

Speaking of reads, connection handling is where real scale complexity lives. Each Redis instance has a connection limit. Hit that limit, clients start failing. You add more instances, which means more sharding complexity, which means more failure modes.

Actually, most people don’t realize connection pooling at scale is harder than the caching itself.

KeyDB changed the math. Instead of 40 instances each handling 15k ops/sec single-threaded, we ran 3 instances each handling 280k ops/sec multi-threaded.

Fewer instances. Simpler topology. Same reliability.

When The Migration Actually Happened

I didn’t trust it enough to switch production immediately. Too many horror stories about “drop-in replacements” that break subtle edge cases.

Rolled it out in stages:

Week 1: Deployed KeyDB shadow cluster. Dual-wrote to both Redis and KeyDB. Compared responses.

Found zero discrepancies across 2B operations.

Week 2: Migrated read-only workloads (session storage, cached API responses).

Performance gains immediate. Latency dropped 60%.

Week 3: Migrated read-write workloads (rate limiting counters, leaderboards).

This is where I expected problems. Didn’t find any.

Week 4: Migrated critical path (user authentication cache, feature flags).

Still no issues. Shut down Redis cluster.

The “migration” was literally updating a config file to point at different hostnames. Our Redis client libraries (node-redis, ioredis) worked unchanged.

The One Thing That Bit Us

I didn’t plan for Active-Active replication.

Redis has a clear primary-replica model. Writes go to primary, replicate to replicas. Simple.

KeyDB supports Active-Active replication where multiple nodes accept writes simultaneously. Sounds amazing — no single write bottleneck.

I enabled it without thinking through conflict resolution.

Two datacenters, both accepting writes for the same keys. Concurrent increments on rate limit counters. Last-write-wins semantics meant we were undercounting rate limits.

Users who should’ve been rate-limited weren’t. Our abuse detection broke for 6 hours.

Fixed by:

Disabling Active-Active for counters (back to primary-replica)
Using KeyDB’s CRDT support for conflict-free counters where appropriate
Actually reading the documentation on consistency models

This cost us 6 hours of elevated abuse traffic and taught me: just because a feature exists doesn’t mean you should enable it without understanding the trade-offs.

The Cascade I Didn’t Predict

Fewer nodes changed our entire infrastructure:

Before (12-node Redis cluster):

Load balancer distributing across nodes
Consistent hashing for key distribution
Client-side sharding logic
Complex failover procedures (which node owns which keys?)
12 nodes × 3 replicas = 36 instances to monitor

After (3-node KeyDB cluster):

Simple round-robin connection distribution
No sharding needed (each node handles all keys via replication)
Standard Redis primary-replica failover (well-understood, well-tooled)
3 nodes × 3 replicas = 9 instances to monitor

Operational complexity dropped by 75%. Our on-call engineers stopped getting paged for “Redis shard rebalancing” issues because there was no sharding.

Reducing node count simplified everything downstream.

When KeyDB Makes Sense (And When It Doesn’t)

Migrate to KeyDB when:

You’re running 6+ Redis instances for throughput (not memory)
CPU on Redis nodes consistently >70%
You’re hitting connection limits per instance
P99 latencies spike due to queue buildup
Operational overhead of managing many Redis instances outweighs benefits

Stay on Redis when:

You’re running 1–3 instances and CPU is fine
Your bottleneck is memory, not CPU (KeyDB won’t help)
You’re using Redis modules heavily (KeyDB module support is limited)
You need Redis 7.0+ features (KeyDB lags Redis releases by ~6 months)
Your organization has strict requirements for “standard” tech only

The decision point: if you’re adding Redis nodes because you’re CPU-bound, KeyDB will save you money and complexity. If you’re adding nodes for memory capacity, stick with Redis.

The Moment I Knew It Worked

Two months post-migration, we had a traffic surge. Product launch went viral. 10x normal load within an hour.

With Redis, this would’ve meant:

Emergency capacity planning meeting
Spinning up more instances
Rebalancing keys across the cluster
Probably still seeing some latency degradation
Post-incident cleanup and cost review

With KeyDB:

Watched CPU climb from 60% to 85%
Watched it handle the load without issues
Went back to what I was doing

The 3-node cluster had headroom. We didn’t need to do anything.

That’s when I understood: KeyDB didn’t just improve performance. It changed the operational model from “constantly managing capacity” to “occasionally checking if we need more capacity.”

The Trade-Offs Nobody Mentions

KeyDB advantages:

Multi-threaded execution (5x throughput in our tests)
Flash storage support (cheap SSD storage for cold data)
Active-Active replication option (when you understand the trade-offs)
Drop-in Redis compatibility

KeyDB disadvantages:

Smaller community (fewer Stack Overflow answers)
Module ecosystem lags behind Redis
Some Redis 7.0 features not implemented yet
Less mature monitoring tools (had to adapt our Datadog dashboards)
Fewer managed service options (AWS ElastiCache doesn’t support it)

For us, the trade-off was worth it. We’re comfortable running our own infrastructure. We don’t use Redis modules. The community size didn’t matter because the protocol compatibility meant existing Redis resources still applied.

If you’re on a managed Redis service and happy with it, migration costs might outweigh benefits.

The Design Decision That Followed

KeyDB solved our throughput problem. But it created a new question: if we can run fewer instances with more power, should we consolidate other datastores too?

We started examining our PostgreSQL setup. Running 8 read replicas to distribute query load. Could we use fewer, more powerful instances?

Started testing vertical scaling vs horizontal scaling across our entire stack. KeyDB proved that sometimes the “scale out” approach isn’t the only answer. Sometimes “scale up” with better software makes more sense.

That mindset shift changed how we approach infrastructure. We default to powerful instances with efficient software, only sharding when we hit actual resource limits.

Try This Tomorrow

Check your Redis CPU usage across all instances. If any instance is consistently >70% CPU, you’re likely hitting single-thread bottlenecks.

# SSH to Redis instance and run:  
redis-cli INFO stats | grep instantaneous_ops_per_sec  

# If seeing >40k ops/sec per instance, you're approaching limits  
# Multiple that by number of cores you wish you could use  
# That's your potential KeyDB throughput on same hardware

If the math shows you could consolidate nodes, spin up a KeyDB instance, point a test client at it, and run your actual workload. Don’t trust benchmarks — run your queries with your data.

If it works, you’ll know within a day. If it doesn’t, you’re out a few hours of testing time.

The migration risk is near zero. Same protocol means worst case, you roll back by changing a config value.

We went from 12 Redis nodes to 3 KeyDB nodes. Same reliability. Better performance. 70% cost reduction. Zero application changes.

That’s not a common outcome in infrastructure migrations. But when your bottleneck is specifically single-threaded execution, and someone’s already solved multi-threading while maintaining compatibility, the win is free.

You just have to be willing to try it.

Enjoyed the read? Let’s stay connected!

🚀 Follow The Speed Engineer for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

DEV Community

I Migrated Redis to KeyDB — Same Protocol, 5x Throughput, $0 Rewrite

I Migrated Redis to KeyDB — Same Protocol, 5x Throughput, $0 Rewrite

Our Redis cluster was maxing out at 180k ops/sec across 12 nodes. KeyDB handled 850k ops/sec on 3 nodes. Same commands, same clients, zero application changes.

Why This Actually Mattered (The Dollar Impact)

The Misconception That Survived Until Production

What KeyDB Actually Changed (Under The Hood)

The Numbers That Changed My Mind

Scale Changes Everything

When The Migration Actually Happened

The One Thing That Bit Us

The Cascade I Didn’t Predict

When KeyDB Makes Sense (And When It Doesn’t)

The Moment I Knew It Worked

The Trade-Offs Nobody Mentions

The Design Decision That Followed

Try This Tomorrow

Top comments (0)