DEV Community

Ken C. Demanawa
Ken C. Demanawa

Posted on

Redis Isn't PostgreSQL: Building a Hybrid Change Data Capture Runtime in Ruby

I Built Commercial Redis CDC Source Drivers for Ruby — Here's What I Learned

For the past couple of years I've been building a Change Data Capture (CDC) ecosystem for Ruby.

Like many CDC projects, it started with PostgreSQL. PostgreSQL's Write-Ahead Log (WAL) is an excellent source of truth: durable, ordered, replayable, and well understood. It provides exactly the properties you want when you're building reliable event pipelines.

But the deeper I went into distributed systems, the more I realized something important.

Many systems don't observe change from PostgreSQL first.

They observe it from Redis.

Redis often sits at the front of modern architectures:

  • Redis Streams carry application events.
  • Pub/Sub distributes transient state changes.
  • Keyspace notifications react to cache invalidation and key expiry.
  • Redis Cluster routes events across multiple primaries.

In many systems, Redis sees a change before PostgreSQL ever commits it.

That raised an interesting question:

Can Redis become a first-class Change Data Capture source?

The obvious answer is "yes."

The interesting answer is "yes—but not in the same way PostgreSQL does."

That distinction eventually became cdc-redis-pro, a commercial Redis source driver for the Ruby CDC ecosystem.

This article isn't a product announcement.

It's an engineering write-up about the architectural decisions behind the project, the tradeoffs Redis forces you to make, and the execution model that ultimately emerged.


Redis Doesn't Have One CDC Interface

One misconception I frequently encounter is the assumption that Redis has an equivalent of PostgreSQL's WAL.

It doesn't.

Instead, Redis exposes several completely different mechanisms for observing change.

Source Delivery Replay
Streams At-least-once Yes
Pub/Sub At-most-once No
Sharded Pub/Sub At-most-once No
Keyspace Notifications At-most-once No

At first glance they all look like "events."

Operationally they're completely different systems.

Streams are durable.

Pub/Sub isn't.

Keyspace notifications exist primarily as operational signals.

Sharded Pub/Sub introduces routing constraints that don't exist elsewhere.

Treating them all as the same abstraction inevitably hides important guarantees—and hidden guarantees eventually become production incidents.

Instead of pretending every Redis source behaves identically, I wanted the API to expose those differences explicitly.

If a source cannot replay missed messages, the API should say so.

If a reconnect creates a loss window, operators should know exactly when it happened.

Infrastructure software shouldn't hide reality.

It should make reality easier to reason about.


Redis and PostgreSQL Solve Different Problems

A common question is:

"If Redis can generate change events, why not replace PostgreSQL CDC entirely?"

Because they solve different problems.

PostgreSQL's WAL is the durable history of your system.

Redis is often the earliest signal that something is happening.

One tells you what committed.

The other tells you what is happening right now.

They're complementary.

Not competing.

Conceptually, I think about them like this:

                    PostgreSQL WAL
                          │
                          ▼
                 Durable Record of Truth
Enter fullscreen mode Exit fullscreen mode
Redis Streams / PubSub / Keyspace
              │
              ▼
        Fast Operational Signal
Enter fullscreen mode Exit fullscreen mode

The goal isn't choosing one over the other.

The goal is allowing both to participate in the same downstream processing pipeline.

That required another architectural boundary.


A Common Language for Change Events

One of the design goals of the broader CDC ecosystem is that downstream processors shouldn't care where an event originated.

Whether a change comes from PostgreSQL logical replication or Redis Streams, the downstream processing model should remain identical.

That boundary is CDC::Core::ChangeEvent.

Instead of exposing PostgreSQL-specific or Redis-specific payloads to processors, each source is normalized into a common event model.

Conceptually the pipeline looks like this:

                PostgreSQL WAL
                     │                     
                pgoutput-client
                     │
                     ▼
                 ChangeEvent
                       ▲
                       │
                 cdc-redis-pro
                       │
        Streams / PubSub / Keyspace
Enter fullscreen mode Exit fullscreen mode

Everything downstream consumes the same normalized event.

A webhook processor doesn't need to know whether the event came from WAL or Redis.

A search indexing pipeline doesn't care.

An audit sink doesn't care.

Even the execution runtime doesn't care.

That separation between source acquisition and event processing became one of the defining architectural decisions of the ecosystem.

As the project grew, it became clear that acquiring events efficiently and processing them efficiently are two different problems—and they scale independently.

That realization eventually led to a separate execution engine: cdc-orchestrator-pro.

We'll come back to that shortly.

First, let's look at what makes each Redis source fundamentally different.

Redis Isn't One Event System. It's Four.

The first surprise when building a Redis CDC source is that there isn't a single Redis change stream.

There are four.

Each has different delivery guarantees.

Each behaves differently during failures.

Each recovers differently after reconnects.

And each answers a different operational question.

Treating them as interchangeable would have made the implementation simpler—but it also would have hidden the exact information operators need during production incidents.

Instead, cdc-redis-pro embraces those differences.


Redis Streams: The Durable Path

Redis Streams is the closest thing Redis has to a traditional CDC source.

Messages are persisted.

Consumers maintain checkpoints.

Consumer groups coordinate work.

Failed consumers leave pending entries behind for recovery.

In many ways, Streams feels familiar to anyone coming from Kafka or PostgreSQL logical replication.

That made it the natural foundation for the recoverable side of the driver.

The Streams implementation supports:

  • XREAD
  • XREADGROUP
  • Consumer Groups
  • Pending-entry inspection
  • XAUTOCLAIM
  • Duplicate suppression
  • Optional dead-letter streams

Operationally, Streams is the only Redis source that provides genuine replay.

If a downstream worker crashes halfway through a batch, processing resumes from the last committed checkpoint rather than silently dropping work.

Conceptually, it looks like this:


             Producer
                │
                ▼
          Redis Stream
                │
          Consumer Group
                │
                ▼
          cdc-redis-pro
                │
           ChangeEvent
                │
                ▼
         Downstream Runtime
Enter fullscreen mode Exit fullscreen mode

This is the strongest consistency story Redis offers.

It isn't PostgreSQL's WAL—but it isn't trying to be.

It's a durable event log designed for application-level workflows.


Pub/Sub: Fast, But Ephemeral

Pub/Sub solves a completely different problem.

Messages exist only while subscribers are connected.

Disconnect for five seconds.

Those five seconds are gone forever.

That isn't a bug.

It's the contract.

Many libraries attempt to hide this by automatically reconnecting.

The problem is that reconnecting doesn't recover missed messages.

It only resumes receiving future ones.

Pretending otherwise creates false confidence.

Instead, cdc-redis-pro treats Pub/Sub as an explicitly at-most-once source.

Reconnects are measured.

Loss windows are reported.

Operators can immediately see:

  • when the disconnect occurred,
  • how long the subscriber was offline,
  • and exactly where message loss became possible.

That distinction matters.

Infrastructure software shouldn't promise guarantees the underlying system doesn't provide.


Sharded Pub/Sub Changes the Topology

Redis Cluster introduces another variation.

Sharded Pub/Sub distributes channels across multiple primaries.

That improves scalability, but it also means subscriptions become topology-aware.

A reconnect isn't always reconnecting to the same node.

During resharding, ownership of a channel may move entirely.

Handling that correctly requires continuously tracking cluster topology rather than assuming a fixed server layout.

The driver automatically discovers topology through CLUSTER SHARDS and transparently rebinds subscriptions as ownership changes.

To downstream processors, events continue arriving normally.

To operators, topology changes remain observable.


Keyspace Notifications Aren't Really CDC

Keyspace notifications are probably the easiest Redis feature to misunderstand.

They're incredibly useful.

They're also incredibly easy to misuse.

Keyspace notifications exist to announce that Redis itself performed an operation:

  • a key expired,
  • a value changed,
  • a key was deleted,
  • a hash was updated.

They're operational signals.

They're not durable history.

They're not replayable.

And by the time you receive an expiration notification, the value may already be gone.

That's simply how Redis works.

Rather than pretending every notification contains complete information, the driver offers optional best-effort value enrichment whenever the value still exists.

If it doesn't, the event still proceeds.

The guarantee remains explicit.


Delivery Guarantees Should Stay Visible

One design principle shaped almost every API in the project.

I didn't want to normalize away delivery semantics.

Instead, I wanted them to remain visible all the way to the operator.

Think of it like a database transaction.

You wouldn't want a library to silently convert an eventually-consistent operation into something that merely looks transactional.

The same idea applies here.

Different Redis sources have different operational characteristics.

The API should preserve them.

That philosophy can be summarized like this:

Source Replay Delivery Typical Use
Streams At-least-once Durable workflows
Pub/Sub At-most-once Live events
Sharded Pub/Sub At-most-once Cluster-scale broadcasts
Keyspace Notifications At-most-once Operational signals

None of these are "better."

They're simply optimized for different workloads.


Topology Matters More Than Features

Supporting Redis isn't just about supporting commands.

It's about supporting deployments.

A surprising amount of complexity came not from Streams or Pub/Sub themselves, but from the environments they run in.

The driver currently supports:

  • Standalone Redis
  • Redis Sentinel
  • Redis Cluster
  • TLS
  • ACL authentication

Cluster support turned out to be particularly interesting.

Streams must remain within a single hash slot.

Cross-slot reads fail.

Pub/Sub subscriptions migrate during resharding.

Connections disappear during primary failover.

Those aren't edge cases.

They're normal operating conditions in production.

Every supported topology is continuously exercised using Docker-based integration tests covering failover, node restarts, resharding, authentication, and TLS.

I wanted the implementation to reflect how Redis is actually deployed—not just how it behaves on a laptop.


Acquiring Events Is Only Half the Problem

By this point, the source layer was capable of reliably acquiring events from every major Redis deployment model.

The next question became much harder.

How do you process them efficiently?

One worker?

Ten workers?

Hundreds?

How do you preserve ordering where it's required while still exploiting modern Ruby's parallelism?

It turned out that reading events from Redis wasn't the difficult part.

Scheduling what happened after they were read became the real engineering challenge.

That challenge eventually became HybridRuntime, the execution engine inside cdc-orchestrator-pro.

And surprisingly, the solution wasn't built around threads.

It was built around ownership.

The Architecture I'm Most Proud Of

Surprisingly, reading events from Redis wasn't the hardest part of the project.

Scheduling what happened after those events arrived was.

Modern Ruby gives us two powerful concurrency primitives:

  • Ractors for parallel CPU execution
  • Fibers for concurrent I/O

Most systems choose one.

I wanted both.

That eventually became HybridRuntime, the execution engine inside cdc-orchestrator-pro.

Its job isn't tied to Redis.

Redis simply happened to be the workload that exposed the problem first.


Event Acquisition and Event Processing Are Different Problems

One architectural realization changed the direction of the project.

Reading events from a source and processing those events are two completely different concerns.

They're limited by different bottlenecks.

They scale independently.

A PostgreSQL logical replication connection is fundamentally serial.

A Redis Stream consumer is similarly constrained.

But once an event has been acquired and normalized into a CDC::Core::ChangeEvent, downstream processing becomes embarrassingly parallel.

That naturally separates the pipeline into two halves.

                    Source Layer
                         │
         PostgreSQL WAL / Redis Streams
                         │
                         ▼
                CDC::Core::ChangeEvent
                         │
                         ▼
                  Execution Layer
Enter fullscreen mode Exit fullscreen mode

Once an event reaches the execution layer, its origin no longer matters.

Redis.

PostgreSQL.

A future Kafka adapter.

A future S3 replay.

The runtime simply processes ChangeEvent.

That separation turned out to be one of the most valuable architectural decisions in the ecosystem.


HybridRuntime

HybridRuntime combines two existing execution engines from the CDC ecosystem.

  • cdc-parallel provides pools of prewarmed Ractors for true CPU parallelism.
  • cdc-concurrent provides asynchronous Fiber pools for overlapping I/O within each Ractor.

Together they form a nested execution model.

                 HybridRuntime
                        │
        ┌───────────────┴───────────────┐
        ▼                               ▼
  Ractor Pool                    Ractor Pool
        │                               │
        ▼                               ▼
   Fiber Pool                     Fiber Pool
        │                               │
        ▼                               ▼
Redis Connections              Redis Connections
Enter fullscreen mode Exit fullscreen mode

The interesting observation is that parallelism and concurrency solve different problems.

Ractors increase throughput by executing work simultaneously.

Fibers increase throughput by avoiding idle time while waiting for I/O.

The runtime deliberately uses both.


The Inception Pool

As the architecture evolved, I noticed something amusing.

Every layer owned another pool.

The runtime owns a pool of Ractors.

Each Ractor owns a LocalResourcePool.

Each LocalResourcePool owns a pool of Fibers.

Each Fiber owns a live Redis connection.

It looked like this:

HybridRuntime
     │
     ▼
Prewarmed Ractor Pool
     │
     ▼
LocalResourcePool
     │
     ▼
Fiber Pool
     │
     ▼
Redis Connections
Enter fullscreen mode Exit fullscreen mode

Internally I started calling it the Inception Pool.

A pool containing pools containing pools.

The name stuck.


Ownership Instead of Synchronization

Most concurrent systems solve shared state by protecting it.

Threads
  │
  ▼
Mutex
  │
  ▼
Shared Connection Pool
Enter fullscreen mode Exit fullscreen mode

The more workers you add, the more frequently they compete for the same resources.

Locks become unavoidable.

HybridRuntime takes a different approach.

Instead of synchronizing ownership...

...it avoids sharing ownership entirely.

Every Redis client is created inside the Ractor that will use it.

It never leaves that Ractor.

Nothing is borrowed.

Nothing is shared.

Nothing requires a mutex.

Conceptually it looks like this.

Ractor 1
   │
   ├── Redis Connection A
   ├── Redis Connection B
   └── Fiber Scheduler

Ractor 2
   │
   ├── Redis Connection A
   ├── Redis Connection B
   └── Fiber Scheduler
Enter fullscreen mode Exit fullscreen mode

The only thing that crosses a Ractor boundary is an immutable ChangeEvent.

Everything else remains local.

This aligns naturally with Ruby's ownership model.

Mutable state belongs somewhere.

Rather than fighting that constraint, the runtime embraces it.


Why LocalResourcePool Exists

That ownership model eventually led to another component:
CDC::Orchestrator::Pro::LocalResourcePool.

Unlike traditional connection pools, a LocalResourcePool isn't shared across Ractors.

The pool itself is shared as an immutable coordinator.

The live resources are not.

Instead, every Ractor lazily creates and owns its own resource pool the first time it needs one.

             LocalResourcePool
                    │
      ┌─────────────┴─────────────┐
      ▼                           ▼
  Ractor A                   Ractor B
      │                           │
 Resource Pool               Resource Pool
      │                           │
 Redis Connections          Redis Connections
Enter fullscreen mode Exit fullscreen mode

Each Ractor owns its resources for their entire lifetime.

Nothing crosses a Ractor boundary.

Nothing requires synchronization.

The work moves.

The connections don't.

This turns out to be a natural fit for long-lived resources such as:

  • Redis clients
  • PostgreSQL connections
  • HTTP clients
  • Elasticsearch clients
  • S3 clients

Every Ractor operates independently using resources it owns locally.

Rather than coordinating access to a shared pool, the runtime coordinates immutable ChangeEvents while leaving the underlying resources exactly where they were created.

The result is a simpler ownership model, reduced contention, and an execution architecture that scales naturally with additional Ractors.


Two Independent Scaling Axes

Another consequence of this architecture is that acquisition and processing no longer have to scale together.

Suppose a Redis deployment only needs three acquisition workers.

That says nothing about how many processing workers you need.

You might run:

Acquisition

3 Ractors
5 Fibers each

↓

Processing

7 Ractors
20 Fibers each
Enter fullscreen mode Exit fullscreen mode

Each side can be tuned independently.

Adding more downstream workers doesn't require opening additional Redis Streams.

Adding more source readers doesn't require changing the execution topology.

The two halves of the pipeline evolve independently.

That separation proved invaluable during benchmarking because it exposed where the real bottlenecks actually lived.


Beyond Redis

One realization surprised me.

HybridRuntime wasn't solving a Redis problem.

It was solving an event-processing problem.

Redis happened to be the first source.

The same execution model works for:

  • PostgreSQL logical replication
  • Redis Streams
  • Webhook delivery
  • Search indexing
  • Object storage sinks
  • Future Kafka adapters
  • Future message brokers

Anything capable of producing a CDC::Core::ChangeEvent automatically inherits the same execution engine.

That ultimately justified extracting the runtime into its own commercial component: cdc-orchestrator-pro.

Originally it lived inside another project.

Eventually it became obvious that it wasn't a Redis runtime.

It wasn't a Sidekiq runtime.

It wasn't even a PostgreSQL runtime.

It was an execution fabric for normalized change events.

Redis simply happened to be the benchmark that inspired it.


Parallelism Isn't Free

One thing the benchmarks made very clear is that parallelism isn't magic.

Adding more Ractors doesn't produce linear speedups.

It introduces coordination costs.

Partition routing.

Mailbox communication.

Ordering constraints.

Preserving correctness means accepting those costs.

Understanding where those tradeoffs appear became just as interesting as the throughput numbers themselves.

Let's look at what those benchmarks actually measured.


Where This Actually Fits

After spending so much time discussing architecture, it's worth asking a simple question.

Who actually needs this?

The honest answer is:

Not every Rails application.

If Redis is simply a cache sitting beside your database, this project is probably unnecessary.

Likewise, if every important state transition already commits to PostgreSQL before anything else happens, PostgreSQL logical replication alone may be all the CDC infrastructure you need.

cdc-redis-pro exists for a much narrower class of systems.

Systems where Redis is part of the application's event architecture rather than merely its cache.


Redis Streams as an Event Bus

This is probably the most natural fit.

Many distributed systems already use Redis Streams as their internal event bus.

Order Service
    │
    ▼
Redis Stream
    │
    ▼
Consumers
Enter fullscreen mode Exit fullscreen mode

Once Redis becomes the place where work is coordinated, durability suddenly matters.

Consumers crash.

Deployments restart.

Networks partition.

A consumer needs to know where to resume.

Redis Streams already provides those building blocks.

Consumer Groups.

Pending Entries.

Checkpoint IDs.

XAUTOCLAIM.

The job of cdc-redis-pro isn't replacing those mechanisms.

It's integrating them into a larger event-processing pipeline while preserving their semantics.


Fast Signals Before Durable State

Many systems generate transient events before anything reaches PostgreSQL.

Examples include:

  • inventory availability
  • market data
  • IoT telemetry
  • collaborative editing
  • multiplayer game state

These events often exist for milliseconds.

Some are never intended to become permanent records.

Waiting for a database commit before reacting introduces unnecessary latency.

Redis already has the signal.

The application simply needs a reliable way to observe it.

That's exactly where Redis becomes a valuable CDC source.

Not because it replaces the database.

Because it observes change sooner.


Redis and PostgreSQL Together

The architecture becomes much more interesting when both sources exist simultaneously.

Imagine an order-processing pipeline.

Customer clicks Buy
       │
       ▼
Redis Stream
       │
 Immediate downstream processing
        │
 PostgreSQL Transaction
        │
        ▼
 Logical Replication
Enter fullscreen mode Exit fullscreen mode

Redis carries the operational signal.

PostgreSQL records the durable history.

Eventually both become the same normalized object.

Redis Streams
        │
        ▼
   ChangeEvent
        ▲
        │
PostgreSQL WAL
Enter fullscreen mode Exit fullscreen mode

Once normalized, downstream processing becomes identical.

That separation allows each technology to do what it does best.

Redis optimizes for responsiveness.

PostgreSQL optimizes for durability.

Neither replaces the other.


Event Processing Shouldn't Care About the Source

One of the design goals of the CDC ecosystem is that processors shouldn't know—or care—where an event originated.

A webhook dispatcher shouldn't behave differently because the event came from Redis instead of PostgreSQL.

Neither should:

  • search indexing
  • audit sinks
  • analytics
  • cache invalidation
  • AI pipelines
  • object storage
  • future message brokers

Every processor consumes exactly the same event model.

Redis
   │
   ▼
 ChangeEvent
       ▲
       │
 PostgreSQL
      │
      ▼
 Processor
        │
 ┌──────┼────────┬────────┬────────┐
 ▼      ▼        ▼        ▼        ▼

Webhook Search  Audit   Redis   Future...
Enter fullscreen mode Exit fullscreen mode

That separation is what allows the runtime to remain completely source-agnostic.


Ordered Workloads

Not every workload benefits equally from parallelism.

Suppose an application updates customer balances.

+100
-20
+15
Enter fullscreen mode Exit fullscreen mode

Processing those out of order would produce incorrect state.

Ordering matters.

Other workloads don't have that constraint.

Search indexing.

Webhook fan-out.

Telemetry aggregation.

Independent cache updates.

Those can often execute concurrently.

One of the runtime's responsibilities is recognizing that not every processor requires the same ordering guarantees.

Correctness always comes first.

Throughput comes second.


Why Not Just Use Sidekiq?

This is probably the question Ruby developers ask most often.

After all, Sidekiq already provides a robust distributed job system.

The answer is that jobs and change streams solve different scheduling problems.

A job queue answers:

"What work should execute next?"

A CDC runtime answers:

"How should related events flow through the system while preserving their correctness?"

Those are similar questions.

They're not the same question.

Jobs are independent.

Change events frequently aren't.

Ordering.

Checkpoints.

Replay.

Transaction boundaries.

Partition routing.

Those become first-class concerns in CDC systems.

Rather than replacing Sidekiq, the runtime sits at a different layer.

Sidekiq remains an excellent execution engine for background jobs.

HybridRuntime focuses on ordered event pipelines.

The two complement one another rather than compete.


Lessons Learned

Building cdc-redis-pro changed how I think about event-driven systems.

A few observations kept appearing throughout development.

Redis isn't PostgreSQL.

Trying to force Redis into a WAL-shaped abstraction usually hides important operational behavior.

Delivery guarantees matter more than APIs.

Two systems exposing similar methods may have completely different recovery characteristics.

Ownership scales better than synchronization.

Keeping mutable resources inside a single Ractor proved simpler than sharing them across many workers.

Acquisition and processing are independent problems.

The bottleneck for reading events is rarely the same bottleneck for processing them.

Treating those concerns separately made both architectures significantly cleaner.

Most importantly...

Infrastructure shouldn't hide tradeoffs.

It should make them explicit.

That's the philosophy behind the entire project.

The benchmark results ended up reflecting exactly those design decisions.

What the Benchmarks Actually Mean

Benchmark numbers are easy to misunderstand.

They're also surprisingly easy to exaggerate.

I wanted to avoid both.

Rather than publishing a single headline number, I built a benchmark matrix that explored how the runtime behaves under different execution strategies.

The goal wasn't to find the biggest number.

The goal was to understand where the architecture stops scaling—and why.


Measuring Different Parts of the Pipeline

Not every benchmark measures the same thing.

Some benchmarks measure source acquisition.

Others measure downstream execution.

Others measure the orchestration layer itself.

Treating those numbers as interchangeable would be misleading.

I ended up thinking about the benchmarks as three different phases.

Redis Source
      │
      ▼
ChangeEvent Acquisition
      │
      ▼
HybridRuntime
      │
      ▼
Downstream Sink
Enter fullscreen mode Exit fullscreen mode

Each phase has different bottlenecks.

Acquisition is constrained by Redis.

Processing is constrained by CPU, I/O latency, ordering requirements, and scheduling overhead.

Understanding which phase you're measuring is more important than the final throughput number.


The Synthetic Benchmark

The largest number observed was approximately 54,500 events per second.

That's intentionally not presented as an end-to-end Redis benchmark.

It measures the execution capacity of the orchestration layer after events have already been acquired.

In other words:

ChangeEvent
      │
      ▼
HybridRuntime
      │
      ▼
Processor
Enter fullscreen mode Exit fullscreen mode

This benchmark answers a very specific question:

"How quickly can the runtime schedule and execute already-available work?"

That's useful.

It just isn't the same as measuring an entire Redis pipeline.


End-to-End Pipelines

Real systems spend time doing real work.

Reading from Redis.

Writing to PostgreSQL.

Calling HTTP services.

Updating search indexes.

Those operations introduce latency that no scheduler can eliminate.

When measured end-to-end, the results naturally become lower.

Current peak observations include:

  • Redis Streams → Runtime: approximately 17,600 events/sec
  • PostgreSQL WAL → Redis: approximately 20,000 events/sec

Those numbers include actual I/O rather than isolated scheduling.

Personally, I find them more interesting than the synthetic benchmark because they reflect complete pipelines.


Scaling Isn't Linear

One result immediately stood out.

Adding more Ractors did not produce proportional speedups.

That's exactly what I expected.

Parallelism always introduces coordination costs.

Events must be routed.

Partitions must remain consistent.

Workers communicate through Ractor mailboxes.

Ordering constraints occasionally delay otherwise-complete work.

The runtime spends part of its time doing useful work...

...and part of its time coordinating that work.

That coordination isn't overhead to eliminate.

It's the cost of preserving correctness.

The benchmark matrix made those tradeoffs visible.

Rather than chasing perfect scaling, the goal became identifying the point where additional parallelism stopped producing meaningful throughput gains.

For the current implementation, that sweet spot consistently appeared around:

  • 3 prewarmed Ractors
  • 5 Redis connections per Ractor
  • 50 Fibers

That balance delivered high throughput without introducing excessive scheduling overhead.


Ordering Has a Cost

One benchmark compared ordered and unordered execution.

The difference wasn't dramatic.

Ordered execution consistently performed slightly slower.

That's expected.

Maintaining ordering means the runtime occasionally waits for earlier work to complete before later work can safely continue.

Event 1
Event 2
Event 3
Enter fullscreen mode Exit fullscreen mode

cannot become:

Event 2
Event 3
Event 1
Enter fullscreen mode Exit fullscreen mode

simply because Event 2 happened to finish first.

Preserving correctness sometimes requires sacrificing a little throughput.

That's a tradeoff I consider worthwhile.

Correctness scales better than debugging race conditions.


The Interesting Bottleneck

The benchmark wasn't really about Redis.

It was about coordination.

At low parallelism, workers spend most of their time processing events.

At high parallelism, workers spend increasingly more time coordinating with one another.

Eventually another Ractor contributes more scheduling overhead than useful work.

Finding that point was considerably more valuable than finding the largest throughput number.

It answered a much more practical question:

"How should I actually configure this in production?"


Chaos Matters More Than Throughput

Raw throughput is only one characteristic of an event pipeline.

Recovery behavior is arguably more important.

The benchmark suite includes failure scenarios covering:

  • Redis restarts
  • PostgreSQL restarts
  • connection interruption
  • checkpoint recovery
  • consumer recovery

Streams resumed processing from checkpoints.

Pub/Sub sources reported explicit loss windows.

Recovery behavior remained consistent with each source's documented guarantees.

That consistency mattered more to me than achieving another few thousand events per second.


Long-Running Stability

Short benchmarks rarely expose operational problems.

Memory leaks.

Connection exhaustion.

Scheduler starvation.

Queue growth.

Those usually appear over time.

The runtime was therefore exercised continuously using soak tests.

One representative run processed approximately 1.34 million events over five minutes.

No processing failures were observed.

Median throughput degraded by roughly 2% over the duration of the run.

That's encouraging, although much longer overnight and multi-day soak tests remain on my roadmap.

Operational confidence comes from sustained behavior—not just impressive graphs.


What I Learned

Perhaps the most surprising outcome of the benchmarking work was this:

The execution runtime wasn't the limiting factor.

The limiting factor was almost always the surrounding system.

Network latency.

Redis.

HTTP endpoints.

Disk.

Database writes.

The scheduler spent most of its time waiting for external systems.

That reinforced one of the central architectural decisions behind HybridRuntime.

Fibers overlap waiting.

Ractors overlap computation.

Neither attempts to eliminate latency.

They simply ensure latency in one part of the system doesn't unnecessarily stall everything else.

The result isn't infinite scalability.

It's predictable scalability.

And for infrastructure software, predictability is usually the more valuable property.


The complete benchmark reports—including raw CSV data, SVG charts, chaos-recovery artifacts, and soak-test results—are published alongside the documentation.

I'd much rather readers inspect the raw data than rely on a single headline number.

Benchmarks are most useful when they're reproducible.

What's Next

cdc-redis-pro is only one piece of a much larger ecosystem.

The long-term goal was never to build "yet another Redis client."

The goal was to build a source-agnostic Change Data Capture platform for Ruby.

Today, PostgreSQL logical replication and Redis happen to be the two primary sources.

Tomorrow, that could just as easily include:

  • Kafka
  • NATS
  • Amazon SQS
  • Webhooks
  • Object storage
  • Search indexes
  • Other databases

The important observation is that the runtime doesn't need to change.

As long as a source can be normalized into a CDC::Core::ChangeEvent, everything downstream already knows how to process it.

That was the motivation behind separating source acquisition from execution.

        Source
           │
           ▼
   CDC::Core::ChangeEvent
           │
           ▼
    cdc-orchestrator-pro
           │
           ▼
        Processors
Enter fullscreen mode Exit fullscreen mode

Every new source becomes an adapter.

Not a new runtime.


Why Split the Runtime?

One architectural decision deserves a brief explanation.

Originally the execution engine lived inside another project.

As the ecosystem evolved, I realized something important.

The runtime wasn't solving a Redis problem.

It wasn't solving a PostgreSQL problem.

It wasn't even solving a Sidekiq problem.

It was solving an event-processing problem.

That realization led to extracting the execution engine into its own commercial component:

cdc-orchestrator-pro

Today it powers Redis CDC.

Tomorrow it can power any source capable of producing normalized change events.

Separating those concerns keeps both halves of the system simpler.

Source adapters acquire events.

HybridRuntime processes them.

Each evolves independently.


Open Source First

Although cdc-redis-pro and cdc-orchestrator-pro are commercial products, the ecosystem they're built upon remains open source.

That includes:

  • cdc-core
  • cdc-parallel
  • cdc-concurrent
  • pgoutput-client
  • pgoutput-parser
  • pgoutput-decoder
  • pgoutput-source-adapter
  • Mammoth

Those projects define the common event model, execution primitives, and PostgreSQL integration that everything else builds upon.

The commercial components focus on operational capabilities rather than replacing the open-source foundation.

That separation is intentional.

I believe infrastructure ecosystems become valuable through adoption and trust—not artificial feature restrictions.


Looking Ahead

Redis replication remains one of the larger pieces still on the roadmap.

Today, cdc-redis-pro consumes Redis event sources such as Streams, Pub/Sub, and Keyspace Notifications.

A future version will move further upstream by treating Redis itself as a replication source.

That's a significantly more ambitious problem.

I'd rather stabilize the current architecture before expanding its scope.

There are also areas where I think the execution engine itself can continue to improve.

Adaptive scheduling.

Smarter partition routing.

Better observability.

Long-running soak tests.

More topology-aware execution.

Those improvements belong to the runtime rather than any particular source adapter—which is exactly why separating acquisition from execution turned out to be such a useful architectural boundary.


Final Thoughts

I started this project thinking I was building Redis CDC.

Somewhere along the way I realized I was really building an execution model.

Redis happened to expose the problem first.

PostgreSQL reinforced it.

Future source adapters will probably validate it again.

The most interesting lesson wasn't about Redis at all.

It was this:

Acquiring events and processing events are different problems.

They have different bottlenecks.

They scale differently.

They deserve different architectures.

Once those responsibilities are separated, the rest of the system becomes remarkably composable.

Redis becomes another source.

PostgreSQL becomes another source.

Tomorrow's adapters become just that—adapters.

The runtime stays the same.

For me, that's the most exciting part of the entire project.

Not because it produced the largest benchmark numbers.

Not because it uses Ractors or Fibers.

But because it led to an architecture that's easier to reason about, easier to extend, and honest about the tradeoffs of the systems it builds upon.

The benchmark reports are public.

The documentation is public.

The implementation is commercial.

If you're building event-driven systems in Ruby—or you're wrestling with Redis and PostgreSQL in the same architecture—I'd genuinely love to hear how you're approaching those problems.

I'm convinced there's still a lot left to explore.

Top comments (0)