Building Resilient Event-Driven Systems: Lessons from the Distributed Trenches

#distributedsystems #elixir #cloud #serverless

When you first look at a distributed architecture diagram with services scattered across multiple cloud providers, regions, and continents, it's easy to feel overwhelmed. Network partitions, timeouts, SSL handshake failures, connection drops—the list of things that can go wrong seems endless. And they do go wrong, constantly. In our globally distributed application running across 5 regions on Fly.io, we see these failures every single day. Phoenix.PubSub disconnecting from Redis. PostgreSQL protocols timing out mid-query. RabbitMQ brokers closing connections unexpectedly. It looks scary, and honestly, it should be. But here's the thing: with the right architectural choices, the right technology stack, and a healthy respect for the fallacies of distributed computing, you can build systems that not only survive these failures but thrive because of how they handle them.

The Network Is Not Reliable (And Never Will Be)

Peter Deutsch's famous "8 Fallacies of Distributed Computing" starts with the most dangerous assumption: the network is reliable. It's not. When you're orchestrating services across Redis on Upstash, LavinMQ on CloudAMPQ, Postgres on Neon, and storage on Backblaze B2—all while serving users globally through BunnyCDN—you're essentially building on top of organized chaos. The cost of communication is never zero, and latency isn't just a number on a dashboard; it's a real constraint that shapes your entire architecture.
Looking at our Sentry error logs tells the story plainly: DBConnection.ConnectionError: ssl recv (idle): timeout, Phoenix.PubSub disconnected from Redis with reason :ssl_closed, Cannot connect to RabbitMQ broker: :timeout. These aren't exceptional circumstances; they're Tuesday. The errors show up, they resolve themselves, and the application keeps running. That's not luck—that's design.

Let It Crash: More Than Just a Slogan

Elixir's "let it crash" philosophy, inherited from Erlang's decades of building fault-tolerant telecom systems, is often misunderstood. It doesn't mean writing careless code or ignoring errors. It means designing systems where failures are isolated, supervised, and automatically recovered from. When a PostgreSQL connection times out after 4 days of idle time, the connection process crashes, and the supervisor immediately spawns a new one. When Phoenix.PubSub loses its Redis connection, it gracefully reconnects without taking down the entire application.
This is visible in our error patterns. Notice how the Postgrex protocol disconnections show up with (No error message) but are all marked as Resolved. The system detected the failure, handled it, and moved on. No manual intervention. No emergency pages at 3 AM. The BEAM VM's process isolation means that a failing database connection doesn't cascade into a system-wide outage—it's contained, logged, and recovered.

Event-Driven Architecture: Decoupling for Resilience

At the heart of our architecture lies event-driven design, and it's what allows us to scale without losing data. When a user uploads a file, that action triggers events that flow through the system asynchronously. The web request completes quickly, returning control to the user, while the actual processing happens in the background through LavinMQ for inter-service communication and Oban for background job processing.
Here's why this matters: when LavinMQ has a momentary connection issue (as we see in the logs with RuntimeError: unexpected error when connecting to RabbitMQ broker), the messages aren't lost. They're persisted in PostgreSQL through Oban's reliable job queue. The system retries with exponential backoff—starting with short delays and gradually increasing them to avoid overwhelming a recovering service. This retry strategy is visible in our error resolution times: some errors resolve in minutes, others take hours, but they all resolve without manual intervention.
Using Postgres as Oban's backend is a deliberate choice. While LavinMQ handles the real-time, high-throughput message passing between services, Oban manages critical background tasks that absolutely cannot be lost. Database-backed job queues give us transactional guarantees—if the job is enqueued, it will eventually be processed, even if it takes multiple retries across connection failures.

The Technology Stack: Intentional Choices

Every piece of our stack was chosen with resilience and global distribution in mind:
Fly.io provides globally distributed compute with applications running in 5 regions simultaneously. When a region has issues, traffic automatically routes to healthy regions. The edge-first architecture means users always connect to the nearest available instance.
Upstash Redis gives us globally replicated cache and pub/sub capabilities. Even when connections drop (as they inevitably do), the application degrades gracefully, fetching from the database instead of serving stale cache or crashing.
CloudAMPQ's LavinMQ offers a lightweight, high-performance message broker. Its RabbitMQ compatibility means we get the proven AMQP protocol with better performance characteristics. Message persistence ensures that even during connection issues, messages wait in queues rather than disappearing into the void.
Neon's Postgres provides serverless Postgres with branching and point-in-time recovery. The connection pooling and automatic scaling mean we can handle traffic spikes without manual database provisioning. When we see those ssl recv (idle): timeout errors, it's often because a connection has been idle during low-traffic periods—Neon's serverless nature shuts down idle resources, and our connection pooler handles reconnection transparently.
Backblaze B2 with S3 protocol gives us cost-effective, reliable object storage. The S3 compatibility means we can use battle-tested client libraries, and the global CDN integration through BunnyCDN ensures low-latency access worldwide.
CircleCI handles our continuous deployment pipeline, automatically deploying code changes across all regions. This is crucial because fixing issues often means deploying new code, and we need that process to be fast and reliable.
Sentry is our observability layer, showing us not just when things fail, but how they fail and how they recover. The error patterns in Sentry guided many of our retry strategies and timeout configurations.

The Reactive Manifesto in Practice

Our architecture embodies the principles of the Reactive Manifesto—responsive, resilient, elastic, and message-driven:
Responsive: Users get fast responses because we don't block on slow operations. File uploads return immediately while processing happens asynchronously.
Resilient: Failures are isolated and don't cascade. A database timeout doesn't crash the application; it crashes a single connection process that's immediately restarted.
Elastic: Our globally distributed compute and serverless database scale up and down based on demand without manual intervention.
Message-driven: Events flow through message queues, decoupling services and allowing them to process work at their own pace while maintaining backpressure.

Practical Patterns for Resilience

Several concrete patterns emerge from our experience:
Exponential backoff with jitter spaces out retries intelligently. The first retry happens quickly (maybe the network hiccup was momentary), but subsequent retries back off exponentially, with random jitter to prevent thundering herd problems.
Dead letter queues capture messages that fail repeatedly after all retries. This prevents poison messages from blocking queue processing while ensuring we can investigate and manually recover them later.
Idempotent operations ensure that retries don't cause duplicate side effects. When LavinMQ redelivers a message after a connection issue, we can safely process it again without corrupting data.
Graceful degradation means falling back to reduced functionality rather than failing completely. If Redis is down, we serve from the stale in memory cache rather than throwing errors.

The Cost of Resilience

Building resilient distributed systems isn't free. There's operational complexity—more services to monitor, more failure modes to understand, more logs to analyze. There's performance overhead—retries mean higher latency for failed operations, circuit breakers mean some requests get rejected that might have eventually succeeded.
But the alternative is worse. Without proper retry logic, timeout handling, and failure isolation, that DBConnection.ConnectionError doesn't just log to Sentry and resolve itself—it crashes your application. That momentary Redis connection drop doesn't trigger a graceful reconnection—it takes down your real-time features until someone manually restarts the service.
Our error logs show the price we pay for resilience: hundreds of timeout errors, connection failures, and protocol disconnections every week. But they also show something more important: they all resolve themselves. The system heals, automatically, without human intervention. That's what resilience looks like in production.

Conclusion: Embrace the Chaos

Distributed systems are inherently chaotic. Networks partition, services timeout, connections close unexpectedly. You can't prevent these failures, but you can design for them. Elixir's supervisor trees give you automatic recovery. Event-driven architecture gives you decoupling and scalability. Proper retry strategies give you resilience. Message queues give you durability.
When we look at our Sentry dashboard and see all those resolved errors—PostgreSQL timeouts, Redis disconnections, RabbitMQ failures—we don't see problems. We see evidence that the system is working as designed. Each of those errors represents a moment where the system detected a failure, handled it gracefully, and recovered automatically.
Building distributed systems is still scary. The architecture diagrams are still complex. The failure modes are still numerous. But with the right tools, the right patterns, and a healthy respect for the network's unreliability, you can build systems that are truly resilient—systems that bend but don't break, that stumble but don't fall, that meet the challenges of global distribution head-on and emerge stronger for it.
The network will fail. Your services will timeout. Connections will drop. Build accordingly.