Samuel Owolabi

Posted on May 2

Designing for Failure: Chaos Engineering Principles in System Design

#systemdesign #security #cloudcomputing #softwareengineering

A practical guide to chaos engineering principles that transform fragile architectures into resilient, self-healing systems.

Recently, I wrote an article titled “What if you are to build for one million daily active users?”. In that article, we explored a point where a monolithic system could no longer scale and began to break. We discussed scalability, availability, and observability, and why they become critical as systems grow. This article builds directly on that discussion.
Here, the focus is designing for failure, what exactly is Chaos Engineering, how can we simulate chaos on our system, measure the impacts, and how to handle and mitigate failures on our system.

The reality is that 100% uptime is not something you can realistically promise. What you can design for is fault tolerance and resilient infrastructure. That difference matters.

A simple way to understand this is the idea of a spare tire in your car. You do not expect to have a flat tire every day, but you still keep a spare. You might even check occasionally to make sure it is still inflated. The reason is straightforward. The cost of being unprepared when failure happens, especially in a bad situation like a breakdown on a highway at night, is very high. I have experienced this before on Lekki, Lagos highway, and that night wasn’t funny (LOL).

Netflix famously talked about building Chaos Monkey, a tool that randomly terminates production instances. The goal was to force their systems to survive common types of failures before those failures happened unexpectedly. This is the core mindset behind Chaos Engineering.

To design for failure, we must understand how the system behaves when failure inevitably happens. What is the cost? What is the impact? How do we mitigate it? How do we still maintain over 99% uptime? This requires treating failure as a default state, not an exception.

Core Principles of Chaos Engineering

1. Define What “Normal” Looks Like

The first step is defining steady-state behavior. Without this, there is no baseline to measure against.
Examples include:

P95 latency less than 300 milliseconds
Error rate below 0.5%
Successful payment completion rate above 99.9%

These metrics represent what healthy behavior means for your system.

2. Form a Hypothesis

Once normal behavior is defined, you form a hypothesis.
The idea is simple: if a specific failure happens, the system should still be able to perform a critical function.
For example, if Service X fails, users should still be able to complete checkout within acceptable latency. This keeps chaos experiments intentional and measurable.

3. Introduce Realistic Failures

Next, you introduce failures that closely resemble real-world incidents.
Examples include:

Killing instances or Kubernetes pods
Introducing network latency or packet loss
Brute-forcing APIs to test rate limits
Simulating DDoS-like traffic spikes
Simulating DNS failures or dependency outages
Testing third-party data mismatches or breaking changes

We are ensuring realism, not randomness.

4. Run Experiments in Production

Chaos experiments are most valuable in production. This is where real traffic patterns, real user behavior, and real data shapes exist.
That said, experiments must be controlled.

Start small to limit blast radius.
Run during business hours when teams are available.
Have clear rollback strategies such as feature flags, traffic shifting, or instance replacement.
Ensure strong monitoring and observability before running experiments.

I want you to know that running chaos experiments without observability leads to outages, not learning.

5. Automate and Continuously Improve

Chaos Engineering is not a one-off exercise.
Systems evolve. Dependencies change. Teams rotate.
Experiments should be automated, repeatable, and run continuously, either as scheduled jobs or integrated into CI/CD pipelines. Over time, experiments can be expanded to test higher-impact scenarios.
The feedback loop remains consistent:
Design → Build → Chaos Test → Iterate

Types of Failures to Design For and How to Address Them

Section 1: Single Points of Failure

A single point of failure is any component whose failure can bring down the entire system.

1. Database Failures

Scenario: The primary database goes down.
Solutions:

Primary–standby setup with synchronous replication and automatic failover
Read replicas for distributing read queries (asynchronous replication)

Additional considerations include health checks, failover timing, and data consistency. Strong consistency simplifies reasoning but reduces availability. Eventual consistency improves availability but introduces complexity and potential inconsistency windows.
Trade-offs: Increased cost, operational complexity, and consistency challenges.

2. Cloud Server Outages

Using a single availability zone creates unnecessary risk.
Solutions:

Multi-AZ architecture for high availability
Multi-region architecture for disaster recovery

Trade-offs: Higher latency, data replication complexity, and increased cost.

3. Application and Web Server Failures

Scenario: API servers crash or become unhealthy.
Solutions:

Horizontal scaling using Auto Scaling Groups
Load balancer health checks and automatic deregistration
Stateless application design
External session storage or token-based session management

Trade-offs: More infrastructure and stricter architectural discipline.

4. Message Queue Failures

Common failure modes include broker unavailability and consumer processing timeouts.
Solutions:

Dead-letter queues for failed messages
Temporary fallback storage such as NoSQL databases
Clustered queue setups with replication

Trade-offs: Increased operational overhead and reprocessing complexity.

5. DNS Failures

Relying on a single DNS provider or registrar is a common oversight.
Solutions:

Multiple DNS providers
Secondary DNS configurations
Anycast DNS

6. Third-Party API Failures

Depending on a single third-party API for critical functionality is risky, especially in domains like fintech.
Solutions:

Multiple providers
Cached responses
Graceful fallback data or degraded flows

Trade-offs: Higher integration complexity and reconciliation challenges.

Every solution discussed here introduces trade-offs. Designing for failure is not about eliminating risk entirely. It is about understanding where risk exists and making deliberate decisions about how much of it you are willing to accept.

Section 2: Network Failures

Network failures are unavoidable in distributed systems. Latency spikes, packets get dropped, DNS fails, and sometimes the network splits entirely. Many system outages are not caused by servers crashing, but by slow or unreliable communication between otherwise healthy components.

This is where several of the classic fallacies of distributed computing show up, especially the assumption that the network is reliable and has zero latency.

Common Network Behavior and Risk Pattern we often ignore

1. Are All Network Calls Protected with Timeouts?

Timeouts are one of the most important and most frequently overlooked safeguards in distributed systems.

A slow dependency is often worse than a failed one. Without timeouts, requests pile up, threads get exhausted, and failures spread to otherwise healthy services.
Timeouts should exist at every layer:

Client-side requests
API gateway timeouts
Service-to-service calls
Database query execution
External API requests

A practical rule of thumb is that timeouts should get shorter as requests move deeper into the system.
Example timeout chain:
Client (10s) → Gateway (8s) → Service (6s) → Database (4s)
Timeouts should also be monitored. A sudden increase often indicates upstream degradation long before a full outage occurs.

2. Do We Have Retry Logic with Exponential Backoff?

Retries exist to handle transient failures such as brief network interruptions or temporary service unavailability.
However, retries without limits or delays can make outages worse.
It is important to distinguish between:

Transient failures, which may succeed on retry
Permanent failures, which will not

Exponential backoff helps control retry behavior by increasing the wait time between attempts.

// Basic formula:
wait_time = base_delay × (2 ^ attempt_number)

This reduces pressure on failing services and prevents retry storms. Retry counts should always be capped, and retries should be combined with timeouts and circuit breakers.

3. Are Operations Idempotent?

Idempotency means that performing the same operation multiple times produces the same result.
This becomes critical when retries are involved. If a request is retried due to a timeout, the system must not process it twice in a harmful way.

Common implementation strategies include:

Unique transaction or request IDs
Idempotency keys stored with deduplication windows
Server-side request tracking

Payment systems are a common example where idempotency is mandatory. Charging a user twice because of a retry is unacceptable.

Section 3: Cascading Failures

In complex systems, failures rarely stay isolated. A single degraded service can trigger failures in upstream services, which then overload others. This is how small issues turn into large outages.

1. Are Circuit Breakers in Place?

Circuit breakers prevent repeated calls to failing dependencies.
They typically operate in three states:

Closed: requests flow normally
Open: requests are blocked
Half-open: limited test requests are allowed

Key configuration parameters include:

Failure threshold, such as a 50% error rate
Open-state timeout duration
Number of test requests in half-open state
Visibility into circuit breaker state through monitoring

Circuit breakers protect systems by failing fast and giving dependencies time to recover.

2. Can One Service’s Failure Take Down the Entire System?

This question reveals how tightly coupled a system is.
Important steps include:

Identifying critical path services
Mapping service dependencies
Understanding which failures are acceptable and which are not

Implementation strategies:

Prefer asynchronous communication where possible
Use event-driven patterns to reduce tight coupling
Provide fallback responses when dependencies fail
Serve stale or cached data when appropriate

Chaos experiments are useful here. For example, intentionally disabling a downstream database and observing how the system behaves can reveal hidden coupling and unexpected dependencies.

Section 4: Data Consistency

Distributed systems must make trade-offs between consistency, availability, and partition tolerance. Network partitions are not hypothetical events; they are guaranteed to happen at some point.
This is where understanding consistency models becomes essential. The spectrum ranges from strong consistency, which is easier to reason about but less available, to eventual consistency, which improves availability but requires careful conflict handling.
(This ties directly into the earlier discussion on strong vs eventual consistency. I wrote an article on this here, please check it out, you will gain deeper knowledge on consistency).

Common Risk Pattern: Consistency Mismatch

1. Is the Consistency Model Appropriate for the Use Case?

Different parts of a system often require different consistency guarantees.

Strong consistency means that once a write succeeds, every subsequent read will immediately return that updated value. From the user’s point of view, the system behaves as if there is only one copy of the data. This makes the system easier to reason about, but it usually comes at the cost of higher latency and lower availability, especially during failures or network issues.

Eventual consistency means that after a write, different parts of the system may temporarily see different values, but given enough time and no new updates, all copies will converge to the same state. This model favors availability and performance, but it requires the system and sometimes the application to handle stale data and resolve conflicts.

In practice, strong consistency optimizes for correctness and simplicity, while eventual consistency optimizes for availability and scale. Neither is universally better. The right choice depends on what the data represents and how critical immediate correctness is to the user experience.

Strong consistency is suitable for: Payments, Inventory updates, Account balances
Eventual consistency works well for: Recommendations, Activity feeds, Analytics

Hybrid approaches are common and often necessary.
Examples:

Strong consistency for checkout flows
Eventual consistency for product recommendations
Per-operation consistency levels
Strong guarantees on critical paths, relaxed guarantees elsewhere

2. How Do We Handle Replication Lag?

Replication lag occurs when followers fall behind leaders.
Common causes include:

Network latency
High write throughput
Slower follower nodes

Lag becomes dangerous when users expect read-after-write consistency but do not receive it.
Mitigation strategies include:

Reading from the leader immediately after writes
Client-side caching of recent writes
Versioning and monotonic reads

Replication lag should be measured and monitored. When lag grows beyond acceptable limits, it becomes a correctness issue, not just a performance concern.

Section 5: Recovery

Failures will happen. What matters next is how quickly and safely the system recovers.
This is where recovery metrics become practical tools rather than abstract concepts.
Key metrics include:

RTO (Recovery Time Objective): how long recovery can take
RPO (Recovery Point Objective): how much data loss is acceptable
MTTR (Mean Time to Recovery)

The business cost of downtime

Common Risk Pattern: Slow or Manual Recovery

1. How Quickly Are Failures Detected?

Detection speed directly affects recovery time.
Effective detection relies on:

Well-designed health checks
Monitoring and alerting
Observability across logs, metrics, and traces

Examples include infrastructure and application monitoring tools, both within cloud providers and external platforms. Faster detection leads to faster mitigation.

2. Can the System Heal Itself?

Self-healing reduces reliance on manual intervention.
Examples include:

Auto Scaling Groups replacing failed instances
Kubernetes restarting unhealthy containers
Horizontal Pod Autoscaling
Automated DNS failover

Self-healing does not eliminate the need for humans, but it reduces response time significantly.

3. What Is the Backup and Recovery Strategy?

Backups exist to meet RPO and RTO goals.

RPO (Recovery Point Objective) answers the question:
How much data can we afford to lose?
It defines the maximum acceptable amount of data loss measured in time.
For example, an RPO of 5 minutes means losing up to 5 minutes of data is acceptable if a failure occurs. Anything beyond that is a problem.
RPO directly influences how often you back up data and how replication is designed.

RTO (Recovery Time Objective) answers the question:
How long can the system be down?
It defines the maximum acceptable time it should take to restore the system after a failure.
For example, an RTO of 30 minutes means the system must be back online within 30 minutes of an outage.
RTO affects automation, failover strategies, and disaster recovery architecture.

Key questions:

How much data loss is acceptable?
How fast must recovery occur?

Common strategies include:

Database snapshots
File system backups
Object storage versioning
Backup frequency aligned with RPO requirements

Bonus: Disaster Recovery Strategies

Disaster recovery approaches vary by cost and recovery speed.

Backup and Restore: lowest cost, slowest recovery
Pilot Light: minimal services running, faster recovery
Warm Standby: scaled-down but functional environment
Hot Site / Multi-Site: fastest recovery, highest cost

Choosing a strategy is a business decision informed by technical constraints.

Designing Systems That Expect Reality

Designing for failure is not about pessimism. It is about acknowledging how real systems behave under real conditions.
Each section in this article addresses a different failure category:

Single points of failure
Network unreliability
Failure chain reactions
Data consistency trade-offs
Recovery and self-healing

The common thread is intentional design.
Resilient systems are not accidental. They are built by teams who assume failure will happen, test for it deliberately, and learn continuously.
The goal is not zero failure.
The goal is controlled failure, fast recovery, and minimal impact on users.
That is what designing for failure truly means.

DEV Community