- 2008 — database corruption, 3 days of darkness, entire DVD operation halted
- 2011 — Chaos Monkey deployed; instance-killing runs every business day in production
- 10+ members of the Simian Army — from instance kills (Chaos Monkey) to full region failures (Chaos Kong)
- Business hours only — the essential design constraint that makes chaos safe and pedagogical
- September 2014 — AWS reboots 10% of EC2 instances without warning; Netflix serves customers without interruption
- Chaos Monkey spawned an entire engineering discipline now practiced at LinkedIn, Google, Amazon, and Twilio
It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.
The Story
The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.
— Yury Izrailevsky & Ariel Tseitlin, The Netflix Simian Army, Netflix Tech Blog, July 19 2011
The origin of Chaos Monkey is not a clever engineering insight — it is a three-day disaster. In August 2008, Netflix was still primarily a DVD-by-mail business, running on vertically scaled servers in its own datacentres. A major database corruption took down the entire system. For three days, Netflix could not ship DVDs to its customers. It was a single point of failure (a component whose failure brings down the entire system — the exact opposite of a fault-tolerant distributed architecture) at the most basic level: one database, one failure mode, total outage.
Netflix's engineering leadership concluded that the only path forward was to move toward highly reliable, horizontally scalable, distributed systems in the cloud. They chose AWS. The migration presented a new problem: moving from a monolith with a small number of catastrophic failure points to a microservices architecture (a system design where an application is broken into many small, independently deployable services communicating over a network — improving scalability at the cost of increased distributed systems complexity) with hundreds of services, each potentially failing in its own unique way. The engineers designed graceful degradation: if recommendations failed, show popular titles instead; if search was slow, streaming should still work. They wrote the code, reviewed it, tested it in staging — and then realised they had no way to know if the fault tolerance actually worked without experiencing actual failures.
Problem
August 2008: Database Corruption, Three Days of Darkness
Netflix's vertically scaled infrastructure suffered a major database corruption that halted DVD shipping for three days. The root cause was architectural: a single relational database instance, a single point of failure, no recovery path faster than manual intervention. The outage made the problem concrete: this architecture couldn't support Netflix's growth.
Cause
Distributed Systems Are Only Theoretically Resilient
Moving to hundreds of microservices on AWS solved the single-point-of-failure problem in theory — but raised new questions: did the code actually implement the graceful degradation it was designed for? Staging environments couldn't answer this. Code review couldn't answer this. The only honest answer required production failures — and those were the thing Netflix was trying to avoid.
Solution
Chaos Monkey: Production Failure on a Schedule
Netflix built Chaos Monkey — a script that randomly terminates EC2 instances during business hours — and deployed it in all production environments. Engineers came in every day knowing Chaos Monkey was running, knowing their services might get an instance killed at any moment, and knowing they had to build recovery mechanisms or face a very bad afternoon. The tool made fault tolerance a daily engineering discipline, not a theoretical design principle.
Result
September 2014: AWS Reboots 10% of Its Servers. Netflix Shrugs.
On September 25 2014, AWS rebooted approximately 10% of its EC2 instances without warning. Netflix's systems handled it without customer impact. Netflix explicitly credited Chaos Monkey: the engineers had been building and proving recovery mechanisms every day for years. When AWS created an unplanned failure event at scale, Netflix's systems responded automatically, gracefully, and without requiring an emergency war room.
The Fix
Building a Fault-Tolerant Culture
The most important thing Chaos Monkey fixed was not a technical system — it was an organisational incentive. Before Chaos Monkey, engineers could ship theoretically fault-tolerant but practically fragile code without facing immediate consequences. The fragility only became visible during a real, unplanned outage — at which point it was someone else's problem. After Chaos Monkey, the consequences were immediate and personal: if your service didn't handle instance failures gracefully, Chaos Monkey would expose this during your working hours, while you were at your desk, with your team watching.
- 2011 — year Chaos Monkey publicly announced — three years after the 2008 database outage that triggered the AWS migration
- 10+ — members of the Simian Army at peak, each targeting a different failure category
- Business hours — the scheduling constraint that made Chaos Monkey safe; failures during working hours with engineers present
- September 2014 — the real-world validation: AWS reboots 10% of EC2 instances; Netflix serves customers without interruption
# Simplified version of what Chaos Monkey does
# Real implementation: originally Java, rebuilt in Go for v2.0 (2016)
# Runs continuously during configurable business hours
import random
import time
from datetime import datetime
class ChaosMonkey:
def __init__(self, aws_client, excluded_clusters=None):
self.aws = aws_client
self.excluded = excluded_clusters or []
def is_business_hours(self) -> bool:
"""Only run during business hours — engineers must be present.
This is the key safety constraint of Chaos Monkey's original design."""
now = datetime.now()
return (
now.weekday() < 5 and # Monday–Friday
9 <= now.hour < 17 # 9am–5pm local time
)
def run(self):
while True:
if self.is_business_hours():
clusters = self.aws.get_all_clusters()
for cluster in clusters:
if cluster.name in self.excluded:
continue
# Pick one instance at random from each cluster
instances = cluster.get_running_instances()
if not instances:
continue
victim = random.choice(instances)
# Terminate it. No warning. No coordination.
# If the system doesn't survive this, engineers know
# immediately — and fix it before it becomes a 3am incident.
self.aws.terminate_instance(victim.id)
print(f"[Chaos Monkey] Terminated {victim.id} "
f"in cluster {cluster.name}")
time.sleep(self.config.termination_interval_seconds)
Failure Injection Testing (FIT): the evolution beyond instance kills
In 2014, Netflix engineers (including Kolton Andrus, who later co-founded Gremlin) introduced FIT — Failure Injection Testing. Where Chaos Monkey operated at the infrastructure level (kill an EC2 instance), FIT operated at the application level: injecting failure metadata through Zuul (Netflix's edge proxy handling all requests from devices to backend services) to simulate specific service failures with surgical precision. FIT could say "for this specific user's request, pretend the recommendations service is timing out" without actually degrading the recommendations service for everyone. This precision made chaos experiments far more targeted and safer to run continuously — and became the pattern that tools like Gremlin later commercialised.
Chaos Monkey 2.0: open-sourced and rebuilt in Go
Chaos Monkey was open-sourced in 2012 and rebuilt in 2016 as version 2.0. The new version was written in Go, used Spinnaker as its deployment platform dependency, and introduced mean-time-between-terminations (rather than probabilistic scheduling) for more predictable test coverage. Version 2.0 added Trackers — Go objects that report instance terminations to external monitoring systems, enabling downstream correlation of Chaos Monkey events with application metrics and alerts. The Spinnaker dependency became a significant constraint: teams unwilling to adopt Spinnaker found Chaos Monkey 2.0 inaccessible, which opened market space for alternatives like Gremlin.
Architecture
Netflix's architecture in 2011 was organised around a principle that Chaos Monkey enforced: every service must be independently deployable, independently scalable, and independently recoverable. The microservices were connected through REST APIs, each service maintaining its own data store and exposing a versioned interface to its consumers. Chaos Monkey operated at the EC2 instance layer. When an instance was terminated, the load balancer in front of that cluster detected the unhealthy instance and stopped routing traffic to it. If the cluster had sufficient redundancy, other instances absorbed the traffic without degradation. If not, the service degraded — and the engineers learned something they needed to know.
The Simian Army: Failure Coverage Across Infrastructure Layers
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
How Netflix's Architecture Handles Chaos Monkey Instance Loss
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
What Chaos Monkey doesn't test
Chaos Monkey's instance-termination model is powerful but deliberately narrow. It does not test network partitions (instances visible but unreachable), latency degradation (Latency Monkey's job), data corruption, or slow memory leaks that cause gradual performance degradation over hours. Chaos Monkey's successors in the Simian Army, and later tools like Gremlin, were created to cover these gaps. The original insight — failing constantly builds resilience — generalises to all failure types, but the specific mechanism must match the specific failure mode being tested. A chaos engineering programme that only kills instances is missing most of the failure surface.
Lessons
Designing for fault tolerance is not the same as having fault tolerance. Netflix's engineers wrote graceful degradation code. Chaos Monkey tested whether it actually worked. Until production failure exercises the code path, you don't know whether your fault tolerance design survived contact with reality. Chaos Monkey converts theoretical resilience into empirical evidence.
Chaos Engineering (deliberately injecting controlled failures into production systems during business hours, with engineers present, to proactively expose resilience gaps before they become unplanned outages) must be practised during business hours with humans present. The purpose is learning, not destruction. Chaos experiments run at 3am when no one is available to respond create exactly the incidents that chaos engineering is supposed to prevent.
Align incentives with the behaviour you want. Chaos Monkey made the cost of fragile code immediate and personal — the engineer whose service broke during business hours paid the cost of fixing it right then. Without this alignment, resilience engineering is aspirational. With it, resilience engineering is survival instinct.
The blast radius (the scope of impact when a single component fails) of individual failures is only measurable through testing. A microservices architecture where every service failure cascades to every other provides less reliability than a monolith, not more. Chaos Monkey surfaces these cascade dependencies so they can be eliminated before a real failure exposes them at scale.
Start at the instance level and escalate gradually. Netflix began with Chaos Monkey (instances), expanded to Chaos Gorilla (availability zones), then Chaos Kong (regions). Each level was only attempted after the previous level produced a stable, confident result. Expand scope only when you're confident you've solved the current scope.
Engineering Glossary
Blast radius — the scope of impact when a single component fails. Chaos engineering is designed to continuously measure and minimise blast radius by forcing service-level isolation. A microservices architecture where every service failure cascades to all others has a blast radius equivalent to a monolith.
Chaos Engineering — the discipline of deliberately injecting controlled failures into production systems during business hours, with engineers present, in order to proactively expose resilience gaps before they become unplanned outages. Formalised as a named discipline in the 2015 Principles of Chaos Engineering document by Netflix's Casey Rosenthal.
Chaos Kong — the most extreme Simian Army tool, simulating the complete failure of an entire AWS region. Built after Netflix had proven resilience to instance failures (Chaos Monkey) and AZ failures (Chaos Gorilla). Tests active-active multi-region deployment under full regional failure conditions.
FIT (Failure Injection Testing) — a Netflix evolution beyond Chaos Monkey that operates at the application layer rather than the infrastructure layer, injecting failure metadata through Zuul to simulate specific service failures for specific users without degrading the service for everyone.
Microservices architecture — a system design where an application is broken into many small, independently deployable services communicating over a network. Improves scalability and team autonomy at the cost of increased distributed systems complexity and new categories of failure.
Rambo Architecture — Netflix's internal term for the design philosophy Chaos Monkey enforced: each service must be able to succeed no matter what, even on its own. If a dependent service is down, handle it gracefully. Every service is both a potential failure source and a potential victim of failures, and must be designed for both roles simultaneously.
Simian Army — the suite of failure-injection and resilience-verification tools Netflix built following Chaos Monkey's success, each targeting a different failure class: Latency Monkey (network degradation), Conformity Monkey (best practice enforcement), Doctor Monkey (health checks), Janitor Monkey (resource cleanup), Chaos Gorilla (AZ failure), Chaos Kong (region failure).
Single point of failure — a component whose failure causes the entire system to stop working. The 2008 database corruption that triggered Netflix's cloud migration was a single point of failure at the most basic level. Eliminating single points of failure through distributed architecture is the goal; Chaos Monkey tests whether that goal was actually achieved.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)