TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 18

Netflix Unleashed a Monkey With a Weapon in Its Own Data Center — On Purpose

#devops #reliability #cloud #programming

2008 — database corruption, 3 days of darkness, entire DVD operation halted
2011 — Chaos Monkey deployed; instance-killing runs every business day in production
10+ members of the Simian Army — from instance kills (Chaos Monkey) to full region failures (Chaos Kong)
Business hours only — the essential design constraint that makes chaos safe and pedagogical
September 2014 — AWS reboots 10% of EC2 instances without warning; Netflix serves customers without interruption
Chaos Monkey spawned an entire engineering discipline now practiced at LinkedIn, Google, Amazon, and Twilio

It was 2011 and Netflix had just migrated hundreds of microservices to AWS. Their architecture was distributed, horizontally scaled, and theoretically fault-tolerant. But theory and production are different things. The only way to know if a system could survive failures was to cause failures — constantly, deliberately, during business hours, and in production. So they built a monkey.

The Story

The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables — all the while we continue serving our customers without interruption.

— Yury Izrailevsky & Ariel Tseitlin, The Netflix Simian Army, Netflix Tech Blog, July 19 2011

The origin of Chaos Monkey is not a clever engineering insight — it is a three-day disaster. In August 2008, Netflix was still primarily a DVD-by-mail business, running on vertically scaled servers in its own datacentres. A major database corruption took down the entire system. For three days, Netflix could not ship DVDs to its customers. It was a single point of failure (a component whose failure brings down the entire system — the exact opposite of a fault-tolerant distributed architecture) at the most basic level: one database, one failure mode, total outage.

Netflix's engineering leadership concluded that the only path forward was to move toward highly reliable, horizontally scalable, distributed systems in the cloud. They chose AWS. The migration presented a new problem: moving from a monolith with a small number of catastrophic failure points to a microservices architecture (a system design where an application is broken into many small, independently deployable services communicating over a network — improving scalability at the cost of increased distributed systems complexity) with hundreds of services, each potentially failing in its own unique way. The engineers designed graceful degradation: if recommendations failed, show popular titles instead; if search was slow, streaming should still work. They wrote the code, reviewed it, tested it in staging — and then realised they had no way to know if the fault tolerance actually worked without experiencing actual failures.

The Core Insight: Fail Constantly

Netflix's founding philosophy for Chaos Engineering was radical in its simplicity: the best way to avoid failure is to fail constantly. If you only experience failures accidentally, in production, at 3am, your engineers have no muscle memory for responding to them and your systems have never been forced to prove their resilience claims. If you fail constantly, during business hours, with engineers present — your systems either prove they can recover or they expose the gaps so engineers can fix them before those gaps become incidents.

Problem

August 2008: Database Corruption, Three Days of Darkness

Netflix's vertically scaled infrastructure suffered a major database corruption that halted DVD shipping for three days. The root cause was architectural: a single relational database instance, a single point of failure, no recovery path faster than manual intervention. The outage made the problem concrete: this architecture couldn't support Netflix's growth.

Cause

Distributed Systems Are Only Theoretically Resilient

Moving to hundreds of microservices on AWS solved the single-point-of-failure problem in theory — but raised new questions: did the code actually implement the graceful degradation it was designed for? Staging environments couldn't answer this. Code review couldn't answer this. The only honest answer required production failures — and those were the thing Netflix was trying to avoid.

Solution

Chaos Monkey: Production Failure on a Schedule

Netflix built Chaos Monkey — a script that randomly terminates EC2 instances during business hours — and deployed it in all production environments. Engineers came in every day knowing Chaos Monkey was running, knowing their services might get an instance killed at any moment, and knowing they had to build recovery mechanisms or face a very bad afternoon. The tool made fault tolerance a daily engineering discipline, not a theoretical design principle.

Result

September 2014: AWS Reboots 10% of Its Servers. Netflix Shrugs.

On September 25 2014, AWS rebooted approximately 10% of its EC2 instances without warning. Netflix's systems handled it without customer impact. Netflix explicitly credited Chaos Monkey: the engineers had been building and proving recovery mechanisms every day for years. When AWS created an unplanned failure event at scale, Netflix's systems responded automatically, gracefully, and without requiring an emergency war room.

The Fix

Building a Fault-Tolerant Culture

The most important thing Chaos Monkey fixed was not a technical system — it was an organisational incentive. Before Chaos Monkey, engineers could ship theoretically fault-tolerant but practically fragile code without facing immediate consequences. The fragility only became visible during a real, unplanned outage — at which point it was someone else's problem. After Chaos Monkey, the consequences were immediate and personal: if your service didn't handle instance failures gracefully, Chaos Monkey would expose this during your working hours, while you were at your desk, with your team watching.

2011 — year Chaos Monkey publicly announced — three years after the 2008 database outage that triggered the AWS migration
10+ — members of the Simian Army at peak, each targeting a different failure category
Business hours — the scheduling constraint that made Chaos Monkey safe; failures during working hours with engineers present
September 2014 — the real-world validation: AWS reboots 10% of EC2 instances; Netflix serves customers without interruption

# Simplified version of what Chaos Monkey does
# Real implementation: originally Java, rebuilt in Go for v2.0 (2016)
# Runs continuously during configurable business hours

import random
import time
from datetime import datetime

class ChaosMonkey:
    def __init__(self, aws_client, excluded_clusters=None):
        self.aws = aws_client
        self.excluded = excluded_clusters or []

    def is_business_hours(self) -> bool:
        """Only run during business hours — engineers must be present.
        This is the key safety constraint of Chaos Monkey's original design."""
        now = datetime.now()
        return (
            now.weekday() < 5 and   # Monday–Friday
            9 <= now.hour < 17      # 9am–5pm local time
        )

    def run(self):
        while True:
            if self.is_business_hours():
                clusters = self.aws.get_all_clusters()

                for cluster in clusters:
                    if cluster.name in self.excluded:
                        continue

                    # Pick one instance at random from each cluster
                    instances = cluster.get_running_instances()
                    if not instances:
                        continue

                    victim = random.choice(instances)

                    # Terminate it. No warning. No coordination.
                    # If the system doesn't survive this, engineers know
                    # immediately — and fix it before it becomes a 3am incident.
                    self.aws.terminate_instance(victim.id)
                    print(f"[Chaos Monkey] Terminated {victim.id} "
                          f"in cluster {cluster.name}")

            time.sleep(self.config.termination_interval_seconds)

The Simian Army: Expanding Beyond Instance Kills

The success of Chaos Monkey triggered a proliferation. If randomly killing instances built resilience to instance failures, what would it take to become resilient to other failure categories? Netflix announced the Simian Army in July 2011 — a suite of failure-injection tools each targeting a different failure class. Latency Monkey injected artificial delays to simulate network degradation. Conformity Monkey shut down instances not following engineering best practices. Doctor Monkey removed unhealthy instances from service. Janitor Monkey cleaned up unused cloud resources. Chaos Gorilla simulated the complete failure of an entire AWS availability zone. And above all of these: Chaos Kong — simulating the complete failure of an entire AWS region.

Failure Injection Testing (FIT): the evolution beyond instance kills

In 2014, Netflix engineers (including Kolton Andrus, who later co-founded Gremlin) introduced FIT — Failure Injection Testing. Where Chaos Monkey operated at the infrastructure level (kill an EC2 instance), FIT operated at the application level: injecting failure metadata through Zuul (Netflix's edge proxy handling all requests from devices to backend services) to simulate specific service failures with surgical precision. FIT could say "for this specific user's request, pretend the recommendations service is timing out" without actually degrading the recommendations service for everyone. This precision made chaos experiments far more targeted and safer to run continuously — and became the pattern that tools like Gremlin later commercialised.

Chaos Monkey 2.0: open-sourced and rebuilt in Go

Chaos Monkey was open-sourced in 2012 and rebuilt in 2016 as version 2.0. The new version was written in Go, used Spinnaker as its deployment platform dependency, and introduced mean-time-between-terminations (rather than probabilistic scheduling) for more predictable test coverage. Version 2.0 added Trackers — Go objects that report instance terminations to external monitoring systems, enabling downstream correlation of Chaos Monkey events with application metrics and alerts. The Spinnaker dependency became a significant constraint: teams unwilling to adopt Spinnaker found Chaos Monkey 2.0 inaccessible, which opened market space for alternatives like Gremlin.

Architecture

Netflix's architecture in 2011 was organised around a principle that Chaos Monkey enforced: every service must be independently deployable, independently scalable, and independently recoverable. The microservices were connected through REST APIs, each service maintaining its own data store and exposing a versioned interface to its consumers. Chaos Monkey operated at the EC2 instance layer. When an instance was terminated, the load balancer in front of that cluster detected the unhealthy instance and stopped routing traffic to it. If the cluster had sufficient redundancy, other instances absorbed the traffic without degradation. If not, the service degraded — and the engineers learned something they needed to know.

The Simian Army: Failure Coverage Across Infrastructure Layers

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

How Netflix's Architecture Handles Chaos Monkey Instance Loss

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

The Behavioural Economics of Chaos Engineering

Chaos Monkey's deepest contribution to Netflix's culture was aligning incentives. Without it, the cost of fragile code was paid by whoever happened to be on-call when a real failure occurred — often not the engineer who wrote the fragile code. With Chaos Monkey, the cost was paid immediately and visibly by the team whose service broke. Engineers who experienced a Chaos Monkey failure during business hours had a powerful motivator to invest in proper fault tolerance: they didn't want to experience it again. This is DevOps incentive design at its finest — not policy mandates, but a system where the right behaviour is the path of least resistance.

What Chaos Monkey doesn't test

Chaos Monkey's instance-termination model is powerful but deliberately narrow. It does not test network partitions (instances visible but unreachable), latency degradation (Latency Monkey's job), data corruption, or slow memory leaks that cause gradual performance degradation over hours. Chaos Monkey's successors in the Simian Army, and later tools like Gremlin, were created to cover these gaps. The original insight — failing constantly builds resilience — generalises to all failure types, but the specific mechanism must match the specific failure mode being tested. A chaos engineering programme that only kills instances is missing most of the failure surface.

Lessons

Designing for fault tolerance is not the same as having fault tolerance. Netflix's engineers wrote graceful degradation code. Chaos Monkey tested whether it actually worked. Until production failure exercises the code path, you don't know whether your fault tolerance design survived contact with reality. Chaos Monkey converts theoretical resilience into empirical evidence.
Chaos Engineering (deliberately injecting controlled failures into production systems during business hours, with engineers present, to proactively expose resilience gaps before they become unplanned outages) must be practised during business hours with humans present. The purpose is learning, not destruction. Chaos experiments run at 3am when no one is available to respond create exactly the incidents that chaos engineering is supposed to prevent.
Align incentives with the behaviour you want. Chaos Monkey made the cost of fragile code immediate and personal — the engineer whose service broke during business hours paid the cost of fixing it right then. Without this alignment, resilience engineering is aspirational. With it, resilience engineering is survival instinct.
The blast radius (the scope of impact when a single component fails) of individual failures is only measurable through testing. A microservices architecture where every service failure cascades to every other provides less reliability than a monolith, not more. Chaos Monkey surfaces these cascade dependencies so they can be eliminated before a real failure exposes them at scale.
Start at the instance level and escalate gradually. Netflix began with Chaos Monkey (instances), expanded to Chaos Gorilla (availability zones), then Chaos Kong (regions). Each level was only attempted after the previous level produced a stable, confident result. Expand scope only when you're confident you've solved the current scope.

Engineering Glossary

Blast radius — the scope of impact when a single component fails. Chaos engineering is designed to continuously measure and minimise blast radius by forcing service-level isolation. A microservices architecture where every service failure cascades to all others has a blast radius equivalent to a monolith.

Chaos Engineering — the discipline of deliberately injecting controlled failures into production systems during business hours, with engineers present, in order to proactively expose resilience gaps before they become unplanned outages. Formalised as a named discipline in the 2015 Principles of Chaos Engineering document by Netflix's Casey Rosenthal.

Chaos Kong — the most extreme Simian Army tool, simulating the complete failure of an entire AWS region. Built after Netflix had proven resilience to instance failures (Chaos Monkey) and AZ failures (Chaos Gorilla). Tests active-active multi-region deployment under full regional failure conditions.

FIT (Failure Injection Testing) — a Netflix evolution beyond Chaos Monkey that operates at the application layer rather than the infrastructure layer, injecting failure metadata through Zuul to simulate specific service failures for specific users without degrading the service for everyone.

Microservices architecture — a system design where an application is broken into many small, independently deployable services communicating over a network. Improves scalability and team autonomy at the cost of increased distributed systems complexity and new categories of failure.

Rambo Architecture — Netflix's internal term for the design philosophy Chaos Monkey enforced: each service must be able to succeed no matter what, even on its own. If a dependent service is down, handle it gracefully. Every service is both a potential failure source and a potential victim of failures, and must be designed for both roles simultaneously.

Simian Army — the suite of failure-injection and resilience-verification tools Netflix built following Chaos Monkey's success, each targeting a different failure class: Latency Monkey (network degradation), Conformity Monkey (best practice enforcement), Doctor Monkey (health checks), Janitor Monkey (resource cleanup), Chaos Gorilla (AZ failure), Chaos Kong (region failure).

Single point of failure — a component whose failure causes the entire system to stop working. The 2008 database corruption that triggered Netflix's cloud migration was a single point of failure at the most basic level. Eliminating single points of failure through distributed architecture is the goal; Chaos Monkey tests whether that goal was actually achieved.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community