Year: 2008 · Crisis: DVD-to-streaming transition and single point of failure
The Problem: A 3-Day Outage
In August 2008, Netflix's DVD distribution center database corrupted. The system couldn't process customer orders for 3 days. That day, Netflix CTO Reed Hastings slammed his fist on the table:
"If the entire company goes down because one database crashes, we're not a technology company. We're a DVD rental shop."
At the time, Netflix was a single monolithic Java application: monolith.war. All business logic — orders, billing, catalog, user management — lived in a single WAR file. Every new feature required a full system redeploy.
Architectural Decision: Migration to Microservices (2008–2016)
Netflix spent 8 years transforming their monolith into 700+ microservices. But they didn't do it as a "Big Bang Rewrite" — they used the Strangler Fig Pattern:
- New features were written in new microservices first.
- Old features were gradually extracted from the monolith and moved to microservices.
- End users never noticed the transition.
By 2016, the last monolith component — "user management" — had been migrated. Netflix was now fully microservice-based.
Chaos Engineering: The Simian Army
But microservices brought a new problem: in a system with 700+ services, something will break every single day. This was an unavoidable physical reality.
Netflix's answer was radical: embrace the chaos.
In 2011, Chaos Monkey was born. This bot randomly killed servers in the production environment. Yes, you read that right: in production, serving real users.
// Chaos Monkey — A bot that randomly kills production services
// NOTE: This code is illustrative. The actual Chaos Monkey implementation
// in Netflix's Simian Army (now Chaos Toolkit) project works differently.
// But the core logic is the same: pick a random instance and kill it.
public class ChaosMonkey {
@Scheduled(cron = "0 0 9-15 * * MON-FRI") // Weekdays 9am-3pm
public void terminateRandomInstance() {
List<Instance> instances = awsClient.describeInstances(
Filters.builder()
.name("tag:Environment").values("production")
.name("instance-state-name").values("running")
.build()
);
Instance victim = random.choice(instances);
awsClient.terminateInstance(victim.getId());
slack.notify(":monkey: Killed instance " + victim.getId() + "!");
}
}
Why? Because if engineers know their servers can die at any moment, they write resilient systems. Circuit breakers, retries, fallbacks, timeouts — these are no longer "nice to have," they're mandatory.
After Chaos Monkey came the Simian Army:
- Latency Monkey: Injects delays into services
- Conformity Monkey: Finds and kills non-compliant services
- Chaos Kong: Takes down an entire AWS region (once a year)
- Chaos Gorilla: Takes down an Availability Zone
Netflix's Hystrix Library
To survive this chaos, Netflix built Hystrix — the implementation that became the industry standard for the Circuit Breaker pattern:
// Resilient service calls with Hystrix
// NOTE: Hystrix entered maintenance mode in 2018.
// Modern alternatives: Resilience4j (Java), Polly (.NET), Istio (service mesh).
// But the concept is the same: circuit breaker + fallback.
@HystrixCommand(
fallbackMethod = "fallbackRecommendations",
commandProperties = {
@HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds",
value = "1000"),
@HystrixProperty(name = "circuitBreaker.requestVolumeThreshold",
value = "20"),
@HystrixProperty(name = "circuitBreaker.errorThresholdPercentage",
value = "50")
}
)
public List<Movie> getRecommendations(String userId) {
return recommendationService.call(userId);
}
// Fallback — Don't show the user nothing when the main service crashes
public List<Movie> fallbackRecommendations(String userId) {
return cache.getPopularMovies(); // Static but functional response
}
Circuit Breaker Logic:
- If 50% of the last 20 requests fail → Circuit opens
- For 5 seconds, all requests go directly to the fallback
- After 5 seconds, one request is attempted → If successful, circuit closes
This way, even if the recommendation service crashes, the homepage still shows popular movies. The system never fully goes down.
Trade-offs
✅ Gains:
- Thousands of deployments per day (vs. a few per month with the monolith)
- One service crashing doesn't take down the rest (blast radius)
- Each team can choose its own technology stack (polyglot)
❌ Costs:
- Distributed system complexity: Debugging a single error now means digging through logs of 10 services
- Network overhead: What used to be a method call is now an HTTP/RPC call
- Operational burden: Monitoring, updating, and securing 700 services → requires a DevOps army
🛠️ Takeaways
Netflix ran a monolith for 10 years. A monolith is the right choice at the startup stage — microservices should only come when you genuinely hit a scale problem. If you're not running "random" tests in production, the problems will find you in front of real users. Trusting a remote service without a fallback is suicide — every remote call needs one. And most "Big Bang Rewrite" projects fail — migrate incrementally with the Strangler Fig pattern.
Next up — Chapter 2: Uber and the 100 billion events per day problem. 🚗
Top comments (0)