Scale Wars #1 — Netflix: The Company That Killed the Monolith

#systemdesign #architecture #programming #microservices

Year: 2008 · Crisis: DVD-to-streaming transition and single point of failure

The Problem: A 3-Day Outage

In August 2008, Netflix's DVD distribution center database corrupted. The system couldn't process customer orders for 3 days. That day, Netflix CTO Reed Hastings slammed his fist on the table:

"If the entire company goes down because one database crashes, we're not a technology company. We're a DVD rental shop."

At the time, Netflix was a single monolithic Java application: monolith.war. All business logic — orders, billing, catalog, user management — lived in a single WAR file. Every new feature required a full system redeploy.

Architectural Decision: Migration to Microservices (2008–2016)

Netflix spent 8 years transforming their monolith into 700+ microservices. But they didn't do it as a "Big Bang Rewrite" — they used the Strangler Fig Pattern:

New features were written in new microservices first.
Old features were gradually extracted from the monolith and moved to microservices.
End users never noticed the transition.

By 2016, the last monolith component — "user management" — had been migrated. Netflix was now fully microservice-based.

Chaos Engineering: The Simian Army

But microservices brought a new problem: in a system with 700+ services, something will break every single day. This was an unavoidable physical reality.

Netflix's answer was radical: embrace the chaos.

In 2011, Chaos Monkey was born. This bot randomly killed servers in the production environment. Yes, you read that right: in production, serving real users.

// Chaos Monkey — A bot that randomly kills production services
// NOTE: This code is illustrative. The actual Chaos Monkey implementation
// in Netflix's Simian Army (now Chaos Toolkit) project works differently.
// But the core logic is the same: pick a random instance and kill it.

public class ChaosMonkey {

    @Scheduled(cron = "0 0 9-15 * * MON-FRI") // Weekdays 9am-3pm
    public void terminateRandomInstance() {
        List<Instance> instances = awsClient.describeInstances(
            Filters.builder()
                .name("tag:Environment").values("production")
                .name("instance-state-name").values("running")
                .build()
        );

        Instance victim = random.choice(instances);
        awsClient.terminateInstance(victim.getId());

        slack.notify(":monkey: Killed instance " + victim.getId() + "!");
    }
}

Why? Because if engineers know their servers can die at any moment, they write resilient systems. Circuit breakers, retries, fallbacks, timeouts — these are no longer "nice to have," they're mandatory.

After Chaos Monkey came the Simian Army:

Latency Monkey: Injects delays into services
Conformity Monkey: Finds and kills non-compliant services
Chaos Kong: Takes down an entire AWS region (once a year)
Chaos Gorilla: Takes down an Availability Zone

Netflix's Hystrix Library

To survive this chaos, Netflix built Hystrix — the implementation that became the industry standard for the Circuit Breaker pattern:

// Resilient service calls with Hystrix
// NOTE: Hystrix entered maintenance mode in 2018.
// Modern alternatives: Resilience4j (Java), Polly (.NET), Istio (service mesh).
// But the concept is the same: circuit breaker + fallback.

@HystrixCommand(
    fallbackMethod = "fallbackRecommendations",
    commandProperties = {
        @HystrixProperty(name = "execution.isolation.thread.timeoutInMilliseconds", 
                         value = "1000"),
        @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", 
                         value = "20"),
        @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", 
                         value = "50")
    }
)
public List<Movie> getRecommendations(String userId) {
    return recommendationService.call(userId);
}

// Fallback — Don't show the user nothing when the main service crashes
public List<Movie> fallbackRecommendations(String userId) {
    return cache.getPopularMovies();  // Static but functional response
}

Circuit Breaker Logic:

If 50% of the last 20 requests fail → Circuit opens
For 5 seconds, all requests go directly to the fallback
After 5 seconds, one request is attempted → If successful, circuit closes

This way, even if the recommendation service crashes, the homepage still shows popular movies. The system never fully goes down.

Trade-offs

✅ Gains:

Thousands of deployments per day (vs. a few per month with the monolith)
One service crashing doesn't take down the rest (blast radius)
Each team can choose its own technology stack (polyglot)

❌ Costs:

Distributed system complexity: Debugging a single error now means digging through logs of 10 services
Network overhead: What used to be a method call is now an HTTP/RPC call
Operational burden: Monitoring, updating, and securing 700 services → requires a DevOps army

🛠️ Takeaways

Netflix ran a monolith for 10 years. A monolith is the right choice at the startup stage — microservices should only come when you genuinely hit a scale problem. If you're not running "random" tests in production, the problems will find you in front of real users. Trusting a remote service without a fallback is suicide — every remote call needs one. And most "Big Bang Rewrite" projects fail — migrate incrementally with the Strangler Fig pattern.