DEV Community

Cover image for What a Beginner Can Learn from Netflix's Chaos
Josh Ellis
Josh Ellis

Posted on

What a Beginner Can Learn from Netflix's Chaos

👋 Intro

💬 Docendo discimus: "the best way to learn is to teach"

You've probably heard the old idea that the best way to learn something is to teach it. I love watching videos on YouTube that have (seemingly) nothing to do with my current projects. So I'm starting a series here that's dedicated to explaining takeaways from these videos.

📺 Today’s video is “Mastering Chaos - A Netflix Guide to Microservices”. A talk by Josh Evans, former Director of Operations Engineering at Netflix.

It’s a great video, but I'm mostly working solo on side projects that have one or zero backend servers. Netflix’s microservice infrastructure is well beyond my needs in the immediate future. (Unless, of course, Netflix wants to hire me. DMs are open 😉)

Even so, there are plenty of practical concepts that beginners can take away from a video like this.

  1. Look for solutions in nature
  2. Have a plan B
  3. Try to break things
  4. Automate best practices

👀 Look for solutions in nature

The Netflix codebase used to monolithic, meaning all the code was together. This meant whenever something broke, it often brought down the whole system with it.

Netflix took inspiration from how evolution set up the body as a system of microservices. As an example, your visual and digestive systems are mostly independent. If something happens where you become unable to see, you’d still be able to continue eating to survive.

In the same way, Netflix builds many APIs that each only handle a small part of the system (microservices). That way, if one crashes, it would be less likely to affect the whole system.

My Takeaway: When you’re setting up a system or solving a problem, see if nature already came up with a solution.

Have a plan B

Separating their system into microservices solved a lot of problems. But there were still times where one microservice caused cascading issues.

Their solution was to add a middleware that sets up fallbacks and timeouts for when things go wrong. What does that mean?

Here’s an example (note: this is 100% fiction, the speaker didn’t give a real-world example):

  • User tries to load Netflix
  • The Show Options service asks the Creepily-Personal Recommendations service for a list of Shows to recommend to the User
  • The Creepily-Personal Recommendations service is currently broken/overwhelmed and can’t respond
  • The Show Options service doesn’t get a response fast enough, so they know to try their first fallback.
  • The Show Options service asks the General Recommendations service what to recommend
  • The General Recommendations service responds successfully
  • The Show Options service shows the recommendations
  • User is happy because they don’t know anything went wrong

Now, that’s a very contrived example, but you’re able to get the idea.

My Takeaway: Try to hide errors from your users by coding a backup plan into your app. The user will be happy because everything keeps working, and you’ll be happy because your system doesn’t crash due to a small part of it failing.

💣 Try to break things

Setting up tests is a great way to make sure your app works in ideal situations, but how do you figure out what will happen when unexpected events happen? Enter Chaos Engineering.

Principlesofchaos.org defines chaos engineering as “the discipline of experimenting on a system to build confidence in the system's capability to withstand turbulent conditions in production”.

By definition, you can’t truly plan for an unexpected event. But you can still try to break your system. This is how Netflix predicts whether their system will survive events like cyber-attacks and server outages.

My Takeaway: For those of us not running hundreds of microservices on servers around the world, we can still remember to try to break things. When you’re using your app, make sure to test more than just the “ideal” user actions.

🤖 Automate best practices

One of the big problems with a microservice framework is the extra overhead of needing many services to get simple tasks done. This makes you less likely to want to follow best practices and fall into ‘hacky’ solutions.

Netflix solves this issue by automating as much of the overhead as possible. When something goes wrong, they figure out how to fix it and make it automatic to do the right thing in the future.

My Takeaway: Automating best practices could be as simple as using a generator to add consistency to your files and functions. I encourage my own best practices by having my portfolio pull projects data directly from GitHub. I'm reminded to put good tags, descriptions, commits, and titles on my projects because of it.

🤔 Final Thoughts

I hope some of those takeaways were useful!

I’d love to hear from you. What are some of the ways you’ve implemented looking for solutions in nature, having a plan B, trying to break things, or automating best practices?

See what I'm working on: imjoshellis.com

Top comments (0)