Learnings from Chaos Engineering in Node.js

Working on Chaos Engineering on the code, not infrastructure, level and have some learnings to share.

Infra vs Software

Googling chaos engineering results in a ton of infra & open source tooling around breaking it. That's not what I'm doing.

Little seems aimed around the software side. I'm still new at this, but most is focused on EC2's/servers, networking support services around that, & some on Serverless. Code appears in the Site Reliability or Cyber realm.

Googling is Hard

2nd, googling is hard. "How to cause memory leeks in Node.js" results in 20 results on how NOT to cause memory leeks. You have to dig in the articles for good code on how to quickly, or controllably cause them.

How To Safely Test?

3rd, integration testing this stuff is super hard & new for me. For example, overloading your CPU, even on an older Intel Mac, is harder than I thought. Given it's shared, your tests can lag your test suite, or crash your test suite. That's not "controlled" chaos.

Learned to use a Worker, have it spaz out, then shut it down. It's a safe place to do heavy CPU work. It's also taught me how hard it is to use Node's native cluster api vs. their "thread" (i.e. Worker) api. For Node's amazing concurrency via Promises, cluster feels like a bad api.

It gave me a much better appreciation for Erlang/Elixir's or Akka's supervisor pattern. Just feels like a much more natural & easy to contain things. Node.js Worker and Cluster feels like "writing scripts", not code or self-contained objects.

What is Steady State vs. On Fire?

4th, and this is the boring part, you have to be able to identify steady state first; what is considered "working". What's your CPU or memory now? Except they don't have that, it's all these complicated things like "resident set size", swap, etc.

In true Node.js fashion, I just use libraries to return English. Things like Memory & Disk, though, are hard to track when things start catching fire. I have 20% memory left now. When I have 2%, my machine is unresponsive and can't tell me I have 2%.

Testing Sadness

5th & unexpected, these REST API libraries are hard to test. Side effects EVERYWHERE! It's no wonder Test Driven Development & testing in general still hasn't become the norm outside of companies that mandate it. Testing these library boundaries are awful. Spies/mocks galore.

Race Conditions

6th, even if testing inside a monolith, the race conditions/concurrency makes my head hurt. "If I add latency to all requests, then /ping takes a long time, which means my /people/list call takes an extra 4 seconds, but only if I call it while the server is pinging." Wat.

Conclusions

Not a career path I want to pursue, but I recommend trying it once for 2 reasons.

First, your google fu immediately is worthless. You have to change your perspective to find what you need.

Second, spending your life avoiding bad practices, and suddenly having to not only make them happen, but well, in a controlled environment, super hurts the brain.

This all makes me like Serverless more. Some of these problems are more easily mitigated there. Some are worse (i.e. latency).