LitmusChaos on Juju

#litmuschaos #observability #juju #chaos

(Or: on how to deploy on a Friday afternoon and sleep tight all weekend)

One of the strongest and most helpful stances that working on Observability has taught me is this: by default, anything you don't observe is broken.
No 'theoretically', no 'maybe', no 'if': it's broken right now for the simple reason that you can't trust that it isn’t. If it were broken, you wouldn't know it anyway; so for all effects and purposes you might as well treat it as broken from the start.

So the baseline for a robust product (especially if cloud-native, distributed, and so on) is to have observability in place thanks to which you can trust beyond reasonable doubt that, if it were broken, you'd know it as soon as it happens and preferably a little sooner than that.

Now we also know that no application survives contact with the user (paraphrasing Helmuth von Moltke); that's why we do load-testing with tools like k6. By simulating load and system stress we can convince ourselves that the thing won't break when we expose it to thundering herds of live users.

We simulate load to convince ourselves that an application is robust enough to tolerate production usage. But heavy production usage is not the only cause of instability and outage; there’s also infrastructure stability that needs to be accounted for. What if your network starts dropping packets, pods go down, or a service your application depends on starts having issues and stops replying or gives faulty data?
To give a second spin to Helmuth's quote: no application survives contact with your cloud infrastructure. We use load testing to verify resilience under production-like usage; and we use chaos-testing to verify resilience under production-like deployment. By simulating infrastructure failures (network jitter, pod crash, worker collapse...) we convince ourselves that the system is robust not only under 'usage pressure', but also when deployed on top of inherently wobbly infrastructure.

Concern	Practice that addresses the concern
Will it work on my machine?	Unit, integration (feature …) tests
Will it work on my cloud?
- Will it work under production load?	Load testing
- Will it work on unstable infrastructure?	Chaos testing
- Would I know it if it didn’t work?	Observability

At Canonical we’re working on day 2 ops automation through the whole (open-source) cloud ops stack. From getting something to deploy on a cloud and (inter)operating nicely, all the way to observing it and load- and chaos-testing it to ensure we can deploy it on a Friday afternoon and sleep tight. Not because we know it won’t break (it will!) but because we trust that we’ll receive an alert before it does and as soon as that happens, we’ll have enough data at our fingerprints to start remediating right away.

Juju primer

Juju is Canonical's cloud operator driver. You can read all about it (and find more technical definitions of what I'm about to summarize) at canonical.com/juju. I personally like to think about Juju as an operating system for the cloud. "Cloud" is a pile of abstracted storage, networking, and compute resources (think: any public/private/dev cloud; and yes, your home lab is also a cloud!). Juju is the abstracted operating system for it. Charms are the applications that can be executed on Juju (juju deploy postgres...).

If you come from Kubernetes, a charm is a sidecar operator operating on your regular application/service pod.
If you come from VM clouds, a charm is a co-located process operating on your regular application/service package (snap, deb, handbuilt binary...).

Juju is model-driven in that you define declaratively the topology of your deployment ("I want a Tempo instance and ceph cluster with s3, and I want tempo to use that ceph cluster as storage") and Juju 'makes it so', abstracting away the details of the substrate you're on.

Where on a regular OS the application developer encodes business logic ("this is how you edit images", "this is how you serve a database") in a binary and distributes it with a package; in Juju the charm developer encodes operation logic ("this is how you scale postgres", "this is how you add TLS to this server") in charm code and distributes it as a charm package (on charmhub). In practice, a charm is a yaml spec with some Python on top that describes how a given application should be installed, configured, operated, in response to Juju model state (and juju model changes).

If you’re curious about Juju and want to read more, see this intro tutorial.

Observability in Juju

For the past 4+ years, the Observability team at Canonical has been busy developing COS (Canonical Observability Stack), an opinionated, scalable solution for monitoring cloud workloads. Read more about it here.

COS can ingest all OpenTelemetry signals (logs, metrics, profiles, traces) and use them to populate dashboards and emit alerts to suit your monitoring needs.

Anyone (including our customers and our internal services) can bootstrap the substrate, Juju, and COS using Terraform; then Juju takes over and the charms handle all operations like scaling, backups, upgrades, and integration logic.

Load-testing in Juju

Want to do load tests on your staging cloud? add the k6-k8s charm (and its integrations to any load-bearing applications in your deployment, pun intended) to your declarative terraform plan and hit apply. Then juju run k6-k8s start and watch many thousands of virtual users pound your APIs with random data.

Chaos testing with LitmusChaos on Juju

Here is where things get interesting.
Last year we charmed the Litmus control plane and made it the first stone of our Canonical Chaos Engineering Platform. This means that you can add to your terraform plan the litmus-operators module and obtain a control plane that you can already use as-is (on any Kubernetes-based cloud!!).

This means that you can boot up the Chaoscenter and start spicing up things for your deployments. Define experiments, probes, and let them wreak havoc in your infrastructure. With COS on the side, gathering telemetry, you know exactly what breaks and how, and you can verify that you're receiving all the alerts that you're hoping you'd receive if one of your critical pods started churning unexplainably often.
Add k6 to the mix and you have a perfect storm: packets dropping left and right and pods giving up while an unprecedented stampede of virtual users is desperately trying to buy a new pair of shoes through your shiny new platform. All the while you, on the side, are watching and taking notes:

how does the system react?
how does the observability stack cope?
- am I missing signals?
- did I receive unnecessary alerts?
- did I not receive alerts which I would have wanted to receive?

If this test passes and your observability stack is robust then you're going to be much more comfortable pressing the 'deploy' button on a Friday afternoon.

Chaos-testing in Juju: 26.04

At Canonical, the cycles of life align with the Ubuntu releases. This cycle, 26.04, we added a new charm to our Litmus collection: the litmus-infrastructure-k8s charm. This allows Juju users to manage declaratively not only the Litmus control plane deployment and operation, but also the Chaos Infrastructure provisioning. Add this charm to your Terraform plan in the same Juju model == K8s namespace as the system under test, integrate it with the Chaoscenter charm, and now your Litmus deployment contains a software-defined Infrastructure ready to launch experiments on.

The future of chaos testing in Juju

What's next? This is fresh-off-the-press so we're looking for user feedback, but we're anticipating a number of exciting directions in which this might evolve.

First of all, we're fantasizing about a Juju fault injector which would allow us to define experiments that operate at the Juju level instead of at the cloud substrate level (remove/add relations, scale applications up/down, run actions, change config options...). That would allow us to wield higher-order chaos, if you like, to validate our products.

Secondly, we're thinking it might be interesting to attach to our charms chaos experiment definitions that would allow one to define declaratively not just the deployment + observability + load/chaos-testing solution stack topology, but also the chaos experiment definitions themselves.
Essentially this would mean teaching a charm how to tell the Litmus infrastructure "this is how you should chaos-test me". So charms would ship with built-in, opinionated fault and probe definitions that would enable any cloud admin to quickly get started validating their deployments with Litmus.

Acknowledgements

We’d like to say thank you to the Litmus team for developing a great open-source Chaos Engineering tool! And for the CNCF mentorship programme which is contributing observability instrumentation for Litmus itself.