DEV Community: Karthik Satchitanand

Chaos Engineering With Harness CI/CD Pipelines

Karthik Satchitanand — Fri, 06 May 2022 14:57:02 +0000

Left-Shift Of Chaos Engineering

The emergence of the cloud-native paradigm and an ever-increasing adoption of Kubernetes has brought with it the challenges of re-architecting applications: to be distributed, loosely coupled, and containerized (read: microservices). It has also brought in new ways of dealing with “operational” aspects such as deployments and upgrades, storage provisioning (in case of stateful components), maintaining high availability, scaling, recovery, etc., Kubernetes, as the overarching orchestration system provides multiple choices and approaches for implementing these. However, this brings in significant complexity.

Organizations migrating to Kubernetes typically face the need to test different kinds of failure scenarios and learn about their systems and processes deeply, through repeated experimentation - to build confidence before they hit production. This has contributed to a “shift-left” approach in chaos engineering, with increased usage of chaos within CI/CD pipelines.

LitmusChaos Improves Support For CI/CD

The litmus community has seen a significant upsurge in the usage of chaos experiments within pipelines, in some cases even before the build/deployment artifacts are merged into the main source of truth (executed on transient PR environments) and in most cases as part of the continuous deployment process, wherein the build artifacts are pushed to a staging environment wherein they are subject to “sanity chaos”.

While the project did provide some initial (limited) support and presented use cases around this area as part of its 1.x releases viz. Gitlab templates & Github chaos actions, there was still a pending need to support fully featured chaos scenarios which allows all experiment tunables along with hypothesis validation using Litmus probes in CI/CD pipelines. In other words, support for the execution of workflows that can be managed and viewed from the Chaos-Center.

Recent versions of Litmus (v2.8.0+) provides a set of refactored/standardized APIs to trigger workflows from a pipeline, while also allowing the visualization and management of chaos workflows launched with specific metadata (cluster_id & controller-instanceid labels) in the Chaos-Center.

In this post, let's take a quick peek at an example of chaos being executed from a Harness CI/CD pipeline.

Harness CI/CD

Harness Continuous Delivery as a Service platform provides a simple and secure way for engineering and DevOps teams to release applications faster, safer, and more secure. Harness automates the entire CI/CD process, which helps build, test, and deploy improved features more quickly. It uses machine learning to detect the quality of deployments and automatically roll back failed ones, saving time, and reducing customer scripting and manual oversight. It provides enterprise-grade security at every step throughout the CI/CD pipeline.

Here is a quick reference to how you can set up a build and deploy pipeline using Harness.

As explained in detail in the aforementioned reference, Harness carries out the build and deploy steps via jobs executed through a Delegate - a service you run in your VPC. It speaks to the Harness manager, the main control plane managing the individual pipeline instances. You could use the Harness SaaS platform or its on-premise installation to implement the pipelines.

Chaos Experiments In Harness Pipelines

Chaos within a Harness pipeline can be added as a dedicated stage post the deployment stage, with the results of the chaos experiment determining the success of your pipeline with an appropriately defined custom failure strategy:

or as a step within the deployment stage itself, with the ability to rollback deployment in case a chaos run against it is not found to provide desired results.

Within the chaos steps defined in the above pipelines, one could invoke the Litmus API to trigger a chaos workflow that is configured with the right fault (experiment) & validation (probes) combinations, apart from desired runtime properties (duration, node selectors, tolerations, resources, etc.,). The prerequisite is to have the authentication tokens extracted prior to the construction of the API invocation in the pipeline.

Such a chaos workflow template can either be hand-crafted and used as a payload within the API call, or, still simpler, be pre-constructed on the Chaos-Center so that it can be re-triggered from the pipeline using the workflow_id reference.
An alternative mechanism is to perform a kubectl apply operation of the workflow manifest downloaded after its creation from the Chaos-Center (this model is intended to support those users who prefer having a golden copy of the chaos scenario definition alongside the deployment artifacts). The status and success of such workflow executions can be tracked via kubectl commands or via litmus API.

Conclusion

While chaos experimentation as part of SRE-driven gamedays will continue to exist as the chaos engineering practice continues to get adopted by more organizations, its consumption is set to increase manifold in the developer community on account of its inclusion in the CI/CD pipelines, what with the culture of developers owning build-to-deploy for their code changes.

Through this post, we have shown an example of how popular CI/CD platforms like Harness are embracing the concept of chaos experimentation and bringing it into the ambit of the pipeline world. You can see the explanation in action in this demo by Neelanjan, provided during the Cloud Native Scale meetup.

ChaosNative Joins Harness!

Karthik Satchitanand — Tue, 22 Mar 2022 13:03:28 +0000

Today, we are thrilled to announce to the LitmusChaos community that ChaosNative, the primary sponsor of the LitmusChaos project, is joining Harness to extend scalable chaos engineering to enterprises.

LitmusChaos was born out of a need to improve resiliency and build confidence in cloud-native services in production. What started out as an effort to provide out-of-the-box chaos experiments for Kubernetes-based microservices, eventually grew into an end-to-end framework to carry out chaos engineering on a wide variety of application and infrastructure targets, with support for multi-tenancy, SLO validation, and custom workflows, amongst other features. In the last 4 years since the project started, Litmus saw a continuous growth of community involvement in usage, code contributions, and feedback. Achieving the CNCF incubating status is a testament to Litmus’ maturity in adoption, and it is now at a stage where there is a massive opportunity to explore and identify new use cases for chaos engineering in cloud-native DevOps.

All along, we believed that there is a lot to be done to make chaos engineering commonplace in DevOps. We kept saying - It is not just for Ops. Litmus has seen tremendous growth in the past year mainly in its use in CI/CD pipelines. The sentiment of shift-left has received significant validation within the community, with several related use cases involving Litmus being presented in conferences such as KubeCon, CDCon & Chaos Carnival! We believe that by joining hands with the Harness team - who bring in deep expertise and a rich understanding of the CI/CD domain, we shall take this movement forward with increased focus and help developers identify issues much quicker and earlier in the delivery cycle. By being part of Harness, our efforts to take chaos engineering to developers are going to be accelerated.

Having said that, we are ever-more committed to improving the core, i.e., the chaos orchestration services, and fault library to support our existing users and cater to the SRE community. We expect to contribute many new features to Litmus that aid SREs, including better support for disconnected or air-gapped environments, improved observability via events & audit logs, chaos user mapping with Kubernetes/platform service accounts, integration with secrets management platforms, hooks to incident response systems and guided gameday orchestration.

As part of Harness, we will continue to help maintain the community-first values of Litmus, while delivering on important roadmap items around integrations, newer fault injections/experiments, and an improved chaos dashboard.

The team will remain active advocates of cloud-native chaos engineering, with no changes to the monthly meetups, workshops, and participation in CNCF events. What is more, we are even working towards a certification around Litmus, as part of efforts to spread chaos engineering knowledge, so stay tuned!

We are thankful to the LitmusChaos users, our early adopter customers at ChaosNative, CNCF, and the chaos engineering community at large for their support and Harness for giving us the opportunity to continue our mission together!

(Uma Mukkara & Karthik Satchitanand: Co-creators and Maintainers of LitmusChaos & Co-Founders of ChaosNative)

LitmusChaos in 2021: The Year In Review

Karthik Satchitanand — Wed, 29 Dec 2021 04:10:08 +0000

Introduction

Year-end retrospectives are an interesting topic. Much as it helps dwell on things accomplished, it also generates excitement (and nervous energy) about what is to come. In this short post (there is much to write individually on the topics referenced), I shall be making an attempt to encapsulate the progress LitmusChaos as a project & the litmus community as a whole made over the past year. Before we get there, my heartfelt thank you to all our users & adopters, contributors, as well as CNCF for helping us along in our journey. Your feedback, involvement, and mentorship are what keep us going.

I have tried to bucket some of the important milestones we achieved, based on their nature, and, in the process paint an overall picture of the growth of the project. These wouldn’t have been possible without valid criticism and healthy debates from/with the community.

Features & Releases

LitmusChaos 2.0

At the end of last year, we had begun to play with the idea of a “Litmus Portal”, essentially a dashboard that would help orchestrate chaos and simplify the learning curve in preparing chaos custom resources (CRs) for a given scenario. Over time, the scope of the portal went beyond this initial requirement to become a full-blown control plane that can:

Help create complex scenarios using workflows
Manage chaos across target environments (other clusters, namespaces)
Delegate chaos operations to teams
Leverage dedicated chaos artifact sources (you could use git to load/commit workflows from/into the chaos center)

This precipitated a major release (2.0) in August, after remaining in beta for nearly 6 months, a phase during which we learned more about user expectations around such capabilities. The decision to bump up the version was taken after we observed a collective acceptance and alignment of the community towards the newer way of approaching chaos compared to what was in use until then. Having said that, 2.0 is built on the same core, with users free to directly consume the chaos operator as before.

Support for Containerd, CRI-O

As the Kubernetes community moves on from docker as the container runtime, there was a growing need to support network, resource stress, and other experiments for other popular runtimes (growing usage of Litmus in OpenShift environments was another trigger). Today, Litmus natively supports choosing the desired runtime as an experiment tunable.

Better Blast Radius Control

As chaos adoption grew last year, so did the use-cases and requirements around how to make fault injection ever so granular. Especially on Kubernetes. While randomized failures are at the core of the traditional chaos engineering practice, it now operates against a heavily filtered dataset. Litmus added support for the percentage-based selection of pods, (satisfying the given namespace: workload label: node label constraints), nodes & cloud resources apart from the ability to pick out a chaos target by name.

Improved Hypothesis Validation via richer Probe Schema

Automated chaos experiments becoming more mainstream resulted in several enhancement requests around the basic probe functionality Litmus provided in earlier versions. Newer capabilities & thereby a richer schema has been added to HTTP, command, K8s & Prometheus probes with more in the works. They are being used for performing custom validations within experiments, making the experiment verdicts more meaningful.

Non-Kubernetes Experiments

Multiple organizations that adopted chaos engineering invariably had a mixed or hybrid environment, with services residing in different substrates - Kubernetes (self-hosted, managed/cloud), vanilla VMware VMs, or cloud instances, and in some cases bare-metal servers. When Litmus was picked by their teams (teams which are increasingly Kubernetes-native/aware) the ask from them was to enable the same platform and provide similar UX to perform chaos against non-Kubernetes targets to have a centralized view of chaos and resilience across different services. This resulted in an increased focus on experiments targeting VMs, disks on different cloud providers (VMware, Azure, GCP, AWS). This is an area that is a work in progress and should see more features/experiments come the new year.

Community

At the heart of any open-source project is the community, and Litmus witnessed great community growth in 2021. We added 18 new adopters, across cloud-native end-users, vendors, solution providers, and other open-source projects too. The end-of-month Saturday meetups saw more interest/attendance, the slack channel grew to include nearly five times what it was around this time last year. Contributions (of all types, code, docs, tests, helm charts) started coming in bigger numbers - leading to four newer maintainers (including folks from the adopting organizations). We partnered with members from the user community to deliver KubeCon talks featuring Litmus in the KubeCon EU ‘21 & KubeCon NA’ 21. We were pleasantly surprised to view presentations about it from other chaos advocates and see it featured in a Keynote that discussed reliability!

The year also saw the birth of a CNCF initiative to further the field of chaos engineering in a cloud-native environment: Chaos Engineering Working Group, of which litmus maintainers and community members are an important part.

Chaos Carnival

As the community grew, we realized it was also an indicator of the increasing adoption of chaos as a practice. Reliability is no longer an afterthought and while Litmus is playing an important role in this area, it isn’t the only one. Also, we felt a necessity to tap into the community to bring about more facets of chaos engineering and resilience, especially around aspects such as culture, processes, and allied technical topics such as observability, security et al. We decided to bring people to a common forum to discuss these and thereby spread knowledge to equip newer entrants to chaos engineering & strengthen practitioners with more data points. Thus was born the Chaos Carnival. Organized majorly by members from the Litmus Community, with generous help from sponsors, the event brought together some of the best minds in the SRE, DevOps & Chaos Engineering space to share their expertise.

Looking Ahead: What 2022 Brings

For starters, we are putting together the second edition of the chaos carnival with an awesome list of topics and speakers. We hope you will enjoy it!

Coming back to the project, increased support for diverse non-Kubernetes targets, simplified integrations with observability & CI/CD platforms remain a priority, not to mention more chaos types within Kubernetes. However, we also aim to place emphasis on security-related features (we got started last year by defining PSPs & Kyverno policies you could use with Litmus, apart from moving to more secure container images) that are an important enterprise need for adoption. Stay tuned to this space for more information on what these are!

Finally, we are working towards moving further (beyond sandbox) along the ladder within the CNCF ecosystem and making the community more vibrant - so that the litmus user can experience a richer chaoslib, better chaos orchestration & integrations, higher quality, faster patches/enhancements, and better support.

Thanks for reading this article!

LitmusChaos at Kubecon EU '21

Karthik Satchitanand — Wed, 19 May 2021 15:48:27 +0000

KubeCon is veritably a festival for the tech community nowadays, and this is a sentiment that I have expressed several times before, but cannot tire of repeating! The sheer diversity of content (organized into a neat set of tracks), the announcements & launches, predictions, and panel discussions, all add to the excitement. I even managed to get my young nephew interested in the Cloud Native world by ordering the Phippy & friends buildable set with my Speaker giveaway!

For us, as members of the LitmusChaos project & the broader cloud native chaos engineering community, this edition of Kubecon EU '21 turned out to be sweeter still.

The predictions made ahead of the event by Cheryl Hung, VP, Ecosystem at CNCF, where she noted the growing interest in chaos, set the ball rolling (and this came on the back of similar observations made by Liz Rice, the chair of CNCF’s Technical Oversight Committee (ToC) at the November 2020 Kubecon):

From inside CNCF I have a unique insight into cloud native tech, DevOps and ecosystem. Here’s what I see coming next!https://t.co/KeqcQCC3E7 pic.twitter.com/e7uoq0aXXz
— Cheryl Hung (@oicheryl) March 23, 2021

This added to the organic interest that Litmus as a project has gathered in the CNCF community, and resulted in some wonderful responses to the various presentations/talks that it featured in. All this led to newer slack members, appreciative messages from those that grokked the project, more questions on what’s coming & roadmap, quick feedback from folks that tried it, and some cool issues and PRs in the process!

In this blog post, let us do a quick round-up of the talks/sessions where Litmus “appeared”. Some of the perspectives expressed are instructive and definitely hold value for us at the project. I have also tried to capture all the buzz (via tweets, of course) around Litmus and provide links to the awesome presentations around the project.

The Round-Up

First up, was the whirlwind sandbox projects tour by Justin Cormack, CTO, Docker, where he introduced Litmus as a way to package & perform your own custom chaos experiments. The talk prompted some awesome cloud-native advocates & tech-gurus to recall some deep dives on the project that had been performed in recent times!

This year we've been taking a deep dive into the CNCF Sandbox on our blog, and so far we've covered @kubevirt, @LitmusChaos and @telepresenceio. But @justincormack just blew us out of the water running through all 50 projects in 15min! #KubeCon https://t.co/qh9eVE4q3N
— LiveWyer (@LiveWyerUK) May 5, 2021

@LitmusChaos https://t.co/U405GfALZZ
— David McKay (@rawkode) May 5, 2021

Our KubeCon talk, on Putting Chaos into Continuous Delivery to increase Application resilience along with the fantastic team at Keptn, was well attended with many questions on the philosophy of automated chaos. For those interested, there is more coming on the integration with Keptn, including some very cool use cases around auto-remediation!

Key takeaways from the talk! 😍
Glad to have @KitopiHQ as an adopter of both @keptnProject & @LitmusChaos.

This establishes a process of continuously evaluating resiliency! pic.twitter.com/Rs8QvoKogk
— LitmusChaos (Chaos Engineering for Kubernetes) (@litmuschaos) May 6, 2021

Alois Reitbauer (https://twitter.com/AloisReitbauer), co-chair of the CNCF SIG-App Delivery, presented some interesting findings in this space, with Chaos Engineering & Litmus getting special mention. He discussed some common experimentation trends, and how organizations are embracing a shift left approach with chaos. And what better than validation of these thoughts from the experts at Gitlab!

Chaos Engineering. I'm here for it. Shift left in CI/CD. 🙌#KubeCon #CloudNativeCon pic.twitter.com/YeS8YFJkgr
— Michael Friedrich, go SLO ... conf 📈🇦🇹 (@dnsmichi) May 5, 2021

Next up was probably the most interesting demo (one that followed a nice intro/whiteboard session) involving litmus in the Kubecon. Saiyam (Civo) & Karthik (Verica) presented how one could leverage the GitOps & Event-triggered chaos features in the Litmus platform to test the sanity of the changes made to your application deployments on the cluster. It also had some cool takeaways on how you should approach chaos engineering & up your reliability game.

The demo included -> @LitmusChaos, @CivoCloud Kubernetes, @PrometheusIO and black box exporter @grafana, @fluxcd, @github actions!

Repo -> https://t.co/dHrXYESCd9 https://t.co/7Gd35fmmj1
— Saiyam Pathak (@SaiyamPathak) May 6, 2021

In between the various talks, we had the opportunity to hold the Office Hours (thanks a ton to the CNCF for organizing this) where community members, first time users got a chance to learn about the project & the advanced ones to probe us on the roadmap :). What better than waking up to messages like this from folks that attended the office hours:

#Chaos @LitmusChaos was a big surprise for me in #kubeconEU. They have an excellent solution for controlled and scheduled chaos validation in production environments for many services running on K8s. Great solution for Disaster Recovery and self healing tests
— Matheus Fidelis @ 127.0.0.1 (@fidelissauro) May 4, 2021

Amidst all this, we had our SIG-Docs lead @Divya_Mohan02 participating in the awesome DoKC Day (Data on Kubernetes Community), where she shared her contribution journey in LitmusChaos, what it means to break things intentionally & how folks can get involved with the project.

"The art of intentionally breaking things"=chaos engineering/the title of @Divya_Mohan02's talk about her experience working with @LitmusChaos https://t.co/T3BlkKqXEX
— Data on K8s Community (@dokcommunity) May 6, 2021

Looking Ahead

While we feel chaos engineering & LitmusChaos project’s approach towards it has been validated, there is much work to do & improve. The impending Litmus 2.0 launch is expected to improve the experience of users along with a host of useful features. Having said that, What are the aspects you are most interested in seeing Litmus get better at, features you’d like to include or integrations you’d like to see?

Feel free to get involved in these discussions on our slack channel or post your thoughts into the Github discussions !

Are you an SRE, developer, or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack For detailed discussions & regular updates On Chaos Engineering For Kubernetes.

Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.

Observability Considerations in Chaos: The Metrics Story

Karthik Satchitanand — Wed, 03 Mar 2021 18:03:39 +0000

Introduction

Some of the recent queries from members of the litmuschaos community have been around how they can observe the impact of chaos. The frequency of these queries coupled with the feedback we got from several litmus users (like the SREs at F5 Networks) resulted in some interesting features and also triggered off several meetup appearances discussing precisely this: “Observability Considerations in Chaos Engineering”. This blog (and a few that will follow) is a summarization of those discussions and an attempt to define what observability means in the context of chaos & what hooks litmus provides in this area.

What we will not do is define either observability (read this extremely insightful blog by Cindy Sridharan) or chaos (we have done this ad-nauseam in the other litmus blogs) individually :)

Observability in Chaos Engineering

Observability ties in with chaos engineering in a couple of major ways:

(A) Chaos testing of your Observability infrastructure

Are you getting the right metrics, is all information logged correctly, do you have the notifications/alerts arriving as expected?. It is often the easiest way to introduce a chaos engineering practice in an organization. And this is a view shared by many devops engineers we have spoken to over the years. The value demonstrated here gives the stakeholders the confidence to try it on real business applications sooner rather than later. Having said that, if the business app happens to be an observability solution - even better ;)

(B) Observability infra as supportive tooling to visualize chaos & validate the hypothesis

Chaos experimentation is a lot about hypothesizing around the application and/or infrastructure behavior, controlling blast radius & measuring SLOs. SREs love to visualize the impact of chaos - either actively (live) or recorded (as with automated chaos tests)

Point (A) is more about process, culture, and engineering ethos while (B) is a technical requirement asked of the chaos-framework with respect to what hooks it provides for generating observability data & also consuming it from the environment it is deployed into.

We have had some interesting use cases of (A) in the litmus community - notably Zebrium using it to verify if their ML-based autonomous monitoring solution functions effectively and Intuit using litmus to test the resilience of their Prometheus & Wavefront cluster add-ons.

As far as (B) goes, until quite recently (a few months past) Chaos (K8s) Events was the only observability element generated by Litmus microservices (metrics were mostly rudimentary) and the users needed additional aids such as the Heptio event router to convert these Kubernetes events into metrics in order to bring them onto their dashboards. There have been some cool developments since.

Let me explain them in more detail.

Improved Litmus Chaos Exporter to create Chaos-Interleaved dashboards

Chaos interleaved dashboards (i.e., instrumented app dashboards/panels with prom functions on chaos metrics in order to indicate the period of chaos injection/experiment execution) is something that has been in practice and, like mentioned previously, used the event router. While this did help, it was by no means a perfect solution - as upstream implementations of the event-router didn’t have the desired event filtering capabilities (leading to more space on the TSDB) and events are relatively ephemeral entities that disappear from the etcd after a specified period of time. As a result, the chaos exporter was improved to generate more meaningful metrics to indicate the state & results of the chaos experiments - the source being a richer (schema-wise) ChaosResult CR.

These metrics (especially litmuschaos_awaited_experiments) can be used to annotate your dashboards to view the application behavior during the chaos period. Here is a sample Kafka dashboard interleaved with chaos metrics:

You can find references for this dashboard here.

Automated Validation of Hypothesis

Oftentimes, chaos experiments are automated and there is a need to record application behavior (and any other environmental factors worth observing, depending on the usecase) to determine whether an experiment was successful in proving a “Hypothesis”. The hypothesis can be defined as the expectation you, the chaos practitioner (developer, SRE) have from your system under the chaotic condition it is subjected to. Typically, it involves the following:

(a) Defining a Steady State for your application/infrastructure
(b) Anticipated deviation from Steady State

It may also involve other expectations around (a) & (b) such as expected MTTD (mean time for detection of failures and beginning of recovery flow within a self-healing infrastructure), MTTR (Mean Time to Recover and get back to Steady-State and Optimal Operational Characteristics). These details are typically expressed as application metrics via respective exporters/plugins.

It is also possible to have other notions of Steady-State - the range can be as diverse as availability of a downstream microservice, custom app status retrieved via dedicated CLI, values for different status fields on Kubernetes Custom Resources (especially with operators doing lifecycle management of apps), etc., (you can find more details about this here). However, for a majority of the cases, these tend to be auxiliary checks over the course of an experiment, with the core constraints of the steady-state hypothesis revolving around app metrics.

The promProbe was introduced to address this very requirement. With this probe, users can define an expected value (or range) for app metrics (that are queried from a Prometheus end-point) via promQL queries at various stages of the experiment execution (before/after chaos OR through the chaos period). The success of the probe determines the eventual “verdict” of the experiment.

You can find references for this probe definition here.

Conclusion

Observability has many more facets than just metrics: such as logs, traces, events. Analytics too is slowly being drafted in as an element of (or at least at the periphery of) observability in recent times. In subsequent blog posts, we will discuss what Litmus has to offer in these areas and what improvements we are looking at.

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join Our Community #litmus channel in Kubernetes Slack
Contribute to LitmusChaos and share your feedback on Github
If you like LitmusChaos, become one of the many stargazers here.

Learn more with Chaos Engineering Bootcamps

Karthik Satchitanand — Tue, 09 Feb 2021 17:34:09 +0000

There is an unmistakable buzz and excitement that the chaos carnival is generating, and I just can’t wait for it to begin. While tons of material has been generated about chaos engineering - the what, why & how over the last few years, nothing beats the experience of hearing actual practitioners talk about their learnings. The chaos carnival has an impressive line-up of such speakers and their experiences & accounts will add to the abundant wealth of info out there. As members of the organizing team, and more importantly advocates of cloud-native chaos engineering, we felt this is also a good opportunity to ease people into chaos engineering via simple bootcamps during the break period.

As the attendees flit through different tracks, they will also have an opportunity to take a low-level, hands-on intro to chaos use cases - that we will accomplish using LitmusChaos. Unlike most demos/intro sessions we have conducted in the past during meetups & conferences, the idea behind these bootcamps is to take an actual real-world application (like Kafka, Percona MySQL) - not a hello service :) - and inject chaos with a well-defined hypothesis around the failure.

During the session, you will also be introduced to the observability elements in chaos engineering & how experiments both generate as well as consume observability info that helps you to validate the hypothesis. Considering these bootcamps are scheduled during breaks, they will be split into two parts: an instructor-led tutorial explaining each step followed by an optional guided lab that you can join-in to experience, where the participants will be provided access to a pre-packaged Kubernetes cluster environment that has been bootstrapped with the necessary dependencies allowing them to recreate and experience the chaos steps.

During the tutorial, we will also discuss the KPIs on how you can measure your progress around your efforts on chaos engineering.

To participate in the bootcamp on Days 1 & 2 of the event, please join & track the announcements in the chaos carnival slack channel. If you’d like to also perform the optional lab exercise, please do reach out via email (Karthik.s, Udit Gaurav or Shubham Chaudhary) or DMs on the chaos carnival slack channel. There are a limited set of pre-packaged environments available - the access to which will be emailed on a first-come-first-service basis and stay active for the duration of the lab!

We hope that these sessions are fun, pique your interest in chaos engineering, reduce your apprehension and inspire you to dig more so you take those bold steps towards application resilience. They would also serve as a nice indicator of the prerequisites needed to kickstart these efforts in your organization.

LitmusChaos: A Reflection On The Past Six Months

Karthik Satchitanand — Mon, 23 Nov 2020 18:21:46 +0000

This blog’s title suggests that it may have arrived a month early. After all, reflections are mostly written at year-end with predictions and goals for the new year. But to those of us accustomed to compartmentalizing our lives around the festival called “KubeCon”, this doesn’t seem out of place. Also, the choice of six months is thanks to the fact that we have been a CNCF sandbox project for (nearly) this duration. While some of us have been lucky to witness first-hand the ongoings in the project from a vantage point as maintainers, we felt the need for us to share them with the larger litmus community. So, this article is going to be a mix of updates as well as opinions/trends we are observing in the cloud-native chaos engineering space, with generous references to specific docs/blog posts/videos highlighting them.

The Project Has Grown...

...And how! Thanks in no small part to being accepted as a CNCF sandbox project. While we were convinced of the need for a cloud-native chaos engineering solution (which is what motivated us to start Litmus in the first place), the reach and vibrant nature of the CNCF developer & end-user community amplified the interest in the project leading to increased contributions from developers across different organizations (including RedHat, Intuit, ContainerSolutions, Microsoft, etc.,) as well as new adopters (Okteto, WeScale, NetApp). It also paved the way for greater collaboration & integrations with other projects on the CNCF landscape that are focused on solving different kinds of challenges around application delivery (for ex: Okteto, Argo, Keptn, Spinnaker), about which we shall delve a bit more in subsequent sections. More importantly, it has helped generate more dialogue with diverse folks who are at various stages of the cloud-native journey - right from those that have crossed the chasm & have turned into mature/well-oiled adopters of the paradigm to those that are in the process of re-architecting their applications into microservices and migrating them onto their first Kubernetes staging clusters. The Litmus slack channel has never been more active (it saw a 70%+ increase in members), and we are also grateful for all those on-demand zoom calls and pages-worth slack conversations - it would be an understatement to say that more than 90% of the features/enhancements and fixes that went in during this period were direct end-results of this dialogue. Talk about being community-driven!!

Evolution Of (Cloud-Native) Chaos Engineering

While the principles around which we helped define the sub-category of cloud-native chaos engineering (a little over a year ago) continue to hold true (validated by the emergence of other projects sharing similar values), we have noticed some developments in this space since.

These are also things that influenced us as we developed and iterated on the litmus framework, resulting in several new features & integrations.

Prominent amongst those has been the call for “left-shift” of chaos - which is now being increasingly viewed as part of the delivery pipelines as against something run only in production environments, by SRE or Ops persona. This is definitely an interesting departure from traditionally held views about CE. While the “exploratory” model of chaos engineering with controlled failures on prod & well-defined hypotheses in a gameday setting is still the Nirvana of a mature devops practice, we found an urgent need amongst the community for doing it much earlier in an automated way (read CI/CD pipelines) in the wake of overhauls to application architecture (making it microservices oriented) & their impending/ongoing migration to Kubernetes.

Note that this is not to say chaos engineering doesn’t apply to monolithic apps or non-Kubernetes environments, just that the recent shift to a cloud-native mindset has brought about new practices.

Where do you use or intend to use Chaos Engineering in your Kubernetes architecture?#chaosengineering #Kubernetes #cncf #DevOps #k8s
— LitmusChaos (Chaos Engineering for Kubernetes) (@litmuschaos) June 22, 2020

A panel discussion we curated recently to dig more perspective around this topic and other trends in this space was especially enlightening. So have been the conversations with different users (the persona ranging from application developers to people classifying themselves as “devops engineers” focused on delivery tooling, QA architects, SREs, VPs & CTOs). Here is a summarization:

Reliability verification is needed right from the earliest stages of application/platform development. Something that is being termed as the “Chaos First” principle

Chaos first is a really important principle. It’s too hard to add resilience to a finished project. Needs to be a gate for deployment in the first place. https://t.co/xi98y8JZpJ
— adrian cockcroft (@adrianco) December 3, 2019

Chaos to “test” or unearth issues in the observability frameworks is being increasingly seen as a way to lower the entry-barrier for chaos engineering in many organizations that are generally apprehensive about the idea. Litmus has seen adoption in multiple organizations with this use-case. In some cases, this is the main motivation too (for ex: testing log-based anomaly detection in Zebrium!)
There is a definite need identified to bring in methods to define steady-state of systems & hypothesize about the impact of chaos in a declarative way, without which the advantage of a Kubernetes-native/YAML based approach is nullified. This is a prerequisite for automated SLO validation.
Organic integrations emerge between cloud-native chaos toolsets & observability frameworks, with native support for chaos events, metrics (and other elements). Observability was & continues to be a prerequisite for chaos engineering.
The “run-everywhere” drive in chaos engineering has brought it to the developer’s doorstep and has become a much-used strategy in dev tests. This, combined with the self-service model where developers are provided “namespaces” with “quotas”, has resulted in the need for chaos frameworks to be “multi-tenant” aware, with implications right from RBACs needed for experimentation to the resource consumption at scale.

As a Kubernetes-native Chaos Engineering practitioner, how do you believe experiments should be?
.
.
Thoughts from @kubernauts ?#ChaosEngineering #Kubernetes #OSS #DevOps #SRE #CloudNative #Docker #k8s #k8sjp
— LitmusChaos (Chaos Engineering for Kubernetes) (@litmuschaos) July 21, 2020

The control offered by the chaos-frameworks is extremely important to the SRE. That encompasses the flexibility to choose security configuration, isolate chaos resources, abort/stop chaos when needed, and even simplicity in defining new experiments (Hear Andreas Krivas from Container Solutions talking about it based on prod experience in this user-interview)
Another trend, something that is nascent is the application of chaos engineering to the edge. Kubernetes for the edge is something that has caught on in recent times, and chaos engineering is a natural extension! This calls for newer experiments, multi-arch support & resource-optimized deployments. This adoption story from Michael Fornaro (founder/maintainer of Raspbernetes) is an early indicator! Something that has driven the litmus e2e team to prioritize platforms such as K3s as part of its e2e suite.
While Kubernetes native chaos is something that is appreciated and has caught on, the need to inject failures on infrastructure components falling “outside” the ambit of the Kube API continues to be in demand. And probably is ever-more important. This includes support for node-level and disk-level failures, with growing requests for interoperability with different cloud providers.

Which of these #Kubernetes Cluster would you prefer to use in your local development or CI?
---@LitmusChaos #kind #microk8s @kubernetesio #Docker
#k3s @Rancher_Labs
— Udit Gaurav (@udit_gaurav15) September 18, 2020

Integrations with Other Projects in the CNCF Landscape

With chaos engineering being accepted as an important cog in the application delivery process (Litmus is part of the App-Delivery SIG in CNCF), it was important that we interface with standard cloud-native mechanisms enabling it. Right from dev-test through CI pipelines, automated validation upon continuous deployment into pre-prod environments (CD) and eventually in production. This resulted in the following integrations, which are expected to undergo a formal release process just as the core chaos framework.

Okteto (DevTest): Okteto simplifies the application development experience and reduces the cycles spent in getting a test image validated on the Kubernetes cluster. It reduces this “inner loop of development” by allowing “in-cluster development”, i.e., helps spin-up a complete dev environment on the cluster itself with access to all the code in your workspace. It has a SaaS offering (Okteto Cloud) that provides Kubernetes namespaces with the option to pull in LitmusChaos control plane so that developers can right away gauge the impact of failures and fix them before hitting CI. You can leverage Okteto’s litmus-enabled preconfigured development environments to deploy Litmus along with your application with a single click. On the other hand, Okteto is also being actively used for the development of the chaos experiments themselves!
Gitlab (CI): Running chaos experiments, albeit with “stricter” validations, not unlike failure tests is something that is catching up. Litmus provides chaos templates for Gitlab, an abstraction over the chaos CRDs to extend your regular CI pipelines with new stages to inject various types of faults and validate for a specific application or infra behavior. The usage models here are varied, with some teams opting for quick litmus-based e2e on low-footprint ephemeral clusters such as KIND that can be brought up within the build environment itself, thereby not needing to push test images to a registry v/s more elaborate “retrospective” pipelines running against cloud/hosted K8s clusters after the images are pushed to a test repository. openebs.ci is a great example of the latter.
Argo (Workflows): Argo is an incubating project in CNCF with a suite of sub-projects under its banner (workflows, CD, events, rollouts). Amongst those, the community (notably Intuit) foresaw great value in creating what we are now calling “chaos workflows”, which are essentially argo workflows running one or more experiments, often in combination with tools to generate “real world load”. This enables the creation of complex chaos scenarios (for ex: you could simulate parallel multiple-component failures, chained chaos - i.e., cases where a sequence of certain failures can cause unique issues, etc.,) and imparts greater power to developers and SREs. The chaos workflows are different from vanilla argo workflows in the sense that the status of a workflow is determined by the success of the chaos experiments.
Keptn (CD): Keptn is an open-source cloud-native application life-cycle orchestration tool. An entrant into the CNCF sandbox alongside Litmus, a major use case of Keptn within Kubernetes involves defining a pipeline with one or more stages for deployment, testing, and remediation strategies. It provides a mechanism to implement a Quality Gate for the promotion of applications based on SLO validation. Litmus integrates with Keptn (via a dedicated litmus-service in the Keptn control plane) to inject chaos into any stage of the Keptn pipeline, especially with background load to simulate real-world behavior, thereby lending greater strength to the SLO eval process.
Spinnaker (CD): In recent times, Spinnaker has introduced the plugin model to extend the capabilities of this popular continuous delivery platform, while keeping a lean core. With an increasing number of organizations using spinnaker for the CD needs, we introduced the Litmus plugin that enables the creation of a custom stage for chaos, with the ability to inject different failures on the deployed application and creating opinionated exit checks.

Notable Features We Added

As mentioned earlier, a lot of the new enhancements & features that we added during this period were a result of community requests and roadmap items crafted based on the trends observed. Here is a high-level list of some notable improvements:

The existing experiments were migrated to golang, and (12) newer ones added (including node-level chaos experiments). We also simplified the experiment development process via the litmus-sdk
The ChaosEngine CRD schema has been upgraded significantly to support various properties for the chaos pods, leading to granular control of the experiment (including pod/container-security context, image overrides & secrets, resource requests-limits, pod scheduling via selectors & tolerations, ability to inject custom labels & annotations, etc.,)
Support for chaos on runtimes other than docker (containerd, crio)
Litmus Probes were introduced to enable declarative steady-state definition and construct hypotheses
The Chaos-Scheduler was made available to help with the execution of continuous background chaos.
Support for newer modes of operation (admin, namespaced) to enable centralization of chaos & operations in multi-tenant environments respectively
The CI & e2e practices in the litmus project have improved (though there is much more to be done) over time, with support for multi-arch builds, test runs for pull request validation as well as increased integration test coverage for the litmus control plane and chaos experiments.

Not least, we also introduced the chaos-bird as our new mascot !!

Hola k8s folks, Unveiling the brand new mascot for Litmus, the Chaos Bird in grand fashion @KubeCon_. The all new website (https://t.co/hDS6S1xp9d) and ChaosHub (https://t.co/BA4angVyNu) are all about the Chaos Bird unfurling its magic in the cloud-native world!#chaosengineering pic.twitter.com/B3tVsqzwGR
— Ishan Gupta (@code_igx) August 17, 2020

Introducing the Litmus Portal

Once we established the fact to ourselves and the community that chaos workflows are the way forward in terms of “scaling” chaos (i.e., increase injections with different “ordering” schemes to form complex scenarios), we decided to simplify this experience for users. After all, constructing YAMLs spanning hundreds of lines by hand is something that is universally detested! Over time, we also learned that “scaling” chaos has other connotations too - such as being able to inject chaos across several “target” clusters from a single “control plane”, extending the chaos execution & analysis to a team or groups of engineers, etc., Not just that, chaos engineering is a lot about visualization, introspection, and analysis of results as against just fault-injection.

All this resulted in the creation of the Litmus Portal which offers a centralized management console to construct, schedule, and visualize chaos workflows on multiple clusters, connected to it via an agent/subscriber. It allows selecting chaos experiments from a trusted/specific git source and offers insights into application/infrastructure resiliency via a “resilience grading” mechanism that allows setting weights to experiment results. It also supports the generation of chaos reports and a git-based auth mechanism to aid chaos execution by teams. The portal is alpha-2 at the time of writing this article and is expected to evolve to include many other features over time.

State of Governance

One of the areas we made significant strides in this period was project governance. While Litmus had become collaborative and “truly” opensource when the maintainer/project leadership group grew to include Sumit Nagal (Intuit), Jayesh Kumar(AWS) & Maria Kotlyarevskaya (Wrike) apart from the team at MayaData, we were looking for ways to decentralize the planning & maintenance activities of the various sub-projects within Litmus. We also realized, that over time, Litmus was being perceived and used in a variety of use-cases - while some folks in the community were focused on integrations with other tools/frameworks, few were interested in improving observability hooks and yet others wanted to focus on improving the chaos orchestration logic to improve the chaos experience. We also gauged interest amongst some members in learning about and contributing to the project by improving the documentation.

This led us to create Special Interest Groups (SIGs) within the Litmus project (something that CNCF & Kubernetes has implemented to great effect) to allow the community members to align themselves with like-minded individuals and improve the project by defining the direction in which the area/sub-projects should go. Each SIG is composed of a group of community members led by at least two SIG-Chairs who also have the commit bits to a set of github repositories that come under the purview of the area/sub-projects. Each SIG has a set of well-defined Goals, Non-Goals & is also responsible for proposing deliverables for the monthly releases. The SIG teams typically meet once in two weeks to discuss progress, roadmap items and to engage in cool demos/presentations. While some SIGs are formally operational (Documentation, Observability, Orchestration, Integrations), a few others are yet to take off (Deployment, Testing, CI). We hope these will soon be functional!

Looking Ahead

We recognize the fact that there is increasing adoption of chaos engineering as a practice across organizations, and sustained improvement is key in ensuring the cloud-native world is able to leverage litmus for their resilience needs successfully. The project roadmap is continuously being upgraded to hold newer requirements and stability improvements. To this effect, we have been planning a steering committee composed of people from the resilience, observability & application delivery world that can help prioritize features, enrich project/product management practices, and generally help bring real-world experience to improve Litmus in becoming a well-rounded solution! If you are interested, do reach out to us on the litmus slack channel.

So, to summarize, we are looking at the community getting involved in a bigger way in driving the project - via feedback and contributions. Needless to say, we are also focused on moving further along the path as a CNCF project which would accelerate achieving these goals. The next Kubecon is 6 months away, but we will keep you all posted on what's new in Litmus.

Before we end this post, it is worth looking at what Liz Rice, the TOC (Technical Oversight Committee) chair at CNCF had to say about chaos engineering!

#CNCF TOC chair @lizrice is sharing the 5 technologies to watch in 2021 according to the TOC:
1. Chaos engineering
2. @kubernetesio for the edge
3. Service mesh
4. Web assembly and eBPF
5. Developer + operator experience pic.twitter.com/aSRDTB0piN
— CNCF (@CloudNativeFdn) November 20, 2020

Are you an SRE, developer, or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack For detailed discussions & regular updates On Chaos Engineering For Kubernetes.

Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.

Part-1: Evaluating Resiliency with Keptn and LitmusChaos

Karthik Satchitanand — Thu, 19 Nov 2020 17:41:55 +0000

Evolution of Chaos Engineering: From Shift-Right to Run-Everywhere

When chaos engineering started gaining ground a few years back as an essential practice to ensure reliability, it brought about a change in the mindset of traditional software practitioners in forcing them to think about “shifting-right”, i.e., testing in production. While this did not intend to replace traditional “shift-left” approaches for testing, it was more about the need for finding out application behavior amidst the vagaries that only a prod environment could offer. And chaos engineering, with its scientific hypothesis-based approach and blast radius principles to control “the extent of chaos”, was (is) seen as the way to go about it. After all, testing in production is not only about fault-injections. It helps unearth more issues than just application misbehavior - around deployment practices, observability mechanisms, incident response, recovery practices, etc., As the SRE role has grown, chaos engineering also has grown as an integral function. Accelerated in no small part due to organizations like Netflix & Amazon publicly sharing their stories and technologies.

However, the emergence of the cloud-native paradigm and an ever-increasing adoption of Kubernetes has brought with it the challenges of re-architecting applications: to be distributed, loosely-coupled (read: microservices) and containerized. It has also brought in new ways of dealing with “operational” aspects such as deployments and upgrades, storage (in case of stateful components), maintaining HA, scaling, recovery, etc., Kubernetes, as the overarching orchestration system provides multiple choices and approaches for implementing these. All this contributes to significant complexity. And with it a lot of apprehensions.

Organizations that are consciously adopting cloud-native design principles and migrating to Kubernetes as the deployment platform of choice typically face the need to test different kinds of failure scenarios (including those of the Kube ecosystem components themselves) and learn about their application behavior and deployment practices deeply, by hypothesizing & repeated experimentation. And it is often needed to build confidence before they hit production. In recent times, this has contributed to a “shift-left” approach in chaos engineering. With more and more organizations planning for chaos as part of their software delivery process. By that, it means reliability verification as a practice and end-goal, and by extension, chaos experimentation, is now no longer an only-SRE or Ops responsibility. Developers are getting involved too (Note that this doesn’t in any way undermine or replace testing in prod, that is still the ultimate goal of CE)

How is Chaos as part of Continuous Delivery different from traditional Failure Testing

One question that arises here is, how is it different from failure testing one would expect the QA teams (or developers wearing a QA hat) to perform. The differences are subtle and it depends on the persona as to how the chaos tooling is employed. Some prominent ones are:

With chaos engineering, a lot of emphases is placed on the “what”, i.e., service level objectives (SLOs) over the “how”, i.e., application functionality.
Chaos is expected to be done against systems mimicking production (typically called “staging”) environments. Kubernetes helps here: today it is the de-facto development platform as much as it is a deployment platform of choice, which makes it easier to achieve some degree of similarity with prod (via the right mix of scale, simulated traffic, and in cases where permissible - datasets cloned from prod)
The focus is more about observation and inference rather than a pre-defined/strict “validation”. Having said this, the boundaries are a bit blurred here, and typically “chaos efforts” end up being a mix of both, with a nuanced practice seen to get developed as the organization matures in its journey.

One of the immediate impacts of this culture is the practice of integrating chaos into continuous delivery (CD) pipelines, with a dedicated stage for running experiments against a suitable pre-prod environment, and the results/findings from the experiment deciding on the promotion of the change (i.e., build artifact - often container images or a deployment/resource specification) into production. The process of gauging these findings is typically around consuming data from different service level indicators (often metrics from sources like Prometheus) and examining it against a predefined SLO.

In this blog, we introduce you to a means of implementing the above, i.e., executing chaos in CD pipelines with the ability to “gate” your deployments to production based on checks against SLOs. This we achieve using the cloud-native control plane for CD: Keptn & LitmusChaos.

(You can read more about Litmus here)

Keptn: Cloud-Native Application Life-Cycle Orchestration

Keptn is an open-source enterprise-grade cloud-native application life-cycle orchestration tool. Keptn orchestrates continuous delivery and operations of your applications and was built to help create an “Autonomous Cloud”, which essentially means enabling your organization to become more autonomous when deploying and operating your apps & services in the new multi-hybrid cloud platforms. Something that is termed “No-Ops” in the broader community. It can integrate with monitoring and observability platforms thanks to its event-driven architecture, with the communication being achieved using CloudEvents (itself an incubating CNCF project).

A major use case of Keptn within Kubernetes involves defining a pipeline with one or more stages for deployment, testing, and remediation strategies. A sequence in Keptn typically starts with the deployment of an application into one of the pre-prod namespaces (the images or spec of this deployment can be the artifact of a CI process, with Keptn integrating into popular CI frameworks like Jenkins or Travis CI), followed by other stages that can trigger tests and evaluate the operational characteristics of the application before subjecting it to a quality-gate evaluation, as part of which predefined SLOs (essentially rules created against SLIs derived from Prometheus, Dynatrace, or other sources) are validated. The success of this eval results in “promoting” an application to the next phase, say, deploy into production. Keptn also supports remediation based on certain observations in production, and provides flexibility in being installed/used for a specific use-case (CD/QualityGating), etc.,

At the core of Keptn’s philosophy is GitOps & Code/Config generation (it stores all of its configurations in an internal Git repository that can be connected with Github/other Git-based SCM, and applies changes from the Git repository to your environment). All of Keptn’s stages and artifacts (right from applications, core services to SLOs) powering the pipeline are git-controlled and a lot of complexity is abstracted away from users via simple CLI operations (it provides APIs too) that generate the config specs.

Litmus Service in Keptn

The litmus world has been witnessing a steady increase in the number of “chaos left-shift” use cases. This also led us to share ways of introducing litmus chaos experiments into Gitlab & spinnaker pipelines. As members of a community working on a framework that is (a) built grounds-up to be GitOps friendly and (b) has a conscience aligned with “principles of chaos”, we were committed to finding a means to bring in the “hypothesis” & “SLO” elements into “pipeline-led” chaos. Litmus Probes feature was one of the first results of this introspection and today it provides a way to define “expectations” out of your infrastructure or application under chaos, with the diversity of definitions ranging from service availability to metrics & K8s resource states.

It was during this time that we learned about the Keptn project. We were immediately impressed by its capabilities and especially its well-defined outlook on “Quality Gating”. While probes are limited in their scope to individual experiments, there was a need for evaluating SLOs at a broader level, across “periods” of chaotic activity. And by this, we mean the period in which applications are subjected to real-world load and real-world failures (one or more, run in sequence or in parallel). Keptn’s event-driven infrastructure provides just that. It evaluates SLOs (described by a simple and easy to understand configuration file) for the period of time the “test”, in this case, the real world load, is run while also allowing for chaos injections to happen in the background to simulate the real-world failures. All the control data being maintained as cloudevents makes it convenient to visualize & consume - with the Keptn Bridge (dashboard) offering a useful view of how each event has played out & displaying the evaluation results.

Another feature worth mentioning here is that the Keptn control plane runs a single dedicated service for each integration, that acts on all pipelines being executed in the system. This brings about a low-touch approach to managing pipelines & abstracts away the complexities of “integrating” new chaos tests.

Considering all this & with excellent support from the Keptn team, we wrote up a litmus service that can now help to inject “chaos” into any stage of a Keptn pipeline. Functionality-wise, it acts on the “deployment finished” cloudevent and triggers chaos on the app deployed in a pre-prod namespace, while at the same time, out-of-the-box stress tools generate a user-defined load profile against it. Leading us to the insights and benefits discussed earlier.

Trying out the Litmus Service with a Demo Pipeline On Keptn

Once we had the integration working, we showcased it to the community via a webinar to share our learnings and also to encourage the community in trying it out! During this presentation, we demonstrated a simple Keptn pipeline that attempts to verify the resiliency of a “hello-service” using Litmus.

You can find the presentation here: https://docs.google.com/presentation/d/1ZFEwXqFIkpicM5-aRkLinWh8AWjombkg1d0Y-Z1DC68/edit#slide=id.p1

In part-2 of this blog series, we will discuss how you can reproduce the demo steps and along the way revisit & appreciate concepts that we have discussed in this article.

Stay tuned!!

Are you an SRE, developer, or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack For detailed discussions & regular updates On Chaos Engineering For Kubernetes.

Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.

Deployment Modes in LitmusChaos

Karthik Satchitanand — Wed, 02 Sep 2020 03:14:00 +0000

Introduction

Chaos Engineering as a practice is something that is steadily pervading various stages of the microservices development life cycle, with different persona (across the spectrum, right from app developers to SREs, DevOps functions maintaining CI pipelines to service owners) making use of it to ensure resilience, albeit with contextual differences. Thereby, it is expected that the chaos frameworks today lend themselves to operating in different modes: considering the permissions, security constraints & allowed blast radius these personas operate with.

In this blog, we shall discuss the various deployment/operational modes within Litmus, that takes into consideration these different personas & provides steps to install it in a specific mode using helm.

Note: It is assumed that readers have had a chance to try-out litmus chaos experiments and are aware of the chaos operator, chaos CRDs & the chaoshub. The subsequent sections can be understood better with this knowledge.

Modes of Execution in LitmusChaos

While the essential nature of the litmus chaos operator in terms of the ability to orchestrate chaos and that of a given experiment to inject a specific fault remains the same across different modes, their scope (cluster-wide v/s namespaced) and impact differ. As you may have guessed already, this corresponds to the way the RBAC is set up for the litmus components & operator's watch range.

Admin Mode

Persona: An SRE or a cluster-admin with complete autonomy over the cluster who wants to centralize chaos operations and avoid the need to place the chaos components in multiple target namespaces. He/She is also the executor of the chaos, so pulls the experiment CRs from the hub, tunes & runs them.

Operational Characteristics: The chaos operator is installed in a central admin ns (typically litmus) along with a superset cluster-wide RBAC that can execute all supported chaos experiments, including those against node resources. The operator is set up to watch for ChaosEngine CRs created in the same admin namespace, though the target application can reside in a different namespace (specified by .spec.appinfo.appns in the ChaosEngine CR). This ensures that all the chaos resources (chaos experiment CRs, runner/experiment/helper pods, chaosresult CRs) are maintained within the same admin namespace.

Steps to Deploy: The litmus helm chart consists of a flag in the values.yaml to specify the mode: operatorMode which can be set to admin during install.

kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmuschaos/litmus --version 1.7.0 --set operatorMode=admin
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.7.0?file=charts/generic/experiments.yaml -n litmus

Standard Mode

Persona: Chaos execution is delegated to service owners in staging/prod environments with varying degrees of permissions, while the operator continues to reside in a central/admin namespace. The expectation in this mode is for the SRE/cluster-admin to install the litmus infra components (operator, CRDs) beforehand, with the operator being set up to watch ChaosEngine CRs across namespaces. This is done considering that individual service owners can create chaos resources (ChaosExperiment CRs installed from the hub, ChaosEngine CRs) in the (app/service) namespaces where they have access.

Operational Characteristics: While technically quite similar to the admin mode (the operator continues to remain in a central ns & orchestrate chaos across apps), here the executor of chaos is different & the service owners are expected to use their own ChaosServiceAccounts (or use the recommended per-experiment RBAC available on the chaoshub) for chaos execution. The chaos pods (runner/experiment/helper) too are created in the service namespaces to aid in better visibility/debug.

Steps to Deploy: The operatorMode should be set to standard during install.

kubectl create ns litmus
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmuschaos/litmus --version 1.7.0 --set operatorMode=standard
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.7.0?file=charts/generic/experiments.yaml -n <service/app-namespace>

Namespaced Mode

Persona: All chaos operations (infra setup, orchestration, and chaos execution) are managed by developers/DevOps engineers in their respective namespaces in a strictly multi-tenant environment, where the persona doesn't have access to cluster-wide resources like nodes, crds and is generally restricted from running with elevated privileges, mounting hostpath/file volumes, etc., This mode of operation is especially useful in SaaS-based environments providing Kubernetes namespaces for use, for example, Okteto Cloud

Operational Characteristics: In this mode, the operator is installed in the user's namespace and is set up to target applications it is co-residing with, i.e., in the same namespace. Needless to say, the operator also expects & creates chaos CRs (ChaosExperiment & ChaosEngine) and pods (runner/experiment/helper pods) in the same namespace, respectively. However, the use of namespaced mode requires the custom resource definitions (CRDs) to be pre-installed on the cluster by the admin before the users can attempt the operator installation.

Steps to Deploy: The operatorMode should be set to namespaced during install with the --skip-crds flag used to ensure CRD install is not attempted.

helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm install litmuschaos/litmus --version 1.7.0 --set operatorMode=namespaced --namespace <developer-namespace> --skip-crds

Note: In the namespaced mode, only pod-level chaos experiments can be executed.

Conclusion

As of today, there is a sizeable ratio of usage across the different modes in the litmus community, with standard being the more commonly used one (which is the reason it is set as the default mode). Having said that, this is one aspect of the framework that we are betting on seeing changes to continuously, considering the rapid improvements in the Kubernetes ecosystem around role-based access, security & also the constant evolution around deployment practices. For now, though, we would love to get your feedback on the current options and which one you prefer most!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack For detailed discussions & regular updates On Chaos Engineering For Kubernetes.

Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.

Litmus SDK: DevTest your chaos experiments with Okteto

Karthik Satchitanand — Fri, 14 Aug 2020 12:16:04 +0000

Introduction

The Litmus developer community is not unfamiliar with the scaffolding utilities that the project provides to bootstrap chaos experiment code. There is a cool blog about it too. While the procedure helped developers get started quickly by generating all the necessary artifacts & laying out the standard experiment sequence, performing developer testing (popularly called DevTest) before the experiment got added into the litmus-e2e for regular pipeline runs was still a tad cumbersome.

Why Dev Testing On the Cluster is Necessary

As with any application for Kubernetes, the chaos experiments need to be tested, as we code along, in a Kubernetes cluster. While one can still run the experiment binaries by providing the right config (or the KUBECONFIG env variable), the developer is bound to make multiple runs with the experiment business logic running out of a pod in the cluster, as that is the eventual execution mode. This becomes especially important when these experiments need to be injected with config information in the form of ENV variables, config maps, secrets, or if they make use of host files at runtime.

Sometimes, the experiment may need to be run with a specific set of permissions or security policies, in which case the operational characteristics and experiment stability can be known only when it is run on an appropriate cluster environment.

The Problem Statement

Until now, this was being achieved by building Kubernetes Job manifests with all the run characteristics burned into the spec & running a private docker image from the developer's repository. However, in this model, any fixes/corrections, or enhancements would mean repeating the cycle - Fix Code -> Build Experiment -> Build Docker Image -> Push Docker Image -> Re-run Kubernetes Experiment Job. While one could churn out a quick script to do this, you would still need to maintain it & also update it for different experiments' needs. I would like to avoid this additional complexity, wouldn't you?

Introducing Okteto

Okteto is a popular open-source project that enables you to perform what is called as "in-cluster development" by spinning up a dev environment (essentially a Kubernetes deployment on the cluster with the right base image replete with your favorite dev tools, described by a simple okteto.yaml manifest) consisting of the application source code. The terminal for this development container is provided right within your workspace, with any code changes made on your IDE getting reflected on the dev containers via a sync mechanism. This enables developers to run the latest code directly in the cluster, thereby accelerating what the Okteto community calls the "inner loop of development".

Okteto for Litmus Chaos Experiments

While Okteto is pretty nifty & helps people develop in the cluster, as the Litmus dev team, we needed a mechanism where the development container is spun-up with exactly the same dependencies and config parameters that the specific experiment desires. For example, the network chaos experiments need a different set of configuration params passed, to, say, a disk-based experiment. Here is where the swap mode of Okteto helps. While okteto can launch custom dev environments for your code, it can also swap container images on existing deployments, thereby inheriting all its properties. Here is a nice blog explaining this with a simple demo app.

Based on this knowledge, we decided to tweak the scaffold utilities to generate a test deployment with the standard chaos experiment ENVs/other config details burned into the spec (such as the experiment-specific RBAC/service account), while using a busybox image that does nothing - instead of the experiment job spec, that you have been used to seeing. With this, you could update the manifest to include any missing info, deploy it on the cluster and eventually use okteto to swap the image on this deployment to that of the development container to kickstart your dev-test process.

Steps to Perform DevTest of Litmus Chaos Experiments

Note: Refer to the previous (part-1) blog to go through the steps to bootstrap your experiment code. For the subsequent instructions, I assume you have gone through these initial steps. By this point, you must:

Have already cloned the litmus-go repository (your fork, that is.)
Live at the path litmus-go/contribute/developer-guide
Have generated the experiment artifacts & written the first-cut business logic of the chaos experiment
Have access to a dev cluster (minikube or the like) with the kubeconfig setup from your workspace. You could also use the Okteto Cloud which allows you a free namespace with enough resources and even SSL endpoints for your apps!!)

Now, let's get testing our changes in the cluster!!

Install the Okteto CLI

  curl https://get.okteto.com -sSfL | sh

(Optional) Create a sample nginx deployment that can be used as the application under test (AUT).

  kubectl create deployment nginx --image=nginx

Setup the RBAC necessary for execution of this experiment by applying the generated rbac.yaml

  kubectl apply -f rbac.yaml

Modify the test/test.yaml with the desired values (app & chaos info) in the ENV and appropriate chaosServiceAccount along with any other dependencies, if applicable (configmaps, volumes etc.,) & create this deployment

  kubectl apply -f test/test.yml

Go to the root of this repository (litmuschaos/litmus-go) & launch the Okteto development environment in your workspace. This should take you to the bash prompt on the dev container into which the content of the litmus-go repo is loaded.

  root@test:~/okteto/litmus-go# okteto up 

  Deployment litmus-go doesn't exist in namespace litmus. Do you want to create a new one? [y/n]: y
  ✓  Development container activated
  ✓  Files synchronized

  The value of /proc/sys/fs/inotify/max_user_watches in your 
  cluster nodes is too low. This can affect file synchronization 
  performance.

  Visit https://okteto.com/docs/reference/known-issues/index.html 
  for more information.
      Namespace: default
      Name:      litmus-experiment
      Forward:   2345 -> 2345
                 8080 -> 8080

  Welcome to your development container. Happy coding!

This dev container inherits the env, serviceaccount & other
properties specified on the test deployment & is now suitable
for running the experiment.

Execute the experiment against the sample app chosen & verify the steps via logs printed on the console.

  go run experiments/<chaos-category>/<experiment 
  name>/<experiment-name>.go

In parallel, observe the experiment execution via the changes to the chaos & application pods

  watch -n 1 kubectl get pods

If there are necessary changes to the code based on the run,
make them via your favorite IDE. These changes are automatically reflected in the dev container. Re-run the experiment to confirm changes.
Once the experiment code is validated, stop/remove the
development environment

  root@test:~/okteto/litmus-go# okteto down
  ✓  Development container deactivated
  i  Run 'okteto push' to deploy your code changes to the cluster

You could also run a final test using the ChaosExperiment CR with the final (pushed image) image & a ChaosEngine CR mapping the experiment to the sample app we used earlier. For this, you would need the chaos operator running on the cluster along with the chaos CRDs installed. This is a breeze in the Okteto Cloud, which provides you with a single click install of the litmus infra.

Conclusion

As the Litmus community, we are always on the lookout for collaborating with cool projects & build/adopt techniques to simplify your life as chaos engineers & litmus developers. Give this a try & let us know your feedback!!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you? Join our community on Slack For detailed discussions & regular updates On Chaos Engineering For Kubernetes.

Check out the LitmusChaos GitHub repo and do share your feedback. Submit a pull request if you identify any necessary changes.

Monitoring Litmus Chaos Experiments

Karthik Satchitanand — Wed, 22 Jul 2020 19:46:09 +0000

Without Observability, there is no Chaos Engineering. This is a line that I picked from a nice article on chaos, and I couldn't agree more. The very nature of voluntary chaos injection demands that we have the right monitoring aids to validate our experiment's hypothesis around application/microservices behavior. Typically, that can be mapped to The Four Golden Signals.

Having said that, observability in itself has many facets, just as chaos engineering does. Chaos not only helps test resiliency in terms of service availability (HA), it is also a means to refine alerting and notification mechanisms, streamline the incident response structure and measure key performance indicators (KPIs) such as mean time to detect an anomaly (MTTD), mean time to recovery, say, to optimal performance (MTTR) and sometimes even the time to resolve (another MTTR!), either via self-heal or manual effort, in cases where the chaos experiment is deliberately executed with a high blast radius. There are several tools one could employ today to obtain and visualize this data, which is the other facet to observability that I mentioned earlier. Some tools can even help with automated Root Cause Analysis. Check out this cool demo by folks from Zebrium which demonstrates automated detection of incidents induced via Litmus Chaos experiments.

While there is a lot to discuss and learn about the Whys & Hows of observability with chaos engineering, in this blog we shall get started with a simple means of mapping application behavior with chaos ongoings, i.e., find a way to juxtapose application metrics with chaos events. And to do that, we will make use of the de-facto open-source monitoring stack of Prometheus & Grafana. This is intended to get you rocking on your chaos observability journey, which will get more exciting with continuous enhancements being added into the LitmusChaos framework

Test Bed

What better than the sock-shop demo application to learn about microservices behavior? A quick set of commands should get you started. A Kubernetes cluster is all you need!

Obtain the demo artefacts

git clone https://github.com/litmuschaos/chaos-observability.git
cd chaos-observability/sample-application/sock-shop

Setup Sock-Shop Microservices Application

kubectl create ns sock-shop
kubectl apply -f deploy/sock-shop/

Verify that the sock-shop microservices are running

kubectl get pods -n sock-shop

Setup the LitmusChaos Infrastructure

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.6.0.yaml
kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml
kubectl apply -f https://hub.litmuschaos.io/api/chaos/1.6.0?file=charts/generic/experiments.yaml -n litmus

Monitoring Aids

The LitmusChaos framework provides various Kubernetes chaos events against the ChaosEngine & ChaosResult custom resources, right from the pre-chaos validation checks, through chaos injection and post-chaos health checks in order to trace the ongoings over the course of the chaos experiment. Converting these events into metrics is a great way to integrate with existing off-the-shelf application dashboards to gain a clear understanding of application behavior through chaos injection and revert actions.

In this exercise, we make use of Heptio's event router to convert the chaos events into metrics and then instrument the standard sock-shop application's Grafana dashboard with appropriate queries to achieve our goal.

Setup the Monitoring Infrastructure

Step-1: Lets setup the event router with the HTTP sink to convert the kube cluster events into metrics.

kubectl apply -f deploy/litmus-metrics/01-event-router-cm.yaml
kubectl apply -f deploy/litmus-metrics/02-event-router.yaml

Step-2: We will set up Prometheus & Grafana deployments with NodePort (you could change it to Loadbalancer if you prefer) services

kubectl apply -f deploy/monitoring/01-monitoring-ns.yaml
kubectl apply -f deploy/monitoring/02-prometheus-rbac.yaml
kubectl apply -f deploy/monitoring/03-prometheus-configmap.yaml
kubectl apply -f deploy/monitoring/04-prometheus-alert-rules.yaml
kubectl apply -f deploy/monitoring/05-prometheus-deployment.yaml
kubectl apply -f deploy/monitoring/06-prometheus-svc.yaml
kubectl apply -f deploy/monitoring/07-grafana-deployment.yaml
kubectl apply -f deploy/monitoring/08-grafana-svc.yaml

Step-3: Access the grafana dashboard via the NodePort (or loadbalancer) service IP

Note: To change the service type to Loadbalancer, perform a
kubectl edit svc prometheus -n monitoring and replace type:
NodePort to type: LoadBalancer

  kubectl get svc -n monitoring

Default username/password credentials: admin/admin

Step-4: Add the Prometheus datasource for Grafana via the Grafana Settings menu

Step-5: Import the grafana dashboard "Sock-Shop Performance" provided here

Execute the Chaos Experiments

For the sake of illustration, let us execute a CPU hog experiment on the catalog microservice & a Memory Hog experiment on the orders microservice in a staggered manner

kubectl apply -f chaos/catalogue/catalogue-cpu-hog.yaml

Wait for ~60s

kubectl apply -f chaos/orders/orders-memory-hog.yaml

Verify execution of chaos experiments

kubectl describe chaosengine catalogue-cpu-hog -n litmus
kubectl describe chaosengine orders-memory-hog -n litmus

Visualize Chaos Impact

Observe the impact of chaos injection through increased Latency & reduced QPS (queries per second) on the microservices under test.

Conclusion

As you can see, this is an attempt to co-relate application stats to the failure injected, and hence a good starting point in your chaos monitoring journey. Try this out & share your feedback! A lot more can be packed into the dashboards to make the visualization more intuitive. Join us in this effort and be part of SIG-Observability within LitmusChaos!!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?
Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN
(#litmus channel on the Kubernetes workspace)
Check out the Litmus Chaos GitHub repo and do share your feedback: https://github.com/litmuschaos/litmus
Submit a pull request if you identify any necessary changes.

Chaos Workflows with Argo and LitmusChaos

Karthik Satchitanand — Tue, 23 Jun 2020 17:31:27 +0000

This previous post introduced LitmusChaos as a cloud-native chaos engineering framework, that provides both, native off-the-shelf chaos experiments as well as the ability to orchestrate experiments written in the BYOC (bring-your-own-chaos) mode. You may also have tried your hand with this quick litmus demo. Exciting as it already is, we have seen one more usage pattern evolve in the Litmus community: Chaos Workflows. Does this sound like some word-play between two popular dev(git)ops practices? Let me explain in detail.

Is it sufficient to just inject a failure?

One of the common reasons for injecting chaos (or, as it is commonly known: running a chaos experiment) in a microservices environment is to validate one’s hypothesis about system behavior in an unexpected failure. Today, this is a well-established practice with a multitude of chaos injection tools built for the container (read: Kubernetes) ecosystem, enabling SREs to verify resilience in the pre-production and production environments.

However, when simulating real-world failures via chaos injection on development/staging environments as part of a left-shifted, continuous validation strategy, it is preferable to construct potential failure sequence or chaos workflow over executing standalone chaos injection actions. Often, this translates into failures during a certain workload condition (such as, say, percentage load), multiple (parallel) failures of dependent & independent services, failures under (already) degraded infrastructure, etc. The observations and inferences from these exercises are invaluable in determining the overall resilience of the applications/microservices under question.

LitmusChaos + Argo = Chaos Workflows

While this is already practiced in some form, manually, by developers & SREs via gamedays and similar methodologies, there is a need to automate this, thereby enabling repetition of these complex workflows with different variables (maybe a product fix, a change to deployment environment, etc.). One of the early adopters of the Litmus project, Intuit, used the container-native workflow engine, Argo, to execute their chaos experiments (in BYOC mode via chaostoolkit) orchestrated by LitmusChaos to achieve precisely this. The community recognized this as an extremely useful pattern, thereby giving rise to Chaos Workflows.

Using Chaos Workflows as an aid for benchmark tests

In this blog, let's look at one use-case of chaos workflows. We shall examine how chaos impacts an Nginx server's performance characteristics using a workflow that executes a standard benchmark job with pod-kill chaos operation in parallel.

Prepare the Chaos Environment

In the next few sections, we shall lay the base for executing this workflow by setting up the infrastructure components.

Install Argo Workflow Infrastructure

The Argo workflow infrastructure consists of the Argo workflow CRDs, Workflow Controller, associated RBAC & Argo CLI. The steps are shown below to install Argo in the standard cluster-wide mode, where the workflow controller operates on all namespaces. Ensure that you have the right permission to be able to create the said resources.

Create argo namespace

root@demo:~/chaos-workflows# kubectl create ns argo
namespace/argo created

Create the CRDs, workflow controller deployment with associated RBAC

root@demo:~/chaos-workflows# kubectl apply -f https://raw.githubusercontent.com/argoproj/argo/stable/manifests/install.yaml -n argo

customresourcedefinition.apiextensions.k8s.io/clusterworkflowtemplates.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/cronworkflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflows.argoproj.io created
customresourcedefinition.apiextensions.k8s.io/workflowtemplates.argoproj.io created
serviceaccount/argo created
serviceaccount/argo-server created
role.rbac.authorization.k8s.io/argo-role created
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-admin configured
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-edit configured
clusterrole.rbac.authorization.k8s.io/argo-aggregate-to-view configured
clusterrole.rbac.authorization.k8s.io/argo-cluster-role configured
clusterrole.rbac.authorization.k8s.io/argo-server-cluster-role configured
rolebinding.rbac.authorization.k8s.io/argo-binding created
clusterrolebinding.rbac.authorization.k8s.io/argo-binding unchanged
clusterrolebinding.rbac.authorization.k8s.io/argo-server-binding unchanged
configmap/workflow-controller-configmap created
service/argo-server created
service/workflow-controller-metrics created
deployment.apps/argo-server created
deployment.apps/workflow-controller created

Install the argo CLI on the test harness machine (where the kubeconfig is available)

root@demo:~# curl -sLO https://github.com/argoproj/argo/releases/download/v2.8.0/argo-linux-amd64

root@demo:~# chmod +x argo-linux-amd64

root@demo:~# mv ./argo-linux-amd64 /usr/local/bin/argo

root@demo:~# argo version
argo: v2.8.0
BuildDate: 2020-05-11T22:55:16Z
GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
GitTreeState: clean
GitTag: v2.8.0
GoVersion: go1.13.4
Compiler: gc
Platform: linux/amd64

Install a Sample Application: Nginx

Install a simple multi-replica stateless Nginx deployment with service exposed over nodeport

root@demo:~# kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/App/nginx.yaml

deployment.extensions/nginx created
root@demo:~# kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/App/service.yaml 
service/nginx created

Install Litmus Infrastructure

Apply the LitmusChaos Operator manifest:

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-operator-v1.5.0.yaml

Install the litmus-admin service account to be used by the chaos-operator while executing the experiment.

kubectl apply -f https://litmuschaos.github.io/litmus/litmus-admin-rbac.yaml

Install the Chaos experiment of choice (in this example, we pick a pod-delete experiment)

kubectl apply -f https://hub.litmuschaos.io/api/chaos/master?file=charts/generic/pod-delete/experiment.yaml -n litmus

Create the Argo Access ServiceAccount

Create the service account and associated RBAC, which will be used by the Argo workflow controller to execute the actions specified in the workflow. In our case, this corresponds to the launch of the Nginx benchmark job and creating the chaosengine to trigger the pod-delete chaos action. In our example, we place it in the namespace where the litmus chaos resources reside, i.e., litmus.

root@demo:~# kubectl apply -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/Argo/argo-access.yaml -n litmus

serviceaccount/argo-chaos created
clusterrole.rbac.authorization.k8s.io/chaos-cluster-role created
clusterrolebinding.rbac.authorization.k8s.io/chaos-cluster-role-binding created

Nginx traffic characteristics during a non-chaotic benchmark run

Before proceeding with the chaos workflows, let us first look at how the benchmark run performs under normal circumstances & what are the properties of note.

To achieve this:

Let us run a simple Kubernetes job that internally executes an apache-bench test on the Nginx service with a standard input of 10000000 requests over a 300s period.

root@demo:~# kubectl create -f https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/App/nginx-bench.yaml

job.batch/nginx-bench-c9m42 created

Observe the output post the 5 min duration & note the failed request count. Usually, it is 0, i.e., there was no disruption in Nginx traffic.

root@demo:~# kubectl logs -f nginx-bench-zq689-6mnrm

2020/06/23 01:42:29 Running: ab -r -c10 -t300 -n 10000000 http://nginx.default.svc.cluster.local:80/
2020/06/23 01:47:35 This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking nginx.default.svc.cluster.local (be patient)
Finished 808584 requests


Server Software:        nginx/1.19.0
Server Hostname:        nginx.default.svc.cluster.local
Server Port:            80

Document Path:          /
Document Length:        612 bytes

Concurrency Level:      10
Time taken for tests:   300.001 seconds
Complete requests:      808584
Failed requests:        0
Total transferred:      683259395 bytes
HTML transferred:       494857692 bytes
Requests per second:    2695.27 [#/sec] (mean)
Time per request:       3.710 [ms] (mean)
Time per request:       0.371 [ms] (mean, across all concurrent requests)
Transfer rate:          2224.14 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   0.7      0      25
Processing:     0    3   2.0      3      28
Waiting:        0    3   1.9      2      28
Total:          0    4   2.2      3      33
WARNING: The median and mean for the initial connection time are not within a normal deviation
        These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
  50%      3
  66%      4
  75%      5
  80%      5
  90%      7
  95%      8
  98%      9
  99%     11
 100%     33 (longest request)

Formulating a Hypothesis

Typically, in most production deployments, the Nginx service is set up to guarantee specific SLAs in terms of tolerated errors, etc., While, say, under normal circumstances, the server performs as expected, it is also necessary to gauge how much degradation is seen for different levels of failures & what the cascading impact may be on others applications. The results obtained by inducing chaos may give us an idea on how best to manage the deployment (improved high availability configuration, resources allocated, replica counts, etc.,) to continue to meet the SLA despite a certain degree of failure (while that is an interesting topic to discuss for another day, we shall restrict the scope of this blog to demonstrating how workflows can be used!)

In the next step, we shall execute a chaos workflow that runs the same benchmark job while a random pod-delete (Nginx replica failure) occurs and observe the degradation in the attributes we have noted: failed_requests.

Create the Chaos Workflow

Applying the workflow manifest performs the following actions in parallel:

Starts an Nginx benchmark job for the specified duration (300s)
Triggers a random pod-kill of the Nginx replica by creating the chaosengine CR. Cleans up after chaos.

root@demo:~# argo submit https://raw.githubusercontent.com/litmuschaos/chaos-workflows/master/Argo/argowf-native-pod-delete.yaml -n litmus
Name:                argowf-chaos-sl2cn
Namespace:           litmus
ServiceAccount:      argo-chaos
Status:              Pending
Created:             Fri May 15 15:31:45 +0000 (now)
Parameters:
  appNamespace:      default
  adminModeNamespace: litmus
  appLabel:          nginx

Visualize the Chaos Workflow

You can visualize the progress of the chaos workflow via the Argo UI. Convert the argo-server service to type NodePort & view the dashboard at https://<node-ip>:<nodeport>

root@demo:~# kubectl patch svc argo-server -n argo -p '{"spec": {"type": "NodePort"}}'
service/argo-server patched

Observe the Nginx benchmark results

Observing the Nginx benchmark results over 300s with a single random pod kill shows an increased count of failed requests.

root@demo:~# kubectl logs -f nginx-bench-7pnvv

2020/06/23 07:00:34 Running: ab -r -c10 -t300 -n 10000000 http://nginx.default.svc.cluster.local:80/
2020/06/23 07:05:37 This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking nginx.default.svc.cluster.local (be patient)
Finished 802719 requests


Server Software:        nginx/1.19.0
Server Hostname:        nginx.default.svc.cluster.local
Server Port:            80

Document Path:          /
Document Length:        612 bytes

Concurrency Level:      10
Time taken for tests:   300.000 seconds
Complete requests:      802719
Failed requests:        866
   (Connect: 0, Receive: 289, Length: 289, Exceptions: 288)
Total transferred:      678053350 bytes
HTML transferred:       491087160 bytes
Requests per second:    2675.73 [#/sec] (mean)
Time per request:       3.737 [ms] (mean)
Time per request:       0.374 [ms] (mean, across all concurrent requests)
Transfer rate:          2207.20 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0  11.3      0    3044
Processing:     0    3  57.2      3   16198
Waiting:        0    3  54.2      2   16198
Total:          0    4  58.3      3   16199

Percentage of the requests served within a certain time (ms)
  50%      3
  66%      4
  75%      4
  80%      5
  90%      6
  95%      7
  98%      9
  99%     11
 100%  16199 (longest request)

Further iterations of these tests with increased pod-kill instances over the benchmark period or an increased kill count (i.e., number of replicas killed at a time) can give more insights about the behavior of the service, in turn leading us to the mitigation procedures.

Note: To test with different variables, edit the ChaosEngine spec in the workflow manifest before re-submission.

Conclusion

You can use Argo with LitmusChaos to construct complex chaos workflows, with pre-conditioning & dependencies built-in. The parallel nature of execution can help you simulate multi-service/component failures to verify application behavior under worst-case scenarios. You can even sew in recovery procedures based on error conditions.

Do try this out & let us know what kind of workflows you would like to see being built within litmus!

Are you an SRE or a Kubernetes enthusiast? Does Chaos Engineering excite you?

Join Our Community On Slack For Detailed Discussion, Feedback & Regular Updates On Chaos Engineering For Kubernetes: https://kubernetes.slack.com/messages/CNXNB0ZTN
(#litmus channel on the Kubernetes workspace)
Check out the Litmus Chaos GitHub repo and do share your feedback: https://github.com/litmuschaos/litmus
Submit a pull request if you identify any necessary changes.

litmuschaos / litmus

Litmus helps SREs and developers practice chaos engineering in a Cloud-native way. Chaos experiments are published at the ChaosHub (https://hub.litmuschaos.io). Community notes is at https://hackmd.io/a4Zu_sH4TZGeih-xCimi3Q

LitmusChaos

Open Source Chaos Engineering Platform

Read this in other languages.

🇰🇷 🇨🇳 🇧🇷 🇮🇳

Overview

LitmusChaos is an open source Chaos Engineering platform that enables teams to identify weaknesses & potential outages in infrastructures by inducing chaos tests in a controlled way. Developers & SREs can practice Chaos Engineering with LitmusChaos as it is easy to use, based on modern Chaos Engineering principles & community collaborated. It is 100% open source & a CNCF project.

LitmusChaos takes a cloud-native approach to create, manage and monitor chaos. The platform itself runs as a set of microservices and uses Kubernetes custom resources (CRs) to define the chaos intent, as well as the steady state hypothesis.

At a high-level, Litmus comprises of:

Chaos Control Plane: A centralized chaos management tool called chaos-center, which helps construct, schedule and visualize Litmus chaos workflows
Chaos Execution Plane Services: Made up of a…

View on GitHub