DEV Community

Cover image for Chaos Engineering in AWS with FIS

Chaos Engineering in AWS with FIS

Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. It is not about creating chaos, it is about making the inherent chaos in real-world applications visible to you.

One of the famous Chaos Engineering tools was Netflix Chaos Monkey, which would shut down random machines in the environment and check the effect on the availability.

In the Well-Architected Framework, Chaos Engineering is a part of the Reliability pillar.

Design Principles of Reliability say:

  1. Automatically recover from failure
  2. Test recovery procedures
  3. Scale horizontally to increase aggregate workload availability
  4. Stop guessing capacity
  5. Manage change through automation

AWS FIS (Fault Injection Simulator) corresponds directly to the 2nd principle and indirectly to the 1st principle.

FIS allows you to create Chaos Experiments and test your workload to see if your workload is designed reliably. It shows whether your recovery procedures work, and whether you can automatically recover from failures. It also gives you an idea of downtime that can be expected in the particular DR strategy you have chosen.

We recently had this question from a customer. They have only 2 AZ in their architecture, and they have a multi-AZ RDS database with a cross-region read replica. If one of the AZs goes down, will the CRR replica continue to function? Will it keep in sync with the DB after the multi-AZ failover happens? How much is the time, that we won't be able to access the RDS db, and the read replica? We want to optimistically answer with Yes for these questions and add "a very short time" to the last one. But how can we be sure?

You create an FIS experiment and test it out.

Key Features of AWS FIS

Managed Chaos Experiments: AWS FIS provides templates for creating and managing chaos experiments, allowing teams to focus on results rather than the intricacies of setup and execution.

Broad Fault Injection Capabilities: Simulate a wide range of failures, including server outages, network latency, unavailability of EC2 resources, and throttled database access, to understand their impact on your application.

Integration with AWS Services: Seamlessly integrate with other AWS services such as Amazon EC2, Amazon RDS, Amazon ECS, AWS Lambda, VPC, etc, enabling a comprehensive testing environment.

Safety and Security: AWS FIS is built with safety in mind, offering mechanisms to limit the blast radius of experiments and ensure that your production environments remain secure.
Enter fullscreen mode Exit fullscreen mode

Experiment: An experiment in AWS Fault Injection Simulator (FIS) is a controlled procedure designed to assess the resilience of your AWS infrastructure by intentionally introducing faults or disruptions.

Template: This template defines the actions (faults) to be executed, the targets (AWS resources) those actions will affect, and any conditions or constraints. Templates ensure experiments are reproducible and standardized.
An experiment is a template in action.

Actions: Actions are the specific faults or disruptions you want to introduce. AWS FIS supports a variety of actions, such as stopping an EC2 instance, injecting latency into a network, and throttling database I/O operations, among others.

Targets: Targets are the AWS resources upon which actions will be performed. Targets can be specified explicitly or selected dynamically based on tags or other identifiers, allowing for flexibility in defining the scope of the experiment. Eg: Which RDS to stop, Which subnet to have network disruption.

Stop Conditions: To ensure safety and prevent unintended consequences, experiments can include stop conditions. These are criteria that, when met, will automatically terminate the experiment. This can be defined as a CloudWatch Alarm.

Creating an Experiment Template

Navigate to the AWS FIS console -> Experiment Template.

FIS

Select the account this experiment will run on

fis account

After giving a name and description, select the actions you want to include. I want to include a network disruption action that would disrupt all network connectivity in one AZ us-east-1a.

Action Type: NETWORK, aws-network-disrupt-connectivity
Target: (a target node is created automatically for the action which you will edit later)
Duration: I will disrupt for 2 minutes
Scope: (All all types of network connectivity will be disrupted including regional services like S3)

fis network disrupt

Now edit the Targets node.

fis target subnet

You can select targets using resource tags, filters or directly using resource IDs. I am selecting the subnet ID:

subnet-select

Now, I create the action for DB instance going down:

Action type: RDS aws:rds:reboot-db-instances
Start-After: network-disruption (I want db to go down after the first action)
Target: (a target node is created automatically for the action which you will edit later)
Force failover: Yes (this will cause failover to the standby instance)

fis actions db

Now edit the Targets node: (Select the DB instance I have)

fis actions

selected db

Actions are done, now specify the additional options:

Empty target resolution mode: What if you gave some tags to find the targets by, and at runtime, no targets were found? I am specifying that I want the experiment to fail in that case.

Service Access: What IAM Role would be used by the experiment? This needs to have access to CloudWatch if you want to log experiment logs to a CloudWatch log group. The default IAM role for FIS does not have CloudWatch access, edit the policy to add that.

fis options

Stop Condition: You wanted to bring down one AZ but ended up bringing down everything. Now you are panicking and want to stop the experiment. Specifying a CloudWatch Alarm will allow you to stop an experiment by putting the Alarm in Alarm State.
For eg: This could be set so it goes into Alarm if there are more than X messages in a queue.

You can also stop the experiment from the console by clicking "stop experiment" if you have access.

Logs: You can send the logs of the experiment to an S3 bucket or a CloudWatch Log Group.

Logs_Stop

Running the Experiment

Now let's look at our steady state. I have a multi-AZ RDS MySQL DB instance. Currently primary is us-east-1a

2az-db

This DB sits over 2 AZs, evident from the Subnet Group for the DB, which includes us-east-1a and us-east-1b.

multiaz-db

Please note: Both subnets in the group must share the same network accessibility. This could be done by having them associate with the same route table. Otherwise, after failover, it may be stuck in a subnet with no access.

At this point, both DB and the replica are in sync:

in-sync

Once you save the experiment, start the experiment by clicking Start Experiment

StartExp

You can see the status changing in actions.

status

In logs, you can see the targets that have been resolved.
It has found the following subnets:

subnet

It has found the DB:

db target

When the network disruption is under process, your MySQL connectivity will seem stuck.

mysqlstuck

When the DB action starts, you will see the DB instance rebooting.

reboot

While rebooting you won't be able to insert into the DB.

When the console comes back to show DB as "Available" create a new connection to the DB.

Note that the console may not yet show the AZ change. It takes some time for it to reflect. But DB has failed over to the other AZ when it says available. It is a DNS change behind the scenes.

Insert

We can also query from the replica and it is still in sync!

replica

We could notice a downtime of about 2 minutes for the RDS db to come back up, for our tiny db.t3.micro instance. More powerful instances take less time.

After some minutes console shows the new AZ for primary: us-east-1b

post-failover

Actions should show as completed now.
completed1

The experiment is completed.
completed2

Conclusion

By using FIS, we can test the resiliency of our applications in AWS. In this demo, we saw how we tested the resilience of a multi-az db and read replica in the face of losing one of the 2 AZs.

More info can be found here: https://aws.amazon.com/fis/
Great demo at ReInvent: https://www.youtube.com/watch?v=N0aZZVVZiUw

Hope this was helpful!

Top comments (2)

Collapse
 
rdarrylr profile image
Darryl Ruggles

This is a really cool tool! Thanks for putting this article out.

Collapse
 
manojlingala profile image
manojlingala

Well Explained πŸ™Œ