Robert Zsoter

Posted on Oct 2 • Edited on Nov 9 • Originally published at Medium

Chaos Engineering on AWS: Using Fault Injection Simulator (FIS) for Resilience

#aws #chaosengineering #fis #security

Part I: Building Resilient Systems on AWS: EC2 service and Auto Scaling Group

Introduction to Chaos Engineering

Chaos Engineering: Introduction to Resilient Systems

Why do we need Chaos Engineering?

Historically, disaster preparedness focused on catastrophic events like earthquakes or power outages, with organizations investing in disaster recovery (DR) plans to r*estore services from backup* data centers after major disruptions.

While effective for large-scale outages, this approach fails to address the frequent, smaller-scale failures prevalent in modern systems, rendering traditional DR insufficient as infrastructure evolves.

The shift from monolithic applications to distributed, microservices-based, cloud-native architectures on platforms like AWS and Kubernetes has brought scalability and agility but also increased fragility. A single misbehaving microservice can trigger cascading failures, a misconfigured network route can isolate critical components, or a faulty deployment can exhaust resources like CPU or memory on a single Kubernetes node, disrupting entire applications*.*

Imagine proactively simulating failures in a controlled environment: injecting latency into an AWS RDS database, terminating a Kubernetes pod mid-operation, or mimicking an AWS region outage.

By deliberately introducing faults, engineers can identify weaknesses, redesign for fault tolerance, measure blast radius, and validate monitoring, alerting, and auto-remediation mechanisms.

Chaos Engineering is the practice of intentionally injecting controlled failures to study system behavior and enhance resilience. It involves safely introducing faults, limiting their impact, observing outcomes, and iteratively applying architectural improvements.

This is not reckless disruption but a disciplined, test-driven approach to hardening systems.

Downtime carries significant financial and operational costs across industries like e-commerce, healthcare, and finance. In cloud-native environments, unpredictability is inherent.

While chaos cannot be eliminated, Chaos Engineering enables teams to prepare, test, and adapt systems to be self-healing and robust, ensuring resilience in the face of inevitable failures.

AWS FIS service

Overview about AWS Fault Injection Service (FIS)

Chaos Engineering has evolved into a critical practice for building resilient cloud systems, with various tools emerging to support controlled failure testing. The concept gained prominence with Netflix’s Chaos Monkey, which randomly terminated instances to validate system robustness. Today, the ecosystem includes tools like Gremlin, Azure Chaos Studio, and, within the AWS platform, the AWS Fault Injection Service (FIS).

In this guide, we focus on AWS FIS, Amazon’s managed solution for conducting Chaos Engineering experiments in AWS environments.

FIS enables teams to simulate real-world failures, identify vulnerabilities, and strengthen system resilience before issues impact users.

AWS Fault Injection Service is a fully managed platform designed to execute controlled fault injection experiments across AWS workloads.

By introducing failures like instance terminations or network disruptions, FIS helps engineers assess how systems handle unexpected conditions, allowing them to address weaknesses proactively and enhance reliability.

Why should we use AWS FIS?

AWS FIS empowers teams to:

Simulate failures such as EC2 instance crashes, EKS pod evictions, or RDS failovers.
Test complex scenarios, including Availability Zone (AZ) outages or cross-region disruptions.
Validate recovery mechanisms like auto-scaling, failover, and monitoring alerts.
Build confidence in system resilience, reducing Mean Time to Recovery (MTTR) and ensuring robust performance.

How FIS Works: Architecture Overview

FIS integrates seamlessly with AWS services to provide comprehensive observability and control:

Amazon CloudWatch: Monitors metrics and triggers rollbacks if experiment thresholds are exceeded.
AWS X-Ray: Traces the impact of failures across services for detailed analysis.
AWS IAM: Enforces granular permissions to manage who can run or configure experiments.

FIS offers predefined failure scenarios and experiment templates for services like:

EC2: Instance termination, CPU or memory stress.
EKS: Pod or node disruptions for Kubernetes workloads.
RDS: Database reboot or failover simulations.
S3: Network latency or access disruptions.
Multi-AZ/Region: Simulating large-scale outages.

Managing FIS

Experiments (experiment is a controlled test using the FIS to simulate real-world failure) can be configured and executed via:

AWS Management Console for interactive setup.
AWS CLI for scripted workflows.
AWS CloudFormation or AWS SDKs for infrastructure-as-code integration.

This flexibility allows FIS to fit into both manual testing processes and automated CI/CD pipelines, embedding Chaos Engineering into the Software Development Lifecycle (SDLC).

Pre-Requirements: IAM Roles for AWS FIS

Before you can run experiments with AWS Fault Injection Service (FIS), you must configure two essential IAM roles, each serving a distinct purpose in the security model.

About the roles

1. User Role: Who Can Control FIS Service

This is the IAM role or IAM identity** (user/group/role) used to log into the AWS Console or interact with the AWS CLI. This role defines who can view, create, modify, or start FIS experiment.

Because FIS experiments may impact availability or cause downtime, it is critical to strictly control who can access these capabilities. You should assign FIS-related permissions only to trusted users with experiments or platform engineering responsibilities.

2. FIS Service Role: What FIS Can Do

This is the IAM role assumed by AWS FIS itself when running an experiment. It governs what actions the FIS engine is allowed to perform on AWS resources.

For example, it defines whether FIS can:

Terminate an EC2 instance
Reboot or failover an RDS database
Inject faults into EKS or simulate an Availability Zone (AZ) outage

This role must include the exact permissions required to perform those actions, and must also trust the FIS service to assume the role.

This is called the “FIS Service Role.”

Creating the AWS FIS Service Role

Scope: IAM Role for FIS experiments

In this walkthrough, we’ll create the AWS FIS service role, an IAM role that the Fault Injection Simulator (FIS) will assume when performing its experiments (such as EC2 termination).

This role controls what actions FIS can take on AWS resources during a fault injection.

1. Navigate to IAM in AWS Console

Go to the IAM (Identity and Access Management) section of the AWS Console.
Click on “Roles” from the left menu.
Choose “Create role”.

2. Set Trusted Entity Type

Under Trusted entity type, select: ➤ “AWS service”
This is because the role will be assumed by an AWS-managed service.
“Use Case”: In the list of services, choose: ➤ “FIS: Fault Injection Simulator”

This tells IAM that only AWS FIS will be allowed to assume this role when performing experiments.

3. Select Use Case for experiment

AWS offers predefined permission sets depending on what type of experiment you’re planning to run:

For EC2 instance termination: choose the EC2 use case
For network-level disruptions: choose VPC/network-related permissions
For RDS failover testing: choose RDS
For ECS/EKS faults: select the corresponding container service
etc.

In this guide, we’ll focus on terminating EC2 instances, so select the EC2 use case: “AWSFaultInjectionSimulatorEC2Access”

4. Review Permissions and Trust Policy

After selecting the EC2 scenario:

IAM automatically attaches a predefined AWS managed policy that grants permissions for EC2-related fault injection actions.
Review the Trust policy, which should contain:

{  
  "Version": "2012-10-17",  
  "Statement": \[  
    {  
      "Effect": "Allow",  
      "Principal": {  
        "Service": "fis.amazonaws.com"  
      },  
      "Action": "sts:AssumeRole"  
    }  
  \]  
}

This policy allows FIS to assume the role during the experiment execution.

5. Name the Role

Provide a meaningful name like: FIS-EC2
Click “Create role” to finalize the creation.

6. Add CloudWatch Logging Permissions

To allow FIS to write logs, metrics, and diagnostic output, attach an additional permission policy for CloudWatch access:

Go to the newly created role
Click “Add permissions” → “Attach policies”
Search for and attach: ➤ CloudWatchLogsFullAccess (or define a scoped custom policy if preferred)

This allows the FIS experiments to send logging information to Amazon CloudWatch Logs, enabling you to monitor experiment results and rollbacks.

Understanding the Components

In this section, we will analyze the key components of the first fault injection experiment that we’ll execute using AWS FIS.

This experiment is designed to simulate EC2 instance termination within an AWS environment managed by an Auto Scaling Group (ASG).

1. The “Given”, Our Known Architecture

The first component of any FIS experiment is the “given”, a clear understanding of the system’s current behavior and architecture under normal (steady-state) conditions.

In this case, we know the following:

The application is hosted on EC2 instances
These instances are deployed across multiple Availability Zones (AZs)
An Auto Scaling Group (ASG) is configured to manage these instances
The ASG is expected to maintain a defined minimum capacity and automatically replace any unhealthy or terminated instance

This architectural context is essential. It establishes what we expect to happen before any fault is introduced, and serves as the baseline.

2. The Hypothesis, What We Expect to Happen

The second component is the hypothesis, a prediction of how the application should behave when a specific failure is happened.

For this experiment, our hypothesis is:

“If an EC2 instance is terminated, the Auto Scaling Group will detect the loss and automatically provision a new instance. As a result, the application will continue running without any disruption.”

This hypothesis is based on the expected behavior of ASGs in AWS, which are designed to maintain the desired capacity at all times.

Objective of the Experiment

By executing this AWS FIS experiment, we aim to test whether this hypothesis holds true under real conditions.

We want to observe:

Whether the ASG replaces the instance fast enough
Whether there is any downtime during the replacement
How other application components behave during this replacement window

This test will give us a clear understanding of the actual resilience of the EC2 + ASG setup.

Preparing the AWS Architecture

Affected services: EC2 + ASG

Before we initiate our first AWS Fault Injection Service (FIS) experiment, let’s define and build a simple and controlled architecture in your AWS account to support the experiment scenario.

To keep things straightforward and beginner-friendly, we will use the AWS Management Console for this setup in this case.

Architecture Components Overview

We will create three essential components in this setup:

1. EC2 Launch Template

We begin by creating an EC2 Launch Template, which defines the configuration for EC2 instances launched by the Auto Scaling Group.

This includes:

AMI ID (for example: Amazon Linux 2023 AMI 2023 under the “Free tier”)
Instance type (as you want but t3.micro is fine)
Security groups (no changes are needed there)
Subnets: any subnet of your VPC
Key pair (optional): we don’t need for that in this case
User data (optional)

As part of this configuration, we’ll add a Resource tag:

➤ Key: fis

➤ Value: true

This tag can later be used in FIS filters to target specific instances for termination.
Ensure that “Instances” has been added under the “Resources types”

2. Auto Scaling Group (ASG)

Next, we will create an Auto Scaling Group that uses the above launch template: In the next page, click on the “Create an Auto Scaling group from your template” link and create the ASG (or navigate to EC2/Auto Scaling Groups on the AWS Console and select the created launch template from the list)

Key configuration:

Select your affected VPC (Select the VPC that contains the subnet selected in the previous launch template.)
Spread instances across at least two Availability Zones (AZs): (select the same subnet and at least one or more subnet)
Leave other options as default, next page: also leave as default, next, and

Set the values like this:

Desired capacity = 1
Minimum capacity = 1
Maximum capacity = 4

Next, next, and under “Review page” click on “Create Auto Scaling group”

This setup ensures that when an EC2 instance is terminated, the ASG will automatically launch a new one to maintain the desired capacity.

3. CloudWatch Log Group for FIS

Lastly, we will create a CloudWatch Log Group named:

➤ test.fs

Navigate to CloudWatch on the AWS console and create a new log group under the “Log Groups” Set expiration for 1 day, other options: leave as default

This log group will be used by AWS FIS to write execution logs, helping you:

Track fault injection events
Monitor system responses
Audit the sequence of actions

Once these three components are in place, we’ll be ready to define and run our FIS experiment template targeting the EC2 instance within this Auto Scaling setup.

Creating the AWS FIS Experiment Template

Use case: “controlled” EC2 Termination in ASG

As part of this LAB we will now create our first AWS Fault Injection Service (FIS) experiment template. This experiment is designed to test the behavior of an Auto Scaling Group (ASG) when 50% of its EC2 instances are terminated.

1. Navigate to FIS Experiment Templates

Open the AWS Management Console
Go to AWS Fault Injection Simulator (FIS)
Click on “Experiment templates” in the left navigation pane
Click “Create an experiment template”

2. General Settings

Experiment Template Name: ➤ Example: fis-ec2
AWS Account: Select your current AWS account
Description: ➤ "Terminate 50% of instances in the Auto Scaling Group"

3. Define the Action

An action in AWS FIS defines*what fault is injected during the experiment. In our case, this will be **terminating EC2 instances: **click* on “Add Action” under the “Actions and Targets”

Action Name: TerminateEC2Instances
Action Description: "Terminate EC2 instance(s) to test ASG recovery"
Service: EC2
Action Type: select EC2 and choose aws:ec2:terminate-instances
Start after: optional Note: You may also define action sequencing here (for multi-step experiments), but we will skip this for now.
Target: now, leave as default
Click on “Save”

4. Target Definition

We now specifywhich EC2 instances** the action should apply to.

Click on “Instances-Target-1” on the created “diagram” and configure which instances will be affected

Click on the dots of the “Instances-Target-1” and edit the target
Target Name: fis-ec2-target
Resource Type: aws:ec2:instance
Target Method: select the “Resource tags, filters and parameters” and configure the tag (as you did it in the launch template): under the “Resource tags” add the same key and value
Key: fisValue: true ➤ This ensures only EC2 instances explicitly marked as “fis-true” are targeted.

Apply two filters:

We will configure this because we want to terminate 50% of running instances that means we need a multiple filters there.

Under the “Resources filters”, configure these filters:

Attribute: State.NameValue: running

➤ This ensures the experiment only affects running instances, skipping stopped, initializing, or terminated ones.

and add one more filter: PERCENT(50) → this ensures only 50% of matching instances will be affected from the all instances.

Click on “Save”.

5. Assign IAM Role

Select the IAM role that AWS FIS will assume when executing the experiment:

Example: FIS-EC2
This role (which has been created earlier) must have permissions to terminate EC2 instances and write to CloudWatch Logs

Leave other options here on default, Next.

6. Add Stop Conditions (Optional but Recommended, especially in Production/Live environment)

Important:
Although we will not define stop conditions in this lab, it’s considered best practice in production environments.

Stop conditions allow you to configure CloudWatch alarms that will immediately stop the FIS experiment if a critical threshold is breached (e.g., CPU drops below a threshold or latency spikes).

In the “PROD/LIVE” environment you should consider to configure it: “AWS FIS helps you run experiments on your workloads safely. You can set a limit, known as a stop condition, to end the experiment if it reaches the threshold defined by a CloudWatch alarm. If a stop condition is reached during an experiment, you can’t resume the experiment.”

7. CloudWatch Logging

Enable log delivery to an existing CloudWatch Log Group: configure** the created log group:

Log Group Name: test-fs

This allows you to monitor:

Action start/end times
Success/failure of the experiments
Target resource IDs and outcomes

8. Create the Template

Scroll to the bottom and click “Create experiment template”
AWS may show a warning that no stop condition was defined, acknowledge this for the demo
Type “create”

The template is now ready to use.

Running Your First AWS FIS Experiment

Affected components: EC2 + ASG (EC2 Termination)

In this section, we will execute the AWS FIS experiment that was created earlier. The goal is to test how the Auto Scaling Group (ASG) handles EC2 instance termination and validate whether our hypothesis holds true.

Step 1: Review the Target Instances

Before triggering the experiment, it’s important to confirm which instances will be targeted:

Navigate to the EC2 Dashboard
Apply two filters: Instance state: runningTag filter: Key=fis, Value=true

In our example, a single EC2 instance meets both conditions. This would be the affected instance if the experiment proceeds.

Because we configured 50% of the affected instance (in the FIS template) we need to increase the number of the instances from one to two.

Adjusting ASG Capacity

The current setup means there’s only one EC2 instance running. If that instance is terminated, it takes time for a replacement to launch, leading to potential downtime.

Modify the affected Auto Scaling Group:

Set Desired capacity = 2
Set Minimum capacity = 2

This ensures that two EC2 instances are always running, and terminating one will not interrupt service.

Wait for the second instance to launch before continuing. Ensure that both instances are up and running.

Step 2: Start the Experiment

Go to AWS FIS in the Console
Select Experiment templates
Locate the experiment template created earlier, and have a look the “Targets” tab. You can see an EC2 instance, (as “Resource”) under the “Preview” : This will show which actual resource(s) that is/are targeted when you start this experiment. The “T*arget information*” will be something like: “arn:aws:ec2:-x:xxxxxxxxxx:instance/i-xxxxxxxxxxxx” If not (if you can’t see anything) click on the “Generate preview” at the right and wait a few seconds.
Click Start experiment
Confirm the action (since it may cause disruption)

At this point, the experiment transitions to:

State: initiating
Status: pending

Step 3: Validation

Now, navigate to “EC2 Console” and check the affected instance(s). The affected instance should be terminated now.

You should now see:

State: Running → Shutting-down → Terminated
The instance count drops from 2 to 1 (one is terminated)

Note:

because one instance remains running, the application should continue functioning without interruption.
due to configuration values of ASG, the ASG will start a new instance (It will “replace” the terminated instance).
under the “EC2/Auto Scaling groups/test-fs” navigate to “I*nstance management*” and “Activity” tab and check the information about the instance(s) and you can find the events there

Best Practices and Takeaways

Implementing fault injection with AWS FIS is not just about testing for failure, it’s about building confidence in recovery.

Based on this covered scenario, here are the most important lessons and recommendations:

FIS Experiment Planning

Define a Clear Hypothesis Every FIS experiment must start with a clearly defined expected behavior based on your architecture. Avoid blind testing.
Understand Your Architecture’s Limits Knowing that ASGs maintain desired capacity isn’t enough, understand boot time, warm-up latency, and how it affects service availability.
Start Small Begin with safe experiments that target only a portion of your resources (e.g., 50%) and gradually expand the scope once confidence builds.

Targeting and Blast Radius Control

Use Tags and Filters for Target Isolation Always apply precise filters (like tag=fips:true and state=running) to limit the impact of experiments to approved resources.
Leverage Selection Modes Use PERCENT or COUNT to control how many instances are affected, this is crucial for minimizing unintended disruptions.

Security and Access

Separate IAM Roles Use two distinct IAM roles: - One for users/automation to manage/run experiments - One for FIS to assume during execution (with minimal required permissions)
Enable Least Privilege The FIS service role should only have the permissions it needs to execute the specific fault scenario (e.g., ec2:TerminateInstances).

Observability and Monitoring

Use CloudWatch Logs and Alarms Send all FIS logs to a dedicated log group for traceability. In production, always define stop conditions tied to CloudWatch alarms.
Preview Before Running Use the Preview Target feature to verify that FIS has resolved the correct resources, catch misconfigurations before runtime.

Iteration and Learning

Rerun with Adjustments Use each failed experiment as a learning opportunity. Adjust configurations (e.g., ASG size), then rerun and observe improvements.
Document Everything Maintain a knowledge base of all FIS templates, their assumptions, outcomes, and resulting architecture changes.

Conclusion

This initial FIS experiment with EC2 instances managed by an Auto Scaling Group demonstrates how structured fault injection reveals architectural weaknesses, and helps teams improve system resilience with confidence.

By applying best practices in targeting, IAM, observability, and post-experiment review, chaos engineering becomes a controlled, safe, and highly effective practice in modern AWS environments.

This is just the start. In future posts, I would like to explore:

Testing more complex architectures (like EKS, or RDS, multi-AZ)
Using stop conditions and CloudWatch alarms
Automating chaos experiments in CI/CD workflows

About the Author
I’m Róbert Zsótér, Kubernetes & AWS architect.
If you’re into Kubernetes, EKS, Terraform, and cloud-native security, follow my latest posts here:

LinkedIn: Róbert Zsótér
Substack: CSHU

Let’s build secure, scalable clusters, together.

Originally published on Medium: Chaos Engineering on AWS - Part 1

DEV Community