DEV Community: Adrian Hornsby

AWS re:Invent 2020 digest — Part 2

Adrian Hornsby — Thu, 07 Jan 2021 12:22:56 +0000

AWS re:Invent 2020 digest — Part 2

Curated list of my favorite AWS updates from re:Invent 2020

https://reinvent.awsevents.com/

reInvent 2020 is coming to an end. A lot ofnew launches have happened since I published Part 1of this series. Because digesting all the different updates takes time and a lot of coffee, I thought I’d help you out a little.

Following is a curated list of things that I found most important; matters related to architecture, scalability, reliability, performance, resiliency, devops, and security — anything that caught my eye, and I hope will satisfy yours.

AWS Fault Injection Simulator (coming in 2021)

AWS Fault Injection Simulator is a fully managed chaos engineering service that makes it easier for teams to discover an application’s weaknesses at scale in order to improve performance, observability, and resiliency. Chaos engineering is the process of stressing an application in testing or production environments by creating disruptive events, such as server outages or API throttling, observing how the system responds, and implementing improvements. Chaos engineering helps teams create the real-world conditions needed to uncover the hidden issues, monitoring blind spots, and performance bottlenecks that are difficult to find in distributed systems.

As you can imagine, this is my favorite launch and it looks like I am not the only one thinking like that :)

// Detect dark theme var iframe = document.getElementById('tweet-1338896040200708097-524'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1338896040200708097&theme=dark" }

// Detect dark theme var iframe = document.getElementById('tweet-1338908145125650434-231'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1338908145125650434&theme=dark" }

Go watch Laura Thomson, Sr. Product Mgr for AWS FIS, launching the service live on twitch!

https://medium.com/media/2e8f5a5d424f73adb2479662523ad904/href

You can also check my reInvent session here.

// Detect dark theme var iframe = document.getElementById('tweet-1338915399300239362-114'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1338915399300239362&theme=dark" }

To learn more about chaos engineering, check my collection of articles.

AWS Lambda now supports self-managed Apache Kafka as an event source

If you love event-driven architecture, this one is for you! AWS Lambda lets customers build applications that can be triggered by messages in an Apache Kafka cluster hosted on any infrastructure. It is now easier than ever to build Kafka consumer applications with Lambda without needing to worry about provisioning or managing servers.

// Detect dark theme var iframe = document.getElementById('tweet-1294178133164339200-205'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1294178133164339200&theme=dark" }

AWS announces Amazon Managed Service for Grafana and Prometheus in Preview

// Detect dark theme var iframe = document.getElementById('tweet-1338902037644206081-384'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1338902037644206081&theme=dark" }

Amazon Managed Service for Grafana is a fully managed and secure data visualization service that lets customers instantly query, correlate, and visualize operational metrics, logs, and traces for their applications from multiple data sources. Developed in partnership with Grafana Labs, Amazon Managed Service for Grafana manages the provisioning, setup, scaling, and maintenance of Grafana servers, eliminating the need for customers to do this themselves.

// Detect dark theme var iframe = document.getElementById('tweet-1338910979317526529-364'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1338910979317526529&theme=dark" }

Amazon Managed Service for Prometheus (AMP) is a fully managed Prometheus**-compatible monitoring service that makes it easy to monitor containerized applications at scale by automatically scaling the ingestion, storage, and querying of operational metrics.

** The Cloud Native Computing Foundation’s Prometheus project is a popular open source and alerting monitoring solution optimized for container environments.

Customers can now use the open source Prometheus Query Language (PromQL) to monitor the performance of containerized workloads on AWS or on-premises, without having to manage the underlying infrastructure for scalability, availability, and security.

// Detect dark theme var iframe = document.getElementById('tweet-1338901818810699780-525'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1338901818810699780&theme=dark" }

AWS Global Accelerator launches custom routing

AWS Global Accelerator released custom routing accelerator, a new type of accelerator that lets you use your own application logic to route user traffic to a specific Amazon EC2 destination.

With a custom routing accelerator, you can route multiple users to a specific EC2 destination in a single or multiple AWS Regions by directing them to a unique port on your accelerator. This feature makes it easier to integrate Global Accelerator with your application logic, such as matchmaking servers or session border controllers (network devices that protect and regulate IP traffic flows for real-time communication workflows).

// Detect dark theme var iframe = document.getElementById('tweet-1337467445880221697-11'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1337467445880221697&theme=dark" }

With custom routing accelerators, you can now leverage AWS Global Accelerator as the single point of entry for your application while deterministically sending your user traffic to specific EC2 destinations in any AWS Region.

And customers are already embracing this feature to build multiplayer game architectures!

// Detect dark theme var iframe = document.getElementById('tweet-1337479605037518851-793'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1337479605037518851&theme=dark" }

Other noticeable launches

-Adrian

AWS re:Invent 2020 digest — Part 1

Adrian Hornsby — Thu, 10 Dec 2020 07:01:49 +0000

AWS re:Invent 2020 digest — Part 1

Curated list of my favorite AWS updates from re:Invent 2020

https://reinvent.awsevents.com/

While reInvent just started, the first keynote from Andy Jassy has had a lot of new launches. I know that digesting all the updates takes time and a lot of coffee, so let me help you.

Following is a curated list of things that I found most important; matters related to architecture, scalability, reliability, performance, resiliency, DevOps, and security — anything that caught my eye, and I hope will satisfy yours.

Amazon S3 now delivers strong read-after-write consistency automatically for all applications

This is hands-down my favorite launch!

Amazon S3 now delivers strong read-after-write consistency automatically for all applications for any storage request, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost.

Amazon S3 now delivers strong read-after-write consistency automatically for all applications

OK — but what does strong read-after-write consistency mean?

After successfully writing a new object or overwriting an existing one, any subsequent read request immediately receives the object’s latest version. S3 also provides strong consistency for list operations, so after a write, you can directly perform a listing of the objects in a bucket with any changes reflected.

// Detect dark theme var iframe = document.getElementById('tweet-1333929339541352448-691'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1333929339541352448&theme=dark" }

To learn more, hear the GM of Amazon S3, Kevin Miller, and Ashish Gandhi with Dropbox, discuss the benefits of strong consistency for S3.

Continuing with S3, they were a couple more updates that you might find useful for your DR or multi-region strategy:

AWS Lambda now supports container images as a packaging format

This one is interesting —even controversial — because I know some of the serverless purists out there are feeling betrayed :) But to me, it is a testament of AWS’ obsession to listening to customers. And customers wanted that.

You can now package and deploy AWS Lambda functions as a container image of up to 10 GB.

AWS Lambda now supports container images as a packaging format

It means that you can now build Lambda-based applications using your familiar container tooling & workflows, using either a set of AWS base images for Lambda, or using your preferred community or enterprise images.

Suppose you are familiar with container development tools such as the Docker CLI. In that case, you can locally build and test your Lambda based application and push your container image to Amazon ECR. You can then deploy your Lambda function by specifying your Amazon ECR image tag or digest from the repository.

And by the way, Amazon ECR just launched Amazon ECR Public. This fully managed registry makes it easy for a developer to share container software worldwide for anyone to download publicly.

For more information and a deep dive on container image support for Lambda, please read this very detailed post from Danilo.

// Detect dark theme var iframe = document.getElementById('tweet-1333822129020792839-703'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1333822129020792839&theme=dark" }

Babelfish for Amazon Aurora PostgreSQL is Available for Preview

Babelfish for Amazon Aurora is a new translation layer for Amazon Aurora that enables Aurora to understand queries from applications written for Microsoft SQL Server.

By using Babelfish, your applications running on SQL Server can now run directly on Aurora PostgreSQL with little to no code changes. Babelfish understands the SQL Server wire-protocol and T-SQL, the Microsoft SQL Server query language, so you don’t have to switch database drivers or re-write all of your application queries.

This announcement is huge for many of our customers!

// Detect dark theme var iframe = document.getElementById('tweet-1333826995130486784-47'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1333826995130486784&theme=dark" }

And by the way, AWS is open-sourcing Babelfish in 2021. Until then, you can use Babelfish on Amazon Aurora in a preview to see how it works and to get a sense of whether this is the right approach for you.

Here is a full write-up of the launch by Matt Asay!

Want more PostgreSQL? You just might like Babelfish | Amazon Web Services

Introducing the next version of Amazon Aurora Serverless in preview

No secrets here — I love Amazon Aurora, so I am biased.

For those not knowing what Amazon Aurora is, it is a MySQL and PostgreSQL-compatible relational database built for the cloud.

Amazon Serverless Aurora is — as the name implies — the serverless version of Aurora. AWS is now releasing its version 2, with supports for the full breadth of Aurora features, including Global Database, Multi-AZ deployments, and read replicas.

Amazon Aurora Serverless v2, currently in preview, scales instantly from hundreds to hundreds-of-thousands of transactions in a fraction of a second. As Aurora Serverless scales, it adjusts its capacity in fine-grained increments to provide just the right amount of database resources that the application needs. There is no database capacity for you to manage; you pay only for the capacity your application consumes.

Note : Aurora Serverless v2 (Preview) is currently available in preview for Aurora with MySQL compatibility.

// Detect dark theme var iframe = document.getElementById('tweet-1333845197000634370-221'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1333845197000634370&theme=dark" }

Amazon EKS adds support for EC2 Spot Instances in managed node groups

First, for those that don’t know what a Spot Instance is, a Spot Instance is an unused EC2 instance that is available for less than the on-demand price, often at steep discounts, which lets you lower your EC2 bill significantly. Amazon EC2 sets each instance type’s spot price in each Availability Zone and adjusts it gradually based on the long-term supply and demand for Spot Instances.

Second, Amazon EKS is a managed service that makes it easy for you to run Kubernetes on AWS.

And now, Amazon EKS supports creating and managing Amazon EC2 Spot Instances using Amazon EKS managed node groups. This lets you take advantage of the steep savings and scale that Spot Instances provide.

Amazon EKS adds support for EC2 Spot Instances in managed node groups

Until now, Amazon EKS customers had to configure Amazon EC2 Auto Scaling groups manually, manage graceful draining of Spot nodes, and upgrade the Spot nodes to the latest Kubernetes versions. With managed node groups, customers get native support for Spot Instances.

Talking of EKS, the new Amazon EKS Distro — an open-source Kubernetes distribution used by Amazon EKS was launched too!

Introducing Amazon EKS Distro - an open source Kubernetes distribution used by Amazon EKS.

Other noticeable launches

-Adrian

The Resilient Architecture Collection

Adrian Hornsby — Thu, 12 Nov 2020 07:28:03 +0000

A list of my resiliency related blog posts.

Series on Resilient Architecture

Resilient systems embrace the idea that failures are typical, and that it’s entirely OK to run applications in what we call partially failing mode. While not suitable for life-critical applications, running in a partially failing mode is a viable option for most web applications. Of course, I’m not saying it doesn’t matter if your system fails. It does, and it might result in lost revenue. But, it’s probably not life-critical.

Building resilient architectures has had its ups-and-downs, some 1 am wake-up calls, some Christmases spent debugging, some “I’m done, I quit” … but most of all, it’s been an incredible learning experience and journey.

This blog post is a collection of tips and tricks that have served me well throughout this journey, and I hope they will help you well too.

Part 1: Embracing failure at scale

In part 1 of this series, I focus on the infrastructure layer, redundancy, immutability, and the concept of infrastructure as code.

Patterns for Resilient Architecture — Part 1

Part 2 — Avoiding Cascading Failures

In part 2, I focus on cascading failure prevention. Cascading failure happen when one part of a system experiences a local failure and takes down the entire system through inter-connections and failure propagation.

Patterns for Resilient Architecture — Part 2

Part 3 — Preventing Service Failures with Health Check

In part 3, I discuss the importance and the challenge of health checks — striking a balance between failure detection and reaction.

Patterns for Resilient Architecture — Part 3

Part 4 — Caching for Resiliency

In part 4, I talk about caching. While caching is often associated with accelerating content delivery, it is also essential from a resiliency standpoint.

Patterns for Resilient Architecture — Part 4

The Operational Excellence Collection

Adrian Hornsby — Thu, 12 Nov 2020 07:05:21 +0000

A list of my operational excellence related blog posts.

Series on Operational Excellence

It takes three interconnecting elements to operate the technology we build successfully. First, you need to have the right culture. Second, you need great tools. And third, you need complete processes.

Part 1 of the series covers the cultural side of Operational Excellence (OE) and examined Amazon’s culture in the context of its Leadership Principles (LPs). Part 2 discusses the role that tools play in achieving OE. Part 3 covers the final aspect to operational excellence — processes — or what we call mechanisms.

Below is the AWS Summit 2020 recording of my presentation on this topic. For more details, read the blog posts :)

Incident Postmortem Template

Operational Readiness Review Template

Adrian Hornsby — Thu, 12 Nov 2020 06:58:29 +0000

Towards Operational Excellence

I want to express my gratitude to my colleagues and friends Ricardo Sueiras, Matt Fitzerald, and Boaz Ziniman for their valuable feedback.

Since I published my blog series Towards Operational Excellence, I received a relatively large amount of feedback and requests. One, in particular, stood out:

“Can you share an operational excellence review template?”

Operational Readiness Review

In this blog post, I will share with you my “lightweight” (but not so lightweight) Operational Readiness Review (ORR) template.

An ORR is a rigorous, evidence-based assessment that evaluates a particular service’s operational state and is often very specific to a specific company, its culture, and its tools. Yet, ORRs all have the same goal: help you find blind spots in your operations.

This template, which I hope will help you get started, is based on my two-decades of experience writing application software, deploying servers, and managing large-scale architectures. I have refined it over the years, helping customers operating software systems in the AWS cloud.

This ORR template is by no mean a complete one. Instead, treat it as a starting point for you and your company to get the ball rolling. The most important thing is to make you think about the different aspects of software operations to minimize the risks of failure once the code hits production.

How to use this ORR template?

As mentioned previously, this is not THE template — it is A template — so treat it more as a mechanism for regularly evaluating your workloads, identifying high-risk issues, and recording your improvements.

More importantly, make it yours. Add your own experience to it. Adapt it to your culture, to your needs.

Can you have the right answers to all questions?

Very unlikely at first, but over time it should be your goal. Again, it is more a learning path to support continuous improvement. Having ORR reviews makes it easy to save point-in-time milestones and track improvements to your operations.

Who should do an ORR?

ORR should preferably be done with the entire service team: the product owner, the technical product manager, backend and frontend developers, designers, architects, etc. Everyone who was involved in one way or another with the service. The more diversity, the better. We want to avoid confirmation bias as much as possible.

When should you do an ORR?

A formal ORR should be done before the initial service launch and after any significant technological change. It should be repeated periodically (about once per year) to ensure that things haven’t drifted away from operational expectations but instead improved over time.

How does an ORR differ from an AWS Well-Architected review?

While there are some overlaps, the AWS Well-Architected review provides customers and partners a means to evaluate architectures and implement designs that can scale over time. It describes the key concepts, design principles, and architectural best practices for designing and running workloads in the cloud. ORR addresses and focuses on the operational aspect of a particular service.

Operational Readiness Review Template

The ORR template is organized as follows:

1 — Service Definition and Goals

2 — Architecture

3 — Failures and Impact

4 — Risk Assessment

5 — Metrics and Alarms

6 — Testing

7 — Deployment

8 — Operations

9 — Disaster Recovery

NOTE 1: As you may have noticed, I didn’t include security in there! And for a good reason — security must have it’s own, in-depth, review.

1 — Service Definition and Goals

Describe what your service does from the customer’s point of view.

Describe your operational goals for the service.

What is the SLA of the service?

What are the business scaling drivers correlated with your services? (e.g. number of users, sales, marketing, ad-hoc, …)

Are you conducting an in-depth security review of your service?

2 — Architecture

Describe the architecture of your service. Call out the critical functionalities. Identify the different components of the system and how they interact with one another.

Describe each component of your system.

Does your service support auto-scaling ? Describe the mechanisms and expectations.

Does your architecture handle a sudden surge of traffic?

What parts of your architectural design reduces the blast radius of failures? (discuss bulkheads, cells, shards, etc.)

Do you have any single-points of failure? If you do, explain why and what is done to minimize the impact of failure.

Explain the different database and storage choices.

List all customer-facing endpoints, explain what each does and what components and dependencies they have.

List all dependencies that your service takes.

What is the anticipated request volume for each component and dependencies of your system?

3 — Failures and Impact

Explain how your service will be impacted based on the failure of each of your components and dependencies.

What is the failure mode for each of the components? (fail-open vs. fail-closed)

Explain the impact on customer experience for the failure of each component and each dependency.

What are the limits imposed on your service by your dependencies? How are these limits tracked?

Do you communicate your scaling requirements to teams that own services you’ve taken dependencies on?

Does your service impose limits on customer resources?

Can you increase limits without making a deployment?

Can you increase limits on a per-customer basis?

Describe the resilience to failure of each of your components (discuss in particular multi-AZ, self-healing, retries, timeouts, back-off, throttles, and limits put in place)

Can the service tolerate an availability zone (AZ) failure without impact?

Can your service sustain production traffic with one AZ down? (ref. static stability)

What is the retry/back-off strategy for each of your dependencies?

What happens when your customers hit limits and get throttled? Can they raise them? How?

4 — Risk Assessment

What are your operational risks?

What scalability concerns are you worried about?

What features did you cut to meet your deadline?

What are the top three things that you believe will catch fire first?

Do you keep track of your dependencies and their criticality? Do you review them regularly?

Do you understand the cost/economics relationship of the service to scaling?

5 — Monitoring, Metrics & Alarms

How do you measure and monitor the end-to-end customer experience?

Do you monitor for single-customer experience?

Do you alarm on poor overall customer experience?

Do you alarm on poor single-customer experience?

How do you trace customer requests in your system?

What are you alarming on? List all of your alarms, with period and threshold, and the severity of each.

Are you dashboard clear? Does everyone know what to look at?

Are there metrics you monitor that don’t have alarms? Which? Why?

What kind of health-checks does your system monitor? (discuss in particular if it is shallow or deep if it uses cache, async vs. sync, etc., and the risks associated)

Do you monitor each external dependency and alarm on failure conditions?

Do you monitor your dependency usage and remaining allowance?

Do you monitor the hosts for disk failure?

Do you monitor disk space utilization?

Do you have log-rotation in place?

Do you monitor for host CPU and memory utilization?

Do you monitor for certificate expiration?

Do you monitor the latency of synchronous and asynchronous calls?

Do you auto-cut tickets on alarms?

6 — Testing

Describe the overall test strategy should follow.

When do you run tests? Do you have tests before and after conducting code review? Do they run automatically, or are developers running tests manually?

Do you test using “fake” accounts?

What’s the percentage of public-facing APIs covered by tests?

Do you test your dependencies? What assumptions do you make on these?

How do you verify that your service’s monitoring and alarming function as expected?

7 — Deployment

How does your deployment procedure work? Lists actions and estimated time in the deployment pipeline.

What are the manual touch-points in your system? Why aren’t they automated? What are the risks associate with each of the touch-points?

What is your procedure to define and approve a change in production?

Do you have a mandatory code review for each change? How do these changes get approved? Do you have several people approving changes?

How do you roll back a change?

Do you test the rollback procedures before deployment?

How do you deploy the configuration to different stages?

Do you error-and-syntax check your configuration before deployment?

What are the dependencies for deployment?

Does your deployment support immutability ? Does your deployment update/upgrade software in-place?

Do you perform load testing before deploying to production?

8 — Operations

Describe how the on-call rotation for your service looks like.

Do you have easily available and complete links to the documentation for the service?

Do you have well defined, documented, and accessible recovery procedures?

Describe the escalation path in the event of an outage (include timing expectations).

Does the trouble-ticketing system integrates with the monitoring system?

Does the paging system integrates with the monitoring system?

9 — Disaster Recovery

Do your on-calls have full access to connect to, debug, and configure the service?

Are you preventing/discouraging your team from using full-admin access roles except when absolutely necessary?

Do you have read-only roles for your team to use for non-critical situations?

Do you have up-to-date escalation policies easily accessible by anyone in the company?

How do you keep the escalation policies up-to-date?

Do you have platform-wide locks that prevent or delay routine tasks in case of an active disaster?

Do you have a well-defined process for DR situations? (e.g., war rooms, isolation, calls, internal & external communication)

Are you practicing your disaster recovery procedure?

Do you have measured and verified RTO and RPO?

Are your DNS TTLs set to sane values?

Do you have verified and tested tools deployed to query logs to measure the impact on customers?

Do you have a process for identifying the causes of outages? (e.g., postmortem, correction-of-error, etc.)

Do you back up critical data?

Do you practice backup restoration regularly?

Do you regularly practice fail-overs?

Can your on-call team enable throttles to protect the service from user load?

Can your on-call team increase limits in case of emergencies?

Do you update run-books as the service changes?

That’s all for now, folks. If you want to download, fork, or suggest some changes, this template is on my GitHub account here. Please contribute and help me improve it.

adhorn/operational-excellence

I hope you’ve enjoyed this post. I would love to hear what works and what doesn’t work for you, so please don’t hesitate to share your feedback and opinions. Thanks a lot for reading :-)

— Adrian

Building resilient services at Prime Video with chaos engineering

Adrian Hornsby — Tue, 25 Aug 2020 07:22:08 +0000

Large-scale distributed software systems are composed of several individual sub-systems-such as CDNs, load balancers, and databases-and their interactions. These interactions sometimes have unpredictable outcomes caused by unforeseen turbulent events (for example, a network failure). These events can lead to system-wide failures.

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in the system’s capability to withstand turbulent events. Chaos engineering requires adopting practices to identify interactions in distributed systems and related failures proactively, and also needs implementing and validating countermeasures. The key to chaos engineering is injecting failure in a controlled manner.

In this post, we present a simple approach for fault injection in systems utilizing Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Elastic Container Service (Amazon ECS), and its integration with a load-testing suite to validate the countermeasures put in place to prevent dependency and resource exhaustion failures. A typical chaos experiment could be generating baseline load (traffic) against the system, adding latency to all network calls to the underlying database, and then validating timeouts and retries. We will explain how to inject such failure (addition of latency to database calls), why validating countermeasures (timeouts and retries) under load is essential, and how to execute it in an Amazon EC2-based system.

We will start with a brief introduction to chaos engineering, then dive deep into failure injection using the AWS Systems Manager. We will then present our open source library, AWSSSMChaosRunner. This was inspired by Adrian Hornsby’s “Injecting Chaos to Amazon EC2 using AWS System Manager” blog post.

Finally, we will provide an example of integration and explain how Prime Video used this library to prevent potentially customer-impacting outages.

Chaos engineering introduction

Software testing commonly involves implementing and automating unit tests, integration tests, and end-to-end tests. Although these tests are critical, they do not encompass the broader spectrum of disruptions possible in a distributed system (e.g., Availability Zone outage, dependency failure, network outage, etc.).

Generally, the behavior of software systems to these scenarios remains unknown. For example, what happens if an Amazon EC2 instance in the service fleet sustains high CPU consumption? Such a situation can occur because of an unexpected increase in traffic or an incorrectly implemented loop in the code. Building confidence in software systems is hard without putting them under stress. Questions to consider:

Have you tested how the system behaves when the underlying instances have a sustained CPU spike?
Is the system behavior understood under different stress?
Is there sufficient monitoring?
Have the alarms been validated?
Are there any countermeasures implemented? For example, is auto-scaling set up, and does it behave as expected? Are timeouts and retries appropriate?

As mentioned previously, chaos engineering requires adopting practices to identify interactions in distributed systems and related failures proactively, and also needs implementing and validating countermeasures. These can be implemented using chaos engineering experiments.

Typical chaos engineering experiments are:

Resource exhaustion : For example, exhaustion of CPU, virtual memory, disk space, and so on. These failures occur frequently and are often caused by failed deployments, memory leaks, or unexpected traffic spikes. Chaos experiments that control resource exhaustion verify that there is sufficient monitoring to detect such failures and proper countermeasures (for example, auto-scaling, auto-restart, etc.) for the system to recover automatically.
Failing or slow network dependency : For example, a database accessed over the network is slow to respond, or its failure rate is high. These failures can happen when the network is experiencing intermittent issues or when dependencies are in a degraded state. Timeouts, retry policies, and circuit breakers are typical countermeasures to these failures; however, they are rarely adequately tested, as unit or integration tests generally can’t validate them with high confidence. Chaos experiments that inject latency or faults in the dependency code path are good at proving the countermeasures’ effectiveness-timeouts, retries, and circuit breakers.

For a more in-depth review of chaos engineering, please see the resources at the end of this article.

AWSSSMChaosRunner: Library for failure injection using AWS Systems Manager

Next, let’s review essential AWS Systems Manager concepts: the AWS Systems Manager Agent (SSM Agent), the SendCommand API, and the AWS Systems Manager documents.

AWS Systems Manager

AWS Systems Manager is a service used to view operational data from multiple AWS services and to automate operational tasks across your AWS resources. A full list of Systems Manager capabilities can be found in the user guide.

For Amazon EC2 instances, AWS Systems Manager offers the SSM Agent to perform actions inside instances or servers. This capability is generally used on most Amazon EC2 instances for operating system patching and for managing SSH sessions.

AWS Systems Manager Agent

SSM Agent is open source Amazon software, released under the Apache License 2.0, that can be installed and configured on an Amazon EC2 instance. SSM Agent makes it possible for Systems Manager to update, manage, and configure these resources. SSM Agent is preinstalled by default on instances created from the following Amazon Machine Images (AMIs): Windows Server 2008–2012 R2 AMIs published in November 2016 or later, Windows Server 2016 and 2019, Amazon Linux, Amazon Linux 2, Ubuntu Server 16.04, Ubuntu Server 18.04, and Amazon ECS-Optimized.

For installation and configuration instructions, refer to the user guide.

SendCommand API

AWS SSM SendCommand API enables running commands programmatically on one or more instances through the SSM Agent.

Example : ‘Hello, World!’ SendCommand using the AWS CLI

The specified instance, instanceid=i-1234567890abcdef0, will run "echo Hello, World!" as a shell script. Targets can be used to specify single instances or groups of instances by using instance tags (for example, Auto Scaling group).
The SendCommand execution will time out in 10 seconds.
Any logs from the command will be sent to the CloudWatch log group named test.

aws ssm send-command \ 
    --document-name "AWS-RunShellScript" \ 
    --parameters 'commands=["echo Hello, World!"]' \ 
    --targets "Key=instanceids,Values=i-1234567890abcdef0" \ 
    --comment "echo Hello, World!" 
    --timeout-seconds 10 
    --cloud-watch-output-config "CloudWatchOutputEnabled=true,CloudWatchLogGroupName=test"

SSM command documents

AWS Systems Manager document (SSM document) can be used to specify complex commands in the form of shell scripts to be executed on an instance or groups of instances. You can run SSM documents via the AWS Systems Manager console or the SendCommand API.

Example : An SSM document for black hole routing all outgoing traffic on a given UDP or TCP port

This document is specified in YAML format, but also can be specified with JSON.
Command parameters are defined separately, as variables.
action: aws:runShellScript specifies that the steps (mainSteps) are a part of a shell script.

---
schemaVersion: '2.2'
description: Blackhole a protocol/port on an instance
parameters:
  prtl:
    type: String
    description: Specify the protocol to blackhole. (Required)
    allowedValues:
      - tcp
      - udp
  port:
    type: String
    description: Specify the port to blackhole. (Required)
  duration:
    type: String
    description: The duration - in seconds - of the blackhole. (Required)
    default: "60"
mainSteps:
- action: aws:runShellScript
  name: ChaosBlackholeAttack
  inputs:
    runCommand:
    - iptables -A OUTPUT -p {{ prtl }} --dport {{ port }} -j DROP
    - sleep {{ duration }}
    - iptables -D OUTPUT -p {{ prtl }} --dport {{ port }} -j DROP

AWSSSMChaosRunner

Assuming that SSM Agent is installed on the Amazon EC2 instances and configured with correct permissions, AWS Systems Manager can be used for failure injection on Amazon EC2 instances in the following way:

Create the SSM document via the AWS Systems Manager Console or the AWS CLI.

The shell script included in the SSM document must be executable on the underlying instances.

2. Call the SSM SendCommand API via the AWS Systems Manager Console or the AWS CLI.

The Amazon EC2 fleet can be defined by using appropriate tags to the target parameter.
The parameters of the underlying shell script must be specified (duration/port/protocol in the above example).
The CloudWatch log group must be configured and specified to view logs from the whole Amazon EC2 fleet in a single location.

If the above steps are successful, all specified Amazon EC2 hosts will be injecting failure. For example, EC2 hosts will black-hole outgoing traffic to a given UDP/TCP port. However, no requests may be hitting the service you are injecting failure into; either it is a period of low traffic or a development fleet. In which case, the effect of the failure injection might be minimal, or worse, not perceived at all. Thus, it will be difficult to validate the countermeasures put in place. A third step is needed.

3. Generate traffic to the service using load generators to simulate real-life high traffic on the system.

Running the above steps manually is prone to configuration errors, is risky, and is time consuming. These steps can be automated with the recently released AWSSSMChaosRunner library, as illustrated in the image below.

This library abstracts the creation of SSM documents and calling the SSM SendCommand, and provides tried and tested SSM documents for your chaos experiments. This library is open sourced under the Apache-2.0 License and is available on GitHub and Maven Central.

amzn/awsssmchaosrunner

Failure injections

The failure injections currently available in the AWSSSMChaosRunner library are:

NetworkInterfaceLatency : Adds latency to all inbound/outbound calls to a given network interface.
DependencyLatency : Adds latency to inbound/outbound calls to a given external dependency.
DependencyPacketLossAttack: Drops packets on inbound/outbound calls to a given external dependency.
MemoryHog: Hogs virtual memory on the fleet.
CPUHog: Hogs CPU on the fleet.
DiskHog: Hogs disk space on the fleet.
AWSServiceLatencyAttack: Adds latency to an AWS service using the CIDR ranges returned from ip-ranges.amazonaws.com. This is necessary for services like such as Amazon Simple Storage Service (Amazon S3) or Amazon DynamoDB, where the resolved IP address can change during the chaos experiment.
AWSServicePacketLossAttack: Drops packets to an AWS service using the CIDR ranges returned from ip-ranges.amazonaws.com. This is necessary for services like Amazon S3 or Amazon DynamoDB, where the resolved IP address can change during the chaos experiment.
MultiIPAddressLatencyAttack: Adds latencies to all calls to a list of IPAddress. This could be useful for a router → host kind of a setup.
MultiIPAddressPacketLossAttack: Drops packets from all calls to a list of IPAddress. This could be useful for a router → host kind of a setup.

Chaos testing an EC2 service

Take, for example, a service running in Amazon EC2. (Commonly recommended components, such as CDNs, load balancers, and VPCs have been omitted for simplification).

This service receives client requests, applies business logic, and accesses a database (or any external dependency). Let’s learn how to apply the AWSSSMChaosRunner library to this service.

Prerequisites

Familiarity with IAM concepts, such as IAM policies, roles, and users.
Tests for the service are written in Java, Kotlin, or Scala. AWSSSMChaosRunner library is only available for these languages.
Service health and behavior must be instrumented and monitored with metrics or logs. Without monitoring, the effect of failure injections can not be observed.
Some baseline traffic (load) is generated to the service from the tests while the chaos experiment is executed. Generating traffic will help validate the experiment hypothesis.

Step 1. Set up permissions for calling AWS Systems Manager from the tests package.

Although implementing this part in different ways is possible, the approach described here generates temporary credentials for AWS Systems Manager on each run of the tests.

First you must create an IAM user and an IAM role it can assume. The following IAM policy must be attached to this role.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "sts:AssumeRole",
                "ssm:CancelCommand",
                "ssm:CreateDocument",
                "ssm:DeleteDocument",
                "ssm:DescribeDocument",
                "ssm:DescribeInstanceInformation",
                "ssm:DescribeDocumentParameters",
                "ssm:DescribeInstanceProperties",
                "ssm:GetDocument",
                "ssm:ListTagsForResource",
                "ssm:ListDocuments",
                "ssm:ListDocumentVersions",
                "ssm:SendCommand"
            ],
            "Resource": [
                "\*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "ec2:DescribeInstances",
                "iam:PassRole",
                "iam:ListRoles"
            ],
            "Resource": [
                "\*"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "ssm:StopAutomationExecution",
                "ssm:StartAutomationExecution",
                "ssm:DescribeAutomationExecutions",
                "ssm:GetAutomationExecution"
            ],
            "Resource": [
                "\*"
            ],
            "Effect": "Allow"
        }
    ]
}

Step 2. Initialize the AWS Systems Manager client.

This code should be invoked during the initialization of the tests (i.e., wherever the singletons are created).

//Kotlin
@Bean
open fun awsSecurityTokenService(
   credentialsProvider: AWSCredentialsProvider, 
   awsRegion: String
   ): AWSSecurityTokenService {
    return AWSSecurityTokenServiceClientBuilder.standard()
        .withCredentials(credentialsProvider)
        .withRegion(awsRegion)
        .build()
}

@Bean
open fun awsSimpleSystemsManagement(
   securityTokenService: AWSSecurityTokenService,
   awsAccountId: String,
   chaosRunnerRoleName: String
   ): AWSSimpleSystemsManagement {
    val chaosRunnerRoleArn = "arn:aws:iam::$awsAccountId:role/$chaosRunnerRoleName"
    val credentialsProvider = STSAssumeRoleSessionCredentialsProvider
        .Builder(chaosRunnerRoleArn, "ChaosRunnerSession")
        .withStsClient(securityTokenService).build()

    return AWSSimpleSystemsManagementClientBuilder.standard()
        .withCredentials(credentialsProvider)
        .build()
}

Step 3. Start the fault injection attack before starting the test, and stop it after the test.

The given test sends traffic to the service.

//Kotlin
@Before
override fun initialise(args: Array) {
    if (shouldExecuteChaosRunner()) {
        ssm = applicationContext.getBean(AWSSimpleSystemsManagement::class.java)
        ssmAttack = getAttack(ssm, attackConfiguration)
        command = ssmAttack.start()
    }
}

@Test
fun `given failure injection generate calls to the service`(int: duration) {
 // This test should call an endpoint of the service and keep repeating this for the duration of the test.
 // Additional logging can be added or service dashboards can be monitored for an overview.
 val startTime = LocalDateTime.now()
 while(getElapsedSeconds(startTime) <= duration){
    serviceClient.callEndpoint()
 }
}

@After
override fun destroy() {
    ssmAttack.stop(command)
}

Step 4. Run the test.

Execute the command to run the above test.

Note: AWSSSMChaosRunner can also be used for an EC2+ECS based service with one setup step prior to the above steps. Please see the Github README for more details.

Prime Video uses AWSSSMChaosRunner to prevent a potential outage

In March 2020 Prime Video launched Prime Video profiles which lets Prime Video users access separate recommendations, season progress, and Watchlist, as these are based on individual profile activity. This new customer experience required the design and implementation of new services using Amazon EC2.

Prime Video Profiles

These services are part of a distributed system, and they call other internal Amazon services over the network. Testing the timeouts, retries, and circuit-breaker configurations used by this service was considered critical because:

These code paths are hard to validate through unit, integration, and end-to-end tests.
Issues in configurations are usually discovered during an outage when these countermeasures-timeouts, retries, and circuit breaker-would be needed.

Prime Video implemented this chaos engineering experiment using the AWSSSMChaosRunner’s DependencyLatency attack, and by generating load against the service, thus simulating traffic when dependencies exhibit high latency.

The service-to-service call metrics were observed and, as a result, timeouts, retries, and circuit-breaker configuration were validated.

Now let’s review the result of one of these chaos experiments and find out how it helped us proactively discover a potentially customer-impacting issue.

Experiment: Validate ElastiCache timeout

The chaos experiment is set up as follows:

Experiment hypothesis : The timeout for Service → ElastiCache call is set as 40 milliseconds. This will be validated by observing the Service → ElastiCache latency metric during the experiment.
Failure injection : Two seconds of latency is added to the Service → ElastiCache call using AWSSSMChaosRunner.
Generate baseline load against the service : 1000 requests per second are generated against the service. As discussed previously, running chaos engineering experiments while loading the system is critical.

Experiment outcome

The above image shows that the Service → ElastiCache latency is going beyond the configured 40ms timeout. Thus, the ElastiCache timeout configuration is failing.

Following these results, we fixed a bug in the timeout configuration.

To validate our fix, we subsequently re-run the same experiment.

The illustration shows that the maximum of Service → ElastiCache latency is capped at 40 milliseconds, the configured timeout value. This happens despite the extra latency of two seconds injected into this call path by the experiment. This result validates that the service will time out quickly if ElastiCache is slow to respond or if that network path has some issue.

Running this chaos experiment led to the discovery of a bug in the countermeasure for dependency degradation (i.e., ElastiCache timeout). The bug fix prevented a potential customer-impacting failure from happening.

Conclusion

Testing service dependency timeouts, retries, and circuit-breaker configurations is essential. In this post, we presented an open source approach to failure injection on Amazon EC2 using AWS Systems Manager, and we demonstrated how Prime Video combines it with load testing to achieve higher levels of resiliency. This Prime Video case study shows how chaos engineering helps prevent potentially customer-impacting issues that are difficult to pinpoint using traditional testing methods.

Resources

PrinciplesOfChaos.org
Chaos Engineering: The art of breaking things purposefully blog collection on Medium
Awesome chaos engineering collection of reading resources on GitHub

Originally published at https://aws.amazon.com on August 18, 2020.

Ten lessons from twelve years of AWS

Adrian Hornsby — Tue, 07 Jul 2020 15:05:02 +0000

Recording and redacted transcript from my keynote at the AWS Community Day Australia and New Zealand.

First of all, I would like to thank everyone in the AWS Community in Australia and New Zealand and the AWS Heroes who have helped put this event together.

In particular, Augustino (aka Gus), Peter, and Nathan , the AWS Heroes, and John our OBS Ninja. All have done an amazing work building this Community.

Thank you!

A little reminder that if you plan to share your day with others on social media, please use the hashtag:

#AWSCommunityDayANZ

A few years ago, I did a talk called ten lessons from ten years on AWS.

That was at the Community Day in Bangalore, India. Back then, there wasn’t any COVID virus, so I traveled there — and I loved it.

I wish I could be on site with you today, but instead, I am virtual with you this early morning, live from Finland.

It is 4 am here, so bear with me if I am a little slow :)

So, I did that talk two years ago, but it feels like it was already ten years ago.

J. R . Rim said

“In the information age, one tech year is equivalent to one person’s lifetime.”

While that is true, I recently found myself highly influenced by a dog.

Her name is Hilma , and as you can see, she likes kissing.

And so, I now see tech years like dog years, with a ratio of approximately 1 to 7.

I am trying to say that time flies — fast — and for today, I thought it would be interesting to look back at the past again, and see what went well, and what didn’t.

So, as I said, I did a similar talk two and half years ago, and looking at it now feels like I need to make some corrections.

That is the summary slide of it.

Do you notice the problem?

The biggest mistake I did back then was to focus mostly on technology.

Not that what is on that slide is wrong, quite the opposite, but these lessons don’t all qualify as life lessons — lessons that make you grow as a person.

So today, I will try to do a better job and share some of the lessons that changed me as a person.

#1 — It’s the customers and the information you give them that’s important, not the technology.

WAP, ASP, Instant messaging, Grid computing, Portals, Biometrics, AR, VR, Web2.0, P2P, IoT, 4G, 5G, InternetTV, DVB, NFC, BigData, wearable, NLP, ML, Autonomous Vehicle, Connected homes, Deep learning, Neural network, Deep learning, digital twin, crypto, serverless, k8s, blockchain, …

These are all examples of hypes I have lived in since 2000.

While hypes come and go, the bigger picture remains the same.

More important than the technology itself are the customers, especially how good an experience you give them.

Technology is irrelevant if your customers aren’t happy!

Whether you use Ruby, Python, Node, Haskell, or Clojure, K8S or serverless, a monolith or micro-services, customers’ happiness is making the final call.

I have seen fantastic customer experience, and genuinely successful businesses run on less than polished software systems.

And honestly, technology is moving too fast always to make sound technological decisions.

How many front-end frameworks will be released during this talk? :)

Most of the time, successful teams guess and take a shot at it. What they do differently is being fast at adjusting the course.

And to me, this has become the best measure for success:

“How fast can you correct course once you have customer feedback.”

Notice that first, you need to have that feedback, and only then can you correct course. Otherwise, you go blind.

So, focus on that feedback loop. And it means a few things:

Listen to your customers.
Forget your ego.
Invest in automation — early! (CI/CD).
Repeat.

Most of what we build in the cloud is organized around Data, with a big D.

We store Data.
We Process it.
And we move it around.

Easy right? :)

How fast you process it and move it around influences the price and the accuracy of the information you extract from Data.

The rest is mostly operations. How much time you spend on operations influences the price, but most importantly, the amount of time you spend listening and adjusting course for your customers — so choose wisely.

#2 — You can’t know everything.

Technology changes too fast to keep track of all of it while maintaining a balanced life.

Trying to keep up with technological changes is like trying to make sense of AWS services naming convention — it’s impossible :)

As a developer advocate for AWS, I often get asked

“How do you keep up with everything?”

Here is a news for you — I actually don’t :)

Everyone is struggling to keep up, worried about being left behind the waves of technological evolution.

Even people who seem to have it all together. They are more like you and me than you would believe.

There is a weird paradox to knowledge:

“the more you know, the more you realize you don’t know.”

And of course, we feel bad about it. We feel like imposters — I certainly felt and continue to feel like one.

To me, the imposters’ syndrome was and continues to be one of the hardest things to deal with, and I am 40yo white male, so imagine what it must be for women in tech or other minorities.

Communities like this have helped me a lot overcome my fears a lot — Communities where I could discuss and practice, often without being judged.

Another weird paradox exists:

“As you become more senior, you know less and less about new technologies.”

Yet, the most inspiring engineering reviews I have participated in while working at Amazon were with very senior engineers — the senior principals. There are about 70 or so of them at Amazon.

What’s their secret?

They listen.
They ask questions — a lot of them.
And, they challenge ideas and biases.

I will come back to biases later — but do you notice something?

These are all people skills — they listen, are curious, and they dive deep.

Seniors engineers are more people engineers rather than technology engineers, which leads me to lesson three.

#3 — Invest in people skills

… as much as you invest time learning about the latest NodeJS framework, k8s or serverless.

No one’s technical skill is irreplaceable. I learned it the hard way.

One day, after three years of working heart and soul for a startup which I was number two employee, I was fired.

I arrived one morning at work with a smile on my face. A coffee later, I was walking out of the door, with all my books, but without a job.

And it was terrifying.

Sure, we had financial difficulties, but I had built everything from scratch. I knew every line of code, and I never thought that I would be the first to go.

But I was, and my technical skills that I thought were irreplaceable, were in fact, fast covered for by the new, very good and cheaper work force just hired out from college.

The only option you have to stand out and stay relevant is to invest in people skills.

And I get it, for many of us it is difficult, even more now — but it is critical.

There are many skills necessary, of course — but one stands out: Empathy.

Bill Billard correctly said:

“Opinion is really the lowest form of human knowledge. It requires no accountability, no understanding. The highest form of knowledge … is empathy, for it requires us to suspend our egos and live in another’s world. It requires profound purpose larger than the self kind of understanding.”

When I got fired over a coffee cup, I had a lot of opinions. I was a lead developer, so I had a lot of empathy for the team of developers I managed but a lot of opinions towards the management team.

Opinions that cost me my job.

While I understood our developers, I failed to communicate upward and failed to understand the different business needs.

Like most things in life, you have to find the right balance to be successful.

But the more people skills you have, the easier it will be.

#4 — Embrace failure.

I got that one right the first time! :)

While we have learned that almost everything will work again if you reboot it, including you, sometimes, things just fail.

Ask Murphy about it!

Brian Tracy, beautifully, said:

“It is not failure itself that holds you back; it is the fear of failure that paralyzes you.”

Let me start by saying that scared developers won’t:

Try things out.
Won’t innovate as fast as your business would need to.
Won’t dare to jump in and fix things when (pardon my French) shit hits the fan.
Won’t do more than ask for.
And, won’t stay long in the job.

I know!! I was one of them.

Failure should not be seen as “you are a failure” but merely as moving along the path of experimentation.

If you don’t fail, you are probably not trying things hard enough nor pushing the limits of the “adjacent possible.”

Innovations flourish in a community where ideas are exchanged, discussed, tried, and improved over time — a community like this one.

But most of all, innovations flourish in an environment that embraces failure.

Remember, Thomas Edison tested more than 6000 different materials before settling on a light bulb using carbonized bamboo.

Failure tends to teach you lessons that reading books or blog posts can’t teach you.

The most successful teams I have worked with are those I failed the hardest with, but we were prepared to fail, and especially we were not afraid of losing our jobs.

The best way to learn from failure is to practice failure — you know I had to mention chaos engineering at least once in this talk :)

You also have to work on minimizing the blast radius of failures. Techniques like bulkheads, sharding, isolation, load shedding, graceful degradation, immutability, etc., will come handy — so you probably should get familiar with them.

But again, technology itself won’t make your system more robust; people will — which brings me to point five.

#5 — Don’t blame people when things fail.

Sidney Dekker, one of the most critical writer in the field of safety engineering, rightly said:

“The question […] is not who is responsible for failure; rather, it asks what is responsible for things going wrong. What is the set of engineered and organized circumstances that is responsible for putting people in a position where they end up doing things that go wrong?”

Everyone screws up, and one day, so will you!

I screwed up several times — big time!!

I deleted databases in production — twice — while trying to recover from an outage. But that is the topic for another talk :)

Do you think I woke up that morning thinking:

“Today, I will come to work, delete some databases, and do a shitty job!”

Of course not — Everyone has good intentions, even when we screw up.

So don’t blame individuals or teams. Similarly, don’t assign or imply blame to others, individuals, groups, or organizations. Instead, identify what happened and question why those things happened.

Stopping at people’s errors isn’t right. It is a sign that you haven’t gone deep enough.

Think about the situation that led the operator to trigger the event? Why was the operator able to do such a thing? Was it a lack of proper tools, a problem in the culture, or a missing process

[…]

Continuing on databases —

# 6 — Route53 is NOT a database.

… regardless of what QuinnyPig says on Twitter or Reddit.

Link

Nor are tags :)

Link

#7 — Watch out for heuristics and biases!

Being aware of cognitive biases when performing your job is a superpower that will often make the difference between becoming an inspiring leader or not.

Cognitive biases impact our perception of reality, driving us into making incorrect conclusions and often, irrational decisions.

While I don’t think removing biases is possible, it definitely will help you if you can identify them and properly adjust your perception and question your assumptions.

Some of the biases and heuristics to watch out for:

The confirmation bias — “the tendency to search for, interpret, favor, and recall information that confirms or supports one’s prior personal beliefs or values.”
The sunk cost fallacy — “the tendency for people to believe that investments (i.e., sunk costs) justify further expenditures.”
The common belief fallacy — “If many believe so, it is so.” you know, blockchain …
The hindsight bias — “the tendency for people to perceive events that have already occurred as having been more predictable than they were before the events took place.”
The fundamental attribution error — “the tendency to believe that what people do reflects who they are.”

And, there is more!

You get the idea — we are full of imperfections :)

So, spend some time reading about it — it will make you a better developer, a more understanding manager, and eventually an inspiring leader.

It will make you a better person overall.

#8 — Writing.

I started writing my blog on Jan 9, 2018 — after I gave the first version of this talk in India — two and a half years ago.

Julien Simon, the ML principal developer advocate, tried to convince me for a year before I wrote my first words down.

I was worried I had nothing to say. So afraid that I deleted words as fast as I was writing them down.

You know, the fear of failing does paralyze you.

But eventually, I took Julien’s advice, published my first blog post, and the second, the third, etc.

Today, and with approximately 60 blog posts published and nearly 300 thousand visitors a year, I can easily say that writing that first blog post was one of the most important thing I did in my career.

One crucial thing l learned along the way:

Every writer you know writes terrible first drafts. And second drafts. And third.

Every. Single. Writer.

Eventually, after tens of iterations, words start to make sense.

But it takes time! So don’t give up.

The funny thing is that I hated writing most of my school life, and long after.

It still takes me an absurd amount of time to put my thoughts into words. But in the process, it also clarifies them, put them in order, crystallizes them.

The key to writing is to start, and then do it word by word.

If you are like me, you are probably often wondering what to write about.

One of my favorite writer — Anne Lamott- said:

“Remember that every single thing that happened to you is yours, and you get to tell it.”

And that is so true.

What problem did you solve lately? Why? How?

What happened to you a few days, months, or years ago, is probably happening to someone today.

So, tell your stories, for only you, with your own words, can tell it!

My favorite part of writing is the review process.

As soon as I have some sort of draft, I select several victims in my team or broader network, and I ask them to review it — to challenge my ideas, my opinions, and surface biases.

The review process was initially very challenging, as I had to deal with critics.

You know, the same critics when a developer gets her or his code reviewed in a pull request.

“How dare you criticize my baby?”

Getting critics is humbling, very humbling — and for the better — It forces you to listen, learn, and eventually improve.

Of course, you need to find competent reviewers, reviewers that care about your success. Reviewers that aren’t scared of telling you the truth.

So, make sure you have some of them in your network of influence.

And, if one day you decide to start writing, I would happily become one of them.

#9 — Mentoring.

Throughout my career, I’ve had the chance to have amazing mentors.

Whether in my work or outside of my career aspirations, I have had few key people who have helped me move forward.

Sometimes, this is as simple as bouncing my ideas back-and-forth to see a clearer picture.

Other times, it’s getting encouragement, a supportive tap on the shoulder, and advice on what to do next.

A good mentor can help you be your best self.

Everyone needs mentors. All my mentors have mentors themselves.

Mentoring others has been really important to me, especially in the past few years.

Most people enter their professional lives with little understanding of the complex landscape and expectations for excellence required for a successful career.

I certainly had no idea what I was doing when I started!

It is is not a problem per-se, for it is the normal unrolling of life, but it is scary at times, and a rather excellent opportunity for mentoring.

Mentoring is essential, not only because of the knowledge and skills one can learn from mentors but also because mentoring provides professional and personal support that facilitates success.

Research shows that people with good mentors have a higher chance of success and more significant career advancement potential.

So, open yourself for mentoring others and look out for your own, personal mentor as well.

#10 — Learn from others.

Learning from others is the single most important thing I have learned. And I have to admit, sometimes I still have to remind myself to shut up and listen.

There is always someone in the room that knows more than you do. That person is just not necessarily broadcasting it.

Be open and ready to be challenged and change opinion.

Share your ideas with others, challenge them and especially let others do so.

Fortunately, there are myriads ways to do this — participating in this community is, of course, one of them.

A few weeks ago, while preparing for this talk, I asked the developer advocate team to share with me their lessons — and few of them kindly answered.

But as I didn’t really know what I was going to talk about a few weeks ago, I asked my question with a heavy bias towards getting technical answers — so despite my first lesson not to focus on technology, the following is mainly about technology — but I am the one to blame for that :)

Javier Ramirez :

Don’t use the AWS as a traditional Data Center. Have a consistent naming/tagging strategy early on. Especially for everything that has unique names (s3 buckets, for example).

Alex Casalboni:

Learn IAM before you do anything serious. Master Infrastructure as Code (IaC), either CloudFormation or Terraform. Use the management console only to build prototypes, or the first time you try out a new service, switch to IaC for anything else.

Boaz Ziniman:

Account security — don’t use the root account. Always enable MFA. Set IAM users for every developer, with different roles. Tag everything!

Enrique Duvos:

Security, Security, Security.

Dennis Traub:

Turn on CloudTrail. Turn on Guard Duty. Don’t assume that development teams will consider security when building on AWS. So, I’m with Enrique here —think security.

Marcia Villalba:

Try to see if there is a managed service before building it yourself.

Danilo Poccia:

Adopt the right mindset: With on-prem virtualization and hosting, you have a finite set of resources where you try to squeeze as many things as possible. With the cloud, you have access to a virtually unlimited set of resources, and you should use the minimum you need at any point in time.

Sebastien Stormacq:

Use EC2 only if you have exhausted all other possibilities. This is not because of EC2, but because no machine to manage is better than one machine to manage. So, go serverless as much as possible. And serverless is not only Lambda, but it is also RDS, Cognito, S3, API Gateway, etc.

Ricardo Sueiras:

Shift the conversation around AWS from technology to business outcomes (Agility, etc.). It will help you get the exec sponsor/support required for success. From a technical point of view, bet on automation. Resist being obsessed with the console.

Isabel Huerga Ayza:

Governance — everything that is not clearly defined will be done by no one. Or best case it will be, but not consistently. Goes to accountability and ownership, which is not good to leave to good faith when what is at risk is your business. Don’t wait until a production disruption to enable support. Setup budget alerts!

Steven Bryen:

My advice is to think differently about things like the dynamic allocation of resources. An example would be security groups. You can create a rule referencing another security group, which means new instances automatically match the rule and have access as they scale. It is very different from the traditional on-prem mindset but is so valuable.

Cobus Bernard:

Read the pricing page for a service and set up billing alerts.

Feel free to connect with them and challenge them on their ideas. I am sure they would love that!

That’s it for now, folks! Again, thanks for inviting me and have a lovely rest of the day.

-Adrian

https://medium.com/media/4971e7d7ca309f5a367c1cfbbd5651ee/href

Incident Postmortem Template

Adrian Hornsby — Fri, 26 Jun 2020 10:58:44 +0000

Since I published my blog series Towards Operational Excellence, I received a relatively large amount of feedback. But one question, in particular, stood out.

“Can you share an incident postmortem template?”

In this blog post, I will share an example incident postmortem template, which I hope will help you get started. I will also share some DOs and DON’Ts that I have seen work across a wide variety of customers — both internally in Amazon, and externally.

What is a postmortem?

A postmortem is a process where a team reflects on a problem — for example, an unexpected loss of redundancy, or perhaps a failed software deployment — and documents what the problem was and how to avoid it in the future.

“Postmortems are not about figuring out who to blame for an incident that happened. They are about figuring out, through data and analysis, what happened, why it happened, and how it can be stopped from happening again.”

At Amazon, we call that process Correction-Of-Errors (COE), and we use it to learn from our mistakes, whether they’re flaws in tools, processes, or the organization.

We use the COE to identify contributing factors to failures and, more importantly, drive continuous improvement.

To learn more about our COE process, please check out my favorite re:Invent 2019 talk from Becky Weiss, a senior principal engineer at AWS.

Incident Postmortem Template

Below is an example of an incident postmortem template.

I do not claim that this template is perfect — just that it’s an example that can help get started.

If you think something is missing, if you agree or disagree strongly about a particular part of that template, please share your feedback with me by leaving a comment below.

Bare-bone version

For all the let-me-get-straight-to-the-point champions out there — here is a bare-bone template.

https://medium.com/media/3af3e1c22e9bc3f9a51b18a56145b994/href

Extended-cut version

In this extended-cut version, I will expand on each of the different parts of the template, suggesting what could belong to each section.

Title:

Descriptive title (Service XYZ failed, affecting customers in the EU region)

Incident date:

Date of the event.

Owner :

Name of the owner of the postmortem process.

Peer-review committee :

List of people that will verify the quality of the postmortem before publishing it.

Tags :

List of tags or keywords to classify the event and facilitate future search and analysis.

Example: Configuration, Database, Dependency, Latent

Summary:

A summary of the event.

Supporting data:

Metric graphs, tables, or other data, that best illustrate the impact of this event.

Customer Impact:

Discuss customer-impact during the event. Explicitly mention the number of impacted customers.

Incident Response Analysis:

Example of questions you could address:

Was the event detected within the expected time?

How was it detected? (e.g., alarm, customer ticket)

How could time to detection be improved?

Did the escalation work appropriately?

Would earlier escalation have reduced or prevented the event?

How did you know how to mitigate the event?

How could time to mitigation be improved?

How did you confirm the event was entirely mitigated?

Post-Incident Analysis:

Example of questions you could address:

How were the contributing factors diagnosed?

How could time to diagnosis be improved?

Did you have an actual backlog item that could’ve prevented or reduced the impact of this event? If yes, why was this item not done?

Could a programmatic verification rule (e.g., AWS Config) be used to prevent this event?

Did a change trigger this event?

How was that change deployed — automatically or manually?

Could safeguards in the deployment have prevented or reduced the impact of this event?

Could this have been caught and rolled back during the deployment?

Was this tested in a staging environment? If yes, why did this pass through? Could more tests have prevented or reduced the impact of this event?

If this change was manual, was there a playbook? Was that playbook practiced, tested, and reviewed recently?

Did a specific tool/command trigger the event? Could safeguards have prevented or reduced the impact of this event? Was there any safeguard triggered? If not, why none were in place?

Was a production operation readiness or well-architected review performed on the system(s)? If not, why? When was the last evaluation done?

Could a review have prevented or reduced the impact of the event?

Timeline:

Detail all major event points with their time (included the timezone) with a short description.

Example: 09:19 EEST — database run out of connections. Link graph & log

Diving deep on contributing factors:

Start with the problem.

Keep asking questions (e.g., why?) until you get to multiple contributing factors. There is no single cause for failure . So, keeping going!

Probe into different directions — tools, culture, and processes. NEVER stop at human errors (e.g., if an operator enters a wrong command, ask why no safeguards were in place, or why wasn’t the action peer-reviewed, and why didn’t that command have roll-back?)

Define action items against all contributing factors.

Lessons Learned:

Describe what your team is taking away from this event.

What did you learn that will help you in the future to prevent similar events?

What unexpected things happened?

What process broke down?

Lessons learned should correlate directly, if possible, with an action item.

Action items:

List of action items with a title, an owner, due date, a priority, and a link to the backlog item created to follow up.

Example: Evaluate shorter timeout for GET API 123, adhorn, July 3rd- 2020, high priority, link to a backlog item.

Things to do when doing a postmortem

Generally, select senior, experienced owners and reviewers to ensure the high-quality completion of the postmortem.
Proper postmortems are diving deep on the issues. Nothing is left unanswered unless it becomes an action item.
Questioning your assumptions, be-aware of heuristics, and fight biases ** (see below).
Reviewers should be fully empowered to reject a postmortem for not meeting a high-quality bar.
Review recent postmortems in meetings with the broader organization.
Be smart about what can be accomplished in the short-term, don’t over-promise.
Use existing postmortems and previous lessons learned to design new “best practice” patterns, and set mechanisms to share the knowledge with the rest of the organization (e.g., present postmortems in weekly operational reviews)
Codify and automate lessons learned when possible.
Don’t let postmortems drag on for a long time.

** Heuristics and biases to watch out for (in no particular order):

The confirmation bias — “the tendency to search for, interpret, favor, and recall information that confirms or supports one’s prior personal beliefs or values.”
The sunk cost fallacy — “the tendency for people to believe that investments (i.e., sunk costs) justify further expenditures.”
The common belief fallacy — “If many believe so, it is so.”
The hindsight bias — “the tendency for people to perceive events that have already occurred as having been more predictable than they actually were before the events took place.”
The fundamental attribution error — “the tendency to believe that what people do reflects who they are.”

Things to avoid when doing a postmortem

Don’t blame individuals or teams. Similarly, don’t assign or imply blame to others, individuals, teams, or organizations. Instead, identify what happened and question why those things happened.
Stopping at an operator error isn’t right. It is a sign that you haven’t gone deep enough. Think about the situation that led the operator to trigger the event? Why was the operator able to do such a thing? Was it a lack of proper tools, a problem in the culture, or a missing process?
Don’t do postmortems punitively. Don’t do a postmortem if no one is going to get value and find improvements.
Avoid open-ended questions or action items. Action items such as “create training” and “improve documentation” aren’t useful. Either you didn’t go deep enough, or you didn’t need a postmortem.
Action items should focus on what can be done in a shorter-term to mitigate the event.
Don’t try to fix everything in your system in a single postmortem. “We need to change the overall architecture of our system now” or “we need to move to Fortran” aren’t the right action items.
Do not spend an unreasonable amount of time on writing postmortems. They should be done relatively fast and with a high-quality bar.
Do not write postmortems on weekends, or in a hurry. It can generally wait the next Monday.

That’s all for now, folks. I hope you’ve enjoyed this post. I would love to hear what works and what doesn’t work for you, so please don’t hesitate to share your feedback and opinions. Thanks a lot for reading :-)

— Adrian

The Chaos Engineering Collection

Adrian Hornsby — Tue, 16 Jun 2020 05:34:44 +0000

A list of my chaos engineering related blog posts and open-source projects.

Series on chaos engineering

This is a collection of three articles on chaos engineering that present and discuss the different phases of the chaos engineering process.

Part 1: The art of breaking things purposefully

In Part 1 of this series, I introduce chaos engineering and explain how it helps uncover and fix unknowns in your system before they become outages in production; and also how it fosters positive cultural change inside organizations.

Chaos Engineering — Part 1

Part 2: Planning your first experiment

In Part 2, I discuss areas to invest in to start designing your first chaos engineering experiments and pick up the right hypothesis.

Chaos Engineering — Part 2

Part 3: Failure Injection — Tools and Methods

In Part 3, I focus on the experiment itself and present a collection of tools and methods that cover the broad spectrum of failure injection necessary for running chaos engineering experiments.

Chaos Engineering — Part 3

Practical Chaos Engineering

A set of articles presenting practical implementations of chaos engineering experiments.

Injecting Chaos to Amazon EC2 using AWS System Manager

In this article, I show how to inject failure into your application using AWS System Manager and opened source plenty of ready-made failure injection to get started. Try it — it’s pretty awesome!

Injecting Chaos to AWS Lambda functions using Lambda Layers

In this article, I explain how to use AWS Lambda Layers to conduct chaos engineering experiments on Lambda functions.

Original post:

Injecting Chaos to AWS Lambda functions using Lambda Layers

Update:

Collection of python scripts to run failure injection on AWS infrastructure

adhorn/aws-chaos-scripts

Creating your own Chaos Monkey with AWS Systems Manager Automation

Adrian Hornsby — Tue, 16 Jun 2020 05:32:22 +0000

Chaos Engineering on AWS

I’d like to express my gratitude to my colleagues and friends Jason Byrne and Matt Fitzgerald for their valuable feedback.

In a recent post, I explained how to use AWS SSM Run Command to inject failures on EC2 instances. SSM Run Command is well-suited to execute custom scripts on EC2 instances, especially to inject latency or blackouts on the network interface, do resource exhaustion of CPUs, memory, and IO.

However, we need more than that. Failure injection should target resources, network characteristics and dependencies, applications, processes and service , and also the infrastructure.

We also need to have a broad set of controls and capabilities to perform chaos experiments safely. We might want to:

Execute commands and scripts directly into EC2 instances.
Invoke Lambda functions to run custom scripts.
Orchestrate several failure injections to form chaos scenarios.
Schedule them for execution at specific times.
Have automatic cancellations if errors are detected.
Have safety measures in places with approvals.
Apply velocity controls to limit the blast radius of experiments.

That is where AWS System Manager Automation (SSM** Automation) comes in. So, let’s take a look!

** Note: AWS Systems Manager was formerly known as Amazon Simple Systems Manager (SSM). The original abbreviated name of the service, SSM, is still used and reflected in various AWS resources.

What is SSM Automation?

SSM Automation was launched to simplify frequent maintenance and deployment tasks of AWS resources and, especially, codify them.

SSM Automation in a nutshell

SSM Automation uses documents (defined in YAML or JSON) to enable resource management across multiple accounts and AWS regions. You can execute AWS API calls as part of a document in combination with other SSM Automation actions such as running commands on your EC2 instances, invoking Lambda functions, and executing custom Python or Powershell scripts.

SSM Automation document

While these documents can be executed directly via the console, the CLI, and SDKs, you can also schedule and trigger them through CloudWatch Events. This scheduling capability makes the integration with CI/CD pipelines trivial.

SSM Automation Action types

Action types let you automate a wide variety of operations. For example, the aws:executeAwsApi action type used above enables you to run any API operation on any AWS service, including creating or deleting AWS resources, starting processes, triggering notifications, etc.

While SSM Automation supports a wide variety of actions, the most notable ones for chaos engineering are the following:

aws:executeAwsApi — Call and run AWS API actions
aws:changeInstanceState — Change instance state
aws:runCommand — Run a command on an EC2 instance
aws:executeScript — Run a Python or PowerShell script
aws:invokeLambdaFunction — Invoke an AWS Lambda function
aws:assertAwsResourceProperty — Assert a resource state or event state
aws:waitForAwsResourceProperty — Wait on a resource property
aws:pause — Pause an SSM Automation execution
aws:sleep — Delay an SSM Automation execution
aws:approve — Pause an SSM Automation execution for manual approval

SSM Automation also includes safety and velocity features that help you control the execution and the roll-out of these documents across large groups of instances by using tags, limits, and error thresholds you define.

As you can probably guess by now, SSM Automation is also well-suited to execute chaos engineering experiments safely.

“Hello, World!”

Let’s take a look at the “Hello, World!” of chaos engineering experiments — Randomly stopping EC2 instances.

This experiment is famously known as Chaos Monkey, and was created by Netflix to enforce strong architectural guidelines; Applications launched on the AWS cloud must be stateless auto-scaled micro-services. That means that applications running Netflix should tolerate random EC2 instance failures.

Following is an SSM Automation document (described in YAML) randomly failing an EC2 instance in a particular AWS availability zone.

To open that SSM Automation document in your favorite IDE, click here.

Okay — so what do we have here?

Note: For readability purposes, I will now collapse irrelevant sections of the SSM Automation document.

The top section of this document is simple. It starts with a description , the schemaVersion (currently at 0.3 ), and assumeRole, which is the IAM role that SSM Automation needs to assume to run the actions defined below in the document.

The parameters section — AvailabilityZone , TagName , TagValue , and AutomationAssumeRole — are parameters operators need to input for each experiment’s execution. The first three parameters are used in the first step — ListInstances — to filter EC2 instances, while the last one is the IAM role required to execute actions described in the document.

These parameters are inputs of the experiment execution, in bold in the below AWS CLI start-automation-execution command:

> aws ssm start-automation-execution --document-name "StopRandomInstances-API" --document-version "\$DEFAULT" --parameters '{" **AvailabilityZone**":["eu-west-1c"]," **TagName**":["SSMTag"]," **TagValue**":["chaos-ready"]," **AutomationAssumeRole**":["arn:aws:iam::01234567890:role/SSMAutomationChaosRole"]}' --region eu-west-1

mainSteps

The mainSteps section defines actions that SSM performs on AWS resources. In this document there are six steps that run in sequential order — namely listInstances , SelectRandomInstance , verifyInstanceStateRunning , stopInstances , forceStopInstances , and verifyInstanceStateStopped.

Each of these steps defines a single SSM Automation action type. The output from one step can be used as input in the following step.

mainSteps (collapsed)

First step — listInstances

Let’s take a look at the first step listInstances. This first step uses an action type aws:executeAwsApi to query the EC2 service for a list of instances filtered by availability-zone, the state of the EC2 instance, and its tags.

Outputs

As explained earlier, the output from one step can be used as input in the following step. SSM Automation uses a JSONPath expression in the selector to help select the proper output.

A JSONPath expression is a string beginning with “$.” used to select one or more components within a JSON element (e.g., the output of the DescribeInstances API call). The JSONPath operators that are supported by SSM Automation are:

Dot-notated child (.): This operator selects the value of a specific key from a JSON object.
Deep-scan (..): This operator scans a JSON element level by level and selects a list of values with the specific key. The return type of this operator is always a JSON array. This operator can be either StringList or MapList.
Array-Index ([]): This operator gets the value of a specific index from a JSON array.

In this first step, the output “$.Reservations..Instances..InstanceId” returns a list of InstanceIds filtered by availability-zone, state, and tag.

Second step — SeletRandomInstance

The second step of the document uses an action type aws:executeScript that execute an inline Python script, which returns a random InstanceId from a list of InstanceIds.

Note: The function defined in the handler must have two parameters, events and __context.

The output of script execution is a Payload object on which you can execute the JSONPath selector. In this example, $.Payload.InstanceId.

Third step — verifyInstanceStateRunning

The third step of the document uses another type of action, aws:waitForAwsResourceProperty, that asserts the state of the random InstanceId returned from step two.

In that step, the selector checks the state of the instances to make sure they are running. I want to make sure all instances are running before messing with them.

Note: As you may have noticed, the input is a StringList, but with a single item, InstanceId. That allows us to easily modify the random function from the previous step to return several items instead, without having to change anything else in the document.

Fourth and Fifth step — stopInstances and forceStopInstances

The fourth and fifth steps of the document use the action type aws:changeInstanceState. As you have probably guessed, these steps change the state of EC2 instances — in that example, to stopped. The input is again the InstanceId from step two.

Why use stopInstances and forceStopInstances steps?

In the stopInstances step, the EC2 control plane attempts to gracefully shutdown the selected EC2 instance, allowing it to flush its file system caches or file system metadata. However, sometimes, there may be an issue with the underlying host computer, and the instance might get stuck in the stopping state. That is why the forceStopInstances step set Force to true, which forces the instances to stop.

Note 1 : The second step, forceStopInstances , is not recommended for EC2 instances running Windows Server.

Note 2 : The default timeout value for the aws:changeInstanceState action is 3600 seconds (one hour). You can limit or extend the timeout by specifying the timeoutSeconds parameter.

For more information on EC2 stop-instances API, click here. For troubleshooting errors, click here.

Last step — verifyInstanceStateStopped

Finally, the last step of this document is to verify the state of the instances to be stopped or terminated. This step is arguably redundant since aws:changeInstanceState also asserts on the desired value. However, for the sake of this example, I preferred to make that step explicit.

Nuff said — Let’s demo this!

For this example, I will assume that you already have some EC2 instances launched in your AWS account with appropriate tags (I use SSMTag:chaos-ready for the demo).

1- Create an IAM role for SSM Automation

By default, SSM doesn’t have permission to perform actions on your AWS resources. Start by creating a role — e.g., SSMAutomationChaosRole with the following policy.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:\*:\*:function:ChaosAutomation\*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:StartInstances",
                "ec2:RunInstances",
                "ec2:StopInstances",
                "ec2:TerminateInstances",
                "ec2:DescribeInstances",
                "ec2:DescribeInstanceStatus"
            ],
            "Resource": [
                "\*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:\*"
            ],
            "Resource": [
                "\*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "sns:Publish"
            ],
            "Resource": [
                "arn:aws:sns:\*:\*:ChaosAutomation\*"
            ]
        }
    ]
}

It should give you enough to get started with actions calling EC2, SSM Run Command, and AWS Lambda. You should, of course, extend or restrict this policy to your own needs.

2- Fault injection documents

To get you started, I created a few ready-to-use SSM Automation documents.

https://github.com/adhorn/chaos-ssm-documents/

Currently, the following chaos experiments are available — feel free to ask or contribute for more!

1- Randomly stopping instances using EC2 API

2- Randomly stopping instances using AWS Lambda

3- Injecting multiple CPU stresses on EC2 instances using AWS Run Command

To use any of them, you need to create a SSM Automation document using the AWS CLI as follows:

> aws ssm create-document --content --name "[StopRandomInstances-API](https://eu-west-1.console.aws.amazon.com/systems-manager/documents/StopRandomInstances/description?region=eu-west-1)" file://stop\_random\_instance\_api.yml --document-type "Automation" --document-format YAML

After uploading the document, you should see it under the Owned by me tab in AWS System Manager Documents filtered by Document type: Automation .

3- Executing the fault injection document

Go to the Automation dashboard in the AWS System Manager and click Execute automation.

SSM Automation dashboard

Filter the documents by Owner: Owned by me, and you should see your newly uploaded document(s).

Select the StopRandomInstances-API automation document and click Next.

Note: If you prefer using the AWS CLI, notice that the console outputs the AWS CLI command execution equivalent.

You enter the input parameters defined in the automation document here, namely AvailabilityZone, TagName, and TagValue (I use SSMTag:chaos-ready). Remember to select the correct role created earlier, in this demo SSMAutomationChaosRole, to allow the execution of the experiment.

Before running the experiment, let’s take a look at my instances currently running in eu-west-1.

As you can see, I have four instances in eu-west-1a but only three with the correct tag SSMTag:chaos-ready. I will use that information to verify that my filters are working correctly.

Let’s execute the experiment.

You can follow the execution of each step from the AWS Console. Each step gets a Step ID that you can monitor independently. Following is a zoom on Step 1: listInstances.

We can now check and verify that our filters work. And indeed, we have three instances with the correct set of tags in eu-west-1a.

A zoom on the second step shows us the randomly selected instance: i-01f069058c584b2bc.

Once all the steps completed successfully, we can verify that the correct instance stopped — i-01f069058c584b2bc

As you can see, our EC2 fault injection worked.

4- Cancelling Executions

You might have noticed the Cancel execution in the execution status page.

Yes — that’s our Big Red Button right there!

CAUTION: You can only attempt to cancel an execution since SSM cannot guarantee that actions can be stopped or reverted. For example, you can’t undo an activity that is already happening, e.g., stopping and terminating an instance.

As always, with chaos engineering, be extra careful with your experiments — plan carefully!

5 — Continuous Chaos testing

What made Chaos Monkey so unique was that is was continuously running in Netflix’s environment, regularly shutting down EC2 instances, at a regular interval — it wasn’t just a one-off.

Now that you have successfully executed your EC2 failure injection with SSM Automation, you might want to turn that into a continuous chaos test, or continuous verification.

Continuous chaos testing simply means that you regularly execute the failure injection to verify the application repeatedly withstand failures.

Luckily, it is straightforward to do!

You can execute the above SSM Automation by specifying our SSM document as the target of an Amazon CloudWatch event.

Amazon CloudWatch — Create Rule

Open the CloudWatch console, choose Events in the left navigation pane, and click Create rule.
Choose Schedule and specify the recurrence by using the cron format. For demo purposes, I choose to execute the SSM Automation document every 5 minutes, which is represented by the Cron expression 0/5 * * * ? * .
Then click Add target and choose SSM Automation from the Select target type list. Choose the Automation document created above as your target- StopRandomInstances-API.
Expand Configure automation parameter(s), and enter each of the required values — AvailabilityZone, TagName, TagValue and AutomationAssumeRole.
In the permissions section, let CloudWatch create a new role to call SSM Automation Execution, or select an existing one.
Click Configure details , add a name and a description. Select Enabled state and click Create rule. Make sure you add a distinct name with an accurate description; you want to make it apparent what is it a chaos engineering rule!

You can verify, change, or disable the rule from the CloudWatch console afterward.

After a while, you should start seeing executions of the SSM Automation document every 5 min.

As you can see, the last four executions differ and hold the IAM role assumed by the CloudWatch event calling SSM Automation execution.

That’s it — We have successfully built our custom Chaos Monkey using SSM Automation! Hopefully, this blog post will inspire you to start your journey with chaos engineering. Feel free to comment, share your ideas, or submit pull-requests if you want to add new functionalities to this collection of SSM documents.

Note for serverless fans: If you are interested in doing the same experiment but with actions using AWS Lambda, use this document with this lambda function.

-Adrian

Chaos Engineering — How to safely inject failure?

Adrian Hornsby — Mon, 11 May 2020 07:37:16 +0000

Chaos Engineering — How to safely inject failure?

Answering questions from my webinar

I recently did a two-hour webinar dedicated to chaos engineering and got a lot of great questions from the audience. In this mini-series of posts, I take some time to answer them.

If you missed the webinar, you could access it on-demand from the link below. And if you have questions you would like me to address, feel free to ask me directly on Twitter :-)

Dev Connect - Applying chaos engineering principles for building fault-tolerant applications

These are some of the question I was asked:

Is it good practice to conduct a test for the whole system at once or segregate tests?

Is it best to do tests in a “copy-of-prod” like environment?

Is there a structured approach to experiment safely?

How do you experiment without risking breaking production?

How can chaos testing be conducted for Lambda (or serverless) services?

If I understand correctly, the synthesized question is:

“How to deploy chaos experiments and safely inject failure in your environment.”

Notice the deliberate usage of the word deploy.

That is probably the most important question out there. And the answer is simple;

The safest way to inject failure in the environment is by using the canary deployment pattern .

Yes! And it is one of the essential things about experimenting safely:

Chaos engineering experiments should be treated as a deployment pattern.

The difference with a traditional deployment is that once the experiment is over, we bring back the initial environment — in other words, we rollback the experiment.

Let’s rewind a bit — What is canary (deployment) experiment?

The canary deployment pattern is a technique used to reduce the risk of failure when new versions of applications enter production by creating a new environment with the latest version of the software. You then gradually roll out the change to a small subset of users, slowly making it available to everybody if no deployment errors are detected.

The canary deployment pattern is one of the basic design in immutable infrastructure, a model by which no updates, security patches, or configuration changes happen “in-place” on production systems. If any change is needed, a new version of the architecture is built and deployed instead.

Immutable infrastructures are more consistent, reliable, and predictable , and they simplify many aspects of software development and operations by preventing common issues related to mutability. Learn more about immutable infrastructures here.

Canary Deployment applied to Chaos Engineering experiment.

Why is the canary pattern important for chaos engineering experiments?

First, by isolating the chaos experiment from the primary production environment and progressively ramping up the traffic sent to it, you can better control the potential blast radius of failure.

Second, having a dedicated environment to run your experiment makes it easier to deal with logs and monitoring information.

Third, you can gradually increase the percentage of requests handled by the new canary chaos experiment and rollback if errors are detected. It gives us near-instant rollback — the big red button.

Finally, you can more easily control what traffic is sent to the canary running the chaos experiment, further limiting the potential risk of customer nuisance.

Consider several routing or partitioning mechanisms for your experiment:

Internal teams vs. customers
Paying customers vs. non-paying customers
Geographic-based routing
Feature flags (FeatureToggle)
Random

How to do canary chaos experiments on AWS?

Following are my favorite ways:

(1) Canary with Route 53 weighted routing policy

Route 53 lets you use a weighted routing policy to split the traffic between the different versions of the application you are deploying, one with the experiment, the other one without.

Weighted routing enables you to associate multiple resources with a single domain name or subdomain name and choose how much traffic is routed to each resource.

Canary with Route 53 weighted routing policy

To configure weighted routing for your canary chaos experiment, you assign each record a relative weight that corresponds with how much traffic you want to send to each resource — one of the resources being the application with the chaos experiment. Route 53 sends traffic to a resource based on the weight that you assign to the record as a proportion of the total weight for all records.

For example, if you want to send a tiny portion of the traffic to one resource and the rest to another resource, you might specify weights of 1 and 255. The resource with a weight of 1 gets 1/256th of the traffic (1/1+255), and the other resource gets 255/256ths (255/1+255).

To me, it is probably the simplest and safest way to deploy your chaos experiment since it separates the experiment from the rest of the production environment.

The downside is that since you have to duplicate the entire environment, it is also the more expensive option.

Note: The speed of the rollback is directly related to the DNS TTL value. So, watch out for the default TTL values, and shorten them.

(2) Canary with Application Load Balancer and Weighted Target Groups

When creating an Application Load Balancer (ALB), you create one or more listeners and configure listener rules to direct the traffic to one target group. A target group tells a load balancer where to direct traffic to, e.g., EC2 instances, Lambda functions, etc.

Canary with Application Load Balancer and Weighted Target Groups

To do canary chaos experiments with the ALB, you can use forward actions to route requests to one or more target groups. If you specify multiple target groups for forward action, you must specify a weight for each target group.

Each target group’s weight is a value from 0 to 999. Requests that match a listener rule with weighted target groups are distributed to these target groups based on their weights. For example, if you specify two target groups, one with a weight of 10 and the other with a weight of 100, the target group with a weight of 100 receives ten times more requests than the other target group.

(3) Canary with API Gateway release deployments

For your serverless applications, you have the option of using Amazon API Gateway since it supports canary release deployments.

Using canaries, you can set the percentage of API requests that are handled by new API deployments to a stage. When canary settings are enabled for a stage, API Gateway will generate a new CloudWatch Logs group and CloudWatch metrics for the requests handled by the canary deployment API. You can use these metrics to monitor the performance and errors of the new API and react to them.

Please note that currently, API Gateway canary deployment only works for REST APIs, not the new HTTP APIs.

(4) Canary with AWS Lambda alias traffic shifting

The second option for serverless applications is by using the AWS Lambda alias traffic shifting feature. Update the version weights on a particular alias, and the traffic will be routed to new function versions based on the specified weight. You can easily monitor the health of that new version using CloudWatch metrics for that alias and rollback if errors are detected.

Canary with AWS Lambda alias traffic shifting

AWS CodeDeploy can help using this feature as it can automatically update function alias weights based on a predefined set of preferences and automatically rollback if needed. Check out the AWS SAM or the serverless.com framework integration with CodeDeploy to automate alias traffic shifting.

Wrapping up

Using the canary pattern to perform chaos engineering experiments is a great way to deploy and gain confidence in your experiment, control the potential blast radius of its failure, have a fast rollback , and better understand its impact on the application.

Of course it is not the only option, but it is the safest.

Chaos Engineering — What and who is a chaos engineer?

Adrian Hornsby — Wed, 29 Apr 2020 06:01:07 +0000

Chaos Engineering Q&A — What and who is a chaos engineer?

Answering questions from my webinar

I recently did a two-hour webinar dedicated to chaos engineering and got a lot of great questions from the audience. In this mini-series of posts, I will take some time to answer them.

If you missed the webinar, you can access it on-demand from the link below. And if you have questions you would like me to address, feel free to ask me directly on Twitter :-)

Dev Connect - Applying chaos engineering principles for building fault-tolerant applications

Questions

Who’s the best set of people to start looking into chaos engineering in a team?

How can performance engineers drive chaos engineering ideas?

In general, whose responsibility is chaos engineering? Would this fall to the solutions architect/engineering team, a Business Continuity team, or a ‘virtual’ team that spans all teams involved in the application?

Great set of first questions! I grouped them since they are very similar to one another.

First of all, let’s debunk a myth. The myth of the chaos engineer going around service teams and surprising them with breaking things randomly, without noticing them, and hoping developers will keep smiling.

It is a myth!

Chaos engineers are more likely to be advocates, helping teams understand what chaos engineering is and how to prepare for it, explaining and even demoing how to do it, and in most cases, coordinating the execution of experiments and GameDays*. But they work WITH the teams, not against them.

I like to think of chaos engineers as program managers instead, with a strong background in software engineering, a good understanding of resiliency patterns, and, more importantly, a passion for the practice of chaos engineering — a contagious passion. Driving the adoption of chaos engineering practices happens through technical presentations and workshops, writing and sharing ideas, support meetings, brainstorming sessions, running GameDays, celebrating wins, etc. The chaos engineer is an evangelist of the discipline, not necessarily the one that pulls the trigger.

Chaos engineering is a practice more than a job definition, and thus everyone in the software engineering or operation teams can use the chaos engineering methodology to improve their systems. Often the best person to do fault injections in a software system is the ones most intimate with the software system itself. Yes, I am talking about the developer!

The best way to start a chaos engineering practice is thus to start a chaos engineering program ** and elect a champion for the job. That champion can be a new hire or not — the important is that the champion needs a strong background in software engineering and a passion for chaos engineering. The rest is like everything; it can be learned.

If you can’t afford to hire someone dedicated to the role, you still will need a program and someone managing it. A program gives substances to an idea, something to show progress and hold onto when things get harder. The program needs some goals. Without goals, there isn’t accountability. However, setting goals requires the full awareness of the possible biases associated with setting goals and capturing metrics [1].

Goals like “reducing the number of sev1 tickets” are not suitable as they don’t focus on learning and can be fooled easily by merely not raising ticket severity (which will have a negative impact).

Goals such as “conducting one GameDay a month, with each team” are better since they focus on the action, not the result. Remember, we are trying to setup a new practice, learn new ways of thinking about systems, and the outcome of that is hard to measure directly. Sure, you will see some short terms and long terms benefits, but they often differ between organizations.

Ask yourself this simple question: “What do we want to learn?.” Then, create the program and goals around that simple idea. Have realistic goals too — chaos engineering will never remove all the risks and potential failures in your system.

— Adrian

[1] https://hbr.org/2019/09/dont-let-metrics-undermine-your-business

* The term GameDay was coined by Jesse Robbins when he worked at Amazon. A GameDay is an exercise during which teams practice responding to an incident in a “safe” environment by purposefully injecting failures in order to increase the availability of software systems. A GameDay is like a fire drill. His talk from 2011 is still my all-time favorite talk.

** I will address the question “ how to start a chaos engineering program ” in a later post since it deserves its own post.

DEV Community: Adrian Hornsby

AWS re:Invent 2020 digest — Part 2

AWS re:Invent 2020 digest — Part 2

Curated list of my favorite AWS updates from re:Invent 2020

AWS announces Amazon Managed Service for Grafana and Prometheus in Preview

Other noticeable launches

AWS re:Invent 2020 digest — Part 1

AWS re:Invent 2020 digest — Part 1

Curated list of my favorite AWS updates from re:Invent 2020

OK — but what does strong read-after-write consistency mean?

Other noticeable launches

The Resilient Architecture Collection

Series on Resilient Architecture

The Operational Excellence Collection

Series on Operational Excellence

Part 1 — Customers, Culture, and why you should care.

Part 2 — On the importance of tools

Part 3 — Mechanisms

Incident Postmortem Template

Operational Readiness Review Template

Operational Readiness Review Template

Towards Operational Excellence

Operational Readiness Review

How to use this ORR template?

Can you have the right answers to all questions?

Who should do an ORR?

When should you do an ORR?

How does an ORR differ from an AWS Well-Architected review?

Operational Readiness Review Template

1 — Service Definition and Goals

2 — Architecture

3 — Failures and Impact

4 — Risk Assessment

5 — Monitoring, Metrics & Alarms

6 — Testing

7 — Deployment

8 — Operations

9 — Disaster Recovery

Building resilient services at Prime Video with chaos engineering

Chaos engineering introduction

AWSSSMChaosRunner: Library for failure injection using AWS Systems Manager

AWS Systems Manager

AWS Systems Manager Agent

SendCommand API

SSM command documents

AWSSSMChaosRunner

Failure injections

Chaos testing an EC2 service

Prerequisites

Prime Video uses AWSSSMChaosRunner to prevent a potential outage

Experiment: Validate ElastiCache timeout

Experiment outcome

Conclusion

Resources

Ten lessons from twelve years of AWS

Recording and redacted transcript from my keynote at the AWS Community Day Australia and New Zealand.

#1 — It’s the customers and the information you give them that’s important, not the technology.

#2 — You can’t know everything.

#3 — Invest in people skills

#4 — Embrace failure.

#5 — Don’t blame people when things fail.

# 6 — Route53 is NOT a database.

#7 — Watch out for heuristics and biases!

#8 — Writing.

#9 — Mentoring.

#10 — Learn from others.

Incident Postmortem Template

What is a postmortem?

Incident Postmortem Template

Bare-bone version

Extended-cut version

Title:

Incident date:

Owner :

Peer-review committee :

Tags :

Summary:

Supporting data:

Customer Impact:

Incident Response Analysis: