DEV Community

Cover image for Improving Operational Excellence using Game Days
Gerald Stewart for AWS Community Builders

Posted on

Improving Operational Excellence using Game Days

A game day is a practice event that simulates a failure or outage so that you can test your systems, your processes and your team's response to remediate or act on that particular event.

AWS defines operational excellence as

The Operational Excellence pillar includes the ability to support the development and run workloads effectively, gain insight into their operations, and continuously improve supporting processes and procedures to deliver business value.

To me, game days go hand-in-hand with the Operational Excellence pillar of the Well-Architected framework. They are even explicitly called out under the anticipate failure design principle.


I ran a few game-day scenarios earlier this year with my team to get them comfortable with the prospect of providing 24/7 on-call support for our applications.

However, I found it difficult to identify scenarios and even harder to accurately reproduce them in a not-so-obvious way.

At re:Invent this year, my favourite session of the week was on this exact topic. It wasn't recorded, so I'd like to share with you all what I took away from that session in this blog.


Examples of Game Day Events/Types

  • Security Incidents
  • Someone leaving the company/team
  • Infrastructure outage
  • Application outage

When should I run a Game Day?

Regular Cadence
Game days should be run regularly to help your team develop a muscle memory for incident response. Your company likely runs regular safety drills to ensure employees know what to do in the event of a fire, this is no different.

Architectural Changes
If you make substantial architecture or design changes to your application, it might also be worth running a game day to ensure all team members are aware of the changing points-of-failure to be aware of in applications.

Team Changes
Someone joining or leaving a team can have a big impact on the team's ability to resolve. Often team members become a "crutch" in these types of scenarios. If your star incident resolver leaves you to need to ensure the team is capable and confident in supporting your application.

The same can be said about a new joiner. Getting them up to speed quickly by running game days in a controlled environment is a great way to build confidence and should be considered part of the onboarding process for any team supporting applications in production.


Building Effective Game Days

Part 1: How can someone triaging this scenario effectively detect that it is happening?

In order for a scenario to be considered valid, there must be an effective way of detecting that something is actually happening. Let's look at an example:

Scenario: A Lambda function invoked through an API Gateway is producing a 5XX response code
In this scenario, an effective troubleshooting method may be alerting for 5XX errors at an API Gateway level.

If you are identifying scenarios and struggling to find effective methods of detecting them occurring, you should consider creating monitoring or processes to enable the detection of these.

Some more examples:

  • IP Address exhaustion - Monitored using CloudWatch alarm
  • Denial of Service - Abnormal traffic pattern alarm
  • Lambda timeouts - Alerting off CloudWatch metric

Part 2: How can someone resolve this scenario in an effective and efficient manner?

After identifying a scenario, we need to make sure that the team has the relevant skills and information to resolve the scenario in an effective and efficient manner. This can be as simple as knowing where to look at the status of alerts or ensuring they have the correct permissions to view logs.

For more specific application-level errors runbooks can be useful to understand the meaning behind error messages, or knowing who to page at 3 am at when it's a pesky downstream service causing the error.

Scenario: A team member is called for an out-of-hours system outage
Upon receiving a call about a system outage, the team member knows the location of the alerting and runbook so that they can effectively respond to the presented issue. They can resolve this on their own without the need to escalate further.

If gaps have been identified here, before proceeding with the game day scenario make sure the necessary documentation exists and the entire team is aware of it

Part 3: As a "game master" how can this scenario be reproduced in a simulated and controlled environment?

In order to trigger a scenario, it needs to be able to be reproducible.

Let's look at some examples and how they can be reproduced:

This is often the piece I find most challenging, you may need to get creative. Currently, AWS FIS only supports EC2, ECS, EKS and RDS. As someone who works mostly with AWS-managed serverless services, I would love to see support for Lambda, API Gateway, CloudFront and DynamoDB so that this is less of a headache!

Part 4: What is the category/classification of this scenario?

Scenarios can be broadly grouped into two wide categories:

  • AWS outages/issues
  • Application level outages/issues

It's a good idea to run a mix of both types of scenarios so that team members understand troubleshooting, resolutions or escalation paths for both types.

Part 5: What is the estimated time a trained person should resolve this scenario in?

We should assign each scenario two time values:

  • Expected resolution time: The expected time a trained individual should resolve this issue in.
  • Maximum run time: The maximum amount of time the scenario is allowed to run for.

The expected resolution time is a rough number, that should be adjusted if a repeat of a particular scenario is either too generous or too harsh of a time frame. It serves as an indicator that the training, processes and documentation available are sufficient to meet the demands of supporting the application.

The maximum run time should be given enough buffer room to allow someone ample time to attempt to resolve a scenario. We assign this value to avoid creating frustration with the exercise going on for an endless amount of time. If certain scenarios are hitting this regularly this should be re-assessed.


Example Game Day Plan

Game Day reference architecture

Let's take a look at this reference architecture to identify some potential game-day scenarios. This diagram depicts a CloudFront distribution, with a Route 53 domain and ACM certificate. An API gateway with two Lambda functions, one making HTTP calls to an external service and one accessing an RDS instance.

Game day reference architecture with identified scenarios
The diagram above depicts some example game day scenarios in red. I left some blank to give you (the reader) a chance to think of some potential scenarios.

Let's break down some of the identified scenarios for this architecture:

Scenario Detection Method Reproduction Steps Classification Resolution Time/Max Runtime
Misconfigured Route 53 record Traffic anomaly detection Facilitator alters configuration Application Level 5/10
Expiring ACM certificate Certificate expiry alert - Application Level 10/20
Third party service outage Lambda error alerting Mock Server Application Level 10/20

Scenario 1: A misconfigured Route 53 record. This is a pretty common occurrence when making DNS changes. It would be a user error that causes this type of outage. To reproduce this whoever is running the scenario could go in and make a change. The goal here is to troubleshoot and revert the change in a timely manner.

Scenario 2: An expiring ACM certificate. AWS will notify you of this well in advance. Simulating this might be difficult as certificate renewals are usually on a yearly basis. For this, it could be a "made-up" scenario, that you walk through the steps taken to remediate the issue.

Scenario 3: A third-party service outage. This could be simulated with a mock server. A mock outage could be simulated. In this situation identifying the correct downstream service and mocking out contacting the team that owns the service would be a good outcome.

I have intentionally left some boxes blank if you want to think about this architecture and what other possible application or AWS-level outages could occur. I'd love to hear about them in the comments.


Conclusion

You've run your game day, what's next you might ask?

I would recommend collecting feedback on:

  • Tooling and processes
  • Education and knowledge sharing
  • Are there any application-level changes that could help mitigate this scenario?
  • The overall running of the game day

I would expect the following outputs of a successful game day:

  • Increased team confidence
  • Improvements in documentation
  • Improvements in alerting
  • Potential architecture changes to make applications more resilient

The entire process of game days is centred around continuous improvement. Bulletproof your plans, and increase your engineer's confidence and knowledge of your systems.

If you found any of this useful please let me know, make sure to attempt to identify the extra scenarios in our example architecture and leave a comment with how you would conduct a game day for them!

Top comments (0)