What is the Operational Excellence pillar?
The Operational Excellence pillar focuses on how your organization supports your business objectives. It includes your ability to run workloads effectively, gain insight into their operations, and continuously improve supporting processes and procedures to deliver business value.
Why is Operational Excellence important to improving my architecture?
Operational Excellence helps you define success for your workloads, identify risks inherent in their operation, and make informed decisions. It enables your teams' understanding of their roles in that success and helps you determine if they have what they need to succeed. Its best practices support measuring success through the achievement of business outcomes, understanding workload and operations health, responding when outcomes are at risk, and continuing improvement.
What are the design principles of Operational Excellence?
There are five design principles for Operational Excellence in the cloud:
Perform operations as code
In the cloud, you can apply the same engineering discipline that you use for application code to your entire environment. You can define your entire workload (applications, infrastructure) as code, and update it with code. You can implement your operations procedures as code, and automate their operation by starting them in response to events. By performing operations as code, you limit humane error and enable consistent responses to events.
Make frequent, small, reversible changes
Design workloads to allow components to be updated regularly to increase the flow of beneficial changes into your workload. Make changes in small increments that can be reversed if they fail to aid in identifying and resolving issues introduced into your environment (without affecting customers when possible).
Refine operations procedures frequently
As you use operations procedures, look for opportunities to improve them. AS you evolve your workload, evolve your procedures appropiately. Set up regular Game Days to review and validate that all procedures are effective and that teams are familiar with them.
Anticipate failure
Perform "pre-mortem" exercises to identify potential sources of failure so that they can be removed or mitigated. Test your failure scenarios, and validate your understanding of their impact. Test your response procedures to ensure that they are effective and that teams are familiar with their activities. Set up regular Game Days to test workloads and team responses to simulated events.
Learn from all operational failures
Drive improvement through lessons learned from all operational events and failures. Share what is learned across teams and through the entire organization.
Top comments (0)