DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Updated on

Operational Excellence Best Practices - AWS Well-Architected Framework Study Guide

Return to Well-Architected Framework Guide

Organization

  • Evaluate:
    • Internal and external customer needs
    • Threats to the business (liabilities, info security)
    • Impact of risks and tradeoffs between approaches
  • Understand:
    • Team members' roles in supporting workload
    • Support required to achieve business outcomes
    • Team members' roles in the success of other teams (and vice versa)
    • Responsibility, ownership, how decisions are made, and who has authority to make decisions
  • Ensure:
    • There are identified owners for each application, workload, platform, and infrastructure component
    • Each process and procedure has an identified owner responsible for its definition, and owners responsible for their performance
    • Team members have the resources to be successful and scale to support your business outcomes
  • Define:
    • Guidelines or obligations based on organizational governance and external factors, such as regulatory compliance requirements and industry standards
    • Responsibilities of team members
    • Agreements between teams describing how they work together to support each other and your business outcomes
  • Ask:
    • How do you determine what your priorities are?
    • How do you structure your organization to support your business outcomes?
    • How does your organizational culture support your business outcomes?

Prepare

  • Design your workload to provide information necessary to understand its internal state
  • Capture a broad set of information to enable situational awareness
  • Adopt approaches that improve the flow of changes into production and that enable refactoring, fast feedback on quality, and bug fixing
  • Adopt approaches that provide fast feedback on quality and enable rapid recovery from changes that do not have desired outcomes
  • Plan for unsuccessful changes so that you are able to respond faster if necessary and test and validate the changes you make
  • Evaluate the operational readiness of your workload, processes, procedures, and personnel to understand the operational risks related to your workload
  • Ask:
    • How do you design your workload so that you can understand its state?
    • How do you reduce defects, ease remediation, and improve flow into production?
    • How do you mitigate deployment risks?
    • How do you know that you are ready to support a workload?

Operate

  • Define expected outcomes
  • Identify metrics to measure success
  • Establish metrics baselines for improvement, investigation, and intervention
  • Use established runbooks for well-understood events, and use playbooks to aid in investigation and resolution of issues
  • Communicate operational status of workloads through dashboards and notifications tailored to target audience
  • Develop scripted responses to well-understood events and automate their performance in response to recognizing the event
  • Ask:
    • How do you understand the health of your workload?
    • How do you understand the health of your operations?
    • How do you manage workload and operations events?

Evolve

  • Dedicate work cycles to making continuous incremental improvements
  • Perform post-incident analysis of all customer impacting events
  • Identify the contributing factors and preventative action to limit or prevent recurrence
  • Communicate contributing factors with affected communities as appropriate
  • Ask:
    • How do you evolve operations?

Return to Well-Architected Framework Guide

Top comments (0)