DEV Community

Alec Dutcher
Alec Dutcher

Posted on • Updated on

Reliability Best Practices - AWS Well-Architected Framework Study Guide

Return to Well-Architected Framework Guide

Foundations

  • Foundational requirements are those whose scope extends beyond a single workload or project
  • It’s the responsibility of AWS to satisfy the requirement for sufficient networking and compute capacity
  • Service quotas (aka service limits) exist to prevent accidentally provisioning more resources than needed and to limit request rates on API operations to protect services from abuse
  • Monitor and manage these quotas for all workload environments
  • Ask:
    • How do you manage service quotas and constraints?
    • How do you plan your network topology?

Workload Architecture

  • SDKs take the complexity out of coding by providing language-specific APIs for AWS services
  • Distributed systems rely on communications networks to interconnect components, such as servers or services
  • Workload must operate reliably despite data loss or latency in these networks
  • Components must operate in a way that does not negatively impact other components
  • Ask:
    • How do you design your workload service architecture?
    • How do you design interactions in a distributed system to prevent failures?
    • How do you design interactions in a distributed system to mitigate or withstand failures?

Change Management

  • Anticipate and accommodate changes to achieve reliable operation
  • Changes include those imposed on your workload (i.e. spikes in demand) and those from within (i.e. feature deployments and security patches)
  • Monitor the behavior of a workload and automate the response to KPIs
  • Ask:
    • How do you monitor workload resources?
    • How do you design your workload to adapt to changes in demand?
    • How do you implement change?

Failure Management

  • Be aware of failures as they occur and take action to avoid impact on availability
  • Take advantage of automation to react to monitoring data
  • Regularly back up your data and test your backup files
  • Test failure response on a regular schedule and ensure that such testing is also triggered after significant workload changes
  • Actively track KPIs, as well as the recovery time objective (RTO) and recovery point objective (RPO)
  • Ask:
    • How do you back up data?
    • How do you use fault isolation to protect your workload?
    • How do you design your workload to withstand component failures?
    • How do you test reliability?
    • How do you plan for disaster recovery (DR)?

Return to Well-Architected Framework Guide

Top comments (0)