DEV Community

Cover image for AWS Well-Architected Framework - Reliability Pillar
Sebastian Torres
Sebastian Torres

Posted on

AWS Well-Architected Framework - Reliability Pillar

What is the Reliability Pillar?

The Reliability pillar focuses on the ability of a workload to perform its intended function correctly and consistently when it's expected to. This includes the ability to operate and test the workload through its total lifecycle.

Why is Reliability important to improving my architecture?

Reliability of a workload in the cloud depends on several factors, the primary of which is resiliency.

Resiliency

Resiliency is the ability of a workload to recover from insfrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions -such as misconfigurations or transient network issues.

Using the best practices of the Reliability pillar can enable workloads to achieve the availability goals required to achieve an organization's business objectives. These best practices help mitigate the turbulent conditions of production, and therefore best serve your users.


What are the design principles of the Reliability pillar?

There are five designs principles for Reliability in the cloud.

Automatically recover from failure

By monitoring a workload for key performance indicators (KPIs), you can trigger automation when a threshold is breached. These KPIs should be a measure of business value, not of the technical aspects of the operation. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair theb failure. With more sophisticated automation, it's possible to anticipate and remediate failures before they occur.

Test recovery procedures

In the cloud, you can test how your workload fails, and you can validate your recovery procedures. You can use automation to simulate different failures or recreate scenarios that led to failure pathways that you can test and fix before a real failure scenario occurs, thus reducing risk.

Scale horizontally to increase aggregate workload availability

Replace one large resource with multiple small resources to reduce the impact of a single faiilure on the overall workload. Distribute requests across multiple, smaller resources to help ensure they don't share a common point of failure.

Stop guessing capacity

In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand withouth over -or under- provisioning. There are still limits, but some quotas can be controlled and others can be managed.

Manage change in automation

Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to automation, which then can be tracked and reviewed.


What are the best practice areas of Reliability?

  • Foundations
  • Workload Architecture
  • Change Management
  • Failure Management

Top comments (0)