Sebastian Torres

Posted on Apr 23, 2022

AWS Well-Architected Framework - Reliability Pillar

#cloud #aws #architecture

What is the Reliability Pillar?

The Reliability pillar focuses on the ability of a workload to perform its intended function correctly and consistently when it's expected to. This includes the ability to operate and test the workload through its total lifecycle.

Why is Reliability important to improving my architecture?

Reliability of a workload in the cloud depends on several factors, the primary of which is resiliency.

Resiliency

Resiliency is the ability of a workload to recover from insfrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions -such as misconfigurations or transient network issues.

Using the best practices of the Reliability pillar can enable workloads to achieve the availability goals required to achieve an organization's business objectives. These best practices help mitigate the turbulent conditions of production, and therefore best serve your users.

What are the design principles of the Reliability pillar?

There are five designs principles for Reliability in the cloud.

Automatically recover from failure

By monitoring a workload for key performance indicators (KPIs), you can trigger automation when a threshold is breached. These KPIs should be a measure of business value, not of the technical aspects of the operation. This allows for automatic notification and tracking of failures, and for automated recovery processes that work around or repair theb failure. With more sophisticated automation, it's possible to anticipate and remediate failures before they occur.

Test recovery procedures

In the cloud, you can test how your workload fails, and you can validate your recovery procedures. You can use automation to simulate different failures or recreate scenarios that led to failure pathways that you can test and fix before a real failure scenario occurs, thus reducing risk.

Scale horizontally to increase aggregate workload availability

Replace one large resource with multiple small resources to reduce the impact of a single faiilure on the overall workload. Distribute requests across multiple, smaller resources to help ensure they don't share a common point of failure.

Stop guessing capacity

In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand withouth over -or under- provisioning. There are still limits, but some quotas can be controlled and others can be managed.

Manage change in automation

Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to automation, which then can be tracked and reviewed.

What are the best practice areas of Reliability?

Foundations
Workload Architecture
Change Management
Failure Management

DEV Community

AWS Well-Architected Framework - Reliability Pillar

What is the Reliability Pillar?

Why is Reliability important to improving my architecture?

Resiliency

What are the design principles of the Reliability pillar?

Automatically recover from failure

Test recovery procedures

Scale horizontally to increase aggregate workload availability

Stop guessing capacity

Manage change in automation

What are the best practice areas of Reliability?

Top comments (0)

Read next

Move objects from one folder to other in the same S3 Bucket using C# in AWS

Understanding DevSecOps Principles

Amazon SQS: The Backbone of Asynchronous Communication

Building an Event-Driven Architecture for Content Embedding Generation with AWS Bedrock, DynamoDb, and AWS Batch