HA, FT, DR; What they mean and Why they matter?

#aws #devops #cloud #architecture

It all starts with a thought, how can I improve my infrastructure? There are plenty of other things you should care about, but these terms are probably the most important ones.

High Availability, Fault Tolerance and Disaster Recovery.

In this article, you'll find enough answers to start caring about resilience, hopefully. Let's start with explaining terms.

What is High Availability (HA)?

High Availability ensures a defined level of operational performance is consistently maintained. It's not about enhancing the user experience though, not the aim but the result.

It may sound a bit confusing, so let's explain it in much simpler terms;

High Availability is about maximizing the system's online time, at the same time minimizing the outages. A system can fail anytime, but you can also replace the component that causes the system to fail in order to fix the issue, fast. You might say "hey fatih, why don't we diagnose the issue and fix it?". The answer is simple, our aim is to maximize UPTIME of the system. The time it takes to analyze the situation, diagnose the issue, and implement a fix affects the system's uptime. The approach is simple, "it is better alive than dead".

How to measure High Availability?

With 9's of course.

High Availability is measured with percentages;

Three 9's (99.9%) -> It means the system is up 99,9% of the year. In other words the system is down for 8.77 hours per year.

Five 9's (99.999%) -> down for 5.26 minutes per year.

In summary;

HA is about maximizing the system UPTIME.
It requires additional costs and extra planning.

What is Fault Tolerance (FT)?

Fault Tolerance and High Availability often get confused with each other. While HA is about the system uptime, Fault Tolerance ensures a system continues to operate correctly even when some components fail. A system that operates through even failure. Fault Tolerance (FT) requires redundant systems that operate simultaneously to take over during an outage in the primary system., which multiplies the costs.

It is not a simple fail-over system; therefore, it is much more complex than HA. It is harder to design and implement and requires much more detailed planning and engineering. This is because we aim for 100% UPTIME now, where even a minute or second of downtime is TOO MUCH to lose.

In April 2011, AWS faced a major outage when a failure in its Elastic Block Store (EBS) disrupted services for many customers. However, AWS's fault-tolerant design kicked in. Data was replicated across multiple servers, so even though some systems went down, most services kept running. Customers were able to continue operations with minimal disruption, proving the strength of AWS’s redundancy and recovery systems.

What is Disaster Recovery (DR)?

Disaster Recovery refers to a set of policies, tools, and procedures designed to enable the recovery or continuation of critical infrastructure after a disaster. DR is a process. It is the ability to restore access and functionality to the infrastructure after a disaster.

Think of it as a pilot ejection system in a jet.

In many cases, Disaster Recovery can be automated for efficiency.

Conclusion

In the end, building up a resilient system is all about connecting the dots. That easy huh? HA, FT, DR are just cornerstones for your infrastructure. Keep these principles in mind as you improve your infrastructure, and you'll be better prepared for whatever comes your way.