Bartek Żyliński

Posted on Aug 19 • Originally published at pasksoftware.com on Jul 9

Availability – Theory, Problems, Tools and Best Practices

#distributedsystems #softwareengineering #availability

Availability is the measure of a system’s ability to stay up and running despite the failures, of its parts. Today, I will explore this core trait of distributed systems. I will cover theory, challenges, tools and best practises to ensure your system stays up and running against all odds.

Let’s start with theory.

What Is Availability?

Availability describes how our systems handle failures and determines the system’s uptime. Usually, we describe the availability of a system in “nines” notation. 99% availability guarantees a maximum of 14.40 minutes of downtime per day, while 99.999%-the so-called 5 nines-reduces this time to 846 milliseconds.

Most cloud services have an SLA with either three (99.9%) to five (99.999%) nines availability guarantees for end users.

Availability (%)	Downtime per day (~)	Downtime per month (~)	Downtime per year (~)
90	144 minutes (2.4 hours)	73 hours	36.53 days
99	14 minutes	7 hours	3.65 days
99.9	1.5 minutes	44 minutes	8.77 hours
99.99	9 seconds	4.4 minutes	52.6 minutes
99.999	846 milliseconds	26 seconds	5.3 minutes
99.9999	86.40 milliseconds	2.6 seconds	31.5 seconds

Additionally, the term high availability or HA is used to describe services that have at least 3 nines of availability guarantees.

There is a famous struggle related to availability and consistency. The common notion is that in case of a failure, we can have either one or the other. While in most cases this is true, the topic as a whole is vastly more nuanced and complex. For example, CRDTs put this whole statement into question; the same is true for Google’s internal Spanner.

Moreover, we can use various techniques to balance both of these traits. A system may favor one over the other in certain places while not in others. Just remember: this struggle exists and is one of the most important cases of study in distributed systems research.

How To Measure Availability

Availability is probably the simplest trait to measure – at least for a single service. You probably already have uptime or downtime metrics in one of your dashboards. Just divide the value you have there by: 24 (hours), 1440 (minutes), 5184000 (seconds). Et voilà, you have your service daily uptime percentage ready, and you can easily see how many nines you archived.

Things are getting more complicated when our service has multiple dependencies, or when we want to measure availability on the scale of whole system.

As an example consider the service A with two dependencies: DB and Email Service.

Service A has uptime of 99.99%.
DB has uptime of 99.9%.
Email Service has uptime of 99%.

Thus, availability of service A is not 99.99 but in fact 98.89%.

0.9999 × 0.999 × 0.99 = 0.9889 => 98.89%.

In more readable format:

Component	SLA (nines)	Availability (decimal)
Front-end API	99.99%	0.9999
Database	99.9%	0.9990
Email service	99 %	0.9900
Composite A	0.9999 × 0.9990 × 0.9900 = 0.9889 → 98.89%	0.9889

While the final difference is not big, it clearly illustrates the point. Availability of service is not a standalone but a product of all dependencies.

The same principle applies to system. Availability of a system as a whole is a product of all its services and tools. Even a single poorly available component can bring whole system down.

Weakest link	Best product you can ever reach
99 % (two nines)	< 99 %
99.9 % (three nines)	< 99.8 %
99.99 % (four nines)	< 99.96 %

Here is a quick note on how you can structure your Availability related metrics:

Tier	Example in an availability context
SLI
(Indicator)	`http_request_success_ratio` = successful requests ÷ total requests
SLO
(Objective)	`http_request_success_ratio ≥ 99.95 % over 30 days`
SLA
(Agreement)	“We guarantee 99.9 % monthly availability; otherwise you get service credits.”

Signs That System Has Poor Availability

There are a couple of behaviors we can notice, which indicate availability problems of our service. Additionally, some of those are similar to the signs of poor scalability.

Low uptime percentage – most obvious of all, directly shows that the service is down and users cannot access it.
Service “flapping” – the service oscillates between up and down as automated restarts or failovers repeatedly flip the service in and out.
Health-check Failures – Persistent probe timeouts under normal load mean the service is down or will be down in near future.
High Mean Time To Recover – outages last hours, before the team can resolve it and bring system back online.
Suddenly traffic drops to zero – service is either down or users gave up attempts to connect.
Direct Feedback – an important client is calling CTO/CIO (or whoever else) complaining everything is down, alerts start spinning, and other interesting events.

The Availability Game Changers

In my opinion, the game change for availability is automatic and graceful failover. While it sounds simple, it is actually more complex. To achieve it, we need to combine multiple different concepts and make the work together. Nonetheless, it is crucial for providing a zero downtime experience.

The anatomy of state of the art zero-down-time failover:

Stage	What happens	Typical target time
1. Detect	Health probe sees anomalies (5× timeouts/60 s).	≤ 5 s
2. Decide	Orchestrator marks node unhealthy, stops scheduling it.	≤ 1 s
3. Redirect	Load balancer removes endpoint from pool; sticky sessions migrate.	≤ 2 s
4. Restore	Replacement pod/VM starts and passes readiness checks.	≤ 40 s (hot standby: ≈ 0 s)

Of course, automatic failover is not a silver bullet and comes with drawbacks. The two most significant ones are higher complexity of the design and increased costs. Redundancy is responsible for increased costs, while failover itself adds the complexity.

It may sound bad, unfortunately without such mechanism we will not be able to provide high availability.

Tools For Availability

I have already covered automatic failover as key tools to build available systems. However, these are not the only concepts. There are more, and you can find them below.

Replication

Replication is a method to implement redundancy. The key difference is that redundancy impacts all layers of our system from software to hardware. While replication is mostly related to the data layer.

We provide multiple up-to-date copies of the same data-set, usually split across multiple nodes. Thus, in case one of the nodes fails the data is still available for the user.

There are two main types of Replication:

Single-master/Single-leader – only one of replica nodes is handling incoming writes – the leader. The rest of nodes provides read access and can be used to offload part of incoming traffic. Leader propagate changes to others nodes, usually, using some type of Write Ahead Log (WAL). If leader node fails or becomes unavailable for some reason. The leader election process takes place, and the new leader is selected from up-and-running nodes.
Multi-master/multi-leader – all the nodes accept both reads and writes at the same time. Writes are then propagated to other nodes. The biggest problem in case is that the same write operation can end up on two different nodes at the same time. Thus, it requires separate conflict resolution mechanism.

The concept of replication is a very extensive one. The good walkthrough and comparison even of these two approaches is out of scope of this article. However, I promise to dive deeper into replication in separate article.

For now remember following table:

Single-master	Multi-master
Only one node accepts write	Multiple nodes accepts write
Propagate via WAL	Conflict resolution and propagation

Automatic Failover

Automatic and gracefully (not noticeable by user) failover mechanism is the key for availability.

Good automatic failover we will need to combine at least three concepts:

Redundancy – we need more than one node to even start thinking of building any failover.
Health checks – we need properly defined health checks to detect if nodes are down or should not handle user requester.
Load-balancer/actual failover – we need way to change the failing components and redirect the traffic to up-and-running ones.

Each piece alone is insufficient; all must work together.

Isolating failure

Another way to increase the availability of our system is to isolate failures. By doing so, we can ensure that a failure of one component will not cause the cascade failure of the other components involved in the same processing flow.

As with most concepts from this paragraph there is no single tool or method to achieve that. Instead, we can follow one of the patterns below. We can also mix different patterns.

Let’s dive into them below:

Circuit breaker – one of the most common microservices patterns in existences. It implements the fail-fast concepts in a way similar to an electrical circuit breaker. If multiple consecutive calls to other service fail in certain period the circuit breaks switches. Then for the duration of a timeout period all attempts to invoke that service will fail immediately. Thus reducing the load on possibly faulty service and giving it time to recover. Also avoids introducing potential timeouts on other stages of the flow.
Bulkhead – according to this pattern components and resources in our system should be compartmentalized. Partitioning should be done in such a way that components do not share any resources. For example, each partition should have its own thread pools, connections pools and CPU or memory limits. Such split will decrease the chances of one component overuse (high resource utilization) impact the other components in the system.
Error Kernel – we split our system into two types of components core and side ones. The core ones must not fail whatever the reason. The side ones may fail, and we should be able to easily restart them. Then we can move the side ones into the “outskirts” of the system. Thus, we end with reliable core and easy to restart leaf components.

Multi-Region or Multi-Cloud Deployment

Multi-Availability Zone or Multi-Region Deployment will protect us form the least expected type of failures. The ones that will wipe-out whole datacenter or multiple data-centers located in a particular region. Like burning of OVH datacenter in France or GCP electrical problem in Iowa.

We can go even further and build Multi-Cloud failover. If your core cloud provider is down you can switch to a backup. While it adds a ton of extra complexity to your system it drastically reduces the probability of system-wide failure even more. Region wide failures are rare by themselves. Provider wide failures are even rarer. Nevertheless, both may happen. Being able to handle them probably will not decide the difference between 99.99% and lower vitality tiers.

However, being able to handle such events have a few advantages:

Besides staying alive when others are down.
Indicate how good your architecture is.

Chaos Engineering/ Fault Injection

Chaos engineering will not actually help you to build available system by itself. Rather, it helps you ensure that your system is in fact available. By introducing deliberate and trackable failure you can identify weaknesses and problems that will not show up in any other case. I also mentioned this concept here.

Just remember it is not fully safe and double-check that your system will be able to handle it.

Why We Fail To Achieve High Availability

After what, how, and why, it is time for why we fail. In my opinion and experience, there are a few factors that lead to our failure in building available systems.

Some reason will be the same as in the case of my article on scalability.

Ignoring the trade-offs – every decision we make has short- and long-lasting consequences we have to be aware of. Of course, we can ignore them; still, we have to know them first and be conscious as to why we are ignoring some potential drawbacks.
Incorrect health-checks – they react either too slowly or too quickly. Restarting service too early or too late increasing the likelihood of users experiencing the failure.
Lack of Redundancy – critical components do not have properly configured redundancy.
Badly designed Failover – we are unable to redirect the traffic to the up-and-running nodes fast enough.

Below a simple checklist how to increase the chance of not failing in availability:

Do today	Impact
Add health check to every component.	30 min work slashes 502 errors during deploys/failovers.
Track availability product	Makes hidden single points painfully obvious.
Set a written SLO	Aligns team on what “good enough” means.
Run a failover drill.	Check your design in practise.

Summary

I have shared a number of concepts and approaches for building highly available systems.

Let’s do a quick recap of key takeaways:

Making highly available systems requires mixing different concepts like: redundancy, healthcheck and failovers.
Proper health checks will help you keep up with the state of your components.
Isolating failures and preventing their propagation will keep the system running even if some components will fail.
Multi-region deployment will save you in the most unexpected moment

Some concepts discussed here can’t be implemented using a single tool. They require architectural thinking and coordination across layers of the stack.

Concept	Tool
Replication	Usually part of database product you are using
Automatic failover	K8s probes, Cloud autoscaling products
Failure isolation	Resilience4j, K8s Namespaces
Multi AZ	Cloud providers Availability Zones

🚀 High availability isn't just a metric - it is a mindset. Build for failure. Monitor everything. And treat availability as first class feature.

I wish you luck on your struggle with availability.

Thank you for your time.

Blog Availability – Theory, Problems, Tools and Best Practices from Pask Software.

DEV Community