Kevin Mack

Posted on Jun 22, 2020 • Originally published at welldocumentednerd.com on Jun 16, 2020

A Few Gotcha’s for Resiliency

#technology #availability

So I’ve been doing a lot of work over the past few months around availability, SLAs, Chaos Engineering, and in that time I’ve come across a few “Gotcha’s” that are important to sidestep if you want to build a stronger availability in your solutions. This is no means met to be an exhaustive list, but its at least a few tips in my experience that can help, if you are starting down this road:

Gotcha #1 – Pulling in too many services into the calculation.

As I’ve discussed in many other posts, the first step of doing a resiliency review of your application, is to figure out what functions are absolutely essential, and governed by the SLA. Because the simple fact is that SLA calculations are a business decision and agreement, so just like any good contract, the first step is to figure out the scope of what is covered.

But let’s boil this down to simple math, SLA calculations are described in depth here.

Take the following as an example:

| Service | SLA |
| App Service | 99.95% |
| Azure Redis | 99.9% |
| Azure SQL | 99.99% |

Based on the above, the calculation for the SLA gives us the following:

.9995 * .999 * .9999 = .9984 = 99.84%

Now, if I look at the above, and more with Gotcha #2 on the specifics, but if I can remove the “Redis” and lower the number of services involved, the calculation changes to the following:

.9995 * .9999 = .9994 = 99.94%

Notice how removing an item from the SLA causes it to increase, part of the reason here is that I removed something with a much lower SLA, but each item in the SLA calculation will impact the final number, so where ever possible, we should make sure we have scoped our calculations to only the services involved in supporting the SLA.

Gotcha #2 – Using a caching layer incorrectly

Caching tiers are an essential part of any solution. When Redis first was created, caching tiers were seen as something that you would implement if you had aggressive performance requirements. But anymore the demands on software solutions are so high, that I would argue all major solutions have a caching tier of some kind.

Now to slightly contradict myself, those caching tiers, while important to performance of an application, should not be required as part of your SLA or availability calculation, if implemented correctly.

What I mean by that, is caching tiers are meant to be transient, meaning that they can be dropped at anytime, and the application should be able to function without it. Rather than relying on it for a persistence store. The most common case, that violate recommendations for solutions is the following:

User takes an action that requests data.
Application reaches down to data store to retrieve data.
Application puts data in Redis cache.
Application returns requested data.

The above has no issues at all, that’s what Redis is for, the problem is when the next part is this:

User takes an action that requests data.
Application pulls data from Redis and returns.
If data is not available, application errors out.

Given the ephemeral nature of caches, and the fact that these caches can be very hard to replicate, your application should be smart enough that if the data isn’t in redis, it will go get it from a data store.

By implementing the following, and configuring your application to use its cache only for performance optimization, you can effectively remove the Redis cache from the SLA calculation.

Gotcha #3 – Using the right event store

Now the other way I’ve seen Redis or caching tiers misused, is as a event data store. So a practice I’ve seen done over and over again is to leverage redis to store JSON objects as part of event store because of the performance benefits. There are appropriate technologies that can support this from a much better perspective and manage costs while benefiting your solutions:

Cosmos DB: Cosmos is really designed, exactly for this purpose, of providing high performance, and high availability for your applications. It does this by allowing you to configure the appropriate writing strategy.
Blob Storage: Again, Blob storage can be used as an event store, by writing objects to blob, although not my first choice, it is a viable option for managing costs.
Other database technologies: There are a myriad of potential options here from Maria, PostGres, MySQL, SQL, etc. All of which performance this operation better.

Gotcha #4 – Mismanaged Configuration

I did a whole poston this, but the idea that configuration cannot be changed with out causing an application event is always a concern. You should be able to change an applications endpoint without having any major hiccups in its operation.

At the end of the following