5 tips to avoid production incidents

#monitoring #devops #tutorial #beginners

It is difficult to give a complete definition of what a major service incident is. However, intuitively, it is usually clear to everyone whether each specific incident is major or not. The more users are unable to access the functionality provided by the service, the more significant the incident is.

For example, if access to the service is completely lost (due to an expired SSL certificate or failure of key components of the service), the incident is major. Or if part of critical functionality stops working (such as adding items to the cart, searching the site, or the recommendations feed), the incident is also major.

Is this true? But what if access is lost for a minute? What if it's only for 1 second? What if only 0.1% of users are affected and a critical functionality stops working for only half an hour? It becomes less obvious and unambiguous.

However, it can be said for sure that it is necessary to reduce the number of affected users. Therefore, it is necessary to either quickly fix the incident or prevent it altogether.

Monitoring of critical system resources such as CPU, memory, RAM, etc. I thought long and hard about what to put in first place, but in the end, I settled on this. An outage can be major if it is not detected and prevented or repaired in time. Usually, a service completely stops working when its components transition to a state in which they do not work fully. Such a transition most often occurs when the resources necessary for the service to function are depleted. This can result from a release or an increase in traffic. If monitoring is triggered when approaching threshold values, it is possible to prevent the outage or at least start repairing it quickly.
Horizontally scalable architecture, in some way. So that each component of the service is not in a single instance. Servers regularly fail, and this is a stochastic process. It is not desirable for everything to completely break down just because one old disk has failed, right?
Understanding that if an outage occurs due to a release of one of the service's components, the release should be rolled back, not fixed. If the service transitions to a non-working state for an understandable reason, it is necessary to return the service to a working state by canceling the cause. I have seen many cases where production attempts to fix the problem with new releases, but more often than not, it is not possible to fix it. The biggest problem with attempts to fix the problem is that each specific fix depends on the malfunction and is not a universal method of resolving the accessibility issue. In a stressful situation, it is much better to adhere to universal processes.
Work on retries thoroughly. If the service or component of the service starts working unstably, poorly made retries lead to the threshold crossing at some point. Retries exhaust the service completely. In addition to trying to configure retries correctly in advance, it is necessary to have traffic management tools. Be able to quickly change retry parameters and quickly dump some of the traffic into oblivion.
After an outage, think about what needs to be improved in the system so that the same outage does not occur again in the future. These are called action items. Ideally, it would be possible to avoid a whole range of similar problems. But if avoiding a range of problems is difficult, then at a minimum, it is necessary to learn how to avoid the exact same outage. Gradually, the robustness of the system will grow to the required level.

Of course, there should be more and different types of monitoring and graphs, and there are many other tips as well, but starting with these 5 points will already yield results. These points are the minimum technical hygiene for a stable production environment.

DEV Community

5 tips to avoid production incidents

Top comments (0)