DEV Community

Karel Vanden Bussche
Karel Vanden Bussche

Posted on

The Pyramid of Alerting

We all came across it before. Your company is processing thousands of data points. All goes well on most days, but today, the pipelines are saying "not today". You start digging and find out that the issue is due to a single invalid or malformed message. You check the message for issues, identify the breaking issue (it's a new date format, who would have guessed) and implement a fail-safe for this specific issue.

Some people might say "Well, you lost quite a bit of time digging for the root cause, why didn't you program the fail-safe in the first place?". As data-engineers, we are well-trained in the art of balance our efforts to stay efficient & provide business value on the other end. There are hundreds of integrations to be made, but only little time and competitors don't wait. Secondly, data is so chaotic, that it is impossible to create guards against each mutation. Lastly, each hour of programming, has a business cost. As such, the new integrations have priority over a few edge-cases, such as faulty files or wrong assumptions. That is completely fine if you weave your net around the weakest points. It's only when certain assumptions weren't validated that the issues start.

This balance between efficiency and robustness is an important equilibrium. The fact that postponing certain complex fail-safe developments brings more efficiency to the team even if the issue bites back later. The 2 cards that keeps everything from falling down in this precarious card stack, are monitoring and alerting. It sounds boring to most people, but monitoring can really help identify silently failing processes in your infrastructure. These failures might grind your pipeline to a halt or could silently destroy data integrity.

At OTA Insight, we try to actively work on moving issues from the silent category to the alerting category. Over the years, we have had both types of impact arise, though quickly worked to resolve them permanently.
Sometimes our messages fail to transform and due to our aggressive retry-strategy, we keep on retrying these invalid or corrupt messages. When this issue is not resolved, the pipeline has significantly lower throughput and we lose precious CPU-time.
Other times, our integrity checks fail, meaning we did not know we are not loading any data for multiple days.
Thinking about what impact a new flow might have and brainstorming alerting at scope-time might help resolve the biggest issues that are low-hanging fruit.

The levels of monitoring & alerts

Over the years, our team has learned valuable lessons on monitoring. When we now implement a new solution, we have a few key levels of monitoring we plan for. The image below shows the hierarchy of alerts we currently look at for new flows or business logic.

Image description

In the following sections, we'll go over each of them, bottom to top, and show what they entail and how they help us recover from certain disasters.

Operational alerts

The first type of alerts are the operation alerts. This category focusses on making sure that everything keeps running smoothly at the service/bare-metal level. These alerts monitor for example how many messages are in a certain queue and what the age of the oldest message is. This gives us a good idea if our processes are healthy.

Operational alerting should have an internal hierarchy. Each operation will encounter spikes at multiple points during their lifetime. This means that certain alerts might be raised falsely or their importance might be inflated. As such, we deem it worthwhile to define multiple steps of thresholds, each with its own importance. High importance alerts should be triggered as soon as possible, while low importance alerts should be looked into, but not necessarily now. In most cases, it is possible that a low importance alert evolves into a high importance alert. Due to this, it's also valuable to look into low importance alerts before they evolve and cripple the rest of the data pipelines.

Say we did not have these alerts. That would mean that when our processes wouldn't be healthy, no alarm bells would start ringing. This would be a disaster, as an invisible issue will not be actioned rapidly and might go on for days or weeks without anyone knowing.

Data validation

Data validation has a lot of definitions, so let me explain what I mean with this first. The previous alert gave an overview of operational load & throughput. What it did not show was if certain values were received and if the amount of data was (more or less) correct. Other metrics that fall under data validation are:

  • Did we receive (enough) files for a given date?
  • Was the amount of data relatively close to the average amount of ingested rows of previous days?
  • Can the incoming files be unzipped/loaded correctly?
  • Did we receive the file at the correct timestamp?
  • ...

The validations are slightly more complex, as the implementer should have some prior knowledge over the dataset. In the case of operational alerts, no prior knowledge was needed. The knowledge is quite superficial, as we're only looking at the amount of data and not the semantics of the data itself.

Data validations might save your skin in cases where there is an issue with the ingestion or messages are silently dropped in your infrastructure. As such, these validation are very valuable, because, what is a data business without data?

Business Assumptions

As we're now fairly certain that we have the data in our pipeline and it is flowing through correctly, we can start posing the question if our data is correct. As this requires a more in-depth knowledge of semantics of the data, this piece of alerting should be owned by the team that owns the data. At OTA Insight, that's the team that ingests it in most cases, but sometimes it's the consuming team that knows much better how it should be handled.

Business assumptions come in all flavours. To give an idea, I'll give a few examples:

  • We assume the header of the file always has a certain date format, but we're not sure if this will transfer to newly received files.
  • We assume 2 values always add up to a third value. If this is incorrect, our next calculations cascade the faultiness.
  • Are the values in a certain row correct according to our initial research?
  • Are values within a certain range without too many outliers?
  • ...

These alerts are a bit harder to implement, as they can happen at different levels of your processing. This means that each data source should have a way of surfacing these invalidated assumptions.

Preferably, these fallbacks start very broad, warning at any assumptions that are broken. As time goes by, certain assumptions can be (in)validated and removed from our net, or added as hard-fails if the assumption signals an actual error. After a while, our assumptions will converge to a realistic mirror of the actual underlying semantics.

Implementing this is sometimes hard, as the decision needs to be made to use either a warning or an error. Throwing too many or not enough errors might skew the data or impact data quality. On the other hand, having too many warnings makes it hard to filter between truth and error. As such, this should be implemented by someone who has both a good understanding of the underlying data, but also the data source.

When implemented, these alerts give your pipelines a safety net against invalid data. They also provide a looking glass into your assumptions, which can help you root out the invalid assumptions and reinforce the valid ones.

Conclusion

People that work with Big Data know that it contains chaos. Without some kind of insights, finding & lowering edge cases is practically impossible. Adding different kinds of alerts helps us engineers cluster this chaos into smaller packets that can be solved independently. The result is more manageable errors and targeted failsafes, which ultimately increases the robustness of both processors and data.

The hierarchy in alerts makes it easy to delegate certain responsibilities to different teams. It also gives an idea on what knowledge is required for certain levels of fallbacks. Lastly, it gives an idea of the implementation complexity at each depth. Depending on your data source, you might want to weigh the costs against the boons of adding a certain validation.

Top comments (1)

Collapse
 
jthan24 profile image
Jhonny Pong

Awesome, thanks for sharing!!