DEV Community

Cover image for What are you monitoring
Adedapo ajuwon
Adedapo ajuwon

Posted on

What are you monitoring

This post was initially posted on my medium blog @dapseen

Working as a DevSecOps/SRE could be really fun, you know. The different kinds of errors tell you exactly where to go, yeah? It’s a 404 error — I’ll have to check my logs, check ELB (Elastic Load Balancer) status, I’ll…, bla bla… and yea, with that, i’d be able to determine if it’s a code error or a network error.

In more technical terms, it’s very common for DevSecOps/SRE to write promQL that monitors status 4XX, 5XX etc. However, what happens when you get none of these statuses?

What I mean is…

WHAT DO YOU DO WHEN YOUR SERVICE IS NOT AVAILABLE?

Don’t get me wrong guys, this is completely different from your service not running. What I mean exactly is — your service is running perfectly, but oops! it’s just not reachable.

Service Architecture

To manage this kind of incident, checking logs for your regular 5XX or 4XX definitely wouldn’t work here. Why? Your service wouldn’t be logging any of these response statuses. Sadly.

Now let’s look on the brighter side. The question is, how can you monitor your service that is not reachable?

Simple!

Monitor SILENCE

Service A on a regular day receives 5000 RPM (requests per minute) in peak hours. SILENCE comes into play when Service A is getting 0 (zero) RPM. Obviously, you know something is totally wrong.
Now to the main business…

Writing a well structured PROMQL to understand this and plugging Grafana to alert your slack channel is something i’d recommend.

Grafana Alert

Setting if no data or all values are null to ALERTING will notify you when you don’t have any request hitting your service.

However, what i’m yet to figure out is — how can I prevent this from waking me up in the middle of the night when I get low traffic or zero traffic? That kind of SILENCE is valid, but peak hours SILENCEDefinitely NOT valid.

Top comments (0)