Requirement
- We have a counter named heartbeat_count which indicates whether application is up or not. It has label called application which is application name.
- Each application send this heartbeat metrics at 15 seconds.
Now, we want to get an alert whenever any application stop pushing heartbeat metric for 5 minutes.
Solution
count_over_time()
- this function counts the number of time metrics has value in given time. So if application is sending metrics at every 15 seconds, count_over_time(heartbeat_count{application="ABC"}[1m]) give 4 (4 times metrics has value in last 1 minutes as metric is pushed every 15 seconds).
Now, in 10 minutes, count_over_time should be 40 for application working fine. We can use this function to send an alerts if counter is missing 20 times in last 10 minute. Below query print the heartbeat count in last 10 minutes by application. So if we see metric for any application going below 20 which means counter's value is missing for 20 times (5 minutes might not be continuos but that is the limitation of this solution) in last 10 minutes.
sum by(application) (count_over_time(heartbeat_count{application!=""}[10m]))
Below query will give us a applications for which heartbeat counter is missing 20 times in last 10 minutes. And we can setup alert on this easily.
sum by(application) (count_over_time(heartbeat_count{application!=""}[10m])) < 20
Limitations
- Whenever new application starts, alerts is sent as new application don't have counter value in past 10 minutes. Which can be ignored.
Top comments (1)
Nicely explained!!! It's really helpful 🙂🙂