Splunk is a tool for querying and summarizing unstructured or structured text. You can use Splunk to search, summarize, and alert on application logs.
You can use Splunk to send alerts via email or Pagerduty when we notice a rise in certain error logs. It's important for us to monitor our error logs so that we know when our users are experiencing issues with our products. Errors are one of the "Golden Signals" for application health as described in the Google Site Reliability Engineering book online (link).
Here's a hypothetical log that we need to alert on:
2021-09-05T16:35:18+00:0 level=ERROR logger=com.enterprise.payments PAY - reason=null payeeId=aceja payStatus=FAILED host=appDeployment-alfzf
Which also has other logs, such as success logs
2021-09-05T13:35:18+00:0 level=INFO logger=com.enterprise.payments PAY - reason=accepted payeeId=baehg payStatus=SUCCESS host=appDeployment-alfzf
Or in between logs
2021-09-05T17:35:18+00:0 level=WARN logger=com.enterprise.payments PAY - reason=decline payeeId=gaeaj payStatus=FAILED host=appDeployment-alfzf
In the past, we setup our alerts to trigger when we get x number of error logs, where x is some number from 1 to 100 errors. This works most days, since usually we have steady traffic/usage.
However, when our application gets high traffic, we can arrive at a triggered alert even when the system is operating "normally". The goal of our alerts and monitoring is to tell us when something unusual, or exceptional is happening; something we need to fix. Getting alerts that we have high traffic is nerve wracking, and wakes us up late at night for no reason. These false positives create pager burnout quickly.
In order to avoid false alarms / false positives, I refactored one of our Splunk alerts from quantity-of-errors based, to percentage-of-requests-which-fail based.
Here's an example of the query before:
index=example (PAY "FAILED" logger=com.enterprise.payments)
which would alert when it finds greater than x results. This has the advantage of being simple, but also the downside of a false positive during high traffic. We don't know how many payments were successful, or just declined cards, not an error in our server flow, but rather in the user's cash flow.
Here's an example of the query after making it alert on percentage of errors:
index=example (PAY logger=com.enterprise.payments) | eval failureRate=if(match(payStatus, "FAILED"), 100, 0) | timechart avg(failureRate) as percentFail | where percentFail > 5
By converting payStatus to a number, we can use avg later on
if(match(payStatus, "FAILED"), 100, 0). This gives us a nice 0-100% timechart graph.
where percentFail > 5 is critical, since it lets us set the alert to fire when it finds greater than 1 result. Using the where limits the number of results to results when the percentage of failures was high.
Another side benefit is that we get the splunk error percentage directly in the PagerDuty alert, letting us know what to expect once we click out of PagerDuty and into Splunk.
The final query powering our alert can also be tied to other relevant transaction logs by payeeId, avoid alerting if the volume of events is too low, and provide reasons for each log in the results table: