Discussion on: How do you wrap your head around observability?

View post

I hadn't thought before about there being a mental model or developing it. I guess our model is around health of a system and how to check on the health, like a doctor or car driver what's to know about a patient / car and get early signs of things going awry and enough evidence of details to figure out the cause - the why.

Some of what I'll mention below will be obvious but I hope it explains how we match a need for transparency of errors and reporting with the tools we have.

We did have a team book club around Observabilty.

In my team and for the org, we have SLAs around uptime. So we want our software to be up and performant. We want New Relic and now Datadog instead to ping us when something is down or irregular so we can investigate.

A vague error is useless, so we add layers of monitoring and logging. The server side logs such as from pods or DB or Apache go somewhere that we can inspect by hand (annoying) or ideally view on CloudWatch or Datadog. Then we can search and sort and can build dashboard.

Each of the monitoring tools tells us something about the health of the system.

APM and kubernetes metrics tells us about server side performance and error rates and if servers are restarting.

Synthetic monitoring tells us that a machine can reach our site or endpoint in a timely manner and get a success message. This is typically what we use to page ourselves.

RUM and JS errors are on thing in New Relic but two in Datadog, but we use those to tell us about the experience for all users on the site according to their actual browsers not according to our test that runs every 5 min. We can see pages that are slow or most visited and assets loaded and if any page has an usual amount of JS console errors.

Michael Currin • Feb 19 '21

We also have a need to know if a service is appearing stable in one way but irregular in another.

Like we are setting up anomaly detection to tell us when an error rate exceeds an acceptable threshold.

Or when the number of items on a dead letter queue or queue exceeds a certain amount for a period. Because our system is too slow or not able to recover well when it hits errors or low volume.

Oh that's the other thing. We correlate metrics. So we be detectives and figure out - did the increase in volume correspond to an increase in error rate? Did one cause the other or just coincidence?
Why is it that one synthetic test location or one of the servers is consistently slower than the other? Or one server gets less volume of requests than the others, even though they are all weighted equally?