Monitoring/Observability

#o11y #principles #development

Complete coverage of all production systems.

No system should be active in production (i.e. providing a service to a customer or user) without being monitored.

All monitoring/logging is public, so that everyone in Coolblue has visibility of the vitality of the system and Tech Services can monitor specific aspects without exposing sensitive data.

Monitoring means tracking errors in critical workflows, health of critical dependencies and service KPIs.

Observability

Each application we build has to be observable. That means we need to know when something is wrong, and we need to be able to determine why this is.

To be able to tell when something is going wrong with our solutions we actively monitor them and we put in place alerts for our service level objectives.

To be able to find out why things are going wrong we make sure we have the logs to do so, combined with our monitoring data when needed.

This monitoring and logging principle describes two parts:

The first one being monitoring, which is the practice that describes methods to have insights in our applications and stacks. -The other one is the practice of Logging. A practice which describes methods to register log events and give insights into the complexity of our application and stacks.

Monitoring and Alerts

You and your team actively monitor your applications. First you determine the metrics that are relevant for your application to measure. Then you should create dashboards and define alerts and service level objectives. Dashboards give insights into the recent and/or current state of your application. You use them to see at a glance what is happening. Alerts, with or without a Service Level Objective tag help you to be notified when certain thresholds you set are met. The (SLO) Alerts are also acted upon via our Tech Services Department.

Logging

Applications are hard and complex to write and manage. Problems we are solving, the abstractions we create and the implementations we choose to use, are all part of the complexity we are building. In order to shed light on that complexity we can use the practice of logging.

Logging can bring us additional insights into the operations executed in the application which can help understand the sequence of events that might have lead to a certain outcome (error or otherwise). Its an investigative tool that, if exercised correctly, can help piece together the application behaviour leading up to the outcome giving developers potentially new insights into the emergent behaviour of their systems.

Playbooks

In order to work together with those that help us action SLO Alerts when they happen, even outside of your own working day, we agree to have playbooks in place. These playbooks contain information on the SLO Alert itself, the potential underlying issue and should help and direct the reader into actions to help resolve the issue. We have a template available to write these playbooks. Please make sure the playbook is findable via the SLO Alert (make sure the SLO Alert title in our observability platform matches the SLO field in the playbook, and you can add a link to the page in the slack alert for easy access).

PII Data and Sensitive Data

We monitor and we log without exposing sensitive company data or PII data on our customers.

Definition of PII Data: Personally identifiable information (PII) is any data that can be used to identify a specific individual. Social Security numbers, mailing or email address, and phone numbers have most commonly been considered PII, but technology has expanded the scope of PII considerably. It can include an IP address, login IDs, social media posts, or digital images. Geolocation and biometric is also be classified as PII.