DEV Community

Cover image for What is Observability?
Elufioye Gboyega michael
Elufioye Gboyega michael

Posted on

What is Observability?

What is Observability?

Observability is the ability to understand a system's internal state and behavior by examining its external outputs, such as metrics, logs, and traces.

It can be likened to a doctor's ability to diagnose a patient based on the patient's complaints and symptoms.


Observability & the Doctor's analogy:

  1. The Patient's Complaints = External Outputs

    Like a patient shares symptoms, a system provides metrics, logs, and traces as its external outputs.

  2. The Doctor's Diagnosis = Understanding the System's Internal State

    A doctor uses the patient's symptoms to diagnose the kind of illness. In the same way observability tools are used to analyze system outputs to understand the internal state and behavior.

  3. Medical Tests = Observability Tools

    A doctor may use X-rays, scans, and a series of other tests to gather more data about the patient's underlying illness, just as we use monitoring tools, log aggregators, tracing, and visualization systems to get a more detailed insight into a system.


Why Observability Matters

The internal state of a system determines the behavior of the system. Observability gives us insight into the internal happenings of a system, so we can make sense of the system's behavior.


The Three Pillars of Observability

1. Metrics

System metrics gotten from Grafana showing different system measurable datapoint
These are the data points used to measure a system's performance and resource usage over time.

  • Examples are: CPU usage, memory utilization, and request latency.

Purpose: These are important data points that can greatly affect the performance of our system, and having this data gives us a pointer to diagnosing our system's performance-related issues.

2. Logs

Logs are the event watchers of a system, providing a detailed and time-stamped record of events that occurred in a system.

Examples:

  • "User X cannot add a project on trackmention.com at 10:12:15."
  • "trackmention.com database connection timeout at 15:05:17."

Purpose: These logs provide information about what happened in the system at a specific time, helping pinpoint issues.

3. Traces

Traces of a system showing the requests and the timestamp of occurrence.
Traces of a system request and the timestamp of occurrence

End to end trace of request journey and how long it took at each point

Traces provide the end-to-end record of system requests. For example:

An endpoint that stores user registration data has 3 layers: the handler, the controller/service, and the store layer. Tracing gives us a record of how the request:

  • Hits the route/handler layer.
  • Gets passed to the controller/service layer.
  • Finally reaches the store layer where it is saved in the database.

Purpose: The tracing record provides insights, such as how long it takes for each layer to process the request, so we can easily recognize the part of our system with performance bottlenecks.


Tools for Observability

  • Metrics: Prometheus, Datadog, CloudWatch
  • Logs: Loki, ELK stack (Elasticsearch, Logstash, Kibana)
  • Traces: Jaeger, OpenTelemetry

Each tool has its unique strengths. For instance:

  • Prometheus excels at real-time metrics collection.
  • The ELK stack is ideal for centralized log management.
  • Jaeger and OpenTelemetry specialize in distributed tracing.

Summary

Observability is essential for understanding the internal workings of a system. By using metrics, logs, and traces along with tools like Prometheus, ELK Stack, and OpenTelemetry, we can diagnose system issues effectively—just like a doctor uses symptoms and tests to diagnose a patient.

Top comments (0)