Ertan Felek

Posted on Oct 30

Understanding System Behavior with Observability in Distributed Systems

#docker #devops #performance #monitoring

Why observability is more than collecting logs—and how OpenTelemetry, Grafana, Prometheus, Loki, and Tempo help you truly see your system.

Introduction

Imagine you’re managing an EV charging platform. Drivers tap “Start Charging,” but some sessions take 20 seconds longer than usual. Nothing’s broken, but something feels off. Where do you look first?

In today’s cloud-native and microservice-heavy systems, performance issues rarely have a single cause. Traditional monitoring—setting CPU alerts or error thresholds—only tells you that something’s wrong. Observability tells you why.

By combining logs, metrics, and traces, and using the OpenTelemetry ecosystem, you can uncover how your system actually behaves—even when you don’t know what you’re looking for.

Why Observability Matters

Observability is the ability to understand what’s happening inside your system based on the data it emits. It’s about turning signals into insight, not just collecting them.

In a distributed world full of “unknown unknowns,” you can’t predefine every alert. Observability lets you ask new questions on the fly—discovering issues you didn’t anticipate.

Goal: Stop reacting to alerts. Start understanding behavior.

The Three Pillars of Observability

Signal	What It Tells You	Example
Logs	What happened (events & messages)	“PaymentService: Timeout calling Billing API”
Metrics	How often or how much	“p95 latency increased by 40%”
Traces	How components interacted	“API → Kafka → Billing → DB (8s delay in Billing)”

Used together, these three signals form a feedback loop:

Metrics show symptoms (e.g., latency spikes).
Traces reveal where the delay happens.
Logs explain why it happened.

That’s the difference between monitoring and understanding.

The “Unknown Unknowns”

Monitoring handles known problems—“alert me when CPU > 80%.”

Observability helps with unknown unknowns—the subtle bugs, race conditions, or misconfigurations you couldn’t predict.

With rich telemetry, you can ask:

Why are requests slow only in one region?
Why did latency spike even though error rates look normal?

In other words, you can investigate, not guess.

OpenTelemetry: The Universal Language of Observability

Instead of wiring every library to a different monitoring tool, OpenTelemetry (OTel) provides one standard for emitting telemetry data. It’s language-agnostic, vendor-neutral, and built by the CNCF community.

In Go (Golang), OTel is lightweight and flexible:

import (
  "context"
  "go.opentelemetry.io/otel"
  "go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("charging-service")

func StartCharging(ctx context.Context) {
  ctx, span := tracer.Start(ctx, "StartCharging")
  defer span.End()
  // business logic...
}

That’s all it takes to begin tracing across your microservices.
OTel automatically handles context propagation and span correlation—so your traces don’t break across APIs or Kafka messages

A Minimal Observability Stack

A full observability setup doesn’t have to be complex.

One of the most popular open-source stacks combines:

Prometheus → Collects and stores metrics
Loki → Gathers logs efficiently (no heavy indexing)
Tempo → Stores distributed traces
Grafana → Visualizes everything together

These tools speak OpenTelemetry natively.

Your app sends telemetry via the OTel Collector, which routes:

Logs → Loki
Metrics → Prometheus
Traces → Tempo

Grafana becomes your “single pane of glass” for exploring data — metrics on top, traces below, logs one click away.

Diagram placeholder:

“Minimal OpenTelemetry Stack”

Application → OTel Collector → (Loki, Prometheus, Tempo) → Grafana

From Logs to Root Cause: A Real-World Flow

Let’s revisit the EV charging delay scenario:

Metrics show latency increased for /start-charging.
Traces reveal the request slowed in the Billing Service.
Logs for that trace ID show repeated DB retries.

Root cause? A cold cache in the billing database.

Without observability, you’d be guessing for hours.

With it, you know exactly where and why.

Best Practices

✅ Keep it lightweight:

Instrument what matters—business-critical paths, APIs, and message flows.

✅ Correlate everything:

Use consistent trace IDs across logs, metrics, and traces.

✅ Sample smartly:

Use tail-based sampling to retain slow or error traces, not every request.

✅ Enrich your telemetry:

Add contextual attributes (e.g., station_id, region) for better filtering.

⚠️ Avoid pitfalls:

Don’t log everything at debug.
Don’t tag metrics with high-cardinality labels (like user_id).
Don’t forget context propagation across async calls.

Wrapping Up

Observability isn’t about drowning in data—it’s about clarity.

When every service emits meaningful logs, metrics, and traces, you can see your system as a living, connected whole.

With OpenTelemetry handling instrumentation and Grafana + Prometheus + Loki + Tempo providing visibility, you’re equipped not just to monitor—but to understand.

From “What went wrong?” to “Why did it happen?” — that’s the power of observability.

DEV Community