DEV Community

Cover image for Day 29 — 🔭 Monitoring & Observability Part Two
Rahul Joshi
Rahul Joshi

Posted on

Day 29 — 🔭 Monitoring & Observability Part Two

In Part 1, we covered:

  • Observability Fundamentals
  • Monitoring
  • Metrics
  • Prometheus
  • Grafana
  • Alerting

Monitoring tells us:

Something is wrong
Enter fullscreen mode Exit fullscreen mode

But monitoring alone cannot answer:

Why is it wrong?
Which service failed?
Which request caused the problem?
Enter fullscreen mode Exit fullscreen mode

This is where the remaining pillars of observability become critical:

Logging
     +
Tracing
Enter fullscreen mode Exit fullscreen mode

Together they help engineers perform:

  • Root Cause Analysis
  • Incident Investigation
  • Distributed System Debugging
  • Performance Optimization

🔗 Resources


The Three Pillars Revisited

Observability consists of:

Metrics
Logs
Traces
Enter fullscreen mode Exit fullscreen mode

Metrics answer:

What is happening?
Enter fullscreen mode Exit fullscreen mode

Logs answer:

What happened?
Enter fullscreen mode Exit fullscreen mode

Traces answer:

Why did it happen?
Enter fullscreen mode Exit fullscreen mode

Why Monitoring Alone Is Not Enough

Example:

Prometheus Alert:

API Error Rate = 30%
Enter fullscreen mode Exit fullscreen mode

Monitoring tells us:

Problem Exists
Enter fullscreen mode Exit fullscreen mode

But not:

Which API?
Which User?
Which Request?
Which Database Query?
Enter fullscreen mode Exit fullscreen mode

Logs and traces provide those answers.


What is Logging?

Logging is the process of recording events generated by applications, operating systems, and infrastructure.

Examples:

User Login Success
Payment Processed
Database Connection Failed
Pod Restarted
API Timeout
Enter fullscreen mode Exit fullscreen mode

Logs are detailed records of system behavior.


Why Logging Matters

Imagine an application crash.

Monitoring shows:

CPU = Normal
Memory = Normal
Error Rate = High
Enter fullscreen mode Exit fullscreen mode

But logs reveal:

Database Authentication Failed
Enter fullscreen mode Exit fullscreen mode

Root cause found.


Types of Logs

Modern environments generate multiple log types.


Application Logs

Generated by application code.

Example:

{
  "timestamp":"2026-01-01T10:00:00Z",
  "service":"payment-api",
  "level":"ERROR",
  "message":"Payment processing failed"
}
Enter fullscreen mode Exit fullscreen mode

System Logs

Generated by operating systems.

Examples:

Kernel Events
Service Start
Authentication Events
System Reboots
Enter fullscreen mode Exit fullscreen mode

Container Logs

Generated by containers.

Example:

kubectl logs pod-name
Enter fullscreen mode Exit fullscreen mode

Kubernetes Logs

Generated by Kubernetes components.

Examples:

Kubelet
API Server
Scheduler
Controller Manager
Enter fullscreen mode Exit fullscreen mode

Security Logs

Examples:

Failed Login Attempts
Privilege Escalation
Unauthorized Access
Enter fullscreen mode Exit fullscreen mode

Very important for SOC teams.


Challenges with Logging

Modern environments generate huge volumes.

Example:

100 Microservices
      ↓
10 Pods Each
      ↓
Millions of Log Lines
Enter fullscreen mode Exit fullscreen mode

Problems:

  • Storage
  • Search
  • Correlation
  • Cost

This is why centralized logging exists.


What is Centralized Logging?

Instead of:

Application A Logs
Application B Logs
Application C Logs
Enter fullscreen mode Exit fullscreen mode

stored separately,

we collect everything into a central platform.

Applications
      ↓
Log Collector
      ↓
Central Storage
      ↓
Search & Analysis
Enter fullscreen mode Exit fullscreen mode

First Image


Popular Logging Platforms

Today most organizations use:

ELK Stack
EFK Stack
Loki
Splunk
Datadog
Enter fullscreen mode Exit fullscreen mode

Understanding ELK Stack

ELK stands for:

Elasticsearch
Logstash
Kibana
Enter fullscreen mode Exit fullscreen mode

One of the most popular logging solutions.


ELK Architecture

Applications
      ↓
Logstash
      ↓
Elasticsearch
      ↓
Kibana
Enter fullscreen mode Exit fullscreen mode

Elasticsearch

Stores logs.

Think of it as:

Searchable Log Database
Enter fullscreen mode Exit fullscreen mode

Capabilities:

  • Full-text search
  • Indexing
  • Analytics
  • Aggregation

Logstash

Processes logs.

Responsibilities:

Collect
Transform
Parse
Enrich
Forward
Enter fullscreen mode Exit fullscreen mode

Example:

Raw Log
     ↓
Structured JSON
Enter fullscreen mode Exit fullscreen mode

Kibana

Visualization layer.

Provides:

  • Dashboards
  • Search
  • Analytics
  • Visualizations

Example ELK Workflow

Application Log
      ↓
Logstash
      ↓
Elasticsearch
      ↓
Kibana Dashboard
Enter fullscreen mode Exit fullscreen mode

What is EFK Stack?

Kubernetes-focused version of ELK.

EFK:

Elasticsearch
Fluent Bit
Kibana
Enter fullscreen mode Exit fullscreen mode

Fluent Bit replaces Logstash.


Why Fluent Bit?

Benefits:

Lightweight
Fast
Kubernetes Native
Lower Resource Usage
Enter fullscreen mode Exit fullscreen mode

EFK Architecture

Pods
 ↓
Fluent Bit
 ↓
Elasticsearch
 ↓
Kibana
Enter fullscreen mode Exit fullscreen mode

What is Grafana Loki?

Loki is a modern log aggregation system developed by Grafana Labs.

Designed specifically for cloud-native environments.


Why Loki Became Popular

ELK is powerful but expensive.

Loki offers:

Simpler Architecture
Lower Storage Cost
Grafana Integration
Kubernetes Friendly
Enter fullscreen mode Exit fullscreen mode

Loki Architecture

Applications
      ↓
Promtail
      ↓
Loki
      ↓
Grafana
Enter fullscreen mode Exit fullscreen mode

Promtail

Log collector.

Collects:

Container Logs
Pod Logs
System Logs
Enter fullscreen mode Exit fullscreen mode

and sends them to Loki.


Loki Advantages

Benefits:

Low Cost
Easy Deployment
Cloud Native
Grafana Native
Enter fullscreen mode Exit fullscreen mode

differrence


What is Distributed Tracing?

Logs tell us:

What happened?
Enter fullscreen mode Exit fullscreen mode

But not:

Where did latency occur?
Enter fullscreen mode Exit fullscreen mode

Tracing solves this problem.


Why Tracing Exists

In microservices:

User Request
      ↓
Frontend
      ↓
API Gateway
      ↓
Service A
      ↓
Service B
      ↓
Database
Enter fullscreen mode Exit fullscreen mode

One slow service impacts the entire request.

Tracing helps locate it.


What is a Trace?

A trace represents a complete request journey.

Example:

Request Start
      ↓
Service A
      ↓
Service B
      ↓
Database
      ↓
Response
Enter fullscreen mode Exit fullscreen mode

What is a Span?

A trace contains multiple spans.

Example:

Trace
 ├─ API Call
 ├─ Database Query
 ├─ Cache Lookup
 └─ External API Call
Enter fullscreen mode Exit fullscreen mode

Each span measures duration.


demo image


What is OpenTelemetry?

OpenTelemetry (OTel) is the industry standard for observability instrumentation.

Supported by CNCF.

Provides:

Metrics
Logs
Traces
Enter fullscreen mode Exit fullscreen mode

through one framework.


Why OpenTelemetry Matters

Before OpenTelemetry:

Vendor Specific Agents
Enter fullscreen mode Exit fullscreen mode

After OpenTelemetry:

Single Standard
Enter fullscreen mode Exit fullscreen mode

for observability.


OpenTelemetry Components


SDK

Embedded in applications.

Collects:

Metrics
Logs
Traces
Enter fullscreen mode Exit fullscreen mode

Collector

Receives telemetry.

Processes:

Filter
Transform
Route
Export
Enter fullscreen mode Exit fullscreen mode

Exporters

Send data to:

Prometheus
Jaeger
Loki
Elastic
Datadog
Enter fullscreen mode Exit fullscreen mode

Otel


What is Jaeger?

Jaeger is an open-source distributed tracing platform.

Originally developed by Uber.

Now maintained by CNCF.


Jaeger Architecture

Application
      ↓
OTel Collector
      ↓
Jaeger
      ↓
UI
Enter fullscreen mode Exit fullscreen mode

Jaeger Features

Provides:

Trace Visualization
Dependency Mapping
Latency Analysis
Performance Troubleshooting
Enter fullscreen mode Exit fullscreen mode

Example Trace

User Request
      ↓ 50ms
Frontend

      ↓ 20ms
API Gateway

      ↓ 300ms
Payment Service

      ↓ 10ms
Database
Enter fullscreen mode Exit fullscreen mode

Problem identified:

Payment Service
Enter fullscreen mode Exit fullscreen mode

Installing Loki in Development

Add repository:

helm repo add grafana \
https://grafana.github.io/helm-charts
Enter fullscreen mode Exit fullscreen mode

Install:

helm install loki grafana/loki-stack
Enter fullscreen mode Exit fullscreen mode

Installing Jaeger in Development

docker run -d \
--name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one
Enter fullscreen mode Exit fullscreen mode

Access:

http://localhost:16686
Enter fullscreen mode Exit fullscreen mode

Installing Loki in Pre-Prod Kubernetes

helm install loki \
grafana/loki-stack \
-n observability \
--create-namespace
Enter fullscreen mode Exit fullscreen mode

Installing Jaeger in Kubernetes

helm repo add jaegertracing \
https://jaegertracing.github.io/helm-charts
Enter fullscreen mode Exit fullscreen mode

Install:

helm install jaeger \
jaegertracing/jaeger \
-n observability
Enter fullscreen mode Exit fullscreen mode

Modern Kubernetes Observability Stack

Prometheus
+
Grafana
+
Loki
+
Jaeger
+
OpenTelemetry
Enter fullscreen mode Exit fullscreen mode

This combination is currently one of the most popular cloud-native observability platforms.


Final Thoughts

Observability is much more than monitoring.

Monitoring tells you:

Something is wrong
Enter fullscreen mode Exit fullscreen mode

Logging tells you:

What happened
Enter fullscreen mode Exit fullscreen mode

Tracing tells you:

Why it happened
Enter fullscreen mode Exit fullscreen mode

Modern cloud-native platforms achieve full observability by combining:

Prometheus
+
Grafana
+
Loki
+
Jaeger
+
OpenTelemetry
Enter fullscreen mode Exit fullscreen mode

Together these tools provide:

Metrics
Logs
Traces
Root Cause Analysis
Performance Optimization
Incident Response
Enter fullscreen mode Exit fullscreen mode

which are essential for operating reliable Kubernetes and microservices platforms at scale.

Top comments (1)

Collapse
 
bridgexapi profile image
BridgeXAPI

Excellent write-up. Monitoring tells us what happened, observability helps us understand why it happened. That's a distinction many systems still struggle with. Following for future insights. Respect 🙏