Rahul Joshi

Posted on Jun 9

Day 29 — 🔭 Monitoring & Observability Part Two

#masterclassdevsecops #devops #webdev #cicd

In Part 1, we covered:

Observability Fundamentals
Monitoring
Metrics
Prometheus
Grafana
Alerting

Monitoring tells us:

Something is wrong

But monitoring alone cannot answer:

Why is it wrong?
Which service failed?
Which request caused the problem?

This is where the remaining pillars of observability become critical:

Logging
     +
Tracing

Together they help engineers perform:

Root Cause Analysis
Incident Investigation
Distributed System Debugging
Performance Optimization

🔗 Resources

** Support the Journey on GitHub: If you're following along, consider starring and forking the repo:** https://github.com/17J/30-Days-Cloud-DevSecOps-Journey

The Three Pillars Revisited

Observability consists of:

Metrics
Logs
Traces

Metrics answer:

What is happening?

Logs answer:

What happened?

Traces answer:

Why did it happen?

Why Monitoring Alone Is Not Enough

Example:

Prometheus Alert:

API Error Rate = 30%

Monitoring tells us:

Problem Exists

But not:

Which API?
Which User?
Which Request?
Which Database Query?

Logs and traces provide those answers.

What is Logging?

Logging is the process of recording events generated by applications, operating systems, and infrastructure.

Examples:

User Login Success
Payment Processed
Database Connection Failed
Pod Restarted
API Timeout

Logs are detailed records of system behavior.

Why Logging Matters

Imagine an application crash.

Monitoring shows:

CPU = Normal
Memory = Normal
Error Rate = High

But logs reveal:

Database Authentication Failed

Root cause found.

Types of Logs

Modern environments generate multiple log types.

Application Logs

Generated by application code.

Example:

{
  "timestamp":"2026-01-01T10:00:00Z",
  "service":"payment-api",
  "level":"ERROR",
  "message":"Payment processing failed"
}

System Logs

Generated by operating systems.

Examples:

Kernel Events
Service Start
Authentication Events
System Reboots

Container Logs

Generated by containers.

Example:

kubectl logs pod-name

Kubernetes Logs

Generated by Kubernetes components.

Examples:

Kubelet
API Server
Scheduler
Controller Manager

Security Logs

Examples:

Failed Login Attempts
Privilege Escalation
Unauthorized Access

Very important for SOC teams.

Challenges with Logging

Modern environments generate huge volumes.

Example:

100 Microservices
      ↓
10 Pods Each
      ↓
Millions of Log Lines

Problems:

Storage
Search
Correlation
Cost

This is why centralized logging exists.

What is Centralized Logging?

Instead of:

Application A Logs
Application B Logs
Application C Logs

stored separately,

we collect everything into a central platform.

Applications
      ↓
Log Collector
      ↓
Central Storage
      ↓
Search & Analysis

Popular Logging Platforms

Today most organizations use:

ELK Stack
EFK Stack
Loki
Splunk
Datadog

Understanding ELK Stack

ELK stands for:

Elasticsearch
Logstash
Kibana

One of the most popular logging solutions.

ELK Architecture

Applications
      ↓
Logstash
      ↓
Elasticsearch
      ↓
Kibana

Elasticsearch

Stores logs.

Think of it as:

Searchable Log Database

Capabilities:

Full-text search
Indexing
Analytics
Aggregation

Logstash

Processes logs.

Responsibilities:

Collect
Transform
Parse
Enrich
Forward

Example:

Raw Log
     ↓
Structured JSON

Kibana

Visualization layer.

Provides:

Dashboards
Search
Analytics
Visualizations

Example ELK Workflow

Application Log
      ↓
Logstash
      ↓
Elasticsearch
      ↓
Kibana Dashboard

What is EFK Stack?

Kubernetes-focused version of ELK.

EFK:

Elasticsearch
Fluent Bit
Kibana

Fluent Bit replaces Logstash.

Why Fluent Bit?

Benefits:

Lightweight
Fast
Kubernetes Native
Lower Resource Usage

EFK Architecture

Pods
 ↓
Fluent Bit
 ↓
Elasticsearch
 ↓
Kibana

What is Grafana Loki?

Loki is a modern log aggregation system developed by Grafana Labs.

Designed specifically for cloud-native environments.

Why Loki Became Popular

ELK is powerful but expensive.

Loki offers:

Simpler Architecture
Lower Storage Cost
Grafana Integration
Kubernetes Friendly

Loki Architecture

Applications
      ↓
Promtail
      ↓
Loki
      ↓
Grafana

Promtail

Log collector.

Collects:

Container Logs
Pod Logs
System Logs

and sends them to Loki.

Loki Advantages

Benefits:

Low Cost
Easy Deployment
Cloud Native
Grafana Native

What is Distributed Tracing?

Logs tell us:

What happened?

But not:

Where did latency occur?

Tracing solves this problem.

Why Tracing Exists

In microservices:

User Request
      ↓
Frontend
      ↓
API Gateway
      ↓
Service A
      ↓
Service B
      ↓
Database

One slow service impacts the entire request.

Tracing helps locate it.

What is a Trace?

A trace represents a complete request journey.

Example:

Request Start
      ↓
Service A
      ↓
Service B
      ↓
Database
      ↓
Response

What is a Span?

A trace contains multiple spans.

Example:

Trace
 ├─ API Call
 ├─ Database Query
 ├─ Cache Lookup
 └─ External API Call

Each span measures duration.

What is OpenTelemetry?

OpenTelemetry (OTel) is the industry standard for observability instrumentation.

Supported by CNCF.

Provides:

Metrics
Logs
Traces

through one framework.

Why OpenTelemetry Matters

Before OpenTelemetry:

Vendor Specific Agents

After OpenTelemetry:

Single Standard

for observability.

OpenTelemetry Components

SDK

Embedded in applications.

Collects:

Metrics
Logs
Traces

Collector

Receives telemetry.

Processes:

Filter
Transform
Route
Export

Exporters

Send data to:

Prometheus
Jaeger
Loki
Elastic
Datadog

What is Jaeger?

Jaeger is an open-source distributed tracing platform.

Originally developed by Uber.

Now maintained by CNCF.

Jaeger Architecture

Application
      ↓
OTel Collector
      ↓
Jaeger
      ↓
UI

Jaeger Features

Provides:

Trace Visualization
Dependency Mapping
Latency Analysis
Performance Troubleshooting

Example Trace

User Request
      ↓ 50ms
Frontend

      ↓ 20ms
API Gateway

      ↓ 300ms
Payment Service

      ↓ 10ms
Database

Problem identified:

Payment Service

Installing Loki in Development

Add repository:

helm repo add grafana \
https://grafana.github.io/helm-charts

Install:

helm install loki grafana/loki-stack

Installing Jaeger in Development

docker run -d \
--name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one

Access:

http://localhost:16686

Installing Loki in Pre-Prod Kubernetes

helm install loki \
grafana/loki-stack \
-n observability \
--create-namespace

Installing Jaeger in Kubernetes

helm repo add jaegertracing \
https://jaegertracing.github.io/helm-charts

Install:

helm install jaeger \
jaegertracing/jaeger \
-n observability

Modern Kubernetes Observability Stack

Prometheus
+
Grafana
+
Loki
+
Jaeger
+
OpenTelemetry

This combination is currently one of the most popular cloud-native observability platforms.

Final Thoughts

Observability is much more than monitoring.

Monitoring tells you:

Something is wrong

Logging tells you:

What happened

Tracing tells you:

Why it happened

Modern cloud-native platforms achieve full observability by combining:

Prometheus
+
Grafana
+
Loki
+
Jaeger
+
OpenTelemetry

Together these tools provide:

Metrics
Logs
Traces
Root Cause Analysis
Performance Optimization
Incident Response

which are essential for operating reliable Kubernetes and microservices platforms at scale.

Top comments (1)

BridgeXAPI • Jun 9

Excellent write-up. Monitoring tells us what happened, observability helps us understand why it happened. That's a distinction many systems still struggle with. Following for future insights. Respect 🙏