In Part 1, we covered:
- Observability Fundamentals
- Monitoring
- Metrics
- Prometheus
- Grafana
- Alerting
Monitoring tells us:
Something is wrong
But monitoring alone cannot answer:
Why is it wrong?
Which service failed?
Which request caused the problem?
This is where the remaining pillars of observability become critical:
Logging
+
Tracing
Together they help engineers perform:
- Root Cause Analysis
- Incident Investigation
- Distributed System Debugging
- Performance Optimization
🔗 Resources
- ** Support the Journey on GitHub: If you're following along, consider starring and forking the repo:** https://github.com/17J/30-Days-Cloud-DevSecOps-Journey
The Three Pillars Revisited
Observability consists of:
Metrics
Logs
Traces
Metrics answer:
What is happening?
Logs answer:
What happened?
Traces answer:
Why did it happen?
Why Monitoring Alone Is Not Enough
Example:
Prometheus Alert:
API Error Rate = 30%
Monitoring tells us:
Problem Exists
But not:
Which API?
Which User?
Which Request?
Which Database Query?
Logs and traces provide those answers.
What is Logging?
Logging is the process of recording events generated by applications, operating systems, and infrastructure.
Examples:
User Login Success
Payment Processed
Database Connection Failed
Pod Restarted
API Timeout
Logs are detailed records of system behavior.
Why Logging Matters
Imagine an application crash.
Monitoring shows:
CPU = Normal
Memory = Normal
Error Rate = High
But logs reveal:
Database Authentication Failed
Root cause found.
Types of Logs
Modern environments generate multiple log types.
Application Logs
Generated by application code.
Example:
{
"timestamp":"2026-01-01T10:00:00Z",
"service":"payment-api",
"level":"ERROR",
"message":"Payment processing failed"
}
System Logs
Generated by operating systems.
Examples:
Kernel Events
Service Start
Authentication Events
System Reboots
Container Logs
Generated by containers.
Example:
kubectl logs pod-name
Kubernetes Logs
Generated by Kubernetes components.
Examples:
Kubelet
API Server
Scheduler
Controller Manager
Security Logs
Examples:
Failed Login Attempts
Privilege Escalation
Unauthorized Access
Very important for SOC teams.
Challenges with Logging
Modern environments generate huge volumes.
Example:
100 Microservices
↓
10 Pods Each
↓
Millions of Log Lines
Problems:
- Storage
- Search
- Correlation
- Cost
This is why centralized logging exists.
What is Centralized Logging?
Instead of:
Application A Logs
Application B Logs
Application C Logs
stored separately,
we collect everything into a central platform.
Applications
↓
Log Collector
↓
Central Storage
↓
Search & Analysis
Popular Logging Platforms
Today most organizations use:
ELK Stack
EFK Stack
Loki
Splunk
Datadog
Understanding ELK Stack
ELK stands for:
Elasticsearch
Logstash
Kibana
One of the most popular logging solutions.
ELK Architecture
Applications
↓
Logstash
↓
Elasticsearch
↓
Kibana
Elasticsearch
Stores logs.
Think of it as:
Searchable Log Database
Capabilities:
- Full-text search
- Indexing
- Analytics
- Aggregation
Logstash
Processes logs.
Responsibilities:
Collect
Transform
Parse
Enrich
Forward
Example:
Raw Log
↓
Structured JSON
Kibana
Visualization layer.
Provides:
- Dashboards
- Search
- Analytics
- Visualizations
Example ELK Workflow
Application Log
↓
Logstash
↓
Elasticsearch
↓
Kibana Dashboard
What is EFK Stack?
Kubernetes-focused version of ELK.
EFK:
Elasticsearch
Fluent Bit
Kibana
Fluent Bit replaces Logstash.
Why Fluent Bit?
Benefits:
Lightweight
Fast
Kubernetes Native
Lower Resource Usage
EFK Architecture
Pods
↓
Fluent Bit
↓
Elasticsearch
↓
Kibana
What is Grafana Loki?
Loki is a modern log aggregation system developed by Grafana Labs.
Designed specifically for cloud-native environments.
Why Loki Became Popular
ELK is powerful but expensive.
Loki offers:
Simpler Architecture
Lower Storage Cost
Grafana Integration
Kubernetes Friendly
Loki Architecture
Applications
↓
Promtail
↓
Loki
↓
Grafana
Promtail
Log collector.
Collects:
Container Logs
Pod Logs
System Logs
and sends them to Loki.
Loki Advantages
Benefits:
Low Cost
Easy Deployment
Cloud Native
Grafana Native
What is Distributed Tracing?
Logs tell us:
What happened?
But not:
Where did latency occur?
Tracing solves this problem.
Why Tracing Exists
In microservices:
User Request
↓
Frontend
↓
API Gateway
↓
Service A
↓
Service B
↓
Database
One slow service impacts the entire request.
Tracing helps locate it.
What is a Trace?
A trace represents a complete request journey.
Example:
Request Start
↓
Service A
↓
Service B
↓
Database
↓
Response
What is a Span?
A trace contains multiple spans.
Example:
Trace
├─ API Call
├─ Database Query
├─ Cache Lookup
└─ External API Call
Each span measures duration.
What is OpenTelemetry?
OpenTelemetry (OTel) is the industry standard for observability instrumentation.
Supported by CNCF.
Provides:
Metrics
Logs
Traces
through one framework.
Why OpenTelemetry Matters
Before OpenTelemetry:
Vendor Specific Agents
After OpenTelemetry:
Single Standard
for observability.
OpenTelemetry Components
SDK
Embedded in applications.
Collects:
Metrics
Logs
Traces
Collector
Receives telemetry.
Processes:
Filter
Transform
Route
Export
Exporters
Send data to:
Prometheus
Jaeger
Loki
Elastic
Datadog
What is Jaeger?
Jaeger is an open-source distributed tracing platform.
Originally developed by Uber.
Now maintained by CNCF.
Jaeger Architecture
Application
↓
OTel Collector
↓
Jaeger
↓
UI
Jaeger Features
Provides:
Trace Visualization
Dependency Mapping
Latency Analysis
Performance Troubleshooting
Example Trace
User Request
↓ 50ms
Frontend
↓ 20ms
API Gateway
↓ 300ms
Payment Service
↓ 10ms
Database
Problem identified:
Payment Service
Installing Loki in Development
Add repository:
helm repo add grafana \
https://grafana.github.io/helm-charts
Install:
helm install loki grafana/loki-stack
Installing Jaeger in Development
docker run -d \
--name jaeger \
-p 16686:16686 \
-p 4317:4317 \
jaegertracing/all-in-one
Access:
http://localhost:16686
Installing Loki in Pre-Prod Kubernetes
helm install loki \
grafana/loki-stack \
-n observability \
--create-namespace
Installing Jaeger in Kubernetes
helm repo add jaegertracing \
https://jaegertracing.github.io/helm-charts
Install:
helm install jaeger \
jaegertracing/jaeger \
-n observability
Modern Kubernetes Observability Stack
Prometheus
+
Grafana
+
Loki
+
Jaeger
+
OpenTelemetry
This combination is currently one of the most popular cloud-native observability platforms.
Final Thoughts
Observability is much more than monitoring.
Monitoring tells you:
Something is wrong
Logging tells you:
What happened
Tracing tells you:
Why it happened
Modern cloud-native platforms achieve full observability by combining:
Prometheus
+
Grafana
+
Loki
+
Jaeger
+
OpenTelemetry
Together these tools provide:
Metrics
Logs
Traces
Root Cause Analysis
Performance Optimization
Incident Response
which are essential for operating reliable Kubernetes and microservices platforms at scale.




Top comments (1)
Excellent write-up. Monitoring tells us what happened, observability helps us understand why it happened. That's a distinction many systems still struggle with. Following for future insights. Respect 🙏