In Modern Time applications are no longer simple monolithic systems.
Today organizations run:
- Microservices
- Kubernetes
- Containers
- Serverless Functions
- Multi-Cloud Platforms
- Distributed Systems
As infrastructure becomes more distributed, troubleshooting becomes significantly harder.
A single user request may travel through:
Frontend
↓
API Gateway
↓
Microservice A
↓
Microservice B
↓
Database
When something breaks, the biggest challenge becomes:
"What exactly happened?"
This is where Observability becomes critical.
🔗 Resources
- ** Support the Journey on GitHub: If you're following along, consider starring and forking the repo:** https://github.com/17J/30-Days-Cloud-DevSecOps-Journey
What is Observability?
Observability is the ability to understand the internal state of a system by analyzing the data it produces.
In simple words:
Can we understand
what is happening
inside our systems?
Observability helps engineers answer:
- Why is the application slow?
- Which service is failing?
- Which request caused the issue?
- What changed recently?
- Where is latency occurring?
Without observability:
Problem Exists
↓
Guessing Begins
With observability:
Problem Exists
↓
Evidence Available
↓
Faster Resolution
Why Observability Matters
Modern cloud-native systems generate enormous amounts of data.
Example:
100 Microservices
↓
Millions of Requests
↓
Thousands of Containers
Traditional monitoring alone is no longer sufficient.
Organizations need:
Visibility
Insights
Correlation
Root Cause Analysis
Observability provides all of them.
Monitoring vs Observability
Many people confuse monitoring and observability.
Monitoring asks:
What is wrong?
Observability asks:
Why is it wrong?
Example:
Monitoring:
CPU Usage = 95%
Observability:
Which service?
Which request?
Which dependency?
Which deployment caused it?
Observability provides context.
The Three Pillars of Observability
Modern observability is built on three primary pillars.
Metrics
Logs
Traces
Or:
Monitoring
Logging
Tracing
Together they provide a complete picture of system behavior.
Pillar 1: Monitoring (Metrics)
Monitoring focuses on numerical measurements.
Examples:
CPU Usage
Memory Usage
Request Rate
Error Rate
Latency
Disk Usage
Metrics answer:
How much?
How often?
How fast?
Pillar 2: Logging
Logs provide detailed event information.
Example:
User Login Success
Database Connection Failed
API Request Received
Logs answer:
What happened?
Pillar 3: Tracing
Tracing follows a request across multiple services.
Example:
User Request
↓
Frontend
↓
API
↓
Payment Service
↓
Database
Tracing answers:
Where did the request spend time?
Why Metrics Matter First
Among all observability signals:
Metrics
are usually the first thing engineers implement.
Reasons:
- Lightweight
- Efficient
- Fast alerting
- Low storage cost
- Easy visualization
This is why Prometheus became the industry standard.
What is Prometheus?
Prometheus is an open-source monitoring and alerting system originally developed at SoundCloud and now maintained by CNCF.
Prometheus collects:
Metrics
from applications and infrastructure.
Example:
CPU
Memory
Network
Latency
Errors
Why Prometheus Became Popular
Before Prometheus:
Monitoring Tools
↓
Complex
Expensive
Difficult Scaling
Prometheus introduced:
Pull-Based Collection
Powerful Query Language
Kubernetes Integration
Open Source
Understanding Prometheus Components
Prometheus Server
Core component.
Responsible for:
- Metric collection
- Storage
- Query processing
- Alerting
Exporters
Prometheus collects metrics through exporters.
Examples:
Node Exporter
MySQL Exporter
MongoDB Exporter
Redis Exporter
Blackbox Exporter
Alertmanager
Handles alerts.
Example:
CPU > 90%
↓
Alertmanager
↓
Email
Slack
Teams
PagerDuty
Time-Series Database
Prometheus stores metrics as:
Timestamp + Value
Example:
10:00 CPU=45%
10:01 CPU=48%
10:02 CPU=51%
What is Grafana?
Grafana is a visualization platform used to create dashboards from Prometheus metrics.
Prometheus stores data.
Grafana visualizes data.
Relationship:
Prometheus
↓
Metrics
↓
Grafana
↓
Dashboards
Why Grafana is Popular
Grafana provides:
- Beautiful dashboards
- Alerting
- Multiple data sources
- Real-time visualization
Supported sources:
Prometheus
Elasticsearch
Loki
InfluxDB
CloudWatch
Azure Monitor
Prometheus + Grafana Architecture
Applications
↓
Exporters
↓
Prometheus
↓
Grafana
↓
Engineers
Common Metrics Monitored
Infrastructure:
CPU
Memory
Disk
Network
Application:
Request Rate
Response Time
Error Rate
Kubernetes:
Pod Count
Node Status
Container CPU
Container Memory
Installing Prometheus in Development Environment
For local development, Docker is easiest.
Run Prometheus Container
docker run -d \
--name prometheus \
-p 9090:9090 \
prom/prometheus
Verify:
http://localhost:9090
Check Targets
Navigate:
Status
↓
Targets
Installing Node Exporter
docker run -d \
--name node-exporter \
-p 9100:9100 \
prom/node-exporter
This exposes:
CPU Metrics
Memory Metrics
Disk Metrics
Configure Prometheus
Example:
global:
scrape_interval: 15s
scrape_configs:
- job_name: node
static_configs:
- targets:
- localhost:9100
Restart Prometheus.
Installing Grafana in Development Environment
Run Grafana:
docker run -d \
--name grafana \
-p 3000:3000 \
grafana/grafana
Access:
http://localhost:3000
Default:
admin/admin
Connect Grafana to Prometheus
Add Data Source:
Grafana
↓
Connections
↓
Data Sources
↓
Prometheus
URL:
http://prometheus:9090
Save and Test.
Creating First Dashboard
Example panel:
rate(node_cpu_seconds_total[5m])
Shows CPU usage.
Installing Prometheus in Pre-Production Kubernetes
Production-like environments typically use Helm.
Add Prometheus Community Repo
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
Update:
helm repo update
Install kube-prometheus-stack
helm install monitoring \
prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace
This installs:
Prometheus
Grafana
Alertmanager
Node Exporter
Kube State Metrics
in one deployment.
Verify Installation
kubectl get pods -n monitoring
Expected:
prometheus
grafana
alertmanager
node-exporter
Access Grafana
kubectl port-forward svc/monitoring-grafana \
3000:80 \
-n monitoring
Open:
http://localhost:3000
Access Prometheus
kubectl port-forward svc/monitoring-kube-prometheus-prometheus \
9090:9090 \
-n monitoring
Open:
http://localhost:9090
Production Monitoring Stack
A typical enterprise monitoring stack looks like:
Kubernetes Cluster
↓
Node Exporter
↓
Prometheus
↓
Alertmanager
↓
Grafana
↓
Operations Team
Example Alert Rule
CPU Alert:
groups:
- name: cpu-alerts
rules:
- alert: HighCPUUsage
expr: node_cpu_seconds_total > 90
for: 5m
Grafana Dashboard Examples
Infrastructure Dashboard:
CPU Usage
Memory Usage
Disk Usage
Network Traffic
Kubernetes Dashboard:
Nodes
Pods
Deployments
Namespaces
Application Dashboard:
Request Rate
Error Rate
Latency
Availability
Monitoring Best Practices
Use Labels Properly
Good:
environment=prod
team=platform
service=payment
Retain Metrics Wisely
Avoid storing metrics forever.
Create Actionable Alerts
Bad:
CPU > 80%
Good:
CPU > 90% for 10 minutes
Separate Environments
Dev
QA
PreProd
Prod
should have independent monitoring.
Observability Tools Landscape
Monitoring:
Prometheus
Grafana
Datadog
New Relic
CloudWatch
Azure Monitor
Logging:
ELK Stack
EFK Stack
Loki
Splunk
Tracing:
Jaeger
Zipkin
Tempo
OpenTelemetry
What We'll Cover in Part Two
This article focused on:
Observability Fundamentals
Monitoring
Prometheus
Grafana
In Part Two we'll cover:
Logging
Centralized Log Management
ELK Stack
EFK Stack
Loki
Tracing
Jaeger
OpenTelemetry
Distributed Tracing
End-to-End Observability
Final Thoughts
Observability is one of the most important capabilities in modern cloud-native platforms.
Without observability:
Failures Become Guesswork
With observability:
Metrics
Logs
Traces
↓
Faster Troubleshooting
Better Reliability
Improved User Experience
For most organizations, the journey starts with:
Prometheus
+
Grafana
because they provide a powerful, scalable, and Kubernetes-native monitoring platform.
Once monitoring is established, the next step is adding:
Logging
+
Tracing
to achieve full-stack observability.


Top comments (0)