Rahul Joshi

Posted on Jun 8

Day 28 — 🔭 Monitoring & Observability Part One

#masterclassdevsecops #devops #webdev #cicd

In Modern Time applications are no longer simple monolithic systems.

Today organizations run:

Microservices
Kubernetes
Containers
Serverless Functions
Multi-Cloud Platforms
Distributed Systems

As infrastructure becomes more distributed, troubleshooting becomes significantly harder.

A single user request may travel through:

Frontend
    ↓
API Gateway
    ↓
Microservice A
    ↓
Microservice B
    ↓
Database

When something breaks, the biggest challenge becomes:

"What exactly happened?"

This is where Observability becomes critical.

🔗 Resources

** Support the Journey on GitHub: If you're following along, consider starring and forking the repo:** https://github.com/17J/30-Days-Cloud-DevSecOps-Journey

What is Observability?

Observability is the ability to understand the internal state of a system by analyzing the data it produces.

In simple words:

Can we understand
what is happening
inside our systems?

Observability helps engineers answer:

Why is the application slow?
Which service is failing?
Which request caused the issue?
What changed recently?
Where is latency occurring?

Without observability:

Problem Exists
      ↓
Guessing Begins

With observability:

Problem Exists
      ↓
Evidence Available
      ↓
Faster Resolution

Why Observability Matters

Modern cloud-native systems generate enormous amounts of data.

Example:

100 Microservices
      ↓
Millions of Requests
      ↓
Thousands of Containers

Traditional monitoring alone is no longer sufficient.

Organizations need:

Visibility
Insights
Correlation
Root Cause Analysis

Observability provides all of them.

Monitoring vs Observability

Many people confuse monitoring and observability.

Monitoring asks:

What is wrong?

Observability asks:

Why is it wrong?

Example:

Monitoring:

CPU Usage = 95%

Observability:

Which service?
Which request?
Which dependency?
Which deployment caused it?

Observability provides context.

The Three Pillars of Observability

Modern observability is built on three primary pillars.

Metrics
Logs
Traces

Or:

Monitoring
Logging
Tracing

Together they provide a complete picture of system behavior.

Pillar 1: Monitoring (Metrics)

Monitoring focuses on numerical measurements.

Examples:

CPU Usage
Memory Usage
Request Rate
Error Rate
Latency
Disk Usage

Metrics answer:

How much?
How often?
How fast?

Pillar 2: Logging

Logs provide detailed event information.

Example:

User Login Success
Database Connection Failed
API Request Received

Logs answer:

What happened?

Pillar 3: Tracing

Tracing follows a request across multiple services.

Example:

User Request
      ↓
Frontend
      ↓
API
      ↓
Payment Service
      ↓
Database

Tracing answers:

Where did the request spend time?

Why Metrics Matter First

Among all observability signals:

Metrics

are usually the first thing engineers implement.

Reasons:

Lightweight
Efficient
Fast alerting
Low storage cost
Easy visualization

This is why Prometheus became the industry standard.

What is Prometheus?

Prometheus is an open-source monitoring and alerting system originally developed at SoundCloud and now maintained by CNCF.

Prometheus collects:

Metrics

from applications and infrastructure.

Example:

CPU
Memory
Network
Latency
Errors

Why Prometheus Became Popular

Before Prometheus:

Monitoring Tools
      ↓
Complex
Expensive
Difficult Scaling

Prometheus introduced:

Pull-Based Collection
Powerful Query Language
Kubernetes Integration
Open Source

Understanding Prometheus Components

Prometheus Server

Core component.

Responsible for:

Metric collection
Storage
Query processing
Alerting

Exporters

Prometheus collects metrics through exporters.

Examples:

Node Exporter
MySQL Exporter
MongoDB Exporter
Redis Exporter
Blackbox Exporter

Alertmanager

Handles alerts.

Example:

CPU > 90%
      ↓
Alertmanager
      ↓
Email
Slack
Teams
PagerDuty

Time-Series Database

Prometheus stores metrics as:

Timestamp + Value

Example:

10:00 CPU=45%
10:01 CPU=48%
10:02 CPU=51%

What is Grafana?

Grafana is a visualization platform used to create dashboards from Prometheus metrics.

Prometheus stores data.

Grafana visualizes data.

Relationship:

Prometheus
      ↓
Metrics
      ↓
Grafana
      ↓
Dashboards

Why Grafana is Popular

Grafana provides:

Beautiful dashboards
Alerting
Multiple data sources
Real-time visualization

Supported sources:

Prometheus
Elasticsearch
Loki
InfluxDB
CloudWatch
Azure Monitor

Prometheus + Grafana Architecture

Applications
      ↓
Exporters
      ↓
Prometheus
      ↓
Grafana
      ↓
Engineers

Common Metrics Monitored

Infrastructure:

CPU
Memory
Disk
Network

Application:

Request Rate
Response Time
Error Rate

Kubernetes:

Pod Count
Node Status
Container CPU
Container Memory

Installing Prometheus in Development Environment

For local development, Docker is easiest.

Run Prometheus Container

docker run -d \
--name prometheus \
-p 9090:9090 \
prom/prometheus

Verify:

http://localhost:9090

Check Targets

Navigate:

Status
   ↓
Targets

Installing Node Exporter

docker run -d \
--name node-exporter \
-p 9100:9100 \
prom/node-exporter

This exposes:

CPU Metrics
Memory Metrics
Disk Metrics

Configure Prometheus

Example:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: node
    static_configs:
      - targets:
        - localhost:9100

Restart Prometheus.

Installing Grafana in Development Environment

Run Grafana:

docker run -d \
--name grafana \
-p 3000:3000 \
grafana/grafana

Access:

http://localhost:3000

Default:

admin/admin

Connect Grafana to Prometheus

Add Data Source:

Grafana
    ↓
Connections
    ↓
Data Sources
    ↓
Prometheus

URL:

http://prometheus:9090

Save and Test.

Creating First Dashboard

Example panel:

rate(node_cpu_seconds_total[5m])

Shows CPU usage.

Installing Prometheus in Pre-Production Kubernetes

Production-like environments typically use Helm.

Add Prometheus Community Repo

helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts

Update:

helm repo update

Install kube-prometheus-stack

helm install monitoring \
prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace

This installs:

Prometheus
Grafana
Alertmanager
Node Exporter
Kube State Metrics

in one deployment.

Verify Installation

kubectl get pods -n monitoring

Expected:

prometheus
grafana
alertmanager
node-exporter

Access Grafana

kubectl port-forward svc/monitoring-grafana \
3000:80 \
-n monitoring

Open:

http://localhost:3000

Access Prometheus

kubectl port-forward svc/monitoring-kube-prometheus-prometheus \
9090:9090 \
-n monitoring

Open:

http://localhost:9090

Production Monitoring Stack

A typical enterprise monitoring stack looks like:

Kubernetes Cluster
       ↓
Node Exporter
       ↓
Prometheus
       ↓
Alertmanager
       ↓
Grafana
       ↓
Operations Team

Example Alert Rule

CPU Alert:

groups:
- name: cpu-alerts

  rules:
  - alert: HighCPUUsage

    expr: node_cpu_seconds_total > 90

    for: 5m

Grafana Dashboard Examples

Infrastructure Dashboard:

CPU Usage
Memory Usage
Disk Usage
Network Traffic

Kubernetes Dashboard:

Nodes
Pods
Deployments
Namespaces

Application Dashboard:

Request Rate
Error Rate
Latency
Availability

Monitoring Best Practices

Use Labels Properly

Good:

environment=prod
team=platform
service=payment

Retain Metrics Wisely

Avoid storing metrics forever.

Create Actionable Alerts

Bad:

CPU > 80%

Good:

CPU > 90% for 10 minutes

Separate Environments

Dev
QA
PreProd
Prod

should have independent monitoring.

Observability Tools Landscape

Monitoring:

Prometheus
Grafana
Datadog
New Relic
CloudWatch
Azure Monitor

Logging:

ELK Stack
EFK Stack
Loki
Splunk

Tracing:

Jaeger
Zipkin
Tempo
OpenTelemetry

What We'll Cover in Part Two

This article focused on:

Observability Fundamentals
Monitoring
Prometheus
Grafana

In Part Two we'll cover:

Logging
Centralized Log Management
ELK Stack
EFK Stack
Loki
Tracing
Jaeger
OpenTelemetry
Distributed Tracing
End-to-End Observability

Final Thoughts

Observability is one of the most important capabilities in modern cloud-native platforms.

Without observability:

Failures Become Guesswork

With observability:

Metrics
Logs
Traces
      ↓
Faster Troubleshooting
Better Reliability
Improved User Experience

For most organizations, the journey starts with:

Prometheus
+
Grafana

because they provide a powerful, scalable, and Kubernetes-native monitoring platform.

Once monitoring is established, the next step is adding:

Logging
+
Tracing

to achieve full-stack observability.