DEV Community

Cover image for Day 28 — 🔭 Monitoring & Observability Part One
Rahul Joshi
Rahul Joshi

Posted on

Day 28 — 🔭 Monitoring & Observability Part One

In Modern Time applications are no longer simple monolithic systems.

Today organizations run:

  • Microservices
  • Kubernetes
  • Containers
  • Serverless Functions
  • Multi-Cloud Platforms
  • Distributed Systems

As infrastructure becomes more distributed, troubleshooting becomes significantly harder.

A single user request may travel through:

Frontend
    ↓
API Gateway
    ↓
Microservice A
    ↓
Microservice B
    ↓
Database
Enter fullscreen mode Exit fullscreen mode

When something breaks, the biggest challenge becomes:

"What exactly happened?"

This is where Observability becomes critical.


🔗 Resources


What is Observability?

Observability is the ability to understand the internal state of a system by analyzing the data it produces.

In simple words:

Can we understand
what is happening
inside our systems?
Enter fullscreen mode Exit fullscreen mode

Observability helps engineers answer:

  • Why is the application slow?
  • Which service is failing?
  • Which request caused the issue?
  • What changed recently?
  • Where is latency occurring?

Without observability:

Problem Exists
      ↓
Guessing Begins
Enter fullscreen mode Exit fullscreen mode

With observability:

Problem Exists
      ↓
Evidence Available
      ↓
Faster Resolution
Enter fullscreen mode Exit fullscreen mode

Why Observability Matters

Modern cloud-native systems generate enormous amounts of data.

Example:

100 Microservices
      ↓
Millions of Requests
      ↓
Thousands of Containers
Enter fullscreen mode Exit fullscreen mode

Traditional monitoring alone is no longer sufficient.

Organizations need:

Visibility
Insights
Correlation
Root Cause Analysis
Enter fullscreen mode Exit fullscreen mode

Observability provides all of them.


Monitoring vs Observability

Many people confuse monitoring and observability.

Monitoring asks:

What is wrong?
Enter fullscreen mode Exit fullscreen mode

Observability asks:

Why is it wrong?
Enter fullscreen mode Exit fullscreen mode

Example:

Monitoring:

CPU Usage = 95%
Enter fullscreen mode Exit fullscreen mode

Observability:

Which service?
Which request?
Which dependency?
Which deployment caused it?
Enter fullscreen mode Exit fullscreen mode

Observability provides context.


The Three Pillars of Observability

Modern observability is built on three primary pillars.

Metrics
Logs
Traces
Enter fullscreen mode Exit fullscreen mode

Or:

Monitoring
Logging
Tracing
Enter fullscreen mode Exit fullscreen mode

Together they provide a complete picture of system behavior.


First Image


Pillar 1: Monitoring (Metrics)

Monitoring focuses on numerical measurements.

Examples:

CPU Usage
Memory Usage
Request Rate
Error Rate
Latency
Disk Usage
Enter fullscreen mode Exit fullscreen mode

Metrics answer:

How much?
How often?
How fast?
Enter fullscreen mode Exit fullscreen mode

Pillar 2: Logging

Logs provide detailed event information.

Example:

User Login Success
Database Connection Failed
API Request Received
Enter fullscreen mode Exit fullscreen mode

Logs answer:

What happened?
Enter fullscreen mode Exit fullscreen mode

Pillar 3: Tracing

Tracing follows a request across multiple services.

Example:

User Request
      ↓
Frontend
      ↓
API
      ↓
Payment Service
      ↓
Database
Enter fullscreen mode Exit fullscreen mode

Tracing answers:

Where did the request spend time?
Enter fullscreen mode Exit fullscreen mode

Why Metrics Matter First

Among all observability signals:

Metrics
Enter fullscreen mode Exit fullscreen mode

are usually the first thing engineers implement.

Reasons:

  • Lightweight
  • Efficient
  • Fast alerting
  • Low storage cost
  • Easy visualization

This is why Prometheus became the industry standard.


What is Prometheus?

Prometheus is an open-source monitoring and alerting system originally developed at SoundCloud and now maintained by CNCF.

Prometheus collects:

Metrics
Enter fullscreen mode Exit fullscreen mode

from applications and infrastructure.

Example:

CPU
Memory
Network
Latency
Errors
Enter fullscreen mode Exit fullscreen mode

Why Prometheus Became Popular

Before Prometheus:

Monitoring Tools
      ↓
Complex
Expensive
Difficult Scaling
Enter fullscreen mode Exit fullscreen mode

Prometheus introduced:

Pull-Based Collection
Powerful Query Language
Kubernetes Integration
Open Source
Enter fullscreen mode Exit fullscreen mode

Prometheus


Understanding Prometheus Components


Prometheus Server

Core component.

Responsible for:

  • Metric collection
  • Storage
  • Query processing
  • Alerting

Exporters

Prometheus collects metrics through exporters.

Examples:

Node Exporter
MySQL Exporter
MongoDB Exporter
Redis Exporter
Blackbox Exporter
Enter fullscreen mode Exit fullscreen mode

Alertmanager

Handles alerts.

Example:

CPU > 90%
      ↓
Alertmanager
      ↓
Email
Slack
Teams
PagerDuty
Enter fullscreen mode Exit fullscreen mode

Time-Series Database

Prometheus stores metrics as:

Timestamp + Value
Enter fullscreen mode Exit fullscreen mode

Example:

10:00 CPU=45%
10:01 CPU=48%
10:02 CPU=51%
Enter fullscreen mode Exit fullscreen mode

What is Grafana?

Grafana is a visualization platform used to create dashboards from Prometheus metrics.

Prometheus stores data.

Grafana visualizes data.

Relationship:

Prometheus
      ↓
Metrics
      ↓
Grafana
      ↓
Dashboards
Enter fullscreen mode Exit fullscreen mode

Why Grafana is Popular

Grafana provides:

  • Beautiful dashboards
  • Alerting
  • Multiple data sources
  • Real-time visualization

Supported sources:

Prometheus
Elasticsearch
Loki
InfluxDB
CloudWatch
Azure Monitor
Enter fullscreen mode Exit fullscreen mode

Prometheus + Grafana Architecture

Applications
      ↓
Exporters
      ↓
Prometheus
      ↓
Grafana
      ↓
Engineers
Enter fullscreen mode Exit fullscreen mode

Common Metrics Monitored

Infrastructure:

CPU
Memory
Disk
Network
Enter fullscreen mode Exit fullscreen mode

Application:

Request Rate
Response Time
Error Rate
Enter fullscreen mode Exit fullscreen mode

Kubernetes:

Pod Count
Node Status
Container CPU
Container Memory
Enter fullscreen mode Exit fullscreen mode

Installing Prometheus in Development Environment

For local development, Docker is easiest.


Run Prometheus Container

docker run -d \
--name prometheus \
-p 9090:9090 \
prom/prometheus
Enter fullscreen mode Exit fullscreen mode

Verify:

http://localhost:9090
Enter fullscreen mode Exit fullscreen mode

Check Targets

Navigate:

Status
   ↓
Targets
Enter fullscreen mode Exit fullscreen mode

Installing Node Exporter

docker run -d \
--name node-exporter \
-p 9100:9100 \
prom/node-exporter
Enter fullscreen mode Exit fullscreen mode

This exposes:

CPU Metrics
Memory Metrics
Disk Metrics
Enter fullscreen mode Exit fullscreen mode

Configure Prometheus

Example:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: node
    static_configs:
      - targets:
        - localhost:9100
Enter fullscreen mode Exit fullscreen mode

Restart Prometheus.


Installing Grafana in Development Environment

Run Grafana:

docker run -d \
--name grafana \
-p 3000:3000 \
grafana/grafana
Enter fullscreen mode Exit fullscreen mode

Access:

http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

Default:

admin/admin
Enter fullscreen mode Exit fullscreen mode

Connect Grafana to Prometheus

Add Data Source:

Grafana
    ↓
Connections
    ↓
Data Sources
    ↓
Prometheus
Enter fullscreen mode Exit fullscreen mode

URL:

http://prometheus:9090
Enter fullscreen mode Exit fullscreen mode

Save and Test.


Creating First Dashboard

Example panel:

rate(node_cpu_seconds_total[5m])
Enter fullscreen mode Exit fullscreen mode

Shows CPU usage.


Installing Prometheus in Pre-Production Kubernetes

Production-like environments typically use Helm.


Add Prometheus Community Repo

helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
Enter fullscreen mode Exit fullscreen mode

Update:

helm repo update
Enter fullscreen mode Exit fullscreen mode

Install kube-prometheus-stack

helm install monitoring \
prometheus-community/kube-prometheus-stack \
-n monitoring \
--create-namespace
Enter fullscreen mode Exit fullscreen mode

This installs:

Prometheus
Grafana
Alertmanager
Node Exporter
Kube State Metrics
Enter fullscreen mode Exit fullscreen mode

in one deployment.


Verify Installation

kubectl get pods -n monitoring
Enter fullscreen mode Exit fullscreen mode

Expected:

prometheus
grafana
alertmanager
node-exporter
Enter fullscreen mode Exit fullscreen mode

Access Grafana

kubectl port-forward svc/monitoring-grafana \
3000:80 \
-n monitoring
Enter fullscreen mode Exit fullscreen mode

Open:

http://localhost:3000
Enter fullscreen mode Exit fullscreen mode

Access Prometheus

kubectl port-forward svc/monitoring-kube-prometheus-prometheus \
9090:9090 \
-n monitoring
Enter fullscreen mode Exit fullscreen mode

Open:

http://localhost:9090
Enter fullscreen mode Exit fullscreen mode

Production Monitoring Stack

A typical enterprise monitoring stack looks like:

Kubernetes Cluster
       ↓
Node Exporter
       ↓
Prometheus
       ↓
Alertmanager
       ↓
Grafana
       ↓
Operations Team
Enter fullscreen mode Exit fullscreen mode

Example Alert Rule

CPU Alert:

groups:
- name: cpu-alerts

  rules:
  - alert: HighCPUUsage

    expr: node_cpu_seconds_total > 90

    for: 5m
Enter fullscreen mode Exit fullscreen mode

Grafana Dashboard Examples

Infrastructure Dashboard:

CPU Usage
Memory Usage
Disk Usage
Network Traffic
Enter fullscreen mode Exit fullscreen mode

Kubernetes Dashboard:

Nodes
Pods
Deployments
Namespaces
Enter fullscreen mode Exit fullscreen mode

Application Dashboard:

Request Rate
Error Rate
Latency
Availability
Enter fullscreen mode Exit fullscreen mode

Monitoring Best Practices


Use Labels Properly

Good:

environment=prod
team=platform
service=payment
Enter fullscreen mode Exit fullscreen mode

Retain Metrics Wisely

Avoid storing metrics forever.


Create Actionable Alerts

Bad:

CPU > 80%
Enter fullscreen mode Exit fullscreen mode

Good:

CPU > 90% for 10 minutes
Enter fullscreen mode Exit fullscreen mode

Separate Environments

Dev
QA
PreProd
Prod
Enter fullscreen mode Exit fullscreen mode

should have independent monitoring.


Observability Tools Landscape

Monitoring:

Prometheus
Grafana
Datadog
New Relic
CloudWatch
Azure Monitor
Enter fullscreen mode Exit fullscreen mode

Logging:

ELK Stack
EFK Stack
Loki
Splunk
Enter fullscreen mode Exit fullscreen mode

Tracing:

Jaeger
Zipkin
Tempo
OpenTelemetry
Enter fullscreen mode Exit fullscreen mode

What We'll Cover in Part Two

This article focused on:

Observability Fundamentals
Monitoring
Prometheus
Grafana
Enter fullscreen mode Exit fullscreen mode

In Part Two we'll cover:

Logging
Centralized Log Management
ELK Stack
EFK Stack
Loki
Tracing
Jaeger
OpenTelemetry
Distributed Tracing
End-to-End Observability
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

Observability is one of the most important capabilities in modern cloud-native platforms.

Without observability:

Failures Become Guesswork
Enter fullscreen mode Exit fullscreen mode

With observability:

Metrics
Logs
Traces
      ↓
Faster Troubleshooting
Better Reliability
Improved User Experience
Enter fullscreen mode Exit fullscreen mode

For most organizations, the journey starts with:

Prometheus
+
Grafana
Enter fullscreen mode Exit fullscreen mode

because they provide a powerful, scalable, and Kubernetes-native monitoring platform.

Once monitoring is established, the next step is adding:

Logging
+
Tracing
Enter fullscreen mode Exit fullscreen mode

to achieve full-stack observability.

Top comments (0)