🚀 Demystifying Observability: A Practical Guide with Python, Prometheus, and Grafana

#python #prometheus #grafana

🧐 link to the practice repository:
PRESS ME

In the modern era of distributed systems and microservices, "monitoring" is no longer enough. We need Observability. But what exactly is the difference, and how can we implement it without getting lost in complex configurations?

In this article, we will explore the core concepts of observability and build a real-world example using Python (Flask), Prometheus, and Grafana.

What is Observability?

While monitoring tells you when something is wrong (e.g., "The server is down"), observability allows you to understand why it is wrong by asking questions to your system from the outside.

It is often categorized into three pillars:

Logs: A record of discrete events (e.g., "User X logged in").
Metrics: Aggregated numerical data over time (e.g., "CPU usage is at 80%").
Traces: The path of a request through your distributed system.

Today, we will focus heavily on Metrics using the industry-standard tool: Prometheus.

The Stack

For our practical exercise, we will use:

Application: A simple Python Flask API.
Instrumentation: prometheus-client library to expose metrics.
Storage & Scraping: Prometheus.
Visualization: Grafana.

Step 1: The Application Code

We need an application that doesn't just work, but talks to us. We will create a Flask app and instrument it to track:

Request Count: How many requests we receive (broken down by status code and method).
Latency: How long requests take to process.

Here is the core logic (app/main.py):

from flask import Flask, Response, request
import time
import random
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST

app = Flask(__name__)

# 1. Define Prometheus Metrics
REQUEST_COUNT = Counter(
    'app_request_count', 
    'Application Request Count', 
    ['method', 'endpoint', 'http_status']
)

REQUEST_LATENCY = Histogram(
    'app_request_latency_seconds', 
    'Application Request Latency', 
    ['method', 'endpoint']
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    request_latency = time.time() - request.start_time
    # 2. Record metrics after every request
    REQUEST_LATENCY.labels(request.method, request.path).observe(request_latency)
    REQUEST_COUNT.labels(request.method, request.path, response.status_code).inc()
    return response

@app.route('/metrics')
def metrics():
    # 3. Expose metrics for Prometheus to scrape
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

# ... routes for / and /error ...

Step 2: Configuring Prometheus

Prometheus needs to know where to look for metrics. We create a prometheus.yml file:

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: 'flask_app'
    static_configs:
      - targets: ['app:5000']

This tells Prometheus to ping our app on port 5000 every 5 seconds and read the data from /metrics.

Step 3: Orchestration with Docker Compose

To make this easy to run anywhere, we use Docker Compose to spin up the App, Prometheus, and Grafana simultaneously.

version: '3.8'
services:
  app:
    build: ./app
    ports: ["5000:5000"]
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
    ports: ["9090:9090"]
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]

Seeing it in Action

Once the containers are running, we can generate some traffic to our app.

Generate Traffic: Refresh http://localhost:5000/ and http://localhost:5000/error a few times.
Check Prometheus: Go to http://localhost:9090 and search for app_request_count. You will see the raw data increasing.
Visualize in Grafana:
- Go to http://localhost:3000 (admin/admin).
- Add Prometheus as a Data Source (http://prometheus:9090).
- Create a dashboard to visualize rate(app_request_count[1m]).

Conclusion

Observability is not just for giant tech companies. With a few lines of code and open-source tools, you can gain deep insights into your application's health. Instead of guessing why your app is slow, you can look at the histogram metrics and know.

Happy Coding! 🚀