DEV Community

πŸ‘οΈ Stop Flying Blind: Implementing Observability Practices in Production (Python, Prometheus & Grafana)

Have you ever been woken up at 3:00 AM because "the server is down," only to spend the next four hours grepping through messy text files trying to figure out why?

In modern distributed systems, traditional monitoring (asking "Is the system working?") is no longer enough. We need Observability (asking "Why is the system not working, and where exactly is the bottleneck?").

In this article, we will break down the core practices of Observability and implement a real-world, automated solution using Python (FastAPI), Prometheus, and Grafana.


πŸ›οΈ The Three Pillars of Observability

To achieve true observability, your system must emit three types of telemetry data:

  1. Metrics: Numeric representations of data measured over time (e.g., CPU usage, request latency, error rates). They tell you that something is wrong.
  2. Logs: Immutable, timestamped records of discrete events (e.g., "User 123 failed authentication"). They tell you what went wrong.
  3. Traces: A representation of a series of casually related distributed events that encode the end-to-end request flow. They tell you where it went wrong.

πŸ’‘ Best Practice: The RED Method for Metrics

When instrumenting web services, always follow the RED method for your metrics:

  • Rate: The number of requests per second.
  • Errors: The number of those requests that are failing.
  • Duration: The amount of time those requests take.

πŸš€ Real-World Example: Instrumenting a FastAPI Application

Let's build a simple e-commerce checkout API and instrument it to expose metrics to Prometheus.

1. The Application Code (main.py)

We will use the prometheus_client library to create custom metrics for our API.

# main.py
import time
import random
from fastapi import FastAPI, HTTPException
from prometheus_client import make_asgi_app, Counter, Histogram
import uvicorn

app = FastAPI(title="E-Commerce Checkout API")

# Add Prometheus ASGI middleware to expose /metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

# Define our Metrics (Following the RED method)
REQUEST_COUNT = Counter(
    'checkout_requests_total', 
    'Total number of checkout requests',
    ['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
    'checkout_request_latency_seconds', 
    'Latency of checkout requests in seconds',
    ['endpoint']
)

@app.post("/checkout")
def process_checkout():
    start_time = time.time()

    # Simulate processing time (Duration)
    time.sleep(random.uniform(0.1, 0.5))

    # Simulate a random failure (Errors)
    if random.random() < 0.2:  # 20% failure rate
        REQUEST_COUNT.labels(method='POST', endpoint='/checkout', status='500').inc()
        REQUEST_LATENCY.labels(endpoint='/checkout').observe(time.time() - start_time)
        raise HTTPException(status_code=500, detail="Payment gateway timeout")

    # Success (Rate)
    REQUEST_COUNT.labels(method='POST', endpoint='/checkout', status='200').inc()
    REQUEST_LATENCY.labels(endpoint='/checkout').observe(time.time() - start_time)

    return {"status": "success", "message": "Checkout complete"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

Enter fullscreen mode Exit fullscreen mode

2. The Requirements (requirements.txt)

fastapi==0.103.1
uvicorn==0.23.2
prometheus-client==0.17.1

Enter fullscreen mode Exit fullscreen mode

🐳 Infrastructure & Orchestration (Docker Compose)

To see our metrics, we need Prometheus to scrape our app, and Grafana to visualize it. We can spin up the entire observability stack locally using docker-compose.yaml.

# docker-compose.yaml
version: '3.8'

services:
  api:
    build: .
    ports:
      - "8000:8000"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus

Enter fullscreen mode Exit fullscreen mode

Prometheus Config (prometheus.yml):

scrape_configs:
  - job_name: 'fastapi_app'
    scrape_interval: 5s
    static_configs:
      - targets: ['api:8000']

Enter fullscreen mode Exit fullscreen mode

When you run docker-compose up, Prometheus will automatically hit http://api:8000/metrics every 5 seconds, and you can build beautiful dashboards in Grafana (http://localhost:3000) showing your exact API error rates and latency!


☁️ Automation: Enforcing Code Quality with CI/CD

Observability starts with good code. Let's ensure our instrumented app builds correctly before merging. Here is our GitHub Actions pipeline (.github/workflows/ci.yml):

name: CI/CD Observability API

on:
  push:
    branches: [ "main" ]
  pull_request:
    branches: [ "main" ]

jobs:
  build-and-test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Repository
        uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
          pip install flake8

      - name: Lint Code
        run: flake8 main.py --count --select=E9,F63,F7,F82 --show-source --statistics

      - name: Build Docker Image
        run: docker build -t my-observable-api:latest .

Enter fullscreen mode Exit fullscreen mode

🎯 Conclusion

By instrumenting our code from day one, we transition from reactive firefighting to proactive engineering. Using the RED method alongside Prometheus and Grafana ensures that when our API starts slowing down or failing, we don't have to guess whyβ€”the dashboard will tell us exactly what is happening.

How do you handle observability in your projects? Let me know in the comments! πŸ‘‡

Top comments (0)