Application Monitoring with Prometheus and Grafana: A Developer's Guide

#prometheus #grafana #devops #monitoring

You can't fix what you can't see. Monitoring tells you when things break, why they broke, and how to prevent it. Here's how to set it up properly.

The Three Pillars of Observability

Metrics: Numeric measurements over time (request count, latency, CPU usage)
Logs: Discrete events with context (errors, warnings, audit trails)
Traces: Request flow across services (distributed tracing)

Adding Prometheus Metrics to Python

pip install prometheus-client

from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

# Use in your app
def handle_request(method, endpoint):
    ACTIVE_CONNECTIONS.inc()
    start = time.time()
    try:
        result = process_request(method, endpoint)
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=200).inc()
        return result
    except Exception:
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=500).inc()
        raise
    finally:
        REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(time.time() - start)
        ACTIVE_CONNECTIONS.dec()

# Expose metrics endpoint
start_http_server(9090)  # GET /metrics on port 9090

FastAPI Middleware for Automatic Metrics

from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, generate_latest
from starlette.responses import Response
import time

app = FastAPI()

REQUEST_COUNT = Counter('requests_total', 'Total requests', ['method', 'path', 'status'])
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Request latency', ['method', 'path'])

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        path=request.url.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        path=request.url.path
    ).observe(duration)

    return response

@app.get("/metrics")
async def metrics():
    return Response(content=generate_latest(), media_type="text/plain")

Docker Compose: Full Monitoring Stack

services:
  app:
    build: .
    ports:
      - "8000:8000"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  grafana_data:

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['app:8000']

Structured Logging

import structlog
import logging

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.BoundLogger,
    logger_factory=structlog.PrintLoggerFactory(),
)

log = structlog.get_logger()

def process_order(order_id: str, user_id: str):
    log.info("processing_order", order_id=order_id, user_id=user_id)
    try:
        result = charge_payment(order_id)
        log.info("order_completed", order_id=order_id, amount=result["amount"])
    except Exception as e:
        log.error("order_failed", order_id=order_id, error=str(e))
        raise

Output: {"event": "processing_order", "order_id": "123", "user_id": "456", "level": "info", "timestamp": "2024-01-15T10:30:00Z"}

Alerting Rules

# alert_rules.yml
groups:
  - name: app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(requests_total{status="500"}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: SlowResponses
        expr: histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile latency above 2s"

Key Takeaways

Use Counter for things that only go up (requests, errors)
Use Histogram for latency and size distributions
Use Gauge for values that go up and down (connections, queue size)
Structured logging makes logs searchable and parseable
Set up alerts for error rates and latency, not just uptime

6. Start with RED metrics: Rate, Errors, Duration

🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99

DEV Community