DEV Community

郑沛沛
郑沛沛

Posted on

Application Monitoring with Prometheus and Grafana: A Developer's Guide

You can't fix what you can't see. Monitoring tells you when things break, why they broke, and how to prevent it. Here's how to set it up properly.

The Three Pillars of Observability

  1. Metrics: Numeric measurements over time (request count, latency, CPU usage)
  2. Logs: Discrete events with context (errors, warnings, audit trails)
  3. Traces: Request flow across services (distributed tracing)

Adding Prometheus Metrics to Python

pip install prometheus-client
Enter fullscreen mode Exit fullscreen mode
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)

ACTIVE_CONNECTIONS = Gauge(
    'active_connections',
    'Number of active connections'
)

# Use in your app
def handle_request(method, endpoint):
    ACTIVE_CONNECTIONS.inc()
    start = time.time()
    try:
        result = process_request(method, endpoint)
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=200).inc()
        return result
    except Exception:
        REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=500).inc()
        raise
    finally:
        REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(time.time() - start)
        ACTIVE_CONNECTIONS.dec()

# Expose metrics endpoint
start_http_server(9090)  # GET /metrics on port 9090
Enter fullscreen mode Exit fullscreen mode

FastAPI Middleware for Automatic Metrics

from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, generate_latest
from starlette.responses import Response
import time

app = FastAPI()

REQUEST_COUNT = Counter('requests_total', 'Total requests', ['method', 'path', 'status'])
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Request latency', ['method', 'path'])

@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
    start = time.time()
    response = await call_next(request)
    duration = time.time() - start

    REQUEST_COUNT.labels(
        method=request.method,
        path=request.url.path,
        status=response.status_code
    ).inc()
    REQUEST_LATENCY.labels(
        method=request.method,
        path=request.url.path
    ).observe(duration)

    return response

@app.get("/metrics")
async def metrics():
    return Response(content=generate_latest(), media_type="text/plain")
Enter fullscreen mode Exit fullscreen mode

Docker Compose: Full Monitoring Stack

services:
  app:
    build: .
    ports:
      - "8000:8000"

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana

volumes:
  grafana_data:
Enter fullscreen mode Exit fullscreen mode
# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['app:8000']
Enter fullscreen mode Exit fullscreen mode

Structured Logging

import structlog
import logging

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ],
    wrapper_class=structlog.BoundLogger,
    logger_factory=structlog.PrintLoggerFactory(),
)

log = structlog.get_logger()

def process_order(order_id: str, user_id: str):
    log.info("processing_order", order_id=order_id, user_id=user_id)
    try:
        result = charge_payment(order_id)
        log.info("order_completed", order_id=order_id, amount=result["amount"])
    except Exception as e:
        log.error("order_failed", order_id=order_id, error=str(e))
        raise
Enter fullscreen mode Exit fullscreen mode

Output: {"event": "processing_order", "order_id": "123", "user_id": "456", "level": "info", "timestamp": "2024-01-15T10:30:00Z"}

Alerting Rules

# alert_rules.yml
groups:
  - name: app_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(requests_total{status="500"}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"

      - alert: SlowResponses
        expr: histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m])) > 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "95th percentile latency above 2s"
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. Use Counter for things that only go up (requests, errors)
  2. Use Histogram for latency and size distributions
  3. Use Gauge for values that go up and down (connections, queue size)
  4. Structured logging makes logs searchable and parseable
  5. Set up alerts for error rates and latency, not just uptime

6. Start with RED metrics: Rate, Errors, Duration

🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99

Top comments (0)