You can't fix what you can't see. Monitoring tells you when things break, why they broke, and how to prevent it. Here's how to set it up properly.
The Three Pillars of Observability
- Metrics: Numeric measurements over time (request count, latency, CPU usage)
- Logs: Discrete events with context (errors, warnings, audit trails)
- Traces: Request flow across services (distributed tracing)
Adding Prometheus Metrics to Python
pip install prometheus-client
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0]
)
ACTIVE_CONNECTIONS = Gauge(
'active_connections',
'Number of active connections'
)
# Use in your app
def handle_request(method, endpoint):
ACTIVE_CONNECTIONS.inc()
start = time.time()
try:
result = process_request(method, endpoint)
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=200).inc()
return result
except Exception:
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=500).inc()
raise
finally:
REQUEST_LATENCY.labels(method=method, endpoint=endpoint).observe(time.time() - start)
ACTIVE_CONNECTIONS.dec()
# Expose metrics endpoint
start_http_server(9090) # GET /metrics on port 9090
FastAPI Middleware for Automatic Metrics
from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, generate_latest
from starlette.responses import Response
import time
app = FastAPI()
REQUEST_COUNT = Counter('requests_total', 'Total requests', ['method', 'path', 'status'])
REQUEST_LATENCY = Histogram('request_latency_seconds', 'Request latency', ['method', 'path'])
@app.middleware("http")
async def metrics_middleware(request: Request, call_next):
start = time.time()
response = await call_next(request)
duration = time.time() - start
REQUEST_COUNT.labels(
method=request.method,
path=request.url.path,
status=response.status_code
).inc()
REQUEST_LATENCY.labels(
method=request.method,
path=request.url.path
).observe(duration)
return response
@app.get("/metrics")
async def metrics():
return Response(content=generate_latest(), media_type="text/plain")
Docker Compose: Full Monitoring Stack
services:
app:
build: .
ports:
- "8000:8000"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
volumes:
grafana_data:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'myapp'
static_configs:
- targets: ['app:8000']
Structured Logging
import structlog
import logging
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
],
wrapper_class=structlog.BoundLogger,
logger_factory=structlog.PrintLoggerFactory(),
)
log = structlog.get_logger()
def process_order(order_id: str, user_id: str):
log.info("processing_order", order_id=order_id, user_id=user_id)
try:
result = charge_payment(order_id)
log.info("order_completed", order_id=order_id, amount=result["amount"])
except Exception as e:
log.error("order_failed", order_id=order_id, error=str(e))
raise
Output: {"event": "processing_order", "order_id": "123", "user_id": "456", "level": "info", "timestamp": "2024-01-15T10:30:00Z"}
Alerting Rules
# alert_rules.yml
groups:
- name: app_alerts
rules:
- alert: HighErrorRate
expr: rate(requests_total{status="500"}[5m]) > 0.1
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
- alert: SlowResponses
expr: histogram_quantile(0.95, rate(request_latency_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "95th percentile latency above 2s"
Key Takeaways
- Use Counter for things that only go up (requests, errors)
- Use Histogram for latency and size distributions
- Use Gauge for values that go up and down (connections, queue size)
- Structured logging makes logs searchable and parseable
- Set up alerts for error rates and latency, not just uptime
6. Start with RED metrics: Rate, Errors, Duration
🚀 Level up your AI workflow! Check out my AI Developer Mega Prompt Pack — 80 battle-tested prompts for developers. $9.99
Top comments (0)