Have you ever been woken up at 3:00 AM because "the server is down," only to spend the next four hours grepping through messy text files trying to figure out why?
In modern distributed systems, traditional monitoring (asking "Is the system working?") is no longer enough. We need Observability (asking "Why is the system not working, and where exactly is the bottleneck?").
In this article, we will break down the core practices of Observability and implement a real-world, automated solution using Python (FastAPI), Prometheus, and Grafana.
ποΈ The Three Pillars of Observability
To achieve true observability, your system must emit three types of telemetry data:
- Metrics: Numeric representations of data measured over time (e.g., CPU usage, request latency, error rates). They tell you that something is wrong.
- Logs: Immutable, timestamped records of discrete events (e.g., "User 123 failed authentication"). They tell you what went wrong.
- Traces: A representation of a series of casually related distributed events that encode the end-to-end request flow. They tell you where it went wrong.
π‘ Best Practice: The RED Method for Metrics
When instrumenting web services, always follow the RED method for your metrics:
- Rate: The number of requests per second.
- Errors: The number of those requests that are failing.
- Duration: The amount of time those requests take.
π Real-World Example: Instrumenting a FastAPI Application
Let's build a simple e-commerce checkout API and instrument it to expose metrics to Prometheus.
1. The Application Code (main.py)
We will use the prometheus_client library to create custom metrics for our API.
# main.py
import time
import random
from fastapi import FastAPI, HTTPException
from prometheus_client import make_asgi_app, Counter, Histogram
import uvicorn
app = FastAPI(title="E-Commerce Checkout API")
# Add Prometheus ASGI middleware to expose /metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
# Define our Metrics (Following the RED method)
REQUEST_COUNT = Counter(
'checkout_requests_total',
'Total number of checkout requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'checkout_request_latency_seconds',
'Latency of checkout requests in seconds',
['endpoint']
)
@app.post("/checkout")
def process_checkout():
start_time = time.time()
# Simulate processing time (Duration)
time.sleep(random.uniform(0.1, 0.5))
# Simulate a random failure (Errors)
if random.random() < 0.2: # 20% failure rate
REQUEST_COUNT.labels(method='POST', endpoint='/checkout', status='500').inc()
REQUEST_LATENCY.labels(endpoint='/checkout').observe(time.time() - start_time)
raise HTTPException(status_code=500, detail="Payment gateway timeout")
# Success (Rate)
REQUEST_COUNT.labels(method='POST', endpoint='/checkout', status='200').inc()
REQUEST_LATENCY.labels(endpoint='/checkout').observe(time.time() - start_time)
return {"status": "success", "message": "Checkout complete"}
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
2. The Requirements (requirements.txt)
fastapi==0.103.1
uvicorn==0.23.2
prometheus-client==0.17.1
π³ Infrastructure & Orchestration (Docker Compose)
To see our metrics, we need Prometheus to scrape our app, and Grafana to visualize it. We can spin up the entire observability stack locally using docker-compose.yaml.
# docker-compose.yaml
version: '3.8'
services:
api:
build: .
ports:
- "8000:8000"
prometheus:
image: prom/prometheus:latest
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
depends_on:
- prometheus
Prometheus Config (prometheus.yml):
scrape_configs:
- job_name: 'fastapi_app'
scrape_interval: 5s
static_configs:
- targets: ['api:8000']
When you run docker-compose up, Prometheus will automatically hit http://api:8000/metrics every 5 seconds, and you can build beautiful dashboards in Grafana (http://localhost:3000) showing your exact API error rates and latency!
βοΈ Automation: Enforcing Code Quality with CI/CD
Observability starts with good code. Let's ensure our instrumented app builds correctly before merging. Here is our GitHub Actions pipeline (.github/workflows/ci.yml):
name: CI/CD Observability API
on:
push:
branches: [ "main" ]
pull_request:
branches: [ "main" ]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- name: Checkout Repository
uses: actions/checkout@v3
- name: Setup Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install flake8
- name: Lint Code
run: flake8 main.py --count --select=E9,F63,F7,F82 --show-source --statistics
- name: Build Docker Image
run: docker build -t my-observable-api:latest .
π― Conclusion
By instrumenting our code from day one, we transition from reactive firefighting to proactive engineering. Using the RED method alongside Prometheus and Grafana ensures that when our API starts slowing down or failing, we don't have to guess whyβthe dashboard will tell us exactly what is happening.
How do you handle observability in your projects? Let me know in the comments! π
Top comments (0)