Hashira Belén Vargas Candia

Posted on Jul 1

Practical Modern Observability: FastAPI, OpenTelemetry, Prometheus, Jaeger, and Grafana

#microservices #monitoring #tutorial #python

In modern, distributed architectures (like microservices or serverless applications), understanding why a system failed or why it is running slowly is a massive engineering challenge.

When a user complains that a checkout operation failed, you cannot simply look at a single server’s log file. That request may have traversed a Gateway, an Order service, a Payment processor, an Inventory system, and a database. You need a way to correlate all of these events across network and process boundaries.

This is where Observability comes in.

In this article, we will cover the core concepts of modern observability practices, explain the OpenTelemetry standard, and walk through a complete, real-world Python implementation. We will build a containerized FastAPI application instrumented with OpenTelemetry that exports metrics to Prometheus, sends traces to Jaeger, and compiles everything into a unified Grafana dashboard.

Repository: https://github.com/Hashiravc/Observability-Practices.git

1. What is Observability?

Often confused with monitoring, observability is the measure of how well you can infer the internal states of a system based on its external outputs.

Monitoring tells you when a system is broken (e.g., "CPU usage is 99%" or "HTTP 500 error rate is > 5%"). It is about tracking pre-defined metrics.
Observability tells you why it is broken (e.g., "The payment service is slow because database query SELECT * FROM payment_methods took 3 seconds under a specific trace context"). It is about asking questions you didn't anticipate.

To achieve observability, we rely on the three pillars of telemetry:

Metrics: Structured, aggregated numerical data that track resource usage or business KPIs over time. (e.g., CPU load, request count, revenue). Metrics are ideal for alerting.
Traces: Represent the end-to-end journey of a request as it flows through a distributed system. A trace is composed of one or more spans (individual units of work). Traces are crucial for isolating latency bottlenecks and database serialization errors.
Logs: Timestamped, text-based entries describing specific, discrete events. In highly observable systems, logs are structured (JSON-formatted) and inject trace IDs to correlate logs directly with traces.

2. The OpenTelemetry Standard

Historically, implementing observability meant using proprietary libraries from vendor platforms (like Datadog, New Relic, or Dynatrace). If you wanted to change platforms, you had to rewrite your application instrumentation.

OpenTelemetry (OTel) is an open-source, vendor-neutral collection of APIs, SDKs, and tools hosted by the Cloud Native Computing Foundation (CNCF). It provides a single standard for gathering metrics, logs, and traces.

By instrumenting your code with the OpenTelemetry API, you can swap out telemetry backends (e.g., from local Prometheus/Jaeger to Datadog or AWS CloudWatch) by changing simple configuration variables—without modifying a single line of application code.

3. Demo Application Architecture

To demonstrate observability practices in a real-world scenario, we will build a containerized E-Commerce application with the following architecture:

  [ Traffic Generator ]
           │
           ▼ (HTTP request)
     [ FastAPI Web ] ──────────────► [ SQLite DB ]
           │ (Exposes /metrics)
           ├───────────────────────► [ Prometheus ] ──────┐
           │                                              ▼
           ├─► (OTLP Traces gRPC) ──► [ Jaeger ] ────► [ Grafana Dashboard ]
           │
        (W3C Trace Headers)
           │
           ▼
   [ /inventory/deduct ]

When a client hits /checkout:

A parent span checkout_transaction is created.
A database entry is written inside a child span (db_create_order).
An HTTP request is made to the /inventory/deduct route. To trace this across network boundaries, we manually inject W3C Trace Context headers.
The inventory service extracts the context and begins a child span inventory_deduct_span to deduct item stock.
If we request checkout of specific items, we simulate network latency (sleep) or write lock deadlocks (500 errors).

4. Code Walkthrough

Let us look at how this is implemented.

Dependency Configuration (`requirements.txt`)

We install FastAPI, Uvicorn, SQLAlchemy, and the official OpenTelemetry SDK packages, including the Prometheus metric reader and OTLP trace exporter:

fastapi==0.111.0
uvicorn==0.30.1
sqlalchemy==2.0.31
httpx==0.27.0
prometheus-client==0.20.0
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-instrumentation-fastapi==0.46b0
opentelemetry-exporter-otlp==1.25.0
opentelemetry-exporter-prometheus==1.25.0

Telemetry Configuration (`app/telemetry.py`)

This file initializes OpenTelemetry, registers exporters, and defines custom application-level metrics.

import os
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Define service metadata
SERVICE_NAME_VALUE = os.getenv("OTEL_SERVICE_NAME", "order-service")
resource = Resource.create({SERVICE_NAME: SERVICE_NAME_VALUE})

# 1. Tracing Setup
tracer_provider = TracerProvider(resource=resource)

# Configure OTLP Exporter (sending traces to Jaeger via gRPC)
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
try:
    otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
    span_processor = BatchSpanProcessor(otlp_exporter)
    tracer_provider.add_span_processor(span_processor)
except Exception as e:
    print(f"Could not initialize OTLP exporter: {e}. Falling back to console.")
    console_exporter = ConsoleSpanExporter()
    span_processor = BatchSpanProcessor(console_exporter)
    tracer_provider.add_span_processor(span_processor)

trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer("order-service-tracer")

# 2. Metrics Setup (Prometheus Pull Exporter)
prometheus_reader = PrometheusMetricReader()
meter_provider = MeterProvider(resource=resource, metric_readers=[prometheus_reader])
metrics.set_meter_provider(meter_provider)
meter = metrics.get_meter("order-service-meter")

# Custom business and performance metrics
checkout_counter = meter.create_counter(
    name="order_checkouts_total",
    description="Total number of checkout requests processed",
    unit="1"
)

revenue_counter = meter.create_counter(
    name="order_revenue_total",
    description="Total revenue generated from checkouts",
    unit="USD"
)

checkout_errors = meter.create_counter(
    name="order_checkout_errors_total",
    description="Total failed checkouts",
    unit="1"
)

def setup_telemetry(app: FastAPI):
    """
    Instruments the FastAPI application.
    """
    # Auto-instrumentation hooks FastAPI request durations and details
    FastAPIInstrumentor.instrument_app(
        app,
        tracer_provider=tracer_provider,
        meter_provider=meter_provider
    )

Application Implementation (`app/main.py`)

Here we set up our FastAPI endpoints. Notice how trace context is injected on /checkout and extracted on /inventory/deduct using W3C Trace Context propagation.

import os
import time
import random
import httpx
from fastapi import FastAPI, Depends, HTTPException, Header, Response, Request
from pydantic import BaseModel
from sqlalchemy.orm import Session
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST

from app.database import init_db, get_db, Order, Inventory
from app.telemetry import (
    setup_telemetry,
    tracer,
    checkout_counter,
    revenue_counter,
    checkout_errors
)
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.trace import StatusCode

app = FastAPI(title="E-Commerce Observability Demo API")
setup_telemetry(app)

INVENTORY_URL = os.getenv("INVENTORY_URL", "http://localhost:8000")

class CheckoutRequest(BaseModel):
    item: str
    quantity: int
    price: float

class InventoryDeductRequest(BaseModel):
    item: str
    quantity: int

@app.on_event("startup")
def on_startup():
    init_db()

@app.get("/")
def home():
    return {"message": "Welcome to the E-Commerce Observability Demo API!"}

@app.post("/checkout")
async def checkout(request_data: CheckoutRequest, db: Session = Depends(get_db)):
    checkout_counter.add(1, {"item": request_data.item})

    # Start parent trace span
    with tracer.start_as_current_span("checkout_transaction") as span:
        span.set_attribute("order.item", request_data.item)
        span.set_attribute("order.quantity", request_data.quantity)
        span.set_attribute("order.price_per_unit", request_data.price)

        # Child database span
        with tracer.start_as_current_span("db_create_order") as db_span:
            db_order = Order(
                item=request_data.item,
                quantity=request_data.quantity,
                price=request_data.price,
                status="PENDING"
            )
            db.add(db_order)
            db.commit()
            db.refresh(db_order)
            db_span.set_attribute("db.order_id", db_order.id)
            order_id = db_order.id

        # Downstream HTTP client span + Trace context injection
        headers = {}
        TraceContextTextMapPropagator().inject(headers) # Injects 'traceparent' header

        with tracer.start_as_current_span("http_call_inventory_service") as http_span:
            http_span.set_attribute("http.url", f"{INVENTORY_URL}/inventory/deduct")
            try:
                async with httpx.AsyncClient() as client:
                    response = await client.post(
                        f"{INVENTORY_URL}/inventory/deduct",
                        json={"item": request_data.item, "quantity": request_data.quantity},
                        headers=headers,
                        timeout=5.0
                    )
            except Exception as exc:
                checkout_errors.add(1, {"item": request_data.item, "reason": "inventory_network_error"})
                span.set_status(StatusCode.ERROR, f"Inventory communication failed: {exc}")
                http_span.record_exception(exc)
                db_order.status = "FAILED"
                db.commit()
                raise HTTPException(status_code=502, detail="Inventory service network error")

            if response.status_code != 200:
                checkout_errors.add(1, {"item": request_data.item, "reason": "inventory_rejection"})
                span.set_status(StatusCode.ERROR, f"Inventory check rejected: {response.text}")
                db_order.status = "FAILED"
                db.commit()
                raise HTTPException(status_code=response.status_code, detail=f"Inventory deduction rejected: {response.text}")

        # Success path
        db_order.status = "COMPLETED"
        db.commit()

        total_revenue = request_data.quantity * request_data.price
        revenue_counter.add(total_revenue, {"item": request_data.item})
        span.set_attribute("order.status", "COMPLETED")
        span.set_attribute("order.revenue", total_revenue)

        return {"order_id": order_id, "status": "COMPLETED", "item": request_data.item, "total_price": total_revenue}

@app.post("/inventory/deduct")
def deduct_inventory(request_data: InventoryDeductRequest, request: Request, db: Session = Depends(get_db)):
    # Extract trace parent header from HTTP client call
    carrier = {"traceparent": request.headers.get("traceparent", "")}
    extracted_context = TraceContextTextMapPropagator().extract(carrier=carrier)

    with tracer.start_as_current_span("inventory_deduct_span", context=extracted_context) as span:
        span.set_attribute("inventory.item", request_data.item)
        span.set_attribute("inventory.deduction_quantity", request_data.quantity)

        # Simulate network latency (2 seconds) for large smartphone orders
        if request_data.item == "smartphone" and request_data.quantity >= 3:
            delay = random.uniform(1.0, 2.5)
            span.set_attribute("simulation.latency_added", delay)
            time.sleep(delay)

        # Simulate db locks/deadlock conflicts (500 Server Error) for large laptop orders
        if request_data.item == "laptop" and request_data.quantity >= 2:
            if random.random() < 0.6:
                span.set_status(StatusCode.ERROR, "Simulated deadlock conflict")
                raise HTTPException(status_code=500, detail="Database deadlock conflict during write")

        db_item = db.query(Inventory).filter(Inventory.item == request_data.item).first()
        if not db_item or db_item.quantity < request_data.quantity:
            span.set_status(StatusCode.ERROR, "Insufficient stock")
            raise HTTPException(status_code=400, detail="Insufficient stock")

        db_item.quantity -= request_data.quantity
        db.commit()
        return {"status": "SUCCESS", "remaining_stock": db_item.quantity}

@app.get("/metrics")
def metrics_endpoint():
    """
    Exposes metrics scraped by Prometheus server.
    """
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

5. Orchestrating the Stack with Docker Compose

To run this observability stack locally, we define a multi-container deployment using Docker Compose. We launch:

Our FastAPI App on port 8000.
Prometheus on port 9090 to collect the metric endpoints.
Jaeger on port 16686 (Web UI) and 4317 (gRPC collector) to receive tracing spans.
Grafana on port 3000 to aggregate datasources.

Here is the docker-compose.yml config:

version: "3.8"

services:
  web:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OTEL_SERVICE_NAME=order-service
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
      - INVENTORY_URL=http://web:8000
    depends_on:
      - jaeger

  jaeger:
    image: jaegertracing/all-in-one:1.57
    ports:
      - "16686:16686" # Web UI
      - "4317:4317"   # OTLP gRPC receiver
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  prometheus:
    image: prom/prometheus:v2.52.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:11.0.0
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
    depends_on:
      - prometheus
      - jaeger

6. How it Looks in Action

When you run docker compose up --build, start the traffic generator (generate_traffic.py), and log in to Grafana, you will see a unified, correlated interface.

High-Level Metric Monitoring: Grafana metrics charts track the overall request throughput, purchase revenue, and HTTP error rate.
Drill-down to Latency: Under the average latency chart, when a spike occurs, you notice the /checkout route takes 2.2 seconds.
Distributed Tracing Correlation: Because the trace context was propagated from /checkout down to /inventory/deduct, you can jump directly from Grafana into Jaeger to examine the trace hierarchy.
Root-Cause Isolation: In Jaeger, you see that the parent span checkout_transaction spent 95% of its execution time inside the child span inventory_deduct_span. Looking at the span details, you see simulation.latency_added: 2.15, pointing directly to the simulated network lag!

7. Observability Best Practices for Production

If you are implementing observability in production, keep these best practices in mind:

Use an OpenTelemetry Collector: In local development, exporting telemetry directly from the app to Prometheus/Jaeger is fine. In production, however, your application should stream telemetry data asynchronously via OTLP to a local OTel Collector daemon. The Collector processes, batches, and exports the data to your telemetry backend. This prevents application slowdowns during backend bottlenecks.
Inject Trace Context into Logs: Configure your logger (e.g., Python logging or structlog) to format logs as JSON and inject the active trace ID (trace.get_current_span().get_span_context().trace_id). This bridges the gap between logging and tracing.
Watch Out for Metric Cardinality: When defining metrics, do not add attributes with high cardinality (e.g. user IDs, order IDs, or session IDs) as tags. Adding tags with infinite possible values will bloat your metric database memory and crash Prometheus. Use traces or logs for high-cardinality attributes.
Enforce Sampling: Sending 100% of traces to your collector is expensive and unnecessary for high-traffic applications. Configure adaptive or head-based sampling (e.g., sample 5% of successful checkouts, but 100% of errors).

Conclusion

Observability is not just about installing software packages; it is an engineering discipline. Adopting OpenTelemetry ensures you decouple your application logic from any particular monitoring vendor.

Setting up unified tracing and metric aggregation, as shown in our E-Commerce demo, allows your team to go from identifying a high-level error spike to isolating the exact line of failing code or latency bottleneck in seconds.

Top comments (1)

arun rajkumar • Jul 4

First comment here, so I'll add the production angle: propagating W3C trace context across the /checkout → /inventory hop is the part most teams skip, and it's exactly the part that saves you at 2am. On a payments stack the trace ID becomes the debugging currency — when a checkout fails, following one request across services beats stitching together five separate log files. The two best-practice points I'd underline hardest: inject the trace ID into structured logs so logs and traces are one click apart, and stay ruthless about metric cardinality — tagging with per-transaction IDs is the classic way to blow up Prometheus memory. What did you settle on for sampling — 100% of error traces with the happy path sampled, or something adaptive?