In modern, distributed architectures (like microservices or serverless applications), understanding why a system failed or why it is running slowly is a massive engineering challenge.
When a user complains that a checkout operation failed, you cannot simply look at a single server’s log file. That request may have traversed a Gateway, an Order service, a Payment processor, an Inventory system, and a database. You need a way to correlate all of these events across network and process boundaries.
This is where Observability comes in.
In this article, we will cover the core concepts of modern observability practices, explain the OpenTelemetry standard, and walk through a complete, real-world Python implementation. We will build a containerized FastAPI application instrumented with OpenTelemetry that exports metrics to Prometheus, sends traces to Jaeger, and compiles everything into a unified Grafana dashboard.
Repository: https://github.com/Hashiravc/Observability-Practices.git
1. What is Observability?
Often confused with monitoring, observability is the measure of how well you can infer the internal states of a system based on its external outputs.
- Monitoring tells you when a system is broken (e.g., "CPU usage is 99%" or "HTTP 500 error rate is > 5%"). It is about tracking pre-defined metrics.
- Observability tells you why it is broken (e.g., "The payment service is slow because database query
SELECT * FROM payment_methodstook 3 seconds under a specific trace context"). It is about asking questions you didn't anticipate.
To achieve observability, we rely on the three pillars of telemetry:
- Metrics: Structured, aggregated numerical data that track resource usage or business KPIs over time. (e.g., CPU load, request count, revenue). Metrics are ideal for alerting.
- Traces: Represent the end-to-end journey of a request as it flows through a distributed system. A trace is composed of one or more spans (individual units of work). Traces are crucial for isolating latency bottlenecks and database serialization errors.
- Logs: Timestamped, text-based entries describing specific, discrete events. In highly observable systems, logs are structured (JSON-formatted) and inject trace IDs to correlate logs directly with traces.
2. The OpenTelemetry Standard
Historically, implementing observability meant using proprietary libraries from vendor platforms (like Datadog, New Relic, or Dynatrace). If you wanted to change platforms, you had to rewrite your application instrumentation.
OpenTelemetry (OTel) is an open-source, vendor-neutral collection of APIs, SDKs, and tools hosted by the Cloud Native Computing Foundation (CNCF). It provides a single standard for gathering metrics, logs, and traces.
By instrumenting your code with the OpenTelemetry API, you can swap out telemetry backends (e.g., from local Prometheus/Jaeger to Datadog or AWS CloudWatch) by changing simple configuration variables—without modifying a single line of application code.
3. Demo Application Architecture
To demonstrate observability practices in a real-world scenario, we will build a containerized E-Commerce application with the following architecture:
[ Traffic Generator ]
│
▼ (HTTP request)
[ FastAPI Web ] ──────────────► [ SQLite DB ]
│ (Exposes /metrics)
├───────────────────────► [ Prometheus ] ──────┐
│ ▼
├─► (OTLP Traces gRPC) ──► [ Jaeger ] ────► [ Grafana Dashboard ]
│
(W3C Trace Headers)
│
▼
[ /inventory/deduct ]
When a client hits /checkout:
- A parent span
checkout_transactionis created. - A database entry is written inside a child span (
db_create_order). - An HTTP request is made to the
/inventory/deductroute. To trace this across network boundaries, we manually inject W3C Trace Context headers. - The inventory service extracts the context and begins a child span
inventory_deduct_spanto deduct item stock. - If we request checkout of specific items, we simulate network latency (sleep) or write lock deadlocks (500 errors).
4. Code Walkthrough
Let us look at how this is implemented.
Dependency Configuration (requirements.txt)
We install FastAPI, Uvicorn, SQLAlchemy, and the official OpenTelemetry SDK packages, including the Prometheus metric reader and OTLP trace exporter:
fastapi==0.111.0
uvicorn==0.30.1
sqlalchemy==2.0.31
httpx==0.27.0
prometheus-client==0.20.0
opentelemetry-api==1.25.0
opentelemetry-sdk==1.25.0
opentelemetry-instrumentation-fastapi==0.46b0
opentelemetry-exporter-otlp==1.25.0
opentelemetry-exporter-prometheus==1.25.0
Telemetry Configuration (app/telemetry.py)
This file initializes OpenTelemetry, registers exporters, and defines custom application-level metrics.
import os
from fastapi import FastAPI
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.prometheus import PrometheusMetricReader
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
# Define service metadata
SERVICE_NAME_VALUE = os.getenv("OTEL_SERVICE_NAME", "order-service")
resource = Resource.create({SERVICE_NAME: SERVICE_NAME_VALUE})
# 1. Tracing Setup
tracer_provider = TracerProvider(resource=resource)
# Configure OTLP Exporter (sending traces to Jaeger via gRPC)
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
try:
otlp_exporter = OTLPSpanExporter(endpoint=otlp_endpoint, insecure=True)
span_processor = BatchSpanProcessor(otlp_exporter)
tracer_provider.add_span_processor(span_processor)
except Exception as e:
print(f"Could not initialize OTLP exporter: {e}. Falling back to console.")
console_exporter = ConsoleSpanExporter()
span_processor = BatchSpanProcessor(console_exporter)
tracer_provider.add_span_processor(span_processor)
trace.set_tracer_provider(tracer_provider)
tracer = trace.get_tracer("order-service-tracer")
# 2. Metrics Setup (Prometheus Pull Exporter)
prometheus_reader = PrometheusMetricReader()
meter_provider = MeterProvider(resource=resource, metric_readers=[prometheus_reader])
metrics.set_meter_provider(meter_provider)
meter = metrics.get_meter("order-service-meter")
# Custom business and performance metrics
checkout_counter = meter.create_counter(
name="order_checkouts_total",
description="Total number of checkout requests processed",
unit="1"
)
revenue_counter = meter.create_counter(
name="order_revenue_total",
description="Total revenue generated from checkouts",
unit="USD"
)
checkout_errors = meter.create_counter(
name="order_checkout_errors_total",
description="Total failed checkouts",
unit="1"
)
def setup_telemetry(app: FastAPI):
"""
Instruments the FastAPI application.
"""
# Auto-instrumentation hooks FastAPI request durations and details
FastAPIInstrumentor.instrument_app(
app,
tracer_provider=tracer_provider,
meter_provider=meter_provider
)
Application Implementation (app/main.py)
Here we set up our FastAPI endpoints. Notice how trace context is injected on /checkout and extracted on /inventory/deduct using W3C Trace Context propagation.
import os
import time
import random
import httpx
from fastapi import FastAPI, Depends, HTTPException, Header, Response, Request
from pydantic import BaseModel
from sqlalchemy.orm import Session
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
from app.database import init_db, get_db, Order, Inventory
from app.telemetry import (
setup_telemetry,
tracer,
checkout_counter,
revenue_counter,
checkout_errors
)
from opentelemetry import trace
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
from opentelemetry.trace import StatusCode
app = FastAPI(title="E-Commerce Observability Demo API")
setup_telemetry(app)
INVENTORY_URL = os.getenv("INVENTORY_URL", "http://localhost:8000")
class CheckoutRequest(BaseModel):
item: str
quantity: int
price: float
class InventoryDeductRequest(BaseModel):
item: str
quantity: int
@app.on_event("startup")
def on_startup():
init_db()
@app.get("/")
def home():
return {"message": "Welcome to the E-Commerce Observability Demo API!"}
@app.post("/checkout")
async def checkout(request_data: CheckoutRequest, db: Session = Depends(get_db)):
checkout_counter.add(1, {"item": request_data.item})
# Start parent trace span
with tracer.start_as_current_span("checkout_transaction") as span:
span.set_attribute("order.item", request_data.item)
span.set_attribute("order.quantity", request_data.quantity)
span.set_attribute("order.price_per_unit", request_data.price)
# Child database span
with tracer.start_as_current_span("db_create_order") as db_span:
db_order = Order(
item=request_data.item,
quantity=request_data.quantity,
price=request_data.price,
status="PENDING"
)
db.add(db_order)
db.commit()
db.refresh(db_order)
db_span.set_attribute("db.order_id", db_order.id)
order_id = db_order.id
# Downstream HTTP client span + Trace context injection
headers = {}
TraceContextTextMapPropagator().inject(headers) # Injects 'traceparent' header
with tracer.start_as_current_span("http_call_inventory_service") as http_span:
http_span.set_attribute("http.url", f"{INVENTORY_URL}/inventory/deduct")
try:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{INVENTORY_URL}/inventory/deduct",
json={"item": request_data.item, "quantity": request_data.quantity},
headers=headers,
timeout=5.0
)
except Exception as exc:
checkout_errors.add(1, {"item": request_data.item, "reason": "inventory_network_error"})
span.set_status(StatusCode.ERROR, f"Inventory communication failed: {exc}")
http_span.record_exception(exc)
db_order.status = "FAILED"
db.commit()
raise HTTPException(status_code=502, detail="Inventory service network error")
if response.status_code != 200:
checkout_errors.add(1, {"item": request_data.item, "reason": "inventory_rejection"})
span.set_status(StatusCode.ERROR, f"Inventory check rejected: {response.text}")
db_order.status = "FAILED"
db.commit()
raise HTTPException(status_code=response.status_code, detail=f"Inventory deduction rejected: {response.text}")
# Success path
db_order.status = "COMPLETED"
db.commit()
total_revenue = request_data.quantity * request_data.price
revenue_counter.add(total_revenue, {"item": request_data.item})
span.set_attribute("order.status", "COMPLETED")
span.set_attribute("order.revenue", total_revenue)
return {"order_id": order_id, "status": "COMPLETED", "item": request_data.item, "total_price": total_revenue}
@app.post("/inventory/deduct")
def deduct_inventory(request_data: InventoryDeductRequest, request: Request, db: Session = Depends(get_db)):
# Extract trace parent header from HTTP client call
carrier = {"traceparent": request.headers.get("traceparent", "")}
extracted_context = TraceContextTextMapPropagator().extract(carrier=carrier)
with tracer.start_as_current_span("inventory_deduct_span", context=extracted_context) as span:
span.set_attribute("inventory.item", request_data.item)
span.set_attribute("inventory.deduction_quantity", request_data.quantity)
# Simulate network latency (2 seconds) for large smartphone orders
if request_data.item == "smartphone" and request_data.quantity >= 3:
delay = random.uniform(1.0, 2.5)
span.set_attribute("simulation.latency_added", delay)
time.sleep(delay)
# Simulate db locks/deadlock conflicts (500 Server Error) for large laptop orders
if request_data.item == "laptop" and request_data.quantity >= 2:
if random.random() < 0.6:
span.set_status(StatusCode.ERROR, "Simulated deadlock conflict")
raise HTTPException(status_code=500, detail="Database deadlock conflict during write")
db_item = db.query(Inventory).filter(Inventory.item == request_data.item).first()
if not db_item or db_item.quantity < request_data.quantity:
span.set_status(StatusCode.ERROR, "Insufficient stock")
raise HTTPException(status_code=400, detail="Insufficient stock")
db_item.quantity -= request_data.quantity
db.commit()
return {"status": "SUCCESS", "remaining_stock": db_item.quantity}
@app.get("/metrics")
def metrics_endpoint():
"""
Exposes metrics scraped by Prometheus server.
"""
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
5. Orchestrating the Stack with Docker Compose
To run this observability stack locally, we define a multi-container deployment using Docker Compose. We launch:
- Our FastAPI App on port
8000. - Prometheus on port
9090to collect the metric endpoints. - Jaeger on port
16686(Web UI) and4317(gRPC collector) to receive tracing spans. - Grafana on port
3000to aggregate datasources.
Here is the docker-compose.yml config:
version: "3.8"
services:
web:
build: .
ports:
- "8000:8000"
environment:
- OTEL_SERVICE_NAME=order-service
- OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
- INVENTORY_URL=http://web:8000
depends_on:
- jaeger
jaeger:
image: jaegertracing/all-in-one:1.57
ports:
- "16686:16686" # Web UI
- "4317:4317" # OTLP gRPC receiver
environment:
- COLLECTOR_OTLP_ENABLED=true
prometheus:
image: prom/prometheus:v2.52.0
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
ports:
- "9090:9090"
grafana:
image: grafana/grafana:11.0.0
volumes:
- ./grafana/provisioning:/etc/grafana/provisioning
ports:
- "3000:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
- GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
depends_on:
- prometheus
- jaeger
6. How it Looks in Action
When you run docker compose up --build, start the traffic generator (generate_traffic.py), and log in to Grafana, you will see a unified, correlated interface.
- High-Level Metric Monitoring: Grafana metrics charts track the overall request throughput, purchase revenue, and HTTP error rate.
- Drill-down to Latency: Under the average latency chart, when a spike occurs, you notice the
/checkoutroute takes 2.2 seconds. - Distributed Tracing Correlation: Because the trace context was propagated from
/checkoutdown to/inventory/deduct, you can jump directly from Grafana into Jaeger to examine the trace hierarchy. - Root-Cause Isolation: In Jaeger, you see that the parent span
checkout_transactionspent 95% of its execution time inside the child spaninventory_deduct_span. Looking at the span details, you seesimulation.latency_added: 2.15, pointing directly to the simulated network lag!
7. Observability Best Practices for Production
If you are implementing observability in production, keep these best practices in mind:
- Use an OpenTelemetry Collector: In local development, exporting telemetry directly from the app to Prometheus/Jaeger is fine. In production, however, your application should stream telemetry data asynchronously via OTLP to a local OTel Collector daemon. The Collector processes, batches, and exports the data to your telemetry backend. This prevents application slowdowns during backend bottlenecks.
- Inject Trace Context into Logs: Configure your logger (e.g., Python
loggingorstructlog) to format logs as JSON and inject the active trace ID (trace.get_current_span().get_span_context().trace_id). This bridges the gap between logging and tracing. - Watch Out for Metric Cardinality: When defining metrics, do not add attributes with high cardinality (e.g. user IDs, order IDs, or session IDs) as tags. Adding tags with infinite possible values will bloat your metric database memory and crash Prometheus. Use traces or logs for high-cardinality attributes.
- Enforce Sampling: Sending 100% of traces to your collector is expensive and unnecessary for high-traffic applications. Configure adaptive or head-based sampling (e.g., sample 5% of successful checkouts, but 100% of errors).
Conclusion
Observability is not just about installing software packages; it is an engineering discipline. Adopting OpenTelemetry ensures you decouple your application logic from any particular monitoring vendor.
Setting up unified tracing and metric aggregation, as shown in our E-Commerce demo, allows your team to go from identifying a high-level error spike to isolating the exact line of failing code or latency bottleneck in seconds.
Top comments (0)