There is a massive difference between "Monitoring" and "Observability," yet we often use the terms interchangeably.
- Monitoring is looking at a dashboard to see if the CPU usage is above 80%. It answers known questions: "Is the database healthy?"
- Observability is the ability to ask new questions about your system without deploying new code. It answers the terrifying questions: "Why are payments failing only for iOS users in Canada using a Visa card?"
Most tutorials show you how to set up a dashboard. This article will show you how to catch a ghost.
We are going to implement Structured Tracing with High Cardinality attributes using OpenTelemetry (OTel). This is the modern standard that works with any backend (Azure Monitor, AWS X-Ray, Datadog, Honeycomb, or Jaeger).
The Scenario: "The Zombie Transaction"
Imagine you run an e-commerce platform. A customer complains: "My credit card was charged, but I never got an order confirmation."
Your logs show 200 OK on the web server. Your database metrics look healthy. You are flying blind. This is a "Zombie Transaction"—the state is inconsistent across microservices.
To solve this, we need Distributed Tracing with Context Propagation.
The Architecture
We will simulate a microservices workflow using Python (FastAPI):
- Checkout Service: Initiates the process.
- Payment Service: Charges the card (simulated).
- Inventory Service: Decrements stock.
If the Payment succeeds but Inventory fails, we have a Zombie Transaction.
Step 1: The Code (Python + OpenTelemetry)
We aren't just logging text; we are creating "Spans". A Span represents a unit of work.
First, install the OTel libraries:
pip install fastapi uvicorn opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi opentelemetry-exporter-otlp
Now, let's build the app.py. Pay close attention to the set_attribute lines. This is the secret sauce: High Cardinality Data.
import time
import random
import logging
from fastapi import FastAPI, HTTPException
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
# 1. Setup OpenTelemetry
# In a real app, you would export to an OTLP endpoint (Jaeger, Datadog, Azure).
# Here we print to Console for demonstration transparency.
resource = Resource(attributes={"service.name": "checkout-service"})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(ConsoleSpanExporter())
)
app = FastAPI()
logger = logging.getLogger("checkout")
# 2. The Payment Service (Mock)
def process_payment(user_id, amount):
with tracer.start_as_current_span("payment_gateway_call") as span:
# OBSERVABILITY BEST PRACTICE:
# Add "High Cardinality" attributes.
# This allows us later to filter traces by specific user IDs.
span.set_attribute("app.user_id", user_id)
span.set_attribute("app.payment.amount", amount)
time.sleep(0.1) # Simulate network latency
if random.random() < 0.1: # 10% chance of random failure
span.set_attribute("error", True)
span.record_exception(Exception("Payment Gateway Timeout"))
raise Exception("Payment Failed")
return "tx_12345abc"
# 3. The Inventory Service (Mock)
def reserve_inventory(sku):
with tracer.start_as_current_span("inventory_reservation") as span:
span.set_attribute("app.sku", sku)
time.sleep(0.2)
# Simulate a logic bug: We have a "Zombie" scenario where
# payment succeeded, but inventory crashes.
if sku == "buggy-item-001":
raise HTTPException(status_code=500, detail="Inventory Database Locked")
return True
@app.post("/checkout")
async def checkout(user_id: str, sku: str):
# This is the "Root Span"
with tracer.start_as_current_span("checkout_process") as span:
span.set_attribute("app.user_id", user_id)
try:
# Step 1: Charge User
transaction_id = process_payment(user_id, 99.00)
span.set_attribute("app.payment.tx_id", transaction_id)
# Step 2: Reserve Item
reserve_inventory(sku)
return {"status": "success", "tx_id": transaction_id}
except Exception as e:
# We catch the error, but did the payment happen?
# The trace will show us exactly where it broke.
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR))
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Why this is unique?
Most developers log like this:
logger.error("Error in checkout")
This is useless. In the code above, we injected app.user_id.
When the customer complains, you don't grep logs for "Error". You go to your Observability tool (Jaeger/Datadog/Grafana) and run a query:
app.user_id == "customer_4094"
You will immediately see a Waterfall Visualization showing:
-
checkout_process(Started) -
payment_gateway_call(Success - Money Taken!) -
inventory_reservation(Failed - Error: Inventory Database Locked)
You have just proven the "Zombie Transaction" exists without guessing.
Connecting to a Platform (The "How-To")
The code above uses ConsoleSpanExporter so you can see the JSON structure in your terminal immediately without an account.
To send this to a real platform, you simply change one line of code (or purely via Environment Variables).
For Jaeger/Grafana/New Relic/Datadog (via OTLP):
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Replace ConsoleSpanExporter with this:
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))
Theory: The Three Pillars are Not Enough
We used to talk about Logs, Metrics, and Traces (The Three Pillars). But modern Observability is about Correlation.
In our example:
- The Metric (Error Rate) triggers the alert.
- The Trace (The Waterfall) shows where the request died.
- The Log (The attributes inside the span) tells us who it affected.
Conclusion
Observability is not just about having colorful dashboards. It is about debuggability.
By using OpenTelemetry to standardize your instrumentation and focusing on High Cardinality Attributes (like User IDs, Order IDs, SKU codes), you turn your debugging process from a murder mystery into a simple lookup.
Top comments (0)