AHMED HASAN AKHTAR OVIEDO

Posted on Nov 30, 2025

The Case of the Zombie Transaction: Solving 'Unknown Unknowns' with OpenTelemetry & High Cardinality

#observability #opentelemetry #devops #monitoring

There is a massive difference between "Monitoring" and "Observability," yet we often use the terms interchangeably.

Monitoring is looking at a dashboard to see if the CPU usage is above 80%. It answers known questions: "Is the database healthy?"
Observability is the ability to ask new questions about your system without deploying new code. It answers the terrifying questions: "Why are payments failing only for iOS users in Canada using a Visa card?"

Most tutorials show you how to set up a dashboard. This article will show you how to catch a ghost.

We are going to implement Structured Tracing with High Cardinality attributes using OpenTelemetry (OTel). This is the modern standard that works with any backend (Azure Monitor, AWS X-Ray, Datadog, Honeycomb, or Jaeger).

The Scenario: "The Zombie Transaction"

Imagine you run an e-commerce platform. A customer complains: "My credit card was charged, but I never got an order confirmation."

Your logs show 200 OK on the web server. Your database metrics look healthy. You are flying blind. This is a "Zombie Transaction"—the state is inconsistent across microservices.

To solve this, we need Distributed Tracing with Context Propagation.

The Architecture

We will simulate a microservices workflow using Python (FastAPI):

Checkout Service: Initiates the process.
Payment Service: Charges the card (simulated).
Inventory Service: Decrements stock.

If the Payment succeeds but Inventory fails, we have a Zombie Transaction.

Step 1: The Code (Python + OpenTelemetry)

We aren't just logging text; we are creating "Spans". A Span represents a unit of work.

First, install the OTel libraries:

pip install fastapi uvicorn opentelemetry-api opentelemetry-sdk \
opentelemetry-instrumentation-fastapi opentelemetry-exporter-otlp

Now, let's build the app.py. Pay close attention to the set_attribute lines. This is the secret sauce: High Cardinality Data.

import time
import random
import logging
from fastapi import FastAPI, HTTPException
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

# 1. Setup OpenTelemetry
# In a real app, you would export to an OTLP endpoint (Jaeger, Datadog, Azure).
# Here we print to Console for demonstration transparency.
resource = Resource(attributes={"service.name": "checkout-service"})
trace.set_tracer_provider(TracerProvider(resource=resource))
tracer = trace.get_tracer(__name__)
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(ConsoleSpanExporter())
)

app = FastAPI()
logger = logging.getLogger("checkout")

# 2. The Payment Service (Mock)
def process_payment(user_id, amount):
    with tracer.start_as_current_span("payment_gateway_call") as span:
        # OBSERVABILITY BEST PRACTICE:
        # Add "High Cardinality" attributes. 
        # This allows us later to filter traces by specific user IDs.
        span.set_attribute("app.user_id", user_id)
        span.set_attribute("app.payment.amount", amount)

        time.sleep(0.1) # Simulate network latency
        if random.random() < 0.1: # 10% chance of random failure
            span.set_attribute("error", True)
            span.record_exception(Exception("Payment Gateway Timeout"))
            raise Exception("Payment Failed")

        return "tx_12345abc"

# 3. The Inventory Service (Mock)
def reserve_inventory(sku):
    with tracer.start_as_current_span("inventory_reservation") as span:
        span.set_attribute("app.sku", sku)
        time.sleep(0.2)
        # Simulate a logic bug: We have a "Zombie" scenario where 
        # payment succeeded, but inventory crashes.
        if sku == "buggy-item-001":
            raise HTTPException(status_code=500, detail="Inventory Database Locked")
        return True

@app.post("/checkout")
async def checkout(user_id: str, sku: str):
    # This is the "Root Span"
    with tracer.start_as_current_span("checkout_process") as span:
        span.set_attribute("app.user_id", user_id)

        try:
            # Step 1: Charge User
            transaction_id = process_payment(user_id, 99.00)
            span.set_attribute("app.payment.tx_id", transaction_id)

            # Step 2: Reserve Item
            reserve_inventory(sku)

            return {"status": "success", "tx_id": transaction_id}

        except Exception as e:
            # We catch the error, but did the payment happen?
            # The trace will show us exactly where it broke.
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR))
            raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Why this is unique?

Most developers log like this:
logger.error("Error in checkout")

This is useless. In the code above, we injected app.user_id.

When the customer complains, you don't grep logs for "Error". You go to your Observability tool (Jaeger/Datadog/Grafana) and run a query:

app.user_id == "customer_4094"

You will immediately see a Waterfall Visualization showing:

checkout_process (Started)
payment_gateway_call (Success - Money Taken!)
inventory_reservation (Failed - Error: Inventory Database Locked)

You have just proven the "Zombie Transaction" exists without guessing.

Connecting to a Platform (The "How-To")

The code above uses ConsoleSpanExporter so you can see the JSON structure in your terminal immediately without an account.

To send this to a real platform, you simply change one line of code (or purely via Environment Variables).

For Jaeger/Grafana/New Relic/Datadog (via OTLP):

from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Replace ConsoleSpanExporter with this:
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317"))

Theory: The Three Pillars are Not Enough

We used to talk about Logs, Metrics, and Traces (The Three Pillars). But modern Observability is about Correlation.

In our example:

The Metric (Error Rate) triggers the alert.
The Trace (The Waterfall) shows where the request died.
The Log (The attributes inside the span) tells us who it affected.

Conclusion

Observability is not just about having colorful dashboards. It is about debuggability.

By using OpenTelemetry to standardize your instrumentation and focusing on High Cardinality Attributes (like User IDs, Order IDs, SKU codes), you turn your debugging process from a murder mystery into a simple lookup.

References

Top comments (3)

JOAN CRISTIAN MEDINA QUISPE • Dec 4 '25

Es fascinante ver cómo una transacción colgada puede pasar desapercibida bajo el radar de los dashboards convencionales. El enfoque de enriquecer los Spans de OTel con contexto de negocio es, en mi opinión, lo que separa el "monitoring" de la verdadera "observability".

Como sugerencia constructiva: Sería genial si pudieras compartir (quizás en un futuro post o snippet) cómo configuraste la alerta para que esto no vuelva a ocurrir. ¿Creaste una métrica derivada basada en la duración de ese span específico?

DAVID JORDAN ANAMPA PANCCA • Dec 4 '25

Me parece excelente cómo explicas la diferencia entre monitoring y observability. El uso de OpenTelemetry con atributos de alta cardinalidad para rastrear problemas como las "Zombie Transactions" hace todo mucho más claro. Es un enfoque práctico y directo para resolver problemas complejos sin perder tiempo buscando en logs.

Minerva Education AI SRL • Dec 4 '25

Great work — genuinely entertaining article!