I built a tool that analyzes OpenTelemetry traces and tells you what's wrong

#devops #java #observability #opentelemetry

I kept staring at Jaeger trace waterfalls trying to figure out why a request was slow. 20 spans across 5 services — which one is the problem? Is it the database? A downstream
timeout? An N+1 query hiding in plain sight?

So I built TraceLens — you paste an OpenTelemetry trace, and it tells you:

Root cause with confidence score
Bottlenecks ranked by impact (duration + percentage)
Fix recommendations with actual code examples

Try it now

👉 https://tracelens.dev

No signup. No API key. Click "Load sample trace" to see it analyze a real-world database constraint violation.

How it works

The naive approach would be to dump the entire OTLP JSON into an LLM and ask "what's wrong?" — but traces are verbose. A 10-span trace can be 15,000+ tokens of JSON.

TraceLens does three things before hitting the LLM:

Parse and build a span tree

Raw OTLP JSON is a flat array of spans. TraceLens reconstructs the parent-child tree, computes the critical path (longest chain of sequential spans), and detects orphan
spans.

Compress into compact notation

Instead of sending raw JSON, TraceLens compresses traces into a compact tree notation:

[order-service] POST /api/orders (3500ms, ERROR)
├── [order-service] SELECT orders WHERE customer_id = ? (15ms, OK)
├── [inventory-service] GET /api/inventory/check (800ms, OK)
│ └── [inventory-service] SELECT inventory WHERE sku = ? (45ms, OK)
└── [order-service] INSERT INTO orders (2300ms, ERROR)
└── DataIntegrityViolationException: duplicate key (order_ref)

This is 5-7x fewer tokens than the raw OTLP JSON, while preserving all the diagnostic information.

Domain-tuned prompt

The system prompt is specifically tuned for distributed trace analysis. It knows about common patterns like:

N+1 queries (many identical child spans)
Context propagation failures
Timeout cascades
Connection pool exhaustion
Missing indexes (fast COUNT, slow SELECT)

What the output looks like

For the sample trace (a database constraint violation), TraceLens returns:

Root Cause (95% confidence): The INSERT failed because order_ref 'ORD-2024-001' already exists — indicating missing idempotency handling or a race condition.

Bottlenecks: