DEV Community

Cover image for I built a tool that analyzes OpenTelemetry traces and tells you what's wrong
SolvePath
SolvePath

Posted on

I built a tool that analyzes OpenTelemetry traces and tells you what's wrong

I kept staring at Jaeger trace waterfalls trying to figure out why a request was slow. 20 spans across 5 services — which one is the problem? Is it the database? A downstream
timeout? An N+1 query hiding in plain sight?

So I built TraceLens — you paste an OpenTelemetry trace, and it tells you:

  • Root cause with confidence score
  • Bottlenecks ranked by impact (duration + percentage)
  • Fix recommendations with actual code examples

Try it now

👉 https://tracelens.dev

No signup. No API key. Click "Load sample trace" to see it analyze a real-world database constraint violation.

How it works

The naive approach would be to dump the entire OTLP JSON into an LLM and ask "what's wrong?" — but traces are verbose. A 10-span trace can be 15,000+ tokens of JSON.

TraceLens does three things before hitting the LLM:

  1. Parse and build a span tree

Raw OTLP JSON is a flat array of spans. TraceLens reconstructs the parent-child tree, computes the critical path (longest chain of sequential spans), and detects orphan
spans.

  1. Compress into compact notation

Instead of sending raw JSON, TraceLens compresses traces into a compact tree notation:

[order-service] POST /api/orders (3500ms, ERROR)
├── [order-service] SELECT orders WHERE customer_id = ? (15ms, OK)
├── [inventory-service] GET /api/inventory/check (800ms, OK)
│ └── [inventory-service] SELECT inventory WHERE sku = ? (45ms, OK)
└── [order-service] INSERT INTO orders (2300ms, ERROR)
└── DataIntegrityViolationException: duplicate key (order_ref)

This is 5-7x fewer tokens than the raw OTLP JSON, while preserving all the diagnostic information.

  1. Domain-tuned prompt

The system prompt is specifically tuned for distributed trace analysis. It knows about common patterns like:

  • N+1 queries (many identical child spans)
  • Context propagation failures
  • Timeout cascades
  • Connection pool exhaustion
  • Missing indexes (fast COUNT, slow SELECT)

What the output looks like

For the sample trace (a database constraint violation), TraceLens returns:

Root Cause (95% confidence): The INSERT failed because order_ref 'ORD-2024-001' already exists — indicating missing idempotency handling or a race condition.

Bottlenecks:

  • INSERT INTO orders — 2300ms (65.7% of total time)
  • GET /api/inventory/check — 800ms (22.9%)

Recommendations (with code):

  1. Add idempotent order creation with ON CONFLICT handling
  2. Fix order_ref generation to ensure uniqueness (UUID instead of deterministic)
  3. Add duplicate check before INSERT
  4. Cache inventory lookups
  5. Add database index on order_ref

Each recommendation includes a code example you can adapt.

Tech stack

  • Backend: Java 25, Spring Boot 4, Spring AI
  • Frontend: React 19, Tailwind CSS v4
  • LLM: Claude (via Spring AI) with domain-tuned prompt
  • Hosting: Railway (API) + Vercel (frontend)

What's next

I'd love feedback from anyone working with distributed tracing:

  • Is this useful? Would you actually use it?
  • What traces would you want to test it with?
  • What's missing?

Try it: https://tracelens.dev

Apache 2.0 licensed.

Top comments (0)