I kept staring at Jaeger trace waterfalls trying to figure out why a request was slow. 20 spans across 5 services — which one is the problem? Is it the database? A downstream
timeout? An N+1 query hiding in plain sight?
So I built TraceLens — you paste an OpenTelemetry trace, and it tells you:
- Root cause with confidence score
- Bottlenecks ranked by impact (duration + percentage)
- Fix recommendations with actual code examples
Try it now
No signup. No API key. Click "Load sample trace" to see it analyze a real-world database constraint violation.
How it works
The naive approach would be to dump the entire OTLP JSON into an LLM and ask "what's wrong?" — but traces are verbose. A 10-span trace can be 15,000+ tokens of JSON.
TraceLens does three things before hitting the LLM:
- Parse and build a span tree
Raw OTLP JSON is a flat array of spans. TraceLens reconstructs the parent-child tree, computes the critical path (longest chain of sequential spans), and detects orphan
spans.
- Compress into compact notation
Instead of sending raw JSON, TraceLens compresses traces into a compact tree notation:
[order-service] POST /api/orders (3500ms, ERROR)
├── [order-service] SELECT orders WHERE customer_id = ? (15ms, OK)
├── [inventory-service] GET /api/inventory/check (800ms, OK)
│ └── [inventory-service] SELECT inventory WHERE sku = ? (45ms, OK)
└── [order-service] INSERT INTO orders (2300ms, ERROR)
└── DataIntegrityViolationException: duplicate key (order_ref)
This is 5-7x fewer tokens than the raw OTLP JSON, while preserving all the diagnostic information.
- Domain-tuned prompt
The system prompt is specifically tuned for distributed trace analysis. It knows about common patterns like:
- N+1 queries (many identical child spans)
- Context propagation failures
- Timeout cascades
- Connection pool exhaustion
- Missing indexes (fast COUNT, slow SELECT)
What the output looks like
For the sample trace (a database constraint violation), TraceLens returns:
Root Cause (95% confidence): The INSERT failed because order_ref 'ORD-2024-001' already exists — indicating missing idempotency handling or a race condition.
Bottlenecks:
- INSERT INTO orders — 2300ms (65.7% of total time)
- GET /api/inventory/check — 800ms (22.9%)
Recommendations (with code):
- Add idempotent order creation with ON CONFLICT handling
- Fix order_ref generation to ensure uniqueness (UUID instead of deterministic)
- Add duplicate check before INSERT
- Cache inventory lookups
- Add database index on order_ref
Each recommendation includes a code example you can adapt.
Tech stack
- Backend: Java 25, Spring Boot 4, Spring AI
- Frontend: React 19, Tailwind CSS v4
- LLM: Claude (via Spring AI) with domain-tuned prompt
- Hosting: Railway (API) + Vercel (frontend)
What's next
I'd love feedback from anyone working with distributed tracing:
- Is this useful? Would you actually use it?
- What traces would you want to test it with?
- What's missing?
Try it: https://tracelens.dev
Apache 2.0 licensed.
Top comments (0)