Alexandr Bandurchin for Uptrace

Posted on Mar 14

Distributed Tracing: From 100% Error Rate to Root Cause in 60 Seconds

#distributedsystems #microservices #monitoring #tutorial

A step-by-step breakdown of microservice incident investigation: service graph, trace timeline, log correlation, and structured debugging using Uptrace

Introduction

In a microservice architecture, a single failing endpoint can hide a problem in an entirely different service. In this article, I'll walk through a step-by-step investigation of a real incident — from detecting 100% error rate to pinpointing the exact root cause — in under a minute.

We'll use Uptrace, an OpenTelemetry-native platform for tracing and monitoring. All examples are based on a real demo application with microservices.

What you'll learn:

How to use the service graph to find bottlenecks

Span filtering techniques for deep, focused analysis

How timeline reconstruction helps isolate the problem

Correlating traces and logs for root cause analysis

Going from investigation to prevention through monitoring

📊 The Symptom: A Broken Endpoint

Let's say users report that recommendations aren't loading. We open the APM dashboard and immediately spot the problem:

Endpoint: GET /api/recommendations
Error rate: 100% 🔴
Latency p90: 10.6ms
Rate: 1.9 spans/min

APM Overview → "Slowest endpoints" tab. The GET /api/recommendations endpoint is highlighted: error_rate = 100%, p90 = 10.6ms

100% error rate is critical. Every single request is failing. But why?

🔍 Phase 1: Service Graph — Where Is the Problem?

The first step is understanding where exactly in the service chain the failure is happening. For this, we use the service graph.

What Is a Service Graph?

A service graph visualizes how services call each other. Each node is a service, each edge is a call between services.

Our graph looks like this:

[frontend] ---> [recommendation] ---> [product-catalog]
   ✅               🔴 100% errors           ❓

Service graph: red nodes (frontend, recommendation) signal errors. Green nodes (product-reviews, cart) are healthy. Edges show latency and error rate.

Already clear:

frontend is making requests (red node — there are issues)
recommendation is returning errors 🔴
Green nodes (product-reviews, cart) are working normally ✅
The problem is somewhere inside or further down the chain

Incoming vs Outgoing Perspective

The service graph supports two views:

Incoming — who calls this service (dependencies on it)

Outgoing — what dependencies this service relies on

Switching between them reveals:

Incoming: frontend → recommendation (6.1ms duration, errors)
Outgoing: recommendation → ??? (this is where the problem is)

Phase 1 conclusion: The problem is in recommendation — it can't reach its downstream dependencies.

🎯 Phase 2: Analyzing Problematic Spans

Now we need to look at specific requests. We navigate to spans.

Filtering: Cutting Through the Noise

In a production system there are millions of spans. We need precise filtering:

Step 1: Focus on one service

service_name = frontend

Step 2: Only server spans (handled requests)

_kind = server

Step 3: Only errors

_status_code = error

Now we see only the problematic requests from frontend.

Aggregations: Understanding the Scale

We group spans by operation and check aggregations:

Aggregations:
  - perMin(count())     # requests per minute
  - p99(_dur_ms)        # 99th percentile latency
  - _error_rate         # error percentage

Group by:
  - _group_id           # group identical operations

Result:

Operation	Rate	p99 Latency	Status
GET /api/recommendations	2.1/min	16ms	🔴 Errors
POST /api/checkout	1.4/min	23ms	🔴 Errors
GET /api/products/{productId}	11/min	5.7ms	🔴 Errors

Spans → Groups: three active filters (service_name, _kind, _status_code) narrowed the list to 5 groups. Aggregations perMin(count()), p99(_dur_ms), and _error_rate show the scale of the problem.

The problem affects multiple operations, but GET /api/recommendations shows a consistent error pattern. Let's drill deeper into this operation and look at specific requests.

The problem is isolated — we can focus on specific operations.

📋 Phase 2.5: Selecting a Specific Span

We click on one of the GET /api/recommendations rows in the Groups table. A detail panel opens with information about that specific request.

Clicking GET /api/recommendations opens a side panel. The ATTRS tab shows all structured attributes as key-value pairs. The VIEW TRACE button (arrow) leads to the full request timeline.

Selecting a span reveals its attributes as structured key-value pairs, making it easy to filter and search for issues.

What we see in the attributes:

http_status_code: 500 — confirms a server error
http_route: /api/recommendations — the specific endpoint
http_target — full URL with query parameters
host_name: play-all-in-one — host running the service
next_span_name — next span in the call chain

This is structured logging in action — every field has a name and type, making debugging far more effective than working with plain text logs.

Now we click VIEW TRACE to see the full picture of the request across all microservices.

📈 Phase 3: Trace Timeline — Where Is the Bottleneck?

Clicking VIEW TRACE opens a waterfall timeline — the complete history of one request across all microservices.

What Is a Trace?

A trace is the full history of a single request through all microservices, visualized as a waterfall timeline.

Our trace looks like this:

GET /api/recommendations                                [9.5ms total]
├─ user_get_recommendations (load-generator)           [0.7ms]
│  └─ GET http://frontend-proxy:8080/...               [1.3ms]
│     └─ Ingress (frontend-proxy)                      [0.2ms]
│        └─ router frontend egress                     [0.8ms]
│           └─ GET http://frontend-proxy:8080/...      [0.3ms]
│              └─ GET /api/recommendations (frontend)  [6.4ms] 🔴
│                 └─ rpc:recommendation                [failures]

Waterfall trace view: the request path from load-generator through proxy to frontend. The last span (6.4ms) is the clear culprit for both latency and errors.

Timeline Analysis:

Total duration: 9.5ms
Load generator: 0.7ms (request initiation)
HTTP calls through proxy: ~2ms
Frontend processing: 6.4ms — this is where most of the time goes! 🔴
- A significant portion is consumed by rpc:recommendation (26% of total)
- Errors are visible (red markers)

Phase 3 conclusion: Bottleneck is in GET /api/recommendations on frontend (6.4ms out of 9.5ms). The request passes through several hops (load-generator → proxy → frontend), but most of the time and all the errors occur at the final stage when calling the recommendation service.

Inspecting the Problematic Span

We click on the longest span, GET /api/recommendations (6.4ms), in the waterfall. A detail panel opens on the right with complete span information.

Attributes panel of the slowest span: status error, http_status_code = 500, duration 6.4ms. All service metadata is visible: host, OS, Next.js library version.

This span shows exactly where the error occurs — frontend is processing the /api/recommendations request but receiving a 500 error.

Key attributes:

http_status_code: 500 — server returned an internal server error
http_target — full URL with parameter productIds=L9ECAV7KIM
next_span_name: GET /api/recommendations — next call in the chain
next_span_type: BaseServer.handleRequest — type of request handler
otel_library_name: next.js — Next.js framework is being used

Now we need to understand why this span is failing. For that, we look at the logs associated with this trace.

📋 Phase 4: Logs — Root Cause

The trace showed where the problem is. Now we need to understand why. We look at the logs tied to the trace.

Trace-Log Correlation

In Uptrace, traces and logs are linked via trace_id. No need to manually search logs by time or service — just click the LOGS & ERRORS (7) tab while in the waterfall view, and all logs for that specific request appear automatically.

"LOGS & ERRORS (7)" tab: all 7 entries belong to one trace_id. The repeating message "DNS resolution failed for product-catalog:3550" with gRPC status UNAVAILABLE (code 14) — this is the root cause.

What we see in the logs:

A repeating error message appears multiple times with identical content:

Exception: _InactiveRpcError of RPC that terminated with:
  status = StatusCode.UNAVAILABLE
  details = "DNS resolution failed for product-catalog:3550;
            C-ares status is not ARES_SUCCESS
            qtype=A name=product-catalog is_balancer=0;
            DNS server returned general failure"
  grpc_status: 14
  grpc_message: "DNS resolution failed for product-catalog:3550"

We also see the HTTP access log confirming the outcome:

GET /api/recommendations?productIds=L9ECAV7KIM HTTP/1.1
Status: 500
User-Agent: python-requests/2.32.4

Each log entry is linked to the trace via trace_id. Clicking the link icon opens the full structured attributes of the entry in key-value format.

Root Cause Found! 🎯

Problem: DNS cannot resolve product-catalog:3550

Technical details:

gRPC Status: UNAVAILABLE (code 14)

DNS Query: A record for product-catalog

DNS Resolver: C-ares (async DNS resolver)

Result: "DNS server returned general failure"

Possible causes:

The product-catalog service is not running

Incorrect DNS record in Kubernetes/service mesh

Network policy blocking DNS access

Wrong hostname in config (should be product-catalog.namespace.svc.cluster.local)

DNS server overloaded or unreachable

Time from symptom to root cause: ~60 seconds ⏱️

🚨 Phase 5: Prevention Through Monitoring

The investigation is complete — we found the root cause. In a production environment the next step would be a fix (correct the DNS record, restart the service, update the configuration). But it's not enough to fix the current problem — we need to prevent recurrence. For that, we create a monitor.

From Trace to Monitor:

Go back to the TRACE tab
Click the problematic span GET /api/recommendations (6.4ms) in the waterfall
The span detail panel opens on the right
Click the MONITOR dropdown
Select Monitor error rate

The MONITOR button in the span panel opens a menu with 7 options. Selecting "Monitor p99 duration" — Uptrace automatically fills all form fields from the current span's attributes.

After selecting Monitor p99 duration, Uptrace automatically creates a form pre-filled with data from the span.

What gets auto-populated:

Monitor: httpserver:frontend > GET /api/recommendations p99 duration

Metric: uptrace_tracing_spans
Aggregation: p99($spans)
Filter:
  - _group_id = "8037161693486813802"
  (this is the span group ID for this operation)

Alert threshold: p99 latency > 15ms
Check interval: last 5 points (5 minutes)
Notification: Email to EVERYONE

The green zone on the chart shows the acceptable range (up to 15ms). When p99 latency exceeds this threshold, the monitor fires an alert automatically.

Now if p99 latency goes above 15ms, the team receives an automatic alert and can respond before the problem becomes critical. The monitor can be configured to send notifications to Slack, Telegram, PagerDuty, or email.

📚 Takeaways: Investigation Techniques

1. Structured Filtering

Don't look at all spans at once. Filter by:

service_name — one service at a time
_kind = server — only handled requests
_status_code = error — only errors

2. Timeline Thinking

The trace timeline is the story of a request. Look for:

Longest spans (the bottleneck)
Failed spans (where it breaks)
Unexpected calls (unnecessary hops)

3. Log-Trace Correlation

Logs without context are useless. trace_id connects:

One log entry → full request flow
Structured attributes → filtering and search

4. Aggregations Matter

Don't rely on a single span. Aggregations reveal:

perMin(count()) — frequency of the issue
p99(_dur_ms) — worst-case latency
error_rate — percentage of failures

5. From Investigation to Prevention

An investigation doesn't end at the fix. Monitors turn knowledge into automation.

🛠️ Tools & Techniques Recap

Tool	Usage	What It Gives You
Service Graph	Dependency visualization	Where in the chain the problem is
Span filtering	service_name, _kind, _status_code	Remove noise, focus on the problem
Aggregations	perMin, p99, error_rate	Scale and frequency of the issue
Trace timeline	Waterfall view	Bottleneck within the request
Log correlation	trace_id linking	Root cause analysis
Monitors	Automated alerting	Prevention

💬 Conclusion

Distributed tracing isn't just "looking at logs." It's a systematic approach to incident investigation:

Service graph — find the problematic service
Span filtering — isolate the failing requests
Trace timeline — understand the bottleneck
Log correlation — identify the root cause
Monitors — prevent recurrence

With the right tools (OpenTelemetry + Uptrace), the path from symptom to resolution takes less than a minute instead of hours of debugging.

What's next?

Try Uptrace: play.uptrace.dev

Open-source APM based on OpenTelemetry: uptrace.dev/get/hosted/open-source-apm

Set up OTEL Collector: uptrace.dev/opentelemetry/collector

Explore best practices: uptrace.dev/opentelemetry

Questions? Comments? Drop them below! 👇

DEV Community