DEV Community

Cover image for Distributed Tracing: From 100% Error Rate to Root Cause in 60 Seconds
Alexandr Bandurchin for Uptrace

Posted on

Distributed Tracing: From 100% Error Rate to Root Cause in 60 Seconds

A step-by-step breakdown of microservice incident investigation: service graph, trace timeline, log correlation, and structured debugging using Uptrace


Introduction

In a microservice architecture, a single failing endpoint can hide a problem in an entirely different service. In this article, I'll walk through a step-by-step investigation of a real incident — from detecting 100% error rate to pinpointing the exact root cause — in under a minute.

We'll use Uptrace, an OpenTelemetry-native platform for tracing and monitoring. All examples are based on a real demo application with microservices.

What you'll learn:

  • How to use the service graph to find bottlenecks
  • Span filtering techniques for deep, focused analysis
  • How timeline reconstruction helps isolate the problem
  • Correlating traces and logs for root cause analysis
  • Going from investigation to prevention through monitoring

📊 The Symptom: A Broken Endpoint

Let's say users report that recommendations aren't loading. We open the APM dashboard and immediately spot the problem:

Endpoint: GET /api/recommendations
Error rate: 100% 🔴
Latency p90: 10.6ms
Rate: 1.9 spans/min
Enter fullscreen mode Exit fullscreen mode


APM Overview → "Slowest endpoints" tab. The GET /api/recommendations endpoint is highlighted: error_rate = 100%, p90 = 10.6ms

100% error rate is critical. Every single request is failing. But why?


🔍 Phase 1: Service Graph — Where Is the Problem?

The first step is understanding where exactly in the service chain the failure is happening. For this, we use the service graph.

What Is a Service Graph?

A service graph visualizes how services call each other. Each node is a service, each edge is a call between services.

Our graph looks like this:

[frontend] ---> [recommendation] ---> [product-catalog]
   ✅               🔴 100% errors           ❓
Enter fullscreen mode Exit fullscreen mode


Service graph: red nodes (frontend, recommendation) signal errors. Green nodes (product-reviews, cart) are healthy. Edges show latency and error rate.

Already clear:

  • frontend is making requests (red node — there are issues)
  • recommendation is returning errors 🔴
  • Green nodes (product-reviews, cart) are working normally ✅
  • The problem is somewhere inside or further down the chain

Incoming vs Outgoing Perspective

The service graph supports two views:

Incoming — who calls this service (dependencies on it)

Outgoing — what dependencies this service relies on

Switching between them reveals:

  • Incoming: frontend → recommendation (6.1ms duration, errors)
  • Outgoing: recommendation → ??? (this is where the problem is)

Phase 1 conclusion: The problem is in recommendation — it can't reach its downstream dependencies.


🎯 Phase 2: Analyzing Problematic Spans

Now we need to look at specific requests. We navigate to spans.

Filtering: Cutting Through the Noise

In a production system there are millions of spans. We need precise filtering:

Step 1: Focus on one service

service_name = frontend
Enter fullscreen mode Exit fullscreen mode

Step 2: Only server spans (handled requests)

_kind = server
Enter fullscreen mode Exit fullscreen mode

Step 3: Only errors

_status_code = error
Enter fullscreen mode Exit fullscreen mode

Now we see only the problematic requests from frontend.

Aggregations: Understanding the Scale

We group spans by operation and check aggregations:

Aggregations:
  - perMin(count())     # requests per minute
  - p99(_dur_ms)        # 99th percentile latency
  - _error_rate         # error percentage

Group by:
  - _group_id           # group identical operations
Enter fullscreen mode Exit fullscreen mode

Result:

Operation Rate p99 Latency Status
GET /api/recommendations 2.1/min 16ms 🔴 Errors
POST /api/checkout 1.4/min 23ms 🔴 Errors
GET /api/products/{productId} 11/min 5.7ms 🔴 Errors


Spans → Groups: three active filters (service_name, _kind, _status_code) narrowed the list to 5 groups. Aggregations perMin(count()), p99(_dur_ms), and _error_rate show the scale of the problem.

The problem affects multiple operations, but GET /api/recommendations shows a consistent error pattern. Let's drill deeper into this operation and look at specific requests.

The problem is isolated — we can focus on specific operations.


📋 Phase 2.5: Selecting a Specific Span

We click on one of the GET /api/recommendations rows in the Groups table. A detail panel opens with information about that specific request.


Clicking GET /api/recommendations opens a side panel. The ATTRS tab shows all structured attributes as key-value pairs. The VIEW TRACE button (arrow) leads to the full request timeline.

Selecting a span reveals its attributes as structured key-value pairs, making it easy to filter and search for issues.

What we see in the attributes:

  • http_status_code: 500 — confirms a server error
  • http_route: /api/recommendations — the specific endpoint
  • http_target — full URL with query parameters
  • host_name: play-all-in-one — host running the service
  • next_span_name — next span in the call chain

This is structured logging in action — every field has a name and type, making debugging far more effective than working with plain text logs.

Now we click VIEW TRACE to see the full picture of the request across all microservices.


📈 Phase 3: Trace Timeline — Where Is the Bottleneck?

Clicking VIEW TRACE opens a waterfall timeline — the complete history of one request across all microservices.

What Is a Trace?

A trace is the full history of a single request through all microservices, visualized as a waterfall timeline.

Our trace looks like this:

GET /api/recommendations                                [9.5ms total]
├─ user_get_recommendations (load-generator)           [0.7ms]
│  └─ GET http://frontend-proxy:8080/...               [1.3ms]
│     └─ Ingress (frontend-proxy)                      [0.2ms]
│        └─ router frontend egress                     [0.8ms]
│           └─ GET http://frontend-proxy:8080/...      [0.3ms]
│              └─ GET /api/recommendations (frontend)  [6.4ms] 🔴
│                 └─ rpc:recommendation                [failures]
Enter fullscreen mode Exit fullscreen mode


Waterfall trace view: the request path from load-generator through proxy to frontend. The last span (6.4ms) is the clear culprit for both latency and errors.

Timeline Analysis:

  • Total duration: 9.5ms
  • Load generator: 0.7ms (request initiation)
  • HTTP calls through proxy: ~2ms
  • Frontend processing: 6.4ms — this is where most of the time goes! 🔴
    • A significant portion is consumed by rpc:recommendation (26% of total)
    • Errors are visible (red markers)

Phase 3 conclusion: Bottleneck is in GET /api/recommendations on frontend (6.4ms out of 9.5ms). The request passes through several hops (load-generator → proxy → frontend), but most of the time and all the errors occur at the final stage when calling the recommendation service.

Inspecting the Problematic Span

We click on the longest span, GET /api/recommendations (6.4ms), in the waterfall. A detail panel opens on the right with complete span information.


Attributes panel of the slowest span: status error, http_status_code = 500, duration 6.4ms. All service metadata is visible: host, OS, Next.js library version.

This span shows exactly where the error occurs — frontend is processing the /api/recommendations request but receiving a 500 error.

Key attributes:

  • http_status_code: 500 — server returned an internal server error
  • http_target — full URL with parameter productIds=L9ECAV7KIM
  • next_span_name: GET /api/recommendations — next call in the chain
  • next_span_type: BaseServer.handleRequest — type of request handler
  • otel_library_name: next.js — Next.js framework is being used

Now we need to understand why this span is failing. For that, we look at the logs associated with this trace.


📋 Phase 4: Logs — Root Cause

The trace showed where the problem is. Now we need to understand why. We look at the logs tied to the trace.

Trace-Log Correlation

In Uptrace, traces and logs are linked via trace_id. No need to manually search logs by time or service — just click the LOGS & ERRORS (7) tab while in the waterfall view, and all logs for that specific request appear automatically.


"LOGS & ERRORS (7)" tab: all 7 entries belong to one trace_id. The repeating message "DNS resolution failed for product-catalog:3550" with gRPC status UNAVAILABLE (code 14) — this is the root cause.

What we see in the logs:

A repeating error message appears multiple times with identical content:

Exception: _InactiveRpcError of RPC that terminated with:
  status = StatusCode.UNAVAILABLE
  details = "DNS resolution failed for product-catalog:3550;
            C-ares status is not ARES_SUCCESS
            qtype=A name=product-catalog is_balancer=0;
            DNS server returned general failure"
  grpc_status: 14
  grpc_message: "DNS resolution failed for product-catalog:3550"
Enter fullscreen mode Exit fullscreen mode

We also see the HTTP access log confirming the outcome:

GET /api/recommendations?productIds=L9ECAV7KIM HTTP/1.1
Status: 500
User-Agent: python-requests/2.32.4
Enter fullscreen mode Exit fullscreen mode


Each log entry is linked to the trace via trace_id. Clicking the link icon opens the full structured attributes of the entry in key-value format.

Root Cause Found! 🎯

Problem: DNS cannot resolve product-catalog:3550

Technical details:

  • gRPC Status: UNAVAILABLE (code 14)
  • DNS Query: A record for product-catalog
  • DNS Resolver: C-ares (async DNS resolver)
  • Result: "DNS server returned general failure"

Possible causes:

  • The product-catalog service is not running
  • Incorrect DNS record in Kubernetes/service mesh
  • Network policy blocking DNS access
  • Wrong hostname in config (should be product-catalog.namespace.svc.cluster.local)
  • DNS server overloaded or unreachable

Time from symptom to root cause: ~60 seconds ⏱️


🚨 Phase 5: Prevention Through Monitoring

The investigation is complete — we found the root cause. In a production environment the next step would be a fix (correct the DNS record, restart the service, update the configuration). But it's not enough to fix the current problem — we need to prevent recurrence. For that, we create a monitor.

From Trace to Monitor:

  1. Go back to the TRACE tab
  2. Click the problematic span GET /api/recommendations (6.4ms) in the waterfall
  3. The span detail panel opens on the right
  4. Click the MONITOR dropdown
  5. Select Monitor error rate


The MONITOR button in the span panel opens a menu with 7 options. Selecting "Monitor p99 duration" — Uptrace automatically fills all form fields from the current span's attributes.

After selecting Monitor p99 duration, Uptrace automatically creates a form pre-filled with data from the span.

What gets auto-populated:

Monitor: httpserver:frontend > GET /api/recommendations p99 duration

Metric: uptrace_tracing_spans
Aggregation: p99($spans)
Filter:
  - _group_id = "8037161693486813802"
  (this is the span group ID for this operation)

Alert threshold: p99 latency > 15ms
Check interval: last 5 points (5 minutes)
Notification: Email to EVERYONE
Enter fullscreen mode Exit fullscreen mode

The green zone on the chart shows the acceptable range (up to 15ms). When p99 latency exceeds this threshold, the monitor fires an alert automatically.

Now if p99 latency goes above 15ms, the team receives an automatic alert and can respond before the problem becomes critical. The monitor can be configured to send notifications to Slack, Telegram, PagerDuty, or email.


📚 Takeaways: Investigation Techniques

1. Structured Filtering

Don't look at all spans at once. Filter by:

  • service_name — one service at a time
  • _kind = server — only handled requests
  • _status_code = error — only errors

2. Timeline Thinking

The trace timeline is the story of a request. Look for:

  • Longest spans (the bottleneck)
  • Failed spans (where it breaks)
  • Unexpected calls (unnecessary hops)

3. Log-Trace Correlation

Logs without context are useless. trace_id connects:

  • One log entry → full request flow
  • Structured attributes → filtering and search

4. Aggregations Matter

Don't rely on a single span. Aggregations reveal:

  • perMin(count()) — frequency of the issue
  • p99(_dur_ms) — worst-case latency
  • error_rate — percentage of failures

5. From Investigation to Prevention

An investigation doesn't end at the fix. Monitors turn knowledge into automation.


🛠️ Tools & Techniques Recap

Tool Usage What It Gives You
Service Graph Dependency visualization Where in the chain the problem is
Span filtering service_name, _kind, _status_code Remove noise, focus on the problem
Aggregations perMin, p99, error_rate Scale and frequency of the issue
Trace timeline Waterfall view Bottleneck within the request
Log correlation trace_id linking Root cause analysis
Monitors Automated alerting Prevention

💬 Conclusion

Distributed tracing isn't just "looking at logs." It's a systematic approach to incident investigation:

  1. Service graph — find the problematic service
  2. Span filtering — isolate the failing requests
  3. Trace timeline — understand the bottleneck
  4. Log correlation — identify the root cause
  5. Monitors — prevent recurrence

With the right tools (OpenTelemetry + Uptrace), the path from symptom to resolution takes less than a minute instead of hours of debugging.


What's next?


Questions? Comments? Drop them below! 👇

Top comments (0)