A step-by-step breakdown of microservice incident investigation: service graph, trace timeline, log correlation, and structured debugging using Uptrace
Introduction
In a microservice architecture, a single failing endpoint can hide a problem in an entirely different service. In this article, I'll walk through a step-by-step investigation of a real incident — from detecting 100% error rate to pinpointing the exact root cause — in under a minute.
We'll use Uptrace, an OpenTelemetry-native platform for tracing and monitoring. All examples are based on a real demo application with microservices.
What you'll learn:
- How to use the service graph to find bottlenecks
- Span filtering techniques for deep, focused analysis
- How timeline reconstruction helps isolate the problem
- Correlating traces and logs for root cause analysis
- Going from investigation to prevention through monitoring
📊 The Symptom: A Broken Endpoint
Let's say users report that recommendations aren't loading. We open the APM dashboard and immediately spot the problem:
Endpoint: GET /api/recommendations
Error rate: 100% 🔴
Latency p90: 10.6ms
Rate: 1.9 spans/min

APM Overview → "Slowest endpoints" tab. The GET /api/recommendations endpoint is highlighted: error_rate = 100%, p90 = 10.6ms
100% error rate is critical. Every single request is failing. But why?
🔍 Phase 1: Service Graph — Where Is the Problem?
The first step is understanding where exactly in the service chain the failure is happening. For this, we use the service graph.
What Is a Service Graph?
A service graph visualizes how services call each other. Each node is a service, each edge is a call between services.
Our graph looks like this:
[frontend] ---> [recommendation] ---> [product-catalog]
✅ 🔴 100% errors ❓

Service graph: red nodes (frontend, recommendation) signal errors. Green nodes (product-reviews, cart) are healthy. Edges show latency and error rate.
Already clear:
- frontend is making requests (red node — there are issues)
- recommendation is returning errors 🔴
- Green nodes (product-reviews, cart) are working normally ✅
- The problem is somewhere inside or further down the chain
Incoming vs Outgoing Perspective
The service graph supports two views:
Incoming — who calls this service (dependencies on it)
Outgoing — what dependencies this service relies on
Switching between them reveals:
- Incoming: frontend → recommendation (6.1ms duration, errors)
- Outgoing: recommendation → ??? (this is where the problem is)
Phase 1 conclusion: The problem is in recommendation — it can't reach its downstream dependencies.
🎯 Phase 2: Analyzing Problematic Spans
Now we need to look at specific requests. We navigate to spans.
Filtering: Cutting Through the Noise
In a production system there are millions of spans. We need precise filtering:
Step 1: Focus on one service
service_name = frontend
Step 2: Only server spans (handled requests)
_kind = server
Step 3: Only errors
_status_code = error
Now we see only the problematic requests from frontend.
Aggregations: Understanding the Scale
We group spans by operation and check aggregations:
Aggregations:
- perMin(count()) # requests per minute
- p99(_dur_ms) # 99th percentile latency
- _error_rate # error percentage
Group by:
- _group_id # group identical operations
Result:
| Operation | Rate | p99 Latency | Status |
|---|---|---|---|
| GET /api/recommendations | 2.1/min | 16ms | 🔴 Errors |
| POST /api/checkout | 1.4/min | 23ms | 🔴 Errors |
| GET /api/products/{productId} | 11/min | 5.7ms | 🔴 Errors |

Spans → Groups: three active filters (service_name, _kind, _status_code) narrowed the list to 5 groups. Aggregations perMin(count()), p99(_dur_ms), and _error_rate show the scale of the problem.
The problem affects multiple operations, but GET /api/recommendations shows a consistent error pattern. Let's drill deeper into this operation and look at specific requests.
The problem is isolated — we can focus on specific operations.
📋 Phase 2.5: Selecting a Specific Span
We click on one of the GET /api/recommendations rows in the Groups table. A detail panel opens with information about that specific request.

Clicking GET /api/recommendations opens a side panel. The ATTRS tab shows all structured attributes as key-value pairs. The VIEW TRACE button (arrow) leads to the full request timeline.
Selecting a span reveals its attributes as structured key-value pairs, making it easy to filter and search for issues.
What we see in the attributes:
-
http_status_code: 500— confirms a server error -
http_route: /api/recommendations— the specific endpoint -
http_target— full URL with query parameters -
host_name: play-all-in-one— host running the service -
next_span_name— next span in the call chain
This is structured logging in action — every field has a name and type, making debugging far more effective than working with plain text logs.
Now we click VIEW TRACE to see the full picture of the request across all microservices.
📈 Phase 3: Trace Timeline — Where Is the Bottleneck?
Clicking VIEW TRACE opens a waterfall timeline — the complete history of one request across all microservices.
What Is a Trace?
A trace is the full history of a single request through all microservices, visualized as a waterfall timeline.
Our trace looks like this:
GET /api/recommendations [9.5ms total]
├─ user_get_recommendations (load-generator) [0.7ms]
│ └─ GET http://frontend-proxy:8080/... [1.3ms]
│ └─ Ingress (frontend-proxy) [0.2ms]
│ └─ router frontend egress [0.8ms]
│ └─ GET http://frontend-proxy:8080/... [0.3ms]
│ └─ GET /api/recommendations (frontend) [6.4ms] 🔴
│ └─ rpc:recommendation [failures]

Waterfall trace view: the request path from load-generator through proxy to frontend. The last span (6.4ms) is the clear culprit for both latency and errors.
Timeline Analysis:
- Total duration: 9.5ms
- Load generator: 0.7ms (request initiation)
- HTTP calls through proxy: ~2ms
-
Frontend processing: 6.4ms — this is where most of the time goes! 🔴
- A significant portion is consumed by rpc:recommendation (26% of total)
- Errors are visible (red markers)
Phase 3 conclusion: Bottleneck is in GET /api/recommendations on frontend (6.4ms out of 9.5ms). The request passes through several hops (load-generator → proxy → frontend), but most of the time and all the errors occur at the final stage when calling the recommendation service.
Inspecting the Problematic Span
We click on the longest span, GET /api/recommendations (6.4ms), in the waterfall. A detail panel opens on the right with complete span information.

Attributes panel of the slowest span: status error, http_status_code = 500, duration 6.4ms. All service metadata is visible: host, OS, Next.js library version.
This span shows exactly where the error occurs — frontend is processing the /api/recommendations request but receiving a 500 error.
Key attributes:
-
http_status_code: 500— server returned an internal server error -
http_target— full URL with parameterproductIds=L9ECAV7KIM -
next_span_name: GET /api/recommendations— next call in the chain -
next_span_type: BaseServer.handleRequest— type of request handler -
otel_library_name: next.js— Next.js framework is being used
Now we need to understand why this span is failing. For that, we look at the logs associated with this trace.
📋 Phase 4: Logs — Root Cause
The trace showed where the problem is. Now we need to understand why. We look at the logs tied to the trace.
Trace-Log Correlation
In Uptrace, traces and logs are linked via trace_id. No need to manually search logs by time or service — just click the LOGS & ERRORS (7) tab while in the waterfall view, and all logs for that specific request appear automatically.

"LOGS & ERRORS (7)" tab: all 7 entries belong to one trace_id. The repeating message "DNS resolution failed for product-catalog:3550" with gRPC status UNAVAILABLE (code 14) — this is the root cause.
What we see in the logs:
A repeating error message appears multiple times with identical content:
Exception: _InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "DNS resolution failed for product-catalog:3550;
C-ares status is not ARES_SUCCESS
qtype=A name=product-catalog is_balancer=0;
DNS server returned general failure"
grpc_status: 14
grpc_message: "DNS resolution failed for product-catalog:3550"
We also see the HTTP access log confirming the outcome:
GET /api/recommendations?productIds=L9ECAV7KIM HTTP/1.1
Status: 500
User-Agent: python-requests/2.32.4

Each log entry is linked to the trace via trace_id. Clicking the link icon opens the full structured attributes of the entry in key-value format.
Root Cause Found! 🎯
Problem: DNS cannot resolve
product-catalog:3550Technical details:
- gRPC Status: UNAVAILABLE (code 14)
- DNS Query: A record for
product-catalog- DNS Resolver: C-ares (async DNS resolver)
- Result: "DNS server returned general failure"
Possible causes:
- The product-catalog service is not running
- Incorrect DNS record in Kubernetes/service mesh
- Network policy blocking DNS access
- Wrong hostname in config (should be
product-catalog.namespace.svc.cluster.local)- DNS server overloaded or unreachable
Time from symptom to root cause: ~60 seconds ⏱️
🚨 Phase 5: Prevention Through Monitoring
The investigation is complete — we found the root cause. In a production environment the next step would be a fix (correct the DNS record, restart the service, update the configuration). But it's not enough to fix the current problem — we need to prevent recurrence. For that, we create a monitor.
From Trace to Monitor:
- Go back to the TRACE tab
- Click the problematic span GET /api/recommendations (6.4ms) in the waterfall
- The span detail panel opens on the right
- Click the MONITOR dropdown
- Select Monitor error rate

The MONITOR button in the span panel opens a menu with 7 options. Selecting "Monitor p99 duration" — Uptrace automatically fills all form fields from the current span's attributes.
After selecting Monitor p99 duration, Uptrace automatically creates a form pre-filled with data from the span.
What gets auto-populated:
Monitor: httpserver:frontend > GET /api/recommendations p99 duration
Metric: uptrace_tracing_spans
Aggregation: p99($spans)
Filter:
- _group_id = "8037161693486813802"
(this is the span group ID for this operation)
Alert threshold: p99 latency > 15ms
Check interval: last 5 points (5 minutes)
Notification: Email to EVERYONE
The green zone on the chart shows the acceptable range (up to 15ms). When p99 latency exceeds this threshold, the monitor fires an alert automatically.
Now if p99 latency goes above 15ms, the team receives an automatic alert and can respond before the problem becomes critical. The monitor can be configured to send notifications to Slack, Telegram, PagerDuty, or email.
📚 Takeaways: Investigation Techniques
1. Structured Filtering
Don't look at all spans at once. Filter by:
-
service_name— one service at a time -
_kind = server— only handled requests -
_status_code = error— only errors
2. Timeline Thinking
The trace timeline is the story of a request. Look for:
- Longest spans (the bottleneck)
- Failed spans (where it breaks)
- Unexpected calls (unnecessary hops)
3. Log-Trace Correlation
Logs without context are useless. trace_id connects:
- One log entry → full request flow
- Structured attributes → filtering and search
4. Aggregations Matter
Don't rely on a single span. Aggregations reveal:
-
perMin(count())— frequency of the issue -
p99(_dur_ms)— worst-case latency -
error_rate— percentage of failures
5. From Investigation to Prevention
An investigation doesn't end at the fix. Monitors turn knowledge into automation.
🛠️ Tools & Techniques Recap
| Tool | Usage | What It Gives You |
|---|---|---|
| Service Graph | Dependency visualization | Where in the chain the problem is |
| Span filtering | service_name, _kind, _status_code | Remove noise, focus on the problem |
| Aggregations | perMin, p99, error_rate | Scale and frequency of the issue |
| Trace timeline | Waterfall view | Bottleneck within the request |
| Log correlation | trace_id linking | Root cause analysis |
| Monitors | Automated alerting | Prevention |
💬 Conclusion
Distributed tracing isn't just "looking at logs." It's a systematic approach to incident investigation:
- Service graph — find the problematic service
- Span filtering — isolate the failing requests
- Trace timeline — understand the bottleneck
- Log correlation — identify the root cause
- Monitors — prevent recurrence
With the right tools (OpenTelemetry + Uptrace), the path from symptom to resolution takes less than a minute instead of hours of debugging.
What's next?
- Try Uptrace: play.uptrace.dev
- Open-source APM based on OpenTelemetry: uptrace.dev/get/hosted/open-source-apm
- Set up OTEL Collector: uptrace.dev/opentelemetry/collector
- Explore best practices: uptrace.dev/opentelemetry
Questions? Comments? Drop them below! 👇
Top comments (0)