Alexandr Bandurchin for Uptrace

Posted on Oct 16 • Originally published at uptrace.dev

Debugging Microservices in Production with Distributed Tracing

#devops #microservices #monitoring

Your production checkout flow just started returning 500 errors. Six microservices handle checkout. Logs show errors in three of them. Which service broke? Which error happened first? What caused the cascade?

Traditional debugging doesn't work. You can't attach a debugger to production. Searching logs across six services gives thousands of lines with no obvious connection. By the time you correlate timestamps and trace IDs manually, customers have abandoned their carts.

Distributed tracing solves this by showing the complete request path. One trace captures the entire checkout flow from API Gateway through all six services. You see which service failed, what it was doing, and what led to the failure. This article shows practical debugging scenarios using distributed traces in Uptrace.

Debugging Workflow

Production issues follow a pattern. An alert fires indicating elevated error rates or latency. You need to understand what changed and which component broke. Distributed tracing provides the investigation path.

Start with the alert context. If error rate spiked in the payment service, filter traces to that service with error status. Uptrace shows all failed payment requests with their complete context. Pick a representative error and examine its trace.

The trace timeline shows every service involved. Each span represents work done by one service. Span duration reveals bottlenecks. Error annotations point to failures. Attributes contain request details like user ID, order amount, and API endpoint.

Work backward from the error. The payment service failed with a timeout. Its span shows it called a fraud detection service. That service's span shows it queried a database. The database span reveals a slow query taking 8 seconds. You found the root cause without reading a single log line.

Scenario: Slow Checkout

Users report checkout taking 10+ seconds. Your SLO targets 2 seconds. You need to identify which service causes the delay and why it slowed down recently.

Open Uptrace and filter traces for the checkout endpoint. The trace list shows P50, P95, and P99 latencies. P95 jumped from 1.8s to 12s in the last hour. Click a slow trace from the recent spike.

The trace waterfall displays all services involved in checkout. The inventory service span takes 9 seconds while others complete in milliseconds. This service is the bottleneck.

Drill into the inventory service span. It contains child spans for each operation. Database queries complete in 50ms. Cache checks are instant. One span stands out: "validate_stock_levels" takes 8.5 seconds.

Click that span to see its attributes. It queries inventory for 47 items. The SQL query scans millions of rows without proper indexing. Before this morning, orders averaged 3 items. A bulk order feature launched today explaining the sudden increase in query complexity.

-- Span attribute showing the problematic query
db.statement: SELECT * FROM inventory
  WHERE sku IN (47 items...)
  ORDER BY warehouse_id
-- Missing index on (sku, warehouse_id)

The fix is clear: add a composite index on SKU and warehouse columns. The trace also reveals an optimization opportunity—batch this query instead of running it synchronously during checkout.

Scenario: Failed Payment

Payment processing returns intermittent 401 errors. Some charges succeed while others fail with authentication errors. The error rate is 15%, concentrated in the last 30 minutes.

Filter Uptrace traces to the payment service with HTTP status 401. The trace list shows both successful and failed requests. Compare a failed trace to a successful one from 35 minutes ago.

The successful trace shows the payment service calling Stripe with valid API credentials. The failed trace shows the same call but with a 401 response. The span attributes reveal the difference.

// Successful request span attributes
stripe.api_version: "2023-10-16"
stripe.key: "pk_live_abc...xyz" (masked)

// Failed request span attributes
stripe.api_version: "2024-01-01"
stripe.key: "pk_live_def...uvw" (masked)

The API version changed. Check when this happened by examining traces over the past hour. Successful requests used version 2023-10-16. Failed requests started at 10:47 AM using version 2024-01-01.

Search your deployment logs for 10:47 AM. A configuration update deployed new environment variables including an updated Stripe API version. The new version requires different authentication headers that your code doesn't send.

Rollback the config change and payment processing recovers immediately. The trace comparison made the root cause obvious without reproducing the issue locally.

Scenario: Missing Events

Your order processing uses Kafka. Orders placed through the API should trigger warehouse notifications. Some orders never reach the warehouse. The API confirms order creation but the warehouse service shows no activity for those orders.

Start with a trace of a successful order. It shows the API service creating the order, publishing to Kafka, and the warehouse service consuming the event. The trace includes spans for both the producer and consumer sides.

Now examine a failed order. Filter by order ID in Uptrace. The trace shows the API service successfully creating the order and calling the Kafka producer. But there's no corresponding consumer span. The event disappeared.

Check the producer span attributes. They show which Kafka topic and partition the event was sent to. The consumer span for successful orders shows it reads from the same topic but a different partition.

# Producer span (successful order)
messaging.destination: orders
messaging.kafka.partition: 2

# Consumer span
messaging.destination: orders
messaging.kafka.partition: 2

# Producer span (failed order)
messaging.destination: orders
messaging.kafka.partition: 5

# No consumer span exists

The consumer isn't reading from partition 5. Check your consumer configuration. It's hard-coded to read partitions 0-4. When you scaled Kafka to six partitions last week, the consumer configuration didn't update.

The fix is simple: change the consumer to subscribe to the topic rather than specific partitions. Kafka will assign partitions automatically. Distributed tracing revealed an async boundary issue that logs alone wouldn't catch.

Scenario: Memory Leak

The recommendation service's memory usage grows steadily until the pod restarts. Memory increases 50MB per hour. After 8 hours, the pod hits its 2GB limit and crashes.

Memory leaks are hard to debug with traces alone, but traces provide context about what the service was doing when memory spiked. Start by correlating memory metrics with trace volume.

Uptrace shows memory usage alongside request rates. Memory increases correlate with bursts of high traffic, but doesn't decrease during quiet periods. Something is accumulating in memory.

Filter traces during high memory periods. Look for patterns in the requests. Most requests complete normally, but one endpoint shows unusually large response payloads in its span attributes.

// Normal request span
http.response_content_length: 15KB
recommendation.items_returned: 10

// Suspicious request span
http.response_content_length: 2.4MB
recommendation.items_returned: 500

Certain queries request 500 recommendations instead of the normal 10. These large responses get cached in memory. The cache uses an LRU eviction policy, but its max size is set to 10GB—way more than the pod's 2GB limit.

Check when these large queries started. Filter spans by recommendation.items_returned > 100. They began two days ago when a new mobile app version deployed. The app has a bug causing it to request 500 items when the user scrolls.

Fix the cache size to 500MB and add request validation to reject queries for more than 50 items. The memory leak stops. Traces showed which requests caused the memory growth without needing heap dumps.

Trace Search Techniques

Finding relevant traces requires good search skills. Uptrace supports multiple filter dimensions. Combine them to narrow results quickly.

Filter by service name to see only traces touching specific components. Add time range filters to focus on when the issue occurred. Status filters show only errors or slow requests.

Attribute filters are powerful. Filter by any span attribute like HTTP status code, user ID, or custom business logic tags. Want to see all failed orders over $500? Filter by order.amount > 500 and error = true.

Duration filters identify performance issues. Show traces where the total duration exceeds your SLO. Or filter for specific spans that are slow, like db.operation.duration > 100ms.

Group traces by attributes to spot patterns. Group by HTTP status code to see error distribution. Group by region to identify geo-specific issues. Group by version to compare releases.

Save useful queries. When debugging production issues, you often need the same filters repeatedly. Save queries for common scenarios like "slow checkout", "payment failures", or "API errors".

Service Maps

Service maps visualize dependencies between microservices. Uptrace generates these automatically from trace data. Each node represents a service. Edges show request paths between services.

The map updates in real-time as traffic flows through your system. Color coding indicates health—green for normal, yellow for elevated errors, red for critical failures. This makes problem services immediately visible.

Click a service to see its metrics. Request rate, error percentage, and latency percentiles appear in the sidebar. Compare current metrics against historical baselines to identify anomalies.

Click an edge between services to see their interaction patterns. How many requests flow from API Gateway to the payment service? What's the error rate on that path? How long do those calls take?

Service maps reveal architectural issues. If the authentication service shows edges to 20 other services, it's a bottleneck and single point of failure. If the order service calls the recommendation service, but recommendations also call orders, there's a circular dependency risk.

Use the map during incidents. When alerts fire, check the service map to see blast radius. Which services are affected? Which are still healthy? This guides mitigation strategy—do you need to rollback one service or several?

Span Analysis

Individual spans contain detailed execution context. Examining spans reveals what a service was doing during a request. Start with the span timeline to understand execution flow.

Spans can be synchronous or async. Synchronous spans block their parent until complete. Async spans run in parallel. The timeline shows this visually. Long synchronous spans block the entire request. Multiple async spans run concurrently, reducing total time.

Span attributes store request-specific data. HTTP spans include method, path, status code, and headers. Database spans show queries, row counts, and table names. Message queue spans contain topic, partition, and message size.

Custom attributes add business context. Tag spans with user tier, feature flags, or A/B test variant. During debugging, filter traces by these attributes to isolate issues affecting specific user segments.

Span events mark important moments within a span. A payment span might have events for "validation_complete", "fraud_check_started", "stripe_api_called", and "confirmation_sent". Events help pinpoint where in a long operation something went wrong.

Error spans deserve special attention. They include exception type, message, and stack trace. The stack trace shows the code path that led to the error. Combined with source code line numbers, you can jump directly to the problematic code.

Error Patterns

Production errors rarely occur in isolation. Similar errors cluster together when a root cause affects multiple requests. Spotting these patterns helps prioritize debugging efforts.

Group errors by exception type. If 1000 traces show "DatabaseConnectionTimeout" errors, that's one issue to fix. If you see 100 different exception types with 10 instances each, multiple unrelated problems exist.

Group by service to identify which components are unstable. If 80% of errors originate in the authentication service, focus there first. If errors distribute evenly across services, look for infrastructure issues like network problems.

Timeline visualization shows error bursts. Errors clustering within a 5-minute window suggest a deployment or configuration change. Errors spread evenly over hours indicate a gradual degradation like a memory leak.

Compare error traces to successful traces. What's different? Did the error trace take an unusual code path? Did it call different downstream services? Did it have different input parameters?

Look for correlations with deployments. Uptrace can overlay deployment markers on the timeline. Errors starting immediately after a deployment point to a code or configuration regression.

Performance Optimization

Distributed traces guide performance improvements. They show exactly where time is spent, allowing targeted optimization rather than guessing.

Start with the critical path. This is the sequence of spans that determines total request duration. Optimizing spans off the critical path won't improve latency. Focus on the longest span on the critical path first.

Database queries often dominate request time. Traces show which queries are slow, how often they run, and whether they could be parallelized or cached. One 500ms query run during every request has massive impact.

Third-party API calls are another common bottleneck. Traces reveal API latency and error rates. If a recommendation API averages 800ms response time, consider caching results or making the call async.

Identify unnecessary work. Sometimes services fetch data that isn't used. Traces show which database fields are queried but never accessed. Removing these queries improves performance without changing functionality.

Parallelization opportunities appear in the trace waterfall. If three spans run sequentially but don't depend on each other, run them concurrently. This can reduce total latency significantly.

Production Debugging

Production debugging requires different techniques than local development. You can't attach a debugger or add print statements. Traces provide the investigation data you need.

When an error occurs, the trace is your stack trace. It shows the sequence of service calls leading to the failure. Follow the trace backward from the error to find the root cause.

Trace context carries through async boundaries. When a service publishes a message to Kafka, the trace context goes with it. The consumer picks up that context and continues the trace. This maintains visibility across async operations.

Sampling doesn't affect debugging. Even with 10% sampling, when an error occurs, that trace gets captured. Head-based sampling ensures complete traces. You always have data for the errors that matter.

Compare problem traces to baselines. Find a similar request that succeeded recently. Diff the traces to see what changed. Different service versions? Different code paths? Different input parameters?

Uptrace persists traces for analysis. You can investigate production issues hours or days after they occurred. The complete trace history remains available for debugging and post-mortems.

Deployment Verification

Distributed tracing verifies deployments work correctly. After deploying new code, compare traces before and after the deployment to catch regressions.

Check error rates first. Filter traces to the deployed service with error status. Compare error counts before and after deployment. An increase indicates the new version introduces bugs.

Verify latency hasn't regressed. Compare P95 latency before and after. A 50ms increase might be acceptable. A 500ms increase needs investigation. Examine slow traces from the new version to find the cause.

Look for new error types. If the old version never threw "NullPointerException" but the new version does, the deployment introduced a bug. Check traces with this error to understand when it occurs.

Validate new features work end-to-end. Deploy a feature that adds a new service call. Verify traces show the new call with correct parameters and responses. Missing or malformed calls indicate integration issues.

Canary deployments benefit from trace comparison. Send 10% of traffic to the new version. Compare traces between old and new versions. If the new version shows higher errors or latency, abort the deployment.

Implementation with Uptrace

Uptrace provides a complete debugging workflow for microservices. All traces export to a central platform where you can search, filter, and analyze them.

The trace explorer shows all traces with filtering options. Search by service, time range, duration, status, or any span attribute. Saved queries let you bookmark common debugging scenarios.

The waterfall view displays trace timelines. Each span appears with its duration and relationship to other spans. Click any span to see its attributes, events, and errors.

Service maps visualize architecture and dependencies. Watch traffic flow in real-time. Identify bottlenecks and failure points visually. Click services or connections to drill into their metrics.

Alerting detects issues automatically. Set alerts on error rates, latency percentiles, or custom metrics. When an alert fires, Uptrace shows relevant traces immediately. Jump from alert to root cause in seconds.

Team collaboration features enable shared debugging. Link to specific traces in chat or tickets. Everyone sees the same data without needing access to production systems. Post-mortems become easier with complete trace records.

Ready to debug with distributed tracing? Start with Uptrace.

You may also be interested in:

DEV Community