DEV Community

Cover image for Why I Spent 5 Hours Finding a 20-Minute Fix: A Case for Structured Logging
Nandish Dave
Nandish Dave

Posted on

Why I Spent 5 Hours Finding a 20-Minute Fix: A Case for Structured Logging

This article is based on a real production incident I worked on recently. Names and specifics are changed, but the lessons are real.

The Incident

A production incident. Customers unable to complete payments in certain edge cases. Priority 1. All hands on deck.

The fix, once we found the root cause, took about 20 minutes. But getting there? That took hours. Not because the bug was complex — but because our observability was broken.

This post is about what went wrong and what I'd do differently.

Problem 1: Cost-Driven Logging Decisions

AWS CloudWatch is powerful, but querying logs at scale gets expensive fast. To manage costs, all application logs were being shipped to Splunk.

Splunk is a great tool — but it comes with its own query language (SPL), its own learning curve, and its own quirks. During a P1 incident, the last thing you want is engineers Googling "how to filter Splunk logs by timestamp."

The trade-off nobody talks about: You save money on log queries but pay with engineer time during incidents. That cost is invisible until it isn't.

Problem 2: No Structured Logging

This was the bigger issue. Developers were logging raw stack traces — unstructured, inconsistent, and scattered across services.

No centralized error-handling middleware. Every service handled errors differently. Some logged full traces, some logged one-liners, some didn't log errors at all.

What unstructured logs look like during an incident:

ERROR: NullReferenceException at PaymentService.Process()
   at PaymentService.cs:line 142
   at OrderHandler.cs:line 89
INFO: Request received for user 48291
ERROR: Connection timeout to payment gateway
INFO: Request received for user 50123
INFO: Health check OK
ERROR: NullReferenceException at PaymentService.Process()
   at PaymentService.cs:line 142
Enter fullscreen mode Exit fullscreen mode

Which error belongs to which customer? Which request triggered which failure? Impossible to tell.

What structured logs should look like:

{
  "timestamp": "2026-02-09T14:23:01Z",
  "level": "ERROR",
  "correlationId": "req-abc-123",
  "service": "payment-service",
  "userId": "48291",
  "message": "Payment processing failed",
  "error": "NullReferenceException",
  "stackTrace": "PaymentService.Process() at line 142",
  "context": {
    "orderId": "ORD-9981",
    "amount": 149.99,
    "gateway": "stripe"
  }
}
Enter fullscreen mode Exit fullscreen mode

One log entry. Everything you need. Searchable, filterable, traceable.

Problem 3: No Correlation IDs

Without a unique correlation ID attached to each incoming request and passed through every downstream service call, there was no way to trace a single customer's journey end-to-end.

Thousands of requests per second. Logs interleaved. One customer's payment attempt scattered across 5 services with nothing connecting them.

A simple middleware that generates a UUID and attaches it to every log entry would have saved hours:

// Express middleware example
app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || uuid();
  res.setHeader('x-correlation-id', req.correlationId);
  next();
});
Enter fullscreen mode Exit fullscreen mode

Then in Splunk or any log tool: correlationId="req-abc-123" — and you see the entire request lifecycle.

What I'd Do Differently

  1. Structured JSON logging from day one — not raw console.log or stack trace dumps
  2. Correlation ID middleware — generated at the API gateway, passed to every service
  3. Centralized error handling — one middleware that catches, formats, and logs all errors consistently
  4. Evaluate observability costs holistically — factor in engineer time during incidents, not just storage costs
  5. Log levels that mean something — ERROR for actual failures, WARN for degraded states, INFO for business events

The Takeaway

The cheapest log storage means nothing if your team burns hours reading unreadable logs during a P1.

Observability isn't overhead. It's the difference between a 20-minute fix and a 5-hour firefight.

Why I Wrote This

I've spent 15+ years building and debugging enterprise systems. This incident reminded me that we often over-invest in features and under-invest in observability. The tools and patterns I described above aren't complex — correlation IDs, structured logging, error middleware — but they're skipped more often than you'd think.

I wrote this so the next engineer stuck in a P1 at 2 AM has one less problem to deal with.

I'd love to hear from you:

  • What does your observability stack look like?
  • Have you faced a similar situation where finding the bug took longer than fixing it?
  • What logging practices does your team follow?

Drop your questions or experiences in the comments — I'll reply to every one.


I'm Nandish Dave — Cloud Architect & Full Stack Engineer with 15+ years building enterprise solutions. I write about cloud architecture, DevOps, and lessons from production. Find me at nandishdave.world.

Top comments (0)