Rohit Gavali

Posted on Jan 16

How Production Logs Forced Me to Simplify API Error Handling

#webdev #programming #api #ai

At 3 AM on a Tuesday, our API threw an error that took me forty-five minutes to understand from the logs alone.

Not because the error was complex. Because our error handling was.

We had built what we thought was a sophisticated error handling system. Detailed error codes, extensive logging, custom exception hierarchies, contextual metadata attached to every failure. The kind of system that looks impressive in code review and feels like enterprise-grade engineering.

Then production hit, and I found myself scrolling through thousands of log lines, unable to quickly answer the simplest question: "What actually went wrong?"

That night, staring at logs that told me everything except what I needed to know, I realized we had optimized for the wrong thing. We had built error handling for the code's elegance, not for the human debugging it at 3 AM.

The Abstraction Trap

Our error handling started simple. Catch exceptions, log them, return appropriate HTTP status codes. Basic, functional, boring.

Then we started adding "improvements."

We created custom exception classes for every failure mode. DatabaseConnectionException, InvalidAuthTokenException, RateLimitExceededException, UpstreamServiceTimeoutException. Each with its own error code, severity level, and metadata schema.

We built middleware that caught these exceptions, transformed them into standardized error responses, logged them with rich context, and tracked them in our monitoring system. We had error hierarchies, error factories, error serializers.

The code looked clean. The architecture felt robust. The error handling was thorough and type-safe.

And it was completely useless when trying to debug production issues.

The problem wasn't that our errors lacked information—they had too much. Every error logged twenty fields of context. Stack traces were pristine. Error codes were precise. But when scanning through logs at 3 AM trying to understand why the API was returning 500s, I couldn't quickly distinguish signal from noise.

Our sophisticated error system had created a new problem: information overload that masked the actual failures.

What Production Logs Revealed

After that 3 AM incident, I started actually reading our production logs. Not during incidents—during normal operation. What I found was humbling.

Most of our carefully crafted error context was never useful. The detailed metadata we attached to exceptions? Rarely relevant. The precise error codes mapping to specific failure modes? Nobody referenced them. The error hierarchies we'd designed? They didn't help anyone understand what was failing.

What actually helped during debugging was simple, direct information:

What was the API trying to do?
What went wrong?
What should we do about it?

Everything else was noise.

I noticed patterns in how I actually debugged production issues. I'd grep for the endpoint that was failing, scan for error keywords, look for repeated failures, check for upstream service names. The sophisticated error handling we'd built didn't support this workflow—it fought against it.

Our logs looked like this:

[ERROR] Exception caught in middleware layer
Type: DatabaseConnectionException
Code: DB_CONN_001
Severity: HIGH
Message: Unable to establish connection to database
Context: {
  "request_id": "abc123",
  "user_id": "user_456",
  "endpoint": "/api/users/profile",
  "database_host": "prod-db-1.internal",
  "connection_pool_size": 50,
  "retry_attempt": 3,
  "timeout_ms": 5000,
  ...15 more fields
}
Stack trace: [50 lines]

What I actually needed:

[ERROR] /api/users/profile - Database connection failed after 3 retries (prod-db-1 timeout)

The first format was "complete." The second was useful.

The Simplification

I started rewriting our error handling with a new principle: optimize for the person reading logs, not the person writing code.

First change: Flatten the error hierarchy. Instead of custom exception classes for every failure mode, we went to three categories: client errors (4xx), server errors (5xx), and dependency failures (upstream services, databases, etc.). That's it.

This felt wrong at first. We were losing type safety. We were giving up precise error categorization. But in production logs, those distinctions didn't matter. What mattered was: Is this our fault or the client's fault? Is this a code bug or an infrastructure issue?

Second change: Structure logs for grep, not JSON parsers. We had been logging errors as structured JSON, thinking it would make them easier to query. In practice, it made them harder to read. When debugging, you scan logs visually. JSON objects spread across multiple lines are hard to scan.

We switched to a simple format: [LEVEL] endpoint - what happened (relevant context). One line per error. No nested objects. Critical information in predictable positions.

Third change: Context only when it matters. We stopped attaching comprehensive metadata to every error. Instead, we included only the context that would help debug that specific failure type.

Database connection failed? Log which database and how many retries. Don't log request IDs, user IDs, or the entire request context—those are already in the access logs.

Rate limit exceeded? Log the endpoint and the limit. Don't log the client's entire request history.

Fourth change: Make errors actionable. Every error should suggest what to do next. Not in a user-facing message, but in the logs themselves.

Instead of: InvalidAuthToken we logged: Authentication failed - token expired (client should refresh)

Instead of: UpstreamServiceTimeout we logged: Payment service timeout after 5s - check payment-service health

This changed how we thought about errors. They weren't just failures to categorize—they were signals for action.

The Tools That Actually Help

Once we simplified our error handling, we needed better ways to make sense of the patterns emerging in logs.

We started using AI to analyze log patterns when we noticed repeated errors. Not to replace human investigation, but to quickly surface correlations we might miss. "These three endpoints are failing at the same rate—probably the same root cause."

For complex debugging sessions, we'd use Claude Sonnet 4.5 to help structure our investigation. Paste in a sample of errors, ask it to identify the common pattern or suggest what to check next. The AI didn't debug for us, but it helped organize our thinking when we were overwhelmed.

When logs revealed issues with specific data transformations or validation logic, we'd use tools that could analyze and extract structured information from messy error patterns, helping us understand what types of inputs were causing failures.

The goal wasn't to automate debugging—it was to accelerate the pattern recognition that helps you form hypotheses about what's actually broken.

What We Gave Up (And Why It Didn't Matter)

Simplifying our error handling meant sacrificing things that felt important:

Detailed error taxonomies. We went from 50+ error types to basically three categories. This felt like a loss of precision. In practice, the precision wasn't helping anyone. Knowing the exact error type didn't make debugging faster—knowing what was broken did.

Comprehensive metadata on every error. We stopped logging everything we could and started logging only what was relevant. This meant sometimes we'd have to add more logging after discovering we needed additional context. That's fine—better to add specific logging when needed than drown in unused context always.

Type-safe error handling. Our custom exception hierarchy gave us compile-time guarantees about error handling. Removing it felt risky. But runtime reliability isn't about type safety—it's about humans understanding failures quickly and fixing them correctly.

Sophisticated error transformation pipelines. We had middleware that enriched errors, categorized them, and routed them to different logging systems based on type. We deleted most of it. Simpler error handling meant fewer places for bugs to hide in the error handling itself.

What we gained was worth more than what we lost: the ability to debug production issues quickly.

The Pattern That Emerged

After six months with simplified error handling, we noticed something interesting: we were fixing bugs faster, but we weren't fixing more bugs.

The complex error handling hadn't prevented bugs. It had just made them harder to understand. When you can't quickly diagnose what's failing, you either ignore intermittent errors (hoping they'll go away) or spend excessive time debugging simple issues.

With clearer errors, we could quickly distinguish between:

Known issues we're monitoring
New failures that need immediate attention
Client errors that don't require action
Infrastructure problems vs. code bugs

This meant less time investigating false alarms and more time fixing actual problems.

The developers on our team started writing simpler error handling in new code. Not because we mandated it, but because they saw how much easier it made their own debugging. The cultural shift from "comprehensive error handling" to "useful error handling" happened organically.

What This Means for Your API

If you're building error handling right now, here's what I'd do differently:

Start with simple logging. Don't build sophisticated error categorization until you've actually debugged production issues and know what information you need. Your first error handling should be almost embarrassingly simple.

Optimize for human scanning, not machine parsing. Structured logging has its place, but errors should be readable first, queryable second. When something's on fire, you need to scan logs visually and quickly form hypotheses.

Make errors actionable. Every error should tell you what to do next. "Database connection failed" isn't enough. "Database connection failed - check if prod-db-1 is accepting connections" actually helps.

Include context that matters, exclude context that doesn't. You don't need to log everything about the request with every error. You need to log what's relevant to that specific failure mode.

Test your error handling by reading logs. Don't just test that errors are caught and logged. Actually read the logs and see if you can quickly understand what's failing. If it takes you more than a few seconds to understand an error, your error handling is too complex.

Use platforms like Crompt AI that let you work with multiple AI models to help analyze error patterns when you're debugging complex issues. Not as a replacement for good logging, but as a thinking partner when you're trying to make sense of what logs are telling you.

The Real Lesson

Error handling isn't about catching every possible failure mode and logging comprehensive context. It's about making failures understandable to the person who has to fix them.

The best error handling I've seen isn't sophisticated—it's simple, direct, and optimized for human comprehension under pressure.

Your errors will be read by tired developers at inconvenient times trying to fix problems quickly. Write error handling for them, not for the idealized version of yourself that has unlimited time to investigate issues.

The sophistication comes from understanding what information actually helps during debugging, not from building elaborate

-ROHIT

DEV Community