Mustafa ERBAY

Posted on Jun 3 • Originally published at mustafaerbay.com.tr

Error Handling Choices: The Operational Burden of a Detailed Approach

#tutorials #errors #architecture #operations

The Operational Burden of Detailed Error Management: A Perspective

While developing a production ERP, an incomplete late shipment report kept me busy for days. At first, I thought there was a problem with the database query, but the real issue stemmed from the inadequacy of the information message the application returned to the user in an unexpected error situation. That's when I realized that error handling is not just about lines of code; its operational burden and cost need to be analyzed correctly. In this post, I will explain the operational impacts, trade-offs, and real-world reflections of detailed error management, drawing from my own experiences.

In the software development process, especially in enterprise applications, error management is critically important. However, there are also situations where the most detailed approach is not always the best solution. In this post, while seeking an answer to the question "how detailed should it be?", I will prioritize operational efficiency and cost factors.

Why Detailed Error Management?

It is certainly possible to prevent an application from crashing with a simple try-catch block. However, in production systems, especially in applications managing a supply chain or financial transaction flow, this simplicity is often not enough. Instead of just telling the user "An error occurred," explaining what the error is, why it occurred, and what they need to do makes a big difference in understanding and resolving the operational process.

For example, imagine a payment process failing on an e-commerce platform. Instead of just telling the user "Payment failed," informing them whether the error was due to card information, a communication issue with the bank, or a problem elsewhere in the system helps the user resolve the issue faster and makes the support team's job easier. This level of detail, especially in complex enterprise software, can significantly reduce the debugging time required to find the root cause of a problem.

ℹ️ A Note from My Experience

In one project, I noticed that users couldn't track their orders. The source of the problem was an unexpected unresponsiveness from a microservice running in the background. Initially, only a general error message was returned. However, by enriching the detailed logging and the error message returned to the user, we were able to clearly see which service and with which parameters the problem occurred. This allowed us to resolve the issue within 2 hours. Without detailed error messages, this period would have been extended by at least 1-2 days.

Operational Cost: Unexpected Burden

Detailed error management, at first glance, may seem like just an extra effort during the development phase, but it actually directly affects operational costs. The content, size, and storage method of each error message have an impact on overall system performance and costs.

For example, creating separate, long, and detailed error messages for each possible error condition prolongs development time and requires additional resources for storing these messages in the system. If these messages are written to log files, the size of the log files increases rapidly. This increases disk space costs and can degrade the performance of log analysis tools. Analyzing a 100 GB log file requires much more time and processing power than analyzing a 1 GB file.

Furthermore, developers foreseeing and coding every error scenario is also a significant time investment. This can make it difficult to meet sprint goals, especially in agile development processes. The team's focus shifting from developing new features to detailing existing errors can slow down the overall progress of the project.

Trade-offs: Level of Detail and Pragmatism

So, how do we strike a balance at this point? How much detail should we go into in error management? The answer to this question varies depending on the type of project, the user base, and business requirements.

H2: Level of Detail Presented to the User

Error messages shown directly to the user should generally contain less technical detail. The goal is not to scare or confuse the user, but to help them understand the problem and, if necessary, guide them on what to do.

Simple User Interfaces: For mobile applications or end-user-focused websites, more general messages like "Your transaction could not be completed. Please try again later." may suffice.
Enterprise Applications: In enterprise software like ERP or CRM, users may have more technical knowledge. In these cases, information such as which module the error occurred in or which data entry caused the problem can be more useful. For example, the message "Supplier information could not be saved. Supplier code 'XYZ123' already exists." provides sufficient detail to understand the problem.

⚠️ Example Scenario - Incorrect Detailing

In a banking application, showing a raw database error like java.sql.SQLException: ORA-00600: internal error code, arguments: [a12345], [], [], [], [], [], [] to the user during a money transfer error will panic the user and be useless. Instead, a more guiding message like "Your transfer could not be completed. Please contact your bank's customer service." should be preferred.

H2: Level of Detail for Developer and Operations Teams

For developers and operations teams, the situation is different. They need more technical details to debug and resolve errors. These details are usually written to log files and tracked with special monitoring tools.

Logging: Information such as the code line causing the error, parameters, called functions, and system status (CPU, memory usage, etc.) should be logged. This is critical for finding the root cause of the problem.
Trace Information: In distributed systems, distributed tracing tools are used to track the journey of a request across different services. These tools provide vital information to understand which service and at what stage the error occurred. For example, a user's request first comes to the API Gateway, then to the authentication service, and then to the backend service. If the error occurs in the backend service, the trace information shows this flow.
Metrics and Alerts: System performance metrics (e.g., error rate per request, request processing time) and alerts for abnormal changes in these metrics should be set up. This helps detect problems before they affect users. For example, triggering an alert if the error rate exceeds 5%.

H2: Real-time Data Analysis and Error Management

A situation I encountered while checking the accuracy of user-entered data in a financial calculator application I developed once again demonstrated the importance of detailed error management. In a specific dataset, I had to perform an in-depth analysis to understand why the calculation was incorrect.

Initially, only a general error message like "Invalid input" was returned. However, to find the source of the problem, I started logging all the data entered by the user, intermediate calculation steps, and the final result. Through this detailed logging, I realized that the error was actually caused by a floating-point precision issue, and this issue became more pronounced in a specific data set. If I hadn't obtained this information, it could have taken weeks to find the source of the problem.

💡 Practical Solution: A Separate Mechanism for Error Details

In a production environment, instead of logging every detail directly, we can return a basic error message along with an error ID to the user when an error occurs. This error ID points to a separate database or logging service where more detailed information (user information, system status, parameters, etc.) is stored. This way, a clean interface is presented to the user, and the operations team can access all the details needed to investigate the problem.

Error Management Strategies: Options and Consequences

There are different strategies that can be followed for error management. Each has its own unique advantages and disadvantages.

H2: Coding Errors and Operational Burden

Errors made during the software development phase directly affect the operational burden. For example, receiving a NullPointerException error during an API request, if not handled properly, can cause the application to crash or exhibit unexpected behavior. Developers need to be careful in detecting and correcting such errors.

In one project, I observed that the entire system slowed down due to an error in a microservice. The problem was caused by the service entering an infinite loop when it couldn't respond to a request. This caused the server's CPU usage to reach 100% and negatively affected the performance of other services. The operational cost of such errors is not only performance degradation but also wasted server resources and potentially service outages.

H2: Comprehensive Testing and Error Prevention

The most effective way to manage errors is to prevent them from occurring. This is achieved through comprehensive unit tests, integration tests, and user acceptance tests. However, it is not always possible to foresee all possible error scenarios.

🔥 Unexpected Situations: My Own Mistake

Last month, while publishing content on my blog, I added sleep(360) to my code to test a feature. My goal was to measure how long an operation took. However, when I forgot this code and published it, this 360-second wait locked the system during a specific user interaction, leading to an OOM (Out of Memory) error. This simple "error" I wrote turned into an operational problem. Afterwards, I used a more controlled polling mechanism instead of sleep. This is a personal example showing how serious unexpected error scenarios can be.

H2: Design Patterns for Error Management

Design patterns used for error management also affect the operational burden. For example, the Circuit Breaker pattern prevents other parts of the system from being affected by temporarily stopping calls to a service if it continuously returns errors. This increases system stability and prevents the spread of operational problems.

Another pattern, the Retry Pattern, allows an operation to be retried several times when an error occurs. This is especially effective in situations such as temporary network issues or services being temporarily unavailable. However, if implemented incorrectly, it can also lead to system overload. For example, an operation that retries 1000 times per second can completely disable the target service. Therefore, retry strategies (number of retries, wait time, etc.) need to be carefully adjusted.

Conclusion: A Pragmatic Approach

Detailed error management is important for software reliability and operational efficiency. However, it is necessary to accept that the "most detailed" approach is not always the best solution. A balance must be struck by considering operational costs, development time, and user experience.

Error messages shown to the user should be clear, understandable, and guiding; for developer and operations teams, they should contain sufficient technical detail. Logging, monitoring, and alerting mechanisms should ensure these details are used effectively. A pragmatic error management strategy both improves code quality and minimizes operational burden. This is a continuous process of improvement and finding balance.

DEV Community