Mustafa ERBAY

Posted on Jun 9 • Originally published at mustafaerbay.com.tr

Log Level Decisions: The Anatomy of DEBUG, INFO, and ERROR Strategies

#observability #guide #software

Whenever an issue arises in my systems or the applications I develop, the first place I look is always the logs. Logs are like a window into the inner world of the application; they are the most fundamental way to understand what is going on. However, how you use this window directly affects how much information you see and how quickly you can process that information. This is exactly where the log level decision comes into play.

Over the years, I have seen many bad practices—ranging from needlessly printing DEBUG logs in production to throwing ERROR logs that brush off an important error with just "an unexpected error occurred"—and unfortunately, I have made some of these mistakes myself. In this post, I will present a guide shaped by my own experiences on how I use log levels and what these levels mean for a more efficient operation. My goal is to move logs away from being just a file-filling tool and turn them into a real source of operational intelligence.

Log Levels and Their Purpose: Why Do We Differentiate?

Log levels are used to classify the importance and detail level of messages produced by an application or system at runtime. This classification serves different needs in both development and production environments. For example, while you want to see every detail during development, tracking only critical events and errors in production is usually sufficient.

The main levels I use, which are widely accepted in the industry, are: DEBUG, INFO, WARN, ERROR, and FATAL. Each has its own unique purpose and use case. Choosing the wrong level can either cause you to miss a major issue or drown you in unnecessary information noise. I have experienced this situation countless times. In a production company's ERP, when a bottleneck occurred in the shipment flow at 2:00 AM, we found the root cause within minutes thanks to correct log levels; but there were also times we wasted hours due to incorrect level usage.

ℹ️ Log Level Hierarchy

The generally accepted log level hierarchy is as follows (from most detailed to most critical):

TRACE: Very fine-grained information, usually meant for libraries.

DEBUG: Detailed information for development and troubleshooting.

INFO: Important events showing the general flow of the application.

WARN: Potential issues, unexpected but non-critical situations.

ERROR: Errors that prevent the normal operation of the application.

FATAL: Very critical errors that cause the application to stop completely.

Thanks to these levels, we can filter logs, set up alerts for events at specific levels, and monitor the overall health of the system in a much more meaningful way. For example, while we receive instant notifications for logs at the ERROR level, we can use INFO level logs only for periodic reporting or general status monitoring. This distinction is essential for reducing operational load and focusing on real problems.

DEBUG Level: A Lifesaver for Development and Deep Investigation

The DEBUG level, as the name suggests, is the most detailed log level I use, typically during the development phase or when analyzing a problem in depth. I turn to DEBUG logs when I want to see which parameters a function was called with, the values of variables at each step of a loop, or exactly how a database query was constructed. In the backend of my own side project, when an API request comes in, I log the request payload, headers, and response time at the DEBUG level so I can easily diagnose potential integration problems.

However, leaving DEBUG logs enabled in a production environment can lead to serious issues. First, the performance cost is very high. Writing hundreds of lines of logs for every single transaction increases disk I/O, raises CPU usage, and extends the overall response time of the application. Second, disk space fills up rapidly. In a client project, DEBUG logs were accidentally left enabled in production, exceeding journald's RateLimitInterval and RateLimitBurst settings, which completely filled the server's disk within 24 hours. This caused the application to crash and led to a major outage. Third, accidentally logging sensitive data poses security risks. User passwords, API keys, or personal information can easily be exposed in DEBUG logs.

# Example of detailed logging at DEBUG level in a FastAPI application
import logging
from fastapi import FastAPI, Request

app = FastAPI()
logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG) # DEBUG in development environment

@app.post("/api/users")
async def create_user(request: Request):
    payload = await request.json()
    logger.debug(f"Received user creation request. Headers: {request.headers}, Payload: {payload}")
    # ... user creation logic ...
    logger.debug(f"User creation successful for payload: {payload.get('username')}")
    return {"message": "User created successfully"}

In the example above, the headers and payload of every request coming to the /api/users endpoint are logged at the DEBUG level. This can be highly valuable when integrating an API during development, but it must be turned off in production. Because of this, when deploying my applications, I make sure I can dynamically adjust the log level via environment variables or configuration files. For example, I turn off DEBUG logs with a value like LOG_LEVEL=INFO or LOG_LEVEL=WARN.

INFO Level: The Pulse of System Status

The INFO level is ideal for recording important events that show the normal and expected workflow of the application or system. These logs are used to monitor the overall health and operational flow of the system. Events such as successful user logins, starting or stopping a service, or completing a job should be logged at the INFO level. For me, INFO logs are the answer to the question, "What is the system doing right now?"

In a production ERP, I would log every major step—from receiving an order to confirming a shipment—at the INFO level. This made these logs an invaluable resource when generating daily operational reports or tracking the status of an order. For instance, status changes of an order like "payment received", "prepared by warehouse", and "shipped" were tracked with INFO logs. This allowed me to easily see which orders passed through which stages and when in the end-of-month reports.

2024-05-18 10:30:05.123 INFO [main] com.example.app.UserService - User 'mustafa.erbay' logged in successfully from IP: 192.168.1.10
2024-05-18 10:30:10.456 INFO [order-processor] com.example.app.OrderService - New order received: #ORD-20240518-001 for customer ID: 12345
2024-05-18 10:30:15.789 INFO [payment-gateway] com.example.app.PaymentService - Payment processed successfully for order #ORD-20240518-001, amount: 1250.00 TL
2024-05-18 10:30:20.111 INFO [inventory-manager] com.example.app.InventoryService - Inventory updated for product SKU: PROD-ABC-001, new stock: 48

As you can see in this example, INFO logs clearly outline the core business workflow of the system. Even if there is no immediate error, these are crucial for detecting operational anomalies or understanding at which stage a user's problem began. For example, when looking for an answer to a user's question "Why hasn't my order arrived?", I can quickly determine whether the order got stuck at the payment stage or the shipping stage by looking at the INFO logs. This is a key part of my observability strategy.

WARN and ERROR Levels: Diagnosing Problems Early

The WARN and ERROR levels are used to point out potential or existing problems in the system, and they are the most critical log categories for a system administrator or developer. The WARN (Warning) level is used when an unexpected situation occurs or something could cause a problem in the future, even if it is not an outright error yet. For example, when a configuration file is not found and default values are used, when a response from an API is slower than expected (but still successful), or when a cache starts filling up, the WARN level is appropriate.

The ERROR level, on the other hand, is reserved for situations that disrupt the normal flow of the application, prevent a task from being completed, or indicate that a critical component has failed. Database connection failures, file write errors, a critical service not responding, or a business logic error (e.g., canceling a transaction due to invalid data input) should be logged at the ERROR level. In my own Android spam app, I print an ERROR log when a call blocking list cannot be updated or when an API call fails due to a network error. This allows me to instantly see issues affecting the core functionality of the application.

2024-05-18 10:45:01.321 WARN [config-loader] com.example.app.ConfigService - Configuration file 'app.properties' not found. Using default settings.
2024-05-18 10:45:15.654 WARN [payment-gateway] com.example.app.PaymentService - Payment gateway response time exceeded 2000ms for order #ORD-20240518-002. Current: 2350ms.
2024-05-18 10:46:01.987 ERROR [database-connector] com.example.app.DataService - Failed to connect to PostgreSQL database. Retrying in 5 seconds. Exception: org.postgresql.util.PSQLException: Connection refused.
2024-05-18 10:46:05.111 ERROR [user-api] com.example.app.UserService - User registration failed due to invalid email format for 'test@invalid'.

The FATAL level is used for very rare and critical errors that usually cause the application to crash completely or become entirely unusable. We see FATAL logs in situations like a JVM crash or an application failing to load its dependencies. A log at this level typically requires restarting the application or manual intervention. When my systemd units detect that a service has stopped with a FATAL error, they automatically attempt to restart it and trigger my alarm mechanisms. This distinction is vital for alerting the right team at the right time in critical situations.

Log Level Strategies and Implementation Tips

Deciding on log levels is not just about what you log, but also about how you structure your logging infrastructure. There are several strategies I implement in my own systems and client projects:

Environment-Based Log Level Configuration

Using different log levels for development, test, and production environments is a fundamental rule. Typically, DEBUG is used in development, INFO or WARN in test, and INFO or ERROR in production. This is highly important for both performance and security. For example, in a Python application using the logging library, we can easily configure this with an environment variable:

# app.py
import os
import logging

LOG_LEVEL = os.getenv("LOG_LEVEL", "INFO").upper()
logging.basicConfig(level=getattr(logging, LOG_LEVEL),
                    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s')

logger = logging.getLogger(__name__)

def my_function():
    logger.debug("This is a debug message.")
    logger.info("This is an info message.")
    logger.warning("This is a warning message.")
    logger.error("This is an error message.")

if __name__ == "__main__":
    my_function()

This way, while we see all the details in development with LOG_LEVEL=DEBUG python app.py, we can track only important events in production with LOG_LEVEL=INFO python app.py. This is a practice I easily manage using environment variables when spinning up my applications with Docker Compose.

Avoiding Sensitive Data Logging

It is essential to ensure that sensitive data such as passwords, credit card numbers, and personally identifiable information (PII) are never written to logs. In a production ERP, I never kept user passwords in any logging stage, either before or after hashing them with bcrypt. If there is a need for debugging, masked or encrypted versions of this data should be used. This is a critical step both for compliance with regulations like GDPR and for your overall security posture.

⚠️ Security Risk: Sensitive Data

The presence of sensitive data (passwords, API keys, personal data) in logs can lead to serious security breaches. Always ensure that such data is not logged, or use masking/encryption techniques.

Dynamic Log Level Changing

Sometimes in a production environment, you might need to temporarily raise the log level to troubleshoot a specific issue. Instead of switching the entire application to DEBUG, it is highly beneficial to set up a mechanism that can dynamically change the log level for a specific module or class. In my systems, I can instantly raise or lower the log level of a specific component via an admin interface or an API endpoint. This allows me to immediately lower the log level after identifying the problem, eliminating unnecessary performance costs. This is a lifesaver, especially in scenarios like when a query in PostgreSQL slows down, where I temporarily switch the code block triggering that query to DEBUG to produce more detailed logs until I find the problem, and then pull it back to INFO.

Common Mistakes and My Lessons Learned

In my twenty years of experience, I must admit I have made a lot of mistakes when it comes to logging. The lessons I learned from these mistakes guide me today as I define my logging strategies.

Printing Everything to DEBUG and Performance Fires

One of the most common mistakes is logging everything at the DEBUG level. This logging, done with the mindset of "just in case, let it sit there," can turn into a serious "disk fire" in production. On a client's e-commerce site, over a weekend where we forgot to set Nginx's access_log away from debug level, I saw that 95% of a 500GB disk was filled with log files. This caused the server to slow down and critical events to be lost because new logs could not be written. Since that day, I carefully configure settings like SystemMaxUse and SystemKeepFree in journald, keep logrotate rules tight, and always set the default log level of applications to INFO or WARN.

Forgetting to Log Errors or Insufficient Error Messages

Knowing that an error occurred is not enough; you also need to know why it occurred. Sometimes developers leave generic messages in try-catch blocks like logger.error("An error occurred"). This turns the troubleshooting process into a nightmare. In a production company's ERP, the stock update service was throwing errors, but the logs only said "Stock could not be updated." We had to restart the entire service in DEBUG mode to find the cause of the error. However, logging logger.error("Stock could not be updated: %s", exc_info=True) or logging the error message in detail would have sped up finding the root cause immensely. Now, I include the stack trace and relevant variable values as much as possible in every ERROR log.

Drowning the System with Unnecessary INFO Logs

INFO logs are important, but logging every minor event as INFO is another common mistake. Logging every micro-event like "User moved the mouse" or "Button clicked" as INFO needlessly strains log collection systems (like Promtail or Fluentd), consumes disk space, and causes important INFO logs to be overlooked. In one project, I saw a Redis client logging every SET and GET operation as INFO. This meant thousands of log lines per second and was rapidly filling the Redis server's disk. In such cases, I use the INFO level only for the main steps of the workflow and leave the finer details to DEBUG.

Not Applying Log Level Changes

Log levels can change throughout the application lifecycle. However, failing to integrate these changes into the CI/CD pipeline or forgetting them during deployment processes is a common occurrence. When releasing a new version, accidentally leaving DEBUG logs from the development environment enabled in production can lead to the disk capacity issues I mentioned above. In my own CI/CD processes, I added a step that automatically pulls the log level configuration to INFO or WARN before production deployments. This is important for minimizing human error.

Conclusion

Deciding on log levels is one of the cornerstones of software development and system operations. When done right, it speeds up troubleshooting, makes it easier to understand the overall health of the system, and lowers operational costs. When done wrong, it negatively impacts performance, creates security risks, and wastes teams' time.

For me, the key point is to clearly understand the purpose of each log level and use it in the correct context. DEBUG is for development and deep investigation; INFO is for tracking the operational flow; WARN is for drawing attention to potential issues; and ERROR and FATAL are for reporting critical errors. Making this distinction well has made my systems much more resilient and observable. Remember, logs are not just text files; they are the stories your system tells, and listening to these stories correctly always helps us build better systems.

In the next step, I plan to write about how I visualize logs more effectively using tools like Grafana Loki and how I add alerting mechanisms to them.

DEV Community