Mustafa ERBAY

Posted on May 28 • Originally published at mustafaerbay.com.tr

Log Level Strategy: How to Make the Right Choices in a Production

#tutorials #logging #systemadministration #debugging

Log Level Strategy: How to Make the Right Choices in a Production Environment?

Monitoring the health and performance of our systems in a production environment is critically important. One of the cornerstones of this monitoring is logging. However, constantly logging everything at the DEBUG level not only leads to performance issues but also makes it impossible to find the information we're looking for. Defining the right log level strategy allows us to use system resources efficiently and helps us quickly find solutions when problems arise. In this post, I will explain what we need to consider when creating a log level strategy in a production environment, which level we should use in which situations, and how we can optimize this strategy, based on my own experiences.

Logging is the process of recording information generated by an application or system during its runtime. This information is used for debugging, monitoring system performance, detecting security events, and understanding general system behavior. Log levels determine the level of detail of the recorded information. The standard levels commonly used are: TRACE, DEBUG, INFO, WARN, ERROR, FATAL. Each level has its own specific purpose and priority. Managing these levels correctly in a production environment requires working with the precision of a surgeon.

Standard Log Levels and Their Meanings

There are several widely accepted standard levels in the world of logging. Understanding them is the first step to establishing the right strategy. Each level serves a specific use case and helps us determine what information should be recorded and when. Correctly understanding and implementing these levels speeds up troubleshooting processes and reduces unnecessary log overhead.

TRACE: This is the most detailed log level. It is generally used during the development phase when even the finest details of the code need to be monitored. Information such as method entries, exits, and variable values are recorded at this level. It is not recommended to keep it constantly enabled in a production environment.
DEBUG: Used for development and debugging. It contains information that helps understand the normal flow of the application. For example, details like which steps a request went through and with which parameters it was processed can be kept at this level. In a production environment, it can be temporarily enabled for a specific module.
INFO: This level provides information about the normal operation of the application. Important events, successful operations, and initiated tasks are recorded. For example, a user logging in or a report being successfully generated would be logged at the INFO level. In a production environment, it is generally used as the default level.
WARN: Indicates situations that could potentially cause problems but do not immediately stop the application from running. For example, a missing configuration file, but the application can continue with default values, might be logged at the WARN level. These warnings are important for preventing future issues.
ERROR: Records critical errors that indicate an operation has failed. For example, situations like failing to establish a database connection or being unable to write to a file are logged at the ERROR level. In a production environment, logs at this level indicate situations requiring immediate intervention.
FATAL: Indicates serious errors that prevent the application from running and require it to be stopped immediately. For example, situations like a critical service failing to start or running out of memory are logged as FATAL. Logs at this level signal events that could cause the system to crash completely.

ℹ️ Hierarchy of Log Levels

These levels are generally considered within a hierarchy. For example, if you select the DEBUG level, TRACE level logs will also be recorded. However, if you select the FATAL level, only FATAL level logs will be recorded. This hierarchy shows which level contains more detailed information.

Default Log Level Selection in Production Environment

Choosing the log level in a production environment is like an art of balance. On one hand, we need to monitor the system's health, and on the other, we must avoid unnecessary resource consumption. The general trend is to use the INFO level by default. This provides enough information to track the application's normal operation while avoiding the performance and storage issues that come with excessive logging. However, this may not always be the correct answer.

While working on a production ERP system, the INFO level was usually sufficient for the shipping and invoicing modules. However, in a section where operator screens managed real-time data updates, switching to DEBUG level played a critical role in finding the root cause of delays when they occurred. In such situations, temporarily increasing the log level of only the relevant module is a more sensible approach than enabling DEBUG for the entire system. This allows us to solve the problem while minimally impacting overall system performance.

⚠️ Points to Consider

Frequently changing the log level in a production environment, especially in large and complex systems, can lead to unforeseen side effects. Such changes should be planned and carried out in a controlled manner.

Dynamic Log Level Adjustment and Use Cases

We don't always have to use a fixed log level in a production environment. We can adjust log levels according to the situation, considering the dynamic nature of our systems. This is particularly useful when experiencing issues in a specific module or when a new feature is rolled out. For example, when deploying a new AI-powered production planning module, we can log at the DEBUG level for the first few days to detect potential errors and unexpected behaviors early on.

In my own financial calculator projects, I sometimes needed to closely monitor specific API calls or calculation steps to improve the user experience. In these cases, I temporarily increased the log level of components like Nginx or FastAPI to DEBUG. For instance, to fix a calculation error reported by a user, I temporarily increased the log level for requests belonging to that user's session to pinpoint exactly where the problem started and with what parameters it occurred. This allowed me to quickly identify and resolve the source of the issue.

# Example: Dynamic log level adjustment in FastAPI (simple demonstration)
from fastapi import FastAPI
import logging

app = FastAPI()

# Default logger setup
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

@app.get("/calculate")
async def calculate(value: int):
    logger.info(f"Calculation requested with value: {value}")

    try:
        # Actual calculation logic goes here
        if value < 0:
            logger.warning(f"Negative value provided: {value}. Proceeding with absolute value.")
            value = abs(value)

        result = value * 2
        logger.debug(f"Intermediate result for {value}: {result}") # DEBUG level

        if result > 1000:
            logger.error(f"Calculation result exceeded threshold: {result}")
            return {"error": "Result too large"}

        logger.info(f"Calculation successful. Result: {result}")
        return {"result": result}

    except Exception as e:
        logger.exception(f"An error occurred during calculation for value {value}") # Logging with error details
        return {"error": "An internal error occurred"}

# A mechanism to change the log level externally should be added (e.g., via an API endpoint or configuration file)
# For example:
# logger.setLevel(logging.DEBUG) # This line should be executed dynamically

Logging Performance and Resource Consumption

Logging is a process that consumes system resources (CPU, memory, disk I/O). Especially in high-traffic systems or when performing very detailed logging, this consumption can reach significant levels. Optimizing logging performance in a production environment is vital for the overall stability and response time of the system. Logging excessively can heavily utilize the CPU, rapidly fill up disk space, and even cause the application to slow down.

In the backend service of an e-commerce platform, logs began to increase at an incredible rate during a busy campaign period. Specifically, the DEBUG level, which logged every shopping cart transaction in detail, strained the system's I/O so much that even database queries became slow. When we realized this, we reduced the log level to INFO for critical modules only and lowered the general log level to WARN. Even with this simple adjustment, disk I/O decreased by 40%, and the system's response time returned to normal. This experience once again demonstrated how sensitive the log level setting is in a production environment.

💡 Logging and Storage Management

The growth of log files over time is inevitable. Properly configuring log rotation mechanisms and regularly archiving or deleting old logs prevents disk space from being exhausted. Additionally, sending logs to a centralized logging system (e.g., ELK Stack, Splunk) simplifies management and provides real-time analysis capabilities.

Security-Focused Logging Strategies

Logging plays a critical role not only for performance and debugging but also for security. Sufficient and accurate logs must be kept to detect security events, monitor suspicious activities, and analyze intrusion attempts. However, security logs can also impact performance, so a balance must be struck.

While working on an internal platform for a bank, we enabled more detailed logging of certain API accesses and authorization operations at the request of the security team. This meant more detailed logs including the source, destination, method used, and response of each request. This allowed us to detect a potential unauthorized access attempt or data leak much faster. However, considering the disk space requirement of this additional logging, we also began to archive older and less critical security logs more aggressively.

# Example: Monitoring fail2ban logs and detecting security events
# We can view suspicious IP addresses and reasons for blocking by monitoring the /var/log/fail2ban.log file.
# cat /var/log/fail2ban.log | grep "Ban"

# Searching for suspicious login attempts in system logs (e.g., journald)
# journalctl -u sshd -p err..emerg | grep "Failed password"

Which Log Level is Suitable for Security?

While ERROR and WARN levels are generally sufficient for detecting security events, in some cases, INFO or even DEBUG level logs can contain critical information. For example, logs recording failed login attempts are important for detecting a brute-force attack. Keeping this information at the ERROR level is logical. However, if detailed monitoring of a user's authorization steps is required, this can be done at the INFO or DEBUG level. The important thing is to work with the security team to determine which events should be recorded at which level of detail.

🔥 Risks of Excessive Security Logging

Overly detailed security logging can rapidly consume storage space and increase the risk of losing real security events among unimportant logs. Therefore, the logging strategy should be both comprehensive and manageable, in collaboration with the security team.

Continuous Improvement of Log Level Strategy

A logging strategy is not a static structure; it must adapt to changing needs and system evolution over time. As new features are added, performance issues arise, or security requirements change, it is necessary to review and update the logging strategy. This is a continuous improvement cycle.

In the Android version of one of my mobile applications, I adjusted the log levels to better understand certain errors based on user feedback. Initially, logging was at the INFO level, but following a user complaint of "the app crashes suddenly," it was temporarily increased to DEBUG for the relevant module. This allowed me to identify the exact error and the triggering parameters that caused the application to crash. This experience showed how valuable user feedback is in shaping the logging strategy.

Log Analysis and Feedback Loop

Regularly analyzing logs not only helps in finding errors but also provides information about the effectiveness of the logging strategy. Answers to questions like which logs are used the most, which logs are unnecessary, or which information is missing can be used to improve the strategy. This feedback loop makes the logging system more efficient and useful.

In a previous [related: system administration project], our analysis to optimizejournald`'s log level and retention period led us to clean up verbose logs that were unnecessarily occupying disk space and extend the retention period for critical error logs. Such regular analyses allow us to use system resources more efficiently and access the information we need faster.

`bash

Examples of analyzing logs with journalctl

Displaying logs for a specific time

journalctl --since "2026-05-28 09:00:00" --until "2026-05-28 10:00:00"

Filtering logs for a specific service

journalctl -u my-application.service

Searching logs by a specific keyword

journalctl -g "database connection error"

Logging is an integral part of modern systems. Defining the right log level strategy in a production environment is critical for maintaining system health, resolving issues quickly, and ensuring security. While adopting the INFO level as a default is a good starting point when creating this strategy, the most effective approach is to make dynamic adjustments considering your system's specific needs and continuously improve the strategy through regular analysis.

DEV Community