Mustafa ERBAY

Posted on May 28 • Originally published at mustafaerbay.com.tr

Log Level Strategies: Detailed Monitoring or Minimum Noise?

#career #logging #systemadministration #observability

Log Levels: Why Do They Matter?

The fundamental way to diagnose issues and optimize the performance of our systems is by correctly analyzing the logs they produce. However, logging everything can lead to a situation akin to finding a needle in a haystack. Conversely, logging too little can make it impossible to find the source of a problem. This is precisely where our log level strategy comes into play.

In this post, I will delve into what different log levels mean, when we should use each one, and how I apply this strategy in my own projects. My goal is to make our systems more efficient by reducing unnecessary noise, while also ensuring we obtain all the necessary information in critical situations.

Common Log Levels and Their Meanings

Logging is the process of recording the operational status of systems and applications. Different levels indicate the degree of importance and the level of detail of the recorded information. Understanding these levels correctly is the first step in creating an effective logging strategy.

Generally accepted log levels include:

TRACE (Most Detailed): Typically used during the development phase or for debugging a very specific error. It can record every function call, variable value, and flow control. Its use in production environments is generally not recommended as it generates an excessive amount of data.
DEBUG (Debugging): Helps developers understand the normal flow of applications. The information recorded at this level shows which paths the application takes and which conditions are met.
INFO (Information): Records important events that indicate the normal operational status of the application. For example, a user logging in, a transaction completing successfully. This is often the default level in production environments.
WARN (Warning): Indicates situations that point to potential problems but do not directly stop the application from running. For example, a resource being temporarily unavailable or a configuration running with default values.
ERROR (Error): Indicates that an operation or request has failed. Logs at this level are critical for identifying situations where the application is not functioning correctly.
FATAL (Critical Error): Indicates serious errors that cause the application or system to crash, requiring immediate intervention. When a log at this level is received, the system usually stops running.

ℹ️ The Importance of Log Levels

Choosing the right log levels makes it easier to monitor the health of our systems. Both too much detail and too little information can complicate the troubleshooting process. Therefore, knowing what each level means and determining the appropriate level based on project needs is critically important.

Log Levels and System Performance

Changing the log level can directly impact system performance. For instance, extensive logging at the TRACE or DEBUG level can reduce I/O performance as it requires constant writing to disk. Additionally, the large log files generated can quickly consume disk space, which can indirectly lead to performance issues.

⚠️ Considerations for Production Environments

TRACE or DEBUG levels should generally not be used in production environments. These levels can place an excessive load on the system and pose security risks. INFO or WARN levels are sufficient for tracking normal operations, while ERROR and FATAL levels are used to identify critical issues.

Strategy Development: When to Use What

An effective logging strategy is shaped by the project's lifecycle, existing infrastructure, and monitoring needs. There isn't a single "correct" strategy; the best one is the one that best suits the project's requirements.

Development Environment

In the development phase, quickly finding and fixing bugs is essential. Therefore, more detailed log levels are generally preferred.

TRACE / DEBUG: These levels are invaluable for understanding code flow, checking variable values, and identifying unexpected behavior. For example, DEBUG level logs can show exactly how an API request was sent and what the response from the server was, helping to understand why it failed.
INFO: Used for monitoring the normal flow.

The cost of excessive logging in this environment is low, as development is usually done on local machines or test servers. However, it should still be remembered that excessive TRACE logging can negatively affect performance.

Test Environment (Staging/Testing Environment)

The test environment should mimic the production environment but still offer some flexibility for debugging.

DEBUG / INFO: These levels are suitable for observing how the application behaves in different scenarios. For example, during load tests or integration tests, the DEBUG level can help identify potential bottlenecks or faulty integration points.
WARN / ERROR: These levels are used to detect situations where test scenarios fail.

Production Environment

In the production environment, the priorities are system stability, performance, and security. Therefore, logging levels should be more conservative.

INFO: Used to monitor normal operational states. Information such as a user successfully logging in or a request being processed successfully is recorded at this level. This provides sufficient information to understand overall system health.
WARN: Important for early detection of potential issues. For example, situations where a database connection times out but is then re-established can be logged as warnings.
ERROR: Used to record actual errors and their impact. Situations like a request failing to process or a module crashing are logged at this level.
FATAL: Reserved for critical errors that cause the application or system to stop completely.

Logs kept at these levels have a minimal impact on performance and provide the necessary basic information for troubleshooting. When necessary, it is possible to temporarily switch to more detailed log levels for troubleshooting purposes.

💡 Dynamic Log Level Adjustment

Many modern logging frameworks offer the ability to dynamically adjust log levels at runtime. This allows us to investigate the source of a problem in more detail by temporarily switching to the DEBUG level when an issue arises in a production environment, without needing to restart the system. However, caution should be exercised when using this feature, and it's important not to forget to switch the log level back to a safer level like INFO or WARN after the issue is resolved.

Real-World Scenarios and My Applications

I've had various experiences with log levels in my own projects and in firms I've consulted for. These experiences have contributed to the evolution of my strategy.

Scenario 1: Shipping Errors in a Production ERP

While working on a production ERP system, we noticed that shipping reports were occasionally incomplete. Initially, logging was done at the INFO level. To diagnose the problem, we temporarily increased the log level to DEBUG. This allowed us to see in detail at which steps the shipping process was getting stuck and which data was missing.

For example, a log line like the following indicated that the problem stemmed from a database query:

DEBUG 2026-05-28 10:35:12.123 [shipping-processor] com.example.erp.ShippingService: Executing query: SELECT * FROM orders WHERE status = 'ready_for_shipment' AND ship_date < NOW() - INTERVAL '1 day';

We noticed that this query returned empty results under certain conditions. As a result, we optimized the query and added more error checks for missing data to resolve the issue. We reverted to the INFO level in production but added specific warning (WARN) logs for such critical modules.

Scenario 2: Performance Issues in My Own Side Project

I observed slowdowns in the backend of my self-developed financial calculator application during heavy usage. Initially, the INFO level seemed sufficient, but I switched to DEBUG to find the performance bottleneck.

Upon reviewing the logs, I found that a specific calculation function was being repeated much more often than expected. This indicated an error in the caching mechanism. An updated log made the source of the problem clearer:

DEBUG 2026-05-28 11:05:45.789 [calculator-worker] com.example.finance.Calculator: Cache miss for key 'user_123:portfolio_xyz', recalculating.
DEBUG 2026-05-28 11:05:45.795 [calculator-worker] com.example.finance.Calculator: Cache miss for key 'user_123:portfolio_xyz', recalculating.
DEBUG 2026-05-28 11:05:45.801 [calculator-worker] com.example.finance.Calculator: Cache miss for key 'user_123:portfolio_xyz', recalculating.

These repeated "Cache miss" logs indicated that the cache was not being updated correctly or that the key was being generated incorrectly. After fixing the cache key and improving the update logic, the performance issue was resolved. I continue to use INFO and WARN levels for certain critical calculations in production.

🔥 Risks of Excessive Logging

Logging at DEBUG or TRACE levels for extended periods in a production environment can lead to rapid disk filling, decreased I/O performance, and potentially system instability. Furthermore, logs at these levels may contain sensitive information, posing security risks. Therefore, these levels should only be used for troubleshooting, for short durations, and with caution.

Scenario 3: Mobile App Update Processes

While publishing an update to the Google Play Store for my Android spam blocker app, I encountered a metadata rejection. To understand the reason for this, I had to examine the app's local logs. Since the Play Store itself doesn't provide detailed logs, the app's internal logging mechanism became critical.

Within the app, INFO level logs recorded which features users were using and how basic functions were operating. However, to find the reason for the metadata rejection, I added a special DEBUG logging mechanism that recorded the steps related to the app's publishing process.

This allowed me to obtain a detailed log entry showing why the Play Store rejected a specific metadata field:

DEBUG 2026-05-28 14:20:01.555 [play-store-upload] com.example.android.spamblocker.PlayStoreUploader: Uploading metadata. Field 'release_notes_ch' rejected with reason: 'Invalid character detected. Please use plain text only.'.

This log indicated that a special character in the "release_notes_ch" field was causing the problem. After cleaning this character and resubmitting the update, the issue was resolved. This experience once again highlighted how important it is to have more detailed logging for development and debugging in such situations.

Advanced Logging Techniques and Tools

Beyond basic log levels, more advanced logging techniques and tools allow us to monitor our systems more deeply.

Structured Logging

Using structured logging (e.g., in JSON format) instead of traditional plain text logs makes log analysis and querying much easier. Each log entry is structured as key-value pairs, making filtering and searching by specific fields (e.g., user_id, request_id, error_code) much more efficient.

An example of structured log output:

{
  "timestamp": "2026-05-28T15:00:00Z",
  "level": "ERROR",
  "service": "auth-service",
  "request_id": "abc123xyz789",
  "user_id": "user_456",
  "message": "Authentication failed: Invalid credentials",
  "details": {
    "ip_address": "192.168.1.100",
    "attempt_count": 3
  }
}

This format can be easily processed and visualized by log analysis tools (e.g., Elasticsearch, Splunk, Loki).

Centralized Logging Systems

In distributed systems, aggregating logs from different services in one place is essential for gaining a holistic view. Centralized logging systems like Elasticsearch, Logstash, Kibana (ELK Stack) or Grafana Loki consolidate all log data into a single platform, offering search, analysis, and visualization capabilities.

These systems not only collect logs but also offer advanced features such as filtering logs, recognizing patterns, and detecting anomalies. For example, you can easily see how often a specific error code occurs within a given time frame.

Correlation IDs

In distributed systems, it's common for a single request to pass through multiple services. Correlation IDs are used to trace these requests end-to-end. When a request reaches the first service, a unique ID is generated, and this ID is passed to all subsequent services the request travels through.

Each service records this correlation ID in its logs, making it possible to track the lifecycle of a single request throughout the entire system. This is vital for understanding which service a request is getting stuck in or delayed by.

ℹ️ The Importance of Correlation IDs

Correlation IDs significantly simplify debugging and performance monitoring in distributed systems. Understanding how a request is processed across multiple services is a fundamental technique for diagnosing issues in complex systems.

Conclusion: A Balanced Approach

Log levels are a powerful tool we use to monitor the health of our systems and resolve potential issues. However, to use this tool effectively, adopting a balanced approach is essential. While it's crucial to access detailed information in the development environment, it's equally important to protect the system from unnecessary load and information clutter in the production environment.

Advanced techniques like structured logging, centralized logging systems, and correlation IDs help us derive maximum benefit from log data. We must remember that the best logging strategy is the one that is most suitable, sustainable, and scalable for the project's needs. By establishing this balance, we can make our systems more reliable, performant, and manageable.

DEV Community