Log Level Strategies: Balancing Observability and Cost
A recurring issue with delayed shipment reports in a production ERP system took three days to resolve. Situations like these highlight how well we can understand what's happening within our systems. Logs are the cornerstone of this understanding. However, logging everything is both impractical and costly. This is where establishing the right log level strategies becomes critical. In this post, I'll share how I maximize observability in our systems while keeping costs under control, based on my own experiences.
Determining the correct log level is not just a technical detail but a strategic decision. Overly detailed logs can rapidly increase storage and analysis costs, while insufficient logs can make debugging impossible when issues arise. Striking this balance directly impacts our systems' health and operational efficiency. Let's examine how I achieve this balance step by step.
Understanding Different Log Levels and Their Use Cases
Log levels are used to indicate the severity and importance of an event. There are typically standardized levels that provide a guide on what information should be recorded and when. Understanding these levels correctly forms the foundation of our logging strategy.
The most commonly used log levels are:
- DEBUG: This is the most detailed level, generally used during development and debugging processes. It includes things like variable values, function calls, and loop iterations. It's not recommended to keep this level constantly enabled in production environments.
- INFO: These messages indicate the normal operation of the application. They record events such as a user logging in or a process completing successfully. This level is crucial for observability.
- WARN: This level indicates potential issues or unexpected situations that do not directly prevent the application from running. Examples include a temporary database connection interruption. Logs at this level are critical for proactive intervention.
- ERROR: This level records errors that prevent the application from performing a function. Situations like an unprocessable request or a missing file fall under this category. It's one of the first places to look for troubleshooting.
- FATAL: This signifies the most severe errors that can cause the application to crash completely. These are typically situations that lead to application termination.
Each of these levels serves a specific purpose. For instance, DEBUG level logs are invaluable during the development phase when testing a new feature. However, continuously collecting the same logs in a production environment consumes unnecessary disk space and slows down log analysis. Therefore, defining appropriate log levels for each environment (development, testing, production) is vital.
ℹ️ Log Levels and Environments
As a general rule, using the
DEBUGlevel in development environments, while preferringINFOorWARNlevels in test environments, is logical. In production environments,INFO,WARN, andERRORlevels are typically used together. TheFATALlevel is critical only for situations where the application unexpectedly halts.
Log Level Selection in Production: Observability vs. Cost
The production environment is where we need to establish the most delicate balance. On one hand, we need sufficient observability to understand what's happening in our systems, detect potential issues early, and respond quickly. On the other hand, storing endless logs can lead to astronomical costs. Therefore, we must find a sensible answer to the question, "How much logging?".
Generally, the INFO level is a good starting point for production environments. This level allows us to follow the normal flow of the application, logging key events like user registrations, order creations, or report generations. However, this level might not be sufficient when issues arise. For example, understanding why a user cannot open a specific page might be impossible with only INFO logs.
This is where WARN and ERROR levels come into play. WARN logs indicate that something might be going wrong but hasn't yet escalated into a major problem. Perhaps an external service is responding slowly, or a configuration file was read unexpectedly. Heeding these warnings allows us to take preventive measures before issues become more significant. ERROR logs directly indicate the presence of a problem and are the most important source for troubleshooting.
But what about DEBUG? Keeping the DEBUG level constantly enabled in production is usually a major mistake. A couple of weeks ago, I accidentally left the DEBUG level enabled for a service in my own system. Within an hour, 80% of my disk was full, and the service began to slow down. Switching back to the INFO level after debugging instantly resolved the issue. Such situations can lead to serious cost and performance impacts. However, when a specific debugging session is needed, temporarily switching to the DEBUG level and reverting it afterward can be a smart move.
Real-World Scenarios and Example Log Lines
To solidify theoretical knowledge, let's walk through some real-world scenarios. These examples will illustrate how different log levels might appear in a production environment and what kind of information they provide.
Consider a payment processing service for an e-commerce site.
INFO Level Example:
To log a successful payment by a user, we might see a log like this:
2026-05-23 10:30:15 INFO [payment-service] Payment successful for order ID: 123456789. Transaction ID: TXN987654321. Amount: 150.75 EUR. User ID: user_abc.
This log indicates the transaction occurred, providing order and transaction details, the amount, and the user ID. It's sufficient for a basic success record.
WARN Level Example:
If a delay is experienced in a request to an external API during the payment process, we might receive a warning like this:
2026-05-23 10:31:02 WARN [payment-service] External payment gateway (gateway_xyz) response time exceeded threshold (1500ms). Actual time: 1850ms. Order ID: 123456789.
This warning shows that while the payment transaction itself was successful, an external service was slow. This is an indicator that could affect future transactions.
ERROR Level Example:
If an error occurs during the payment process, such as a rejection by the bank, we would see a log like this:
2026-05-23 10:32:45 ERROR [payment-service] Payment failed for order ID: 123456789. Reason: Bank declined transaction. Response code: 402. User ID: user_abc.
This error record includes the reason for the failure (bank decline) and the relevant response code, which is critical for identifying the root cause.
DEBUG Level Example (Temporary Use):
If we enable DEBUG level during a debugging session, we can see much more detailed information:
2026-05-23 10:33:10 DEBUG [payment-service] Entering process_payment_request method.
2026-05-23 10:33:10 DEBUG [payment-service] Request payload: {"order_id": "123456789", "amount": 150.75, "currency": "EUR", ...}
2026-05-23 10:33:11 DEBUG [payment-service] Calling external gateway API with payload: {"transaction_data": "...", "user_token": "..."}
2026-05-23 10:33:12 DEBUG [payment-service] Received response from gateway: {"status": "declined", "code": "402", "message": "Insufficient funds"}
2026-05-23 10:33:12 DEBUG [payment-service] Mapping gateway response to internal error code.
As you can see, DEBUG logs detail every step of the process, the data used, and responses from external services. This can be very useful for definitively pinpointing the source of an issue but is not suitable for continuous use.
⚠️ Risks of DEBUG Level
Keeping the
DEBUGlevel constantly enabled in a production environment not only increases storage costs but can also lead to performance issues. Furthermore, it carries the risk of sensitive information (e.g., user credentials or payment details) being logged in plain text. Therefore, theDEBUGlevel should only be used temporarily and under controlled conditions.
Strategies for Managing Logging Costs
Logging costs can be a significant expense, especially in large-scale systems. Data storage, transfer, and licensing fees for log analysis tools can add up substantially over time. Several strategies can be employed to manage these costs.
First, correctly setting log levels is the most fundamental step. Start by avoiding keeping the DEBUG level constantly enabled in production. Instead, focusing on INFO, WARN, and ERROR levels will significantly reduce data volume.
Second, defining retention policies for logs is crucial. It's generally unnecessary to store all logs indefinitely. Automatically deleting old logs that are no longer needed after a certain period (e.g., 30, 90, or 180 days) or moving them to a cheaper storage solution can reduce costs. Many log management systems offer the ability to configure these retention policies.
Third, using intelligent solutions for filtering and collecting logs is important. Sending only truly critical logs to a central system reduces data traffic and storage needs. For example, sending only ERROR and WARN level logs to a central logging service while storing INFO level logs for a more limited duration or only for specific services can be an option.
Fourth, ensuring log collection agents run efficiently is vital. In some cases, the log collection agents themselves can consume significant CPU or memory resources. Optimizing the configurations of these agents and preventing them from collecting unnecessary data is also important.
Finally, evaluating alternative logging solutions can be beneficial. Managed logging services offered by cloud providers or solutions like the open-source ELK Stack (Elasticsearch, Logstash, Kibana) can offer different advantages in terms of scalability and cost. I've experimented with various combinations of these solutions in some side projects I've developed, and each has its own trade-offs.
Advanced Logging Techniques for Observability
Beyond standard log levels, there are advanced techniques we can use to delve deeper into our systems and gain more visibility. These techniques can make a significant difference, especially in complex distributed systems or performance-critical applications.
1. Structured Logging:
Recording logs in a structured format, such as JSON, rather than plain text, dramatically enhances log analysis and querying capabilities. Each log entry contains information in key-value pairs. For example:
{
"timestamp": "2026-05-23T10:30:15Z",
"level": "INFO",
"service": "payment-service",
"message": "Payment successful",
"order_id": "123456789",
"transaction_id": "TXN987654321",
"amount": 150.75,
"currency": "EUR",
"user_id": "user_abc"
}
These structured logs are easily parsed by log management systems, making them searchable. Querying all transactions for a specific order_id or all activities of a particular user_id becomes much easier.
2. Correlation IDs:
In distributed systems, a request often passes through multiple services. To track the entire journey of a request, a unique "correlation ID" is assigned, and this ID is added to the logs of all relevant services. This allows us to gather all logs related to a specific request in one place.
For example, a user request might first hit an API Gateway, then go to an authentication service, and finally to a backend service. If the same correlation ID is passed to each service, all logs related to that request can be found via this ID. This significantly speeds up the debugging process.
3. Metrics Collection:
Logs are typically event-based (what happened?). Metrics, on the other hand, provide continuous information about the system's state (how much?). For instance, metrics like requests per second (RPS), memory usage, and CPU load of a service provide insights into the system's overall health. When used in conjunction with logs, these metrics offer more comprehensive observability. Tools like Prometheus are popular for metrics collection and visualization.
4. Tracing:
Distributed tracing allows for end-to-end tracking of a request's journey through a system. This is critical for understanding which service is causing latency or where an error originates, especially in microservice architectures. Tools like Jaeger or Zipkin assist in this regard. Unlike logs, tracing shows the completion time of a request and the calls made to sub-components.
While these advanced techniques may require a bit more setup and configuration initially, they ultimately make our systems more understandable, manageable, and resilient to errors in the long run. Implementing these techniques in my own developed products has allowed me to identify and resolve potential issues much faster.
Conclusion: Logging is a Marathon, Not a Sprint
Defining log levels is not a one-time task. As systems evolve, workloads change, and new features are added, we must continuously review our logging strategies. This is a marathon, not a sprint. The key is to find the right balance: ensuring sufficient observability while keeping costs at reasonable levels.
Based on my experience, focusing on INFO, WARN, and ERROR levels in production environments is usually the best starting point. Using DEBUG level only for temporary debugging sessions positively impacts both costs and performance. Advanced techniques like structured logging and correlation IDs elevate observability to the next level and simplify debugging in complex systems.
Let's remember that logs are silent witnesses that provide us with information about our system's health. Learning to listen to them correctly is the key to making our systems more reliable and efficient. By establishing this balance, we can keep operational costs under control and be better prepared for unexpected issues.
Top comments (0)