Log Level Strategy: Developer Comfort or Operational Burden?

#devops #learning #uretkenlik

As developers, we all want to understand how the code we write behaves in the production environment. The most natural consequence of this desire is to sprinkle log lines into every corner and decision mechanism of the code. However, when a proper log level strategy is not established, those lines that save lives on a developer's local computer turn into monsters that fill up disks, melt CPUs, and drive operations teams crazy on live servers.

Throughout my twenty years of field experience, I have seen many systems crash due to logging. Once, on a client project, I witnessed exactly 180 GB of logs being generated in just 24 hours due to the debug logs of only 4 microservices, locking the entire database because the disk hit 100% capacity. In this post, I will share my own experiences to explain how logging is not just a technical detail, but also a delicate bridge of empathy between the software team and the operations team.

Log Level Strategy: The Hidden Cost of Comfort

As developers, we mostly focus on the writing phase of the code. The thought of "let me throw a log here just in case, we'll look at it if an error occurs" is actually a time bomb left for our future selves or the system administrator. If the log level strategy is not defined with clear rules at the beginning of the project, every developer starts choosing log levels according to their own personal taste.

# Developer A's preference (Overly optimistic)
logger.info("User logged into the system, processing starts...")

# Developer B's preference (Paranoid)
logger.debug(f"SQL query triggered: SELECT * FROM users WHERE id = {user_id}")

Even the simple example above makes a huge difference in production. Logging every SQL query with dynamic string formatting in a system where thousands of users connect, like in the second line, is the shortest way to push disk I/O limits. While building the backend infrastructure of my own side project, I made this mistake myself and saw the response time of the application server behind the Nginx reverse proxy jump from 12ms to 180ms; the reason was entirely a bloated disk write queue.

⚠️ Disk I/O and CPU Bottleneck

Remember, writing logs to disk is an expensive operation. No matter how fast your application runs, if your logging library operates synchronously or your disk write speed (IOPS) is limited, your application will block on the log writing line.

The Betrayal of the DEBUG Level in Production

The DEBUG level, as the name suggests, is designed to be used only during development and testing phases. Leaving this level enabled in a production environment is one of the worst things you can do to a system. While writing the inventory module of a production ERP, we logged every barcode scan coming from handheld terminals in the warehouse at the DEBUG level.

Three days after the system went live, the CPU consumption of the systemd-journald service reached 94%. Faced with the overwhelming log load, Journald activated its own protection mechanism and started dropping logs. As a result, when a real error occurred, we had no trace of it because the system was drowned in noise.

# The famous warning line we encountered in the journalctl output:
Jun 06 14:20:11 srv-erp-01 systemd-journald[452]: Suppressed 8420 messages from /user.slice/user-1000.slice

If your log level remains as DEBUG in production, you don't just fill up the disk; you also increase the risk of leaking sensitive data (personal data, passwords, token information) into log files. No matter how much you deal with security aspects like kernel hardening and switch hardening, this carelessness at the application level can make all your firewalls meaningless.

The Limits of the INFO Level: The Fine Line Between "Noise" and "Signal"

So, what should we log at the INFO level? This question is one of the most critical debates in software architecture. INFO should indicate important state changes during the normal operation of the system. It should not show every step inside a loop, but only that the loop completed successfully or started.

In an AI-powered production planning module I designed for a production ERP, I categorized the log levels as follows:

Log Level	When to Use?	Target Audience	Example Scenario
DEBUG	Detailed code flow, variable states	Developer	`Connection pool active connections: 14`
INFO	Business logic start and end	System Administrator	`AI Production Plan #4812 started.`
WARNING	Unexpected but system-tolerated situations	L1 Support / Ops	`Database connection retry #1 failed. Retrying...`
ERROR	Operation interrupted, requires intervention	SRE / L2 Support	`Failed to write plan to DB. Transaction aborted.`

Sticking to this table prevents the operations team from being woke up unnecessarily in the middle of the night. If you use the INFO level like a DEBUG level, thousands of lines per second will flow through your log monitoring tools (like OpenSearch, Grafana Loki), making it impossible to detect a real anomaly.

ERROR and WARNING: When Should We Sound the Alarm?

Two concepts that are most frequently confused: When is a situation a WARNING, and when is it an ERROR? My rule on this is very simple: If a log line does not require a human to immediately get out of bed and intervene, that log is not an ERROR.

Last year, on the backend of a mobile app I developed, I saw 450,000 unhandled exceptions accumulate on Sentry weekly. When I analyzed them in detail, I realized that 98% of them were actually temporary connection errors caused by network drops on the client side, which the server could do nothing about. Because these were marked as ERROR, the on-call engineer received an alert every half hour and, after a while, started completely ignoring these alerts (alert fatigue).

# BAD PRACTICE (Treating every error like a disaster)
try:
    response = requests.get("https://api.external-service.com/data", timeout=5)
except requests.exceptions.Timeout as e:
    logger.error(f"External API timeout! Error: {e}") # This is not an ERROR, it is a temporary situation.

# GOOD PRACTICE (Handling tolerable situations with WARNING)
try:
    response = requests.get("https://api.external-service.com/data", timeout=5)
except requests.exceptions.Timeout as e:
    logger.warning("External API timeout. Retrying with fallback data.")
    response = get_cached_fallback_data()

If an external service outage does not cause the entire system to stop and your application can tolerate it (for example, by serving cached data), this situation is a WARNING. However, situations that cause the entire system to stop, such as a complete loss of database connection or running out of disk space, are real ERRORs and must trigger an immediate alarm.

The Two Sins of Log Management: Empty Catch Blocks and String Formatting

Two major technical errors frequently made during the software development process can completely reset your log level strategy. The first is swallowing exceptions and not writing them anywhere. The second is writing logs as unstructured plain text, making it difficult to analyze the system.

Previously, on a busy database replication structure running on PostgreSQL 15, it took us days to find the source of a WAL bloat (accumulating transaction logs) issue because the logs were not written correctly. An empty catch block inside a failing function had caused the problem to grow silently.

# SILENT DEATH (Empty catch block)
try:
    process_payment(order_id)
except Exception:
    pass # Never do this! The system won't know it failed.

# STRUCTURED LOGGING
import json
import logging

logger = logging.getLogger("payment_processor")

def process_payment_safe(order_id):
    try:
        # Payment operations...
        pass
    except Exception as e:
        # We generate logs in JSON format to make the analysis tools' job easier
        log_payload = {
            "event": "payment_failed",
            "order_id": order_id,
            "error_message": str(e),
            "status": "failed"
        }
        logger.error(json.dumps(log_payload))

Structured logging (JSON format) allows modern log analysis systems to easily index logs. This way, we can find the answer to the question "Which orders received payment errors?" in seconds. With plain text logs, we would have to struggle writing regex, which extends operational recovery time.

Building a Human-Centric Operations Culture

As I mentioned at the beginning of the article, logging is entirely a matter of culture and empathy. It defines the relationship between the developer who writes the code and the system administrator or site reliability engineer (SRE) who tries to keep that code alive in production. The developer must know that every log line they write has a cost; the operator must provide the infrastructure to dynamically change log levels (for example, log level hot-reload).

I had experienced a similar trade-off during VPS migration processes before. Temporarily pulling the log level to DEBUG during the live migration and then dropping it back to INFO without restarting the server after the migration was complete saved lives. A small signal (SIGHUP) configuration in Nginx reverse proxy settings or systemd service units allows you to change the log level without restarting your application.

In conclusion; logging is not a random debug tool for developers to feel comfortable on their local machines. Logs are the black box of your system in the production environment. It is in our hands to either fill that box with noise and make it useless, or equip it with clear and readable signals to make it a lifesaver.

My clear stance: Completely ban the DEBUG log level in production, make sure to structure your logs in JSON format, and attach the question "Should this error wake someone up in the middle of the night?" to every error log. In my next post, I will talk about the mistakes we make in indexing strategies on PostgreSQL and how the query planner can catch us off guard.