Prioritizing Monitoring and Alerting: My 3-Step Pragmatic Guide

#monitoring #alerting #sistemyonetimi #operasyon

What to monitor and what to alert on in systems is an area that proves its criticality with experience. Initially, I tried to set alerts for everything, then I drowned in the noise. Deciding what to monitor and what to alert on is an important engineering decision, especially when working with limited resources. In critical moments of a production ERP or in the background of my own side project, this prioritization has prevented sleepless nights.

In this post, I'll explain how I've built my monitoring and alerting strategy over the years, detailing the steps I follow. For me, this isn't just a technical topic; it's directly related to operational maturity and team efficiency. Not every metric or log line should trigger an alarm; the important thing is to get the right signal at the right time.

Step 1: What Breaks if the Business Stops? Defining Business-Critical Metrics

The first and most important step is to identify the metrics that are the lifeblood of the business. A system can have hundreds of metrics indicating its performance, but not all of them are equally important. I always start by asking, "If this breaks, will we lose money, lose customers, or will operations stop?" When I was working with an ERP for a manufacturing company, this meant ensuring the uninterrupted flow of core business processes like order intake, production planning, shipping, and invoicing.

ℹ️ Examples of Business-Critical Metrics

Database connection count: If it exceeds a critical threshold, we can't accept new connections.

API latency: Especially for payment or critical data processing APIs. Latencies above 500ms slow down business workflows.

Error Rates: A specific service returning HTTP 5xx errors above 5%.

Disk usage: Especially for log or database disks, above 90% full.

Payment gateway success rate: Indispensable for e-commerce sites.

Once, at a large Turkish e-commerce site, I witnessed millions of liras in potential sales lost within hours due to a momentary problem with a payment gateway integration. It was then that I painfully learned that these types of metrics shouldn't just be "monitored," but should trigger an "alert" immediately. For such situations, I prioritize metrics that directly affect the output of business processes, rather than general system health indicators. For example, even if a database server's CPU usage reaches 80%, if this doesn't affect business workflows, a warning might suffice. But if our payment API's error rate jumps from 1% to 5%, that's a direct PagerDuty call.

When defining these metrics, I talk not only with the technical team but also with business units. What is their definition of "business stopping," and which workflows are most critical? This helps me move beyond technical jargon and understand the real business impact. In my experience, software architecture is often not about software, but about organizational flow. Therefore, monitoring architecture should also mirror organizational flow.

Step 2: Monitor What Signals Before It Breaks: Proactive Metrics

While business-critical metrics say "there's a problem now," proactive metrics signal "there will be a problem soon." These are indicators that provide clues about the system's future behavior. Beyond obvious ones like disk fullness, there are other, more subtle proactive metrics. In my experience, correctly interpreting these metrics allows for proactive action and preventing major outages while they are still minor.

For example, in PostgreSQL, there's a situation called WAL bloat. Excessive growth of transaction logs increases disk usage and degrades I/O performance. To monitor this, just looking at disk fullness isn't enough; I track WAL size and growth rate using functions like pg_wal_lsn_diff. If the WAL size exceeds a certain threshold (e.g., 1GB) and the growth rate becomes abnormal, this is a warning for me, indicating that I need to immediately check VACUUM or replication settings.

-- Simple query for PostgreSQL WAL bloat tracking
SELECT
    pg_size_pretty(pg_current_wal_lsn() - '0/0'::pg_lsn) AS current_wal_size,
    (pg_current_wal_lsn() - pg_stat_replication.write_lsn) AS replica_lag_bytes
FROM pg_stat_replication
WHERE client_addr IS NOT NULL;

On the Redis side, OOM eviction policy selection and memory usage are critical. If I run Redis with maxmemory-policy noeviction and its memory fills up, it becomes unable to write new data. To prevent this, I monitor the used_memory metric and the evicted_keys count. An increase in the number of evicted keys is a sign that Redis is constantly trying to evict data and that memory is insufficient. This situation immediately requires a memory increase or optimization in the data model.

⚠️ Importance of Proactive Metrics

These metrics should be configured as a "warning" or "information" before reaching an "alarm" level. Our goal is not to create panic, but to be prepared for potential problems and gain time for intervention. I generally aim to address these warnings within a week.

At this step, having a deep understanding of the internal workings of systems is a great advantage. Setting soft limits like cgroup memory.high in Linux and getting warnings when these limits are approached allows me to take precautions before a container is OOM-killed. Such detailed monitoring is usually what I focus on more when I put on my "system administrator" hat, and from my field experience, I know that these fine-tunings make a big difference.

Step 3: Filter the Noise, Alert Only What Requires Action

The fundamental difference between monitoring and alerting is that one is about collecting information (monitoring), while the other is about notifying when a situation requires intervention (alerting). If we try to set an alarm for everything we monitor, we quickly experience "alert fatigue." The team starts ignoring constantly ringing alarms that don't signal a real problem. This prolongs response time when a truly critical situation arises. I've made this mistake repeatedly in my own operations and learned my lesson.

My basic principle for alerting is: "If an alarm wakes me up in the middle of the night, it must be a truly urgent problem requiring intervention." With this philosophy, I set the thresholds and triggering logic for alarms very carefully.

💡 Checklist for Effective Alerting

Threshold Values: For how long, above or below what value? (Example: CPU above 95% for 5 minutes).

Repeating Alarms: How to handle situations where the same error repeats continuously? (Deduplication, re-notification after a certain period).

Criticality Levels: Distinction with levels like Warning, Critical, Emergency.

Action-Oriented: When an alarm triggers, is it clear what needs to be done? (Runbook link).

For example, I use fail2ban to block malicious requests to my servers. fail2ban internally catches specific patterns and bans IPs. Normally, fail2ban banning an IP is not an alarm; it means it's doing its job. But if I see in fail2ban logs, for instance, more than 1000 bans in the last hour, this could be a DDoS attempt or a widespread scanning attack. This is an alert level. For this scenario, I set up a systemd unit that monitors fail2ban logs via journald and sends a notification when the number of bans exceeds a certain threshold.

# Example of filtering and counting fail2ban logs with journalctl
# Counts ban events in the last 1 hour
journalctl -u fail2ban.service --since "1 hour ago" | grep "Ban" | wc -l

Similarly, journald itself has rate limits. If a service logs too much, journald might start dropping these logs (rate limit). This can cause us to miss important logs. I pay attention to the RateLimitBurst and RateLimitIntervalSec settings for the systemd-journald service in journald. If I see a log indicating these limits have been reached, it gives me an alarm that I need to optimize the logging behavior of the relevant application. Such situations often work in conjunction with limits I assign to processes via cgroup, such as memory.high or CPUQuota.

Connecting Monitoring Data to Action Plans

Knowing what to do when we receive an alarm is as important as the alarm itself. Just getting a notification saying "Disk full!" is not enough. We also need to know which disk is full, which application is using this disk, how to clean it, or how to resolve the issue. For me, every critical alarm should have a "runbook" behind it. This runbook includes a roadmap for problem detection, initial intervention steps, temporary solutions, and the ultimate resolution.

In a production ERP, I prepared detailed runbooks for situations like PostgreSQL slowdowns. These runbooks include which metrics to check (CPU, I/O, active connections), which queries to run (slow queries, locked tables), how to interpret EXPLAIN ANALYZE outputs, and possible index strategies. Sometimes, I even add simple shell scripts containing basic troubleshooting commands to these runbooks.

🔥 An Alarm Without a Runbook is an Unfinished Story

An alarm without a runbook leaves the team in uncertainty and prolongs intervention time. This increases the risk of business disruption. In my experience, runbooks should be living documents and updated after every incident.

The concept of observability also comes into play here. Not just metrics, but logs and traces also play a critical role in finding the root cause of a problem. When an application's error rate increases, knowing just the number isn't enough. We also need to know which requests are failing, the error's stack trace, and which users are affected. In my systems, while collecting metrics with Prometheus, I gather logs in a central location with Loki or Elasticsearch. For traces, I use OpenTelemetry to see how a request travels between different services and where it gets stuck. Bringing these three together ensures that an alarm saying "There's a problem!" also says "The problem is here, and you can solve it like this!"

This approach becomes even more important during architectural changes, especially when transitioning from monolith to microservices. Tracking the journey of a request in distributed systems can be much more complex than in a single monolith. Therefore, in architectures like event-sourcing or CQRS, I have to use the transaction outbox pattern to monitor idempotency issues and ensure that events are processed correctly.

My Techniques for Reducing Noise and Combating Alert Fatigue

Over the years, I've personally experienced how destructive alert fatigue can be. Constant, but insignificant, alarms eventually lead to real emergencies being overlooked. Therefore, reducing noise and only sending alerts that require action has become an operational discipline for me.

Some techniques I use include:

Deduplication (Grouping Recurring Alarms): Instead of sending a new notification every time the same error occurs repeatedly, I send the first notification and then group subsequent identical errors under that notification. This way, instead of "100 disk full alarms," I receive a notification like "Disk full (repeated 100 times)."
Silencing: I temporarily silence alarms during planned maintenance windows or for known temporary issues. For example, during a VPS migration process, I silence the old server's network alarms until the new server is up.
Escalation Policies: I initially send an alarm to only one team. If no action is taken within a certain period, it's escalated to a higher-level team or a different channel (SMS, phone call).
Baselines and Anomaly Detection: Static thresholds are not always sufficient. I use systems that monitor metrics like an application's normal traffic patterns and CPU usage during operation, and flag deviations from this baseline as anomalies. This is very useful, especially when I enable a new feature flag or perform a canary deployment.
Alert Review: I regularly review existing alarms (monthly or after every major incident). I weed out alarms that are no longer valid, produce false positives, or send unnecessary notifications. This also applies to alarms about systemd unit failures. Sometimes a systemd timer doesn't work as expected and I get OOM-killed. In this case, I improve the alarm by linking it to a more specific reason than "OOM-killed." Last month, I got OOM-killed after writing sleep 360, then I switched to polling-wait to reduce such errors.

These approaches are critically important, especially in complex bare-metal + container hybrid deployment environments. When monitoring disk fires or build OOM situations in an environment managed with Docker Compose, these filtering techniques allow me to take action only in situations that truly require intervention. Otherwise, it would be inevitable to drown under constantly ringing phones and incoming notifications. This is one of the most valuable lessons I've learned since the early years of my career.

Conclusion

Monitoring and alerting are the cornerstones of ensuring a system's healthy operation. However, striking the right balance between the two comes with years of experience. The 3 steps I've outlined in this post – defining business-critical metrics, monitoring proactive signals, and filtering out noise to alert only what requires action – have been a roadmap for operational maturity for me.

Let's not forget that this process is not a one-and-done task. As systems evolve and business needs change, these priorities must be continuously reviewed and updated. Every new feature, every new integration, can add a new layer to our monitoring and alerting strategy. With a pragmatic approach, learning from mistakes, and continuous improvement, we can build more resilient systems. In my next post, I'll discuss how I monitor index strategies and performance regressions in PostgreSQL.