DEV Community

James Lee
James Lee

Posted on

Monitoring & Alerting System Design: From Static Thresholds to Intelligent Alert Correlation

The Three Pillars of Observability

A complete monitoring system requires three complementary components:

┌─────────────────────────────────────────────────────────────┐
│  Metrics  — what is happening (numeric measurements)        │
│  Tracing  — where it is happening (request lifecycle)       │
│  Logging  — why it is happening (detailed context)          │
└─────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
Pillar Purpose Examples
Metrics Numeric measurements over time; 5 types: Gauge, Counter, Histogram, Timer, Meter CPU usage, request rate, error rate
Tracing Full lifecycle of a request across distributed services Distributed trace spanning API → DB → Cache
Logging Detailed runtime information: method inputs, exceptions, stack traces Application error logs, audit logs

How They Work Together

Alert fires (Metrics threshold exceeded)
     │
     ▼
Trace the suspicious request path (Tracing)
     │
     ▼
Drill into detailed logs of the failing module (Logging)
     │
     ▼
Identify root cause
     │
     ▼
Tune alert rules for earlier detection next time (Metrics)
Enter fullscreen mode Exit fullscreen mode

None of the three can replace the others. Together they form a complete feedback loop from detection → diagnosis → prevention.


Detection Algorithms

Static Threshold Detection

Best suited for baseline performance metrics (CPU, memory, disk, etc.).

Recommended configuration: N-of-M detection

Trigger an alert only when the threshold is exceeded N times within M consecutive check cycles.

Example: "5 cycles, satisfied 3 times"

Cycle:  1    2    3    4    5
Value: [OK] [!!] [!!] [OK] [!!]  → 3/5 exceeded → ALERT ✅

Cycle:  1    2    3    4    5
Value: [OK] [!!] [OK] [OK] [OK]  → 1/5 exceeded → no alert ✅
Enter fullscreen mode Exit fullscreen mode

Tuning guidance:

Metric Type Recommended N Reason
Volatile (e.g. CPU usage) N = 3 Frequent spikes are normal; avoid false positives
Stable (e.g. disk usage) N = 1 Disk rarely spikes; any breach is meaningful

Recovery Detection

Recovery uses the inverse of the trigger condition:

Trigger condition:  "5 cycles, satisfied ≥ 3 times" → ALERT
Recovery condition: "5 cycles, satisfied  < 3 times" → RESOLVED
Enter fullscreen mode Exit fullscreen mode

This prevents alert flapping when a metric hovers near the threshold.


Alert States (Alertmanager Model)

┌──────────────────────────────────────────────────────────────┐
│                     Alert State Machine                      │
│                                                              │
│   Inactive ──threshold crossed──▶ Pending                   │
│                                      │                      │
│                              duration satisfied             │
│                                      │                      │
│                                      ▼                      │
│                                   Firing ──▶ Notification   │
│                                      │       Pipeline       │
│                              condition clears               │
│                                      │                      │
│                                      ▼                      │
│                                  Inactive                   │
└──────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
State Meaning
Inactive Threshold not exceeded; no alert
Pending Threshold exceeded, but for duration not yet satisfied
Firing Threshold exceeded AND duration satisfied → alert sent

The for duration requirement ensures transient spikes don't generate noise.


Alert Convergence Strategies

Without convergence, a single incident can generate thousands of alerts — flooding inboxes and causing alert fatigue.

Three Core Mechanisms

Mechanism Description
Grouping Classify similar alerts into a single notification
Inhibition When alert A fires, suppress related alerts B and C
Silencing Mute alerts for a specific time window (e.g. during maintenance)

Convergence Rules

Scenario: metric continuously exceeds threshold for hours
Without convergence: hundreds of duplicate alert emails
With convergence:    one alert email, then silence for N hours

┌──────────────────────────────────────────────────────────┐
│  Alert fires at T=0                                      │
│  ├── notify once                                         │
│  ├── suppress for 24h if condition persists unresolved   │
│  └── re-notify after 24h if still unresolved             │
└──────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

No-data alerting:

  • If a metric suddenly stops reporting → trigger a "no data" alert
  • ⚠️ Only enable this for metrics with consistent, continuous reporting; disable for intermittent sources

Three Generations of Alerting Architecture

Generation 1: Polling-Based Alerts

crontab ──periodic scan──▶ metrics store ──threshold check──▶ alert
Enter fullscreen mode Exit fullscreen mode

Problems:

Issue Description
Latency Data is already stale by the time it's polled from storage
Inflexibility Fixed rules and thresholds can't adapt to growing business scale
False positives Brief network jitter triggers alerts on static thresholds

Generation 2: Streaming Alerts

Data flows through a stream processor instead of being stored first:

Incoming metrics stream
     │
     ▼
Stream processing engine
(evaluate rules in real-time)
     │
     ├── threshold met ──▶ push to Redis queue ──▶ alert consumer
     └── threshold not met ──▶ discard / expire from queue
Enter fullscreen mode Exit fullscreen mode

Four key advantages:

① Real-time
Rules are evaluated as data flows through — no storage round-trip.

② Hierarchical rule abstraction

Alert hierarchy (bottom-up convergence):

Interface level
└── auto-set threshold based on SLA/KPI
     │ converge up
Service pool level
└── warning: >5% of IPs exceed SLA
    error:   >10% of IPs exceed SLA
     │ converge up
Business line level
└── warning/error/fatal based on service pool states
     │ converge up
Global dashboard (9-grid view)
└── macro-level availability overview
    drill down to find root cause
Enter fullscreen mode Exit fullscreen mode

Alerts that don't breach the next level's SLA are shown in dashboards only — not sent to users. This dramatically reduces notification noise.

③ Automatic rule extraction

As long as incoming data follows the protocol (includes business line, service pool, etc.), alert rules are automatically generated and stored in MySQL. Dashboards, emails, and push notifications are auto-configured — no manual intervention needed.

Auto-extracted rules cover ~80% of alerting needs. Custom rules handle the remaining 20% (e.g. sudden spike/drop, week-over-week comparison).

④ Bulk rule import via API

POST /api/v1/alert-rules/batch
Content-Type: application/json

[
  { "metric": "response_time", "threshold": 500, "pool": "order-service" },
  { "metric": "error_rate",    "threshold": 0.05, "pool": "payment-service" }
]
Enter fullscreen mode Exit fullscreen mode

Useful when teams have their own alert logic abstractions that differ from the platform's auto-extraction model.


Generation 3: Intelligent Alerts

Intelligent alerting continuously improves alert quantity (noise reduction) and quality (false positive rate) through algorithms.

Four core techniques:

Technique Description
Alert correlation Group related alerts into a single "incident event"
Deduplication Merge repeated alerts from the same root cause
Auto-recovery Automatically close alerts when conditions normalize
Alert classification Categorize alerts by type for smarter routing

Data Aggregation & Dimensionality Reduction

Why Aggregation Matters

Without aggregation:

Every raw data point stored and queried individually:
Write volume:  O(N) raw points per second
Read volume:   O(N) raw points → compute result at query time
Enter fullscreen mode Exit fullscreen mode

With aggregation:

Pre-aggregate at write time:
Write volume:  O(1) aggregated value per window
Read volume:   O(1) pre-computed result
Enter fullscreen mode Exit fullscreen mode

In high-volume monitoring scenarios, this difference is critical for system stability.

Two Aggregation Dimensions

① Computational aggregation

Raw data points ──▶ avg / sum / max / min / percentile / sampling
Enter fullscreen mode Exit fullscreen mode

② Multi-dimensional aggregation (labels/tags)

Dimensions: time | location | business line | service | interface

Example query:
"Average response time of the payment service
 in IDC-Shanghai over the last 5 minutes"
= aggregate by [time=5m, location=shanghai, service=payment]
Enter fullscreen mode Exit fullscreen mode

Tags/labels are the key to slicing monitoring data across different perspectives.

Dimensionality Reduction for Alert Storms

Raw alert events per day:  10,000+
After convergence & aggregation:  < 10 actionable items

Reduction pipeline:
10,000 raw alerts
     │ deduplication
     ▼
2,000 unique alerts
     │ correlation (same root cause)
     ▼
200 incident groups
     │ SLA-based suppression (doesn't breach next level)
     ▼
< 10 alerts sent to on-call engineers
Enter fullscreen mode Exit fullscreen mode

The goal: surface only what requires human intervention, while keeping full detail available for drill-down.


Architecture Summary

┌──────────────────────────────────────────────────────────────┐
│                  Monitoring & Alerting Platform              │
│                                                              │
│  Data Collection                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                   │
│  │ Metrics  │  │ Tracing  │  │ Logging  │                   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                   │
│       └─────────────┴─────────────┘                         │
│                      │                                       │
│                      ▼                                       │
│            Stream Processing Engine                          │
│         (real-time rule evaluation)                          │
│                      │                                       │
│          ┌───────────┴───────────┐                          │
│          ▼                       ▼                          │
│   Alert Queue (Redis)     Aggregated Store                   │
│          │                                                   │
│          ▼                                                   │
│   Convergence Engine                                         │
│   (group / inhibit / silence / deduplicate)                  │
│          │                                                   │
│          ▼                                                   │
│   Notification Pipeline                                      │
│   (email / SMS / webhook / PagerDuty)                        │
└──────────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Top comments (0)