James Lee

Posted on May 17

Monitoring & Alerting System Design: From Static Thresholds to Intelligent Alert Correlation

#distributedsystems #monitoring #sre #systemdesign

The Three Pillars of Observability

A complete monitoring system requires three complementary components:

┌─────────────────────────────────────────────────────────────┐
│  Metrics  — what is happening (numeric measurements)        │
│  Tracing  — where it is happening (request lifecycle)       │
│  Logging  — why it is happening (detailed context)          │
└─────────────────────────────────────────────────────────────┘

Pillar	Purpose	Examples
Metrics	Numeric measurements over time; 5 types: Gauge, Counter, Histogram, Timer, Meter	CPU usage, request rate, error rate
Tracing	Full lifecycle of a request across distributed services	Distributed trace spanning API → DB → Cache
Logging	Detailed runtime information: method inputs, exceptions, stack traces	Application error logs, audit logs

How They Work Together

Alert fires (Metrics threshold exceeded)
     │
     ▼
Trace the suspicious request path (Tracing)
     │
     ▼
Drill into detailed logs of the failing module (Logging)
     │
     ▼
Identify root cause
     │
     ▼
Tune alert rules for earlier detection next time (Metrics)

None of the three can replace the others. Together they form a complete feedback loop from detection → diagnosis → prevention.

Detection Algorithms

Static Threshold Detection

Best suited for baseline performance metrics (CPU, memory, disk, etc.).

Recommended configuration: N-of-M detection

Trigger an alert only when the threshold is exceeded N times within M consecutive check cycles.

Example: "5 cycles, satisfied 3 times"

Cycle:  1    2    3    4    5
Value: [OK] [!!] [!!] [OK] [!!]  → 3/5 exceeded → ALERT ✅

Cycle:  1    2    3    4    5
Value: [OK] [!!] [OK] [OK] [OK]  → 1/5 exceeded → no alert ✅

Tuning guidance:

Metric Type	Recommended N	Reason
Volatile (e.g. CPU usage)	N = 3	Frequent spikes are normal; avoid false positives
Stable (e.g. disk usage)	N = 1	Disk rarely spikes; any breach is meaningful

Recovery Detection

Recovery uses the inverse of the trigger condition:

Trigger condition:  "5 cycles, satisfied ≥ 3 times" → ALERT
Recovery condition: "5 cycles, satisfied  < 3 times" → RESOLVED

This prevents alert flapping when a metric hovers near the threshold.

Alert States (Alertmanager Model)

┌──────────────────────────────────────────────────────────────┐
│                     Alert State Machine                      │
│                                                              │
│   Inactive ──threshold crossed──▶ Pending                   │
│                                      │                      │
│                              duration satisfied             │
│                                      │                      │
│                                      ▼                      │
│                                   Firing ──▶ Notification   │
│                                      │       Pipeline       │
│                              condition clears               │
│                                      │                      │
│                                      ▼                      │
│                                  Inactive                   │
└──────────────────────────────────────────────────────────────┘

State	Meaning
Inactive	Threshold not exceeded; no alert
Pending	Threshold exceeded, but `for` duration not yet satisfied
Firing	Threshold exceeded AND duration satisfied → alert sent

The for duration requirement ensures transient spikes don't generate noise.

Alert Convergence Strategies

Without convergence, a single incident can generate thousands of alerts — flooding inboxes and causing alert fatigue.

Three Core Mechanisms

Mechanism	Description
Grouping	Classify similar alerts into a single notification
Inhibition	When alert A fires, suppress related alerts B and C
Silencing	Mute alerts for a specific time window (e.g. during maintenance)

Convergence Rules

Scenario: metric continuously exceeds threshold for hours
Without convergence: hundreds of duplicate alert emails
With convergence:    one alert email, then silence for N hours

┌──────────────────────────────────────────────────────────┐
│  Alert fires at T=0                                      │
│  ├── notify once                                         │
│  ├── suppress for 24h if condition persists unresolved   │
│  └── re-notify after 24h if still unresolved             │
└──────────────────────────────────────────────────────────┘

No-data alerting:

If a metric suddenly stops reporting → trigger a "no data" alert
⚠️ Only enable this for metrics with consistent, continuous reporting; disable for intermittent sources

Three Generations of Alerting Architecture

Generation 1: Polling-Based Alerts

crontab ──periodic scan──▶ metrics store ──threshold check──▶ alert

Problems:

Issue	Description
Latency	Data is already stale by the time it's polled from storage
Inflexibility	Fixed rules and thresholds can't adapt to growing business scale
False positives	Brief network jitter triggers alerts on static thresholds

Generation 2: Streaming Alerts

Data flows through a stream processor instead of being stored first:

Incoming metrics stream
     │
     ▼
Stream processing engine
(evaluate rules in real-time)
     │
     ├── threshold met ──▶ push to Redis queue ──▶ alert consumer
     └── threshold not met ──▶ discard / expire from queue

Four key advantages:

① Real-time
Rules are evaluated as data flows through — no storage round-trip.

② Hierarchical rule abstraction

Alert hierarchy (bottom-up convergence):

Interface level
└── auto-set threshold based on SLA/KPI
     │ converge up
Service pool level
└── warning: >5% of IPs exceed SLA
    error:   >10% of IPs exceed SLA
     │ converge up
Business line level
└── warning/error/fatal based on service pool states
     │ converge up
Global dashboard (9-grid view)
└── macro-level availability overview
    drill down to find root cause

Alerts that don't breach the next level's SLA are shown in dashboards only — not sent to users. This dramatically reduces notification noise.

③ Automatic rule extraction

As long as incoming data follows the protocol (includes business line, service pool, etc.), alert rules are automatically generated and stored in MySQL. Dashboards, emails, and push notifications are auto-configured — no manual intervention needed.

Auto-extracted rules cover ~80% of alerting needs. Custom rules handle the remaining 20% (e.g. sudden spike/drop, week-over-week comparison).

④ Bulk rule import via API

POST /api/v1/alert-rules/batch
Content-Type: application/json

[
  { "metric": "response_time", "threshold": 500, "pool": "order-service" },
  { "metric": "error_rate",    "threshold": 0.05, "pool": "payment-service" }
]

Useful when teams have their own alert logic abstractions that differ from the platform's auto-extraction model.

Generation 3: Intelligent Alerts

Intelligent alerting continuously improves alert quantity (noise reduction) and quality (false positive rate) through algorithms.

Four core techniques:

Technique	Description
Alert correlation	Group related alerts into a single "incident event"
Deduplication	Merge repeated alerts from the same root cause
Auto-recovery	Automatically close alerts when conditions normalize
Alert classification	Categorize alerts by type for smarter routing

Data Aggregation & Dimensionality Reduction

Why Aggregation Matters

Without aggregation:

Every raw data point stored and queried individually:
Write volume:  O(N) raw points per second
Read volume:   O(N) raw points → compute result at query time

With aggregation:

Pre-aggregate at write time:
Write volume:  O(1) aggregated value per window
Read volume:   O(1) pre-computed result

In high-volume monitoring scenarios, this difference is critical for system stability.

Two Aggregation Dimensions

① Computational aggregation

Raw data points ──▶ avg / sum / max / min / percentile / sampling

② Multi-dimensional aggregation (labels/tags)

Dimensions: time | location | business line | service | interface

Example query:
"Average response time of the payment service
 in IDC-Shanghai over the last 5 minutes"
= aggregate by [time=5m, location=shanghai, service=payment]

Tags/labels are the key to slicing monitoring data across different perspectives.

Dimensionality Reduction for Alert Storms

Raw alert events per day:  10,000+
After convergence & aggregation:  < 10 actionable items

Reduction pipeline:
10,000 raw alerts
     │ deduplication
     ▼
2,000 unique alerts
     │ correlation (same root cause)
     ▼
200 incident groups
     │ SLA-based suppression (doesn't breach next level)
     ▼
< 10 alerts sent to on-call engineers

The goal: surface only what requires human intervention, while keeping full detail available for drill-down.

Architecture Summary

┌──────────────────────────────────────────────────────────────┐
│                  Monitoring & Alerting Platform              │
│                                                              │
│  Data Collection                                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐                   │
│  │ Metrics  │  │ Tracing  │  │ Logging  │                   │
│  └────┬─────┘  └────┬─────┘  └────┬─────┘                   │
│       └─────────────┴─────────────┘                         │
│                      │                                       │
│                      ▼                                       │
│            Stream Processing Engine                          │
│         (real-time rule evaluation)                          │
│                      │                                       │
│          ┌───────────┴───────────┐                          │
│          ▼                       ▼                          │
│   Alert Queue (Redis)     Aggregated Store                   │
│          │                                                   │
│          ▼                                                   │
│   Convergence Engine                                         │
│   (group / inhibit / silence / deduplicate)                  │
│          │                                                   │
│          ▼                                                   │
│   Notification Pipeline                                      │
│   (email / SMS / webhook / PagerDuty)                        │
└──────────────────────────────────────────────────────────────┘

DEV Community