The Three Pillars of Observability
A complete monitoring system requires three complementary components:
┌─────────────────────────────────────────────────────────────┐
│ Metrics — what is happening (numeric measurements) │
│ Tracing — where it is happening (request lifecycle) │
│ Logging — why it is happening (detailed context) │
└─────────────────────────────────────────────────────────────┘
| Pillar | Purpose | Examples |
|---|---|---|
| Metrics | Numeric measurements over time; 5 types: Gauge, Counter, Histogram, Timer, Meter | CPU usage, request rate, error rate |
| Tracing | Full lifecycle of a request across distributed services | Distributed trace spanning API → DB → Cache |
| Logging | Detailed runtime information: method inputs, exceptions, stack traces | Application error logs, audit logs |
How They Work Together
Alert fires (Metrics threshold exceeded)
│
▼
Trace the suspicious request path (Tracing)
│
▼
Drill into detailed logs of the failing module (Logging)
│
▼
Identify root cause
│
▼
Tune alert rules for earlier detection next time (Metrics)
None of the three can replace the others. Together they form a complete feedback loop from detection → diagnosis → prevention.
Detection Algorithms
Static Threshold Detection
Best suited for baseline performance metrics (CPU, memory, disk, etc.).
Recommended configuration: N-of-M detection
Trigger an alert only when the threshold is exceeded N times within M consecutive check cycles.
Example: "5 cycles, satisfied 3 times"
Cycle: 1 2 3 4 5
Value: [OK] [!!] [!!] [OK] [!!] → 3/5 exceeded → ALERT ✅
Cycle: 1 2 3 4 5
Value: [OK] [!!] [OK] [OK] [OK] → 1/5 exceeded → no alert ✅
Tuning guidance:
| Metric Type | Recommended N | Reason |
|---|---|---|
| Volatile (e.g. CPU usage) | N = 3 | Frequent spikes are normal; avoid false positives |
| Stable (e.g. disk usage) | N = 1 | Disk rarely spikes; any breach is meaningful |
Recovery Detection
Recovery uses the inverse of the trigger condition:
Trigger condition: "5 cycles, satisfied ≥ 3 times" → ALERT
Recovery condition: "5 cycles, satisfied < 3 times" → RESOLVED
This prevents alert flapping when a metric hovers near the threshold.
Alert States (Alertmanager Model)
┌──────────────────────────────────────────────────────────────┐
│ Alert State Machine │
│ │
│ Inactive ──threshold crossed──▶ Pending │
│ │ │
│ duration satisfied │
│ │ │
│ ▼ │
│ Firing ──▶ Notification │
│ │ Pipeline │
│ condition clears │
│ │ │
│ ▼ │
│ Inactive │
└──────────────────────────────────────────────────────────────┘
| State | Meaning |
|---|---|
| Inactive | Threshold not exceeded; no alert |
| Pending | Threshold exceeded, but for duration not yet satisfied |
| Firing | Threshold exceeded AND duration satisfied → alert sent |
The for duration requirement ensures transient spikes don't generate noise.
Alert Convergence Strategies
Without convergence, a single incident can generate thousands of alerts — flooding inboxes and causing alert fatigue.
Three Core Mechanisms
| Mechanism | Description |
|---|---|
| Grouping | Classify similar alerts into a single notification |
| Inhibition | When alert A fires, suppress related alerts B and C |
| Silencing | Mute alerts for a specific time window (e.g. during maintenance) |
Convergence Rules
Scenario: metric continuously exceeds threshold for hours
Without convergence: hundreds of duplicate alert emails
With convergence: one alert email, then silence for N hours
┌──────────────────────────────────────────────────────────┐
│ Alert fires at T=0 │
│ ├── notify once │
│ ├── suppress for 24h if condition persists unresolved │
│ └── re-notify after 24h if still unresolved │
└──────────────────────────────────────────────────────────┘
No-data alerting:
- If a metric suddenly stops reporting → trigger a "no data" alert
- ⚠️ Only enable this for metrics with consistent, continuous reporting; disable for intermittent sources
Three Generations of Alerting Architecture
Generation 1: Polling-Based Alerts
crontab ──periodic scan──▶ metrics store ──threshold check──▶ alert
Problems:
| Issue | Description |
|---|---|
| Latency | Data is already stale by the time it's polled from storage |
| Inflexibility | Fixed rules and thresholds can't adapt to growing business scale |
| False positives | Brief network jitter triggers alerts on static thresholds |
Generation 2: Streaming Alerts
Data flows through a stream processor instead of being stored first:
Incoming metrics stream
│
▼
Stream processing engine
(evaluate rules in real-time)
│
├── threshold met ──▶ push to Redis queue ──▶ alert consumer
└── threshold not met ──▶ discard / expire from queue
Four key advantages:
① Real-time
Rules are evaluated as data flows through — no storage round-trip.
② Hierarchical rule abstraction
Alert hierarchy (bottom-up convergence):
Interface level
└── auto-set threshold based on SLA/KPI
│ converge up
Service pool level
└── warning: >5% of IPs exceed SLA
error: >10% of IPs exceed SLA
│ converge up
Business line level
└── warning/error/fatal based on service pool states
│ converge up
Global dashboard (9-grid view)
└── macro-level availability overview
drill down to find root cause
Alerts that don't breach the next level's SLA are shown in dashboards only — not sent to users. This dramatically reduces notification noise.
③ Automatic rule extraction
As long as incoming data follows the protocol (includes business line, service pool, etc.), alert rules are automatically generated and stored in MySQL. Dashboards, emails, and push notifications are auto-configured — no manual intervention needed.
Auto-extracted rules cover ~80% of alerting needs. Custom rules handle the remaining 20% (e.g. sudden spike/drop, week-over-week comparison).
④ Bulk rule import via API
POST /api/v1/alert-rules/batch
Content-Type: application/json
[
{ "metric": "response_time", "threshold": 500, "pool": "order-service" },
{ "metric": "error_rate", "threshold": 0.05, "pool": "payment-service" }
]
Useful when teams have their own alert logic abstractions that differ from the platform's auto-extraction model.
Generation 3: Intelligent Alerts
Intelligent alerting continuously improves alert quantity (noise reduction) and quality (false positive rate) through algorithms.
Four core techniques:
| Technique | Description |
|---|---|
| Alert correlation | Group related alerts into a single "incident event" |
| Deduplication | Merge repeated alerts from the same root cause |
| Auto-recovery | Automatically close alerts when conditions normalize |
| Alert classification | Categorize alerts by type for smarter routing |
Data Aggregation & Dimensionality Reduction
Why Aggregation Matters
Without aggregation:
Every raw data point stored and queried individually:
Write volume: O(N) raw points per second
Read volume: O(N) raw points → compute result at query time
With aggregation:
Pre-aggregate at write time:
Write volume: O(1) aggregated value per window
Read volume: O(1) pre-computed result
In high-volume monitoring scenarios, this difference is critical for system stability.
Two Aggregation Dimensions
① Computational aggregation
Raw data points ──▶ avg / sum / max / min / percentile / sampling
② Multi-dimensional aggregation (labels/tags)
Dimensions: time | location | business line | service | interface
Example query:
"Average response time of the payment service
in IDC-Shanghai over the last 5 minutes"
= aggregate by [time=5m, location=shanghai, service=payment]
Tags/labels are the key to slicing monitoring data across different perspectives.
Dimensionality Reduction for Alert Storms
Raw alert events per day: 10,000+
After convergence & aggregation: < 10 actionable items
Reduction pipeline:
10,000 raw alerts
│ deduplication
▼
2,000 unique alerts
│ correlation (same root cause)
▼
200 incident groups
│ SLA-based suppression (doesn't breach next level)
▼
< 10 alerts sent to on-call engineers
The goal: surface only what requires human intervention, while keeping full detail available for drill-down.
Architecture Summary
┌──────────────────────────────────────────────────────────────┐
│ Monitoring & Alerting Platform │
│ │
│ Data Collection │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Metrics │ │ Tracing │ │ Logging │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ └─────────────┴─────────────┘ │
│ │ │
│ ▼ │
│ Stream Processing Engine │
│ (real-time rule evaluation) │
│ │ │
│ ┌───────────┴───────────┐ │
│ ▼ ▼ │
│ Alert Queue (Redis) Aggregated Store │
│ │ │
│ ▼ │
│ Convergence Engine │
│ (group / inhibit / silence / deduplicate) │
│ │ │
│ ▼ │
│ Notification Pipeline │
│ (email / SMS / webhook / PagerDuty) │
└──────────────────────────────────────────────────────────────┘
Top comments (0)