Quame Jnr

Posted on Nov 6

Prometheus: The Essential Guide to Monitoring Systems

#monitoring #tooling #devops #opensource

What is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud. It collects metrics as time series data, numerical measurements stored with timestamps and optional key-value pairs called labels. Since joining the Cloud Native Computing Foundation in 2016, it has become a standalone project with an active community.

History and Relevance Compared to Older Tools

When monitoring tools like Nagios (1999), Zabbix (2004), and Graphite emerged, they were designed for stable, on-premises environments with fixed servers. As cloud computing and microservices grew popular with containers and services scaling dynamically these tools faced challenges.

Their limitations included push-based architectures, where agents sent data to a central system, leading to configuration issues in changing setups and missed data from short-lived jobs. They lacked multi-dimensional labels for flexible data analysis, and querying was limited. Storage was often absent or added separately.

Prometheus addressed these by using a pull model to retrieve metrics, enabling auto-discovery. Its labels and PromQL query language allowed rich data exploration, while self-contained storage ensured reliability. Joining the CNCF in 2016, it became key for Kubernetes and modern infrastructures, offering better adaptability.

Why Use Prometheus?

Prometheus helps monitor system and application performance, such as CPU usage, RAM usage, network traffic, and request counts. It alerts you to issues like sudden RAM or CPU surges, memory leaks, or high error rates. This is especially useful for dynamic, service-oriented architectures where traditional monitoring falls short.

It excels in:

Machine-centric monitoring: Tracking hardware metrics like CPU, memory, and disk usage.
Application monitoring: Measuring request latency, error rates, and throughput.
Microservices architectures: Handling dynamic environments with multi-dimensional data (e.g., labeling by service, version).
Alerting: Triggering notifications for thresholds (e.g., "RAM usage > 90%").
Reliability during outages: Standalone design allows diagnosis when other systems fail.

However, it's not ideal for 100% accurate billing data, as it prioritizes reliability over completeness in failure scenarios.

How Does It Work?

Prometheus performs three main functions:

Retrieves data: Uses a pull model over HTTP to scrape metrics from endpoints (typically /metrics).
Stores data: Saves metrics in a time series database with timestamps and labels.
Queries data: Provides an HTTP API for querying via PromQL, a flexible query language.

To expose metrics, applications use client libraries or exporters like the Node Exporter for system data. For short-lived jobs, a push gateway allows pushing metrics that Prometheus then scrapes. The Alertmanager handles alerts by routing them to channels like Slack or email.

Metric Types in Prometheus

Prometheus uses four metric types:

Counter: How many times something happened (e.g., number of requests).
Gauge: The current value of something that can go up or down (e.g., CPU or memory usage).
Histogram: How long or big something is (e.g., request latency).
Summary: Pre-computed percentiles for distributions.

Architecture

Prometheus scrapes metrics from instrumented jobs directly or via the push gateway. It stores samples locally and runs rules for aggregation or alerting. Visualization tools like Grafana query the data via the API.

At its core, Prometheus handles three key functions: retrieving metrics (scraping), saving metrics (storing in TSDB), and querying metrics (via HTTP API).


┌─────────────────┐          ┌──────────────────────────────┐          ┌─────────────┐
│                 │          │     PROMETHEUS SERVER        │          │             │
│  Instrumented   │─ Pull ──▶│                              │◀─ Query ─│   Grafana   │
│      Jobs       │          │  ┌────────────────────────┐  │          │             │
│   (/metrics)    │          │  │   1. Retrieval         │  │          └─────────────┘
└─────────────────┘          │  │      (Scrape)          │  │
                             │  └──────────┬─────────────┘  │
┌─────────────────┐          │             │                │          ┌─────────────┐
│                 │          │             ▼                │          │             │
│  Push Gateway   │─ Pull ──▶│  ┌────────────────────────┐  │─ Alert  ─│ Alertmanager│
│  (short-lived)  │          │  │   2. Storage (TSDB)    │  │          │             │
└─────────────────┘          │  └──────────┬─────────────┘  │          └──────┬──────┘
        ▲                    │             │                │                 │
        │ Push               │             ▼                │                 │
┌─────────────────┐          │  ┌────────────────────────┐  │                 ▼
│  Short-lived    │          │  │   3. Query (HTTP API)  │  │          ┌─────────────┐
│     Jobs        │          │  │      PromQL            │  │          │ Slack/Email │
└─────────────────┘          │  └────────────────────────┘  │          └─────────────┘
                             │                              │
┌─────────────────┐          │  ┌────────────────────────┐  │
│                 │          │  │   Rules Engine         │  │
│    Exporters    │─ Pull ──▶│  │   (Aggregation +       │  │
│  (Node, etc.)   │          │  │    Alerting)           │  │
└─────────────────┘          │  └────────────────────────┘  │
                             └──────────────────────────────┘

Components

The Prometheus ecosystem includes:

Prometheus server: The core engine that scrapes, stores, and queries metrics using a time series database.
Client libraries: SDKs for languages like Go, Python, or Java to instrument application code and expose custom metrics.
Push gateway: An intermediary for short-lived jobs (e.g., batch scripts) to push metrics, which Prometheus then scrapes.
Exporters: Specialized tools that collect metrics from external systems (e.g., Node Exporter for OS stats, HAProxy for load balancers) and expose them in Prometheus format.
Alertmanager: Handles alert routing, deduplication, and notifications to channels like Slack or email.
Support tools: Utilities for tasks like configuration validation, data migration, or federation with other Prometheus instances.

Conclusion

Prometheus is a powerful, flexible tool for modern monitoring, especially in dynamic environments like microservices and Kubernetes. It addresses many limitations of older tools through its pull model, labels, and query language.

Top comments (2)

Eugene Agbaglo • Nov 6

This is in depth. Everything I know about it is here and more

Quame Jnr • Nov 6

Thank you Eugene