Hamed Naeemaei

Posted on Jul 17

Start monitoring like a pro, learn the heartbeat of your services, Part 1

#programming #csharp #dotnet #prometheus

🟡 Why Monitoring Matters

In today’s digital world, software services run 24/7. And when something goes wrong, seconds can cost businesses money, trust, and peace of mind. Monitoring is not just about catching errors — it’s about understanding your systems in motion.
That’s where Prometheus and Grafana step in.

This multi-part article series will walk you through the practical implementation of a full monitoring stack using Prometheus for metrics collection and Grafana for visualization — with real Web API services and real metrics.

🟡 What We’ll Build? (Overview)

Spin up multiple Web API instances
Collect realistic metrics from them
Store those metrics in Prometheus
Build beautiful and functional dashboards in Grafana
Explore PromQL and alerting scenarios
Learn through hands-on, containerized environments

The full code is available on GitHub:
👉 https://github.com/naeemaei/MonitoringExample

Just run this from the project root:

docker-compose -f Prometheus/docker-compose.yml up -d --build

🟡 What You’ll Need

Visual Studio Code
Docker and Docker Compose

🟡 What is a Time Series Database (TSDB)?

A Time Series Database (TSDB) is a specialized type of database optimized for storing, retrieving, and managing time-stamped data. Unlike traditional relational databases, TSDBs are designed to efficiently handle sequences of data points indexed in time order. With examples from the real world, we can clarify the matter better:

Take a look at your smartwatch. It continuously displays your heart rate in real time. Now, imagine you want to store that heart rate data over an extended period, say, a full month. How should we structure that data?
At the core, we have a primary dimension: time (our key), and a value, which is the heart rate at that specific timestamp. This structure is a timestamp paired with a single numeric value should remind you of a time series model.
But that’s just the beginning.
We might also want to associate additional context with each data point. For instance:
The city where the user is located in?
The user’s activity state: sleeping, awake, walking, or working out
With this richer dataset, we can ask powerful questions, like:
What’s the average heart rate of a person while sleeping in a coastal city compared to a mountain region?
How does the heart rate fluctuate at the onset of sleep, during deep sleep, and upon waking?
What was the average heart rate in the past hour?
Sound familiar? It should — modern smartwatches and smartphones already collect this kind of data. They track your steps, your sleep cycles, and your movement, and then generate daily summaries and health insights.
But here’s the real question:
How can we store and structure this kind of data so that it’s not only easy to monitor in real-time, but also easy to analyze, visualize, and report?
The answer lies in using a Time Series Database (TSDB) — and that’s exactly where Prometheus excels.

🟡 How Prometheus Stores Metrics?

Each metric in Prometheus has 4 parts:
Metric Name, Timestamp, Labels (Tags), Value. For example:

2021/07/14 14:17:25.1000 → heart_rate{city="Shiraz", state="sleep"} 103

In the above example, the metric name is heart_rate, the time is 2021/07/14 14:17:25.1000, the labels are city and state, and the value is 103.
This data means that the heart rate was recorded on 2021/07/14 14:17:25.1000 in the city of Shiraz, and the sleep state was 103.
In general, it should be said that in Time Series Databases, the main and key axis is time. Pay attention to this image:

As you can see in this image, we have a series of time points from t0 to t6, at each of which metric information is stored. For example, the number of search requests that have entered the Android application up to moment t3 (first row from the bottom) is 18, and by time t5 this number reaches 22, meaning that in (t5 - t3) seconds, 5 new search requests have arrived on the Android service.
Traditional relational databases can store time-stamped data, but they're not designed for the scale, speed, and analytical nature of time series workloads.

🟡 What is Prometheus?

Prometheus is an open-source monitoring and alerting toolkit, designed for time series data collection, especially in cloud-native, containerized environments.

📦 Prometheus core components

✅ Prometheus Server (Core Engine)

The core engine has these responsibilities:

Scrapes metrics from targets (applications/services).
Stores time series data locally.
Executes queries using PromQL.
Evaluates rules and triggers alerts.
TSDB (Time Series DB): Embedded time series database optimized for speed, compression, and retention.
Label-based Storage: Each time series is uniquely identified by a metric name + key-value labels.

✅ Exporters (Metric Endpoints)

Exporters are agents or libraries that expose metrics in Prometheus format, and Prometheus fetches metrics from exporters.
The interesting thing is that many third-party libraries provide Exporters for different types of services. Need to monitor a Microsoft SQL Server instance? Just search for the appropriate MSSQL Exporter. Want insights into your Windows OS metrics? Use the official Windows Exporter. The Prometheus ecosystem is incredibly rich. There’s an exporter for virtually every major service or system you can think of. From databases and message queues to web servers and operating systems, most come with prebuilt exporters that expose useful metrics in a Prometheus-compatible format.
Prometheus mainly works on a pull-based model to scrape exporter metrics. It scrapes (pulls) metrics from exporters at regular intervals. This is configurable via scrape_configs in Prometheus.

scrape_configs:
  - job_name: 'aspnet_core_api'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:5000']

In rare cases (e.g., short-lived jobs), Prometheus can receive pushed metrics via Pushgateway.

✅ Service Discovery

Service discovery dynamically discovers targets in modern environments (Kubernetes, EC2, Consul, Docker Swarm), but static configs are possible.

✅ PromQL (Prometheus Query Language)

It is used for querying and aggregating time series ,and writing alerting rules.

✅ Alertmanager

Decoupled service that handles alerts fired by Prometheus.

📦 Understanding Prometheus Metric Types

Prometheus supports four core types of metrics: Counter, Gauge, Histogram, and Summary. Each serves a distinct purpose in monitoring and analyzing the behavior of systems and services.

🔢 Counter

A Counter is a metric that only increases over time. It’s ideal for tracking cumulative values, like the number of requests received, errors encountered, or jobs completed.

Example use case:
Tracking the number of HTTP requests to a product detail endpoint.

# TYPE application_request_counter counter
application_request_counter{endpoint="product-detail-page", responsecode="200"} 532

In the example above, 532 requests have been received so far. The next incoming request will increment this to 533.
💡 Counters never decrease — if you see a reset, it usually means the application has restarted.

📏 Gauge

A Gauge represents a metric that can go up or down. Use it to track values like current memory usage, CPU load, or request duration.

Example use case:
Tracking how long a product list page request takes.

# TYPE application_request_duration gauge
application_request_duration{endpoint="product-list-page"} 845

This means the current request took 845 milliseconds to complete. Unlike counters, gauges are momentary snapshots.

📊 Histogram

A Histogram collects observations (like response durations) and organizes them into buckets. It also tracks the total count and sum of all observed values, allowing you to calculate averages, percentiles, and latency distributions.

# TYPE application_request histogram
application_request_bucket{le="50"}    502
application_request_bucket{le="100"}   954
application_request_bucket{le="180"}   1166
application_request_bucket{le="250"}   1296
application_request_bucket{le="400"}   1383
application_request_bucket{le="800"}   1424
application_request_bucket{le="2000"}  1429
application_request_bucket{le="+Inf"}  1429

application_request_sum   25465
application_request_count 1429

This means:
502 requests took ≤ 50ms
954 took ≤ 100ms
...
All 1429 requests fell within the +Inf bucket
From sum and count, you can calculate the average latency.

🧮 Summary

A Summary also tracks sum and count, but adds quantiles, such as the 50th, 95th, or 99th percentiles — very useful for latency analysis.

# TYPE application_request_duration summary
application_request_duration_sum{app="MonitoringExample.Api"}   95743
application_request_duration_count{app="MonitoringExample.Api"} 1429
application_request_duration{app="MonitoringExample.Api", quantile="0.5"}  114
application_request_duration{app="MonitoringExample.Api", quantile="0.75"} 132
application_request_duration{app="MonitoringExample.Api", quantile="0.95"} 146
application_request_duration{app="MonitoringExample.Api", quantile="0.99"} 152

This tells us:
50% of requests completed in ≤ 114ms
95% of requests completed in ≤ 146ms
99% completed in ≤ 152ms

📦 Histogram vs. Summary

Histogram and Summary are often misunderstood metric types. Both Histogram and Summary are used to observe and analyze distributions of values, like response durations or payload sizes. However, they differ in how they collect, aggregate, and present data.
A Histogram samples observations (like durations or sizes) and counts how many fall into predefined buckets. Each bucket is defined by an upper bound (le, or "less than or equal to"). In addition, a Histogram always provides *_count(total number of observations) and *_sum(sum of all observed values). These let you calculate averages.
A Summary calculates and exports precomputed quantiles, like the 50th (median), 95th, or 99th percentile — along with *_sum and *_count. When you need to monitor user experience and answer these questions: How fast are most of the requests? Or what's the typical latency for 99% of users? Should use Summary