Oluwademilade Oyekanmi for AWS Community Builders

Posted on Feb 9 • Edited on Apr 14

A Comprehensive Guide to AWS Monitoring with Prometheus and Grafana

#monitoring #aws #prometheus

Introduction

In the fast-paced world of DevOps, monitoring and observability are essential for ensuring system reliability and cost efficiency. This guide will walk you through the process of configuring Prometheus and Grafana for AWS monitoring, setting up AWS-specific alerts and integrating AWS Cost Exporter. Whether you’re a beginner or an intermediate DevOps engineer, this guide is designed to be easy to follow and implement.

Who is this Guide for?

This guide is tailored for:

Beginners: Those new to DevOps and monitoring tools.
Intermediate Engineers: Those looking to deepen their understanding of Prometheus, Grafana, and DORA metrics.

What You’ll Learn

By the end of this guide, you’ll be able to:

Deploy Prometheus and Grafana on a cloud server.
Set up Node Exporter and Blackbox Exporter for system and uptime monitoring.
Configure DORA metrics tracking for CI/CD pipelines.
Set up an alerting system with Slack notifications.
Let’s dive in!

Part 1: Set Up Prometheus and Monitoring Tools

1. Setting Up Prometheus

Prometheus is an open-source monitoring system that collects and stores metrics as time series data. It’s like having a system constantly monitor your infrastructure, collecting performance data and alerting you before problems occur. Prometheus uses a pull-based model, scraping metrics from configured targets via HTTP endpoints.

In this section, we’ll install Prometheus, configure it to collect data from different sources, and ensure it’s properly storing and retrieving metrics.

1. Create User
We need to create a dedicated user for Prometheus. This enhances security by limiting the permissions and access of the Prometheus service.

sudo useradd --no-create-home --shell /bin/false prometheus

2. Create Directories
These directories will store Prometheus configuration files and data. Organizing them separately helps in managing and backing up data efficiently.

sudo mkdir -p /etc/prometheus /var/lib/prometheus

3. Download Prometheus
This step involves downloading and extracting the Prometheus binary.

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz

4. Copy Binaries
Copy the binaries to /usr/local/bin, making them easily accessible and executable from anywhere in the system.

sudo cp prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
sudo cp prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/

5. Copy the Configuration Files
These files contain the necessary configurations and libraries for Prometheus to function correctly.

sudo cp -r prometheus-2.45.0.linux-amd64/consoles /etc/prometheus
sudo cp -r prometheus-2.45.0.linux-amd64/console_libraries /etc/prometheus

6. Configure Prometheus
Edit /etc/prometheus/prometheus.yml and define scrape jobs for monitoring. This configuration defines how often Prometheus scrapes data and what endpoints it monitors.

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node_exporter:9100']

  # Blackbox Exporter for HTTP endpoint checks
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - "https://example1.com"  # <your-url-here>
          - "https://example2.com"  # <your-url-here>
          - "https://example3.com"  # <your-url-here>
          - "https://example4.com"  # <your-url-here>
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  - job_name: 'github-exporter'
    static_configs:
      - targets: ['github-exporter:9118']
    labels:
      environment: 'production'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

rule_files:
  - "alert_rules.yml"

7. Add the Alert Rules
Add the rules in the /etc/prometheus/alert_rules.yml file

groups:
  - name: blackbox_exporter_alerts
    rules:
      # Alert when an endpoint is down
      - alert: "Endpoint Down"
        expr: probe_success == 0
        for: 1m
        labels:
          severity: critical
          target: "{{ $labels.instance }}"
        annotations:
          summary: "A Resolve Vote URL is Down" 
          description: "{{ $labels.instance }} URL is unreachable." # (job: {{ $labels.job }})."

      # Alert when an endpoint is back up
      #- alert: EndpointUp
      #  expr: probe_success == 1
      #  for: 1m
      #  labels:
      #    severity: info
      #    target: "{{ $labels.instance }}"
      #  annotations:
      #    summary: "Endpoint is Back Online"
      #    description: "{{ $labels.instance }} is now reachable (job: {{ $labels.job }})."

      # High latency alert
      - alert: "Latency"
        expr: probe_duration_seconds > 1
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High Latency"
          description: "{{ $labels.instance }} has high latency: {{ $value }}s."

      # SSL Certificate Expiry Alert (less than 7 days)
      - alert: "SSL Expiry"
        expr: probe_ssl_earliest_cert_expiry{job="blackbox-http"} - time() < 86400 * 7
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "SSL Certificate Expiry Warning"
          description: "SSL certificate for {{ $labels.instance }} expires in less than 7 days."

  - name: node_exporter_alerts
    rules:
      # High CPU Usage Alert
      - alert: High CPU Usage
        expr: avg(rate(node_cpu_seconds_total{mode="user"}[2m])) * 100 > 80
        for: 1m
        labels:
          severity: critical
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High CPU Usage on {{ $labels.instance }}"
          description: "CPU usage has exceeded 80% for over 2 minutes."

      # High Memory Usage Alert
      - alert: HighMemoryUsage
        expr: (node_memory_Active_bytes / node_memory_MemTotal_bytes) * 100 > 80
        for: 1m
        labels:
          severity: critical
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High Memory Usage on {{ $labels.instance }}"
          description: "Memory usage ({{ $value | humanizePercentage }}%) exceeds 80% for 2m."

      # High Disk Usage Alert
      - alert: HighDiskUsage
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs"} / node_filesystem_size_bytes{fstype!~"tmpfs"}) * 100 < 20
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High Disk Usage on {{ $labels.instance }}"
          description: "Disk space usage is critically high, less than 20% available."

      # High System Load Alert
      - alert: HighSystemLoad
        expr: node_load1 > (count(node_cpu_seconds_total{mode="user"}) * 1.5)
        for: 1m
        labels:
          severity: warning
          target: "{{ $labels.instance }}"
        annotations:
          summary: "High System Load on {{ $labels.instance }}"
          description: "System load ({{ $value }}) is too high compared to available CPU cores."

8. Set Permissions for Prometheus User

This step ensures that the Prometheus user has the necessary access to its configuration and data files.

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool

9. Create Systemd Service
Create a systemd service file at /etc/systemd/system/prometheus.service. This service file ensures Prometheus runs as a background process and starts automatically on boot.

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/bin/prometheus \
    --config.file=/etc/prometheus/prometheus.yml \
    --storage.tsdb.path=/var/lib/prometheus/ \
    --web.console.templates=/etc/prometheus/consoles \
    --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

10. Enable and Start Prometheus
These commands reload the systemd manager configuration, enable Prometheus to start on boot, and start the Prometheus service immediately.

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

11. Check Prometheus Status

sudo systemctl status prometheus

2. Setting Up Node Exporter

Your computer or server is always running various processes — handling CPU load, managing memory, reading and writing to disks. But how do you keep track of these activities? Node Exporter acts as a sensor, continuously collecting system health data and making it available for Prometheus to analyze. It is a Prometheus exporter that provides detailed system metrics, including CPU, memory, disk I/O, and network statistics.

We’ll install Node Exporter, connect it to Prometheus, and visualize key metrics like CPU usage, memory consumption, and disk space. This will help in spotting performance issues before they impact your system.

1. Create a Node Exporter User
Similar to what we did with Prometheus, creating a dedicated user for Node Exporter enhances security.

sudo useradd --no-create-home --shell /bin/false node_exporter

2. Download Node Exporter
Gets the latest version of Node Exporter and extracts the files.

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvf node_exporter-1.6.1.linux-amd64.tar.gz

3. Copy Binary
This step moves the executable to a system-wide directory and sets appropriate ownership.

sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter

4. Create Systemd Service
Create /etc/systemd/system/node_exporter.service. This ensures Node Exporter runs as a background service and starts on boot.

[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter

[Install]
WantedBy=multi-user.target

5. Enable and Start Node Exporter

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

6. Check Node Exporter Status

sudo systemctl status node_exporter

3. Setting Up Blackbox Exporter

Monitoring the health of your internal system is excellent, but what about services that users interact with, like websites and APIs? Blackbox Exporter is a tool that helps test whether these external services are reachable and responding correctly. It does this by simulating user interactions, such as:

Checking if a website is online and loading correctly
Measuring how long it takes for a webpage to respond
Verifying whether a database or application can be reached over the network

We’ll set up Blackbox Exporter to monitor critical services and ensure they stay accessible.

1. Create a System User for Blackbox

sudo useradd --no-create-home --shell /bin/false blackbox_exporter

2. Download Blackbox Exporter
Blackbox Exporter is used to monitor the availability and response time of network services.

cd /tmp
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
tar -xvf blackbox_exporter-0.24.0.linux-amd64.tar.gz

3. Copy Binary and Set Permissions
This step moves the executable to a system-wide directory and sets appropriate ownership.

sudo cp blackbox_exporter-0.24.0.linux-amd64/blackbox_exporter /usr/local/bin/
sudo chown blackbox_exporter:blackbox_exporter /usr/local/bin/blackbox_exporter

4. Create Blackbox Config Directory

sudo mkdir -p /etc/blackbox_exporter
sudo chown blackbox_exporter:blackbox_exporter /etc/blackbox_exporter

5. Create Configuration
This configuration file defines how Blackbox Exporter should probe different types of services. Add the below config to /etc/blackbox_exporter/blackbox.yml

modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      method: GET
      preferred_ip_protocol: "ip4"
  http_post_2xx:
    prober: http
    timeout: 5s
    http:
      method: POST
      headers:
        Content-Type: application/json
      body: '{}'

Create /etc/systemd/system/blackbox_exporter.service. This ensures Blackbox Exporter runs as a background service and starts on boot.

[Unit]
Description=Blackbox Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=blackbox_exporter
Group=blackbox_exporter
Type=simple
ExecStart=/usr/local/bin/blackbox_exporter --config.file=/etc/blackbox_exporter/blackbox.yml

[Install]
WantedBy=multi-user.target

7. Enable and Start Blackbox Exporter

sudo systemctl daemon-reload
sudo systemctl enable blackbox_exporter
sudo systemctl start blackbox_exporter

8. Check Blackbox Exporter Status

sudo systemctl status blackbox_exporter

4. Setting Up Grafana

Staring at rows of numbers can be overwhelming — Grafana turns those numbers into beautiful, easy-to-read dashboards. It connects to Prometheus and helps visualize performance trends, making it easier to understand what’s happening in your system at a glance.

In this section, we’ll install Grafana, configure it to pull data from Prometheus and create dashboards that display critical system and application performance metrics. By the end, you’ll have real-time, interactive charts showing exactly how your infrastructure is performing.

1. Import Grafana GPG Key
Importing the GPG key ensures the authenticity of the Grafana packages.

sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null

2. Add Repository
Adding the Grafana repository allows you to install Grafana using apt-get

echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list

3. Update and Install Grafana

sudo apt-get update
sudo apt-get install grafana-enterprise
sudo apt-get install grafana

4. Enable and Start Grafana

sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

5. Check Grafana Status

sudo systemctl status grafana-server

5. Setting Up GitHub Exporter for DORA Metrics

If you’re building software, you want to know how efficiently your team delivers updates. That’s where DORA (DevOps Research and Assessment) metrics come in. These four key metrics help measure software delivery performance:

Deployment Frequency (DF) — How often new code is deployed
Lead Time for Changes (LTC) — How long it takes for a code change to go live
Change Failure Rate (CFR) — How often deployments cause problems
Mean Time to Recovery (MTTR) — How quickly issues are fixed

GitHub doesn’t provide these insights directly, so we use GitHub Exporter, which collects data from GitHub repositories and makes it available to Prometheus. We’ll set up GitHub Exporter, connect it to Prometheus, and visualize DORA metrics in Grafana to track and improve software delivery speed and reliability.

1. Install dependencies
These dependencies are necessary for running the GitHub Exporter script

sudo apt-get install -y python3.12 python3-pip
sudo mkdir -p /opt/github_exporter
cd /opt/github_exporter

2. Create GitHub Exporter Script
Edit:/opt/github_exporter/github_exporter.py. This script fetches deployment data from GitHub and exposes it as Prometheus metrics.

import time
import requests
from datetime import datetime, timedelta
from prometheus_client import start_http_server, Gauge, Counter
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger('github-metrics-exporter')

# GitHub API Configuration with hardcoded token
GITHUB_TOKEN = "your_github_token_here"  # Replace with your actual GitHub token

# GitHub repositories to monitor
REPOS = [
    {"owner": "<repo-owner>", "repo": "<repo-name>"},
    {"owner": "<repo-owner>", "repo": "<repo-name>"}
]

# API headers
HEADERS = {
    'Authorization': f'token {GITHUB_TOKEN}',
    'Accept': 'application/vnd.github.v3+json'
}

# Prometheus metrics
deployment_frequency = Counter('github_deployment_frequency_total', 
                             'Total number of deployments', 
                             ['repository'])

lead_time_for_changes = Gauge('github_lead_time_for_changes_seconds', 
                             'Time from commit to production in seconds', 
                             ['repository'])

change_failure_rate = Gauge('github_change_failure_rate_percent', 
                           'Percentage of deployments that failed', 
                           ['repository'])

mean_time_to_restore = Gauge('github_mean_time_to_restore_seconds', 
                            'Mean time to recover from failures in seconds', 
                            ['repository'])

class GitHubMetricsCollector:
    def __init__(self, repos, headers):
        self.repos = repos
        self.headers = headers

    def get_workflows(self, owner, repo):
        """Get all workflows for a repository"""
        url = f"https://api.github.com/repos/{owner}/{repo}/actions/workflows"
        response = requests.get(url, headers=self.headers)
        if response.status_code != 200:
            logger.error(f"Failed to get workflows: {response.status_code}, {response.text}")
            return []
        return response.json().get('workflows', [])

    def get_workflow_runs(self, owner, repo, workflow_id, time_period_days=30):
        """Get workflow runs for a specific workflow"""
        since_date = (datetime.now() - timedelta(days=time_period_days)).isoformat()
        url = f"https://api.github.com/repos/{owner}/{repo}/actions/workflows/{workflow_id}/runs?created=>{since_date}&per_page=100"
        response = requests.get(url, headers=self.headers)
        if response.status_code != 200:
            logger.error(f"Failed to get workflow runs: {response.status_code}, {response.text}")
            return []
        return response.json().get('workflow_runs', [])

    def get_commit_data(self, owner, repo, sha):
        """Get data for a specific commit"""
        url = f"https://api.github.com/repos/{owner}/{repo}/commits/{sha}"
        response = requests.get(url, headers=self.headers)
        if response.status_code != 200:
            logger.error(f"Failed to get commit data: {response.status_code}, {response.text}")
            return None
        return response.json()

    def calculate_deployment_frequency(self, owner, repo):
        """Calculate deployment frequency"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        total_deployments = 0
        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            successful_deployments = [r for r in runs if r['conclusion'] == 'success']
            total_deployments += len(successful_deployments)

        deployment_frequency.labels(repository=f"{owner}/{repo}").inc(total_deployments)
        logger.info(f"[{owner}/{repo}] Deployment Frequency: {total_deployments} deployments")
        return total_deployments

    def calculate_lead_time_for_changes(self, owner, repo):
        """Calculate lead time for changes"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        lead_times = []
        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            successful_deployments = [r for r in runs if r['conclusion'] == 'success']

            for run in successful_deployments:
                commit_sha = run.get('head_sha')
                if not commit_sha:
                    continue

                commit_data = self.get_commit_data(owner, repo, commit_sha)
                if not commit_data:
                    continue

                commit_time = datetime.strptime(commit_data['commit']['author']['date'], 
                                               "%Y-%m-%dT%H:%M:%SZ")
                deployment_time = datetime.strptime(run['updated_at'], 
                                                  "%Y-%m-%dT%H:%M:%SZ")

                lead_time = (deployment_time - commit_time).total_seconds()
                lead_times.append(lead_time)

        if lead_times:
            avg_lead_time = sum(lead_times) / len(lead_times)
            lead_time_for_changes.labels(repository=f"{owner}/{repo}").set(avg_lead_time)
            logger.info(f"[{owner}/{repo}] Lead Time for Changes: {avg_lead_time:.2f} seconds")
            return avg_lead_time
        return 0

    def calculate_change_failure_rate(self, owner, repo):
        """Calculate change failure rate"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        total_deployments = 0
        failed_deployments = 0

        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            total_deployments += len(runs)
            failed_deployments += len([r for r in runs if r['conclusion'] == 'failure'])

        if total_deployments > 0:
            failure_rate = (failed_deployments / total_deployments) * 100
            change_failure_rate.labels(repository=f"{owner}/{repo}").set(failure_rate)
            logger.info(f"[{owner}/{repo}] Change Failure Rate: {failure_rate:.2f}%")
            return failure_rate
        return 0

    def calculate_mttr(self, owner, repo):
        """Calculate Mean Time to Restore"""
        workflows = self.get_workflows(owner, repo)
        deployment_workflows = [w for w in workflows if 'deploy' in w.get('name', '').lower()]

        recovery_times = []

        for workflow in deployment_workflows:
            runs = self.get_workflow_runs(owner, repo, workflow['id'])
            runs.sort(key=lambda x: datetime.strptime(x['created_at'], "%Y-%m-%dT%H:%M:%SZ"))

            # Find failure-success sequences
            for i in range(1, len(runs)):
                if runs[i-1]['conclusion'] == 'failure' and runs[i]['conclusion'] == 'success':
                    failure_time = datetime.strptime(runs[i-1]['updated_at'], "%Y-%m-%dT%H:%M:%SZ")
                    recovery_time = datetime.strptime(runs[i]['updated_at'], "%Y-%m-%dT%H:%M:%SZ")

                    time_to_restore = (recovery_time - failure_time).total_seconds()
                    recovery_times.append(time_to_restore)

        if recovery_times:
            mttr = sum(recovery_times) / len(recovery_times)
            mean_time_to_restore.labels(repository=f"{owner}/{repo}").set(mttr)
            logger.info(f"[{owner}/{repo}] Mean Time to Restore: {mttr:.2f} seconds")
            return mttr
        return 0

    def collect_metrics(self):
        """Collect all metrics for all repositories"""
        for repo_info in self.repos:
            owner = repo_info['owner']
            repo = repo_info['repo']

            logger.info(f"Collecting metrics for {owner}/{repo}")

            try:
                self.calculate_deployment_frequency(owner, repo)
                self.calculate_lead_time_for_changes(owner, repo)
                self.calculate_change_failure_rate(owner, repo)
                self.calculate_mttr(owner, repo)
            except Exception as e:
                logger.error(f"Error collecting metrics for {owner}/{repo}: {str(e)}")

def main():
    # Start Prometheus HTTP server
    port = 9118
    start_http_server(port)
    logger.info(f"Server started on port {port}")

    collector = GitHubMetricsCollector(REPOS, HEADERS)

    # Collect metrics every 15 minutes
    collection_interval = 15 * 60  # 15 minutes in seconds

    while True:
        collector.collect_metrics()
        logger.info(f"Metrics collection completed. Next collection in {collection_interval} seconds")
        time.sleep(collection_interval)

if __name__ == "__main__":
    main()

3. Install Required Python Packages
These packages are necessary for the GitHub Exporter script to function.

sudo pip3 install requests prometheus_client pytz

or run

sudo apt update
sudo apt install python3-requests python3-prometheus-client python3-tz

4. Create Systemd Service
Create /etc/systemd/system/github_exporter.service

[Unit]
Description=GitHub Metrics Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/github_exporter/github_exporter.py
Restart=always

[Install]
WantedBy=multi-user.target

5. Enable and Start GitHub Exporter

sudo systemctl daemon-reload
sudo systemctl enable github_exporter
sudo systemctl start github_exporter

6. Check GitHub Exporter Status

sudo systemctl status github_exporter

4. Setting Up AlertManager

1. Install Binaries

wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz

2. Create User
A dedicated user enhances security by limiting permissions.

sudo groupadd -f alertmanager
sudo useradd -g alertmanager --no-create-home --shell /bin/false alertmanager
sudo mkdir -p /etc/alertmanager/templates
sudo mkdir /var/lib/alertmanager
sudo chown alertmanager:alertmanager /etc/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager

3. Unpack Prometheus AlertManager Binary
Untar and move the downloaded Prometheus AlertManager binary

tar -xvf alertmanager-0.21.0.linux-amd64.tar.gz
mv alertmanager-0.21.0.linux-amd64 alertmanager-files

4. Install Prometheus AlertManager
Copying the alertmanager and amtool binaries tousr/bin makes them globally accessible on your system. Changing ownership to the alertmanager user ensures that the AlertManager runs with the appropriate permissions, enhancing security.

sudo cp alertmanager-files/alertmanager /usr/bin/
sudo cp alertmanager-files/amtool /usr/bin/
sudo chown alertmanager:alertmanager /usr/bin/alertmanager
sudo chown alertmanager:alertmanager /usr/bin/amtool

5. Install Prometheus AlertManager Configuration File
Move the alertmanager.yml file from alertmanager-files to the etc/alertmanager folder and change the ownership to alertmanager user.

sudo cp alertmanager-files/alertmanager.yml /etc/alertmanager/alertmanager.yml
sudo chown alertmanager:alertmanager /etc/alertmanager/alertmanager.yml

6. Setup Prometheus AlertManager Service
Create the alertmanager service file at /usr/lib/systemd/system/alertmanager.service

sudo vi /usr/lib/systemd/system/alertmanager.service

Add the following configuration:

[Unit]
Description=AlertManager
Wants=network-online.target
After=network-online.target

[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/usr/bin/alertmanager \
    --config.file /etc/alertmanager/alertmanager.yml \
    --storage.path /var/lib/alertmanager/

[Install]
WantedBy=multi-user.target

7. Set File Permissions
Setting the correct permissions ensures that the system can read and execute the service file without being modified by unauthorised users.

sudo chmod 664 /usr/lib/systemd/system/alertmanager.service

8. Create Configuration File
Edit configuration file /etc/alertmanager/alertmanager.yml

global:
resolve_timeout: 1m

route:
receiver: 'slack-notifications'
group_by: ['alertname', 'job']
repeat_interval: 1h

receivers:
- name: 'slack-notifications'
slack_configs:
- channel: '#<channel-name-here>'
  send_resolved: true
  icon_url: https://avatars3.githubusercontent.com/u/3380462
  api_url: 'https://hooks.slack.com/services/<api-url-here>'
  title: |-
    [{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }} for {{ .CommonLabels.job }}
    {{- if gt (len .CommonLabels) (len .GroupLabels) -}}
      (
      {{- with .CommonLabels.Remove .GroupLabels.Names }}
        {{- range $index, $label := .SortedPairs -}}
          {{ if $index }}, {{ end }}
          {{- $label.Name }}="{{ $label.Value -}}"
        {{- end }}
      {{- end }}
      )
    {{- end }}
  text: >-
    {{ range .Alerts -}}
    *Alert:* {{ .Annotations.title }}{{ if .Labels.severity }} - `{{ .Labels.severity }}`{{ end }}
    *Description:* {{ .Annotations.description }}
    *Details:*
      {{ range .Labels.SortedPairs }} • *{{ .Name }}:* `{{ .Value }}`
      {{ end }}
    {{ end }}

9. Reload Systemd and Start AlertManager
Reloading systemd ensures that it recognizes the new AlertManager service file. Starting the service ensures AlertManager is running and ready to handle alerts.

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

10. Check AlertManager Service Status
Checking the status ensures that AlertManager is running without errors. If there are issues, the status output will provide clues for troubleshooting.

sudo systemctl status alertmanager

By following these steps, you’ve successfully set up Prometheus AlertManager and configured it to send alerts to Slack. You’ve also prepared Grafana for visualising metrics and monitoring your system. This setup ensures that your team is notified of critical issues in real time, improving system reliability and efficiency.

Part 2: Configure Grafana Dashboards

Once you’ve set up Prometheus, Node Exporter, Blackbox Exporter, and AlertManager, the next step is to visualize your metrics using Grafana. Grafana is a powerful interface tool for visualising metrics and creating dashboards that help you monitor your system and CI/CD pipeline performance. By connecting it to Prometheus, you can monitor system performance, track DORA metrics, and set up alerts for critical issues. Here’s how to configure dashboards for Node Exporter, Blackbox Exporter, and DORA metrics.

Once you have all the components installed and running, you can:

Access Grafana at http://your-server-ip:3000 with the default credentials
Log in with the default credentials (username: admin, password: admin).
Set up dashboards to visualise metrics from Prometheus, Node Exporter, and Blackbox Exporter.

1. Configuring Node Exporter Dashboard
The Node Exporter dashboard provides insights into system metrics like CPU usage, memory usage, disk usage, and more. Here’s how to set it up:

Create a New Dashboard:
- Click on the “Create” button (plus icon) in the left sidebar.
- Select “Import” from the dropdown menu.
Import Node Exporter Dashboard:
- In the “Import via grafana.com” textbox, enter the Node Exporter dashboard UID: 1860.
- Click “Load”

Select Data Source:
- Choose Prometheus as your data source
- Click "Import"
View Your Dashboard
- You'll now see a fully configured Node Exporter dashboard with panels for CPU, memory, disk, and other system metrics.

2. Configuring Blackbox Exporter Dashboard
The Blackbox Exporter dashboard helps you monitor uptime, HTTP response times, and SSL certificate expiration. Here's how to set it up:

Create a New Dashboard:
- Click on the "Create" button (plus icon) in the left sidebar.
- Select "Import" from the dropdown menu.
Import Blackbox Exporter Dashboard:
- Same as you did with Node Exporter, in the "Import via grafana.com" textbox, enter the Blackbox Exporter dashboard UID: 7587.
- Click "Load"
Select Data Source:
- Choose Prometheus as your data source
- Click "Import"
View Your Dashboard
- You'll now see a fully configured Blackbox Exporter dashboard with panels for CPU, memory, disk, and other system metrics.

3. Configuring DORA Metrics Dashboard
The DORA metrics dashboard tracks key CI/CD performance indicators like Deployment Frequency, Lead Time for Changes, Change Failure Rate, and Mean Time to Restore. Here's how to set it up:

Create a New Dashboard:
- Click on the "Create" button (plus icon) in the left sidebar.
- Select "New Dashboar" from the dropdown menu.
Add Panels for DORA Metrics:
- Click on the "Add" dropdown and select "Visualization".
- In the Queries panel, select the following metrics: github_deployment_frequency_total github_change_failure_rate_percent github_mean_time_to_restore_seconds github_lead_time_for_changes_seconds
Save Your Dashboard:
- Click "Save" to save your dashboard.
- Give it a meaningful name, like "DORA Metrics Dashboard".
View and Customize:
- You can now view your DORA metrics in real time.
- Feel free to edit, move panels around, or change visualization types (e.g., graphs, gauges, tables).

4. Customizing Your Dashboards
Grafana is highly customizable, so don't be afraid to get creative! Here are some tips:

Edit Panels: Click on a panel title and select "Edit" to change the visualization type or query.
Move Panels: Drag and drop panels to rearrange them.
Add Alerts: Set up alerts directly from Grafana panels to notify your team of critical issues.
Save Changes: Always save your dashboard after making changes.

Part 3: Implementing AWS Cost Exporter

1. Create Cost Exporter Directory

sudo mkdir -p /opt/cost_exporter

2. Create the Python Cost Exporter Script
Create the file at /opt/cost_exporter/cost_exporter.py:

import time
import boto3
from datetime import datetime, timedelta
from prometheus_client import start_http_server, Gauge
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger('aws-cost-exporter')

# Create metrics
aws_service_cost = Gauge('aws_service_cost_dollars', 'Cost in dollars by AWS service', ['service'])
aws_total_cost = Gauge('aws_total_cost_dollars', 'Total AWS cost in dollars', [])
aws_budget_usage = Gauge('aws_budget_usage_percent', 'Budget usage percentage', ['budget_name'])

def collect_cost_metrics():
    """Collect AWS cost metrics and update Prometheus gauges"""
    ce_client = boto3.client('ce')
    budgets_client = boto3.client('budgets')

    # Get current date and start of month
    end_date = datetime.utcnow().strftime('%Y-%m-%d')
    start_date = datetime(datetime.utcnow().year, datetime.utcnow().month, 1).strftime('%Y-%m-%d')

    try:
        # Get cost by service
        response = ce_client.get_cost_and_usage(
            TimePeriod={
                'Start': start_date,
                'End': end_date
            },
            Granularity='MONTHLY',
            Metrics=['UnblendedCost'],
            GroupBy=[
                {
                    'Type': 'DIMENSION',
                    'Key': 'SERVICE'
                }
            ]
        )

        total_cost = 0

        # Process service costs
        for group in response['ResultsByTime'][0]['Groups']:
            service_name = group['Keys'][0]
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            total_cost += cost
            aws_service_cost.labels(service=service_name).set(cost)
            logger.info(f"Service: {service_name}, Cost: ${cost:.2f}")

        # Set total cost
        aws_total_cost.set(total_cost)
        logger.info(f"Total Cost: ${total_cost:.2f}")

        # Get budgets and their usage
        try:
            budgets_response = budgets_client.describe_budgets(
                AccountId=boto3.client('sts').get_caller_identity().get('Account')
            )

            for budget in budgets_response.get('Budgets', []):
                budget_name = budget['BudgetName']
                calculated_spend = float(budget.get('CalculatedSpend', {}).get('ActualSpend', {}).get('Amount', 0))
                budget_limit = float(budget.get('BudgetLimit', {}).get('Amount', 0))

                if budget_limit > 0:
                    usage_percent = (calculated_spend / budget_limit) * 100
                    aws_budget_usage.labels(budget_name=budget_name).set(usage_percent)
                    logger.info(f"Budget: {budget_name}, Usage: {usage_percent:.2f}%")
        except Exception as e:
            logger.error(f"Error getting budget information: {str(e)}")

    except Exception as e:
        logger.error(f"Error collecting cost metrics: {str(e)}")

def main():
    # Start up the server to expose the metrics.
    port = 9108
    start_http_server(port)
    logger.info(f"AWS Cost Exporter started on port {port}")

    # Update metrics every hour
    while True:
        collect_cost_metrics()
        time.sleep(3600)  # 1 hour

if __name__ == '__main__':
    main()

3. Install Required Python Packages

sudo apt install python3-boto3

4. Create a Systemd Service for the Cost Exporter
Create /etc/systemd/system/cost_exporter.service file and add the following to it:

[Unit]
Description=AWS Cost Exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
ExecStart=/usr/bin/python3 /opt/cost_exporter/cost_exporter.py
Restart=always

[Install]
WantedBy=multi-user.target

5. Start the Cost Exporter

sudo systemctl daemon-reload
sudo systemctl enable cost_exporter
sudo systemctl start cost_exporter

6. Update Prometheus Configuration
Add the following to /etc/prometheus/prometheus.yml under the scrape_configs section:

- job_name: 'aws_cost'
  static_configs:
    - targets: ['localhost:9108']

7. Restart Prometheus

sudo systemctl restart prometheus

Part 4: Setting Up AWS-Specific Alerts

1. Add AWS Alert Rules to Prometheus
Add the following to /etc/prometheus/alert_rules.yml under a new group:

- name: aws_alerts
  rules:
    - alert: HighEC2CPUUsage
      expr: avg(aws_ec2_cpuutilization_average{instance=~".*"}) by (instance) > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU Usage on EC2 Instance {{ $labels.instance }}"
        description: "EC2 Instance {{ $labels.instance }} has high CPU usage ({{ $value }}%) for 5 minutes."

    - alert: RDSHighCPUUsage
      expr: aws_rds_cpuutilization_average > 80
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High CPU Usage on RDS Instance {{ $labels.dbinstance_identifier }}"
        description: "RDS Instance {{ $labels.dbinstance_identifier }} has high CPU usage ({{ $value }}%) for 5 minutes."

    - alert: LambdaErrors
      expr: increase(aws_lambda_errors_sum[1h]) > 10
      labels:
        severity: warning
      annotations:
        summary: "High Error Rate on Lambda Function {{ $labels.function_name }}"
        description: "Lambda Function {{ $labels.function_name }} has more than 10 errors in the past hour."

    - alert: BudgetNearLimit
      expr: aws_budget_usage_percent > 90
      labels:
        severity: warning
      annotations:
        summary: "Budget Usage Approaching Limit"
        description: "Budget {{ $labels.budget_name }} is at {{ $value | printf \"%.2f\" }}% of its limit."

2. Restart Prometheus

sudo systemctl restart prometheus

Conclusion

By following these steps, you've successfully configured Grafana dashboards for Node Exporter, Blackbox Exporter, DORA metrics, AWS-specific alerts and integrated AWS Cost Exporter into Prometheus and Grafana. These dashboards provide a clear view of your system's performance and CI/CD pipeline efficiency, helping you make data-driven decisions.
Don't forget to explore Grafana's extensive library of pre-built dashboards and plugins to further enhance your monitoring setup.

Thank You for Reading!

If you found this guide helpful, don't forget to like, comment, and share! Let me know if you have any questions or need further assistance.

Happy monitoring! 🚀

DEV Community