DEV Community

Taufique
Taufique

Posted on

Alerting Setup in EKS using Prometheus

Introduction

Overview:

This SOP provides detailed instructions for configuring alerting in an Amazon EKS cluster using Prometheus. Prometheus is an open-source monitoring and alerting toolkit widely used in Kubernetes environments for real-time monitoring and proactive alerting based on metrics. The integration ensures timely notifications about potential issues, enabling swift action to maintain system health and reliability.

Prometheus:

Prometheus is a powerful open-source monitoring system designed for collecting, storing, and querying time-series data. It is highly scalable and well-suited for cloud-native environments, especially Kubernetes. Prometheus collects metrics from configured targets, evaluates defined rules, and enables queries using its PromQL language. With its robust integration capabilities, Prometheus is a cornerstone of modern observability stacks, offering insights into application and infrastructure performance.

Alerts for Pods:

Alerting for pods involves monitoring the health, resource usage, and performance of Kubernetes pods and triggering alerts when specific thresholds or conditions are breached. For example, alerts can be set for high CPU or memory usage, pod restarts, or readiness and liveness probe failures. This ensures teams are promptly notified of potential issues, enabling them to take corrective actions to maintain application availability and reliability.

Objective:

The objective of this SOP is to:

Set up Prometheus in an EKS cluster for monitoring.
Configure alerting rules for key performance metrics and resource utilization.
Integrate Prometheus with Alertmanager to route alerts to notification channels like email.
Key Components:

Amazon EKS: The managed Kubernetes service that hosts your applications.
Prometheus: The monitoring and alerting toolkit for collecting metrics.
Alertmanager: A component of Prometheus responsible for managing alerts and routing them to configured endpoints.
Kubernetes Metrics Server: A lightweight service for gathering resource metrics like CPU and memory.
Prerequisites:

EKS Cluster: A fully operational EKS cluster with kubectl configured for access.
Prometheus Operator: Deployed in the EKS cluster for managing Prometheus configurations.
Alertmanager: Installed alongside Prometheus in the cluster.
IAM Permissions: Sufficient AWS IAM permissions to manage resources in the EKS cluster.

Procedure
Initial Setup:

Login to the server:
ssh -i "" @ip

Login to Jenkins User:
sudo su - jenkins

cd

mkdir

cd

Installation of Prometheus Stack:

Check prometheus Repo if not present add the Prometheus through Helm Package Manager of K8’s:

helm repo ls

Repo is not added so, add the prom repo using below commands:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

helm repo update

helm pull prometheus-community/kube-prometheus-stack

Extract the tar file

tar -xvf

Adding Affinity in the Values file:

Find the values.yaml file inside prom stack.
cd into the extracted tar file and check the files.
Take a backup of it.

And copy the content indise the values file and open it in IDE for modification.

Before doing the modification find the role of the Node Group find as below.

Use the below syntax for Affinity:

**affinity:

 nodeAffinity:

    requiredDuringSchedulingIgnoredDuringExecution:

        nodeSelectorTerms:

         - matchExpressions:

            - key: role

            operator: In

            values:

            - Production-Magnifi-Monitoring-NG
Enter fullscreen mode Exit fullscreen mode

**
Before modification of Affinity

After Modification:

After modification of entire yaml file replace with old content with newly modified content.

Go to charts folder and find grafana folder futher find the values file inside the grafana folder.

Follow above steps to modify the affinity in grafana values file.

Deploy the prom stack using Helm Package Manager:

helm install prom-stack . -f values.yaml -n monitoring --create-namespace

Check all pods are in running state:

k get all -n monitoring

Setting up the Alerts:

Delete all the default rules of Prometheus expect the below one:

Delete all the rules expect

**prom-stack-kube-prometheus-kubernetes-apps

prom-stack-kube-prometheus-k8s.rules.container-cpu-usage-second**

Find the prometheusrule.

k get prometheusrules -n monitoring

Delete the default rules

k delete prometheusrules -n

Check the rules we left

**prom-stack-kube-prometheus-k8s.rules.container-cpu-usage-second

prom-stack-kube-prometheus-kubernetes-apps **

k get prometheusrules -n monitoring

Edit the above rule.

k edit prometheusrules prom-stack-kube-prometheus-kubernetes-apps -n monitoring

Delete the all the rules under spec below the rules.

Add the new rules as given below:

Reference Rule:

rules:

  • alert: MagnifiProductionPodRestart

    annotations:

    description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{

     $value }} times.
    

    expr: kube_pod_container_status_restarts_total{namespace="prod"} > 0

    for: 1m

    labels:

    severity: production-critical

  • alert: MagnifiProductionPodPending

    annotations:

    description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has pending {{

     $value }} times.
    

    expr: kube_pod_status_phase{namespace="prod", phase="Pending"} == 1

    for: 1m

    labels:

    severity: production-critical

Add the SMTP credentails in prometheus secrets, Follow the below steps:

Create SMPT user.

Create Access Key and Secret Key:

Use the below format and add the below credentails.

global:

resolve_timeout: 5m

smtp_from:

smtp_smarthost: email-smtp.ap-south-1.amazonaws.com:587

smtp_auth_username:

smtp_auth_password:

smtp_require_tls: true

route:

receiver: support

group_by:

  • job

  • monitor_type

  • severity

  • alertname

  • namespace

routes:

  • receiver: support

match:

 alertname: <Alert_Name_1>
Enter fullscreen mode Exit fullscreen mode
  • receiver: support

match:

 alertname: <Alert_Name_2>
Enter fullscreen mode Exit fullscreen mode

group_wait: 30s

group_interval: 5m

repeat_interval: 1h

receivers:

  • name: support

email_configs:

  • send_resolved: true

to:

  • send_resolved: true

to:

templates:

  • '/etc/alertmanager/config/*.tmpl'

After doing modification encode the above format.

Link: https://www.base64encode.org/

Make the change secrets of alert manager:

To know the secrets in monitoring namespace

k get secrets -n monitoring

Edit the Alert Manager secret

k edit secrets alertmanager-prom-stack-kube-prometheus-alertmanager -n monitoring

Remove the Old Secret after the alertmanager.yaml.

Add the encoded secrets.

Alerts Testing:

Pod Restart Testing:

vi restart-pod.yaml

apiVersion: v1

kind: Pod

metadata:

name: always-restart-pod

spec:

restartPolicy: Always

containers:

  • name: my-container

image: nginx:latest

command: ["/bin/sh", "-c", "exit 1"]

Got Alert for Pod Restart Firing and Resolved:

Got Alert for Pod Pending Firing and Resolved:

vi pending-pod.yaml

apiVersion: v1

kind: Pod

metadata:

name: pending-pod

spec:

containers:

  • name: my-container

image: non-existing-image:latest

k apply -f pending-pod.yaml -n prod

k delete po pending-pod -n prod

Got Alert for Pod PendingFiring and Resolved:

Expose grafana prom and alertmanager Services:

Check the services of monitoring

k get svc -n

Create a ingress file for prometheus

vi prometheus-ingress.yaml

Replace the Service and port of the related service.

apiVersion: networking.k8s.io/v1

kind: Ingress

metadata:

annotations:

kubernetes.io/ingress.class: nginx

nginx.ingress.kubernetes.io/ssl-redirect: "true"

nginx.ingress.kubernetes.io/proxy-body-size: 5m

name: prod-prom-grafana-monitoring

namespace: monitoring

spec:

rules:

  • host: grafana-prod-mumbai.illusto.com

http:

 paths:

 - backend:

     service:

       name: prom-stack-grafana

       port:

number: 80

   path: /

   pathType: ImplementationSpecific
Enter fullscreen mode Exit fullscreen mode
  • host: prom-prod-mumbai.illusto.com

http:

 paths:

 - backend:

     service:

       name: prom-stack-kube-prometheus-prometheus

       port:

number: 9090

   path: /

   pathType: ImplementationSpecific
Enter fullscreen mode Exit fullscreen mode
  • host: alert-prod-mumbai.illusto.com

http:

 paths:

 - backend:

     service:

       name: prom-stack-kube-prometheus-alertmanager

       port:

number: 9093

   path: /

   pathType: ImplementationSpecific
Enter fullscreen mode Exit fullscreen mode

Create the ingress rules

k apply -f prometheus-ingress.yaml

Check the ingress.

k get ingress -n monitoring

Add the Route53 records:

Grafana Dashboard:

Prometheus Dashboard:

AlertManager Dashboard:

Scope

This SOP applies to DevOps, monitoring, and SRE teams tasked with maintaining the reliability and performance of applications deployed on Amazon EKS. It is applicable for both staging and production environments.

Roles and Responsibilities

DevOps Engineers:

1.Responsible for deploying and configuring Prometheus and Alertmanager.

2.Define and maintain alerting rules based on organizational needs.

3.Act on received alerts to resolve issues and ensure high availability.

4.Validate alerting configurations to ensure compliance with security protocols.

Enforcement

1.Policy Compliance: All alerting configurations must follow organizational monitoring and alerting standards.

2.Access Control: Only authorized personnel are allowed to modify Prometheus and Alertmanager configurations.

3.Auditing: Regular audits of alerting rules should be conducted to ensure effectiveness and compliance.

Conclusion:

Alerting in EKS using Prometheus provides a robust mechanism to proactively monitor the health and performance of applications. By following this SOP, teams can ensure timely notifications for critical issues, reducing downtime and maintaining application reliability in dynamic Kubernetes environments.

Top comments (2)

Collapse
 
meghna_biswal_11be1dadc63 profile image
Meghna Biswal

Excellent work continue with your efforts

Collapse
 
tausif_samar_ad62408b4ee7 profile image
Tausif Samar

well done