Introduction
Overview:
This SOP provides detailed instructions for configuring alerting in an Amazon EKS cluster using Prometheus. Prometheus is an open-source monitoring and alerting toolkit widely used in Kubernetes environments for real-time monitoring and proactive alerting based on metrics. The integration ensures timely notifications about potential issues, enabling swift action to maintain system health and reliability.
Prometheus:
Prometheus is a powerful open-source monitoring system designed for collecting, storing, and querying time-series data. It is highly scalable and well-suited for cloud-native environments, especially Kubernetes. Prometheus collects metrics from configured targets, evaluates defined rules, and enables queries using its PromQL language. With its robust integration capabilities, Prometheus is a cornerstone of modern observability stacks, offering insights into application and infrastructure performance.
Alerts for Pods:
Alerting for pods involves monitoring the health, resource usage, and performance of Kubernetes pods and triggering alerts when specific thresholds or conditions are breached. For example, alerts can be set for high CPU or memory usage, pod restarts, or readiness and liveness probe failures. This ensures teams are promptly notified of potential issues, enabling them to take corrective actions to maintain application availability and reliability.
Objective:
The objective of this SOP is to:
Set up Prometheus in an EKS cluster for monitoring.
Configure alerting rules for key performance metrics and resource utilization.
Integrate Prometheus with Alertmanager to route alerts to notification channels like email.
Key Components:
Amazon EKS: The managed Kubernetes service that hosts your applications.
Prometheus: The monitoring and alerting toolkit for collecting metrics.
Alertmanager: A component of Prometheus responsible for managing alerts and routing them to configured endpoints.
Kubernetes Metrics Server: A lightweight service for gathering resource metrics like CPU and memory.
Prerequisites:
EKS Cluster: A fully operational EKS cluster with kubectl configured for access.
Prometheus Operator: Deployed in the EKS cluster for managing Prometheus configurations.
Alertmanager: Installed alongside Prometheus in the cluster.
IAM Permissions: Sufficient AWS IAM permissions to manage resources in the EKS cluster.
Procedure
Initial Setup:
Login to the server:
ssh -i "" @ip
Login to Jenkins User:
sudo su - jenkins
cd
mkdir
cd
Installation of Prometheus Stack:
Check prometheus Repo if not present add the Prometheus through Helm Package Manager of K8’s:
helm repo ls
Repo is not added so, add the prom repo using below commands:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm pull prometheus-community/kube-prometheus-stack
Extract the tar file
tar -xvf
Adding Affinity in the Values file:
Find the values.yaml file inside prom stack.
cd into the extracted tar file and check the files.
Take a backup of it.
And copy the content indise the values file and open it in IDE for modification.
Before doing the modification find the role of the Node Group find as below.
Use the below syntax for Affinity:
**affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- Production-Magnifi-Monitoring-NG
**
Before modification of Affinity
After Modification:
After modification of entire yaml file replace with old content with newly modified content.
Go to charts folder and find grafana folder futher find the values file inside the grafana folder.
Follow above steps to modify the affinity in grafana values file.
Deploy the prom stack using Helm Package Manager:
helm install prom-stack . -f values.yaml -n monitoring --create-namespace
Check all pods are in running state:
k get all -n monitoring
Setting up the Alerts:
Delete all the default rules of Prometheus expect the below one:
Delete all the rules expect
**prom-stack-kube-prometheus-kubernetes-apps
prom-stack-kube-prometheus-k8s.rules.container-cpu-usage-second**
Find the prometheusrule.
k get prometheusrules -n monitoring
Delete the default rules
k delete prometheusrules -n
Check the rules we left
**prom-stack-kube-prometheus-k8s.rules.container-cpu-usage-second
prom-stack-kube-prometheus-kubernetes-apps **
k get prometheusrules -n monitoring
Edit the above rule.
k edit prometheusrules prom-stack-kube-prometheus-kubernetes-apps -n monitoring
Delete the all the rules under spec below the rules.
Add the new rules as given below:
Reference Rule:
rules:
-
alert: MagnifiProductionPodRestart
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has restarted {{
$value }} times.expr: kube_pod_container_status_restarts_total{namespace="prod"} > 0
for: 1m
labels:
severity: production-critical
-
alert: MagnifiProductionPodPending
annotations:
description: Pod {{ $labels.namespace }}/{{ $labels.pod }} has pending {{
$value }} times.expr: kube_pod_status_phase{namespace="prod", phase="Pending"} == 1
for: 1m
labels:
severity: production-critical
Add the SMTP credentails in prometheus secrets, Follow the below steps:
Create SMPT user.
Create Access Key and Secret Key:
Use the below format and add the below credentails.
global:
resolve_timeout: 5m
smtp_from:
smtp_smarthost: email-smtp.ap-south-1.amazonaws.com:587
smtp_auth_username:
smtp_auth_password:
smtp_require_tls: true
route:
receiver: support
group_by:
job
monitor_type
severity
alertname
namespace
routes:
- receiver: support
match:
alertname: <Alert_Name_1>
- receiver: support
match:
alertname: <Alert_Name_2>
group_wait: 30s
group_interval: 5m
repeat_interval: 1h
receivers:
- name: support
email_configs:
- send_resolved: true
to:
- send_resolved: true
to:
templates:
- '/etc/alertmanager/config/*.tmpl'
After doing modification encode the above format.
Link: https://www.base64encode.org/
Make the change secrets of alert manager:
To know the secrets in monitoring namespace
k get secrets -n monitoring
Edit the Alert Manager secret
k edit secrets alertmanager-prom-stack-kube-prometheus-alertmanager -n monitoring
Remove the Old Secret after the alertmanager.yaml.
Add the encoded secrets.
Alerts Testing:
Pod Restart Testing:
vi restart-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: always-restart-pod
spec:
restartPolicy: Always
containers:
- name: my-container
image: nginx:latest
command: ["/bin/sh", "-c", "exit 1"]
Got Alert for Pod Restart Firing and Resolved:
Got Alert for Pod Pending Firing and Resolved:
vi pending-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: pending-pod
spec:
containers:
- name: my-container
image: non-existing-image:latest
k apply -f pending-pod.yaml -n prod
k delete po pending-pod -n prod
Got Alert for Pod PendingFiring and Resolved:
Expose grafana prom and alertmanager Services:
Check the services of monitoring
k get svc -n
Create a ingress file for prometheus
vi prometheus-ingress.yaml
Replace the Service and port of the related service.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
kubernetes.io/ingress.class: nginx
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/proxy-body-size: 5m
name: prod-prom-grafana-monitoring
namespace: monitoring
spec:
rules:
- host: grafana-prod-mumbai.illusto.com
http:
paths:
- backend:
service:
name: prom-stack-grafana
port:
number: 80
path: /
pathType: ImplementationSpecific
- host: prom-prod-mumbai.illusto.com
http:
paths:
- backend:
service:
name: prom-stack-kube-prometheus-prometheus
port:
number: 9090
path: /
pathType: ImplementationSpecific
- host: alert-prod-mumbai.illusto.com
http:
paths:
- backend:
service:
name: prom-stack-kube-prometheus-alertmanager
port:
number: 9093
path: /
pathType: ImplementationSpecific
Create the ingress rules
k apply -f prometheus-ingress.yaml
Check the ingress.
k get ingress -n monitoring
Add the Route53 records:
Grafana Dashboard:
Prometheus Dashboard:
AlertManager Dashboard:
Scope
This SOP applies to DevOps, monitoring, and SRE teams tasked with maintaining the reliability and performance of applications deployed on Amazon EKS. It is applicable for both staging and production environments.
Roles and Responsibilities
DevOps Engineers:
1.Responsible for deploying and configuring Prometheus and Alertmanager.
2.Define and maintain alerting rules based on organizational needs.
3.Act on received alerts to resolve issues and ensure high availability.
4.Validate alerting configurations to ensure compliance with security protocols.
Enforcement
1.Policy Compliance: All alerting configurations must follow organizational monitoring and alerting standards.
2.Access Control: Only authorized personnel are allowed to modify Prometheus and Alertmanager configurations.
3.Auditing: Regular audits of alerting rules should be conducted to ensure effectiveness and compliance.
Conclusion:
Alerting in EKS using Prometheus provides a robust mechanism to proactively monitor the health and performance of applications. By following this SOP, teams can ensure timely notifications for critical issues, reducing downtime and maintaining application reliability in dynamic Kubernetes environments.










































Top comments (2)
Excellent work continue with your efforts
well done