Monitoring
If you're starting in DevOps or have some experience but haven't had any actions in Monitoring, Logging or Alerts. This will help you with Monitoring( alerts a little bit) at least.
Without Monitoring, any application crumbles under its sheer vastness. Any application needs constant care, otherwise, things go haywire or just stop working altogether. Monitoring helps us detect incidents, events and evaluate system metrics. Actually, by monitoring system metrics, we detect incidents, events, etc. Monitoring can be done for various purposes like making sure the site is always healthy and servicing customers and bringing in revenue, or it can be done to detect any intrusion or threat whatever the reason, monitoring is important with the end goal of keeping the application/site always healthy.
Since I'm a DevOps Guy, I'm not going into security but rather monitoring to keep the application healthy, so let's start.
Prerequisite: AWS account, Helm
suggest these 4 Major Signals or Golden Signals that we need to monitor all the time as they give critical data about the application.
1. Latency
The Duration required by the packet to reach the server and get back to us.
2. Traffic
Simply speaking the number of requests per second received by the server.
3. Errors
As the name suggests, the number of errors can be 4xx/5xx or timeout request, any request that took longer time to respond (more than set in SLOs)
4. Saturation
In layman's terms how full the system is. Any resource which is close to 100% utilization. Used to monitor the most constraints or limited resources in a system like CPU, Disk Space, Bandwidth etc.
Again these are very short explanations if you wanna read more about them click here.
Now let's talk about Tooling. I've been using Prometheus and Grafana as my go-to tools for monitoring.
So let's start with the hands-on.
STEP I
I'm doing this on AWS, specifically speaking EKS. Create the cluster using eksctl
, I like eksctl as it creates everything from VPC to node group, with minimal inputs, you can find you're suitable setting here.
But here's mine:
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: my-cluster
region: eu-west-2 # Replace with your preferred AWS region
version: "1.30" # Replace with your desired Kubernetes version
managedNodeGroups:
- name: ng-t2-medium
instanceType: t3a.medium
desiredCapacity: 2
minSize: 2
maxSize: 2
volumeSize: 20 # Size in GiB
privateNetworking: true # If you want nodes to use private subnets only
ssh:
enableSsm: true # Allows access via AWS Systems Manager Session Manager
Nothing complicated just 2 instances and that's all, remember to update the add-ons installed by eksctl
.
STEP II
Now that we have our eksctl
get remember to get the kubeconfig file in you're workstation by using the command:
aws eks update-kubeconfig --region <region-code> --name <cluster-name>
Now Let's use helm charts for kube-Prometheus stack install it and wait for the startup
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install <Name> prometheus-community/kube-prometheus-stack -n <namespace>
STEP III
Now that we have installed Prometheus and Grafana, Let's create an application that we gonna monitor.
I'm using a very simple nodejs application
const express = require("express");
const client = require("prom-client");
const winston = require("winston");
const os = require("os");
const app = express();
const PORT = 3000;
// Initialize Prometheus metrics
const httpRequestDuration = new client.Histogram({
name: "http_request_duration_seconds",
help: "Duration of HTTP requests in seconds",
labelNames: ["method", "route", "status_code"],
buckets: [0.1, 0.5, 1, 2, 5], // Buckets for latency
});
const httpRequestCount = new client.Counter({
name: "http_requests_total",
help: "Total number of HTTP requests",
labelNames: ["method", "route"],
});
const httpErrorCount = new client.Counter({
name: "http_errors_total",
help: "Total number of HTTP errors",
labelNames: ["method", "status_code"],
});
const systemMetrics = new client.Gauge({
name: "system_resource_usage",
help: "System CPU and memory usage",
labelNames: ["resource"],
});
// Logger setup
const logger = winston.createLogger({
level: "info",
format: winston.format.json(),
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: "app.log" }),
],
});
// Middleware to track request duration and traffic
app.use((req, res, next) => {
const start = Date.now();
httpRequestCount.inc({ method: req.method, route: req.path });
res.on("finish", () => {
const duration = (Date.now() - start) / 1000; // Convert to seconds
httpRequestDuration.observe(
{ method: req.method, route: req.path, status_code: res.statusCode },
duration
);
if (res.statusCode >= 400) {
httpErrorCount.inc({ method: req.method, status_code: res.statusCode });
}
});
next();
});
// Simple endpoints
app.get("/", (req, res) => {
res.send("Welcome to the Golden Signals App!");
});
app.get("/hello", (req, res) => {
setTimeout(() => {
res.send("Hello, World!");
}, Math.random() * 1000); // Random delay to simulate latency
});
app.get("/error", (req, res) => {
res.status(500).send("Simulated error!");
});
// Expose metrics at /metrics
app.get("/metrics", async (req, res) => {
res.set("Content-Type", client.register.contentType);
res.end(await client.register.metrics());
});
// Monitor system resources every 5 seconds
setInterval(() => {
const memoryUsage = process.memoryUsage().heapUsed / 1024 / 1024; // Convert to MB
const cpuLoad = os.loadavg()[0]; // 1-minute average
systemMetrics.set({ resource: "memory" }, memoryUsage);
systemMetrics.set({ resource: "cpu" }, cpuLoad);
logger.info({
memoryUsage: `${memoryUsage.toFixed(2)} MB`,
cpuLoad: cpuLoad.toFixed(2),
});
}, 5000);
// Start server
app.listen(PORT, () => {
console.log(`Server running on http://localhost:${PORT}`);
});
Dockerfile for the same
# Use the official Node.js image from the Docker Hub
FROM node:21-alpine
# Create and change to the app directory
WORKDIR /usr/src/app
# Copy application dependency manifests to the container image.
# A wildcard is used to ensure both package.json AND package-lock.json are copied.
COPY package*.json ./
# Install dependencies
RUN npm install
# Copy the local code to the container image
COPY . .
# Expose the port the app runs on
EXPOSE 3000
# Run the application
CMD ["node", "app.js"]
Now deployment, service and service monitor file for K8s
apiVersion: apps/v1
kind: Deployment
metadata:
name: golden-signals-app
labels:
app: golden-signals-app
spec:
replicas: 2
selector:
matchLabels:
app: golden-signals-app
template:
metadata:
labels:
app: golden-signals-app
spec:
containers:
- name: golden-signals-app
image: ashutosh5786/golden-signal-app:v1
ports:
- containerPort: 3000
resources:
limits:
memory: "256Mi"
cpu: "500m"
requests:
memory: "128Mi"
cpu: "250m"
env:
- name: NODE_ENV
value: "production"
---
apiVersion: v1
kind: Service
metadata:
name: golden-signals-service
labels:
app: golden-signals-app
spec:
ports:
- name: metrics-port # Port name, used in ServiceMonitor
port: 3000 # Exposed port
targetPort: 3000 # Port on the container
selector:
app: golden-signals-app
type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: golden-signals-app-monitor
labels:
release: kubeprometheus <Name of the release used when installing using helm chart>
spec:
selector:
matchLabels:
app: golden-signals-app
endpoints:
- port: metrics-port # Use the port name from the Service
path: /metrics
interval: 15s
namespaceSelector:
matchNames:
- default <Namespace in which app is deployed>
At this point, I take that you understand what deployment and service are but you might have noticed the Service Monitor well it's a way for us to configure the Prometheus to monitor the pods we are creating.
STEP IV
After STEP III, we can head to Prometheus-ui, to check if the application is monitored or not. Go to Status > Target
If you see the application good otherwise go back and recheck everything.
It should look something like this:
After confirming that Prometheus is scrapping the data, let's move to Grafana. And create the Dashboard which is my favourite part.
STEP V
Let's start creating those dashboards.
Side Panel > Dashboard > Add Visualization
Select Prometheus as Data Source
Once You've done this Select the CODE Option instead of the Builder
As you can observe in the above screen.
Starting with Saturation for my case I have used the CPU as my constraint so I'm using below query to visualize it.
rate(container_cpu_usage_seconds_total{container="golden-signals-app"}[1m])
Now I'm gonna give all the query I used to build my dashboard for the Golden Signals
Latency (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
For more grain visualization we can have different queries but for the ease of this practical, I'm going with the easier option.
Error Rate
sum(rate(http_errors_total[1m])) by (status_code)
Traffic (RPS)
sum(rate(http_requests_total[1m]))
After Adding all those queries it should look like this
Now we are all ready to monitor those Golden Signals to make sure our applications are always healthy and 100% running.
CONCLUSION
Monitoring is essential to keeping your applications healthy and reliable. By tracking Golden Signals with Prometheus and Grafana, you can catch issues early and ensure smooth operations. This guide gives you a starting point to build a monitoring setup that grows with your application’s needs.
Thank you for Reading, I'd love to know what you have set up for monitoring in your system, If I missed anything please mention it in the comments.
Top comments (0)