DEV Community

Cover image for Monitoring Golden Signals in EKS
Ashutosh Singh for AWS Community Builders

Posted on

Monitoring Golden Signals in EKS

Monitoring

If you're starting in DevOps or have some experience but haven't had any actions in Monitoring, Logging or Alerts. This will help you with Monitoring( alerts a little bit) at least.

Without Monitoring, any application crumbles under its sheer vastness. Any application needs constant care, otherwise, things go haywire or just stop working altogether. Monitoring helps us detect incidents, events and evaluate system metrics. Actually, by monitoring system metrics, we detect incidents, events, etc. Monitoring can be done for various purposes like making sure the site is always healthy and servicing customers and bringing in revenue, or it can be done to detect any intrusion or threat whatever the reason, monitoring is important with the end goal of keeping the application/site always healthy.

Since I'm a DevOps Guy, I'm not going into security but rather monitoring to keep the application healthy, so let's start.

Prerequisite: AWS account, Helm

Google

suggest these 4 Major Signals or Golden Signals that we need to monitor all the time as they give critical data about the application.

1. Latency

The Duration required by the packet to reach the server and get back to us.

2. Traffic

Simply speaking the number of requests per second received by the server.

3. Errors

As the name suggests, the number of errors can be 4xx/5xx or timeout request, any request that took longer time to respond (more than set in SLOs)

4. Saturation

In layman's terms how full the system is. Any resource which is close to 100% utilization. Used to monitor the most constraints or limited resources in a system like CPU, Disk Space, Bandwidth etc.

Again these are very short explanations if you wanna read more about them click here.

Now let's talk about Tooling. I've been using Prometheus and Grafana as my go-to tools for monitoring.

So let's start with the hands-on.

STEP I

I'm doing this on AWS, specifically speaking EKS. Create the cluster using eksctl, I like eksctl as it creates everything from VPC to node group, with minimal inputs, you can find you're suitable setting here.

But here's mine:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: my-cluster
  region: eu-west-2 # Replace with your preferred AWS region
  version: "1.30" # Replace with your desired Kubernetes version

managedNodeGroups:
  - name: ng-t2-medium
    instanceType: t3a.medium
    desiredCapacity: 2
    minSize: 2
    maxSize: 2
    volumeSize: 20 # Size in GiB
    privateNetworking: true # If you want nodes to use private subnets only
    ssh:
      enableSsm: true # Allows access via AWS Systems Manager Session Manager

Enter fullscreen mode Exit fullscreen mode

Nothing complicated just 2 instances and that's all, remember to update the add-ons installed by eksctl.

STEP II

Now that we have our eksctl get remember to get the kubeconfig file in you're workstation by using the command:

aws eks update-kubeconfig --region <region-code> --name <cluster-name>

Now Let's use helm charts for kube-Prometheus stack install it and wait for the startup

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install <Name> prometheus-community/kube-prometheus-stack -n <namespace>
Enter fullscreen mode Exit fullscreen mode

STEP III

Now that we have installed Prometheus and Grafana, Let's create an application that we gonna monitor.

I'm using a very simple nodejs application

const express = require("express");
const client = require("prom-client");
const winston = require("winston");
const os = require("os");

const app = express();
const PORT = 3000;

// Initialize Prometheus metrics
const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "route", "status_code"],
  buckets: [0.1, 0.5, 1, 2, 5], // Buckets for latency
});

const httpRequestCount = new client.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "route"],
});

const httpErrorCount = new client.Counter({
  name: "http_errors_total",
  help: "Total number of HTTP errors",
  labelNames: ["method", "status_code"],
});

const systemMetrics = new client.Gauge({
  name: "system_resource_usage",
  help: "System CPU and memory usage",
  labelNames: ["resource"],
});

// Logger setup
const logger = winston.createLogger({
  level: "info",
  format: winston.format.json(),
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: "app.log" }),
  ],
});

// Middleware to track request duration and traffic
app.use((req, res, next) => {
  const start = Date.now();
  httpRequestCount.inc({ method: req.method, route: req.path });
  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000; // Convert to seconds
    httpRequestDuration.observe(
      { method: req.method, route: req.path, status_code: res.statusCode },
      duration
    );

    if (res.statusCode >= 400) {
      httpErrorCount.inc({ method: req.method, status_code: res.statusCode });
    }
  });
  next();
});

// Simple endpoints
app.get("/", (req, res) => {
  res.send("Welcome to the Golden Signals App!");
});

app.get("/hello", (req, res) => {
  setTimeout(() => {
    res.send("Hello, World!");
  }, Math.random() * 1000); // Random delay to simulate latency
});

app.get("/error", (req, res) => {
  res.status(500).send("Simulated error!");
});

// Expose metrics at /metrics
app.get("/metrics", async (req, res) => {
  res.set("Content-Type", client.register.contentType);
  res.end(await client.register.metrics());
});

// Monitor system resources every 5 seconds
setInterval(() => {
  const memoryUsage = process.memoryUsage().heapUsed / 1024 / 1024; // Convert to MB
  const cpuLoad = os.loadavg()[0]; // 1-minute average
  systemMetrics.set({ resource: "memory" }, memoryUsage);
  systemMetrics.set({ resource: "cpu" }, cpuLoad);

  logger.info({
    memoryUsage: `${memoryUsage.toFixed(2)} MB`,
    cpuLoad: cpuLoad.toFixed(2),
  });
}, 5000);

// Start server
app.listen(PORT, () => {
  console.log(`Server running on http://localhost:${PORT}`);
});

Enter fullscreen mode Exit fullscreen mode

Dockerfile for the same

# Use the official Node.js image from the Docker Hub
FROM node:21-alpine

# Create and change to the app directory
WORKDIR /usr/src/app

# Copy application dependency manifests to the container image.
# A wildcard is used to ensure both package.json AND package-lock.json are copied.
COPY package*.json ./

# Install dependencies
RUN npm install

# Copy the local code to the container image
COPY . .

# Expose the port the app runs on
EXPOSE 3000

# Run the application
CMD ["node", "app.js"]
Enter fullscreen mode Exit fullscreen mode

Now deployment, service and service monitor file for K8s

apiVersion: apps/v1
kind: Deployment
metadata:
  name: golden-signals-app
  labels:
    app: golden-signals-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: golden-signals-app
  template:
    metadata:
      labels:
        app: golden-signals-app
    spec:
      containers:
        - name: golden-signals-app
          image: ashutosh5786/golden-signal-app:v1
          ports:
            - containerPort: 3000
          resources:
            limits:
              memory: "256Mi"
              cpu: "500m"
            requests:
              memory: "128Mi"
              cpu: "250m"
          env:
            - name: NODE_ENV
              value: "production"
---
apiVersion: v1
kind: Service
metadata:
  name: golden-signals-service
  labels:
    app: golden-signals-app
spec:
  ports:
    - name: metrics-port             # Port name, used in ServiceMonitor
      port: 3000                     # Exposed port
      targetPort: 3000               # Port on the container
  selector:
    app: golden-signals-app
  type: ClusterIP
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: golden-signals-app-monitor
  labels:
    release: kubeprometheus <Name of the release used when installing using helm chart>
spec:
  selector:
    matchLabels:
      app: golden-signals-app
  endpoints:
    - port: metrics-port # Use the port name from the Service
      path: /metrics
      interval: 15s
  namespaceSelector:
    matchNames:
      - default <Namespace in which app is deployed>
Enter fullscreen mode Exit fullscreen mode

At this point, I take that you understand what deployment and service are but you might have noticed the Service Monitor well it's a way for us to configure the Prometheus to monitor the pods we are creating.

STEP IV

After STEP III, we can head to Prometheus-ui, to check if the application is monitored or not. Go to Status > Target
If you see the application good otherwise go back and recheck everything.

It should look something like this:
Prometheus Target Screen showing Golden Signal App

After confirming that Prometheus is scrapping the data, let's move to Grafana. And create the Dashboard which is my favourite part.

STEP V

Let's start creating those dashboards.

Side Panel > Dashboard > Add Visualization
Select Prometheus as Data Source

Once You've done this Select the CODE Option instead of the Builder

Code Option
As you can observe in the above screen.

Starting with Saturation for my case I have used the CPU as my constraint so I'm using below query to visualize it.

rate(container_cpu_usage_seconds_total{container="golden-signals-app"}[1m])

Now I'm gonna give all the query I used to build my dashboard for the Golden Signals

Latency (P95)
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
For more grain visualization we can have different queries but for the ease of this practical, I'm going with the easier option.

Error Rate
sum(rate(http_errors_total[1m])) by (status_code)

Traffic (RPS)
sum(rate(http_requests_total[1m]))

After Adding all those queries it should look like this

Dashboard on Grafana

Now we are all ready to monitor those Golden Signals to make sure our applications are always healthy and 100% running.

CONCLUSION

Monitoring is essential to keeping your applications healthy and reliable. By tracking Golden Signals with Prometheus and Grafana, you can catch issues early and ensure smooth operations. This guide gives you a starting point to build a monitoring setup that grows with your application’s needs.

Thank you for Reading, I'd love to know what you have set up for monitoring in your system, If I missed anything please mention it in the comments.

Top comments (0)