DEV Community: Kristijan

Traditional vs Modern Incident Response

Kristijan — Mon, 25 Apr 2022 12:15:16 +0000

What is Incident Response?

An incident is an event (network outage, system failure, data breach, etc.) that can lead to loss of, or disruption to, an organization's operations, services or functions. Incident Response is an organization’s effort to detect, analyze and correct the hazards caused due to an incident. In the most common cases, when an incident response is mentioned, it usually relates to security incidents. Sometimes incident response and incident management are more or less used interchangeably.

However, an incident can be of any nature, it doesn’t have to be tied to security, for example:

Physical damage to hardware or systems (fire, flooding)
Human error (misconfigurations, accidental deletion of data)
Malicious actors (denial of service attacks, malware, ransomware)

Every incident is different and may require a different response. The incident response consists of steps taken by an organization to address the outage and reinstate services to their normal operation, often in real-time. For example, treating an outage is referred to as an incident response.

A good incident response plan can help your company respond quickly and effectively when an outage occurs. Keep in mind that incident response is not just a technical function to be done by a specific team. Instead, it is more of a corporate process that involves all areas of the business.

Traditional vs Modern Incident Response

The biggest change in the world of incident response was the widespread adoption of automation.

Traditionally, the incident response was a highly manual process. Everything from creating a ticket to patching a server required human interaction. It was effective until the world experienced the internet boom.

Easy internet access has certainly opened up opportunities for people and businesses alike. According to IDC, 60% or more organizations have spent more on technology to embrace the digital future.

The rise in use of digital platforms has resulted in complex infrastructures with multiple application dependencies. Hence, downtime and system failures for even a few minutes can incur huge monetary losses (in some cases, even millions).

In order to avoid such events, organizations have resorted to dealing with incidents using teams that are on-call 24/7. This puts a lot of pressure on incident response teams as they are required to manually monitor systems, keep track of alerts and avoid fatigue. Hence automating some or most of the incident response processes can help get rid of repetitive work. It helps response teams be more effective with less effort.

That's not to say that people are no longer involved with incident response. People are still involved in triage, troubleshooting, and postmortem analysis. It's just that those tasks are much less frequent than they were before automation became the norm.

Incident Response used to be about reacting to what happened with a solution for an immediate ‘bleed stop’. Nowadays, it is more about being proactive and trying to prevent incidents altogether by understanding and gaining intelligence about why something has happened.

Incident response and management have become more of a DevOps-based activity. Where operational issues are addressed through code and automation, rather than manual intervention.

Responding to an Incident

In the SRE (Site Reliability Engineering) realm, the incident response can be divided into following steps:

Detect
Respond
Resolution and Recovery
Postmortems

Let’s expand on those and understand how incidents were responded to in the past, and how they are now.

Detect

This step is where you detect an issue or determine if there has been a breach. A breach or incident could originate from different sources.

Traditional: Primary source of detection would most likely be calls or emails from the impacted users. Monitoring and alerting tools weren’t as ubiquitous as today.

Modern: An issue will usually be caught through monitoring and alerting on metrics, or in another case by people noticing something strange while they're doing their work. With alerting tools and right schedules in place it is easier to detect such issues so they can be dealt with due process.

Respond

This is the step where you analyze the issue at hand and take a call on whether to contain the damage or terminate the concerned services.

Traditional: The limitations of technology made it difficult to connect globally. Cross-functional localized teams would come together to figure out the issue. It often led to forcing resources to quit the work at hand and focus on solving the issue. This chopping and changing would particularly impact developers the most.

Modern: Modern-day teams analyze the metrics and logs to determine how bad the outage is. Is it a brief spike in errors? Are a few nodes going offline? Or is it a full-on service disruption? This step involves analyzing metrics and logs before responding further. This is where your colleagues from other sectors would collaborate for help. Using modern ChatOps tools like Slack, Microsoft Teams helps in effective collaboration. This keeps the right people connected even globally if needed.

Resolution and Recovery

Once you've analyzed and pinpointed the root cause, you need to resolve the issue and ensure the system has recovered, with the affected systems and devices up and running again.

Traditional: The process was unstructured. There was a lack of coordination between people, which led to support people tripping over and duplicating efforts. The aim of recovery was to get the system up and running, and nothing much followed. Getting to the root cause was rarely an objective until the same issue occurred repeatedly.

This changed with time as processes were put into place. But lack of automation meant that the on-call schedules were still not very efficient and there was a lot of manual work.

Modern: These days, various tools and techniques are used to deal with issues. The decision is based on the issue that is being dealt with and the team's capabilities. For example, if you're experiencing network issues and your team has access to network engineering resources, they may be able to resolve the issue quickly by adjusting settings on routers or switches.

Recovery is usually coordinated by the on-call incident handler, who is responsible for implementing a solution and making sure it does not fail. The SRE team then follows up with the manager to make sure the fix works as intended and, if necessary, to mitigate any damage caused by the outage. Another goal is to prevent such incidents from happening again

Postmortems

A postmortem is written after the issue is resolved, and everything has calmed down. Once the postmortem write-up is ready, a meeting occurs and is led by an SRE manager or incident handler who distributes the postmortem notes to relevant parties within the organization.

The goal of this meeting is to review what happened during the outage, why it happened, what was done to stop it, and how it could have been prevented in the future. The postmortem then becomes part of an organization's operational history, allowing teams to learn from past mistakes and improve their overall reliability going forward.

Traditional: Traditional postmortems were either internal reports that were never seen outside the company or formal reports submitted to external auditors.

Both constraints made it difficult to share detailed information about what happened and why it happened. Traditional postmortems are typically tactical documents that focus on how IT personnel responds to an incident.

Modern: The practice of postmortems is an established part of modern incident response and is generally written as after-the-fact documentation.

Modern digital postmortems are more inclusive of all teams involved, including the stakeholders. And should be viewed as strategic points that focus on lessons learned by the entire organization.

They can be used for training purposes since they document case studies from completed investigations. They allow you to:

Analyze past issues
Find trends and make predictions about future risks
Help you learn from mistakes
Prevent a recurrence.

An excellent example for a postmortem template, and what should be included, can be found in the first SRE book by Google.

This brings us to the end of this blog. We have successfully explored what incident response is and how it has evolved with time.

Squadcast is an incident management tool that’s purpose-built for SRE. Create a blameless culture by reducing the need for physical war rooms, unify internal & external SLIs, automate incident resolution and create a knowledge base to effectively handle incidents.

Infrastructure as Code: All you need to know

Kristijan — Tue, 04 Jan 2022 13:42:52 +0000

Using Code to create and manage deployments is more time-efficient and less tedious when compared to using CLI or even UI. In this blog, we explore the buzz around the usage of Infrastructure as Code (IaC) and how Terraform can be used to implement IaC.

In this blog post, we will explore the good, the bad, and the ugly sides of infrastructure as code so you can make an informed decision on how (and why) to incorporate it into your workflow.

What is Infrastructure as Code?

Infrastructure as code (abbreviated as IaC) is a set of practices for infrastructure management that enables it to be managed and coordinated by code, instead of the traditional way of using CLI or UI.

You would want to have a solution that allows you to easily manage and provision infrastructure with reusability and templating in mind.

One of its most important benefits is that infrastructure as code enables infrastructure to be easily defined, replicated, templated, and put into a version-controlled system.

Why should I use Infrastructure as Code?

Infrastructure as code is beneficial for several reasons including automation, infrastructure consistency across environments, and full infrastructure history over time through the use of version control.

This allows you and your team to have increased collaboration since IaC templates can be stored inside git repositories and can easily be collaborated all across.

It also makes it easier for new team members to ramp up on how things work in your environment; because there is less need for documentation or handoffs between teammates - everything needed will already be available, on GitHub for example.

IaC enables infrastructure to scale just like software does, with definitions for multiple environments such as development, staging, and production.

This means infrastructure can be quickly modified during the development process and the changes can be tested in an environment that is identical to production, hence minimizing any errors.

It is much quicker to code or supply templates for new infrastructure than it is to use a CLI console or UI.

Of course, there are exceptions depending on the infrastructure end goal.

The tools can help you create things in parallel. Imagine creating ten instances that need having extra disks attached.

Even though this goal is quite straightforward, it will take you an eternity to complete it. You would need to click through UI wizards countless times to spin up all instances.

However, utilizing the potential of IaC makes this simple.

The code may be used to iterate through a list of your chosen Instances and create them in a breeze.

It's crucial to remember that infrastructure as code is not a magic bullet for infrastructure management.

IaC is only one piece of the puzzle, as with everything, there is more to it.

How do I get started with Infrastructure as Code?

There are many infrastructure as code tools such as Terraform, Pulumi, CloudFormation, and others which we will take a look at in a minute.

To get started with IaC, you must first have established a goal and objective on how you want to manage your infrastructure.

For managing the infrastructure, it is easier if it’s to be provisioned and run through a cloud provider.

Although, the versatility of IaC still enables you to manage even physical infrastructure. This is no exception.

In general, IaC works well with any infrastructure that can be defined using code or templates.

You might use IaC to define the entirety of your infrastructure, or you may go hybrid and define some services using IaC while others through UI or CLI tooling.

Keep in mind that attempting to integrate current infrastructure into IaC code is a little more challenging and will require some effort, but isn't impossible. It is much easier starting from scratch with IaC.

The pros and cons of using Infrastructure as Code

Some pros and cons were already mentioned in the previous sections.

However, let's clear it up and look at them compared.

Pros

Infrastructure as code is version controlled, which means you have a history of who did what and which changes have been done on the environment.
In case of an issue, you can easily roll back to a previous state if needed.
IaC enables infrastructure to be quickly modified during the development process and changes tested in an environment that is identical to production.
It's much quicker for a human to code infrastructure than it is to provision infrastructure via user interfaces. Your already written code blocks and templates can be reused in the future.
IaC tools are readily available which can be used for managing infrastructure across cloud providers, on-premises, or even hybrid environments.

Cons

It is difficult to integrate current infrastructure into IaC in the case of an existing environment where you have a lot of infrastructure(s) already created.
There is a learning curve and requires some effort to get infrastructure provisioned via IaC. However, it's well worth the effort spent in learning it.
You must be able to define infrastructure using code or templates for IaC to work. This means learning another language, syntax, and logic.
Not everything can be created and connected by the use of IaC; there are some limitations.

Tools for Infrastructure as Code

There are many IaC tools that you may use to manage infrastructure.

Some of these include:

Terraform for managing infrastructure in any cloud provider or your own data center.
CloudFormation from Amazon AWS enables you to define resources through custom templates.
Pulumi is another open-source tool for infrastructure as code which was created by former Google employees with experience in managing infrastructure at scale.
Azure Resource Manager which is Microsoft's answer to infrastructure as code. Using ARM you can easily provision infrastructure using JSON templates.

The most commonly used IaC tool is Terraform, as it offers a vendor-agnostic approach with extended support for different providers, services, and infrastructure components.

Terraform enables you to define infrastructure using the declarative approach. By writing configuration files in its language, HCL and reusability of code through modules.

You may use these building blocks for configuration management, data management, continuous delivery workflows, serverless functions, and a variety of other applications.

Terraform Demo

You can now see how IaC can help automate creating multiple instances.

To demonstrate the power of IaC, let's take on the task of creating the ten instances here.

terraform {
  required_providers {
    google = {
      source = "hashicorp/google"
      version = "3.89.0"
    }
  }
}

provider "google" {}

variable "instance_count" {
  default = "10"
}

resource "google_compute_instance" "instance" {
  count        = var.instance_count
  project      = "YOUR-PROJECT-ID"
  zone         = "us-central1-a"

  name         = "squadcast-instance-${count.index}"
  machine_type = "e2-medium"

  attached_disk {
    source = "instance-disk-${count.index}"
  }

  lifecycle {
    ignore_changes = [attached_disk]
  }

  boot_disk {
    initialize_params {
      image = "debian-cloud/debian-9"
    }
  }

  network_interface {
    network    = "default"
  }
}

resource "google_compute_disk" "instance-disk" {
  count   = var.instance_count
  project = "YOUR-PROJECT-ID"
  zone    = "us-central1-a"

  name    = "instance-disk-${count.index}"
  type    = "pd-ssd"
  size    = "50"
  physical_block_size_bytes = 4096
}

Three commands are executed in succession:

terraform init - to initialize Terraform, prepare, and download all the necessary files before running
terraform plan - to print out the execution plan on what infrastructure elements will be created(or deleted)
terraform apply - finally applying the staged changes and executing the creation of the instances

In a breeze, all the instances are created along with the extra pair of disks.

Squadcast is an incident management tool that’s purpose-built for SRE. Your team can get rid of unwanted alerts, receive relevant notifications, work in collaboration using the virtual incident war rooms, and use automated tools like runbooks to eliminate toil.

Implementing Istio in a Kubernetes cluster

Kristijan — Mon, 25 Oct 2021 13:22:34 +0000

As the complexity of a microservice architecture grows, it becomes important to implement a service mesh for better insights into your cluster and microservices. In this blog, Kristijan explains how Istio can be used as a service mesh, along with a detailed installation steps & configuration setup.

Service Mesh? You’ve heard about it, but does it solve something, or is it just another hot buzzword in the industry?

In this article you will learn about the Istio service mesh, along with a full installation guide and configuration setup.

What is Istio?
Istio architecture
Installing Istio
Observability
Demo application
Recap

Before moving straight to Istio, it’s worth mentioning that in one of our previous articles - The Age of Service Mesh, Gigi Sayfan explained in detail how service meshes work and what problems they solve.

I highly suggest you give that article a read. Maybe even as a prequel, as it will provide you with great insight into service mesh basics and the general idea behind them.

Right now, there are a plethora of options for service meshes.

To name a few:

Each service mesh has its pros and cons, along with specific use cases that you should consider for your cluster and end goal.

You can decide which “brand” of a service mesh to install.

What is Istio?

Istio is a service mesh designed to enhance and give you better insight into your cluster and microservices.

One of the great things about Istio and service meshes overall is that they require absolutely no code change for them to work.

Istio works by integrating itself as an additional layer inside the Kubernetes cluster and thus provides modern features that you can utilize to your advantage.

Those features can include advanced load balancing, circuit breaking, mTLS traffic encryption, better authentication and authorization options, metrics, telemetry, and overall fine-grain control over the cluster’s traffic going in and out.

Now Istio isn’t just a single object that you install. It’s more of a collection of entities that work together and make up the whole service mesh.

Like Kubernetes, Istio has a control plane that manages everything and a data plane that handles the traffic between the services.

There is more to Istio, as it isn’t bound to only work in a Kubernetes cluster. It will also work with virtual machines and supports different deployment options both for installing and running.

In the next section, we will explain Istio’s components and architecture.

Istio Architecture

As the saying goes, a picture is worth a thousand words.

Consider the following diagram:

Image Source

You can see that the traffic destined in and out of the pods doesn’t flow directly now; Instead, it first must pass through the sidecar proxies.

The container sidecars are Envoy proxies that get automatically injected into your pods on startup.

During installation, you instruct Istio which namespace to ‘watch’ and deploy Envoy proxies along with your applications.

You will see how this is done in action when we get to the installing section.

The other part is the control plane, made of multiple components bundled in one binary - istiod. The control plane manages the proxies, certificates, service discovery, and executing the configuration you set.

The components making Istio are:

To explain a bit better and give some analogy here.

Consider the service mesh as a telephone network.

The data plane consists of phones that you and your friends use to communicate with each other.

You will be able to communicate without them, but you will have to yell across. Instead, this way is much more modern, secure, and with better control over the communication.

Now the control plane will be the telephone service provider, and from there, all the calls get managed, routed, and billed.

Everything you apply is done towards and on the control plane; the control plane will communicate that change to the sidecar proxies.

The traffic traversing the data plane is only visible to the proxies; the other Istio components have no access.

Installing Istio

Depending on your setup, Istio offers you different installation and deployment strategies.

Each cloud service provider has its own thing. So it’s best to go over the platform setup and check if any prerequisites or dependencies are needed before installing.

The install options can range from using helm, istioctl, or using an operator.

You can look into them at the following link.

For this guide, we will install Istio using the istioctl tool.

First, you’ll need to download the binary:

$ curl -L https://istio.io/downloadIstio | sh -

Navigate into the newly created folder, export the path to the binary, and verify that it works:

$ cd istio-1.11.1/
$ export PATH=$PWD/bin:$PATH
$ istioctl version

no running Istio pods in "istio-system"
1.11.1

Well, that’s okay, you still haven’t installed Istio.

It is a good idea to run the pre-flight check to verify if your cluster doesn’t have any issues running the Istio service mesh.

$ istioctl x precheck

Install Pre-Check passed! The cluster is ready for Istio installation.

Before moving forward, you should assess which type of profile you want Istio to be installed with.

There are six of them at the time of this writing:

Default
Demo
Minimal
External
Empty
Preview

You can view each profile with an extended description here.

We will go with the default profile intended for production environments.

Each profile is just a set of features that Istio will enable when installed.

If you want to test every feature, you can install it using the demo profile.

Note: Installing profiles that include the Ingress or Egress Gateway will automatically spin up an external load balancer.

Istio also offers customizations and custom third-party add-ons you can include in the profile.

Suppose none of the above profiles meet your requirements, you can use istioctl to generate and create custom manifests to fit your needs.

To install using the default profile:

$ istioctl install --set profile=default -y

✔ Istio core installed             
✔ Istiod installed                                                         
✔ Egress gateways installed  
✔ Ingress gateways installed
✔ Installation complete

Excellent! You’ve installed Istio successfully!

You are halfway there.

Now you will need to label which namespace Istio will control and inject sidecar proxies in the pods.

For example, to label the default namespace for sidecar injection:

$ kubectl label namespace default istio-injection=enabled

namespace/default labeled

You can now verify this with:

$ kubectl get ns --show-labels

NAME            STATUS      AGE      LABELS
default         Active      119d     istio-injection=enabled
[other output truncated]

With that, you completed the Istio core components installation.

Observability

I installed Istio, and now what?

Next comes the observability part.

The Envoy proxies will send off telemetry and other data that you can use to visualize the traffic in the mesh.

Like the Prometheus and the Grafana setup, you will need Istio paired with a visualization tool to display the data.

You will use the Kiali dashboard to visualize and see what’s going on in the cluster.

There is one caveat, however. Kiali requires that you have a running Prometheus instance in your cluster.

You can deploy one or supply the address of the existing one if you have it already deployed.

For simplicity and example purposes, the following section will use the demo manifests for deploying Kiali, Jaeger, Prometheus, and Grafana.

Keep in mind that you shouldn’t rely on this setup for running in production environments!

Further below, there will be an explanation of how to set up Kiali to work with an existing Prometheus instance.

Installing the Kiali dashboard

Navigate to the Istio folder and apply the manifests located under samples/addons:

$  kubectl apply -f samples/addons

Applying the above will deploy many objects, so give them a couple of minutes to start.

Check on the Kiali pod if it’s started:

$ kubectl -n istio-system get pods -l app=kiali

NAME                           READY    STATUS    RESTARTS      AGE
kiali-787bc487b7-fbxwm         1/1      Running   0             2m10s

Once it’s running, you can now access the dashboard using kubectl and port-forward.

However, istioctl offers a much simpler way:

$ istioctl dashboard kiali

http://localhost:20001/kiali

open http://localhost:20001/kiali in your browser.

As you can see, there is no traffic running in the selected namespace, and Kiali will show no connections.

If you want to access the other dashboards - Grafana and Jaeger, you can again use istioctl dashboard:

$ istioctl dashboard grafana

$ istioctl dashboard jaeger

Kiali tips

There are also other ways to deploy Kiali that are more inclined to production use, where you can customize and set your own parameters.

Installing Kiali can be done by deploying the Kiali-server or the Kiali-operator.

You can find the GitHub link for both Helm charts here.

As mentioned in the previous section, you can specify external instances of Prometheus and the other tools.

It’s best to install all the tooling Kiali needs for you to have the most benefit and greater observability in the service mesh.

Those are the Prometheus instance, Grafana, and Jaeger for tracing.

Note: Refer to the Jaeger documentation as it requires additional configuration to have full distributed tracing in your apps.

Grafana and Jaeger are optional for Kiali and not required for it to work.

Image Source

You can specify every connection to the other systems during installation.

Example, for specifying an existing Prometheus instance:

$ helm install kiali-server kiali-server --repo https://kiali.org/helm-charts \
  -n istio-system \
  --set auth.strategy="anonymous" \
  --set external_services.custom_dashboards.prometheus.url="http://prometheus-k8s.monitoring:9090/" \
  --set external_services.prometheus.url="http://prometheus-k8s.monitoring:9090/"

The Kiali authentication options are available here.

The anonymous option used above provides free unauthenticated access to the dashboard.

Demo application

You’ve deployed Istio, have a running service mesh inside your cluster, and you also installed the Kiali dashboard to observe the traffic.

Let’s now deploy a simple demo application.

You can use the following hello-world web app that will display a simple web page for testing.

Apply the following deployment and service manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-kubernetes
spec:
  selector:
    matchLabels:
      name: hello-kubernetes
  template:
    metadata:
      labels:
        name: hello-kubernetes
    spec:
      containers:
      - name: app
        image: paulbouwer/hello-kubernetes:1.10
        ports:
          - containerPort: 8080
        env:
        - name: KUBERNETES_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: KUBERNETES_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: KUBERNETES_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
---
apiVersion: v1
kind: Service
metadata:
  name: hello-kubernetes
spec:
  type: ClusterIP
  ports:
  - port: 80
    targetPort: 8080
  selector:
    name: hello-kubernetes

Verify that the pod is running and the service is deployed:

$ kubectl get pod,svc

NAME                                     READY   STATUS        RESTARTS      AGE
pod/hello-kubernetes-78c896db9c-hrcc8    2/2     Running       0             46s

NAME                        TYPE        CLUSTER-IP       EXTERNAL-IP    PORT(S)     AGE
service/hello-kubernetes    ClusterIP   10.0.12.41       <none>         80/TCP      47s

Now, for testing, you can use port-forward to access the application. However, a more permanent solution would be to use a load balancer or an ingress.

Istio has its own ingress controller that you can utilize and test the application.

The following ingress manifest will expose the application on the / path:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  annotations:
    kubernetes.io/ingress.class: istio
  name: ingress-hello-kubernetes
spec:
  rules:
  - http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: hello-kubernetes
            port:
              number: 80

Note: Notice the ingress class annotation; you specify that the Istio ingress controller will pick up this object.

You can get the IP address using:

$ kubectl -n istio-system get svc -l release=istio

NAME                    TYPE            CLUSTER-IP      EXTERNAL-IP    
istio-egressgateway     ClusterIP       10.0.12.240     <none>                                                                              
istio-ingressgateway    LoadBalancer    10.0.13.159     34.140.202.188   
istiod                  ClusterIP       10.0.8.208      <none>      
[other output truncated]

Visiting the external IP of the ingress gateway will open up the web application.

To see some activity in the Kiali dashboard, you first need to generate some traffic.

The simplest way is to use curl and a while loop:

From another terminal run:

$ while true; do curl 34.140.202.188 ; sleep 1; done;

And check the Kiali dashboard:

Success!!

You can now see that Kiali displays the traffic, and it reaches the web application without any issues.

Recap

You learned what is Istio and how it works
Got your hands dirty by installing and configuring Istio in a cluster
In addition to that, you installed Kiali to visualize the traffic in the mesh
You deployed a demo application and connected it using Istio’s ingress ‍

Infrastructure monitoring using kube-prometheus operator

Kristijan — Wed, 09 Jun 2021 12:36:29 +0000

Prometheus has emerged as the de-facto open source standard for monitoring Kubernetes implementations. In this tutorial, Kristijan Mitevski shows how infrastructure monitoring can be done using kube-prometheus operator. The blog also covers how the Prometheus Alertmanager cluster can be used to route alerts to Slack using webhooks.

In this tutorial by Squadcast, you will learn how to install and configure infrastructure monitoring for your Kubernetes cluster using the kube-prometheus operator, displaying metrics with Grafana, and configuring alerting with Alertmanager.

Infrastructure Monitoring

One of the key principles of running clusters in production is Monitoring.

You must be aware of the resource allocation and limits of each component that is a part of the cluster.

It is of crucial importance to have insight and observability in your cluster be it a Kubernetes one, bare metal, virtual machines, or any other.

A good monitoring solution paired with a set of metrics and alerting will provide a safe environment for your workloads.

It’s safe to say that monitoring a Kubernetes cluster comes in two parts - infrastructure and workload monitoring.

The first part covers the actual infrastructure that supports your workloads. These will be the nodes or instances that host your applications. And by collecting and observing the metrics you will be very well aware of the node's health, usage, and capacity.

The other part that needs to be covered are the workloads and the microservices that you deploy on the cluster.

These can be defined in your applications or in the Kubernetes ecosystem - the pods and containers.

Monitoring just the instances is not enough, since Kubernetes abstracts and adds additional layers for container management, this also needs to be taken into account.

The Pods are entities of their own, each with different resource requirements, limits, and usage.

Prometheus Operator

Before moving on to installing the monitoring stack, let’s have a brief intro on what the Kubernetes Operators and Custom Resources are.

In one of our previous blogs, we explained in detail what and how Kubernetes Operators and Custom Resources are used.

Kubernetes operators take the Kubernetes controller pattern that manages native Kubernetes resources (Pods, Deployments, Namespaces, Secrets, etc) and lets you apply it to your own custom resources.

Custom resources are Kubernetes objects that you define via CRDs (Custom Resource Definitions). Once a CRD is defined, you can create custom resources based on the definition and they are stored by Kubernetes. And you can interact with them through the Kubernetes API or kubectl, just like existing resources.

As you’ve read, both of these resources are extensions to the Kubernetes API and are not available by default like those of Deployment or StatefulSet kinds.

The Prometheus Operator will manage and configure a Prometheus cluster for you. Bear in mind that this contains only the core components.

And instead, in this tutorial, you will deploy the more enhanced - kube-prometheus operator.

The kube-prometheus operator will deploy all the core components plus exporters, extra configurations, dashboards, and everything else required to get your cluster monitoring up to speed.

These configurations can then be easily modified to suit your needs.

You can follow the official link if you wish to compare the differences between the operator deployment options.

Note: Manually deploying the Prometheus stack is still an option. However, you will need to deploy every component separately. This creates operational toil and will require a lot of manual steps until all services are configured and connected properly.

Installing the kube-prometheus operator

Although Kubernetes Operators may sound complex and scary, fear not, their installation is a breeze!

The people contributing to the Prometheus Operator project made its install straightforward.

Just a couple of commands need to be executed and you will have your monitoring set up in the cluster.

Let’s start.

To install the kube-prometheus operator, first clone the repository containing all the necessary files with this command:

$ git clone https://github.com/prometheus-operator/kube-prometheus.git

Now from inside the kube-prometheus folder, apply the manifests located in the manifests/setup:

$ kubectl apply -f manifests/setup

What this does is, first it’s creating the monitoring namespace that will contain all deployments for the monitoring stack.

Second, it will create all the necessary RBAC roles, role bindings, and service accounts that are required for the monitoring services to have proper privileges for access and metrics gathering.

And finally, it will create the aforementioned custom resources and custom resource definitions for the Prometheus Operator, and deploy them.

With the previous command you deployed the files for the Operator itself, but not for its services.

Wait a bit until the prometheus-operator pod is up and running.

You can check on it using:

$ kubectl -n monitoring get pods

NAME	READY	STATUS	RESTARTS	AGE
prometheus-operator-7775c66ccf-74fn5	2/2	Running	0	58s

Now you will need to deploy the services next:

$ kubectl apply -f manifests/

Running the above command will deploy:

Prometheus with High Availability
Alertmanager with High Availability
Grafana
Node exporters
Blackbox exporter
Prometheus adapter
Kube-state-metrics
And all other supporting services and configurations for the monitoring stack

After a couple of minutes, all the pods should get into a running state. You can verify again using the kubectl get pods command.

You can now port-forward and open the Prometheus, Grafana, and Alertmanager services locally:

Prometheus

$ kubectl -n monitoring port-forward svc/prometheus-k8s 9090

Grafana

$ kubectl -n monitoring port-forward svc/grafana 3000

Alertmanager

$ kubectl -n monitoring port-forward svc/alertmanager-main 9093

It’s best to check and open all of them, just to verify that everything works as expected.

Note: The default user and password for Grafana are admin/admin, after which you will be prompted to create a new password.

If you are greeted by the services user interface, that means the install was successful and you can now explore, and configure the services to fit your needs.

Before moving on, let’s understand how the services are interconnected and what role they serve in the monitoring stack.

Service Interaction

Consider the following diagram:

The high-level overview of the stack goes like this:

Prometheus will periodically scrape (or pull) metrics via the configured exporters and metrics servers on the nodes using HTTP.

The exporters don’t send out the data to Prometheus, instead, Prometheus pulls that data from endpoints set by exporters.

The scraped metrics gets time-stamped and stored as time-series data, which get written to a persistent volume for storage, and later for analysis.

A Config Map contains the rules and configurations for Prometheus.

This configuration contains the rules, alerts, scraping jobs, and targets that Prometheus needs to do proper monitoring.

Prometheus uses its own language called PromQL in which alerts can be set, and data queried.

Example of a PromQL query:

api_http_requests_total{method="POST", handler="/messages"}

To visualize and display the data stored by Prometheus, Grafana comes into the picture.

Grafana connects to Prometheus, sets it as the data source, fetches this data, and displays it through a custom dashboard.

Alertmanager is used to notify us of any alerts. Rules set in Prometheus get evaluated, and once a violation occurs Prometheus pushes this alert notification to Alertmanager.

Once Alertmanager receives this notification, based on its own rules for routing and grouping, it will then send it over to the configured Receivers.

Similar to Prometheus, the Alertmanager rules also get stored separately either in Config Maps or Secrets.

These receivers can be either Squadcast Platform, Slack, Email, or any other configurable incident receiver.

Let’s examine this Prometheus alerting rule:

- name: kubernetes-system-kubelet
  rules:
  - alert: KubeNodeNotReady
    annotations:
      description: '{{ $labels.node }} has been unready for more than 15 minutes.'
      runbook_url: https://github.com/prometheus-operator/kube-prometheus/wiki/kubenodenotready
      summary: Node is not ready.
    expr: |
kube_node_status_condition{job="kube-state-metrics",condition="Ready",status="true"} == 0
    for: 15m
    labels:
      severity: warning

name - The name of the rule group; you can have multiple alerts under the same group
rules - Configured rules under the rule group
alert - Name of the actual Alert and how it will be displayed once it is firing
annotations - Under annotations, you can set more details and summaries about the alert firing
expr - The expression for the alert in PromQL language that gets evaluated
for - How long will Prometheus wait from first encountering the alert until a notification is sent. If the alert is running for 15minutes or longer, Prometheus will send a notification to Alertmanager
labels - additional labeling that you can attach to the alert

Using the Go templating language, you can further detail your alert and give a clearer description.

Adding a custom alerting rule

Now that you’ve learned more about how the whole Prometheus setup operates, you can modify its configuration and add your own custom alerting rules.

As explained above, Prometheus gets its configuration data from a Config Map.

If you describe the prometheus-k8s Stateful Set, you will see the prometheus-k8s-rule-files-0 Config Map that contains the config rules.

$ kubectl -n monitoring describe statefulsets prometheus-k8s
[other output truncated]

    prometheus-k8s-rulefiles-0:
      Type: ConfigMap (a volume populated by a ConfigMap)
      Name: prometheus-k8s-rulefiles-0
      Optional: false

You can now edit the config map and add your custom alert.

Keep in mind that the config map will have a lot of lines!

It’s best to save it and edit it offline:

$ kubectl -n monitoring get configmap prometheus-k8s-rulefiles-0 -o yaml > prometheus-k8s-rulefiles.yaml

The configuration inside will be split under separate YAML files. And each file will contain a group of alerts based on common alert targets.

In this case, the example will cover adding an alerting rule under a new group.

But if you like, you can add it under one of the existing alert groups.

Add the following sample alert under the data block:

custom-alert-rules.yaml: |
  groups:
    - name: custom.rules
  rules:
    - alert: AlertTestJobFailed
    expr: kube_job_status_failed{job_name="alert-test"} == 0
    for: 0m
  labels: false
    severity: warning
  annotations:
    summary: Alert Test Job failed (instance {{ $labels.instance }})
    description: "Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"

The alert will fire if our Kubernetes job fails to execute. Which you will test in a moment.

Now comes the tricky part.

Since the owner of the Config Map is the Prometheus Operator, you will need to remove the file ownership data.

Otherwise, any modifications to the Config Map will be reverted back without any changes!

Under the config map metadata field remove anything but the name and namespace, leave those as they are.

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-k8s-rulefiles-0
  namespace: monitoring

[other data truncated]

You can also find the link to the edited config map here.

Finally, replace the existing Config Map with the new one:

$ kubectl replace -f prometheus-k8s-rulefiles.yaml
configmap/prometheus-k8s-rulefiles-0 replaced

Prometheus will automatically reload the Config Map, and after waiting a bit check on the UI to verify that the alerting rule is created:

$ kubectl -n monitoring port-forward svc/prometheus-k8s 9090

You can now run the test job to check if the alert will fire correctly.

$ kubectl create job alert-test --image=busybox -- sh -c "sleep 10; exit 139"

The job will create a pod that will sleep for ten seconds and exit with status 139, thus failing the job and triggering the alert.

Setting up Slack Webhook

Since the monitoring stack is now working, it’s a good idea to configure a receiver and see how alert notifications can be made easier for human view.

Creating a Slack workspace and channel for testing purposes is easy.

If you do not already have a Slack account you can sign up and create one here.

Once logged in, on the left side, in the Channels tab click on the plus sign and create a separate channel for alerts.

Now, to configure a webhook click on Browse Slack, then Apps, and in the search bar search for webhooks.

Make sure to select the Incoming Webhooks, this is extremely important.

Incoming - it means that data will be sent to Slack, and not received from it.

Once you click on Add, you will be redirected to the Slack webhook webpage.

Just confirm with Add to Slack, and choose the previously created alert channel. In our case #prometheus_alerts.

Once that is done on the next page, you will be given the webhook URL.

That looks something like:

https://hooks.slack.com/services/XXXXXXXXXX/XXXXX/XXXXX

Keep this URL a secret, and treat it like a password!

Anyone that gets his / her hands on this URL can send anything to your channel.

At the bottom, there will be an example with curl that you can use to send a request and verify if the webhook integration works.

Going back to your channel, If you see the famous ghost emoji, this means that the webhook configuration is successful.

Note: If you received an SSL certificate problem when running the test request, add the -k option to curl.

$ curl -k -X POST {rest of the command}

Configuring Alertmanager

With the Slack channel configured with a webhook, what's left is to integrate it in the Alertmanager configuration.

In other, non-operator deployments, usually the configuration will be stored inside a Config Map.

However, deployed through the Prometheus Operator the configuration will be defined inside a Kubernetes secret.

Which is much better since its contents are encoded and not left in plain-text format. On another note, some additional steps are required when there is a configuration change.

You have two options for editing the Alertmanager configuration:

You can modify the existing secret that Alertmanager uses to store its configuration.
You can replace that secret with a new one that contains your updated configuration.

The first option is a bit tedious process since you will need to decode and encode back the contents every time you want to change some of the configurations.

With the second option, you can define and store your configuration in plain YAML format. Once there is a new configuration needed, you can generate the Alertmanager secret, directly from that file.

First, you will need to grab the default configuration that’s deployed with Alertmanager.

You will use that as a base for further editing.

You can output the Alertmanager secret with:

$ kubectl -n monitoring get secret alertmanager-main -o yaml

The secret will look messy at first glance.

What you are interested in is in the data field, under the alertmanager.yaml field.

That’s the configuration that Alertmanager uses.

For sake of clarity, the other fields will be truncated.

data:
  alertmanager.yaml: Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKImluaGliaXRfcnVsZXMiOgotICJlcXVhbCI6CiAgLSAibmFtZXNwYWNlIgogIC0gImFsZXJ0bmFtZSIKICAic291cmNlX21hdGNoIjoKICAgICJzZXZlcml0eSI6ICJjcml0aWNhbCIKICAidGFyZ2V0X21hdGNoX3JlIjoKICAgICJzZXZlcml0eSI6ICJ3YXJuaW5nfGluZm8iCi0gImVxdWFsIjoKICAtICJuYW1lc3BhY2UiCiAgLSAiYWxlcnRuYW1lIgogICJzb3VyY2VfbWF0Y2giOgogICAgInNldmVyaXR5IjogIndhcm5pbmciCiAgInRhcmdldF9tYXRjaF9yZSI6CiAgICAic2V2ZXJpdHkiOiAiaW5mbyIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAiRGVmYXVsdCIKLSAibmFtZSI6ICJXYXRjaGRvZyIKLSAibmFtZSI6ICJDcml0aWNhbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gIm5hbWVzcGFjZSIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJEZWZhdWx0IgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJXYXRjaGRvZyIKICAtICJtYXRjaCI6CiAgICAgICJzZXZlcml0eSI6ICJjcml0aWNhbCIKICAgICJyZWNlaXZlciI6ICJDcml0aWNhbCI=

In this state, you can’t do anything, since the data stored is encoded in base64.

You can either decode it using online free decoders available on the web.

Or you can do it through the terminal:

$ echo “Imdsb2JhbCI6CiAgInJlc29sdmVfdGltZW91dCI6ICI1bSIKImluaGliaXRfcnVsZXMiOgotICJlcXVhbCI6CiAgLSAibmFtZXNwYWNlIgogIC0gImFsZXJ0bmFtZSIKICAic291cmNlX21hdGNoIjoKICAgICJzZXZlcml0eSI6ICJjcml0aWNhbCIKICAidGFyZ2V0X21hdGNoX3JlIjoKICAgICJzZXZlcml0eSI6ICJ3YXJuaW5nfGluZm8iCi0gImVxdWFsIjoKICAtICJuYW1lc3BhY2UiCiAgLSAiYWxlcnRuYW1lIgogICJzb3VyY2VfbWF0Y2giOgogICAgInNldmVyaXR5IjogIndhcm5pbmciCiAgInRhcmdldF9tYXRjaF9yZSI6CiAgICAic2V2ZXJpdHkiOiAiaW5mbyIKInJlY2VpdmVycyI6Ci0gIm5hbWUiOiAiRGVmYXVsdCIKLSAibmFtZSI6ICJXYXRjaGRvZyIKLSAibmFtZSI6ICJDcml0aWNhbCIKInJvdXRlIjoKICAiZ3JvdXBfYnkiOgogIC0gIm5hbWVzcGFjZSIKICAiZ3JvdXBfaW50ZXJ2YWwiOiAiNW0iCiAgImdyb3VwX3dhaXQiOiAiMzBzIgogICJyZWNlaXZlciI6ICJEZWZhdWx0IgogICJyZXBlYXRfaW50ZXJ2YWwiOiAiMTJoIgogICJyb3V0ZXMiOgogIC0gIm1hdGNoIjoKICAgICAgImFsZXJ0bmFtZSI6ICJXYXRjaGRvZyIKICAgICJyZWNlaXZlciI6ICJXYXRjaGRvZyIKICAtICJtYXRjaCI6CiAgICAgICJzZXZlcml0eSI6ICJjcml0aWNhbCIKICAgICJyZWNlaXZlciI6ICJDcml0aWNhbCI=” | base64 -d

The above command will echo the contents and pipe them through base64 with the -d option for decoding.

The final, cleaned up configuration will look like this:

"global":
  "resolve_timeout": "5m"
"inhibit_rules":
  - "equal":
    - "namespace"
    - "alertname"
    "source_match":
      "severity": "critical"
    "target_match_re":
      "severity": "warning|info"
  - "equal":
    - "namespace"
    - "alertname"
    "source_match":
      "severity": "warning"
    "target_match_re":
      "severity": "info"
"receivers":
  - "name": "Default"
  - "name": "Watchdog"
  - "name": "Critical"
"route":
  "group_by":
  - "namespace"
  "group_interval": "5m"
  "group_wait": "30s"
  "receiver": "Default"
  "repeat_interval": "12h"
  "routes":
  - "match":
    "alertname": "Watchdog"
    "receiver": "Watchdog"
  - "match":
    "severity": "critical"
    "receiver": "Critical"

Now to add the Slack webhook, a couple of edits will be needed to the configuration.

On the following link from the official docs, you can see what options are available.

Copy the configuration output from above, to a new file named alertmanager.yaml.

In that file, you will need to add three separate configs for Alertmanager in order to send alerts on Slack.

Get the Slack webhook URL created previously, and you will add it under the global config as slack_api_url value.

"global":
  "resolve_timeout": "5m"
  "slack_api_url": "https://hooks.slack.com/services/xxxxxxxxxxxx/xxxxxxxxxx/x"

[other data truncated]

Under the route section, replace the default receiver Default with slack.

"receiver": "slack"

[other data truncated]

Create a separate receiver for Slack alerts. Under the channel, add the alert channel name you created earlier.

"receivers":
  - "name": "slack"
    "slack_configs":
    - "channel": "#prometheus_alerts"
      "send_resolved": "true"
      "text": " \nsummary: {{ .CommonAnnotations.summary }}\ndescription: {{ .CommonAnnotations.description }}"

[other data truncated]

Victory is close!

For you to be able to add this new configuration as a new secret, you must first delete the existing one:

$ kubectl -n monitoring delete secret alertmanager-main
secret "alertmanager-main" deleted

Now create a new secret with the exact same name, and using the alertmanager.yaml file:

$ kubectl -n monitoring create secret generic alertmanager-main --from-file=alertmanager.yaml
secret/alertmanager-main created

Finally, be patient and wait a bit for Alertmanager to load the new configuration.

If all is configured properly, you should now see some alerts on your Slack alert channel.

The Alertmanager options are endless, there are a lot of tweaks and fine-tuning that can be done.

You can set the intervals, group methods, different receivers, alert severity, you can even configure the alerting messages however you like.

Above is just a sample to showcase the basic configuration options.

Recap

Let’s recap on what you’ve learned from this tutorial:

You learned more about Infrastructure Monitoring
Got familiar with Kubernetes Operators
Installed kube-prometheus operator as a monitoring solution on your cluster
You learned how the Prometheus components communicate with each other
Added your own custom alerting rule
Configured Alertmanager to send alerts to a Slack channel using webhooks
From the team at Squadcast, we encourage you to keep on learning!

DEV Community: Kristijan

Traditional vs Modern Incident Response

What is Incident Response?

Traditional vs Modern Incident Response

Responding to an Incident

Detect

Respond

Resolution and Recovery

Postmortems

Infrastructure as Code: All you need to know

What is Infrastructure as Code?

Why should I use Infrastructure as Code?

How do I get started with Infrastructure as Code?

The pros and cons of using Infrastructure as Code

Tools for Infrastructure as Code

Terraform Demo

Implementing Istio in a Kubernetes cluster

Table of contents

What is Istio?

Istio Architecture

Installing Istio

Observability

Installing the Kiali dashboard

Kiali tips

Demo application

Recap

Infrastructure monitoring using kube-prometheus operator

Infrastructure Monitoring

Prometheus Operator

Installing the kube-prometheus operator

Prometheus

Grafana

Alertmanager

Service Interaction

Adding a custom alerting rule

Setting up Slack Webhook

Configuring Alertmanager

Recap