DEV Community: Mpho Mphego

Note To Self: How To Delete AWS SageMaker's Endpoint With MonitoringSchedule

Mpho Mphego — Mon, 21 Nov 2022 03:51:50 +0000

The Story

I have recently been deep diving into AWS SageMaker. I will document my journey in another blog post stick around!

This short post will show you how to delete an endpoint with a monitoring schedule. For some reason, this isn't possible with the AWS console; which I find very odd.

If you are here and have no idea what Endpoint with Monitoring Schedule is, you can read this Amazon SageMaker Model Monitor docs. If like me, you rather read the shorter version and were lazy to read the AWS docs, here is the short version:
With SageMaker Model Monitor, you can do the following:

Monitor data quality and model accuracy drift.
Monitor bias in your model's predictions.
Monitor drift in feature attribution.

This post will walk you through how to delete an endpoint with a monitoring schedule.

TL;DR

Delete an endpoint with a monitoring schedule via AWS CLI.

The Walk-through

If like me, you have tried to delete an endpoint with a monitoring schedule, you will have noticed that it is not possible. See the dreaded and cryptic error message below!

Fear not I have a solution.
We first need to delete the MonitoringSchedules configured with the endpoint via the AWS CLI tool.

On the SageMaker terminal, run the following command:

SageMaker instances do not come pre-installed with jq, so first things first install it.

  # Ref: https://stedolan.github.io/jq/
  sudo yum install jq

Let's get the region of the endpoint we want to delete.

  $ REGION=$(python -c 'import boto3; print(boto3.Session().region_name)')
  $ echo "REGION: $REGION"

  REGION: us-east-1

Get the list of MonitoringSchedules available

  $ aws sagemaker list-monitoring-schedules --region $REGION | jq '.'
  {
    "MonitoringScheduleSummaries": [
      {
        "MonitoringScheduleName": "my-monitoring-schedule",
        "MonitoringScheduleArn": "arn:aws:sagemaker:us-east-1:853052508252:monitoring-schedule/my-monitoring-schedule",
        "CreationTime": 1635378407.474,
        "LastModifiedTime": 1635476955.122,
        "MonitoringScheduleStatus": "Scheduled",
        "EndpointName": "xgboost-2021-10-27-23-31-41-439",
        "MonitoringJobDefinitionName": "data-quality-job-definition-2021-10-27-23-46-47-211",
        "MonitoringType": "DataQuality"
      }
    ]
  }

Get the name of your MonitoringSchedules and pass it to delete the monitoring schedule

  MON_NAME=$(aws sagemaker list-monitoring-schedules --region $REGION | jq -r '.MonitoringScheduleSummaries[].MonitoringScheduleName')
  aws sagemaker delete-monitoring-schedule --monitoring-schedule-name $MON_NAME --region $REGION

Now we can delete the Endpoint with no issues, first get the endpoint name you want to delete

  $ aws sagemaker list-endpoints --region $REGION | jq "."
  {
      "Endpoints": [
          {
              "EndpointName": "xgboost-2021-10-27-23-31-41-439",
              "EndpointArn": "arn:aws:sagemaker:us-east-1:853052508252:endpoint/xgboost-2021-10-27-23-31-41-439",
              "CreationTime": 1635377502.453,
              "LastModifiedTime": 1635378004.108,
              "EndpointStatus": "InService"
          }
      ]
  }

With the endpoint name, delete the endpoint

  $ ENDPOINT_NAME=$(aws sagemaker list-endpoints --region $REGION | jq -r ".Endpoints[].EndpointName")
  $ aws sagemaker delete-endpoint --endpoint-name $ENDPOINT_NAME --region $REGION
  $ aws sagemaker list-endpoints --region $REGION
  {
      "Endpoints": []
  }

NB: Always, make sure to delete the endpoint and other resources after you are done to avoid cost!

How I Setup Jenkins On Docker Container Using Ansible (Part 1)

Mpho Mphego — Wed, 26 Oct 2022 07:45:00 +0000

The Story

Recently, my team found themselves in a situation where they needed to have a staging or development Jenkins environment. The motivation behind the need for a new environment was that we needed a backup Jenkins environment and a place where new Jenkins users could get their hands dirty with Jenkins without having to worry about changes to the production environment and most importantly we needed to ensure that our Jenkins environment is stored as code and could be easily replicated.

For this task, I decided to pair with my padawan/mentee (@AneleMakhaba) as he was a good fit for the task (and I wanted to disseminate the knowledge as well) which had been in the backlog for a while.

I thought this meme was relevant to the task.

Our initial approach to the task was to:

Create a Docker image that would be used for our Jenkins environment which includes all the necessary dependencies and configuration files.
- The configuration should be based on the production environment.
Explore the "As code paradigm":
- Create and version control Jenkins Configuration using [JCaC(Jenkins Configuration as Code), which is a plugin for Jenkins that provides the ability to define this whole configuration as a simple, human-friendly, plain text YAML syntax.
- Create and version control Jenkins Job configuration using Jenkins Job Builder, which is a Python package with the ability to store Jenkins jobs in a YAML format.
Deploy a new Jenkins instance (dev-environment) with a single command
Future work includes the ability to backup and restores Jenkins job history to the newly deployed environment with a single command.

(@AneleMakhaba) recently gave a Lunch 'n Learn talk that summarises this post. The talk explores the following:

Why do we need to configure Jenkins as Code?
Managing Jenkins as Code
Jenkins infrastructure as Code
Jenkins Jobs as Code
Some of the benefits of 'as code' paradigm, and a demo

This collaborated blog post is divided into 3 sections, Instance Creation, Containerization and Automation to avoid a very long post and,

In this post, we will detail the steps that we undertook to create an environment ([EC2 instance) that will host our Jenkins instance.

Note: We did not use any AWS services to host our Jenkins environment at our workplace, instead we used Proxmox containers.

Thank your @AneleMakhaba, for your collaboration in writing this post.

TL;DR

Create an EC2 instance with the following specifications:
- Instance type:t2.micro
- Instance name: jenkins-server
- Instance key pair: jenkins-ec2
- AMI: ami-09d56f8956ab235b3 (Ubuntu 20.04 LTS)
SSH into the instance and create an ansible user with sudo rights.
Copy the local ssh key to the instance and add it to the ansible user's authorized_keys file.

The How

This is the first post in the series of posts that will detail the steps that we undertook to create an environment (EC2 instance) for running Jenkins CI. The instance was launched via the AWS Console, a future post will detail the same steps using Terraform for deterministic orchestrations.

The Walk-through

This post-walk-through mainly focuses on instance creation. If you would like to read more about the containerization click here and here for automation walk-through.

Create an EC2 instance

To create an EC2 instance that will be used to run the Jenkins container head over to the AWS Console and create a new instance.

On console, search for EC2 and select it, then locate the "Launch Instance" button.
After selecting the "Launch Instance" button, add the name of your instance (I chose Jenkins-server) then select the "Ubuntu" option for the AMI (Amazon Machine Image).
Choose an instance of your choice (for this post we chose a t2.micro), then select the Create new key pair button (these keys will be used to SSH into our instance later)
After the instance is created we will need to wait for it to be ready and then we will be able to SSH into it by clicking on the "Connect" button and,
Follow the instructions to SSH into the instance.

Now, open a new terminal window on the host and SSH into the instance to ensure that everything is working as expected.

Create an Ansible user

Note: This is an optional step as we can use the default EC2 user in Ansible. Due to security reasons, it is recommended to create a dedicated Ansible user with sudo rights and only authorized access to the instance.

Generate the ssh-key for your user

First, we need to generate an ssh-key for our Ansible user from our localhost. This key will help ease the SSH connection to the instance. The following command will generate an ssh-key for the user ansible on localhost:

ssh-keygen -t rsa -b 4096 -C "ansible-user"
chmod 400 ~/.ssh/id_rsa

We can leave everything as default - a pair of private/public keys will be generated in ~/.ssh as id_rsa (the private key) and id_rsa.pub (the public key).

Read more about SSH Public and Private Key

We need to copy the contents of the public key - id_rsa.pub that looks like this:

cat ~/.ssh/id_rsa.pub

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDAudXEIP2qNrYDOVdS5T7ZB7...............
ansible-user

Once we have our ssh-key, we can SSH into the EC2 instance.

ssh -i "jenkins-ec2.pem" ubuntu@<<ec2-host-or-ip>>.compute-1.amazonaws.com

Then we can create the ansible user and assign it sudo rights (I know) on the EC2 instance:

sudo su -
adduser ansible
echo "ansible ALL=(ALL) NOPASSWD:ALL" >> /etc/sudoers
mkdir -p /home/ansible/.ssh
cd /home/ansible/.ssh

Then paste the public key, and contents that we generated earlier on the host environment into the authorized_keys file and save.

vi authorized_keys

This will ensure that we can SSH into the instance without a password, and we can run Ansible commands without being prompted for a password each time.

Note: This is not the best security practice, but it is a good starting point.

Going back to the host environment, we can test the SSH connection to the EC2 instance using the ansible user that we just created:

Once you have a completed instance that you can SSH into, then you can create a Jenkins server on it. The post How I setup Jenkins on Docker container using Ansible Part 2 will detail the steps to create a Jenkins server on an EC2 instance.

Conclusion

Congratulations! You have successfully created an EC2 instance that will run the Jenkins environment. You can now use the instance to run Ansible playbooks and containers. Another avenue to explore is Terraform for deterministic deployment instead of relying on the AWS Console. This will be covered in future posts.

Reference

How To Build An ETL Using Python, Docker, PostgreSQL And Airflow

Mpho Mphego — Sun, 09 Jan 2022 12:09:15 +0000

30 Min Read

Updated: 2022-02-18 06:54:15 +02:00

The Story

During the past few years, I have developed an interest in Machine Learning but never wrote much about the topic. In this post, I want to share some insights about the foundational layers of the ML stack. I will start with the basics of the ML stack and then move on to the more advanced topics.

This post will detail how to build an ETL (Extract, Transform and Load) using Python, Docker, PostgreSQL and Airflow.

You will need to sit down comfortably for this one, it will not be a quick read.

Before we get started, let's take a look at what ETL is and why it is important.

One of the foundational layers when it comes to Machine Learning is ETL(Extract, Transform and Load). According to Wikipedia:

ETL is the general procedure of copying data from one or more sources into a destination system that represents the data differently from the source(s) or in a different context than the source(s).
Data extraction involves extracting data from (one or more) homogeneous or heterogeneous sources; data transformation processes data by data cleaning and transforming it into a proper storage format/structure for the purposes of querying and analysis; finally, data loading describes the insertion of data into the final target database such as an operational data store, a data mart, data lake or a data warehouse.

One might begin to wonder, Why do we need an ETL pipeline?

Assume we had a set of data that we wanted to use. However, this data is unclean, missing information, and inconsistent as with most data. One solution would be to have a program clean and transform this data so that:

There is no missing information
Data is consistent
Data is fast to load into another program

With smart devices, online communities, and E-Commerce, there is an abundance of raw, unfiltered data in today's industry. However, most of it is squandered because it is difficult to interpret due to it being tangled. ETL pipelines are available to combat this by automating data collection and transformation so that analysts can use them for business insights.

There are a lot of different tools and frameworks that are used to build ETL pipelines. In this post, I will focus on how one can tediously build an ETL using Python, Docker, PostgreSQL and Airflow tools.

TL;DR

There's no free lunch. Read the whole post.

Code used in this post is available on https://github.com/mmphego/simple-etl

The How

For this post, we will be using the data from UC-Irvine machine learning recognition datasets. This dataset contains Wine Quality information and it is a result of chemical analysis of various wines grown in Portugal.

We will need to extract the data from a public repository (for this post I went ahead and uploaded the data to gist.github.com) and transform it into a format that can be used by ML algorithms (not part of this post), thereafter we will load both raw and transformed data into a PostgreSQL database running in a Docker container, then create a DAG that will run an ETL pipeline periodically. The DAG will be used to run the ETL pipeline in Airflow.

The Walk-through

Before we can do any transformation, we need to extract the data from a public repository. Using Python and Pandas, we will extract the data from a public repository and upload the raw data to a PostgreSQL database. This assumes that we have an existing PostgreSQL database running in a Docker container.

The Setup

Let's start by setting up our environment. First, we will set up our Jupyter Notebook and PostgreSQL database. Then, we will set up Apache Airflow (a fancy cron-like scheduler).

Setup PostgreSQL and Jupyter Notebook

In this section, we will set up the PostgreSQL database and Jupyter Notebook. First, we will need to create a .env file in the project directory. This file will contain the PostgreSQL database credentials which are needed in the docker-compose.yml file.

cat << EOF > .env
POSTGRES_DB=winequality
POSTGRES_USER=postgres
POSTGRES_PASSWORD=postgres
POSTGRES_HOST=database
POSTGRES_PORT=5432
EOF

Once we have the .env file, we can create a Postgres container instance that we will use as our Data Warehouse.
The code below will create a docker-compose.yaml file that will contain all the necessary information to run the container including a Jupyter Notebook that we can use to interact with the container and/or data.

cat << EOF > postgres-docker-compose.yaml
version: "3.8"
# Optional Jupyter Notebook service
services:
  jupyter_notebook:
    image: "jupyter/minimal-notebook"
    container_name: ${CONTAINER_NAME:-jupyter_notebook}
    environment:
      JUPYTER_ENABLE_LAB: "yes"
    ports:
      - "8888:8888"
    volumes:
      - ${PWD}:/home/jovyan/work
    depends_on:
      - database
    links:
      - database
    networks:
      - etl_network

  database:
    image: "postgres:11"
    container_name: ${CONTAINER_NAME:-database}
    ports:
      - "5432:5432"
    expose:
      - "5432"
    environment:
      POSTGRES_DB: "${POSTGRES_DB}"
      POSTGRES_HOST: "${POSTGRES_HOST}"
      POSTGRES_PASSWORD: "${POSTGRES_PASSWORD}"
      POSTGRES_PORT: "${POSTGRES_PORT}"
      POSTGRES_USER: "${POSTGRES_USER}"
    healthcheck:
      test:
        [
          "CMD",
          "pg_isready",
          "-U",
          "${POSTGRES_USER}",
          "-d",
          "${POSTGRES_DB}"
        ]
      interval: 5s
      retries: 5
    restart: always
    volumes:
      - /tmp/pg-data/:/var/lib/postgresql/data/
      - ./init-db.sql:/docker-entrypoint-initdb.d/init.sql
    networks:
      - etl_network

volumes:
  dbdata: null

# Create a custom network for bridging the containers
networks:
  etl_network: null
EOF

But before we can run the container, we need to create the init-db.sql file that will contain the SQL command to create the database. This file will be our entrypoint into the container. Read more about Postgres Docker entrypoint here.

cat << EOF > init-db.sql
CREATE DATABASE ${POSTGRES_DB};
EOF

After creating the postgres-docker-compose.yaml file, we need to source the .env file, create a docker network (the docker network will ensure all containers are interconnected) and then run the docker-compose up command to start the container.

Note the current local directory is mounted to the /home/jovyan/work directory in the container. This is done to allow the container to access the data in the local directory. ie all the files in the local directory will be available in the container.

source .env
# Install yq (https://github.com/mikefarah/yq/#install) to parse the YAML file and retrieve the network name
NETWORK_NAME=$(yq eval '.networks' postgres-docker-compose.yaml | cut -f 1 -d':')
docker network create $NETWORK_NAME
# or hardcode the network name from the YAML file
# docker network create etl_network
docker-compose --env-file ./.env -f ./postgres-docker-compose.yaml up -d

When we run the docker-compose up command, we will see the following output:

Starting database_1 ... done
Starting jupyter_notebook_1 ... done

Since the container is running in detached mode, we will need to run the docker-compose logs command to see the logs and retrieve the URL of the Jupyter Notebook. The command below will print the URL (with access token) of the Jupyter Notebook.

docker logs $(docker ps -q --filter "ancestor=jupyter/minimal-notebook") 2>&1 | grep 'http://127.0.0.1' | tail -1

Once everything is running, we can open the Jupyter Notebook in the browser using the URL from the logs and have fun.

Setup Airflow

In this section, we will set up the Airflow environment. A quick overview of the Airflow environment, Apache Airflow, is an open-source tool for orchestrating complex computational workflows and creating a data processing pipeline. Think of it as a fancy version of a job scheduler or cron job. A workflow is a series of tasks that are executed in a specific order and we call them DAGs. A DAG (Directed Acyclic Graph) is a graph that contains a set of tasks that are connected by dependencies or a graph with nodes connected via directed edges.

The image below shows an example of a DAG.

Okay now that we got the basics of what Airflow and DAGs are, let's set up Airflow. First, we will need to create our custom Airflow Docker image. This image adds and installs a list of Python packages that we will need to run the ETL (Extract, Transform and Load) pipeline.

First, let's create a list of Python packages that we will need to install.

Run the following command to create the requirements.txt file:

cat << EOF > requirements.txt
pandas==1.3.5
psycopg2-binary==2.8.6
python-dotenv==0.19.2
SQLAlchemy==1.3.24
EOF

Then we will create a Dockerfile that will install the required Python packages (Ideally, we should only install packages in a virtual environment but for this post, we will install all packages in the Dockerfile).

cat << EOF > airflow-dockerfile
FROM apache/airflow:2.2.3
ADD requirements.txt /usr/local/airflow/requirements.txt
RUN pip install --no-cache-dir -U pip setuptools wheel
RUN pip install --no-cache-dir -r /usr/local/airflow/requirements.txt
EOF

Now we can create a Docker compose file that will run the Airflow container. The airflow-docker-compose.yaml below is a modified version of the official Airflow Docker. We have added the following changes:

Customized Airflow image that includes the installation of Python dependencies.
Removes example DAGs and reloads DAGs every 60seconds.
Memory limitation set to 4GB.
Allocated on 2 workers to run Gunicorn web server.
Add our .env file to the Airflow container and,
A custom network for bridging the containers (Jupyter, PostgresDB and Airflow).

The airflow-docker-compose.yaml file when deployed will start a list of containers namely:

airflow-scheduler - The scheduler monitors all tasks and DAGs, then triggers the task instances once their dependencies are complete.
airflow-webserver - The webserver is available at http://localhost:8080.
airflow-worker - The worker that executes the tasks given by the scheduler.
airflow-init - The initialization service.
flower - The flower app for monitoring the environment. It is available at http:/
localhost:5555.
postgres - The database.
redis - The redis-broker that forwards messages from scheduler to worker.

cat << EOF > airflow-docker-compose.yaml
---
version: '3'
x-airflow-common:
  &airflow-common
  build:
    context: .
    dockerfile: airflow-dockerfile
  environment:
    &airflow-common-env
    AIRFLOW__CORE__EXECUTOR: CeleryExecutor
    AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
    AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
    AIRFLOW__CORE__FERNET_KEY: ''
    AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
    AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
    AIRFLOW__API__AUTH_BACKEND: 'airflow.api.auth.backend.basic_auth'
    # Scan for DAGs every 60 seconds
    AIRFLOW__SCHEDULER__DAG_DIR_LIST_INTERVAL: '60'
    AIRFLOW__WEBSERVER__SECRET_KEY: '3d6f45a5fc12445dbac2f59c3b6c7cb1'
    # Prevent airflow from reloading the dags all the time and set:
    AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL: '60'
    # 2 * NUM_CPU_CORES + 1
    AIRFLOW__WEBSERVER__WORKERS: '2'
    # Kill workers if they don't start within 5min instead of 2min
    AIRFLOW__WEBSERVER__WEB_SERVER_WORKER_TIMEOUT: '300'

  volumes:
    - ./dags:/opt/airflow/dags
    - ./logs:/opt/airflow/logs
    - ./plugins:/opt/airflow/plugins

  env_file:
    - ./.env
  user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
  mem_limit: 4000m
  depends_on:
    redis:
      condition: service_healthy
    postgres:
      condition: service_healthy
  networks:
    - etl_network

services:
  postgres:
    image: postgres:13
    environment:
      POSTGRES_USER: airflow
      POSTGRES_PASSWORD: airflow
      POSTGRES_DB: airflow
    volumes:
      - postgres-db-volume:/var/lib/postgresql/data
    healthcheck:
      test: [ "CMD", "pg_isready", "-U", "airflow" ]
      interval: 5s
      retries: 5
    restart: always
    networks:
      - etl_network

  redis:
    image: redis:latest
    ports:
      - 6379:6379
    healthcheck:
      test: [ "CMD", "redis-cli", "ping" ]
      interval: 5s
      timeout: 30s
      retries: 50
    restart: always
    mem_limit: 4000m
    networks:
      - etl_network

  airflow-webserver:
    <<: *airflow-common
    command: webserver
    ports:
      - 8080:8080
    healthcheck:
      test:
        [
          "CMD",
          "curl",
          "--fail",
          "http://localhost:8080/health"
        ]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-scheduler:
    <<: *airflow-common
    command: scheduler
    healthcheck:
      test:
        [
          "CMD-SHELL",
          'airflow jobs check --job-type SchedulerJob --hostname
            "$${HOSTNAME}"'
        ]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-worker:
    <<: *airflow-common
    command: celery worker
    healthcheck:
      test:
        - "CMD-SHELL"
        - 'celery --app airflow.executors.celery_executor.app inspect ping -d
          "celery@$${HOSTNAME}"'
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always

  airflow-init:
    <<: *airflow-common
    command: version
    environment:
      <<: *airflow-common-env
      _AIRFLOW_DB_UPGRADE: 'true'
      _AIRFLOW_WWW_USER_CREATE: 'true'
      _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
      _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}

  flower:
    <<: *airflow-common
    command: celery flower
    ports:
      - 5555:5555
    healthcheck:
      test: [ "CMD", "curl", "--fail", "http://localhost:5555/" ]
      interval: 10s
      timeout: 10s
      retries: 5
    restart: always
    mem_limit: 4000m

volumes:
  postgres-db-volume: null

# Create a custom network for bridging the containers
networks:
  etl_network: null
EOF

Before starting Airflow for the first time, we need to prepare our environment. We need to add the Airflow USER to our .env file because some of the container's directories that we mount, will not be owned by the root user. The directories are:

./dags - you can put your DAG files here.
./logs - contains logs from task execution and scheduler.
./plugins - you can put your custom plugins here.

The following commands will create the Airflow User & Group IDs and directories.

mkdir -p ./dags ./logs ./plugins
chmod -R 777 ./dags ./logs ./plugins
echo -e "AIRFLOW_UID=$(id -u)" >> .env
echo -e "AIRFLOW_GID=0" >> .env

After that, we need to initialize the Airflow database. We can do this by running the following command:

docker-compose -f airflow-docker-compose.yaml up airflow-init

This will create the Airflow database and the Airflow USER.
Once we have the Airflow database and the Airflow USER, we can start the Airflow services.

docker-compose -f airflow-docker-compose.yaml up -d

Running docker ps will show us the list of containers running and we should make sure that the status of all containers is Up as shown in the image below.

Once we have confirmed that Airflow, Jupyter and database services are running, we can start the Airflow webserver.

The webserver is available at http://localhost:8080. The default account has the login airflow and the password airflow.

Now that all the hard work is done. We can create our ETL and DAGs.

Memory and CPU utilization

When all the containers are running, you can experience system lag if your system is not able to handle the load. Monitoring the CPU and Memory utilization of the containers is crucial to maintaining good performance and a reliable system. To monitor the CPU and Memory utilization of the containers, we use the Docker command-line tool stats command, which gives us a live look at our containers resource utilization. We can use this tool to gauge the CPU, Memory, Network, and disk utilization of every running container.

docker stats

The output of the above command will look like the following:

CONTAINER ID   NAME                          CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
c857cddcac2b   dataeng_airflow-scheduler_1   89.61%    198.8MiB / 3.906GiB   4.97%     2.49MB / 3.72MB   0B / 0B           3
b4be499c5e4f   dataeng_airflow-worker_1      0.29%     1.286GiB / 3.906GiB   32.93%    304kB / 333kB     0B / 172kB        21
20af4408fd3d   dataeng_flower_1              0.14%     156.1MiB / 3.906GiB   3.90%     155kB / 93.4kB    0B / 0B           74
075bb3178876   dataeng_airflow-webserver_1   0.11%     715.4MiB / 3.906GiB   17.89%    1.19MB / 808kB    0B / 8.19kB       30
967341194e93   dataeng_postgres_1            4.89%     43.43MiB / 15.26GiB   0.28%     4.85MB / 4.12MB   0B / 4.49MB       15
a0de99b6e4b5   dataeng_redis_1               0.12%     7.145MiB / 15.26GiB   0.05%     413kB / 428kB     0B / 4.1kB        5
6ad0eacdcfe2   jupyter_notebook              0.00%     128.7MiB / 15.26GiB   0.82%     800kB / 5.87MB    91.2MB / 12.3kB   3
4ba2e98a551a   database                      6.80%     25.97MiB / 15.26GiB   0.17%     19.7kB / 0B       94.2kB / 1.08MB   7

Clean Up

To stop and remove all the containers, including the bridge network, run the following command:

docker-compose -f airflow-docker-compose.yaml down --volumes --rmi all
docker-compose -f postgres-docker-compose.yaml down --volumes --rmi all
docker network rm etl_network

Extract, Transform and Load

Now that we have Jupyter, Airflow and Postgres services running, we can start creating our ETL. Open the Jupyter notebook and create a new notebook called Simple ETL. For this post, we will use the

Step 0: Install the required libraries

We need to install the required libraries for our ETL, these include:

pandas: Used for data manipulation
python-dotenv: Used for loading environment variables
SQLAlchemy: Used for connecting to databases (Postgres)
psycopg2: Postgres adapter for SQLAlchemy

!pip install -r requirements.txt

Step 1: Import libraries and load the environment variables

The first step is to import all the modules, load the environment variables and create the connection_uri variable that will be used to connect to the Postgres database.

import os

import pandas as pd

from dotenv import dotenv_values
from sqlalchemy import create_engine, inspect

CONFIG = dotenv_values('.env')
if not CONFIG:
    CONFIG = os.environ

connection_uri = "postgresql+psycopg2://{}:{}@{}:{}".format(
    CONFIG["POSTGRES_USER"],
    CONFIG["POSTGRES_PASSWORD"],
    CONFIG['POSTGRES_HOST'],
    CONFIG["POSTGRES_PORT"],
)

Step 2: Create a connection to the Postgres database
We will treat this database as a fake production database, that will house both our raw and transformed data.

engine = create_engine(connection_uri, pool_pre_ping=True)
engine.connect()

Step 3: Extract the data from the hosting service

Once we have a connection to the Postgres database, we can pull a copy of the UC-Irvine machine learning recognition datasets that I recently uploaded to https://gist.github.com/mmphego

dataset = "https://gist.githubusercontent.com/mmphego/5b6fc4d6dc3c8fba4fce9d994a2fe16b/raw/ab5df0e76812e13df5b31e466a5fb787fac0599a/wine_quality.csv"

df = pd.read_csv(dataset)

It is always a good idea to check the data before you start working with it.

df.head()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
    table-layout: fixed;
    border-collapse: collapse;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	winecolor
0	7.0	0.27	0.36	20.7	0.045	45.0	170.0	1.0010	3.00	0.45	8.8	6	white
1	6.3	0.30	0.34	1.6	0.049	14.0	132.0	0.9940	3.30	0.49	9.5	6	white
2	8.1	0.28	0.40	6.9	0.050	30.0	97.0	0.9951	3.26	0.44	10.1	6	white
3	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6	white
4	7.2	0.23	0.32	8.5	0.058	47.0	186.0	0.9956	3.19	0.40	9.9	6	white

We also need to have an understanding of the data types that we will be working with. This will give us a clear indication of some features we need to engineer or any missing values that we need to fill in.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   fixed acidity         6497 non-null   float64
 1   volatile acidity      6497 non-null   float64
 2   citric acid           6497 non-null   float64
 3   residual sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free sulfur dioxide   6497 non-null   float64
 6   total sulfur dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64
 12  winecolor             6497 non-null   object
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB

From the above information, we can see that there are a total of 6497 rows and 13 columns. But the 13th column is the winecolor column and it does not contain numerical values. We need to convert/transform this column into numerical values.

Now, let's check the table summary which gives us a quick overview of the data this includes the count, mean, standard deviation, min, max, 25th percentile, 50th percentile, 75th percentile, and the number of null values.

df.describe()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000	6497.000000
mean	7.215307	0.339666	0.318633	5.443235	0.056034	30.525319	115.744574	0.994697	3.218501	0.531268	10.491801	5.818378
std	1.296434	0.164636	0.145318	4.757804	0.035034	17.749400	56.521855	0.002999	0.160787	0.148806	1.192712	0.873255
min	3.800000	0.080000	0.000000	0.600000	0.009000	1.000000	6.000000	0.987110	2.720000	0.220000	8.000000	3.000000
25%	6.400000	0.230000	0.250000	1.800000	0.038000	17.000000	77.000000	0.992340	3.110000	0.430000	9.500000	5.000000
50%	7.000000	0.290000	0.310000	3.000000	0.047000	29.000000	118.000000	0.994890	3.210000	0.510000	10.300000	6.000000
75%	7.700000	0.400000	0.390000	8.100000	0.065000	41.000000	156.000000	0.996990	3.320000	0.600000	11.300000	6.000000
max	15.900000	1.580000	1.660000	65.800000	0.611000	289.000000	440.000000	1.038980	4.010000	2.000000	14.900000	9.000000

Looking at the data, we can see a few things:

Since our data contains categorical variables (winecolor), we can use one-hot encoding to transform the categorical variables into binary variables
We can normalize the data by transforming it to have zero mean, this will ensure that the data is centred around zero ie standardize the data

Step 4: Transform the data into usable format

Now that we have an idea of what our data looks like, we can use the pandas.get_dummies function to transform the categorical variables into binary variables then drop the original categorical variables.

df_transform = df.copy()
winecolor_encoded = pd.get_dummies(df_transform['winecolor'], prefix='winecolor')
df_transform[winecolor_encoded.columns.to_list()] = winecolor_encoded
df_transform.drop('winecolor', axis=1, inplace=True)

Then we can normalize the data by subtracting the mean and dividing by the standard deviation. This will ensure that the data is centred around zero and has a standard deviation of 1. Instead of using sklearn.preprocessing.StandardScaler, we will use the z-score normalization (also known as standardization) method.

for column in df_transform.columns:
    df_transform[column] = (df_transform[column] -
        df_transform[column].mean()) / df_transform[column].std()

After transforming the data, we can now take a look:

df_transform.head()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	winecolor_red	winecolor_white
0	-0.166076	-0.423150	0.284664	3.206682	-0.314951	0.815503	0.959902	2.102052	-1.358944	-0.546136	-1.418449	0.207983	-0.571323	0.571323
1	-0.706019	-0.240931	0.147035	-0.807775	-0.200775	-0.931035	0.287595	-0.232314	0.506876	-0.277330	-0.831551	0.207983	-0.571323	0.571323
2	0.682405	-0.362411	0.559923	0.306184	-0.172231	-0.029596	-0.331634	0.134515	0.258100	-0.613338	-0.328496	0.207983	-0.571323	0.571323
3	-0.011807	-0.666110	0.009405	0.642474	0.056121	0.928182	1.242978	0.301255	-0.177258	-0.882144	-0.496181	0.207983	-0.571323	0.571323
4	-0.011807	-0.666110	0.009405	0.642474	0.056121	0.928182	1.242978	0.301255	-0.177258	-0.882144	-0.496181	0.207983	-0.571323	0.571323

Then check how the data looks like after normalization:

df_transform.describe()

.dataframe tbody tr th:only-of-type {
    vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality	winecolor_red	winecolor_white
count	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6497.000000	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03	6.497000e+03
mean	2.099803e-16	-2.449770e-16	3.499672e-17	3.499672e-17	-3.499672e-17	-8.749179e-17	0.000000	-3.517170e-15	2.720995e-15	2.099803e-16	-8.399212e-16	-2.821610e-16	-3.499672e-17	1.749836e-16
std	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00	1.000000e+00
min	-2.634386e+00	-1.577208e+00	-2.192664e+00	-1.017956e+00	-1.342536e+00	-1.663455e+00	-1.941631	-2.529997e+00	-3.100376e+00	-2.091774e+00	-2.089189e+00	-3.227439e+00	-5.713226e-01	-1.750055e+00
25%	-6.288845e-01	-6.661100e-01	-4.722972e-01	-7.657389e-01	-5.147590e-01	-7.620156e-01	-0.685480	-7.858922e-01	-6.748102e-01	-6.805395e-01	-8.315512e-01	-9.371575e-01	-5.713226e-01	5.713226e-01
50%	-1.660764e-01	-3.016707e-01	-5.940918e-02	-5.135217e-01	-2.578628e-01	-8.593639e-02	0.039904	6.448391e-02	-5.287017e-02	-1.429263e-01	-1.608107e-01	2.079830e-01	-5.713226e-01	5.713226e-01
75%	3.738663e-01	3.664680e-01	4.911081e-01	5.584015e-01	2.559297e-01	5.901428e-01	0.712210	7.647937e-01	6.312639e-01	4.618885e-01	6.776148e-01	2.079830e-01	-5.713226e-01	5.713226e-01
max	6.698910e+00	7.533774e+00	9.230570e+00	1.268585e+01	1.584097e+01	1.456245e+01	5.736815	1.476765e+01	4.922650e+00	9.870119e+00	3.695947e+00	3.643405e+00	1.750055e+00	5.713226e-01

Step 5: Load the data into a database

If we are happy with the results, then we can load both dataframes into our database. Since we do not have any tables in our database and our dataset is small, we can get away by using the .to_sql method to write the data to a table in the database.

raw_table_name = 'raw_wine_quality_dataset'
df.to_sql(raw_table_name, engine, if_exists='replace')
transformed_table_name = 'clean_wine_quality_dataset'
df_transformed.to_sql(transformed_table_name, engine, if_exists='replace')

This will create two tables in our database, namely raw_wine_quality_dataset and clean_wine_quality_dataset.

For a sanity check, we can verify that the data in both tables were successfully written to the database, using the following query:

def check_table_exists(table_name, engine):
    if table_name in inspect(engine).get_table_names():
        print(f"{table_name!r} exists in the DB!")
    else:
        print(f"{table_name} does not exist in the DB!")

check_table_exists(raw_table_name, engine)
check_table_exists(transformed_table_name, engine)

Well, that was a lot of work! But, we can do even better! We can use the .read_sql method to read the data from the database and then use the .drop_duplicates method to remove the duplicate rows.

pd.read_sql(f"SELECT * FROM {raw_table_name}", engine)
pd.read_sql(f"SELECT * FROM {transformed_table_name}", engine)

Well done, we successfully wrote our data into the database. Our ETL pipeline is now complete the only thing left to do is to make it repeatable via Airflow.

Airflow ETL Pipeline

Now that we have an ETL pipeline that can be run in Airflow, we can start building our Airflow DAG.

We can reuse our jupyter notebook and ensure that the DAG is written to file as a Python script by using the magic command %%writefile dags/simple_etl_dag.py

Step 1: Import necessary

But first, we need to import the necessary libraries and to create a DAG in Airflow, you always have to import the DAG class from airflow.models. Then import the PythonOperator (since we will be executing Python logic) and finally, import days_ago to get a datetime object representation of n days ago.

import os

from functools import wraps

import pandas as pd

from airflow.models import DAG
from airflow.utils.dates import days_ago
from airflow.operators.python import PythonOperator

from dotenv import dotenv_values
from sqlalchemy import create_engine, inspect

Step 2: Create a DAG object

After importing the necessary libraries, we can create a DAG object. We will use the DAG class from airflow.models to create a DAG object. A DAG object must have a dag_id, a schedule_interval, and a start_date. The dag_id is a unique name of the DAG, and the schedule_interval is the interval at which the DAG will be executed. The start_date is the date at which the DAG will start. We can also add a default_args parameter to the DAG object, which is a dictionary of default arguments that may include Owners information, a description, and a default start_date.

args = {"owner": "Airflow", "start_date": days_ago(1)}

dag = DAG(dag_id="simple_etl_dag", default_args=args, schedule_interval=None)

Step 3: Define a logging function

For the sake of simplicity, we will create a simple (decorator) logging function that will be used to log the execution of the DAG using print statements of course.


def logger(fn):
    from datetime import datetime, timezone

    @wraps(fn)
    def inner(*args, **kwargs):
        called_at = datetime.now(timezone.utc)
        print(f">>> Running {fn.__name__!r} function. Logged at {called_at}")
        to_execute = fn(*args, **kwargs)
        print(f">>> Function: {fn.__name__!r} executed. Logged at {called_at}")
        return to_execute

    return inner

Step 4: Create an ETL function

We will refactor the ETL pipeline we defined above into be a function that can be called by the DAG and use the logger function to log the execution of the function.

DATASET_URL = "https://gist.githubusercontent.com/mmphego/5b6fc4d6dc3c8fba4fce9d994a2fe16b/raw/ab5df0e76812e13df5b31e466a5fb787fac0599a/wine_quality.csv"


CONFIG = dotenv_values(".env")
if not CONFIG:
    CONFIG = os.environ


@logger
def connect_db():
    print("Connecting to DB")
    connection_uri = "postgresql+psycopg2://{}:{}@{}:{}".format(
        CONFIG["POSTGRES_USER"],
        CONFIG["POSTGRES_PASSWORD"],
        CONFIG["POSTGRES_HOST"],
        CONFIG["POSTGRES_PORT"],
    )

    engine = create_engine(connection_uri, pool_pre_ping=True)
    engine.connect()
    return engine


@logger
def extract(dataset_url):
    print(f"Reading dataset from {dataset_url}")
    df = pd.read_csv(dataset_url)
    return df


@logger
def transform(df):
    # transformation
    print("Transforming data")
    df_transform = df.copy()
    winecolor_encoded = pd.get_dummies(df_transform["winecolor"], prefix="winecolor")
    df_transform[winecolor_encoded.columns.to_list()] = winecolor_encoded
    df_transform.drop("winecolor", axis=1, inplace=True)

    for column in df_transform.columns:
        df_transform[column] = (
            df_transform[column] - df_transform[column].mean()
        ) / df_transform[column].std()
    return df

@logger
def check_table_exists(table_name, engine):
    if table_name in inspect(engine).get_table_names():
        print(f"{table_name!r} exists in the DB!")
    else:
        print(f"{table_name} does not exist in the DB!")


@logger
def load_to_db(df, table_name, engine):
    print(f"Loading dataframe to DB on table: {table_name}")
    df.to_sql(table_name, engine, if_exists="replace")
    check_table_exists(table_name, engine)

@logger
def tables_exists():
    db_engine = connect_db()
    print("Checking if tables exists")
    check_table_exists("raw_wine_quality_dataset", db_engine)
    check_table_exists("clean_wine_quality_dataset", db_engine)
    db_engine.dispose()

@logger
def etl():
    db_engine = connect_db()

    raw_df = extract(DATASET_URL)
    raw_table_name = "raw_wine_quality_dataset"

    clean_df = transform(raw_df)
    clean_table_name = "clean_wine_quality_dataset"

    load_to_db(raw_df, raw_table_name, db_engine)
    load_to_db(clean_df, clean_table_name, db_engine)

    db_engine.dispose()

Step 5: Create a PythonOperator

Now that we have our ETL function defined, we can create a PythonOperator that will execute the ETL and data verification function. One of the best practices is to use context managers thus avoiding the need to add dag=dag to your task which might result in Airflow errors.

with dag:
    run_etl_task = PythonOperator(task_id="run_etl_task", python_callable=etl)
    run_tables_exists_task = PythonOperator(
        task_id="run_tables_exists_task", python_callable=tables_exists)

    run_etl_task >> run_tables_exists_task

That's it! Now, we can head out to the Airflow UI and check if our DAG was created successfully.

Step 6: Run the DAG

After we log in to the Airflow UI, we should notice that the DAG was created successfully. You should see an image similar to the one below.

If we are happy with the DAG, we can now run the DAG by clicking on the green play button and selecting Trigger DAG. This will start the DAG execution

Let's open the last successful run of the DAG and see the logs. The image below shows the graph representation of the DAG

Looks like the DAG was executed successfully, everything is Green!
Now, we can check the logs of the DAG to see the execution of the ETL function by clicking on an individual task and then clicking on the Logs tab.

The logs show that the ETL function was executed successfully.

This now concludes this post. If you have gotten this far, I hope you enjoyed this post and found it useful.

Conclusion

In this post, We have covered the basics of creating your very own ETL pipeline, how to run multiple docker containers interconnected, Data manipulation and feature engineering techniques, simple techniques on reading and writing data to a database, and finally, how to create a DAG in Airflow. This has been a great learning experience and I hope you found this post useful. In the next post, I will explore a less tedious way of creating an ETL pipeline using AWS services. So stick around and learn more!

FYI it took me a week to write this post. I was trying to get a better understanding of Docker networking, Postgres Fundamentals, Airflow ecosystem and how to create a DAG. This was a great learning experience and I hope you found this post useful.

Reference

How To Configure Distributed Tracing With Jaeger On Kubernetes Cluster

Mpho Mphego — Sun, 26 Sep 2021 11:47:50 +0000

The Story

In the previous article titled How To Configure Jaeger Data Source On Grafana And Debug Network Issues With Bind-utilities. I described how to configure Jaeger on Grafana but I did not go into the details on how we can use Jaeger tracing on an application.

If you did not read the previous article, please do so now before we go down the rabbit hole.

To understand better let's quickly visit our not so distant past, most applications are a built-in single-contained monolithic system where the application would execute in order of operations down a pretty clear path as shown in the image below.

First, the user would send a request which would be received by a load balancer, and then route it to the monolithic application and finally the database. As we want to know the request latency, we would want to trace it on the way back. The monolith has all the application services bundled into one block this is a good example of monolithic tracing.

Now if we consider distributed tracing where we use microservices and in which they are all decoupled, the transaction path will be very different. The transaction occurs across several distributed services, this is illustrated in the image below.

Similarly, the user would send a request which would be received by a load balancer. But in this case, we don't have a monolithic application.
We have a whole set of microservices. The question is now,
how do we trace through these distributed services?

Well, distributed tracing allows us to follow the request as it goes
through the various services to the database and of course, the trip back.
From the image, you may notice that not every service was hit because, for that specific request, it probably didn't need those other two services.

This is a very common scenario, and we can use distributed tracing to trace the request through the various services therefore we're able to trace these very separate microservices and still get relevant latency information.

For this post, we will use the Jaeger service which is a distributed tracing service. It is a distributed tracing service that is used to trace distributed transactions that collect data when a request is initiated. This process triggers the creation of a special trace ID and the initial span (parent span).

Datadog details how distributed tracing works perfectly:

End-to-end distributed tracing platforms start collecting data as soon as a request is initiated, such as when a user fills out a form on a website. This causes the tracing platform to generate a unique trace ID and an initial span, known as the parent span. A trace represents the entire execution path of the request, with each span representing a single unit of work along the way, such as an API call or database query. A top-level child span is created whenever a request enters a service. If the request contained multiple commands or queries within the same service, the top-level child span may act as a parent to child spans nested beneath it.

A hierarchical bar chart is frequently used to visualize traces. A distributed trace illustrates the dependencies and durations of distinct microservices processing the request, similar to how Gantt charts represent subtask dependencies and durations in a project.
This is illustrated in the image below.

To understand what spans and traces is let's look at the definitions as described by opentracing:

Trace: The description of a transaction as it moves through a distributed system.
Span: A named, timed operation representing a piece of the workflow. Spans accept key: value tags as well as fine-grained, timestamped, structured logs attached to the particular span instance.
Span context: Trace information that accompanies the distributed transaction, including when it passes the service to service over the network or through a message bus. The span context contains the trace identifier, span identifier, and any other data that the tracing system needs to propagate to the downstream service.

Before we go deeper into the details of how to use Jaeger, read the Jaeger docs.

Back to the reason, I started this blog post. before I go deeper into the rabbit hole
This post will detail how to deploy a demo application called Hot R.O.D (Rides on Demand) that consists of several microservices and illustrates the use of the OpenTracing API.
It will be deployed in a k3s cluster with Jaeger backend to view the traces. Read more about the app here

If all that does not ring a bell, check out my previous post on How To Configure Jaeger Data Source On Grafana And Debug Network Issues With Bind-utilities.

TL;DR

Life's too short read the whole d*** article...

The How

Before you continue, ensure that you have the following:

Kubernetes cluster with Jaeger backend installed

The Walk-through

First, we need to create a namespace for the Jaeger backend, a dedicated directory for the Kubernetes YAML manifests.

export namespace=observability
mkdir -p ./manifests/jaeger-tracing/

Then we need to create a hotrod Jaeger instance as hotrod-traces and update the hotrod-traces-query Jaeger-operator YAML manifest to change ports from the default configuration into nodePort, which will enable us to expose the Jaeger backend on a port 30686.

cat >> manifests/jaeger-tracing/jaeger-hotrod.yaml << EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: hotrod-traces
  namespace: {namespace}
---
apiVersion: v1
kind: Service
metadata:
  name: hotrod-traces-query
  namespace: {namespace}
spec:
  ports:
    - name: http-query
      port: 16686
      protocol: TCP
      targetPort: 16686
      nodePort: 30686
  selector:
    app: jaeger
    app.kubernetes.io/component: all-in-one
    app.kubernetes.io/instance: hotrod-traces
    app.kubernetes.io/managed-by: jaeger-operator
    app.kubernetes.io/name: hotrod-traces
    app.kubernetes.io/part-of: jaeger
  type: NodePort
EOF

We then apply the hotrod-traces-query service to the hotrod-traces Jaeger instance and confirm that the service is running, as shown below.

kubectl apply -n {namespace} -f manifests/jaeger-tracing/jaeger-hotrod.yaml

Now that we have a running Jaeger instance, we can create a hotrod-traces-query service and apply it to the hotrod-traces Jaeger instance.
Let's create a hotrod.yaml service and deployment manifest that will be used to deploy the Jaeger backend with the latest example-hotrod image from jaegertracing/jaeger.

cat >> manifests/jaeger-tracing/hotrod.yaml << EOF
apiVersion: v1
kind: Service
metadata:
  name: hotrod
  labels:
    app: hotrod
spec:
  ports:
    - port: 8080
  selector:
    app: hotrod
    tier: frontend
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    sidecar.jaegertracing.io/inject: "true"
  labels:
    name: hotrod
  name: hotrod
spec:
  selector:
    matchLabels:
      app: hotrod
      tier: frontend
  template:
    metadata:
      labels:
        app: hotrod
        tier: frontend
    spec:
      containers:
        - image: jaegertracing/example-hotrod:latest
          args: ["all"]
          name: hotrod
          imagePullPolicy: Always
          ports:
            - containerPort: 8080
              protocol: TCP
          env:
            - name: JAEGER_AGENT_HOST
              value: hotrod-traces-agent.{namespace}.svc.cluster.local
            - name: JAEGER_AGENT_PORT
              value: "6831"
EOF

After creating the hotrod.yaml manifest, we can deploy the Jaeger backend and confirm that the service is running as shown below.

kubectl apply -n {namespace} -f manifests/jaeger-tracing/hotrod.yaml

Assuming that everything is running as expected, we can access the Hot R.O.D application but since we used Kubernetes LB and did not configure a static port we will use it to access our application.
We will need to find the port that the service is running on from the hotrod service using the command below.

Note: This port will be randomly assigned by Kubernetes.

kubectl get svc -n {namespace} hotrod -o json | jq '.spec.ports[0].nodePort'

Once we have the port, we can access http://localhost:30415 from the browser before issuing requests which would trigger Jaeger to record the traces.

On the second tab, we can open http://localhost:30686 to see the traces that were recorded by Jaeger on the UI.

In the Jaeger UI, you can see the traces that were recorded by Jaeger.

Under "Service", choose any of the services shown in the drop-down menu then select "Find Traces"
In the results, click and examine the spans.

The images below show various services and their spans.

If you want to understand more about the data in the images above, I recommend you to read this article titled Take OpenTracing for a HotROD ride as this is beyond the scope of this post.

Conclusion

In this blog post, we covered a few topics related to Jaeger and how distributed tracing differ from monolithic tracing. How one can set up a simple distributed tracing system in Kubernetes and deploy a simple microservices application before using Jaeger tracing to understand how long requests take to complete thereby improving the application performance.

Reference

How To Configure Jaeger Data Source On Grafana And Debug Network Issues With Bind-utilities

Mpho Mphego — Sun, 25 Jul 2021 12:34:46 +0000

The Story

As a mentor for a Udacity nanodegree, I realized that most students had difficulties adding Jaeger tracing data source on Grafana & Prometheus running in a Kubernetes cluster.

According to the docs:

Jaeger is a distributed tracing system released as open source by Uber Technologies. It is used for monitoring and troubleshooting microservices-based distributed systems, including distributed context propagation, distributed transaction monitoring, root cause analysis, service dependency analysis and performance/latency optimization

At this point, one might be wondering what distributed tracing is?

An understanding of application behaviour can be a fascinating task in a microservice architecture. This is because incoming requests may cover several services, and on this request, each intermittent service may have one or more operations. This makes it more difficult and requires more time to resolve problems.

Distributed tracking helps gain insight into each process and identifies failure regions caused by poor performance.

I, therefore, decided to document this guide below which takes you through the installation of Jaeger to incorporate it into Grafana and troubleshooting.

Note: This post will not be about using Jaeger for distributed tracing and backend/frontend application performance/latency optimization. If that's something that interests you then check out this post very useful.

Note: This post assumes that:

You are familiar with Kubernetes
You have a running Kubernetes cluster and,
You have already installed Grafana and Prometheus on the cluster

If not, refer to a previous post on how to install Prometheus & Grafana using Helm 3 on Kubernetes cluster running on Vagrant VM

The Walk-through

This section is divided into 4 parts:

Installing Jaeger Operator on Kubernetes
Access Jaeger UI on Browser
Configuring Jaeger Data Source on Grafana
Debugging and Troubleshooting

Installing Jaeger Operator on Kubernetes

First, we will need to install Jaeger Operator.

The Jaeger Operator is an implementation of a Kubernetes Operator. Operators are pieces of software that ease the operational complexity of running another piece of software. More technically, Operators are a method of packaging, deploying, and managing a Kubernetes application.

The command below will create the observability namespace and install the Jaeger Operator (CRD for apiVersion: jaegertracing.io/v1) in the same namespace.

export namespace=observability
kubectl create namespace ${namespace}
kubectl create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/crds/jaegertracing.io_jaegers_crd.yaml
kubectl create -n ${namespace} -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/service_account.yaml
kubectl create -n ${namespace} -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role.yaml
kubectl create -n ${namespace} -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role_binding.yaml
kubectl create -n ${namespace} -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/operator.yaml

kubectl create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/cluster_role.yaml
kubectl create -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/cluster_role_binding.yaml

Once we have created the jaeger-operator deployment, we need to create a Jaeger instance, see snippet below:

mkdir -p jaeger-tracing
cat >> jaeger-tracing/jaeger.yaml <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: my-traces
  namespace: ${namespace}
EOF
kubectl apply -n ${namespace} -f jaeger-tracing/jaeger.yaml

Once the Jaeger instance named my-traces has been created, we can verify that pods and services are running successfully by running:

kubectl get -n ${namespace} pods,svc

The Jaeger UI is served via the Ingress.

An Ingress exposes HTTP and HTTPS routes from outside the cluster to services within the cluster. We can verify that an ingress service exists, by running:

kubectl get -n ${namespace} ingress -o yaml | tail

Note: The service name and port number will be useful later when setting up data sources on Grafana.

Access Jaeger UI on Browser

(for testing purposes) we can port-forward it such that we access it on our localhost host by running the command:

kubectl port-forward -n ${namespace} \
    $(kubectl get pods -n ${namespace} -l=app="jaeger" -o name) 16686:16686

Then on our browser, we can access the Jaeger UI to validate the installation was successful.

Configuring Jaeger Data Source on Grafana

To configure Jaeger as a data source, we need to retrieve the Jaeger query service name as this will be used to query a DNS record for Kubernetes service and port.

Query is a service that retrieves traces from storage and hosts a UI to display them.

According to Kubernetes docs:

Every Service defined in the cluster (including the DNS server itself) is assigned a DNS name. By default, a client Pod's DNS search list includes the Pod's namespace and the cluster's default domain.

We can retrieve a full DNS name for the Jaeger Query endpoint which we will use as our data source URL on Grafana

According to Kubernetes docs:

A DNS query may return different results based on the namespace of the pod making it. DNS queries that don't specify a namespace are limited to the pod's namespace. Access services in other namespaces by specifying them in the DNS query.

The code below compiles the DNS for the Jaeger query service which exists in the observability namespace running on a local cluster.

In Kubernetes, a Service is an abstraction that defines a logical set of Pods and a policy by which to access them (sometimes this pattern is called a micro-service).

Notice, the pattern <service_name>.<namespace>.svc.cluster.local

ingress_name=$(kubectl get -n ${namespace} ingress -o jsonpath='{.items[0].metadata.name}'); \
ingress_port=$(kubectl get -n ${namespace} ingress -o jsonpath='{.items[0].spec.defaultBackend.service.port.number}'); \
echo -e "\n\n${ingress_name}.${namespace}.svc.cluster.local:${ingress_port}"

Copy the echoed URL (including port number) above and open Grafana UI to add the data source, ensure that the link is successful by selecting save&test.

Should you encounter an error "Jaeger: Bad Gateway. 502. Bad Gateway" or similar go to debugging and troubleshooting

The image below shows a successful integration, where we can query Jaeger Span traces on Grafana.

A span represents a logical unit of work in Jaeger that has an operation name, the start time of the operation, and the duration. Spans may be nested and ordered to model causal relationships.

Debugging and Troubleshooting

Jaeger docs contain a list of commonly encountered issues, hit this link for more information.
If your issues are relating to DNS. Please ensure that kube-dns is running, all Service objects have an in-cluster DNS name of <service_name>.<namespace>.svc.cluster.local so all other things would address your <service_name> in the <namespace>.

For the next task, we will need to run a Docker container in our cluster which provides a list of useful BIND utilities such as dig, host and nslookup within the cluster.

After a few Google searches, I found this popular container below and decided to use it for my debugging after I've investigated and vetted it for any malicious packages.

Running the command below will invoke a bash shell on the newly created pod-based of the dnsutils Docker image:

vagrant@dashboard:~> kubectl run dnsutils --image tutum/dnsutils -ti -- bash

Note: I am running k3s on a Vagrant box. In the case, you are not familiar with k3s,

K3s, is designed to be a single binary of less than 40MB that completely implements the Kubernetes API. To achieve this, they removed a lot of extra drivers that didn't need to be part of the core and are easily replaced with add-ons.

K3s is a fully CNCF (Cloud Native Computing Foundation) certified Kubernetes offering. This means that you can write your YAML to operate against a regular "full-fat" Kubernetes, and they'll also apply against a k3s cluster.

Anyways, let's not get side-tracked. If you have used Docker before, think of kubectl run as an alternate docker run; it creates and runs a particular image in a pod.

The commands below will query various DNS records using dig (Domain Information Groper) utility, which will return a list of IP addresses of the A Record which exists on the Kubernetes domain (*.*.svc.cluster.local), then print the full hostname to STDOUT records that contain observability namespace.

root@dnsutils:/# namespace=observability
root@dnsutils:/# for IP in $(dig +short *.*.svc.cluster.local); do
    HOSTS=$(host $IP)
    if grep -q "${namespace}" <<< "$HOSTS"; then
        echo "${HOSTS}";
    fi;
done

Below is an image, highlighting the hostname of a particular service of interest, which is my-traces-query.observability.svc.cluster.local

Then investigate the port: 16686 on the hostname if it's up by using nmap utility. But since nmap doesn't come preinstalled in the container then we can manually install it.

root@dnsutils:/# apt update -qq && apt install -y nmap

After we have installed the utility we can scan the port that Jaeger query should be running on as shown in Configuring Jaeger Data Source on Grafana.

root@dnsutils:/# nmap -p 16686 my-traces-query.observability.svc.cluster.local

The image below shows that port 16686 is open and running this validates that we can access the Jaeger query either via the UI or as a Grafana data source.

I will try to update this post with new ways to debug as I find my ways around Kubernetes, Jaeger and Grafana.

If you have any suggestions, leave a comment below and we will get in touch.

Reference

How I Setup A Private Local PyPI Server Using Docker And Ansible. [Continues]

Mpho Mphego — Wed, 16 Jun 2021 14:18:48 +0000

The Story

This post continues from How I Setup A Private PyPI Server Using Docker And Ansible

In this post, I will try to detail how to set up a private local PyPI server using Docker And Ansible.

TL;DR

Deploy/destroy devpi server running in Docker container using a single command.

The How

After my initial research, I wanted to ensure that the deployment is deterministic and the PyPI repository can be torn down and recreated ad-hoc by a single command. In our case, a simple make pypi deploys an instance of PyPI server through an Ansible playbook.

According to the docs:

Ansible Playbooks offer a repeatable, re-usable, simple configuration management and multi-machine deployment system, one that is well suited to deploying complex applications. If you need to execute a task with Ansible more than once, write a playbook and put it under source control. Then you can use the playbook to push out new configuration or confirm the configuration of remote systems.

A basic Ansible playbook:

Selects machines to execute against from inventory
Connects to those machines (or network devices, or other managed nodes), usually over SSH
Copies one or more modules to the remote machines and starts execution there

You can read more about Ansible here

The Walk-through

The setup is divided into two sections, Containerization and Automation.

This post-walk-through mainly focuses on automation. Go here for the containerisation.

Containerization

I didn't want the post to be too long.

Post continues here

Automation

See The How for the justification of opting for Ansible for the automation.

Prerequisite

If you already have Ansible installed and configured you can skip this step else you can search for your installation methods.

python3 -m pip install ansible paramiko

Ensure dependency plugins have been installed as well.

ansible-galaxy collection install \
    ansible.posix \
    community.docker

Directory Structure

In this section, I will go through each file in our pypi_server directory, which houses the configurations.

├── ansible.cfg
├── ansible-requirements-freeze.txt
├── host_inventory
├── Makefile
├── README.md
├── roles
│   └── pypi_server
│      ├── defaults
│      │    └── main.yml
│      ├── files
│      │    └── simple_test-1.0.zip
│      ├── tasks
│      │    └── main.yml
│      └── templates
│           └── nginx-pypi.conf.j2
└── up_pypi.yml

Ansible Configuration

Certain settings in Ansible are adjustable via a configuration file (ansible.cfg). The stock configuration is sufficient for most users, but in our case, we wanted certain configurations. Below is a sample of our ansible.cfg

cat >> ansible.cfg << EOF
[defaults]
inventory=host_inventory

# https://github.com/ansible/ansible/issues/14426
transport=paramiko

[ssh_connection]
pipelining=True
EOF

If installing Ansible from a package manager such as apt, the latest ansible.cfg file should be present in /etc/ansible.

If you installed Ansible from pip or the source, you may want to create this file to override default settings in Ansible.

wget https://raw.githubusercontent.com/ansible/ansible/devel/examples/ansible.cfg

Selecting machine to run your commands from inventory

Ansible reads information about which machines you want to manage from your inventory. Although you can pass an IP address to an ad hoc command, you need inventory to take advantage of the full flexibility and repeatability of Ansible.

cat >> host_inventory << EOF
vagrant ansible_host=192.168.50.4 ansible_user=root ansible_become=yes

[pypi_server]
vagrant
EOF

For this post, I will be using a Vagrant box.

According to the docs:

Vagrant is a tool for building and managing virtual machine environments in a single workflow. With an easy-to-use workflow and focus on automation, Vagrant lowers development environment setup time, increases production parity, and makes the "works on my machine" excuse a relic of the past.

Vagrant will isolate dependencies and their configuration within a single disposable, consistent environment, without sacrificing any of the tools you are used to working with (editors, browsers, debuggers, etc.).

Below is a Vagrantfile, used for local development which you may use, you would just need to run vagrant up and everything is installed and configured for you to work.

cat >> Vagrantfile << EOF
# -*- mode: ruby -*-
# vi: set ft=ruby :
# set up the default terminal
ENV["TERM"]="linux"

Vagrant.configure(2) do |config|
    config.vm.box = "opensuse/Leap-15.2.x86_64"

    config.ssh.username = 'root'
    config.ssh.password = 'vagrant'
    config.ssh.insert_key = 'true'

    config.vm.network "private_network", ip: "192.168.50.4"
    config.vm.network "forwarded_port", guest: 3141, host: 3141 # devpi Access

    # consifure the parameters for VirtualBox provider
    config.vm.provider "virtualbox" do |vb|
        vb.memory = "1024"
        vb.cpus = 1
        vb.customize ["modifyvm", :id, "--ioapic", "on"]
    end
    config.vm.provision "shell", inline: <<-SHELL
      zypper --non-interactive install python3 python3-setuptools python3-pip
      zypper --non-interactive install docker
      systemctl enable docker
      usermod -G docker -a $USER
      systemctl restart docker
    SHELL
end
EOF

Thereafter run the following command which allows you to install an SSH key on a remote server's authorized keys and it facilitates SSH key login, which removes the need for a password for each login, thus ensuring a password-less, automatic login process.

# password: vagrant
ssh-copy-id vagrant@192.168.50.4

Once the vagrant box is up, use the ping module to ping all the nodes in your inventory:

ansible all -m ping

You should see output for each host in your inventory, similar to the image below:

Ansible Roles

According to the docs:

Roles let you automatically load related vars, files, tasks, handlers, and other Ansible artefacts based on a known file structure. After you group your content into roles, you can easily reuse them and share them with other users.

An Ansible role has a defined directory structure with eight main standard directories. You must include at least one of these directories in each role. You can omit any directories the role does not use.

Using the ansible-galaxy CLI tool that comes bundled with Ansible, you can create a role with the init command. For example, the following will create a role directory structure called pypi_server in the current working directory:

ansible-galaxy init pypi_server

See Directory Structure above.

By default Ansible will look in each directory within a role for a main.ymlfile for relevant content:

defaults/main.yml: default variables for the role.
files/main.yml: files that the role deploys.
handlers/main.yml: handlers, which may be used within or outside this role.
meta/main.yml: metadata for the role, including role dependencies.
tasks/main.yml: the main list of tasks that the role executes.
templates/main.yml: templates that the role deploys.
vars/main.yml: other variables for the role.

Playbook

We defined our playbook which deploys the PyPI server below.

cat >> up_pypi.yml <<EOF
---
- name: configure and deploy a PyPI server
  hosts: pypi_server
  roles:
    - role: pypi_server
      vars:
        fqdn: # Fully qualified domain name
        fqdn_port: 80
        host_ip: "{{ hostvars[groups['pypi_server'][0]].ansible_default_ipv4.address }}"
        nginx_reverse_proxy: reverse_proxy
EOF

I found these posts relevant to the way we set up our nginx_reverse_proxy:

Plays

files/:

This is a simple tester for PyPI upload procedures. I modified simple_test package that was downloaded from https://pypi.org

Download: simple_test-1.0.zip

defaults/main.yml:

These are default variables for the role and they have the lowest priority of any variables available and can be easily overridden by any other variable, including inventory variables. They are used as default variables in the tasks

cat >> defaults/main.yml << EOF
---
container_name : pypi_server
base_image : << Your Docker Registry>>/pypi_server:latest

devpi_client_ver: '5.2.2'

devpi_port: 3141
devpi_user: devpi
devpi_group: devpi

devpi_folder_home: ./.devpi
devpi_nginx: /var/data/nginx

EOF

tasks/main.yml:

In this main.yml file we have a list of tasks that the role executes in sequence (and the whole play fails if any of these tasks fail):

Install apt and python packages.
- Update apt cache and install python3-pip.
- Install ansible-docker dependencies.
Start devpi and configure nginx routings.
- Start devpi server on docker container.
- Pause for 30 seconds to ensure server is up.
- Confirm if docker container is up.
- Create PyPI user and an index.
- Template nginx reverse proxy config.
- Check if nginx reverse proxy is up.
- Reload nginx reverse proxy.
Check if PyPI server is running!
- Install python dependencies locally in a virtual environment.
- Check if devpi index is up and confirm nginx routing!
- Login to devpi as PyPI user.
- Find path to simple-test package.
- Upload simple-test package to devpi.
- Check if package was uploaded.
- Install python package from PyPI server.
- Garbage cleaning.

Note: These tasks are executed on the remote server, in this case, a vagrant box.

Below is the main.yml which details the configuration, deployment and testing of the PyPI server (in a vagrant box).

cat >> tasks/main.yml << EOF
---
- name: Install apt and python packages
  block:
  - name: update apt-cache and install python3-pip.
    apt:
      name: python3-pip
      state: latest
      update_cache: yes

  - name: install ansible-docker dependencies.
    pip:
      name: docker-py
      state: present

  become: yes
  tags: [devpi, packages]

- name: start devpi and configure Nginx routings
  block:
  - name: start devpi server on the docker container.
    community.docker.docker_container:
      name: "{{ container_name }}"
      image: "{{ base_image }}"
      volumes:
      - "{{ devpi_folder_home }}:/root/.devpi"
      ports:
      - "{{ devpi_port }}:{{ devpi_port }}"
      restart_policy: on-failure
      restart_retries: 10
      state: started

  - name: pause for 30 seconds to ensure server is up.
    pause:
      seconds: 30

  - name: "confirm if {{ container_name }} docker is up"
    community.docker.docker_container:
      name: "{{ container_name }}"
      image: "{{ base_image }}"
      state: present

  - name: create pypi user and an index.
    shell: "docker exec -ti {{ container_name }} /bin/bash -c '/data/create_pypi_index.sh'"
    register: command_output
    failed_when: "'Error' in command_output.stderr"

  - name: template nginx reverse proxy config
    template:
      src: "nginx-pypi.conf.j2"
      dest: "{{ devpi_nginx }}/{{ fqdn }}.conf"

  - name: "check if {{ nginx_reverse_proxy }} is up"
    community.docker.docker_container_info:
      name: "{{ nginx_reverse_proxy }}"
    register: result

  - name: "reload {{ nginx_reverse_proxy }}: nginx service"
    shell: "docker exec -ti {{ nginx_reverse_proxy }} bash -c 'service nginx reload'"
    when: result.exists

  - name: pause for 30 seconds to ensure nginx is reloaded.
    pause:
      seconds: 30

  tags: [docker, nginx]

- name: check if pypi server is running!
  delegate_to: localhost
  connection: local
  block:
  - name: install python dependencies locally in a virtual environment
    pip:
      name: devpi-client
      version: "{{ devpi_client_ver }}"
      virtualenv: /tmp/venv
      virtualenv_python: python3
      state: present

  - name: "check if devpi's index is up and confirm nginx routing!"
    shell: "/tmp/venv/bin/devpi use http://{{ fqdn }}/pypi/trusty"

  - name: login to devpi as pypi user
    shell: "/tmp/venv/bin/devpi login pypi --password="

  - name: find path to simple-test package
    find:
      paths: "."
      patterns: '*.zip'
      recurse: yes
    register: output

  - name: upload simple-test package to devpi
    shell: "/tmp/venv/bin/devpi upload {{ output.files[0]['path'] }}"

  - name: check if package was uploaded
    shell: "/tmp/venv/bin/devpi test simple-test"

  - name: install python package from pypi server
    pip:
      name: pip
      virtualenv: /tmp/venv
      extra_args: >
        --upgrade
        -i  http://{{ fqdn }}/pypi/trusty
        --trusted-host {{ fqdn }}

  - name: garbage cleaning
    file:
      path: "/tmp/venv"
      state: absent

  tags: [tests]
EOF

templates/:

Ansible uses Jinja2 templating to enable dynamic expressions and access to variables.

Below is an Nginx templated config file used for routing from localhost to a dedicated FQDN (Fully qualified domain name
)

cat >> nginx-pypi.conf.j2 <<EOF
server {
    server_name {{ fqdn }};
    listen 80;

    gzip             on;
    gzip_min_length  2000;
    gzip_proxied     any;
    gzip_types       application/json;

    proxy_read_timeout 60s;
    client_max_body_size 70M;

    # set to where your devpi-server state is on the filesystem
    root {{ devpi_folder_home }};

    # try serving static files directly
    location ~ /\+f/ {
        # workaround to pass non-GET/HEAD requests through to the named location below
        error_page 418 = @proxy_to_app;
        if ($request_method !~ (GET)|(HEAD)) {
            return 418;
        }

        expires max;
        try_files /+files$uri @proxy_to_app;
    }
    # try serving docs directly
    location ~ /\+doc/ {
        # if the --documentation-path option of devpi-web is used,
        # then the root must be set accordingly here
        root {{ devpi_folder_home }};
        try_files $uri @proxy_to_app;
    }
    location / {
        # workaround to pass all requests to / through to the named location below
        error_page 418 = @proxy_to_app;
        return 418;
    }
    location @proxy_to_app {
        proxy_pass http://{{ host_ip }}:{{ devpi_port }};
        proxy_set_header X-outside-url $scheme://$http_host;
        proxy_set_header X-Real-IP $remote_addr;
    }
}
EOF

Makefile

Below is a snippet from our Makefile, which makes it a lot easier to install dependencies and set up a PyPI server. This means that instead of typing the whole pip or ansible-playbook commands to install dependencies and bring up a PyPI server, we can run something like:

make install_pkgs pypi

You can also check out my over-engineered Makefile here.

cat >> Makefile << EOF 
.DEFAULT_GOAL := help

define PRINT_HELP_PYSCRIPT
import re, sys
print("Please use `make <target>` where <target> is one of\n")
for line in sys.stdin:
    match = re.match(r'^([a-zA-Z_-]+):.*?## (.*)$$', line)
    if match:
        target, help = match.groups()
        if not target.startswith('--'):
            print(f"{target:20} - {help}")
endef

export PRINT_HELP_PYSCRIPT

help:
    python3 -c "$$PRINT_HELP_PYSCRIPT" < $(MAKEFILE_LIST)

install_pkgs:  ## Install Ansible dependencies locally.
    python3 -m pip install -r ansible-requirements-freeze.txt

lint: *.yml  ## Lint all yaml files
    echo $^ | xargs ansible-playbook -i host_inventory --syntax-check

pypi: ## Setup and start PyPI server
    ansible-playbook -i host_inventory -Kk up_pypi.yml
EOF

Final Testing

To ensure deterministic pypi_server builds, I ran/did the following:

Stopped pypi_server container, delete pypi_server on server
Ran a CI job that builds and pushes Docker images to our local docker registry.
Started pypi_server container by executing make pypi whilst in the current working directory (ansible roles) on an ansible dedicated server.
Verified if pypi.domain FQDN is up (curl http://pypi.domain && dig pypi.domain)
In a virtual environment, installed a random Python package then rebuilt the wheel before pushing it to pypi.domain

Conclusion

Congratulations!!!

Accessing your FQDN you should see the devpi home page listing your indices:

Assuming that everything was set up correctly. You now have a local/private PyPI server running in a Docker container that is under config management, thus ensuring deterministic builds and a single command can tear it down or bring it up.

This was a great Ansible and Nginx learning curve for me and if you have reached the end of this post. I appreciate you!

Reference

How I Setup A Private Local PyPI Server Using Docker And Ansible

Mpho Mphego — Tue, 15 Jun 2021 09:30:00 +0000

Liquid syntax error: Variable '{{%20">here for the automation.

Prerequisite

If you already have Docker and Docker-Compose installed and configured you can skip this step else you can search for your installation methods.
{% raw %}' was not properly terminated with regexp: /\}\}/

Note To Self: How To Stop A Running Pod On Kubernetes

Mpho Mphego — Tue, 18 May 2021 02:36:19 +0000

The Story

OMG, I just ran a Kubernetes command from the wild and now I cannot seem to stop or delete the running pod (that was me when my CPU fan sounded like an industrial fan).

So, this is what happened right. I have a RKE Kubernetes cluster running on a Vagrant box and I thought to myself. Why not try to use it to mine a few cryptos while at it; since the crypto business has been booming recently.

So the idea was to test it on my Vagrant box, and somehow let it find its way to run it elsewhere so that I can mine while I sleep and then one day wake up as a Gajillionare or something close.

Note To Self: Never blindly run commands on your system especially from the wild.

TL;DR

# Set a new size for a Deployment, ReplicaSet, Replication Controller, or StatefulSet.
kubectl scale --help

The How

Running a basic kubectl run command to bring up a few mining pods where rke_config_cluster.yml is my RKE config file.

#start monero_cpu_moneropool
kubectl run --kubeconfig rke_config_cluster.yml moneropool --image=servethehome/monero_cpu_moneropool:latest --replicas=1
#start minergate
kubectl run --kubeconfig rke_config_cluster.yml minergate --image=servethehome/monero_cpu_minergate:latest --replicas=1
#start cryptotonight
kubectl run --kubeconfig rke_config_cluster.yml minergate --image=servethehome/universal_cryptonight:latest --replicas=1

After realising that my CPU was choking, I then tried to stop the mining pods.

Little did I know that Kubernetes doesn't support the stop/pause of the current state of the pod(s). Then I started deleting the pods thinking that this will automagically stop and delete the pods and sure enough that didn't work.

That's when it hinted to me, the command I copied ensures that there's always 1 replica running, which was why the pods kept on being re-spawned.

The Walk-through

I managed to stop all my mining pods by ensuring that there are no working deployments which is simply done by setting the number of replicas to 0. Duh!!!

kubectl --kubeconfig=rke_config_cluster.yml  scale --replicas=0 deployment minergate moneropool
kubectl --kubeconfig=rke_config_cluster.yml  scale --replicas=0 replicaset minergate-686c565775 moneropool-69fbc5b6d5

Checking all running pods again, I can see that my mining pods have been paused/stopped.

And that's how I failed to become a Gajillionare, maybe I should just run this in a production environment bwagagagaga!

Reference

Rancher Kubernetes Engine (RKE)

How And Why, I Moved From Docker Hub To GitHub Docker Registry.

Mpho Mphego — Thu, 15 Apr 2021 14:31:30 +0000

The Story

On August 2020, Docker announced that they are introducing rate-limiting for Docker container pulls for free or anonymous users, which meant if you did not login to your DockerHub registry via command-line you would be limited to 100 pulls per 6 hours. At first, this did not affect me as I rarely pulled 10 images per day, but recently I have been tinkering with Kubernetes, Prometheus, Jaeger (you can check this post on how to install Prometheus & Grafana on K3s cluster) and other tools which usually pulls multiple images per run. You can check

This meant that I would be pulling images more frequently than I used to in the past. That's when I got the dreaded error message, "429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit".

This meant that I either had to configure Kubernetes secrets to pull from an authenticated DockerHub registry or find an alternative registry that will not issue rate limits every 100th pull. Coincidentally, roughly at the same time GitHub introduced their container registry (offering private container registry that integrates easily with the existing CI/CD tooling), and we were saved. Since it's still in its beta stage usage is free.

This post will detail how I migrated from using DockerHub to GitHub as my Docker container registry.

TL;DR

Setup GitHub Container registry and configure Kubernetes pod container to pull from the private registry (Optional)

But wait, What is a Container registry?

According to RedHat,

A container registry is a repository, or collection of repositories, used to store container images for Kubernetes, DevOps, and container-based application development.

GitHub is the home for free and open-source software and it's in a great spot to offer a container registry, which integrates well with their existing services and operates as an extension of GitHub Packages. Thus making it a good competitor to DockerHub.

The How

After deploying my application on my Kubernetes cluster. I noticed a few errors and after troubleshooting, I found that the docker pull rate limit was hit, which drove me insane.

$ kubectl describe pod frontend-app-6b885c795d-9vbfx | tail

Events:
  Type     Reason          Age                From                Message
  ----     ------          ----               ----                -------
  Warning  FailedMount     72s                kubelet, dashboard  MountVolume.SetUp failed for volume "default-token-6zmpp" : failed to sync secret cache: timed out waiting for the condition
  Normal   SandboxChanged  71s                kubelet, dashboard  Pod sandbox changed, it will be killed and re-created.
  Warning  Failed          44s                kubelet, dashboard  Failed to pull image "mmphego/frontend:v7": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/mmphego/frontend:v7": failed to copy: httpReaderSeeker: failed open: unexpected status code https://registry-1.docker.io/v2/mmphego/frontend/manifests/sha256:2994ce56c38abe2947935d7bc9d6a743dfc30186659aae80d5f2b51a0b8f37d1: 429 Too Many Requests - Server message: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit
  Warning  Failed          44s                kubelet, dashboard  Error: ErrImagePull
  Normal   BackOff         43s                kubelet, dashboard  Back-off pulling image "mmphego/frontend:v7"
  Warning  Failed          43s                kubelet, dashboard  Error: ImagePullBackOff
  Normal   Pulling         31s (x2 over 70s)  kubelet, dashboard  Pulling image "mmphego/frontend:v7"

The Walk-through

Setting up your container registry is straight forward.

Steps

Create a GitHub Personal Token on https://github.com/settings/apps (See images below)
- Select Personal Access Tokens
- Click Generate a new token
- Add a Note, check write: packages and hit generate.
When done, you will be provided with a token that you need to backup.
GitHub recommends placing the token into a file.

Either add the token into your ~/.bashrc or ~/.bash_profile and risk exposing them as environmental variables or place them in a file under a secret directory with reading/writing privileges (I prefer the latter).
```
vim ~/.secrets/github_docker_token
```
Paste the token into the github_docker_token file.
Login to GitHub Container Registry

Setup your username as an environmental variable:
```
$ cat ~/.bashrc

export GH_EMAIL=$(git config user.email)
export GH_USERNAME=$(git config user.username) # or hardcode your username (Not Recommended!)
```
Log in to your container registry with your username and personal token.
```
cat ~/.secrets/github_docker_token | docker login ghcr.io -u ${GH_USERNAME} --password-stdin
```
If successful, you should see an image similar to the one above.

Note: Typing secrets on the command line may store them in your shell history unprotected, and those secrets might also be visible to other users on your PC.

Confirm that you successfully logged in.

To confirm that you logged in we need to build, tag and push our image to ghcr (GitHub Container Registry).

export USERNAME="add information"
export REPOSITORY="add information"
export IMAGE="add information"
export VERSION="add information"
docker build . -t ghcr.io/${USERNAME}/${REPOSITORY}/${IMAGE}:${VERSION}
docker push ghcr.io/${USERNAME}/${REPOSITORY}/${IMAGE}:${VERSION}

or in my case, I have a docker-compose yaml file to make my life easier (I suppose).

cat docker-compose-file.yaml

```yaml
version: '3'
services:
hello_world_app:
    build: ../../app
    image: ghcr.io/mmphego/jaeger-tracing-example/jaeger-tracing-example:v2
```

Then build the tagged image.

```bash
docker-compose -f deployment/docker/docker-compose-file.yaml build
```

continues...

After a successful build, we push the image to the registry.

```bash
docker-compose -f deployment/docker/docker-compose-file.yaml push
```

**Note**: The manual build and push steps will be used on GitHub Actions (so at this point ensure everything works 100%)

After pushing the image you should see new package(s) on your profile under Packages.

Setup GitHub Action workflow for auto-build and publish (Optional)

Optionally, set up a workflow environment in your repository/project for the CI/CD to build and publish your container to the GitHub container registry.

mkdir -p .github/workflows
vim .github/workflows/docker-image-publisher.yaml

Paste the following code snippet into your docker-image-publisher.yaml. This workflow will build and push images on pull requests and master branches.

---
name: Docker Image CI

on:
workflow_dispatch: # Run workflow manually (without waiting for the cron to be called), through the Github Actions Workflow page directly
push:
    branches:
    - master
pull_request:
    branches:
    - '*'

jobs:
build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
        with:
        fetch-depth: 0

    - name: Build the Docker image
        run: |
        export USERNAME=${{ github.repository_owner }}
        docker-compose -f deployment/docker/docker-compose-file.yaml build

    - name: Login and Push Docker images to GitHub container registry
        run: |
        export USERNAME=${{ github.repository_owner }}
        echo "${{ secrets.DOCKER_PASSWORD }}" | docker login ghcr.io -u ${{ secrets.DOCKER_USERNAME }} --password-stdin
        docker-compose -f deployment/docker/docker-compose-file.yaml push

Run the GitHub Action build manually...

Alternatively, check out the Docker Build & Push Action or Build and push Docker images

Configure Kubernetes to use your new container registry (Optional)

Kubernetes supports a special type of secret that you can create which will be used to fetch images for your pods from any container registry that requires authentication.

Create a Kubernetes Secret, naming it my-secret-docker-reg and providing credentials:

kubectl create secret docker-registry my-secret-docker-reg \
    --docker-server=https://ghcr.io \
    --docker-username=${GH_USERNAME} \
    --docker-password=$(cat  ~/.secrets/github_docker_token) \
    --docker-email=${GH_EMAIL} -o yaml > docker-secret.yaml
    # or  kubectl apply the output of an imperative command in one line
    # --docker-email=${GH_EMAIL} -o yaml | kubectl apply -f -

# You can then apply the file like any other Kubernetes 'yaml':
kubectl apply -f docker-secret.yaml

Inspect the Secret: my-secret-docker-reg

kubectl get secrets

Final Result

Successful pull from (then private) GitHub container registry!

Me:
After setting up my GitHub container registry and Kubernetes docker-secrets.

Conclusion

Hopefully, you have learned something new in this post (and enjoyed Denzel Washington gifs) and will consider using GitHub Container Registry to house your images and GitHub Actions to build and push them to your GitHub Container Registry!
Finally, to configure your Kubernetes cluster to use GitHub Container Registry for fetching images for your pods.

Reference

How I Hardened The Security Of My Docker Environment

Mpho Mphego — Sun, 28 Mar 2021 08:40:30 +0000

The Story

One thing I have never considered when working with containers was security (Yes, I know what you're thinking). I've always thought that since Docker provides a secure and robust environment for managing SDLC (Systems development life cycle) as compared to traditional VMs, then that simply meant that I was immune to security issues such as container breakouts, wild images and DoS (Denial-of-Service) attacks.

But the worst thing happened. I pulled and ran a random image from the wild resulting which made my workstation unusable mainly because some process(es) running in the container consuming all of my memory and CPU. I had to force a system restart and lost the things I was working on.
I had to learn the hard way and ended up tweeting about the ordeal to warn others.

Mpho Mphego

@mphomphego

#NoteToSelf: When running untrusted containers from the wild always "use memory limit mechanisms" to prevent a denial of service from occurring.
FYI a container can use all of the memory on the host.

I learned this the hard way.

04:22 AM - 27 Mar 2021

This experience resulted in me going down the rabbit-hole, researching ways to harden my docker security in my environment. In this post, I will detail some of the things everyone should know when working with Docker.

TL;DR

Audit your environment, don't run containers as Root and always keep your system up-to-date.

The How

There are several ways one can improve the security of their docker environment.

Harden Your System/Host/Server

Your docker environment is only secure if your system/host is secure, meaning if the host is compromised surely the docker environment will be as well.
Always ensure that your host systems (OS, Kernel versions, packages) are always up-to-date.

Another great alternative is to run a system security audit, for UNIX-based systems there's a tool called lynis.

According to the docs;

Lynis is a security auditing tool for systems based on UNIX like Linux, macOS, BSD, and others. It performs an in-depth security scan and runs on the system itself. The primary goal is to test security defences and provide tips for further system hardening. It will also scan for general system information, vulnerable software packages, and possible configuration issues. Lynis was commonly used by system administrators and auditors to assess the security defences of their systems. Besides the "blue team," nowadays penetration testers also have Lynis in their toolkit.

To run a system audit clone/download and run lynis script (no compilation nor installation is required):

git clone https://github.com/CISOfy/lynis
cd lynis; ./lynis audit system

It usually takes a few seconds to complete, and upon completion, you should see some recommended remediations similar to the ones pictured below:

Avoid Running Containers As Root

Mpho Mphego

@mphomphego

Friends don't let friends run containers as root.

20:39 PM - 28 Mar 2021

By default (I think), Docker lets you run containers run as Root, meaning you have access to all the root privileges when running containers.

Remediation: You can update your Dockerfile by adding user(s) similar to what I have below.

# Add a user
RUN groupadd -r vino && \
    useradd -m -s /bin/bash -r -g vino -G audio,video vino && \
    mkdir -p /app && \
    chown -R vino:vino /app

RUN echo "vino ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
RUN visudo --c
USER vino

and then you can run a container with that user

docker run --user <user>[:<group>] -ti <image> /bin/bash

Set Resource Limits for Images and Containers

As mentioned at the beginning of this post, I had a container that when ran caused my computer to become unresponsive due to high CPU and memory usage. This then led me down the rabbit-hole of trying to understand and avoid the same issue happening again.

To ensure your computer/host/server does not get DoS'ed, you should limit the number of system resources that each container and image can consume. Limiting these resources minimizes the attack surface in the event of a system compromise.

Mpho Mphego

@mphomphego

#Docker image building best practices (IMHO)

- Preset memory amount the image will use
- Force it to always fetch new dependencies (avoid using legacy dependencies)

07:11 AM - 28 Mar 2021

You can also limit memory and CPU used when running the container by running:

docker run \
    --restart=on-failure:5 \
    --memory 256mb \
    --cpus="1.5" \
    -i \
    -p 4000:4000 \
    <image_name>

Running this container with these restrictions will limit the memory usage to a maximum 256Mb and guarantees at most one and a half of the CPUs, this should be sufficient for most applications (considering that 1 application per container) while ensuring that should the application fail it will be restarted maximum 5 times before issuing a runtime error.

CIS Benchmarks Auditing

As you work to develop an image for your docker container you need to build &test, verify and harden it, this is where CIS (Center for Internet Security) Docker benchmarking comes in.

CIS Docker benchmark establish an authoritative hardening guide for Docker across the core attack surfaces - Docker client, host, and registry

There are currently 2 tools (I know of) that are great for running Docker security audits.

`docker-bench`

docker-bench is a detection tool (not an enforcement tool) written in Go that checks whether Docker is deployed according to security best practices documented in the CIS (Center for Internet Security) Docker Benchmark (Download report)

To install the tool run:

go get github.com/aquasecurity/docker-bench
cd $GOPATH/src/github.com/aquasecurity/docker-bench
go build -o docker-bench .

Then run the analysis and only review failed checks:

./docker-bench --include-test-output | grep FAIL

You should see output similar to the one below, which lists some of the identified findings that needs remediation which is usually a manual process although the actual remediation steps will vary depending on the specific attach surface you chose to harden.

Suppose we want to remedy 4.5 Ensure Content trust for Docker is Enabled
We would need to follow the instructions listed in the CIS Docker Benchmark page 128 as shown in the snippet below

Alternatively, open the CIS Docker Benchmark document for recommended remediation/hardening tips.

`Docker Bench for Security`

A similar application to the docker-bench was developed by the Docker team which also provides a tool to analyse containers and images for potential security risks. This is a great alternative since it was written and maintained by the creators of Docker

The Docker Bench for Security is a script that checks for dozens of common best-practices around deploying Docker containers in production. The tests are all automated and are inspired by the CIS Docker Benchmark

As opposed to docker-bench which is a Go package that needs to be built, the Docker Bench for Security is packaged in a small container. However, this container gets ran with a lot of privileges such as sharing the host's filesystem, PID and namespaces.

Run the analysis:

docker run --rm --net host --pid host --userns host --cap-add audit_control \
    -e DOCKER_CONTENT_TRUST=$DOCKER_CONTENT_TRUST \
    -v /etc:/etc:ro \
    -v /usr/bin/containerd:/usr/bin/containerd:ro \
    -v /usr/bin/runc:/usr/bin/runc:ro \
    -v /usr/lib/systemd:/usr/lib/systemd:ro \
    -v /var/lib:/var/lib:ro \
    -v /var/run/docker.sock:/var/run/docker.sock:ro \
    --label docker_bench_security \
    docker/docker-bench-security

If all went well, you should see an output similar to the one below which lists some of the identified findings that needs remediation which is usually a manual process although the actual remediation steps will vary depending on the specific attach surface you chose to harden.

Don't Use Images From The Wild

Last but not least, if you can; try not to use containers from the wild. Alternatively, vet their Dockerfile if it's available and then build your image from their Dockerfile.

Another option to consider is to enable the Docker Content Trust feature which is disabled by default.

To enable it run:

echo "export DOCKER_CONTENT_TRUST=1" >> ~/.bashrc && source ~/.bashrc

This means that when you attempt to pull images that are not signed by a genuine publisher, Docker will decline.

Reference

How To Fork A Subdirectory Of Repo As A Different Repo On GitHub

Mpho Mphego — Sun, 07 Feb 2021 15:15:48 +0000

The Story

Ever wanted to fork a subdirectory and not the whole Git/GitHub repository. Well I have, I recently had to fork a subdirectory of one of the repositories I wanted to work on without the need to fork the whole repository. In this post, I will show you how it's done.

Note: I do not think you can fork subdirectories through GitHub's web interface

The How

Clone the repo

git clone https://github.com/<someones-username>/<some-repo-you-want-to-fork>
cd some-repo-you-want-to-fork

Create a branch using the `git subtree` command for the folder only

git subtree split --prefix=./src -b dir-you-want-to-fork
git checkout dir-you-want-to-fork

Create a new GitHub repo

Head over to GitHub and create a new repository you wish to fork the directory to.

Add the newly created repo as a remote

cd some-repo-you-want-to-fork
git remote set-url origin https://github.com/<username>/<new_repo>.git

Push the subtree to the new repository

git fetch origin -pa
git push -u origin dir-you-want-to-fork

Fetch all remote branches in the new repository

git clone https://github.com/<username>/<new_repo>.git
cd new_repo
git checkout --detach
git fetch origin '+refs/heads/*:refs/heads/*'
git checkout dir-you-want-to-fork

You now have a "fork" of the src subdirectory.

Merge to main/dev branch (troubleshooting)

If you ever run git merge master and get the error fatal: refusing to merge unrelated histories; run

git checkout dir-you-want-to-fork
git merge --allow-unrelated-histories master
# Fix conflicts and
git commit -a
git push origin dir-you-want-to-fork

Reference

Install Prometheus & Grafana With Helm 3 On Kubernetes Cluster Running On Vagrant VM

Mpho Mphego — Mon, 01 Feb 2021 04:08:40 +0000

The Story

We would like to install the monitoring tool Prometheus and Grafana with helm 3 on our local machine/VM running a Kubernetes cluster.

In this post, we will go through the procedure of deploying Prometheus and Grafana in a Kubernetes Cluster.

TL;DR

The How

Prerequisites

For this application, we need a Kubernetes cluster running locally and to interface with it via kubectl. The list below shows some of the tools that we'll need to use for getting our environment set up properly.

We will use Vagrant
With VirtualBox
To run K3s and,
Interface with it via kubectl

The Walk-through

Configuration

All Vagrant configuration is shown below. Vagrant leverages VirtualBox which loads an openSUSE OS and automatically installs OS dependencies, K3s and helm. Some useful vagrant commands can be found in this cheatsheet.

Running cat Vagrantfile, results in the config:

# -*- mode: ruby -*-
# vi: set ft=ruby :
default_box = "opensuse/Leap-15.2.x86_64"
box_version = "15.2.31.309"
# The "2" in `Vagrant.configure` configures the configuration version (we 
# support older styles for backwards compatibility). Please don't change it # # unless you know what you're doing.
Vagrant.configure("2") do |config|
  # The most common configuration options are documented and commented on below.
  # For a complete reference, please see the online documentation at
  # https://docs.vagrantup.com.

  # Every Vagrant development environment requires a box. You can search for
  # boxes at https://vagrantcloud.com/search.

  config.vm.define "master" do |master|
    master.vm.box = default_box
    master.vm.box_version = box_version
    master.vm.hostname = "master"
    master.vm.network 'private_network', ip: "192.168.33.10",  virtualbox__intnet: true
    master.vm.network "forwarded_port", guest: 22, host: 2222, id: "ssh", disabled: true
    master.vm.network "forwarded_port", guest: 22, host: 2000 # Master Node SSH
    master.vm.network "forwarded_port", guest: 6443, host: 6443 # API Access
    for p in 30000..30100 # expose NodePort IP's
      master.vm.network "forwarded_port", guest: p, host: p, protocol: "tcp"
    end
    master.vm.provider "virtualbox" do |vb|
      # v.memory = "3072"
      vb.memory = "2048"
      vb.name = "k3s"
    end

    master.vm.provision "shell", inline: <<-SHELL
      echo "******** Installing dependencies ********"
      sudo zypper refresh
      sudo zypper --non-interactive install bzip2
      sudo zypper --non-interactive install etcd
      sudo zypper --non-interactive install lsof

      echo "******** Begin installing k3s ********"
      curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.19.2+k3s1 K3S_KUBECONFIG_MODE="644" sh -
      echo "******** End installing k3s ********"

      echo "******** Begin installing helm ********"
      curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 | bash
      echo "******** End installing helm ********"
    SHELL
  end
end

Running the following command will start up the virtual machine and install the relevant dependencies: vagrant up

Install Prometheus with Helm 3

Let's ssh into our freshly baked VM: vagrant ssh
Let's create a namespace monitoring for bundling all monitoring tools: kubectl create namespace monitoring
Install Prometheus using helm 3 on the monitoring namespace
| Helm is a popular package manager for Kubernetes (think apt for Ubuntu or pip for Python). It uses a templating language to make the managing of multiple Kubernetes items in a single application easier to package, install, and update.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add stable https://charts.helm.sh/stable
helm repo update
# Use k3s config file, normally this would be in `~/.kube/config`
helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --kubeconfig /etc/rancher/k3s/k3s.yaml

If the installation was successful you should be able to see 6 running pods:

Alert manager: This allows us to create alerts with Prometheus
Operator: This is the application itself
Exporter: This is responsible for getting the logs from the nodes
Grafana and other metrics tools

kubectl get pods --namespace=monitoring

and,
helm ls --namespace monitoring

Once everything is up and running we need to access Grafana.

It is highly advisable to use some kind of ingress to expose the services to the world, an example would be to use NGINX.

But for testing purposes, we can either use;

kubectl port-forward or,
Expose pods with NodePort service.

These are simple ways of forwarding a Kubernetes service's port to a local port on your machine.

NOTE: This is something you would never do in production but would regularly do in testing.

Port-forwarding with kubectl port-forward

kubectl port-forward prometheus-prometheus-kube-prometheus-prometheus-0 --address 0.0.0.0 3000:80 -n monitoring

In my case, this was never successful and I had to opt for the second option.

Port-forwarding with NodePort service

Retrieve all services running on the monitoring namespace

vagrant@master:~> kubectl get svc --namespace monitoring

NAME                                      TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                      AGE
prometheus-kube-prometheus-prometheus     ClusterIP   10.43.27.175   <none>        9090/TCP                     40m
prometheus-kube-prometheus-alertmanager   ClusterIP   10.43.27.184   <none>        9093/TCP                     40m
prometheus-prometheus-node-exporter       ClusterIP   10.43.53.226   <none>        9100/TCP                     40m
prometheus-kube-state-metrics             ClusterIP   10.43.94.157   <none>        8080/TCP                     40m
alertmanager-operated                     ClusterIP   None           <none>        9093/TCP,9094/TCP,9094/UDP   40m
prometheus-operated                       ClusterIP   None           <none>        9090/TCP                     40m
prometheus-kube-prometheus-operator       ClusterIP    10.43.242.43   <none>        443/TCP                40m
prometheus-grafana                        ClusterIP    10.43.31.19    <none>        80/TCP                 40m

You will need to make some modification to the prometheus-grafana YAML config such that you can access Grafana from your local machine.

Run kubectl edit svc --namespace monitoring prometheus-grafana and make the following changes:

type: ClusterIP with type: NodePort, and
Change nodePort and choose from range 30000 - 30100 as defined in the Vagrantfile.

Do the same for prometheus-operator:

kubectl edit svc --namespace monitoring prometheus-kube-prometheus-operator

Verify that services were updated, and we should see service type as NodePort and exposed/forwarded ports.

Alternatively, you can patch the config. Read more here

Verify that you can access the localhost through port 30100

Also, check out more details on best practices when accessing Applications in a Cluster.

Access Grafana

If the installation was successful we should be able to access Grafana from our local system. Thanks to port-forwarding.

Note: When installing via the Prometheus Helm chart, the default Grafana admin password is actually prom-operator

Troubleshooting

Vagrant cannot forward the specified ports on this VM

Vagrant cannot forward the specified ports on this VM, since they
would collide with another VirtualBox virtual machine's forwarded
ports! The forwarded port to 4567 is already in use on the host
machine.

To fix this, modify your current projects Vagrantfile to use another
port. For example, where '1234' would be replaced by a unique host port:

  config.vm.forward_port 80, 1234

As the message says, the port collides with another port on the host box. I would simply change the port to some other value on the host machine or let Vagrant auto-correct itself if it encounters any collisions.

In the Vagrantfile, append , auto_corrent: true and the end of master.vm.network "forwarded_port", guest: 6443, host: 6443