DEV Community: Cey

ELK Stack Mastery: Building a Scalable Log Management System

Cey — Mon, 04 Nov 2024 23:22:49 +0000

Youtube Tutorial

Overview

This project sets up an Elastic Cluster with 3 nodes using Virtualbox virtual machines. It includes the setup of Elasticsearch, Logstash, and Kibana (ELK stack) for log management and analysis.

Project Goals

Set up Elastic Cluster with all necessary components.
Create an index with a retention period of 10 days in Hot, 10 days in Cold, and 10 days in Frozen tiers.
Load logs using one of the methods listed in the setup.
Create a Dashboard with drilldown capabilities.

Prerequisites

VirtualBox installed on your system
Debian 12 ISO image
Sufficient system resources to run 3 VMs

VM Configuration

Create 3 VMs with the following specifications:

elktest1 (Master + Data_Hot + Data_Content, Kibana, Logstash)
- 8 GB RAM, 4 CPU, 40 GB storage
elktest2 (Data_Cold, Logstash)
- 8 GB RAM, 4 CPU, 40 GB storage
elktest3 (Data_Frozen, Logstash)
- 6 GB RAM, 3 CPU, 50 GB storage

Setup Instructions

1. VM Installation

Download Debian 12 ISO:

<https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-12.7.0-amd64-netinst.iso>

Install Debian on each VM.
In VM settings, change network from NAT to Bridged Adapter.

2. SSH Setup

Install SSH on each VM:

su -
apt-get update
apt-get install openssh-server
systemctl start ssh
systemctl enable ssh

Reboot and get IP addresses:

reboot now
ip addr show

Connect from host machine:

ssh <username>@<your_ip_address>

3. Elasticsearch Installation

On all VMs:

apt install curl
curl -fsSL <https://artifacts.elastic.co/GPG-KEY-elasticsearch> | gpg --dearmor -o /usr/share/keyrings/elastic.gpg
echo "deb [signed-by=/usr/share/keyrings/elastic.gpg] <https://artifacts.elastic.co/packages/8.x/apt> stable main" | tee -a /etc/apt/sources.list.d/elastic-8.x.list
apt update
apt install elasticsearch

4. Elasticsearch Configuration

On elktest1:

Edit /etc/elasticsearch/elasticsearch.yml:

cluster.name: elktestcluster
node.name: elktest1
node.roles: ["master","data_hot","data_content"]
cluster.initial_master_nodes: ["elktest1"]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
discovery.seed_hosts: ["elktest1"]
xpack.security.enabled: true
xpack.security.enrollment.enabled: true
xpack.security.http.ssl:
  enabled: true
  keystore.path: certs/http.p12
xpack.security.transport.ssl:
  enabled: true
  verification_mode: certificate
  keystore.path: certs/transport.p12
  truststore.path: certs/transport.p12
http.host: 0.0.0.0

Start Elasticsearch:
```
systemctl start elasticsearch
```

Reset elastic user password:

/usr/share/elasticsearch/bin/elasticsearch-reset-password -i -u elastic

Generate enrollment tokens for other nodes:

cd /usr/share/elasticsearch/bin
./elasticsearch-create-enrollment-token -s node

On elktest2 and elktest3:

Reconfigure node with enrollment token:

cd /usr/share/elasticsearch/bin
./elasticsearch-reconfigure-node --enrollment-token <your_enrollment_token>

Edit /etc/elasticsearch/elasticsearch.yml:

For elktest2:

cluster.name: elktestcluster
node.name: elktest2
node.roles: ["data_cold"]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200

For elktest3:

cluster.name: elktestcluster
node.name: elktest3
node.roles: ["data_frozen"]
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
http.port: 9200
xpack.searchable.snapshot.shared_cache.size: 30%

Start Elasticsearch on both nodes:
```
systemctl start elasticsearch
```

5. Index Lifecycle Management

Create ILM policy:

PUT _ilm/policy/elktestcluster_logs_policy
{
    "policy": {
        "phases": {
            "hot": {
                "actions": {
                    "rollover": {
                        "max_size": "40gb",
                        "max_age": "10d"
                    }
                }
            },
            "warm": {
                "min_age": "10d",
                "actions": {
                    "forcemerge": {
                        "max_num_segments": 1
                    },
                    "allocate": {
                        "require": {
                            "data": "cold"
                        }
                    }
                }
            },
            "cold": {
                "min_age": "20d",
                "actions": {
                    "freeze": {},
                    "allocate": {
                        "require": {
                            "data": "frozen"
                        }
                    }
                }
            }
        }
    }
}

Assign policy to index template:

PUT _index_template/elktestcluster_logs_template
{
  "index_patterns": ["elktestcluster-logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 1,
      "index.lifecycle.name": "elktestcluster_logs_policy",
      "index.lifecycle.rollover_alias": "elktestcluster-logs"
    }
  }
}

6. Logstash Setup

Install Logstash on all VMs:

apt install logstash -y

Add logstash user to elasticsearch group:

sudo usermod -aG elasticsearch logstash

Create Logstash pipeline configuration:

nano /etc/logstash/conf.d/elktestcluster-logs.con

Add the following content:

input {
  file {
    path => [
      "/var/log/elasticsearch/elktestcluster*.json"
    ]
    start_position => "beginning"
    sincedb_path => "/dev/null"
    codec => "json"
  }
}

output {
  elasticsearch {
    hosts => ["<https://elktest1:9200>", "<https://elktest2:9200>", "<https://elktest3:9200>"]
    index => "elktestcluster-logs-%{+YYYY.MM.dd}"
    user => "elastic"
    password => "elastic"
    ssl => true
    cacert => "/etc/elasticsearch/certs/http_ca.crt"
  }
}

Start Logstash on all VMs:

systemctl start logstash

7. Kibana Setup

Install Kibana on one VM (preferably elktest1 or elktest2):

apt install kibana -y

Reset kibana_system user password:

/usr/share/elasticsearch/bin/elasticsearch-reset-password -i -u kibana_system

Configure Kibana:

nano /etc/kibana/kibana.yml

Add/edit the following:

server.port: 5601
server.host: "0.0.0.0"
elasticsearch.hosts: ["<https://elktest1:9200>", "<https://elktest2:9200>", "<https://elktest3:9200>"]
elasticsearch.username: "kibana_system"
elasticsearch.password: "kibana"
elasticsearch.ssl.verificationMode: none

Start Kibana:

systemctl start kibana

8. Accessing Kibana

Open a web browser and go to:

http://<your_kibana_machine_ip>:5601

Use the Elasticsearch credentials:

Username: elastic
Password: elastic

Final Steps

Create a data view from cluster logs in Kibana.
Create a dashboard from the data view.

Congratulations! You have now set up a complete ELK stack for log management and analysis.

End to End Data Engineering OTP Pipeline Project

Cey — Mon, 04 Nov 2024 18:21:17 +0000

End to End OTP Pipeline Project using Docker, Airflow, Kafka, KafkaUI, Cassandra, MongoDB, EmailOperator, SlackWebhookOperator and DiscordWebhookOperator

Project Overview

Project contains 3 Kafka brokers, UI for Apache Kafka, Zookeeper, Cassandra, Mongo, Mongo-Express, airflow-init, postgres, airflow-webserver, airflow-scheduler, airflow-triggerer, airflow-cli containers.

The Pipeline performs

creating a kafka topic named email_topic with 3 partitions, write an email address at a time triggered by user.
creating cassandra keyspace and table named email_namespace and email_table and consuming kafka topic messages into cassandra.
creating mongodb database and collection named email_database and email_collection and consuming topic messages into mongodb.
checking existence of the email and otp code in these two databases and pushing email notifications, slack channel messages and discord messages simultaneously.

Configuring Email Notifications in Airflow

Obtain an App Password for your Microsoft Account:

Go to the Security tab and select Manage how I sign in to Microsoft.
Create a new app password as instructed.
For detailed guidance, refer to the Microsoft support article.

Update docker-compose.yaml file

AIRFLOW_SMTPSMTP_HOST='smtp-mail.outlook.com' \
AIRFLOWSMTPSMTP_MAIL_FROM='your_email@outlook.com' \
AIRFLOWSMTPSMTP_USER='your_email@outlook.com' \
AIRFLOWSMTPSMTP_PASSWORD='your_app_password' \
AIRFLOWSMTPSMTP_PORT='587' \
AIRFLOWSMTPSMTP_STARTTLS='True' \
AIRFLOWSMTPSMTP_SSL='False' \
AIRFLOWSMTPSMTP_TIMEOUT='5' \
AIRFLOWSMTP_SMTP_RETRY_LIMIT='3'

Setting up receiver emails using using MailSlurp.

Sign up for a free account.
From inboxes section create several temporary email addresses.
In the emails section find and copy the email addresses into email_list.txt file in your project directory before running.

Setting up Slack

Go to your apps section of Slack api website.
Create an app from scratch.
Assign the app to your workspace.
After creating your app, go to incoming webhooks section from dropdown menu and toggle Activate Incoming Webhooks.
Go to bottom of the page and add new webhook to workspace, assign it to data-engineering channel.
In the Webhook URL section, you can copy your information, that will be used in airflow connection setup.

Setting up Discord

Go to Discord Web or Desktop Application.
Create a server at the bottom of the lef pane of servers, for create my own, for me and for my friends, name it whatever you want.
Forward to Server Settings, Go to Apps/Integrations/Webhooks, Create Webhook. That will create a default webhook. Copy Webhook URL, and store it for airflow connection setup.

Initialize the containers using docker_run.sh shell script:

$ ./docker_run.sh

This will create required uid variable, network named cassandra-kafka and initialize the postgres database that holds the airflow metadata, download the container images and finally start the project containers which will be orchestrated by airflow.

WebUI Links:

Airflow : http://localhost:8080/ \
Username: admin Password: admin

Kafka UI : http://localhost:8888/

Mongo Express : http://localhost:8082 \
Username: admin Password: pass

Access the cassandra shell using :

$ docker exec -it cassandra cqlsh -u cassandra -p cassandra

Establish airflow connections for slack and discord from Admin/Connections panel.

Use Webhook URL obtained from slack app and add a new record.

Connection Id * : slack_webhook \
Connection Type * : Slack Incoming Webhook \
Slack Webhook Endpoint : hooks.slack.com/services \
Webhook Token : T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX (found in the webhook url, usually starts with "T")

Use Copied Webhook URL from Discord and add a new record.

Connection Id * : discord_webhook \
Connection Type * : Discord \
Host : https://discord.com/api/ \
Webhook Endpoint : webhooks/{webhook.id}/{webhook.token} (found in the webhook url)

Trigger the main_dag several times.

Once you triggered the dag you can track created kafka messages from kafkaui, data inserted into cassandra table using cassandra shell (cqlsh) and inserted documents into mongodb from mongo-express-ui

Feel free to visit for extensive explanation of project in my Medium post

End-to-End Realtime Streaming Data Engineering Project

Cey — Wed, 07 Aug 2024 13:58:38 +0000

This repository demonstrates a data engineering pipeline using Spark Structured Streaming. It retrieves random names from an API, sends the data to Kafka topics via Airflow, and processes it with Spark Structured Streaming before storing it in Cassandra.

System Architecture

Components:

Data Source: Uses the randomuser.me API for generating user data.
Apache Airflow: Orchestrates the pipeline and schedules data ingestion.
Apache Kafka & Zookeeper: Stream data from PostgreSQL to Spark.
Apache Spark: Processes data in real time.
Cassandra: Stores the processed data.
Scripts:

kafka_stream.py: Airflow DAG script that pushes API data to Kafka during 2 minutes every 1 seconds.
spark_stream.py: Consumes and processes data from Kafka using Spark Structured Streaming.

What You'll Learn:

Setting up and orchestrating pipelines with Apache Airflow.
Real-time data streaming with Apache Kafka.
Synchronization with Apache Zookeeper.
Data processing with Apache Spark.
Storage solutions with Cassandra and PostgreSQL.
Containerization of the entire setup using Docker.
Technologies:
Apache Airflow, Python, Apache Kafka, Apache Zookeeper, Apache Spark, Cassandra, PostgreSQL, Docker

Getting Started

WebUI links

Airflow : http://localhost:8080/
Kafka UI : http://localhost:8085/

Clone the repository:

$ git clone https://github.com/akarce/e2e-structured-streaming.git

Navigate to the project directory:

$ cd e2e-structured-streaming

Create an .env file in project folder and set an AIRFLOW_UID

$ echo -e "AIRFLOW_UID=$(id -u)" > .env

$ echo AIRFLOW_UID=50000 >> .env

Run Docker Compose to perform database migrations and create the first user account

$ docker-compose up airflow-init

Run Docker Compose again to spin up the services:

$ docker compose up -d

Copy the dependencies.zip and spark_stream.py files into spark-master container

$ docker cp dependencies.zip spark-master:/dependencies.zip

$ docker cp spark_stream.py spark-master:/spark_stream.py

Run the docker exec command to access cqlsh shell in cassandra container

$ docker exec -it cassandra cqlsh -u cassandra -p cassandra localhost 9042

Run describe command to see there are no keyspaces named in cassandra instance

cqlsh> DESCRIBE KEYSPACES;

Unpause the dag user_automation using Airflow UI

Go to Airflow UI using : http://localhost:8080/

Login using Username: admin Password: admin

You can track the topic creation and message queue using the open source tool named UI for Apache Kafka that is running as a container, WebUI link: http://localhost:8085/

Message schema looks like this

In a new terminal run the docker exec command to run spark job to read the streaming from kafka topic:

$ docker exec -it spark-master spark-submit --packages com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1 --py-files /dependencies.zip /spark_stream.py

Now go back to the cqlsh shell terminal back and run the command to see data is inserted to cassandra table called created_users

cqlsh> SELECT * FROM spark_streaming.created_users;

and run count query several times to approve data is being inserted while running user_automation dag

cqlsh> SELECT count(*) FROM spark_streaming.created_users;