DEV Community: Stefen

Automating Data Pipeline Deployment on AWS with Terraform: Utilizing Lambda, Glue, Crawler, Redshift, and S3

Stefen — Wed, 26 Jul 2023 15:37:39 +0000

https://medium.com/@stefentaime_10958/automating-data-pipeline-deployment-on-aws-with-terraform-utilizing-lambda-glue-crawler-1621e0736edd

Objective
Pre-requisites
Components
Source Systems
Schedule & Orchestrate
Extract
Load
Transform
Data Visualization
Choosing Tools & Frameworks
Future Work & Improvements
Further Reading
Setup
Important Note on Costs

Objective

The objective of this guide is to demonstrate how to automate the deployment of a data pipeline on AWS using Terraform. The pipeline will utilize AWS services such as Lambda, Glue, Crawler, Redshift, and S3. The data for this pipeline will be extracted from a Stock Market API, processed, and transformed to create various views for data analysis.

Pre-requisites

Before we begin, make sure you have the following:

Basic understanding of AWS services
Familiarity with Terraform
AWS account with IAM :

Terraform installed on your local machine

Components

The main components of our data pipeline are:

AWS Lambda: Used for running serverless functions.
AWS Glue: Used for ETL (Extract, Transform, Load) operations.
AWS Crawler: Used for cataloging data.
Amazon Redshift: Used for data warehousing and analysis.
AWS S3: Used for data storage.

Source Systems

In this pipeline, the source system is an API from which Lambda1 extracts data.

Schedule & Orchestrate

Lambda2 is scheduled to execute the Crawler and the Glue Job, orchestrating the flow of data through the pipeline.

import boto3

import os

import time

def lambda_handler(event, context):

    s3 = boto3.client('s3')

    glue = boto3.client('glue')

bucket_name = os.environ['BUCKET_NAME']

folders = ['AAPL', 'IBM', 'MSFT']

for folder in folders:
    objects = s3.list_objects_v2(
        Bucket=bucket_name,
        Prefix=folder + '/'
    )

    if not any(obj['Key'].endswith('.json') for obj in objects.get('Contents', [])):
        raise Exception(f"No JSON files found in {folder} folder")

glue.start_crawler(Name=os.environ['GLUE_CRAWLER_NAME'])

while True:
    crawler = glue.get_crawler(Name=os.environ['GLUE_CRAWLER_NAME'])
    if crawler['Crawler']['State'] == 'RUNNING':
        break
    time.sleep(10)  


while True:
    crawler = glue.get_crawler(Name=os.environ['GLUE_CRAWLER_NAME'])
    if crawler['Crawler']['LastCrawl']['Status'] == 'SUCCEEDED':
        break
    time.sleep(60)  

glue.start_job_run(JobName=os.environ['GLUE_JOB_NAME'])

Extract

Data is extracted from the API by Lambda1 and stored in an S3 bucket.

import json

import boto3

import requests

import os

from datetime import datetime

def flatten_data(data):

    metadata = data['Meta Data']

    time_series = data['Time Series (Daily)']

    new_data = []

    for date, values in time_series.items():

        flattened_record = metadata.copy()

        flattened_record.update(values)

        flattened_record['date'] = date

        new_data.append(flattened_record)

    return new_data

def lambda_handler(event, context):

    s3 = boto3.resource('s3')

    apikey = ''

    symbols = ['MSFT', 'AAPL', 'IBM']

bucket_name = os.environ['BUCKET_NAME']

for symbol in symbols:
    url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&amp;symbol={symbol}&amp;outputsize=full&amp;apikey={apikey}'
    r = requests.get(url)
    data = r.json()

    data = flatten_data(data)


    date_str = datetime.now().strftime('%Y-%m-%d')


    key = f'{symbol}/{date_str}-{symbol}.json'


    lines = ""
    for record in data[:100]:  # Only take the first 100 records
        line = json.dumps(record) + "\n"
        lines += line


    s3.Bucket(bucket_name).put_object(Key=key, Body=lines)

Load

The Crawler loads the data from the S3 bucket into a database with three tables.

Transform

The Glue Job transforms the data by reading it from the catalog, applying transformations, and writing the output back into the S3 bucket.

import sys

from awsglue.transforms import *

from awsglue.utils import getResolvedOptions

from pyspark.context import SparkContext

from awsglue.context import GlueContext

from awsglue.job import Job

from pyspark.sql import functions as F

from awsglue.dynamicframe import DynamicFrame

args = getResolvedOptions(sys.argv, ["JOB_NAME"])

sc = SparkContext()

glueContext = GlueContext(sc)

spark = glueContext.spark_session

job = Job(glueContext)

job.init(args["JOB_NAME"], args)

table_names = ["aapl", "ibm", "msft"]

for table_name in table_names:

    S3bucket_node1 = glueContext.create_dynamic_frame.from_catalog(

        database="av_financial_analysis",

        table_name=table_name,

        transformation_ctx="S3bucket_node1",

    )

ApplyMapping_node2 = ApplyMapping.apply(
    frame=S3bucket_node1,
    mappings=[
        ("`1. information`", "string", "`1. information`", "string"),
        ("`2. symbol`", "string", "`2. symbol`", "string"),
        ("`3. last refreshed`", "string", "`3. last refreshed`", "date"), 
        ("`4. output size`", "string", "`4. output size`", "string"),
        ("`5. time zone`", "string", "`5. time zone`", "string"),
        ("`1. open`", "string", "`1. open`", "double"),
        ("`2. high`", "string", "`2. high`", "double"),
        ("`3. low`", "string", "`3. low`", "double"),
        ("`4. close`", "string", "`4. close`", "double"),
        ("`5. volume`", "string", "`5. volume`", "bigint"),
        ("date", "string", "date", "date"), 
        ("partition_0", "string", "partition_0", "string"),
    ],
    transformation_ctx="ApplyMapping_node2",
)


df = ApplyMapping_node2.toDF()

# Group by the 'symbol' column and calculate the mean, min, max of the specified columns
grouped_df = df.groupBy("`2. symbol`").agg(
    F.mean("`1. open`").alias("average_open"),
    F.min("`1. open`").alias("min_open"),
    F.max("`1. open`").alias("max_open"),

    F.mean("`4. close`").alias("average_close"),
    F.min("`4. close`").alias("min_close"),
    F.max("`4. close`").alias("max_close"),

    F.mean("`2. high`").alias("average_high"),
    F.min("`2. high`").alias("min_high"),
    F.max("`2. high`").alias("max_high"),

    F.mean("`3. low`").alias("average_low"),
    F.min("`3. low`").alias("min_low"),
    F.max("`3. low`").alias("max_low"),
)

# Convert back to DynamicFrame
grouped_dyf = DynamicFrame.fromDF(grouped_df, glueContext, "grouped_dyf")

glueContext.write_dynamic_frame.from_options(
    frame = grouped_dyf,
    connection_type = "s3",
    connection_options = {"path": f"s3://av-financial-analysis-bucket/output/{table_name}"},
    format = "csv",
)



job.commit()

Data Visualization

Redshift reads the data from the catalog and creates views for visualization. Here are some examples of the views that can be created:

Stock performance comparison view: This view compares the daily performance of three stocks (AAPL, IBM, and MSFT). It includes columns for date, stock symbol, opening price, closing price, highest price, lowest price, and stock volume.
Daily stock statistics view: This view calculates daily statistics for each stock, such as the percentage change between the opening and closing price, the difference between the highest and lowest price, and the total volume of shares traded.
Stock trends view: This view plots stock trends over a period of time. For example, it can calculate the moving average of closing prices over 7 days, 30 days, and 90 days for each stock.
Most-traded shares view: This view ranks stocks according to the total volume of shares traded each day. This can help identify the most popular or active stocks on the market.
Stock correlation view: This view examines the correlation between the price movements of different stocks. For example, if the price of the AAPL share rises, does the price of the IBM share also rise?
Choosing Tools & Frameworks

The tools and frameworks were chosen based on their integration with AWS and their ability to handle the tasks required for this pipeline. Terraform was chosen for its infrastructure as code capabilities, allowing for easy deployment and management of the pipeline.

Future Work & Improvements

Regularly monitor and optimize your pipeline to ensure it remains efficient as your data grows.
Implement proper error handling and alerting mechanisms to quickly identify and resolve any issues.

Setup

To get started with this project, follow the steps below:

Ensure you have configured your AWS environment using aws configure.
Clone the project repository to your local machine using the following command: git clone https://github.com/Stefen-Taime/etl_onaws_deploy_with_terraform.git
Navigate to the project directory: cd etl_onaws_deploy_with_terraform
Familiarize yourself with the project structure using the tree command. The project structure should look like this:

.
├── glue
│ ├── crawler
│ │ └── main.tf
│ └── job
│ ├── glue_job.py
│ └── main.tf
├── lambda_functions
│ ├── lambda1
│ │ ├── deploy.sh
│ │ ├── lambda_function.py
│ │ ├── main.tf
│ │ └── requirements.txt
│ └── lambda2
│ ├── deploy.sh
│ ├── lambda_function.py
│ └── main.tf
├── main.tf
├── outputs.tf
├── redshift
│ ├── network.tf
│ ├── outputs.tf
│ ├── provider.tf
│ ├── redshift-cluster.tf
│ ├── redshift-iam.tf
│ ├── security-group.tf
│ ├── terraform.tfstate
│ ├── terraform.tfvars
│ └── variables.tf
├── s3
│ └── main.tf
└── variables.tf
Package the Lambda1 function. Navigate to the Lambda1 directory (cd lambda_functions/lambda1), grant execute permissions to the deployment script (chmod a+x deploy.sh), and run the script (./deploy.sh). You should see a deployment_package.zip file generated.
Return to the root directory of the project. At this point, we will deploy the first two essential modules for data ingestion: the Lambda1 function and the S3 bucket. In the main.tf file located at the root of the project, you can keep only the S3 and Lambda1 modules and comment out or temporarily remove the rest.
Once ready, run terraform init to initialize your Terraform workspace, followed by terraform plan. At this stage, you will need to enter the ARN of the Lambda function. You can enter an example ARN, such as MyArnLambda.
After that, run terraform apply -var="account_id=". You can find your account ID in the AWS console at the top right. If everything goes well, you should see an output similar to this image:

Go to the AWS console and execute the Lambda function. Check the S3 bucket, and you should see three folders: aapl, ibm, msft.

Once deployed, return to the main.tf file at the root of the project and uncomment the Module2 section, which includes the Glue Crawler and Glue Job. Run terraform init, terraform plan, and terraform apply again.
The output should look like this:

Deploy Module3, which includes the Lambda2 function. Do the same as you did with Lambda1: navigate to the Lambda2 directory (cd lambda_functions/lambda2), grant execute permissions to the deployment script (chmod a+x deploy.sh), and run the script (./deploy.sh).
Return to the root of the project and run terraform init, terraform plan, and terraform apply.

Once deployed, go to the AWS Lambda console and execute the second function. It should trigger the Crawler and the Glue Job. You can verify this by checking the image:

Note: The Glue Job execution may fail. This is because the Glue Job executes at the same time as the Crawler, and the Glue Job uses the catalog created by the Crawler. The error occurs because the catalog is not found as it is being created at the same time. The solution is to manually re-execute the Glue Job by clicking on ‘Run’ at the top right.

Once all these are executed, you should have an output folder in the bucket and a database containing three tables in the catalog.

The final step of the project will be to deploy the Redshift cluster. To do this, navigate to the redshift directory and fill in your AWS key and ID in the terraform.tfvars file.
Still in the redshift directory, run terraform init, terraform plan, and terraform apply.

Once deployed, connect to the data catalog in Redshift, which contains three tables. You can create scripts for various views such as comparison of stock performance, daily stock statistics, stock trends, most traded stocks, and correlation between stocks.

SELECT
"date",
symbol,
"1. open" as "open",
"4. close" as "close",
"2. high" as "high",
"3. low" as "low",
"5. volume" as "volume"
FROM
(
SELECT
"date",
'AAPL' as symbol,
"1. open",
"4. close",
"2. high",
"3. low",
"5. volume"
FROM
test.aapl
UNION ALL
SELECT
"date",
'IBM' as symbol,
"1. open",
"4. close",
"2. high",
"3. low",
"5. volume"
FROM
test.ibm
UNION ALL
SELECT
"date",
'MSFT' as symbol,
"1. open",
"4. close",
"2. high",
"3. low",
"5. volume"
FROM
test.msft
)
ORDER BY
"date",
symbol;

Once finished, return to the redshift directory and destroy the infrastructure redshift with terraform destroy. This is crucial to avoid additional costs.

Also, in the root of the project, run terraform destroy to destroy the Lambda functions, S3 bucket, Crawler, and Glue. You may encounter an error saying that the bucket is not empty. Just empty the bucket manually and try again.

Congratulations! You are now capable of managing a deployment on AWS with Terraform.

Important Note on Costs

Remember, AWS services are not free, and costs can accumulate over time. It’s crucial to destroy your environment when you’re done using it to avoid unnecessary charges. You can do this by running terraform destroy in your terminal. Please note that I am not responsible for any costs associated with running this pipeline in your AWS environment.

Building a Modern Data Pipeline: A Deep Dive into Terraform, AWS Lambda and S3, Snowflake, DBT, Mage AI, and Dash

Stefen — Mon, 26 Jun 2023 05:44:10 +0000

https://medium.com/@stefentaime_10958/building-a-modern-data-pipeline-a-deep-dive-into-terraform-aws-lambda-and-s3-snowflake-dbt-cac6816f2100

https://link.medium.com/eVU7N3rcWAb

Creating a Election Monitoring System Using MongoDB, Spark, Twilio SMS Notifications, and Dash

Stefen — Tue, 13 Jun 2023 00:55:25 +0000

Creating a Election Monitoring System Using MongoDB, Spark, Twilio SMS Notifications, and Dash

In this article, we present a proof-of-concept (POC) for an innovative solution that tackles this challenge in the context of election monitoring. This solution was devised for a government that approached a young digital company specializing in data, with a desire to make election results more transparent, accessible, and real-time.

The system proposed is designed to ingest voter data, process and analyze it, alert the media and concerned parties via SMS once the results are ready, and finally display the results on an interactive map via a Dash application.

The Data Pipeline

In this context, the Spark cluster is set up with a worker node, which will execute the tasks assigned by the Spark master. This setup allows for efficient handling of data processing tasks, which can be split among multiple worker nodes if necessary.

The data the system processes come from an intriguing source: a synthetic dataset of voting records. A script using the Python library Faker generates this data, imitating realistic voting behavior across all US states and the District of Columbia. The synthetic data is stored in MongoDB, a popular NoSQL database known for its flexibility and scalability, making it an excellent choice for handling large datasets like voting records.

from datetime import datetime
from faker import Faker
from pymongo import MongoClient

# Init Faker
fake = Faker()

## Init MongoDB
client = MongoClient('mongodb://root:example@localhost:27017/')
db = client['admin']
collection = db['votes']

state_weights = {
    "Alabama": (0.60, 0.40),
    "Alaska": (0.55, 0.45),
    "Arizona": (0.15, 0.85),
    "Arkansas": (0.20, 0.80),
    "California": (0.15, 0.85),
    "Colorado": (0.70, 0.30),
    "Connecticut": (0.10, 0.90),
    "Delaware": (0.34, 0.66),
    "Florida": (0.82, 0.18),
    "Georgia": (0.95, 0.05),
    "Hawaii": (0.50, 0.50),
    "Idaho": (0.67, 0.33),
    "Illinois": (0.60, 0.40),
    "Indiana": ((0.15, 0.85)),
    "Iowa": (0.45, 0.55),
    "Kansas": (0.40, 0.60),
    "Kentucky": (0.62, 0.38),
    "Louisiana": (0.58, 0.42),
    "Maine": (0.60, 0.40),
    "Maryland": (0.55, 0.45),
    "Massachusetts": (0.63, 0.37),
    "Michigan": (0.62, 0.38),
    "Minnesota": (0.61, 0.39),
    "Mississippi": (0.41, 0.59),
    "Missouri": (0.60, 0.40),
    "Montana": (0.57, 0.43),
    "Nebraska": (0.56, 0.44),
    "Nevada": (0.55, 0.45),
    "New Hampshire": (0.54, 0.46),
    "New Jersey": (0.53, 0.47),
    "New Mexico": (0.52, 0.48),
    "New York": (0.51, 0.49),
    "North Carolina": (0.50, 0.50),
    "North Dakota": (0.05, 0.95),
    "Ohio": (0.58, 0.42),
    "Oklahoma": (0.57, 0.43),
    "Oregon": (0.56, 0.44),
    "Pennsylvania": (0.55, 0.45),
    "Rhode Island": (0.50, 0.50),
    "South Carolina": (0.53, 0.47),
    "South Dakota": (0.48, 0.52),
    "Tennessee": (0.51, 0.49),
    "Texas": (0.60, 0.40),
    "Utah": (0.59, 0.41),
    "Vermont": (0.58, 0.42),
    "Virginia": (0.57, 0.43),
    "Washington": (0.44, 0.56),
    "West Virginia": (0.55, 0.45),
    "Wisconsin": (0.46, 0.54),
    "Wyoming": (0.53, 0.47),
    "District of Columbia": ((0.15, 0.85))  
}

def generate_vote(state):
    weights = state_weights.get(state, (0.50, 0.50))  # Get the weights for the state, or use (0.50, 0.50) as a default
    vote = {
        "voting_time": datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f'),  # Updated line
        "voter": {
            "voter_id": str(fake.unique.random_number(digits=9)),
            "first_name": fake.first_name(),
            "last_name": fake.last_name(),
            "address": {
                "street": fake.street_address(),
                "city": fake.city(),
                "state": state,
                "zip_code": fake.zipcode()
            },
            "birth_date": str(fake.date_of_birth(minimum_age=18, maximum_age=90)),
            "gender": fake.random_element(elements=('Male', 'Female')),
        },
        "vote": {
            "voting_date": "2023-06-06",
            "voting_location": fake.address(),
            "election": {
                "type": "Presidential Election",
                "year": "2023"
            },
            "vote_valid": "Yes",
            "voting_choice": {
                "party": fake.random_element(elements=('Republican', 'Democrat')),
            }
        }
    }

    return vote

# List of states
states = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
          "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois",
          "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland",
          "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana",
          "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York",
          "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania",
          "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah",
          "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming",
          "District of Columbia"]


# Generate fake voting data for each state and insert into MongoDB
for state in states:
    for i in range(1, 61):
        vote = generate_vote(state)
        collection.insert_one(vote)

print(f"All votes saved to MongoDB")

For each state, the synthetic data simulates voter choices based on predefined probabilities, reflecting historical voting patterns. This data, consisting of 60 voters for each state, serves as the input for the Spark processing system.

The Spark system processes the data, determining the winning party in each state. It then calculates the percentage of votes each party has won. This critical information is then fed into an SMS notification system, alerting media outlets and the relevant parties with real-time election results.

version: '3.1'
services:

  # ===================== #
  #     Apache Spark      #
  # ===================== #
  spark:
    image: bitnami/spark:3.3.1
    environment:
      - SPARK_MODE=master
    ports:
      - '8080:8080'
      - '7077:7077'
    volumes:
      - ./data:/data
      - ./src:/src
  spark-worker:
    image: bitnami/spark:3.3.1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark:7077
      - SPARK_WORKER_MEMORY=4G
      - SPARK_EXECUTOR_MEMORY=4G
      - SPARK_WORKER_CORES=4
    ports:
      - '8081:8081'
    volumes:
      - ./data:/data
      - ./src:/src

  # ===================== #
  #        MongoDB        #
  # ===================== #
  mongo:
    image: mongo:4.4
    volumes:
      - ./mongo:/data/db
    ports:
      - '27017:27017'
    environment:
      - MONGO_INITDB_ROOT_USERNAME=root
      - MONGO_INITDB_ROOT_PASSWORD=example
  mongo-express:
    image: mongo-express
    ports:
      - '8091:8081'
    environment:
      - ME_CONFIG_MONGODB_ADMINUSERNAME=root
      - ME_CONFIG_MONGODB_ADMINPASSWORD=example
      - ME_CONFIG_MONGODB_SERVER=mongo
      - ME_CONFIG_MONGODB_PORT=27017

Data Processing with PySpark (Job 1)

Create a SparkSession: The code initiates a SparkSession, which is an entry point to any Spark functionality. When it starts, it connects to the MongoDB database where the data is stored.
Load Data: The code then reads the data from MongoDB and loads it into a DataFrame, which is a distributed collection of data organized into named columns. It’s similar to a table in a relational database and can be manipulated in similar ways.
Data Processing: The code selects the relevant fields from the DataFrame (state, party, and validity of the vote), groups them by state and party, and counts the number of valid votes for each party in each state.
Find Winners: Next, the code finds the party with the most votes in each state. It does this by ranking the parties within each state based on the number of votes they got and then selecting the one with the highest rank (i.e., the one with the most votes).
Calculate Percentages: The code then calculates the percentage of votes each winning party got in its state. It does this by dividing the number of votes the winning party got by the total votes in that state and multiplying by 100.
Write Results: Finally, the code saves the results, which include the winning party and their vote percentage in each state, back to MongoDB but in a different collection called ‘election_results’.

So in essence, this code processes voting records to determine the party that won the most votes in each state and calculates what percentage of the total votes in that state the winning party received. This analysis can give a clear picture of the distribution of votes in an election.

from pyspark.sql import SparkSession
from pyspark.sql.functions import count, expr, col
from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Create a SparkSession
spark = SparkSession.builder \
    .appName('MongoDBIntegration') \
    .config("spark.mongodb.input.uri", "mongodb://root:example@mongo:27017/admin.votes") \
    .getOrCreate()

# Load the MongoDB data into a DataFrame
df = spark.read.format("mongo").load()

# Extract relevant fields and group by state and party
result = df.select(
    df["voter.address.state"].alias("state"),
    df["vote.voting_choice.party"].alias("party"),
    df["vote.vote_valid"].alias("validity")
).where(col("validity") == "Yes").groupby("state", "party").agg(count("validity").alias("votes"))

# Find the party with the most votes in each state
winners = result.withColumn("rn", F.row_number().over(Window.partitionBy("state").orderBy(F.desc("votes")))).filter(col("rn") == 1).drop("rn")

# Calculate the percentage of victory
total_votes = result.groupby("state").agg(F.sum("votes").alias("total_votes"))
winners_with_percentage = winners.join(total_votes, "state").withColumn("percentage", (col("votes") / col("total_votes")) * 100)

# Write the result to MongoDB
winners_with_percentage.write.format("mongo").mode("overwrite").option("spark.mongodb.output.uri", "mongodb://root:example@mongo:27017/admin.election_results").save()

Output:

{

    _id: ObjectId('64873b3df42ba41d32f3d1a6'),

    state: 'Utah',

    party: 'Republican',

    votes: 127,

    total_votes: 240,

    percentage: 52.916666666666664

}

Data Processing with PySpark (Job 2)

Create a SparkSession and Load Data: The script starts a SparkSession and then loads data from a MongoDB collection.
Set Electoral Votes by State: The United States uses a system called the Electoral College to decide the outcome of presidential elections. Each state has a number of votes in the Electoral College that is largely proportional to its population. This script creates a dictionary that maps each state to its number of electoral votes. Then it converts this dictionary into a DataFrame.
Join Electoral Votes with Election Data: The script combines the election results data with the electoral votes data, based on the state name. This gives us a DataFrame where each row has the state name, the party, the votes that party received, and the number of electoral votes that state has.
Calculate Nationwide Votes: The script calculates the total votes received by each party nationwide.
Identify the Nationwide Winner: The script determines the party that got the most votes nationwide.
Calculate Maximum State Votes and Handle Ties: The script identifies the maximum number of votes received in each state and handles ties by giving the electoral votes to the nationwide winner.
Calculate Total Grand Electors for Each Party: The script then calculates the total number of electoral votes (“grand electors”) each party received nationwide, considering the rule of tie-breaking.
Save the Results: The script saves the electoral votes results back to MongoDB.
Notify the Results via SMS: Using Twilio, an online messaging service, the script then sends an SMS with the election results. The results are formatted as a string which includes each party and the number of electoral votes they won.
Stop the SparkSession: Lastly, the script stops the SparkSession, releasing its resources.

from pyspark.sql import SparkSession
from pyspark.sql.functions import max
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from twilio.rest import Client

spark = SparkSession.builder \
.appName("ElectionResults") \
.config("spark.mongodb.input.uri", "mongodb://root:example@mongo:27017/admin.election_results") \
.getOrCreate()

df = spark.read.format("mongo").load()

create a dictionary of grand electors by state

electors_dict = {
"Alabama": 9,
"Alaska": 3,
"Arizona": 11,
"Arkansas": 6,
"California": 55,
"Colorado": 9,
"Connecticut": 7,
"Delaware": 3,
"Florida": 29,
"Georgia": 16,
"Hawaii": 4,
"Idaho": 4,
"Illinois": 20,
"Indiana": 11,
"Iowa": 6,
"Kansas": 6,
"Kentucky": 8,
"Louisiana": 8,
"Maine": 4,
"Maryland": 10,
"Massachusetts": 11,
"Michigan": 16,
"Minnesota": 10,
"Mississippi": 6,
"Missouri": 10,
"Montana": 3,
"Nebraska": 5,
"Nevada": 6,
"New Hampshire": 4,
"New Jersey": 14,
"New Mexico": 5,
"New York": 29,
"North Carolina": 15,
"North Dakota": 3,
"Ohio": 18,
"Oklahoma": 7,
"Oregon": 7,
"Pennsylvania": 20,
"Rhode Island": 4,
"South Carolina": 9,
"South Dakota": 3,
"Tennessee": 11,
"Texas": 38,
"Utah": 6,
"Vermont": 3,
"Virginia": 13,
"Washington": 12,
"West Virginia": 5,
"Wisconsin": 10,
"Wyoming": 3,
"District of Columbia": 3
}

Convert dictionary to DataFrame

electors_df = spark.createDataFrame([(k, v) for k, v in electors_dict.items()], ["state", "electors"])

df = df.join(electors_df, on="state", how="inner")

nationwide_df = df.groupBy("party").agg(F.sum("votes").alias("total_votes"))

nationwide_winner = nationwide_df.orderBy(F.desc("total_votes")).first()[0]

Identify maximum votes in each state

state_max_df = df.groupBy("state").agg(max("votes").alias("max_votes"))

df = df.join(state_max_df, on="state", how="inner")

window = Window.partitionBy(df['state'])

from pyspark.sql.functions import when

df = df.withColumn('winners', F.sum(when(df.votes == df.max_votes, 1).otherwise(0)).over(window))

df = df.withColumn('final_party', when(df.winners > 1, nationwide_winner).otherwise(df.party))

result_df = df.groupBy("final_party").sum("electors")

Save the result to MongoDB

result_df.write.format("mongo").option("uri", "mongodb://root:example@mongo:27017/admin.election_results_out").mode("overwrite").save()

account_sid = ''
auth_token = ''
client = Client(account_sid, auth_token)

result = result_df.collect()

result_str = "\n".join([f"{row['final_party']}: {row['sum(electors)']} electors" for row in result])

message_body = f"Dear recipient, \n\nWe are pleased to share with you the final election results:\n\n{result_str}\n\nWe would like to express our gratitude for your patience and interest in our democratic process. For more detailed results, please visit our official website.\n\nBest regards,\n[Election Committee]"

message = client.messages.create(
from_='',
body=message_body,
to=''
)

print(f"Message sent with id {message.sid}")

Stop the SparkSession

spark.stop()

Output:

{

    _id: ObjectId('6487445c358709227a7e9c71'),

    final_party: 'Republican',

    'sum(electors)': 201

}

{

    _id: ObjectId('6487445c358709227a7e9c72'),

    final_party: 'Democrat',

    'sum(electors)': 337

}

Notification of Results

Visualization with Dash

The final step involves visualizing the results using Dash, a productive Python framework for building web analytic applications. It allows us to construct an interactive map of the United States, where each state is colored according to the party that won the majority of votes: blue for Democrats and red for Republicans. This enables users to easily and intuitively understand the election results.

Connect to a Database: The script connects to a database (specifically MongoDB) where the election results are stored.
Define the Geographic Data: The script contains a list of states with their latitude and longitude coordinates. This data will help to plot each state accurately on the map.
Create a State Name to Abbreviation Dictionary: This dictionary is used to map full state names to their abbreviations (like “New York” to “NY”), because the map uses abbreviations.
Set Up the Application: The script sets up an app using a framework called Dash, which helps in building interactive web applications.
Define the Application Layout: The layout of the app is defined to include a graphical element (a map in this case).
Update the Map: A function is defined that updates the map each time it’s called. This function does a few things:
a. Get Election Results: The function fetches the election results from the database.
b. Process Results: It processes these results to extract the necessary data. For each state, it gets the party that won and the percentage of votes that party received. Parties are assigned a numerical value to color-code them later (0 for Republican and 1 for Democrat).
c. Prepare Hover Text: This is the text that appears when you hover over a state on the map. It shows the party that won and the percentage of votes they received.
d. Create the Map: The function creates a map of the United States, with each state color-coded based on the party that won there (blue for Democrats and red for Republicans).
e. Add Legends: Legends are added to the map to indicate which color corresponds to which party.
f. Adjust the Layout: Finally, the function adjusts the layout of the map and returns it. The map is displayed in the web application.

I hope this guide will give you a better understanding of how MongoDB, PySpark, Twilio and Dash can be used to build an efficient, high-performance data pipeline.

Medium

End to end data engineering project with Spark, Mongodb, Minio, postgres and Metabase

Stefen — Mon, 15 May 2023 14:25:20 +0000

Utilizing of open source technologies for the implementation of a data pipeline

Architecture

Source Code

All the source code demonstrated in this post is open-source and available on GitHub.

git clone https://github.com/Stefen-Taime/projet_data.git

Prerequisites

As a prerequisite for this post, you will need to create the following resources:

(1) Linux Machine;
(1) Docker ;
(1) Docker Compose;
(1) Virtualenv;

Setup

git clone https://github.com/Stefen-Taime/projet_data.git
cd projet_data/extractor

pip install -r requirements.txt
python main.py

or

docker build --tag=extractor .
docker-compose up run

#This folder contains code used to create a downlaods folder, iteratively download files from a list of uris, unzip them and delete zip files.

At this point you should have in the extractor directory with a new folder Dowloads with 2 csv files

then

cd ..
cd docker

docker-compose -f docker-compose-nosql.yml up -d  #for mongodb
docker-compose -f docker-compose-sql.yml up -d    #for postgres and adminer port 8085, metabase port 3000
docker-compose -f docker-compose-s3.yml up -d     #for minio port 9000
docker-compose -f docker-compose-spark.yml up -d  #for spark master and jupyter notebook port 8888


cd ..
cd loader
pip install -r requirements.txt

# ! modify the path DATA and DATA_FOR_MONGODB variables in .env

python loader.py mongodb  #upload data in mongodb database (if you have an error, manually create an auto-mpg database and enter an auto collection and try again)
python loader.py minio    #upload data in minio(if you have an error, manually create a landing compartment and try again)

He must have now an auto-mpg database and inside an auto collection with data in it for mongodb and also data in minio

then

go to localhost 8888 and the password is “stefen”.once in jupyter notebook run all cells

go to localhost 8085

go to localhost 3000

Cleaning Up

https://medium.com/@stefentaime_10958/end-to-end-data-engineering-project-with-spark-mongodb-minio-postgres-and-metabase-2c400672b50d

ELT Airflow Pipeline Project

Stefen — Mon, 15 May 2023 14:23:29 +0000

About

Project using data engineering concepts.

The project is an ELT (Extract, Load, Transform) data pipeline, orchestrated with Apache Airflow through Docker containers.

Faker is used as a package to generate data to a mysql database. The data is extracted from mysql, transformed with pandas and Sql and then loaded into an Olap postgres database. A notification is then sent by email once the whole process is completed.

Architecture

Prerequisites

Setup mailtrap

One platform toTest, Send, Control your emails:

Setup

Clone the project to your desired location:

$ git clone https://github.com/Stefen-Taime/airflow_etl.git

fill the AIRFLOW_SMTPSMTP_USER, AIRFLOWSMTPSMTP_PASSWORD, AIRFLOWSMTP_SMTP_MAIL_FROM in .envExample file:

AIRFLOW_ADMIN_MAIL=airflow
AIRFLOW_ADMIN_FIRSTNAME=airflow
AIRFLOW_ADMIN_NAME=airflow
AIRFLOW_ADMIN_PASSWORD=airflowpassword
AIRFLOW__CORE__LOAD_DEFAULT_CONNECTIONS=False
AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgres+psycopg2://airflow:airflowpassword@postgres:5432/airflow
AIRFLOW__CORE__FERNET_KEY=81HqDtbqAywKSOumSha3BhWNOdQ26slT6K0YaZeZyPs=
AIRFLOW_CONN_METADATA_DB=postgres+psycopg2://airflow:airflowpassword@postgres:5432/airflow
AIRFLOW_VAR__METADATA_DB_SCHEMA=airflow
AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC=5
AIRFLOW__CORE__EXECUTOR=LocalExecutor
AIRFLOW__SMTP__SMTP_HOST=smtp.mailtrap.io
AIRFLOW__SMTP__SMTP_PORT=2525
AIRFLOW__SMTP__SMTP_USER=xxxxxxxxxxx
AIRFLOW__SMTP__SMTP_PASSWORD=xxxxxxx
AIRFLOW__SMTP__SMTP_MAIL_FROM=your_email@gmail.com
AIRFLOW__WEBSERVER__BASE_URL=http://localhost:8080
POSTGRES_USER=airflow
POSTGRES_PASSWORD=airflowpassword
POSTGRES_DB=airflow
AIRFLOW_UID=1000
AIRFLOW_GID=0
AIRFLOW_UID=1000
AIRFLOW_GID=0
AIRFLOW_UID=1000
AIRFLOW_GID=0
PG_VER=14-alpine
POSTGRES_SRC_PASSWORD=Sup3rS3c3t
PORT=5432
POSTGRES_USER_OLAP=postgres
HOSTNAME=olap
ONTAINER_NAME=postgres
POSTGRES_DB_OLAP=postgres

grant permissions to the bash script:

chmod a+x build_Services.sh

Bash:

$ ./build_Services.sh

Build Docker:

$ docker-compose up --build -d

When everything is done, you can check all the containers running:

$ docker ps

oltp Interface

Now you can access adminer web interface by going to http://localhost:8085 with the default user which is in the docker-compose.yml:

Système     MySQL

Serveur     oltp

user        root

password    myrootpassword


Database    testdb

olap Interface

Now you can access new adminer web interface by going to http://localhost:8085 with the default user which is in the docker-compose.yml:

Système     PostgesSQL

Serveur     olap

user        postgres

password    Sup3rS3c3t


Database    postgres

Airflow Interface

Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow/airflowpassword:

Airflow DAG

Now you can run Airflow etl dag:

Check oltp and olap database

Check your mailtrap.io/inboxes

Shut down or restart Airflow

If you need to make changes or shut down:

$ docker-compose down

References

https://medium.com/@stefentaime_10958/elt-airflow-pipeline-project-dcf834c1be17

Building a Scalable RSS Feed Pipeline with Apache Airflow, Kafka, and MongoDB, Flask Api

Stefen — Mon, 15 May 2023 14:20:29 +0000

In today’s data-driven world, processing large volumes of data in real-time has become essential for many organizations. The Extract, Transform, Load (ETL) process is a common way to manage the flow of data between systems. In this article, we’ll walk through how to build a scalable ETL pipeline using Apache Airflow, Kafka, and Python, Mongo and Flask

In this pipeline, the RSS feeds are scraped using a Python library called feedparser. This library is used to parse the XML data in the RSS feeds and extract the relevant information. The parsed data is then transformed into a standardized JSON format using Python's built-in json library. This format includes fields such as title, summary, link, published_date, and language, which make the data easier to analyze and consume.

NEWS_FEEDS = {
        "en": [
            "https://www.cnn.com/rss/edition.rss",
            "https://www.bbc.com/news/10628494",
            "https://www.nbcnews.com/id/303207/device/rss/rss.xml",
            "https://www.foxnews.com/about/rss/"
        ],
        "pl": [
            "https://www.tvn24.pl/najnowsze.xml",
            "https://www.rmf24.pl/fakty/polska/feed",
            "https://wiadomosci.wp.pl/rss",
            "https://www.money.pl/rss/wszystkie"
        ],
        "es": [
            "https://www.elpais.com/rss/feed.html?feedId=1022",
            "https://www.abc.es/rss/feeds/abc_EspanaEspana.xml",
            "https://www.elconfidencial.com/rss/",
            "https://www.elperiodico.com/es/rss/"
        ],
        "de": [
            "https://www.tagesschau.de/xml/rss2",
            "https://www.faz.net/rss/aktuell/",
            "https://www.zeit.de/rss",
            "https://www.spiegel.de/schlagzeilen/tops/index.rss"
        ],
        "fr": [
            "https://www.lemonde.fr/rss/une.xml",
            "https://www.lefigaro.fr/rss/figaro_actualites.xml",
            "https://www.liberation.fr/rss/",
            "https://www.lci.fr/rss"
        ]
    }

What is Apache Airflow?

Apache Airflow is a platform used to programmatically author, schedule, and monitor workflows. It allows developers to create complex workflows by defining tasks and their dependencies. Airflow makes it easy to monitor the execution of tasks and provides an intuitive web interface to visualize the workflow.

What is Kafka?

Apache Kafka is a distributed event streaming platform that allows you to publish and subscribe to streams of records. Kafka provides high-throughput, low-latency, and fault-tolerant data transport. Kafka can be used for real-time data processing, streaming analytics, and log aggregation.

Implementing the ETL pipeline

To implement the ETL pipeline, we’ll use Python and the following libraries:

feedparser: A Python library that parses RSS feeds
beautifulsoup4: A Python library that extracts data from HTML and XML files
kafka-python: A Python library that provides a Kafka client
redis: A Python library that provides a Redis client

First, we’ll define a DAG (Directed Acyclic Graph) in Airflow to run the pipeline on a scheduled basis. The DAG consists of four tasks:

Update the proxy pool: This task retrieves a list of proxy servers from Redis or a public API, tests their connectivity, and stores the valid proxies in Redis. We’ll use the proxies to avoid getting blocked by the RSS feed servers.
1. Extract news: This task reads the RSS feeds using the valid proxies, extracts the news articles, and stores them in a list. We’ll use concurrent programming to speed up the extraction process.
2. Validate data: This task checks if the news articles have all the required fields (title, link, and summary), and stores the valid articles in a separate list.
3. Send to Kafka: This task sends the validated news articles to a Kafka topic, using the JsonConverter to serialize the data.
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow import DAG
from airflow.utils.dates import days_ago

from datetime import datetime, timedelta
import feedparser
from bs4 import BeautifulSoup
from kafka import KafkaProducer
from kafka.errors import KafkaError
import json
import requests
import random
import redis
import concurrent.futures
import html

NEWS_FEEDS = {
"en": [
"https://www.cnn.com/rss/edition.rss",
"https://www.bbc.com/news/10628494",
"https://www.nbcnews.com/id/303207/device/rss/rss.xml",
"https://www.foxnews.com/about/rss/"
],
"pl": [
"https://www.tvn24.pl/najnowsze.xml",
"https://www.rmf24.pl/fakty/polska/feed",
"https://wiadomosci.wp.pl/rss",
"https://www.money.pl/rss/wszystkie"
],
"es": [
"https://www.elpais.com/rss/feed.html?feedId=1022",
"https://www.abc.es/rss/feeds/abc_EspanaEspana.xml",
"https://www.elconfidencial.com/rss/",
"https://www.elperiodico.com/es/rss/"
],
"de": [
"https://www.tagesschau.de/xml/rss2",
"https://www.faz.net/rss/aktuell/",
"https://www.zeit.de/rss",
"https://www.spiegel.de/schlagzeilen/tops/index.rss"
],
"fr": [
"https://www.lemonde.fr/rss/une.xml",
"https://www.lefigaro.fr/rss/figaro_actualites.xml",
"https://www.liberation.fr/rss/",
"https://www.lci.fr/rss"
]
}

headers_list = [
{
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
{
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
},
{
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Dnt": "1",
"Referer": "https://www.google.com/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36",
"X-Amzn-Trace-Id": "Root=1-5ee7bae0-82260c065baf5ad7f0b3a3e3"
},
{
"User-Agent": 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:55.0) Gecko/20100101 Firefox/55.0',
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,/;q=0.8",
"Accept-Language": "pl-PL,pl;q=0.9,en-US;q=0.8,en;q=0.7",
"Referer": "https://www.reddit.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1"
}

]

Define default_args dictionary to pass to the DAG

ARGS = {
"owner": "stefentaime",
"start_date": days_ago(0),
"retries": 1,
"retry_delay": timedelta(seconds=30)
}

dag = DAG(
dag_id="ETL-Pipeline",
default_args=ARGS,
description="",
schedule_interval="0 0 1 * *",
tags=["ETL", "kafka", "Scrapting"]
)

REDIS_CONFIG = {'host': 'redis', 'port': 6379, 'decode_responses': True}
REDIS_KEY = 'proxies'
PROXY_WEBPAGE = 'https://free-proxy-list.net/'
TESTING_URL = 'https://httpbin.org/ip'
MAX_WORKERS = 20
PROXY_EXPIRATION = timedelta(minutes=5)

def get_proxies():
r = redis.Redis(**REDIS_CONFIG)
if r.exists(REDIS_KEY):
proxies = r.lrange(REDIS_KEY, 0, -1)
expiration = r.ttl(REDIS_KEY)
if expiration == -1:
r.expire(REDIS_KEY, PROXY_EXPIRATION)
elif expiration < PROXY_EXPIRATION.total_seconds():
r.delete(REDIS_KEY)
proxies = []
else:
proxies = []
if not proxies:
headers = random.choice(headers_list)
page = requests.get(PROXY_WEBPAGE, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
for row in soup.find('tbody').find_all('tr'):
proxy = row.find_all('td')[0].text + ':' + row.find_all('td')[1].text
proxies.append(proxy)
r.rpush(REDIS_KEY, *proxies)
r.expire(REDIS_KEY, PROXY_EXPIRATION)
return proxies

def update_proxypool(**kwargs):
get_proxies()

def test_proxy(proxies):
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
results = list(executor.map(test_single_proxy, proxies))
return (proxy for valid, proxy in zip(results, proxies) if valid)

def test_single_proxy(proxy):
headers = random.choice(headers_list)
try:
resp = requests.get(TESTING_URL, headers=headers, proxies={"http": proxy, "https": proxy}, timeout=3)
if resp.status_code == 200:
return True
except:
pass
return False

Define the task to update the proxypool

def update_proxypool(**kwargs):
proxies = get_proxies()
valid_proxies = list(test_proxy(proxies))
kwargs['ti'].xcom_push(key='valid_proxies', value=valid_proxies)

import datetime

next_id = 1

def extract_website_name(link):
# Extract the website name from the link
website_name = link.split('//')[1].split('/')[0]
# Remove any leading "www." from the website name
website_name = website_name.replace('www.', '')
return website_name

def extract_article_data(entry, language):
global next_id
title = entry.title.encode('ascii', 'ignore').decode()
soup = BeautifulSoup(entry.summary, 'html.parser')
summary = html.unescape(soup.get_text().strip().replace('\xa0', ' '))
link = entry.link
date_published = entry.get('published_parsed', None)
if date_published is not None:
date_published = datetime.datetime(*date_published[:6])
time_since_published = datetime.datetime.utcnow() - date_published
if time_since_published < datetime.timedelta(hours=1):
today = datetime.datetime.utcnow().strftime("%d-%m-%Y")
website_name = extract_website_name(link)
unique_id = f"{language.upper()}{next_id:02d}-{website_name}-01-{today}"
next_id += 1
return {
'id': unique_id,
'title': title,
'link': link,
'summary': summary,
'language': language
}
return None

def extract_news_feed(feed_url, language, proxy):
feed = feedparser.parse(feed_url, request_headers={'User-Agent': proxy})
articles = []
extracted_articles = set()
for entry in feed.entries:
if len(articles) >= 2:
break
link = entry.link
title = entry.title.encode('ascii', 'ignore').decode()
unique_id = f'{language}-{link}-{title}'
if unique_id in extracted_articles:
continue
extracted_articles.add(unique_id)
article_data = extract_article_data(entry, language)
if article_data is not None:
articles.append(article_data)
return articles

def extract_news(**kwargs):
valid_proxies = set(kwargs['ti'].xcom_pull(key='valid_proxies'))
articles = []
with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = [executor.submit(extract_news_feed, feed_url, language, proxy) for language in NEWS_FEEDS.keys()
for feed_url in NEWS_FEEDS[language] for proxy in valid_proxies]
for future in concurrent.futures.as_completed(futures):
result = future.result()
if result is not None:
articles.extend(result)
kwargs['ti'].xcom_push(key='articles', value=articles)
return articles

Define the task to validate the quality of the data

def validate_data(**kwargs):
articles = kwargs['ti'].xcom_pull(key='articles', task_ids='extract_news')
validated_articles = [article for article in articles if all(article.get(k) for k in ('title', 'link', 'summary'))]
kwargs['ti'].xcom_push(key='validated_articles', value=validated_articles)
return validated_articles

Define the task to send data to the Kafka topic

def send_to_kafka(**kwargs):
validated_articles = kwargs['ti'].xcom_pull(key='validated_articles', task_ids='validate_data')
producer = KafkaProducer(bootstrap_servers='broker:29092')
for article in validated_articles:
try:
producer.send('rss_feeds', key=article['title'].encode(), value=json.dumps(article).encode())
except KafkaError as e:
print(f"Failed to send message to Kafka: {e}")
producer.flush()
print("Data sent to Kafka successfully.")

Define the task dependencies

update_proxypool_task = PythonOperator(task_id='update_proxypool', python_callable=update_proxypool, provide_context=True, dag=dag)
extract_news_task = PythonOperator(task_id='extract_news', python_callable=extract_news, provide_context=True, dag=dag)
validate_data_task = PythonOperator(task_id='validate_data', python_callable=validate_data, provide_context=True, dag=dag)
send_to_kafka_task = PythonOperator(task_id='send_to_kafka', python_callable=send_to_kafka, provide_context=True, dag=dag)

Set the task dependencies

update_proxypool_task >> extract_news_task >> validate_data_task >> send_to_kafka_task

Next, we’ll deploy a Kafka connector to consume the news articles from the Kafka topic and load them into MongoDB. We’ll use the MongoSinkConnector from the mongo-kafka-connect library, which provides an efficient and reliable way to integrate Kafka with MongoDB. The connector is configured to read the news articles from the Kafka topic, and write them to a MongoDB collection in the demo database.

{
    "name": "mongodb-sink-connector",
    "config": {
      "connector.class": "com.mongodb.kafka.connect.MongoSinkConnector",
      "tasks.max": "1",
      "topics": "rss_feeds",
      "connection.uri": "mongodb://debezium:dbz@mongo:27017/demo?authSource=admin",
      "database": "demo",
      "collection": "rss_feeds_collection",
      "key.converter": "org.apache.kafka.connect.storage.StringConverter",
      "value.converter": "org.apache.kafka.connect.json.JsonConverter",
      "key.converter.schemas.enable": "false",
      "value.converter.schemas.enable": "false"
    }
  }

To run the pipeline, you need to set up the following components:

Apache Airflow: Use pip to install Airflow, and create a Python script that defines the DAG.
Redis: Set up a Redis instance to store the proxy servers.
Kafka: Install and configure a Kafka cluster with a single broker, and create a Kafka topic named rss_feeds.
MongoDB: Install and configure a MongoDB cluster, and create a database named demo.
Kafka Connector: Deploy the mongo-kafka-connect connector to your Kafka cluster, and configure it to read from the rss_feeds topic and write to the rss_feeds_collection collection in the demo database.

Flask web application to serve news articles stored in a MongoDB database. The web application provides the following endpoints:

from pymongo import MongoClient
from bson.objectid import ObjectId
from flask import Flask, request, jsonify, render_template

client = MongoClient('mongodb://debezium:dbz@localhost:27017/?authSource=admin')
db = client['demo']
collection = db['rss_feeds_collection']


app = Flask(__name__, template_folder='/path/template')

# get all news articles
@app.route('/news', methods=['GET'])
def get_all_news():
    cursor = collection.find({}, {"_id": 0})
    news = []
    for item in cursor:
        news.append({'title': item['title'], 'summary': item['summary'], 'link': item['link'], 'language': item['language'], 'id': item['id']})
    return jsonify({'news': news})

# get a news article by id
@app.route('/news/<id>', methods=['GET'])
def get_news_by_id(id):
    item = collection.find_one({'id': id})
    if item:
        return jsonify({'_id': str(item['_id']), 'title': item['title'], 'summary': item['summary'], 'link': item['link'], 'language': item['language']})
    else:
        return jsonify({'error': 'News article not found'})

# update a news article by id
@app.route('/news/<id>', methods=['PUT'])
def update_news_by_id(id):
    item = collection.find_one({'id': id})
    if item:
        data = request.get_json()
        collection.update_one({'id': id}, {'$set': data})
        return jsonify({'message': 'News article updated successfully'})
    else:
        return jsonify({'error': 'News article not found'})

# delete a news article by id
@app.route('/news/<id>', methods=['DELETE'])
def delete_news_by_id(id):
    item = collection.find_one({'id': id})
    if item:
        collection.delete_one({'id': id})
        return jsonify({'message': 'News article deleted successfully'})
    else:
        return jsonify({'error': 'News article not found'})


# render a web page with news articles
@app.route('/', methods=['GET'])
def news_page():
    page = request.args.get('page', 1, type=int)
    language = request.args.get('language')

    # build query for language filtering
    query = {} if not language else {'language': language}

    # retrieve total count and paginated news articles
    count = collection.count_documents(query)
    cursor = collection.find(query, {"_id": 0}).skip((page-1)*5).limit(8)
    news = []
    for item in cursor:
        news.append({'title': item['title'], 'summary': item['summary'], 'link': item['link'], 'language': item['language'], 'id': item['id']})

    # calculate number of pages for pagination
    num_pages = count // 8 + (1 if count % 8 > 0 else 0)

    return render_template('index.html', news=news, page=page, language=language, num_pages=num_pages)

if __name__ == '__main__':
    app.run(debug=True)

/news: GET all news articles from the database
/news/: GET a news article with the specified id from the database
/news/: PUT updates a news article with the specified id in the database
/news/: DELETE deletes a news article with the specified id from the database
/: GET a web page that displays paginated news articles with an optional language filter

Prerequisites

Before we start, make sure you have the following installed:

Python 3
Docker and Docker Compose
A text editor

Steps To Run:

Clone the project to your desired location:

$ git clone https://github.com/Stefen-Taime/Scalable-RSS-Feed-Pipeline.git

Execute the following command that will create the .env file containing the Airflow UID needed by docker-compose:

$ echo -e "AIRFLOW_UID=$(id -u)" > .env

Build Docker:

$ docker-compose build

Initialize Airflow database:

$ docker-compose up airflow-init

Start Containers:

$ docker-compose up -d

When everything is done, you can check all the containers running:

$ docker ps

Now you can access Airflow web interface by going to http://localhost:8080 with the default user which is in the docker-compose.yml. Username/Password: airflow. Now, we can trigger our DAG and see all the tasks running.

To setup Kafka and MongoDB, navigate to cd mongo-kafka:

$ cd mongo-kafka

Start Kafka and MongoDB containers:

$ docker-compose up -d

Execute the following command that will create SinkConnector for MongoDB:

$ curl -X POST \ -H "Content-Type: application/json" \ --data @mongo-sink.json \ http://localhost:8083/connectors

Execute the following command that will Run Api

$ python api.pi

Conclusion:

In conclusion, this article has covered a variety of topics related to building a scalable RSS feed pipeline. We started by discussing RSS feeds and how to scrape them using Python. We then explored the use of Apache Airflow for orchestrating the pipeline and scheduling tasks.

Next, we looked at how to use Kafka as a message broker to handle the data flow between the different components of the pipeline. We also examined the use of Kafka Connect to integrate Kafka with MongoDB and to enable easy data ingestion.

To visualize the data ingested into MongoDB, we built a simple Flask API with Jinja templates to render a web page with paginated news articles. We used Bootstrap to make the page responsive and added filtering capabilities based on the language of the news articles.

https://medium.com/@stefentaime_10958/building-a-scalable-rss-feed-pipeline-with-apache-airflow-kafka-and-mongodb-flask-api-da379cc2e3fb

Real-Time Data Processing with MySQL, Redpanda, MinIO, and Apache Spark Using Delta Lake

Stefen — Mon, 15 May 2023 14:18:45 +0000

In this article, you will learn how to set up a real-time data processing and analytics environment using Docker, MySQL, Redpanda, MinIO, and Apache Spark. We will create a system that generates fake data simulating sensors on a bridge that flash car plates at each passage. The data will be stored in a MySQL database, and processed in real-time using Redpanda and Kafka Connect. We will then use MinIO as a distributed object storage and Apache Spark to further process and analyze the data. Additionally, we will integrate the Twilio API for real-time notifications.

Introduction
Setting up the environment

Docker Compose configuration
Data generation and storage in MySQL
Creating an API for data ingestion
Setting up connectors for data streaming and storage

Real-time data processing with Apache Spark

Reading data from MinIO
Data transformation and storage in the data warehouse
Integrating Twilio for real-time notifications

Conclusion

1. Introduction

In this article, we will walk through the process of setting up a real-time data processing and analytics environment for vehicle plate recognition. We will use Docker to manage our services, MySQL for data storage, Redpanda as a streaming platform, MinIO as an object storage server, and Apache Spark for data processing and analysis. We will also integrate the Twilio API to send SMS notifications in real-time based on the processed data.

2. Setting up the environment

Docker Compose configuration

To begin, we will create a Docker Compose file that defines all the necessary services, networks, and volumes for our environment. The services include Redpanda, MinIO, MySQL, Kafka Connect, Adminer, Spark Master, Spark Workers, Jupyter Notebook, a data generator, and an API.

version: "3.7"
services:
  redpanda:
    image: vectorized/redpanda
    container_name: redpanda
    ports:
      - "9092:9092"
      - "29092:29092"
    command:
      - redpanda
      - start
      - --overprovisioned
      - --smp
      - "1"
      - --memory
      - "1G"
      - --reserve-memory
      - "0M"
      - --node-id
      - "0"
      - --kafka-addr
      - PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
      - --advertise-kafka-addr
      - PLAINTEXT://redpanda:29092,OUTSIDE://redpanda:9092
      - --check=false
    networks:
      - spark_network  

  redpanda-console:
    image: vectorized/console
    container_name: redpanda_console
    depends_on:
      - redpanda
    ports:
      - "5000:8080"
    env_file:
      - .env
    networks:
      - spark_network  

  minio:
    hostname: minio
    image: "minio/minio"
    container_name: minio
    ports:
      - "9001:9001"
      - "9000:9000"
    command: [ "server", "/data", "--console-address", ":9001" ]
    volumes:
      - ./minio/data:/data
    env_file:
      - .env
    networks:
      - spark_network  

  mc:
    image: minio/mc
    container_name: mc
    hostname: mc
    environment:
      - AWS_ACCESS_KEY_ID=minio
      - AWS_SECRET_ACCESS_KEY=minio123
      - AWS_REGION=us-east-1
    entrypoint: >
      /bin/sh -c " until (/usr/bin/mc config host add minio http://minio:9000 minio minio123) do echo '...waiting...' && sleep 1; done; /usr/bin/mc mb minio/warehouse; /usr/bin/mc policy set public minio/warehouse; exit 0; "
    depends_on:
      - minio
    networks:
      - spark_network  

  mysql:
    image: debezium/example-mysql:1.6
    container_name: mysql
    volumes:
      - ./mysql/data:/var/lib/mysql
    ports:
      - "3306:3306"
    env_file:
      - .env
    networks:
      - spark_network  

  kafka-connect:
    build:
      context: ./kafka
      dockerfile: ./Dockerfile
    container_name: kafka_connect
    depends_on:
      - redpanda
    ports:
      - "8083:8083"
    env_file:
      - .env
    networks:
      - spark_network  

  adminer:
    image: adminer:latest
    ports:
      - 8085:8080/tcp
    deploy:
     restart_policy:
       condition: on-failure 
    networks:
      - spark_network      

  spark-master:
    build:
      context: ./spark
      dockerfile: ./Dockerfile
    container_name: "spark-master"
    environment:
      - SPARK_MODE=master
      - SPARK_LOCAL_IP=spark-master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    ports:
      - "7077:7077"
      - "8080:8080"
    volumes:
      - ./spark/spark-defaults.conf:/opt/bitnami/spark/conf/spark-defaults.conf
    networks:
      - spark_network

  spark-worker-1:
    image: docker.io/bitnami/spark:3.3
    container_name: "spark-worker-1"
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=4G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - spark_network

  spark-worker-2:
    image: docker.io/bitnami/spark:3.3
    container_name: "spark-worker-2"
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
      - SPARK_WORKER_MEMORY=4G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    networks:
      - spark_network

  spark-notebook:
    build:
      context: ./notebooks
      dockerfile: ./Dockerfile
    container_name: "spark-notebook"
    user: root
    environment:
      - JUPYTER_ENABLE_LAB="yes"
      - GRANT_SUDO="yes"
    volumes:
      - ./notebooks:/home/jovyan/work
      - ./notebooks/spark-defaults.conf:/usr/local/spark/conf/spark-defaults.conf
    ports:
      - "8888:8888"
      - "4040:4040"
    networks:
      - spark_network

  generate_data:
    build: ./generate_data
    container_name: generate_data
    command: python generate_data.py
    depends_on:
      - mysql
    networks:
      - spark_network

  api:
    build: ./api
    ports:
      - "8000:8000"
    depends_on:
      - mysql          


networks:
  spark_network:
    driver: bridge
    name: spark_network

docker-compose up --build -d

Data generation and storage in MySQL

Once our environment is set up, we will generate fake data simulating sensors on a bridge that flash car plates at each passage. The data will include vehicle and owner information, subscription status, and other relevant fields. This data will be stored in a MySQL database and serve as the source of our real-time data processing pipeline.

import random
import uuid
from faker import Faker
import pandas as pd
import mysql.connector
from datetime import datetime, timedelta

# Initialize Faker
fake = Faker()

# Number of data points to generate
num_records = 1000

# Generate synthetic data
data = []

for _ in range(num_records):
    unique_id = str(uuid.uuid4())
    plate_number = f"{random.randint(1000, 9999)}-{fake.random_element(elements=('AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'FFF', 'GGG', 'HHH', 'III', 'JJJ', 'KKK', 'LLL', 'MMM', 'NNN', 'OOO', 'PPP', 'QQQ', 'RRR', 'SSS', 'TTT', 'UUU', 'VVV', 'WWW', 'XXX', 'YYY', 'ZZZ'))}"

    car_info = {
        "make": fake.random_element(elements=("Toyota", "Honda", "Ford", "Chevrolet", "Nissan", "Volkswagen", "BMW", "Mercedes-Benz")),
        "year": random.randint(2000, 2023)
    }

    owner_info = {
        "name": fake.name(),
        "address": fake.address(),
        "phone_number": fake.phone_number().replace("x", " ext. ")  # Modify phone number format
    }

    subscription_status = fake.random_element(elements=("active", "expired", "none"))

    if subscription_status != "none":
        subscription_start = fake.date_between(start_date='-3y', end_date='today')
        subscription_end = subscription_start + timedelta(days=365)
    else:
        subscription_start = None
        subscription_end = None

    balance = round(random.uniform(0, 500), 2)

    timestamp = fake.date_time_between(start_date='-30d', end_date='now').strftime('%Y-%m-%d %H:%M:%S')


    record = {
        "id": unique_id,
        "plate_number": plate_number,
        "car_make": car_info["make"],
        "car_year": car_info["year"],
        "owner_name": owner_info["name"],
        "owner_address": owner_info["address"],
        "owner_phone_number": owner_info["phone_number"],
        "subscription_status": subscription_status,
        "subscription_start": subscription_start,
        "subscription_end": subscription_end,
        "balance": balance,
        "timestamp": timestamp
    }

    data.append(record)

# Convert data to a pandas DataFrame
df = pd.DataFrame(data)

# Connect to the MySQL database
db_config = {
    "host": "mysql",
    "user": "root",
    "password": "debezium",
    "database": "inventory"
}
conn = mysql.connector.connect(**db_config)

# Create a cursor
cursor = conn.cursor()

# Create the 'customers' table if it doesn't exist
create_table_query = '''
CREATE TABLE IF NOT EXISTS customers (
    id VARCHAR(255) NOT NULL,
    plate_number VARCHAR(255) NOT NULL,
    car_make VARCHAR(255) NOT NULL,
    car_year INT NOT NULL,
    owner_name VARCHAR(255) NOT NULL,
    owner_address TEXT NOT NULL,
    owner_phone_number VARCHAR(255) NOT NULL,
    subscription_status ENUM('active', 'expired', 'none') NOT NULL,
    subscription_start DATE,
    subscription_end DATE,
    balance DECIMAL(10, 2) NOT NULL,
    timestamp TIMESTAMP NOT NULL
)
'''
cursor.execute(create_table_query)

# Store the synthetic data in the 'customers' table
for index, row in df.iterrows():
    insert_query = '''
    INSERT INTO customers (id, plate_number, car_make, car_year, owner_name, owner_address, owner_phone_number, subscription_status, subscription_start, subscription_end, balance, timestamp)
    VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    '''
    cursor.execute(insert_query, (
        row['id'],
        row['plate_number'],
        row['car_make'],
        row['car_year'],
        row['owner_name'],
        row['owner_address'],
        row['owner_phone_number'],
        row['subscription_status'],
        row['subscription_start'],
        row['subscription_end'],
        row['balance'],
        row['timestamp']
    ))

# Commit the changes and close the cursor
conn.commit()
cursor.close()

# Close the database connection
conn.close()

print("Synthetic data stored in the 'customers' table in the MySQL database")

Creating an API for data ingestion

To facilitate data ingestion, we will create an API that allows us to send data as JSON objects. This API will be used to insert new data into the MySQL database, simulating the real-time data flow from the sensors on the bridge.

from flask import Flask, request, jsonify, render_template
import mysql.connector
import pandas as pd

app = Flask(__name__, template_folder='template')

db_config = {
        "host": "10.0.0.25",
        "user": "root",
        "password": "debezium",
        "database": "inventory"
    }

@app.route('/send_data', methods=['POST'])
def send_data():
    data = request.get_json()


    conn = mysql.connector.connect(**db_config)

    cursor = conn.cursor()

    insert_query = '''
    INSERT INTO customers (id, plate_number, car_make, car_year, owner_name, owner_address, owner_phone_number, subscription_status, subscription_start, subscription_end, balance, timestamp)
    VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
    '''
    cursor.execute(insert_query, (
        data['id'],
        data['plate_number'],
        data['car_make'],
        data['car_year'],
        data['owner_name'],
        data['owner_address'],
        data['owner_phone_number'],
        data['subscription_status'],
        data['subscription_start'],
        data['subscription_end'],
        data['balance'],
        data['timestamp']
    ))

    conn.commit()

    cursor.close()
    conn.close()

    return jsonify({"status": "success"}), 200

@app.route('/customers', methods=['GET'])
def customers():
    plate_number = request.args.get('plate_number', '')
    page = int(request.args.get('page', 1))
    items_per_page = 10

    conn = mysql.connector.connect(**db_config)

    # Create a cursor
    cursor = conn.cursor()

    # Fetch customers filtered by plate_number and apply pagination
    select_query = '''
    SELECT * FROM customers
    WHERE plate_number LIKE %s
    LIMIT %s OFFSET %s
    '''
    cursor.execute(select_query, (f"%{plate_number}%", items_per_page, (page - 1) * items_per_page))
    customers = cursor.fetchall()

    # Get the total number of customers
    cursor.execute("SELECT COUNT(*) FROM customers WHERE plate_number LIKE %s", (f"%{plate_number}%",))
    total_customers = cursor.fetchone()[0]

    # Close the cursor and connection
    cursor.close()
    conn.close()

    return render_template('customers.html', customers=customers, plate_number=plate_number, page=page, total_pages=(total_customers // items_per_page) + 1)


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Test api

import requests

data = {
    "id": "5a5c562e-4386-44ad-bf6f-bab91081781e",
    "plate_number": "7695-OOO",
    "car_make": "Ford",
    "car_year": 2012,
    "owner_name": "Stefen",
    "owner_address": "92834 Kim Unions\nPort Harryport, MD 61729",
    "owner_phone_number": "your number phone",
    "subscription_status": "active",
    "subscription_start": None,
    "subscription_end": None,
    "balance": 100.0,
    "timestamp": "2023-03-03T14:37:49",
}

response = requests.post("http://0.0.0.0:8000/send_data", json=data)

print(response.status_code)
print(response.json())

python request.py

my initial balance is $100

Setting up connectors for data streaming and storage

With our data stored in MySQL, we will set up Kafka Connect connectors to stream the data from MySQL to Redpanda and then store it in MinIO, which will serve as our distributed object storage. This data storage will act as the “bronze” table in our data warehouse.

# create connector source for MySQL
curl --request POST \
  --url http://localhost:8083/connectors \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "src-mysql",
  "config": {
    "connector.class": "io.debezium.connector.mysql.MySqlConnector",
    "tasks.max": "1",
    "database.hostname": "mysql",
    "database.port": "3306",
    "database.user": "debezium",
    "database.password": "dbz",
    "database.server.id": "184054",
    "database.include.list": "inventory",
    "decimal.handling.mode": "double",
    "topic.prefix": "dbserver1",
    "schema.history.internal.kafka.bootstrap.servers": "redpanda:9092",
    "schema.history.internal.kafka.topic": "schema-changes.inventory"
  }
}'

# create connector sink MySQL to S3
curl --request POST \
  --url http://localhost:8083/connectors \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "sink_aws-s3",
  "config": {
    "topics.regex": "dbserver1.inventory.*",
    "topics.dir": "inventory",
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter",
    "format.class": "io.confluent.connect.s3.format.json.JsonFormat",
    "flush.size": "1",
    "store.url": "http://minio:9000",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "s3.region": "us-east-1",
    "s3.bucket.name": "warehouse",
    "aws.access.key.id": "minio",
    "aws.secret.access.key": "minio123"
  }
}'

3. Real-time data processing with Apache Spark

Reading data from MinIO

Using Apache Spark, we will read the data stored in MinIO and process it further. This processing will involve selecting relevant fields and transforming the data into a more suitable format for analysis.

Data transformation and storage in the data warehouse

Once we have processed the data, we will store it in a “silver” table in our data warehouse. This table will be used for further analysis and processing.

Integrating Twilio for real-time notifications

To enhance our real-time data processing pipeline, we will integrate the Twilio API, allowing us to send SMS notifications based on specific conditions or events. For example, we could send an SMS to the vehicle owner when their subscription is about to expire or when their

from datetime import datetime as dt, timedelta, timezone
import pytz
from twilio.rest import Client
from pyspark.sql import Row
from datetime import datetime, timezone
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import BooleanType
import datetime
import mysql.connector
from typing import Optional

# Additional imports
from mysql.connector import Error

TWILIO_ACCOUNT_SID = ''
TWILIO_AUTH_TOKEN = ''
TWILIO_PHONE_NUMBER = ''

client = Client(TWILIO_ACCOUNT_SID, TWILIO_AUTH_TOKEN)
silver_data = spark.read.parquet("s3a://warehouse/inventory/silver_data")

def get_rate_for_customer(timestamp, subscription_status):
    if subscription_status == 'active':
        if 0 <= timestamp.hour < 6 or 11 <= timestamp.hour < 16:
            return 2.99
        elif 6 <= timestamp.hour < 11 or 16 <= timestamp.hour < 23:
            return 3.99
    else:
        return 9.99

    # Add a default rate value to avoid NoneType issues
    return 0.0


def is_subscription_active(subscription_start: dt, subscription_end: dt, current_time: dt) -> bool:
    return subscription_start <= current_time <= subscription_end

def get_subscription_status(subscription_end: dt, current_time: dt) -> bool:
    grace_period = timedelta(days=7)
    return current_time <= subscription_end + grace_period


def send_sms(phone_number, message):
    try:
        client.messages.create(
            body=message,
            from_=TWILIO_PHONE_NUMBER,
            to=phone_number
        )
        print(f"SMS sent to {phone_number}: {message}")
    except Exception as e:
        print(f"Error sending SMS: {e}")

from pyspark.sql.functions import col

def is_valid_balance(value):
    try:
        float(value)
        return True
    except ValueError:
        return False

valid_balance_udf = udf(is_valid_balance, BooleanType())

silver_data = silver_data.filter(valid_balance_udf(col("balance")))

# Database configuration
db_config = {
    "host": "mysql",
    "user": "root",
    "password": "debezium",
    "database": "inventory"
}

def update_customer_balance(customer_id, new_balance):
    try:
        connection = mysql.connector.connect(**db_config)
        cursor = connection.cursor()
        update_query = "UPDATE customers SET balance = %s WHERE id = %s"
        cursor.execute(update_query, (new_balance, customer_id))
        connection.commit()
        print(f"Updated balance for customer {customer_id}: {new_balance}")
    except Error as e:
        print(f"Error updating balance: {e}")
    finally:
        if connection.is_connected():
            cursor.close()
            connection.close() 

from datetime import datetime, timezone

def safe_date_conversion(date_string: Optional[str]) -> dt:
    if date_string is None or not isinstance(date_string, str):
        return dt(1970, 1, 1, tzinfo=timezone.utc)
    try:
        return dt.fromisoformat(date_string[:-1]).replace(tzinfo=timezone.utc)
    except ValueError:
        return dt(1970, 1, 1, tzinfo=timezone.utc)

def process_plate(row: Row) -> None:
    print(f"Processing plate: {row.plate_number}")
    current_time = dt.now(timezone.utc)
    try:
        plate_timestamp = dt.fromisoformat(row.timestamp[:-1]).replace(tzinfo=timezone.utc)
    except ValueError:
        plate_timestamp = dt.fromtimestamp(0, timezone.utc)

    subscription_start = safe_date_conversion(row.subscription_start)
    subscription_end = safe_date_conversion(row.subscription_end)

    is_active = is_subscription_active(subscription_start, subscription_end, current_time)
    rate = get_rate_for_customer(plate_timestamp, row.subscription_status)

    balance = float(row.balance)
    new_balance = balance - rate

    if row.subscription_status == 'none':
        message = f"Dear {row.owner_name}, your car with plate number {row.plate_number} is not registered. The rate of ${rate} has been charged for your recent passage. Your new balance is ${new_balance:.2f}."
        send_sms(row.owner_phone_number, message)
    elif is_active:  # Changed from row.subscription_status == 'active'
        message = f"Dear {row.owner_name}, your subscription is active. The rate of ${rate} has been charged for your recent passage. Your new balance is ${new_balance:.2f}."
        send_sms(row.owner_phone_number, message)
    elif not get_subscription_status(subscription_end, current_time):
        message = f"Dear {row.owner_name}, your subscription has expired. The rate of ${rate} has been charged for your recent passage. Your new balance is ${new_balance:.2f}."
        send_sms(row.owner_phone_number, message)

        update_customer_balance(row.id, new_balance)

silver_data.foreach(process_plate)

This script is designed to process a dataset containing information about car passages and their owners, including subscription status, balance, plate numbers, and owner details. It reads data from a “silver” table in a data warehouse, processes the data in real-time, sends SMS notifications to the car owners via the Twilio API, and updates the customer’s balance in a MySQL database.

Here’s a breakdown of the script:

Import necessary libraries and modules for the script.
Define Twilio credentials (account SID, auth token, and phone number) for sending SMS notifications.
Create a SparkSession to read data from the “silver” table.
Define utility functions:

get_rate_for_customer: Calculate the rate based on timestamp and subscription status.
is_subscription_active: Check if a subscription is active.
get_subscription_status: Check if a subscription is within the grace period.
send_sms: Send an SMS using the Twilio API.
is_valid_balance: Check if a given balance is valid (convertible to a float).
update_customer_balance: Update the customer balance in the MySQL database.
safe_date_conversion: Convert a date string to a datetime object, handling errors and missing values.
process_plate: Process each plate record, calculate the rate, send SMS notifications, and update the customer balance.

Register a User-Defined Function (UDF) valid_balance_udf that filters records with valid balance values.
Filter the dataset to keep records with valid balances using the valid_balance_udf.
Define database configuration for connecting to the MySQL database.
Use the foreach action to process each plate record using the process_plate function. This includes checking subscription status, calculating the rate, sending SMS notifications, and updating the customer balance.

gold_data.write.parquet("s3a://warehouse/inventory/gold_data", mode="overwrite")


import pyspark.sql.functions as F
from pyspark.sql import SparkSession

class MetricsAdapter:
    def __init__(self, silver_table, warehouse_path):
        self.silver_table = silver_table
        self.warehouse_path = warehouse_path

    def show_metrics(self):
        daily_metrics = spark.read.format('delta').load(self.warehouse_path + '/gold/daily_metrics')
        weekly_metrics = spark.read.format('delta').load(self.warehouse_path + '/gold/weekly_metrics')
        monthly_metrics = spark.read.format('delta').load(self.warehouse_path + '/gold/monthly_metrics')
        quarterly_metrics = spark.read.format('delta').load(self.warehouse_path + '/gold/quarterly_metrics')
        yearly_metrics = spark.read.format('delta').load(self.warehouse_path + '/gold/yearly_metrics')
        subscription_status_count = silver_data.groupBy("subscription_status").count()

        print("Daily Metrics:")
        daily_metrics.show(5)

        print("Weekly Metrics:")
        weekly_metrics.show(5)

        print("Monthly Metrics:")
        monthly_metrics.show(5)

        print("Quarterly Metrics:")
        quarterly_metrics.show(5)

        print("Yearly Metrics:")
        yearly_metrics.show(5)    

    def transform(self):
        # Calculate the week, month, quarter, and year from the timestamp
        time_based_metrics = self.silver_table.withColumn("date", F.to_date("timestamp")) \
            .withColumn("year", F.year("timestamp")) \
            .withColumn("quarter", F.quarter("timestamp")) \
            .withColumn("month", F.month("timestamp")) \
            .withColumn("week_of_year", F.weekofyear("timestamp")) \
            .withColumn("total_passages", F.lit(1)) \
            .withColumn("total_revenue", F.when(self.silver_table.timestamp.substr(12, 2).cast("int") < 12, 2.99).otherwise(3.99))


        # Daily metrics
        daily_metrics = time_based_metrics.groupBy("date").agg(
            F.count("*").alias("total_passages"),
            F.sum(F.when(time_based_metrics.timestamp.substr(12, 2).cast("int") < 12, 2.99).otherwise(3.99)).alias("total_revenue")
        )
        daily_metrics.write.format('delta').mode('overwrite').option("mergeSchema", "true").save(self.warehouse_path + '/gold/daily_metrics')

        # Weekly metrics
        weekly_metrics = time_based_metrics.groupBy("year", "week_of_year").agg(
            F.sum("total_passages").alias("total_passages"),
            F.sum("total_revenue").alias("total_revenue")
        )
        weekly_metrics.write.format('delta').mode('overwrite').option("mergeSchema", "true").save(self.warehouse_path + '/gold/weekly_metrics')

        # Monthly metrics
        monthly_metrics = time_based_metrics.groupBy("year", "month").agg(
            F.sum("total_passages").alias("total_passages"),
            F.sum("total_revenue").alias("total_revenue")
        )
        monthly_metrics.write.format('delta').mode('overwrite').option("mergeSchema", "true").save(self.warehouse_path + '/gold/monthly_metrics')

        # Quarterly metrics
        quarterly_metrics = time_based_metrics.groupBy("year", "quarter").agg(
            F.sum("total_passages").alias("total_passages"),
            F.sum("total_revenue").alias("total_revenue")
        )
        quarterly_metrics.write.format('delta').mode('overwrite').option("mergeSchema", "true").save(self.warehouse_path + '/gold/quarterly_metrics')

        # Yearly metrics
        yearly_metrics = time_based_metrics.groupBy("year").agg(
            F.sum("total_passages").alias("total_passages"),
            F.sum("total_revenue").alias("total_revenue")
        )
        yearly_metrics.write.format('delta').mode('overwrite').option("mergeSchema", "true").save(self.warehouse_path + '/gold/yearly_metrics')

# Example usage
spark = SparkSession.builder.getOrCreate()
silver_data = spark.read.parquet("s3a://warehouse/inventory/silver_data")
warehouse_path = "s3a://warehouse/inventory/gold_data"
metrics_adapter = MetricsAdapter(silver_data, warehouse_path)
metrics_adapter.transform()

metrics_adapter.show_metrics()

The code calculates daily, weekly, monthly, quarterly, and yearly metrics, such as total passages and total revenue. It also defines a MetricsAdapter class that encapsulates the data transformation and metrics display logic.

The first line of code:

gold_data.write.parquet("s3a://warehouse/inventory/gold_data", mode="overwrite")

writes the gold_data DataFrame to the specified S3 bucket in Parquet format, with the overwrite mode, which replaces any existing data in the destination.

The MetricsAdapter class has two primary methods: transform() and show_metrics().

transform() method:

Calculates the date, year, quarter, month, and week of the year from the timestamp.
Aggregates the data based on different time granularities (daily, weekly, monthly, quarterly, and yearly) using the groupBy and agg functions.
Writes the aggregated metrics into Parquet format on the specified S3 bucket using Delta Lake format, which provides ACID transactions, versioning, and schema evolution for large-scale data lakes.

show_metrics() method:

Reads the metrics data from the S3 bucket and formats it as Delta Lake.
Displays the top 5 records of daily, weekly, monthly, quarterly, and yearly metrics using the show() function.

Finally, the example usage part of the code initializes a SparkSession, reads the silver_data from the S3 bucket, creates a MetricsAdapter instance with silver_data and the warehouse path, calls the transform() method to aggregate the data, and then calls the show_metrics() method to display the results.

Conclusion

In this article, we have demonstrated how to set up a real-time data processing and analytics environment using Docker, MySQL, Redpanda, MinIO, and Apache Spark. We created a system that generates fake data simulating a sensor, stores it in a MySQL database, and processes it in real-time using Redpanda and Kafka Connect. We then utilized MinIO as a distributed object storage and Apache Spark to further process and analyze the data. Additionally, we integrated the Twilio API for real-time notifications.

This project showcases the potential of using modern data processing tools to handle real-time scenarios, such as monitoring car passages on a bridge and notifying car owners about their subscription status and balance. The combination of these technologies enables scalable and efficient data processing, as well as the ability to respond quickly to changes in the data.

The knowledge gained from this project can be applied to various other real-time data processing and analytics use cases. By understanding and implementing these technologies, you can build powerful and efficient systems that are able to handle large amounts of data and provide valuable insights in real-time.

https://github.com/Stefen-Taime/stream-ingestion-redpanda-minio.git

https://medium.com/@stefentaime_10958/real-time-data-processing-and-analytics-with-docker-mysql-redpanda-minio-and-apache-spark-eca83f210ef6

Visualizing Bitcoin to USD Exchange Rates using FastAPI, Prometheus, Grafana, Deploy with jenkins

Stefen — Mon, 15 May 2023 14:15:42 +0000

Visualizing Bitcoin to USD Exchange Rates using FastAPI, Prometheus, Grafana, Deploy with jenkins On Localhost Ubuntu Server 20.04

In this article, we’ll explore how to visualize the exchange rate of Bitcoin to USD using FastAPI, Prometheus, Grafana, and Docker. We will create a simple FastAPI application to import exchange rate data from an API, store it in a database, and expose it as metrics using Prometheus. Then, we’ll use Grafana to create dashboards that visualize the data, and deploy the whole setup using Docker and Jenkins. Inspired by the article of amlanscloud

Setup Jenkins Agent/Slave Using SSH [Password & SSH Key]

Overview

Here’s an outline of our plan:

Data Source: The process commences by acquiring data from a public data source. In this case, we’ll use an API that offers free exchange rates for Bitcoin to USD. The API delivers the exchange rate between Bitcoin and USD at the moment the API is called. By invoking the API multiple times, we can obtain time series data for the fluctuations in the exchange rate.
Data Importer App: An importer app reads the data from the data source API mentioned above. This app operates daily, invoking the data source API to acquire the exchange rate. Subsequently, the importer app stores this rate in a database. Each day’s exchange rate is represented by a row of items in the database.
Prometheus Data Scraper: This component is a data scraper job defined within Prometheus. The Prometheus data scraper fetches data from an API endpoint and incorporates it as a metric in Prometheus. A custom API has been developed for this purpose, which, upon invocation, retrieves the day’s exchange rate from the database and returns the data in a format that can be easily read and imported as a metric in Prometheus.
Visualize Data: Prometheus stores the data scraped as a distinct metric. This metric serves as a data source for creating visualization dashboards on Grafana. These dashboards display various trends in the exchange rate fluctuations as documented by the metrics.
Deployment: We will deploy the entire setup using Jenkins, Docker, and Docker Compose.

from fastapi import FastAPI, Response
from fastapi.responses import PlainTextResponse, JSONResponse
import uvicorn
from dotenv import load_dotenv
import os
import requests
import json
from prometheus_client import Gauge, CollectorRegistry, generate_latest
from datetime import date
import redis

load_dotenv()

def scrape_data():
qry_url = f'{os.environ.get("STOCK_URL")}?function=CURRENCY_EXCHANGE_RATE&from_currency=BTC&to_currency=USD&apikey={os.environ.get("API_KEY")}'
response = requests.request("GET", qry_url)
respdata = response.json()
rate = respdata['Realtime Currency Exchange Rate']['5. Exchange Rate']
float_rate = "{:.2f}".format(float(rate))
return float_rate

registry = CollectorRegistry()
exchange_rate_btc_usd = Gauge('exchange_rate_btc_usd', 'Exchange rate between BTC and USD', registry=registry)

app = FastAPI()

@app.get("/", response_class=JSONResponse)
async def root():
return {"message": "Hello World"}

@app.get("/exchangemetrics", response_class=PlainTextResponse)
async def get_exchange_metrics():
r = redis.Redis(host="redis", port=os.environ.get('REDIS_PORT'), password=os.environ.get('REDIS_PASSWORD'), db=0)
scraped_data = 0.0
try:
todays_date = str(date.today())
redis_key = f'exchange_rate-{todays_date}'
tmpdata = r.get(redis_key)
scraped_data = tmpdata.decode("utf-8")
exchange_rate_btc_usd.set(float(scraped_data))
except Exception as e:
print(e)
print('responding default value')
return Response(generate_latest(registry), media_type="text/plain")

@app.get("/getexchangedata", response_class=PlainTextResponse)
async def get_exchange_data():
r = redis.Redis(host="redis", port=os.environ.get('REDIS_PORT'), password=os.environ.get('REDIS_PASSWORD'), db=0)
scraped_data = 0.0
respdata = "error"
try:
scraped_data = scrape_data()
todays_date = str(date.today())
r.set(f'exchange_rate-{todays_date}', scraped_data)
respdata = "done"
except Exception as e:
print(e)
print('responding default value')
todays_date = str(date.today())
r.set(f'exchange_rate-{todays_date}', scraped_data)
return respdata

if name == "main":
uvicorn.run("main:app", host="0.0.0.0", port=5000, reload=True)

FastAPI application that fetches the Bitcoin to USD exchange rate from an external API and stores the data in a Redis database. It also exposes the exchange rate data as Prometheus metrics. Let’s break down the code:

Import necessary modules: The code starts by importing the required modules such as FastAPI, uvicorn, dotenv, os, requests, json, prometheus_client, datetime, and redis.
Load environment variables: The load_dotenv() function is called to load the environment variables from the .env file.
Scrape exchange rate data: The scrape_data() function fetches the exchange rate data by making a GET request to the external API using the requests library. It then extracts the exchange rate and formats it as a float with two decimal places.
Initialize Prometheus Gauge: A Prometheus Gauge named exchange_rate_btc_usd is initialized to store the exchange rate data.
Create FastAPI app: A FastAPI application named app is created.
Define endpoints:

GET /: A simple Hello World endpoint that returns a JSON response.
GET /exchangemetrics: This endpoint fetches the exchange rate data for the current day from the Redis database and sets the value of the Prometheus Gauge exchange_rate_btc_usd. It then returns the Prometheus metrics as a plain text response.
GET /getexchangedata: This endpoint fetches the exchange rate data by calling the scrape_data() function, stores the data in the Redis database, and returns a plain text response indicating whether the operation was successful.

Run FastAPI app: The FastAPI app is run using uvicorn with the host set to “0.0.0.0” and port 5000. The reload=True parameter enables hot-reloading during development.

Prometheus

It sets up Prometheus to scrape metrics from two different sources: Prometheus itself and a FastAPI application. Let’s break down the configuration:

Global settings: The global settings apply to all the scrape jobs defined in the configuration.

scrape_interval: The interval at which Prometheus scrapes metrics from the targets. In this case, it's set to 15 seconds.

Scrape configs: The scrape_configs section defines the jobs that Prometheus will use to scrape metrics from different sources.

job_name: 'prometheus': The first job is named 'prometheus' and is configured to scrape metrics from the Prometheus server itself.
static_configs: This section specifies the target for this job. In this case, it's set to scrape metrics from the Prometheus server running on 'localhost:9090'.
job_name: 'fastapi_app': The second job is named 'fastapi_app' and is configured to scrape metrics from the FastAPI application.
metrics_path: The path to the metrics endpoint in the FastAPI application, which is '/exchangemetrics' in this case.
static_configs: This section specifies the target for this job. In this case, it's set to scrape metrics from the FastAPI application running on 'fastapi_app:5000' (the application's hostname and port).

global:
scrape_interval: 15s

scrape_configs:
- job_name: 'prometheus' static_configs:
  - targets: ['localhost:9090']
- job_name: 'fastapi_app' metrics_path: '/exchangemetrics' static_configs:
  - targets: ['fastapi_app:5000']

Jenkins

which defines a Jenkins pipeline for building and deploying a project. The pipeline consists of four stages, and there’s a post section to execute actions after all stages have completed. Let's break down the pipeline:

Agent: Specifies that the pipeline will run on a Jenkins agent with the label ‘ubuntu’.
Stages: Contains the sequential stages to be executed in the pipeline.

Stage 1: Build and Deploy prometheus, grafana, and redis:
This stage changes the working directory to /home/stefen/deploy/adminer.
It prints the current working directory and its contents.
Then, it runs the docker-compose up -d command to deploy Prometheus, Grafana, and Redis using Docker Compose.
Stage 2: Build and Deploy API:
This stage waits for 30 seconds before proceeding.
It changes the working directory to /home/stefen/deploy/api.
It prints the current working directory and its contents.
Then, it runs the docker-compose up --build -d command to build and deploy the FastAPI application using Docker Compose.
Stage 3: Fetch and Print getexchangedata:
This stage makes an HTTP request to the FastAPI application’s /getexchangedata endpoint.
It prints the response received from the API.
Stage 4: Fetch and Print Exchangemetrics:
This stage makes an HTTP request to the FastAPI application’s /exchangemetrics endpoint.
It prints the response received from the API.

Post: Specifies actions to be executed after all the stages have completed, regardless of their success or failure.

In this case, it prints the URLs for Prometheus, Grafana, Redis, and the FastAPI application.

pipeline {
agent {
label 'ubuntu'
}
stages {
stage('Build and Deploy prometheus, grafana, and redis') {
steps {
dir('/home/stefen/deploy/adminer') {
sh 'pwd' // Ajout de la commande pwd pour vérifier le répertoire de travail
sh 'ls -la'
sh 'docker-compose up -d'
}
}
}
stage('Build and Deploy API') {
steps {
sleep(time: 30, unit: 'SECONDS') // Wait for 30 seconds
dir('/home/stefen/deploy/api') {
sh 'pwd' // Check the current working directory
sh 'ls -la'
sh 'docker-compose up --build -d'
}
}
}
stage('Fetch and Print getexchangedata') {
steps {
script {
def response = httpRequest 'http://localhost:5000/getexchangedata'
echo "Response: ${response.content}"
}
}
}
stage('Fetch and Print Exchangemetrics') {
steps {
script {
def response = httpRequest 'http://localhost:5000/exchangemetrics'
echo "Response: ${response.content}"
}
}
}
}
post {
always {
script {
def prometheusURL = 'http://localhost:9090'
def grafanaURL = 'http://localhost:3000'
def redisURL = 'http://localhost:6379'
def apiURL = 'http://localhost:5000'
```
            echo "Prometheus URL: ${prometheusURL}"
            echo "Grafana URL: ${grafanaURL}"
            echo "Redis URL: ${redisURL}"
            echo "API URL: ${apiURL}"
        }
    }
}
```
}

Grafana Dashboard:

Conclusion:

In conclusion, the provided code snippets and configurations showcase an end-to-end deployment process of a monitoring and data visualization system using Jenkins, Docker, Prometheus, Grafana, and FastAPI. The pipeline defined in the Jenkinsfile automates the build and deployment process, ensuring a smooth and streamlined workflow for the project. This setup allows developers to efficiently monitor and visualize Bitcoin to USD exchange rate data, making it easier to identify trends and understand the data’s behavior over time. The use of FastAPI, Redis, and the provided API ensures a robust and efficient architecture for the system, while Docker and Jenkins enable seamless deployment and automation. Overall, this project demonstrates a practical application of modern technologies to create an effective and reliable monitoring and data visualization system.

Code Source Github

Analyzing Uber and Uber Eats Expenses Using DBT, Postgres, Gmail, Python, SQL And PowerBI

Stefen — Mon, 15 May 2023 14:13:15 +0000

Unveiling the true cost of your ride-sharing and food delivery habits with an ELT data pipeline, PostgreSQL, dbt, and Power BI.

Introduction

As a regular user of Uber and Uber Eats products, I realized that I wanted to gain better insights into how much I spend on these services per month, year, or quarter. As a digital content creator and data engineer, I decided to create a proof-of-concept (POC) for a data analysis project to track my expenses on these platforms.

In this article, I will walk you through the process of building the “My Uber Project” pipeline. This pipeline utilizes an ELT (Extract, Load, Transform) approach to extract data from PDF receipts, clean and structure the data, store the data in a PostgreSQL database, perform transformations using dbt (Data Build Tool), and finally visualize the results with Power BI.

Data Extraction: PDF Receipts

The first step in the My Uber Project pipeline is to extract data from the PDF receipts received via email after each Uber ride or Uber Eats order. To achieve this, we can use Python libraries like PyPDF2 or pdfplumber to parse the PDF files and extract the relevant information.

Data Cleaning and Structuring

After extracting the raw data, the next step is to clean and structure it. This process involves tasks such as parsing dates, converting currencies, and standardizing column names. The cleaned and structured data will be stored in two separate CSV files:

uber_eats.csv: Contains information related to Uber Eats orders with columns: type, date, total, and restaurant.
uber_ride.csv: Contains information related to Uber rides with columns: type, date, total, and driver.

import pdfplumberimport reimport osimport pandas as pddef extract_data(pdf_path):    with pdfplumber.open(pdf_path) as pdf:        page = pdf.pages[0]        content = page.extract_text()    date_pattern = r'\d{1,2} \w+ \d{4}'    date = re.search(date_pattern, content).group(0)    total_pattern = r'Total (\d+\,\d{2}) \$CA'    total = re.search(total_pattern, content).group(1).replace(',', '.')    driver_pattern = r'Votre chauffeur était (\w+)'    driver_match = re.search(driver_pattern, content)    restaurant_pattern = r'restaurant suivant : (.+?)\.'    restaurant_match = re.search(restaurant_pattern, content)    if driver_match:        return {'type': 'Uber', 'date': date, 'total': total, 'driver': driver_match.group(1)}    elif restaurant_match:        return {'type': 'Uber Eats', 'date': date, 'total': total, 'restaurant': restaurant_match.group(1)}    else:        return {'error': 'Invalid receipt format'}pdf_directory = '/home/stefen/uber/data'pdf_files = [f for f in os.listdir(pdf_directory) if f.endswith('.pdf')]uber_data = []uber_eats_data = []for pdf_file in pdf_files:    pdf_path = os.path.join(pdf_directory, pdf_file)    extracted_data = extract_data(pdf_path)    if 'error' in extracted_data:        print(f"Error processing file {pdf_file}: {extracted_data['error']}")    elif extracted_data['type'] == 'Uber':        uber_data.append(extracted_data)    elif extracted_data['type'] == 'Uber Eats':        uber_eats_data.append(extracted_data)uber_df = pd.DataFrame(uber_data)uber_eats_df = pd.DataFrame(uber_eats_data)uber_df.to_csv('uber_receipts.csv', index=False)uber_eats_df.to_csv('uber_eats_receipts.csv', index=False)

Here’s an explanation of each part of the code:

Import necessary libraries:

pdfplumber: To extract text from PDF files
re: To perform regular expression operations
os: To interact with the operating system, e.g., working with directories and files
pandas: To work with data in DataFrame format and save to CSV

Define the extract_data function that takes a PDF file path as an input: a. Open the PDF file using pdfplumber and get the first page b. Extract the text content from the page c. Use regular expressions to find the date, total, driver (if available), and restaurant (if available) information in the text d. If a driver is found, return the extracted data as a dictionary with the 'type' key set to 'Uber' e. If a restaurant is found, return the extracted data as a dictionary with the 'type' key set to 'Uber Eats' f. If neither a driver nor a restaurant is found, return an error dictionary indicating an invalid receipt format
Specify the directory containing the PDF files and create a list of all PDF files in the directory.
Initialize empty lists uber_data and uber_eats_data to store extracted data.
Iterate through each PDF file in the list, call the extract_data function to extract the data, and append it to the appropriate list based on the 'type' key value. If an error is encountered, print the error message.
Create separate DataFrames for Uber and Uber Eats data using the pandas library.
Save the DataFrames to CSV files (uber_receipts.csv and uber_eats_receipts.csv) without including the index column.

After the data extraction and processing, the next step is to create the architecture for the PostgreSQL database and pgAdmin. In this section, we will use Docker and docker-compose to set up the services:

version: "3.8"services:  postgres:    image: postgres:latest    environment:      POSTGRES_USER: postgres      POSTGRES_PASSWORD: mysecretpassword    ports:      - "0.0.0.0:5432:5432"    volumes:      - postgres_data:/var/lib/postgresql/data      - ./postgres-init:/docker-entrypoint-initdb.d  pgadmin:    image: dpage/pgadmin4:latest    environment:      PGADMIN_DEFAULT_EMAIL: admin@example.com      PGADMIN_DEFAULT_PASSWORD: mysecretpassword    ports:      - "8080:80"    depends_on:      - postgresvolumes:  postgres_data:

Once the PostgreSQL database and pgAdmin have been set up, the next step is to initialize and configure our dbt project. After running the dbt init command, we can start setting up the project structure. Here's an overview of the dbt project structure:

C:.├───dbt_packages├───logs├───macros├───models│   ├───intermediate│   ├───marts│   │   ├───eats_dept│   │   └───rides_dept│   └───staging├───seeds└───target    ├───compiled    │   └───my_uber_project    │       └───models    │           ├───intermediate    │           ├───marts    │           │   ├───eats_dept    │           │   └───rides_dept    │           └───staging    └───run        └───my_uber_project            ├───models            │   ├───intermediate            │   ├───marts            │   │   ├───eats_dept            │   │   └───rides_dept            │   └───staging            └───seeds

The project structure contains the following folders:

dbt_packages: Contains packages installed via the packages.yml file.
logs: Stores log files generated during dbt execution.
macros: Contains custom macros for the project.
models: Holds the dbt models, organized into subdirectories for intermediate, staging, and marts (eats_dept and rides_dept) layers.
seeds: Contains CSV files with seed data to be loaded into the database.
target: Stores the output of dbt commands (compiled and run). This folder has subdirectories for compiled and run models, each with the same structure as the models folder (intermediate, staging, and marts layers).

By following this structure, we can keep our dbt project organized and easy to maintain. Each subdirectory within the models folder serves a specific purpose, helping to separate different stages of data transformation and analysis.

The profiles.yml file is a configuration file used by dbt to define different environments (called profiles) and their connection settings. In this example, two profiles are defined: dev and prod. Each profile specifies the connection settings for a PostgreSQL database.

profiles.yml file contents:

default: The name of the profile group. You can have multiple profile groups if needed.
outputs: A dictionary containing the different profiles within the group.
dev: The development profile with the following connection settings:

type: The type of database being used (in this case, PostgreSQL).
threads: The number of concurrent threads dbt should use when executing queries.
host, port, user, pass, dbname, schema: Connection settings for the PostgreSQL database (host, port, username, password, database name, and schema) in the development environment.

prod: The production profile with similar connection settings as the dev profile. Replace the placeholders ([host], [port], [prod_username], [prod_password], [dbname], and [prod_schema]) with the actual values for your production environment.
target: Specifies the default target profile to use when running dbt commands. In this case, it is set to dev.

By defining different profiles, you can easily switch between development and production environments when running dbt commands, allowing you to test and develop transformations in one environment before deploying them to another. To switch between profiles, you can change the target value in the profiles.yml file or use the --target flag when running dbt commands.

default:  outputs:    dev:      type: postgres      threads: 3      host: localhost      port: 5432      user: dbt      pass: dbt_password      dbname: olap      schema: public    prod:      type: postgres      threads: 1      host: [host]      port: [port]      user: [prod_username]      pass: [prod_password]      dbname: [dbname]      schema: [prod_schema]  target: dev

Once the dbt project is set up, one of the first things to do is to manage the date format in the Uber receipts, which are in French. To handle the French month names, you can create a custom function in your PostgreSQL database to translate them into English month names. Here’s a step-by-step explanation of the process:

Connect to your olap PostgreSQL database using your preferred database client or pgAdmin.
Create a new function called translate_french_month_to_english that accepts a single TEXT parameter representing the French month name.
Inside the function, use a CASE statement to map the French month names (in lowercase) to their corresponding English month names.
Return the translated English month name or NULL if no match is found.
The function is defined using the plpgsql language.

Here’s the SQL code for the function:

CREATE OR REPLACE FUNCTION translate_french_month_to_english(month TEXT)RETURNS TEXT AS $$BEGIN  RETURN CASE    WHEN lower(month) = 'janvier' THEN 'January'    WHEN lower(month) = 'février' THEN 'February'    WHEN lower(month) = 'mars' THEN 'March'    WHEN lower(month) = 'avril' THEN 'April'    WHEN lower(month) = 'mai' THEN 'May'    WHEN lower(month) = 'juin' THEN 'June'    WHEN lower(month) = 'juillet' THEN 'July'    WHEN lower(month) = 'août' THEN 'August'    WHEN lower(month) = 'septembre' THEN 'September'    WHEN lower(month) = 'octobre' THEN 'October'    WHEN lower(month) = 'novembre' THEN 'November'    WHEN lower(month) = 'décembre' THEN 'December'    ELSE NULL  END;END;$$ LANGUAGE plpgsql;

By adding this function to your PostgreSQL database, you can easily translate the French month names in your Uber receipts data to their English counterparts. This will help standardize the date format and make it easier to work with the data in dbt and other data processing tools.

Once the translate_french_month_to_english function is created, you can now create your first staging models for both Uber Eats and Uber rides data. In each model, you will use the custom date parsing function to convert the French date format to a standardized format.

Create a new model for staging Uber Eats data:

{{ config(materialized='table') }}SELECT *,       {{ parse_custom_date('date') }} as transaction_dateFROM {{ ref('uber_eats') }}

This model uses the parse_custom_date macro (which should be defined in your macros folder) to convert the French date format in the date column. The resulting standardized date is stored in a new column called transaction_date.

Create a new model for staging Uber rides data:

{{ config(materialized='table') }}SELECT *,       {{ parse_custom_date('date') }} as transaction_dateFROM {{ ref('uber_ride') }}

With the staging models in place, run the following dbt commands:

dbt seed: This command loads the seed data from the CSV files in the seeds folder into your database.
dbt run: This command executes the models in your project. It will create the staging tables for both Uber Eats and Uber rides data, applying the custom date parsing to standardize the date format.

After creating the staging models, you can create an intermediate model called uber_transactions.sql in the models/intermediate folder. This model combines the Uber Eats and Uber rides data into a single table, which can be useful for further analysis and reporting. Here's a breakdown of the code in this model:

Set the materialization type to ‘table’ using the config function:

{{ config(materialized='table') }}

2. Create a Common Table Expression (CTE) named eats:

WITH eats AS (    SELECT 'eats' as type,           transaction_date,           total,           restaurant    FROM {{ ref('uber_eating') }}),

This CTE selects data from the uber_eating staging model, adding a new column called type with a value of 'eats' to identify the source of the data.

3. Create another CTE named rides:

rides AS (    SELECT 'rides' as type,           transaction_date,           total,           driver    FROM {{ ref('uber_riding') }})

Similar to the eats CTE, this CTE selects data from the uber_riding staging model and adds a type column with a value of 'rides' to identify the source of the data.

4. Combine the eats and rides CTEs using the UNION ALL operator:

SELECT *FROM eatsUNION ALLSELECT *FROM rides

The UNION ALL operator combines the results of the two SELECT statements into a single result set. This will create a single table containing both Uber Eats and Uber rides data, with the type column indicating the source of each row.

Full-Code:

-- models/intermediate/uber_transactions.sql{{ config(materialized='table') }}WITH eats AS (    SELECT 'eats' as type,           transaction_date,           total,           restaurant    FROM {{ ref('uber_eating') }}),rides AS (    SELECT 'rides' as type,           transaction_date,           total,           driver    FROM {{ ref('uber_riding') }})SELECT *FROM eatsUNION ALLSELECT *FROM rides

After creating the intermediate model, the next step is to create a series of models. These models will generate various aggregated metrics for the rides data, such as average expense, and expenses by week, month, quarter, and year.

Here’s a brief overview of the models:

average_expense_rides.sql: Calculates the average expense of Uber rides.
monthly_expenses_rides.sql: Aggregates the total expenses of Uber rides on a monthly basis.
quarterly_expenses_rides.sql: Aggregates the total expenses of Uber rides on a quarterly basis.
weekly_expenses_rides.sql: Aggregates the total expenses of Uber rides on a weekly basis.
yearly_expenses_rides.sql: Aggregates the total expenses of Uber rides on a yearly basis.

By creating these models, with power bi you can analyze and visualize various aspects of your Uber rides expenses over different time periods. This will provide a comprehensive understanding of your Uber rides expenditure patterns and help you make more informed decisions about your transportation budget.

In conclusion, this project demonstrates the process of building a data pipeline for analyzing Uber and Uber Eats expenses. By leveraging tools such as Python, PostgreSQL, dbt, and Power BI, you can extract, clean, and transform data from various sources, then visualize it in a way that provides valuable insights.

Throughout this project, you:

Extracted data from Uber and Uber Eats PDF receipts using Python and pdfplumber.
Created a PostgreSQL database and a pgAdmin container using Docker Compose.
Loaded the extracted data into the database and configured a dbt project.
Created a custom PostgreSQL function to handle date translations from French to English.
Built a series of dbt models for staging, intermediate, and aggregated data.
Analyzed and visualized the data using Power BI (not covered in detail here but assumed as part of the project).

https://medium.com/@stefentaime_10958/uber-project-analyzing-personal-uber-and-uber-eats-expenses-with-elt-data-pipeline-using-dbt-91ead4aea5df

AI-Powered Accommodation Search: Harnessing the Power of Hadoop, MongoDB, Spark, GPT-3, React, and Flask

Stefen — Mon, 15 May 2023 14:06:14 +0000

In today’s dynamic and data-driven world, the ability to harness information effectively and deliver user-specific results has become paramount. This is particularly true in the accommodation industry, where customer preferences can vary enormously. Leveraging AI and Big Data technologies, I’ve created an intelligent data pipeline capable of tailoring accommodation search results to individual needs.

This article outlines the process of building an AI data pipeline using Hadoop HDFS, MongoDB, Spark, GPT-3, React, and Flask. The goal is to develop an intuitive platform where users can search for Airbnb apartments based on a target city, budget, and duration of stay, all powered by the intelligent language model, GPT-3.

Step 1: Data Acquisition and Upload to HDFS

The data used in this project was derived from a dataset comprising listings from Airbnb, Booking, and Hotels.com. This dataset is focused on exploring the pricing landscape within the most popular European capitals. Each city contributed 500 hotels from each platform, culminating in a total of 7500 hotel listings.

Example Berlin.json:

{
  "airbnbHotels": [
    {
      "thumbnail": "https://a0.muscache.com/im/pictures/miso/Hosting-647664199858827562/original/cfc2fc4c-d703-4827-bc25-f1acb07e0025.jpeg?im_w=720",
      "title": "Private room in Tempelhof",
      "subtitles": ["Privatzimmer in Tempelhofer Feld", "1 bed", "Jul 24 – 31"],
      "price": { "currency": "$", "value": 31, "period": "night" },
      "rating": 5,
      "link": "https://www.airbnb.com/rooms/647664199858827562"
    },
    {
      "thumbnail": "https://a0.muscache.com/im/pictures/b9cb8b8c-51b3-46c4-b9cd-d27053f7d628.jpg?im_w=720",
      "title": "Private room in Mitte",
      "subtitles": ["Tiny, individual Room with private Bathroom", "1 small double bed", "Sep 1 – 8"],
      "price": { "currency": "$", "value": 40, "period": "night" },
      "rating": 4.96,
      "link": "https://www.airbnb.com/rooms/41220512"
    }

Paris.json:

{
  "airbnbHotels": [
    {
      "thumbnail": "https://a0.muscache.com/im/pictures/b9bc653d-df43-4f91-8162-0be5c912a3b4.jpg?im_w=720",
      "title": "Apartment in Paris",
      "subtitles": ["A nice little space and cute", "1 single bed", "May 8 – 13"],
      "price": { "currency": "$", "value": 57, "period": "night" },
      "rating": 4.46,
      "link": "https://www.airbnb.com/rooms/7337703"
    },
    {
      "thumbnail": "https://a0.muscache.com/im/pictures/261b3fb2-fec9-4009-b8b9-90d9976597fd.jpg?im_w=720",
      "title": "Apartment in Paris",
      "subtitles": ["Small cozy cocoon in Paris! 1 person studio", "1 double bed", "Jul 30 – Aug 5"],
      "price": { "currency": "$", "value": 107, "period": "night" },
      "rating": 4.98,
      "link": "https://www.airbnb.com/rooms/25820315"
    }

After downloading the data onto my local system, a Makefile was created to transfer the data onto Hadoop’s distributed file system (HDFS). This process involves copying the JSON files from the local directory into the HDFS using Docker and Hadoop commands, ensuring the data is stored in a distributed manner for efficient processing. Run make copy_files_to_hdfs

.PHONY: copy_files_to_hdfs
copy_files_to_hdfs:
 @for local_path in $(local_dir_hdfs)/*.json; do \
  filename=$$(basename -- "$$local_path"); \
  docker cp $$local_path $(container_id_hdfs):$(docker_dir_hdfs)/$$filename; do
  docker exec $(container_id_hdfs) hadoop fs -copyFromLocal $(docker_dir_hdfs)/$$filename $(hdfs_dir_hdfs)/$$filename; \
 done

Step 2: Data Processing with Spark and Storage in MongoDB

Once the data is in HDFS, the next step is to process and clean it. For this purpose, Apache Spark comes into play. Apache Spark is an open-source, distributed computing system that handles data processing and analytics on large datasets, making it perfect for the task at hand.

Let’s delve deeper into how Spark processes the data and loads it into MongoDB.

from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, lit, col, from_json
from pyspark.sql.types import *

# Create a SparkSession
spark = SparkSession.builder \
    .appName("Data Processing") \
    .master("spark://spark:7077") \
    .getOrCreate()

The first step involves creating a Spark session, which is the entry point to any Spark functionality. Here, we specify the name of the application and the master URL to connect to, which in this case is a Spark standalone cluster.

Next, we define the schema of our data. Spark schema is a blueprint of what our DataFrame should look like, including the types of data it contains.

# Define the schema
schema = StructType([
    StructField("airbnbHotels", ArrayType(
        StructType([
            StructField("thumbnail", StringType()),
            StructField("title", StringType()),
            StructField("subtitles", ArrayType(StringType())),
            StructField("price", StructType([
                StructField("currency", StringType()),
                StructField("value", DoubleType()),
                StructField("period", StringType())
            ])),
            StructField("rating", DoubleType()),
            StructField("link", StringType())
        ])
    ))
])

Once the schema is defined, we load the data from HDFS and process it. The data processing includes adding a new column with the city name and exploding the “airbnbHotels” field into separate rows. We then select the individual fields from the “airbnbHotels” objects.

for city in cities:
    # Load the data for the current city
    df = spark.read.option("multiline", "true").json(f'hdfs://namenode:9000/input/{city}.json')

    # Add a new column with the city name
    df = df.withColumn("city", lit(city))

    # Explode the "airbnbHotels" field into separate rows
    df = df.select("city", explode(df.airbnbHotels).alias("airbnbHotels"))

    # Select the individual fields from the "airbnbHotels" objects
    df = df.select(
        lit(city).alias("city"),
        df.airbnbHotels.thumbnail.alias("thumbnail"),
        df.airbnbHotels.title.alias("title"),
        df.airbnbHotels.subtitles.alias("subtitles"),
        df.airbnbHotels.price.alias("price"),
        df.airbnbHotels.rating.alias("rating"),
        df.airbnbHotels.link.alias("link")
    )

After the processing is done, the data is ready to be loaded into MongoDB. MongoDB is a popular NoSQL database known for its flexibility and scalability.

    # Define MongoDB connection URI for Mongo Atlas, you can create free account
    MONGODB_URI = "mongodb+srv://<username>:<password>@cluster0.mongodb.net/?retryWrites=true&w=majority" 
    # Use mongodb://root:example@mongo:27017 if localhost docker for mongo
    # Write DataFrame to MongoDB
    df.write.format("com.mongodb.spark.sql.DefaultSource") \
        .mode("append") \
        .option("uri", MONGODB_URI) \
        .option("database", "booking") \
        .option("collection", "airbnb") \
        .save()

The DataFrame is written into MongoDB using the MongoDB Spark Connector. This allows data to be written from Spark into MongoDB by specifying the MongoDB connection URI, the database, and the collection where the data should be stored.

Finally, the SparkSession is stopped to free up resources:

# Stop the SparkSession
spark.stop()

This step is vital as it not only cleans the data but also transforms and structures it in a way that makes it readily accessible for the subsequent stages of the pipeline.

By using the powerful processing capabilities of Apache Spark and the flexible storage of MongoDB, we can effectively handle, clean, and store large amounts of data. This forms a robust foundation for the next steps in our pipeline where the data will be used to generate tailored search results.

Step 3: Flask API and GPT-3 Integration

In step 3, a Flask API is built to serve as the interface between the GPT-3 model and the MongoDB database. This is a crucial component of the pipeline as it enables real-time interaction with the data stored in MongoDB through the power of GPT-3 natural language processing.

To begin with, necessary modules are imported and a Flask application is set up. The Python JSON Encoder is extended to handle ObjectId instances, which are specific to MongoDB. This is done via a custom JSONEncoder class. This ensures the JSON serialization process can handle the unique ObjectId values from MongoDB documents.

The Flask application also sets up logging to help with debugging and monitoring the application:

logger = logging.getLogger('my_logger')
logger.setLevel(logging.DEBUG)

A MongoDB client is established with the relevant MongoDB URI. The database and collection that will be interacted with are then specified.

The function process_query() communicates with the GPT-3 model, sending a system message and the user's search query. The function then extracts and returns the content from the GPT-3 response.

Next, the function translate_to_mongo_query() takes the output from process_query() and translates it into a MongoDB query. It does this by extracting the city, budget, and dates from the processed query and using these values to create a query dictionary.

The find_results() function takes a user's query, processes it through GPT-3, translates it into a MongoDB query, then executes the query on the MongoDB collection. It then returns the results as a string.

The generate_preamble() function generates a friendly introduction to the apartment options list using the GPT-3 model.

The format_results() function takes the result set and formats each document into a string that includes relevant apartment information.

Finally, the home() function in the Flask app receives the POST request containing the user's query. It uses the find_results() function to get the MongoDB results, which it then sends to the GPT-3 model. The GPT-3 model generates a response, which the function returns as a JSON object. If there are any errors during this process, the function catches the exception and returns the error message.

In summary, this Flask API serves as the glue that connects the user’s search request, GPT-3’s natural language processing, and the MongoDB database.

import os
import requests
import json
from pymongo import MongoClient
from flask import Flask, request, jsonify
import logging
from bson import ObjectId
from flask import json
import openai
import re
from flask_cors import CORS

app = Flask(__name__)
CORS(app)

class JSONEncoder(json.JSONEncoder):
    def default(self, o):
        if isinstance(o, ObjectId):
            return str(o)
        return json.JSONEncoder.default(self, o)

app.json_encoder = JSONEncoder

logger = logging.getLogger('my_logger')
logger.setLevel(logging.DEBUG)

# Get API key from environment variable
api_key = os.getenv("OPENAI_KEY")
print(api_key)  # Debug: print the api_key
if not api_key:
    raise ValueError("Missing OpenAI API key")

# Set the OpenAI API key
openai.api_key = api_key

# Connect to MongoDB
client = MongoClient('mongodb+srv://<username>:<password>@cluster0.td8y4zr.mongodb.net/?retryWrites=true&w=majority')
db = client['booking']
collection = db['airbnb']

def process_query(query):
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai.api_key}"
    }
    # Use GPT-3 to process the query
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant. You can retrieve information from a database and present it to the user. You can communicate information about an apartment such as its title, location, price, availability, and the link to its page. The user will provide you with their requirements in the following format: 'city: city_name, budget: budget_amount, dates: start_date to end_date'."},
            {"role": "user", "content": query}
        ],
        max_tokens=100
    )

    # Get the desired information from the response
    processed_query = response.choices[0].message['content'].strip()

    return processed_query

def translate_to_mongo_query(processed_query):
    # Initialize the MongoDB query dictionary
    query_dict = {}

    # Search for city in the processed query
    city_search = re.search(r'city: (.*?),', processed_query, re.IGNORECASE)
    if city_search:
        query_dict['city'] = city_search.group(1).strip()

    # Search for budget in the processed query
    budget_search = re.search(r'budget: \$?(\d+)', processed_query, re.IGNORECASE)
    if budget_search:
        # Convert the budget to the same currency as in your database
        budget_in_dollars = int(budget_search.group(1))
        query_dict['price.value'] = {'$lte': budget_in_dollars}

    # Search for the dates in the processed query
    date_search = re.search(r'dates: (.*?) - (.*?)\.', processed_query, re.IGNORECASE)
    if date_search:
        # Format the dates string to match your database format
        dates_string = f"{date_search.group(1)} – {date_search.group(2)}"
        query_dict['subtitles'] = {"$regex": dates_string}

    return query_dict

def find_results(query):
    logger.debug('Starting to find results')

    # Use GPT-3 to understand the query
    processed_query = process_query(query)
    print("Processed Query:", processed_query)
    logger.debug('Processed query with GPT-3')

    # Translate the processed query into MongoDB query
    mongo_query = translate_to_mongo_query(processed_query)
    logger.debug('Translated query to MongoDB query')
    print("MongoDB query:", mongo_query)

    # Run the query and get the results
    results = collection.find(mongo_query).limit(5)  
    logger.debug('Got results from MongoDB')

    # Transform the results into a string, including less information for each result
    results_string = "\n".join([f"Apartment title: {result['title']}, price: {result['price']['value']}, link: {result['link']}" for result in results])

    return results_string

def generate_preamble():
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Provide a friendly introduction to a list of apartment options."}
        ],
        max_tokens=100
    )

    # Get the generated preamble from the response
    preamble = response.choices[0].message['content'].strip()

    return preamble

def format_results(results):
    formatted_results = []
    for result in results:
        formatted_result = f"Title: {result['title']}\nCity: {result['city']}\nPrice: {result['price']['value']} {result['price']['currency']} per {result['price']['period']}\nDates: {', '.join(result['subtitles'])}\nRating: {result['rating']}\nLink: {result['link']}"
        formatted_results.append(formatted_result)
    return formatted_results

@app.route('/', methods=['POST'])
def home():
    try:
        data = request.get_json(force=True)
        query = data['query']

        # Get the results string from find_results
        results_string = find_results(query)

        # Create a GPT-3 prompt with the results
        prompt = f"{query}\n{results_string}"

        # Check if the total token count is within the model's limit
        total_tokens = len(query.split()) + len(results_string.split())
        if total_tokens > 4097:
            raise ValueError(f"Total token count ({total_tokens}) exceeds the model's limit (4097)")

        # Use GPT-3 to generate a response
        response = openai.ChatCompletion.create(
            model="gpt-3.5-turbo",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
        )
        gpt3_response = response.choices[0].message['content'].strip()

        return jsonify({"response": gpt3_response})
    except Exception as e:
        return str(e)

if __name__ == '__main__':
    app.run(debug=True)

Step 4: User Interface with React

In step 4, a React application is built as the user interface where users can enter their search criteria such as city, budget, and period of stay. This is done by creating a functional component SearchBar that maintains two states: query and results.

Here’s how the main components of this script work:

useState(): This React Hook is used to add React state to functional components. query is initialized as an empty string, and setQuery function is used to change its value. The same goes for results and setResults.
handleSubmit(): This is an asynchronous function that is triggered when the user submits the search form. It sends a POST request to the Flask backend (running on http://localhost:5000) with the search query as JSON in the request body. It then waits for the response from the backend, parses the JSON response, and sets the results state with the returned data.
return(): This function returns the JSX that renders the search bar and the results. It includes a form with an input field and a submit button. The input field's value is bound to the query state, and any change in the input field updates the query state. When the form is submitted, the handleSubmit function is called. The results are then displayed in a div element.

import React, { useState } from 'react';
import './Search.css';

function SearchBar() {
const [query, setQuery] = useState("");
const [results, setResults] = useState("");

const handleSubmit = async (event) => {
event.preventDefault();
const response = await fetch('http://localhost:5000', {
method: 'POST',
headers: {
'Content-Type': 'application/json'
},
body: JSON.stringify({ query })
});
const data = await response.json();
setResults(data.response);
};

return (

className="search-input" type="text" value={query} onChange={event => setQuery(event.target.value)} /> Search {results} ); }
export default SearchBar;

In the given example, the user enters “I want an apartment in Berlin with a budget of amount 700$ between May 8–13” in the search bar. The React application sends this query to the Flask backend, which processes the query using GPT-3, forms a MongoDB query, retrieves matching results from the MongoDB database, and sends the results back to the React application. The application then displays the results, which include a list of apartments that match the user’s criteria with a link to each apartment.

In this article, we have explored an innovative and efficient system that integrates GPT-3, Apache Spark, MongoDB, and React to create a user-friendly search application.

The first step involves data collection from a source like Airbnb, transforming and storing it in a distributed file system like HDFS. Apache Spark is used for this purpose, demonstrating its power in handling large scale data processing tasks.

Next, the data stored in HDFS is processed and loaded into MongoDB using PySpark, showcasing the flexibility and ease of combining big data technologies with NoSQL databases.

The third step introduces GPT-3, a state-of-the-art language model developed by OpenAI. With its remarkable natural language understanding capabilities, GPT-3 is used to interpret the user’s search query, which is then transformed into a MongoDB query to retrieve relevant results from the database.

Finally, a user-friendly interface is developed using React, a popular JavaScript library for building UIs. This interface allows users to input their search queries in a natural language format and get results in real-time, providing a seamless user experience.

In conclusion, this system demonstrates the potential of combining big data technologies, AI, and modern frontend development to create powerful and user-friendly applications. It illustrates the possibilities that open up when different technologies are integrated to work together, providing a practical and efficient solution for complex search problems. The methods and technologies discussed here can be extended or modified for various use cases, paving the way for future innovations in the intersection of AI, big data, and web development.

Github
Medium

To run this project, follow these steps:

Step 1: Clone the Repository
First, clone the repository from GitHub using the following command:

git clone https://github.com/Stefen-Taime/IA_Data_Pipeline.git
Step 2: Navigate to Project Directory
Next, navigate to the project directory using this command:

cd IA_Data_Pipeline
Step 3: Use Makefile Commands
You'll need to run a series of commands using the Makefile provided in the project. The Makefile is a tool that simplifies building and managing the project.

Copy Files to HDFS: This will copy the necessary files to Hadoop Distributed File System (HDFS). Run the following command:

make copy_files_to_hdfs
Load Data to MongoDB: This will load the data from HDFS to MongoDB. Run the following command:

make run_load_to_mongo
Run API: This command will start the Flask API which uses GPT-3 to process the user's queries. Run the following command:

make run_api
Start Frontend: This will start the React application, which serves as the user interface for the project. Run the following command:

make start_front_end
Important Note: This project uses OpenAI's GPT-3, and you will need an OpenAI API key to run the application. Make sure to set the OpenAI API key in your environment variables before starting the API.

If you encounter any issues while setting up or running the project, you can refer to the project's documentation or open an issue in the GitHub repository. Remember, this project is an example of how to integrate various technologies, and you might need to adjust some settings based on your specific environment and setup.

DEV Community: Stefen

Automating Data Pipeline Deployment on AWS with Terraform: Utilizing Lambda, Glue, Crawler, Redshift, and S3

Table of Contents

Objective

Pre-requisites

Components

Source Systems

Schedule & Orchestrate

Extract

Load

Transform

Data Visualization

Future Work & Improvements

Setup

Important Note on Costs

Building a Modern Data Pipeline: A Deep Dive into Terraform, AWS Lambda and S3, Snowflake, DBT, Mage AI, and Dash

Creating a Election Monitoring System Using MongoDB, Spark, Twilio SMS Notifications, and Dash

Creating a Election Monitoring System Using MongoDB, Spark, Twilio SMS Notifications, and Dash

The Data Pipeline

Data Processing with PySpark (Job 1)

Data Processing with PySpark (Job 2)

create a dictionary of grand electors by state

Convert dictionary to DataFrame

Identify maximum votes in each state

Save the result to MongoDB

Stop the SparkSession

Notification of Results

Visualization with Dash

End to end data engineering project with Spark, Mongodb, Minio, postgres and Metabase

Utilizing of open source technologies for the implementation of a data pipeline

Architecture

Source Code

Prerequisites

Setup

then

then

Cleaning Up

ELT Airflow Pipeline Project

About

Architecture

Prerequisites

Setup mailtrap

Setup

oltp Interface

olap Interface

Airflow Interface

Airflow DAG

Check oltp and olap database

Check your mailtrap.io/inboxes

Shut down or restart Airflow

References

Building a Scalable RSS Feed Pipeline with Apache Airflow, Kafka, and MongoDB, Flask Api

What is Apache Airflow?

What is Kafka?

Implementing the ETL pipeline

Define default_args dictionary to pass to the DAG

Define the task to update the proxypool

Define the task to validate the quality of the data

Define the task to send data to the Kafka topic

Define the task dependencies

Set the task dependencies

Prerequisites

Steps To Run:

Conclusion:

Real-Time Data Processing with MySQL, Redpanda, MinIO, and Apache Spark Using Delta Lake

Table of Contents

1. Introduction

2. Setting up the environment

Docker Compose configuration

Data generation and storage in MySQL

Creating an API for data ingestion

Test api

Setting up connectors for data streaming and storage

3. Real-time data processing with Apache Spark

Reading data from MinIO

Data transformation and storage in the data warehouse

Integrating Twilio for real-time notifications

Visualizing Bitcoin to USD Exchange Rates using FastAPI, Prometheus, Grafana, Deploy with jenkins

Visualizing Bitcoin to USD Exchange Rates using FastAPI, Prometheus, Grafana, Deploy with jenkins On Localhost Ubuntu Server 20.04

Overview