DEV Community: Muhammad Adnan Khan

AWS Kinesis - Stream Storage Layer

Muhammad Adnan Khan — Mon, 29 Jan 2024 10:17:42 +0000

In this blog post, we will discuss the AWS Kinesis data stream service to understand the high-level overview of the service, architecture, core components, and the use case of the AWS Kinesis service.

AWS kinesis has the following sub-services:

AWS kinesis Data Stream (KDS)
AWS Kinesis Data Firehose
AWS Kinesis Data Analytics
AWS Kinesis Video Stream

Our primary discussion would be around the Kinesis Data Stream the stream storage layer, will discuss the overview, the architecture, and other necessary details.

AWS Kinesis Data Stream (KDS)
KDS is a scalable service that scales elastically and near-real-time processing of streaming big data. It's a data ingestion layer that stores the value from 24 hours up to 8760 hours (365 days), by default it's 24 hours. Data inside the KDS is immutable once stored cannot modified, and the stored data cannot be removed from it until it expires.

The KDS is composed of two layers.

Storage Layers
Processing Layers

1. Storage Layers
Is responsible for storing and managing the incoming data stream temporarily before it goes for further processing in the processing layer.
2. Processing Layer
This layer is fed by the storage layer and is responsible for analyzing and transforming the data in real-time or near-real time. After processing the processing layer is responsible for notifying the storage layer to delete the data that is no longer needed.

Kinesis Data Stream - Architecture
A KDS is composed of the components which we will discuss one by one and how they correlate with one another.
A KDS is composed of one or more shards.

Shards: contains the sequence of data records that supports 5 transactions per second, The total data write rate of 1MB/S or 1000 messages per second whereas the data read rate is 2 MB/S.

Data Records: is composed of the Sequence ID, Partition key, and data blob.

The partition key inside the data records decides to which shard the data will go and the blob is nothing but the original data itself.

Note: The sequence ID will be unique inside the partition.

The producer will put the records into the data stream.
Consumers will get the records from the data stream also known as the KDS applications.

The consumer applications generally run on a fleet of EC2 instances. There are two types of consumers in KDS:

Classic/Shared Fan-out consumers (SFO)
Enhanced Fan-out consumers (EFO)

The SFO works on the Poll/Pull mechanism where it extracts the records from the shard, whereas the EFO works on the push mechanism the consumer subscribes to the shard and the shard automatically pushes the data into the consumer application.
The default throughput of each shard in 2MB/s in shared fan-out all the consumers will share the same throughput of 2 MB/S but in Enhanced Fan-out each consumer will receive its own throughput of 2 MB/S. Suppose we have 5 consumers all of them are reading the data from Shard1 then in SFO the 5 consumers will get the throughput of 2 MB/s but in EFO the 5 consumers will get the 10 MB/S as each one will have separate 2 MB/S.

EFO vs SFO characteristics
The EFO has a latency of 70ms which will be the same for all the consumers while the SFO has around 200ms and will increase with each consumer for example if there are 5 consumers then in EFO the latency will remain at 70ms but in EFO the latency can increase up to 1000ms. In EFO the maximum consumer limit can be up to 20, in SFO the limitation is up to 5 consumers. The cost of EFO is also higher compared to the SFO. The records delivery model in SFO is using HTTP while in EFO HTTP/2 is used.
Now we get the high-level overview of the Amazon Kinesis Data Stream service. Let's discuss the pricing model of the KDS now.

Pricing KDS
The following are the points you should consider while using the KDS service for which you'll be charged.

There's an hourly charge incurred based on the number of shards.
Separate charges when the producer puts the data in the stream.
Charge based on per hour, when the data retention period is extended from the default 24 hours.
If Enhanced Fan-out is being used, charges are based on the amount of data and the number of consumers.

Resources:

Data Evolution - Databases to Data Lakehouse

Muhammad Adnan Khan — Fri, 19 Jan 2024 12:14:16 +0000

In this blog post, we will discuss the evolution of the data and data analytics solution and how fast things have changed recently. We will discuss all the details from the granular details to better understand the concepts later on.
Data in new oil!
Let's first understand what data is. how it become useful for many organizations.

Data is a term that is often used to describe the information that can be in some stored format and can be transmitted. It can be in the form of text, number, or some fact.

It not just a new term, but it has been used by our ancients in different forms either in the form of oral tradition, in paper written form, or can be in any electronic form stored somewhere.
Before the invention of writing, people carried information in some oral form, in the form of stories, knowledge, and history transferred from generation to generation. Later on, this fashion was converted into written form on stones and leathers, and with the invention of the printing press in the 15th century the information was stored in books and documents. This thing keeps changing with time from the printing press to library catalogs, then punch cards, early computers, databases and now we are in the era of big data where everyone has a personal device where each click generates data and that's stored somewhere in the world.
Around 328.77 million TBs of data is generated each day which is around 120 Zettabytes in 2023 and expected to raise 180 Zettabytes by 2025.

Welcome to the world of data

Now you have the history of data and how fast it has evolved, with all this evolution many organizations utilized it for different purposes to get the edge over competitors.
The data generated by you, are used by the organizations to generate profit. Every industry is using it whether it's some social media platform, the e-commerce store, or some movie platform. They track the history of data, analyze the patterns and make recommendations to the user to keep them engaged in their platform and sell their content or product. It's not just these industries the data has use cases in healthcare, oil & gas, pharma, and many other industries you name.
This is why the data is called the new oil, it drives the world.

The Big Data Era

The journey of data processing and analytics systems has evolved over several decades. In the 1980s, where data would be processed in nightly runs in batch streams.
with the increase in the use of databases, the organizations found themselves in tens or even hundreds of databases, supporting the business. These databases were transactional databases or OLTP. As a result, in 1990s the data warehousing comes into the picture for analytical purposes.
The early 21st century witnessed the era of big data when the data was growing exponentially and in different formats structured, unstructured, and semi-structured by modern digital platforms mobiles, web, sensors, IoT devices, social media, and many others, which need to be stored somewhere to perform the analysis. Well, at that time in the early 2010s, a new technology for big data processing became popular. Hadoop is an open-source framework for processing large-scale datasets on clusters of computers. These clusters contain machines with an attached disk that manages the terabytes of data under a single distributed file system Hadoop distributed file system (HDFS). The main bottleneck of these on-prem systems Hadoop and Spark is scalability, which requires a high upfront payment, along with other factors like latency, hardware management, and complexity factors.
During this time, the cloud-based data warehouses (Redshift, Big query, Snowflake, and Synapse) came into the picture which involved fewer managerial tasks and resolved the issues of scalability and latency and usage-based cost model.
After that, the trend of the modern data stack started to evolve into a data lake architecture that comes with high durability, inexpensive, and limitless cloud object stores. Where you can store any type of data without any transformation. Data lakes become the single source of truth for organizations. In this approach, all the data is ingested into the data lake, and a hot subset of data is moved from the data lake to the data warehouse to support low latency.
By integrating the best capabilities of both the data warehouse and data lake a new architecture came into the picture called data Lakehouse. which overcame the bottlenecks of the data lake and data warehouse like supporting any type of data, supporting ACID transactions, and low latency which the data lake can't support.

Now let's discuss each of the defined above and the associated services used in AWS.

OLTP (Online Transaction Processing)
is a source system where the business transactions are stored.
AWS Service: RDS, Aurora, DynamoDB and other services
OLAP (Online Analytical Processing)
the systems used for analytical purposes.
AWS Service: AWS Redshift
*ETL *(Extract Transform Load)
used to transfer the data from OLTP to OLAP system.
AWS Services: AWS Glue and AWS Pipeline
Data warehouse
Is a single source of truth to store the structure-only data with ACID properties, used for analytical purposes.
AWS Service: Redshift
Data lake
is a central repository to store the data from multiple source systems in any structure, it doesn't support ACID transactions and has high latency.
AWS Service: S3
Data Lakehouse
combination of data warehouse and data lake best capabilities with support of ACID and low latency with support to store any type of data.
AWS Service: Redshift Spectrum and Lake Formation.

Will continue the series to explore each of the AWS Services mentioned above in depth the architecture, the working mechanism, and how to use multiple services to build a data warehouse, data lake, and data Lakehouse by using AWS.

DBT + REDSHIFT = ❤

Muhammad Adnan Khan — Mon, 27 Mar 2023 17:30:11 +0000

In recent time you have heard about the DBT (Data Build Tool) a lot, Let's explore the power of the DBT with Amazon Redshift. We will develop the Data pipelines using DBT, Redshift as our data warehouse and Power BI for visualization.

What is DBT?
Let's first understand what exactly DBT is and its use case.
Data Build Tool aka DBT is an open-source tool that helps you in applying transformation using the best practices of Analytics engineering.
I'm not going to explain the terms Extract Transform Load (ETL) and Extract Load Transform (ELT) I assume that you're familiar with these terms. The Transformation step in being applied in DBT.
There're two ways to access DBT.

DBT Core
DBT Cloud

If you're GUI kind of person go with DBT cloud and if you love to work with terminals, then go with DBT Core. However, the Commands would not be that difficult familiarity with basic commands like ls, cd, pwd and some dbt commands are enough to work. For this project I'll go with DBT Core.

Redshift
Redshift is a cloud-based warehouse service provided by Amazon. It uses a Massive Parallel Processing (MPP) architecture, which distribute the data and processing across multiple nodes to improve query performance.

Photo by Daniel Josef on Unsplash

It contains the cluster which is composed of leader and compute nodes you can further read about its architecture in detail here.

Power BI
Power BI is business intelligence tool by Microsoft. You can highly interactive visualization by just drag and drop. Provides plenty of data connection options as well.
If you're interested in Power BI you can further learn about it here.

Dataset
The dataset I'm using is the Sakila database. You can find the scripts to create tables and insert data to tables in following repository.

Note: These scripts are specific to Amazon Redshift. Probably these scripts will throw an error on other databases.

1- Create environment

you should create a specific python environment for this project in order to avoid any conflicts.
If you don't have virtualenv library already installed, then run.

pip install virtualenv

In order to create a virtual environment, you can run the following command.

python -m venv

to activate the environment, run

<environment-name>/Scripts/activate.bat

Once the environment activated, the environment name will appear in your command line before the path.
Environment Activation

Note: The commands differ for different OS, the above-mentioned commands are specific to windows.

2- DBT installation
It's time to install the DBT, before installation of DBT make sure you've Python version 3.7 or above it doesn't support version below 3.7 as per their documentation, may be these changes with time to time. you can read about the supported versions here.
We're using the redshift so we will use the redshift adapter, if you're planning to use some other adapter then command will vary accordingly. if you're going along run the following command for redshift

pip install dbt-redshift

Most of the things are handled by the DBT on its own related to the project you can create or initialize a project by just running the command.

dbt init

The above command will create a project along with the boilerplate. easy peasy right? okay then what's next.

3-RedShift Cluster Setup

Before preceding next, we will setup our Amazon Redshift cluster and allow the public accessibility. Public accessibility isn't recommended you can use the VPC but for demo purposes we can proceed.

Create Redshift cluster.
Add inbound rules in security
Allow accessibility

Once the cluster is setup, open the cluster properties of the cluster note down the Endpoint and connect it locally, I already have aqua data studio, so I connected through it.
Connection to Redshift

Note: Redshift cost varies by different factor, so make sure to create a billing alert, so you can receive updates regarding the cost. Furthermore, you can read here.

Once you connected to your warehouse then create a schema called stg inside database and run the scripts of tables creation and insertion, the repository of the scripts is mentioned above.
Now you've the data in staging layer and you want to load it into warehouse. The data is already cleansed so there's no need to introduce the transformation layer between staging and target layer.

As per the requirements of the clients we have to decide to go with sort of Galaxy schema.

Model Time to build models but ML models, I'm talking about the DBT models where you define your core logic. Inside the Models directory create two sub directories for dimension and fact. In each sub-directory create a schema.yml file.

This schema file contains the information about the source and contain some tests. This is how the schema file for customer dimension will look like:

version: 2

models:
  - name: dim_customer
    description: "Dim customer to join customer with city,address and country"
    columns:
      - name: customer_id
        description: "The primary key for this table"
        tests:
          - unique
          - not_null

sources:
  - name: stg
    database: dev
    schema: stg
    tables:
      - name: customer
      - name: address
      - name: city
      - name: country

The best practice while developing your model is to use Common Table Expression (CTE) as it enhances the readability of the code though it's not necessary.
Now let's create a customer dim which contains the details of the customer, the details from address,city and country source tables.

with customer_base as(
    SELECT *,
    CONCAT(CONCAT(customer.FIRST_NAME,' '),customer.LAST_NAME)  AS FULL_NAME,
    SUBSTRING(customer.EMAIL FROM POSITION('@' IN customer.EMAIL)+1 FOR CHAR_LENGTH(customer.EMAIL)-POSITION('@' IN EMAIL)) AS DOMAIN,
    customer.active::int as ACTIVE_INT,
    CASE WHEN customer.ACTIVE=0 then 'no' else 'yes' end as ACTIVE_DESC,
    '{{ run_started_at.strftime("%Y-%m-%d %H:%M:%S")}}' as DBT_TIME
    FROM
    {{ source('stg','customer')}} as customer

),
address as (
    SELECT * FROM
    {{ source('stg','address')}}

),
city as (
    SELECT * FROM
    {{ source('stg','city')}}

),
country as (
    SELECT * FROM
    {{ source('stg','country')}}

)

SELECT 
customer_base.CUSTOMER_ID,
customer_base.STORE_ID,
customer_base.FIRST_NAME,
customer_base.LAST_NAME,
customer_base.FULL_NAME,
customer_base.EMAIL,
customer_base.DOMAIN,
customer_base.ACTIVE_INT AS ACTIVE,
customer_base.ACTIVE_DESC,
customer_base.create_date,
customer_base.last_update,
customer_base.DBT_TIME,

address.ADDRESS_ID::INT,
address.address,
city.city_id,
city.city,
country.country_id,
country.country

FROM customer_base

LEFT JOIN ADDRESS AS address
 ON customer_base.address_id= address.address_id

LEFT JOIN CITY AS city
 ON address.city_id=city.city_id

LEFT JOIN COUNTRY AS country
 ON country.country_id=city.country_id

This is how the models are defined is DBT and in parallel if you open your project.yml file at the very bottom of the file it contains the detail about models that how the tables will be materialized.

models:
  project:
    # Config indicated by + and applies to all files under models/example/
    example:
      +materialized: view

    dimension:
      +materialized: table
      +schema: dwh

    fact:
      +materialized: incremental
      +schema: dwh

Note: The properties define inside the models has more preference than the one defined in project.yml

Similarly, like customer dimension we defined other dimensions and facts however the facts are materialized incrementally so that it can save cost while rerunning the models again and again, as fact has large number of records.
Now you defined all the models and your target destination in the DBT profile it's time to run the models. To run the models hit the following command.

dbt run -m dimensions

This command will run all the models in dimensions directory. However, if you want to run the specific model then try this command.

dbt run -s model_name

Once you run this command and everything is defined correctly then data will be inserted into your target directory dwh in my case.
That's how you can build your DBT pipeline. If you have good knowledge of SQL and a bit of Python, then you are good to develop complex pipelines on your own.

Finally, our warehouse is ready, users can now perform the analysis as per their requirements. Now there's requirement from the user to build a dashboard upon that cleansed data. We have access to Power BI desktop. In order to make a connection with Redshift we have to provide the following details in Power BI.

server-name: End point of your Redshift
Database: dbname
username: username
password: password

Once you provide the following details then you'll either direct query the source or can import the tables. In import tables, data is cached inside Power BI and in direct query, it'll directly hit the source to fetch data.

Now you can play around and build an amazing dashboard for your user. for the demo purpose I've built this, but it can far improve by utilizing the DAX functions.

You can find the code related to the project codes here.

That's it, Tada :D.

Conclusions:

Create a separate environment for the project, choose your adopter beforehand.
Define the tests and documentations
Run individual models if you're working on cloud, it can add cost if you run all the models
Modularize your logic, so you use the same logic at multiple places by reference.
The property defined inside model holds more value than defined inside project.yml file, it basically overwrites those properties
Go for incremental materialization, if your data is quite large.