DEV Community: muriuki muriungi erick

The easiest way to navigate through MongoDB, PySpark, and Jupyter Notebook

muriuki muriungi erick — Sat, 18 Nov 2023 16:20:46 +0000

I strongly believe that open source is the future. In our modern-day software development cycle, there is a huge interest in the open source project. This is attributed to the fact that such an approach ends up saving cost in the development, making the development more flexible as well as encouraging innovation.

MongoDB: This is basically an open-source document database that can be used to store both structured and unstructured data, using JSON-like format to store documents
Jupyter Notebook: This is one of the commonly used open-source tools that has revolutionized the data science landscape. It is easy to create and share documents containing code, equations, and visualization. Recently, jupyter Notebook has evolved into jupyterLab, which adds additional functionality, such as a command line, terminal, and editor.

Pyspark: This is basically the Python API for Apache Spark, an open-source cluster computing framework. The popular concept of the spark is distributed computing.

To start, we will first install MongoDB, pyspark, and jupyterlab. These tools can easily be installed using docker-compose. All the services we will be using will be defined in this file. For the containers inside docker to communicate efficiently, we need to define a custom network. For this case, I defined custorm network as my-network. This network will isolate containers from the external networks. All services will be defined in the docker-compose file, and it will look as follows. To create a docker container, you should navigate to the directory housing the docker file and run docker compose up -d. That all for our setup

Loading data to MongoDB
It is simple to use MongoDB compass to import as well as export data to and from the MongoDB collection. MongoDB compass supports both CSV and json file formats. For our illustrations, we will use an electric vehicle dataset found in [(https://catalog.data.gov/dataset/electric-vehicle-population-data)] . This is what our Mongodb compass will look like after importing the data.

To verify that the data is correctly imported to the mongoDB, we can query this data from the terminal of our desktop machine. We will achieve this by following the following steps.
1.Select the database: this is done using the use command. For example, I created the database named EV and collection data; therefore, I will run use EV in the terminal.
2.Query Data: this can be achieved using the find method. For my case the method will be like db.data.find()

As seen above, we are sure that we have imported data correctly to our Mongodb. Now we can use spark to load the data to jupyter lab.

1.Import required libraries

2.Start Spark Session

3.Create a connection and read the data

4.Check data type

5.View the data

As seen from the above procedure, we can conclude that creating a data pipeline from Mongodb to pyspark is elementary. The procedure is straightforward and very efficient. Happy coding!!!
find complete project [https://github.com/ndurumo254/mongodb]

Hashmap in python.

muriuki muriungi erick — Mon, 22 May 2023 12:59:50 +0000

Hashmap in python.
Hashmaps are indexed data structures which are also known as hash tables. They are used to compute the index with a key into an array of slots. Hashmaps are unique and immutable. Ideally hashmaps will store key-value pairs and the key are generated using a hash function.
Hashmaps can be compared to closet having several drawers where each drawer is used to store specific clothes. These drawers are labelled with the name of clothes they store. Hashmaps makes it easier and faster to access data.
In python hashmaps are implemented using built in function called dictionary. To understand more about the application of hash map I have used this example from letcode (https://leetcode.com/problems/two-sum/description/)

Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target.
You may assume that each input would have exactly one solution, and you may not use the same element twice.
You can return the answer in any order.

Example 1:
Input: nums = [2,7,11,15], target = 9
Output: [0,1]
Explanation: Because nums[0] + nums[1] == 9, we return [0, 1].

Example 2:
Input: nums = [3,2,4], target = 6
Output: [1,2]

As we can see we have a list of values stored in a variable named nums
Also, we have a target that is the sum of two values from the list
We are supposed to find the two values in the list those add up to the target and give them as an array. This is how we can achieve this.

class Solution(object):
    def twoSum(self, nums, target):
    #initialize the hashmap to sore the values
        compliment_dict= {}
     #loop through the values in the list
        for values in range(len(nums)):
            compliment= target-nums[values]
            if compliment in compliment_dict:
                return compliment_dict[compliment],values
            compliment_dict[nums[values]] = values

Conclusion
Here hashmap is used to store the element of an array. We iterate through the array and for each element we check for the compliment (target- current element). If the compliment exists in the hashmap it means that we have two numbers those can add up to the target.

Simplest pyspark tutorial

muriuki muriungi erick — Wed, 19 Apr 2023 10:04:04 +0000

If data is the new oil, then Spark will be the new engine. In the recent few years, we have witnessed exponential growth in data. This is due to the increased use of smart devices and sensors that collect data in real-time. Businesses that will master how to use such data will outdo their counterparts in making intelligent decisions. To make an intelligent decision from such data, we need to understand and process it. Spark comes in handy when we want to process huge data sets because it uses parallel clusters during data processing. I am going to cover some basics of Spark. Spark supports different languages, and in this tutorial, I will be using Python. Before proceeding, set your environment and install Pyspark on your machine. I am using this dataset
(https://drive.google.com/file/d/1b2oL92aRU5_xLkGBoxLwBfWY5MYv9xpG/view?usp=sharing)
To create a spark resilient distributed dataset (RDD), we start by creating a SparkSession. Sparksession is an object that a programmer creates using a SparkSession builder pattern. Spark session is the first code in the Spark program. Spark session includes all the available Apis in different contexts. These APIS incldes.

SparkContext
StreamingContext
SQLContext
HiveContext

The following line of code will create our session in this tutorial.
from pyspark.sql import SparkSession spark= SparkSession.builder.appName('Erick').getOrCreate() df_spark= spark.read.csv('salary.csv') df_spark.show()
SparkSession.builder() returns a builder class with methods such as master(), appName(),and getOrCreate(). After creating the session the following lines of code loads our data sets and the data is stored in the data frame named df_spark. When we execute df_spark.show() we will get the following output

By default, the Spark will display the first 20 rows. However, as seen in the output, the header of the dataset is wrongly placed. To write this correctly, we have to set the header to true.

df_spark=spark.read.option('header','true').csv('salary.csv') df_spark.show()
The output should now be as follows;

The headers are now correctly shown. We can see the number of rows by running

df_spark.head(5)
In this case, I have specified the number of rows to be 5, but you can specify to any number. When you execute this, your output should look like
[Row(MMM-YY='1/1/2016', Emp_ID=1, Age=28, Gender='Male', City='C23', Education_Level='Master', Salary=57387), Row(MMM-YY='2/1/2016', Emp_ID=1, Age=28, Gender='Male', City='C23', Education_Level='Master', Salary=57387), Row(MMM-YY=None, Emp_ID=None, Age=None, Gender=None, City=None, Education_Level=None, Salary=None), Row(MMM-YY='11/1/2017', Emp_ID=2, Age=31, Gender='Male', City='C7', Education_Level='Master', Salary=67016), Row(MMM-YY='12/1/2017', Emp_ID=2, Age=31, Gender='Male', City='C7', Education_Level='Master', Salary=67016)]
Pyspark enables us to see the datatypes of each column by executing the printschema() method. Before this, we should set inferschema to True when loading the dataset. This is done as

df_spark=spark.read.option('header','true').csv('salary.csv', inferSchema=True)
Then we should check the datatype of schema by
df_spark.printSchema()

When we run this, we should be able to see the datatypes in the following format.

When you printschema before setting inferSchema = True, your output will be as follows

This is because Spark, by default, will interpret everything as a string
This tutorial will cover select and drop functions in Spark, renaming columns, filling the missing values, filtering, and grouping by functions in Spark.
Select function.
To select the column from the dataset or data frame Select() is used. When selecting Age, gender, City, Education_Level, and Salary, we can use the following syntax.
df_spark.select(['Age','gender','City','Education_Level','Salary']).show()
When we select these columns, our output will be as follows,

Sometimes we might need to add a column in the dataset. For example, we can add a column for age after 5years
df_spark.withColumn('age after 5 years',df_spark['age']+5).show()

The output of this line is as follows.

At times we can have columns that are irrelevant to our datasets.for example we might not be in need of Emp_ID in our analysis. Therefore we can drop it as
df_spark.drop('Emp_ID').show()

As seen above, the Emp_ID has been deleted.
We can also need to rename specific columns, which can be done as follows.
df_spark.withColumnRenamed('MMM-YY','date_employed').show()
This will rename MMM-YY to date_employed.

We also need to have means of handling the missing data in the dataset. We can either drop the rows with missing values or fill in the missing values.we can decide to drop all the rows with the missing values and this is done as follows
df_spark.na.drop().show()

When we look at the output of this, we notice that all rows with some missing values have been deleted. The output will be as shown below.

We can also give some conditions using a threshold. For example, we can decide to delete rows with more than two missing values. This can be done as
df_spark.na.drop(how='any',thresh=2).show()

The output of this is as shown below. Rows with one missing value have not to be deleted

We can also decide to delete specific rows with missing values. This is done as follows
df_spark.na.drop(how="any",subset=['city']).show()
Here we will delete all city column rows with missing values. The output is as follows

We can also decide to fill in the missing values rather than deleting them. This can be done as follows
df_spark.na.fill('missing value',['age','salary']).show()
Here all the missing values in the age and salary columns will be filled with (missing value). The output is as shown below.

Pyspark has a class pyspark.ml.feature .Imputer that is used for completing missing values using mean, mode or median of the column that has a missing value. The input column of this class should be numeric because the class is not currently supporting categorical features. This class is used as shown below
from pyspark.ml.feature import Imputer

imputer = Imputer( inputCols = ['Age', 'Salary'], outputCols = ["{}_imputed".format(a) for a in ['Age ', 'salary']] ).setStrategy("median")
imputer.fit(df_spark).transform(df_spark).show()

The output of this will add Age_imputed and Salary_imputed columns, as shown below.

We can also need to filter our results. For example, we can filter ages equal to or below 30 years as
df_spark.filter("Age <= 30").show()

The output of this becomes as seen the displayed ages are below 31 years

We can also select columns to be filtered as
df_spark.filter("Age <= 30").select(['Gender','Education_Level','salary']).show()
The output becomes

We might also need to use more than one condition statement. Such as we can need to filter both salary and age as

df_spark.filter((df_spark['Age']<=30)& (df_spark['Salary']>= 170000)).select(['Gender','Education_Level','salary']).show()
The output becomes

We can use not comparison statement as
df_spark.filter(~((df_spark['Age']<=30)& (df_spark['Salary']>=170000))) .select(['Gender','Education_Level','salary']).show()
The output becomes

We can also need to group our data into various columns. For example, we can group salary with education_level and gender as
df_spark.groupBy('Education_Level','Gender').max('salary').show()
The output is as

In summary, pyspark is the Python API for Apache spark that helps data developers to carry out data processing tasks on large datasets using distributed computing framework. The select () function is used in the selection of a specific column from a data frame. It can take one or multiple names of the column as an argument, and it returns a new data frame with only the selected column. The drop () function is used to delete one or more columns from the data frames. It takes the columns to be deleted as its arguments and returns a new dataframe without the deleted columns. Another function we have used in this tutorial is a withColumnRenamed () function. This function is used to rename a specific column in the data frame. It takes two arguments first, the name of the old column followed by the new name and it will return the renamed column. Finally, we looked at the groupBy() function that is used to group data in the data frame by one or more columns. It returns grouped data frame that can be used to [perform operations such as sum, count,or mean in the grouped data.

Good news.. you can find full notebook here (https://drive.google.com/file/d/1Iy69g13tzCCksbl8DLuQevnq2kWJns3d/view?usp=sharing)

You can follow me (https://twitter.com/ErickNdurumo) or (http://www.linkedin.com/in/erick-muriungi-1500a6122)

Happy coding!!!

Code optimization

muriuki muriungi erick — Mon, 03 Apr 2023 12:18:50 +0000

This is part of everyday thoughts in python,data engineering and machine learning

It's important to consider the complexity of a function call. Basically, if we have a complex call in our program, the program becomes slower. Sometimes complexity might find its way into our workspace when dealing with function calls. It's, therefore, a good practice to look keenly for any recursive or nested functions since they are the main cause of slowing our code. To understand this, let's have a sample of the Fibonacci function

def fib (n):
    if n <=1 :
        return n
    else:
        return  fib(n-1) + fib(n-2)
result= fib(10)
print(result)

Time complexity of the above code is O(2^n)

Here the 10th number in the Fibonacci sequence is calculated by recursively calling the Fibonacci function with a smaller function argument until it reaches the base case of n=0 and n=1. In such cases, the algorithm's time complexity grows exponentially first in cases of huge values of n. in practice; we would like to use a friendly algorithm to minimize calculation repetitions. A friendly algorithm for such a scenario can be as given below.

def fib_optimized(n):
    if n < 2:
        return n
    else:
        fib_prev, fib_curr = 0, 1
        for i in range(2, n+1):
            fib_next = fib_prev + fib_curr
            fib_prev = fib_curr
            fib_curr = fib_next
        return fib_curr
result = fib_optimized(10)
print(result)

The time complexity for the above function is O(n) hence more optimized.
From this, we can take home some points in comparison between an algorithm's time complexity and a function call's time complexity.

The time complexity of an algorithm depends on the size of the input data, and it is basically the amount of time taken for an algorithm to solve a certain problem
The time complexity of a function depends on the function itself.
Therefore it is vital to select the right algorithm for the problem at hand and optimize the algorithm as much as possible to minimize the unnecessary complexity.

happy learning!!

Python functions and lambda functions in data engineering.

muriuki muriungi erick — Mon, 20 Feb 2023 12:26:02 +0000

Data engineering involves processes that aim to transform raw data into valuable insights. These processes include data ingestion, cleaning, transformation, and integration. Python has emerged as one of the most popular programming languages in data engineering. Python provides a wide range of libraries and functions that make data engineering a seamless process. In this blog, we will discuss how to use Python functions and lambda functions in data engineering to perform data manipulation, transformation, and cleaning tasks

Python Functions in Data Engineering

Functions are an essential aspect of any programming language, and Python is no exception. Functions in Python are used to group a set of statements that can be reused in a program. Functions play a crucial role in data engineering as they are used to perform a variety of data manipulation tasks. Some of the commonly used functions in data engineering include:

map() function - The map() function is used to apply a function to each item in an iterable object. This function is often used to convert data types or to perform calculations on a dataset.
filter() function - The filter() function is used to filter out elements from a dataset that do not meet a specific condition. This function is often used to remove outliers or to remove irrelevant data.
reduce() function - The reduce() function is used to perform a computation on a dataset by applying a function repeatedly to the dataset's elements. This function is often used to calculate the sum, average, or product of a dataset.

Here are some of the ways functions are used in data engineering:

Data Cleaning: Functions are used to clean and preprocess data. For instance, functions can be used to handle missing values, outliers, and inconsistencies in data.
Data Transformation: Functions can be used to convert data types, manipulate data, and create new features. For instance, functions can be used to compute summary statistics, aggregate data, or calculate the difference between two dates.
Data Integration: Functions can be used to combine multiple datasets, join tables, or merge columns.
Data Analysis: Functions can be used to perform data analysis, such as computing statistical measures, generating visualizations, and identifying patterns. Here is an example of a Python function that computes the mean of a dataset:

This function takes a list of numbers as input and returns the mean value of the list.

Lambda Functions in Data Engineering:

Lambda functions, also known as anonymous functions, are functions that are defined without a name. Lambda functions are a compact way to define small, one-line functions that can be used as arguments to other functions. Lambda functions are commonly used in data engineering for tasks that require a short and concise function. Some of the commonly used lambda functions in data engineering include:

Sorting - Lambda functions are used to sort a dataset based on a specific key.
Filtering - Lambda functions are used to filter out data that meets a specific condition.
Mapping - Lambda functions are used to map a function to each element in a dataset.

Here are some of the ways lambda functions are used in data engineering:
Sorting: Lambda functions can be used to sort a dataset based on a specific key. For instance, to sort a list of dictionaries

Filtering: Lambda functions can be used to filter out data that meets a specific condition. For instance, to filter out all values greater than a specific threshold, you can use a lambda function as follows:

Mapping: Lambda functions can be used to apply a function to each element in a dataset. For instance, to convert a list of strings to uppercase, you can use a lambda function as follows

happy data engineering practice!!!

Using python dictionary in data engineering.

muriuki muriungi erick — Sun, 19 Feb 2023 08:14:20 +0000

Python dictionaries are a powerful data structure that can be useful in many data engineering applications. In this blog, we'll explore some of the ways that you can use Python dictionaries in data engineering, including how to create and manipulate dictionaries, and how to use them in various data processing tasks.

What is a Python Dictionary?

A Python dictionary is a collection of key-value pairs that allows you to store and retrieve data using a key. Dictionaries are one of the core data structures in Python, and are commonly used in a variety of applications, including data engineering. Here's an example of a simple dictionary in Python:

In this dictionary, the keys are 'key1', 'key2', and 'key3', and the values are 'value1', 'value2', and 'value3', respectively.

Creating and Accessing Dictionaries

To create a dictionary in Python, you can use the curly braces {} and separate the key-value pairs with colons. Here's an example:

You can also create a dictionary using the dict() function, which takes a sequence of key-value pairs as an argument. For example:

Once you have created a dictionary, you can access its values by using the keys. For example, to access the value associated with the 'name' key in the dictionary above, you can use the following code:

This will output 'Alice'.

Manipulating Dictionaries

Dictionaries are mutable, which means that you can add, delete, and modify key-value pairs in the dictionary. Here are some of the ways that you can manipulate dictionaries in Python:

Adding Key-Value Pairs

To add a new key-value pair to a dictionary, you can simply assign a value to a new key

This will add a new key 'email' with the value 'alice@example.com' to the dictionary

Modifying Values

To modify the value associated with a key in a dictionary, you can simply reassign the value:

This will change the value associated with the 'age' key from 30 to 31.

Deleting Key-Value Pairs

To delete a key-value pair from a dictionary, you can use the del statement:

This will remove the 'age' key and its associated value from the dictionary.

Using Dictionaries in Data Engineering

Dictionaries can be used in a variety of data engineering tasks, including data cleaning, data transformation, and data aggregation. Here are some examples of how dictionaries can be used in data engineering:

Data Cleaning

Suppose you have a dataset that contains customer information, and you want to clean up the data by standardizing the state names. You could create a dictionary that maps the abbreviated state names to the full state names, and then use that dictionary to replace the abbreviated state names in the dataset

happy coding

Use of python loops in data engineering

muriuki muriungi erick — Fri, 17 Feb 2023 20:51:19 +0000

Python programming language is rich in a set of tools and libraries used in data engineering. One of the widely used python concepts in data engineering is the python loops which are used to iterate over a collection of data and perform a set of operations on each element. This blog post will explore how to use Python loops for data engineering.

For Loops

The most common loop type in Python is the for loop, which is used to iterate over a collection of data. The syntax of a for loop is as follows:

Here, collection is the collection of data that we want to iterate over, and element is a variable that will take on the value of each element in the collection in turn. Within the loop, we can perform any operations on element that we like.

One common use of for loops in data engineering is to read data from a file and process it line by line. For example, suppose we have a file data.csv that contains some data that we want to process. We can use a for loop to read each line of the file and perform some operations on it, like this:

Here, f is a file object that we can use to read from the file. The with statement is used to ensure that the file is properly closed after we're done with it. Within the loop, line is a string that contains the contents of each line of the file in turn. We can split the line into fields using the split() method, and perform any other operations on the fields that we like.

While Loops

Another type of loop in Python is the while loop, which is used to repeat a set of operations until a certain condition is met. The syntax of a while loop is as follows:

Here, condition is an expression that evaluates to True or False. The loop will continue to execute as long as condition is True. Within the loop, we can perform any operations that we like.

One common use of while loops in data engineering is to process data until some condition is met. For example, suppose we have a list of numbers that we want to process, and we want to keep processing them until the sum of the numbers is greater than 100. We can use a while loop to do this, like this:

Here, numbers is a list of numbers that we want to process. We initialize total to 0, and use i as an index to iterate over the list. The loop will continue to execute as long as total is less than or equal to 100 and i is less than the length of numbers. Within the loop, we add each element of numbers to total, and perform any other operations on total that we like.

Nested Loops

In some cases, we may need to use nested loops in data engineering. Nested loops are loops that are defined inside other loops. For example, suppose we have a list of lists that we want to process, and we want to perform some operations on each element

The simplest way to differentiate between the data engineers, data scientists and the data analyst.

muriuki muriungi erick — Mon, 12 Sep 2022 14:10:07 +0000

Taking the three roles as a complete architectural model, we have architecture engineers designing and building the house, lorries and trucks, bringing the building materials and drivers to the construction site

Data engineers

Data engineers, in this case, are like architectural engineers. just like the architectural engineers,They will design and build the pipelines to ingest data. They are also responsible for maintaining these pipelines. They are the brains behind how the data will flow from data lakes or data warehouses to the data pipelines.

Data scientists

Data scientists are like trucks and lorries bringing construction materials to the construction site. Just like the trucks carries the material, data scientists will carry the preprocessed data to the consumer. They will use technologies such as machine learning to make future prediction. They will exploit the data from data pipelines and draw complex insights from such data.

Data analysts.

Data analysts. Like the driver ,takes the building materials to the site, data analysts use their skills to drive the data to the consumer. Data analysts examine and combine several datasets to help the business understand the trend in the business. Data analysts are the brains behind making an informed business decisions in an organization. They work with the current data to understand the current business situation of an organization

Management of data in data engineering.

muriuki muriungi erick — Sat, 10 Sep 2022 13:07:19 +0000

In their line of duty, data engineers come across pipelines built with different technologies, and they need to understand them. Data engineers must have basic knowledge of data storage, analytics, and pipeline to carry out their duty effectively.

Databases and data warehouses

A database is made up of one or more tables of related data. Dynamic growth in the business sector has necessitated the design of tools to be used in bringing different databases together for the purpose of data analysis. to carry out data analytics reports from various databases, data from these databases are ingested into a central point. A data warehouse is a tool that allows the ingestion of structured data from different databases. Before entering a data warehouse, the data undergoes processes such as validation, preprocessing, and data transformation. Warehouses, however, face the challenge of holding current-era business data because businesses need to handle unstructured and semistructured data

Handling of the big and unstructured dataset

Unstructured and semistructured data sets come from digital platforms such as IoT sensors, social media, web and mobile applications, videos and audio platforms etc. These platforms generate data in high velocity and huge volumes compared to structured data sources. Due to the challenges of handling such datasets, there was a need for a big data technology platform. One such technology is the Hadoop open source framework that was designed in the early 2010s. Hadoop was designed to process large datasets on a cluster of computers. Hadoop is managed under distributed file system called the Hadoop distributed file systems. Providers of HDFS include IBM, MAPR, and Cloudera, among others. These packages include distributed data processing frameworks like HIVE, spark,map-reduce etc.

Benefits of public cloud infrastructure

Public cloud infrastructure has an on-demand capacity
Public cloud infrastructures have elastic and limitless scaling
Public cloud infrastructures have global footprints
Public cloud infrastructure has a cost model based on usage
Public cloud infrastructure has freedom of hardware management
In 2013 AWS made amazon redshift available, and they started providing a data warehouse as a cloud-native service.

Data marts

A repository containing well structured curated as well as the trusted dataset is termed as an enterprise data warehouse (EDW). To measure business performance, business users analysis data in the warehouse. Data in the warehouse contains business subjects such as products, sales, customers etc. data house has four main components. These are;

Enterprise data warehouse: Hosts the data assets, such as the current and historical datasets.
Source systems- Data sources such as ERP, CRM
ETL pipelines –Loads data to the warehouses
Data consumers – Applications used to consume data from the warehouse

Parallel processing

Amazon redshift contains several computer resources.
Each redshift cluster has got two nodes :
• One leader node that interfaces client application to the computer node.
• Multiple computer nodes that store data from the warehouse and run queries in parallel. Each computer node has its own memory and processor separated from each
other.

Dimensional models in data warehouses.

In warehouses, data is stored in relational tables. The two common dimensional models in data warehouses are:
• Star
• Snowflakes
Dimensional models make it easy to filter and retrieve relevant data.
Data marts are built focusing on a single subject of the business repository, such as marketing, finance, or sales. Datamarts are created either through top-down or bottom-up formats.

How data is fed into the warehouse

Organizations bring data from different sources into the warehouse through the pipeline. Data pipelines are designed to serve the following purposes:
• Extracting data from the source
• Transformation of the data through validation cleaning and standardizing
• Loading the transformed data to the warehouse of the enterprise.
There are two types of pipelines.
Extract load transform(ELT) pipelines.
Extract transform load(ETL) pipelines.

Data lake

As earlier stated warehouses are suitable for handling structured dataset. However, business needs to get insights into semistructured and unstructured data set from HTML, JSON data, social media, images etc, for their analysis. Specialized machine learning tools handle such datasets. Data lakes have the ability to handle all kinds of data, may it be structured, unstructured, or even semistructured. Data lakes also handle huge datasets compared to warehouses.

Data lake archtecture

A data lake has 5 layers. These layers include
• Storage layer: This layer is located at the center of the data lake architecture. It provides virtually low cost of unlimited storage. This layer has got 3 main zones, each with a specific purpose. These zones are:
Landing zone: This zone is also known as a raw zone. This is the zone where the ingestion layer writes the data from data zones. The landing zone stores data permanently from the source.
Clean zone: Also called transform zone. Data from the clean zone is stored in the optimized formats
Curated zone: Also called the enriched zone. Data in the curated zone is optimized and cataloged for the consumption layer.
• Catalog and search layer: Data lakes contain huge structured, semistructured, or unstructured data sets from internal or external of an organization. Different departments use data set in the data lakes in an organization for different purpose, and there is a need for the user to search for the available schemas in the dataset. The catalog and the search layer provide metadata about the hosted data.
• Ingestion layer: This layer connects to different data sources. Data from the ingestion layer is forwarded to the storage layer.
• Processing layer: Data from the storage layer is processed in the processing layer to make it ready for consumption by the consumer. Components of both ingestion and processing layers create ELT pipelines.
• Consumption layer: The consumer utilizes the processed data through techniques such as interactive query processing and machine learning, among others.

In the end, I tried to create a simple pipeline in aws. I named it erick254 and selected the location as Africa Capetown.

Introduction to Python for Data Engineering

muriuki muriungi erick — Wed, 31 Aug 2022 12:53:41 +0000

-

Setting up the tools.

To use python, you have to have a code editor and required libraries and modules installed on your pc. This article covers how to run python on jupyter notebook running on vs code. We will start by installing vs code on our machine. Vs code runs on all machines from windows, mac or Linux. If you don't have vs code installed on your computer, visit https://code.visualstudio.com/ to download and install. Vs code is open source; therefore, it is free and easy to install. After installing vs code and opening it, your screen will look as shown below.

Click on the extension on the left side of the window, and on the search, type Jupyter notebook

Then click on the jupyter notebook icon and then install it. Repeat the same procedure to any library you want to install in the vs code. After that, install the anaconda on your machine. If you haven't already installed anaconda, visit https://anaconda.org/ for an installation guide. After that, go to the start button on your windows machine and type anaconda prompt

Click on the anaconda prompt, and you will be directed to the conda terminal. And then create your workspace environment. I created an environment called datascience_basics and will use python 3.9 in this project. The command to create this env in anaconda will be

C:\Users\HP.DESKTOP-QMIMHR3>conda create --name datascience_basics python==3.9

After running the command, your screen will be as shown below.

Several packages will be installed, as shown in the screenshot. After that, we have to activate our environment. To activate the environment run the following command in your terminal.

C:\Users\HP.DESKTOP-QMIMHR3>conda activate datascience_basics

We can check whether jupyter is properly installed by running

Conda list jupyter.

If you installed it properly, your screen would be as shown below

We shall then navigate to our desktop and create a folder for data science. Then we install jupyter in the created folder.

(datascience_basics) C:\Users\HP.DESKTOP-QMIMHR3>cd Desktop

(datascience_basics) C:\Users\HP.DESKTOP-QMIMHR3\Desktop>mkdir datascience1

The screen will be as shown below.

Now open vs code from the terminal using

(datascience_basics) C:\Users\HP.DESKTOP-QMIMHR3\Desktop>code .

In the taskbar of the vs code, click view and then command pallet followed by new jupyter notebook. Your screen should be as shown below

You can change from python to markdown from the top uppermost icon on vs code editor. Now your workspace is set and ready to be used. You will be able to use all python libraries and packages. Jupyter notebook uses dropdown, and your workspace will be as on the screen below.

Why python in data engineering?

Data engineers collect data from different sources and convert it to the right format before delivering it to the right team. Data engineers prepare the data by carrying out activities such as removing repeated data and collecting the missing data, among other data cleaning and pre-processing activities. The cleaned data is then forwarded to the analytic team. Below is a summary of the responsibilities of data engineers.

Ingesting data from various data source
Carrying out data optimization for data analysis
Removing corrupted data in the dataset
Developing, constructing, testing, and maintaining the data structure. The growth of data engineering has facilitated the growth of data engineering. Big data is a very large dataset that traditional data management systems can not economically analyze. The growth of big data has been facilitated by the growth of IoT, mobile application and smart sensors. As per 2021 IDC data, there were more than 10 billion Connected devices. It is projected that this number will roughly rise to 25.4 billion by the year 2030. This means that more than 15000 million devices will be connected to the internet per second. Because of this, companies, organizations and governments are investing heavily in how to ingest such data and store them for economic purposes. In past years data was mainly structured. Data from mobile apps, website pages and iot is mainly informing of pictures, videos or audio such data is unstructured. We can get data from these devices in the form of the JSON format. Such data is described as being semi-structured. Bid data is described using the five vs the 5vs helps data scientist to deliver valuable insights from the data, and at the same time, it helps in making scientists, analysts and data engineer organizations become customer-centric. These 5vs include:
- Volume: this is the amount of data existing. When the volume of the data is big enough data is termed to be big data.
- Variety: refers to the diversity of the data types. An organization can receive data from different sources, which sometimes differ in type. The collected data can be either structured,semi-structured or unstructured.
- Velocity: this refers to how fast the data is produced and moved. This aspect is very important for the company to track the movement of data and to make it available at the right time
- Veracity: This is the quality and the value of the data collected. Collected data may contain missing values or wrong formats, making them messy and difficult to use.
- Value: This refers to the usefulness of the data to an organization. Sometimes data engineering and data scientist sounds as if they are the same. However, these two terms are totally different. To understand the two let's look at their differences.

Data pipelines.

Data is the new form of oil. As the oil moves from crude oil to different oil forms, so does data. Raw data gets into the hands of machine engineers who prepare the data and clean it before giving it to the data scientists. Data scientist manipulates and analyze the data to get different insights. Companies ingest data from a variety of sources, and they need to store this data. To achieve this, data engineers develop and constructs data pipelines. These data pipelines are used to automate the flow of data from one location to another. Depending on the nature of the data source, the data can be processed in either data streams or in batches.
Before doing anything to the system's data, engineers ensure that it flows efficiently in the system. The input of this data can be anything from images, videos, streams of JSON and XML data, timely batches of data, or even data from deployed sensors. Data engineers design systems that take this data as their input, transform it, and then store it in the right format for it to be used by the data scientist, data analyst machine learning engineers, among other data personnel. These systems are sometimes referred to as extract, transform, and load (ETL) pipelines.
As the data is flowing through to the system, it needs to conform to certain standards of architecture. To make the data more accessible to the user, data normalization is done. some of the
activities for data normalization include removal of duplicated data, fixing missing and conflicting data, and converting the data to the right format. Unstructured data is stored in data lakes, while data warehouses are used to store relational database information.
data lakes and warehouses
Data lake stores data from both internal and external sources. Data lakes and data warehouses are different. Let's have some of their differences

Data catalog for data lakes keeps on records for
• The sources of the data
• The location where the data should be stored
• The owner of the data
• How often is the data updated
Python libraries for data engineering.
Python is mainly used in data engineering because of its wealthy libraries and modules. Some of the data engineering python libraries include:
Pandas. Pandas library is used by data engineers for the purpose of reading, querying, writing, or manipulating data. Pandas can read both JSON and CSV file formats. Pandas can also be used to fix issues such as missing data from data sets. Data engineers use pandas to convert the data into a readable format.
Psycopg2/pyodbc/sqlalchemy: data engineers use mypostgresql to store data. These libraries are used to connect to a database.mypostgresql handles structured data
Elasticsearch. Data engineers use this library to manage a NoSQL database.
Scipy. This library offers quick maths solutions. Data engineers use it to perform scientific calculations on the problems related to the data.
Beautiful soup: This library is used for the purpose of data mining and web scrabbing. Data engineers use beautiful soups to extract data from specific websites. Beautiful soup supports both HTML and JSON data formats.
Petl: This library is used to extract and modify tabular data. Data engineers use this library when they are building Extract, Transform and Load data (ETL)pipelines
Pygrametl: this is a library that is used during the deployment of the ETL data pipeline.
From what we have covered, it is clear that python is among the best languages to use in data engineering. This is because of its simplicity and wealth of data engineering libraries. Python is also an open source resource; therefore, everyone is free to improve and use the already existing resources for their personal use.

Introduction to data engineering

muriuki muriungi erick — Fri, 19 Aug 2022 10:33:51 +0000

AbstractThe emergence of big data technology has altered the manner in which we do our daily business. Spontaneous growth of big data technology has necessitated the creation of data engineers who collects these data and manage them.

INTRODUCTION

Data engineering focuses on making the data more useful and readily available for data consumers. Data engineers build systems to be used in data collection, storage, and analysis of data. In the current cognitive era of computing, data engineering is a primary need for every industry. Most modern organizations collect huge amounts of data daily, which is facilitated by the growth of smart sensors and internet of things (IoT) technology. Most of the smart devices used in industries have smart sensors and transducers. Data from the devices are taken and transferred through IoT and stored in different locations. It is the work of the data engineer to fine-tune the collected data and convert it into the appropriate formats. Data engineering is done before the data is forwarded to the data scientists. It's worth noting that emerging data technologies such as deep learning can not thrive without competent data engineers.

WORK OF THE DATA ENGINEER

Data engineers, as we have seen, ensure that the raw collected data is pre-processed and converted into a suitable format to be utilized by the data scientists for the purpose of data analysis. They create data pipelines used by the data-centric and data scientists in their applications. The main goal of data engineers is to make the data available and accessible to the analysis team. Data engineers require a sound knowledge of technical skills in areas such as SQL database and a master of high-level programming languages such as python. They work tirelessly to ensure that this data can be evaluated and optimized to solve different problems of different organizations. Data engineers design and build algorithms used in accessing raw data. Before coming up with such algorithms, they first have to understand the objective of their clients. This is done mainly to make the algorithm perform better per the business goal. Data engineers need to understand data optimization and have the skills to help them develop dashboards and reports. Data engineers may sometimes be tasked with communicating data trends in an organization. The huge organization may have several data scientists and analysts. For data engineers to understand the right tools to be used in data engineering, they should understand different architectural principles that are used in the data processing. The main function of data infrastructure is;

Data extraction: in most cases, the information is located in some locations in either structured or unstructured nature. The information can be in the database or in the internal CRM systems. This information can also be real-time data streams coming directly from sensors.
Data storage: after the data is extracted, it should be stored securely in certain locations. Data engineering mainly incorporates data warehouses for the purpose of analytics.
Data transformation: the raw data is of no use to the user. It makes little or no sense to the business problem being solved. It is difficult and time-consuming to analyze such data. Transformation is done to clean data and format it into appropriate formats to be used by the analytics team Data engineers' roles can be subdivided into three main subcategories. i. Generalists: these are roles for data engineers who work for small companies. They are given the responsibility of the entire data process. they take all data tasks from data management to data analysis. ii. Pipeline centric: This is mainly found in middle-sized companies.in such a setup, data engineers work with data scientists as they gain insight from the collected data. Such engineers need to have sound knowledge of distributed systems and computer science. iii. Database centric: this is for large organizations. In large organizations, the management of data flow is vital. Data engineers in such companies will work with different warehouses from different databases, and they are responsible for developing table schemas

SKILLS NEEDED IN DATA ENGINEERING

To land in a data engineering job, one needs to have big data skills. These skills range from designing, creating, building, or maintaining data pipelines. Knowledge of big data frameworks,databases,and containers is also vital. Knowledge of tools such as Hadoop, scala, storm, and python name a few, is also needed. Below are some skills one needs to have to build a successful career as a data engineer.
i. Database tool: data engineers deal with storing, organizing, and maintaining big data. For one to become a competent data engineer, he needs to understand database design as well as the database structure. The commonly used database structure are the structured query language (SQL) based and NoSQL based.SQL based includes databases such as MySQL, which are used to store structured data.NoSQL includes technologies such as MongoDB and Cassandra, which are used to store unstructured, structured, or semi-structured data.
ii. Tools for data transformation: raw data can not be used directly. They are first cleaned and transformed into desirable formats. The commonly used data transformation tools are the Talend, Pentaho data integration Hevo data, and more
iii. Tools for data mining: these tools extract useful information and then find the patterns in the big data. Mainly data mining assists in data classification and predictions. Some of the data mining tools include Apache mahout, KNIME, Weka, and more