DEV Community: Kinyungu Denis

How to Create and Use a Virtual Environment in Python in Ubuntu 22.04

Kinyungu Denis — Tue, 01 Nov 2022 16:54:00 +0000

Introduction

Greetings my esteemed readers. I took a break to participate in HacktoberFest with Aviyel, it was an awesome experience, I really learnt a lot and I will be sharing with you my dear readers. I am happy to be back and share my knowledge with you.

We will learn about Virtual environments in Python,how does one create a virtual environment, why should one create a virtual environment and how to manage a virtual environment. In order to get the best out of this article, you should understand the basics in Python.

What is a Python virtual environment?

Python Virtual Environment is an environment that ensures packages related to different projects are stored in different places to avoid dependency conflicts, it is isolated from other environments and allows that environment
to have its own independent dependencies.

A single Python installation can fail to meet the requirements of every application. If application K needs version 4.0 of a particular module but application W needs version 3.0, then the requirements are in conflict and installing either version 3.0 or 4.0 will leave one application unable to run.

The solution for this problem is to create a virtual environment so that different applications can then use different virtual environments and are able to run in your system.

How to create a virtual environment?

We will use venv to manage separate packages for different packages.

To create a virtual environment go to your projects directory and run venv. For example in my case I will navigate to my required directory using cd command:

cd /home/exporter/Kadenno/python_projects/Django_projects/project_one

After running the above command we are in our desired project directory.
Now we can run our venv command as follows:

python3 -m venv env

We use the second argument env as the location for our virtual environment however you can change it in case you want to have a location of your own.

Basically venv will create a virtual installation in the env folder.

Now we need to activate our virtual environment. Before you begin installing and using your packages, you virtual environment should be activated. Now we will activate our virtual environment as follows.

source env/bin/activate

Great, now we have our virtual environment up and running, we can go ahead and install our required packages for our project.

You are able to confirm whether you are in your virtual environment by using the following command:

which python

When you have completed your project, you can leave your virtual environment by using the following command:

deactivate

For our case, we wont leave our virtual environment since we are yet to complete our project.

To illustrate an example of a package we can install and use, let us upgrade our pip in our virtual environment.
You will use the following command:

pip install --upgrade pip

This looks awesome, you can proceed to install other packages, libraries and dependencies in the virtual environment which you will require.

At this this point, we understand what is a virtual environment, how one can create it and how you can deactivate it. However, do you know how a virtual environment works?
Let's take a deep dive and learn about it.

How does a virtual environment work?

When you create a virtual environment using venv, the module re-creates the file and folder structure of a standard Python installation on your operating system. Python also copies or symlinks into that folder structure the Python executable with which you’ve called env.

It Adapts the Prefix-Finding Process:
Standard folder structure, the Python interpreter in our virtual environment can understand where all relevant files are located. It does this with only minor adaptations to its prefix-finding process according to the venv specification.

Instead of looking for the os module to determine the location of the standard library, the Python interpreter first looks for a pyvenv.cfg file. If the interpreter finds this file and it contains a home key, then the interpreter will use that key to set the value for the following two variables:

sys.base_prefix: hold the path to the Python executable used to create this virtual environment, which you can find at the path defined under the home key in pyvenv.cfg.
sys.prefix: point to the directory containing pyvenv.cfg.

If the interpreter doesn’t find a pyvenv.cfg file, then it determines that it’s not running within a virtual environment, and both sys.base_prefix and sys.prefix will then point to the same path.

It links back to Your Standard Library:
Python virtual environments aim to be a lightweight way to provide you with an isolated Python environment. In that you can quickly create and then delete when you don’t need it. venv copies only the minimally necessary files.

The Python executable in our virtual environment has access to the standard library modules of the Python installation on which you based the environment. Python points to the file path of the base Python executable in the home setting in pyvenv.cfg

It modifies Your PYTHONPATH:
Scripts should run using the Python interpreter within our virtual environment, venv modifies the PYTHONPATH environment variable that you can access using sys.path.

It changes Your Shell PATH Variable on activation:
You activate your virtual environment before working in it. To activate your virtual environment, you need to execute an activation script, just as how we activated.

Actions that happen in the activation script:

Path: It sets the VIRTUAL_ENV variable to the root folder path of your virtual environment and puts the relative location of its Python executable to our PATH. The path to all the executables in your virtual environment now lives at the front of your PATH, when you type python or pip our shell invokes their internal versions.
Command prompt: command prompt will the name that you passed when creating the virtual environment. It takes that name and puts it into parentheses, for example (env). We saw when we were created our virtual environment.Our command prompt, you will know whether or not your virtual environment is activated.

You will activate your virtual environment before working with it and deactivate it after you’re done, as we discussed earlier.

Conclusion

Through this article we have learnt about virtual environment in Python. Virtual environments give you the ability to isolate your Python development projects from your system installed Python and other Python environments. This gives you full control of your project.

When developing applications that would generally grow out of a simple .py script, it's a good idea to use a virtual environment. Reading through this article you now know how to set up and start using one.

Let me know what you think about this article through my Twitter or LinkedIn handles. It would be great to get your feedback and connect with you.

Learn Ansible and how to Install it in Ubuntu 22.04.

Kinyungu Denis — Wed, 05 Oct 2022 23:06:24 +0000

Greetings to my esteemed readers.

In this we will learn how to install Ansible in Ubuntu 22.04, also I will cover the details of Ansible so you are familiar with it.

What is Ansible and why should we use it?

Ansible refers to an open-source infrastructure automation tool that was initially developed by RedHat and is used to tackle all kinds of challenges that come with infrastructure as a code.

Ansible has three major kinds of uses.

Infrastructure automation
Configuration management
App Deployment

Infrastructure automation

The provisioning use case for using Ansible.
Using Ansible you can create an environment in the existing Infrastructure such as a Virtual Private Cloud (VPC) and your favorite cloud provider. Let us have that our VPC has four virtual machines.

Configuration management

The main use case is the ability to configure your actual infrastructure

The key principles in Ansible

It is declarative, use an Ansible playbook to group together a set of tasks that you need to run by procedurally keeping things.

Create an Ansible playbook, a book of tasks, or a set of plays.
A play has three main things:

The name of the play
The host that you will run against.
The actual tasks that will run

You will define the hosts that will be run against, and define a set of tasks such as security patching.
Ansible tasks.

The set of virtual machines will be the set of hosts, however, in the Ansible world, we call this an inventory. The set of hosts that Ansible can work on.

Ansible takes the advantage of YAML in writing configuration files and you can declare the tasks you want.

Ansible is agent-less, you do not need to install an agent on the Virtual Machines that you have provisioned.
Ansible takes advantage of a secure shell (SSH) to directly run the tasks in the virtual machines.

Ansible is idempotent, this is an operation that can be run multiple times without changing beyond the initial configurations.
It should recognize changes when run multiple times and what needs to be resolved to ensure every task has been done.

Ansible is community-driven, a lot of Ansible playbooks are availed by the community and published as collections, developers have a repository to contribute to.

App Deployment

Ansible can be used to deploy the actual web applications and workloads into virtual machines.

Now we have a basic understanding of Ansible, let us install it on our machine.

The command to ensure our terminal is up-to-date

sudo apt update

sudo apt install software-properties-common

sudo add-apt-repository --yes --update ppa:ansible/ansible

sudo apt install ansible

The above commands will ease the process as you install Ansible in your Ubuntu 22.04.

Conclusion.

We have learned the basics about Ansible and how to install it in Ubuntu 22.04.
I will be dropping more articles about how one uses Ansible.

Let me know what you think about this article through my Twitter or LinkedIn handles. It will be good to connect with you.

Learning Boto3 and AWS Services the right way in Data Engineering.

Kinyungu Denis — Fri, 30 Sep 2022 02:45:23 +0000

Greetings to my esteemed readers!

In this article we will learn about AWS Boto3 and use it together with other AWS services.It will also cover other AWS services that are essential in data engineering. The prerequisites for this article, just have basic knowledge in Python and AWS services.

What is AWS Boto3

Boto3 is the Amazon Web Services (AWS) SDK for Python.
Boto3 is your new friend when it comes to creating Python scripts for AWS resources.
It allows you to directly create, configure, update and delete AWS resources from your Python scripts. Boto3 provides an easy to use, object-oriented API, as well as low-level access to AWS services.

How to install and configure Boto3

Before you install Boto3 you should have Python version 3.7 or any later version.
To install Boto3 via pip:

pip install boto3

You also install using Anaconda, if you desire it to be in your Anaconda environment:

conda install -c anaconda boto3

You can also install it in your Google Colab, to perform your operations on the cloud:

!pip install boto3

Before using Boto3, you need to set up authentication credentials for your AWS account using either the AWS IAM Console or the AWS CLI. You can either choose an existing user or create a new one.

If you have the AWS CLI installed, use the aws configure command to configure your credentials file:

aws configure

You can also create the credentials file yourself. By default, its location is ~/.aws/credentials. The credentials file should specify the access key and secret access key. Replace the YOUR_ACCESS_KEY_ID with the one for your user and YOUR_SECRET_ACCESS_KEY with your user's password.

[default] 
aws_access_key_id = YOUR_ACCESS_KEY_ID 
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY

Save the file.
Now that you have set up these credentials, you have a default profile, which will be used by Boto3 to interact with your AWS account.

Boto3 SDK features

Session
A session manages state about a particular configuration. By default, a session is created for you when needed. However, it's possible for you maintain your own session. Sessions store the following:

Credentials
AWS Region
Other configurations related to your profile

Default Session

Boto3 acts as a proxy to the default session. This is created when you create a low-level client or resource client:

import boto3

# Using the default session
rds = boto3.client('rds')
s3 = boto3.resource('s3')

Custom Session

You can also manage your own session and create low-level clients or resource clients from it:

import boto3
import boto3.session

# Create your own session
current_session = boto3.session.Session()

# Now we can create low-level clients or resource clients from our custom session
rds = current_session.client('rds')
s3 = current_session.resource('s3')

Clients

Clients provide a low-level interface to AWS whose methods map close to 1:1 with service APIs. All service operations are supported by clients. Clients are generated from a JSON service definition file.

import boto3

# Create a low-level client with the service name
s3 = boto3.client('s3')

To access a low-level client from an existing resource:

# Create the resource
s3_resource = boto3.resource('s3')

# Get the client from the resource
s3 = s3_resource.meta.client

Resources

Resources represent an object-oriented interface to Amazon Web Services (AWS). They provide a higher-level abstraction than the raw, low-level calls made by service clients. To use resources, you invoke the resource() method of a Session and pass in a service name:

# Get resources from the default session

s3 = boto3.resource('s3')

Every resource instance has a number of attributes and methods.

Collections

A collection provides an iterable interface to a group of resources. A collection seamlessly handles pagination for you, making it possible to easily iterate over all items from all pages of data.

# s3 list all buckets
s3 = boto3.resource('s3')
for bucket in s3.bucket.all():
    print(bucket.name)

Paginators

Pagination refers to the process of sending subsequent requests to continue where a previous request left off this is due to AWS operations that returns incomplete results.

import boto3

# Create a client
client = boto3.client('s3', region_name='ap-south-1')

# Create a reusable Paginator
paginator = client.get_paginator('list_objects')

# Create a LineIterator from the Paginator
line_iterator = paginator.paginate(Bucket='sample-bucket')

for line in line_iterator:
    print(line['Contents'])

Client vs Resource which should one use?

Resource offer higher-level object-oriented service access whereas Client offer low-level service access.

The question is, “Which one should I use?”

Understanding how the client and the resource are generated helps in which one to choose:

Boto3 generates the client from a JSON service definition file. The client’s methods support every single type of interaction with the target AWS service.
Resources, on the other hand, are generated from JSON resource definition files.

Boto3 generates the client and the resource from different definitions. As a result, you may find cases in which an operation supported by the client isn’t offered by the resource.

With clients, there is more programmatic work to be done. The majority of the client operations give you a dictionary response. To get the exact information that you need, you’ll have to parse that dictionary yourself. With resource methods, the SDK does that work for you.
With the client, you might see some slight performance improvements. The disadvantage is that your code becomes less readable than it would be if you were using the resource. Resources offer a better abstraction, and your code will be easier to comprehend.

Amazon s3

AWS s3 is an object storage platform that allows you to store and retrieve any amount of data at any time. It is a storage that makes web-scale computing easier for users and developers.

Storage

s3 offers total four class storage solutions, with unlimited data storage capacity:

s3 Standard
s3 Standard Infrequent Access (otherwise known as S3 IA)
s3 One Zoned Infrequent Access
Glacier

Amazon s3 Standard

s3 Standard offers high durability, availability and performance object storage for frequently accessed data. It delivers low latency and high throughput. It is perfect for a wide variety of use cases including cloud applications, dynamic websites, content distribution, mobile applications and Big Data analytics.

For example, a web application collecting farm videos uploads. With unlimited storage, there will never be a disk size issue.

s3 Infrequent Access (IA)

s3 IA is designed for data that is accessed less frequently but requires rapid access when needed. s3 Standard-IA offers the high durability, high throughput, and low latency of s3 Standard, with a low per GB storage price and per GB retrieval fee. This combination of low cost and high performance make s3 Standard-IA ideal for long-term storage, backups and as a data store for disaster recovery.

For example, a web application for collecting farm video uploads on daily basis, soon some of those farm videos will go out of access need like there will be less demand to see year-old farm videos. With IA we can move the objects to different storage class without affecting their durability.

s3 One Zoned-IA

s3 One Zoned-IA is designed for data that is accessed less frequently but requires rapid access when needed. s3 One Zone-IA stores data in a single AZ. Because of this, storing data in s3 One Zone-IA costs 20% less than storing it in s3 Standard-IA. It’s a good choice, for storing secondary backup copies of on-premises data or easily re-creatable data.

s3 Reduced Redundancy Storage

Reduced Redundancy Storage (RRS) is an Amazon S3 storage option that enables customers to store noncritical, reproducible data at lower levels of redundancy than Amazon s3’s standard storage.

Amazon Glacier

Amazon Glacier is a secure, durable, and extremely low-cost storage service for data archiving. Customers can store data for as little as $0.004 per gigabyte per month. To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides different options for access to archives, from a few minutes to several hours.

Object Store

Amazon s3 is a simple key, value store designed to store as many objects as you want. You store these objects in one or more buckets. An object consists of the following:

Key — The name that you assign to an object. You use the object key to retrieve the object.
Version ID — Within a bucket, a key and version ID uniquely identify an object
Value — The content that we are storing
Metadata — A set of name-value pairs with which you can store information regarding the object.
Subresources — Amazon S3 uses the subresource mechanism to store object-specific additional information.
Access Control Information — We can control access to the objects in Amazon s3.

Connect to Amazon s3

As long as the credentials file from above has been created you should be able to connect to your s3 object storage.

import boto3
s3_client = boto3.resource('s3')

Create and View Buckets

When creating a bucket there is a lot you can configure (location constraint, read access, write access) and you can use the client API do that. Using the high level API resource(). Once we create a new bucket let’s now view all the buckets available in s3.

# create a bucket with given name
sampled_bucket = s3_client.create_bucket(Bucket='sampled_buckets')

# view buckets in s3
for bucket in s3_client.buckets.all():
     print(bucket.name)

View Objects within a Bucket

Adding objects to it and then view all objects within our specific bucket.

# point to bucket and add objects
sampled_bucket.put_object(Key='sampled/object1')
sampled_bucket.put_object(Key='sampled/object2')

# view objects within a bucket
for obj in sampled_bucket.objects.all():
     print(obj.key)

Upload, Download, and Delete Objects

Upload a CSV file, view the objects within our bucket again.

# upload local csv file to a specific s3 bucket
local_file_path = '/Users/Desktop/data.csv'
key_object = 'sampled/data.csv'

sampled_bucket.upload_file(local_file_path, key_object)
for obj in sampled_bucket.objects.all():
    print(obj.key)

# download an s3 file to local machine
filename = 'downloaded_s3_data.csv'

sampled_bucket.download_file(key_object, filename)

To delete some of these objects. Either delete a specific object or delete all objects within a bucket.

# delete a specific object
sampled_bucket.Object('sampled/object2').delete()

# delete all objects in a bucket
sampled_bucket.objects.delete()

You can only delete an empty bucket, so before delete a bucket ensure it contains no object.

# delete specific bucket
sampled_bucket.delete()

Bucket vs Object

A bucket has a unique name in all of s3 and it may contain many objects. The name of the object is the full path from the bucket root and any object has a key which is unique in the bucket.

AWS RedShift

Amazon RedShift is a fully managed columnar cloud datawarehouse you can use it to run complex analytical queries on large datasets through massively parallel processing (MPP). The datasets can range from gigabytes to petabytes. It supports SQL, ODBC, JDBC interfaces.

AWS Redshift Architecture

The components of Redshift Architecture

Cluster
A Cluster in Redshift is set of one or more compute nodes, there are two type of nodes Leader Node and Compute Node. If a cluster has two or more compute nodes and additional leader node is there to coordinate with all compute nodes and external communication with client applications.

Leader node
Leader node interacts with client applications and communicates with compute nodes to carry out operations. It parses and generates an execution plan to carry out database operations. Based on execution plan, then it compiles the code. Then compiled code is distributed to all provisioned compute nodes and its data portion to each node.

Compute nodes
Leader node compiles code, interaction with external applications and client applications. Leader node compile each step of execution plan and assign that to all compute nodes.
Compute nodes carry out execution of the given compiled code and send back intermediate results back to leader node to aggregate the final result for each request from client application.
Each compute node has its own dedicated CPU, memory and storage which are essentially determined by node type.

AWS Redshift provides two node types at high level:

Dense storage nodes (ds1 or ds2)
Dense compute nodes (dc1 or dc2)

Node slices
Compute node is further partitioned into slices. Each slice is allocated a portion of node’s memory, disk space where it carries out the given workload to the node. Leader node manages distributing data to slices for any queries or other database operations to the slices. All slices work in parallel to complete the operation.

Internal network
Internal network is for communication between leader node and compute nodes to perform various database operations. Redshift has very high-bandwidth connections, various communication protocols to provide high speed, private and secure communication across leader and compute nodes.

Databases
A cluster contains one or more databases. User data is stored on compute nodes. Redshift provides same functionality as typical RDBMS including OLTP functions like DMLs, however it’s optimized for high-performance analysis and reporting on large datasets.

Connections
Redshift interacts with client applications using JDBC and ODBC drivers for Postgre SQL.

Client applications
AWS Redshift provides flexibility to connect with various client tools like ETL, business intelligence reporting and analytics tools. It is based on industry standard PostgreSQL most existing SQL client applications are compatible and work with little or without any changes.

Redshift Distribution Keys

AUTO — if we do not specify any size, it figures on size of data
EVEN — rows are distributed across slices in round robin, appropriate when table does not participate in join or when there is no clear choice between key distribution and all. It tries to evenly distribute without thinking about clustering the data that can be accessed at same time.
KEY — according to values in one column. All the data with specific key will be stores on the same slice.
ALL — Entire table is copied to every node. Appropriate for slow moving tables.

Sort Keys

It is similar to index and makes for fast range queries
Rows are stored on disk in sorted order based on the column you designate as sort key.

Types of sort keys
Single column
Compound
Interleaved — gives equal weight to each column

Importing and Exporting Data

UNLOAD command (exporting) — unload from a table into files in s3
COPY command -read from multiple data file or stream simultaneously.
Use COPY to load large amounts of data from outside of Redshift.
Gzip and bzip2 compression supported to speed it up further.
Automatic compression option — Analyzes data being loaded and figures out optimal compression scheme for storing it
Special case: narrow tables (lots of rows, few columns); Load with a single COPY transaction if possible.

Short Query Acceleration (SQA)

Prioritize short-running queries over longer-running ones
Short queries run in a dedicated space, won’t wait in queue behind long queries
Can be used in place of Work Load Management queues for short queries
Works with: CREATE TABLE AS (CTAS)
Read-only queries (SELECT statements)

Concurrency Scaling

Automatically adds cluster capacity to handle increase in concurrent read queries.
Support virtually unlimited concurrent users & queries
WLM queues manage which queries are sent to the concurrency scaling cluster

Vacuum Command

Recovers space from deleted rows
VACUUM FULL — default vacuum operation, it will resort all the rows and reclaim space from deleted rows
VACUUM DELETE ONLY — reclaiming deleted rows
VACUUM SORT ONLY — resort the table but not reclaim the disk space
VACUUM REINDEX — reinitialize interleaved indexes, reinitialize the table sort key column and then performs full vacuum operation

Resizing Redshift Cluster

Elastic resize
Quickly add or remove nodes of same type
Cluster is down for a few minutes
Tries to keep connections open across the downtime
Limited to doubling or halving for some dc2 and ra3 node types.

Classic resize

Change node type and/or number of nodes
Cluster is read-only for hours to days

Snapshot, restore, resize

Used to keep cluster available during a classic resize
Copy cluster, resize new cluster

Operations on AWS Redshift

There are numerous operation we can perform on database like query, create, modify or remove database objects, records, loading and unloading data from and to Simple storage service.

Query
Redshift allows to use SELECT statement to extract data from tables. It allows to extract specific columns, restrict rows based on given conditions using WHERE clause. Data can be sorted in ascending and descending order. Redshift allows extracting data using joins, subqueries, call in-built and user defined functions.

Data Manipulation Language(DML)
Redshift allows to perform transactions using INSERT, UPDATE, DELETE commands. DML’s require commit to be saved permanently in database or Rollback to revert the changes. A set of DML’s are known as Transactions. Transaction is completed if any COMMIT, ROLLBACK, or any DDL is performed.

Loading and Unloading Data
Load and Unload operations in Redshift is done by COPY and UNLOAD command. COPY command copies data from files in S3 while UNLOAD dumps the data into S3 buckets in various formats. COPY command can be used to load data into Redshift from data files or multiple data stream simultaneously. Redshift recommends to use COPY command in in case of bulk inserts rather than INSERT.

Amazon Redshift splits the result of a select statement across a set of one or more files per node slice to simplify parallel loading of data. While unloading the data into S3, files can be generated serially or in parallel. UNLOAD encrypts the data files using Amazon S3 server side encryption (SSE-S3).

Data Definition Language
CREATE, ALTER, DROP are names of few can be used to create, modify and delete Database, SCHEMA, USER, and database objects like Tables, views, stored procedure, user defined functions. Truncate can used to delete tables data and faster than delete. Truncate releases the space immediately.

Grant, Revoke
Access can be shared and restricted for different set of user groups using Grant and Revoke statements. Access can be granted individually or in the form of roles.

Functions
Functions are data set objects with predefined code to perform a specific operation. They stored in database as precompiled code and can be used in select statement, DML, and in any expression. Functions provide reusability, avoid redundant code. There are two types of functions.

User defined functions
Redshift allows to create a custom user-defined scalar function (UDF) using either a SQL SELECT clause or a Python program. The user defined functions are stored in the database. User defined functions can be used by user who has sufficient privileges to run. Functions can be created by CREATE FUNCTION command.

In-Built Functions

Character functions
Number and Math functions
JSON functions
Date Type formatting functions
Aggregate/Group Functions

Stored Procedures
Stored procedures can be created in Redshift using PostgreSQL procedural language. Stored procedures contains the set of queries, logical conditions in its block. Parameters in the procedures can be IN, OUT or IN OUT type. We can use DML, DDL, and SELECT statements in stored procedures. Stored procedures can be reused and removes duplicate piece of code

Use cases of RedShift:

Trading, and Risk Management
To take decision for future trades, decide exposure limits, and mitigate risk against a counter party. Redshift’s feature data compression, result caching and encryption types to secure critical data makes a suitable data warehouse solution for that industry.

Build Data Lake for pricing data
Data can be helpful to implement price forecasting systems for oil, gas, and power sectors. Redshift’s columnar storage is best fit for time series data.

Supply chain management
Supply chain systems generate huge amount of data that is used in planning, scheduling, optimization and dispatching. To query and analyze huge volume of data feature like parallel processing with powerful node types make Redshift a good option.

Conclusion

In this article, you will learn understand about Boto3, AWS s3 and AWS Redshift. It is quite brief and it provides with basics of this services. You need to create your own AWS account and practice on your own to understand the concepts clearly.

How to Upload a File to Google Colab.

Kinyungu Denis — Tue, 20 Sep 2022 20:22:32 +0000

To my dear readers, today I discovered Google Colab, a tool that can be very handy working with huge datasets for example In my case datasets larger than 10 gigabytes are huge and I would not like my computer fan overworking. No required prerequisite for this article, just basic knowledge about computers and working in the internet.

What is Google Colab?

Google Colab is a tool allows you to write and execute Python in your browser, with zero configuration required to access to GPUs free of charge and provides easy sharing of your code.
Colab is essentially the Google Suite version of a Jupyter Notebook.

Google Colab can be used by a student, an Artificial Intelligence Researcher, Machine Learning Engineer, Data Scientist, Data Engineer.

You need access to good internet and go to your favorite browser, (Brave is my favorite browser) type google colab and click on the first link.

Google colab is easy to use, you are able to write your python code, run it, share with others, easier installation of packages and sharing of documents. However, when one wants to upload a file or folder to google colab, it is quite a hustle.

To Upload a File or a Folder to Google Colab

Mostly people do download CSV file, upload into the Google Colab, read/load the data frame. After a while one needs to repeat everything again because the data was not stored there anymore. This article solves this issue.

In this article, I will show you how to use PyDrive to read a file in CSV format directly from your Google Drive using Python3 in the Google Colab environment.

First Step: Install PyDrive

The first step is to install PyDrive in our colab.



!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

Since we are in colab environment our pip will have exclamation (!) at the beginning as it is the set standard.

Step Two: Authenticate and Authorize.

We need to authenticate and create a PyDrive client.



auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

When you learn the above code, it will prompt you to allow to give permission for Google Colab to access your Drive click allow and proceed to allow Google colab to access your drive.

Step Three: generate a shareable link

Once you have completed verification, go to Google Drive

find your file and click on it;
click on the “share” button;
generate a shareable link “get link”

The link will be copied into your clipboard and paste this link into a string variable in Colab.

Step four: Getting the file id

Do not share your link with others, to avoid unauthorized users from accessing your file. The link below is just for demonstration to help you understand the file id that one needs.



##https://drive.google.com/file/d/25XVhnRJvieQMAEC9TfrWBNG6ERmtU7X/view?usp=sharing


your_file = drive.CreateFile({'id':'25XVhnRJvieQMAEC9TfrWBNG6ERmtU7X'})

You assign the id to a variable your_file, use drive.CreateFile({'id' : 'id_value'})

Step Five: To load the file and show results.

I was uploading a csv file, so let's see if our process is success by loading the csv file and giving an output.

Indicate the name of the CSV file you want to load into memory.



your_file.GetContentFile('matches.csv')

I use Pandas to turn this into a Data Frame and display its header. I use import pyforest, a package that avails a lot of python packages for me including pandas.



import pyforest 

df = pd.read_csv('matches.csv', delimiter=';' )

df.head()

As you can see in our picture above the csv file was uploaded successfully and we were able to operate on the data using pandas.

Now you know how to upload files, folders into your Google colab. This saves you the need to do everything locally in your machine, you are able to work comfortably with huge datasets.

We are still learning data engineering together. Reading the article to Install Apache PySpark in Ubuntu, you can read it here. Installing PySpark in our Local environment was indeed involving.

In Google Colab, I only have to run the following the following command to install PySpark and py4j library



!pip install pyspark==3.3.0 py4j==0.10.9.5

Then move on to using Apache PySpark in my work. To learn about Apache pySpark, read it here

This was a short comprehensive article to solve a challenge, I faced and solved. Feel free to leave your comments and suggestions.

SQL for Data Engineering

Kinyungu Denis — Tue, 20 Sep 2022 01:37:50 +0000

In Data Engineering we have large sets of data that will be queried to obtain meaningful results. SQL is heavily used and it will be a crucial skill for a one to write and execute complex queries.

We have various non-relational databases such as MySQL,SQL Server, PostgreSQL, Oracle Database and many others. The good thing is they all use SQL query language for their queries, so they do not differ too much. In this post I will use PostgreSQL to write queries. PostgreSQL is an advanced, enterprise-class, and open-source relational database system. PostgreSQL supports both SQL (relational) and JSON (non-relational) querying.

Fundamentals in PostgreSQL

Select, Column Aliases, Order By, Select Distinct, Where, Limit, Fetch, In, Between, Like, Is Null, Table Aliases.
Joins, Inner Join, Left Join, Self-Join, Full Outer Join, Cross Join, Natural Join
Group By, Union, Intersect, Having, Grouping Sets, Cube, Rollup, Subquery, Any, All, Exists
Insert, Insert Multiple Rows, Update, Update Join, Delete, Delete Join, Upsert

The SELECT statement has the following clauses:

Select distinct rows using DISTINCT operator.
Sort rows using ORDER BY clause.
Filter rows using WHERE clause.
Select a subset of rows from a table using LIMIT or FETCH clause.
Group rows into groups using GROUP BY clause.
Filter groups using HAVING clause.
Join with other tables using joins such as INNER JOIN, LEFT JOIN, FULL OUTER JOIN, CROSS JOIN clauses.
Perform set operations using UNION, INTERSECT, and EXCEPT.

SELECT
   select_list
FROM
   table_name;

SELECT first_name, last_name, goods_bought FROM customer;

To select data from all columns of the customer table:

SELECT * FROM consumer_reports;

However, it is not a good practice to use the asterisk (*) in the SELECT statement when you embed SQL statements in the application code. It is a good practice to explicitly specify the column names in the SELECT clause whenever possible to get only necessary data from the database.

A column alias allows you to assign a column or an expression in the select list of a SELECT statement a temporary name. The column alias exists temporarily during the execution of the query.

SELECT column_name AS alias_name
FROM table_name;

SELECT
    first_name || ' ' || last_name AS full_name
FROM
    consumer_reports;

The ORDER BY clause allows you to sort rows returned by a SELECT clause in ascending or descending order based on a sort expression.

SELECT
    select_list
FROM
    table_name
ORDER BY
    sort_expression1 [ASC | DESC],
        ...
        ...
    sort_expressionN [ASC | DESC];

SELECT
    first_name,
    last_name
FROM
    consumer_reports
ORDER BY
    first_name DESC;

The DISTINCT clause is used in the SELECT statement to remove duplicate rows from a result set. The DISTINCT clause keeps one row for each group of duplicates. The DISTINCT clause can be applied to one or more columns in the select list of the SELECT statement.

SELECT
   DISTINCT column1, column2, column3, column4
FROM
   table_name;

SELECT
    DISTINCT shape,
    color
FROM
    records
ORDER BY
    color;

The SELECT statement returns all rows from one or more columns in a table. To select rows that satisfy a specified condition, you use a WHERE clause. WHERE clause filters rows returned by a SELECT statement.

SELECT select_list
FROM table_name
WHERE condition
ORDER BY sort_expression

To form the condition in the WHERE clause, you use comparison and logical operators:

AND -- Logical operator AND

OR -- Logical operator OR

IN -- Return true if a value matches any value in a list

BETWEEN -- Return true if a value is between a range
of values

LIKE -- Return true if a value matches a pattern

IS NULL -- Return true if a value is NULL

NOT -- Negate the result of other operators

SELECT
    last_name,
    first_name
FROM
    consumer_records
WHERE
    first_name = 'Brian' AND 
        last_name = 'Kamau';

SELECT
    first_name,
    last_name
FROM
    customer
WHERE 
    first_name IN ('Brian','Kelvin','Martin');

PostgreSQL LIMIT is an optional clause of the SELECT statement that constrains the number of rows returned by the query.

SELECT select_list 
FROM table_name
ORDER BY sort_expression
LIMIT row_count

In case you want to skip a number of rows before returning the row_count rows, you use OFFSET clause placed after the LIMIT clause as the following statement:

SELECT select_list
FROM table_name
LIMIT row_count OFFSET row_to_skip;

This query will display the first 20 rows from our film table, which will be ordered in descending order.

SELECT
    film_id,
    title,
    release_year
FROM
    film
ORDER BY
    film_id DESC
LIMIT 20;

This query will skip 15 rows then proceed to display the next 20 rows only.

SELECT
    film_id,
    title,
    release_year
FROM
    film
ORDER BY
    film_id
LIMIT 20 OFFSET 15;

To constrain the number of rows returned by a query, you often use the LIMIT clause. However, the LIMIT clause is not a SQL-standard. To conform with the SQL standard, PostgreSQL supports the FETCH clause to retrieve a number of rows returned by a query.

Syntax of the PostgreSQL FETCH clause:

OFFSET start { ROW | ROWS }
FETCH { FIRST | NEXT } [ row_count ] { ROW | ROWS } ONLY

In this syntax:
ROW is the synonym for ROWS, FIRST is the synonym for NEXT, you can use them interchangeably.
The start is an integer that must be zero or positive.
The row_count is 1 or greater.

This query will skip the first 20 rows then proceed to display the next 20 rows.

SELECT
    film_id,
    title
FROM
    film
ORDER BY
    title 
OFFSET 20 ROWS 
FETCH FIRST 20 ROWS ONLY;

You use IN operator in the WHERE clause to check if a value matches any value in a list of values.

The syntax of the IN operator is as follows:

value IN (value1,value2,value3, value4, ..., valueN)

This query will return the first 15 rows that has customer id of 1 and 2.

SELECT customer_id,
    rental_id,
    return_date
FROM
    rental
WHERE
    customer_id IN (1, 2)
ORDER BY
    return_date DESC
FETCH FIRST 15 ROWS ONLY;

You can combine the IN operator with the NOT operator to select rows whose values do not match the values in the list.
The following query finds all rentals with the customer id is not 1 or 2.

SELECT
    customer_id,
    rental_id,
    return_date
FROM
    rental
WHERE
    customer_id NOT IN (1, 2)
FETCH NEXT 20 ROWS ONLY;

You use the BETWEEN operator to match a value against a range of values. The syntax of the BETWEEN operator:

value BETWEEN low AND high;

To check if a value is out of a range, you combine the NOT operator with the BETWEEN operator as follows:

value NOT BETWEEN low AND high;

Often used the BETWEEN operator in the WHERE clause.

This query will return the first 20 rows where the price is between 10 and 12.

SELECT
    customer_id,
    payment_id,
    amount
FROM
    payment
WHERE
    amount BETWEEN 10 AND 12
FETCH FIRST 15 ROWS ONLY;

This query returns the first 15 rows that do not meet the condition in the where clause.

SELECT
    customer_id,
    payment_id,
    amount
FROM
    payment
WHERE
    amount NOT BETWEEN 10 AND 12
FETCH FIRST 15 ROWS ONLY;

To construct a pattern by combining literal values with wildcard characters and use the LIKE or NOT LIKE operator to find the matches. PostgreSQL provides you with two wildcards:

Percent sign ( %) matches any sequence of zero or more characters.
Underscore sign ( _) matches any single character.

value LIKE pattern

PostgreSQL supports the ILIKE operator that works like the LIKE operator. In addition, the ILIKE operator matches value case-insensitively.

This query will return all the first name which have 'er' in between the name, then skips 5 rows then fetches the next 20 rows which are ordered according to first_name.

SELECT
    first_name,
        last_name
FROM
    customer
WHERE
    first_name LIKE '%er%'
ORDER BY 
        first_name
OFFSET 5 ROWS
FETCH FIRST 20 ROWS ONLY;

In database, NULL means missing information or not applicable. NULL is not a value, therefore, you cannot compare it with any other values like numbers or strings.
To check if a value is not NULL, you use the IS NOT NULL operator:

value IS NOT NULL

Table aliases temporarily assign tables new names during the execution of a query.

table_name AS alias_name;

Inner Join

In a relational database, data is typically distributed in more than one table. To select complete data, you often need to query data from multiple tables.Let us understand to combine data from multiple tables using the INNER JOIN clause.

Suppose that there are two tables car and manufacturer. The table car has a column model whose value matches with values in the make column of table manufacturer. To select data from both tables, you use the INNER JOIN clause in the SELECT statement as follows:

SELECT
    model,
    year,
    make,
    origin
FROM
    Car
INNER JOIN manufacturer ON model = make;

To join table car with the table manufacturer, you follow these steps:

First, specify columns from both tables that you want to select data in the SELECT clause.
Second, specify the main table for example table car in the FROM clause.
Third, specify the second table (table manufacturer) in the INNER JOIN clause and provide a join condition after the ON keyword.

How the INNER JOIN works.

For each row in the table car, inner join compares the value in the model column with the value in the make column of every row in the table manufacturer:
If these values are equal, the inner join creates a new row that contains all columns of both tables and adds it to the result set.
In case these values are not equal, the inner join just ignores them and moves to the next row.

This query returns the customer with id of 3 and the amount and date which it is paid.

SELECT
    c.customer_id,
    first_name,
    last_name,
    amount,
    payment_date
FROM
    customer c
INNER JOIN payment p 
    ON p.customer_id = c.customer_id
WHERE
    c.customer_id = 3;

Left Join

There are two tables, car and manufacturer table. Each row in the table car may have zero or many corresponding rows in the table manufacturer while each row in the table manufacturer has one and only one corresponding row in the table car .
To select data from the table car that may or may not have corresponding rows in the table manufacturer, you use the LEFT JOIN clause.

SELECT
    model,
    year,
    make,
    origin
FROM
    Car
Left JOIN manufacturer ON model = make;

To join the table car with the manufacturer table using a left join:

First, specify the columns in both tables from which you want to select data in the SELECT clause.
Second, specify the left table (table car) in the FROM clause.
Third, specify the right table (table manufacturer) in the LEFT JOIN clause and the join condition after the ON keyword.
The LEFT JOIN clause starts selecting data from the left table. For each row in the left table, it compares the value in the model column with the value of each row in the make column in the right table.

If these values are equal, the left join clause creates a new row that contains columns that appear in the SELECT clause and adds this row to the result set.
In case these values are not equal, the left join clause also creates a new row that contains columns that appear in the SELECT clause. In addition, it fills the columns that come from the right table with NULL.

this query will return the first 25 rows on the left join clause to join the film table with inventory table.

SELECT
    f.film_id,
    title,
    inventory_id
FROM
    film f
LEFT JOIN inventory i
   ON i.film_id = f.film_id
WHERE i.film_id IS NULL
ORDER BY title
FETCH FIRST 25 ROWS ONLY;

Self-Join

A self-join is a regular join that joins a table to itself, a self-join query hierarchical data or to compare rows within the same table.

SELECT select_list
FROM table_name t1
INNER JOIN table_name t2 ON join_predicate;

The above table is joined to itself using the INNER JOIN clause.

SELECT select_list
FROM table_name t1
LEFT JOIN table_name t2 ON join_predicate;

The above table is joined to itself using the LEFT JOIN clause.

This query finds all pair of films which are not duplicates that have the same length. It will skip the first 15 rows then return the next first 25 rows that have films of the same length.

SELECT DISTINCT
    f1.title,
    f2.title,
    f1.length
FROM
    film f1
INNER JOIN film f2 
    ON f1.film_id <> f2.film_id AND 
       f1.length = f2.length
OFFSET 15 ROWS
FETCH FIRST 25 ROWS ONLY;

Full Outer Join

Syntax of the full outer join:

SELECT * FROM car
FULL [OUTER] JOIN manufacturer on car.id = manufacturer.id;

If the rows in the joined table do not match, the full outer join sets NULL values for every column of the table that does not have the matching row.
If a row from one table matches a row in another table, the result row will contain columns populated from columns of rows from both tables.

Cross Join

CROSS JOIN clause allows you to produce a Cartesian Product of rows in two or more tables.

Cross Join Syntax:

SELECT select_list
FROM table1
CROSS JOIN table2;

This statement is similar to:

SELECT select_list
FROM T1, T2;

Natural Join

It is a join that creates an implicit join based on the same column names in the joined tables.
A natural join can be an inner join, left join, or right join. If you do not specify a join explicitly, PostgreSQL will use the INNER JOIN by default.

Natural Join syntax:

SELECT select_list
FROM table1
NATURAL [INNER, LEFT, RIGHT] JOIN table2;

The convenience of the NATURAL JOIN is that it does not require you to specify the join clause because it uses an implicit join clause based on the common column.
However, avoid using the NATURAL JOIN whenever possible because sometimes it may cause an unexpected result.

Group By

GROUP BY clause divides the rows returned from the SELECT statement into groups. For each group, you can apply an aggregate function for example SUM() to calculate the sum of items or COUNT() to get the number of items in the groups.

Basic Syntax of Group By:

SELECT 
   column_1, 
   column_2,
   ...,
   aggregate_function(column_n)
FROM 
   table_name
GROUP BY 
   column_1,
   column_2,
   ...
   column_n;

This query will return 25 rows from the payment table grouped by customer id, ordered in descending of the total amount

SELECT
    customer_id,
    SUM (amount)
FROM
    payment
GROUP BY
    customer_id
ORDER BY
    SUM (amount) DESC
FETCH FIRST 25 ROWS ONLY;

We can use multiple columns with Group By. In this query Group By clause divides the rows in the payment table by the values in the customer_id and staff_id columns. SUM() calculates the total amount. Then ordered by the customer_id in ascending order. It fetches the first 30 rows from our table payment.

SELECT 
    customer_id, 
    staff_id, 
    SUM(amount) 
FROM 
    payment
GROUP BY 
    staff_id, 
    customer_id
ORDER BY 
    customer_id
FETCH FIRST 30 ROWS ONLY;

Having

HAVING clause specifies a search condition for a group or an aggregate. The HAVING clause is often used with the GROUP BY clause to filter groups or aggregates based on a specified condition.

SELECT
    column_1,
        column_2,
        ...
    aggregate_function (column_n)
FROM
    table_name
GROUP BY
    column_1
HAVING
    condition;

PostgreSQL evaluates the HAVING clause after the FROM, WHERE, GROUP BY, and before the SELECT, DISTINCT, ORDER BY and LIMIT clauses.

Since the HAVING clause is evaluated before the SELECT clause, you cannot use column aliases in the HAVING clause. Because at the time of evaluating the HAVING clause, the column aliases specified in the SELECT clause are not available.

The WHERE clause allows you to filter rows based on a specified condition. However, the HAVING clause allows you to filter groups of rows according to a specified condition.
The WHERE clause is applied to rows while the HAVING clause is applied to groups of rows.

SELECT
    customer_id,
    SUM (amount)
FROM
    payment
GROUP BY
    customer_id
HAVING
    SUM (amount) > 150;

Union Operator

UNION operator combines result sets of two or more SELECT statements into a single result set.

SELECT select_list_1
FROM table_1
UNION
SELECT select_list_2
FROM table_2

To combine the result sets of two queries using the UNION operator, ensure that:
The number and the order of the columns in the select list of both queries must be the same. The data types must be compatible.

The UNION operator removes all duplicate rows from the combined data set. Use the the UNION ALL to retain duplicate rows.

Intersect Operator

PostgreSQL INTERSECT operator combines result sets of two or more SELECT statements into a single result set.
The INTERSECT operator returns any rows available in both result sets.

SELECT select_list
FROM table_1
INTERSECT
SELECT select_list
FROM table_2;

The number of columns and their order in the SELECT clauses must be the same. The data types of the columns must be compatible. When using the Intersect operator.

Except

EXCEPT operator returns rows by comparing the result sets of two or more queries. It returns distinct rows from the first (left) query that are not in the output of the second (right) query.

SELECT select_list
FROM table_1
EXCEPT 
SELECT select_list
FROM table_2;

RollUp

PostgreSQL ROLLUP is a subclause of the GROUP BY clause that offers a shorthand for defining multiple grouping sets. A grouping set is a set of columns by which you group.
ROLLUP assumes a hierarchy among the input columns and generates all grouping sets that make sense considering the hierarchy. ROLLUP is often used to generate the subtotals and the grand total for reports.

This query finds the number of rental per day, month, and year by using the ROLLUP. It will skip 15 rows, then fetch the first 25 rows that follows.

SELECT
    EXTRACT (YEAR FROM rental_date) y,
    EXTRACT (MONTH FROM rental_date) M,
    EXTRACT (DAY FROM rental_date) d,
    COUNT (rental_id)
FROM
    rental
GROUP BY
    ROLLUP (
        EXTRACT (YEAR FROM rental_date),
        EXTRACT (MONTH FROM rental_date),
        EXTRACT (DAY FROM rental_date)
    )
OFFSET 15
FETCH FIRST 25 ROWS ONLY;

Cube

PostgreSQL CUBE is a sub clause of the GROUP BY clause. The CUBE allows you to generate multiple grouping sets. A grouping set is a set of columns to which you want to group.

This query generates all possible grouping sets based on the dimension columns specified in CUBE.

SELECT
    c1, c2, c3,
    aggregate (c4)
FROM
    table_name
GROUP BY
    CUBE (c1, c2, c3);

Subquery

A subquery is a query nested inside another query such as SELECT, INSERT, DELETE and UPDATE.
The query inside the brackets is called a subquery, the query that contains the subquery is known as an outer query.

PostgreSQL executes the query that contains a subquery in the following sequence:
First, executes the subquery then gets the result and passes it to the outer query. Lastly executes the outer query.

SELECT
    film_id,
    title,
    rental_rate
FROM
    film
WHERE
    rental_rate > (
        SELECT
            AVG (rental_rate)
        FROM
            film
    )
FETCH FIRST 30 ROWS ONLY;

The following query gets films that have the returned date between 2005-05-29 and 2005-05-30. Then 30 rows are skipped and the first 30 rows that follows are returned.

SELECT
    film_id,
    title
FROM
    film f
WHERE
    film_id IN (
        SELECT
            i.film_id
        FROM
            rental r
        INNER JOIN inventory i ON i.inventory_id = 
                      r.inventory_id
        WHERE
            return_date BETWEEN '2005-05-29'
        AND '2005-05-30'
    )
OFFSET 30 ROWS FETCH FIRST 30 ROWS ONLY;

All Operator

ALL operator allows you to query data by comparing a value with a list of values returned by a subquery.

comparison_operator ALL (subquery)

The ALL operator must be preceded by a comparison operator such as equal (=), not equal (!=), greater than (>), greater than or equal to (>=), less than (<), and less than or equal to (<=). Followed by a subquery which also must be surrounded by the parentheses.

column_name > ALL (subquery) the expression evaluates to true if a value is greater than the biggest value returned by the subquery.
column_name >= ALL (subquery) the expression evaluates to true if a value is greater than or equal to the biggest value returned by the subquery.
column_name < ALL (subquery) the expression evaluates to true if a value is less than the smallest value returned by the subquery.
column_name <= ALL (subquery) the expression evaluates to true if a value is less than or equal to the smallest value returned by the subquery.
column_name = ALL (subquery) the expression evaluates to true if a value is equal to any value returned by the subquery.
column_name != ALL (subquery) the expression evaluates to true if a value is not equal to any value returned by the subquery.

SELECT
    film_id,
    title,
    length
FROM
    film
WHERE
    length > ALL (
            SELECT
                ROUND(AVG (length),2)
            FROM
                film
            GROUP BY
                rating
    )
ORDER BY
    length
FETCH FIRST 25 ROWS ONLY;

Exists Operator

The EXISTS operator is a boolean operator that tests for existence of rows in a subquery.It accepts an argument which is a subquery.

EXISTS (subquery)

SELECT 
    column1, column2
FROM 
    table_1
WHERE 
    EXISTS( SELECT 
                1 
            FROM 
                table_2 
            WHERE 
                column_2 = table_1.column_1);

This query statement returns customers who have paid at least one rental with an amount greater than 15

SELECT first_name,
       last_name
FROM customer c
WHERE EXISTS
    (SELECT 1
     FROM payment p
     WHERE p.customer_id = c.customer_id
       AND amount > 15 )
ORDER BY first_name,
         last_name;

Insert

INSERT statement allows you to insert a new row into a table.

INSERT INTO table_name(column1, column2, value3, ...)
VALUES (value1, value2, value3, ...);

For example this will insert values to a table called links, for the column url, name, last_modified columns.

INSERT INTO links (url, name, last_modified)
VALUES('https://www.dev.to','DEV','2022-09-20');

Insert Multiple Rows

INSERT INTO table_name (column_list)
VALUES
    (value_list_1),
    (value_list_2),
    (value_list_3),
    ...
    (value_list_n);

INSERT INTO 
    links (url, name, date_modified)
VALUES
    ('https://www.tradingview.com', 'tradingview', '2022-09- 
     15'),
    ('https://www.codenewbie.com','codenewbie', '2022-09- 
     18'),
    ('https://www.forem.com','Forem', '2022-09-20'),
    ('https://www.bitbucket.com', 'Bitbucket', '2022-09-20');

Update

UPDATE statement allows you to modify data in a table

UPDATE table_name
SET column1 = value1,
    column2 = value2,
    column3 = value3
    ...
WHERE condition;

UPDATE subjects
SET published_date = '2022-08-15' 
WHERE subject_id = 231;

Returns the following message after one row has been updated:

UPDATE 1

Delete

DELETE statement allows you to delete one or more rows from a table.

DELETE FROM table_name
WHERE condition;

This will delete the row where the id is 7.

DELETE FROM links
WHERE id = 7;

This query deletes all the rows in our table since we did not specify a where clause.

DELETE FROM links;

The subquery returns a list of phones from the blacklist table and the DELETE statement deletes the contacts whose phones match with the phones returned by the subquery.To delete all contacts whose phones are in the blacklist table.

DELETE FROM contacts
WHERE phone IN (SELECT phone FROM blacklist);

Upsert

Referred to as merge, when you insert a new row into the table, PostgreSQL will update the row if it already exists, otherwise, it will insert the new row.

INSERT INTO table_name(column_list) 
VALUES(value_list)
ON CONFLICT target action;

INSERT INTO customers (name, email)
VALUES('tradingview','hotline@tradingview') 
ON CONFLICT (name) 
DO NOTHING;

INSERT INTO customers (name, email)
VALUES('tradingview','hotline@tradingview') 
ON CONFLICT (name) 
DO 
   UPDATE SET email = EXCLUDED.email || ';' || customers.email;

Common Table Expressions (CTE)

It is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE or DELETE. They only exist during the execution of the query and used to simplify complex joins and subqueries in PostgreSQL.

WITH cte_name (column_list) AS (
    CTE_query_definition 
)
statement;

Advantages of using CTEs:

Improve the readability of complex queries.
Ability to create recursive queries, queries that reference themselves.
Use CTEs in conjunction with window functions to create an initial result set and use another select statement to further process this result set.

This query, the CTE returns a result set that includes staff id and the number of rentals. Then, join the staff table with the CTE using the staff_id column.

WITH cte_rental AS (
    SELECT staff_id,
        COUNT(rental_id) rental_count
    FROM   rental
    GROUP  BY staff_id
)
SELECT s.staff_id,
    first_name,
    last_name,
    rental_count
FROM staff s
    INNER JOIN cte_rental USING (staff_id);

Recursive Query

A recursive CTE has three elements:

Non-recursive term: a CTE query definition that forms the base result set of the CTE structure.
Recursive term: one or more CTE query definitions joined with the non-recursive term using the UNION or UNION ALL operator.
Termination check: the recursion stops when no rows are returned from the previous iteration.

Sequence that PostgreSQL executes a recursive CTE:

Execute the non-recursive term to create the base result set
Execute recursive term with Ri as an input to return the result set Ri+1 as the output.
Repeat step 2 until an empty set is returned (termination check)
Return the final result set that is a UNION or UNION ALL of the result set

WITH RECURSIVE cte_name AS(
    CTE_query_definition -- non-recursive term
    UNION [ALL]
    CTE_query definion  -- recursive term
) SELECT * FROM cte_name;

I trust you have understood the content that we have covered so far. Lets gear on and continue learning.

Managing Tables

PostgreSQL Data Types, Create Table, Select Into, Create Table As, Serial, Sequences, Identity Column, Alter Table, Rename Table, Add Column, Drop Column, Change Column’s Data Type, Rename Column, Drop Table, Temporary Table, Truncate Table

Transaction

A database transaction is a single unit of work that consists of one or more operations. A PostgreSQL transaction is atomic, consistent, isolated, and durable. These properties are often referred to as ACID:

Atomicity guarantees that the transaction completes in an all-or-nothing manner.
Consistency ensures the change to data written to the database must be valid and follow predefined rules.
Isolation determines how transaction integrity is visible to other transactions.
Durability makes sure that transactions that have been committed will be stored in the database permanently.

-- start a transaction
BEGIN;

-- insert a new row into the accounts table
INSERT INTO accounts(name,balance)
VALUES('Alice',10000);

-- commit the change (or roll it back later)
COMMIT;

-- start a transaction
BEGIN;

-- deduct 1000 from account 1
UPDATE accounts 
SET balance = balance - 1000
WHERE id = 1;

-- add 1000 to account 2
UPDATE accounts
SET balance = balance + 1000
WHERE id = 2; 

-- select the data from accounts
SELECT id, name, balance
FROM accounts;

-- commit the transaction
COMMIT;

-- begin the transaction
BEGIN;

-- deduct the amount from the account 1
UPDATE accounts 
SET balance = balance - 1500
WHERE id = 1;

-- add the amount from the account 3 (instead of 2)
UPDATE accounts
SET balance = balance + 1500
WHERE id = 3; 

-- roll back the transaction
ROLLBACK;

The easiest way to export data of a table to a CSV file is to use COPY statement.

COPY persons TO '/home/exporter/persons_db.csv' DELIMITER ',' CSV HEADER;

A relational database consists of multiple related tables. A table consists of rows and columns. Tables allow you to store structured data.

CREATE TABLE [IF NOT EXISTS] table_name (
   column1 datatype(length) column_contraint,
   column2 datatype(length) column_contraint,
   column3 datatype(length) column_contraint,
   table_constraints
);

Column Constraints

NOT NULL – ensures that values in a column cannot be NULL.
UNIQUE – ensures the values in a column unique across the rows within the same table.
PRIMARY KEY – a primary key column uniquely identify rows in a table. A table can have one and only one primary key.
FOREIGN KEY – ensures values in a column or a group of columns from a table exists in a column or group of columns in another table. Unlike the primary key, a table can have many foreign keys.
CHECK – a CHECK constraint ensures the data must satisfy a boolean expression.

CREATE TABLE accounts (
    user_id serial PRIMARY KEY,
    username VARCHAR ( 50 ) UNIQUE NOT NULL,
    password VARCHAR ( 50 ) NOT NULL,
    email VARCHAR ( 255 ) UNIQUE NOT NULL,
    created_on TIMESTAMP NOT NULL,
        last_login TIMESTAMP 
);

CREATE TABLE roles(
   role_id serial PRIMARY KEY,
   role_name VARCHAR (255) UNIQUE NOT NULL
);

CREATE TABLE account_roles (
  user_id INT NOT NULL,
  role_id INT NOT NULL,
  grant_date TIMESTAMP,
  PRIMARY KEY (user_id, role_id),
  FOREIGN KEY (role_id)
      REFERENCES roles (role_id),
  FOREIGN KEY (user_id)
      REFERENCES accounts (user_id)
);

SELECT
    film_id,
    title,
    length 
INTO TEMP TABLE short_film
FROM
    film
WHERE
    length < 60
ORDER BY
    title;

SELECT * FROM short_film;

CREATE TABLE action_film AS
SELECT
    film_id,
    title,
    release_year,
    length,
    rating
FROM
    film
INNER JOIN film_category USING (film_id)
WHERE
    category_id = 1;

SELECT * FROM action_film
ORDER BY title;

Serial

CREATE TABLE table_name(
    id SERIAL
);

CREATE TABLE fruits(
   id SERIAL PRIMARY KEY,
   name VARCHAR NOT NULL
);

INSERT INTO fruits(name) 
VALUES('Orange');

Sequence

CREATE SEQUENCE [ IF NOT EXISTS ] sequence_name
    [ AS { SMALLINT | INT | BIGINT } ]
    [ INCREMENT [ BY ] increment ]
    [ MINVALUE minvalue | NO MINVALUE ] 
    [ MAXVALUE maxvalue | NO MAXVALUE ]
    [ START [ WITH ] start ] 
    [ CACHE cache ] 
    [ [ NO ] CYCLE ]
    [ OWNED BY { table_name.column_name | NONE } ]

Create a table

CREATE TABLE order_details(
    order_id SERIAL,
    item_id INT NOT NULL,
    item_text VARCHAR NOT NULL,
    price DEC(10,2) NOT NULL,
    PRIMARY KEY(order_id, item_id)
);

CREATE SEQUENCE order_item_id
START 10
INCREMENT 10
MINVALUE 10
OWNED BY order_details.item_id;

INSERT INTO 
    order_details(order_id, item_id, item_text, price)
VALUES
    (100, nextval('order_item_id'),'DVD Player',100),
    (100, nextval('order_item_id'),'Android TV',550),
    (100, nextval('order_item_id'),'Speaker',250);

SELECT
    order_id,
    item_id,
    item_text,
    price
FROM
    order_details;

List all sequences in the current database

SELECT
    relname sequence_name
FROM 
    pg_class 
WHERE 
    relkind = 'S';

First, specify the name of the sequence which you want to drop and use the CASCADE option if you want to recursively drops objects that depend on the sequence, and objects that depend on the dependent objects.

DROP SEQUENCE [ IF EXISTS ] sequence_name [, ...] 
[ CASCADE | RESTRICT ];

DROP TABLE order_details;

PostgreSQL identity column

PostgreSQL version 10 introduced a new constraint GENERATED AS IDENTITY that allows you to automatically assign a unique number to a column. The GENERATED AS IDENTITY constraint is the SQL standard-conforming variant of the good old SERIAL column.

column_name type GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY[ ( sequence_option ) ]

create a table named color with the color_id as the identity column:

CREATE TABLE color (
    color_id INT GENERATED ALWAYS AS IDENTITY,
    color_name VARCHAR NOT NULL
);

Insert new rows into the color table:

INSERT INTO color(color_name)
VALUES ('Green')
VALUES ('Blue');

Alter Table

To change the structure of an existing table, you use PostgreSQL ALTER TABLE statement.

ALTER TABLE table_name action;

Add a column, Drop a column, Change the data type of a column, Rename a column, Set a default value for the column, Add a constraint to a column, Rename a table.

CREATE TABLE links (
   link_id serial PRIMARY KEY,
   title VARCHAR (512) NOT NULL,
   url VARCHAR (1024) NOT NULL
);

To add a new column named active, you use the following statement

ALTER TABLE links
ADD COLUMN active boolean;

To remove the active column from the links table.

ALTER TABLE links 
DROP COLUMN active;

To add a new column named target to the links table

ALTER TABLE links 
ADD COLUMN target VARCHAR(10);

To change the name of the links table to short_urls:

ALTER TABLE links 
RENAME TO short_urls;

ALTER TABLE table_name 
DROP COLUMN column_name;

To drop column that other objects depend on.

ALTER TABLE table_name 
DROP COLUMN column_name CASCADE;

Drop Table

To drop a table from the database.

DROP TABLE [IF EXISTS] table_name 
[CASCADE | RESTRICT];

The CASCADE option allows you to remove the table and its dependent objects. The RESTRICT option rejects the removal if there is any object depends on the table. The RESTRICT option is the default if you don’t explicitly specify it in the DROP TABLE statement.

Truncate Table

The TRUNCATE TABLE statement deletes all data from a table without scanning it. It is faster than the DELETE statement, TRUNCATE TABLE statement reclaims the storage right away so you do not have to perform a subsequent VACUMM operation, which is useful in the case of large tables.

This query removes all data and resets the identity column value.

TRUNCATE TABLE table_name 
RESTART IDENTITY;

To remove data from a table and other tables that have foreign key reference the table, you use CASCADE option in the TRUNCATE TABLE statement

TRUNCATE TABLE table_name 
CASCADE;

The TRUNCATE TABLE is transaction-safe. It means that if you place it within a transaction, you can roll it back safely.

Database Constraints

Primary Key, Foreign Key, Check Constraint, Unique Constraint
Not Null Constraint

Primary Key

A primary key is a column or a group of columns used to identify a row uniquely in a table. A table can have one and only one primary key. It is a good practice to add a primary key to every table. PostgreSQL creates a unique B-tree index on the column or a group of columns used to define the primary key.

CREATE TABLE po_headers (
    po_no SERIAL PRIMARY KEY,
    vendor_no INTEGER,
    description TEXT,
    shipping_address TEXT
);

Removing Primary Key

ALTER TABLE table_name DROP CONSTRAINT primary_key_constraint;

To remove the primary key from the table

ALTER TABLE products
DROP CONSTRAINT products_pkey;

Foreign Key

A foreign key is a column or a group of columns in a table that reference the primary key of another table.

In PostgreSQL, you define a foreign key using the foreign key constraint. The foreign key constraint helps maintain the referential integrity of data between the child and parent tables.
A foreign key constraint indicates that values in a column or a group of columns in the child table equal the values in a column or a group of columns of the parent table.

syntax

[CONSTRAINT fk_name]
   FOREIGN KEY(fk_columns) 
   REFERENCES parent_table(parent_key_columns)
   [ON DELETE delete_action]
   [ON UPDATE update_action]

actions: SET NULL, SET DEFAULT, RESTRICT, NO ACTION, CASCADE

CREATE TABLE customers(
   customer_id INT GENERATED ALWAYS AS IDENTITY,
   customer_name VARCHAR(255) NOT NULL,
   PRIMARY KEY(customer_id)
);

CREATE TABLE contacts(
   contact_id INT GENERATED ALWAYS AS IDENTITY,
   customer_id INT,
   contact_name VARCHAR(255) NOT NULL,
   phone VARCHAR(15),
   email VARCHAR(100),
   PRIMARY KEY(contact_id),
   CONSTRAINT fk_customer
      FOREIGN KEY(customer_id) 
      REFERENCES customers(customer_id)
);

ON DELETE CASCADE automatically deletes all the referencing rows in the child table when the referenced rows in the parent table are deleted.
The SET NULL automatically sets NULL to the foreign key columns in the referencing rows of the child table when the referenced rows in the parent table are deleted.
The RESTRICT action is similar to the NO ACTION. PostgreSQL issues a constraint violation because the referencing rows
The ON DELETE SET DEFAULT sets the default value to the foreign key column of the referencing rows in the child table when the referenced rows from the parent table are deleted.

ALTER TABLE child_table
ADD CONSTRAINT constraint_fk
FOREIGN KEY (fk_columns)
REFERENCES parent_table(parent_key_columns)
ON DELETE CASCADE;

Check Constraint

A CHECK constraint is a kind of constraint that allows you to specify if values in a column must meet a specific requirement. It uses a Boolean expression to evaluate the values before they are inserted or updated to the column.

If the values pass the check, PostgreSQL will insert or update these values to the column. Otherwise, PostgreSQL will reject the changes and issue a constraint violation error.

CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR (50),
    last_name VARCHAR (50),
    birth_date DATE CHECK (birth_date > '1900-01-01'),
    joined_date DATE CHECK (joined_date > birth_date),
    salary numeric CHECK(salary > 0)
);

To add CHECK constraints to existing tables, you use the ALTER TABLE statement.

CREATE TABLE prices_list (
    id serial PRIMARY KEY,
    product_id INT NOT NULL,
    price NUMERIC NOT NULL,
    discount NUMERIC NOT NULL,
    valid_from DATE NOT NULL,
    valid_to DATE NOT NULL
);

ALTER TABLE prices_list 
ADD CONSTRAINT price_discount_check 
CHECK (
    price > 0
    AND discount >= 0
    AND price > discount
);

Unique Constraint

To ensure that values stored in a column or a group of columns are unique across the whole table such as email addresses or usernames.
PostgreSQL provides you with the UNIQUE constraint that maintains the uniqueness of the data correctly.

CREATE TABLE person (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR (50),
    last_name VARCHAR (50),
    email VARCHAR (50) UNIQUE
);

Not-Null Constraint

In databases, NULL represents unknown or information missing. NULL is not the same as an empty string or the number zero.

CREATE TABLE invoices(
  id SERIAL PRIMARY KEY,
  product_id INT NOT NULL,
  qty numeric NOT NULL CHECK(qty > 0),
  net_price numeric CHECK(net_price > 0) 
);

Use the NOT NULL constraint for a column to enforce a column not accept NULL. By default, a column can hold NULL.
To check if a value is NULL or not, you use the IS NULL operator. The IS NOT NULL negates the result of the IS NULL.
Never use equal operator = to compare a value with NULL because it always returns NULL.

PostgreSQL Data Types

Boolean, Char, VarChar, and Text, Numeric, Integer, Serial, Date, Timestamp, Interval, Time, Uuid, Json, Hstore, Array, User-defined Data Types

Boolean

PostgreSQL supports a single Boolean data type: BOOLEAN that can have three values: true, false and NULL.
PostgreSQL uses one byte for storing a boolean value in the database. The BOOLEAN can be abbreviated as BOOL.

CREATE TABLE stock_availability (
   product_id INT PRIMARY KEY,
   available BOOLEAN NOT NULL
);

VarChar, Var, Text

PostgreSQL provides three primary character types: CHARACTER(n) or CHAR(n), CHARACTER VARYINGING(n) or VARCHAR(n), and TEXT, where n is a positive integer.

Advantage of specifying the length specifier for the VARCHAR data type is that PostgreSQL will issue an error if you attempt to insert a string that has more than n characters into the VARCHAR(n) column.

PostgreSQL supports CHAR, VARCHAR, and TEXT data types. The CHAR is fixed-length character type while the VARCHAR and TEXT are varying length character types.
Use VARCHAR(n) if you want to validate the length of the string (n) before inserting into or updating to a column.
VARCHAR (without the length specifier) and TEXT are equivalent.

CREATE TABLE chemical_compounds (
    id serial PRIMARY KEY,
    first CHAR (7),
    second VARCHAR (19),
    third TEXT
);

Numeric Type

The NUMERIC type can store numbers with a lot of digits. Typically, you use the NUMERIC type for numbers that require exactness such as monetary amounts or quantities.

NUMERIC(precision, scale)

The precision is the total number of digits and the scale is the number of digits in the fraction part. For example, the number 8765.351 has the precision 7 and scale 3.

If precision is not required, you should not use the NUMERIC type because calculations on NUMERIC values are typically slower than integers, floats, and double precision.

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    price NUMERIC(5,2)
);

Integer Data Types

To store the whole numbers in PostgreSQL, you use one of the following integer types: SMALLINT, INTEGER, and BIGINT.

SMALLINT type for storing something like ages of people, the number of pages of a book
INTEGER is the most common choice between integer types because it offers the best balance between storage size, range, and performance.
Using BIGINT type is not only consuming a lot of storage but also decreasing the performance of the database, therefore, you should have a good reason to use it.

CREATE TABLE cities (
    city_id serial PRIMARY KEY,
    city_name VARCHAR (255) NOT NULL,
    population INT NOT NULL CHECK (population >= 0)
);

Date

To store date values, use the PostgreSQL DATE data type that uses 4 bytes to store a date value.

CREATE TABLE employees (
    employee_id serial PRIMARY KEY,
    first_name VARCHAR (255),
    last_name VARCHAR (355),
    birth_date DATE NOT NULL,
    hire_date DATE NOT NULL
);

INSERT INTO employees (first_name, last_name, birth_date, hire_date)
VALUES ('Derrick','Kimani','1990-05-01','2015-06-01'),
       ('Florence','Wanjiru','1991-03-05','2013-04-01'),
       ('Richard','Chege','1992-09-01','2011-10-01');

To get the current date and time, use the built-in Now() function

SELECT NOW()::date;

SELECT CURRENT_DATE;

To get the year, quarter, month, week, day from a date value, you use the EXTRACT() function.
The following statement extracts the year, month, and day from the birth dates of employees:

SELECT
    employee_id,
    first_name,
    last_name,
    EXTRACT (YEAR FROM birth_date) AS YEAR,
    EXTRACT (MONTH FROM birth_date) AS MONTH,
    EXTRACT (DAY FROM birth_date) AS DAY
FROM
    employees;

TimeStamp

The timestamp datatype allows you to store both date and time. However, it does not have any time zone data. It means that when you change the timezone of your database server, the timestamp value stored in the database will not change automatically.
The timestamptz datatype is the timestamp with the time zone. The timestamptz datatype is a time zone-aware date and time data type.

SELECT CURRENT_TIMESTAMP;

SELECT TIMEOFDAY();

Time

A time value may have a precision up to 6 digits. The precision specifies the number of fractional digits placed in the second field.
The TIME data type requires 8 bytes and its allowed range is from 00:00:00 to 24:00:00.

column_name TIME(precision);

CREATE TABLE shifts (
    id serial PRIMARY KEY,
    shift_name VARCHAR NOT NULL,
    start_at TIME NOT NULL,
    end_at TIME NOT NULL
);

SELECT LOCAL TIME;

SELECT CURRENT TIME;

To extracting hours, minutes, seconds from a time value, you use the EXTRACT function

SELECT
    LOCALTIME,
    EXTRACT (HOUR FROM LOCALTIME) as hour,
    EXTRACT (MINUTE FROM LOCALTIME) as minute, 
    EXTRACT (SECOND FROM LOCALTIME) as second,
    EXTRACT (milliseconds FROM LOCALTIME) as milliseconds;

UUID Data Type

UUID stands for Universal Unique Identifier defined by RFC 4122 and other related standards. A UUID value is 128-bit quantity generated by an algorithm that make it unique in the known universe using the same algorithm.
a UUID is a sequence of 32 digits of hexadecimal digits represented in groups separated by hyphens.

Because of its uniqueness feature, often found UUID in the distributed systems because it guarantees a better uniqueness than the SERIAL data type which generates only unique values within a single database. To store UUID values in the PostgreSQL database, you use the UUID data type.

To install the uuid-ossp module, you use the CREATE EXTENSION statement

CREATE EXTENSION IF NOT EXISTS "uuid-ossp";

If you want to generate a UUID value solely based on random numbers, use the uuid_generate_v4()

SELECT uuid_generate_v4();

Create a table whose primary key is UUID data type, the values of the primary key column will be generated automatically using the uuid_generate_v4() function.

CREATE TABLE contacts (
    contact_id uuid DEFAULT uuid_generate_v4 (),
    first_name VARCHAR NOT NULL,
    last_name VARCHAR NOT NULL,
    email VARCHAR NOT NULL,
    phone VARCHAR,
    PRIMARY KEY (contact_id)
);

INSERT INTO contacts (
    first_name,
    last_name,
    email
)
VALUES
    (
        'Kamau',
        'Kelvin',
        'kamau.kelvin@example.com'
    ),
    (
        'Nafula',
        'Wepkulu',
        'nafula.wepkulu@example.com',
    ),
    (
        'Kasunda',
        'Mutorini',
        'kasunda.mutorini@example.com'
    );

To query our database so that we view the uuid in the customer_id column.

SELECT
    *
FROM
    contacts;

Hstore data type

The hstore module implements the hstore data type for storing key-value pairs in a single value.
The hstore data type is very useful in many cases, such as semi-structured data or rows with many attributes that are rarely queried. Notice that keys and values are just text strings only.

To enable the hstore extension which loads the contrib module to your PostgreSQL instance.

CREATE EXTENSION hstore;

CREATE TABLE books (
    id serial primary key,
    title VARCHAR (255),
    attr hstore
);

The data that we insert into the hstore column is a list of comma-separated key =>value pairs. Both keys and values are quoted using double quotes (“”).

PostgreSQL provides the hstore_to_json() function to convert hstore data to JSON.

SELECT
  title,
  hstore_to_json (attr) json
FROM
  books;

JSON data type

JSON stands for JavaScript Object Notation. JSON is an open standard format that consists of key-value pairs, JSON is human-readable text.
The main usage of JSON is to transport data between a server and a web application.

The orders table consists of two columns:

The id column is the primary key column that identifies the order. The info column stores the data in the form of JSON.

CREATE TABLE orders (
    id serial NOT NULL PRIMARY KEY,
    info json NOT NULL
);

PostgreSQL provides two native operators -> and ->> to help you query JSON data.
The operator -> returns JSON object field by key.
The operator ->> returns JSON object field by text.

We can apply aggregate functions such as MIN, MAX, AVERAGE, SUM, etc., to JSON data. For example, the following statement returns minimum quantity, maximum quantity, average quantity and the total quantity of products sold.

SELECT 
   MIN (CAST (info -> 'items' ->> 'qty' AS INTEGER)),
   MAX (CAST (info -> 'items' ->> 'qty' AS INTEGER)),
   SUM (CAST (info -> 'items' ->> 'qty' AS INTEGER)),
   AVG (CAST (info -> 'items' ->> 'qty' AS INTEGER))
FROM orders;

The json_each() function allows us to expand the outermost JSON object into a set of key-value pairs.

To get a set of keys in the outermost JSON object, you use the json_object_keys() function.

User-defined Data Types

Domain is a data type with optional constraints e.g., NOT NULL and CHECK. A domain has a unique name within the schema scope.
Domains are useful for centralizing the management of fields with common constraints.

The CREATE TYPE statement allows you to create a composite type, used as the return type of a function.

CREATE TYPE film_summary AS (
    film_id INT,
    title VARCHAR,
    release_year SMALLINT
);

Use the film_summary data type as the return type of a function

CREATE OR REPLACE FUNCTION get_film_summary (f_id INT) 
    RETURNS film_summary AS 
$$ 
SELECT
    film_id,
    title,
    release_year
FROM
    film
WHERE
    film_id = f_id ; 
$$ 
LANGUAGE SQL;

A user-defined function that returns a random number between two numbers low and high.

CREATE OR REPLACE FUNCTION random_between(low INT ,high INT) 
   RETURNS INT AS
$$
BEGIN
   RETURN floor(random()* (high-low + 1) + low);
END;
$$ language 'plpgsql' STRICT;

SELECT random_between(1,100);

To get multiple random numbers between two integers, execute:

SELECT random_between(1, 100)
FROM generate_series(1, 4);

To list all user-defined types in the current database use the \dT or \dT+ command.

Conditional Expressions & Operators

CASE, COALESCE, NULLIF, CAST

CASE expression

The PostgreSQL CASE expression is the same as IF/ELSE statement in other programming languages. It allows you to add if-else logic to the query to form a powerful query.

General CASE expression

CASE 
      WHEN condition_1  THEN result_1
      WHEN condition_2  THEN result_2
      [WHEN ...]
      [ELSE else_result]
END

SELECT
    SUM (CASE
               WHEN rental_rate = 0.99 THEN 1
           ELSE 0
          END
    ) AS "Economy",
    SUM (
        CASE
        WHEN rental_rate = 2.99 THEN 1
        ELSE 0
        END
    ) AS "Mass",
    SUM (
        CASE
        WHEN rental_rate = 4.99 THEN 1
        ELSE 0
        END
    ) AS "Premium"
FROM
    film;

Coalesce

COALESCE function that returns the first non-null argument. The COALESCE function accepts an unlimited number of arguments. It returns the first argument that is not null. If all arguments are null, the COALESCE function will return null.

The COALESCE function evaluates arguments from left to right until it finds the first non-null argument. All the remaining arguments from the first non-null argument are not evaluated.

COALESCE (argument_1, argument_2,argument_3, ...);

CREATE TABLE items (
    ID serial PRIMARY KEY,
    product VARCHAR (100) NOT NULL,
    price NUMERIC NOT NULL,
    discount NUMERIC
);

Insert records into the item table

INSERT INTO items (product, price, discount)
VALUES
    ('Cassava', 1000 ,10),
    ('Yams', 1500 ,20),
    ('Arrow roots', 800 ,5),
    ('Potatoes', 500, NULL);

SELECT
    product,
    (price - COALESCE(discount,0)) AS net_price
FROM
    items;

COALESCE function makes the query shorter and easier to read, it substitutes null values in the query.

NULLIF

To use PostgreSQL NULLIF function to handle null values.

NULLIF(argument_1, argument_2, argument_3);

Apply the NULLIF function to substitute the null values for displaying data and preventing division by zero error.

CAST operator

To convert a value of one data type into another. PostgreSQL provides you with the CAST operator that allows you to do this.

CAST ( expression AS target_type );

Specify an expression that can be a constant, a table column, an expression that evaluates to a value. Then the target data type to which you want to convert the result of the expression.

PostgreSQL type cast :: operator

expression::type

Cast a string to a double

SELECT
   CAST ('10.2' AS DOUBLE PRECISION);

Cast a string to a date

SELECT
   CAST ('2015-01-01' AS DATE),
   CAST ('01-OCT-2015' AS DATE);

Cast operator to convert a string to an interval

SELECT '15 minute'::interval,
 '2 hour'::interval,
 '1 day'::interval,
 '2 week'::interval,
 '3 month'::interval;

Explain

The EXPLAIN statement returns the execution plan which PostgreSQL planner generates for a given statement.
The EXPLAIN shows how tables involved in a statement will be scanned by index scan or sequential scan, etc., and if multiple tables are used, what kind of join algorithm will be used.

EXPLAIN [ ( option [, ...] ) ] sql_statement;

Option can be :

ANALYZE [ boolean ]
VERBOSE [ boolean ]
COSTS [ boolean ]
BUFFERS [ boolean ]
TIMING [ boolean ]  
SUMMARY [ boolean ]
FORMAT { TEXT | XML | JSON | YAML }

Boolean specifies whether the selected option should be turned on or off. You can use TRUE, ON, or 1 to enable the option, and FALSE, OFF, or 0 to disable it.

The ANALYZE option causes the sql_statement to be executed first and then actual run-time statistics in the returned information including total elapsed time expended within each plan node and the number of rows it actually returned.

TIMING includes the actual startup time and time spent in each node in the output.It defaults to TRUE and it may only be used when ANALYZE is enabled.

COSTS option includes the estimated startup and total costs of each plan node, as well as the estimated number of rows and the estimated width of each row in the query plan. The COSTS defaults to TRUE.

BUFFERS only can be used when ANALYZE is enabled. By default, the BUFFERS parameter set to FALSE

VERBOSE parameter allows you to show additional information regarding the plan. This parameter sets to FALSE by default.

SUMMARY parameter adds summary information such as total timing after the query plan. Note that when ANALYZE option is used, the summary information is included by default.

The output format of the query plan such as TEXT, XML, JSON, and YAML. This parameter is set to TEXT by default.

Explain statement that calculates the query plan.

EXPLAIN ANALYZE
SELECT
    f.film_id,
    title,
    name category_name
FROM
    film f
    INNER JOIN film_category fc 
        ON fc.film_id = f.film_id
    INNER JOIN category c 
        ON c.category_id = fc.category_id
ORDER BY
    title;

Indeed we have covered the basics of SQL using PostgreSQl in detail. You need to practice what you have learnt using a sample database, so as to understand the concepts well. Our goal should be to write complex readable queries with fast execution rate. The next SQL article will cover the advanced topics in SQL such as Indexes, Views, Stored Procedures.

Feel free to share your thoughts in the comments.

Data Engineering Roadmap

Kinyungu Denis — Sun, 18 Sep 2022 16:24:07 +0000

Today, we will understand the road map for a data engineer. What one need to learn to become a good data engineer.

Data Engineering Roadmap.

Software and technology requirements that you need.
1). Cloud account, Google GCP, AWS or Azure.
2). Python IDE and a text editor, preferably Anaconda.
3). SQL server, MYSQL Workbench, DBeaver AND DBVisualizer.
4). Git and version control system (preferably a GitHub account)
5). Create an account on https://www.atlassian.com and understand the following Atlassian product.

Jira, Trello, Confkuence, Bitbucket, Confluence

1. Data Engineering

What is Data Engineering?
What do a data engineer do?
What is the difference between Data Engineers, ML Engineers and Data Scientists?

Data engineering is the task of designing, building for collecting, storing, processing and analyzing large amount of data at scale.

In data engineering we develop and maintain large scale data processing systems to prepare structured and unstructured data to perform analytical modelling and make data driver decisions.

The aim of data engineering is to make quality data available for analysis and efficient data-driven decision making.

The Data Engineering ecosystem consists of 4 things:

Data — different data types, formats and sources of data.
Data stores and repositories — Relational and non-relational databases, data warehouses, data marts, data lakes, and big data stores that store and process the data
Data Pipelines — Collect/Gather data from multiple sources, clean, process and transform it into data which can used for analysis.
Analytics and Data driven Decision Making — Make the well processed data available for further business analytics, visualization and data driven decision making.

Data Engineering lifecycle consists of building/architecting data platforms, designing and implementing data stores and repositories, data lakes and gathering, importing, cleaning, pre-processing, querying, analyzing data, performance monitoring, evaluation, optimization and fine tuning the processes and systems.

Data Engineer is responsible for making quality data available from various resources, maintain databases, build data pipelines, query data, data preprocessing using tools such as Apache Hadoop and Spark, develop data workflows using tools such as Airflow.

Machine Learning Engineers are responsible for building ML algorithms, building data and ML models and deploy them, have statistical and mathematical knowledge and measure, optimize and improve results.

2). Python for Data Engineering

Data engineering using Python only gets better:

The role of a data engineer involves working with different types of data formats. For such cases, Python is best suited. Its standard library supports easy handling of .csv files, one of the most common data file formats.
A data engineer is often required to use APIs to retrieve data from databases. The data in such cases is usually stored in JSON (JavaScript Object Notation) format, and Python has a library named JSON-JSON to handle such type of data.
Data engineering tools use Directed Acyclic Graphs like Apache Airflow, Apache NiFi, etc. DAGs are nothing but Python codes used for specifying tasks. Thus, learning Python will help data engineers use these tools efficiently.
The responsibility of a data engineer is not only to obtain data from different sources but also to process it. One of the most popular data process engines is Apache Spark which works with Python DataFrames and even offers an API, PySpark, to build scalable big data projects.
Luigi! The Python module that is widely considered a fantastic tool for data engineering.

Python is easy to learn and is free to use for the masses. An active community of developers strongly supports it.

Basic Python
Maths Expressions
Strings
Variables
Loops
Fucntions.
List, Tuples, Dictionary and sets
Connecting With Databases.
Boto3
Psycopg2
mysql
Working with Data
JSON
JSONSCHEMA
datetime
Pandas
Numpy
Connecting to APIs
Requests

3. Scripting and Automation

You need to learn automation, so that you will automate the repetitive tasks and save on time.
Shell Scripting
CRON
ETL

4). Relational Databases and SQL

SQL is very critical in your data engineering path, learn also to perform advanced queries on your data.
RDBMS
Data Modeling
Basic SQL
Advanced SQL
Big Query

5). NoSQL Databases

As a data engineer you will work with variety of data, unstructured data will be commonly stored in NoSQL databases.
Unstructured Data
Advanced ETL
Map-Reduce
Data Warehouses
Data API

6). Data Analysis

Pandas
Numpy
Web Scraping
Data Visualization

7). Data Processing Techniques

Batch Processing : Apache Spark
Stream Processing — Spart Streaming
Build Data Pipelines
Target Databases
Machine learning Algorithms

8). Big Data

Big data basics
HDFS in detail
Hadoop Yarn
Hive
Pig
Hbase

Data engineering tools are specialized applications that make building data pipelines and designing algorithms easier and more efficient. These tools are responsible for making the day-to-day tasks of a data engineer easier in various ways.
Data ingestion systems such as Kafka, for example, offer a seamless and quick data ingestion process while also allowing data engineers to locate appropriate data sources, analyze them, and ingest data for further processing.

Data engineering tools support the process of transforming data. This is important since big data can be structured or unstructured or any other format. Therefore, data engineers need data transformation tools to transform and process big data into the desired format.
Database tools/frameworks like SQL, NoSQL, etc., allow data engineers to acquire, analyze, process, and manage huge volumes of data simply and efficiently.
Visualization tools like Tableau and Power BI allow data engineers to generate valuable insights and create interactive dashboards

Apache Spark is a fast cluster computing framework which is used for processing, querying and analyzing Big data. Being based on In-memory computation, it has an advantage over several other Big Data Frameworks.

Originally written in Scala Programming Language, the open source community has developed an amazing tool to support Python for Apache Spark. PySpark helps data scientists interface with RDDs in Apache Spark and Python through its library Py4j.
There are many features that make PySpark a better framework than others:
Speed: It is 100x faster than traditional large-scale data processing frameworks
Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities
Deployment: Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manager
Real Time: Real-time computation & low latency because of in-memory computation
Polyglot: Supports programming in Scala, Java, Python and R

Spark RDDs
When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations, we need to reuse or share data among multiple jobs. Earlier frameworks like Hadoop had problems while dealing with multiple operations/jobs like
Storing Data in Intermediate Storage such as HDFS
Multiple I/O jobs make the computations slow
Replications and serializations which in turn makes the process even slower

RDDs try to solve all the problems by enabling fault-tolerant distributed In-memory computations. RDD is short for Resilient Distributed Datasets. RDD is a distributed memory abstraction which lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. They are the read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

There are several operations performed on RDDs:
Transformations: Transformations create a new dataset from an existing one. Lazy Evaluation
Actions: Spark forces the calculations for execution only when actions are invoked on the RDDs

Reading a file and Displaying Top n elements:
rdd = sc.textFile("path/Sample")
rdd.take(n)

9). WorkFlows

Introduction to Airflow
Airflow hands on project

10). Infrastructure

Docker
Kubernetes
Business Intelligence

11). Cloud Computing

Such AWS, Microsoft Azure, Google Cloud Platform

1). Data Engineering Tools in AWS
Amazon Redshift, Amazon Athena
2). Data Engineering Tools in Azure
Azure Data Factory, Azure Databricks

This may seem a lot to learn and cover as you become a Data Engineer, however you need to master Python and Advanced SQL well since they are core in data engineering. Understand the big data tools that are available. Remember to build projects so that you understand what you are learning.

Tools differ from one organization to another, so you need to understand the tools that your organization uses and be good in them.

Apache PySpark for Data Engineering

Kinyungu Denis — Fri, 09 Sep 2022 21:04:29 +0000

Greetings to my dear readers. I wrote an article about installing Apache PySpark in Ubuntu and explained about about Apache Spark. Read it here. Now lets go take a deep dive into PySpark and know what it is. This article covers about Apache PySpark a tool that is used in data engineering, understand all details about PySpark and how to use it. One should have basic knowledge in Python, SQL to understand this article well.

What is Apache Spark?

Let first understand about Apache Spark, then we proceed to PySpark.
Apache Spark is an open-source, cluster computing framework which is used for processing, querying and analyzing Big data. It lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.

What is Apache PySpark?

Originally it was written in Scala programming language, the open source community developed a tool to support Python for Apache Spark called PySpark. PySpark provides Py4j library, with the help of this library, Python can be easily integrated with Apache Spark. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing vast data in a distributed environment. PySpark is a very demanding tool among data engineers.

Features of PySpark

Speed - PySpark allows us to achieve a high data processing speed, which is about 100 times faster in memory and 10 times faster on the disk.
Caching - PySpark framework provides powerful caching and good disk constancy.
Real-time - PySpark provides real-time computation on a large amount of data because it focuses on in-memory processing. It shows the low latency.
Deployment - We have local mode and cluster mode. In local mode it is a single machine fox example my laptop, convenient for testing and debugging. Cluster mode there is set of predefined machines and its good for production.
PySpark works well with Resilient Distributed Datasets (RDDs)

Running our cluster locally

To start any Spark application on a local cluster or a dataset, we use SparkConf to set some configuration and parameters.

Commonly used features of the SparkConf when working with PySpark:

set(key, value)-
setMastervalue(value) -
setAppName(value)-
get(key,defaultValue=None) -
setSparkHome(value) -

The following example shows some attributes of SparkConf:

A spark program first creates a SparkContext object which tells the application how to access a cluster. To accomplish the task, you need to implement SparkConf so that the SparkContext object contains the configuration information about the application.

SparkContext

SparkContext is the first and essential thing that gets initiated when we run any Spark application. It is an entry gate for any spark derived application or functionality. It is available as sc by default in PySpark.

*Know that creating any other variable instead of sc will give an error.

Inspecting our SparkContext:

Master - The URL of the cluster connects to Spark.

appName - The name of your task.

The Master and Appname are the most widely used SparkContext parameters.

PySpark SQL

PySpark supports integrated relational processing with Spark's functional programming. To extract the data by using an SQL query language and use the queries same as the SQL language.

PySpark SQL establishes the connection between the RDD and relational table.It supports wide range of data sources and algorithms in Big-data.

Features of PySpark SQL:

Incorporation with Spark - PySpark SQL queries are integrated with Spark programs, queries are used inside the Spark programs. Developers do not have to manually manage state failure or keep the application in sync with batch jobs.
Consistence Data Access - PySpark SQL supports a shared way to access a variety of data sources like Parquet, JSON, Avro, Hive and JDBC.
User-Defined Functions - PySpark SQL has a language combined User-Defined Function (UDFs). UDF is used to define a new column-based function that extends the vocabulary of Spark SQL's DSL for transforming DataFrame.
Hive Compatibility - PySpark SQL runs unmodified Hive queries and allow full compatibility with current Hive data.
Standard Connectivity - It provides a connection through JDBC or ODBC, the industry standards for connectivity for business intelligence tools.

Important classes of Spark SQL and DataFrames are the following:

pyspark.sql.SparkSession: Represents the main entry point for DataFrame and SQL functionality.
pyspark.sql.DataFrame: Represents a distributed collection of data grouped into named columns.
pyspark.sql.Row: Represents a row of data in a DataFrame.
pyspark.sql.Column: Represents a column expression in a DataFrame.
pyspark.sql.DataFrameStatFunctions: Represents methods for statistics functionality.
pyspark.sql.DataFrameNaFunctions: Represents methods for handling missing data (null values).
pyspark.sql.GroupedData: Aggregation methods, returned by DataFrame.groupBy().
pyspark.sql.types: Represents a list of available data types.
pysark.sql.functions: Represents a list of built-in functions available for DataFrame.
pyspark.sql.Window: Used to work with Window functions.



import pyspark 
from pyspark.sql import SparkSession   
spark = SparkSession.builder.getOrCreate()



from pyspark.sql import SparkSession

A spark session can be used to create the Dataset and DataFrame API. A SparkSession can also be used to create DataFrame, register DataFrame as a table, execute SQL over tables, cache table, and read parquet file.



class builder

It is a builder of Spark Session.



getOrCreate()

It is used to get an existing SparkSession, or if there is no existing one, create a new one based on the options set in the builder.

pyspark.sql.DataFrame

A distributed collection of data grouped into named columns. A DataFrame is similar as the relational table in Spark SQL, can be created using various function in SQLContext.
Then manipulate it using the several domain-specific-languages (DSL) which are pre-defined functions of DataFrame.

Querying Using PySpark SQL

This displays my file where sql queries are executed:

The groupBy() function collects the similar category data.

PySpark UDF

The PySpark UDF (User Define Function) is used to define a new Column-based function. Using User-Defined Functions (UDFs), you can write functions in Python and use them when writing Spark SQL queries.

You can declare a User-Defined Function just like any other Python function. The trick comes later when you register a Python function with Spark. To use functions in PySpark, first register them through the spark.udf.register() function.
It accepts two parameters:

name - A string, function name you'll use in SQL queries.
f - A Python function that contains the programming logic.



spark.udf.register()

Py4JJavaError, the most common exception while working with the UDF. It comes from a mismatched data type between Python and Spark.

An example of user define functions:

PySpark RDD(Resilient Distributed Dataset)

Resilient Distributed Datasets (RDDs) are essential part of the PySpark, handles both structured and unstructured data and helps to perform in-memory computations on large cluster. RDD divides data into smaller parts based on a key. Dividing data into smaller chunks helps that if one executor node fails, another node will still process the data.

In-memory Computation - Computed results are stored in distributed memory (RAM) instead of stable storage (disk) providing very fast computation
Immutability - The created data can be retrieved anytime but its value can't be changed. RDDs can only be created through deterministic operations.
Fault Tolerant - RDDs track data lineage information to reconstruct lost data automatically. If failure occurs in any partition of RDDs, then that partition can be re-computed from the original fault tolerant input dataset to create it.
Coarse-Gained Operation - Coarse grained operation means that we can transform the whole dataset but not individual element on the dataset. On the other hand, fine grained mean we can transform individual element on the dataset.
Partitioning - RDDs are the collection of various data items that are so huge in size, they cannot fit into a single node and must be partitioned across various nodes.
Persistence - Optimization technique where we can save the result of RDD evaluation. It stores the intermediate result so that we can use it further if required and reduces the computation complexity.
Lazy Evolution - It doesn't compute the result immediately means that execution does not start until an action is triggered. When we call some operation in RDD for transformation, it does not execute immediately.

PySpark provides two methods to create RDDs: loading an external dataset, or distributing a set of collection of objects.

using parallelize

Create RDDs using the parallelize() function which accepts an already existing collection in program and pass the same to the Spark Context.

Using createDataFrame() function. We have already SparkSession, so we will create our dataFrame.

External Data

Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with textFile(), or read in a directory of text files with wholeTextFiles().

Using read_csv()

The output of using scores_file.show() and scores_file.printSchema()

RDD Operations in PySpark

RDD supports two types of operations:

Transformations - The process used to create a new RDD. It follows the principle of Lazy Evaluations (the execution will not start until an action is triggered). For example :
map, flatMap, filter, distinct, reduceByKey, mapPartitions, sortBy

Actions - The processes which are applied on an RDD to initiate Apache Spark to apply calculation and pass the result back to driver. For example :
collect, collectAsMap, reduce, countByKey/countByValue, take, first

map() transformation takes in a function and applies it to each element in the RDD
collect() action returns the entire elements in the RDD

The RDD transformation filter() returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword.
Count() action returns the number of element available in RDD.

RDD transformation reduceByKey() operates on key, value (key,value) pairs and merges the values for each key.

join() returns RDD with the matching keys with their values in paired form.

DataFrame from RDD

PySpark provides two methods to convert a RDD to DataFrame. These methods are:
toDF(), createDataFrame(rdd, schema)

DataFrames also have two operations: transformations and actions.

DataFrame transformations include: select, filter, groupby, orderby, dropDuplicates, withColumnRenamed.

select() - subsets the columns in a DataFrame
filter() - filters out rows based on a condition
groupby() - used to group based on a column
orderby() - sorts the DataFrame based on one or more columns
dropDuplicates() - removes duplicate rows from a DataFrame.
withColumnRenamed() - renames a columnn in the DataFrame

DataFrame actions include: head, show, count, describe, columns.

describe() - compute the summary statistics of numerical columns in a dataFrame
printSchema() - prints the types of columns in a DataFrame
column - prints all the columns in DataFrame.

Inspecting Data in PySpark



# Print the first 10 observations
people_df.show(10)

# Count the number of rows
print("There are {} rows in the people_df DataFrame.".format(people_df.count()))

# Count the number of columns and their names
print("There are {} columns in the people_df DataFrame and their names are {}".format(len(people_df.columns), people_df.columns))

PySpark DataFrame sub-setting and cleaning



# Select name, sex and date of birth columns
people_df_sub = people_df.select('name', 'sex', 'date of birth')

# Print the first 10 observations from people_df_sub
people_df_sub.show(10)

# Remove duplicate entries from people_df_sub
people_df_sub_nodup = people_df_sub.dropDuplicates()

# Count the number of rows
print("There were {} rows before removing duplicates, and {} rows after removing duplicates".format(people_df_sub.count(), people_df_sub_nodup.count()))

Filtering your DataFrame



# Filter people_df to select females
people_df_female = people_df.filter(people_df.sex == "female")

# Filter people_df to select males
people_df_male = people_df.filter(people_df.sex == "male")

# Count the number of rows
print("There are {} rows in the people_df_female DataFrame and {} rows in the people_df_male DataFrame".format(people_df_female.count(), people_df_male.count()))

Stopping SparkContext

To stop a sparkContext:



sc.stop()

It's very crucial to understand Resilient Distributed Datasets(RDDs) and SQL since they will be extensively used in data engineering.

Conclusion

This article covers the key areas of Apache PySpark you should understand as you learn to become data engineer.you should be able to initialize Spark, use User Defined Functions (UDFs), load data, work with RDDs: apply actions and transformations. Soon I will write an article on a practical use case of Apache PySpark in a project.

Feel free to drop your comments about the article.

Data Engineering 102: Introduction to Python for Data Engineering.

Kinyungu Denis — Wed, 31 Aug 2022 21:00:19 +0000

Greetings to my dear readers, today we will be covering about Python for Data Engineering. If you read my article about Data Engineering 101, we understood that one of the key skills required for a data engineer is strong understanding of Python language. Read that article to gain a basic understanding about data engineering.

Can one use other languages for data engineering? I would answer yes, such as Scala, Java. Lets understand why we are using python for data engineering:

A data engineer do work with different types of data formats. For such cases, Python is best suited. Its standard library supports easy handling of .csv files, one of the most common data file formats.
Data engineering tools use Directed Acyclic Graphs like Apache Airflow, Apache NiFi. DAGs, Python codes used for specifying tasks. Thus, learning Python will help data engineers use these tools efficiently.
A data engineer not only to obtain data from different sources but also to process it. One of the most popular data process engines is Apache Spark which works with Python DataFrames and even offers an API, PySpark, to build scalable big data projects.
A data engineer is often required to use APIs to retrieve data from databases. The data in such cases is usually stored in JSON (JavaScript Object Notation) format, and Python has a library named JSON-JSON to handle such type of data.
Luigi! The Python module package that help us to build complex data pipelines.

Python is relatively easy to learn and is open-source. An active community of developers strongly supports it.

We have understood some of the reasons why we have chosen Python, how do we use Python in data engineering:

Data Acquisition and Ingestion: this involves to obtain data from databases, API's and other sources. A data will use Python to retrieve the data and ingested it.

Data Manipulation: this refers to how a data engineer handles structured, unstructured and semi-structured data into meaningful information.

Parallel Computing:This is necessary for memory and processing power. A data engineer use Python to split tasks into sub-tasks and distribute the tasks.

Data Pipelines: The ETL pipeline that involves extracting, transforming and loading data. We have tools that are easily used with Python such as Snowflake, Apache Airflow.

That's great now we know how Python is used in data engineering. First, we need to familiar with basic Python and understand it well in order to write code. I will use jupyter lab, code editor that is found in Anaconda. I will explain the basic Python with examples to ensure we understand the concepts well.

For basic Python we will cover the following topics:

Variables
Strings
Math Expressions
Loops
Tuples, List, Dictionary and Sets
Functions

A variable refers a container to store a value. A variable name refers to the label that assign a value on it.

variable_name = value

The image above tells us the rules that we should follow when defining variable names. Ensure you use concise and descriptive variable names such as:

officer_duty = False

For variables to be treated as constant, you use capital letters to name a variable:

MAXIMUM_FILE_LIMIT = 1500

Strings

It is a series of characters represented using single or double quotation marks.

We have f-strings (format string) from python version 3.6, f-strings helps us to use values of variables inside a string.

Mathematical Expressions

Operators are used to perform various operations on values and variables. Python operators are classified into the following groups:

Arithmetic operators
Comparison operators
Logical operators
Bitwise operators

Arithmetic operators
These operators,compute mathematical operations for numeric values. It also have a math module to perform advanced numerical computations.

This operations give the following results

In case we combine multiple arithmetic operations, we will begin with operations inside parentheses first.

Comparison Operators
This operators help to compare between two values.

Less than ( < )
Less than or equal to (<=)
Greater than (>)
Greater than or equal to (>=)
Equal to ( == )
Not equal to ( != )

It compares numbers, strings and returns a boolean value (either True or False).

Logical Operators
This helps to check multiple conditions at the same time.
We have and, or, not operators.
and - checks where both conditions are True simultaneously then returns True else it returns False.
or - checks whether one of the condition is True and returns True. It returns False when both conditions are False.
not - it reverses the present condition.

Bitwise Operators
They are used to compare binary numbers.

Loops

We have two loops in Python; while loop and for loop

while loop
You will run a code block as long as the condition specified is True.

while condition:
   body

The condition is an expression that will evaluate to a true or False (boolean value).
while checks the condition at the beginning of each iteration, executes body as long as condition is True.
In the body,you need to stop condition after number of times to avoid an indefinite loop.

day_of_week = 0
while True:
   print(day_of_week)
   day_of_week += 1

   if day_of_week == 5:
      break

The above block of code, day_of_week will increment repeatedly by one. Then we have an if statement that checks if day_of_week == 5, then block runs until the value five is reached and the if block executes by breaking the loop. The break statement exits the loop once the if condition is True.

for loop
Mainly we use for loop to execute a code block for a number of times.

for index in range(n):
   statement

We see the syntax of a for loop. The index is called the loop counter,n the number of times that loop will execute the statement. range() is an inbuilt function, range(n) it generates a sequence of numbers from 0 to n, however n the last value is not printed.

sum = 0
for number in range(101):
   sum += number

print(sum)

As you see in range(0, 10, 2), that indicates range(start, stop, step). You can change the values and see how the code works.

Functions

A function is a block of code that performs a certain task or returns a value. Functions help to divide a program into manageable parts to make it easier to read, test and maintain the program.
This is how we write a function:

def greet(name):
   return f'{name} how are you doing?'
greetings = greet('Richard')
print(greetings)

A parameter is the information that a function needs and it is specified in function definition.In our example name is a parameter.
An argument is the piece of data you pass to a function that which is should return Richard is an argument

We have recursive functions, it is a function that can call to itself.

Lambda Function
Where one has a simple function with one expression, it would be unnecessary to define the def keyword. Lambda expressions allow one to define anonymous functions which are used once.

map() function
This function takes two arguments, the function to apply and the object to apply function on.
It provides a quick and clean way to apply a function iteratively without applying a for loop.

List
It is an ordered collection of items, it is enclosed in square brackets [] .
You can add, remove, modify, sort elements in a list since it is mutable.

empty_list = []

Tuples
This refers to an ordered collection of items, enclosed in parentheses () and it is immutable, you cannot change the elements assigned to a variable.

selected_colors = ('cyan', 'gray', 'white')

List comprehension
It transforms elements in list and returns a new list.
The syntax for a list comprehension is as follows:

list_comprehension = [expression for item in iterable if condition == True]

Let's us implement this list comprehension and understand how it works:

unpacking and packing
This can be done for both tuples and lists.
When you create a tuple you assign values to it, that is referred to as packing a tuple.

rainbow_colors = ('Red', 'Orange', 'Yellow', 'Green', 'Blue', 
   'Indigo', 'Violet')

To extract values from a tuple back to the variables is known as unpacking, so we will be unpacking our tuple.
The number of variables to be used must much the number of values inside the tuple. For example our tuple has seven values thus it can be unpacked to seven variables.

(first, second, third, forth, fifth, sixth, seventh) = 
   rainbow_colors

However this can be simplified by using an asterisk * , it added to a variable name and it takes all the remaining elements and unpacks it to a list.

(first, second, *other_colors) = rainbow_colors

The variable name other_colors will contain all the remaining colors from the initial variable name rainbow_colors

unpacking lists
The unpacking that was done on tuples can also be done on lists.

rainbow_colors = ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 
   'Indigo', 'Violet']

first, second, *other_colors = rainbow_colors

We have learnt that using * on a variable name it unpacks the remaining elements from the initial list to a new list.

Looking at the above images, we see that using * on a variable name, it returns a list.
That's cool, you now understand about unpacking in tuples and lists.

Dictionary
It is a collection of key-value pairs that stores data. Python uses curly braces {} to define a dictionary.

empty_dictionary = {}

customer = {
   'first_name' : 'Fred',
   'last_name' : 'Kagia',
   'age' : 39,
   'location' : 'Nairobi',
   'active' : True
}

To iterate over all key-value pairs in a dictionary, you will use a for loop with two variables key and value . however we can have other variables in for loop except from the key and value that we have decided to use.

for key, value in customer.items():
   print (f"{key} : {value}")

Sets
It is an unordered list of elements, elements are unique. We use curly braces {} to enclose a set.
To define an empty set we use this syntax:

empty_set = set()

capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}

frozen sets
To make a set immutable use frozenset() ,this ensures that elements in a set cannot be modified.

capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}
capital_cities_frozen = frozenset(capital_cities)

To access the index of elements in a set as you iterate over them, you can use built-in function enumerate() :

capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}

for index, city in enumerate(capital_cities, 1):
   print(f"{index}. Capital city is {capital_city}")

Set Theory
This refers to methods of set datatype that are applied to objects collection.

set.intersection() - checks all elements in both sets
set.difference() - checks elements in one set and not in the other set.
set.symmetric_difference() - checks all elements exactly in one set.
set.union() - checks all elements in either set.

Working with Data

1). JSON
2). datetime
3). Pandas
4). Numpy

JSON

This is a syntax for storing and exchanging data. Python has a module json that is used to work with JSON data.

To convert JSON to Python, you will pass the JSON string using json.loads().

To convert Python to JSON, you will convert to JSON string using this json.dumps() method.

To analyze and debug JSON data, we may need to print it in a more readable format. This can be done by passing additional parameters indent and sort_keys to json.dumps() and json.dump() method.

datetime

We use a module called datetime to work with dates as dates object.

import datetime

current_time = datetime.datetime.now()
print(current_datetime)

The date contains year, month, day, hour, minute, second, microsecond. you can use these as methods to return date object.

Create a date object
You may use datetime() class of the datetime module. This class requires three parameters to create year, month, day.

import datetime

planned_date = datetime.datetime(22, 9, 3)
print(planned_date)

NumPy

This is a Python Library that works with arrays, numerical python. A numpy arrays contain element of the same type. Homogeneity allows numpy array to be faster and efficient that Python lists.

create a NumPy object

import numpy as np

natural_numbers = np.array([1, 2, 3, 4, 5])
print(natural_numbers)
print(type(natural_numbers))

NumPy have a powerful technique called NumPy broadcasting, ability to vectorize operations, so that they are performed on all elements at once.

natural_numbers = np.array([1, 2, 3, 4, 5])
natural_numbers_squared = natural_numbers ** 2
print(natural_numbers_squared)

We can also compare using NumPy to perform calculations and using Python list. We will see NumPy works better than the Python Lists.

Pandas

It is a library used for working with datasets. It has functions for analyzing, exploring, cleaning and manipulating data. It has DataFrame as its main data structure.Tabular data with labelled rows and columns.

pandas has a method .apply() , this method takes a function and applies it to a DataFrame. One must specify an axis to it, 0 for columns and 1 for rows. This method can be used with anonymous functions (remember lambda functions.)

We have covered, the basics of Python that will help us to understand and implement data engineering. We will be able to work with tools such as Pyspark, Airflow.

For example lets look at a sample code from Directed Acyclic Graph (DAG).

# this is DAG definition file

from airflow.models import DAG
from airflow.operators.python_operator
import python_operator

dag = DAG(dag_id = "etl_pipeline"
   schedule_interval = "0 0 * * *")

etl_task = Python_Operator(task_id = "etl_task"
   python_callable = etl, dag = dag)

etc_task.set_upstream(wait_for_this_task)

#defines an ETL function

def etl():
   film_dataframe = extract_film_to_pandas()
   film_dataframe = transform_rental_rate(film_dataframe)
   load_loadframe_to_film(film_dataframe)

#define ETL task using PythonOperator


etl_task = PythonOperator(task_id = 'etl_film',
   python_callable = etl, dag =dag)


#set the upstreamto wait_for_table and sample run etl()


etl_task.set_upstream(wait_for_table)
etl()

The following above code shows a DAG(Directed Acyclic Graph) definition file and have an ETL task, which will be added to DAG. DAG to extend and the task to wait for defined in dag, wait_for_able. It is just a sample code, soon we will write our DAG's and ETL's and implement them.

Learning Python is critical for our data engineering career, ensure that understand you understand it well. We will continue together in this path of data engineering. Feel free to give your feedback about this article.

To install Apache Spark and run Pyspark in Ubuntu 22.04

Kinyungu Denis — Thu, 25 Aug 2022 17:27:38 +0000

Hello my esteemed readers, today we will cover installing Apache Spark in our Ubuntu 22.04 and also to ensure that also our Pyspark is running without any errors.
From our previous article about data engineering, we talked about a data engineer is responsible for processing large amount of data at scale, Apache Spark is one good tools for a data engineer to process data of any size. I will explain the steps to use using examples and screenshots from my machine so that you don't run into errors.

What is Apache Spark and what is it used for?

Apache Spark is a unified analytics engine for large-scale data processing on a single-node machine or multiple clusters. It is open source, in that you don't have to pay to download and use it. It utilizes in-memory caching and optimized query execution for fast analytic queries for any provided data size.
It provides high-level API's in Java, Scala, Python and R ,optimized engine that supports general execution graphs. It supports code reuse across multiple workloads such as batch processing, real-time analytics, graph processing, interactive queries and machine learning.

How does Apache Spark work?

Spark does processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back thus resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. We create DataFrames to accomplish data re-use, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency as Apache Spark runs 100 times faster in-memory and 10 times faster on disk than Hadoop MapReduce.

Apache Spark Workloads

Spark Core

Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It is responsible for distributing, monitoring jobs,memory management, fault recovery, scheduling, and interacting with storage systems. Spark Core is exposed through an application programming interface (APIs) built for Java, Scala, Python and R.

Spark SQL

Performs interactive queries for structured and semi-structured data.. Spark SQL is a distributed query engine that provides low-latency, interactive queries up to 100x faster than MapReduce. It includes a cost-based optimizer, columnar storage, and code generation for fast queries, while scaling to thousands of nodes.

Spark Streaming

Spark Streaming is a real-time solution that leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.

Machine Learning Library (MLib)

Spark includes MLlib, a library of algorithms to do machine learning on data at scale. Machine Learning models can be trained by data scientists with R or Python on any Hadoop data source, saved using MLlib, and imported into a Java or Scala-based pipeline.

GraphX

Spark GraphX is a distributed graph processing framework built on top of Spark. GraphX provides ETL, exploratory analysis, and iterative graph computation to enable users to interactively build, and transform a graph data structure at scale. It also provides an optimized runtime for this abstraction.

Key Benefits of Apache Spark:

Speed: Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.Through in-memory caching, and optimized query execution, Spark can run fast analytic queries against data of any size.
Support Multiple Languages: Apache Spark natively supports Java, Scala, R, and Python, giving you a variety of languages for building your applications.
Multiple Workloads: Apache Spark comes with the ability to run multiple workloads, including interactive queries, real-time analytics, machine learning, and graph processing.

Now we have a basic understanding about Apache Spark, we can proceed to our installation in our machines.

To download Apache Spark in Linux we need to have java installed in our machine.
To check if you have java in your machine, use this command:



java --version

For example in my machine, java is installed:

In case you don't have java installed in your system, use the following commands to install it:

Install Java

first update system packages



sudo apt update

Install java



sudo apt install default-jdk  -y

verify java installation



java --version

Your java version should be version 8 or later version and our criteria is met.

Install Apache Spark

First install the required packages, using the following command:



sudo apt install curl mlocate git scala -y

Download Apache Spark. Find the latest release from download page



wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz

Replace the version you are downloading from the Apache download page, where I have
entered my spark file link.

Extract the downloaded file you have downloaded, using this command to extract the file:



tar xvf spark-3.3.0-bin-hadoop3.tgz

Ensure you specify the collect file name you have downloaded, since it could be another version. The above command extracts the downloaded file into the directory that you downloaded in. Ensure you know the path directory for your spark file.

For example, my spark file directory appears as shown in the image:

Once you have completed the above processes it means you are done with download the Apache Spark, but wait we have to configure Spark environment. This is one the section that give you errors and you wonder what you aren't doing right. However, I will guide to ensure that you successfully configure your environment and able to use Apache Spark in your machine and ensure Pyspark is runs as expected.

How to Configure Spark environment

For this, you have to set some environment variables in the bashrc configuration file

Access this file using your editor, for my case I will use nano editor, the following command will open this file in nano editor:



sudo nano ~/.bashrc

This is a file with sensitive information, don't delete any line in it, go to the bottom of file and add the following lines in the bashrc file to ensure that we will use our Spark successfully.



export SPARK_HOME=/home/exporter/spark-3.3.0-bin-hadoop3

export PATH=$PATH:$SPARK_HOME/bin

export SPARK_LOCAL_IP=localhost

export PYSPARK_PYTHON=/usr/bin/python3

export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH

Remember when I asked you to note your Spark installation directory, that installation directory should be assigned to export SPARK_HOME



export SPARK_HOME=<your Spark installation directory>

For example you can see mine is set to:



export SPARK_HOME=/home/exporter/spark-3.3.0-bin-hadoop3

Then write the other lines as they are without changing anything and save that bashrc file. The image below shows how the end of my bashrc file appears after adding the environment variables.

After exiting our bashrc file from our nano editor, you need to save the variables. Use the following command:



source ~/.bashrc

The below image show how you write the command, I wrote my command twice but just write it once.

How to run Spark shell

For now you are done with configuring the Spark environment, you need now to check that your Spark is working as expected and use the command below to run the spark shell;



spark-shell

For successful configuration of our variables, you see an image such as this one.

How to run Pyspark

Use the following command:



pyspark

For successful configuration of our variables, you see an image such as this one.

In this article, we have provided an installation guide of Apache Spark in Ubuntu 22.04, as well as the necessary dependencies; as well as the configuration of Spark environment is also described in detail.

This article should make it easy for you to understand about Apache Spark and install it. So esteemed readers feel free to give feedback and comments.

Data Engineering 101: Introduction to Data Engineering.

Kinyungu Denis — Fri, 19 Aug 2022 16:14:00 +0000

Today we will look into introduction into Data Engineering understand what a data engineering entails about, what tools a data engineer uses and what a data engineer should learn.
This article will help developers who want to begin a career in data engineering.

What is Data Engineering?

Data Engineering is a series of process that involves designing, building for collecting, storing, processing and analyzing large amount of data at scale. It is a field that involves developing and maintaining large scale data processing systems to prepare data to be available and usable for analysis and make business-driven decisions.

The image below tells us about the process that involved in data engineering.

Who is a data engineer and what does a data engineer do?

A data engineer refers to a person who is responsible for building data pipelines from different sources and prepares data for analytical and operational uses.
Data engineers are responsible for laying the foundations for the acquisition, storage, transformation, and management of data in an organization.
They do design, build, and maintain data warehouses. A data warehouse is a place where raw data is transformed and stored in query-able forms.

Let's us understand the Data Engineering tools that a Data Engineer uses.

Data engineering tools are specialized applications that make building data pipelines and designing algorithms easier and more efficient. These tools are responsible for making the day-to-day tasks of a data engineer easier in various ways.
Data ingestion systems such as Kafka, for example, offer a seamless and quick data ingestion process while also allowing data engineers to locate appropriate data sources, analyze them, and ingest data for further processing.
Data engineering tools support the process of transforming data. This is important since big data can be structured or unstructured or any other format. Therefore, data engineers need data transformation tools to transform and process big data into the desired format.
Database tools/frameworks like SQL, NoSQL, etc., allow data engineers to acquire, analyze, process, and manage huge volumes of data simply and efficiently.
Visualization tools like Tableau and Power BI allow data engineers to generate valuable insights and create interactive dashboards

Commonly used Cloud-Based Data Engineering Tools:

Data Engineering Tools in AWS

Amazon Redshift
Amazon Athena

Data Engineering Tools in Azure

Azure Data Factory
Azure Databricks

Most importantly, the Data Engineering ecosystem consists of the following:

Data — different data types, formats, and sources of data.
Data stores and repositories — Relational and non-relational databases, data warehouses, data marts, data lakes, and big data stores that store and process the data.
Data Pipelines — Collect/Gather data from multiple sources, clean, process and transform it into data which can used for analysis.
Analytics and Data driven Decision Making — Make the well processed data available for further business analytics, visualization and data driven decision making.

ETL Data Pipeline

A data pipeline is essentially a collection of tools and methods for transferring data from one system to another for storage and processing. It collects data from several sources and stores it in a database.

ETL(Extract, Transform and Load) involves extracting, transformation and loading tasks across different environments.

These three conceptual steps are how most data pipelines are designed and structured. They serve as a blueprint for how raw data is transformed to analysis-ready data.

Lets us explain these steps to understand

Extract: this is the step where sensors wait for upstream data sources to land then we transport the data from their source locations to further transformations.
Transform: where we apply business logic and perform actions such as filtering, grouping, and aggregation to translate raw data into analysis-ready datasets. This step requires a great deal of business understanding and domain knowledge.
Load: Finally, we load the processed data and transport them to a final destination. Often, this dataset can be either consumed directly by end-users.

Data Warehousing

A data warehouse is a database that stores all of your organization historical data and allows you to conduct analytical queries against it. It is a database that is optimized for reading, aggregating, and querying massive amounts of data from a technical point of view. Modern Data Warehouse can integrate structured and unstructured data.

Four essential components are combined to create a data warehouse:

Data warehouse storage.
Metadata.
Access tools.
Management tools.

Which skills are required to become data engineers?

Data engineers require a significant set of technical skills to address their tasks such as:

Database management: Data engineers spend a considerable part of their daily work operating databases, either to collect, store or transfer data. One should have the basic understanding of relational databases such as MySQL, PostgreSQL and non-relational databases such as MongoDB, DynamoDB and work efficiently with this databases.

Programming languages: Data engineers use programming languages for a wide range of tasks. There are many programming languages that can be used in data engineering, Python is certainly one of the best options. Python perfect for executing ETL jobs and writing data pipelines. Another reason to use Python is its great integration with tools and frameworks that are critical in data engineering, such as Apache Airflow and Apache Spark.

Cloud technology: Being a data engineer entails, to a great extent, connecting your company’s business systems to cloud-based systems.Therefore, a good data engineer should know and have experience in the use of cloud services, their advantages, disadvantages, and their application in Big Data projects. We have cloud platforms such as Amazon Web Services(AWS), Microsoft Azure, Google Cloud Platform which are widely used.

Distributed computing frameworks: A distributed system is a computing environment in which various components are spread across multiple computers on a network. Distributed systems split up the work across the cluster, coordinating the efforts to complete the job more efficiently. Distributed computing frameworks, such as Apache Hadoop and Apache Spark, are designed for the processing of massive amounts of data, and they provide the foundations for some of the most impressive Big Data applications.

Shell: Most of the jobs and routines of the Cloud and other Big Data tools and frameworks are executed using shell commands and scripts. Data engineers should be comfortable with the terminal to edit files, run commands, and navigate the system.

ETL frameworks: Data engineers do create data pipelines with ETL technologies and orchestration frameworks. In this section, we could list many technologies, but the data engineer should know or be comfortable with some of the best known–such as Apache Airflow. Airflow is an orchestration framework. It’s an open-source tool for planning, generating, and tracking data pipelines. We also have other ETL frameworks, I would advise you to research more on this in order to understand about them.

Why then should we consider data engineering?

Data engineering helps firms to collect, generate, store, analyze, and manage data in real-time or in batches. We can achieve this while constructing data infrastructure, all thanks to a new set of tools and technologies.
It focuses on scaling data systems and dealing with various levels of complexity in terms of scalability, optimization and availability.

How are data engineers different from data scientists and machine learning scientists?

Data engineer is responsible for making quality data available from various resources, maintain databases, build data pipelines, query data, data pre-processing, feature Engineering, works with tools such as Apache Hadoop and spark, Develop data workflows using Airflow

ML Engineers are responsible for building Machine Learning algorithms, building data and Machine Learning models and deploy them, have statistical and mathematical knowledge and measure, optimize and improve results.

The primary role of a data scientist is to take raw data presented on the data and apply analytic tools and modeling techniques to analyze the data and provide insights to the business.

The five V's of Data:

Volume - how much data
Variety - what kind of data
Velocity - how frequent is the data
Veracity - how accurate is the data
Value - how useful is the data

That's it for our introduction to data engineering, this article introduces you in the field of Data Engineering and explains what is required to learn to build a successful career as a Data Engineer.

I will continue writing about Data Engineering, so join me as we read about data engineering together.

Remember give your feedback about this article.

Happy Learning!!

How to uninstall MySQL Server from Ubuntu 22.04.

Kinyungu Denis — Thu, 18 Aug 2022 13:29:00 +0000

I am Kinyungu, an IT support specialist and loves to help people to have ease time understanding and using applications.Recently growing a career as a data engineer.

In this article, we look on how to uninstall MySQL server in Ubuntu 22.04. What may cause one to uninstall MySQL server? In case you face unexpected issues while using it or the MySQL server updates or even you willingly decide to uninstall it from your computer.

So follow along and see how we uninstall our MySQL server. Let us do it!!!

First step, we open our Ubuntu terminal, use the shortcut:
CTRL + ALT + T

Good now we are at our terminal, we will write the command to remove MySQL server:

sudo apt-get remove --purge mysql*

Lets explain this command:
sudo enables you to run with root privileges.
apt-get remove this command only uninstalls a package from your machine but the package configuration file remains in your computer.
--purge mysql* It is passed as parameter to the apt-get remove command.

Next let's us remove the purge:

sudo apt-get purge mysql*

Lets understand this command, we know what sudo does.
apt-get purge mysql*, this command will delete all files and directories associated with MySQL.

At this point we have successfully uninstalled MySQL server from our Ubuntu 22.04.
However its advisable to run the following commands so that our MySQL server is uninstalled completely without leaving residue files.

sudo apt-get autoremove

sudo apt-get autoclean

Running the above commands leaves your system clean and you can continue to use your Ubuntu 22.04 well.

This should be considered optional and I would advise one to do this. After uninstalling applications or packages its good to update the system to be up to-date.

We will run this command:

sudo apt-get dist-upgrade

This command will update the repositories packages of applications in the system and also updates the kernel to a new version. It has an intelligent manner to handle dependencies of packages. This will involve to handle conflicts that arise due changes in dependencies or removing dependency packages no longer required. If required it installs new packages that will be required by the new kernel version in our system.

**Yeah!! **Indeed we did uninstall MySQL server.

I hope this article will help anyone uninstalling MySQL server from Ubuntu 22.04 and any other reader looking to learn.