<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kinyungu Denis</title>
    <description>The latest articles on DEV Community by Kinyungu Denis (@kinyungu_denis).</description>
    <link>https://dev.to/kinyungu_denis</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F860808%2Fb44dbc23-631b-4c48-a61e-956dba284a5c.jpg</url>
      <title>DEV Community: Kinyungu Denis</title>
      <link>https://dev.to/kinyungu_denis</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kinyungu_denis"/>
    <language>en</language>
    <item>
      <title>How to Create and Use a Virtual Environment in Python in Ubuntu 22.04</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Tue, 01 Nov 2022 16:54:00 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/how-to-create-and-use-a-virtual-environment-in-python-in-ubuntu-2204-3pp9</link>
      <guid>https://dev.to/kinyungu_denis/how-to-create-and-use-a-virtual-environment-in-python-in-ubuntu-2204-3pp9</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Greetings my esteemed readers. I took a break to participate in &lt;strong&gt;HacktoberFest with Aviyel&lt;/strong&gt;, it was an awesome experience, I really learnt a lot and I will be sharing with you my dear readers. I am happy to be back and share my knowledge with you.&lt;/p&gt;

&lt;p&gt;We will learn about Virtual environments in Python,how does one create a virtual environment, why should one create a virtual environment and how to manage a virtual environment. In order to get the best out of this article, you should understand the basics in Python.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a Python virtual environment?
&lt;/h3&gt;

&lt;p&gt;Python Virtual Environment is an environment that ensures packages related to different projects are stored in different places to avoid dependency conflicts, it is isolated from other environments and allows that environment &lt;br&gt;
to have its own independent dependencies. &lt;/p&gt;

&lt;p&gt;A single Python installation can fail to meet the requirements of every application. If application K needs version 4.0 of a particular module but application W needs version 3.0, then the requirements are in conflict and installing either version 3.0 or 4.0 will leave one application unable to run. &lt;/p&gt;

&lt;p&gt;The solution for this problem is to create a virtual environment so that different applications can then use different virtual environments and are able to run in your system.&lt;/p&gt;

&lt;p&gt;How to create a virtual environment?&lt;/p&gt;

&lt;p&gt;We will use &lt;code&gt;venv&lt;/code&gt; to manage separate packages for different packages.&lt;/p&gt;

&lt;p&gt;To create a virtual environment go to your projects directory and run &lt;code&gt;venv&lt;/code&gt;. For example in my case I will navigate to my required directory using &lt;code&gt;cd&lt;/code&gt; command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;cd /home/exporter/Kadenno/python_projects/Django_projects/project_one
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1v6yoqwdhfno3wxgfky4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1v6yoqwdhfno3wxgfky4.png" alt="Navigating to project directory"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After running the above command we are in our desired project directory.&lt;br&gt;
Now we can run our &lt;code&gt;venv&lt;/code&gt; command as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 -m venv env
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7c0okxe8wutd4y1ztuv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe7c0okxe8wutd4y1ztuv.png" alt="Using venv in virtual environment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We use the second argument &lt;code&gt;env&lt;/code&gt; as the location for our virtual environment however you can change it in case you want to have a location of your own.&lt;/p&gt;

&lt;p&gt;Basically &lt;code&gt;venv&lt;/code&gt; will create a virtual installation in the env folder.&lt;/p&gt;

&lt;p&gt;Now we need to activate our virtual environment. Before you begin installing and using your packages, you virtual environment should be activated. Now we will activate our virtual environment as follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;source env/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmn46ttxy5fhp7o4yzani.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmn46ttxy5fhp7o4yzani.png" alt="Activating our vrtual environment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Great, now we have our virtual environment up and running, we can go ahead and install our required packages for our project.&lt;/p&gt;

&lt;p&gt;You are able to confirm whether you are in your virtual environment by using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;which python
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feaxi49gjjfzqtzc7gwz5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feaxi49gjjfzqtzc7gwz5.png" alt="Checking whether in virtual environment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you have completed your project, you can leave your virtual environment by using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;deactivate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For our case, we wont leave our virtual environment since we are yet to complete our project.&lt;/p&gt;

&lt;p&gt;To illustrate an example of a package we can install and use, let us upgrade our pip in our virtual environment.&lt;br&gt;
You will use the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install --upgrade pip
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspym01lnau65aj7fru3q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fspym01lnau65aj7fru3q.png" alt="Installing pip in our virtual environment"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This looks awesome, you can proceed to install other packages, libraries and dependencies in the virtual environment which you will require.&lt;/p&gt;

&lt;p&gt;At this this point, we understand what is a virtual environment, how one can create it and how you can deactivate it. However, do you know how a virtual environment works?&lt;br&gt;
Let's take a deep dive and learn about it.&lt;/p&gt;

&lt;h3&gt;
  
  
  How does a virtual environment work?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqh5wp02wiupjf8mrm7aw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqh5wp02wiupjf8mrm7aw.png" alt="The virtual environment representation"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you create a virtual environment using &lt;code&gt;venv&lt;/code&gt;, the module re-creates the file and folder structure of a standard Python installation on your operating system. Python also copies or symlinks into that folder structure the Python executable with which you’ve called &lt;code&gt;env&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It Adapts the Prefix-Finding Process:&lt;br&gt;
Standard folder structure, the Python interpreter in our virtual environment can understand where all relevant files are located. It does this with only minor adaptations to its prefix-finding process according to the &lt;code&gt;venv&lt;/code&gt; specification.&lt;/p&gt;

&lt;p&gt;Instead of looking for the os module to determine the location of the standard library, the Python interpreter first looks for a pyvenv.cfg file. If the interpreter finds this file and it contains a home key, then the interpreter will use that key to set the value for the following two variables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;sys.base_prefix:&lt;/strong&gt; hold the path to the Python executable used to create this virtual environment, which you can find at the path defined under the home key in &lt;code&gt;pyvenv.cfg&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sys.prefix:&lt;/strong&gt; point to the directory containing &lt;code&gt;pyvenv.cfg&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the interpreter doesn’t find a &lt;code&gt;pyvenv.cfg&lt;/code&gt; file, then it determines that it’s not running within a virtual environment, and both &lt;code&gt;sys.base_prefix&lt;/code&gt; and &lt;code&gt;sys.prefix&lt;/code&gt; will then point to the same path.&lt;/p&gt;

&lt;p&gt;It links back to Your Standard Library:&lt;br&gt;
Python virtual environments aim to be a lightweight way to provide you with an isolated Python environment. In that you can quickly create and then delete when you don’t need it. &lt;code&gt;venv&lt;/code&gt; copies only the minimally necessary files.&lt;/p&gt;

&lt;p&gt;The Python executable in our virtual environment has access to the standard library modules of the Python installation on which you based the environment. Python points to the file path of the base Python executable in the home setting in pyvenv.cfg&lt;/p&gt;

&lt;p&gt;It modifies Your PYTHONPATH:&lt;br&gt;
Scripts should run using the Python interpreter within our virtual environment, &lt;code&gt;venv&lt;/code&gt; modifies the &lt;code&gt;PYTHONPATH&lt;/code&gt; environment variable that you can access using &lt;code&gt;sys.path&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;It changes Your Shell PATH Variable on activation:&lt;br&gt;
You activate your virtual environment before working in it. To activate your virtual environment, you need to execute an activation script, just as how we activated.&lt;/p&gt;

&lt;p&gt;Actions that happen in the activation script:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Path: It sets the &lt;code&gt;VIRTUAL_ENV&lt;/code&gt; variable to the root folder path of your virtual environment and puts the relative location of its Python executable to our PATH.
The path to all the executables in your virtual environment now lives at the front of your PATH, when you type python or pip our shell invokes their internal versions.&lt;/li&gt;
&lt;li&gt;Command prompt: command prompt will the name that you passed when creating the virtual environment. It takes that name and puts it into parentheses, for example (env). We saw when we were created our virtual environment.Our command prompt, you will know whether or not your virtual environment is activated.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You will activate your virtual environment before working with it and deactivate it after you’re done, as we discussed earlier.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Through this article we have learnt about virtual environment in Python. Virtual environments give you the ability to isolate your Python development projects from your system installed Python and other Python environments. This gives you full control of your project.&lt;/p&gt;

&lt;p&gt;When developing applications that would generally grow out of a simple &lt;code&gt;.py&lt;/code&gt; script, it's a good idea to use a virtual environment. Reading through this article you now know how to set up and start using one.&lt;/p&gt;

&lt;p&gt;Let me know what you think about this article through my &lt;a href="https://twitter.com/deno_exporter" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; or &lt;a href="https://www.linkedin.com/in/denis-mashellkinyungu-1b79bb171/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; handles. It would be great to get your feedback and connect with you.&lt;/p&gt;

</description>
      <category>python</category>
      <category>tutorial</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Learn Ansible and how to Install it in Ubuntu 22.04.</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Wed, 05 Oct 2022 23:06:24 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/learn-ansible-and-how-to-install-it-in-ubuntu-2204-3g5j</link>
      <guid>https://dev.to/kinyungu_denis/learn-ansible-and-how-to-install-it-in-ubuntu-2204-3g5j</guid>
      <description>&lt;p&gt;Greetings to my esteemed readers.&lt;/p&gt;

&lt;p&gt;In this we will learn how to install Ansible in Ubuntu 22.04, also I will cover the details of Ansible so you are familiar with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Ansible and why should we use it?
&lt;/h2&gt;

&lt;p&gt;Ansible refers to an open-source infrastructure automation tool that was initially developed by RedHat and is used to tackle all kinds of challenges that come with infrastructure as a code.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ansible has three major kinds of uses.
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Infrastructure automation&lt;/li&gt;
&lt;li&gt;Configuration management&lt;/li&gt;
&lt;li&gt;App Deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Infrastructure automation
&lt;/h2&gt;

&lt;p&gt;The provisioning use case for using Ansible.&lt;br&gt;
Using Ansible you can create an environment in the existing Infrastructure such as a Virtual Private Cloud (VPC) and your favorite cloud provider. Let us have that our VPC has four virtual machines.&lt;/p&gt;
&lt;h2&gt;
  
  
  Configuration management
&lt;/h2&gt;

&lt;p&gt;The main use case is the ability to configure your actual infrastructure&lt;/p&gt;
&lt;h3&gt;
  
  
  The key principles in Ansible
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;It is declarative&lt;/strong&gt;, use an Ansible playbook to group together a set of tasks that you need to run by procedurally keeping things.&lt;/p&gt;

&lt;p&gt;Create an Ansible playbook, a book of tasks, or a set of plays.&lt;br&gt;
A play has three main things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The name of the play&lt;/li&gt;
&lt;li&gt;The host that you will run against.&lt;/li&gt;
&lt;li&gt;The actual tasks that will run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You will define the hosts that will be run against, and define a set of tasks such as security patching.&lt;br&gt;
Ansible tasks.&lt;/p&gt;

&lt;p&gt;The set of virtual machines will be the set of hosts, however, in the Ansible world, we call this an inventory. The set of hosts that Ansible can work on.&lt;/p&gt;

&lt;p&gt;Ansible takes the advantage of YAML in writing configuration files and you can declare the tasks you want.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ansible is agent-less&lt;/strong&gt;, you do not need to install an agent on the Virtual Machines that you have provisioned. &lt;br&gt;
Ansible takes advantage of a secure shell (SSH) to directly run the tasks in the virtual machines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ansible is idempotent&lt;/strong&gt;, this is an operation that can be run multiple times without changing beyond the initial configurations.&lt;br&gt;
It should recognize changes when run multiple times and what needs to be resolved to ensure every task has been done.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ansible is community-driven,&lt;/strong&gt; a lot of Ansible playbooks are availed by the community and published as collections, developers have a repository to contribute to.&lt;/p&gt;
&lt;h2&gt;
  
  
  App Deployment
&lt;/h2&gt;

&lt;p&gt;Ansible can be used to deploy the actual web applications and workloads into virtual machines.&lt;/p&gt;

&lt;p&gt;Now we have a basic understanding of Ansible, let us install it on our machine.&lt;/p&gt;

&lt;p&gt;The command to ensure our terminal is up-to-date&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt update
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhkn2zqdl005vv26vmqc2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhkn2zqdl005vv26vmqc2.png" alt="Apt update"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install software-properties-common
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8yrjhbcgsezpwucp2sn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw8yrjhbcgsezpwucp2sn.png" alt="Install  properties command"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo add-apt-repository --yes --update ppa:ansible/ansible
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99jc4om6kmuk215k6pva.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F99jc4om6kmuk215k6pva.png" alt="Add the PPA repository"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt install ansible
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8nnto3g1zelfado7sh0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fr8nnto3g1zelfado7sh0.png" alt="Install ansible"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The above commands will ease the process as you install Ansible in your Ubuntu 22.04.&lt;/p&gt;

&lt;p&gt;Conclusion.&lt;/p&gt;

&lt;p&gt;We have learned the basics about Ansible and how to install it in Ubuntu 22.04.&lt;br&gt;
I will be dropping more articles about how one uses Ansible.&lt;/p&gt;

&lt;p&gt;Let me know what you think about this article through my &lt;a href="https://twitter.com/deno_exporter" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; or &lt;a href="https://www.linkedin.com/in/denis-mashellkinyungu-1b79bb171/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; handles. It will be good to connect with you.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>install</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Learning Boto3 and AWS Services the right way in Data Engineering.</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Fri, 30 Sep 2022 02:45:23 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/learning-boto3-and-aws-services-the-right-way-in-data-engineering-1g32</link>
      <guid>https://dev.to/kinyungu_denis/learning-boto3-and-aws-services-the-right-way-in-data-engineering-1g32</guid>
      <description>&lt;p&gt;Greetings to my esteemed readers!&lt;/p&gt;

&lt;p&gt;In this article we will learn about AWS Boto3 and use it together with other AWS services.It will also cover other AWS services that are essential in data engineering. The prerequisites for this article, just have basic knowledge in Python and AWS services.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AWS Boto3
&lt;/h2&gt;

&lt;p&gt;Boto3 is the Amazon Web Services (AWS) SDK for Python. &lt;br&gt;
Boto3 is your new friend when it comes to creating Python scripts for AWS resources. &lt;br&gt;
It allows you to directly create, configure, update and delete AWS resources from your Python scripts. Boto3 provides an easy to use, object-oriented API, as well as low-level access to AWS services.&lt;/p&gt;
&lt;h2&gt;
  
  
  How to install and configure Boto3
&lt;/h2&gt;

&lt;p&gt;Before you install Boto3 you should have Python version 3.7 or any later version.&lt;br&gt;
To install Boto3 via pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install boto3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You also install using Anaconda, if you desire it to be in your Anaconda environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;conda install -c anaconda boto3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also install it in your Google Colab, to perform your operations on the cloud:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip install boto3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before using Boto3, you need to set up authentication credentials for your AWS account using either the AWS IAM  Console or the AWS CLI. You can either choose an existing user or create a new one.&lt;/p&gt;

&lt;p&gt;If you have the AWS CLI installed, use the aws configure command to configure your credentials file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;aws configure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also create the credentials file yourself. By default, its location is ~/.aws/credentials. The credentials file should specify the access key and secret access key. Replace the &lt;code&gt;YOUR_ACCESS_KEY_ID&lt;/code&gt; with the one for your user and &lt;code&gt;YOUR_SECRET_ACCESS_KEY&lt;/code&gt; with your user's password.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[default] 
aws_access_key_id = YOUR_ACCESS_KEY_ID 
aws_secret_access_key = YOUR_SECRET_ACCESS_KEY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save the file.&lt;br&gt;
Now that you have set up these credentials, you have a default profile, which will be used by Boto3 to interact with your AWS account.&lt;/p&gt;
&lt;h2&gt;
  
  
  Boto3 SDK features
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Session&lt;/strong&gt;&lt;br&gt;
A session manages state about a particular configuration. By default, a session is created for you when needed. However, it's possible for you maintain your own session. Sessions store the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Credentials&lt;/li&gt;
&lt;li&gt;AWS Region&lt;/li&gt;
&lt;li&gt;Other configurations related to your profile&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Default Session&lt;/p&gt;

&lt;p&gt;Boto3 acts as a proxy to the default session. This is created when you create a low-level client or resource client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3

# Using the default session
rds = boto3.client('rds')
s3 = boto3.resource('s3')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Custom Session&lt;/p&gt;

&lt;p&gt;You can also manage your own session and create low-level clients or resource clients from it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3
import boto3.session

# Create your own session
current_session = boto3.session.Session()

# Now we can create low-level clients or resource clients from our custom session
rds = current_session.client('rds')
s3 = current_session.resource('s3')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Clients&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Clients provide a low-level interface to AWS whose methods map close to 1:1 with service APIs. All service operations are supported by clients. Clients are generated from a JSON service definition file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3

# Create a low-level client with the service name
s3 = boto3.client('s3')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To access a low-level client from an existing resource:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Create the resource
s3_resource = boto3.resource('s3')

# Get the client from the resource
s3 = s3_resource.meta.client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Resources&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Resources represent an object-oriented interface to Amazon Web Services (AWS). They provide a higher-level abstraction than the raw, low-level calls made by service clients. To use resources, you invoke the resource() method of a Session and pass in a service name:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Get resources from the default session

s3 = boto3.resource('s3')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every resource instance has a number of attributes and methods. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Collections&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A collection provides an iterable interface to a group of resources. A collection seamlessly handles pagination for you, making it possible to easily iterate over all items from all pages of data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# s3 list all buckets
s3 = boto3.resource('s3')
for bucket in s3.bucket.all():
    print(bucket.name)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Paginators&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Pagination refers to the process of sending subsequent requests to continue where a previous request left off this is due to AWS operations that returns incomplete results.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3

# Create a client
client = boto3.client('s3', region_name='ap-south-1')

# Create a reusable Paginator
paginator = client.get_paginator('list_objects')

# Create a LineIterator from the Paginator
line_iterator = paginator.paginate(Bucket='sample-bucket')

for line in line_iterator:
    print(line['Contents'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Client vs Resource&lt;/strong&gt; which should one use?&lt;/p&gt;

&lt;p&gt;Resource offer higher-level object-oriented service access whereas Client offer low-level service access.&lt;/p&gt;

&lt;p&gt;The question is, “Which one should I use?”&lt;/p&gt;

&lt;p&gt;Understanding how the client and the resource are generated helps in which one to choose:&lt;/p&gt;

&lt;p&gt;Boto3 generates the client from a JSON service definition file. The client’s methods support every single type of interaction with the target AWS service.&lt;br&gt;
Resources, on the other hand, are generated from JSON resource definition files.&lt;/p&gt;

&lt;p&gt;Boto3 generates the client and the resource from different definitions. As a result, you may find cases in which an operation supported by the client isn’t offered by the resource.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;With clients, there is more programmatic work to be done. The majority of the client operations give you a dictionary response. To get the exact information that you need, you’ll have to parse that dictionary yourself. With resource methods, the SDK does that work for you.&lt;/li&gt;
&lt;li&gt;With the client, you might see some slight performance improvements. The disadvantage is that your code becomes less readable than it would be if you were using the resource. Resources offer a better abstraction, and your code will be easier to comprehend.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Amazon s3
&lt;/h2&gt;

&lt;p&gt;AWS s3 is an object storage platform that allows you to store and retrieve any amount of data at any time. It is a storage that makes web-scale computing easier for users and developers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;s3 offers total four class storage solutions, with unlimited data storage capacity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;s3 Standard&lt;/li&gt;
&lt;li&gt;s3 Standard Infrequent Access (otherwise known as S3 IA)&lt;/li&gt;
&lt;li&gt;s3 One Zoned Infrequent Access&lt;/li&gt;
&lt;li&gt;Glacier&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Amazon s3 Standard&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;s3 Standard offers high durability, availability and performance object storage for frequently accessed data. It delivers low latency and high throughput. It is perfect for a wide variety of use cases including cloud applications, dynamic websites, content distribution, mobile applications and Big Data analytics.&lt;/p&gt;

&lt;p&gt;For example, a web application collecting farm videos uploads. With unlimited storage, there will never be a disk size issue.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;s3 Infrequent Access (IA)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;s3 IA is designed for data that is accessed less frequently but requires rapid access when needed. s3 Standard-IA offers the high durability, high throughput, and low latency of s3 Standard, with a low per GB storage price and per GB retrieval fee. This combination of low cost and high performance make s3 Standard-IA ideal for long-term storage, backups and as a data store for disaster recovery.&lt;/p&gt;

&lt;p&gt;For example, a web application for collecting farm video uploads on daily basis, soon some of those farm videos will go out of access need like there will be less demand to see year-old farm videos. With IA we can move the objects to different storage class without affecting their durability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;s3 One Zoned-IA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;s3 One Zoned-IA is designed for data that is accessed less frequently but requires rapid access when needed. s3 One Zone-IA stores data in a single AZ. Because of this, storing data in s3 One Zone-IA costs 20% less than storing it in s3 Standard-IA. It’s a good choice, for storing secondary backup copies of on-premises data or easily re-creatable data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;s3 Reduced Redundancy Storage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Reduced Redundancy Storage (RRS) is an Amazon S3 storage option that enables customers to store noncritical, reproducible data at lower levels of redundancy than Amazon s3’s standard storage.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon Glacier&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amazon Glacier is a secure, durable, and extremely low-cost storage service for data archiving. Customers can store data for as little as $0.004 per gigabyte per month. To keep costs low yet suitable for varying retrieval needs, Amazon Glacier provides different options for access to archives, from a few minutes to several hours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Object Store&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Amazon s3 is a simple key, value store designed to store as many objects as you want. You store these objects in one or more buckets. An object consists of the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Key — The name that you assign to an object. You use the object key to retrieve the object.&lt;/li&gt;
&lt;li&gt;Version ID — Within a bucket, a key and version ID uniquely identify an object&lt;/li&gt;
&lt;li&gt;Value — The content that we are storing&lt;/li&gt;
&lt;li&gt;Metadata — A set of name-value pairs with which you can store information regarding the object.&lt;/li&gt;
&lt;li&gt;Subresources — Amazon S3 uses the subresource mechanism to store object-specific additional information.&lt;/li&gt;
&lt;li&gt;Access Control Information — We can control access to the objects in Amazon s3.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Connect to Amazon s3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As long as the credentials file from above has been created you should be able to connect to your s3 object storage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import boto3
s3_client = boto3.resource('s3')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create and View Buckets&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When creating a bucket there is a lot you can configure (location constraint, read access, write access) and you can use the client API do that. Using the high level API resource(). Once we create a new bucket let’s now view all the buckets available in s3.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# create a bucket with given name
sampled_bucket = s3_client.create_bucket(Bucket='sampled_buckets')

# view buckets in s3
for bucket in s3_client.buckets.all():
     print(bucket.name)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;View Objects within a Bucket&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Adding objects to it and then view all objects within our specific bucket.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# point to bucket and add objects
sampled_bucket.put_object(Key='sampled/object1')
sampled_bucket.put_object(Key='sampled/object2')

# view objects within a bucket
for obj in sampled_bucket.objects.all():
     print(obj.key)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Upload, Download, and Delete Objects&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Upload a CSV file, view the objects within our bucket again.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# upload local csv file to a specific s3 bucket
local_file_path = '/Users/Desktop/data.csv'
key_object = 'sampled/data.csv'

sampled_bucket.upload_file(local_file_path, key_object)
for obj in sampled_bucket.objects.all():
    print(obj.key)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# download an s3 file to local machine
filename = 'downloaded_s3_data.csv'

sampled_bucket.download_file(key_object, filename)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To delete some of these objects. Either delete a specific object or delete all objects within a bucket.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# delete a specific object
sampled_bucket.Object('sampled/object2').delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# delete all objects in a bucket
sampled_bucket.objects.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can only delete an empty bucket, so before delete a bucket ensure it contains no object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# delete specific bucket
sampled_bucket.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Bucket vs Object&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A bucket has a unique name in all of s3 and it may contain many objects. The name of the object is the full path from the bucket root and any object has a key which is unique in the bucket.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS RedShift
&lt;/h2&gt;

&lt;p&gt;Amazon RedShift is a fully managed columnar cloud datawarehouse you can use it to run complex analytical queries on large datasets through massively parallel processing (MPP). The datasets can range from gigabytes to petabytes. It supports SQL, ODBC, JDBC interfaces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Redshift Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2zf3w1jd6jh89fckgx7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu2zf3w1jd6jh89fckgx7.png" alt="AWS Redshift"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The components of Redshift Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cluster&lt;br&gt;
A Cluster in Redshift is set of one or more compute nodes, there are two type of nodes Leader Node and Compute Node. If a cluster has two or more compute nodes and additional leader node is there to coordinate with all compute nodes and external communication with client applications.&lt;/p&gt;

&lt;p&gt;Leader node&lt;br&gt;
Leader node interacts with client applications and communicates with compute nodes to carry out operations. It parses and generates an execution plan to carry out database operations. Based on execution plan, then it compiles the code. Then compiled code is distributed to all provisioned compute nodes and its data portion to each node.&lt;/p&gt;

&lt;p&gt;Compute nodes&lt;br&gt;
Leader node compiles code, interaction with external applications and client applications. Leader node compile each step of execution plan and assign that to all compute nodes. &lt;br&gt;
Compute nodes carry out execution of the given compiled code and send back intermediate results back to leader node to aggregate the final result for each request from client application.&lt;br&gt;
Each compute node has its own dedicated CPU, memory and storage which are essentially determined by node type.&lt;/p&gt;

&lt;p&gt;AWS Redshift provides two node types at high level:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dense storage nodes (ds1 or ds2)&lt;/li&gt;
&lt;li&gt;Dense compute nodes (dc1 or dc2)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Node slices&lt;br&gt;
Compute node is further partitioned into slices. Each slice is allocated a portion of node’s memory, disk space where it carries out the given workload to the node. Leader node manages distributing data to slices for any queries or other database operations to the slices. All slices work in parallel to complete the operation. &lt;/p&gt;

&lt;p&gt;Internal network&lt;br&gt;
Internal network is for communication between leader node and compute nodes to perform various database operations. Redshift has very high-bandwidth connections, various communication protocols to provide high speed, private and secure communication across leader and compute nodes.&lt;/p&gt;

&lt;p&gt;Databases&lt;br&gt;
A cluster contains one or more databases. User data is stored on compute nodes. Redshift provides same functionality as typical RDBMS including OLTP functions like DMLs, however it’s optimized for high-performance analysis and reporting on large datasets.&lt;/p&gt;

&lt;p&gt;Connections&lt;br&gt;
Redshift interacts with client applications using JDBC and ODBC drivers for Postgre SQL.&lt;/p&gt;

&lt;p&gt;Client applications&lt;br&gt;
AWS Redshift provides flexibility to connect with various client tools like ETL, business intelligence reporting and analytics tools. It is based on industry standard PostgreSQL most existing SQL client applications are compatible and work with little or without any changes.&lt;/p&gt;

&lt;p&gt;Redshift Distribution Keys&lt;/p&gt;

&lt;p&gt;AUTO — if we do not specify any size, it figures on size of data&lt;br&gt;
EVEN — rows are distributed across slices in round robin, appropriate when table does not participate in join or when there is no clear choice between key distribution and all. It tries to evenly distribute without thinking about clustering the data that can be accessed at same time.&lt;br&gt;
KEY — according to values in one column. All the data with specific key will be stores on the same slice.&lt;br&gt;
ALL — Entire table is copied to every node. Appropriate for slow moving tables.&lt;/p&gt;

&lt;p&gt;Sort Keys&lt;/p&gt;

&lt;p&gt;It is similar to index and makes for fast range queries&lt;br&gt;
Rows are stored on disk in sorted order based on the column you designate as sort key.&lt;/p&gt;

&lt;p&gt;Types of sort keys&lt;br&gt;
Single column&lt;br&gt;
Compound&lt;br&gt;
Interleaved — gives equal weight to each column&lt;/p&gt;

&lt;p&gt;Importing and Exporting Data&lt;/p&gt;

&lt;p&gt;UNLOAD command (exporting) — unload from a table into files in s3&lt;br&gt;
COPY command -read from multiple data file or stream simultaneously. &lt;br&gt;
Use COPY to load large amounts of data from outside of Redshift.&lt;br&gt;
Gzip and bzip2 compression supported to speed it up further.&lt;br&gt;
Automatic compression option — Analyzes data being loaded and figures out optimal compression scheme for storing it&lt;br&gt;
Special case: narrow tables (lots of rows, few columns); Load with a single COPY transaction if possible.&lt;/p&gt;

&lt;p&gt;Short Query Acceleration (SQA)&lt;/p&gt;

&lt;p&gt;Prioritize short-running queries over longer-running ones&lt;br&gt;
Short queries run in a dedicated space, won’t wait in queue behind long queries&lt;br&gt;
Can be used in place of Work Load Management queues for short queries&lt;br&gt;
Works with: CREATE TABLE AS (CTAS)&lt;br&gt;
Read-only queries (SELECT statements)&lt;/p&gt;

&lt;p&gt;Concurrency Scaling&lt;/p&gt;

&lt;p&gt;Automatically adds cluster capacity to handle increase in concurrent read queries.&lt;br&gt;
Support virtually unlimited concurrent users &amp;amp; queries&lt;br&gt;
WLM queues manage which queries are sent to the concurrency scaling cluster&lt;/p&gt;

&lt;p&gt;Vacuum Command&lt;/p&gt;

&lt;p&gt;Recovers space from deleted rows&lt;br&gt;
VACUUM FULL — default vacuum operation, it will resort all the rows and reclaim space from deleted rows&lt;br&gt;
VACUUM DELETE ONLY — reclaiming deleted rows&lt;br&gt;
VACUUM SORT ONLY — resort the table but not reclaim the disk space&lt;br&gt;
VACUUM REINDEX — reinitialize interleaved indexes, reinitialize the table sort key column and then performs full vacuum operation&lt;/p&gt;

&lt;p&gt;Resizing Redshift Cluster&lt;/p&gt;

&lt;p&gt;Elastic resize&lt;br&gt;
Quickly add or remove nodes of same type&lt;br&gt;
Cluster is down for a few minutes&lt;br&gt;
Tries to keep connections open across the downtime&lt;br&gt;
Limited to doubling or halving for some dc2 and ra3 node types.&lt;/p&gt;

&lt;p&gt;Classic resize&lt;/p&gt;

&lt;p&gt;Change node type and/or number of nodes&lt;br&gt;
Cluster is read-only for hours to days&lt;/p&gt;

&lt;p&gt;Snapshot, restore, resize&lt;/p&gt;

&lt;p&gt;Used to keep cluster available during a classic resize&lt;br&gt;
Copy cluster, resize new cluster&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Operations on AWS Redshift&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;There are numerous operation we can perform on database like query, create, modify or remove database objects, records, loading and unloading data from and to Simple storage service.&lt;/p&gt;

&lt;p&gt;Query&lt;br&gt;
Redshift allows to use SELECT statement to extract data from tables. It allows to extract specific columns, restrict rows based on given conditions using WHERE clause. Data can be sorted in ascending and descending order. Redshift allows extracting data using joins, subqueries, call in-built and user defined functions.&lt;/p&gt;

&lt;p&gt;Data Manipulation Language(DML)&lt;br&gt;
Redshift allows to perform transactions using INSERT, UPDATE, DELETE commands. DML’s require commit to be saved permanently in database or Rollback to revert the changes. A set of DML’s are known as Transactions. Transaction is completed if any COMMIT, ROLLBACK, or any DDL is performed.&lt;/p&gt;

&lt;p&gt;Loading and Unloading Data&lt;br&gt;
Load and Unload operations in Redshift is done by COPY and UNLOAD command. COPY command copies data from files in S3 while UNLOAD dumps the data into S3 buckets in various formats. COPY command can be used to load data into Redshift from data files or multiple data stream simultaneously. Redshift recommends to use COPY command in in case of bulk inserts rather than INSERT.&lt;/p&gt;

&lt;p&gt;Amazon Redshift splits the result of a select statement across a set of one or more files per node slice to simplify parallel loading of data. While unloading the data into S3, files can be generated serially or in parallel. UNLOAD encrypts the data files using Amazon S3 server side encryption (SSE-S3). &lt;/p&gt;

&lt;p&gt;Data Definition Language&lt;br&gt;
CREATE, ALTER, DROP are names of few can be used to create, modify and delete Database, SCHEMA, USER, and database objects like Tables, views, stored procedure, user defined functions. Truncate can used to delete tables data and faster than delete. Truncate releases the space immediately.&lt;/p&gt;

&lt;p&gt;Grant, Revoke&lt;br&gt;
Access can be shared and restricted for different set of user groups using Grant and Revoke statements. Access can be granted individually or in the form of roles.&lt;/p&gt;

&lt;p&gt;Functions&lt;br&gt;
Functions are data set objects with predefined code to perform a specific operation. They stored in database as precompiled code and can be used in select statement, DML, and in any expression. Functions provide reusability, avoid redundant code. There are two types of functions.&lt;/p&gt;

&lt;p&gt;User defined functions&lt;br&gt;
Redshift allows to create a custom user-defined scalar function (UDF) using either a SQL SELECT clause or a Python program. The user defined functions are stored in the database. User defined functions can be used by user who has sufficient privileges to run. Functions can be created by CREATE FUNCTION command.&lt;/p&gt;

&lt;p&gt;In-Built Functions&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Character functions&lt;/li&gt;
&lt;li&gt;Number and Math functions&lt;/li&gt;
&lt;li&gt;JSON functions&lt;/li&gt;
&lt;li&gt;Date Type formatting functions&lt;/li&gt;
&lt;li&gt;Aggregate/Group Functions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stored Procedures&lt;br&gt;
Stored procedures can be created in Redshift using PostgreSQL procedural language. Stored procedures contains the set of queries, logical conditions in its block. Parameters in the procedures can be IN, OUT or IN OUT type. We can use DML, DDL, and SELECT statements in stored procedures. Stored procedures can be reused and removes duplicate piece of code&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use cases of RedShift:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Trading, and Risk Management &lt;br&gt;
To take decision for future trades, decide exposure limits, and mitigate risk against a counter party. Redshift’s feature data compression, result caching and encryption types to secure critical data makes a suitable data warehouse solution for that industry.&lt;/p&gt;

&lt;p&gt;Build Data Lake for pricing data&lt;br&gt;
Data can be helpful to implement price forecasting systems for oil, gas, and power sectors. Redshift’s columnar storage is best fit for time series data.&lt;/p&gt;

&lt;p&gt;Supply chain management&lt;br&gt;
Supply chain systems generate huge amount of data that is used in planning, scheduling, optimization and dispatching. To query and analyze huge volume of data feature like parallel processing with powerful node types make Redshift a good option.&lt;/p&gt;

&lt;p&gt;Conclusion&lt;/p&gt;

&lt;p&gt;In this article, you will learn understand about Boto3, AWS s3 and AWS Redshift. It is quite brief and it provides with basics of this services. You need to create your own AWS account and practice on your own to understand the concepts clearly.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How to Upload a File to Google Colab.</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Tue, 20 Sep 2022 20:22:32 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/how-to-upload-a-file-to-google-colab-2119</link>
      <guid>https://dev.to/kinyungu_denis/how-to-upload-a-file-to-google-colab-2119</guid>
      <description>&lt;p&gt;To my dear readers, today I discovered Google Colab, a tool that can be very handy working with huge datasets for example In my case datasets larger than 10 gigabytes are huge and I would not like my computer fan overworking. No required prerequisite for this article, just basic knowledge about computers and working in the internet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Google Colab?
&lt;/h2&gt;

&lt;p&gt;Google Colab is a tool allows you to write and execute Python in your browser, with zero configuration required to access to GPUs free of charge and provides easy sharing of your code.&lt;br&gt;
Colab is essentially the Google Suite version of a Jupyter Notebook.&lt;/p&gt;

&lt;p&gt;Google Colab can be used by a student, an Artificial Intelligence Researcher, Machine Learning Engineer, Data Scientist, Data Engineer.&lt;/p&gt;

&lt;p&gt;You need access to good internet and go to your favorite browser, (Brave is my favorite browser) type google colab and click on the first link.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zko2k842vytjuclbtb7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8zko2k842vytjuclbtb7.png" alt="Google Colab Search"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Google colab is easy to use, you are able to write your python code, run it, share with others, easier installation of packages and sharing of documents. However, when one wants to upload a file or folder to google colab, it is quite a hustle.  &lt;/p&gt;

&lt;h2&gt;
  
  
  To Upload a File or a Folder to Google Colab
&lt;/h2&gt;

&lt;p&gt;Mostly people do download CSV file, upload into the Google Colab, read/load the data frame. After a while one needs to repeat everything again because the data was not stored there anymore. This article solves this issue.&lt;/p&gt;

&lt;p&gt;In this article, I will show you how to use PyDrive to read a file in CSV format directly from your Google Drive using Python3 in the Google Colab environment.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Step: Install PyDrive
&lt;/h3&gt;

&lt;p&gt;The first step is to install PyDrive in our colab.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Since we are in colab environment our pip will have exclamation (!) at the beginning as it is the set standard.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0ca3gq04dv68o4kldzd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb0ca3gq04dv68o4kldzd.png" alt="To install PyDrive"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step Two: Authenticate and Authorize.
&lt;/h3&gt;

&lt;p&gt;We need to authenticate and create a PyDrive client.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtkggbpp7h9yw7ycqa4f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtkggbpp7h9yw7ycqa4f.png" alt="Running Authentication for our PyDrive"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you learn the above code, it will prompt you to allow to give permission for Google Colab to access your Drive click allow and proceed to allow Google colab to access your drive.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtk6xyq16smh0we7rkyk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbtk6xyq16smh0we7rkyk.png" alt="Prompt for Permission"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Step Three: generate a shareable link
&lt;/h3&gt;

&lt;p&gt;Once you have completed verification, go to Google Drive&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;find your file and click on it;&lt;/li&gt;
&lt;li&gt;click on the “share” button;&lt;/li&gt;
&lt;li&gt; generate a shareable link “get link”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The link will be copied into your clipboard and paste this link into a string variable in Colab.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step four: Getting the file id
&lt;/h3&gt;

&lt;p&gt;Do not share your link with others, to avoid unauthorized users from accessing your file. The link below is just for demonstration to help you understand the file id that one needs.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

##https://drive.google.com/file/d/25XVhnRJvieQMAEC9TfrWBNG6ERmtU7X/view?usp=sharing


your_file = drive.CreateFile({'id':'25XVhnRJvieQMAEC9TfrWBNG6ERmtU7X'})



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;You assign the id to a variable your_file, use drive.CreateFile({'id' : 'id_value'})&lt;/p&gt;

&lt;h3&gt;
  
  
  Step Five: To load the file and show results.
&lt;/h3&gt;

&lt;p&gt;I was uploading a csv file, so let's see if our process is success by loading the csv file and giving an output.&lt;/p&gt;

&lt;p&gt;Indicate the name of the CSV file you want to load into memory.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

your_file.GetContentFile('matches.csv')



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;I use Pandas to  turn this into a Data Frame and display its header. I use import pyforest, a package that avails a lot of python packages for me including pandas.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

import pyforest 

df = pd.read_csv('matches.csv', delimiter=';' )

df.head()



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7kg7jnfqcdtxpogpnl72.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7kg7jnfqcdtxpogpnl72.png" alt="File uploaded successfully to Google Colab"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see in our picture above the csv file was uploaded successfully and we were able to operate on the data using pandas. &lt;/p&gt;

&lt;p&gt;Now you know how to upload files, folders into your Google colab. This saves you the need to do everything locally in your machine, you are able to work comfortably with huge datasets.&lt;/p&gt;

&lt;p&gt;We are still learning data engineering together. Reading the article to Install Apache PySpark in Ubuntu, &lt;a href="https://dev.to/kinyungu_denis/to-install-apache-spark-and-run-pyspark-in-ubuntu-2204-4i79"&gt;you can read it here&lt;/a&gt;. Installing PySpark in our Local environment was indeed involving.&lt;/p&gt;

&lt;p&gt;In Google Colab, I only have to run the following the following command to install PySpark and py4j library &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

!pip install pyspark==3.3.0 py4j==0.10.9.5



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then move on to using Apache PySpark in my work. To learn about Apache pySpark, &lt;a href="https://dev.to/kinyungu_denis/apache-pyspark-for-data-engineering-3phi"&gt;read it here&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was a short comprehensive article to solve a challenge, I faced and solved. Feel free to leave your comments and suggestions.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>tutorial</category>
      <category>cloud</category>
      <category>tooling</category>
    </item>
    <item>
      <title>SQL for Data Engineering</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Tue, 20 Sep 2022 01:37:50 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/sql-for-data-engineering-50fh</link>
      <guid>https://dev.to/kinyungu_denis/sql-for-data-engineering-50fh</guid>
      <description>&lt;p&gt;In Data Engineering we have large sets of data that will be queried to obtain meaningful results. SQL is heavily used and it will be a crucial skill for a one to write and execute complex queries.&lt;/p&gt;

&lt;p&gt;We have various non-relational databases such as MySQL,SQL Server, PostgreSQL, Oracle Database and many others. The good thing is they all use SQL query language for their queries, so they do not differ too much. In this post I will use PostgreSQL to write queries. PostgreSQL is an advanced, enterprise-class, and open-source relational database system. PostgreSQL supports both SQL (relational) and JSON (non-relational) querying.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fundamentals in PostgreSQL
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Select, Column Aliases, Order By, Select Distinct, Where, Limit, Fetch, In, Between, Like, Is Null, Table Aliases.&lt;/li&gt;
&lt;li&gt;Joins, Inner Join, Left Join, Self-Join, Full Outer Join, Cross Join, Natural Join&lt;/li&gt;
&lt;li&gt;Group By, Union, Intersect, Having, Grouping Sets, Cube, Rollup, Subquery, Any, All, Exists&lt;/li&gt;
&lt;li&gt;Insert, Insert Multiple Rows, Update, Update Join, Delete, Delete Join, Upsert&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The SELECT statement has the following clauses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Select distinct rows using DISTINCT operator.&lt;/li&gt;
&lt;li&gt;Sort rows using ORDER BY clause.&lt;/li&gt;
&lt;li&gt;Filter rows using WHERE clause.&lt;/li&gt;
&lt;li&gt;Select a subset of rows from a table using LIMIT or FETCH clause.&lt;/li&gt;
&lt;li&gt;Group rows into groups using GROUP BY clause.&lt;/li&gt;
&lt;li&gt;Filter groups using HAVING clause.&lt;/li&gt;
&lt;li&gt;Join with other tables using joins such as INNER JOIN, LEFT JOIN, FULL OUTER JOIN, CROSS JOIN clauses.&lt;/li&gt;
&lt;li&gt;Perform set operations using UNION, INTERSECT, and EXCEPT.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
   select_list
FROM
   table_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT first_name, last_name, goods_bought FROM customer;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To select data from all columns of the customer table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM consumer_reports;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, it is not a good practice to use the asterisk (*) in the SELECT statement when you embed SQL statements in the application code. It is a good practice to explicitly specify the column names in the SELECT clause whenever possible to get only necessary data from the database.&lt;/p&gt;

&lt;p&gt;A column alias allows you to assign a column or an expression in the select list of a SELECT statement a temporary name. The column alias exists temporarily during the execution of the query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT column_name AS alias_name
FROM table_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    first_name || ' ' || last_name AS full_name
FROM
    consumer_reports;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ORDER BY clause allows you to sort rows returned by a SELECT clause in ascending or descending order based on a sort expression.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    select_list
FROM
    table_name
ORDER BY
    sort_expression1 [ASC | DESC],
        ...
        ...
    sort_expressionN [ASC | DESC];
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    first_name,
    last_name
FROM
    consumer_reports
ORDER BY
    first_name DESC;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The DISTINCT clause is used in the SELECT statement to remove duplicate rows from a result set. The DISTINCT clause keeps one row for each group of duplicates. The DISTINCT clause can be applied to one or more columns in the select list of the SELECT statement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
   DISTINCT column1, column2, column3, column4
FROM
   table_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    DISTINCT shape,
    color
FROM
    records
ORDER BY
    color;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The SELECT statement returns all rows from one or more columns in a table. To select rows that satisfy a specified condition, you use a WHERE clause. WHERE clause filters rows returned by a SELECT statement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table_name
WHERE condition
ORDER BY sort_expression
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To form the condition in the WHERE clause, you use comparison and logical operators:&lt;/p&gt;

&lt;p&gt;AND    -- Logical operator AND&lt;/p&gt;

&lt;p&gt;OR -- Logical operator OR&lt;/p&gt;

&lt;p&gt;IN -- Return true if a value matches any value in a list&lt;/p&gt;

&lt;p&gt;BETWEEN   -- Return true if a value is between a range &lt;br&gt;
 of values&lt;/p&gt;

&lt;p&gt;LIKE    -- Return true if a value matches a pattern&lt;/p&gt;

&lt;p&gt;IS NULL     -- Return true if a value is NULL&lt;/p&gt;

&lt;p&gt;NOT     -- Negate the result of other operators&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    last_name,
    first_name
FROM
    consumer_records
WHERE
    first_name = 'Brian' AND 
        last_name = 'Kamau';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    first_name,
    last_name
FROM
    customer
WHERE 
    first_name IN ('Brian','Kelvin','Martin');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL LIMIT is an optional clause of the SELECT statement that constrains the number of rows returned by the query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list 
FROM table_name
ORDER BY sort_expression
LIMIT row_count
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In case you want to skip a number of rows before returning the row_count rows, you use OFFSET clause placed after the LIMIT clause as the following statement:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table_name
LIMIT row_count OFFSET row_to_skip;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query will display the first 20 rows from our film table, which will be ordered in descending order.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    film_id,
    title,
    release_year
FROM
    film
ORDER BY
    film_id DESC
LIMIT 20;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--L5odlBaX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8ev1kwrpqreijr5p8d2t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L5odlBaX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8ev1kwrpqreijr5p8d2t.png" alt="20 rows in Descending order" width="583" height="430"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This query will skip 15 rows then proceed to display the next 20 rows only.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    film_id,
    title,
    release_year
FROM
    film
ORDER BY
    film_id
LIMIT 20 OFFSET 15;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7L-PPLmH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nsojbuawu3xipkmqbguf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7L-PPLmH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/nsojbuawu3xipkmqbguf.png" alt="Skips 15 rows then Limit 20 rows" width="772" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To constrain the number of rows returned by a query, you often use the LIMIT clause. However, the LIMIT clause is not a SQL-standard. To conform with the SQL standard, PostgreSQL supports the FETCH clause to retrieve a number of rows returned by a query.&lt;/p&gt;

&lt;p&gt;Syntax of the PostgreSQL FETCH clause:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OFFSET start { ROW | ROWS }
FETCH { FIRST | NEXT } [ row_count ] { ROW | ROWS } ONLY
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this syntax:&lt;br&gt;
ROW is the synonym for ROWS, FIRST is the synonym for NEXT, you can use them interchangeably.&lt;br&gt;
The start is an integer that must be zero or positive.&lt;br&gt;
The row_count is 1 or greater.&lt;/p&gt;

&lt;p&gt;This query will skip the first 20 rows then proceed to display the next 20 rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    film_id,
    title
FROM
    film
ORDER BY
    title 
OFFSET 20 ROWS 
FETCH FIRST 20 ROWS ONLY; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--OeESnJKF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4tzjev3txzmbenlhnu7u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--OeESnJKF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4tzjev3txzmbenlhnu7u.png" alt="Fetch 20 rows after skipping 20 rows" width="646" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You use IN operator in the WHERE clause to check if a value matches any value in a list of values.&lt;/p&gt;

&lt;p&gt;The syntax of the IN operator is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value IN (value1,value2,value3, value4, ..., valueN)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query will return the first 15 rows that has customer id of 1 and 2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT customer_id,
    rental_id,
    return_date
FROM
    rental
WHERE
    customer_id IN (1, 2)
ORDER BY
    return_date DESC
FETCH FIRST 15 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--I8RYJ2DG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jie14x8nfmlxn1gzgadw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--I8RYJ2DG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jie14x8nfmlxn1gzgadw.png" alt="In operator example" width="685" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can combine the IN operator with the NOT operator to select rows whose values do not match the values in the list.&lt;br&gt;
The following query finds all rentals with the customer id is not 1 or 2.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    customer_id,
    rental_id,
    return_date
FROM
    rental
WHERE
    customer_id NOT IN (1, 2)
FETCH NEXT 20 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QZZCf8HV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kf363ybqu2ha8kpwsb7r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QZZCf8HV--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kf363ybqu2ha8kpwsb7r.png" alt="Using NOT IN operator" width="784" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You use the BETWEEN operator to match a value against a range of values. The syntax of the BETWEEN operator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value BETWEEN low AND high;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To check if a value is out of a range, you combine the NOT operator with the BETWEEN operator as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value NOT BETWEEN low AND high;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Often used the BETWEEN operator in the WHERE clause. &lt;/p&gt;

&lt;p&gt;This query will return the first 20 rows where the price is between 10 and 12.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    customer_id,
    payment_id,
    amount
FROM
    payment
WHERE
    amount BETWEEN 10 AND 12
FETCH FIRST 15 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--W2QvH1Y3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/97nhtge3nniwabcic62o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--W2QvH1Y3--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/97nhtge3nniwabcic62o.png" alt="Between Operator" width="602" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This query returns the first 15 rows that do not meet the condition in the where clause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    customer_id,
    payment_id,
    amount
FROM
    payment
WHERE
    amount NOT BETWEEN 10 AND 12
FETCH FIRST 15 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Cb0EtQ86--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5lzspdlm7yfvpxs0hgr2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Cb0EtQ86--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5lzspdlm7yfvpxs0hgr2.png" alt="Not Between Operator" width="681" height="346"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To construct a pattern by combining literal values with wildcard characters and use the LIKE or NOT LIKE operator to find the matches. PostgreSQL provides you with two wildcards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Percent sign ( %) matches any sequence of zero or more characters.&lt;/li&gt;
&lt;li&gt;Underscore sign ( _)  matches any single character.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value LIKE pattern
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL supports the ILIKE operator that works like the LIKE operator. In addition, the ILIKE operator matches value case-insensitively. &lt;/p&gt;

&lt;p&gt;This query will return all the first name which have 'er' in between the name, then skips 5 rows then fetches the next 20 rows which are ordered according to first_name.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    first_name,
        last_name
FROM
    customer
WHERE
    first_name LIKE '%er%'
ORDER BY 
        first_name
OFFSET 5 ROWS
FETCH FIRST 20 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--rI_uvvCP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0xgfbslcucalndfnkdog.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--rI_uvvCP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0xgfbslcucalndfnkdog.png" alt="Like operator query result" width="800" height="431"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In database, NULL means missing information or not applicable. NULL is not a value, therefore, you cannot compare it with any other values like numbers or strings.&lt;br&gt;
To check if a value is not NULL, you use the IS NOT NULL operator:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value IS NOT NULL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Table aliases temporarily assign tables new names during the execution of a query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;table_name AS alias_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Inner Join
&lt;/h3&gt;

&lt;p&gt;In a relational database, data is typically distributed in more than one table. To select complete data, you often need to query data from multiple tables.Let us understand to combine data from multiple tables using the INNER JOIN clause.&lt;/p&gt;

&lt;p&gt;Suppose that there are two tables car and manufacturer. The table car has a column model whose value matches with values in the make column of table manufacturer. To select data from both tables, you use the INNER JOIN clause in the SELECT statement as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    model,
    year,
    make,
    origin
FROM
    Car
INNER JOIN manufacturer ON model = make;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;To join table car with the table manufacturer, you follow these steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, specify columns from both tables that you want to select data in the SELECT clause.&lt;/li&gt;
&lt;li&gt;Second, specify the main table for example table car in the FROM clause.&lt;/li&gt;
&lt;li&gt;Third, specify the second table (table manufacturer) in the INNER JOIN clause and provide a join condition after the ON keyword.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How the INNER JOIN works.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For each row in the table car, inner join compares the value in the model column with the value in the make column of every row in the table manufacturer:&lt;br&gt;
If these values are equal, the inner join creates a new row that contains all columns of both tables and adds it to the result set.&lt;br&gt;
In case these values are not equal, the inner join just ignores them and moves to the next row.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YdTrYyvX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/trhsis5six5nu810isk6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YdTrYyvX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/trhsis5six5nu810isk6.png" alt="Inner Join venn diagram" width="371" height="282"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This query returns the customer with id of 3 and the amount and date which it is paid.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    c.customer_id,
    first_name,
    last_name,
    amount,
    payment_date
FROM
    customer c
INNER JOIN payment p 
    ON p.customer_id = c.customer_id
WHERE
    c.customer_id = 3;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DHd3TYZX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3ozktqol45k7d4lsipfc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DHd3TYZX--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3ozktqol45k7d4lsipfc.png" alt="Inner Join operator" width="880" height="429"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Left Join
&lt;/h3&gt;

&lt;p&gt;There are two tables, car and manufacturer table. Each row in the table car may have zero or many corresponding rows in the table manufacturer while each row in the table manufacturer has one and only one corresponding row in the table car .&lt;br&gt;
To select data from the table car that may or may not have corresponding rows in the table manufacturer, you use the LEFT JOIN clause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    model,
    year,
    make,
    origin
FROM
    Car
Left JOIN manufacturer ON model = make;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To join the table car with the manufacturer table using a left join:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First, specify the columns in both tables from which you want to select data in the SELECT clause.&lt;/li&gt;
&lt;li&gt;Second, specify the left table (table car) in the FROM clause.&lt;/li&gt;
&lt;li&gt;Third, specify the right table (table manufacturer) in the LEFT JOIN clause and the join condition after the ON keyword.&lt;/li&gt;
&lt;li&gt;The LEFT JOIN clause starts selecting data from the left table. For each row in the left table, it compares the value in the model column with the value of each row in the make column in the right table.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If these values are equal, the left join clause creates a new row that contains columns that appear in the SELECT clause and adds this row to the result set.&lt;br&gt;
In case these values are not equal, the left join clause also creates a new row that contains columns that appear in the SELECT clause. In addition, it fills the columns that come from the right table with NULL.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QoDMRgPm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lfhv62eiytqbh2vb1c03.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QoDMRgPm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lfhv62eiytqbh2vb1c03.png" alt="Left join" width="371" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;this query will return the first 25 rows on the left join clause to join the film table with inventory table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    f.film_id,
    title,
    inventory_id
FROM
    film f
LEFT JOIN inventory i
   ON i.film_id = f.film_id
WHERE i.film_id IS NULL
ORDER BY title
FETCH FIRST 25 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WCkWSWkG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hwlk288762g0ykyb3j14.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WCkWSWkG--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/hwlk288762g0ykyb3j14.png" alt="Left outer join operator" width="779" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Self-Join
&lt;/h3&gt;

&lt;p&gt;A self-join is a regular join that joins a table to itself, a self-join query hierarchical data or to compare rows within the same table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table_name t1
INNER JOIN table_name t2 ON join_predicate;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above table is joined to itself using the INNER JOIN clause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table_name t1
LEFT JOIN table_name t2 ON join_predicate;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above table is joined to itself using the LEFT JOIN clause.&lt;/p&gt;

&lt;p&gt;This query finds all pair of films which are not duplicates that have the same length. It will skip the first 15 rows then return the next first 25 rows that have films of the same length.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT DISTINCT
    f1.title,
    f2.title,
    f1.length
FROM
    film f1
INNER JOIN film f2 
    ON f1.film_id &amp;lt;&amp;gt; f2.film_id AND 
       f1.length = f2.length
OFFSET 15 ROWS
FETCH FIRST 25 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Efy5jQk7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xaqa9it6eglmfibhtw4l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Efy5jQk7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xaqa9it6eglmfibhtw4l.png" alt="Self-join operation" width="880" height="461"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Full Outer Join
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--11974Rlo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vni2opohilw3g56ck8n8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--11974Rlo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vni2opohilw3g56ck8n8.png" alt="Full outer Join" width="364" height="261"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Syntax of the full outer join:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM car
FULL [OUTER] JOIN manufacturer on car.id = manufacturer.id;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the rows in the joined table do not match, the full outer join sets NULL values for every column of the table that does not have the matching row.&lt;br&gt;
If a row from one table matches a row in another table, the result row will contain columns populated from columns of rows from both tables.&lt;/p&gt;
&lt;h3&gt;
  
  
  Cross Join
&lt;/h3&gt;

&lt;p&gt;CROSS JOIN clause allows you to produce a Cartesian Product of rows in two or more tables.&lt;/p&gt;

&lt;p&gt;Cross Join Syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table1
CROSS JOIN table2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This statement is similar to:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM T1, T2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Natural Join
&lt;/h3&gt;

&lt;p&gt;It is a join that creates an implicit join based on the same column names in the joined tables.&lt;br&gt;
A natural join can be an inner join, left join, or right join. If you do not specify a join explicitly, PostgreSQL will use the INNER JOIN by default.&lt;/p&gt;

&lt;p&gt;Natural Join syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table1
NATURAL [INNER, LEFT, RIGHT] JOIN table2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The convenience of the NATURAL JOIN is that it does not require you to specify the join clause because it uses an implicit join clause based on the common column.&lt;br&gt;
However, avoid using the NATURAL JOIN whenever possible because sometimes it may cause an unexpected result.&lt;/p&gt;
&lt;h3&gt;
  
  
  Group By
&lt;/h3&gt;

&lt;p&gt;GROUP BY clause divides the rows returned from the SELECT statement into groups. For each group, you can apply an aggregate function for example  SUM() to calculate the sum of items or COUNT() to get the number of items in the groups.&lt;/p&gt;

&lt;p&gt;Basic Syntax of Group By:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT 
   column_1, 
   column_2,
   ...,
   aggregate_function(column_n)
FROM 
   table_name
GROUP BY 
   column_1,
   column_2,
   ...
   column_n;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query will return 25 rows from the payment table grouped by customer id, ordered in descending of the total amount&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    customer_id,
    SUM (amount)
FROM
    payment
GROUP BY
    customer_id
ORDER BY
    SUM (amount) DESC
FETCH FIRST 25 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sLDsd0TD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fgqayr9q8fbqaebo8666.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sLDsd0TD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fgqayr9q8fbqaebo8666.png" alt="Group By operation" width="597" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can use multiple columns with Group By. In this query Group By clause divides the rows in the payment table by the values in the customer_id and staff_id columns. SUM() calculates the total amount. Then ordered by the customer_id in ascending order. It fetches the first 30 rows from our table payment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT 
    customer_id, 
    staff_id, 
    SUM(amount) 
FROM 
    payment
GROUP BY 
    staff_id, 
    customer_id
ORDER BY 
    customer_id
FETCH FIRST 30 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0f1SRtvL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7gsxyq8jz1yegh8sbm21.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0f1SRtvL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/7gsxyq8jz1yegh8sbm21.png" alt="Group By operation with multiple columns" width="832" height="610"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Having
&lt;/h3&gt;

&lt;p&gt;HAVING clause specifies a search condition for a group or an aggregate. The HAVING clause is often used with the GROUP BY clause to filter groups or aggregates based on a specified condition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    column_1,
        column_2,
        ...
    aggregate_function (column_n)
FROM
    table_name
GROUP BY
    column_1
HAVING
    condition;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL evaluates the HAVING clause after the FROM, WHERE, GROUP BY, and before the SELECT, DISTINCT, ORDER BY and LIMIT clauses.&lt;/p&gt;

&lt;p&gt;Since the HAVING clause is evaluated before the SELECT clause, you cannot use column aliases in the HAVING clause. Because at the time of evaluating the HAVING clause, the column aliases specified in the SELECT clause are not available.&lt;/p&gt;

&lt;p&gt;The WHERE clause allows you to filter rows based on a specified condition. However, the HAVING clause allows you to filter groups of rows according to a specified condition.&lt;br&gt;
The WHERE clause is applied to rows while the HAVING clause is applied to groups of rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    customer_id,
    SUM (amount)
FROM
    payment
GROUP BY
    customer_id
HAVING
    SUM (amount) &amp;gt; 150;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JM5VEPsj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/obz9or3sn4k4x4oaa3vn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JM5VEPsj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/obz9or3sn4k4x4oaa3vn.png" alt="Using Having in a query" width="562" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Union Operator
&lt;/h3&gt;

&lt;p&gt;UNION operator combines result sets of two or more SELECT statements into a single result set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list_1
FROM table_1
UNION
SELECT select_list_2
FROM table_2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To combine the result sets of two queries using the UNION operator, ensure that:&lt;br&gt;
The number and the order of the columns in the select list of both queries must be the same. The data types must be compatible.&lt;/p&gt;

&lt;p&gt;The UNION operator removes all duplicate rows from the combined data set. Use the the UNION ALL to retain duplicate rows.&lt;/p&gt;
&lt;h3&gt;
  
  
  Intersect Operator
&lt;/h3&gt;

&lt;p&gt;PostgreSQL INTERSECT operator combines result sets of two or more SELECT statements into a single result set.&lt;br&gt;
The INTERSECT operator returns any rows available in both result sets.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table_1
INTERSECT
SELECT select_list
FROM table_2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The number of columns and their order in the SELECT clauses must be the same. The data types of the columns must be compatible. When using the Intersect operator.&lt;/p&gt;

&lt;h3&gt;
  
  
  Except
&lt;/h3&gt;

&lt;p&gt;EXCEPT operator returns rows by comparing the result sets of two or more queries. It returns distinct rows from the first (left) query that are not in the output of the second (right) query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT select_list
FROM table_1
EXCEPT 
SELECT select_list
FROM table_2;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  RollUp
&lt;/h3&gt;

&lt;p&gt;PostgreSQL ROLLUP is a subclause of the GROUP BY clause that offers a shorthand for defining multiple grouping sets. A grouping set is a set of columns by which you group. &lt;br&gt;
ROLLUP assumes a hierarchy among the input columns and generates all grouping sets that make sense considering the hierarchy. ROLLUP is often used to generate the subtotals and the grand total for reports.&lt;/p&gt;

&lt;p&gt;This query finds the number of rental per day, month, and year by using the ROLLUP. It will skip 15 rows, then fetch the first 25 rows that follows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    EXTRACT (YEAR FROM rental_date) y,
    EXTRACT (MONTH FROM rental_date) M,
    EXTRACT (DAY FROM rental_date) d,
    COUNT (rental_id)
FROM
    rental
GROUP BY
    ROLLUP (
        EXTRACT (YEAR FROM rental_date),
        EXTRACT (MONTH FROM rental_date),
        EXTRACT (DAY FROM rental_date)
    )
OFFSET 15
FETCH FIRST 25 ROWS ONLY;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WF5tddIC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/li1nmilbe7gw90dzkau3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WF5tddIC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/li1nmilbe7gw90dzkau3.png" alt="Results of rollup operation" width="608" height="514"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Cube
&lt;/h3&gt;

&lt;p&gt;PostgreSQL CUBE is a sub clause of the GROUP BY clause. The CUBE allows you to generate multiple grouping sets. A grouping set is a set of columns to which you want to group.&lt;/p&gt;

&lt;p&gt;This query generates all possible grouping sets based on the dimension columns specified in CUBE.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    c1, c2, c3,
    aggregate (c4)
FROM
    table_name
GROUP BY
    CUBE (c1, c2, c3);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Subquery
&lt;/h3&gt;

&lt;p&gt;A subquery is a query nested inside another query such as SELECT, INSERT, DELETE and UPDATE.&lt;br&gt;
The query inside the brackets is called a subquery, the query that contains the subquery is known as an outer query.&lt;/p&gt;

&lt;p&gt;PostgreSQL executes the query that contains a subquery in the following sequence:&lt;br&gt;
First, executes the subquery then gets the result and passes it to the outer query. Lastly executes the outer query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    film_id,
    title,
    rental_rate
FROM
    film
WHERE
    rental_rate &amp;gt; (
        SELECT
            AVG (rental_rate)
        FROM
            film
    )
FETCH FIRST 30 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ezdJ1ATU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4prf3yqgdivwyxnrsmfv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ezdJ1ATU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4prf3yqgdivwyxnrsmfv.png" alt="Subquery in WHERE clause" width="872" height="612"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The following query gets films that have the returned date between 2005-05-29 and 2005-05-30. Then 30 rows are skipped and the first 30 rows that follows are returned.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    film_id,
    title
FROM
    film f
WHERE
    film_id IN (
        SELECT
            i.film_id
        FROM
            rental r
        INNER JOIN inventory i ON i.inventory_id = 
                      r.inventory_id
        WHERE
            return_date BETWEEN '2005-05-29'
        AND '2005-05-30'
    )
OFFSET 30 ROWS FETCH FIRST 30 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ppC4tnbL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ly9c13fg1brqsvb6gim8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ppC4tnbL--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ly9c13fg1brqsvb6gim8.png" alt="Subquery with IN clause" width="866" height="615"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  All Operator
&lt;/h3&gt;

&lt;p&gt;ALL operator allows you to query data by comparing a value with a list of values returned by a subquery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;comparison_operator ALL (subquery)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The ALL operator must be preceded by a comparison operator such as equal (=), not equal (!=), greater than (&amp;gt;), greater than or equal to (&amp;gt;=), less than (&amp;lt;), and less than or equal to (&amp;lt;=). Followed by a subquery which also must be surrounded by the parentheses.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;column_name &amp;gt; ALL (subquery) the expression evaluates to true if a value is greater than the biggest value returned by the subquery.&lt;/li&gt;
&lt;li&gt;column_name &amp;gt;= ALL (subquery) the expression evaluates to true if a value is greater than or equal to the biggest value returned by the subquery.&lt;/li&gt;
&lt;li&gt;column_name &amp;lt; ALL (subquery) the expression evaluates to true if a value is less than the smallest value returned by the subquery.&lt;/li&gt;
&lt;li&gt;column_name &amp;lt;= ALL (subquery) the expression evaluates to true if a value is less than or equal to the smallest value returned by the subquery.&lt;/li&gt;
&lt;li&gt;column_name = ALL (subquery) the expression evaluates to true if a value is equal to any value returned by the subquery.&lt;/li&gt;
&lt;li&gt;column_name != ALL (subquery) the expression evaluates to true if a value is not equal to any value returned by the subquery.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    film_id,
    title,
    length
FROM
    film
WHERE
    length &amp;gt; ALL (
            SELECT
                ROUND(AVG (length),2)
            FROM
                film
            GROUP BY
                rating
    )
ORDER BY
    length
FETCH FIRST 25 ROWS ONLY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--KgEzw08G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dhi70mdp07wlb5kwnzhh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--KgEzw08G--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/dhi70mdp07wlb5kwnzhh.png" alt="Using ALL in Subquery" width="768" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Exists Operator
&lt;/h3&gt;

&lt;p&gt;The EXISTS operator is a boolean operator that tests for existence of rows in a subquery.It accepts an argument which is a subquery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXISTS (subquery)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT 
    column1, column2
FROM 
    table_1
WHERE 
    EXISTS( SELECT 
                1 
            FROM 
                table_2 
            WHERE 
                column_2 = table_1.column_1);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query statement returns customers who have paid at least one rental with an amount greater than 15&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT first_name,
       last_name
FROM customer c
WHERE EXISTS
    (SELECT 1
     FROM payment p
     WHERE p.customer_id = c.customer_id
       AND amount &amp;gt; 15 )
ORDER BY first_name,
         last_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3AjOxHLt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/chlqjz55v0s8hzbtmisb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3AjOxHLt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/chlqjz55v0s8hzbtmisb.png" alt="Exists Subquery" width="663" height="221"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Insert
&lt;/h3&gt;

&lt;p&gt;INSERT statement allows you to insert a new row into a table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO table_name(column1, column2, value3, ...)
VALUES (value1, value2, value3, ...);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For example this will insert values to a table called links, for the column url, name, last_modified columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO links (url, name, last_modified)
VALUES('https://www.dev.to','DEV','2022-09-20');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Insert Multiple Rows&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO table_name (column_list)
VALUES
    (value_list_1),
    (value_list_2),
    (value_list_3),
    ...
    (value_list_n);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO 
    links (url, name, date_modified)
VALUES
    ('https://www.tradingview.com', 'tradingview', '2022-09- 
     15'),
    ('https://www.codenewbie.com','codenewbie', '2022-09- 
     18'),
    ('https://www.forem.com','Forem', '2022-09-20'),
    ('https://www.bitbucket.com', 'Bitbucket', '2022-09-20');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Update
&lt;/h3&gt;

&lt;p&gt;UPDATE statement allows you to modify data in a table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UPDATE table_name
SET column1 = value1,
    column2 = value2,
    column3 = value3
    ...
WHERE condition;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UPDATE subjects
SET published_date = '2022-08-15' 
WHERE subject_id = 231;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns the following message after one row has been updated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;UPDATE 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Delete
&lt;/h3&gt;

&lt;p&gt;DELETE statement allows you to delete one or more rows from a table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DELETE FROM table_name
WHERE condition;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This will delete the row where the id is 7.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DELETE FROM links
WHERE id = 7;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This query deletes all the rows in our table since we did not specify a where clause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DELETE FROM links;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The subquery returns a list of phones from the blacklist table and the DELETE statement deletes the contacts whose phones match with the phones returned by the subquery.To delete all contacts whose phones are in the blacklist table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DELETE FROM contacts
WHERE phone IN (SELECT phone FROM blacklist);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Upsert
&lt;/h3&gt;

&lt;p&gt;Referred to as merge, when you insert a new row into the table, PostgreSQL will update the row if it already exists, otherwise, it will insert the new row.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO table_name(column_list) 
VALUES(value_list)
ON CONFLICT target action;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO customers (name, email)
VALUES('tradingview','hotline@tradingview') 
ON CONFLICT (name) 
DO NOTHING;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO customers (name, email)
VALUES('tradingview','hotline@tradingview') 
ON CONFLICT (name) 
DO 
   UPDATE SET email = EXCLUDED.email || ';' || customers.email;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Common Table Expressions (CTE)
&lt;/h3&gt;

&lt;p&gt;It is a temporary result set which you can reference within another SQL statement including SELECT, INSERT, UPDATE or DELETE. They only exist during the execution of the query and used to simplify complex joins and subqueries in PostgreSQL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITH cte_name (column_list) AS (
    CTE_query_definition 
)
statement;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Advantages of using CTEs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Improve the readability of complex queries.&lt;/li&gt;
&lt;li&gt;Ability to create recursive queries, queries that reference themselves. &lt;/li&gt;
&lt;li&gt;Use CTEs in conjunction with window functions to create an initial result set and use another select statement to further process this result set.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This query, the CTE returns a result set that includes staff id and the number of rentals. Then, join the staff table with the CTE using the staff_id column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITH cte_rental AS (
    SELECT staff_id,
        COUNT(rental_id) rental_count
    FROM   rental
    GROUP  BY staff_id
)
SELECT s.staff_id,
    first_name,
    last_name,
    rental_count
FROM staff s
    INNER JOIN cte_rental USING (staff_id); 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Recursive Query
&lt;/h3&gt;

&lt;p&gt;A recursive CTE has three elements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Non-recursive term: a CTE query definition that forms the base result set of the CTE structure.&lt;/li&gt;
&lt;li&gt;Recursive term: one or more CTE query definitions joined with the non-recursive term using the UNION or UNION ALL operator.&lt;/li&gt;
&lt;li&gt;Termination check: the recursion stops when no rows are returned from the previous iteration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sequence that PostgreSQL executes a recursive CTE: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execute the non-recursive term to create the base result set &lt;/li&gt;
&lt;li&gt;Execute recursive term with Ri as an input to return the result set Ri+1 as the output.&lt;/li&gt;
&lt;li&gt;Repeat step 2 until an empty set is returned (termination check)&lt;/li&gt;
&lt;li&gt;Return the final result set that is a UNION or UNION ALL of the result set
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WITH RECURSIVE cte_name AS(
    CTE_query_definition -- non-recursive term
    UNION [ALL]
    CTE_query definion  -- recursive term
) SELECT * FROM cte_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I trust you have understood the content that we have covered so far. Lets gear on and continue learning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QkMfY28C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eq92ali1gr8vvbgkzuug.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QkMfY28C--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eq92ali1gr8vvbgkzuug.gif" alt="Clapping for yourself" width="500" height="226"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Managing Tables
&lt;/h2&gt;

&lt;p&gt;PostgreSQL Data Types, Create Table, Select Into, Create Table As, Serial, Sequences, Identity Column, Alter Table, Rename Table, Add Column, Drop Column, Change Column’s Data Type, Rename Column, Drop Table, Temporary Table, Truncate Table&lt;/p&gt;

&lt;h3&gt;
  
  
  Transaction
&lt;/h3&gt;

&lt;p&gt;A database transaction is a single unit of work that consists of one or more operations. A PostgreSQL transaction is atomic, consistent, isolated, and durable. These properties are often referred to as ACID:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Atomicity guarantees that the transaction completes in an all-or-nothing manner.&lt;/li&gt;
&lt;li&gt;Consistency ensures the change to data written to the database must be valid and follow predefined rules.&lt;/li&gt;
&lt;li&gt;Isolation determines how transaction integrity is visible to other transactions.&lt;/li&gt;
&lt;li&gt;Durability makes sure that transactions that have been committed will be stored in the database permanently.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- start a transaction
BEGIN;

-- insert a new row into the accounts table
INSERT INTO accounts(name,balance)
VALUES('Alice',10000);

-- commit the change (or roll it back later)
COMMIT;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- start a transaction
BEGIN;

-- deduct 1000 from account 1
UPDATE accounts 
SET balance = balance - 1000
WHERE id = 1;

-- add 1000 to account 2
UPDATE accounts
SET balance = balance + 1000
WHERE id = 2; 

-- select the data from accounts
SELECT id, name, balance
FROM accounts;

-- commit the transaction
COMMIT;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;-- begin the transaction
BEGIN;

-- deduct the amount from the account 1
UPDATE accounts 
SET balance = balance - 1500
WHERE id = 1;

-- add the amount from the account 3 (instead of 2)
UPDATE accounts
SET balance = balance + 1500
WHERE id = 3; 

-- roll back the transaction
ROLLBACK;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The easiest way to export data of a table to a CSV file is to use COPY statement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;COPY persons TO '/home/exporter/persons_db.csv' DELIMITER ',' CSV HEADER;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A relational database consists of multiple related tables. A table consists of rows and columns. Tables allow you to store structured data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE [IF NOT EXISTS] table_name (
   column1 datatype(length) column_contraint,
   column2 datatype(length) column_contraint,
   column3 datatype(length) column_contraint,
   table_constraints
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Column Constraints&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NOT NULL – ensures that values in a column cannot be NULL.&lt;/li&gt;
&lt;li&gt;UNIQUE – ensures the values in a column unique across the rows within the same table.&lt;/li&gt;
&lt;li&gt;PRIMARY KEY – a primary key column uniquely identify rows in a table. A table can have one and only one primary key.&lt;/li&gt;
&lt;li&gt;FOREIGN KEY – ensures values in a column or a group of columns from a table exists in a column or group of columns in another table. Unlike the primary key, a table can have many foreign keys.&lt;/li&gt;
&lt;li&gt;CHECK – a CHECK constraint ensures the data must satisfy a boolean expression.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE accounts (
    user_id serial PRIMARY KEY,
    username VARCHAR ( 50 ) UNIQUE NOT NULL,
    password VARCHAR ( 50 ) NOT NULL,
    email VARCHAR ( 255 ) UNIQUE NOT NULL,
    created_on TIMESTAMP NOT NULL,
        last_login TIMESTAMP 
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE roles(
   role_id serial PRIMARY KEY,
   role_name VARCHAR (255) UNIQUE NOT NULL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE account_roles (
  user_id INT NOT NULL,
  role_id INT NOT NULL,
  grant_date TIMESTAMP,
  PRIMARY KEY (user_id, role_id),
  FOREIGN KEY (role_id)
      REFERENCES roles (role_id),
  FOREIGN KEY (user_id)
      REFERENCES accounts (user_id)
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    film_id,
    title,
    length 
INTO TEMP TABLE short_film
FROM
    film
WHERE
    length &amp;lt; 60
ORDER BY
    title;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM short_film;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE action_film AS
SELECT
    film_id,
    title,
    release_year,
    length,
    rating
FROM
    film
INNER JOIN film_category USING (film_id)
WHERE
    category_id = 1;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT * FROM action_film
ORDER BY title;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Serial
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE table_name(
    id SERIAL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE fruits(
   id SERIAL PRIMARY KEY,
   name VARCHAR NOT NULL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO fruits(name) 
VALUES('Orange');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Sequence
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE SEQUENCE [ IF NOT EXISTS ] sequence_name
    [ AS { SMALLINT | INT | BIGINT } ]
    [ INCREMENT [ BY ] increment ]
    [ MINVALUE minvalue | NO MINVALUE ] 
    [ MAXVALUE maxvalue | NO MAXVALUE ]
    [ START [ WITH ] start ] 
    [ CACHE cache ] 
    [ [ NO ] CYCLE ]
    [ OWNED BY { table_name.column_name | NONE } ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE order_details(
    order_id SERIAL,
    item_id INT NOT NULL,
    item_text VARCHAR NOT NULL,
    price DEC(10,2) NOT NULL,
    PRIMARY KEY(order_id, item_id)
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE SEQUENCE order_item_id
START 10
INCREMENT 10
MINVALUE 10
OWNED BY order_details.item_id;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO 
    order_details(order_id, item_id, item_text, price)
VALUES
    (100, nextval('order_item_id'),'DVD Player',100),
    (100, nextval('order_item_id'),'Android TV',550),
    (100, nextval('order_item_id'),'Speaker',250);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    order_id,
    item_id,
    item_text,
    price
FROM
    order_details;        
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;List all sequences in the current database&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    relname sequence_name
FROM 
    pg_class 
WHERE 
    relkind = 'S';
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;First, specify the name of the sequence which you want to drop and use the CASCADE option if you want to recursively drops objects that depend on the sequence, and objects that depend on the dependent objects.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DROP SEQUENCE [ IF EXISTS ] sequence_name [, ...] 
[ CASCADE | RESTRICT ];
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DROP TABLE order_details;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  PostgreSQL identity column
&lt;/h3&gt;

&lt;p&gt;PostgreSQL version 10 introduced a new constraint GENERATED AS IDENTITY that allows you to automatically assign a unique number to a column. The GENERATED AS IDENTITY constraint is the SQL standard-conforming variant of the good old SERIAL column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;column_name type GENERATED { ALWAYS | BY DEFAULT } AS IDENTITY[ ( sequence_option ) ]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;create a table named color with the color_id as the identity column:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE color (
    color_id INT GENERATED ALWAYS AS IDENTITY,
    color_name VARCHAR NOT NULL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Insert  new rows into the color table:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO color(color_name)
VALUES ('Green')
VALUES ('Blue');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Alter Table
&lt;/h3&gt;

&lt;p&gt;To change the structure of an existing table, you use PostgreSQL ALTER TABLE statement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE table_name action;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Add a column, Drop a column, Change the data type of a column, Rename a column, Set a default value for the column, Add a constraint to a column, Rename a table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE links (
   link_id serial PRIMARY KEY,
   title VARCHAR (512) NOT NULL,
   url VARCHAR (1024) NOT NULL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To add a new column named active, you use the following statement&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE links
ADD COLUMN active boolean;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To remove the active column from the links table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE links 
DROP COLUMN active;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To add a new column named target to the links table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE links 
ADD COLUMN target VARCHAR(10);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To change the name of the links table to short_urls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE links 
RENAME TO short_urls;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE table_name 
DROP COLUMN column_name;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To drop column that other objects depend on.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE table_name 
DROP COLUMN column_name CASCADE;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Drop Table
&lt;/h3&gt;

&lt;p&gt;To drop a table from the database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;DROP TABLE [IF EXISTS] table_name 
[CASCADE | RESTRICT];
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The CASCADE option allows you to remove the table and its dependent objects. The RESTRICT option rejects the removal if there is any object depends on the table. The RESTRICT option is the default if you don’t explicitly specify it in the DROP TABLE statement.&lt;/p&gt;

&lt;h3&gt;
  
  
  Truncate Table
&lt;/h3&gt;

&lt;p&gt;The  TRUNCATE TABLE statement deletes all data from a table without scanning it. It is faster than the DELETE statement,  TRUNCATE TABLE statement reclaims the storage right away so you do not have to perform a subsequent VACUMM operation, which is useful in the case of large tables.&lt;/p&gt;

&lt;p&gt;This query removes all data and resets the identity column value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TRUNCATE TABLE table_name 
RESTART IDENTITY;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To remove data from a table and other tables that have foreign key reference the table, you use CASCADE option in the TRUNCATE TABLE statement&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TRUNCATE TABLE table_name 
CASCADE;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The  TRUNCATE TABLE is transaction-safe. It means that if you place it within a transaction, you can roll it back safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Database Constraints
&lt;/h2&gt;

&lt;p&gt;Primary Key, Foreign Key, Check Constraint, Unique Constraint&lt;br&gt;
Not Null Constraint&lt;/p&gt;
&lt;h3&gt;
  
  
  Primary Key
&lt;/h3&gt;

&lt;p&gt;A primary key is a column or a group of columns used to identify a row uniquely in a table. A table can have one and only one primary key. It is a good practice to add a primary key to every table. PostgreSQL creates a unique B-tree index on the column or a group of columns used to define the primary key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE po_headers (
    po_no SERIAL PRIMARY KEY,
    vendor_no INTEGER,
    description TEXT,
    shipping_address TEXT
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Removing Primary Key&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE table_name DROP CONSTRAINT primary_key_constraint;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To remove the primary key from the table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE products
DROP CONSTRAINT products_pkey;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Foreign Key
&lt;/h3&gt;

&lt;p&gt;A foreign key is a column or a group of columns in a table that reference the primary key of another table.&lt;/p&gt;

&lt;p&gt;In PostgreSQL, you define a foreign key using the foreign key constraint. The foreign key constraint helps maintain the referential integrity of data between the child and parent tables.&lt;br&gt;
A foreign key constraint indicates that values in a column or a group of columns in the child table equal the values in a column or a group of columns of the parent table.&lt;/p&gt;

&lt;p&gt;syntax&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[CONSTRAINT fk_name]
   FOREIGN KEY(fk_columns) 
   REFERENCES parent_table(parent_key_columns)
   [ON DELETE delete_action]
   [ON UPDATE update_action]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;actions: SET NULL, SET DEFAULT, RESTRICT, NO ACTION, CASCADE&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE customers(
   customer_id INT GENERATED ALWAYS AS IDENTITY,
   customer_name VARCHAR(255) NOT NULL,
   PRIMARY KEY(customer_id)
);

CREATE TABLE contacts(
   contact_id INT GENERATED ALWAYS AS IDENTITY,
   customer_id INT,
   contact_name VARCHAR(255) NOT NULL,
   phone VARCHAR(15),
   email VARCHAR(100),
   PRIMARY KEY(contact_id),
   CONSTRAINT fk_customer
      FOREIGN KEY(customer_id) 
      REFERENCES customers(customer_id)
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;ON DELETE CASCADE automatically deletes all the referencing rows in the child table when the referenced rows in the parent table are deleted.&lt;/li&gt;
&lt;li&gt;The SET NULL automatically sets NULL to the foreign key columns in the referencing rows of the child table when the referenced rows in the parent table are deleted.&lt;/li&gt;
&lt;li&gt;The RESTRICT action is similar to the NO ACTION. PostgreSQL issues a constraint violation because the referencing rows &lt;/li&gt;
&lt;li&gt;The ON DELETE SET DEFAULT sets the default value to the foreign key column of the referencing rows in the child table when the referenced rows from the parent table are deleted.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE child_table
ADD CONSTRAINT constraint_fk
FOREIGN KEY (fk_columns)
REFERENCES parent_table(parent_key_columns)
ON DELETE CASCADE;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Check Constraint
&lt;/h3&gt;

&lt;p&gt;A CHECK constraint is a kind of constraint that allows you to specify if values in a column must meet a specific requirement. It uses a Boolean expression to evaluate the values before they are inserted or updated to the column.&lt;/p&gt;

&lt;p&gt;If the values pass the check, PostgreSQL will insert or update these values to the column. Otherwise, PostgreSQL will reject the changes and issue a constraint violation error.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE employees (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR (50),
    last_name VARCHAR (50),
    birth_date DATE CHECK (birth_date &amp;gt; '1900-01-01'),
    joined_date DATE CHECK (joined_date &amp;gt; birth_date),
    salary numeric CHECK(salary &amp;gt; 0)
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To add CHECK constraints to existing tables, you use the ALTER TABLE statement.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE prices_list (
    id serial PRIMARY KEY,
    product_id INT NOT NULL,
    price NUMERIC NOT NULL,
    discount NUMERIC NOT NULL,
    valid_from DATE NOT NULL,
    valid_to DATE NOT NULL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER TABLE prices_list 
ADD CONSTRAINT price_discount_check 
CHECK (
    price &amp;gt; 0
    AND discount &amp;gt;= 0
    AND price &amp;gt; discount
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Unique Constraint
&lt;/h3&gt;

&lt;p&gt;To ensure that values stored in a column or a group of columns are unique across the whole table such as email addresses or usernames.&lt;br&gt;
PostgreSQL provides you with the UNIQUE constraint that maintains the uniqueness of the data correctly.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE person (
    id SERIAL PRIMARY KEY,
    first_name VARCHAR (50),
    last_name VARCHAR (50),
    email VARCHAR (50) UNIQUE
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Not-Null Constraint
&lt;/h3&gt;

&lt;p&gt;In databases, NULL represents unknown or information missing. NULL is not the same as an empty string or the number zero.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE invoices(
  id SERIAL PRIMARY KEY,
  product_id INT NOT NULL,
  qty numeric NOT NULL CHECK(qty &amp;gt; 0),
  net_price numeric CHECK(net_price &amp;gt; 0) 
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Use the NOT NULL constraint for a column to enforce a column not accept NULL. By default, a column can hold NULL.&lt;/li&gt;
&lt;li&gt;To check if a value is NULL or not, you use the IS NULL operator. The IS NOT NULL negates the result of the IS NULL.&lt;/li&gt;
&lt;li&gt;Never use equal operator = to compare a value with NULL because it always returns NULL.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  PostgreSQL Data Types
&lt;/h2&gt;

&lt;p&gt;Boolean, Char, VarChar, and Text, Numeric, Integer, Serial, Date, Timestamp, Interval, Time, Uuid, Json, Hstore, Array, User-defined Data Types&lt;/p&gt;

&lt;h3&gt;
  
  
  Boolean
&lt;/h3&gt;

&lt;p&gt;PostgreSQL supports a single Boolean data type: BOOLEAN that can have three values: true, false and NULL.&lt;br&gt;
PostgreSQL uses one byte for storing a boolean value in the database. The BOOLEAN can be abbreviated as BOOL.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE stock_availability (
   product_id INT PRIMARY KEY,
   available BOOLEAN NOT NULL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  VarChar, Var, Text
&lt;/h3&gt;

&lt;p&gt;PostgreSQL provides three primary character types: CHARACTER(n) or CHAR(n), CHARACTER VARYINGING(n) or VARCHAR(n), and TEXT, where n is a positive integer.&lt;/p&gt;

&lt;p&gt;Advantage of specifying the length specifier for the VARCHAR data type is that PostgreSQL will issue an error if you attempt to insert a string that has more than n characters into the VARCHAR(n) column.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL supports CHAR, VARCHAR, and TEXT data types. The CHAR is fixed-length character type while the VARCHAR and TEXT are varying length character types.&lt;/li&gt;
&lt;li&gt;Use VARCHAR(n) if you want to validate the length of the string (n) before inserting into or updating to a column.&lt;/li&gt;
&lt;li&gt;VARCHAR (without the length specifier) and TEXT are equivalent.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE chemical_compounds (
    id serial PRIMARY KEY,
    first CHAR (7),
    second VARCHAR (19),
    third TEXT
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Numeric Type
&lt;/h3&gt;

&lt;p&gt;The NUMERIC type can store numbers with a lot of digits. Typically, you use the NUMERIC type for numbers that require exactness such as monetary amounts or quantities.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NUMERIC(precision, scale)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The precision is the total number of digits and the scale is the number of digits in the fraction part. For example, the number 8765.351 has the precision 7 and scale 3.&lt;/p&gt;

&lt;p&gt;If precision is not required, you should not use the NUMERIC type because calculations on NUMERIC values are typically slower than integers, floats, and double precision.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    price NUMERIC(5,2)
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Integer Data Types
&lt;/h3&gt;

&lt;p&gt;To store the whole numbers in PostgreSQL, you use one of the following integer types: SMALLINT, INTEGER, and BIGINT.&lt;/p&gt;

&lt;p&gt;SMALLINT type for storing something like ages of people, the number of pages of a book&lt;br&gt;
INTEGER is the most common choice between integer types because it offers the best balance between storage size, range, and performance.&lt;br&gt;
Using BIGINT type is not only consuming a lot of storage but also decreasing the performance of the database, therefore, you should have a good reason to use it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE cities (
    city_id serial PRIMARY KEY,
    city_name VARCHAR (255) NOT NULL,
    population INT NOT NULL CHECK (population &amp;gt;= 0)
);

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Date
&lt;/h3&gt;

&lt;p&gt;To store date values, use the PostgreSQL DATE data type that uses 4 bytes to store a date value.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE employees (
    employee_id serial PRIMARY KEY,
    first_name VARCHAR (255),
    last_name VARCHAR (355),
    birth_date DATE NOT NULL,
    hire_date DATE NOT NULL
);

INSERT INTO employees (first_name, last_name, birth_date, hire_date)
VALUES ('Derrick','Kimani','1990-05-01','2015-06-01'),
       ('Florence','Wanjiru','1991-03-05','2013-04-01'),
       ('Richard','Chege','1992-09-01','2011-10-01');
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get the current date and time, use the built-in Now() function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT NOW()::date;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT CURRENT_DATE;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get the year, quarter, month, week, day from a date value, you use the EXTRACT() function.&lt;br&gt;
The following statement extracts the year, month, and day from the birth dates of employees:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    employee_id,
    first_name,
    last_name,
    EXTRACT (YEAR FROM birth_date) AS YEAR,
    EXTRACT (MONTH FROM birth_date) AS MONTH,
    EXTRACT (DAY FROM birth_date) AS DAY
FROM
    employees;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;TimeStamp&lt;/p&gt;

&lt;p&gt;The timestamp datatype allows you to store both date and time. However, it does not have any time zone data. It means that when you change the timezone of your database server, the timestamp value stored in the database will not change automatically.&lt;br&gt;
The timestamptz datatype is the timestamp with the time zone. The timestamptz datatype is a time zone-aware date and time data type.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT CURRENT_TIMESTAMP;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT TIMEOFDAY();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Time
&lt;/h3&gt;

&lt;p&gt;A time value may have a precision up to 6 digits. The precision specifies the number of fractional digits placed in the second field.&lt;br&gt;
The TIME data type requires 8 bytes and its allowed range is from 00:00:00 to 24:00:00.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;column_name TIME(precision);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE shifts (
    id serial PRIMARY KEY,
    shift_name VARCHAR NOT NULL,
    start_at TIME NOT NULL,
    end_at TIME NOT NULL
);  
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT LOCAL TIME;

SELECT CURRENT TIME;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To extracting hours, minutes, seconds from a time value, you use the EXTRACT function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    LOCALTIME,
    EXTRACT (HOUR FROM LOCALTIME) as hour,
    EXTRACT (MINUTE FROM LOCALTIME) as minute, 
    EXTRACT (SECOND FROM LOCALTIME) as second,
    EXTRACT (milliseconds FROM LOCALTIME) as milliseconds; 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  UUID Data Type
&lt;/h3&gt;

&lt;p&gt;UUID stands for Universal Unique Identifier defined by RFC 4122 and other related standards. A UUID value is 128-bit quantity generated by an algorithm that make it unique in the known universe using the same algorithm.&lt;br&gt;
a UUID is a sequence of 32 digits of hexadecimal digits represented in groups separated by hyphens.&lt;/p&gt;

&lt;p&gt;Because of its uniqueness feature, often found UUID in the distributed systems because it guarantees a better uniqueness than the SERIAL data type which generates only unique values within a single database. To store UUID values in the PostgreSQL database, you use the UUID data type.&lt;/p&gt;

&lt;p&gt;To install the uuid-ossp module, you use the CREATE EXTENSION statement&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you want to generate a UUID value solely based on random numbers, use the uuid_generate_v4()&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT uuid_generate_v4();
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a table whose primary key is UUID data type, the values of the primary key column will be generated automatically using the uuid_generate_v4() function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE contacts (
    contact_id uuid DEFAULT uuid_generate_v4 (),
    first_name VARCHAR NOT NULL,
    last_name VARCHAR NOT NULL,
    email VARCHAR NOT NULL,
    phone VARCHAR,
    PRIMARY KEY (contact_id)
);

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO contacts (
    first_name,
    last_name,
    email
)
VALUES
    (
        'Kamau',
        'Kelvin',
        'kamau.kelvin@example.com'
    ),
    (
        'Nafula',
        'Wepkulu',
        'nafula.wepkulu@example.com',
    ),
    (
        'Kasunda',
        'Mutorini',
        'kasunda.mutorini@example.com'
    );

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To query our database so that we view the uuid in the customer_id column.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    *
FROM
    contacts;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--t_ySyR06--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6t9zpbwrtg2ljtjn7h88.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--t_ySyR06--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6t9zpbwrtg2ljtjn7h88.png" alt="Uuid in the customer_id column" width="880" height="99"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Hstore data type
&lt;/h3&gt;

&lt;p&gt;The hstore module implements the hstore data type for storing key-value pairs in a single value.&lt;br&gt;
The hstore data type is very useful in many cases, such as semi-structured data or rows with many attributes that are rarely queried. Notice that keys and values are just text strings only.&lt;/p&gt;

&lt;p&gt;To enable the hstore extension which loads the contrib module to your PostgreSQL instance.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE EXTENSION hstore;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE books (
    id serial primary key,
    title VARCHAR (255),
    attr hstore
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data that we insert into the hstore column is a list of comma-separated key =&amp;gt;value pairs. Both keys and values are quoted using double quotes (“”).&lt;/p&gt;

&lt;p&gt;PostgreSQL provides the hstore_to_json() function to convert hstore data to JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
  title,
  hstore_to_json (attr) json
FROM
  books;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  JSON data type
&lt;/h3&gt;

&lt;p&gt;JSON stands for JavaScript Object Notation. JSON is an open standard format that consists of key-value pairs, JSON is human-readable text.&lt;br&gt;
The main usage of JSON is to transport data between a server and a web application. &lt;/p&gt;

&lt;p&gt;The orders table consists of two columns:&lt;/p&gt;

&lt;p&gt;The id column is the primary key column that identifies the order. The info column stores the data in the form of JSON.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE orders (
    id serial NOT NULL PRIMARY KEY,
    info json NOT NULL
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;PostgreSQL provides two native operators -&amp;gt; and -&amp;gt;&amp;gt; to help you query JSON data.&lt;br&gt;
The operator -&amp;gt; returns JSON object field by key.&lt;br&gt;
The operator -&amp;gt;&amp;gt; returns JSON object field by text.&lt;/p&gt;

&lt;p&gt;We can apply aggregate functions such as MIN, MAX, AVERAGE, SUM, etc., to JSON data. For example, the following statement returns minimum quantity, maximum quantity, average quantity and the total quantity of products sold.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT 
   MIN (CAST (info -&amp;gt; 'items' -&amp;gt;&amp;gt; 'qty' AS INTEGER)),
   MAX (CAST (info -&amp;gt; 'items' -&amp;gt;&amp;gt; 'qty' AS INTEGER)),
   SUM (CAST (info -&amp;gt; 'items' -&amp;gt;&amp;gt; 'qty' AS INTEGER)),
   AVG (CAST (info -&amp;gt; 'items' -&amp;gt;&amp;gt; 'qty' AS INTEGER))
FROM orders;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The json_each() function allows us to expand the outermost JSON object into a set of key-value pairs. &lt;/p&gt;

&lt;p&gt;To get a set of keys in the outermost JSON object, you use the json_object_keys() function.&lt;/p&gt;

&lt;h3&gt;
  
  
  User-defined Data Types
&lt;/h3&gt;

&lt;p&gt;Domain is a data type with optional constraints e.g., NOT NULL and CHECK. A domain has a unique name within the schema scope.&lt;br&gt;
Domains are useful for centralizing the management of fields with common constraints.&lt;/p&gt;

&lt;p&gt;The CREATE TYPE statement allows you to create a composite type, used as the return type of a function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TYPE film_summary AS (
    film_id INT,
    title VARCHAR,
    release_year SMALLINT
); 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Use the film_summary data type as the return type of a function&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE OR REPLACE FUNCTION get_film_summary (f_id INT) 
    RETURNS film_summary AS 
$$ 
SELECT
    film_id,
    title,
    release_year
FROM
    film
WHERE
    film_id = f_id ; 
$$ 
LANGUAGE SQL;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A user-defined function that returns a random number between two numbers low and high.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE OR REPLACE FUNCTION random_between(low INT ,high INT) 
   RETURNS INT AS
$$
BEGIN
   RETURN floor(random()* (high-low + 1) + low);
END;
$$ language 'plpgsql' STRICT;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT random_between(1,100);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To get multiple random numbers between two integers, execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT random_between(1, 100)
FROM generate_series(1, 4);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To list all user-defined types in the current database use the \dT or \dT+ command.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conditional Expressions &amp;amp; Operators
&lt;/h2&gt;

&lt;p&gt;CASE, COALESCE, NULLIF, CAST&lt;/p&gt;

&lt;h3&gt;
  
  
  CASE expression
&lt;/h3&gt;

&lt;p&gt;The PostgreSQL CASE expression is the same as IF/ELSE statement in other programming languages. It allows you to add if-else logic to the query to form a powerful query.&lt;/p&gt;

&lt;p&gt;General CASE expression&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CASE 
      WHEN condition_1  THEN result_1
      WHEN condition_2  THEN result_2
      [WHEN ...]
      [ELSE else_result]
END
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    SUM (CASE
               WHEN rental_rate = 0.99 THEN 1
           ELSE 0
          END
    ) AS "Economy",
    SUM (
        CASE
        WHEN rental_rate = 2.99 THEN 1
        ELSE 0
        END
    ) AS "Mass",
    SUM (
        CASE
        WHEN rental_rate = 4.99 THEN 1
        ELSE 0
        END
    ) AS "Premium"
FROM
    film;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XeCF9UW8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6ma02nvib7bf6fzx1kjl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XeCF9UW8--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6ma02nvib7bf6fzx1kjl.png" alt="Aggregate on CASE" width="778" height="91"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Coalesce
&lt;/h3&gt;

&lt;p&gt;COALESCE function that returns the first non-null argument. The COALESCE function accepts an unlimited number of arguments. It returns the first argument that is not null. If all arguments are null, the COALESCE function will return null.&lt;/p&gt;

&lt;p&gt;The COALESCE function evaluates arguments from left to right until it finds the first non-null argument. All the remaining arguments from the first non-null argument are not evaluated.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;COALESCE (argument_1, argument_2,argument_3, ...);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CREATE TABLE items (
    ID serial PRIMARY KEY,
    product VARCHAR (100) NOT NULL,
    price NUMERIC NOT NULL,
    discount NUMERIC
);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Insert records into the item table&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;INSERT INTO items (product, price, discount)
VALUES
    ('Cassava', 1000 ,10),
    ('Yams', 1500 ,20),
    ('Arrow roots', 800 ,5),
    ('Potatoes', 500, NULL);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
    product,
    (price - COALESCE(discount,0)) AS net_price
FROM
    items;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;COALESCE function makes the query shorter and easier to read, it substitutes null values in the query.&lt;/p&gt;

&lt;h3&gt;
  
  
  NULLIF
&lt;/h3&gt;

&lt;p&gt;To use PostgreSQL NULLIF function to handle null values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;NULLIF(argument_1, argument_2, argument_3);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Apply the NULLIF function to substitute the null values for displaying data and preventing division by zero error.&lt;/p&gt;

&lt;h3&gt;
  
  
  CAST operator
&lt;/h3&gt;

&lt;p&gt;To convert a value of one data type into another. PostgreSQL provides you with the CAST operator that allows you to do this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CAST ( expression AS target_type );
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Specify an expression that can be a constant, a table column, an expression that evaluates to a value. Then the target data type to which you want to convert the result of the expression.&lt;/p&gt;

&lt;p&gt;PostgreSQL type cast :: operator&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;expression::type
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cast a string to a double&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
   CAST ('10.2' AS DOUBLE PRECISION);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cast a string to a date&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT
   CAST ('2015-01-01' AS DATE),
   CAST ('01-OCT-2015' AS DATE);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cast operator to convert a string to an interval&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SELECT '15 minute'::interval,
 '2 hour'::interval,
 '1 day'::interval,
 '2 week'::interval,
 '3 month'::interval;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Explain
&lt;/h3&gt;

&lt;p&gt;The EXPLAIN statement returns the execution plan which PostgreSQL planner generates for a given statement.&lt;br&gt;
The EXPLAIN shows how tables involved in a statement will be scanned by index scan or sequential scan, etc., and if multiple tables are used, what kind of join algorithm will be used.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXPLAIN [ ( option [, ...] ) ] sql_statement;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Option can be :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ANALYZE [ boolean ]
VERBOSE [ boolean ]
COSTS [ boolean ]
BUFFERS [ boolean ]
TIMING [ boolean ]  
SUMMARY [ boolean ]
FORMAT { TEXT | XML | JSON | YAML }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Boolean specifies whether the selected option should be turned on or off. You can use TRUE, ON, or 1 to enable the option, and FALSE, OFF, or 0 to disable it.&lt;/p&gt;

&lt;p&gt;The ANALYZE option causes the sql_statement to be executed first and then actual run-time statistics in the returned information including total elapsed time expended within each plan node and the number of rows it actually returned.&lt;/p&gt;

&lt;p&gt;TIMING includes the actual startup time and time spent in each node in the output.It defaults to TRUE and it may only be used when ANALYZE is enabled.&lt;/p&gt;

&lt;p&gt;COSTS option includes the estimated startup and total costs of each plan node, as well as the estimated number of rows and the estimated width of each row in the query plan. The COSTS defaults to TRUE.&lt;/p&gt;

&lt;p&gt;BUFFERS only can be used when ANALYZE is enabled. By default, the BUFFERS parameter set to FALSE&lt;/p&gt;

&lt;p&gt;VERBOSE parameter allows you to show additional information regarding the plan. This parameter sets to FALSE by default.&lt;/p&gt;

&lt;p&gt;SUMMARY parameter adds summary information such as total timing after the query plan. Note that when ANALYZE option is used, the summary information is included by default.&lt;/p&gt;

&lt;p&gt;The output format of the query plan such as TEXT, XML, JSON, and YAML. This parameter is set to TEXT by default.&lt;/p&gt;

&lt;p&gt;Explain statement that calculates the query plan.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;EXPLAIN ANALYZE
SELECT
    f.film_id,
    title,
    name category_name
FROM
    film f
    INNER JOIN film_category fc 
        ON fc.film_id = f.film_id
    INNER JOIN category c 
        ON c.category_id = fc.category_id
ORDER BY
    title;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xOv_RIq2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0yrv15u1u8a4mkdupue9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xOv_RIq2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/0yrv15u1u8a4mkdupue9.png" alt="Explain statement to show query plan" width="880" height="235"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Indeed we have covered the basics of SQL using PostgreSQl in detail. You need to practice what you have learnt using a sample database, so as to understand the concepts well. Our goal should be to write complex readable queries with fast execution rate. The next SQL article will cover the advanced topics in SQL such as Indexes, Views, Stored Procedures.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QbKHGw0q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jh5k7io1izfb4l7cdgh3.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QbKHGw0q--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jh5k7io1izfb4l7cdgh3.gif" alt="congratulations for reading to the end of our article" width="250" height="144"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Feel free to share your thoughts in the comments.&lt;/p&gt;

</description>
      <category>sql</category>
      <category>beginners</category>
      <category>database</category>
    </item>
    <item>
      <title>Data Engineering Roadmap</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Sun, 18 Sep 2022 16:24:07 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/data-engineering-roadmap-34fk</link>
      <guid>https://dev.to/kinyungu_denis/data-engineering-roadmap-34fk</guid>
      <description>&lt;p&gt;Today, we will understand the road map for a data engineer. What one need to learn to become a good data engineer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Engineering Roadmap.
&lt;/h2&gt;

&lt;p&gt;Software and technology requirements that you need.&lt;br&gt;
1). Cloud account, Google GCP, AWS or Azure.&lt;br&gt;
2). Python IDE and a text editor, preferably Anaconda.&lt;br&gt;
3). SQL server, MYSQL Workbench, DBeaver AND DBVisualizer.&lt;br&gt;
4). Git and version control system (preferably a GitHub account)&lt;br&gt;
5). Create an account on &lt;a href="https://www.atlassian.com"&gt;https://www.atlassian.com&lt;/a&gt; and understand the following Atlassian product.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Jira, Trello, Confkuence, Bitbucket, Confluence
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  1. Data Engineering
&lt;/h3&gt;

&lt;p&gt;What is Data Engineering?&lt;br&gt;
What do a data engineer do?&lt;br&gt;
What is the difference between Data Engineers, ML Engineers and  Data Scientists?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9k-3rQIM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w70rwua1ivfcexxi4jml.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9k-3rQIM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/w70rwua1ivfcexxi4jml.png" alt="Data Engineering Process" width="880" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Data engineering is the task of designing, building for collecting, storing, processing and analyzing large amount of data at scale.&lt;/p&gt;

&lt;p&gt;In data engineering we develop and maintain large scale data processing systems to prepare structured and unstructured data to perform analytical modelling and make data driver decisions.&lt;/p&gt;

&lt;p&gt;The aim of data engineering is to make quality data available for analysis and efficient data-driven decision making. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Data Engineering ecosystem consists of 4 things:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data — different data types, formats and sources of data.&lt;/li&gt;
&lt;li&gt;Data stores and repositories — Relational and non-relational databases, data warehouses, data marts, data lakes, and big data stores that store and process the data&lt;/li&gt;
&lt;li&gt;Data Pipelines — Collect/Gather data from multiple sources, clean, process and transform it into data which can used for analysis.&lt;/li&gt;
&lt;li&gt;Analytics and Data driven Decision Making — Make the well processed data available for further business analytics, visualization and data driven decision making.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data Engineering lifecycle consists of building/architecting data platforms, designing and implementing data stores and repositories, data lakes and gathering, importing, cleaning, pre-processing, querying, analyzing data, performance monitoring, evaluation, optimization and fine tuning the processes and systems.&lt;/p&gt;

&lt;p&gt;Data Engineer is responsible for making quality data available from various resources, maintain databases, build data pipelines, query data, data preprocessing using tools such as  Apache Hadoop and Spark, develop data workflows using tools such as Airflow.&lt;/p&gt;

&lt;p&gt;Machine Learning  Engineers  are responsible for building ML algorithms, building data and ML models and deploy them, have statistical and mathematical knowledge and measure, optimize and improve results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--YmVhgQiD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j0ed4xxtnpkp776x38um.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--YmVhgQiD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j0ed4xxtnpkp776x38um.png" alt="Data Pipeline" width="880" height="399"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2). Python for Data Engineering
&lt;/h3&gt;

&lt;p&gt;Data engineering using Python only gets better:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The role of a data engineer involves working with different types of data formats. For such cases, Python is best suited. Its standard library supports easy handling of .csv files, one of the most common data file formats.&lt;/li&gt;
&lt;li&gt;A data engineer is often required to use APIs to retrieve data from databases. The data in such cases is usually stored in JSON (JavaScript Object Notation) format, and Python has a library named JSON-JSON to handle such type of data.&lt;/li&gt;
&lt;li&gt;Data engineering tools use Directed Acyclic Graphs like Apache Airflow, Apache NiFi, etc. DAGs are nothing but Python codes used for specifying tasks. Thus, learning Python will help data engineers use these tools efficiently.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The responsibility of a data engineer is not only to obtain data from different sources but also to process it. One of the most popular data process engines is Apache Spark which works with Python DataFrames and even offers an API, PySpark, to build scalable big data projects.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Luigi! The Python module that is widely considered a fantastic tool for data engineering.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Python is easy to learn and is free to use for the masses. An active community of developers strongly supports it.&lt;/p&gt;

&lt;p&gt;Basic Python&lt;br&gt;
 Maths Expressions&lt;br&gt;
 Strings&lt;br&gt;
 Variables&lt;br&gt;
 Loops&lt;br&gt;
 Fucntions.&lt;br&gt;
 List, Tuples, Dictionary and sets&lt;br&gt;
Connecting With Databases.&lt;br&gt;
 Boto3&lt;br&gt;
 Psycopg2&lt;br&gt;
 mysql&lt;br&gt;
Working with Data&lt;br&gt;
 JSON&lt;br&gt;
 JSONSCHEMA&lt;br&gt;
 datetime&lt;br&gt;
 Pandas&lt;br&gt;
 Numpy&lt;br&gt;
Connecting to APIs&lt;br&gt;
 Requests&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Scripting and Automation
&lt;/h3&gt;

&lt;p&gt;You need to learn automation, so that you will automate the repetitive tasks and save on time.&lt;br&gt;
Shell Scripting&lt;br&gt;
CRON&lt;br&gt;
ETL&lt;/p&gt;

&lt;h3&gt;
  
  
  4). Relational Databases and SQL
&lt;/h3&gt;

&lt;p&gt;SQL is very critical in your data engineering path, learn also to perform advanced queries on your data.&lt;br&gt;
RDBMS&lt;br&gt;
Data Modeling&lt;br&gt;
Basic SQL&lt;br&gt;
Advanced SQL&lt;br&gt;
Big Query&lt;/p&gt;

&lt;h3&gt;
  
  
  5). NoSQL Databases
&lt;/h3&gt;

&lt;p&gt;As a data engineer you will work with variety of data, unstructured data will be commonly stored in NoSQL databases.&lt;br&gt;
Unstructured Data&lt;br&gt;
Advanced ETL&lt;br&gt;
Map-Reduce&lt;br&gt;
Data Warehouses&lt;br&gt;
Data API&lt;/p&gt;

&lt;h3&gt;
  
  
  6). Data Analysis
&lt;/h3&gt;

&lt;p&gt;Pandas&lt;br&gt;
Numpy&lt;br&gt;
Web Scraping&lt;br&gt;
Data Visualization&lt;/p&gt;

&lt;h3&gt;
  
  
  7). Data Processing Techniques
&lt;/h3&gt;

&lt;p&gt;Batch Processing : Apache Spark&lt;br&gt;
Stream Processing — Spart Streaming&lt;br&gt;
Build Data Pipelines&lt;br&gt;
Target Databases&lt;br&gt;
Machine learning Algorithms&lt;/p&gt;

&lt;h3&gt;
  
  
  8). Big Data
&lt;/h3&gt;

&lt;p&gt;Big data basics&lt;br&gt;
HDFS in detail&lt;br&gt;
Hadoop Yarn&lt;br&gt;
Hive&lt;br&gt;
Pig&lt;br&gt;
Hbase&lt;/p&gt;

&lt;p&gt;Data engineering tools are specialized applications that make building data pipelines and designing algorithms easier and more efficient. These tools are responsible for making the day-to-day tasks of a data engineer easier in various ways.&lt;br&gt;
Data ingestion systems such as Kafka, for example, offer a seamless and quick data ingestion process while also allowing data engineers to locate appropriate data sources, analyze them, and ingest data for further processing.&lt;/p&gt;

&lt;p&gt;Data engineering tools  support the process of transforming data. This is important since big data can be structured or unstructured or any other format. Therefore, data engineers need data transformation tools to transform and process big data into the desired format.&lt;br&gt;
Database tools/frameworks like SQL, NoSQL, etc., allow data engineers to acquire, analyze, process, and manage huge volumes of data simply and efficiently.&lt;br&gt;
Visualization tools like Tableau and Power BI allow data engineers to generate valuable insights and create interactive dashboards&lt;/p&gt;

&lt;p&gt;Apache Spark is a fast cluster computing framework which is used for processing, querying and analyzing Big data. Being based on In-memory computation, it has an advantage over several other Big Data Frameworks.&lt;/p&gt;

&lt;p&gt;Originally written in Scala Programming Language, the open source community has developed an amazing tool to support Python for Apache Spark. PySpark helps data scientists interface with RDDs in Apache Spark and Python through its library Py4j.&lt;br&gt;
There are many features that make PySpark a better framework than others:&lt;br&gt;
Speed: It is 100x faster than traditional large-scale data processing frameworks&lt;br&gt;
Powerful Caching: Simple programming layer provides powerful caching and disk persistence capabilities&lt;br&gt;
Deployment: Can be deployed through Mesos, Hadoop via Yarn, or Spark’s own cluster manager&lt;br&gt;
Real Time: Real-time computation &amp;amp; low latency because of in-memory computation&lt;br&gt;
Polyglot: Supports programming in Scala, Java, Python and R&lt;/p&gt;

&lt;p&gt;Spark RDDs&lt;br&gt;
When it comes to iterative distributed computing, i.e. processing data over multiple jobs in computations, we need to reuse or share data among multiple jobs. Earlier frameworks like Hadoop had problems while dealing with multiple operations/jobs like&lt;br&gt;
Storing Data in Intermediate Storage such as HDFS&lt;br&gt;
Multiple I/O jobs make the computations slow&lt;br&gt;
Replications and serializations which in turn makes the process even slower&lt;/p&gt;

&lt;p&gt;RDDs try to solve all the problems by enabling fault-tolerant distributed In-memory computations. RDD is short for Resilient Distributed Datasets. RDD is a distributed memory abstraction which lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. They are the read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. &lt;/p&gt;

&lt;p&gt;There are several operations performed on RDDs:&lt;br&gt;
Transformations: Transformations create a new dataset from an existing one. Lazy Evaluation&lt;br&gt;
Actions: Spark forces the calculations for execution only when actions are invoked on the RDDs&lt;/p&gt;

&lt;p&gt;Reading a file and Displaying Top n elements:&lt;br&gt;
rdd = sc.textFile("path/Sample") &lt;br&gt;
rdd.take(n)&lt;/p&gt;

&lt;h3&gt;
  
  
  9). WorkFlows
&lt;/h3&gt;

&lt;p&gt;Introduction to Airflow&lt;br&gt;
Airflow hands on project&lt;/p&gt;

&lt;h3&gt;
  
  
  10). Infrastructure
&lt;/h3&gt;

&lt;p&gt;Docker&lt;br&gt;
Kubernetes&lt;br&gt;
Business Intelligence&lt;/p&gt;

&lt;h3&gt;
  
  
  11). Cloud Computing
&lt;/h3&gt;

&lt;p&gt;Such AWS, Microsoft Azure, Google Cloud Platform&lt;/p&gt;

&lt;p&gt;1). Data Engineering Tools in AWS&lt;br&gt;
 Amazon Redshift, Amazon Athena&lt;br&gt;
2). Data Engineering Tools in Azure&lt;br&gt;
 Azure Data Factory, Azure Databricks&lt;/p&gt;

&lt;p&gt;This may seem a lot to learn and cover as you become a Data Engineer, however you need to master Python and  Advanced SQL well since they are core in data engineering. Understand the big data tools that are available. Remember to build projects so that you understand what you are learning.&lt;/p&gt;

&lt;p&gt;Tools differ from one organization to another, so you need to understand the tools that your organization uses and be good in them.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>tutorial</category>
      <category>codenewbie</category>
    </item>
    <item>
      <title>Apache PySpark for Data Engineering</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Fri, 09 Sep 2022 21:04:29 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/apache-pyspark-for-data-engineering-3phi</link>
      <guid>https://dev.to/kinyungu_denis/apache-pyspark-for-data-engineering-3phi</guid>
      <description>&lt;p&gt;Greetings to my dear readers. I wrote an article about installing Apache PySpark in Ubuntu and explained about about Apache Spark. &lt;a href="https://dev.to/deno_exporter/to-install-apache-spark-and-run-pyspark-in-ubuntu-2204-4i79"&gt;Read it here&lt;/a&gt;. Now lets go take a deep dive into PySpark and know what it is. This article covers about Apache PySpark a tool that is used in data engineering, understand all details about PySpark and how to use it. One should have basic knowledge in Python, SQL to understand this article well.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Apache Spark?
&lt;/h2&gt;

&lt;p&gt;Let first understand about Apache Spark, then we proceed to PySpark.&lt;br&gt;
Apache Spark is an open-source, cluster computing framework which is used for processing, querying and analyzing Big data. It lets you spread data and computations over clusters with multiple nodes (think of each node as a separate computer). Splitting up your data makes it easier to work with very large datasets because each node only works with a small amount of data.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Apache PySpark?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fez0258a53bstag386jt9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fez0258a53bstag386jt9.png" alt="Python and Apache Spark"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Originally it was written in Scala programming language, the open source community developed a tool to support Python for Apache Spark called PySpark. PySpark provides Py4j library, with the help of this library, Python can be easily integrated with Apache Spark. It not only allows you to write Spark applications using Python APIs, but also provides the PySpark shell for interactively analyzing vast data in a distributed environment. PySpark is a very demanding tool among data engineers.&lt;/p&gt;

&lt;h2&gt;
  
  
  Features of PySpark
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F936cl0llcbs3lcj2khpu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F936cl0llcbs3lcj2khpu.png" alt="PySpark Features"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Speed - PySpark allows us to achieve a high data processing speed, which is about 100 times faster in memory and 10 times faster on the disk.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Caching - PySpark framework provides powerful caching and good disk constancy.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real-time - PySpark provides real-time computation on a large amount of data because it focuses on in-memory processing. It shows the low latency.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Deployment - We have local mode and cluster mode. In local mode it is a single machine fox example my laptop, convenient for testing and debugging. Cluster mode there is set of predefined machines and its good for production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;PySpark works well with Resilient Distributed Datasets (RDDs)&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running our cluster locally
&lt;/h2&gt;

&lt;p&gt;To start any Spark application on a local cluster or a dataset, we use SparkConf to set some configuration and parameters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Commonly used features of the SparkConf when working with PySpark:
&lt;/h3&gt;

&lt;p&gt;set(key, value)-&lt;br&gt;
setMastervalue(value) -&lt;br&gt;
setAppName(value)-&lt;br&gt;
get(key,defaultValue=None) -&lt;br&gt;
setSparkHome(value) -&lt;/p&gt;

&lt;p&gt;The following example shows some attributes of SparkConf:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx40ii0tlt95tv1fiugvi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx40ii0tlt95tv1fiugvi.png" alt="SparkConf attributes"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A spark program first creates a SparkContext object which tells the application how to access a cluster. To accomplish the task, you need to implement SparkConf so that the SparkContext object contains the configuration information about the application.&lt;/p&gt;

&lt;h2&gt;
  
  
  SparkContext
&lt;/h2&gt;

&lt;p&gt;SparkContext is the first and essential thing that gets initiated when we run any Spark application. It is an entry gate for any spark derived application or functionality. It is available as sc by default in PySpark.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;*Know that creating any other variable instead of sc will give an error.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Inspecting our SparkContext:
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1nc9sirchvzsodnti5r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1nc9sirchvzsodnti5r.png" alt="Inspecting Spark Context"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Master - The URL of the cluster connects to Spark.&lt;/p&gt;

&lt;p&gt;appName - The name of your task.&lt;/p&gt;

&lt;p&gt;The Master and Appname are the most widely used SparkContext parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  PySpark SQL
&lt;/h2&gt;

&lt;p&gt;PySpark supports integrated relational processing with Spark's functional programming. To extract the data by using an SQL query language and use the queries same as the SQL language.&lt;/p&gt;

&lt;p&gt;PySpark SQL establishes the connection between the RDD and relational table.It supports wide range of data sources and algorithms in Big-data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Features of PySpark SQL:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Incorporation with Spark - PySpark SQL queries are integrated with Spark programs, queries are used inside the Spark programs. Developers do not have to manually manage state failure or keep the application in sync with batch jobs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Consistence Data Access - PySpark SQL supports a shared way to access a variety of data sources like Parquet, JSON, Avro, Hive and JDBC.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;User-Defined Functions - PySpark SQL has a language combined User-Defined Function (UDFs). UDF is used to define a new column-based function that extends the vocabulary of Spark SQL's DSL for transforming DataFrame.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hive Compatibility - PySpark SQL runs unmodified Hive queries and allow full compatibility with current Hive data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Standard Connectivity - It provides a connection through JDBC or ODBC, the industry standards for connectivity for business intelligence tools.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Important classes of Spark SQL and DataFrames are the following:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.SparkSession: Represents the main entry point for DataFrame and SQL functionality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.DataFrame: Represents a distributed collection of data grouped into named columns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.Row: Represents a row of data in a DataFrame.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.Column: Represents a column expression in a DataFrame.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.DataFrameStatFunctions: Represents methods for statistics functionality.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.DataFrameNaFunctions: Represents methods for handling missing data (null values).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.GroupedData: Aggregation methods, returned by DataFrame.groupBy().&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.types: Represents a list of available data types.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pysark.sql.functions: Represents a list of built-in functions available for DataFrame.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;pyspark.sql.Window: Used to work with Window functions.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

import pyspark 
from pyspark.sql import SparkSession   
spark = SparkSession.builder.getOrCreate()  


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

from pyspark.sql import SparkSession


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;A spark session can be used to create the Dataset and DataFrame API. A SparkSession can also be used to create DataFrame, register DataFrame as a table, execute SQL over tables, cache table, and read parquet file.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

class builder


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It is a builder of Spark Session.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

getOrCreate()


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It is used to get an existing SparkSession, or if there is no existing one, create a new one based on the options set in the builder.&lt;/p&gt;

&lt;p&gt;pyspark.sql.DataFrame&lt;/p&gt;

&lt;p&gt;A distributed collection of data grouped into named columns. A DataFrame is similar as the relational table in Spark SQL, can be created using various function in SQLContext.&lt;br&gt;
Then manipulate it using the several domain-specific-languages (DSL) which are pre-defined functions of DataFrame.&lt;/p&gt;

&lt;h3&gt;
  
  
  Querying Using PySpark SQL
&lt;/h3&gt;

&lt;p&gt;This displays my file where sql queries are executed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6cmzx1ndetj9xcxb30my.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6cmzx1ndetj9xcxb30my.png" alt="The file where SQL queries are executed"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbue90bxc5bpynxa37la.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbue90bxc5bpynxa37la.png" alt="Select Query"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2jyycc7ebf88eni7zik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz2jyycc7ebf88eni7zik.png" alt="Filter Query"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The groupBy() function collects the similar category data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxlpowhc40ow0gagoszr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxlpowhc40ow0gagoszr.png" alt="Group by SQL query"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  PySpark UDF
&lt;/h2&gt;

&lt;p&gt;The PySpark UDF (User Define Function) is used to define a new Column-based function. Using User-Defined Functions (UDFs), you can write functions in Python and use them when writing Spark SQL queries. &lt;/p&gt;

&lt;p&gt;You can declare a User-Defined Function just like any other Python function. The trick comes later when you register a Python function with Spark. To use functions in  PySpark, first register them through the spark.udf.register() function. &lt;br&gt;
It accepts two parameters:&lt;/p&gt;

&lt;p&gt;name - A string, function name you'll use in SQL queries.&lt;br&gt;
f - A Python function that contains the programming logic.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

spark.udf.register()


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Py4JJavaError, the most common exception while working with the UDF. It comes from a mismatched data type between Python and Spark.&lt;/p&gt;

&lt;p&gt;An example of user define functions:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9w60kacnyo8mek2bdc8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9w60kacnyo8mek2bdc8w.png" alt="A basic User Defined Function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  PySpark RDD(Resilient Distributed Dataset)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2sp60uzlo2ffn9iyged.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm2sp60uzlo2ffn9iyged.png" alt="Resilient Distributed dataSets"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Resilient Distributed Datasets (RDDs) are essential part of the PySpark, handles both structured and unstructured data and helps to perform in-memory computations on large cluster. RDD divides data into smaller parts based on a key. Dividing data into smaller chunks helps that if one executor node fails, another node will still process the data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;In-memory Computation - Computed results are stored in distributed memory (RAM) instead of stable storage (disk) providing very fast computation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Immutability - The created data can be retrieved anytime but its value can't be changed. RDDs can only be created through deterministic operations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fault Tolerant - RDDs track data lineage information to reconstruct lost data automatically. If failure occurs in any partition of RDDs, then that partition can be re-computed from the original fault tolerant input dataset to create it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Coarse-Gained Operation - Coarse grained operation means that we can transform the whole dataset but not individual element on the dataset. On the other hand, fine grained mean we can transform individual element on the dataset.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Partitioning - RDDs are the collection of various data items that are so huge in size, they cannot fit into a single node and must be partitioned across various nodes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Persistence - Optimization technique where we can save the result of RDD evaluation. It stores the intermediate result so that we can use it further if required and reduces the computation complexity.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Lazy Evolution - It doesn't compute the result immediately means that execution does not start until an action is triggered. When we call some operation in RDD for transformation, it does not execute immediately.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;PySpark provides two methods to create RDDs: loading an external dataset, or distributing a set of collection of objects. &lt;/p&gt;

&lt;h3&gt;
  
  
  using parallelize
&lt;/h3&gt;

&lt;p&gt;Create RDDs using the &lt;code&gt;parallelize()&lt;/code&gt; function which accepts an already existing collection in program and pass the same to the Spark Context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fixs7lntqwaj06qp3or8g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fixs7lntqwaj06qp3or8g.png" alt="Using parallelize function"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using &lt;code&gt;createDataFrame()&lt;/code&gt; function. We have already SparkSession, so we will create our dataFrame.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftcqj949cawql3aayr1kw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftcqj949cawql3aayr1kw.png" alt="Using Create DataFrame"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  External Data
&lt;/h3&gt;

&lt;p&gt;Read either one text file from HDFS, a local file system or any Hadoop-supported file system URI with &lt;code&gt;textFile()&lt;/code&gt;, or read in a directory of text files with &lt;code&gt;wholeTextFiles()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdm1drf5acav4s0kp0xm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftdm1drf5acav4s0kp0xm.png" alt="Using TextFile"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Using read_csv()
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7ityu72pfdypvocmiah.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg7ityu72pfdypvocmiah.png" alt="Using read_csv() to read a csv file"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The output of using &lt;code&gt;scores_file.show()&lt;/code&gt; and &lt;code&gt;scores_file.printSchema()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6cx6j3yuv2t2zyceama.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo6cx6j3yuv2t2zyceama.png" alt="Output for our csv file"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  RDD Operations in PySpark
&lt;/h3&gt;

&lt;p&gt;RDD supports two types of operations:&lt;/p&gt;

&lt;p&gt;Transformations - The process used to create a new RDD. It follows the principle of Lazy Evaluations (the execution will not start until an action is triggered). For example :&lt;br&gt;
map, flatMap, filter, distinct, reduceByKey, mapPartitions, sortBy&lt;/p&gt;

&lt;p&gt;Actions - The processes which are applied on an RDD to initiate Apache Spark to apply calculation and pass the result back to driver. For example :&lt;br&gt;
collect, collectAsMap, reduce, countByKey/countByValue, take, first&lt;/p&gt;

&lt;p&gt;map() transformation takes in a function and applies it to each element in the RDD&lt;br&gt;
collect() action returns the entire elements in the RDD&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmtu62ecm0v3qxcy9erc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmtu62ecm0v3qxcy9erc.png" alt="Map Transformation and Collect Action"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibrhuf1sjbdiwjambvzt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fibrhuf1sjbdiwjambvzt.png" alt="Map Transformation and Collect Action in Strings"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The RDD transformation filter() returns a new RDD containing only the elements that satisfy a particular function. It is useful for filtering large datasets based on a keyword. &lt;br&gt;
Count() action returns the number of element available in RDD.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vadjds3e7tauj9jhnnr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vadjds3e7tauj9jhnnr.png" alt="Using Filter transformation"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;RDD transformation reduceByKey() operates on key, value (key,value) pairs and merges the values for each key.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxi0elay682dumi3oicr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkxi0elay682dumi3oicr.png" alt="Using reduceByKey()"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;join() returns RDD with the matching keys with their values in paired form.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1aqwsaj82gi14wf2kvq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1aqwsaj82gi14wf2kvq.png" alt="Join RDDs"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  DataFrame from RDD
&lt;/h3&gt;

&lt;p&gt;PySpark provides two methods to convert a RDD to DataFrame. These methods are:&lt;br&gt;
toDF(), createDataFrame(rdd, schema)&lt;/p&gt;

&lt;p&gt;DataFrames also have two operations: transformations and actions.&lt;/p&gt;

&lt;p&gt;DataFrame transformations include: select, filter, groupby, orderby, dropDuplicates, withColumnRenamed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;select() - subsets the columns in a DataFrame&lt;/li&gt;
&lt;li&gt;filter() - filters out rows based on a condition&lt;/li&gt;
&lt;li&gt;groupby() - used to group based on a column&lt;/li&gt;
&lt;li&gt;orderby() - sorts the DataFrame based on one or more columns&lt;/li&gt;
&lt;li&gt;dropDuplicates() - removes duplicate rows from a DataFrame.&lt;/li&gt;
&lt;li&gt;withColumnRenamed() - renames a columnn in the DataFrame&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;DataFrame actions include: head, show, count, describe, columns.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;describe() - compute the summary statistics of numerical columns in a dataFrame&lt;/li&gt;
&lt;li&gt;printSchema() - prints the types of columns in a DataFrame&lt;/li&gt;
&lt;li&gt;column - prints all the columns in DataFrame.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Inspecting Data in PySpark&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

# Print the first 10 observations
people_df.show(10)

# Count the number of rows
print("There are {} rows in the people_df DataFrame.".format(people_df.count()))

# Count the number of columns and their names
print("There are {} columns in the people_df DataFrame and their names are {}".format(len(people_df.columns), people_df.columns))



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;PySpark DataFrame sub-setting and cleaning&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

# Select name, sex and date of birth columns
people_df_sub = people_df.select('name', 'sex', 'date of birth')

# Print the first 10 observations from people_df_sub
people_df_sub.show(10)

# Remove duplicate entries from people_df_sub
people_df_sub_nodup = people_df_sub.dropDuplicates()

# Count the number of rows
print("There were {} rows before removing duplicates, and {} rows after removing duplicates".format(people_df_sub.count(), people_df_sub_nodup.count()))



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Filtering your DataFrame&lt;/strong&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

# Filter people_df to select females
people_df_female = people_df.filter(people_df.sex == "female")

# Filter people_df to select males
people_df_male = people_df.filter(people_df.sex == "male")

# Count the number of rows
print("There are {} rows in the people_df_female DataFrame and {} rows in the people_df_male DataFrame".format(people_df_female.count(), people_df_male.count()))



&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h3&gt;
  
  
  Stopping SparkContext
&lt;/h3&gt;

&lt;p&gt;To stop a sparkContext:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

sc.stop()


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;It's very crucial to understand Resilient Distributed Datasets(RDDs) and SQL since they will be extensively used in data engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This article covers the key areas of Apache PySpark you should understand as you learn to become data engineer.you should be able to initialize Spark, use User Defined Functions (UDFs), load data, work with RDDs: apply actions and transformations. Soon I will write an article on a practical use case of Apache PySpark in a project.&lt;/p&gt;

&lt;p&gt;Feel free to drop your comments about the article.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>python</category>
      <category>dataengineering</category>
      <category>sql</category>
    </item>
    <item>
      <title>Data Engineering 102: Introduction to Python for Data Engineering.</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Wed, 31 Aug 2022 21:00:19 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/introduction-to-python-for-data-engineering-57i6</link>
      <guid>https://dev.to/kinyungu_denis/introduction-to-python-for-data-engineering-57i6</guid>
      <description>&lt;p&gt;Greetings to my dear readers, today we will be covering about Python for Data Engineering. If you read my article about Data Engineering 101, we understood that one of the key skills required for a data engineer is strong understanding of Python language. Read that article to gain a basic understanding about data engineering.&lt;/p&gt;

&lt;p&gt;Can one use other languages for data engineering? I would answer yes, such as Scala, Java. Lets understand why we are using python for data engineering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A data engineer do work with different types of data formats. For such cases, Python is best suited. Its standard library supports easy handling of .csv files, one of the most common data file formats.&lt;/li&gt;
&lt;li&gt;Data engineering tools use Directed Acyclic Graphs like Apache Airflow, Apache NiFi. DAGs, Python codes used for specifying tasks. Thus, learning Python will help data engineers use these tools efficiently.&lt;/li&gt;
&lt;li&gt;A data engineer not only to obtain data from different sources but also to process it. One of the most popular data process engines is Apache Spark which works with Python DataFrames and even offers an API, PySpark, to build scalable big data projects.&lt;/li&gt;
&lt;li&gt;A data engineer is often required to use APIs to retrieve data from databases. The data in such cases is usually stored in JSON (JavaScript Object Notation) format, and Python has a library named JSON-JSON to handle such type of data.&lt;/li&gt;
&lt;li&gt;Luigi! The Python module package that help us to build complex data pipelines.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Python is relatively easy to learn and is open-source. An active community of developers strongly supports it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We have understood some of the reasons why we have chosen Python, how do we use Python in data engineering:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Data Acquisition and Ingestion: this involves to obtain data from databases, API's and other sources. A data will use Python to retrieve the data and ingested it.&lt;/p&gt;

&lt;p&gt;Data Manipulation: this refers to how a data engineer handles structured, unstructured and semi-structured data into meaningful information.&lt;/p&gt;

&lt;p&gt;Parallel Computing:This is necessary for memory and processing power. A data engineer use Python to split tasks into sub-tasks and distribute the tasks.&lt;/p&gt;

&lt;p&gt;Data Pipelines: The ETL pipeline that involves extracting, transforming and loading data. We have tools that are easily used with Python such as Snowflake, Apache Airflow.&lt;/p&gt;

&lt;p&gt;That's great now we know how Python is used in data engineering. First, we need to familiar with basic Python and understand it well in order to write code. I will use jupyter lab, code editor that is found in Anaconda. I will explain the basic Python with examples to ensure we understand the concepts well.&lt;/p&gt;

&lt;p&gt;For basic Python we will cover the following topics:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Variables &lt;/li&gt;
&lt;li&gt;Strings &lt;/li&gt;
&lt;li&gt;Math Expressions&lt;/li&gt;
&lt;li&gt;Loops&lt;/li&gt;
&lt;li&gt;Tuples, List, Dictionary and Sets&lt;/li&gt;
&lt;li&gt;Functions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A variable refers a container to store a value. A variable name refers to the label that assign a value on it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;variable_name = value
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--x04-Lwmy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uahtcxnm8dygehdr4l33.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--x04-Lwmy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uahtcxnm8dygehdr4l33.png" alt="Variable name definition" width="717" height="169"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The image above tells us the rules that we should follow when defining variable names. Ensure you use concise and descriptive variable names such as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;officer_duty = False
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For variables to be treated as constant, you use capital letters to name a variable:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;MAXIMUM_FILE_LIMIT = 1500
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Strings
&lt;/h2&gt;

&lt;p&gt;It is a series of characters represented using single or double quotation marks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hW8ql9aH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1dktc5x3i2ksqifyo8ou.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hW8ql9aH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1dktc5x3i2ksqifyo8ou.png" alt="Python Strings" width="732" height="246"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have f-strings (format string) from python version 3.6, f-strings helps us to use values of variables inside a string.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--8nvDlitQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ib7g1hg52kqvd08jpgl2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--8nvDlitQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ib7g1hg52kqvd08jpgl2.png" alt="Python f-strings" width="815" height="134"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Mathematical Expressions
&lt;/h2&gt;

&lt;p&gt;Operators are used to perform various operations on values and variables. Python operators are classified into the following groups: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Arithmetic operators&lt;/li&gt;
&lt;li&gt;Comparison operators&lt;/li&gt;
&lt;li&gt;Logical operators&lt;/li&gt;
&lt;li&gt;Bitwise operators&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Arithmetic operators&lt;/strong&gt;&lt;br&gt;
These operators,compute mathematical operations for numeric values. It also have a math module to perform advanced numerical computations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ePtcDUUn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2a2bhr5wcl7mn30zfwt5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ePtcDUUn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2a2bhr5wcl7mn30zfwt5.png" alt="Arithmetic opeators" width="676" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This operations give the following results&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--IQMMnXQd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xrjsk1ukpk4ddj2pej8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--IQMMnXQd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xrjsk1ukpk4ddj2pej8i.png" alt="Arithmetic Results" width="704" height="262"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In case we combine multiple arithmetic operations, we will begin with operations inside parentheses first. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparison Operators&lt;/strong&gt;&lt;br&gt;
This operators help to compare between two values.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Less than ( &amp;lt; )&lt;/li&gt;
&lt;li&gt;Less than or equal to (&amp;lt;=)&lt;/li&gt;
&lt;li&gt;Greater than (&amp;gt;)&lt;/li&gt;
&lt;li&gt;Greater than or equal to (&amp;gt;=)&lt;/li&gt;
&lt;li&gt;Equal to ( == )&lt;/li&gt;
&lt;li&gt;Not equal to ( != )&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It compares numbers, strings and returns a boolean value (either True or False).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Logical Operators&lt;/strong&gt;&lt;br&gt;
This helps to check multiple conditions at the same time.&lt;br&gt;
We have &lt;code&gt;and&lt;/code&gt;, &lt;code&gt;or&lt;/code&gt;, &lt;code&gt;not&lt;/code&gt; operators. &lt;br&gt;
&lt;code&gt;and&lt;/code&gt; - checks where both conditions are True simultaneously then returns True else it returns False. &lt;br&gt;
&lt;code&gt;or&lt;/code&gt;  - checks whether one of the condition is True and returns True. It returns False when both conditions are False.&lt;br&gt;
&lt;code&gt;not&lt;/code&gt; - it reverses the present condition.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--no4x2Ny6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n4lmiuej9nubh9f1aquh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--no4x2Ny6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/n4lmiuej9nubh9f1aquh.png" alt="Logical operator in Python" width="715" height="368"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bitwise Operators&lt;/strong&gt;&lt;br&gt;
They are used to compare binary numbers.&lt;/p&gt;
&lt;h2&gt;
  
  
  Loops
&lt;/h2&gt;

&lt;p&gt;We have two loops in Python; while loop and for loop&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;while loop&lt;/strong&gt;&lt;br&gt;
You will run a code block as long as the condition specified is True.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while condition:
   body
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The condition is an expression that will evaluate to a true or False (boolean value).&lt;br&gt;
while checks the condition at the beginning of each iteration, executes body as long as condition is True.&lt;br&gt;
In the body,you need to stop condition after number of times to avoid an indefinite loop.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;day_of_week = 0
while True:
   print(day_of_week)
   day_of_week += 1

   if day_of_week == 5:
      break
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The above block of code, &lt;code&gt;day_of_week&lt;/code&gt; will increment repeatedly by one. Then we have an &lt;code&gt;if&lt;/code&gt; statement that checks if &lt;code&gt;day_of_week == 5&lt;/code&gt;, then block runs until the value five is reached and the &lt;code&gt;if&lt;/code&gt; block executes by breaking the loop. The &lt;code&gt;break&lt;/code&gt; statement exits the loop once the &lt;code&gt;if&lt;/code&gt; condition is True. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--h_3-WWtt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kthzjttvhock71ffkfim.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--h_3-WWtt--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kthzjttvhock71ffkfim.png" alt="Python while loop" width="643" height="283"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;for loop&lt;/strong&gt;&lt;br&gt;
Mainly we use for loop to execute a code block for a number of times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for index in range(n):
   statement
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We see the syntax of a for loop. The &lt;code&gt;index&lt;/code&gt; is called the loop counter,&lt;code&gt;n&lt;/code&gt; the number of times that loop will execute the statement. &lt;code&gt;range()&lt;/code&gt; is an inbuilt function, &lt;code&gt;range(n)&lt;/code&gt; it generates a sequence of numbers from &lt;code&gt;0&lt;/code&gt; to &lt;code&gt;n&lt;/code&gt;, however &lt;code&gt;n&lt;/code&gt; the last value is not printed.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum = 0
for number in range(101):
   sum += number

print(sum)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SLAwl8Q7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3nsyfkdwrve6pw85kqgp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SLAwl8Q7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3nsyfkdwrve6pw85kqgp.png" alt="For loops in Python" width="779" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you see in &lt;code&gt;range(0, 10, 2)&lt;/code&gt;, that indicates &lt;code&gt;range(start, stop, step)&lt;/code&gt;. You can change the values and see how the code works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Functions
&lt;/h2&gt;

&lt;p&gt;A function is a block of code that performs a certain task or returns a value. Functions help to divide a program into manageable parts to make it easier to read, test and maintain the program.&lt;br&gt;
This is how we write a function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def greet(name):
   return f'{name} how are you doing?'
greetings = greet('Richard')
print(greetings)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A parameter is the information that a function needs and it is specified in function definition.In our example &lt;code&gt;name&lt;/code&gt; is a parameter.&lt;br&gt;
An argument is the piece of data you pass to a function that which is should return &lt;code&gt;Richard&lt;/code&gt; is an argument&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--eXCl2q3K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d82gcyxo7jbj6e6jl2s7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--eXCl2q3K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/d82gcyxo7jbj6e6jl2s7.png" alt="Functions in Python" width="786" height="301"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We have recursive functions, it is a function that can call to itself.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wUgCaU1M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a14fov1dkbvm6v9xm36h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wUgCaU1M--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/a14fov1dkbvm6v9xm36h.png" alt="Recursive functions in Python" width="866" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Lambda Function&lt;/strong&gt;&lt;br&gt;
Where one has a simple function with one expression, it would be unnecessary to define the &lt;code&gt;def&lt;/code&gt; keyword. Lambda expressions allow one to define anonymous functions which are used once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;map() function&lt;/strong&gt;&lt;br&gt;
This function takes two arguments, the function to apply and the object to apply function on.&lt;br&gt;
It provides a quick and clean way to apply a function iteratively without  applying a for loop.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--E9bETm83--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6lj90bfo6adwzvc9blb2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--E9bETm83--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6lj90bfo6adwzvc9blb2.png" alt="Implement map and lambda function in Python" width="810" height="215"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;List&lt;/strong&gt;&lt;br&gt;
It is an ordered collection of items, it is enclosed in square brackets &lt;code&gt;[]&lt;/code&gt; .&lt;br&gt;
You can add, remove, modify, sort elements in a list since it is mutable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;empty_list = []
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Tuples&lt;/strong&gt;&lt;br&gt;
This refers to an ordered collection of items, enclosed in parentheses &lt;code&gt;()&lt;/code&gt; and it is immutable, you cannot change the elements assigned to a variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;selected_colors = ('cyan', 'gray', 'white')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;List comprehension&lt;/em&gt;&lt;br&gt;
It transforms elements in list and returns a new list.&lt;br&gt;
The syntax for a list comprehension is as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;list_comprehension = [expression for item in iterable if condition == True]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's us implement this list comprehension and understand how it works:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--dXLzulsx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/khdc84ifd8gkv8m1pug6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--dXLzulsx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/khdc84ifd8gkv8m1pug6.png" alt="List comprehension in Python" width="807" height="293"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;unpacking and packing&lt;/em&gt;&lt;br&gt;
This can be done for both tuples and lists.&lt;br&gt;
When you create a tuple you assign values to it, that is referred to as packing a tuple.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rainbow_colors = ('Red', 'Orange', 'Yellow', 'Green', 'Blue', 
   'Indigo', 'Violet')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To extract values from a tuple back to the variables is known as unpacking, so we will be unpacking our tuple.&lt;br&gt;
The number of variables to be used must much the number of values inside the tuple. For example our tuple has seven values thus it can be unpacked to seven variables.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(first, second, third, forth, fifth, sixth, seventh) = 
   rainbow_colors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However this can be simplified by using an asterisk &lt;code&gt;*&lt;/code&gt; , it added to a variable name and it takes all the remaining elements and unpacks it to a list.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(first, second, *other_colors) = rainbow_colors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The variable name &lt;code&gt;other_colors&lt;/code&gt; will contain all the remaining colors from the initial variable name &lt;code&gt;rainbow_colors&lt;/code&gt; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;unpacking lists&lt;/em&gt;&lt;br&gt;
The unpacking that was done on tuples can also be done on lists.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rainbow_colors = ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 
   'Indigo', 'Violet']

first, second, *other_colors = rainbow_colors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have learnt that using &lt;code&gt;*&lt;/code&gt; on  a variable name it unpacks the remaining elements from the initial list to a new list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Xzn6bobO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5nqfd6bt4dkj1f98m0g9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Xzn6bobO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/5nqfd6bt4dkj1f98m0g9.png" alt="Unpacking tuples" width="799" height="287"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--A_Yz_yZ4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wzyd5xkpsk8f1aiqbikm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--A_Yz_yZ4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wzyd5xkpsk8f1aiqbikm.png" alt="Unpacking lists" width="790" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Looking at the above images, we see that using &lt;code&gt;*&lt;/code&gt; on a variable name, it returns a list.&lt;br&gt;
That's cool, you now understand about unpacking in tuples and  lists. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dictionary&lt;/strong&gt;&lt;br&gt;
It is a collection of key-value pairs that stores data. Python uses curly braces &lt;code&gt;{}&lt;/code&gt; to define a dictionary.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;empty_dictionary = {}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;customer = {
   'first_name' : 'Fred',
   'last_name' : 'Kagia',
   'age' : 39,
   'location' : 'Nairobi',
   'active' : True
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To iterate over all key-value pairs in a dictionary, you will use a for loop with two variables &lt;code&gt;key&lt;/code&gt; and &lt;code&gt;value&lt;/code&gt; . however we can have other variables in for loop except from the &lt;code&gt;key&lt;/code&gt; and &lt;code&gt;value&lt;/code&gt; that we have decided to use.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for key, value in customer.items():
   print (f"{key} : {value}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--f4dmCJ_i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tajk8336k1pvltkomur9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--f4dmCJ_i--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tajk8336k1pvltkomur9.png" alt="Python Dictionaries" width="599" height="285"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sets&lt;/strong&gt;&lt;br&gt;
It is an unordered list of elements, elements are unique. We use curly braces &lt;code&gt;{}&lt;/code&gt; to enclose a set.&lt;br&gt;
To define an empty set we use this syntax:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;empty_set = set()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;frozen sets&lt;/em&gt;&lt;br&gt;
To make a set immutable use &lt;code&gt;frozenset()&lt;/code&gt; ,this ensures that elements in a set cannot be modified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}
capital_cities_frozen = frozenset(capital_cities)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ObiRzULs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v5asp4bfngagllkkmbup.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ObiRzULs--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v5asp4bfngagllkkmbup.png" alt="Frozen sets cannot be modified" width="808" height="323"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--DKhxSLeh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/miv7h8muz8ti05pa5kb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--DKhxSLeh--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/miv7h8muz8ti05pa5kb1.png" alt="Frozen set" width="673" height="185"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To access the index of elements in a set as you iterate over them, you can use built-in function &lt;code&gt;enumerate()&lt;/code&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;capital_cities = {'Nairobi', 'Lusaka', 'Cairo', 'Lagos'}

for index, city in enumerate(capital_cities, 1):
   print(f"{index}. Capital city is {capital_city}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---IxfKh0e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mzfq8023fi9kafomqv9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---IxfKh0e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mzfq8023fi9kafomqv9p.png" alt="using enumerate in sets" width="694" height="166"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;&lt;em&gt;Set Theory&lt;/em&gt;&lt;br&gt;
This refers to methods of set datatype that are applied to objects collection.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;set.intersection() - checks all elements in both sets&lt;/li&gt;
&lt;li&gt;set.difference() - checks elements in one set and not in the other set.&lt;/li&gt;
&lt;li&gt;set.symmetric_difference() - checks all elements exactly in one set.&lt;/li&gt;
&lt;li&gt;set.union() - checks all elements in either set.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bEmyBGDn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v446p1zio4p16zmy2uve.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bEmyBGDn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v446p1zio4p16zmy2uve.png" alt="Set theory in Python" width="829" height="321"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Working with Data
&lt;/h2&gt;

&lt;p&gt;1). JSON&lt;br&gt;
2). datetime&lt;br&gt;
3). Pandas&lt;br&gt;
4). Numpy&lt;/p&gt;
&lt;h2&gt;
  
  
  JSON
&lt;/h2&gt;

&lt;p&gt;This is a syntax for storing and exchanging data. Python has a module &lt;code&gt;json&lt;/code&gt; that is used to work with JSON data.&lt;/p&gt;

&lt;p&gt;To convert JSON to Python, you will pass the JSON string using &lt;code&gt;json.loads()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;To convert Python to JSON, you will convert to JSON string using this &lt;code&gt;json.dumps()&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--or1kwJZ7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/onjdqjp522zr05gdhspl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--or1kwJZ7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/onjdqjp522zr05gdhspl.png" alt="Basic usage of JSON" width="744" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To analyze and debug JSON data, we may need to print it in a more readable format. This can be done by passing additional parameters indent and sort_keys to &lt;code&gt;json.dumps()&lt;/code&gt; and &lt;code&gt;json.dump()&lt;/code&gt; method.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--0UkzgzRq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/aw2dbk76tt6b65d4vk41.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--0UkzgzRq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/aw2dbk76tt6b65d4vk41.png" alt="JSON in a readable form" width="816" height="320"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  datetime
&lt;/h2&gt;

&lt;p&gt;We use a module called &lt;code&gt;datetime&lt;/code&gt; to work with dates as dates object.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import datetime

current_time = datetime.datetime.now()
print(current_datetime)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The date contains year, month, day, hour, minute, second, microsecond. you can use these as methods to return date object.&lt;/p&gt;

&lt;p&gt;Create a date object&lt;br&gt;
You may use &lt;code&gt;datetime()&lt;/code&gt; class of the datetime module. This class requires three parameters to create year, month, day.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import datetime

planned_date = datetime.datetime(22, 9, 3)
print(planned_date)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  NumPy
&lt;/h2&gt;

&lt;p&gt;This is a Python Library that works with arrays, numerical python. A numpy arrays contain element of the same type. Homogeneity allows numpy array to be faster and efficient that Python lists.&lt;/p&gt;

&lt;p&gt;create a NumPy object&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np

natural_numbers = np.array([1, 2, 3, 4, 5])
print(natural_numbers)
print(type(natural_numbers))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;NumPy have a powerful technique called NumPy broadcasting, ability to vectorize operations, so that they are performed on all elements at once.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;natural_numbers = np.array([1, 2, 3, 4, 5])
natural_numbers_squared = natural_numbers ** 2
print(natural_numbers_squared)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BM0xXtU4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8vx9ss6aohyz2i0aa8wy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BM0xXtU4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8vx9ss6aohyz2i0aa8wy.png" alt="Numpy basics" width="690" height="235"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can also compare using NumPy to perform calculations and using Python list. We will see NumPy works better than the Python Lists.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Nkl5qf-Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pkpl3iqtgtef3sirt1yu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Nkl5qf-Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/pkpl3iqtgtef3sirt1yu.png" alt="Comparing using List Comprehension and NumPy" width="880" height="321"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Pandas
&lt;/h2&gt;

&lt;p&gt;It is a library used for working with datasets. It has functions for analyzing, exploring, cleaning and manipulating data. It has DataFrame as its main data structure.Tabular data with labelled rows and columns.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BF82Lqf4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i0f1q8xqe8sdmcp36yx2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BF82Lqf4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i0f1q8xqe8sdmcp36yx2.png" alt="create a pandas DataFrame" width="808" height="353"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--qj3hA5Fl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gfpa371yx5l0ca3skkuq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--qj3hA5Fl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gfpa371yx5l0ca3skkuq.png" alt="Reading data from csv" width="841" height="421"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;pandas has a method &lt;code&gt;.apply()&lt;/code&gt; , this method takes a function and applies it to a DataFrame. One must specify an axis to it, &lt;code&gt;0&lt;/code&gt; for columns and &lt;code&gt;1&lt;/code&gt; for rows. This method can be used with anonymous functions (remember lambda functions.)&lt;/p&gt;

&lt;p&gt;We have covered, the basics of Python that will help us to understand and implement data engineering. We will be able to work with tools such as Pyspark, Airflow.&lt;/p&gt;

&lt;p&gt;For example lets look at a sample code from Directed Acyclic Graph (DAG).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# this is DAG definition file

from airflow.models import DAG
from airflow.operators.python_operator
import python_operator

dag = DAG(dag_id = "etl_pipeline"
   schedule_interval = "0 0 * * *")

etl_task = Python_Operator(task_id = "etl_task"
   python_callable = etl, dag = dag)

etc_task.set_upstream(wait_for_this_task)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#defines an ETL function

def etl():
   film_dataframe = extract_film_to_pandas()
   film_dataframe = transform_rental_rate(film_dataframe)
   load_loadframe_to_film(film_dataframe)

#define ETL task using PythonOperator


etl_task = PythonOperator(task_id = 'etl_film',
   python_callable = etl, dag =dag)


#set the upstreamto wait_for_table and sample run etl()


etl_task.set_upstream(wait_for_table)
etl()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following above code shows a DAG(Directed Acyclic Graph) definition file and have an ETL task, which will be added to DAG. DAG to extend and the task to wait for defined in dag, wait_for_able. It is just a sample code, soon we will write our DAG's and ETL's and implement them.&lt;/p&gt;

&lt;p&gt;Learning Python is critical for our data engineering career, ensure that understand you understand it well. We will continue together in this path of data engineering. Feel free to give your feedback about this article.&lt;/p&gt;

</description>
      <category>python</category>
      <category>dataengineering</category>
      <category>beginners</category>
    </item>
    <item>
      <title>To install Apache Spark and run Pyspark in Ubuntu 22.04</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Thu, 25 Aug 2022 17:27:38 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/to-install-apache-spark-and-run-pyspark-in-ubuntu-2204-4i79</link>
      <guid>https://dev.to/kinyungu_denis/to-install-apache-spark-and-run-pyspark-in-ubuntu-2204-4i79</guid>
      <description>&lt;p&gt;Hello my esteemed readers, today we will cover installing Apache Spark in our Ubuntu 22.04 and also to ensure that also our Pyspark is running without any errors. &lt;br&gt;
From our previous article about data engineering, we talked about a data engineer is responsible for processing large amount of data at scale, Apache Spark is one good tools for a data engineer to process data of any size. I will explain the steps to use using examples and screenshots from my machine so that you don't run into errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is Apache Spark and what is it used for?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Apache Spark is a unified analytics engine for large-scale data processing on a single-node machine or multiple clusters. It is open source, in that you don't have to pay to download and use it. It utilizes in-memory caching and optimized query execution for fast analytic queries for any provided data size.&lt;br&gt;
It provides high-level API's in Java, Scala, Python and R ,optimized engine that supports  general execution graphs. It supports code reuse across multiple workloads such as batch processing, real-time analytics, graph processing, interactive queries and machine learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How does Apache Spark work?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Spark does processing in-memory, reducing the number of steps in a job, and by reusing data across multiple parallel operations. With Spark, only one-step is needed where data is read into memory, operations performed, and the results written back thus resulting in a much faster execution. Spark also reuses data by using an in-memory cache to greatly speed up machine learning algorithms that repeatedly call a function on the same dataset. We create DataFrames to accomplish data re-use, an abstraction over Resilient Distributed Dataset (RDD), which is a collection of objects that is cached in memory, and reused in multiple Spark operations. This dramatically lowers the latency as Apache Spark runs 100 times faster in-memory and 10 times faster on disk than Hadoop MapReduce.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Apache Spark Workloads&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Spark Core&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It is responsible for distributing, monitoring jobs,memory management, fault recovery, scheduling,  and interacting with storage systems. Spark Core is exposed through an application programming interface (APIs) built for Java, Scala, Python and R.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spark SQL&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Performs interactive queries for structured and semi-structured data.. Spark SQL is a distributed query engine that provides low-latency, interactive queries up to 100x faster than MapReduce. It includes a cost-based optimizer, columnar storage, and code generation for fast queries, while scaling to thousands of nodes. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Spark Streaming&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark Streaming is a real-time solution that leverages Spark Core’s fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Machine Learning Library (MLib)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark includes MLlib, a library of algorithms to do machine learning on data at scale. Machine Learning models can be trained by data scientists with R or Python on any Hadoop data source, saved using MLlib, and imported into a Java or Scala-based pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GraphX&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Spark GraphX is a distributed graph processing framework built on top of Spark. GraphX provides ETL, exploratory analysis, and iterative graph computation to enable users to interactively build, and transform a graph data structure at scale. It also provides an optimized runtime for this abstraction.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Key Benefits of Apache Spark:&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Speed:&lt;/strong&gt;  Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.Through in-memory caching, and optimized query execution, Spark can run fast analytic queries against data of any size.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Support Multiple Languages:&lt;/strong&gt;  Apache Spark natively supports Java, Scala, R, and Python, giving you a variety of languages for building your applications. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multiple Workloads:&lt;/strong&gt;  Apache Spark comes with the ability to run multiple workloads, including interactive queries, real-time analytics, machine learning, and graph processing.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now we have a basic understanding about Apache Spark, we can proceed to our installation in our machines.&lt;/p&gt;

&lt;p&gt;To download Apache Spark in Linux we need to have java installed in our machine.&lt;br&gt;
To check if you have java in your machine, use this command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

java --version


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For example in my machine, java is installed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fymbw1cm2dbdmayu1hn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7fymbw1cm2dbdmayu1hn.png" alt="To show java is installed"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In case you don't have java installed in your system, use the following commands to install it:&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Install Java&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;first update system  packages&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

sudo apt update


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Install java&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

sudo apt install default-jdk  -y


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;verify java installation&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

java --version


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Your java version should be version 8 or later version and our criteria is met.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Install Apache Spark&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;First install the required packages, using the following command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

sudo apt install curl mlocate git scala -y


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Download Apache Spark. Find the latest release from &lt;a href="https://spark.apache.org/downloads.html" rel="noopener noreferrer"&gt;download page&lt;/a&gt;&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

wget https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Replace the version you are downloading from the Apache download page, where I have&lt;br&gt;
entered my spark file link.&lt;/p&gt;

&lt;p&gt;Extract the downloaded file you have downloaded, using this command to extract the file:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

tar xvf spark-3.3.0-bin-hadoop3.tgz


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Ensure you specify the collect file name you have downloaded, since it could be another version. The above command extracts the downloaded file into the directory that you downloaded in. Ensure you know the path directory for your spark file.&lt;/p&gt;

&lt;p&gt;For example, my spark file directory appears as shown in the image:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88lfez9qq1reomijyyct.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F88lfez9qq1reomijyyct.png" alt="My spark installation directory"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once you have completed the above processes it means you are done with download the Apache Spark, but wait we have to configure Spark environment. This is one the section that give you errors and you wonder what you aren't doing right. However, I will guide to ensure that you successfully configure your environment and able to use Apache Spark in your machine and ensure Pyspark is runs as expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How to Configure Spark environment&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;For this, you have to set some environment variables in the bashrc configuration file&lt;/p&gt;

&lt;p&gt;Access this file using your editor, for my case I will use nano editor, the following command will open this file in nano editor:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

sudo nano ~/.bashrc


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqb4n07fyuot6i49qt0cp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqb4n07fyuot6i49qt0cp.png" alt="Using nano to open bashrc file"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a file with sensitive information, don't delete any line in it, go to the bottom of file and add the following lines in the bashrc file to ensure that we will use our Spark successfully.&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

export SPARK_HOME=/home/exporter/spark-3.3.0-bin-hadoop3

export PATH=$PATH:$SPARK_HOME/bin

export SPARK_LOCAL_IP=localhost

export PYSPARK_PYTHON=/usr/bin/python3

export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Remember when I asked you to note your Spark installation directory, that installation directory should be assigned to  export SPARK_HOME&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

export SPARK_HOME=&amp;lt;your Spark installation directory&amp;gt; 


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For example you can see mine is set to:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

export SPARK_HOME=/home/exporter/spark-3.3.0-bin-hadoop3


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;Then write the other lines as they are without changing anything and save that bashrc file. The image below shows how the end of my  bashrc file appears after adding the environment variables. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanlblqbwuk4519kf7wp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fanlblqbwuk4519kf7wp2.png" alt="Variables at end of my bashrc file"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After exiting our bashrc file from our nano editor, you need to save the variables. Use the following command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

source ~/.bashrc


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;The below image show how you write the command, I wrote my command twice but just write it once.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gr6wnrl86o5wm81kf4f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7gr6wnrl86o5wm81kf4f.png" alt="using source to save bashrc file"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to run Spark shell
&lt;/h2&gt;

&lt;p&gt;For now you are done with configuring the Spark environment, you need now to check that your Spark is working as expected and use the command below to run the spark shell; &lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

spark-shell


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For successful configuration of our variables, you see an image such as this one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7ooik31vw3d8x7wy4pz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff7ooik31vw3d8x7wy4pz.png" alt="Spark-shell comand"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to run Pyspark
&lt;/h2&gt;

&lt;p&gt;Use the following command:&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

pyspark


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;For successful configuration of our variables, you see an image such as this one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7wlrj8kautwopoffxft.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb7wlrj8kautwopoffxft.png" alt="Shows pyspark in the shell"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this article, we have provided an installation guide of Apache Spark in Ubuntu 22.04, as well as the necessary dependencies; as well as the configuration of Spark environment is also described in detail.&lt;/p&gt;

&lt;p&gt;This article should make it easy for you to understand about Apache Spark and install it. So esteemed readers feel free to give feedback and comments.&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>ubuntu</category>
    </item>
    <item>
      <title>Data Engineering 101: Introduction to Data Engineering.</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Fri, 19 Aug 2022 16:14:00 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/data-engineering-101-introduction-to-data-engineering-4md3</link>
      <guid>https://dev.to/kinyungu_denis/data-engineering-101-introduction-to-data-engineering-4md3</guid>
      <description>&lt;p&gt;Today we will look into introduction into Data Engineering understand what a data engineering entails about, what tools a data engineer uses and what a data engineer should learn.&lt;br&gt;
This article will help developers who want to begin a career in data engineering. &lt;/p&gt;

&lt;h2&gt;
  
  
  What is Data Engineering?
&lt;/h2&gt;

&lt;p&gt;Data Engineering is a series of process that involves designing, building for collecting, storing, processing and analyzing large amount of data at scale. It is a field that involves developing and maintaining large scale data processing systems to prepare data to be available and usable for analysis and make business-driven decisions.&lt;/p&gt;

&lt;p&gt;The image below tells us about the process that involved in data engineering.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frhiqklngt3svyu2dm6th.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frhiqklngt3svyu2dm6th.png" alt="The process that involved in data engineering."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Who is a data engineer and what does a data engineer do?
&lt;/h2&gt;

&lt;p&gt;A data engineer refers to a person who is responsible for building data pipelines from different sources and prepares data for analytical and operational uses.&lt;br&gt;
Data engineers are responsible for laying the foundations for the acquisition, storage, transformation, and management of data in an organization. &lt;br&gt;
They do design, build, and maintain data warehouses. A data warehouse is a place where raw data is transformed and stored in query-able forms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's us understand the Data Engineering tools that a Data Engineer uses.
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Data engineering tools are specialized applications that make building data pipelines and designing algorithms easier and more efficient. These tools are responsible for making the day-to-day tasks of a data engineer easier in various ways.&lt;/li&gt;
&lt;li&gt;Data ingestion systems such as Kafka, for example, offer a seamless and quick data ingestion process while also allowing data engineers to locate appropriate data sources, analyze them, and ingest data for further processing.&lt;/li&gt;
&lt;li&gt;Data engineering tools support the process of transforming data. This is important since big data can be structured or unstructured or any other format. Therefore, data engineers need data transformation tools to transform and process big data into the desired format.&lt;/li&gt;
&lt;li&gt;Database tools/frameworks like SQL, NoSQL, etc., allow data engineers to acquire, analyze, process, and manage huge volumes of data simply and efficiently.&lt;/li&gt;
&lt;li&gt;Visualization tools like Tableau and Power BI allow data engineers to generate valuable insights and create interactive dashboards&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Commonly used Cloud-Based Data Engineering Tools:
&lt;/h2&gt;

&lt;p&gt;Data Engineering Tools in AWS&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Redshift&lt;/li&gt;
&lt;li&gt; Amazon Athena&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Data Engineering Tools in Azure&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Azure Data Factory&lt;/li&gt;
&lt;li&gt; Azure Databricks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly, the &lt;strong&gt;Data Engineering ecosystem&lt;/strong&gt; consists of the following:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Data — different data types, formats, and sources of data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data stores and repositories — Relational and non-relational databases, data warehouses, data marts, data lakes, and big data stores that store and process the data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data Pipelines — Collect/Gather data from multiple sources, clean, process and transform it into data which can used for analysis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Analytics and Data driven Decision Making — Make the well processed data available for further business analytics, visualization and data driven decision making.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;ETL Data Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A data pipeline is essentially a collection of tools and methods for transferring data from one system to another for storage and processing. It collects data from several sources and stores it in a database.&lt;/p&gt;

&lt;p&gt;ETL(Extract, Transform and Load) involves extracting, transformation and loading tasks across different environments.&lt;/p&gt;

&lt;p&gt;These three conceptual steps are how most data pipelines are designed and structured. They serve as a blueprint for how raw data is transformed to analysis-ready data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqhqgmbuzxlru66pa8o0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqhqgmbuzxlru66pa8o0.png" alt="ETL Data pipeline"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Lets us explain these steps to understand &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Extract:&lt;/strong&gt; this is the step where sensors wait for upstream data sources to land then we transport the data from their source locations to further transformations.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Transform:&lt;/strong&gt; where we apply business logic and perform actions such as filtering, grouping, and aggregation to translate raw data into analysis-ready datasets. This step requires a great deal of business understanding and domain knowledge.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Load:&lt;/strong&gt; Finally, we load the processed data and transport them to a final destination. Often, this dataset can be either consumed directly by end-users.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data Warehousing
&lt;/h2&gt;

&lt;p&gt;A data warehouse is a database that stores all of your organization historical data and allows you to conduct analytical queries against it. It is a database that is optimized for reading, aggregating, and querying massive amounts of data from a technical point of view. Modern Data Warehouse can integrate structured and unstructured data.&lt;/p&gt;

&lt;p&gt;Four essential components are combined to create a data warehouse:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data warehouse storage.&lt;/li&gt;
&lt;li&gt;Metadata.&lt;/li&gt;
&lt;li&gt;Access tools.&lt;/li&gt;
&lt;li&gt;Management tools.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Which skills are required to become data engineers?
&lt;/h2&gt;

&lt;p&gt;Data engineers require a significant set of technical skills to address their tasks such as:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Database management:&lt;/strong&gt; Data engineers spend a considerable part of their daily work operating databases, either to collect, store or transfer data. One should have the basic understanding of relational databases such as MySQL, PostgreSQL and non-relational databases such as MongoDB, DynamoDB and work efficiently with this databases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Programming languages:&lt;/strong&gt; Data engineers use programming languages for a wide range of tasks. There are many programming languages that can be used in data engineering, Python is certainly one of the best options. Python perfect for executing ETL jobs and writing data pipelines. Another reason to use Python is its great integration with tools and frameworks that are critical in data engineering, such as Apache Airflow and Apache Spark. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cloud technology:&lt;/strong&gt; Being a data engineer entails, to a great extent, connecting your company’s business systems to cloud-based systems.Therefore, a good data engineer should know and have experience in the use of cloud services, their advantages, disadvantages, and their application in Big Data projects. We have cloud platforms such as Amazon Web Services(AWS), Microsoft Azure, Google Cloud Platform which are widely used.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distributed computing frameworks:&lt;/strong&gt;  A distributed system is a computing environment in which various components are spread across multiple computers on a network. Distributed systems split up the work across the cluster, coordinating the efforts to complete the job more efficiently. Distributed computing frameworks, such as Apache Hadoop and Apache Spark, are designed for the processing of massive amounts of data, and they provide the foundations for some of the most impressive Big Data applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shell:&lt;/strong&gt; Most of the jobs and routines of the Cloud and other Big Data tools and frameworks are executed using shell commands and scripts. Data engineers should be comfortable with the terminal to edit files, run commands, and navigate the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ETL frameworks:&lt;/strong&gt; Data engineers do create data pipelines with ETL technologies and orchestration frameworks. In this section, we could list many technologies, but the data engineer should know or be comfortable with some of the best known–such as Apache Airflow. Airflow is an orchestration framework. It’s an open-source tool for planning, generating, and tracking data pipelines. We also have other ETL frameworks, I would advise you to research more on this in order to understand about them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why then should we consider data engineering?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fldcznmlic9c7b9sz0oqi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fldcznmlic9c7b9sz0oqi.png" alt="Why Data Engineering?"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Data engineering helps firms to collect, generate, store, analyze, and manage data in real-time or in batches. We can achieve this while constructing data infrastructure, all thanks to a new set of tools and technologies.&lt;br&gt;
It focuses on scaling  data systems and dealing with various levels of complexity in terms of scalability, optimization and availability.&lt;/p&gt;

&lt;h2&gt;
  
  
  How are data engineers different from data scientists and machine learning scientists?
&lt;/h2&gt;

&lt;p&gt;Data engineer is responsible for making quality data available from various resources, maintain databases, build data pipelines, query data, data pre-processing, feature Engineering, works with tools such as Apache Hadoop and spark, Develop data workflows using Airflow&lt;/p&gt;

&lt;p&gt;ML Engineers are responsible for building Machine Learning algorithms, building data and Machine Learning models and deploy them, have statistical and mathematical knowledge and measure, optimize and improve results.&lt;/p&gt;

&lt;p&gt;The primary role of a data scientist is to take raw data presented on the data and apply analytic tools and modeling techniques to analyze the data and provide insights to the business.&lt;/p&gt;

&lt;h2&gt;
  
  
  The five V's of Data:
&lt;/h2&gt;

&lt;p&gt;Volume - how much data&lt;br&gt;
Variety - what kind of data&lt;br&gt;
Velocity - how frequent is the data&lt;br&gt;
Veracity - how accurate is the data&lt;br&gt;
Value - how useful is the data&lt;/p&gt;

&lt;p&gt;That's it for our introduction to data engineering, this article introduces you in the field of Data Engineering and explains what is required to learn to build a successful career as a Data Engineer.&lt;/p&gt;

&lt;p&gt;I will continue writing about Data Engineering, so join me as we read about data engineering together.&lt;/p&gt;

&lt;p&gt;Remember give your feedback about this article.&lt;/p&gt;

&lt;p&gt;Happy Learning!!&lt;/p&gt;

</description>
      <category>dataengineering</category>
      <category>beginners</category>
    </item>
    <item>
      <title>How to uninstall MySQL Server from Ubuntu 22.04.</title>
      <dc:creator>Kinyungu Denis</dc:creator>
      <pubDate>Thu, 18 Aug 2022 13:29:00 +0000</pubDate>
      <link>https://dev.to/kinyungu_denis/how-to-uninstall-mysql-server-from-ubuntu-2204-1k9j</link>
      <guid>https://dev.to/kinyungu_denis/how-to-uninstall-mysql-server-from-ubuntu-2204-1k9j</guid>
      <description>&lt;p&gt;I am Kinyungu, an IT support specialist and loves to help people to have ease time understanding and using applications.Recently growing a career as a data engineer.&lt;/p&gt;

&lt;p&gt;In this article, we look on how to uninstall MySQL server in Ubuntu 22.04. What may cause one to uninstall MySQL server? In case you face unexpected issues while using it or the MySQL server updates or even you willingly decide to uninstall it from your computer.&lt;/p&gt;

&lt;p&gt;So follow along and see how we uninstall our MySQL server. Let us do it!!!&lt;/p&gt;

&lt;p&gt;First step, we open our Ubuntu terminal, use the shortcut:&lt;br&gt;
&lt;code&gt;CTRL&lt;/code&gt; + &lt;code&gt;ALT&lt;/code&gt; + &lt;code&gt;T&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Good now we are at our terminal, we will write the command to remove MySQL server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get remove --purge mysql*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lets explain this command:&lt;br&gt;
&lt;code&gt;sudo&lt;/code&gt; enables you to run with root privileges.&lt;br&gt;
&lt;code&gt;apt-get remove&lt;/code&gt;  this command only uninstalls a package from your machine but the package configuration file remains in your computer.&lt;br&gt;
&lt;code&gt;--purge mysql*&lt;/code&gt; It is passed as parameter to the &lt;code&gt;apt-get remove&lt;/code&gt; command.&lt;/p&gt;

&lt;p&gt;Next let's us remove the purge:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get purge mysql*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lets understand this command, we know what &lt;code&gt;sudo&lt;/code&gt; does.&lt;br&gt;
&lt;code&gt;apt-get purge mysql*&lt;/code&gt;, this command will delete all files and directories associated with MySQL.&lt;/p&gt;

&lt;p&gt;At this point we have successfully uninstalled MySQL server from our Ubuntu 22.04.&lt;br&gt;
However its advisable to run the following commands so that our MySQL server is uninstalled completely without leaving residue files.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get autoremove
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get autoclean
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running the above commands leaves your system clean and you can continue to use your Ubuntu 22.04 well.&lt;/p&gt;

&lt;p&gt;This should be considered optional and I would advise one to do this. After uninstalling applications or packages its good to update the system to be up to-date.&lt;/p&gt;

&lt;p&gt;We will run this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sudo apt-get dist-upgrade
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This command will update the repositories packages of applications in the system and also updates the kernel to a new version. It has an intelligent manner to handle dependencies of packages. This will involve to handle conflicts that arise due changes in dependencies or removing dependency packages no longer required. If required it installs new packages that will be required by the new kernel version in our system.&lt;/p&gt;

&lt;p&gt;**Yeah!! **Indeed we did uninstall MySQL server.&lt;/p&gt;

&lt;p&gt;I hope this article will help anyone uninstalling MySQL server from Ubuntu 22.04 and any other reader looking to learn.&lt;/p&gt;

</description>
      <category>mysql</category>
      <category>uninstallation</category>
      <category>ubuntu</category>
      <category>linux</category>
    </item>
  </channel>
</rss>
