<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karen Ngala</title>
    <description>The latest articles on DEV Community by Karen Ngala (@karen_ngala).</description>
    <link>https://dev.to/karen_ngala</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F464461%2F22c2033f-b4da-42ef-996e-3b449b41341c.png</url>
      <title>DEV Community: Karen Ngala</title>
      <link>https://dev.to/karen_ngala</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karen_ngala"/>
    <language>en</language>
    <item>
      <title>Git for Data Science</title>
      <dc:creator>Karen Ngala</dc:creator>
      <pubDate>Wed, 05 Apr 2023 12:54:19 +0000</pubDate>
      <link>https://dev.to/karen_ngala/git-for-data-science-2h3b</link>
      <guid>https://dev.to/karen_ngala/git-for-data-science-2h3b</guid>
      <description>&lt;p&gt;As data science continues to gain momentum as a field, managing and versioning data and code has become increasingly important. Git, a powerful version control system, is a popular tool among software developers for managing source code changes. However, Git is not just limited to software development and can also be used effectively for managing data science projects. &lt;/p&gt;

&lt;p&gt;In this article, we will explore how Git can be leveraged by data scientists to efficiently manage and version data, track changes, collaborate with team members, and reproduce experiments. Whether you are new to Git or an experienced user, this article aims to provide a comprehensive guide on using Git for data science projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Git and How does it work?
&lt;/h2&gt;

&lt;p&gt;Git is a distributed version control system used for tracking changes in source code during software development. It allows multiple people to collaborate on the same project by tracking changes to code. Git does this by taking snapshots of the files at various points in time, creating a complete history of changes made to those files. Each snapshot is called a &lt;em&gt;"commit"&lt;/em&gt; and contains a reference to the previous commit, forming a &lt;em&gt;"commit chain"&lt;/em&gt; or a &lt;em&gt;"commit history"&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Git uses a &lt;strong&gt;distributed&lt;/strong&gt; model, which means that each user has a local copy of the entire repository, including the commit history. This allows users to work offline and makes collaboration easier. When users are ready to share their changes, they can &lt;em&gt;push&lt;/em&gt; their commits to a remote repository, from which other users can then &lt;em&gt;pull&lt;/em&gt; to incorporate those changes into their local copies. &lt;/p&gt;

&lt;p&gt;Git also offers tools for merging changes made by different people and reverting to earlier versions if necessary. It also provides tools for branching, enabling developers to work on different parts of a project simultaneously without disrupting each other's work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Git vs GitHub
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Git&lt;/strong&gt; is a command-line tool that allows developers to track source code history over time while also allowing them to collaborate on the same project with minimal conflict.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt; is a web platform built on Git technology where remote repositories of git projects are hosted. It offers other features such as bug tracking, project management, automation and other features. Alternatives to GitHub include GitLab, Bitbucket, GitKraken, among others.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Terminologies &amp;amp; Commands&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repository&lt;/strong&gt;: A repository is a central location where Git stores all the files and folders of a project, along with their revision history.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Create a new repository on your local computer&lt;/span&gt;
git init 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Commit&lt;/strong&gt;: A commit is a snapshot of a repository at a specific point in time. It represents a set of changes that have been made to the repository. You must first stage the edited files using the &lt;code&gt;git add&lt;/code&gt; command. This marks the files to go into the commit.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# stage all edited files&lt;/span&gt;
git add &lt;span class="nb"&gt;.&lt;/span&gt; 

&lt;span class="c"&gt;# stage a specific file&lt;/span&gt;
git add &amp;lt;file_name.ext&amp;gt;

git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"commit message goes here"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Branch&lt;/strong&gt;: A branch is a separate version of the repository that allows developers to work on different features or fixes simultaneously without interfering with each other's work.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# create then checkout to branch&lt;/span&gt;
git branch &amp;lt;branch_name&amp;gt;
git checkout &amp;lt;branch_name&amp;gt;

&lt;span class="c"&gt;# create and checkout into new branch&lt;/span&gt;
git checkout &lt;span class="nt"&gt;-b&lt;/span&gt; &amp;lt;branch_name&amp;gt;

&lt;span class="c"&gt;# list all branches in the repository &lt;/span&gt;
git branch 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Push&lt;/strong&gt;: Push is the process of sending changes from a local repository to a remote repository, such as on GitHub.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git push origin &amp;lt;branch_name&amp;gt;
&lt;span class="c"&gt;# origin -&amp;gt; the default remote repository that Git tracks for a local repository or points to the original repository in case of cloning.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pull&lt;/strong&gt;: Pull is the process of fetching and merging changes from a remote repository into a local repository.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git pull origin &amp;lt;branch_name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Merge&lt;/strong&gt;: A merge is the process of combining changes from one branch into another branch.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git merge &amp;lt;feature branch_name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pull Request&lt;/strong&gt;: A pull request is a request made by a developer to merge their changes from a branch into the main branch of the repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fork&lt;/strong&gt;: A fork is a copy of a repository that allows a developer to make changes to the code without affecting the original repository.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clone&lt;/strong&gt;: A clone is a &lt;strong&gt;local copy&lt;/strong&gt; of a remote repository that a developer can work on without affecting the original repository.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone &amp;lt;&lt;span class="nb"&gt;link &lt;/span&gt;to remote repository&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;HEAD&lt;/strong&gt;: Shorthand for the current commit your local repository is currently on.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Git Best Practices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Don't push secrets&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Whether you are working on a private or public repository, never commit any secrets. These include, any username, password, API key, TLS certificates, or other sensitive information. Keep in mind that private repositories can be accessed and cloned by multiple accounts or can also be made public at some point.&lt;br&gt;
To protect such sensitive information, make use of the &lt;code&gt;.env&lt;/code&gt; file. This file's purpose is to hold environment variables. The &lt;code&gt;.env&lt;/code&gt; file is in turn kept safe by including it in the &lt;code&gt;.gitignore&lt;/code&gt; file.&lt;br&gt;
For the purpose of making collaboration easy, you should create a &lt;code&gt;.env.example&lt;/code&gt; or &lt;code&gt;.env.template&lt;/code&gt; file. This file informs other collaborators which environement variables the system expects. From this file, they can create a &lt;code&gt;.env&lt;/code&gt; file with their own usernames, passwords and secret keys.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# .env file:
&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;97467282&lt;/span&gt;&lt;span class="n"&gt;TTa89sdaf7659025f7sda22245&lt;/span&gt;

&lt;span class="c1"&gt;# .env.example file:
&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;your_key&lt;/span&gt;

&lt;span class="c1"&gt;# gitignore file:
&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;

&lt;span class="c1"&gt;# app.py
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="nn"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="n"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;'API_KEY'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you happen to commit a secret, you cannot fix it by simply deleting it. Because git is designed to maintain a persistent history of the code, removing the secret will require rewriting history. This can prove difficult in situations where other people have the secret on their local repositories. The simplest solution is to change the passwords and disable the exposed secret keys.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Don't push datasets&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The main purpose of Git is to track changes in text file, not large binary files such as a dataset. You may work with extremely large datasets which you can accidentally commit if you are not careful. There are several approaches you can take:&lt;br&gt;
a) If your dataset does not change, you can upload it to a server and gain access ti it via its URL.&lt;br&gt;
b) Use a &lt;code&gt;.gitgnore&lt;/code&gt; file. Add your dataset files or folders into the gitignore file to avoid accidentally staging and committing them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ignore archives&lt;/span&gt;
&lt;span class="k"&gt;*&lt;/span&gt;.zip
&lt;span class="k"&gt;*&lt;/span&gt;.tar
&lt;span class="k"&gt;*&lt;/span&gt;.tar.gz
&lt;span class="k"&gt;*&lt;/span&gt;.rar

&lt;span class="c"&gt;# ignore dataset folder and subfolders&lt;/span&gt;
datasets/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;3. Don't push notebook outputs&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Cell outputs on notebooks are a great feature. However, when using version control systems such as Git, a change to a code cell will most likely change its output. Keep track of the changes made in output cells will distract from the more important changes in the code cells. This can prove tedious when multiple people are working on the same notebook.&lt;br&gt;
You should, therefore, strip all outputs from a notebook before committing to Git by:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Manually clearing all output cells from the main menu &lt;code&gt;Cells -&amp;gt; All Output -&amp;gt; Clear&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Setting up a &lt;a href="https://zhauniarovich.com/post/2020/2020-06-clearing-jupyter-output/"&gt;pre-commit hook&lt;/a&gt; to clear outputs automatically.&lt;/li&gt;
&lt;li&gt;Using a &lt;a href="https://gist.github.com/33eyes/431e3d432f73371509d176d0dfb95b6e"&gt;.gitattributes file&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Refrain from using &lt;code&gt;--force&lt;/code&gt; or &lt;code&gt;-f&lt;/code&gt;&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;At times, you may encounter an error when pushing to remote that asks you to use the &lt;code&gt;--force&lt;/code&gt; or &lt;code&gt;-f&lt;/code&gt; flag. There are situations that require using this flag. However, make it a habit to read the error message first, try to identify the origin of the error and fix the underling issue. If this proves challenging, try asking for help.&lt;br&gt;
Using --force habitually will prove detrimental in the long run.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Make frequent and clear commits&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;As a general rule of thumb, a single commit should do one thing: fix one bug, not five; solve a single issue, not ten.&lt;br&gt;
For example, a commit that fixes ten bugs will most likely have multiple changed files. Further, if the commit message is unclear like "Model now working", it becomes difficult for someone else to understand what happened in the commit. This provides zero value. The commit message "Fix special tokens not correctly tokenized" is short, but clear. You know what changed, and why.&lt;br&gt;
Thankfully, you can fix your commit history &lt;strong&gt;if&lt;/strong&gt; you haven't pushed to remote. Learning to &lt;a href="https://git-scm.com/book/en/v2/Git-Tools-Rewriting-History"&gt;rewrite history&lt;/a&gt; can prove very useful in real world projects.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;6. Utilize branching and pull requests&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If your project is constantly being worked on by many people or is in production, pull requests can prove very helpful. By default, a git repository has a single branch &lt;code&gt;main&lt;/code&gt; or &lt;code&gt;master&lt;/code&gt;. It is considered the &lt;strong&gt;central true&lt;/strong&gt; branch.&lt;br&gt;
When you branch, you create a temporary 'caveat' from the &lt;code&gt;main&lt;/code&gt; branch. You and other collaborators can work on different features simultaneously through branching. This allows you to work on new features or fix old ones without affecting the main branch.&lt;br&gt;
&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--bthrkqog--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://valohai.com/blog/git-for-data-science/git-branches.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--bthrkqog--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://valohai.com/blog/git-for-data-science/git-branches.png" width="880" height="361"&gt;&lt;/a&gt;&lt;br&gt;
When you are done working on your feature, you will create a pull request to merge (include) the changes of your branch into the &lt;code&gt;main&lt;/code&gt; central branch. Pull requests are a github concept and have features to allow other people to review, comment, suggest changes, approve, or apply the changes in the pull request.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we've covered Git, how it works and the best practices when working with Git. To further help you in this journey, I have linked articles I found useful below:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://towardsdatascience.com/comprehensive-guide-to-github-for-data-scientist-d3f71bd320da"&gt;The basics of Git and Github&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;&lt;a href="https://towardsdatascience.com/a-guide-to-git-for-data-scientists-fd68bc1c729"&gt;Understand the Git workflow and Source code history&lt;/a&gt;&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope you found this post useful!&lt;/p&gt;

</description>
      <category>git</category>
      <category>github</category>
      <category>datascience</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Getting started with Sentiment Analysis</title>
      <dc:creator>Karen Ngala</dc:creator>
      <pubDate>Wed, 22 Mar 2023 19:00:06 +0000</pubDate>
      <link>https://dev.to/karen_ngala/getting-started-with-sentiment-analysis-lc0</link>
      <guid>https://dev.to/karen_ngala/getting-started-with-sentiment-analysis-lc0</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;Pre-reading:&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/karen_ngala/exploratory-data-analysis-ultimate-guide-2olg"&gt;Basic understanding of EDA&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What is Sentiment Analysis?
&lt;/h2&gt;

&lt;p&gt;Humans communicate with each other using Natural Language, which is often complicated. Humans tend to use subtle variations in their speech, such as sarcasm, which is easy for us to interpret but difficult for machines. To make computers understand Natural language, we use a process known as Natural Language Processing (NLP)&lt;/p&gt;

&lt;p&gt;Sentiment analysis, also known as opinion mining, is a an approach to natural language processing that seeks to identify the emotion behind a text such as movie or product reviews. Businesses around the world use sentiment analysis to understand the social opinion on their products or services left on online platforms.&lt;/p&gt;

&lt;p&gt;Sentiment analysis identifies, classifies, and quantifies the sentiment expressed in a text. For example, the text "I loved the movie" carries a positive sentiment while "I found it rather slow and boring" carries a negative sentiment. Positive or negative text can further be quantified in text, for example, the text "I really enjoyed the movie" can be quantified as 'relatively more positive'. The amount of positivity or negativity in text is known as &lt;em&gt;polarity&lt;/em&gt;. &lt;/p&gt;

&lt;p&gt;When a large amount of data is involved, it becomes more effective to use an algorithm to determine customer satisfaction as opposed to humans. &lt;/p&gt;

&lt;h2&gt;
  
  
  Sentiment Analysis Process
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Import relevant libraries
&lt;/h3&gt;

&lt;p&gt;There are a number of libraries we can use in sentiment analysis depending on your goals.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pandas&lt;/strong&gt; — for data analysis and manipulation &lt;code&gt;import pandas as pd&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Matplotlib&lt;/strong&gt; — for data visualization &lt;code&gt;import matplotlib.plyplot as plt&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Seaborn&lt;/strong&gt; — for high-level data visulaization
 &lt;code&gt;import seaborn as sns&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WordCloud&lt;/strong&gt; - to visualize text data. The more a word appears in the text, the larger the font of the word.  &lt;code&gt;from wordcloud import WordCloud&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;re&lt;/strong&gt; — for string pre-processing. Formats string according to a given regular expression  &lt;code&gt;import re&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;nltk&lt;/strong&gt; — Natural Language Toolkit. It is a collection of libraries used in Natural Language Processing.  &lt;code&gt;import nltk&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;stopwords&lt;/strong&gt; — A collection of words that do not offer sentiment in a sentence, such as "the", "and"  &lt;code&gt;from nltk.corpus import stopwords&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Evaluation Libraries:&lt;/strong&gt;
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;roc_curve&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;auc&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;sklearn.metrics&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;classification_report&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;confusion_matrix&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once we have trained our model, we need to evaluate the correctness of the model using the testing dataset i.e: is the result what we expect it to be?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy Score&lt;/strong&gt; — Ratio of correctly classified instances to the total number of instances.
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2F1%2AyRa2inzTnyASJOre93ep3g.gif"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Precision Score&lt;/strong&gt; — Ratio of correctly classified instances to the total positive instances. &lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall Score&lt;/strong&gt; — Ratio of correctly classified instances to the total number of instances.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Classification Report&lt;/strong&gt; — a report of accuracy, precision, and recall scores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ROC Curve&lt;/strong&gt; — a graph of Sensitivity/True Positive Rate (y-axis) against Specificity/False Positive Rate (x-axis) at various threshold values. An ROC “Receiver Characteristic Operator” curve summarizes the performance of a binary classification model.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;A binary classification model is one that classifies an instance as either one thing or the other, i.e: The output can only be this value or the other. 'Sick' or 'Not Sick', 'Cat' or 'Dog', 'Tree' or 'Not Tree'&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  2. Load the dataset
&lt;/h3&gt;

&lt;p&gt;A sample sentiment analysis dataset will contain a text column and its corresponding sentiment/target value.&lt;/p&gt;

&lt;p&gt;To read the dataset, we need to load it using pandas:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df_test&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Exploratory Data Analysis
&lt;/h3&gt;

&lt;p&gt;Understand the data you are working with. Check various aspects of the dataset to familiarize yourself with it. This will help you know how you can manipulate the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dtypes&lt;/span&gt;

&lt;span class="c1"&gt;# Check for null values
&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isnull&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;any&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Distribution of target variables:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The next step is to check the various target sentiments in the dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;value_counts&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c1"&gt;# or
&lt;/span&gt;&lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;countplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;label&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In cases where the labels are of more than two types, we can merge them to create two simple sentiments, &lt;em&gt;positive&lt;/em&gt; and &lt;em&gt;negative&lt;/em&gt; represented in a numerical form: '1' and '0'&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Data Preparation
&lt;/h3&gt;

&lt;p&gt;Dealing with alphanumeric text requires pre-processing to remove any odd characters and prepare the text for the model.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;Covert the text to lowercase. Because of case sensitivity, the word "Hello" is different from "hello"&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Remove any stopwords. Words such as &lt;em&gt;"the", "and"&lt;/em&gt; do not offer much value in sentiment analysis&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;stopwords_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stopwords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;english&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.corpus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stopwords&lt;/span&gt;
&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stopwords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;english&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="c1"&gt;# Get rid of any stopwords
&lt;/span&gt;&lt;span class="n"&gt;STOPWORDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stopwords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;english&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;cleaning_stopwords&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;STOPWORDS&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;cleaning_stopwords&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Remove non-alphabetic characters.&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# remove special characters, numbers and punctuations
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[^a-zA-Z#]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# remove short words
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;


&lt;p&gt;Depending on the data you are dealing with, you may need to remove different characters and character combinations. For example, when handling twitter data, you will need to remove user handles, i.e: "&lt;em&gt;@username&lt;/em&gt;"&lt;br&gt;
&lt;/p&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# function to remove patterns in the input text.
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_pattern&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_txt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pattern&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_txt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;input_txt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_txt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;input_txt&lt;/span&gt;

&lt;span class="c1"&gt;# remove twitter handles (@user)
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vectorize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;remove_pattern&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@[\w]*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;4. Tokenization&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This is used in natural language processing to split text into smaller units that can be more easily assigned meaning. For example, the string "Loved the ambiance and drinks". Tokenization is performed to break the string into individual parts that the program can understand better: 'Loved', 'the', 'ambiance', 'and', 'drinks'&lt;/p&gt;

&lt;p&gt;This step also lays the ground work for stemming or lemmatization. Learn more on this topic &lt;a href="https://towardsdatascience.com/sentiment-analysis-intro-and-implementation-ddf648f79327#:~:text=Questions%20and%20Answers-,Tokens%20and%20Bigrams,-In%20order%20for" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RegexpTokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;\w+&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  &lt;strong&gt;5. Lemmatization&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;This is the process of deriving the root word from the different forms of the word. For example the words &lt;em&gt;eats, eating&lt;/em&gt; are all part of the same lexeme; with &lt;em&gt;eat&lt;/em&gt; as the &lt;strong&gt;lemma&lt;/strong&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Lemmatization is computationally expensive since it involves look-up tables. &lt;br&gt;
    Unlike &lt;em&gt;Stemming&lt;/em&gt; which looks at word reduction, lemmatization considers a language's vocabulary to derive the base word. Base words in stemming don't always make sense. For example, the word 'having' would return 'hav' in stemming and 'have' in lemmatization.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;lm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;WordNetLemmatizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lemmatizer_on_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lemmatize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;lemmatizer_on_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Prepare for training
&lt;/h3&gt;

&lt;p&gt;The next step is to separate the dataset into training data and testing data. Sentiment analysis is a classification problem. As such, a classification model is trained using the training dataset and evaluated using the testing dataset. The ratio of training data to testing data is usually 1:1 or 4:1 to avoid biasing the model.&lt;/p&gt;

&lt;p&gt;The purpose of this step is to ensure the data you use to evaluate your model's accuracy is unseen/new data. Testing a model using the training data will cause the model to only perform well with the training data and not any other data. This is known as &lt;em&gt;overfitting&lt;/em&gt;; and the opposite known as &lt;em&gt;underfitting&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Accuracy score&lt;/em&gt; allows us to evaluate the model's performance. We compare the training accuracy to the testing accuracy to identify underfitting and overfitting.&lt;br&gt;
If the training accuracy is extremely high while the testing accuracy is poor then this is a good indicator that the model is probably overfitted.&lt;/p&gt;

&lt;p&gt;In cases where we need to choose between multiple models, we need to create an extra dataset known as the &lt;strong&gt;validation dataset&lt;/strong&gt;. This allows us to evaluate the models to pick which performs better.&lt;/p&gt;

&lt;p&gt;There are many ways to split your dataset. The following is one method that utilizes sklearn. Read more about &lt;a href="https://towardsdatascience.com/how-to-split-a-dataset-into-training-and-testing-sets-b146b1649830" rel="noopener noreferrer"&gt;how to split a dataset&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This splits data into an 80:20 ratio
&lt;/span&gt;&lt;span class="n"&gt;training_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;testing_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. Build Model
&lt;/h3&gt;

&lt;p&gt;The model you choose to use here is not set in stone. A popular choice for sentiment analysis is &lt;strong&gt;&lt;a href="https://medium.com/@fmnobar/logistic-regression-overview-through-11-practice-questions-practice-notebook-64e94cb8d09d" rel="noopener noreferrer"&gt;Logistic regression&lt;/a&gt;&lt;/strong&gt;. This is because it trains quickly even on large datasets and provides very robust results. Other model choices include Random Forests, and Naive Bayes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Q: What if we do not have labelled data? How can we know the sentiment in a text?&lt;/em&gt;&lt;br&gt;
&lt;strong&gt;A: Using Pre-Trained Models — TextBlob&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
TextBlob is a library that returns the sentiment of a text as a named tuple: "(polarity, subjectivity)”&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Polarity is a float in the range -1.0 and 1.0. It shows whether a text is negative or positive.&lt;/li&gt;
&lt;li&gt;Subjectivity is a float in the range 0.0 and 1.0 to represent &lt;em&gt;very objective&lt;/em&gt; and &lt;em&gt;very subjective&lt;/em&gt; sentiments respectively.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  7. Model Evaluation
&lt;/h3&gt;

&lt;p&gt;After training the model, we evaluate the performance of the model. Assessing the model's efficiency answers the question, &lt;em&gt;Is the model working well with unseen data?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Before going into the evaluation metrics we can use, let's define the results we can get from these metrics.&lt;br&gt;
For these definitions, let's use the example of a model classifying patients as &lt;strong&gt;"Sick"&lt;/strong&gt; or &lt;strong&gt;"Not Sick"&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;True Positive(TP)&lt;/em&gt; - the number of Sick people that were correctly classified as Sick. &lt;/li&gt;
&lt;li&gt;
&lt;em&gt;True Negative(TN)&lt;/em&gt; - the number of Not Sick people that were correctly classified as Not Sick.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;False Positive(FP)&lt;/em&gt; - the number of Not Sick people that were wrongly classified as Sick.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;False Negative(FN)&lt;/em&gt; - the number of Sick people that were wrongly classified as Not Sick.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;N&lt;/em&gt; - total number of patients
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2AGMlSubndVt3g7FmeQjpeMA.png"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are many evaluation metrics. However, we will look at 3 popular metrics used for classification models:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Accuracy&lt;/strong&gt; — How often does the model make correct predictions? i.e: The actual sentiment and the predicted sentiment are the same.&lt;br&gt;
&lt;/p&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Testing accuracy
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Test set&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;  Accuracy: {:0.2f}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;accr1&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2AxvJylefImAAukT7dx7lZ3g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2AxvJylefImAAukT7dx7lZ3g.png"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Confusion Matrix&lt;/strong&gt; — a table used to visualize the performance of a classification model on a dataset for which the true (target) values are known. A confusion matrix highlights two errors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;Type 1 Error&lt;/em&gt;&lt;/strong&gt; - The number of instances that were negative but were wrongly classified as positive. Also called, False Positive(FP)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;em&gt;Type 2 Error&lt;/em&gt;&lt;/strong&gt; - The number of instances that were positive but were wrongly classified as negative. Also called, False Negative(FN)
&lt;/li&gt;
&lt;/ul&gt;

&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;confusion matrix&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;CR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Y_test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y_pred&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CR&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;fig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;plot_confusion_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conf_mat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CR&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
                                &lt;span class="n"&gt;show_absolute&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;show_normed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                &lt;span class="n"&gt;colorbar&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;




&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;AUC (Area Under the ROC Curve)&lt;/strong&gt; — calculated by plotting the true positive rate against the false positive rate at different classification thresholds.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;True Positive Rate (sensitivity)&lt;/em&gt; - proportion of positive samples that are correctly identified as positive
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2F1%2Ayw4Y3D7nGNVza2EC2WrOfg.gif"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;False positive rate (1-specificity)&lt;/em&gt; - is the proportion of negative samples that are incorrectly classified as positive.
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2A857kpm2k2y-eor5Zy3-YeQ.png"&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;True Negative Rate (Specificity)&lt;/em&gt; - proportion of negative samples that are correctly identified as negative
&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2AT4PXeK_Hd397C-6ItmLReQ.png"&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2AzFW1Kj3e2X_mmluTW3rVeA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A640%2Fformat%3Awebp%2F1%2AzFW1Kj3e2X_mmluTW3rVeA.png"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;/li&gt;

&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article we talked about, the steps you can take to solve a sentiment analysis problem.&lt;br&gt;
&lt;strong&gt;Practical guide:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.kaggle.com/code/muhammadimran112233/eda-twitter-sentiment-analysis-using-nn" rel="noopener noreferrer"&gt;Kaggle Notebook&lt;/a&gt; on twitter sentiment analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope you found this article helpful. Leave a comment if you have any questions or would like to discuss this topic further.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>sentimentanalysis</category>
      <category>beginners</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Essential SQL Commands for Data Science</title>
      <dc:creator>Karen Ngala</dc:creator>
      <pubDate>Wed, 15 Mar 2023 13:25:17 +0000</pubDate>
      <link>https://dev.to/karen_ngala/essential-sql-commands-for-data-science-1kkm</link>
      <guid>https://dev.to/karen_ngala/essential-sql-commands-for-data-science-1kkm</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Pre-requisites:&lt;/em&gt;&lt;/strong&gt; This article assumes basic SQL knowledge and CRUD commands such as: &lt;code&gt;CREATE&lt;/code&gt;, &lt;code&gt;INSERT&lt;/code&gt;, &lt;code&gt;UPDATE&lt;/code&gt;, &lt;code&gt;ALTER&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, and &lt;code&gt;DROP&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;SQL, Structured Query Language&lt;/em&gt;, is a programming language used for manipulating and managing data in a relational database. Data Scientists use it to extract insights from data. A large amount of data used by data scientists lives in a relational database. This data can be extracted using SQL commands. SQL servers such as MySQL and PostgreSQL use SQL.&lt;/p&gt;

&lt;p&gt;This article covers the essential SQL commands that data scientists rely on to effectively clean and filter data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data Retrieval

&lt;ul&gt;
&lt;li&gt;Conditions for Data Retrieval&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Data Aggregation

&lt;ul&gt;
&lt;li&gt;Changing Data Types&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Joining Data From Different Tables&lt;/li&gt;
&lt;li&gt;Complex Conditions&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Basics: Data Retrieval
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;SELECT FROM&lt;/strong&gt;&lt;br&gt;
This is the simplest method of data retrieval in a relational database.&lt;br&gt;
It can be combined with conditional statements such as WHERE, ORDER BY, and GROUP BY to filter, sort, and group data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- To select specific columns in a table:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;column1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column3&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- To select everything in a table:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DISTINCT&lt;/strong&gt;&lt;br&gt;
DISTINCT is used with SELECT to view unique values in a column. &lt;br&gt;
For example, to know all the departments appearing in the column &lt;code&gt;department&lt;/code&gt;, we use DISTINCT. It returns a table of the departments appearing in that table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;DISTINCT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Conditions for Data Retrieval&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;WHERE&lt;/strong&gt;&lt;br&gt;
This is a conditional statement used to filter data according to a specific condition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;column1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column3&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;table_name&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- for example:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- We can also filter data with more than one condition:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'Sales'&lt;/span&gt; &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt; &lt;span class="k"&gt;IN&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'Finance'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'IT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'HR'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GROUP BY&lt;/strong&gt;&lt;br&gt;
This statement is used to group data based on one or more columns.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ORDER BY&lt;/strong&gt;&lt;br&gt;
This is used to sort the results of a query either alphabetically or numerically.&lt;br&gt;
The default sorting order in sql is &lt;code&gt;ASC&lt;/code&gt;. Therefore, you do not have to specify &lt;code&gt;ASC&lt;/code&gt; in your query.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;However, to sort the results in a descending order, use the keyword &lt;code&gt;DESC&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LIMIT&lt;/strong&gt;&lt;br&gt;
When the records in a table are many, we may want to limit the number of records we get. For example, to view only the top 10 earners in the Finance department:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;employee_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Data Aggregation
&lt;/h2&gt;

&lt;p&gt;Aggregations are summaries of data used to gain insights on a dataset. They are often used with the GROUP BY clause.&lt;br&gt;
&lt;strong&gt;COUNT()&lt;/strong&gt;&lt;br&gt;
Count returns the total number of rows. In the example below, we are displaying the number of employees in each department.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;employee_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SUM() &amp;amp; AVG()&lt;/strong&gt;&lt;br&gt;
Sum returns the sum of all the values. In the example below, we use the GROUP BY statement to group the employees by department and calculate the total salary for each department:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;total_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Avg returns the average value. In the example below, we use the GROUP BY statement to group the employees by department and calculate the average salary for each department:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;HAVING&lt;/strong&gt;&lt;br&gt;
Having is used to add additional conditions after calculating a grouped aggregation.&lt;br&gt;
For example, the above query can be conditioned further to only show departments with an average salary above 50000.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;50000&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;MIN() &amp;amp; MAX()&lt;/strong&gt;&lt;br&gt;
To know the lowest or highest values in a column, we can use &lt;code&gt;MIN&lt;/code&gt; and &lt;code&gt;MAX&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MIN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;lowest_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;highest_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Changing Data Types&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;CAST( )&lt;/strong&gt;&lt;br&gt;
SQL sees numeric values as numbers even when dealing with money. We can change &lt;code&gt;salary&lt;/code&gt; values to dollar amounts using the CAST function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;CAST&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;money&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;SUM&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can also change numbers into floats, text, or date and time.&lt;br&gt;
&lt;strong&gt;ROUND()&lt;/strong&gt;&lt;br&gt;
When aggregations cause many decimal points, we can round off the decimal points:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ROUND&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;avg_salary&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;department&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  JOINS
&lt;/h2&gt;

&lt;p&gt;Working with a single table limits the number of manipulations we can do with data. This is where JOINs come in. We are able to join data from multiple tables.&lt;br&gt;
Before we go any further, we need to distinguish a &lt;em&gt;promary key&lt;/em&gt; from a &lt;em&gt;foreign key&lt;/em&gt;. A primary key is a column used to uniquely identify records in a table. For example, the primary key in the &lt;code&gt;employees&lt;/code&gt; table is &lt;em&gt;employee_id&lt;/em&gt;. On the other hand, a foreign key is used to relate two tables. &lt;br&gt;
A foreign key is usually a primary key in the other table. A separate table having information about when employees take vacation days (&lt;code&gt;employee_vacation&lt;/code&gt; table) will have a column &lt;code&gt;employee_id&lt;/code&gt; to relate to the employee table. Therefore, &lt;em&gt;employee_id&lt;/em&gt; is a primary key in the employee table but a foreign key in the employee_vacation table.&lt;br&gt;
There are different types of SQL joins which are best illustrated using venn diagrams.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The following examples will feature a customer database with &lt;code&gt;customers&lt;/code&gt; table and &lt;code&gt;orders&lt;/code&gt; table.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;INNER JOIN&lt;/strong&gt;&lt;br&gt;
An inner join is used to view data where records in two tables match on two columns. The example below shows the order_id and customer_name &lt;code&gt;if&lt;/code&gt; the customer_id on the orders table and the customer_id on the customers table are the &lt;strong&gt;same&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;INNER&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An INNER JOIN is also known as a JOIN and therefore, the code above can be written as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;-- We can filter the data to not show a specific customer:&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="s1"&gt;'Lucy Lucy'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also work with more than two tables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;shippers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shipper_name&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt; 
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;shippers&lt;/span&gt; 
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shipper_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;shippers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shipper_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;LEFT JOIN&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The table before the statement &lt;code&gt;LEFT JOIN&lt;/code&gt; is the left table while the one after is the right table.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A LEFT JOIN will return &lt;code&gt;all&lt;/code&gt; the records in the left table and the matching records in the right table. If there are no matching records, the result will contain &lt;code&gt;NULL&lt;/code&gt; values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- customers = left table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;LEFT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;RIGHT JOIN&lt;/strong&gt; &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The table before the statement &lt;code&gt;JOIN&lt;/code&gt; is the right table while the one after is the left table.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A RIGHT JOIN will return &lt;code&gt;all&lt;/code&gt; the records in the right table and the matching records in the left table. If there are no matching records, the result will contain &lt;code&gt;NULL&lt;/code&gt; values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- customers = right table&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;
&lt;span class="k"&gt;RIGHT&lt;/span&gt; &lt;span class="k"&gt;JOIN&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
&lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;customers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Complex Queries
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Subqueries&lt;/strong&gt;&lt;br&gt;
This is a query within another query, also known as a Nested query. It is usually embedded within the WHERE clause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- showing the highest paid employees&lt;/span&gt;
&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt; 
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;salary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;salary&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; 
                &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;employees&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;CASE statement&lt;/strong&gt;&lt;br&gt;
This can be used when you need to add a category where the values are determined by an &lt;code&gt;if...else&lt;/code&gt; statement(CASE statement)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="k"&gt;CASE&lt;/span&gt; 
    &lt;span class="k"&gt;WHEN&lt;/span&gt; &lt;span class="n"&gt;order_total&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt; &lt;span class="k"&gt;THEN&lt;/span&gt; &lt;span class="s1"&gt;'Order total is less than $20'&lt;/span&gt;
    &lt;span class="k"&gt;ELSE&lt;/span&gt; &lt;span class="s1"&gt;'Order total is greater than $20'&lt;/span&gt; 
&lt;span class="k"&gt;END&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;sales_threshold&lt;/span&gt; 
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Common Table Expressions (CTEs)&lt;/strong&gt;&lt;br&gt;
CTEs are used to create temporary tables that are then used to extract the information we need.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="c1"&gt;-- weekly_orders is the temporary table&lt;/span&gt;
&lt;span class="k"&gt;WITH&lt;/span&gt; &lt;span class="n"&gt;weekly_orders&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="k"&gt;SELECT&lt;/span&gt;
        &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;DATE_PART&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'week'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order_date&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;week&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="k"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;order_numbers&lt;/span&gt;
    &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orders&lt;/span&gt;
    &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;week&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;AVG&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order_numbers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;weekly_rentals&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;customer_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Further reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://pub.towardsai.net/useful-intermediate-sql-queries-for-data-science-408c724b67d0"&gt;TRIGGER &amp;amp; COALESCE function&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://mode.com/sql-tutorial/sql-window-functions/"&gt;Window Functions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I hope you found this article helpful in your SQL journey. Practice questions will definitely help you retain all the information you have learned. Use platforms like Hackerranck to level up your SQL skills.&lt;br&gt;
If you found this article helpful, make sure to like it or leave a comment.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>sql</category>
      <category>sqlfordatasccience</category>
    </item>
    <item>
      <title>Exploratory Data Analysis: Ultimate Guide</title>
      <dc:creator>Karen Ngala</dc:creator>
      <pubDate>Mon, 27 Feb 2023 10:04:05 +0000</pubDate>
      <link>https://dev.to/karen_ngala/exploratory-data-analysis-ultimate-guide-2olg</link>
      <guid>https://dev.to/karen_ngala/exploratory-data-analysis-ultimate-guide-2olg</guid>
      <description>&lt;p&gt;_&lt;strong&gt;Note:&lt;/strong&gt; Some terms can be confusing for beginners when used interchangeably in articles (even when they shouldn't). I thought it'd be neat to define them before we jump in.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Variable vs Value

&lt;ul&gt;
&lt;li&gt;In a dataset, a variable is a characteristic or attribute that is being measured or observed for each individual or unit in the dataset. For example, in a dataset of student grades, variables could include the &lt;em&gt;student's name&lt;/em&gt;, &lt;em&gt;class&lt;/em&gt;, &lt;em&gt;subject&lt;/em&gt;, and &lt;em&gt;test scores&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;On the other hand, a value is a specific measurement or observation of that variable for a particular individual or unit in the dataset. For example: if there were 20 students in the dataset, there would be 20 values for each variable.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Column vs Feature

&lt;ul&gt;
&lt;li&gt;A column in a dataset can also be referred to as a feature. The variables we talked about, appear as columns in a dataset. These columns are considered features. Therefore, the terms "&lt;strong&gt;column&lt;/strong&gt;" and "&lt;strong&gt;feature&lt;/strong&gt;" can be used interchangeably to refer to a &lt;strong&gt;variable&lt;/strong&gt; or attribute in a dataset that is used to build a model.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;What is covered in this guide:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is Exploratory Data Analysis?&lt;/li&gt;
&lt;li&gt;Why is it important?&lt;/li&gt;
&lt;li&gt;Common EDA techniques&lt;/li&gt;
&lt;li&gt;Types of EDA&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What is Exploratory Data Analysis?
&lt;/h2&gt;

&lt;p&gt;Exploratory Data Analysis (EDA) is a technique used by data professionals to examine or understand datasets before modelling them. Simply put, the goal of EDA is used to discover different underlying patterns and trends, relations, structures, and anomalies in the data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;EDA plays &lt;strong&gt;two main&lt;/strong&gt; roles: cleaning data as well as understanding variables and the relationships between them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Analyzing data enables analysts to derive meaningful insights that will help identify data cleaning issues, inform the choice of modelling technique, and hypothesis testing. EDA is an &lt;em&gt;iterative&lt;/em&gt; process consisting of activities such as data cleaning, manipulation and visualization. The EDA process can be revisited at any stage of the data analysis process if need be.&lt;/p&gt;

&lt;h2&gt;
  
  
  Importance of EDA
&lt;/h2&gt;

&lt;p&gt;EDA allows data analysts to understand the data better by:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;identifying important variables.&lt;/li&gt;
&lt;li&gt;understanding the relationships between variables.&lt;/li&gt;
&lt;li&gt;identifying issues in data that can affect the accuracy of your models, such as missing variables, outliers.&lt;/li&gt;
&lt;li&gt;uncovering hidden patterns in a dataset that were not obvious to the naked eye.&lt;/li&gt;
&lt;li&gt;drawing new insights that affect associated hypotheses. These hypotheses are tested and explored to gain a better understanding of the dataset.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Components &amp;amp; Techniques in EDA
&lt;/h2&gt;

&lt;p&gt;The technique or steps you choose to employ is determined by the task you are performing and the dataset you are working with. You may not need to follow all the steps below.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Understand the Data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;It is important to understand the nature of data you are working with. In this step, you need to:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Import the libraries you will need for analysis&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#Import Libraries
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;matplotlib.pylab&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="nn"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The next natural step is to load your data into your working environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;data = pd.read_csv("file.csv")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Conduct preliminary analyses on the data. This involves answering the following questions:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;a. What is the size of my dataset and what are the variable data types?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;data.shape &lt;span class="c"&gt;# returns the number of rows by the number of columns in the dataset&lt;/span&gt;

data.columns

data.dtypes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;b. What does my data look like?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;data.head&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c"&gt;# view first few records of data&lt;/span&gt;

data.describe&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c"&gt;# summarizes the count, mean, standard deviation, min, and max for numeric variables&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;c. Are there any missing variables?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;data.isnull&lt;span class="o"&gt;()&lt;/span&gt;.sum&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c"&gt;#check for missing values&lt;/span&gt;

data.info&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c"&gt;# show the data types of each attribute&lt;/span&gt;

&lt;span class="c"&gt;#Checking for wrong entries (symbols -,? # *)&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;col &lt;span class="k"&gt;in &lt;/span&gt;data.columns:
    print&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'{} : {}'&lt;/span&gt;.format&lt;span class="o"&gt;(&lt;/span&gt;col,auto[col].unique&lt;span class="o"&gt;()))&lt;/span&gt;

data.&amp;lt;column_name&amp;gt;.unique&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="c"&gt;# applied to a column of data to return a list of unique values in that column.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There can be many reasons for missing values, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;There was no response recorded&lt;/li&gt;
&lt;li&gt;Error while recording the data&lt;/li&gt;
&lt;li&gt;Error in reading the data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Categorize your values:&lt;/strong&gt; &lt;br&gt;&lt;br&gt;
After finding the missing values in your data, you need to determine what category the values fall in. This will help you determine the best method of handling the missing values as well as help you determine the statistical and visualization methods that can work with your dataset.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Categorical variables&lt;/strong&gt; can have a set number of values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Continuous variables&lt;/strong&gt; can have an infinite number of values.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discrete variables&lt;/strong&gt; can have a set number of values that must be numeric.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How we handle missing values depends on the situation itself and the relations these variables have with other variables. We can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Delete&lt;/strong&gt; all the missing value rows from the dataset before training the model.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Imputation&lt;/strong&gt;: various methods of filling the missing values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Ways of imputing missing values:&lt;/strong&gt;&lt;br&gt;
&lt;em&gt;For **continuous&lt;/em&gt;* data, you can:*&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace the missing value with the mean, median or mode value&lt;/li&gt;
&lt;li&gt;Train a linear model to predict the missing value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;For **categorical&lt;/em&gt;* data, you can:*&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace the missing value with the mode value&lt;/li&gt;
&lt;li&gt;Train a classification model to predict the missing value&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;2. Clean the Data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The above steps are part of many ways through which you can understand the data you are working with. The insights gained will be used in this step to help you  correct some of the issues in your dataset, so as make it more usable. &lt;br&gt;
a. Remove redundant variables&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cleaned_data &lt;span class="o"&gt;=&lt;/span&gt; cleaned_data.copy&lt;span class="o"&gt;()&lt;/span&gt;.drop&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="s1"&gt;'variableA'&lt;/span&gt;,&lt;span class="s1"&gt;'variableB'&lt;/span&gt;,&lt;span class="s1"&gt;'variableC'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;, &lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;b. Remove rows with null values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Using dropna(axis=0) to drop rows with null values&lt;/span&gt;
cleaned_data &lt;span class="o"&gt;=&lt;/span&gt; cleaned_data.dropna&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;axis&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0&lt;span class="o"&gt;)&lt;/span&gt;
cleaned_data.shape &lt;span class="c"&gt;# to see the change in dataset size&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;c. Remove outliers&lt;br&gt;
You can identify outliers by visualization (discussed later in the article), z-score method, interquartile range method, and machine learning-based methods.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Outliers&lt;/strong&gt; are data points that are &lt;em&gt;noticeably different&lt;/em&gt; from the rest. They represent errors in measurement, bad data collection, or variables not considered when collecting the data.&lt;br&gt;
For X to be an outlier, it should satisfy the criteria:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;X &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;Q3 + 1.5&lt;span class="k"&gt;*&lt;/span&gt;IQR&lt;span class="o"&gt;)&lt;/span&gt; OR X &amp;lt; &lt;span class="o"&gt;(&lt;/span&gt;Q1-1.5&lt;span class="k"&gt;*&lt;/span&gt;IQR&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="c"&gt;# where:&lt;/span&gt;
&lt;span class="c"&gt;# Q1: median for first 25% observation when sorted in ascending order&lt;/span&gt;
&lt;span class="c"&gt;# Q2: median for last 25% observation when sorted in ascending order&lt;/span&gt;
&lt;span class="c"&gt;# Q3: median of all observation&lt;/span&gt;
&lt;span class="c"&gt;# IQR: Inter quartile range = Q3-Q1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So, what do you do when you have skewed data and outliers? &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Replace outlier values with more suitable values using Quartile or Interquartile range(IQR) methods.&lt;/li&gt;
&lt;li&gt;Use a different machine learning model that is not sensitive to outliers eg: Naive Bayes Classifier or Decision Tree Regressor.&lt;/li&gt;
&lt;li&gt;Use a lot of training data to improve the signal-to-noise ratio. Outliers will have less impact on the statistical average if you are working with a lot of data.&lt;/li&gt;
&lt;li&gt;Removing all outliers by not picking them for further processing.&lt;/li&gt;
&lt;li&gt; Use transformation methods to remove skewness and make your data normally distributed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Normalization:&lt;/strong&gt;&lt;br&gt;
Transformation methods are used to remove outliers, therefore normalizing the dataset. Some methods of variable transformation include log, square root, and box-cox. For example, the value of x can be replaced by its &lt;strong&gt;log&lt;/strong&gt; value or column mean.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Replacing missing values with mean:&lt;/span&gt;
num_col &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'columnA'&lt;/span&gt;, &lt;span class="s1"&gt;'columnB'&lt;/span&gt;,  &lt;span class="s1"&gt;'columnC'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;col &lt;span class="k"&gt;in &lt;/span&gt;num_col:
    data[col]&lt;span class="o"&gt;=&lt;/span&gt;pd.to_numeric&lt;span class="o"&gt;(&lt;/span&gt;data[col]&lt;span class="o"&gt;)&lt;/span&gt;
    data[col].fillna&lt;span class="o"&gt;(&lt;/span&gt;data[col].mean&lt;span class="o"&gt;()&lt;/span&gt;, &lt;span class="nv"&gt;inplace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Normalization is important to ensure all features are on a similar scale so as to improve the accuracy and integrity of your data. If a dataset has features that are bigger in scale than others, they become dominating leading to inaccurate results. Using un-normalized inputs can cause your model to get stuck at very flat regions which can stop the model from learning.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Analyze variable relationships&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Correlation Matrix:&lt;/strong&gt;&lt;br&gt;
A correlation matrix is a table that shows how strongly different pairs of variables in a dataset are related to each other. Two variables have a: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Positive correlation&lt;/strong&gt; when one goes up and the other goes up.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Negative correlation&lt;/strong&gt; when one goes up and the other goes down.&lt;/li&gt;
&lt;li&gt;or &lt;strong&gt;no&lt;/strong&gt; relationship between them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is the fastest way to get a general understanding of &lt;strong&gt;all&lt;/strong&gt; your variables. They help us identify which variables are important for predicting or explaining a particular outcome of interest.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# calculate correlation matrix&lt;/span&gt;
plt.figure&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;10,10&lt;span class="o"&gt;))&lt;/span&gt;
sns.heatmap&lt;span class="o"&gt;(&lt;/span&gt;cleaned_data.corr&lt;span class="o"&gt;()&lt;/span&gt;,xticklabels&lt;span class="o"&gt;=&lt;/span&gt;corr.columns, &lt;span class="nv"&gt;yticklabels&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;corr.columns, &lt;span class="nv"&gt;annot&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True, &lt;span class="nv"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sns.diverging_palette&lt;span class="o"&gt;(&lt;/span&gt;220, 20, &lt;span class="nv"&gt;as_cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;True&lt;span class="o"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Visualization:&lt;/strong&gt;&lt;br&gt;
By drawing visual representations of your data, such as histograms, scatter plots and pie charts, you can get a better understanding of the distribution of your data. Further, visualization helps in identifying patterns and detecting outliers in a dataset.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I know which charts to generate?&lt;/strong&gt; &lt;br&gt;
Visualizations are all about asking analytical questions. Once you have understood your data - such as the columns(also known as features), you can ask questions to understand their relationships.&lt;/p&gt;

&lt;p&gt;For example, if you have a dataset containing different car features such as horsepower, engine quality and price, we can ask: "How does engine quality affect price?" From this question, we can generate a scatter plot or histogram to show their relationship.&lt;br&gt;
&lt;strong&gt;1. Histogram&lt;/strong&gt; - shows the frequencies of each category in a dataset.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cleaned_data[&lt;span class="s1"&gt;'columnX'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;.plot&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'hist'&lt;/span&gt;, &lt;span class="nv"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;50, &lt;span class="nv"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;12,6&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="nv"&gt;facecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'grey'&lt;/span&gt;,edgecolor&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'black'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
cleaned_data[&lt;span class="s1"&gt;'columnY'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;.plot&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'hist'&lt;/span&gt;, &lt;span class="nv"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;20, &lt;span class="nv"&gt;figsize&lt;/span&gt;&lt;span class="o"&gt;=(&lt;/span&gt;12,6&lt;span class="o"&gt;)&lt;/span&gt;, &lt;span class="nv"&gt;facecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'grey'&lt;/span&gt;,edgecolor&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'black'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2. Pie Chart&lt;/strong&gt; - commonly used to display the distribution of a single categorical variable as a percentage of a whole.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;data[&lt;span class="s1"&gt;'columnA'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;.value_counts&lt;span class="o"&gt;()&lt;/span&gt;.iloc[:5].plot.pie&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;autopct&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"%1.2f%%"&lt;/span&gt;,fontsize&lt;span class="o"&gt;=&lt;/span&gt;13,startangle&lt;span class="o"&gt;=&lt;/span&gt;90,labels&lt;span class="o"&gt;=[&lt;/span&gt;&lt;span class="s1"&gt;''&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;&lt;span class="k"&gt;*&lt;/span&gt;5, &lt;span class="nv"&gt;cmap&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'Set2'&lt;/span&gt;,explode&lt;span class="o"&gt;=[&lt;/span&gt;0.05] &lt;span class="k"&gt;*&lt;/span&gt; 5,pctdistance&lt;span class="o"&gt;=&lt;/span&gt;1.2&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Box Plot&lt;/strong&gt; - visualize the distribution of a variable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cleaned_data.boxplot&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'columnA'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--NjLY6kS4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:396/format:webp/0%2AGWZmD1Z7JuZDHlvC.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--NjLY6kS4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://miro.medium.com/v2/resize:fit:396/format:webp/0%2AGWZmD1Z7JuZDHlvC.png" width="198" height="358"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A box plot can also be used to compare two variables. From te bboxplot below, the average price of a vehicle with two doors is 10000, and the average price of a vehicle with four doors is 12000.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sns.boxplot&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'price'&lt;/span&gt;,y&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'num_of_doors'&lt;/span&gt;,data&lt;span class="o"&gt;=&lt;/span&gt;auto&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Usgtmpdy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://editor.analyticsvidhya.com/uploads/17129blog19.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Usgtmpdy--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://editor.analyticsvidhya.com/uploads/17129blog19.PNG" width="520" height="336"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Scatter plots&lt;/strong&gt; - ‘plots’ the values of two variables along two axes. Like a correlation matrix, it shows the relationship between variables and identifying outliers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cleaned_data.plot&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;kind&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'scatter'&lt;/span&gt;, &lt;span class="nv"&gt;x&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'columnA'&lt;/span&gt;, &lt;span class="nv"&gt;y&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'columnB'&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sns.pairplot&lt;span class="o"&gt;(&lt;/span&gt;cleaned_data&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="c"&gt;# creates scatter plots between all of your variables.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Types of EDA
&lt;/h2&gt;

&lt;p&gt;There are a few types of EDA techniques:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Univariate analysis:&lt;/strong&gt; This involves examining the distribution of a single variable. The goal is to understand the central tendency (mean, median, mode), variability (range, interquartile range, standard deviation), and shape (skewness, kurtosis) of the variable.&lt;br&gt;
When exploring a single variable, we can use the following methods:&lt;br&gt;
a. For &lt;strong&gt;continuous&lt;/strong&gt; data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tabular Method of describing central tendencies, dispersion, and missing values.&lt;/li&gt;
&lt;li&gt;Graphical Method for distribution(&lt;em&gt;Histograms&lt;/em&gt;) and detecting Outliers(&lt;em&gt;Box Plots&lt;/em&gt;).
b. For &lt;strong&gt;Categorical&lt;/strong&gt; variables:&lt;/li&gt;
&lt;li&gt;Tabular Method: &lt;code&gt;.value_counts()&lt;/code&gt; operation in python gives a tabular form of frequencies.&lt;/li&gt;
&lt;li&gt;Graphical Method: The best graph used for categorical variable is &lt;em&gt;barplot&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Bivariate analysis:&lt;/strong&gt; This involves analyzing the relationship between two variables. The goal is to understand how changes in one variable affect changes in another variable. Common bivariate analysis techniques include scatter plots, line charts, and correlation analysis. &lt;br&gt;
When exploring a two variables, we can use the following methods:&lt;br&gt;
a. For &lt;strong&gt;continuous&lt;/strong&gt; data: &lt;em&gt;scatter plots&lt;/em&gt; and the &lt;em&gt;correlation analysis&lt;/em&gt;.&lt;br&gt;
b. For &lt;strong&gt;categorical-continuous&lt;/strong&gt; types: use &lt;em&gt;bar plots&lt;/em&gt; and &lt;em&gt;T-tests&lt;/em&gt; for the analysis purpose. &lt;br&gt;
c. For &lt;strong&gt;Categorical-categorical&lt;/strong&gt; types: use &lt;em&gt;Two-way table&lt;/em&gt; and &lt;em&gt;Chi-square test&lt;/em&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multivariate analysis:&lt;/strong&gt; This involves analyzing the relationship between multiple variables. The goal is to understand how multiple variables interact with each other and to identify any patterns or relationships that may exist. Common multivariate analysis techniques include principal component analysis (PCA) and factor analysis.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I hope this article gave you a better understanding of Exploratory Data Analysis and how to apply EDA techniques to your dataset.&lt;/p&gt;

&lt;p&gt;Feedback is very welcome and highly appreciated.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>exploratorydataanalysis</category>
      <category>dataanalysis</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Python 101: Introduction to Python for Data Science</title>
      <dc:creator>Karen Ngala</dc:creator>
      <pubDate>Fri, 17 Feb 2023 11:23:50 +0000</pubDate>
      <link>https://dev.to/karen_ngala/python-101-introduction-to-python-for-data-science-2chf</link>
      <guid>https://dev.to/karen_ngala/python-101-introduction-to-python-for-data-science-2chf</guid>
      <description>&lt;p&gt;A big dilemma many techies face when picking up a new skill, is "what language or tool should I use, and why?". This dilemma of choice is popularly known as "analysis paralysis" or "choice overload." You will feel overwhelmed by the options available to you which can lead to indecision and a feeling of being stuck. I've been there.&lt;/p&gt;

&lt;p&gt;Going into data science, you have the option of learning many languages ranging from Python, R, Java, and Julia, just to name a few. The choice you make should be individual to you, your specific goals, background, and preferences. Not because of peer influence. &lt;strong&gt;So, why Python?&lt;/strong&gt;  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; It has a simple and intuitive syntax.&lt;/li&gt;
&lt;li&gt; Python has developed a deep ecosystem around Data Science. It has a large and active community of volunteers that create and contribute to the wealth of data science libraries such as matplotlib, sklearn, pandas, and numpy.&lt;/li&gt;
&lt;li&gt; Python can be applied widely beyond Data Science which includes areas such as web development.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Setting up a Python environment
&lt;/h2&gt;

&lt;p&gt;Before jumping into the deep-end, you need to set up your computer in a way that allows you to write and run code. First, check that you have python installed using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you have python the output should be the version of python you have installed eg:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Python 3.8.5
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you do not, you will get an error. You can download the latest python version from the &lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;official python website&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A good place to start for beginners is using Anaconda as the environment for your Data Science workflow. Package conflicts in a Python environment can be a nightmare to deal with. &lt;a href="https://docs.anaconda.com/anaconda/install/" rel="noopener noreferrer"&gt;Anaconda&lt;/a&gt; helps you navigate this and houses required tools, such as Jupyter Notebook. You can later move on to using virtual environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;TOOLS:&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Jupyter Notebook&lt;/strong&gt; &lt;br&gt;
is an open-source website that allows data scientists, like yourself, to create and share live code and visualizations. Each notebook contains executable cells and text descriptions. This makes it easy for people to interact and understand the code from start to end. You can share your code with others using Jupyter notebook.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Google Colab&lt;/strong&gt; &lt;br&gt;
Also known as &lt;a href="https://colab.research.google.com/" rel="noopener noreferrer"&gt;Colaboratory&lt;/a&gt;, is a jupyter notebook environment that runs purely on the cloud and requires no setup. It allows users to load notebooks from public GitHub repos as well as saving to GitHub. A copy of each notebook will be saved on your Google Drive.&lt;/p&gt;
&lt;h2&gt;
  
  
  Python Basics
&lt;/h2&gt;

&lt;p&gt;Learning the language entails first understanding the syntax and rules of Python as a programming language. I will summarize some of the fundamentals of working with Python. For absolute beginners, It would be benefitial to seek further resources and materials. The following are great places to start:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://www.programiz.com/python-programming/first-program" rel="noopener noreferrer"&gt;Getting Started with Python&lt;/a&gt; on Programiz&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.python.org/about/gettingstarted/" rel="noopener noreferrer"&gt;Python For Beginners&lt;/a&gt; on python.org&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://realpython.com/python-first-steps/" rel="noopener noreferrer"&gt;How to Use Python: Your First Steps&lt;/a&gt; by Leodanis Pozo Ramos on Real Python&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  1. Variables &amp;amp; Data types
&lt;/h3&gt;

&lt;p&gt;A variable is a named reference to a value that can be changed during program execution. Assigning a value to a variable is done using the assignment operator (=).&lt;br&gt;
A &lt;strong&gt;data type&lt;/strong&gt; is the nature of value assigned to variables. Python supports the following data types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;integer (an integer value with no decimal value)&lt;/li&gt;
&lt;li&gt;string (alphanumeric text)&lt;/li&gt;
&lt;li&gt;float (a number with a decimal value)&lt;/li&gt;
&lt;li&gt;boolean (value can only &lt;code&gt;True&lt;/code&gt; or &lt;code&gt;False&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Data Structures:&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;lists&lt;/strong&gt; - collection of values that are &lt;u&gt;ordered&lt;/u&gt; and &lt;u&gt;changeable&lt;/u&gt;. Syntax wise, it uses square brackets: &lt;code&gt;my_list = [1, 2, 3, 4]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;tuple&lt;/strong&gt; - similar to a list, but its values &lt;u&gt;cannot&lt;/u&gt; be changed once created. Syntax wise, it uses parenthesis: &lt;code&gt;my_tuple = (1, 2, 3, 4)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;dictionary&lt;/strong&gt; - a collection of &lt;strong&gt;key-value pairs&lt;/strong&gt; that are unordered and &lt;u&gt;changeable&lt;/u&gt;. Syntax wise, it uses curly braces: &lt;code&gt;my_dict = {'name': 'John', 'age': 30}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;sets&lt;/strong&gt; - an unordered collection of unique values. Example: &lt;code&gt;my_set = {1, 2, 3, 4}&lt;/code&gt;. Values in a set will &lt;u&gt;never repeat&lt;/u&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2. Operators
&lt;/h3&gt;

&lt;p&gt;The symbols used for mathematical and logical operations are pretty straight-forward in Python. &lt;code&gt;+&lt;/code&gt; for addition, &lt;code&gt;-&lt;/code&gt; for subtraction, &lt;code&gt;*&lt;/code&gt; for multiplication, and &lt;code&gt;/&lt;/code&gt; for division. &lt;code&gt;==&lt;/code&gt; for checking value equality, &lt;code&gt;!=&lt;/code&gt; for not  equal and &lt;code&gt;&amp;lt;&lt;/code&gt; for less than, and &lt;code&gt;&amp;gt;&lt;/code&gt; for greater than.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Logic &amp;amp; Process Flow
&lt;/h3&gt;

&lt;p&gt;The first thing to note here, is &lt;strong&gt;indentation&lt;/strong&gt;. Python follows a strict indentation rule when it comes to blocks of code. While other languages use markers such as curly braces, python relies on indentation level when executing code.&lt;br&gt;
&lt;strong&gt;Conditions&lt;/strong&gt;&lt;br&gt;
They are used to execute a block of code based on whether a certain condition is true or false. For example, &lt;code&gt;if... else&lt;/code&gt; is a conditional loop that executes the first block of statements if the condition is true and the statements after else if the condition is false. For multiple conditions, the &lt;code&gt;if... elif&lt;/code&gt; statement can be used.&lt;br&gt;
&lt;strong&gt;Loops&lt;/strong&gt; &lt;br&gt;
They are used to repeat a certain block of code multiple times until a specific condition is met. Python has the &lt;code&gt;for loop&lt;/code&gt; and the &lt;code&gt;while loop&lt;/code&gt;. &lt;em&gt;For loops&lt;/em&gt; are used to iterate over a sequence, while &lt;em&gt;while loops&lt;/em&gt; are used to repeat a block of code &lt;u&gt;until&lt;/u&gt; a specific condition is met.&lt;br&gt;
&lt;strong&gt;Functions&lt;/strong&gt; &lt;br&gt;
They are used to group together a set of instructions that can be called multiple times elsewhere in a program. Functions are defined using the &lt;code&gt;def&lt;/code&gt; keyword, followed by the function name and the input parameters. They can also return a value or simply perform an action.&lt;br&gt;
For example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;greet_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, Alice!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, stranger!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Classes and objects&lt;/strong&gt; &lt;br&gt;
Python is an object-oriented programming language. This is a programming paradigm that organizes code into reusable and modular components. &lt;br&gt;
A &lt;strong&gt;class&lt;/strong&gt; is a blueprint for creating objects that share the same attributes and behaviours. &lt;br&gt;
&lt;strong&gt;Objects&lt;/strong&gt; are instances of a class that are created using the class constructor. They can have attributes, which are variables that store data, and methods, which are functions that can be called on the object.&lt;br&gt;
In the following example, &lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Class: Rectangle&lt;br&gt;
Object: my_rectangle&lt;br&gt;
&lt;/p&gt;


&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;Rectangle&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;length&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;area&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;length&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt;

&lt;span class="n"&gt;my_rectangle&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Rectangle&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_rectangle&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;area&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;  &lt;span class="c1"&gt;# Output: 20
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Understanding OOP will be important when interacting with the libraries used in data science.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. File Handling
&lt;/h3&gt;

&lt;p&gt;This is an important part of data science. Reading from and writing to files is a common task of data science and data analysis.&lt;br&gt;
&lt;strong&gt;Reading a File&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;open()&lt;/code&gt; function is used to open a file (file.txt in this case) in &lt;code&gt;'r'&lt;/code&gt; mode. This mode specifies that the file should be opened in read-only mode. The &lt;code&gt;read()&lt;/code&gt; method reads the contents of &lt;em&gt;file.txt&lt;/em&gt; into the &lt;code&gt;contents&lt;/code&gt; variable.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;with&lt;/code&gt; keyword is used to ensure that the file is closed once it is read.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;contents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Writing to a File&lt;/strong&gt;&lt;br&gt;
The &lt;code&gt;'w'&lt;/code&gt; denotes write mode while the &lt;code&gt;write()&lt;/code&gt; function is used to write &lt;em&gt;"Hello, world!"&lt;/em&gt; to the file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;file.txt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Hello, world!&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other modes include the &lt;code&gt;'a'&lt;/code&gt; mode which specifies that the file should be opened in append mode. THis allows additional text to be written into the file.&lt;/p&gt;

&lt;h2&gt;
  
  
  Loading and manipulating data in Python
&lt;/h2&gt;

&lt;p&gt;Data Science often requires working with large amounts of data. Therefore, you need to load the data. There are several ways to load data in Data Science with the most common method being the &lt;strong&gt;Pandas library&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Pandas&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;It is an open-source data analysis and manipulation library for Python. It offers fast and flexible data structures for working with structured and &lt;a href="https://www.tableau.com/learn/articles/time-series-analysis#:~:text=Time%20series%20data%20is%20data%20that%20is%20recorded%20over%20consistent,data%20and%20cross%2Dsectional%20data." rel="noopener noreferrer"&gt;time series data&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Install the pandas library by running the following command in your terminal or command prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;pandas

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pandas offers two primary data structures: Series and DataFrame. A Series is a one-dimensional labelled array.&lt;/p&gt;

&lt;p&gt;A DataFrame is a 2D table-like data structure in Pandas. It is similar to a spreadsheet or SQL table in that it consists of rows and columns. You access data in a DataFrame by its row and column labels. Rows are labelled with an index, and the columns are labelled with column names. You can then load data into a pandas DataFrame as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# Replace 'data.csv' with the name of your file
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are many methods that you can apply to manipulate your data using Pandas. Pandas offers an array of data manipulation tools such as filtering, grouping, merging, reshaping, pivoting data, as well as time series analysis. It is worth reading further on these. Below are a few examples:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Print the first few rows of the DataFrame
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;head&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Output;
&lt;/span&gt;       &lt;span class="n"&gt;name&lt;/span&gt;  &lt;span class="n"&gt;age&lt;/span&gt; &lt;span class="n"&gt;gender&lt;/span&gt;
&lt;span class="mi"&gt;0&lt;/span&gt;     &lt;span class="n"&gt;Alice&lt;/span&gt;   &lt;span class="mi"&gt;25&lt;/span&gt;      &lt;span class="n"&gt;F&lt;/span&gt;
&lt;span class="mi"&gt;1&lt;/span&gt;       &lt;span class="n"&gt;Bob&lt;/span&gt;   &lt;span class="mi"&gt;30&lt;/span&gt;      &lt;span class="n"&gt;M&lt;/span&gt;
&lt;span class="mi"&gt;2&lt;/span&gt;  &lt;span class="n"&gt;Charlie&lt;/span&gt;   &lt;span class="mi"&gt;35&lt;/span&gt;      &lt;span class="n"&gt;M&lt;/span&gt;

&lt;span class="c1"&gt;# Filter the DataFrame to only include rows where the 'age' column is greater than 30
&lt;/span&gt;&lt;span class="n"&gt;filtered_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;age&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Group the DataFrame by the 'gender' column and compute the mean of the 'salary' column for each group
&lt;/span&gt;&lt;span class="n"&gt;grouped_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;filtered_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;gender&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;salary&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Numpy&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Numpy is also a data analysis and manipulation library. However, it differs from pandas in that numpy supports homogeneous data types while pandas supports heterogeneous data types. Read about &lt;a href="https://datageek-prabhakarpandey.medium.com/pythonic-way-of-storing-your-data-f9bd7a5f30f5" rel="noopener noreferrer"&gt;Homogeneous vs Heterogeneous data types&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Install the numpy library by running the following command in your terminal or command prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;numpy

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Numpy is the foundation for many other scientific computing and data science libraries in Python, such as Pandas.&lt;/p&gt;

&lt;p&gt;Numpy is a great library for statistical and mathematical operations. For example, generating mean, median and standard deviation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Create a dataset
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Calculate the mean, median, and standard deviation
&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;median&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;median&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;std&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;std&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mean:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Median:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;median&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Standard deviation:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;std&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Resources for Numpy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://colab.research.google.com/github/computationalcore/introduction-to-python/blob/master/notebooks/6-numpy/PY0101EN-6-1-Numpy1D.ipynb" rel="noopener noreferrer"&gt;1D NumPy in Python&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://colab.research.google.com/github/computationalcore/introduction-to-python/blob/master/notebooks/6-numpy/PY0101EN-6-2-Numpy2D.ipynb" rel="noopener noreferrer"&gt;2D NumPy in Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Data Visualizations using Matplotlib&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Data visualization is a critical part of data science. It allows you to understand and communicate the insights derived from your data. Matplotlip provides a wide range of tools for creating different types of charts and plots, including line charts, bar charts, histograms, scatter plots, and more. It also offers customization through styles, shapes, and colors.&lt;/p&gt;

&lt;p&gt;Install the matplotlib by running the following command in your terminal or command prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;matplotlib

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To demonstrate the different capabilities of Matplotlib, let's create a simple line plot.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Import the librarty
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Some random data
&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Plot the data to create a line chart
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;plot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add labels and title
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;x-axis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;y-axis&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Line Plot&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Display the chart
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To represent the relationship between the variables, you can create a &lt;strong&gt;scatter plot&lt;/strong&gt;. The only difference in the above code will be in plotting (and the title, of course). Replace &lt;code&gt;plt.plot(x, y)&lt;/code&gt; with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scatter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Numpy could be used in the above example to generate random data.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;linspace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# This generates 100 data points for the x-axis 
&lt;/span&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;# This calculates the corresponding y-axis values 
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For a bar graph on the other hand, you would need labels and their corresponding values&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="c1"&gt;# Data to be used
&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;A&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;B&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;C&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;D&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;E&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;values&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Create a bar chart
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;labels&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;values&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Add labels and title
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Category&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bar Chart&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Display the chart
&lt;/span&gt;&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are many other charts that you can create using matplotlib such as histograms, scatter plots, pie charts, and more. It is worth exploring the &lt;a href="https://matplotlib.org/stable/plot_types/index.html" rel="noopener noreferrer"&gt;matplotlib documentation&lt;/a&gt; to familiarize yourself with the different charts.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Optimisation with SciPy&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;SciPy is a scientific computing library built on top of NumPy. It provides additional functionality for optimization, integration, interpolation, linear algebra, and more.&lt;/p&gt;

&lt;p&gt;The example below uses SciPy to perform a simple optimization problem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy.optimize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;minimize_scalar&lt;/span&gt;

&lt;span class="c1"&gt;# Define the objective function (a quadratic function)
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;objective&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;

&lt;span class="c1"&gt;# Find the minimum of the objective function 
&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;minimize_scalar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;objective&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Print the minimum value and the corresponding value of x
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Minimum value:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fun&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Value of x at minimum:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;minimize_scalar()&lt;/code&gt; function is an optimization algorithm used to find the minimum of the function. This code finds the minimum value of the function &lt;code&gt;result.fun&lt;/code&gt; and the value of x when the function is at minimum &lt;code&gt;result.x&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This concept can be applied to more complex optimization problems, including those with multiple variables and constraints. Scipy is a powerful and versatile library with many scientific and engineering applications.&lt;/p&gt;

&lt;h2&gt;
  
  
  Statistical analysis in Python
&lt;/h2&gt;

&lt;p&gt;This is involves interpreting, analyzing, and presenting the collected data. There are several libraries that support statistical analysis in python. These libraries can perform various statistical analysis tasks such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hypothesis testing&lt;/strong&gt; — testing claims about the population based on a sample of data. This can be done using libraries such as &lt;em&gt;&lt;a href="https://scipy-lectures.org/packages/statistics/index.html" rel="noopener noreferrer"&gt;SciPy&lt;/a&gt;&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regression analysis&lt;/strong&gt; — modelling the relationship between two or more variables. For example, &lt;em&gt;Statsmodels&lt;/em&gt; can be used to perform a linear regression on a dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Descriptive statistics&lt;/strong&gt; — simple and quick summary of a dataset. &lt;em&gt;Numpy&lt;/em&gt; is used for summaries such calculating the mean, median, and standard deviation of a dataset&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Time series analysis&lt;/strong&gt; — modelling and forecasting time-dependent data. This can be done using libraries such as &lt;em&gt;Statsmodels&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Predictive Modelling&lt;/strong&gt; — libraries such as &lt;em&gt;Scikit-learn&lt;/em&gt; provide a range of machine learning algorithms, including linear and logistic regression, decision trees, random forests, support vector machines, and neural networks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probability distribution&lt;/strong&gt; — modelling the uncertainty in a dataset using common probability distributions such as normal distribution, binomial distribution, and Poisson distribution. This can be done using &lt;em&gt;SciPy&lt;/em&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Further reading:&lt;/strong&gt;
&lt;/h3&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;a href="https://realpython.com/python-statistics/" rel="noopener noreferrer"&gt;Python Statistics Fundamentals: How to Describe Your Data&lt;/a&gt; by Mirko Stojiljković&lt;/p&gt;

&lt;p&gt;&lt;a href="https://towardsdatascience.com/an-introduction-to-statistical-analysis-and-modelling-with-python-ef816b67f8ff" rel="noopener noreferrer"&gt;An Introduction to Statistical Analysis and Modelling with Python&lt;/a&gt; by Roberto&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we have covered some of the key features and concepts of Python, including data types, operators, control flow, functions, and file reading/writing. We have also introduced some of the most commonly used libraries in Python for data analysis, such as NumPy, Pandas, Matplotlib, and SciPy.&lt;/p&gt;

&lt;p&gt;Python is a powerful language for Data Science. Its readability and its popularity within the data science community makes it easy for beginners to dive into Data Science. There are numerous resources available for learning and development.&lt;/p&gt;

&lt;p&gt;As an aspiring data scientist, learning Python is only the beginning of building your skillset. This article is a great starting point for beginners looking to learn Python and its applications in data analysis. Keep practising and exploring the wonderful world of Data Science. The possibilities are endless.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>beginners</category>
      <category>datasciencecourse</category>
    </item>
    <item>
      <title>Starting a new Django Project with PostgreSQL database</title>
      <dc:creator>Karen Ngala</dc:creator>
      <pubDate>Wed, 28 Sep 2022 19:37:52 +0000</pubDate>
      <link>https://dev.to/karen_ngala/starting-a-new-django-project-with-postgresql-backend-2786</link>
      <guid>https://dev.to/karen_ngala/starting-a-new-django-project-with-postgresql-backend-2786</guid>
      <description>&lt;h3&gt;
  
  
  Pre-reading: Tutorials you may need
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.guru99.com/python-ide-code-editor.html" rel="noopener noreferrer"&gt;Choosing an IDE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pip.pypa.io/en/stable/installation/" rel="noopener noreferrer"&gt;Installing pip&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.python.org/downloads/" rel="noopener noreferrer"&gt;Download Python&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  This article assumes:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Basic understanding of Django.&lt;/li&gt;
&lt;li&gt;Basic knowledge of how to use CLI.&lt;/li&gt;
&lt;li&gt;Basic understanding of Git.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Let's jump right into it!&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;First, head to your terminal and create a new folder using the &lt;code&gt;mkdir&lt;/code&gt; command. This is the folder that will host all the work for the project you are working on.&lt;br&gt;
Then &lt;code&gt;cd&lt;/code&gt; into this folder to create a virtual environment.&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Create a virtual environment
&lt;/h3&gt;

&lt;p&gt;There are &lt;a href="https://towardsdatascience.com/comparing-python-virtual-environment-tools-9a6543643a44" rel="noopener noreferrer"&gt;many virtual environmnet tools&lt;/a&gt; available. &lt;/p&gt;

&lt;p&gt;Working within a virtual environment ensures you isolate Python installs and associated pip packages, allowing you to install and manage your own set of packages that are &lt;strong&gt;independent&lt;/strong&gt; of those provided by the system or used by other projects. Depending on the virtual environment you chose to install on your machine, the command to create a virtual environment will vary.&lt;/p&gt;

&lt;p&gt;For this use case, we will be using &lt;strong&gt;virtualenv&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;virtual&lt;/em&gt; is the name of my virtual environment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ virtualenv virtual
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Activate&lt;/strong&gt; the virtual environment so as to work within the virtual environment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;$ source virtual/bin/activate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2. Install Django
&lt;/h3&gt;

&lt;p&gt;You can now install Django into this dedicated workspace.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# This command will install the most recent version of django.
(virtual) $ pip install django
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To install a specific version of django, specify it as follows(replace the number after the == sign with the version you wish to install):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(virtual) $ pip install django==2.2.11
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To make collaboration easier and keep track of all packages(and their versions) you have currently in your virtual environment, pin your dependencies using the following command. This will create the file &lt;code&gt;requirements.txt&lt;/code&gt;. You can run this command severally as you install more external packages to update the list of dependencies.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(virtual) $ pip freeze &amp;gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  3. Create django project &amp;amp; app
&lt;/h3&gt;

&lt;p&gt;Django is organized in two major parts; &lt;em&gt;project&lt;/em&gt; and &lt;em&gt;app&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Project - the package that represents the entire website. The project directory contains settings for the whole website. &lt;em&gt;A project can have many apps&lt;/em&gt;. Create a project using the following command:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(virtual) $ django-admin startproject &amp;lt;project-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Your folder structure will look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;example/
│
├── project/
│   ├── __init__.py
│   ├── asgi.py
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
│
└── manage.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;App - a sub-module of a project that implements a specific functionality. For example, a website can have an app for &lt;em&gt;posts&lt;/em&gt; and another app for &lt;em&gt;payment.&lt;/em&gt; Create a django app using the following command:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(virtual) $ python manage.py startapp &amp;lt;app-name&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A new folder will be added. Your folder structure will look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;example/
│
├── app/
│   │
│   ├── migrations/
│   │   └── __init__.py
│   │
│   ├── __init__.py
│   ├── admin.py
│   ├── apps.py
│   ├── models.py
│   ├── tests.py
│   └── views.py
│
├── project/
│   ├── __init__.py
│   ├── asgi.py
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
│
└── manage.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  4. Create gitignore &amp;amp; .env files
&lt;/h3&gt;

&lt;p&gt;Before adding git to your project, or before you can commit the changes you've made so far, there are some files you don't want tracked.&lt;br&gt;
The &lt;strong&gt;.gitignore&lt;/strong&gt; file tells git to not track these files or any changes you make to them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;example/
│
├── app/
│   │
│   ├── migrations/
│   │   └── __init__.py
│   │
│   ├── __init__.py
│   ├── admin.py
│   ├── apps.py
│   ├── models.py
│   ├── tests.py
│   └── views.py
│
├── .gitignore
├── .env
├── .env.example
|
├── project/
│   ├── __init__.py
│   ├── asgi.py
│   ├── settings.py
│   ├── urls.py
│   └── wsgi.py
│
└── manage.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are some of files you add to gitignore. You can add or omit anything. For example, I usually have a &lt;code&gt;.txt&lt;/code&gt; file that I use for 'rough work' which I add to gitignore.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;virtual/
.env
*.pyc
db.sqlite3
migrations/
media/*
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The reason for adding migrations folder in gitignore is to minimize merge conflicts and errors in production.&lt;/p&gt;

&lt;p&gt;Your project also contains sensitive data that you do not want tracked. Data like, your django secret key or your database password. This information is stored in a &lt;code&gt;.env&lt;/code&gt; file which is then put in the gitignore file.&lt;/p&gt;

&lt;p&gt;When collaborating with others, create a &lt;code&gt;.env.example&lt;/code&gt; file that contains example data that other collaborators can replace with their own values to run your project locally. This way, no one commits their environment credentials and you don't have to change the values each time you pull the project.&lt;/p&gt;

&lt;p&gt;Contents of .env may look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SECRET_KEY=generate-a-key
DEBUG=True
DB_NAME=db-name
DB_USER=username
DB_PASSWORD=your-password
DB_HOST=127.0.0.1
MODE=dev
ALLOWED_HOSTS=*
DISABLE_COLLECTSTATIC=1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can then reference these credentials in &lt;code&gt;project/settings.py&lt;/code&gt; as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from decouple import config, Csv  #add this to the top



MODE=config("MODE")

SECRET_KEY = config('SECRET_KEY')

DEBUG = config('DEBUG', cast=bool)

ALLOWED_HOSTS = config('ALLOWED_HOSTS', cast=Csv())

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  5. Database and settings.py
&lt;/h3&gt;

&lt;p&gt;The default database used by Django out of the box is SQLite. For more complex projects, you will require a more powerful database like PostgreSQL. &lt;/p&gt;

&lt;p&gt;Some operating systems may come with potgres pre-installed, or you may need to &lt;a href="https://www.postgresql.org/" rel="noopener noreferrer"&gt;install it&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To check if you have PostgreSQL installed, run &lt;code&gt;which psql&lt;/code&gt; command.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;If Postgres is not installed, there appears to be no output. You just get the terminal prompt ready to accept another command:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; which psql
&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;If Postgres is installed, you'll get a response with the path to the location of the Postgres install:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt; which psql
/usr/bin/psql
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To support postgres database, you need to install &lt;code&gt;psycopg2&lt;/code&gt; and two other libraries. psycopg2 is a database adapter that connects databases to python.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install psycopg2
pip install dj-database-url
pip install python-decouple
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make the following changes to &lt;code&gt;project/settings.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import dj_database_url


INSTALLED_APPS = [
    'application',  #new
    'django.contrib.admin',
    ...
]



# Database
# https://docs.djangoproject.com/en/3.1/ref/settings/#databases
if config('MODE')=="dev":
    DATABASES = {
        'default': {
            'ENGINE': 'django.db.backends.postgresql_psycopg2', #changed database from sqlite to postgresql
            'NAME': config('DB_NAME'),
            'USER': config('DB_USER'),
            'PASSWORD': config('DB_PASSWORD'),
            'HOST': config('DB_HOST'),
            'PORT': '',
        }
    }
else:
   DATABASES = {
       'default': dj_database_url.config(
           default=config('DATABASE_URL')
       )
   }

db_from_env = dj_database_url.config(conn_max_age=500)
DATABASES['default'].update(db_from_env)


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6. Version tracking using git
&lt;/h3&gt;

&lt;p&gt;Initialize version control using the &lt;code&gt;git init&lt;/code&gt; command. Then add and commit your changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Test
&lt;/h3&gt;

&lt;p&gt;Check that your set up worked by running this command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;(virtual) $ python manage.py runserver
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will use this command anytime you need to test your code on the browser. The default port is 127.0.0.1:&lt;strong&gt;8000&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You should see an output like this on your browser:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F5%2F53%2FDjango_2.1_landing_page.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2F5%2F53%2FDjango_2.1_landing_page.png"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;At this point, you’ve finished setting up the scaffolding for your Django website, and you can start implementing your ideas by adding models, views and templates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary of Commands
&lt;/h2&gt;

&lt;p&gt;Commands in order of execution:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;$ virtualenv virtual&lt;/td&gt;
&lt;td&gt;setup virtual environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;$ source env/bin/activate&lt;/td&gt;
&lt;td&gt;activate the virtual environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ pip install django&lt;/td&gt;
&lt;td&gt;Instal django inside virtual environment&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ django-admin startproject &amp;lt;projectname&amp;gt;&lt;/td&gt;
&lt;td&gt;set up a Django project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ python manage.py startapp &amp;lt;appname&amp;gt;&lt;/td&gt;
&lt;td&gt;set up a Django app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ pip install psycopg2&lt;/td&gt;
&lt;td&gt;connect database to python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ pip install dj-database-url&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ pip install python-decouple&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ pip freeze &amp;gt; requirements.txt&lt;/td&gt;
&lt;td&gt;pin dependancies and versions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Initialize and commit to git&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;(virtual) $ python manage.py runserver&lt;/td&gt;
&lt;td&gt;view website on 127.0.0.1:&lt;strong&gt;8000&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article, we went through the steps of starting a new Django project with PostgreSQL database, as well as the common terminal commands used for Django web development.&lt;/p&gt;

&lt;p&gt;I hope you found this article helpful!&lt;/p&gt;

</description>
      <category>django</category>
      <category>beginners</category>
      <category>postgres</category>
      <category>create</category>
    </item>
    <item>
      <title>Developer's guide to remote collaboration</title>
      <dc:creator>Karen Ngala</dc:creator>
      <pubDate>Sun, 08 Nov 2020 18:18:39 +0000</pubDate>
      <link>https://dev.to/karen_ngala/beginner-guide-to-developer-collaboration-3n1e</link>
      <guid>https://dev.to/karen_ngala/beginner-guide-to-developer-collaboration-3n1e</guid>
      <description>&lt;h4&gt;
  
  
   Pre-requisites 
&lt;/h4&gt;

&lt;p&gt;This article assumes basic git and GitHub understanding and use &lt;/p&gt;

&lt;h3&gt;
  
  
  So, collaboration...
&lt;/h3&gt;

&lt;p&gt;The first step of effective collaboration is identifying a software development methodology to use - or &lt;a href="https://blog.planview.com/top-6-software-development-methodologies/" rel="noopener noreferrer"&gt;Software Development Life Cycle(SDLC)&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As a software developer, it is inevitable that you have/will encounter &lt;strong&gt;Agile methodology&lt;/strong&gt;. A good place to start is by reading the &lt;a href="https://agilemanifesto.org/" rel="noopener noreferrer"&gt;Agile manifesto&lt;/a&gt; and the principles behind the manifesto. It is brief, yet complete. It was written by 17 software developers who sat together to uncover better ways of developing software. However, agile development is beyond the scope of this article but may be something of interest to you as it focuses on a developer's mindset and values, not tools or processes.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2F2MYqBGS%2FAgile-Manifesto.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2F2MYqBGS%2FAgile-Manifesto.png"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tools
&lt;/h3&gt;

&lt;p&gt;Because we still value these items on the right, we need to consider various technologies that will make remote software development and collaboration seamless and effective.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Definition of scrum terms&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Scrum is an agile project management framework that describes a set of meetings, tools, and roles in team work.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Scrum master&lt;/strong&gt; - is ideally dedicated to just one team, to avoid context switching. She/He is in charge of leading daily standup, addressing blockers, merging approved pull requests, and coaching the team on best practices.

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;The SM role assumes servant-leadership, a way of leading people without having formal authority over them. The SM resorts to setting a shared vision, involving everyone in the decisions, coaching the group.&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scrum team&lt;/strong&gt; - the team of developers and designers working on the project. A scrum team is ideally self-managing and cross-functional.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backlog&lt;/strong&gt; - master list of work that needs to get done to complete a project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Blocker&lt;/strong&gt; -  an obstacle faced in the tackling of an assigned task&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standup&lt;/strong&gt; - in a team of developers working on a project, short meetings are held daily, usually in the mornings. The term comes from the fact that during the meetings, a developer literally stands up and states briefly:

&lt;ul&gt;
&lt;li&gt;What did I do yesterday? -achievements-&lt;/li&gt;
&lt;li&gt;What do I plan to do today? -tasks-&lt;/li&gt;
&lt;li&gt;Am I facing any challenges? -blockers-
&lt;code&gt;Note:&lt;/code&gt; Standups can be run as often as suits the team&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sprint&lt;/strong&gt; - a time period, usually between 1 week and 1 month, but typically 2 weeks, in which a team works to complete a set amount of user stories. A project is generally divided into sprints in which, each sprint should produce a usable end-product. - increment-&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kanban board&lt;/strong&gt; - a basic board divides a sprint into cards&lt;code&gt;'To do'&lt;/code&gt;, &lt;code&gt;'In progress'&lt;/code&gt; and &lt;code&gt;'Complete'&lt;/code&gt;. It can be altered to suite the needs of the team with cards like &lt;code&gt;'In review'&lt;/code&gt;, &lt;code&gt;'Resources'&lt;/code&gt;
  &lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fi.ibb.co%2FcDQZfNj%2Fkanban-board.webp"&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Breakdown of tools and processes&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Processes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Organize product backlog from stakeholder feedback and requirements&lt;/li&gt;
&lt;li&gt;Plan sprints, set timelines and allocate tasks&lt;/li&gt;
&lt;li&gt;Run sprints with daily standups&lt;/li&gt;
&lt;li&gt;Production&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Communication&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;&lt;code&gt;Slack&lt;/code&gt;&lt;/em&gt;  is a great tool for general communication and integrates with numerous developer tools.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;&lt;code&gt;Google Meet&lt;/code&gt;&lt;/em&gt; is great for running standups even in large teams. It allows for screen sharing and has no time limit for a large number of attendees. &lt;code&gt;Google calendar&lt;/code&gt; also comes in handy when scheduling recurring meetings.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Project management&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;Kanban boards come in here. &lt;br&gt;
&lt;br&gt;Each GitHub repository has a Project section for managing tasks within the repo. Some external resources include: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.trello.com/" rel="noopener noreferrer"&gt;&lt;em&gt;Trello&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.atlassian.com/software/jira" rel="noopener noreferrer"&gt;&lt;em&gt;Atlassian Jira&lt;/em&gt;&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Design and prototyping&lt;/strong&gt;&lt;br&gt;
&lt;br&gt;Design is an important part of a collaborative project. It helps front-end developers from straying from the agreed upon design saving hours of back and forth.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;a href="https://www.figma.com/" rel="noopener noreferrer"&gt;&lt;em&gt;Figma&lt;/em&gt;&lt;/a&gt; - web-based, collaborate as you would on a google doc&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Coding&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When it comes to coding in a collaborative environment, coding best practices play a major role in it's effectiveness.  &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A &lt;a href="https://github.com/andela/bestpractices/wiki" rel="noopener noreferrer"&gt;good place to start&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;A good article - &lt;a href="https://medium.com/osedea/the-perfect-code-review-process-845e6ba5c31" rel="noopener noreferrer"&gt;The perfect code review process&lt;/a&gt;. The writer of the article, Robert Cooper, takes you through a fictional scenario of Jimmy and his team&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow#:~:text=Gitflow%20Workflow%20is%20a%20Git,designed%20around%20the%20project%20release.&amp;amp;text=Instead%2C%20it%20assigns%20very%20specific,and%20when%20they%20should%20interact." rel="noopener noreferrer"&gt;&lt;strong&gt;Gitflow&lt;/strong&gt;&lt;/a&gt; is a branching model in which every developer in the team works on independent features. A feature should not be dependant on another feature. Programmers should be able to work on features simultaneously without having to wait on the work of another developer.

&lt;ul&gt;
&lt;li&gt;Naming convention - each developer works on a feature branch eg: &lt;code&gt;ft-header&lt;/code&gt;, &lt;code&gt;ft-authentication&lt;/code&gt;, &lt;code&gt;ft-models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Pull requests are made to a &lt;code&gt;development&lt;/code&gt; branch &lt;strong&gt;not main/master&lt;/strong&gt;. At any given time, the master branch should contain deployable work (no bugs, no incomplete work). An incomplete branch should never be merged&lt;/li&gt;
&lt;li&gt;If modifications are requested after your branch has been reviewed, &lt;a href="https://thoughtbot.com/blog/git-interactive-rebase-squash-amend-rewriting-history" rel="noopener noreferrer"&gt;interactive rebasing&lt;/a&gt; should be done, no extra commits.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Extra remarks&lt;/strong&gt;
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Take some time and listen to some of &lt;a href="https://www.google.com/search?q=uncle+bob+on+youtube&amp;amp;oq=uncle+bob+on+youtube&amp;amp;aqs=chrome..69i57.5040j0j7&amp;amp;client=ms-android-oppo-rvo3&amp;amp;sourceid=chrome-mobile&amp;amp;ie=UTF-8" rel="noopener noreferrer"&gt;Uncle Bob's talks&lt;/a&gt; on Youtube&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://chris.beams.io/posts/git-commit/#imperative" rel="noopener noreferrer"&gt;Read&lt;/a&gt;: Best practices when it comes to git commits&lt;/li&gt;
&lt;li&gt;Always keep in mind:

&lt;ul&gt;
&lt;li&gt;Code is read more often than it is written. &lt;/li&gt;
&lt;li&gt;It is your duty as a programmer to write readable and maintainable code.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;small&gt;  Happy hacking!  &lt;/small&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>beginners</category>
      <category>scrum</category>
      <category>collaboration</category>
      <category>agile</category>
    </item>
  </channel>
</rss>
