DEV Community: Rodney Kirui

Introduction to Data Version Control

Rodney Kirui — Mon, 03 Apr 2023 07:55:17 +0000

What is Data Version Control (DVC)?
Data Version Control (DVC) is an open-source tool that enables data scientists to track and manage changes to their data, models, and experiments. DVC is designed to work seamlessly with Git, the popular version control system used for software development.

Data version control is a critical aspect of any data science project. In traditional software development, version control is used to keep track of changes to source code. With the rise of data-driven applications, data has become a critical part of the development process, and version control is just as important for data as it is for code.
In standard software engineering, many people need to work on a shared codebase and handle multiple versions of the same code. This can quickly lead to confusion and costly mistakes.

To address this problem, developers use version control systems, such as Git, that help keep team members organized.

In a version control system, there’s a central repository of code that represents the current, official state of the project. A developer can make a copy of that project, make some changes, and request that their new version become the official one. Their code is then reviewed and tested before it’s deployed to production.

These quick feedback cycles can happen many times per day in traditional development projects. But similar conventions and standards are largely missing from commercial data science and machine learning. Data version control is a set of tools and processes that tries to adapt the version control process to the data world.

Having systems in place that allow people to work quickly and pick up where others have left off would increase the speed and quality of delivered results. It would enable people to manage data transparently, run experiments effectively, and collaborate with others.

What Is DVC?

DVC is a command-line tool written in Python. It mimics Git commands and workflows to ensure that users can quickly incorporate it into their regular Git practice. If you haven’t worked with Git before, then be sure to check out Introduction to Git and GitHub for Python Developers. If you’re familiar with Git but would like to take your skills to the next level, then check out Advanced Git Tips for Python Developers.

DVC is meant to be run alongside Git. In fact, the git and dvc commands will often be used in tandem, one after the other. While Git is used to store and version code, DVC does the same for data and model files.

Git can store code locally and also on a hosting service like GitHub, Bitbucket, or GitLab. Likewise, DVC uses a remote repository to store all your data and models. This is the single source of truth, and it can be shared amongst the whole team. You can get a local copy of the remote repository, modify the files, then upload your changes to share with team members.

The remote repository can be on the same computer you’re working on, or it can be in the cloud. DVC supports most major cloud providers, including AWS, GCP, and Azure. But you can set up a DVC remote repository on any server and connect it to your laptop. There are safeguards to keep members from corrupting or deleting the remote data.

When you store your data and models in the remote repository, a .dvc file is created. A .dvc file is a small text file that points to your actual data files in remote storage.

The .dvc file is lightweight and meant to be stored with your code in GitHub. When you download a Git repository, you also get the .dvc files. You can then use those files to get the data associated with that repository. Large data and model files go in your DVC remote storage, and small .dvc files that point to your data go in GitHub.

The best way to understand DVC is to use it, so let’s dive in. You’ll explore the most important features by working through several examples. Before you start, you’ll need to set up an environment to work in and then get some data.

Set Up Your Working Environment

ou’ll need to have Python and Git installed on your system. You can follow the Python 3 Installation and Setup Guide to install Python on your system. To install Git, you can read through Installing Git.

Since DVC is a command-line tool, you’ll need to be familiar with working in your operating system’s command line. If you’re a Windows user, have a look at Running DVC on Windows.

To prepare your workspace, you’ll take the following steps:

Create and activate a virtual environment.
Install DVC and its prerequisite Python libraries.
Fork and clone a GitHub repository with all the code.
Download a free dataset to use in the examples.
You can use any package and environment manager you want. This tutorial uses conda because it has great support for data science and machine learning tools. To create and activate a virtual environment, open your command-line interface of choice and type the following command:
$ conda create --name dvc python=3.8.2 -y

The create command creates a new virtual environment. The --name switch gives a name to that environment, which in this case is dvc. The python argument allows you to select the version of Python that you want installed inside the environment. Finally, the -y switch automatically agrees to install all the necessary packages that Python needs, without you having to respond to any prompts.

Once everything is installed, activate the environment:

$ conda activate dvc

You now have a Python environment that is separate from your operating system’s Python installation. This gives you a clean slate and prevents you from accidentally messing up something in your default version of Python.

You’ll also use some external libraries in this tutorial:

dvc is the star of this tutorial.
scikit-learn is a machine learning library that allows you to train models.
scikit-image is an image processing library that you’ll use to prepare data for training.
pandas is a library for data analysis that organizes data in table-like structures.
numpy is a numerical computing library that adds support for multidimensional data, like images.
Some of these are available only through conda-forge, so you’ll need to add it to your config and use conda install to install all the libraries:

$ conda config --add channels conda-forge
$ conda install dvc scikit-learn scikit-image pandas numpy

Alternatively, you can use the pip installer:
$ python -m pip install dvc scikit-learn scikit-image pandas numpy

Now you have all the necessary Python libraries to run the code.

This tutorial comes with a ready-to-go repository that contains the directory structure and code to quickly get you experimenting with DVC.
You need to fork the repository to your own GitHub account. On the repository’s GitHub page, click Fork in the top-right corner of the screen and select your private account in the window that pops up. GitHub will create a forked copy of the repository under your account.

Clone the forked repository to your computer with the git clone command and position your command line inside the repository folder:

$ git clone https://github.com/YourUsername/data-version-control
$ cd data-version-control

Don’t forget to replace Your Username in the above command with your actual username. You should now have a clone of the repository on your computer.

There are six folders in your repository:

src/ is for source code.
data/ is for all versions of the dataset.
data/raw/ is for data obtained from an external source.
data/prepared/ is for data modified internally.
model/ is for machine learning models.
data/metrics/ is for tracking the performance metrics of your models.
The src/ folder contains three Python files:

prepare.py contains code for preparing data for training.
train.py contains code for training a machine learning model.
evaluate.py contains code for evaluating the results of a machine learning model.
The final step in the preparation is to get an example dataset you can use to practice DVC. Images are well suited for this particular tutorial because managing lots of large files is where DVC shines, so you’ll get a good look at DVC’s most powerful features.

How Data Version Control Works

At a high level, DVC works by creating a separate version control system for data and model files, while leveraging Git for code and experiment tracking. When a new data file is added to the project, DVC stores the file in a central repository and generates a small metadata file that contains information about the data, such as its hash value and location.

When a change is made to the data file, DVC generates a new metadata file with updated information about the file, including the new hash value. This metadata file is then committed to the Git repository, along with any code changes or experiment results.

Let's dive deeper into how DVC works step-by-step:

1. Initialize a DVC project: The first step is to initialize a new DVC project. This creates a new directory that contains the DVC configuration files and a Git repository.

2. Add data to the project: Next, data files are added to the project using the DVC add command. When a data file is added to the project, DVC generates a small metadata file that contains information about the data, such as its hash value and location. This metadata file is stored in the DVC cache directory, along with the original data file.

3. Track data changes: When a change is made to a data file, DVC detects the change and generates a new metadata file with updated information about the file, including the new hash value. The new metadata file is stored in the DVC cache directory, and the original data file is overwritten with the new data.

4. Commit changes to Git: Once the data changes are tracked by DVC, the changes are committed to the Git repository along with any code changes or experiment results. This ensures that all changes to data and code are tracked and versioned.

5. Share data with others: To share data with others, the DVC project directory can be pushed to a shared Git repository, or data can be shared directly from the DVC cache directory.

By using separate metadata files for data and models, DVC can track changes to large files without actually storing the files in the Git repository. This allows data scientists to manage and share large files without overwhelming the Git repository or slowing down the development process.

Benefits of Using DVC

Collaboration: DVC allows team members to work on the same project simultaneously, while ensuring that changes to data and models are tracked and shared.
Reproducibility: DVC ensures that data, models, and experiments are stored and versioned, enabling scientists to reproduce experiments easily.
Traceability: DVC provides a detailed history of changes to data and models, making it easy to track down the source of errors or issues.
Scalability: DVC is designed to handle large datasets, allowing data scientists to work with big data without compromising performance or storage.

Conclusion

Data Version Control (DVC) is a powerful tool that enables data scientists to track and manage changes to their data, models, and experiments. By using DVC alongside Git, data scientists can streamline their development process and focus on creating insights from data. DVC provides a way to manage and share large data files, collaborate with team members, and ensure reproducibility and traceability of experiments.

Getting started with Sentiment Analysis

Rodney Kirui — Sat, 25 Mar 2023 17:23:48 +0000

How do customers feel about your products or services? That’s important question business owners shouldn’t neglect. Positive and negative words matter. They can boost your business efforts or initiate a crisis. Luckily, you can measure customer satisfaction through sentiment analysis.

Sentiment analysis is the process of analyzing online pieces of writing to determine the emotional tone they carry, whether they’re positive, negative, or neutral. In simple words, sentiment analysis helps to find the author’s attitude towards a topic.
Essentially, sentiment analysis or sentiment classification fall into the broad category of text classification tasks where you are supplied with a phrase, or a list of phrases and your classifier is supposed to tell if the sentiment behind that is positive, negative or neutral. Sometimes, the third attribute is not taken to keep it a binary classification problem.
Sentiment analysis tools will collect all publicly available mentions containing your predefined keyword and analyze the emotions behind the message. The results of sentiment analysis are a wealth of information for your customer service teams, product development, or marketing.
Sentiment analysis tools will collect all publicly available mentions containing your predefined keyword and analyse the emotions behind the message. The results of sentiment analysis are a wealth of information for your customer service teams, product development, or marketing.

Types of Sentiment Analysis
Sentiment analysis focuses on the polarity of a text (positive, negative, neutral) but it also goes beyond polarity to detect specific feelings and emotions (angry, happy, sad, etc), urgency (urgent, not urgent) and even intentions (interested v. not interested).

Depending on how you want to interpret customer feedback and queries, you can define and tailor your categories to meet your sentiment analysis needs. In the meantime, here are some of the most popular types of sentiment analysis:

Graded Sentiment Analysis
If polarity precision is important to your business, you might consider expanding your polarity categories to include different levels of positive and negative:

Very positive
Positive
Neutral
Negative
Very negative
This is usually referred to as graded or fine-grained sentiment analysis, and could be used to interpret 5-star ratings in a review, for example:

Very Positive = 5 stars
Very Negative = 1 star

Emotion detection
Emotion detection sentiment analysis allows you to go beyond polarity to detect emotions, like happiness, frustration, anger, and sadness.

Many emotion detection systems use lexicons (i.e. lists of words and the emotions they convey) or complex machine learning algorithms.

One of the downsides of using lexicons is that people express emotions in different ways. Some words that typically express anger, like bad or kill (e.g. your product is so bad or your customer support is killing me) might also express happiness (e.g. this is bad ass or you are killing it).

Aspect-based Sentiment Analysis
Usually, when analyzing sentiments of texts you’ll want to know which particular aspects or features people are mentioning in a positive, neutral, or negative way.

That's where aspect-based sentiment analysis can help, for example in this product review: "The battery life of this camera is too short", an aspect-based classifier would be able to determine that the sentence expresses a negative opinion about the battery life of the product in question.

Multilingual sentiment analysis
Multilingual sentiment analysis can be difficult. It involves a lot of preprocessing and resources. Most of these resources are available online (e.g. sentiment lexicons), while others need to be created (e.g. translated corpora or noise detection algorithms), but you’ll need to know how to code to use them.

Alternatively, you could detect language in texts automatically with a language classifier, then train a custom sentiment analysis model to classify texts in the language of your choice.

Why Is Sentiment Analysis Important?
Since humans express their thoughts and feelings more openly than ever before, sentiment analysis is fast becoming an essential tool to monitor and understand sentiment in all types of data.

Automatically analyzing customer feedback, such as opinions in survey responses and social media conversations, allows brands to learn what makes customers happy or frustrated, so that they can tailor products and services to meet their customers’ needs.

For example, using sentiment analysis to automatically analyze 4,000+ open-ended responses in your customer satisfaction surveys could help you discover why customers are happy or unhappy at each stage of the customer journey.

Maybe you want to track brand sentiment so you can detect disgruntled customers immediately and respond as soon as possible. Maybe you want to compare sentiment from one quarter to the next to see if you need to take action. Then you could dig deeper into your qualitative data to see why sentiment is falling or rising.
The overall benefits of sentiment analysis include:

1. Better customer insights: Sentiment analysis can help businesses to understand their customers' needs, preferences, and opinions. By analyzing customer feedback, companies can identify areas for improvement and tailor their products and services to better meet customer needs.

2. Improved brand reputation: By analyzing social media posts and customer reviews, businesses can monitor their brand reputation and identify potential issues before they become major problems. This allows businesses to proactively address customer concerns and maintain a positive brand image.

3. Increased customer satisfaction: By understanding customer feedback and sentiment, businesses can improve their products and services to better meet customer needs. This can lead to increased customer satisfaction and loyalty.

4. Enhanced marketing campaigns: Sentiment analysis can help businesses to develop more effective marketing campaigns by identifying customer preferences and trends. By analyzing social media posts and other online content, businesses can identify key influencers and tailor their messaging to better resonate with their target audience.

5. Streamlined customer support: Sentiment analysis can help businesses to quickly identify and prioritize customer issues. By analyzing customer feedback in real-time, businesses can respond to customer inquiries and complaints more efficiently, leading to improved customer support and satisfaction.

6. Competitive advantage: By analyzing customer sentiment and feedback, businesses can gain valuable insights into their competitors' strengths and weaknesses. This can help businesses to develop more effective marketing strategies and stay ahead of their competitors.

7. Scalability: Sentiment analysis can analyze large volumes of textual data quickly and accurately, making it an ideal solution for businesses with large customer bases or high volumes of social media posts and other online content to monitor.

Overall, sentiment analysis in machine learning can provide businesses with valuable insights into their customers, products, and competitors, leading to improved customer satisfaction, brand reputation, and competitive advantage

Formulating the problem statement of sentiment analysis:
Before understanding the problem statement of a sentiment classification task, you need to have a clear idea of general text classification problem. Let's formally define the problem of a general text classification task.

Input: - A document d - A fixed set of classes C = {c1,c2,..,cn}

Output: A predicted class c $\in$ C
The document term here is subjective because in the text classification world. By document, it is meant tweets, phrases, parts of news articles, whole news articles, a full article, a product manual, a story, etc. The reason behind this terminology is word which is an atomic entity and small in this context. So, to denote large sequences of words, this term document is used in general. Tweets mean a shorter document whereas an article means a larger document.

So, a training set of n labeled documents looks like: (d1,c1), (d2,c2),...,(dn,cn) and the ultimate output is a learned classifier.

You are doing good! But one question that you must be having at this point is where the features of the documents are? Genuine question! You will get to that a bit later.

Now, let's move on with the problem formulation and slowly build the intuition behind sentiment classification.

One crucial point you need to keep in mind while working in sentiment analysis is not all the words in a phrase convey the sentiment of the phrase. Words like "I", "Are", "Am", etc. do not contribute to conveying any kind of sentiments and hence, they are not relative in a sentiment classification context. Consider the problem of feature selection here. In feature selection, you try to figure out the most relevant features that relate the most to the class label. That same idea applies here as well. Therefore, only a handful of words in a phrase take part in this and identifying them and extracting them from the phrases prove to be challenging tasks. But don't worry, you will get to that.

Consider the following movie review to understand this better:

"I love this movie! It's sweet, but with satirical humor. The dialogs are great and the adventure scenes are fun. It manages to be romantic and whimsical while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I have seen it several times and I'm always happy to see it again......."

Yes, this is undoubtedly a review which carries positive sentiments regarding a particular movie. But what are those specific words which define this positivity?

Retake a look at the review.

You must have got the clear picture now. The bold words in the above piece of text are the most important words which construct the positive nature of the sentiment conveyed by the text.

A simple sentiment classifier in Python:
Here's an example of a simple sentiment classifier in Python using the Natural Language Toolkit (NLTK) library. For this case study, you'll use an off-line movie review corpus as covered in the NLTK book https://www.nltk.org/book/ch06.html#document-classification and can be downloaded from here http://www.nltk.org/nltk_data/ nltk provides a version of the dataset. The dataset categorizes each review as positive or negative. You need to download that first as follows:
python -m nltk.downloader all
It's not recommended to run it from Jupyter Notebook. Try to run it from the command prompt (if using Windows). It will take some time. So, be patient.

For more information about NLTK datasets, make sure you visit this link. https://www.nltk.org/data.html

You will be implementing Naive Bayes or let's say Multinomial Naive Bayes classifier using NLTK which stands for Natural Language Toolkit. It is a library dedicated to NLP and NLU related tasks, and the documentation is very good. It covers many techniques in a great and provides free datasets as well for experiments.

This is NLTK's official website. Make sure you check it out because it has some well-written tutorials on NLP covering different NLP concepts.

After all the data is downloaded, you will start by importing the movie reviews dataset by from nltk.corpus import movie_reviews. Then, you will construct a list of documents, labeled with the appropriate categories.

# Load and prepare the dataset
import nltk
from nltk.corpus import movie_reviews
import random

documents = [(list(movie_reviews.words(fileid)), category)
              for category in movie_reviews.categories()
              for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

Next, you will define a feature extractor for documents, so the classifier will know which aspects of the data it should pay attention too. "In this case, you can define a feature for each word, indicating whether the document contains that word. To limit the number of features that the classifier needs to process, you start by constructing a list of the 2000 most frequent words in the overall corpus" Source. You can then define a feature extractor that simply checks if each of these words is present in a given document

# Define the feature extractor

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000]

def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

"The reason that you computed the set of all words in a document document_words = set(document), rather than just checking if the word in the document, is that checking whether a word occurs in a set is much faster than checking whether it happens in a list" - Source.

You have defined the feature extractor. Now, you can use it to train a Naive Bayes classifier to predict the sentiments of new movie reviews. To check your classifier's performance, you will compute its accuracy on the test set. NLTK provides show_most_informative_features() to see which features the classifier found to be most informative.

# Train Naive Bayes classifier
featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

# Test the classifier
print(nltk.classify.accuracy(classifier, test_set))

0.71
Wow! The classifier was able to achieve an accuracy of 71% without even tweaking any parameters or fine-tuning. This is great for the first go!

# Show the most important features as interpreted by Naive Bayes
classifier.show_most_informative_features(5)

Most Informative Features
       contains(winslet) = True              pos : neg    =      8.4 : 1.0
     contains(illogical) = True              neg : pos    =      7.6 : 1.0
      contains(captures) = True              pos : neg    =      7.0 : 1.0
        contains(turkey) = True              neg : pos    =      6.5 : 1.0
        contains(doubts) = True              pos : neg    =      5.8 : 1.0

"In the dataset, a review that mentions "Illogical" is almost 8 times more likely to be negative than positive, while a review that mentions "Captures" is about 6 times more likely to be positive" - Source.

Now the question - why Naive Bayes?

You chose to study Naive Bayes because of the way it is designed and developed. Text data has some practicle and sophisticated features which are best mapped to Naive Bayes provided you are not considering Neural Nets. Besides, it's easy to interpret and does not create the notion of a blackbox model.
Naive Bayes suffers from a certain disadvantage as well:
The main limitation of Naive Bayes is the assumption of independent predictors. In real life, it is almost impossible that you get a set of predictors which are entirely independent.

Essential SQL Commands for Data Science

Rodney Kirui — Wed, 15 Mar 2023 10:40:20 +0000

SQL (Structured Query Language) is a programming language that is widely used for managing and manipulating relational databases. For data scientists, SQL is an essential tool for accessing, querying, and transforming data stored in databases. Here are some essential SQL commands for data science:

1. SELECT
The SELECT statement is used to select data from a database.
The data returned is stored in a result table, called the result-set.
SELECT * FROM table_name;
2. WHERE
The WHERE clause is used to filter records.
It is used to extract only those records that fulfill a specified condition.

SELECT column1, column2, ...
FROM table_name
WHERE condition;

NOTE: The WHERE clause is not only used in SELECT statements, it is also used in UPDATE, DELETE, etc.!

3. GROUP BY
This command is used to group data based on a specific column.
Example:
SELECT column_name, COUNT(*) FROM table_name GROUP BY column_name;
4. ORDER BY
This command is used to sort data in ascending or descending order based on a specific column.
The ORDER BY keyword is used to sort the result-set in ascending or descending order.
The ORDER BY keyword sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.

SELECT column1, column2, ...
FROM table_name
ORDER BY column1, column2, ... ASC|DESC;

Note: using Order by without specifying whether 'ASC or DESC' by default arranges data in an ascending order

5. JOIN
A JOIN clause is used to combine rows from two or more tables, based on a related column between them.
SELECT * FROM table1 JOIN table2 ON table1.column_name = table2.column_name;
a. Inner Join
he INNER JOIN keyword selects records that have matching values in both tables.

SELECT column_name(s)
FROM table1
INNER JOIN table2
ON table1.column_name = table2.column_name;

b. Left Join
The LEFT JOIN keyword returns all records from the left table (table1), and the matching records from the right table (table2). The result is 0 records from the right side, if there is no match.

SELECT column_name(s)
FROM table1
LEFT JOIN table2
ON table1.column_name = table2.column_name;

c. Right Join
The RIGHT JOIN keyword returns all records from the right table (table2), and the matching records from the left table (table1). The result is 0 records from the left side, if there is no match.

SELECT column_name(s)
FROM table1
RIGHT JOIN table2
ON table1.column_name = table2.column_name;

d. Full Join
The FULL OUTER JOIN keyword returns all records when there is a match in left (table1) or right (table2) table records.
Tip: FULL OUTER JOIN and FULL JOIN are the same.

SELECT column_name(s)
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name
WHERE condition;

e. Self Join
A self join is a regular join, but the table is joined with itself.

SELECT column_name(s)
FROM table1 T1, table1 T2
WHERE condition;

6. HAVING
This command is used to filter data based on conditions after grouping.

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);

The HAVING clause was added to SQL because the WHERE keyword cannot be used with aggregate functions.

7. DISTINCT
This command is used to retrieve unique values from a column. The SELECT DISTINCT statement is used to return only distinct (different) values.
Inside a table, a column often contains many duplicate values; and sometimes you only want to list the different (distinct) values.

SELECT DISTINCT column1, column2, ...
FROM table_name;

8. COUNT, AVG, SUM
These commands are used to perform calculations on a set of data. The COUNT() function returns the number of rows that matches a specified criterion.
COUNT() Syntax

SELECT COUNT(column_name)
FROM table_name
WHERE condition;

The AVG() function returns the average value of a numeric column.
AVG() Syntax

SELECT AVG(column_name)
FROM table_name
WHERE condition;

The SUM() function returns the total sum of a numeric column
SUM() Syntax

SELECT SUM(column_name)
FROM table_name
WHERE condition;

9. UPDATE
The UPDATE statement is used to modify the existing records in a table.

UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;

10. DELETE
The DELETE statement is used to delete existing records in a table.
DELETE FROM table_name WHERE condition;

ULTIMATE GUIDE TO EXPLORATORY DATA ANALYSIS

Rodney Kirui — Wed, 01 Mar 2023 09:54:37 +0000

Exploratory Data Analysis is a data analytics process to understand the data in depth and learn the different data characteristics, often with visual means. This allows you to get a better feel of your data and find useful patterns in it.
It is crucial to understand it in depth before you perform data analysis and run your data through an algorithm. You need to know the patterns in your data and determine which variables are important and which do not play a significant role in the output. Further, some variables may have correlations with other variables. You also need to recognize errors in your data.

All of this can be done with Exploratory Data Analysis. It helps you gather insights and make better sense of the data, and removes irregularities and unnecessary values from data.

Helps you prepare your dataset for analysis.
Allows a machine learning model to predict our dataset better.
Gives you more accurate results.
It also helps us to choose a better machine learning model.

Steps Involved in Exploratory Data Analysis

** 1. Understand the Problem **
Before starting the exploratory data analysis (EDA), it is essential to understand the problem you are trying to solve. What is the research question or business problem you are trying to answer? What are the goals of the analysis? Understanding the context of the data will help you frame the analysis and guide your EDA efforts.

2. Data Collection
Data collection is an essential part of exploratory data analysis. It refers to the process of finding and loading data into our system. Good, reliable data can be found on various public sites or bought from private organizations. Some reliable sites for data collection are Kaggle, Github, Machine Learning Repository, etc.
example
Let’s explore steps of Exploratory data analysis in detail using customer churn analysis based on the customers behaviour on the website or app data.

We will classify what kind of customers are likely to sign up for the paid subscription of a website. After analyzing and classifying the dataset, we will be able to do the targeting-based marketing or recommendation to the customers who are likely to sign up for the paid subscription plan.
Import the Libraries:

import pandas as pd
import numpy as np
import re
import string
import matplotlib.pyplot as plt
import seaborn as sn
from dateutil import parser
import warnings
warnings.filterwarnings('ignore')

Data is stored in csv file format, hence we are importing it using pd.read_csv
data = pd.read_csv('app_data.csv')

How many entries (Rows) and attributes(Columns) are present in the data? What is the shape of the data?

data.shape
(50000, 12)
.shape method returns number of rows by number of columns in the dataset. So, in our dataset we have 50000 rows and 12 columns.

Display the first 5 entries of the data.

data.head()

.head() method gives the first 5 rows of the dataset. It is useful for seeing some example values for each variable.

What are the different features available in the data?

data.columns

.columns method returns all the columns in the dataset.

Display the distribution of Numerical Variables.
data.describe()

.describe() method summarizes the count, mean, standard deviation, min, and max for numeric variables. It helps to understand the skewness in the data.

3. Data Cleaning
Data cleaning refers to the process of removing unwanted variables and values from your dataset and getting rid of any irregularities in it. Such anomalies can disproportionately skew the data and hence adversely affect the results. Some steps that can be done to clean data are:
Missing Data
Irregular Data (Outliers)
Unnecessary Data — Repetitive Data, Duplicates and more
Inconsistent Data — Capitalization, Addresses and more

4. Explore the Data
Once you have cleaned the data, the next step is to explore the data. Exploratory data analysis involves examining the data to identify patterns, relationships, and trends. There are several ways to explore the data:

a. Descriptive Statistics: Descriptive statistics summarize the data's main characteristics, such as mean, median, mode, standard deviation, and variance.

import math
import statistics
import numpy as np
import scipy.stats
import pandas as pd

These are all the packages you’ll need for Python statistics calculations. Usually, you won’t use Python’s built-in math package, but it’ll be useful in this tutorial. Later, you’ll import matplotlib.pyplot for data visualization.

b. Data Visualization: Data visualization is a powerful way to explore the data. You can create charts, graphs, and plots to visualize the data's distribution, relationships, and patterns.

c. Statistical Tests: Statistical tests can help you test hypotheses and identify significant differences between groups.

5. Identify Outliers

An outlier is a data point that differs significantly from the majority of the data taken from a sample or population. There are many possible causes of outliers, but here are a few to start you off:

Natural variation in data
Change in the behavior of the observed system
Errors in data collection
Data collection errors are a particularly prominent cause of outliers. For example, the limitations of measurement instruments or procedures can mean that the correct data is simply not obtainable. Other errors can be caused by miscalculations, data contamination, human error, and more.

There isn’t a precise mathematical definition of outliers. You have to rely on experience, knowledge about the subject of interest, and common sense to determine if a data point is an outlier and how to handle it.

6. Identify Patterns and Relationships

Once you have explored the data, you can identify patterns and relationships between variables. Correlation analysis can help you identify the relationship between two variables, and regression analysis can help you predict the outcome variable based on the predictor variables.

7. Iterate

Iterative data exploration is an essential aspect of the exploratory data analysis (EDA) process. EDA is an iterative process that involves repeatedly looking at data from different angles and perspectives to gain a deeper understanding of its properties and relationships.

In the initial stages of EDA, you may start with a general overview of the data to understand its size, shape, and structure. Once you have a basic understanding of the data, you may start to explore specific aspects, such as relationships between variables, distributions, or outliers.

As you uncover new information, you may need to go back and revisit earlier steps of the process, updating or refining your analysis. This iterative approach allows you to build a more complete and nuanced understanding of the data, and can help you identify patterns, trends, or anomalies that may be missed with a single pass through the data.

Overall, iterative data exploration is a critical part of the EDA process, allowing you to explore data from different angles, uncover hidden relationships, and gain a deeper understanding of its properties and patterns.

8. Reporting

After completing an exploratory data analysis (EDA), it's important to communicate your findings to others. One way to do this is by creating a report that summarizes your EDA process, the insights gained, and any recommendations or conclusions that can be drawn from the data.

Here are some steps you can follow to create a report after an EDA:

Start with an introduction: Begin by providing some context about the data and the purpose of the analysis. This could include a brief overview of the data source, the problem you're trying to solve, or the goals of the analysis.

Describe your EDA process: Explain the methods you used to explore the data, such as summary statistics, visualizations, or hypothesis testing. Provide details on the data cleaning and preparation steps you took, as well as any challenges or limitations you encountered.

Present your findings: Summarize the key insights you gained from the analysis. This could include trends, patterns, correlations, outliers, or other noteworthy observations. Use visualizations, such as charts, graphs, or tables, to help illustrate your findings.

Draw conclusions: Based on your findings, draw conclusions about the data and the problem you're trying to solve. Identify any relationships, trends, or patterns that are significant, and provide context for why they matter. Be sure to acknowledge any limitations or uncertainties in your analysis.

Make recommendations: Based on your conclusions, provide recommendations for next steps or actions that could be taken based on the insights gained from the EDA. This could include further analysis, data collection, or changes to business processes.

Conclude with a summary: Provide a brief summary of the key points of your report, highlighting the most important findings and recommendations.

Overall, the goal of the report is to provide a clear, concise, and accurate summary of the EDA process and its results. It should be tailored to the intended audience, using language and visuals that are accessible and easy to understand.

Introduction to SQL

Rodney Kirui — Sun, 19 Feb 2023 08:34:47 +0000

INTRODUCTION
SQL is one of the most common programming languages for interacting with data.

SQL consists of a data definition language, data manipulation language, and a data control language.

The data definition language deals with the schema creation and modification e.g., CREATE TABLE statement allows you to create a new table in the database and the ALTER TABLE statement changes the structure of an existing table.
The data manipulation language provides the constructs to query data such as the SELECT statement and to update the data such as INSERT, UPDATE and DELETE statements.
The data control language consists of the statements that deal with the user authorization and security such as GRANT and REVOKE statements.

HISTORY OF SQL
SQL was first brought into origin by IBM Researcher’s – Raymond F. Boyce, and Donald D. Chamberlin in the 1970’s and the initial version created by them was called SEQUEL or Structured English Query Language which worked on manipulation and retrieving data from IBM databases.

After commercial testing, IBM released various versions like System/38, SQL/DS, and DB2 in 1979, 1981, and 1983, respectively.

In 1986 making a breakthrough, ANSI and ISO adopted the Standard “Database Language SQL”.

RULES OF WRITING SQL QUERIES

SQL statements can span in multi lines.
SQL queries are capable of performing almost all actions on the database
SQL queries are not case sensitive, but generally, we write SQL keywords in Uppercase for better understanding.
SQL follows the principle of tuple relational calculus and the rules of relational algebra.

SQL Commands and Types

1. DDL (Data Definition Language)
Deals with the schema creation and modification e.g., CREATE TABLE statement allows you to create a new table in the database and the ALTER TABLE statement changes the structure of an existing table. EXAMPLE;
CREATE TABLE DataFlair_Employee ( name_emp varchar(50), post_emp varchar(50), email varchar(50), age int, salary varchar(10) );

2. DATA MANIPULATION LANGUAGE
provides the constructs to query data such as the SELECT statement and to update the data such as INSERT, UPDATE and DELETE statements.
Example;Let us populate the database by using the insert command :
Insert into DataFlair_Employee (name_emp , post_emp , email , age , salary)
Insert into DataFlair_Employee (name_emp , post_emp , email , age , salary) Values ('Ram' , "Intern", 'ram@dataflair.com', 21 , '10000' ), ('Shyam', "Manager", 'shyam@dataflair.com' , 25 , '25000'), ('Ria', "Analyst" , 'ram@dataflair.com', 23 , '20000'), ('Kavya', "Senior Analyst" , 'kavya@dataflair.com', 31 , '30000'), ('Aman', "Database Operator",' rish@dataflair.com' , 26 , '15000') ;

3. DQL (Data Query Language)
It is used to retrieve the data stored in the database created by us and the data we store in the database.

4. DCL (Data Control Language)
consists of the statements that deal with the user authorization and security such as GRANT and REVOKE statements.

Performing a simple calculation
The following example uses the SELECT statement to get the first name, last name, salary, and new salary:
SELECT first_name, last_name, salary, salary * 1.05 FROM employees;
The expression salary * 1.05 adds 5% to the salary of every employee. By default, SQL uses the expression as the column heading:

To assign an expression or a column an alias, you specify the AS keyword followed by the column alias as follows:

expression AS column_alias
For example, the following SELECT statement uses the new_salary as the column alias for the salary * 1.05 expression:
SELECT first_name, last_name, salary, salary * 1.05 AS new_salary FROM employees;

CONCLUSION
In summary, SQL provides a standard syntax for interacting with relational databases, enabling users to easily retrieve, modify, and manage data in a variety of contexts.

Overall, the versatility and efficiency of SQL make it a critical component of modern data-driven applications, from business intelligence and data warehousing to web development and analytics.