DEV Community: Pushpa Sree Potluri

Introduction to Data Science - Part 2

Pushpa Sree Potluri — Sat, 13 Jun 2020 17:05:19 +0000

Data Science is all about how well you understand your data. Knowing what type of data you have makes a lot of difference. By knowing your data type, you will be to apply necessary statistical measurements or you can conclude certain assumptions about the data.

1. Categorical data or qualitative data - data that can be divided into categories. Categorical data cannot be defined in numbers. Sometimes we use numbers to define categorical data, but that numbers cannot hold any value.
Ex: You have population data of a city divided into male and female categories, you have represented 1 for male and 0 for female. Here 1 & 0 are numbers but they hold any mathematical meaning.

i. Nominal data - categorical data that has no order or sequence
Ex: Gender, Race, language etc.,
ii. Ordinal data - this is an ordered series
Ex: Education (high school, college, graduate, PhD)

2. Numerical or quantitative data - data which represents numerical value
Ex: You have population data of a city and how much annual income and how many children every individual has. Number of children and annual income is considered as numerical data. It gives information about the quantity of a specific thing.

i. Discrete data - data that can be counted and not measured
Ex: Number of students in a class
ii. Continuous data - data that represents measurements
Ex: Temperature, heights of the students in a class

And just like different types of data, we have different ways to visualize the data:
a. Nominal data - pie chart, bar graphs
b. Ordinal data - stacked bar graph
c. Discrete data - bar graphs, scatter plots
d. Continuous data - box plot, histograms

Introduction to Data Science

Pushpa Sree Potluri — Sat, 13 Jun 2020 03:39:46 +0000

Most of the people are under the misconception that data science is all about machine learning algorithms. That is not true. Data Science is a combination of mathematics, computer science and, machine learning.

Data Science is a study of data, where you maintain datsets and derive insights from the dataset. Data Science uses different parts mentioned in the pattern below to solve the problems.

Perception - try to identify patterns with the help of the data
Planning - involves two steps:

Finding all possible solutions
Finding the best possible solution among all solutions

What do you need to know to be a successful data scientist?

Programming Knowledge
Data modelling and evaluation
Data Visualization and reporting
Probability and Statistics
Machine Learning techniques
Relational Database knowledge

Let's get started with some basic terminology used in data science:

Observations - data points in your dataset (rows)
Features - variables in your dataset (columns)
Target Variable - which you are trying to predict
Train data - data from which your algorithm learns
Test data - data to evaluate your model performance
Model - set of patterns learned from the data
Algorithm - specific machine learning process used to train your model

Environment setup for Data Analysis with PySpark and Spark SQL

Pushpa Sree Potluri — Mon, 27 Apr 2020 21:46:32 +0000

Data Analysis is all about extracting all possible insights from your dataset. A very important step in building a machine learning model is to get to know the data. Spark is widely used for its parallel data processing on computer clusters. Spark supports multiple programming languages (Python, Scala, R, and Java) and includes libraries for SQL(Spark SQL), machine learning(MLlib), stream processing (spark streaming), and graph analytics (GraphX). In this post, I am going to use PySpark and Spark SQL for my data analysis.

If you want to run Spark locally, you should have Java, as well as Python (Python 3), installed on your machine.

Install Spark
i. Go to https://spark.apache.org/downloads.html
ii. Select version and package type

iii. Click on the download link, it will bring you to Apache Software Foundation site. From this site, you can start downloading

iv. Set up some environment variables for Spark home and PySpark in a file called .bash_profile

v. Install PySpark - I am using Python installer program (pip) to install PySpark

Launching Jupyter Notebook
i. Install jupyter notebook with python installer

ii. Open terminal window, navigate to your working directory and type jupyter notebook. This will launch jupyter notebook

iii. Create new jupyter notebook by clicking on the "New" button on the upper right side and selecting Python 3

Best online data science course for beginners (My opinion)

Pushpa Sree Potluri — Thu, 02 Jan 2020 22:21:21 +0000

I have done some research to find out the best online course to learn data science especially for beginners and I found this course really interesting. They have a really good course structure starting with data preprocessing and covered all the popular algorithms with hands on experience.

Course: Udemy - Machine Learning A-Z : Hands-On Python & R in Data Science

Prerequisites

Intermediate level of Python or R
Anyone with a programming background can try this but I suggest to go through the python or R basics before starting this course

This course provides hands on of building a model using some of the basic and most used algorithms in Regression, Classification, Clustering, Association Rule Mining, Neural Networks etc., in both R and Python.

Each section is structured in a way to help us understand the basics of how to build a model. Every section consists of following steps:

Dataset (Explanation and Importing)
Algorithm (Intuition and Implementation)

And coming to how I learned data science:

Think of an use case you want to implement (I work in telecommunications industry, so I searched for the most popular machine learning use cases in telecom)
Set your objective
Which category does your objective falls into? (For this you need to have prior understanding of machine learning categories like Regression, Classification, Clustering etc.,)
Choose a dataset (you can download datasets online - Kaggle.com has some good datasets)
Start your project (I started in Jupyter Notebooks)
Go through the data and make sure you have a clear understanding of the features (you should be able to answer all your questions from the data itself)
And now the most important part Data pre-processing (handling missing data, removing duplicates etc.,)
Select the features you need from data to train your model
Select an algorithm that will fit your purpose
Train your model
Validate the model performance

Association Rule Learning - Part I

Pushpa Sree Potluri — Sun, 15 Dec 2019 05:42:48 +0000

Ever wondered about how retailers are doing

The answer is simple, Association Rule Learning. This technique is used by retailers across the globe to understand customer buying patterns by finding co-relation between the products that customers have bought.

Association Rule Learning involves two steps:

Finding all frequent itemsets
Generating strong association rules from the frequent itemsets

Finding frequent itemsets can be done either by using the Apriori algorithm or FP Growth algorithm. In this part, we will see how the Apriori algorithm works. Apriori works on the assumption that

"All nonempty subsets of a frequent itemset must also be frequent".

Here is the sample dataset consisting of 9 transactions containing items I1, I2, ..I5.

In order to have a proper understanding of association rule learning, it's better if we know the following metrics:

Support: Support of an item I1 is nothing but the number of transactions containing I1 to the total number of transactions
```
   Support (I1) = Transactions containing I1 / Total transactions
                = 6 / 9 = 0.66
```

Confidence: How likely a customer is to purchase item I3 when I1 is purchased.

   Confidence (I1 => I3) = Transactions containing both I1 and I3 / Transactions containing I1
                        = 4 / 6 = 0.66

Now that we are familiar with these terms, let's try to understand the apriori. For this example, I'm taking minimum support count = 2

Step 1: Find 1-frequent itemsets (all the items) and calculate their support counts (nothing but the number of times itemsets have appeared in our transactions)

Step 2: Compare each item support with the minimum support and remove the items having support less than minimum support. Here all the items satisfy the minimum support.

Step 3: From the result we got from table, find 2-frequent itemsets

Step 4: Compare each itemset support with the minimum support and remove the itemsets having support less than minimum support.

Step 5: From the result we got from table, find 3-frequent itemsets and calculate their support counts

Step 6: Compare each itemset support with the minimum support and remove the itemsets having support less than minimum support.

Step 7: From the result we got from table, find 4-frequent itemsets

Step 8: Compare each itemset support with the minimum support and remove the itemsets having support less than minimum support.

Repeat the steps until you get an empty set. Since the 4-itemset is not satisfying our minimum support count, we are not generating itemsets anymore.

Once the frequent itemsets are generated, now is the time to generate strong association rules from the itemsets. Association rules can be generated as follows:

For each frequent itemset l, generate all non-empty subsets s
For every non-empty subset s, output the rule "s => (l-s)"

For this, I'm taking minimum confidence value = 60%

Step 9: Generating all non-empty subsets of an itemset. Here, I am generating all non-empty subsets for an itemset {I1, I2, I5}

Step 10: Generating rules from the non-empty subsets

Step 11: Which rules to consider? For this, we have to take calculate the confidence value for each rule

Consider the first rule in the table I1 => I2∩I5

         Confidence = Support count of (I1, I2, I5) / Support Count of I1
                    = 2 / 6 * 100 = 33.3%

Calculate the confidence for all subsets

After considering the minimum confidence value, rules 3,5 & 6 are strong rules for the itemset {I1, I2, I5}

Step 12: Take the itemset {I1, I2, I3} and follow-through steps 10 & 11

This series consists of

Apriori algorithm working (Current Post).
Python implementation of apriori.
FP Growth algorithm working.
Python implementation of FP growth.

What do you need to know to get into Data Science as a Beginner?

Pushpa Sree Potluri — Wed, 04 Dec 2019 02:35:30 +0000

Data Science is a combination of Programming & Statistics, so to be a data scientist you need to have knowledge of at least one programming language, preferably Python/R as there is a good amount of people/communities who use these languages to build their models.

For a complete beginner, Python is easy to learn. Some of the basic tools used in data science from Python stack.

Jupyter Notebooks - IDE
Pandas - library for data manipulation and analysis
Numpy - library for scientific computing
Matplotlib & Seaborn - library for data visualization
Scikit-Learn - library for machine learning

Good mathematical knowledge helps to make a better judgment while choosing a procedure (algorithm) based on the data available to you and also to diagnose the problems.

If you don't have time to go through the theory, start with a tutorial. Follow the tutorial step-by-step. After you complete a tutorial, apply what you learned to new datasets. You can find some sample datasets online (https://www.kaggle.com/datasets). If you try the same modeling on a new dataset, you might run into a new issue. Upon doing some research, you might discover data issues in the dataset like different formats, or missing values.

If you are looking for more resources https://www.coursera.org/, https://www.datacamp.com/ offers some good and free courses.

This blog is first posted on hackerheap.com