DEV Community: Marjan Ferdousi

Move your mouse pointer with hand gestures

Marjan Ferdousi — Thu, 01 Jun 2023 18:03:06 +0000

There is a game called slither.io (Link) I enjoy playing. It’s kinda hard playing from my phone (curse my small hands and the huge phone) and playing it using the built-in touchpad of my laptop is not something comfortable either. This is how the idea of a virtual mouse came to my mind.

Controlling the mouse pointer with your gesture would need three parts. Firstly, you would need some sort of input method that would decide which direction the pointer should go. Secondly, the input needs to be interpreted and processed. And finally, the actual movement of the mouse pointer according to those inputs has to be executed. Hand gestures can be easily captured from the webcam of my laptop. A very close friend recently used MediaPipe to interpret sign language from hand movement and I thought why don’t I try using that library too. And then for the movement of the mouse, a few google searches made me want to use pyautogui.

Now we start writing the code. I used pip to install the above libraries according to the documentation. After installing, the first thing we do is to import them.



import cv2
import mediapipe as mp
import pyautogui

Taking input from the webcam:

The first step is simple. We gotta capture the video from the webcam.



video = cv2.VideoCapture(0)

Understanding MediaPipe:

Now we gotta use this video to detect movement. I wanted my mouse pointer to follow the direction of my index finger. MediaPipe is a great option to identify these human gestures. In order to use this library, we need to understand how the library is identifying the gestures. MediaPipe can detect movements of your eyes or hands or your posture in general by identifying some important points of your body. These points are called landmarks. Let’s have a look at the landmarks of our hands.

So you can see, if we want to track the motion of our index finger, we need to find out what the landmark 8 is doing at that moment. It might look a bit complex, but the library already has the facilities of identifying the points done for you. You can create a hand object from the ‘Hands’ class of this library and use it to analyse your movements like this:



handGesture = mp.solutions.hands.Hands()

You might also need the drawing utilities from MediaPipe if you wanna draw the landmarks of your hand on the output screen.



drawingTools = mp.solutions.drawing_utils

The ‘loop’:

Now as you are taking input from the webcam, you have to process the input over and over for every frame, you’d need a continuous loop. It might look somewhat like this:



while True:
    <read the captured video>
    <format the video>
    <do something to detect the landmark 8>
    <move the mouse pointer according to the movements of landmark 8>


    cv2.imshow('Virtual Mouse', frame)
    cv2.waitKey(1)

Here the imshow creates a window to show your video output, and waitKey introduces a delay of 1 ms to let the window respond to the actions.

Reading and formatting the video:

You’ve already got the ‘video’ variable where you kept your input video from the webcam.



while True:
    _, frame = video.read()
    frame = cv2.flip(frame, 1)
    rgbConvertedFrame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
    <do something to detect the landmark 8>
    <move the mouse pointer according to the movements of landmark 8>


    cv2.imshow('Virtual Mouse', frame)
    cv2.waitKey(1)

The video.read() gives us two things, a boolean that says if the reading was successful, and the actual frame data. I don’t particularly wanna do anything with the boolean, therefore it goes to the ‘_’ placeholder. I found the video got flipped in the output window therefore added the cv2.flip() function to horizontally flip the image back. This frame contains three
attributes, height, weight and number of channels. In this part I added cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) to convert the frame from BGR to RGB. My code worked without this one but it is recommended to use this process if you are using MediaPipe with cv2 as these libraries by default use different colour spaces.

Finding the landmark 8:

Now the frame we got from the RGB conversion needs to be analyzed to find the index finger.



while True:
    _, frame = video.read()
    frame = cv2.flip(frame, 1)
    rgbConvertedFrame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)


    output = handGesture.process(rgbConvertedFrame)
    hands = output.multi_hand_landmarks


    if hands:
            for hand in hands:
                drawingTools.draw_landmarks(frame, hand)
                landmarks = hand.landmark
                for id, landmark in enumerate(landmarks):
                    if id == 8:
                            <move the mouse pointer>




    cv2.imshow('Virtual Mouse', frame)
    cv2.waitKey(1)

We have to process the RGB converted video frame to find the landmark 8 using the handGesture object of the Hands class we declared earlier. Here multi_hand_landmarks lets you get all the landmarks of your hand from the video and we have that stored in the hands variable. Now we draw all the landmarks we’ve got inside this ‘hands’ variable using draw_landmarks just to get a better view on the output screen. However, we’re gonna work with the landmark 8 only in our current case. So the code of the mouse binding will execute only when we find the landmark 8.

Moving the mouse pointer:

When we can find the landmark 8 aka your index finger, we have to locate its position and somehow bind it with the mouse. Say if the tip of my index finger is at the center of the frame right now and I moved it to the right, the mouse pointed needs to move to the right too. We can easily do that using pyautogui.moveTo(mousePositionX, mousePositionY) function. My first intuition was:



if id == 8:
    x = int(landmark.x*frameWidth)
    y = int(landmark.y*frameHeight)
    cv2.circle(img=frame, center=(x,y), radius=30, color=(0, 255, 255))
    pyautogui.moveTo(mousePositionX, mousePositionY)

However, the dimension of the frame you are capturing using the camera and the dimension of the screen of your computer might not be the same. In my case, my mouse pointer was moving, but crashing within seconds and I couldn’t understand why. I needed to scale the values of x and y within the screen.



screenWidth, screenHeight = pyautogui.size()

pyautogui.size() gives you the size of the screen, and



frameHeight, frameWidth, _ = frame.shape

Frame.shape gives you the size of the frame.

Combining these two, it became something like this:



if id == 8:
    x = int(landmark.x*frameWidth)
    y = int(landmark.y*frameHeight)
    cv2.circle(img=frame, center=(x,y), radius=30, color=(0, 255, 255))
    mousePositionX = screenWidth/frameWidth*x
    mousePositionY = screenHeight/frameHeight*y
    pyautogui.moveTo(mousePositionX, mousePositionY)

And the final code looks like this:



import cv2
import mediapipe as mp
import pyautogui


video = cv2.VideoCapture(0)


handGesture = mp.solutions.hands.Hands()
drawingTools = mp.solutions.drawing_utils
screenWidth, screenHeight = pyautogui.size()


while True:
   _, frame = video.read()
   frame = cv2.flip(frame, 1)
   frameHeight, frameWidth, _ = frame.shape
   rgbConvertedFrame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
   output = handGesture.process(rgbConvertedFrame)
   hands = output.multi_hand_landmarks


   if hands:
       for hand in hands:
           drawingTools.draw_landmarks(frame, hand)
           landmarks = hand.landmark
           for id, landmark in enumerate(landmarks):
               if id == 8:
                   x = int(landmark.x*frameWidth)
                   y = int(landmark.y*frameHeight)
                   cv2.circle(img=frame, center=(x,y), radius=30, color=(0, 255, 255))
                   mousePositionX = screenWidth/frameWidth*x
                   mousePositionY = screenHeight/frameHeight*y
                   pyautogui.moveTo(mousePositionX, mousePositionY)




   cv2.imshow('Virtual Mouse', frame)
   cv2.waitKey(1)

And the output was like:

Beginners’ journey to machine learning

Marjan Ferdousi — Mon, 08 Nov 2021 15:18:29 +0000

Hello, data science cat is back.

After kitty became successful in decision making with machine learning (https://dev.to/orthymarjan/data-science-for-cats-1d7k), a lot of hooman friends has been asking him questions like

I understand the basics, but where do I start coding?
I understand the codes from the internet, but how can I start writing codes by myself?
How do I organize my project?
How do I visualize the solution of a real life problem?

Kitty now tries to explain the answers with real life examples.

First, think of the word ‘learning’. Kitty wants you to remember how you all started learning formally, and later how you implemented your knowledge in the real world. Imagine yourself as a teacher in a school. A new student comes and gets enrolled in a class. You prepare a course curriculum for the class and start teaching accordingly. You take a few class tests to assess how the kids are doing. At the end of the year, you prepare a final test based on what you’ve taught throughout the year. You distribute the question paper to the kids, they answer the questions and you verify the answers to see how well they have learnt. If their answers are above a certain level, they pass. Otherwise they fail. Those who pass later get jobs and use their knowledge from the school to complete their tasks. For example, if you are an english teacher, you teach the kids grammar and literature, and later in real life they might not have to write poems or fill in the blanks with right forms of verbs, but they implement their knowledge of English language to write a report or product document.

Remember machine learning is also a procedure of learning. Now let’s compare the procedure of a school with machine learning. In our example, we will be working on a very small dataset (https://www.kaggle.com/ronitf/heart-disease-uci) where the machine learning model will try to predict if a patient has high risk of heart disease or not based on some test reports with python. You can work in the similar process in R too.

import pandas as pd
df = pd.read_csv('Heart Disease Dataset.csv')
df.head()

Make Curriculum:
At first you would want to decide which topics you would like to teach your student throughout the year. You have to decide which topics (in this case, columns or features) your student has to understand in order to learn whatever you’re trying to teach him (in this case, if the patient has high risk for heart disease). You’ll also decide how much data you have to teach him and how you would take exams later in your course. You’ll be defining ‘training data’ to teach throughout the year and ‘testing data’ for the exam from your actual dataset.

from sklearn.model_selection import train_test_split

feature_cols = ['age',  'sex',  'cp', 'trestbps', 'chol', 'fbs',  'restecg',  'thalach',  'exang',  'oldpeak',  'slope',  'ca', 'thal']
X = df[feature_cols]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)

Here, in X_train you have training features, in X_test you have features of the testing set, in y_train you have the targets of the training set and in y_test you have the targets of the testing set. Train-test ratio is 90%-10% here.

Student Enrolment:
The student in the process of machine learning is your model. At first, it knows nothing. It’s your job to make a suitable procedure of learning for it so that it can later perform in the real world. Let’s say our teaching procedure is the SVM model in this case. We declare a variable named svm and tune its parameters (like we’ve taken a linear kernel, and there are many more that you can find in the documentation).

from sklearn.svm import SVC #SVM classifier
svm = SVC(kernel="linear")

Teaching:
In our case, teaching is ‘fitting’ data to the model. When you fit the training set to the model, it ‘learns’.

svm.fit(X_train, y_train)

Exam:
In the case of exams, our question paper is X_test. You already have the correct answers of the question paper in your hand which is y_test. The student will write his answers in another variable, let’s say y_pred.

y_pred = svm.predict(X_test)

Evaluating the test papers:
You can already understand that you can verify the answers of a student by comparing y_test and y_pred, and decide if he passes or fails.

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Importance of class tests:
If you find your student didn’t do well in the final exams, two things might have happened. The student might not have learnt properly throughout the year, or maybe he did study well but for some reason he couldn’t do well in the finals. Here comes the importance of taking class tests. If the student hadn’t studied properly throughout the year, his class test results would not be satisfactory. If he has done well in the class tests, failing in finals would indicate some other problems. If our accuracy is not satisfactory, let’s check their class test performances using k-fold cross validation (which kinda takes some tests from training sets).

from sklearn import model_selection

kfold = model_selection.KFold(n_splits=10, random_state=31)
model =  SVC(kernel="linear")

results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold)
results

If the cross validation results are poor too, that means we have ‘underfitting’, meaning we couldn’t provide enough data to our model to study (which happened in our case as our accuracy is pretty mediocre and cross validation result too isn’t that good). If not, the probable cause is ‘overfitting’, meaning the model is learning data a bit too well (including noises and bad stuff). Here is a link for you on what you can do (like changing parameters and stuff) in this case: https://adityarohilla.com/2018/11/02/a-brief-introduction-to-support-vector-machine/.
You can also test other models like decision tree, random forest or naive bayes from the same python library to check which one suits you the best.
Using this knowledge in workplace:
What your student has learnt well throughout his school life, he will be able to perform in his job place too. To make him remember his training, we can export this trained model into a file of some kind and later load the file in a system to predict. You can easily integrate your models to a system made with python frameworks like Flask by importing the model file.
To export,

from joblib import dump

# dump the pipeline model
dump(svm, filename="classification.joblib")

To import,

from joblib import load

# load the pipeline model
pipeline = load("classification.joblib")
pipeline.predict([[35,  0,  2,  115,  245,  0,  0,  147,  0,  0.4,  2,  0,  2]])

Here you can see our model predicted the heart disease risk of a new patient who was not a part of our training set and the prediction is [1]. Here is an example of how to integrate such files with Flask: https://www.analyticsvidhya.com/blog/2020/04/how-to-deploy-machine-learning-model-flask/

Here I’m rewriting the code sequentially so that it becomes a bit more clearer to you if you are new.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC #SVM classifier
from sklearn import metrics

feature_cols = ['age',  'sex',  'cp', 'trestbps', 'chol', 'fbs',  'restecg',  'thalach',  'exang',  'oldpeak',  'slope',  'ca', 'thal']
X = df[feature_cols]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
svm = SVC(kernel="linear")
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

How to prove that your cat is fat (with statistics and python)

Marjan Ferdousi — Tue, 23 Feb 2021 22:03:40 +0000

You have a very, very fat and lazy cat named Bogla. He’s so lazy that he often falls asleep in his food bowl. You keep yelling at him not to eat so much tuna but he pays absolutely no attention to you, and has no clue how fat he is. So you decide to lecture him with statistical proofs. You weigh him and oh my God, he’s already 6 kg.

You look up on the internet how you can statistically prove that your cat is too fat. They have written that you have to collect more ‘data’, and to do so, you need to know the weights of some other cats. So you start collecting the information by calling the other cat parents. You write a small python script where you use the ‘pandas’ library to convert the pairs of cat names and their weights into a row-column like structure:

#Import pandas library
import pandas as pd

# initialize list of lists
height_data = [['tom', 3.0], ['pumpkin', 3.51], ['bonk', 4.2], ['thunder', 5.5], ['oreo', 4.73], ['nya', 5], ['kitkat', 4.55], ['bubbles', 4.9], ['sparkle', 6.29], ['pebbles', 3.72]]
 # Create the pandas DataFrame
df_weight = pd.DataFrame(height_data, columns = ['Name', 'Weight(kg)'])
 # print dataframe.
df_weight

Now, you think you need to put the data into some type of graph named ‘Histogram’. To make a histogram, you need to understand the concept of bins or buckets. A bin is like a group. For example, when you think of a person’s age, if the person is below 13 years, you call him a kid; if he is between 13 to 19, you call him a teen and if he is more than 19 years old, you usually call him a grown up. Here the age ranges of 0-12, 13-19 and 19+ can be considered as bins.

To make a histogram of the weights of your friends’ cats, you decide to make 5 bins. The lightest cat is Tom, who is 3 kg, and the heaviest is Sparkle, and he is 6.29 kg. So all the cats are ‘distributed’ within this 3 to 6.29 kg range. If you want to make 5 equal bins within this (6.29-3) or 3.29kg of range, each bin will be of around (3.29/5) or 0.658 kg. Therefore, Tom, who is 3 kg, will be in the first bin. Pumpkin (3.51 kg) is also in this range, as the range of the first bin spreads from 3 to 3.658 kg. The second bin will be from 3.658 to (3.658+0.658) or 4.316 kg, and you can see, the third cat, Bonk, who is 4.20 kg, will be in the second bin. In the same way, you put all of these cats into bins and count how many cats each bin contains. You see, bin 1, 2, 3, 4 and 5 contains 2, 2, 3, 2 and 1 cats. You can do the whole process writing a small python script.

import seaborn as sns
sns.histplot(data=df_weight, x="Weight(kg)", bins=5)
#sns.histplot(data=df_weight, x="Weight(kg)", bins='auto')  automatically decides How many bins you need

You see, most of the cats (in this case, 3 cats) have their weight in the 3rd bin which covers the range of 4.316 to (4.316+0.658) or 4.974 kg.

There are more cats out there, aren’t they? There might be cats weighing less than 3 kg and so on. It would be great if you could assume the probability of other cats being fat from the data you have. Here, the weight of a cat is a “Random Variable” and the weights of all the cats are “Sample Space”. All the possible values of the random variable (in this case, weight of the cat) and from all possible weights of a cat, how often a specific result might come (for example, a cat weighing 6 kg) can be represented with a “Distribution”. Now let’s try to write a few lines of code to see the distribution of our dataset.

sns.FacetGrid(df_weight, size=6) \
  .map(sns.kdeplot, "Weight(kg)") \
  .add_legend()

From this graph, you clearly see that the curve is at its highest point near the value of 4 to 5 kg, that means most of the cats from your neighbourhood are within this range.

Here the curve you see is (almost) a ‘Normal Distribution’. An actual normal distribution is symmetric about the mean, kind of like a bell (🔔), that means the mean value stays in the center of the curve, and both right and left sides look like mirror images. Here is a picture of a normal distribution:

You might already know the meaning of mean, median and mode. Mean is the average value of all your data, median is the middle point of data that separates the higher half and lower half of your dataset and mode is the number that occurs the most. In a normal distribution, all three of them are at the same point, and that point is marked in our image with a dotted line.

Now the problem is, your distribution curve was not totally symmetrical. The curves might lean to their left or right sometimes. We call this asymmetry ‘Skewness’. Skewnesses can be negative or positive depending on where their mean and median is. If the mean is greater than median (mean > median, or, mean is in the right to median in the graph), the skewness is positive. In case of a negatively skewed distribution, the mean is lesser than median (mean < median, or, mean is in the left to median).

For our dataset:

import numpy as np
from scipy import stats

mean = np.mean(df_weight['Weight(kg)'])
median = np.median(df_weight['Weight(kg)'])

print(mean, median)

Here mean is 4.54 and median is 4.64, so mean < median. So the data is negatively skewed.

Anyway, you now know from the graph that your cat is somewhat fat. How fat exactly is he? Here comes the concept of ‘Percentile’ to save your day. What is a percentile? Suppose there is a cat of 4.95 kg and he is fatter than 75% cats. Here, the value 4.95 is the ‘75th percentile’. The value of (75th quartile - 25th quartile is called the interquartile range).

Now let’s check how many cats have lower weight than your cat.

import numpy as np
sum(np.abs(df_weight["Weight(kg)"]) < 6) / float(len(df_weight["Weight(kg)"]))

Whoa, 0.9! That means your cat is fatter than 90% of the cats.

You finally know how to yell at your cat in statistics, but your journey wasn’t smooth. You accidentally wrote Tom’s weight ‘0.3’ instead of ‘3.0’ in the beginning while collecting data and all the graphs were messed up.

You didn’t know why there was some blank space in the histogram. You looked up on the internet that time and came across the concept of ‘Outliers’. Outlier is simply the data that differs from the rest. To check outliers in your data, you used boxplots. In box plots, the most likely range of an event happening (in this case, the most common cat weights) is shown in a box and the other lower and upper (but still acceptable) values are shown using whiskers. The unacceptable values are shown using dots.

sns.boxplot(y='Weight(kg)', data=df_weight)

From this you saw there was a value that didn’t belong to your dataset. You can find the value using the interquartile range (IQR) I mentioned earlier. If the value of 25th percentile is Q1 and the 75th percentile is Q3, anything with a value higher than Q3 + 1.5 x IQR or lower than Q1 - 1.5 x IQR is an outlier.

Q1 = np.percentile(df_weight['Weight(kg)'], 25, interpolation = 'midpoint')
Q3 = np.percentile(df_weight['Weight(kg)'], 75, interpolation = 'midpoint')
IQR = Q3 - Q1

low = Q1 - 1.5 * IQR
up = Q3 + 1.5 * IQR

outlier =[]
for x in df_weight['Weight(kg)']:
   if ((x> up) or (x<low)):
       outlier.append(x)
print('outlier in the dataset is', outlier)

This is how you knew you made a mistake while writing Tom’s weight.

Now go yell at your cat.

Reading and Manipulating Your Dataset With Pandas (2)

Marjan Ferdousi — Sat, 19 Dec 2020 03:28:25 +0000

Manipulation

Let's say you need to see only one column of your dataframe. To see the 'fixed acidity' column of our dataset, you need to write:

df['fixed acidity']

If you add a condition to this column, for example, if you want to see the rows that has a fixed acidity higher than 9:

df[df['fixed acidity']>9]

Sometimes you might need rows with multiple conditions added to columns:

df[(df['fixed acidity']>9) & (df['citric acid']>0.5)]

If you need to find specific columns:

df.loc[:,['volatile acidity', 'chlorides']]

You may want to add conditions with them too, for example, you may want to see the 'volatile acidity' and 'chlorides' content of those rows that have a 'fixed acidity' of 9.2:

df.loc[df['fixed acidity'] == 9.2, ['fixed acidity','volatile acidity', 'chlorides']]

You can view the rows for specific indices (as discussed in the previous chapter) too, like this:

df.loc[0:3, ['volatile acidity', 'chlorides']]

Now, if you want to locate a specific value, for example, the alchohol content of the wine of 0th row:

df['alcohol'].loc[0]

and you will get a value of 9.4

You can find locate a row using its index too:

df.iloc[100]

Now if you want to pinpoint a value within this, for example, the 1st attribute (volatile acidity in this case) of the 100th row, try:

df.iloc[100][1]

and you will get 0.61 as expected.

You can locate specific consecutive rows and columns using this iloc command, for example, first three columns of 3rd to 7th row:

df.iloc[3:8, 0:3]

and non consecutive rows and columns too:

df.iloc[[71, 122, 400], [0, 2]]

What if you want to add a new column to your dataframe? Let's add a 'new column' containing the word 'hi' for all rows:

df['new column'] = 'hi'
df.head()

Let's try changing the value of 'new column' of 0th index of the dataframe using iloc from 'hi' to 'bye':

df.iloc[0, df.columns.get_loc('new column')]= 'bye'
df.head()

Now let's try to find the word starts with 'by' (that we just have added) and replace it with 'hello':

df['new column'].loc[df['new column'].str.startswith('by')] = 'hello'
df.head()

You can also replace null values of your data using pandas. We do not have any null values here, so let's introduce a null value first. Let's replace the string 'hello' with null. To do so, we would need the numpy library.

import numpy as np
df['new column'].loc[df['new column'].str.startswith('hel')] = np.nan
df.head()

To check the number of null values, you can use the isna() method like this:

df.isna().sum()

This isna() method can also be used to locate the null value like this:

pd.isna(df.head())

Let's replace the null value with 'hey'.

df.fillna(value='hey', inplace=True)
df.head()

If you want to drop null values, use the dropna() method.

Now we will try to create a new dataframe using a loop, where one column of the new dataframe would look the same as the 'new column' of our dataframe df.

rows = []
for i in range(df.shape[0]):
     rows.append(['hi', 'bye'])
df_new = pd.DataFrame(rows, columns=["new column 2", "new column 3"])
df_new.iloc[0, df_new.columns.get_loc('new column 2')]= 'hey'
df_new.head()

You can merge these two dataframes using their common attributes:

df_merged = df.merge(df_new, left_on='new column', right_on='new column 2')
df_merged.head()

You can make necessary variations in your merging operations by dropping mismatched attributes, or by using a column with common name and so on.

You can also group your dataframes:

df.groupby(['volatile acidity', 'chlorides']).count().head()

You can also group the dataframes using other attributes like sum.

When you are done with manipulation of your dataframes, you are ready to visualize your data.

Reading and Manipulating Your Dataset With Pandas

Marjan Ferdousi — Sat, 19 Dec 2020 02:29:13 +0000

If you are a data science enthusiast, want to work on data analytics or machine learning, and wondering where and how to start, what you would need to learn at first is to read and manipulate a dataset. While working with a data analytics or machine learning problem, you would most likely be given a set of data (probably an excel sheet), or you might be collecting data from some hardware, survey or some other source. When I first started working in this field, I had a hard time keeping track of the most common and widely used dataset manipulation commands. I would like to share some of my most used commands in the ‘Pandas’ library from Python in this article. The dataset I’ve used to show examples is taken from Kaggle (https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009). I’ve used Google colab to run my codes which you can easily use by visiting the link https://colab.research.google.com/notebooks/intro.ipynb#recent=true. You need to create a new notebook to write your code blocks.

Display

At first you need to upload your dataset to Google colab. To do so, you need to write:

from google.colab import files
uploaded = files.upload()

You will get a button to select the .csv file from your computer. Once you upload the file, check if the name is still the same, because uploading the same file multiple times in the same session would change the name of your dataset.

As your file is uploaded now, you need to read the dataset. You'll be using the 'Pandas' library to read the .csv file and mention it as 'pd'. The full form of CSV is Comma Separated Values, and these type of format is used to store data in a table (or spreadsheet) format, with rows and columns. Therefore we would need a two dimensional data structure to read data from the .csv files. The most common two dimensional data structure in Pandas is dataframes. We are taking a dataframe denoted by df, reading the .csv file and keeping the contents of the file into the dataframe df.

import pandas as pd
df = pd.read_csv("winequality-red.csv")
df

This is how your data looks like. You can find the total number of rows and columns in the bottom left corner of your output. There is another way to learn the dimension of your dataset:

df.shape

The output is: (1599, 12), where the numbers mean rows and columns consecutively. As there are a number of columns, there might be a need of knowing what types of data they are, numbers, fractions or words. In order to check that, write:

df.dtypes

You can see some statistical summaries such as count, mean, standard deviation, minimum and maximum value and 25th, 50th and 75th percentile of all the columns separately using the command:

df.describe()

You might have already noticed that all the rows are not shown in the output. The first and last rows are shown and some of the middle ones are not shown and replaced by '...' instead. Viewing all these rows might be too much at times and you might want to view only a few rows of data to check if your code is working. For example, if you want to see only the first five lines of data:

df.head()

Similarly, if you want to see only the last few lines of your dataset:

df.tail(3)

What if you want to see the first 8 lines?

df[:8]

The number after the colon indicates how many rows starting from the first row (in this case, from 0th to 7th row) you want to see. Now if you want to see the last 8 rows, you will have to find out the 1591st row to 1598th row. To do so:

df[1591:]

If you want to see all the rows of dataset at a time instead of the '...', do this:

pd.set_option('display.max_rows', None)  
df

This will allow you to see all the rows within a scrollable field.

You can also Transpose your dataframe, that means, turn the rows into columns and the columns into rows. To exchange rows and columns, write:

df.T

Here, you cannot see all the columns and the middle ones are replaced with '...' again. To change it, you can write:

pd.set_option('display.max_columns', None)  
df.T

As you have known the basic display command of Pandas, you are ready to dive into the dataset manipulation techniques.

Data Science For Cats : PART 6

Marjan Ferdousi — Sun, 29 Nov 2020 13:28:40 +0000

Fitting Data To Your Model: Time Series Analysis

(Data was taken and slightly edited from an open sourced time series dataset.)

Kitty, do you remember that hooman told you about the components and stationarity (here) of a time series the other day? Of course you do, but you do not remember hooman telling you how to surely detect if a time series is stationary or not.

To check if our data is stationary, the first thing you are going to do is, you start believing that your data is not stationary, and keep wishing that someone proves you wrong. It’s like, whenever your hooman makes you wear a harness and puts you inside your travel bag, you immediately understand hooman is taking you to the vet and keep praying that you are wrong, and keep wishing that hooman is taking you to somewhere else. Here, believing ‘data is non stationary’ or ‘hooman is taking you to vet’ is called a ‘NULL HYPOTHESIS’. On the other hand, the event of your data being stationary, or hooman taking you to a park or playground is called ‘ALTERNATE HYPOTHESIS’. Now you will have to check if your null hypothesis is true or not.

First you load data from the csv file.

df = pd.read_csv("likestuna - Sheet3.csv") 
df.head()

Now hooman says he is going to ‘test’ the stationarity using a method named ADF test, also known as Augmented Dickey Fuller test. Hoomans named Dickey and Fuller found this method where they check something named ‘p-value’. This p-value is like a judge of your ADF test. The higher his value is, the more he supports your null hypothesis. If the value is lower than 0.05, it speaks against your hypothesis, that means whatever you believed at first is not true.

from statsmodels.tsa.stattools import adfuller
def adfuller_test(sales):
    result=adfuller(sales)
    labels = ['ADF Test Statistic','p-value','#Lags Used','Number of Observations']
    for value,label in zip(result,labels):
        print(label+' : '+str(value) )

    if result[1] <= 0.05:
        print("Strong evidence against the null hypothesis (Ho), rejects the null hypothesis. Data is stationary")
    else:
        print("Weak evidence against null hypothesis (Ho), proves the null hypothesis. Data it is non-stationary ")

adfuller_test(df['Sales'])

Hooman gets the output:

You see, your ADF test says your data is not stationary. What do you do now?

To make a series stationary, at first try to remember when or why a time series becomes non stationary. If you remember the diagrams hooman has shown you on the 4th part of your data science journey, after you plotted the data, there were repeated similar looking fragments that kept going upwards as we had an upward trend. You can say that trend is the reason why your data gets distributed in so many different levels.

A nice way to stationarize a time series is to shift your time series a bit and to compare it with your original series. By shifting, hooman means introducing a delay, and you will be calling this delay ‘LAG’. For example, what if your data started after one month of delay?

You want to see how your data looks like after shifts.

df['Sales']
df['lag 1']=df['Sales'].shift(1)
df['lag 2']=df['Sales'].shift(2)
df.head()

Hooman tries to plot your data, and a 5 month delayed data in the same graph and observe how it looks.

ax = df['Sales'].plot(color = 'b') 
df['Sales'].shift(5).plot(ax=ax, color = 'm')

Here, blue is your original data, and magenta is the data with a delay of 5 units (in your case, months).

Hooman says that if you plot the difference of your data and 1 unit lagged data, there is a good chance that it will become stationary. Now what is this difference?

df['Sales First Difference'] = df['Sales'] - df['Sales'].shift(1)
df['Sales Second Difference'] = df['Sales'] - df['Sales'].shift(2)
df.head()

Hooman now checks if the first difference is stationary.

adfuller_test(df['Sales First Difference'].dropna())

Now the p-value becomes 0.054213290283824704, which is barely above the 0.05 cut off margin, and that implies that our data is not stationary yet. If you take the second difference,

adfuller_test(df['Sales Second Difference'].dropna())

The p-value now becomes 0.03862975767698862, which indicates that the data is definitely stationary. So you need to take the second difference to stationarize the data?

Well, actually no. Now you are confused. Hooman now asks you to have a closer look at the p-values of your first and second difference ADF result. In case of first difference, the p-value is almost 0.05, and the difference between 0.054213290283824704 and 0.05 is totally insignificant. So you can say that the first difference has made your data almost stationary. If you take the second difference in such cases where the first difference makes the data almost stationary, there is a chance that your data becomes over stationarized, which will later not give you a good forecast. If you have any confusion, you can cross check the value using some other stationarity check method, such as KPSS. Hooman says that you have to remember this term as ‘order of differencing’, or simply as ‘d’. If you had to take the 99th difference to get a stationary time series, your d value would have been 99. In this case, your d value is simply 1.

There is another method you can try to confirm your d value. You have to plot the correlation between a time series, and a lagged version of it and observe the output. Hooman calls such correlations ‘AUTOCORRELATION’. It usually decays, so if the output curve immediately approaches 0,we can assume that the difference between the series and that lagged version is stationary.

from pandas.plotting import autocorrelation_plot
import matplotlib.pyplot as plt
autocorrelation_plot(df['Sales First Difference'].dropna())
plt.show()

Hooman will be forecasting this data using a model named ‘ARIMA’, Auto Regressive Integrated Moving Average. You need to know 3 values to tune this model, and you already know one of them, the d value.

You see, there are three parts in the ARIMA model. AR or autoregressive part is denoted by p, Integrated part is denoted by d (which you already have calculated) and the MA or moving average part is denoted by q. You already know what d does, and now curious what the AR and MA terms do. Hooman starts explaining.

The first term is ‘AUTOREGRESSIVE’, which has two parts, auto, and regressive. You already know about regression from your previous conversations with hooman. Hooman says, auto means something like ‘with yourself’. So you can say, autoregression is the event where the lagged values of your data have an impact on your current values. AR(x) means x number of lagged terms will have impact on your current data. On the other hand, the ‘MOVING AVERAGE’ term removes the randomness from your data. MA(x) means you are taking x previous observations to understand your current data.

Now how do you calculate these AR (or p) and MA (or q) terms? Hooman explains the AR term calculations first. You need to plot a ‘Partial Autocorrelation Function’ or ‘PACF’ to find out the AR terms. PACF? What’s that? Hooman says, in PACF, you take each of the lagged values of your series, find the residuals by removing the effects that are explained by the lags and find a ‘partial’ correlation.

from statsmodels.graphics.tsaplots import plot_pacf

plt.rcParams.update({'figure.figsize':(9,3), 'figure.dpi':120})

fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df['Sales First Difference']); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,5))
plot_pacf(df['Sales First Difference'].dropna(), ax=axes[1])

plt.show()

The output looks like this:

Hooman is not sure how many bars crossed the blue zone after 0. Could be 1 or 2. So out AR value would be either 1 or 2.

To explain the MA terms, you have to check an ‘Autocorrelation Function’ or ACF to find the MA terms. An ACF will give you a complete autocorrelation of a time series with the lagged values of it.

from statsmodels.graphics.tsaplots import plot_acf

fig, axes = plt.subplots(1, 2, sharex=True)
axes[0].plot(df['Sales First Difference']); axes[0].set_title('1st Differencing')
axes[1].set(ylim=(0,1.2))
plot_acf(df['Sales First Difference'].dropna(), ax=axes[1])

plt.show()

Hooman similarly calculated the MA value from this graph. It could be 3 or 4.

So your (p, d, q) combination could be (1, 1, 3), (2, 1 ,3), (1, 1, 4) or (2, 1, 4). Which one is it?

Take the first combination, use that to fit in an ARIMA model, and then calculate summary like this:

from statsmodels.tsa.stattools import acf

# Create Training and Test
train = df['Sales'][:130]
test = df['Sales'][130:]

from statsmodels.tsa.arima_model import ARIMA
# Build Model
model = ARIMA(train, order=(1, 1, 3))  
fitted = model.fit(disp=-1)  
print(fitted.summary())

There will be an ‘AIC’ term in the output. AIC means Akaike Information Criterion, which estimates how good your model is. The lower the value, the better the model is.

Now hooman compares AIC of all 4 combinations.

You see, the model with order (2, 1, 4) gives the best result. Hooman now tries to plot the data.

from statsmodels.tsa.stattools import acf

# Create Training and Test
train = df['Sales'][:130]
test = df['Sales'][130:]

from statsmodels.tsa.arima_model import ARIMA
# Build Model
# model = ARIMA(train, order=(3,1,1))  
model = ARIMA(train, order=(2, 1, 4))  
fitted = model.fit(disp=-1)  

# Forecast
fc, se, conf = fitted.forecast(14, alpha=0.05)  # 95% conf

# Make as pandas series
fc_series = pd.Series(fc, index=test.index)
lower_series = pd.Series(conf[:, 0], index=test.index)
upper_series = pd.Series(conf[:, 1], index=test.index)

# Plot
plt.figure(figsize=(12,5), dpi=100)
plt.plot(train, label='training')
plt.plot(test, label='actual')
plt.plot(fc_series, label='forecast')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.15)
plt.title('Forecast vs Actuals')
plt.legend(loc='upper left', fontsize=8)
plt.show()

The more data you have, the better the answer will be.

Now you have your sales forecast!

Data Science For Cats : PART 5

Marjan Ferdousi — Wed, 25 Nov 2020 10:23:56 +0000

Fitting Data To Your Model: Classification And Regression

Today you are ready to ‘predict’ something with your hooman. How do you do that? Hooman says you need a ‘model’ who would do that for you. What is a model then? Hooman tells you to think of the process like this: you go to your friend who knows a lot of maths, you give him/her a dataset, then he/she calculates, and gives you a forecast.

Here, your friend is a machine, and the way he/she learns all the maths to do the calculation is called MACHINE LEARNING. He/She knows how to ‘learn’ new things, and he/she learns the possible forecast using the data you have given. Hooman is going to tell you how he works with his machine friend to find out the forecasts. Of course your machine friend does a lot of maths and you should have a look at those maths in your free time.

Remember the three questions you found out with your hooman at the beginning? You are going to see if you can find out the answers of those questions from the data.

First question was, if someone would like or want tuna flavored chips or not. Hooman says he is taking only a few data points just to show you, therefore the accuracy of the model is not going to be very high, in other words, the model is going to make a few mistakes. The more data you provide, the better accuracy it gives.

At first hooman loads 14 rows of data from a csv file. The model will learn about your problem and its solution from these data. You have data on the age of 14 cats, weight of them, how long they sleep, how many potatoes they eat per day, how many packets of chips they eat per day, how many portions of tuna they eat per day and how many meals they have per day. You also asked them if they want tuna chips or not. You have already cleaned the data as your hooman has shown you earlier, and are going to find out which of these habits of the cats are correlated more significantly with their need for tuna chips.

import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn import metrics 
from sklearn.svm import SVC #SVM classifier

likestuna = pd.read_csv("likestuna - Sheet1.csv")
corrMatrix = likestuna.corr()
corrMatrix

The first 5 rows of the file looks like this:

And the output matrix is like this:

From the correlation matrix, you will be able to reduce the number of factors that has impact on kittys' liking or disliking of chips. Therefore, they are considered as FEATURES. Using these features, your model is going to predict a cat’s liking or disliking for tuna flavored potato chips. (I will rewrite this part explaining more once I start feeling better as I'm dealing with eye problems right now. For now, let's say we have reduced a few of them).

Hooman says that he is going to separate the file into two halves: training set and testing set. The training set will randomly choose 85% of the data and they will be used to teach the model which habits of the kitties make them like tuna chips. The remaining 15% data will be used to test how accurately the model is working. Here, we already know the liking or disliking of all the 14 cats. So we can verify the outcomes of the testing dataset easily. Hooman wrote:

feature_cols = ['Weight','Potato per day','Chips per day', 'Tuna per day', 'Meal count']
X = likestuna[feature_cols]
y = likestuna['Wants tuna chips']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
X_test

X_test contains our testing data. In this case, they are:

We already know if these three kitties want tuna chips or not. The #5 and #3 kitties want tuna chips as their ‘Wants tuna chips’ value is 1. On the other hand, the #4 kitty does not want tuna chips. Hooman says he wants to try a model named ‘SVM’ (Support Vector Machine) to see what the prediction says.

svm = SVC(kernel="linear", C=0.025)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
y_pred

The output is array([1, 0, 0]), which means, the first of the testing set cat, that means, the #5 cat wants tuna chips, and the other two, #4 and #3 doesn’t want it. You see, there is a mistake. Hooman shows you this:

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

This says the accuracy is 0.6666666666666666, as one of 3 predictions is wrong. The larger the dataset, the more accurate the prediction will become.

So, this is what you get from the data you already have in your hand. What if there is a #15 kitty? Can you predict if the 15th kitty would want tuna chips or not? Hooman said you can.

data = [[4.7, 4, 1.75, 3.3, 3]]
df = pd.DataFrame(data, columns = ['Weight','Potato per day','Chips per day', 'Tuna per day', 'Meal count'])
svm.predict(df)

It says: array([0]), that means the 15th kitty will not want tuna chips. Of course there is a chance of error as your model is not purrfect.

Can you predict by using only SVM? Of course not. Hooman said there are many other techniques to make a prediction. For example, there is another one that hooman loves a lot. It is called ‘RANDOM FOREST’. Hooman runs a random forest technique on the same training and testing set you used to run the SVM.

from sklearn.ensemble import RandomForestClassifier #random forest classifier
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
y_pred

Here the output becomes array([1, 1, 0]). You see, there are two mistakes this time. So the accuracy is lower, 0.33 in this case. Hooman says that you should try different methods and compare the results and accuracy to see which one is more appropriate in your case.

Now hooman says he wants to solve the second question. How are you going to set the price? He has already explained to you earlier that this is a regression problem. Your data has two parts, weight of a packet of chips and its price.

tuna2 = pd.read_csv("likestuna - Sheet2.csv") 
tuna2

Hooman tries to plot them as following:

import matplotlib.pyplot as plt
from matplotlib import pylab
from pylab import *

tuna2.plot(x='Weight (oz)', y='Price($)', style='o')
plt.title('Weight vs Price')
plt.xlabel('weight')
plt.ylabel('price')
z = np.polyfit(tuna2['Weight (oz)'], tuna2['Price($)'], 1)
p = np.poly1d(z)
pylab.plot(tuna2['Weight (oz)'],p(tuna2['Weight (oz)']),"r--")
plt.show()

You see, the pattern of the price with respect to different weights roughly resembles a straight line, the red line in the picture. To find out an approximate price of any given weight, you just have to find out the position of price for that weight on the red line.

Hooman now tries to find out the predictions using this line.

X = tuna2.iloc[:, :-1].values
y = tuna2.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
y_pred

The answer is like this: array([2.91206515, 4.11357766, 2.67176265])
Now let’s see how much they differ from the original points:

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df

Close enough, aren’t they?

Now you try to predict an unknown size of packet, say, 25 oz:

regressor.predict([[25]])

The answer is: array([4.47403141]), that means the price should be around $4.5. You can now find out an approximate price for any size of packet, yayy!!

You ask hooman how accurate the model is. Hooman shows you how to find out. At first you need to find out a ‘root mean squared error’ of your model. You better check the math of finding this error later.

from sklearn import metrics 
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

The error is 0.280631222467514. Is this good or bad? How do you know? Hooman says you have to check the mean of all of your prices first. If this root mean squares error value is lower than 10% of the mean, the model is good.

tuna2.describe()

Now, 10% of 3.287273 is 0.3287273. Your root mean squared error is lower than that, so your model can be considered good! Congrats!

Data Science For Cats : PART 4

Marjan Ferdousi — Sat, 07 Nov 2020 16:33:32 +0000

Processing Time Series

As you know a few things about correlations, hooman thinks this is a good time for you to learn some more complicated things. Hooman has picked a file where they have noted how many cats have bought potato chips on different days of February and wants you to find out if there is anything interesting. The data looks like this:

What did you find out by looking at these numbers? The first thing you notice is, the numbers are noted on different days of the month, from February 1 to February 28. The sales are recorded in one day interval, for all 28 days. Suddenly you have a feeling that, could this be a time series...?

Yes, this indeed is a time series dataset. The data is sorted using time intervals, and they are kinda related to time. How? Hooman now plots the data to graph. You and hooman immediately find some patterns that might have relationship with time.

Hooman tells you that he has found three interesting pieces of insights from this graph that might have some significance. He points out that:

Every Thursday fewer cats buy potato chips for some unknown reason, and on Friday cats seem to buy more potato chips. This pattern occured on every week of February. Looks like a cycle of ups and downs of sales of chips is going on.
As the events are occurring in cyclic order, if you compare all 4 of the fridays, you can see the sales of chips on the second friday is higher than sales on the first Friday, the sales of chips on the third Friday is higher than sales on the second Friday and so on. Same pattern is seen for Thursdays and Saturdays and almost all the other days. So although the sales have ups and downs on different days, the overall sales is kinda increasing.
There are a few days that do not match with the weekly pattern. For example, the sales of chips on the last Sunday of February seems quite different from the other Sundays for some reasons.

This makes you curious. Are these normal? How are you going to explain these events? This is too complex to handle! Hooman now asks you, what if you could find the exact pattern of the ups and downs of the sales? You realize that you could have found the variations hooman mentioned in the third point if you knew an ‘ideal’ pattern. Yes, you’re right. Hooman says he calls the ‘ideal’ pattern of this cyclic ups and downs SEASONALITY. The gradual increase you have found inspite of ups and downs (in point 2) is called TREND (hooman says this can be a decrease of values too). And the variation of your data from the ‘ideal’ pattern is called RESIDUALS.

Now how would the seasonality of your data look? Hooman shows the pattern of ups and downs of the sales data:

Now you can clearly see the ideal pattern. But wait… you think you found an overall increasing trend earlier, but this curve is going up and down within the same level. Where did the trend go?

Oh wow! Hooman has finally shown you the general direction of change of your sales record. Here you can see, some of the points of your sales data are far from your trend, and seasonality too. After comparing your data with both trend and seasonality, the left out points you find are the residuals. Hooman now shows you a comparison:

So there are a number of points that are quite far away from the ideal values. Now, why do your data have a pattern? Or variation from the pattern? There must be an explanation, right? Now you start thinking what usually happens on Thursday or Friday. Well, on Purrsday, I mean Thursday you usually are tired. Being a cat is not an easy job. As hooman is not going outside and working from home for this stupid pandemic, you too are working very hard with him on weekdays, keeping his lap warm. So you do not go outside and keep having chips from your stock. You refill your stock on Friday because you usually have fun with your hooman friend on the weekend, Saturday and Sunday, and eat chips frequently. So… It makes sense, doesn’t it? Your, and most of your furry friends’ behaviour matches with this pattern.

Then why is the data showing variations sometimes? You now want to have a closer look at the residuals to remember what actually happened.

Now you remember! On the second Wednesday there was a football match and you all stocked chips to eat while watching. On the last Sunday of the month, there was a huge thunderstorm and you cats do not like to get wet, and therefore almost everyone of you stayed home. It matches purrfectly!

As now you have learnt about all these troublesome components, you start thinking why would you even need to know this? Hooman now says that a time series with all these components are hard to analyze. He calls this a NON STATIONARY time series. When the components are separated, it turns into a STATIONARY time series. As the components are separated in a stationary dataset, it becomes easier to analyze.

Now the big question. How do you do this separation? Hooman prefers Python libraries in this case too. He shows you an example:

from statsmodels.tsa.seasonal import seasonal_decompose

data = [10,12,9,12,6,5,16,12,15,11,13,15,5,18,11,17,12,14,8,8,21,13,5,13,15,11,12,23] 
nresult=seasonal_decompose(data, model='additive', freq=4)
nresult.plot()
plt.show()

The output of this code looks like this:

Now you know how to process your data if you have time series. You will be able to forecast your demand using these components. Hooman wants you check out the maths behind these libraries as homework and promises you to explain how to do that the next day.

Data Science For Cats : PART 3

Marjan Ferdousi — Fri, 30 Oct 2020 05:32:48 +0000

Understanding The Relations

With the help of hooman, you’ve fixed your dataset and you both are planning to jump into some real action. You look at the data and find out there are lots of rows and columns. How are you going to find a meaning from these numbers? Hooman understands that you are confused and starts showing you what to do.

Hooman says he wants to find out if there is any relationship among different types of information. Relationship? Among information? How? Hooman gives you an example, when he tries to work on his laptop, you tend to sit on his keyboard. You do not do that in other times. Or, you meow when you are hungry. Here, hooman’s attempt to work on the laptop encourages you to sit on the keyboard. Your increased hunger makes you meow. Like this, any mutual connection between two events is significant and hooman calls it CORRELATION. These examples are called positive correlations, because in both cases, your attempt to sit on the laptop or your meowing increases with the increase of hooman’s attempt to work or your hunger. Hooman says that correlation can be negative too, like you play more when you are less hungry. In this case, one increases when the other decreases.

Now you understand that you need to find some clue about what makes people buy potato chips. In doing so, the hooman shows you an example. He randomly picks some attributes about different brands of chips and how much people liked them. These attributes are basically a few columns from a file consisting of features of different brands of chips. They look like this:

You would like to know why a specific brand of chips is loved by people. You can see, here the #0 brand was loved by 90% of people and the #4 was loved by 55% of people. There must be a reason behind it.

Hooman picks some values from the columns to show you what he means. He converts them to dataframe using pandas library of Python, and calls the build in dataframe.corr() function to find out the correlations:

import pandas as pd
data = {'potato content': [45,37,42,35,39],
        'packaging quality': [38,31,26,28,33],
        'owner can say potato in how many languages': [1,3,7,1,7],
        'spiciness': [44,44,43,43,44],
        'liked by %': [90,56,88,73,55],
        }
df = pd.DataFrame(data,columns=['potato content','packaging quality','owner can say potato in how many languages','spiciness','liked by %'])
pd.set_option("display.max_rows", None, "display.max_columns", None)
pd.set_option('expand_frame_repr', False)
corrMatrix = df.corr()
print (corrMatrix)

Then he shows you the output:

Whoa, more numbers! What do they even mean? You meow at hooman and he starts explaining. He calls the output a CORRELATION MATRIX. So what is this correlation matrix? You can see that’s a table, with some numbers. Each of the numbers represents how strongly one column from your dataset is related with another column. These numbers are called CORRELATION COEFFICIENTs. This coefficient is within 0 and 1. Of course there are mathematical equations behind this calculation. You can search and have a look at them on the internet. How do they work? In the first row, the first number represents the relation between ‘potato content’ and ‘potato content’. Correlation of something with itself is always 1. As hooman emphasized on knowing the reason for people liking a brand, he now explains the last number of the first row to you. It represents the relation between potato contents of a brand of chips and people liking it. The higher the number, the stronger their relationship. Here, 0.685493 is pretty high. Similarly, the last number of the second row contains the relationship between packaging quality and people liking the chips. The last numbers of other rows represent similar relationships too. You can see, some of them are negative numbers. It represents that the relationship between those attributes and people liking a brand of potato chips are opposite, that means, decrease in those attributes causes increase in liking for that brand. Hooman says they are ‘negatively correlated’.

You now understand higher content of potato in a brand of chips makes people like the brand more, and the lower amount of spiciness makes people love the chips… but wait, ‘owner can say potato in how many languages’?? How on the earth can it make people loving or hating a brand of chips? You point your paw to that number.

Hooman knows that you have again become confused. He now asks you when you eat chips the most. You think and reply that you eat them most while watching football on television. What else do you do while watching the matches? You wear the jersey of your favourite team and meow a lot. You suddenly realize that, it kind of seems like you eat potato chips more when you wear a jersey, but in reality, is wearing a jersey a ‘cause’ of eating more chips? No, your chips intake doesn’t increase with wearing a jersey, the real reason for chips consumption is watching the game. Hooman calls this ‘real reason’ CAUSATION.

So, correlation doesn’t always imply causation.

Now that’s a problem. How can you determine which one is the real reason? Well, there is no straight forward way to find that, at least right now. You still are a young cat. You need to grow bigger to learn more complicated stuff. So what are you going to do? For now, you can safely assume that a relationship is more likely to be causal if the correlation coefficient is large. You can set a threshold value for correlation coefficient and ignore the smaller values for now. For example, 0.027518 and -0.214263 are small if you assume that you will take values higher than 0.4. Therefore, you can safely take the amount of ‘potato content’ and ‘spiciness’ in consideration while thinking about why someone liked or disliked a specific brand of potato chips. Here, our finding is, people like potato chips more if the potato content of the chips is higher, or we can say, if there is a positive correlation between them. If the spiciness is high, people tend to dislike that chips, in other words, they are negatively correlated. You will need these relationship assumptions for all types of problems, classification, regression or time series analysis, to find out and predict something about the data.

Data Science For Cats : PART 2

Marjan Ferdousi — Tue, 27 Oct 2020 09:23:28 +0000

Preparing Your Data

Now that you know a few things about the types of problems you’re going to solve, you decide to look at the data you have received from the hooman. OMG! There are a lot of numbers everywhere. It’s gonna take days just to look at them all. Seeing you hissing and growling at the data file, the hooman laughs and tells you that instead of reading, you need to VISUALIZE the data. As you look at him with confusion, he explains that you need to plot your data into graphs to see the shapes of the data and to understand why they look like that. At first you thought that you would need some fancy tools or write a lot of codes to make the graphs. Hooman shows you that you can plot cool graphs using simple tools like Microsoft Excel only.

Once you start plotting the data into a graph, you find some of the values are missing. Someone must have forgotten to add some sales records on that day. Sigh. What do you do now?

You point the missing part of the graph to your hooman friend. Hooman ensures you that there is a way to find out the missing values. You will have to guess it. Wait, you are not going to randomly put a value there, are you?

Hooman now teaches you a few techniques to guess the missing values. You can take the prior or the next point of your missing data and put it there. You can use the mean value. Or, you can connect the missing dots by guessing the pattern of your graph.

As an example, hooman has asked you to find out where to put the missing data point in the following graph. In the figure (1) you pointed too low, and in figure (2), you put the point too high. Then you suddenly realize that the graph looks like a wave and you have to put the point in a way that keeps the shape of the wave intact. You’re right, yayy!

Now hooman tells you that real life data are really complex and contain noises, distortions or misplacements. You cannot simply draw them in a known shape like straight lines or waves point by point. Therefore, you have to interpret the graph into something that is really close to a known shape. Once you can relate your graph to such a shape, you have to use an equation to find out an approximate position of your missing point, or you can say, the approximate value of your missing data. This is called INTERPOLATION. It sounds complicated, doesn’t it? Hooman tells you not to worry, because smart hoomans have created magic tools (like pandas.DataFrame.interpolate in the Pandas library of python) that perform these interpolation operations for you. However, hooman insists that you should search on the internet about the basics and the equations because having a good knowledge on what you’re doing is really important (and he wants you to go deeper by yourself, because you always learn more when you face difficulties and have to do something by yourself!). Sigh, hoomans are annoying.

Hooman shows you how you do that in Python using pandas.
Hooman gives you some data on how many packets of chips you ate in last 4 days:
[0, 2, unknown, 8]

Now he converts this collection of numbers into a series using a built in function pd.Series and interpolate them using ‘polynomial’ method in this way:

s = pd.Series([0, 2, np.nan, 8]) 
s.interpolate(method='polynomial', order=2)

And that gives you an approximate value of how many packets of chips you may have eaten in the third day:
[0, 2, 4.666667, 8]

Sometimes you may find a point in your graph that doesn’t look right.

Hoomans might have made mistakes, or their sensors might have gone crazy, or their cats might have knocked something off the counter while they were working, and therefore caused these unwanted distortions. Now, how do you know which piece of data was a mistake? Is that even possible every time?

Well, there is a way. Hooman has told you to call 10 of your kitty buddies and ask how many packets of potato chips they eat per day. Knowing how lazy you are, it is certain that you will make a mistake.

You have called your buddies and they have told you how many packs of chips they eat per day. Here they are:
[2, 2, 2, 2, 4, 1, 3, 3, 15, 5]

Now the hooman has asked if you know how many packets of chips you have to eat if you want to say “I eat more chips than 75% of people”. Of course you don’t know. You’re a cat, how are you supposed to know that? Hooman says that he calls this number the ‘75th percentile’. Similarly, the number of chips packs you have to finish in order to say “I eat more chips than 50% of the people” is called the ‘50th percentile’. You also need a 25th percentile to find the mistake you have made. The question is, how do you find the number?

Of course there are some mathematical equations to find the numbers. Hooman says it’s your homework to learn about them. He will just show how you can use built in functions to determine those numbers. Hooman loves the numpy and pandas libraries of Python. He has written something like this in Python:

import numpy as np 
data = [2, 2, 2, 2, 4, 1, 3, 3, 15, 5]
Q1 = np.percentile(data, 25, interpolation = 'midpoint') 
Q2 = np.percentile(data, 50, interpolation = 'midpoint') 
Q3 = np.percentile(data, 75, interpolation = 'midpoint')

These are the 25th, 50th and 75th percentile of your data and their values are 2.0, 2.5 and 3.5 consecutively. That means, if you eat more than 3.5 packets of chips per day, you will be able to say that you eat more chips than 75% of people.

Now, the hooman says that the value of (Q3-Q1) is called the interquartile range or the IQR, and anything that is lower than the value of (Q1 - 1.5 * IQR), or higher than the value of (Q3 + 1.5 * IQR) are called ‘outliers’, that means, they do not belong to this dataset. If you want to write that in Python, you will be writing something like this:

IQR = Q3 - Q1 

low = Q1 - 1.5 * IQR 
up = Q3 + 1.5 * IQR 

outlier =[] 
for x in data: 
    if ((x> up) or (x<low)): 
        outlier.append(x) 
print('outlier in the dataset is', outlier)

Now, you know which number is wrong in your collected data. It was 15. As you know which value is mistakenly recorded, you can easily use the median of the rest of the values to replace the outlier.

You actually didn’t make a mistake here by the way. Your friend became too greedy and ate all the packets of chips that day and mentioned it to you. That is not a usual case, therefore it is considered as an outlier too! Lucky you!

Data Science For Cats : PART 1

Marjan Ferdousi — Mon, 26 Oct 2020 06:40:44 +0000

Understanding The Problem

Imagine you’re a cat, who is obsessed with potato chips, and has no idea about what data science is. You have a hooman friend who has a lot of data but too lazy to do anything with it. You love potato chips so much that one day you decide to have your own tuna flavoured potato chips brand. You’re not sure if the hoomans would like your tuna flavoured potato chips, or how you should decide the price, or how the demand would be in future. So you’ve called your hooman friend to have some advice because he has a lot of data on it, and data can do magic.

Your hooman friend agrees to provide you the data and tell you how to use them. Now as you have the data, you are planning to identify your questions and find the answers from the data. Firstly, you wanted to know if the hoomans would like your tuna flavour. Your hooman friend explained that if you take a random person from the hooman race who ate chips at least once in his life and ask him if he likes it or not, there can be only two answers, yes or no. Similarly, if you ask them which flavour they like among sour cream, tomato and bbq, the answer will definitely not be jalapenos. Therefore, you can pick an answer from a definite set of options in these types of questions. Your hooman has now told you that you have successfully figured out CLASSIFICATION problems.

Now you’ve started thinking about your other questions. How can you have some basic idea about the price? You start checking your data where you see that a 16oz pack of Hay’s chips made with onions and sour cream flavour costs $3.66, and an 8oz pack of Tingles tomato salsa flavoured chips costs $2. You’ve noticed that you know various information about the chips in your data like packet size, flavours, ingredients and so on, and the prices of all the chips are not necessarily always $3.66 or $2. Depending on the features like size or ingredients, it is varying within a range. For example, if the first 5 samples of chips have prices as following: $2.19, $4.10, $3.50, $2.20 and $2.50, there is no such rule that the price of the 6th sample has to be within these exact prices only. It can be $1.99, or $4.50, depending on how complex the flavour profile is, and how big the pack size is. You mentally take note that your hooman friend is calling this a REGRESSION problem.

Hearing you meowing enthusiastically, your hooman friend decides to explain a special type of regression to you. He calls it a TIME SERIES regression. It is a special type of regression where you try to predict some future values of something using the values from the past, linked by time. You suddenly realize that your third problem is a time series problem where you’re trying to predict the demand of potato chips in the next month using the demand data of this month, the previous month and so on. In other words, the sales prediction of the next month can be predicted from the sales record of this month. You haven’t understood all the details of this regression yet, but hooman said he will explain this later.

Now the hooman thinks that you are prepared for starting some real work with all these data. He believes you have understood how to identify your questions and which approach you should take to explain your problems.