Data Science For Cats : PART 2

#datascience #beginners #ai #machinelearning

Preparing Your Data

Now that you know a few things about the types of problems you’re going to solve, you decide to look at the data you have received from the hooman. OMG! There are a lot of numbers everywhere. It’s gonna take days just to look at them all. Seeing you hissing and growling at the data file, the hooman laughs and tells you that instead of reading, you need to VISUALIZE the data. As you look at him with confusion, he explains that you need to plot your data into graphs to see the shapes of the data and to understand why they look like that. At first you thought that you would need some fancy tools or write a lot of codes to make the graphs. Hooman shows you that you can plot cool graphs using simple tools like Microsoft Excel only.

Once you start plotting the data into a graph, you find some of the values are missing. Someone must have forgotten to add some sales records on that day. Sigh. What do you do now?

You point the missing part of the graph to your hooman friend. Hooman ensures you that there is a way to find out the missing values. You will have to guess it. Wait, you are not going to randomly put a value there, are you?

Hooman now teaches you a few techniques to guess the missing values. You can take the prior or the next point of your missing data and put it there. You can use the mean value. Or, you can connect the missing dots by guessing the pattern of your graph.

As an example, hooman has asked you to find out where to put the missing data point in the following graph. In the figure (1) you pointed too low, and in figure (2), you put the point too high. Then you suddenly realize that the graph looks like a wave and you have to put the point in a way that keeps the shape of the wave intact. You’re right, yayy!

Now hooman tells you that real life data are really complex and contain noises, distortions or misplacements. You cannot simply draw them in a known shape like straight lines or waves point by point. Therefore, you have to interpret the graph into something that is really close to a known shape. Once you can relate your graph to such a shape, you have to use an equation to find out an approximate position of your missing point, or you can say, the approximate value of your missing data. This is called INTERPOLATION. It sounds complicated, doesn’t it? Hooman tells you not to worry, because smart hoomans have created magic tools (like pandas.DataFrame.interpolate in the Pandas library of python) that perform these interpolation operations for you. However, hooman insists that you should search on the internet about the basics and the equations because having a good knowledge on what you’re doing is really important (and he wants you to go deeper by yourself, because you always learn more when you face difficulties and have to do something by yourself!). Sigh, hoomans are annoying.

Hooman shows you how you do that in Python using pandas.
Hooman gives you some data on how many packets of chips you ate in last 4 days:
[0, 2, unknown, 8]

Now he converts this collection of numbers into a series using a built in function pd.Series and interpolate them using ‘polynomial’ method in this way:

s = pd.Series([0, 2, np.nan, 8]) 
s.interpolate(method='polynomial', order=2)

And that gives you an approximate value of how many packets of chips you may have eaten in the third day:
[0, 2, 4.666667, 8]

Sometimes you may find a point in your graph that doesn’t look right.

Hoomans might have made mistakes, or their sensors might have gone crazy, or their cats might have knocked something off the counter while they were working, and therefore caused these unwanted distortions. Now, how do you know which piece of data was a mistake? Is that even possible every time?

Well, there is a way. Hooman has told you to call 10 of your kitty buddies and ask how many packets of potato chips they eat per day. Knowing how lazy you are, it is certain that you will make a mistake.

You have called your buddies and they have told you how many packs of chips they eat per day. Here they are:
[2, 2, 2, 2, 4, 1, 3, 3, 15, 5]

Now the hooman has asked if you know how many packets of chips you have to eat if you want to say “I eat more chips than 75% of people”. Of course you don’t know. You’re a cat, how are you supposed to know that? Hooman says that he calls this number the ‘75th percentile’. Similarly, the number of chips packs you have to finish in order to say “I eat more chips than 50% of the people” is called the ‘50th percentile’. You also need a 25th percentile to find the mistake you have made. The question is, how do you find the number?

Of course there are some mathematical equations to find the numbers. Hooman says it’s your homework to learn about them. He will just show how you can use built in functions to determine those numbers. Hooman loves the numpy and pandas libraries of Python. He has written something like this in Python:

import numpy as np 
data = [2, 2, 2, 2, 4, 1, 3, 3, 15, 5]
Q1 = np.percentile(data, 25, interpolation = 'midpoint') 
Q2 = np.percentile(data, 50, interpolation = 'midpoint') 
Q3 = np.percentile(data, 75, interpolation = 'midpoint')

These are the 25th, 50th and 75th percentile of your data and their values are 2.0, 2.5 and 3.5 consecutively. That means, if you eat more than 3.5 packets of chips per day, you will be able to say that you eat more chips than 75% of people.

Now, the hooman says that the value of (Q3-Q1) is called the interquartile range or the IQR, and anything that is lower than the value of (Q1 - 1.5 * IQR), or higher than the value of (Q3 + 1.5 * IQR) are called ‘outliers’, that means, they do not belong to this dataset. If you want to write that in Python, you will be writing something like this:

IQR = Q3 - Q1 

low = Q1 - 1.5 * IQR 
up = Q3 + 1.5 * IQR 

outlier =[] 
for x in data: 
    if ((x> up) or (x<low)): 
        outlier.append(x) 
print('outlier in the dataset is', outlier)

Now, you know which number is wrong in your collected data. It was 15. As you know which value is mistakenly recorded, you can easily use the median of the rest of the values to replace the outlier.

You actually didn’t make a mistake here by the way. Your friend became too greedy and ate all the packets of chips that day and mentioned it to you. That is not a usual case, therefore it is considered as an outlier too! Lucky you!