Data Science For Cats : PART 3

#datascience #beginners #machinelearning #ai

Understanding The Relations

With the help of hooman, you’ve fixed your dataset and you both are planning to jump into some real action. You look at the data and find out there are lots of rows and columns. How are you going to find a meaning from these numbers? Hooman understands that you are confused and starts showing you what to do.

Hooman says he wants to find out if there is any relationship among different types of information. Relationship? Among information? How? Hooman gives you an example, when he tries to work on his laptop, you tend to sit on his keyboard. You do not do that in other times. Or, you meow when you are hungry. Here, hooman’s attempt to work on the laptop encourages you to sit on the keyboard. Your increased hunger makes you meow. Like this, any mutual connection between two events is significant and hooman calls it CORRELATION. These examples are called positive correlations, because in both cases, your attempt to sit on the laptop or your meowing increases with the increase of hooman’s attempt to work or your hunger. Hooman says that correlation can be negative too, like you play more when you are less hungry. In this case, one increases when the other decreases.

Now you understand that you need to find some clue about what makes people buy potato chips. In doing so, the hooman shows you an example. He randomly picks some attributes about different brands of chips and how much people liked them. These attributes are basically a few columns from a file consisting of features of different brands of chips. They look like this:

You would like to know why a specific brand of chips is loved by people. You can see, here the #0 brand was loved by 90% of people and the #4 was loved by 55% of people. There must be a reason behind it.

Hooman picks some values from the columns to show you what he means. He converts them to dataframe using pandas library of Python, and calls the build in dataframe.corr() function to find out the correlations:

import pandas as pd
data = {'potato content': [45,37,42,35,39],
        'packaging quality': [38,31,26,28,33],
        'owner can say potato in how many languages': [1,3,7,1,7],
        'spiciness': [44,44,43,43,44],
        'liked by %': [90,56,88,73,55],
        }
df = pd.DataFrame(data,columns=['potato content','packaging quality','owner can say potato in how many languages','spiciness','liked by %'])
pd.set_option("display.max_rows", None, "display.max_columns", None)
pd.set_option('expand_frame_repr', False)
corrMatrix = df.corr()
print (corrMatrix)

Then he shows you the output:

Whoa, more numbers! What do they even mean? You meow at hooman and he starts explaining. He calls the output a CORRELATION MATRIX. So what is this correlation matrix? You can see that’s a table, with some numbers. Each of the numbers represents how strongly one column from your dataset is related with another column. These numbers are called CORRELATION COEFFICIENTs. This coefficient is within 0 and 1. Of course there are mathematical equations behind this calculation. You can search and have a look at them on the internet. How do they work? In the first row, the first number represents the relation between ‘potato content’ and ‘potato content’. Correlation of something with itself is always 1. As hooman emphasized on knowing the reason for people liking a brand, he now explains the last number of the first row to you. It represents the relation between potato contents of a brand of chips and people liking it. The higher the number, the stronger their relationship. Here, 0.685493 is pretty high. Similarly, the last number of the second row contains the relationship between packaging quality and people liking the chips. The last numbers of other rows represent similar relationships too. You can see, some of them are negative numbers. It represents that the relationship between those attributes and people liking a brand of potato chips are opposite, that means, decrease in those attributes causes increase in liking for that brand. Hooman says they are ‘negatively correlated’.

You now understand higher content of potato in a brand of chips makes people like the brand more, and the lower amount of spiciness makes people love the chips… but wait, ‘owner can say potato in how many languages’?? How on the earth can it make people loving or hating a brand of chips? You point your paw to that number.

Hooman knows that you have again become confused. He now asks you when you eat chips the most. You think and reply that you eat them most while watching football on television. What else do you do while watching the matches? You wear the jersey of your favourite team and meow a lot. You suddenly realize that, it kind of seems like you eat potato chips more when you wear a jersey, but in reality, is wearing a jersey a ‘cause’ of eating more chips? No, your chips intake doesn’t increase with wearing a jersey, the real reason for chips consumption is watching the game. Hooman calls this ‘real reason’ CAUSATION.

So, correlation doesn’t always imply causation.

Now that’s a problem. How can you determine which one is the real reason? Well, there is no straight forward way to find that, at least right now. You still are a young cat. You need to grow bigger to learn more complicated stuff. So what are you going to do? For now, you can safely assume that a relationship is more likely to be causal if the correlation coefficient is large. You can set a threshold value for correlation coefficient and ignore the smaller values for now. For example, 0.027518 and -0.214263 are small if you assume that you will take values higher than 0.4. Therefore, you can safely take the amount of ‘potato content’ and ‘spiciness’ in consideration while thinking about why someone liked or disliked a specific brand of potato chips. Here, our finding is, people like potato chips more if the potato content of the chips is higher, or we can say, if there is a positive correlation between them. If the spiciness is high, people tend to dislike that chips, in other words, they are negatively correlated. You will need these relationship assumptions for all types of problems, classification, regression or time series analysis, to find out and predict something about the data.

Top comments (6)

Gregor Gonzalez • Oct 31 '20

thanks! I didn't even know what data science was about. These examples remind me of when we build a "suggested" inventory from monthly sales.

The idea was to make a sales report to later determine the most sold, the most productive and suggest the purchase of products to fill the inventory for the following month.

There were certain tendencies and errors. For example, for the best sellers they were always the same products, small things that were sold very often but it was not the strength of the business, that damaged the report. The most productive was a very expensive product that was sold once a year. Certain days there were rebates or big sales on certain dates of the year and that affected the suggested inventory by indicating that the product was high turnover when in reality it was an isolated event.

In conclusion, analyzing data can be very complex. I would like to learn more about data science and python.

Marjan Ferdousi • Oct 31 '20

Glad you liked it!! I'm planning to continue and hopefully the next part will cover some topics on time series analysis. Hope that's gonna answer some of the confusing questions.

Gregor Gonzalez • Oct 31 '20

oh that's nice. i'll follow for any next part. 🙌

Sudip Podder • Nov 30 '20

I have read up to this part, will complete the rest later. Nice job 'bilai'! Easy to understand for people like me without any prior knowledge of data science 😼