DEV Community

loading...
Cover image for Data Science For Cats : PART 5

Data Science For Cats : PART 5

orthymarjan profile image Marjan Ferdousi ・6 min read

Fitting Data To Your Model: Classification And Regression

Today you are ready to ‘predict’ something with your hooman. How do you do that? Hooman says you need a ‘model’ who would do that for you. What is a model then? Hooman tells you to think of the process like this: you go to your friend who knows a lot of maths, you give him/her a dataset, then he/she calculates, and gives you a forecast.

Alt Text

Here, your friend is a machine, and the way he/she learns all the maths to do the calculation is called MACHINE LEARNING. He/She knows how to ‘learn’ new things, and he/she learns the possible forecast using the data you have given. Hooman is going to tell you how he works with his machine friend to find out the forecasts. Of course your machine friend does a lot of maths and you should have a look at those maths in your free time.

Remember the three questions you found out with your hooman at the beginning? You are going to see if you can find out the answers of those questions from the data.

First question was, if someone would like or want tuna flavored chips or not. Hooman says he is taking only a few data points just to show you, therefore the accuracy of the model is not going to be very high, in other words, the model is going to make a few mistakes. The more data you provide, the better accuracy it gives.

At first hooman loads 14 rows of data from a csv file. The model will learn about your problem and its solution from these data. You have data on the age of 14 cats, weight of them, how long they sleep, how many potatoes they eat per day, how many packets of chips they eat per day, how many portions of tuna they eat per day and how many meals they have per day. You also asked them if they want tuna chips or not. You have already cleaned the data as your hooman has shown you earlier, and are going to find out which of these habits of the cats are correlated more significantly with their need for tuna chips.

import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn import metrics 
from sklearn.svm import SVC #SVM classifier

likestuna = pd.read_csv("likestuna - Sheet1.csv")
corrMatrix = likestuna.corr()
corrMatrix
Enter fullscreen mode Exit fullscreen mode

The first 5 rows of the file looks like this:

Alt Text

And the output matrix is like this:

Alt Text

From the correlation matrix, you see you can ignore the age and sleep time as their values are insignificant. The other factors have some impact on a cat’s liking or disliking tuna chips. Therefore, they are considered as FEATURES. Using these features, your model is going to predict a cat’s liking or disliking for tuna flavored potato chips.

Hooman says that he is going to separate the file into two halves: training set and testing set. The training set will randomly choose 85% of the data and they will be used to teach the model which habits of the kitties make them like tuna chips. The remaining 15% data will be used to test how accurately the model is working. Here, we already know the liking or disliking of all the 14 cats. So we can verify the outcomes of the testing dataset easily. Hooman wrote:

feature_cols = ['Weight','Potato per day','Chips per day', 'Tuna per day', 'Meal count']
X = likestuna[feature_cols]
y = likestuna['Wants tuna chips']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15)
X_test
Enter fullscreen mode Exit fullscreen mode

X_test contains our testing data. In this case, they are:

Alt Text

We already know if these three kitties want tuna chips or not. The #5 and #3 kitties want tuna chips as their ‘Wants tuna chips’ value is 1. On the other hand, the #4 kitty does not want tuna chips. Hooman says he wants to try a model named ‘SVM’ (Support Vector Machine) to see what the prediction says.

svm = SVC(kernel="linear", C=0.025)
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
y_pred
Enter fullscreen mode Exit fullscreen mode

The output is array([1, 0, 0]), which means, the first of the testing set cat, that means, the #5 cat wants tuna chips, and the other two, #4 and #3 doesn’t want it. You see, there is a mistake. Hooman shows you this:

print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

This says the accuracy is 0.6666666666666666, as one of 3 predictions is wrong. The larger the dataset, the more accurate the prediction will become.

So, this is what you get from the data you already have in your hand. What if there is a #15 kitty? Can you predict if the 15th kitty would want tuna chips or not? Hooman said you can.

data = [[4.7, 4, 1.75, 3.3, 3]]
df = pd.DataFrame(data, columns = ['Weight','Potato per day','Chips per day', 'Tuna per day', 'Meal count'])
svm.predict(df)
Enter fullscreen mode Exit fullscreen mode

It says: array([0]), that means the 15th kitty will not want tuna chips. Of course there is a chance of error as your model is not purrfect.

Can you predict by using only SVM? Of course not. Hooman said there are many other techniques to make a prediction. For example, there is another one that hooman loves a lot. It is called ‘RANDOM FOREST’. Hooman runs a random forest technique on the same training and testing set you used to run the SVM.

from sklearn.ensemble import RandomForestClassifier #random forest classifier
rf=RandomForestClassifier(n_estimators=100)
rf.fit(X_train,y_train)
y_pred=rf.predict(X_test)
y_pred
Enter fullscreen mode Exit fullscreen mode

Here the output becomes array([1, 1, 0]). You see, there are two mistakes this time. So the accuracy is lower, 0.33 in this case. Hooman says that you should try different methods and compare the results and accuracy to see which one is more appropriate in your case.

Now hooman says he wants to solve the second question. How are you going to set the price? He has already explained to you earlier that this is a regression problem. Your data has two parts, weight of a packet of chips and its price.

tuna2 = pd.read_csv("likestuna - Sheet2.csv") 
tuna2
Enter fullscreen mode Exit fullscreen mode

Alt Text

Hooman tries to plot them as following:

import matplotlib.pyplot as plt
from matplotlib import pylab
from pylab import *

tuna2.plot(x='Weight (oz)', y='Price($)', style='o')
plt.title('Weight vs Price')
plt.xlabel('weight')
plt.ylabel('price')
z = np.polyfit(tuna2['Weight (oz)'], tuna2['Price($)'], 1)
p = np.poly1d(z)
pylab.plot(tuna2['Weight (oz)'],p(tuna2['Weight (oz)']),"r--")
plt.show()
Enter fullscreen mode Exit fullscreen mode

Alt Text

You see, the pattern of the price with respect to different weights roughly resembles a straight line, the red line in the picture. To find out an approximate price of any given weight, you just have to find out the position of price for that weight on the red line.

Alt Text

Hooman now tries to find out the predictions using this line.

X = tuna2.iloc[:, :-1].values
y = tuna2.iloc[:, 1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
y_pred
Enter fullscreen mode Exit fullscreen mode

The answer is like this: array([2.91206515, 4.11357766, 2.67176265])
Now let’s see how much they differ from the original points:

df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
df
Enter fullscreen mode Exit fullscreen mode

Alt Text

Close enough, aren’t they?

Now you try to predict an unknown size of packet, say, 25 oz:

regressor.predict([[25]])
Enter fullscreen mode Exit fullscreen mode

The answer is: array([4.47403141]), that means the price should be around $4.5. You can now find out an approximate price for any size of packet, yayy!!

You ask hooman how accurate the model is. Hooman shows you how to find out. At first you need to find out a ‘root mean squared error’ of your model. You better check the math of finding this error later.

from sklearn import metrics 
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Enter fullscreen mode Exit fullscreen mode

The error is 0.280631222467514. Is this good or bad? How do you know? Hooman says you have to check the mean of all of your prices first. If this root mean squares error value is lower than 10% of the mean, the model is good.

tuna2.describe()
Enter fullscreen mode Exit fullscreen mode

Alt Text

Now, 10% of 3.287273 is 0.3287273. Your root mean squared error is lower than that, so your model can be considered good! Congrats!

Discussion

pic
Editor guide