DEV Community: Jamie

Day 7: Building first KNN model

Jamie — Fri, 22 Jan 2021 22:41:59 +0000

Today's the day to build my first model!

I decided to go with a K-Nearest Neighbours model since I've never done one of those before and I already have the code on how to do it. I've attempted to do linear regression and logistic regression before, but it didn't go well.
Next, I hunted on Kaggle for some data that I could use. I wanted to find one that someone else has used with a KNN before, so that way I know it is possible. After a quick Google search I found this dataset on Diabetes.
I tried to follow the steps I'd been taught on the Data camp course and examine the data first. However, I hit a snag. I've not worked with data like this before so I don't know if it's okay for features such as skin thickness to be 0. So I found someone else's notebook that had worked on this before to see what they did (it's okay because I'm still learning right?). They also weren't sure, so they filled the zero values with the mean values, so I did this too. Now the brain is really sweating.

Okay, so the next step was to build the actual model. I did this by only looking at my notes and I actually managed to get it to work first time! I did have some problems trying to get GridSearchCV (if you're unfamiliar, this is a way to find the best value for your hyper parameters). I think I'll come back to that on Monday and see if I can get that working. Instead, I managed to get a simple for loop working. Then I finished up by making a ROC curve and getting the area under the curve value. It's not perfect, but I'm pretty proud of getting this far and mostly understanding everything I've done.

I don't really have code to show today since it was just putting everything together from the last week. However, if you'd like to see what I've done you can look for yourself. Anyway, have a good weekend. I'll be back on Monday!

Day 5 + 6: Pre-processing data

Jamie — Thu, 21 Jan 2021 20:03:31 +0000

I decided to group day's 5 and 6 together (today and yesterday) since they were both on the same topic. I'd picked up a little bit on pre-processing data through my PhD, but I just didn't really know the actual terms to do things in Python.

The Code

This is very helpful, being able to convert categories to binary so that it works better with machine learning
df_origin = pd.get_dummies(df)
Replace all zero values with nan, then drop all rows with nan in:
df.column.replace(0, np.nan, inplace=True) df.dropna()
Scale data down to be between a smaller range of numbers
from sklearn.preprocessing import Scale X_scaled = scale(X)

That's the end of the DataCamp course on scikit-learn (well this introduction anyway).

Thoughts on the course

Would I recommend the DataCamp course I did? That's actually a tough question. If you already pay for DataCamp or can get a free trial, I'd recommend doing this course. But I don't recommend paying for DataCamp specifically for this course. It's a good course to know the terminology to be able to do some of the basics of machine learning in Python (see previous days for specifics on what the whole course contained). I just feel like there are other free resources you could probably find to pick up the same knowledge.
Anyway, now that I've finished the course I'm going to try and build my first machine learning model on Kaggle tomorrow.

Day 4: Area under the ROC

Jamie — Wed, 20 Jan 2021 19:42:53 +0000

I actually did this yesterday, but I forgot to post.

Anyway, building on day 3's work on ROC today was all about how to quantify how good a model is. That's where the area under the curve (or AUC) comes in. This will then give you a value. The bigger the value is the better your model.

The code

Once you've got your model (any model), split it into your train and test, then have it making predictions. Then you need to import the AUC and you can run it.
from sklearn.metrics import roc_auc_score roc_auc_score(y_test, y_pred_prob)

Day 3: ROC curves

Jamie — Tue, 19 Jan 2021 00:19:41 +0000

Happy Monday! Today is all about ROCs

Okay not that kind of rock. Instead, as I found out, it's a way to evaluate how well logistic regression works (and maybe other types of models, I'm not 100%).
So first you start by making a confusion matrix. This takes all of the true positives, false positives, true negatives and false negative and gives you their values. Sounds confusing I know so I guess they got the right name for the matrix.
The next step is to take these four new things and calculate other factors such as precision and accuracy. These give you an idea of how many true positives and false negatives you're getting, respectively. The confusion matrix also produced a value called F score, this is something I've come across before but not really understood so I'll try and brush up on that in the future.
Anyway, from these we can produce a receiver operating characteristic (ROC) curve which allows us to see these things more quickly in a graph. Here's an example of a ROC curve I found on Google:

So the diagonal line is how good your model would be if it just classified between two things at random, it would be right about half the time. The orange curve shows how good your model is. Basically, the closer it is to the top left the better your model is. With a graph like this, you can plot a ROC curve for multiple models and compare how they do. So if you're say comparing KNN and logistical regression you can visually see how well they perform in a specific task and choose the one that's right for you. Neat huh?

I'm trying a new thing today where I post some of the code I learnt so you can try and learn these new things with me. What do you think? Would you like to see more of this or would you prefer I keep it more of a diary?

The Code

How to make a confusion matrix:
From sklearn.metrics import confusion_matrix Print(confusion_matrix(y_test, y_pred))

How to make and plot a ROC curve:
From sklearn.metrics import roc_curve Y_pred_prob = logreg.predict_proba(X_test)[:,1] Fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) Plt.plot([0,1], [0,1], ‘k—‘) Plt.plot(fpr, tpr, label=’Logistic Regression’) Plt.xlabel(‘False Positive Rate’) Plt.ylabel(‘True Positive Rate’) Plt.title(‘Logistic Regression ROC Curve’) Plt.show()

Day 2: learning linear regression

Jamie — Sat, 16 Jan 2021 00:19:55 +0000

Well I'm back again (that's a good start to my 100 days of learning machine learning).

Day 2

I moved onto the next chapter of the scikit-learn data camp course. Today was all about linear regression. I can really see what people mean that it only takes a few lines of code to do machine learning. Both KNN (see yesterday's post) and linear regression use like 10 lines top. So that's nice.

Anyway, today I managed to get a linear regression model working which I'm pretty happy about. I also now know what ridge regression and lasso regression is. They were always just mystery words that people would throw out sometimes. Like if you order a a plain black coffee and then they say do you want sugar or milk with that. I'd just look at them in confusion and be like "I just wanted coffee??". But now I understand that lasso and ridge regression are just ways to supplement linear regression if you need them. Just like milk and sugar with coffee (I think, I don't actually drink coffee)

I'm unsure what my next step should be. I could carry on with the course and learn a bit more about hyper parameters or I could jump over to kaggle and try building my first model. I want to do both so it's just a question of which one I do first. What do you think I should do?

Be back on Monday for day 3, have a good weekend!

100 days of learning machine learning: day 1

Jamie — Thu, 14 Jan 2021 22:43:16 +0000

So I'm trying something new. I'm going to try and spend 100 days learning machine learning. I've been using Python for about a year and also have some theoretical knowledge of machine learning after completing this course. I now need to learn how to actually do machine learning.

Why am I doing this? What else is there to do right now? I'm bored, stuck in the house. The main reason though is being able to do machine learning would help with my PhD research immensely!

Why are you telling me all this? I'm hoping by posting on DEV it'll help me to stay motivated and keep practising. I'm also hoping that maybe I can find other people who want to learn machine learning (or are super bored) and we could figure some of this stuff out together.

Anyway...

Day 1

So today is the first day. I'm starting using data camp to learn scikit-learn. I'm hoping if I can pick up some of the coding knowledge here. Then once I understand it a bit, move over to Kaggle and try it out on some real datasets. Today I spent about an hour working on a K-nearest neighbor model. I was hoping to start with linear regression, but this was actually quite simple to pick up (always nice). The examples showed how to use it with the iris dataset and then I got to work with a dataset from american politics.

My favourite thing I learnt today is how to use a for loop to change the number of neighbors to be used, and evaluate the score of the KNN each time. This is so simple but blew my mind at the same time. It's such a clever way to work out the best number of neighbors each time.