Beginners’ journey to machine learning

Hello, data science cat is back.

After kitty became successful in decision making with machine learning (https://dev.to/orthymarjan/data-science-for-cats-1d7k), a lot of hooman friends has been asking him questions like

I understand the basics, but where do I start coding?
I understand the codes from the internet, but how can I start writing codes by myself?
How do I organize my project?
How do I visualize the solution of a real life problem?

Kitty now tries to explain the answers with real life examples.

First, think of the word ‘learning’. Kitty wants you to remember how you all started learning formally, and later how you implemented your knowledge in the real world. Imagine yourself as a teacher in a school. A new student comes and gets enrolled in a class. You prepare a course curriculum for the class and start teaching accordingly. You take a few class tests to assess how the kids are doing. At the end of the year, you prepare a final test based on what you’ve taught throughout the year. You distribute the question paper to the kids, they answer the questions and you verify the answers to see how well they have learnt. If their answers are above a certain level, they pass. Otherwise they fail. Those who pass later get jobs and use their knowledge from the school to complete their tasks. For example, if you are an english teacher, you teach the kids grammar and literature, and later in real life they might not have to write poems or fill in the blanks with right forms of verbs, but they implement their knowledge of English language to write a report or product document.

Remember machine learning is also a procedure of learning. Now let’s compare the procedure of a school with machine learning. In our example, we will be working on a very small dataset (https://www.kaggle.com/ronitf/heart-disease-uci) where the machine learning model will try to predict if a patient has high risk of heart disease or not based on some test reports with python. You can work in the similar process in R too.

import pandas as pd
df = pd.read_csv('Heart Disease Dataset.csv')
df.head()

Make Curriculum:
At first you would want to decide which topics you would like to teach your student throughout the year. You have to decide which topics (in this case, columns or features) your student has to understand in order to learn whatever you’re trying to teach him (in this case, if the patient has high risk for heart disease). You’ll also decide how much data you have to teach him and how you would take exams later in your course. You’ll be defining ‘training data’ to teach throughout the year and ‘testing data’ for the exam from your actual dataset.

from sklearn.model_selection import train_test_split

feature_cols = ['age',  'sex',  'cp', 'trestbps', 'chol', 'fbs',  'restecg',  'thalach',  'exang',  'oldpeak',  'slope',  'ca', 'thal']
X = df[feature_cols]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)

Here, in X_train you have training features, in X_test you have features of the testing set, in y_train you have the targets of the training set and in y_test you have the targets of the testing set. Train-test ratio is 90%-10% here.

Student Enrolment:
The student in the process of machine learning is your model. At first, it knows nothing. It’s your job to make a suitable procedure of learning for it so that it can later perform in the real world. Let’s say our teaching procedure is the SVM model in this case. We declare a variable named svm and tune its parameters (like we’ve taken a linear kernel, and there are many more that you can find in the documentation).

from sklearn.svm import SVC #SVM classifier
svm = SVC(kernel="linear")

Teaching:
In our case, teaching is ‘fitting’ data to the model. When you fit the training set to the model, it ‘learns’.

svm.fit(X_train, y_train)

Exam:
In the case of exams, our question paper is X_test. You already have the correct answers of the question paper in your hand which is y_test. The student will write his answers in another variable, let’s say y_pred.

y_pred = svm.predict(X_test)

Evaluating the test papers:
You can already understand that you can verify the answers of a student by comparing y_test and y_pred, and decide if he passes or fails.

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Importance of class tests:
If you find your student didn’t do well in the final exams, two things might have happened. The student might not have learnt properly throughout the year, or maybe he did study well but for some reason he couldn’t do well in the finals. Here comes the importance of taking class tests. If the student hadn’t studied properly throughout the year, his class test results would not be satisfactory. If he has done well in the class tests, failing in finals would indicate some other problems. If our accuracy is not satisfactory, let’s check their class test performances using k-fold cross validation (which kinda takes some tests from training sets).

from sklearn import model_selection

kfold = model_selection.KFold(n_splits=10, random_state=31)
model =  SVC(kernel="linear")

results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold)
results

If the cross validation results are poor too, that means we have ‘underfitting’, meaning we couldn’t provide enough data to our model to study (which happened in our case as our accuracy is pretty mediocre and cross validation result too isn’t that good). If not, the probable cause is ‘overfitting’, meaning the model is learning data a bit too well (including noises and bad stuff). Here is a link for you on what you can do (like changing parameters and stuff) in this case: https://adityarohilla.com/2018/11/02/a-brief-introduction-to-support-vector-machine/.
You can also test other models like decision tree, random forest or naive bayes from the same python library to check which one suits you the best.
Using this knowledge in workplace:
What your student has learnt well throughout his school life, he will be able to perform in his job place too. To make him remember his training, we can export this trained model into a file of some kind and later load the file in a system to predict. You can easily integrate your models to a system made with python frameworks like Flask by importing the model file.
To export,

from joblib import dump

# dump the pipeline model
dump(svm, filename="classification.joblib")

To import,

from joblib import load

# load the pipeline model
pipeline = load("classification.joblib")
pipeline.predict([[35,  0,  2,  115,  245,  0,  0,  147,  0,  0.4,  2,  0,  2]])

Here you can see our model predicted the heart disease risk of a new patient who was not a part of our training set and the prediction is [1]. Here is an example of how to integrate such files with Flask: https://www.analyticsvidhya.com/blog/2020/04/how-to-deploy-machine-learning-model-flask/

Here I’m rewriting the code sequentially so that it becomes a bit more clearer to you if you are new.

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC #SVM classifier
from sklearn import metrics

feature_cols = ['age',  'sex',  'cp', 'trestbps', 'chol', 'fbs',  'restecg',  'thalach',  'exang',  'oldpeak',  'slope',  'ca', 'thal']
X = df[feature_cols]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10)
svm = SVC(kernel="linear")
svm.fit(X_train, y_train)
y_pred = svm.predict(X_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

DEV Community

Beginners’ journey to machine learning

Top comments (0)

Read next

New Breakthrough Speeds Up AI Models by 270% While Cutting Energy Use in Half

Building an open-source community at HOTOSM: a thank you

Building AI-Powered Recommendation Engines at Scale

Daily JavaScript Challenge #JS-121: Detect Consecutive Duplicates in a String