DEV Community: arham jawad

Chapter 2: Classification

arham jawad — Wed, 10 Dec 2025 08:10:07 +0000

MNIST

MNIST is a dataset which is a set of 70,000 small images of digits handwritten by high school students and employees of the US Census Bureau.

In this chapter, we will train a machine learning model that can classify if a given image is of a 5 or not-5.

from sklearn.datasets import fetch_openml

mnist = fetch_openml('mnist_784', as_frame=False)

Let's create two variables: X and y for features and labels respectively:

X, y = mnist.data, mnist.target

There are 70,000 images in total, and each image has 784 features / pixels. This means that each image has a resolution of 28x28.

Let's split this into training set and test set:

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

Training a Binary Classifier

Let's start training a binary classifier. A binary classifier is basically a type of classifier where our model has to choose from just two options, like cat or dog, five or not-5, comedy or horror etc. In our case, we need to train a binary classifier that classifies whether a given image is of a 5 or not-5. Let's narrow down our labels into just 5 or not 5:

y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')

This means that y_train_5 will have the value of True if a given image has the label of '5', and False if it's not 5.

Now let's train a SGD Classifier. An SGD Classifier uses gradient descent to tweak the parameters of the model, but instead of tweaking all the parameters at once, it looks at one instance / gradient at a time, so it is very efficient for larger datasets. I will cover more about gradient descent in the next chapter.

from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)

sgd_clf.fit(X_train, y_train_5)

Finally! You’ve built the model. Everything perfect? Well, not exactly. Now we need to measure its performance, and it might sound easy in words, but measuring the performance of a classifier is significantly harder than a regression model. There are a lot of things you have to measure, we will cover these right now:

Confusion Matrix
Precision Score
Recall Score
F1 Score
ROC (Receiver Operating Characteristic)
ROC-AUC (Area Under the Curve)

Okay! Let's get right into this.

Confusion Matrix

To first understand how a confusion matrix works, or in general, how to measure the performance of a classifier, you must first understand these terms:

Note: I will define them in our context (5 or not-5)

True Positive: The model said that an image is of a 5 and it was a 5. The model was correct
True Negative: The model said that an image is not a 5 and it was not a 5. The model was correct.
False Positive: The model said that the image is a 5 but it was not 5. The model was incorrect.
False Negative: The model said that the image is not a 5 but it was a 5. The model was incorrect.

Here is what a Confusion Matrix looks like:

[[58391, 687],

[1891, 3530]]

The top left number is the number of True Negatives.

The top right number is the number of False Positives.

The bottom left number is the number of False Negatives.

The bottom right number is the number of True Positives.

To first see the confusion matrix of the model, we should first do cross_val_predict to first make some predictions:

from sklearn.model_selection import cross_val_predict

y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

Precision

Precision tells you: "Out of all the predicted positives, how many were actually positives and correct?"

Precision = TP/TP+FP

Recall

Recall tells you: "Out of all the actual positives, how many did the model catch?"

Recall = TP/TP+FN

Precision and Recall

To measure the precision and recall of a model, you can use the precision_score and recall_score methods provided by sklearn.metrics respectively:

from sklearn.metrics import precision_score, recall_score

print(precision_score(y_train_5, y_train_pred))

print(recall_score(y_train_5, y_train_pred))

F1 Score

The F1 score is a harmonic mean of precision and recall meaning the classifier will only get a high F1 score if both precision and recall are high.

To measure F1 score you can use the f1_score function provided by sklearn.metrics library:

from sklearn.metrics import f1_score

print(f1_score(y_train_5, y_train_pred))

The F1 score favors classifiers that have similar precision and recall.

Precision/Recall Trade-off

To understand this tradeoff, let's first understand how the SGDClassifier makes it classification decisions. For each instance, it computes a score based on a decision function. If that score is greater than a threshold, it assigns the instance to the positive class, otherwise it assigns it to the negative class.

This results in a problem: You have to balance precision and recall, but it's very hard to do so. If you have 99% precision, recall could be very low!

ROC Curve

The ROC Curve plots the Recall against FPR (False Positive Rate). But there is a tradeoff, the higher the recall, the more false positives the classifier produces.

Another way to compare classifiers is to measure the area under the curve (AUC). A perfect classifier will have a ROC-AUC of 1. Scikit-learn provides the roc_auc_score function in the sklearn.metrics library to measure the ROC-AUC of a model.

Conclusion

Finally! You have done it! You learnt about SGDClassifier, we’ve covered too many performance metrics to count and now you should be pretty comfortable making some pretty good classification projects! But this is not the end! In the next chapter, we will cover the theory behind these machine learning models, which we’ve treated as black boxes for the last two chapters. I’ll explain how they make decisions and what happens under the hood, which is really useful if you want to actually deeply understand machine learning, or if you want to go into research (Like ME!). Overall, this was a lot of fun, see you soon!

Chapter 2: End-to-End Machine Learning Project

arham jawad — Sat, 06 Dec 2025 15:09:29 +0000

Chapter 2

In this chapter, we create a very simple project where we predict house prices given the features such as Total Number of Bedrooms, age of the house, size of the house, and much much more.

Pipeline

A pipeline is just a set of data processing components. Pipelines are used very commonly as there are lot of data transformations to apply.

Classification or Regression?

First, we need to decide whether this problem is Regression or Classification. In Classification, your model chooses from a given set of categories, like Cat or Dog? Genre of a Movie. In classification, the options are limited, the model just has to predict the probability of a specific thing being in a class, so it has to choose. In Regression, your model has to predict a continuous value, meaning it doesn't choose, the number it predicts can go upto infinity and down to negative infinity. The Housing Price problem is a Regression problem as we are predicting a price, and prices can vary highly.

Select a Performance Measure

Now that we know what kind of problem it is, we need to select a performance measure. For regression, the most commonly used performance measure is the RMSE, which tells you how bad your model is, not how good it is, so the goal in this problem is to minimize your RMSE, not maximize it like accuracy.

Take a Look at the Data Structure

You can start by looking at the top five rows of the data using panda's head() method.

Note: You can look at more than 5 rows by specifying a number inside the head() method's bracket, the default value is 5 rows.

Create a Test Set

In Machine Learning, your models learn from a training set, and then you test them on the test set, so at this point, we will create a test set, a typical split is 80/20, where we set 80% of data for training and remaining data for testing.

Scikit-learn has a train_test_split method inside its sklearn.model_selection library. This allows you to specify your features, and your labels, and your test_size, and it automatically splits it for you:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Data Cleaning

Machine Learning Algorithms cannot work with missing features, so you have to take care of them, for this, you can either drop the entire rows containing missing values, drop the full feature, or set the missing values to some other value like mean, median, mode. The third option is the best and most commonly used one. Scikit-learn provides a SimpleImputer inside its sklearn.impute library:

from sklearn.impute import SimpleImputer


imputer = SimpleImputer(strategy="median")

Since you can only apply median on numerical features, lets set aside a subset of our dataset containing only the numerical attributes:

import numpy as np

housing_num = housing.select_dtypes(include=[np.number])

X = imputer.fit_transform(housing_num)

Handling Text and Categorical Attributes

Since ML Models cannot work with categorical data, we have to convert them into numbers. To do this, we can use Ordinal Encoder, which assigns a number to each category, it is useful for tasks where two nearby categories are more similar than two distant categories, like "Bad", "Average", "Good", "Excellent" etc. But in our scenerio, OneHotEncoder is better. It creates one column for each category, so in the presence of one category, the column for that category will be 1 and other will be 0 and vice versa:

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()

housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

Feature Scaling

ML Models cannot work with numbers on different scales like 1, and 100,000 or 100,000 and 1,000,000,000, so we need to balance them and bring them on a similar scale like between 0 and 1. To do this, Scikit-learn provides a StandardScaler class in the sklearn.preprocessing library:

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

Transformation Pipelines

As you can clearly see, you need to apply a lot of transformations on your data, like Imputing, Feature Scaling, Categorical Encoding, and doing them manually one by one seems redundant. So, we can use the Pipeline class provided by sklearn.pipeline library to group all these data transformations together:

from sklearn.pipeline import Pipeline

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])

It would be more convenient to handle both categorical and numerical features together, for this, we can use the ColumnTransformer class provided by the sklearn.compose library:

from sklearn.compose import ColumnTransformers

num_attribs = housing.select_dtypes(include=[np.number]).columns
cat_attribs = housing.select_dtpes(include=[np.object]).columns

num_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("standardize", StandardScaler()),
])

cat_pipeline = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("onehotencode", OneHotEncoder(handle_unknown="ignore")),
])

preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", cat_pipeline, cat_attribs),
])

Select and Train a Model

Finally! Its time for us to select and train a machine learning model on our data! In our case, we have used the XGBoost Regressor which has same syntax as a RandomForestRegressor:

XGBRegressor(
        n_estimators=330,
        max_depth=10,
        random_state=42
    )

Evaluate on the Test Set

I evaluated our model on the test set, and here are the results, not bad!

RMSE: 764340224.0, RMSE: 27646.703125