DEV Community

Cover image for Classification from scratch — Mammographic Mass Classification
Apoorva Dave
Apoorva Dave

Posted on

Classification from scratch — Mammographic Mass Classification

In our previous article, we discussed the classification technique in theory. It’s time to play with the code 😉 Before we can start coding, the following libraries need to be installed in our system:

  1. Pandas: pip install pandas
  2. Numpy: pip install numpy
  3. scikit-learn: pip install scikit-learn

The task here is to classify Mammographic Masses as benign or malignant using different Classification algorithms including SVM, Logistic Regression and Decision Trees. Benign is when the tumor doesn’t invade other tissues whereas malignant does spread. Mammography is the most effective method for breast cancer screening available today.


The dataset used in this project is “Mammographic masses” which is a public dataset from UCI repository (

It can be used to predict the severity (benign or malignant) of a mammographic mass from BI-RADS attributes and the patient’s age. Number of Attributes: 6 (1 goal field: severity, 1 non-predictive: BI-RADS, 4 predictive attributes)

Attribute Information:

  1. BI-RADS assessment: 1 to 5 (ordinal)
  2. Age: patient’s age in years (integer)
  3. Shape (mass shape): round=1, oval=2, lobular=3, irregular=4 (nominal)
  4. Margin (mass margin): circumscribed=1, microlobulated=2, obscured=3, ill- defined=4, spiculated=5 (nominal)
  5. Density (mass density): high=1, iso=2, low=3, fat-containing=4 (ordinal)
  6. Severity: benign=0 or malignant=1 (binomial)


Screenshot of top 10 rows of the dataset

So we talked a lot about the theory behind it. It’s fairly simple to build a classification model. Follow the below steps and get your own model in an hour 😃 So let’s get started!


Create a new IPython Notebook and insert the below code to import the necessary modules. In case you get any error, do install the necessary packages using pip.

import numpy as np
import pandas as pd
from sklearn import model_selection
from sklearn.preprocessing import StandardScaler
from sklearn import tree
from sklearn import svm
from sklearn import linear_model
Enter fullscreen mode Exit fullscreen mode

Read the data using pandas into a dataframe. To check the top 5 rows of the dataset, use df.head() . You can specify the number of rows as an argument to this function in case you want to check different number of rows. BI-RADS attribute has been given as non-predictive in the dataset and so it won’t be taken into consideration.

input_file = ''
masses_data = pd.read_csv(input_file,names =['BI-RADS','Age','Shape','Margin','Density','Severity'],usecols = ['Age','Shape','Margin','Density','Severity'],na_values='?')
Enter fullscreen mode Exit fullscreen mode

You can get a description of the data like values of count, mean, standard deviation etc as masses_data.describe()


As you might have observed, there are missing values in the dataset. Handling missing data is something very important in data preprocessing. We fill out the empty values using the mean or mode of the column depending on the data analysis. For simplicity, as of now, you can drop the null values from the data.

masses_data = masses_data.dropna()
features = list(masses_data.columns[:4])
X = masses_data[features].values
labels = list(masses_data.columns[4:])
y = masses_data[labels].values
y = y.ravel()
Enter fullscreen mode Exit fullscreen mode

The vector X contains the input features from column 1 to 4 except the target variable. Their values will be used for training. The target variable i.e Severity is stored in the vector y.

Scale the input features to normalize the data within a particular range. Here we are using StandardScaler() which transforms the data to have a mean value 0 and standard deviation of 1.

scaler  = StandardScaler()
X = scaler.fit_transform(X)
Enter fullscreen mode Exit fullscreen mode

Create training and testing set using train_test_split. 25% of the data is used for testing and 75% for training.

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.25, random_state=0)
Enter fullscreen mode Exit fullscreen mode

To build a Decision Tree Classifier from the training set, we just need to use the function DecisionTreeClassifier() It has a certain number of parameters about which you can find on the scikit-learn documentation. For now, we would just use the default values of each parameter. Use predict() on the test input features X_test to get the predicted values y_pred. The function score() can be used directly to compute the accuracy of prediction on test samples.

clf = tree.DecisionTreeClassifier(random_state=0)
clf =,y_train)
y_pred = clf.predict(X_test)
clf.score(X_test, y_test)
Enter fullscreen mode Exit fullscreen mode

The DecisionTreeClassifier() without any tuning gives a result around 77% which we can say is not the worst.

To build an SVM classifier, the classes provided by scikit-learn include SVC, NuSVC, and LinearSVC. We will build a classifier using SVC class and linear kernel. (To know the difference between SVC with linear kernel and LinearSVC you can go to the link —

svc = svm.SVC(kernel='linear', C=1)
scores = model_selection.cross_val_score(svc,X,y,cv=10)
Enter fullscreen mode Exit fullscreen mode

In this section, I am trying to show you a different approach for creating a classifier. The svc classifier object is created using the SVC class on the training set. cross_val_score() function evaluates score using cross-validation method. Cross-validation is used to avoid any kind of overfitting. k-Fold cross-validation implies k-1 folds of data is used for training and 1 fold for testing. The score obtained using this is around 79.5%


Cross-Validation (source:

Similar to the Decision Tree Classifier, we can also create Logistic Regression classifier. The function LogisticRegression() is used. The classifier is fitted on the training set and similarly used to predict target values for the test set. It gives a mean score of 80.5%

clf = linear_model.LogisticRegression(C=1e5)
clf =, y_train)
y_pred = clf.predict(X_test)
scores = model_selection.cross_val_score(clf,X,y,cv=10)

Enter fullscreen mode Exit fullscreen mode

Thus, if we want to build a single classifier we can do it in just 10 lines of code😄. And in no effort, we achieved an accuracy of 80%. You can create your own classification models (there are plenty of options) or fine-tune any of these. Also if you are interested you can give a shot to Artificial Neural Networks as well 😍. For me, I got the best accuracy of 84% with ANNs. To get the entire code, please use this link

If you liked the article do show some ❤ Stay tuned for more! Till then happy learning 😸

Top comments (1)

anamritraj profile image
Anand Amrit Raj

This is great! Thanks for sharing 😁