Classification Problem: Predicting Values in Group Column

Cecilia Ngunjiri — Mon, 24 Feb 2025 10:46:25 +0000

Project link : https://github.com/CessNgunjiri/Project/blob/main/cecilia.ipynb

Introduction
The project involves a classification problem. The objective is to predict the values in the "group" column, which can either be "control" or "patient."

Dataset Columns and Descriptions

rownames: A unique identifier for each record in the dataset, often used as an index.
subject: Represents the identifier or label for the individuals or entities being studied.
age: Indicates the age of the subjects in the dataset.
group: The target variable for the classification problem, categorizing each subject as either "control" or "patient."

Data Understanding/Inspection

I started by importing the Python libraries required for this project, which include pandas, seaborn, and matplotlib.pyplot
The data has 5 columns and 945 rows.

Data Cleaning

The dataset was clean and there were no missing values in any of the columns. It also has no duplicates.

Checking for Outliers
Using a box plot to check the outliers in the dataset. There are notable outliers in the exercise column.

To check the number of outliers, I calculated the interquartile range.

Decided to remove the outliers because there are only 83 outliers out of 945 rows. Removing them still maintains the integrity of the data without removing a huge part of the data.

To ensure data integrity, I decided to remove non-numeric characters in the subject column.

Exploratory Data Analysis

Performed both Univariate Analysis and Bi-Variate Analysis.

Data Preprocessing

One Hot Encoding

One hot encoded the control and patient entries in the group column.

Standard Scaling

Handling Class Imbalance using SMOTE

Data Splitting

Split the data into training and testing. 80% of the data for training and 20% of the data for testing.

Modeling
Baseline Model: Logistic Regression Model

A baseline model to help provide a reference point to evaluate the performance of more complex models, helping to set realistic expectations, identify potential issues, and ensure that advanced techniques offer meaningful improvements.

Other Models:
Random Forest Classifier

Random Forest Classifier Evaluation:
precision recall f1-score support

       0       1.00      1.00      1.00        66
       1       1.00      1.00      1.00       123

accuracy                           1.00       189

macro avg 1.00 1.00 1.00 189
weighted avg 1.00 1.00 1.00 189

Precision:
Precision measures how many of the predicted positives were actually correct. Class 0 (control): 100% of the predictions for control were correct. Class 1 (patient): 99% of the predictions for patient were correct.

Recall:
Recall tells us how many of the actual positives were correctly identified. Class 0 (control): 98% of actual control cases were identified. Class 1 (patient): 100% of actual patient cases were identified.

F1-Score:
F1-Score is a balance of precision and recall, providing a single score. Class 0 (control): 99%. Class 1 (patient): 100%.

Support:
Support is the number of actual cases in the dataset. Class 0 (control): 59 instances. Class 1 (patient): 121 instances.

Accuracy: Accuracy tells us the overall proportion of correct predictions. Accuracy: 99%. The model got 99% of the predictions correct.

Decision Tree Model

Decision Tree Classifier Evaluation:
precision recall f1-score support

       0       1.00      1.00      1.00        66
       1       1.00      1.00      1.00       123

accuracy                           1.00       189

macro avg 1.00 1.00 1.00 189
weighted avg 1.00 1.00 1.00 189

Decision Tree Model Accuracy is 100%

Class 0 (Control Group):

Precision (100%):

When the model says “This is Class 0,” it’s right 100% of the time.

Recall (100%):

Out of all the actual Class 0 samples, it correctly identified 100% of them.

F1-Score (100%):

This is a balance between precision and recall. It combines how accurate and how thorough the model is for Class 0.

Class 1 (Patient Class):

Precision (100%):

When the model says “This is Class 1,” it’s right 100% of the time.

Recall (100%):

Out of all the actual Class 1 samples, it correctly identified 100% of them.

F1-Score (100%):

This combines precision and recall for Class 1 to give a single measure of how well it’s doing.

Class 0 (Control Group):

The model is great at finding most Class 0 samples (100% recall).

Class 1 (Patient Group):

The model is very confident when predicting Class 1 (100% precision).

K-Nearest Neighbors Model

K-Nearest Neighbors Evaluation:
precision recall f1-score support

       0       1.00      0.97      0.98        66
       1       0.98      1.00      0.99       123

accuracy                           0.99       189

macro avg 0.99 0.98 0.99 189
weighted avg 0.99 0.99 0.99 189

K-Nearest Neighbors Model Accuracy is 99%

Class 0 (Control Group):

Precision (99%):

When the model says “This is Class 0,” it’s right 99% of the time.

Recall (100%):

Out of all the actual Class 0 samples, it correctly identified 100% of them.

F1-Score (99%):

This is a balance between precision and recall. It combines how accurate and how thorough the model is for Class 0.

Class 1 (Patient Class):

Precision (99%):

When the model says “This is Class 1,” it’s right 99% of the time.

Recall (100%):

Out of all the actual Class 1 samples, it correctly identified 100% of them.

F1-Score (100%):

This combines precision and recall for Class 1 to give a single measure of how well it’s doing.

Class 0 (Control Group):

The model is great at finding most Class 0 samples (98% recall).

Class 1 (Patient Group):

The model is very confident when predicting Class 1 (100% precision).

Conclusion

The best model to use to predict the group column, is the Decision Tree Model.

It has high accuracy of 100% and can predict the control group or patient group 100% accurately

DEV Community: Cecilia Ngunjiri

Classification Problem: Predicting Values in Group Column