Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.
If you’re looking to read more about machine learning, check out this article I wrote for FreeCodeCamp[(https://www.freecodecamp.org/news/what-is-machine-learning-for-beginners/)]
In this project, I worked on developing a machine learning model that predicts the diabetic status of a patient. This was done using classification machine learning algorithms; Support Vector Machine and Logistic Regression.
I decided to use both algorithms so I could compare the performance of both on the dataset.
I chose SVM in particular for this project because it excels in handling high-dimensional data, making it adept at identifying complex patterns in datasets, resulting in accurate predictions.
Support Vector Machine (SVM) is quiet a powerful machine learning model that operates by finding an optimal hyperplane to separate data into distinct classes. My interest in SVM stems from its core principles, where maximizing the margin between data points ensures robust classification.
Data Description:
The dataset used for this project is a diabetes focused dataset that contains columns such as age, glucose level, blood pressure, insulin level, BMI, and other data, which were used to determine if a person is diabetic or not.
Steps:
- Importing the necessary libraries.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.metrics import accuracy_score
2. Loading in the dataset:
The csv was loaded using the code below:
Diabetes_dataset = pd.read_csv("diabetes.csv”)
A peep into what the dataset looks like:
Diabetes_dataset.head()
Checking the number of rows and columns present in the dataset.
Diabetes_dataset.shape
Statistical description of the dataset:
Diabetes_dataset.describe()
Value counts of number of diabetic and non diabetic records in the dataset.
Diabetes_dataset['Outcome'].value_counts()
3. Extracting dependent and independent variables
X = Diabetes_dataset.drop(columns = 'Outcome',axis=1)
Y = Diabetes_dataset['Outcome']
4. Standardizing the “X” values due to the high variation in range of numbers present in the different columns.
scaler = StandardScaler()
scaler.fit(X)
standardized_data = scaler.transform(X)
print(standardized_data)
The data has now been standardized and the range is now between -1 and +1.
5. Splitting the dataset into test and train.
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state=2)
6.Training and fitting the model using Logistic Regression.
model = LogisticRegression()
model.fit(X_train, Y_train)
7. Checking the accuracy score of the model using the train and test data.
Accuracy score using the train data:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score on Training data : ', training_data_accuracy)
Accuracy score using the test data:
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on Test Data : ', test_data_accuracy)
8. Training and fitting the model using Support Vector Machine.
classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, Y_train)
8. Checking the accuracy score of the model using the train and test data.
Accuracy score using the train data:
X_train_prediction = classifier.predict(X_train)
"training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score on the training data : ', training_data_accuracy)
Accuracy score using the test data:
X_test_prediction = classifier.predict(X_test)test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on the test data : ', test_data_accuracy)
From the accuracy score gotten from both model, we can see that the Support Vector Machine performed slightly better compared to the Logistic Regression Model.
Testing the model: Predicting a random individual's diabetics status using the model.
# Step 1
individuals_data = (2,141,84,26,175,34,0.42,36)
# Step individuals_data_as_numpy_array = np.asarray(individuals_data)
# Step 3
individuals_data_reshaped = individuals_data_as_numpy_array.reshape(1,-1)
# Step 4
std_data = scaler.transform(individuals_data_reshaped)
print(std_data)
#Step 5
prediction = classifier.predict(std_data)
print(prediction)
if (prediction[0] == 0):
print('The person is not diabetic')
else:
print('The person is diabetic')]
For the entire code of this project, check the notebook on my GitHub.
[(https://github.com/heyfunmi/Diabetes-Prediction-using-SVM/blob/main/Diabetes_Prediction.ipynb)]
Thank you for reading, Ciao!!
Top comments (0)