Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.
If you’re looking to read more about machine learning, check out this article I wrote for FreeCodeCamp[(https://www.freecodecamp.org/news/what-is-machine-learning-for-beginners/)]
In this project, I worked on developing a machine learning model that predicts if an individual will pay back a loan or not. This was done using classification machine learning algorithms; Decision Tree and Random Forest.
I decided to use both algorithms so I could compare the performance of both on the dataset.
Random Forest is a preferred choice when compared to Decision Tree, particularly in high-dimensional data scenarios. It excels in harnessing ensemble learning, where multiple decision trees collaboratively tackle complex pattern recognition and contribute to improved predictive accuracy.
Using Random Forest in this project reflects not just my personal preference but a data-driven approach, acknowledging the substantial benefits of combining these trees in mitigating overfitting and enhancing classification robustness in real-world, diverse datasets.
Data Description
The dataset is a lending data available online which shows the varying profile of people that applied for loan and if they paid back or not.
Here are what the columns of the dataset represent:
- credit.policy: If the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
- purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
- int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
- installment: The monthly installments owed by the borrower if the loan is funded.
- log.annual.inc: The natural log of the self-reported annual income of the borrower.
- dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
- fico: The FICO credit score of the borrower.
- days.with.cr.line: The number of days the borrower has had a credit line.
- revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
- revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
- inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
- delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
- pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
Steps:
1.Importing the necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
2.Loading in the dataset:
loan_dataset = pd.read_csv("loan-data.csv")
A peep into what the dataset looks like
loan_dataset.head()
Checking the number of rows and columns present in the dataset
loan_dataset.shape
3.Data Cleaning
It is essential to carry out data cleaning/pre processing on any given dataset before proceeding with the model building.
Data Cleaning involves removal of duplicates, null values, outliers and a plethora of errors that can be found in the dataset.
Checking for missing values
loan_dataset.isnull().sum()
The dataset has no missing values.
4.Label Encoding
Label encoding is used in converting categorical data into numerical form.
The column “Purpose” needed to be converted from categorical column to a numerical column.
cat_feats=['purpose']
loan =pd.get_dummies(loan_dataset,columns=cat_feats,drop_first=True)
5.Extracting Dependent and independent variables and training the model
X = loan.drop('not.fully.paid',axis=1)
y = loan['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)
6.Fitting the Decision Tree Model
from sklearn.tree import DecisionTreeClassifier
tree =DecisionTreeClassifier()
tree.fit(X_train,y_train)
7.Checking the accuracy of the Decision Tree model using the test data
from sklearn.metrics import accuracy_score
y_pred = tree.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score {:.2f}%".format(accuracy * 100))
The Decision Tree model gave an accuracy score of 73.38%
Not bad!
8.Fitting the Random Forest
from sklearn.ensemble import RandomForestClassifier
rfc= RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)
9.Checking the accuracy of the Random Forest Model using the test data
y_pred = rfc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score: {:.2f}%".format(accuracy * 100))
As expected, The Random Forest Model outperformed the Decision Tree Model with an accuracy score of 84.86%
These results proves the effectiveness of Random Forest in comparison to Decision Trees for this particular problem, highlighting the valuable role of ensemble techniques in enhancing model performance and ensuring better generalization to unseen data.
That’s it for this project!
For the entire code, check my GitHub profile: https://github.com/heyfunmi/Loan-Repayment-Prediction-using-Decision-Tree-and-Random-Forest./blob/main/Loan_prediction..ipynb)
Thank you for reading!
Top comments (1)
this is awesome.