DEV Community: Oluwafunmilola Obisesan

Loan Repayment Prediction using Machine Learning.

Oluwafunmilola Obisesan — Mon, 30 Oct 2023 14:46:23 +0000

Machine learning (ML) is a sub set of artificial intelligence (AI) that allows software applications to become more accurate at predicting outcomes without being explicitly programmed to do so.
Machine learning algorithms uses historical data as input to predict new output values.
If you’re looking to read more about machine learning, check out this article I wrote for FreeCodeCamp[(https://www.freecodecamp.org/news/what-is-machine-learning-for-beginners/)]

In this project, I worked on developing a machine learning model that predicts if an individual will pay back a loan or not. This was done using classification machine learning algorithms; Decision Tree and Random Forest.

I decided to use both algorithms so I could compare the performance of both on the dataset.

Random Forest is a preferred choice when compared to Decision Tree, particularly in high-dimensional data scenarios. It excels in harnessing ensemble learning, where multiple decision trees collaboratively tackle complex pattern recognition and contribute to improved predictive accuracy.

Using Random Forest in this project reflects not just my personal preference but a data-driven approach, acknowledging the substantial benefits of combining these trees in mitigating overfitting and enhancing classification robustness in real-world, diverse datasets.

Data Description
The dataset is a lending data available online which shows the varying profile of people that applied for loan and if they paid back or not.

Here are what the columns of the dataset represent:

credit.policy: If the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).

Steps:

1.Importing the necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

2.Loading in the dataset:

loan_dataset = pd.read_csv("loan-data.csv")

A peep into what the dataset looks like

loan_dataset.head()

Checking the number of rows and columns present in the dataset

loan_dataset.shape

3.Data Cleaning
It is essential to carry out data cleaning/pre processing on any given dataset before proceeding with the model building.
Data Cleaning involves removal of duplicates, null values, outliers and a plethora of errors that can be found in the dataset.

Checking for missing values

loan_dataset.isnull().sum()

The dataset has no missing values.

4.Label Encoding
Label encoding is used in converting categorical data into numerical form.
The column “Purpose” needed to be converted from categorical column to a numerical column.

cat_feats=['purpose']
loan =pd.get_dummies(loan_dataset,columns=cat_feats,drop_first=True)

5.Extracting Dependent and independent variables and training the model

X = loan.drop('not.fully.paid',axis=1)
y = loan['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=101)

6.Fitting the Decision Tree Model

from sklearn.tree import DecisionTreeClassifier
tree =DecisionTreeClassifier()
tree.fit(X_train,y_train)

7.Checking the accuracy of the Decision Tree model using the test data

from sklearn.metrics import accuracy_score
y_pred = tree.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score {:.2f}%".format(accuracy * 100))

The Decision Tree model gave an accuracy score of 73.38%
Not bad!

8.Fitting the Random Forest

from sklearn.ensemble import RandomForestClassifier
rfc= RandomForestClassifier(n_estimators=100)
rfc.fit(X_train,y_train)

9.Checking the accuracy of the Random Forest Model using the test data

y_pred = rfc.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy Score: {:.2f}%".format(accuracy * 100))

As expected, The Random Forest Model outperformed the Decision Tree Model with an accuracy score of 84.86%

These results proves the effectiveness of Random Forest in comparison to Decision Trees for this particular problem, highlighting the valuable role of ensemble techniques in enhancing model performance and ensuring better generalization to unseen data.

That’s it for this project!

For the entire code, check my GitHub profile: https://github.com/heyfunmi/Loan-Repayment-Prediction-using-Decision-Tree-and-Random-Forest./blob/main/Loan_prediction..ipynb)

Thank you for reading!

Diabetes Prediction using Machine Learning.

Oluwafunmilola Obisesan — Sun, 08 Oct 2023 22:20:34 +0000

In this project, I worked on developing a machine learning model that predicts the diabetic status of a patient. This was done using classification machine learning algorithms; Support Vector Machine and Logistic Regression.

I decided to use both algorithms so I could compare the performance of both on the dataset.

I chose SVM in particular for this project because it excels in handling high-dimensional data, making it adept at identifying complex patterns in datasets, resulting in accurate predictions.

Support Vector Machine (SVM) is quiet a powerful machine learning model that operates by finding an optimal hyperplane to separate data into distinct classes. My interest in SVM stems from its core principles, where maximizing the margin between data points ensures robust classification.

Data Description:
The dataset used for this project is a diabetes focused dataset that contains columns such as age, glucose level, blood pressure, insulin level, BMI, and other data, which were used to determine if a person is diabetic or not.

Steps:

Importing the necessary libraries.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import svm
from sklearn.metrics import accuracy_score

2. Loading in the dataset:
The csv was loaded using the code below:

Diabetes_dataset = pd.read_csv("diabetes.csv”)

A peep into what the dataset looks like:

Diabetes_dataset.head()

Checking the number of rows and columns present in the dataset.

Diabetes_dataset.shape

Statistical description of the dataset:

Diabetes_dataset.describe()

Value counts of number of diabetic and non diabetic records in the dataset.

Diabetes_dataset['Outcome'].value_counts()

3. Extracting dependent and independent variables

X = Diabetes_dataset.drop(columns = 'Outcome',axis=1)
Y = Diabetes_dataset['Outcome']

4. Standardizing the “X” values due to the high variation in range of numbers present in the different columns.

scaler = StandardScaler()
scaler.fit(X)
standardized_data = scaler.transform(X)
print(standardized_data)

The data has now been standardized and the range is now between -1 and +1.

5. Splitting the dataset into test and train.

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify=Y, random_state=2)

6.Training and fitting the model using Logistic Regression.

model = LogisticRegression()
model.fit(X_train, Y_train)

7. Checking the accuracy score of the model using the train and test data.

Accuracy score using the train data:

X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score on Training data : ', training_data_accuracy)

Accuracy score using the test data:

X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on Test Data : ', test_data_accuracy)

8. Training and fitting the model using Support Vector Machine.

classifier = svm.SVC(kernel='linear')
classifier.fit(X_train, Y_train)

8. Checking the accuracy score of the model using the train and test data.

Accuracy score using the train data:

X_train_prediction = classifier.predict(X_train)
"training_data_accuracy = accuracy_score(X_train_prediction, Y_train)
print('Accuracy score on the training data : ', training_data_accuracy)

Accuracy score using the test data:

X_test_prediction = classifier.predict(X_test)test_data_accuracy = accuracy_score(X_test_prediction, Y_test)
print('Accuracy score on the test data : ', test_data_accuracy)

From the accuracy score gotten from both model, we can see that the Support Vector Machine performed slightly better compared to the Logistic Regression Model.

Testing the model: Predicting a random individual's diabetics status using the model.

# Step 1
individuals_data = (2,141,84,26,175,34,0.42,36)
# Step  individuals_data_as_numpy_array = np.asarray(individuals_data)
# Step 3
individuals_data_reshaped = individuals_data_as_numpy_array.reshape(1,-1)
# Step 4
std_data = scaler.transform(individuals_data_reshaped)
print(std_data)
#Step 5
prediction = classifier.predict(std_data)
print(prediction)
if (prediction[0] == 0):
   print('The person is not diabetic')
else:
    print('The person is diabetic')]

For the entire code of this project, check the notebook on my GitHub.
[(https://github.com/heyfunmi/Diabetes-Prediction-using-SVM/blob/main/Diabetes_Prediction.ipynb)]

Thank you for reading, Ciao!!

Insurance Cost Prediction using Machine Learning with Python.

Oluwafunmilola Obisesan — Sun, 29 Jan 2023 12:13:29 +0000

In this project, I worked on developing an end to end machine learning model using linear regression.
Data cleaning, Extensive data visulaization, Exploratory data analysis was also done.

Data Description:

The dataset used for this project is an Insurance focused dataset that contains columns such as age, sex, bmi, region, and other data, which were used to determine the cost of each person’s insurance.

Steps

Importing the necessary libraries: Numpy, pandas, matplotlib, seaborn and sckitlearn were imported.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
%matplotlib inline

Loading in the dataset: The csv was loaded using the code below:


Insurance = pd.read_csv("https://raw.githubusercontent

Information about the data. To get some information about the data such as the type of data in each column, we use the code below


Insurance.info()

Checking the statistical description of the data:

Insurance.describe()

Checking for the number of rows and columns present in the dataset:

Insurance.shape

Data Cleaning and preparation:

Working with “unclean” data leads to inaccuracy in results, so it’s necessary to carry out data cleaning before any analysis or prediction is done.

Checking for null values:

To check for null values in our dataset, we use the code below:

Insurance.isnull().any()

Checking for duplicates:

Insurance.duplicated().any()

Exploratory Data Analysis:

Exploratory data analysis helps in understanding the patterns, trends and metrics in a dataset. Also helps in detecting outliers and anomalous events.

Using a correlation matrix to check for correlations among the columns in the dataset:

sns.heatmap(Insurance.corr())

The correlation matrix shows there’s little or no correlation between “age” and “charges”.

Checking for the distribution pattern of the “charges” column

sns.distplot(Insurance['charges'])

Plotting a pairplot to check out the relationship that exists between one column to another.

sns.pairplot(Insurance);

Extracting dependent and independent variables:

The dependent variable in this case is the “charges “ while the independent variables are the other columns.

X = Insurance.drop(columns = ["charges"])
X.head(5)

y = Insurance["charges"]
y

Splitting the dataset into test and train.

To build a machine learning algorithm, you have to “train” the model with a set of data and use the other set to “test” the model you’ve built.
So we split our data into “test” data and “train” data, using 80 percent to train the model and using the other 20 percent to test the model.


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 0)
X_train.head()

One hot encoding to transform categorical text data

The data contains some columns which have texts in them, such as gender, region.
Since we can’t build the model with these text data, we need to convert it into numbers.
Using the gender column as an example; assigning 0 to female and 1 to male.
We can do this using one hot encoding, using the code below

X_train_ = pd.get_dummies(X_train, columns=["sex", "smoker", "region"], drop_first=True)

Building and fitting the model.

Here is the most interesting part of this project , now that we are done with data cleaning and converting text data to numbers, we can now build our model using the line of code below:

from sklearn.linear_model import LinearRegression

lm = LinearRegression()

lm.fit(X_train_,y_train)

Predicting the “test” set results.

Remember we trained our model on 80 percent of our data, now that we’ve built the model, we can use the model to predict the outcome of the 20 percent we set aside.
Here’s the code and the prediction using our “test” data.

predictions = lm.predict(X_test_)

Now let’s check the accuracy of our model, if our model is 100 percent accurate in predicting the “test” set results.

Model evaluation:

To evaluate the accuracy of our model, we’ll use the R2 score.
The R2 score measures the amount of variance of the prediction which is explained by the dataset.

If the value of the R2 score is 1, it means the model is perfect, and if it’s 0, it means the model will perform badly in an unseen data.
The closer the value of the R2 is to 1, the more perfectly the model is trained.

To check our R2 score, we use the code below:

from sklearn.metrics import r2_score
r2_score(y_test, predictions)

Oops
Not a bad model I must say!

View the entire code here:

https://github.com/heyfunmi/Insurance_Cost_Prediction_using_Machine_Learning_with_Python

See you in another project!
Cheers!!

Exploratory Data Analysis Using SQL.

Oluwafunmilola Obisesan — Fri, 05 Aug 2022 21:41:00 +0000

INTRODUCTION

Two years ago when I started learning how to write SQL code and learning how to work with SQL, it felt complex and disjointed to me. But now, working with SQL is one of my favorite thing to do and I totally enjoy analyzing datasets with SQL.

In this project, I carried out an exploratory analysis on petrol/gas prices around the world. The dataset contains records of petrol/gas prices of ALL the countries in the world, daily petrol/gas consumption of each country ,varying prices per liter and barrel as of June, 2022.

Using SQL, I was able to run some queries through the dataset and get answers to some questions and also to get insights.

DATA PREPARATION AND CLEANING.

The dataset was first loaded into Microsoft Excel in order to “clean” the dataset, rename the columns, remove outliers and to check for consistency in the dataset.
Here is a picture of the dataset in Microsoft Excel:

A database was then created in MYSQL, and the dataset was imported into the SQL workbench in order to begin analysis.
Picture of the dataset after being imported into the SQL workbench:

ANALYSIS

Analysis was done to get answers to some very important questions and to get an understanding of the dataset.

PETROL/ GAS CONSUMPTION AROUND THE WORLD.

1)To start with, I wanted to know the total sum of barrel of petrol/gas consumed by the entire world on a daily basis.

The query above shows the world consumes 96,576,722 barrels of petrol/ gas on a daily basis, as of June 2022.

2.Country with the highest petrol/gas consumption daily:

The United States of America consumes the most petrol/gas daily.
19,687,287 barrels per day.

3.Country with the least petrol/ gas consumption daily:

Niue consumes the least petrol/ gas in the world. 51 barrels per day.
Until while carrying out this analysis, I didn’t know of the country called Niue, had to google it up and realized it’s a country with an average of 1,620 people, located in the Oceania continent.

4.Top five countries with the highest daily oil consumption:

USA, China, India, Japan and Russia consumes the most petrol/gas daily .

5.Bottom five countries with the lowest oil consumption around the world:

Niue, Salt Helena, Monsterrat, Kiribati, Saint Pierre & Miquelon consumes the least petrol/ gas around the world.

PETROL/ GAS PRICE ANALYSIS AROUND THE WORLD:

1.Country with the highest price for a liter of petrol/gas.

North Korea sells a liter of petrol/gas for $14.5!
Wow!

2.Country with the least price for a liter of petrol/gas:

Venezuela sells the cheapest petrol/gas in the world, with a liter going for a paltry $0.02 dollar.

3.Country with the highest and lowest price per gallon :

Consistent with the data above, North Korea and Venezuela sells petrol/ gas for the highest and cheapest price per gallon respectively.

4.Top five countries with the most expensive price for a liter of petrol/gas:

North Korea, Tonga, Niue, Honk Kong, and Norway sells a liter of petrol/gas for the most expensive amount.

5.Bottom five countries with the cheapest price for petrol/ gas:

Venezuela, Libya, Iran, Brunei and Syria sells a liter of petrol/ gas for the cheapest prices.

        ******

Thanks for reading!

Comments and suggestions, please do send a mail: heyfunmi@gmail.com

Data Analysis With Microsoft Excel: Analysis Of The MonkeyPox Disease.

Oluwafunmilola Obisesan — Wed, 03 Aug 2022 22:19:00 +0000

INTRODUCTION

MonkeyPox is a viral zoonosis (a virus transmitted to humans from animals). Animal host include rodents and non human pirates. The MonkeyPox virus was first discovered in 1970 in Democratic Republic of Congo.
In May 2022, multiple cases of the virus was identified in several countries across the world.

In this project, I carried out data analysis on the viral disease, using Microsoft Excel.

Data cleaning, analysis, exploration and visualization was done using just Microsoft Excel.
This analysis shows trends and patterns of the disease in 2022 , it also provides answers to some very important questions.

Data Structure And Preparation.

The dataset for this project was gotten from kaggle.com
The dataset contains thirty six columns and over twenty-five thousand rows. Records in the dataset include; Date of confirmation, status, city, country, gender, age, symptoms, and so many more.
Here is what the dataset looks like:

Data Cleaning And Preparation.

The dataset had no outliers, errors, duplicate, missing rows and columns, so the data cleaning process was a walk through.

Analysis And Insights.

An overview of the dashboard shows the most important information gotten from the dataset, which include; total number of confirmed cases, countries with the most confirmed cases and the pattern of the spread of the disease in 2022.

1.Number of incidents:

Suspected, discarded and confirmed cases were recorded in the dataset. Analysis carried out on the dataset shows 23, 273 confirmed cases of monkey pox disease have been recorded from January 1st, 2022 to 1st August, 2022.

2.Countries with Most Confirmed Cases.

Much to my surprise, The United States of America and Central European countries have had the most confirmed cases of MonkeyPox in 2022.

3.Cases Per Month.

The disease caught the media attention in May 2022, and has garnered lot of publicity since then.
Analysis done on this dataset shows the gradual trend of the disease since January 2022 till August.
*****

Dashboard:

Thanks for reading!

Comments and suggestions, please do send me a mail: heyfunmi@gmail.com

Ciao!

Data Analysis With Python: Analysis of gun violence incidents in the USA.

Oluwafunmilola Obisesan — Sat, 25 Jun 2022 23:01:30 +0000

After the school shooting that happened some weeks ago in Uvalde, Texas, I was prompted to carry some data analysis on gun violence incidents in the USA.
Read my findings here.

     **************

The dataset for this project contains records of gun violence incidents across the 50 states of the United States of America within the period of December 28, 2018 to June 20, 2022.
Data cleaning, analysis and visualization was done using python. The analysis provides answers to some important questions and to get an understanding of the dataset.

Data Structure:
Columns in the dataset include; Incident ID, Incident Date, State, City and the Address where the gun violence incident happened.

The necessary python libraries needed to carry out this analysis was imported into the python IDLE (Jupyter Notebook), And the dataset was loaded in to begin analysis.

Total numbers of column and rows present in the dataset shows 2000 rows and 7 columns.

10 random samples of the dataset to see what the dataset looks like.

Data Cleaning
Data cleaning was done using the python pandas library in order to “clean “ the dataset and prepare it for analysis.

Checking for missing values in the dataset

The image above shows the dataset had missing values in the “Address” column, this missing values were then replaced with “not available “, since we can not replace the missing addresses with any random address.

Checking for duplicates in the dataset to avoid inaccuracy in the results.

The image above shows there were no duplicates in the dataset.
Superb!!!

Data Analysis and Exploration
Analysis was done to answer some key questions and to get useful information.
1) Total number of gun violence incidents in the USA within the period of this record; December 28, 2018 to June 20, 2022.

The analysis shows there were 2000 gun violence incidents in the USA within the period stated above.

2) Total number of people killed within the period:

Analysis shows, 2003 people were killed within the period of this record.

3) Total number of people injured.

8299 people were injured due to gun violence incident within the period of this record.

4) Top ten states with the most gun violence incidents

I was particularly surprised by this result, I really didn’t expect Illinois to top this chart. This further shows, Data analysis gives result based on facts and not intuition.

5) Bottom ten state with the LEAST gun violence incident.

Wyoming, Maine and New Hampshire had just 1 gun violence incidents within the period.

6) Top ten states with the most deaths:

Texas had the most deaths within this period, with a total of 231 people killed due to gun violence incidents.
Crazy!

7) Top ten city with the most gun violence incident:

Chicago, A city in Illinois had the most gun violence incidents across all cities in the USA.

8) Bottom five city with the least incidents:

Lima, salt lake, cleavland, Inkster and West Jefferson were the cities with least incidents.

9) Top five addresses with the most gun violence incidents:

NB: four rows in the address column were missing and were replaced with “not available” as explained earlier, hence the not available in the image above. The other four addresses had the most gun violence incidents within the period of this record.

         *************

To view the entire code for this project, please check out my GitHub profile for the Jupyter notebook: https://github.com/heyfunmi/Gun-violence-Analysis

Thank you for reading!

For comments and suggestions, please reach me via: heyfunmi@gmail.com

Data Analysis With Microsoft Excel: People Analytics.

Oluwafunmilola Obisesan — Fri, 17 Jun 2022 16:37:55 +0000

Hey!
Another project here! This was done using one of my favorite tools for Data Analysis, Microsoft Excel! I still feel there’s nothing you can’t do with Excel, it’s literally one of the most powerful tools out there.

                *****

In this project, data cleaning, exploration and visualization was done using only Microsoft excel.
Analysis was done to provide an end of the year summary of the company’s employment structure based on department, performance score, and current status.

Data structure: This dataset originates from the HR department of the company. The dataset contains records of employees ranging from name, gender, marital status, salary, country, state, department, recruitment source, performance score, etc.

Data cleaning and preparation:
The data cleaning and preparation phase was done to ensure the dataset is free from errors, outliers and duplicates.
Here is a picture of the dataset prior to cleaning:

Some of the steps taken to clean the data include:

Removal of duplicates
The marital status column and gender column were edited to aid better understanding of the values in the rows. The find and replace function in Microsoft excel was used to replace the M and F in the gender column to male and female.

The marital status column was also edited; 0, 1, 2, 3, 4 was replaced with single, married, divorced and separated, respectively.

Here is a picture showing how the find and replace function was used to replace the values in the marital status column.

3.An additional column was also created, Employees count column. It was necessary to create this column as it would be pivotal in all the analysis carried out.
Here is a picture of the dataset after the data cleaning process.

Analysis and insights:

Analysis was done using pivot table and charts.
Slicers were also added for easy access to various niche of the dashboard.

The dashboard provides answers to the following :

1.Employees count by gender
2.Employees count by department
3.Employees count by performance score
4.Employees count by current status.

Here is a picture of the dashboard:

Thank you for reading!

Data Analysis With Power BI: Sales Analysis.

Oluwafunmilola Obisesan — Sat, 11 Jun 2022 13:23:33 +0000

INTRODUCTION
Hey there, Here is a data analysis project I did with power BI.

Come along!!!

The dataset used for this project originates from a superstore sales record. Analysis was done in order to answer key questions that reflect some important key performance indicators (KPIs) and to get a proper understanding of the metrics and trends in the dataset.
This project provides an in-depth analysis of the sales record and also provides insights which could augment the growth of the super store.

Data structure
The dataset contains record of sales from 2012 to 2017.
Records in the data include order ID, item type, customer location, order quantity, unit price, order priority, unit sold, total revenue, total profit.

Data cleaning and preparation

The dataset was loaded into the power query editor of power BI to clean it in preparation for analysis. The dataset had a lot of empty rows across column, these empty rows were removed across the columns.
Misspellings, duplicates, outliers were also removed from the dataset and only unique values were used.

Here is a picture of the dataset after cleaning.

Analysis and insights
Overview of the analysis showed the superstore sold 513k unit of goods and made 44.17 million dollars profit.

The data was used to provide insights to the following:

The sales channel with the most unit of goods sold

The superstore made sales through online and offline channels.

Analysis of the dataset showed the superstore sold more unit of goods offline compared to the online channel.

what channel brought in the most profit

Consistent with the data above, more profit were made through offline sales compared to online sales.

Sales made based on item type

This analysis reveals cosmetics was the best selling item type, and meat was the least sold item type by the superstore.

Profit by item type

Fruit was the least profitable item type and cosmetics brought in the most profit for the superstore.

Unit sold by each region
The superstore sold goods to eight regions which include: Sub Sahara Africa, Europe, Asia, Middle East, Central America, North America, and Australia.

The superstore made the most sales to the Sub Sahara Africa region and the North America region bought the least unit of goods from the superstore.

Total profit made from region

The superstore also made more profit selling to the Sub Sahara Africa region, and made the least profit selling to the North America region.

Strategic recommendations.

More sales and profits were made through the offline channel compared to the online channel. Hence, discounts could be applied to goods bought through the online channel, thereby generating more sales and profit through the online channel.
To aid region with low sales like North America and Central America, publicity of the superstore could be done in those regions to create awareness about the superstore and what they sell. Promotional emails and flyers could also be used in those regions to create awareness. Furthermore, Establishment of the superstore branches at those region could also boost sales.

3.the least sold item type which include meats, vegetables could be made into a combo sale package or special package to stimulate the consumer interest in buying these items.

Here is an overview of the analysis.

Thank you for reading!