DEV Community: Omale Happiness Ojone

How to predict a model using linear regression

Omale Happiness Ojone — Tue, 14 Jun 2022 20:11:26 +0000

What is Regression?

Regression is a supervised learning method used to determine the relationship between variables. When the output variable is a real or continuous value, you have a regression problem.

In this lesson, I'll show you how to predict a model using linear regression, and I'll use the fish market dataset as an example. Let's get started.

What is Linear Regression?

Linear regression is a straightforward statistical regression method for predicting relationships between continuous variables. Linear regression, as the name implies, depicts a linear relationship between the independent variable (X-axis) and the dependent variable (Y-axis). Simple linear regression is defined as a linear regression with only one input variable (x). When there are several input variables, the linear regression is referred to as multiple linear regression.The linear regression model gives a sloped straight line describing the relationship within the variables.

The dependent variable and independent variables have a linear relationship, as shown in the graph above. When the value of x (the independent variable) rises, so does the value of y (the dependent variable). The best fit straight line is designated by the red line. We aim to plot a line that best predicts the data points based on the given data points.

Firstly, let's import basic utilities:

In [1]:

%matplotlib inline
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

Let's now read the csv file (keep in mind it should be in the same folder as the notebook!):

In [2]:

df = pd.read_csv('Fish.csv')

Let's take a peek at the dataset's first few rows. This is necessary in order to gain a fundamental comprehension.

In [3]:

df.head()

Now let's look at the dataset more closely to obtain essential statistical indicators such as the mean and standard deviation.

In [4]:

df.describe()

It is important to reshape the two dimensions (X and y), as failure to do so would result in an error.

In [5]:

X = np.array(df['Length1']).reshape(-1, 1) 
y = np.array(df['Length2']).reshape(-1, 1)

Split the dataset in two parts: train and test. This is needed to calculate the accuracy (and many other metrics) of the model. We will use the train part during the training, and the test part during the evaluation.

In [6]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=101)

Import the model and instantiate it:

In [7]:

from sklearn.linear_model import LinearRegression
linearmodel = LinearRegression()

Now let's train the model:

In [8]:

linearmodel.fit(X_train, y_train)

Let's have a look at how the model is performing with R2. R2 is a statistical metric used to determine whether or not a model is "a good fit" and how well it works. The Pearson Correlation Coefficient is equivalent to the R2 in this situation (one independent variable). R2 has a range of values between 0.0 and 1.0, with 0 indicating the worst fit and 1 indicating the best fit.

In [9]:

linearmodel.score(X_test, y_test)

It's quite high! This is because the two variables (Length1 and Length2), as seen during the EDA, take the shape of a straight line. Let's compare the predicted values to the test values in the dataset.

In [10]:

plt.scatter(x_test, y_test)
plt.plot(x_test, linearmodel.predict(x_test), color = 'red')
plt.show()

Linear Regression (multiple independent variables): Let's predict weight

Predicting the weight of the fish using Linear Regression is similar to the previous one. The only significant difference is the presence of numerous independent variables. The variable "Species" will be removed entirely.

In [11]:

x = fish.drop(['Weight', 'Species'], axis = 1)
y = fish['Weight']

In [12]:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 42)

In [13]:

linearmodel = LinearRegression()
linearmodel.fit(x_train, y_train)
linearmodel.score(x_test, y_test)

Conclusion

The model is a good fit but it's not performing well (or rather, not as well as hoped) for this problem and data. The EDA didn't cover some basics such as feature selection and removing outliers. Also the model was deprived of a feature 'Species' which as you might imagine, may influence the weight of a fish. Another important factor is the size of the dataset, usually larger datasets lead to more accurate results. Anyways the goal of a simple linear regression is to predict the value of a dependent variable based on an independent variable. The greater the linear relationship between the independent variable and the dependent variable, the more accurate is the prediction.
Thanks for reading

P.S: I'm looking forward to being your friend, let's connect on twitter.

Data Preparation

Omale Happiness Ojone — Sat, 16 Apr 2022 21:57:00 +0000

Data Preparation

Data preparation is the transformation of raw data into a form that is more suitable for modeling so that data scientists and analysts can run it through machine learning algorithms to uncover insights or make predictions.

Why is Data Preparation Important?

Most machine learning algorithms require data to be formatted in a very specific way, so datasets generally require some amount of preparation before they can yield useful insights. Some datasets have values that are missing, invalid, have inaccuracies or other errors, which are difficult for the algorithm to process.

The algorithm cannot function if data is missing. If the data is incorrect, the algorithm produces less accurate, if not misleading, results. Some datasets simply lack useful business context (for example, poorly defined ID values), necessitating feature enrichment. Good data preparation results in clean, well-curated data, which leads to more practical, accurate model results.

Steps in data preparation process

The process of preparing data includes the following:

1. Data collection:

Relevant data is gathered from operational systems, data warehouses and other data sources. During this step, data professionals and end users gathering data themselves should confirm that the data is a good fit for the objectives of the planned applications.

2. Data discovery and profiling.

The next step is to explore the collected data to understand what it contains and what needs to be done to prepare it for the intended use. Data profiling helps identify patterns, anomalies, inconsistencies, missing data, and other attributes and issues in data sets, so problems can be addressed.

3. Data cleaning.

In this step, the identified data errors are corrected to create complete and accurate data sets that are ready to be processed and analyzed. For example, faulty data is removed or fixed, missing values are filled in, and inconsistent entries are harmonized. Nevertheless, there are general data cleaning operations that can be performed, such as:

Using statistics to define normal data and identify outliers.
Identifying columns that have the same value or no variance and removing them.
Identifying duplicate rows of data and removing them.
Marking empty values as missing.
Imputing missing values using statistics or a learned model

4. Data structuring.

At this point, the data needs to be structured, modelled and organized into a unified format that will meet the requirements of the planned use.

5. Data transformation and enrichment.

In connection with structuring data, it often must be transformed to make it consistent and turn it into usable information. Data enrichment and optimization further enhance data sets as needed to produce the desired business insights.

6. Data validation and publishing.

To complete the preparation process, automated routines are run against the data to validate its consistency, completeness and accuracy. The prepared data is then stored in a data warehouse or other repository and made available for use.

A big benefit of instituting an effective data preparation process is that data scientists and other end users can spend less time finding and structuring data and instead focus more on data mining and data analysis. For example, data preparation can be done more quickly, and prepared data can automatically be fed to users for analyses.

Conclusion

In this article, we have seen what data preparation is and the process of preparing data. We also saw reasons why data preparation is important. Thanks for reading.

P.S: I'm looking forward to being your friend, let's connect on twitter.

Data Cleaning

Omale Happiness Ojone — Fri, 15 Apr 2022 16:14:13 +0000

Data cleaning refers to the process of “cleaning” data, by identifying errors in the data and then rectifying them.
The main aim of Data Cleaning is to identify and remove errors & duplicate data, in order to create a reliable dataset.
We will use the fish dataset as the basis for this tutorial.

Fish Dataset

The “Fish Dataset” is a machine learning dataset.
The task involves predicting the weight of a fish.
You can access the dataset here:
[(https://www.kaggle.com/aungpyaeap/fish-market)]

         from pandas import read_csv
         from numpy import unique
         import pandas as pd
         import seaborn as sns
         import matplotlib.pyplot as plt
         import numpy as np
         fish = pd.read_csv("Fish.csv")

. How does the data look like?

Fill-Out Missing Values

One of the first steps of fixing errors in your dataset is to find incomplete values and fill them out. Most of the data that you may have can be categorized.
In most cases, it is best to fill out your missing values based on different categories or create entirely new categories to include the missing values.
If your data are numerical, you can use mean and median to rectify the errors.
let's check our dataset:

As you can see, in this case, we do not have missing values.

Removing rows with missing values

One of the simplest things to do in data cleansing is to remove or delete rows with missing values. This may not be the ideal step in case of a huge amount of errors in your training data.
If the missing values are considerably less, then removing or deleting missing values can be the right approach. You will have to be very sure that the data you are deleting does not include information that is present in the other rows of the training data.

Note: As you can see, in this case, we do not have missing values. However, this is not always the case.

Fixing errors in the Dataset

Ensure there are no typographical errors and inconsistencies in the upper or lower case.
Go through your data set, identify such errors, and solve them to make sure that your training set is completely error-free. This will help you to yield better result from your machine learning functions.

Identify Columns That Contain a Single Value

Columns that have a single observation or value are probably useless for modeling.
These columns or predictors are referred to zero-variance predictors as if we measured the variance (average value from the mean), it would be zero.
When a predictor contains a single value, we call this a zero-variance predictor because there truly is no variation displayed by the predictor.
You can detect rows that have this property using the nunique() Pandas function that will report the number of unique values in each column.

Delete Columns That Contain a Single Value

Variables or columns that have a single value should probably be removed from your dataset.
From the above picture we could see that the column Species has a single value.
Columns are relatively easy to remove from a NumPy array or Pandas DataFrame.
One approach is to record all columns that have a single unique value, then delete them from the Pandas DataFrame by calling the drop() function.

Identify Rows That Contain Duplicate Data

Rows that have identical data are probably useless, if not dangerously misleading during model evaluation.
A duplicate row is a row where each value in each column for that row appears in identically the same order (same column values) in another row.
The pandas function duplicated() will report whether a given row is duplicated or not. All rows are marked as either False to indicate that it is not a duplicate or True to indicate that it is a duplicate. If there are duplicates, the first occurrence of the row is marked False (by default), as we might expect.

First, the presence of any duplicate rows is reported, and in this case, we can see that there are no duplicates (False).
But in a case where there are duplicates, we could also use the Pandas function drop_duplicates() to drop the duplicates row.

Conclusion

Data Cleaning is a critical process for the success of any machine learning function. For most machine learning projects, about 80 percent of the effort is spent on data cleaning. We have discussed some of the points.

How I Deployed my First Machine Learning Model Using Streamlit (Part 2)

Omale Happiness Ojone — Sat, 17 Jul 2021 20:34:48 +0000

In this article I will be explaining how to deploy your machine learning model online.

But before then here is the link to the part 1 of this article:
https://dev.to/codinghappinessweb/how-i-deployed-my-first-machine-learning-model-using-streamlit-part-1-31h9

After signing up on streamlit, you will get an invite from the app which would look like this.

It takes 1-2days before it could get accepted, the acceptance invite looks like this.

After signing in, you create a github repository and name it i.e Loan Prediction. Add the project files to the repository created.

Some of the files you should add to the repository includes:

The code for the project
The python script for the project
The pkl file which is classifier.pkl for this project
Requirement.txt file

To get the requirement.txt file just type "pip install pipreqs" on your command prompt, Locate the folder where your python file for the streamlit app is, open a terminal inside that folder and run this command "-pipreqs". It will scan through all the python file there and create a requirement.txt file for you.

Next, Sign into streamlit share: you will create a new app, link your github repository to it and specify the main python file for the app, then you deploy.

Finally your app deployment is ready!!!
You can now share the links to your friends.

End Notes

Congratulations! We have now successfully completed loan prediction model deployment using Streamlit. The deployment is simple, fast, and most importantly in Python. I encourage you to first try this particular project, play around with the values as input, and check the results. And then, you can try out other machine learning projects as well and perform model deployment using streamlit.

Lastly, I would love to hear your feedback and suggestions for this article. If you have any questions related to the article, post them in the comments section below. I will actively look forward to answering them.

Link to part 1 of the article:https://dev.to/codinghappinessweb/how-i-deployed-my-first-machine-learning-model-using-streamlit-part-1-31h9

You can view the app viaStreamlit

You can access the datasetGithub

And my jupyter notebookGithub

How I Deployed my First Machine Learning Model Using Streamlit (Part 1)

Omale Happiness Ojone — Sat, 17 Jul 2021 20:31:39 +0000

I believe most of you must have done some form of data science project at some point in your lives, be it a machine learning project, a deep learning project, or even visualizations of your data. And the best part of these projects is to showcase them to others.

But the question is how will you showcase your work to others? Well, this is where Model Deployment will help you.

In this article I will be showing you how I was able to deploy my first machine learning model using Streamlit.

Streamlit is a popular open-source framework used for model deployment by machine learning and data science teams. And the best part is it’s free of cost and purely in python.

Preparing Data and Training Model

We will first build a loan prediction model and then deploy it using Streamlit.

The project that I have picked for this particular article is automating the loan eligibility process.

The task is to predict whether the loan will be approved or not based on the details provided by customers.

Based on the details provided by customers, we have to create a model that can decide whether or not their loan should be approved and point out the factors that will help us to predict whether the loan for a customer should be approved or not.

As a starting point, here are a couple of factors that I think will be helpful for us with respect to this project:

Amount of loan: The total amount of loan applied by the customer. My hypothesis here is that the higher the amount of loan, the lesser the chances of loan approval and vice versa.
Income of applicant: The income of the applicant (customer) can also be a deciding factor. A higher income will lead to higher probability of loan approval.
Education of applicant: Educational qualification of the applicant can also be a vital factor to predict the loan status of a customer. My hypothesis is if the educational qualification of the applicant is higher, the chances of their loan approval will be higher.

Next, we need to collect the data. And the dataset related to the customers and loan will be provided at the end of this article.

We will first import the required libraries and then read the CSV file:

  import pandas as pd
  train = pd.read_csv('train_ctrUa4K.csv') 
  train.head()

Above are the first five rows from the dataset.

We know that machine learning models take only numbers as inputs and can not process strings. So, we have to deal with the categorical features present in the dataset and convert them into numbers.

 train['Gender']= train['Gender'].map({'Male':0, 'Female':1})
 train['Married']= train['Married'].map({'No':0, 'Yes':1})
 train['Loan_Status']= train['Loan_Status'].map({'N':0, 
 'Y':1})

Above, we have converted the categories present in the Gender, Married and the Loan Status variable into numbers, simply using the map function of pandas DataFrame object. Next, let’s check if there are any missing values in the dataset:

     train.isnull().sum()

So, there are missing values inside many features including the Gender, Married, LoanAmount variable. Next, we will remove all the rows which contain any missing values in them:

train = train.dropna()
train.isnull().sum()

Now there are no missing values in the dataset. Next, we will separate the dependent (Loan_Status) and the independent variables:

  X = train[['Gender', 'Married', 'ApplicantIncome', 
      'LoanAmount', 'Credit_History']]
  y = train.Loan_Status
  X.shape, y.shape

We will first split our dataset into a training and validation set, so that we can train the model on the training set and evaluate its performance on the validation set.

 from sklearn.model_selection import train_test_split
 x_train, x_cv, y_train, y_cv = train_test_split(X,y, 
 test_size = 0.2, random_state = 10)

We have split the data using the train_test_split function from the sklearn library keeping the test_size as 0.2 which means 20 percent of the total dataset will be kept aside for the validation set. Next, we will train using the random forest classifier:

      from sklearn.ensemble import RandomForestClassifier 
      model = RandomForestClassifier(max_depth=4, random_state 
      = 10) 
      model.fit(x_train, y_train)

Now, our model is trained, let’s check its performance on both the training and validation set:

      from sklearn.metrics import accuracy_score
      pred_cv = model.predict(x_cv)
      accuracy_score(y_cv,pred_cv)

The model is 80% accurate on the validation set. Let’s check the performance on the training set too:

    pred_train = model.predict(x_train)
    accuracy_score(y_train,pred_train)

Performance on the training set is almost similar to that on the validation set. So, the model has generalized well. Finally, we will save this trained model so that it can be used in the future to make predictions on new observations:

         # saving the model 
         import pickle 
         pickle_out = open("classifier.pkl", mode = "wb") 
         pickle.dump(model, pickle_out) 
         pickle_out.close()

We are saving the model in pickle format and storing it as classifier.pkl. This will store the trained model and we will use this while deploying the model.

We will be deploying this loan prediction model using Streamlit which is a recent and the simplest way of building web apps and deploying machine learning and deep learning models.

Model Deployment of the Loan Prediction Model using Streamlit

Creating the app, we will start with the basic installations:

 !pip install -q streamlit

Streamlit will be used to make our web app.

We have to create the python script for our app. Let me show the code first and then I will explain it to you in detail:

         import pickle
         import streamlit as st

         # loading the trained model
         pickle_in = open('classifier.pkl', 'rb') 
         classifier = pickle.load(pickle_in)

         @st.cache()

         # defining the function which will make the 
         prediction using the data which the user inputs 
         def prediction(Gender, Married, ApplicantIncome, 
             LoanAmount, Credit_History):   

             # Pre-processing user input    
             if Gender == "Male":
                Gender = 0
            else:
                Gender = 1

           if Married == "Unmarried":
              Married = 0
          else:
              Married = 1

          if Credit_History == "Unclear Debts":
             Credit_History = 0
         else:
             Credit_History = 1  

         LoanAmount = LoanAmount / 1000

         # Making predictions 
         prediction = classifier.predict( 
           [[Gender, Married, ApplicantIncome, LoanAmount, 
           Credit_History]])

         if prediction == 0:
            pred = 'Rejected'
         else:
             pred = 'Approved'
         return pred


        #this is the main function in which we define our 
        webpage  
       def main():       
       #front end elements of the web page 
       html_temp = """ 
       <div style ="background-color:yellow;padding:13px"> 
       <h1 style ="color:black;text-align:center;">Streamlit 
       Loan 
       Prediction ML App</h1> 
       </div> 
       """

      #display the front end aspect
      st.markdown(html_temp, unsafe_allow_html = True) 

     #following lines create boxes in which user can enter 
     data 
     required to make prediction 
     Gender = st.selectbox('Gender',("Male","Female"))
     Married = st.selectbox('Marital Status', 
     ("Unmarried","Married")) 
     ApplicantIncome = st.number_input("Applicants monthly 
     income") 
     LoanAmount = st.number_input("Total loan amount")
     Credit_History = st.selectbox('Credit_History',("Unclear 
     Debts","No Unclear Debts"))
     result =""

    #when 'Predict' is clicked, make the prediction and store 
    it 
    if st.button("Predict"): 
       result = prediction(Gender, Married, ApplicantIncome, 
       LoanAmount, Credit_History) 
       st.success('Your loan is {}'.format(result))
       print(LoanAmount)

   if __name__=='__main__': 
        main()

This is the entire python script which will create the app for us. Let me break it down and explain in detail:

In this part, we are saving the script as app.py, and then we are loading the required libraries which are pickle to load the trained model and streamlit to build the app. Then we are loading the trained model and saving it in a variable named classifier.

Next, we have defined the prediction function. This function will take the data provided by users as input and make the prediction using the model that we have loaded earlier. It will take the customer details like the gender, marital status, income, loan amount, and credit history as input, and then pre-process that input so that it can be feed to the model and finally, make the prediction using the model loaded as a classifier. In the end, it will return whether the loan is approved or not based on the output of the model.

And here is the main app. First of all, we are defining the header of the app. It will display “Streamlit Loan Prediction ML App”. To do that, we are using the markdown function from streamlit. Next, we are creating five boxes in the app to take input from the users. These 5 boxes will represent the five features on which our model is trained.

The first box is for the gender of the user. The user will have two options, Male and Female, and they will have to pick one from them. We are creating a dropdown using the selectbox function of streamlit. Similarly, for Married, we are providing two options, Married and Unmarried and again, the user will pick one from it. Next, we are defining the boxes for Applicant Income and Loan Amount.

Since both of these variables will be numeric in nature, we are using the number_input function from streamlit. And finally, for the credit history, we are creating a dropdown which will have two categories, Unclear Debts, and No Unclear Debts.

At the end of the app, there will be a predict button and after filling in the details, users have to click that button. Once that button is clicked, the prediction function will be called and the result of the Loan Status will be displayed in the app. This completes the web app creating part. And you must have noticed that everything we did is in python. Isn’t it awesome?

This part is for running the app on your local machine, not the acual deployment.
I will be explaining the actual deployment in my next article.

First run the .py file in the same directory on your cmd:

    streamlit run loan_prediction.py

This will generate a link, something like this:
Local URL: http://localhost:8501
Network URL: http://192.168.43.47:8501

Note that the link will vary at your end. You can click on the link which will take you to the web app:

You can see, we first have the name displayed at the top. Then we have 5 different boxes that will take input from the user and finally, we have the predict button. Once the user fills in the details and clicks on the Predict button, they will get the status of their loan whether it is approved or rejected.

And it is as simple as this to build and deploy your machine learning models using Streamlit.

Note, this part is for running the app on your local machine, not the acual deployment.

I will be explaining the actual deployment in my next article.

Link to part 2 of the article:https://dev.to/codinghappinessweb/how-i-deployed-my-first-machine-learning-model-using-streamlit-part-2-103a

You can view the app via Streamlit

You can access the datasetGithub

And my jupyter notebookGithub

Analysing Dataset Using Naive Bayes Classifier

Omale Happiness Ojone — Mon, 26 Apr 2021 19:44:09 +0000

In this article, we will discuss several things related to Naive Bayes Classifier including:

Introduction to Naive Bayes.
Naive Bayes with Scikit-Learn.

1.Introduction to Naive Bayes.

Naive Bayes classifier is a classification algorithm in machine learning and is included in supervised learning. This algorithm is based on the Bayes Theorem created by Thomas Bayes. Therefore, we must first understand the Bayes Theorem before using the Naive Bayes Classifier.

The essence of the Bayes theorem is conditional probability where conditional probability is the probability that something will happen, given that something else has already occurred. By using conditional probability, we can find out the probability of an event will occur given the knowledge of the previous event.

P(A|B) = Posterior Probability, Probability of A given Value of B.
P (B|A) = Likelihood of B given A is True.
P (A) = Prior Probability, Probability of event A.
P (B) = Marginal Probability, Probability of event B.

By using the basis of the Bayes theorem, the Naive Bayes Classifier formula can be written as follows :

P (y | x1, … , xj) = Posterior Probability, Probability of data included in class y given their features x1 until xj.
P (x1, … , xj | y) = Likelihood of features value given that their class is y.
P (y) = Prior Probability.
P (x1, … , xj) = Marginal Probability.

Because marginal probability always remains the same in the calculation of naive bayes classifier, then we can ignore the calculation of marginal probability. In the Naive Bayes classifier, we determine the class of data points into based on the value of the greatest posterior probability.

2.Naive Bayes with Scikit-Learn.

Now that we know how to calculate the Naive Bayes Classifier algorithm manually, we can easily use Scikit-learn. Scikit-learn is one of the libraries in Python that is used for the implementation of Machine Learning. I am here using Gaussian Naive Bayes Classifier and the datasets that I use is glass classification which you should be able to download at the end of this article.

The steps in solving the Classification Problem using Naive Bayes Classifier are as follows:

Load the library
Load the dataset
Visualize the data
Handling missing values
Exploratory Data Analysis (EDA)
Modelling

1.Load several libraries of python that will be used to work on this case:

 import pandas as pd
 import numpy as np
 import matplotlib.pyplot as plt
 import seaborn as sns
 from sklearn.naive_bayes import GaussianNB
 from sklearn.metrics import accuracy_score
 from sklearn.model_selection import train_test_split
 import warnings
 warnings.filterwarnings('ignore')

2.Load the dataset that will be used in working on this case. The dataset used is a glass dataset:

  glass=pd.read_csv("glass.csv")

3.Look at some general information from the data to find out the characteristics of the data in general:

  #Top five of our data
  glass.head()

  #Last five of our data
  glass.tail()

  #Viewing the number of rows (214) and number of columns / 
  features (10)
  glass.shape

4.Handling missing values from the data if there is any, if not then it can proceed to the next stage:

  #Data is clean and can continue to the Explorary Data 
  Analysis stage
  glass.isnull().sum()

5.Exploratory Data Analysis to find out more about the characteristics of the data:

 #Univariate analysis Type (Target features).
 sns.countplot(df['Type'], color='red')

6.Modeling our data with Gaussian Naive Bayes from Scikit-Learn:

#Create a Naive Bayes object
nb = GaussianNB()
#Create variable x and y.
x = glass.drop(columns=['Type'])
y = glass['Type']
#Split data into training and testing data 
x_train, x_test, y_train, y_test = train_test_split(x, y, 
test_size=0.2, random_state=4)
#Training the model
nb.fit(x_train, y_train)
#Predict testing set
y_pred = nb.predict(x_test)
#Check performance of model
print(accuracy_score(y_test, y_pred))

From the accuracy score, it can be seen that the value is 48% which in my opinion still needs to be improved again.
From my analysis, why the accuracy value of the Naive Bayes model is so low is due to imbalanced data. So one of the ways that I will use to improve the accuracy of my model is by data balancing.

Link to the dataset:https://www.kaggle.com/uciml/glass/download

Analysing Dataset Using KNN

Omale Happiness Ojone — Sat, 27 Mar 2021 08:20:00 +0000

In this article, I will explain a classification model in detail which is a major type of supervised machine learning. The model we will work on is called a KNN classifier as the title says.
The KNN classifier is a very popular and well known supervised machine learning technique. This article will explain the KNN classifier with a simple but complete project.

What is a supervised learning model?

I will explain it in detail.
But here is what Wikipedia has to say:
Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labelled training data consisting of a set of training examples.
Supervised learning models take input features (X) and output (y) to train a model. The goal of the model is to define a function that can use the input features and calculate the output.
I will show a practical example with a real dataset

KNN Classifier

The KNN classifier is an example of a memory-based machine learning model.
That means this model memorizes the labelled training examples and they use that to classify the objects it hasn’t seen before.
The k in KNN classifier is the number of training examples it will retrieve in order to predict a new test example.

KNN classifier works in three steps:

When it is given a new instance or example to classify, it will retrieve training examples that it memorized before and find the k number of closest examples from it.

2.Then the classifier looks up the labels (the name of the fruit in the example above) of those k numbers of closest examples.

3.Finally, the model combines those labels to make a prediction. Usually, it will predict the majority labels.

Data Preparation

Before we start, I encourage you to check if you have the following resources available in your computer:

Numpy Library
Pandas Library
Matplotlib Library
Scikit-Learn Library
Jupyter Notebook environment.

Download the dataset. I provided the link at the bottom of the page. Run every line of code yourself if you are reading to learn this.
First, import the necessary libraries:

  %matplotlib notebook
  import numpy as np
  import matplotlib.pyplot as plt
  import pandas as pd
  from sklearn.model_selection import train_test_split

For this tutorial, I will use the Titanic dataset from Kaggle.

Here is how I can import the dataset in the notebook using pandas.

      titanic = pd.read_csv('titanic_data.csv')
      titanic.head()
      #titaninc.head() gives the first five rows of the 
      dataset.  
      #we will print first five rows only to examine the 
      dataset.

Look at the second column. It contains the information, if the person survived or not.
0 means the person survived and 1 means the person did not survive.
For this tutorial, our goal will be to predict the ‘Survived’ feature.
This dataset is very simple. Just from intuition, we can see that there are columns that cannot be important to predict the ‘Survived’ feature.
For example, ‘PassengerId’, ‘Name’, ‘Ticket’ and, ‘Cabin’ does not seem to be useful to predict that if a passenger survived or not.
I will make a new DataFrame with a few key features and name the new DataFrame titanic1.

         titanic1 = titanic[['Pclass', 'Sex', 'Fare', 
         'Survived']]

The ‘Sex’ column has the string value and that needs to be changed. Because computers do not understand words. It only understands numbers. I will change the ‘male’ for 0 and ‘female’ for 1.

         titanic1['Sex'] = titanic1.Sex.replace({'male':0, 
         'female':1})

This is how the DataFrame titanic1 looks like:

Our goal is to predict the ‘Survived’ parameter, based on the other information in the titanic1 DataFrame. So, the output variable or label(y) is ‘Survived’. The input features(X) are ‘P-class’, ‘Sex’, and, ‘Fare’.

    X = titanic1[['Pclass', 'Sex', 'Fare']]
    y = titanic1['Survived']

KNN Classifier Model

To start with, we need to split the dataset into two sets:
a training set and a test set.
We will use the training set to train the model where the model will memorize both the input features and the output variable.
Then we will use the test set to see that if the model can predict if the passenger survived using the ‘P-class’, ‘Sex’, and, ‘Fare’.
The method ‘train_test_split’ is going to help to split the data. By default, this function uses 75% data for the training set and 25% data for the test set. If you want you can change that and you can specify the ‘train_size’ and ‘test_size’.
If you put train_size 0.8, the split will be 80% training data and 20% test data. But for me the default value 75% is good. So, I am not using train_size or test_size parameters.

           X_train, X_test, y_train, y_test = 
           train_test_split(X, y, random_state=0)

Remember to use the same value for ‘random_state’. That way, every time you will do this split, it will take the same data for the training set and test set.
I choose random_state as 0. You can choose a number of your choice.
Python’s scikit -learn library, already have a KNN classifier model.
I will import that.

         from sklearn.neighbors import KNeighborsClassifier

Save this classifier in a variable.

        knn = KNeighborsClassifier(n_neighbors = 5)

Here, n_neighbors is 5.

That means when we will ask our trained model to predict the survival chance of a new instance, it will take 5 closest training data.
Based on the labels of those 5 training data, the model will predict the label of the new instance.
Now, I will fit the training data to the model so that model can memorize them.

              knn.fit(X_train, y_train)

You may think that as it memorized the training data it can predict the label of 100% of the training features correctly. But that’s not certain. Why?
Look, whenever we give input and ask it to predict the label it will take a vote from the 5 closest neighbors even if it has the exact same feature memorized.
Let’s see how much accuracy it can give us on training data

           knn.score(X_train, y_train)

The training data accuracy I got is 0.83 or 83%.

Remember, we have a test dataset that our model has never seen. Now check, how much accurately it can predict the label of the test dataset.
knn.score(X_test, y_test)

The accuracy came out to be 0.78 or 78%.

Congrats! You developed a KNN classifier!

Notice, the training set accuracy is a bit higher than the test set accuracy. That’s overfitting.

What is Overfitting?

In a single sentence, when the training set accuracy is higher than the test set accuracy, we call it overfitting.

Prediction

If you want to see the predicted output for the test dataset, here is how to do that:

Input:

        y_pred = knn.predict(X_test)
        y_pred

Output:

Or you can just input one single example and find the label.
I want to see when a person is traveling in ‘P-class’ 3, ‘Sex’ is female that means 1, and, paid a ‘Fare’ of 25, if she could survive as per our model.
Input:

      knn.predict([[3, 1, 25]])

Remember to use two brackets, because it requires a 2D array
Output:

     array([0], dtype=int64)

The output is zero. That means as per our trained model the person could not survive.

Please feel free to try wth more different inputs like this one!

If You Want to See Some Further Analysis of KNN Classifier

KNN classifier is highly sensitive to the choice of ‘k’ or n_neighbors. In the example above I used n_neighors = 5.
For different n_neighbors, the classifier will perform differently.
Let’s check how it performs on the training dataset and test dataset for different n_neighbors value. I choose 1 to 20.
Now, we will calculate the training set accuracy and the test set accuracy for each n_neighbors value from 1 to 20.

Input:

         training_accuracy  = []  
         test_accuracy = []
         for i in range(1, 21):
             knn = KNeighborsClassifier(n_neighbors = i)
             knn.fit(X_train, y_train)
             training_accuracy.append(knn.score(X_train, 
             y_train))
             test_accuracy.append(knn.score(X_test, y_test))

After running this code snippet, I got the training and test accuracy for different n_neighbors.
Now, let’s plot the training and test set accuracy against n_neighbors in the same plot.
Input:

          plt.figure()
          plt.plot(range(1, 21), training_accuracy, 
          label='Training Accuarcy')
          plt.plot(range(1, 21), test_accuracy, label='Testing 
          Accuarcy')
          plt.title('Training Accuracy vs Test Accuracy')
          plt.xlabel('n_neighbors')
          plt.ylabel('Accuracy')
          plt.ylim([0.7, 0.9])
          plt.legend(loc='best')
          plt.show()

Output:

Analyze the Graph Above
In the beginning, when the n_neighbors were 1, 2, or 3, training accuracy was a lot higher than test accuracy. So, the model was suffering from high overfitting.
After that training and test accuracy became closer. That is the sweet spot. We want that.
But when n_neighbors was going even higher, both training and test set accuracy was going down. We do not need that.
From the graph above, the perfect n_neighbors for this particular dataset and model should be 6 or 7.

That is a good classifier!

Look at the graph above! When n_neighbors is about 7, both training and testing accuracy was above 80%.

Conclusion

This article’s purpose was to show a KNN classifier with a project. If you are a machine learning beginner this should help you learn some key concepts of machine learning and the workflow. There are so many different machine learning models out there. But this is the typical workflow of a supervised machine learning model.

Here is the titanic dataset I used in the article:
https://www.kaggle.com/biswajee/titanic-dataset

WORKING WITH DATETIME FUNCTION IN PYTHON

Omale Happiness Ojone — Mon, 25 Jan 2021 08:10:03 +0000

Dates for Python

A date in Python is not a data form of its own, but to work with dates as date objects, we can import a module called datetime.
First you have to install it on your computer if you dont't have just simply do "pip install datetime" then you import it.
So, in this example I will be showing you how to Import the datetime module and display the current date

       :
       import datetime
       x = datetime.datetime.now()
       print(x)

Date Output

When we execute the code from the example above the result will be

   :
   2021-01-20 14:01:32.454684

Year, month, day, hour, minute, second, and microseconds are included in the date.The datetime module has several methods for returning the date object information.

Here are a few examples, you will learn more about them later in this chapter:
Example;
Return the year and name of weekday

          :
          import datetime
          x = datetime.datetime.now()
          print(x.year)
          print(x.strftime("%A"))

Date Output

When we execute the code from the example above the result will be

 :
 2021
 Wednesday

Creating Date Objects

You can use the datetime () class constructor of the datetime module to create a date. For creating a date, the datetime() class
includes three parameters: year, month, day.
Example;
Create a date object

           :
           import datetime
           x = datetime.datetime(2021, 1, 20)
           print(x)

Date Output

When we execute the code from the example above the result will be

          :
          2021-01-20 00:00:00

The datetime() class also takes time and time zone parameters (hour, minute, second, microsecond, tzone), but they are optional and have a value of 0, by default (None for timezone).

The strftime() Method

The datetime object has a method by which date objects can be formatted into readable strings. The method is called strftime() and takes one format parameter to define the format of the string
returned.
Example;
Display the name of the month

  :

  import datetime

  x = datetime.datetime(2021, 1, 1)

  print(x.strftime("%B"))

Date Output

When we execute the code from the example above the result will be

  :

 January

A reference of all the legal format codes:

Directive Description Example
%a Weekday, short version Wed
%A Weekday, full version Wednesday

%w Weekday as a number 0-6, 0 is Sunday 3

%d Day of month 01-31 31

%b Month name, short version Dec
%B Month name, full version December
%m Month as a number 01-12 12

%y Year, short version, without century 18

%Y Year, full version 2018

%H Hour 00-23 17

%I Hour 00-12 05

%p AM/PM PM

%M Minute 00-59 41

%S Second 00-59 08

%f Microsecond 000000-999999 548513

%z UTC offset +0100

%Z Timezone CST
%j Day number of year 001-366 365
%U Week number of year, Sunday as the first day of week, 00-53 52

%W Week number of year, Monday as the first day of week, 00-53 52

%c Local version of date and time Mon Dec 31 17:41:00 2018

%x Local version of date 12/31/18

%X Local version of time 17:41:00

%% A % character %

%G ISO 8601 year 2018

%u ISO 8601 weekday (1-7) 1

%V ISO 8601 weeknumber (01-53) 01

TIPS FOR BEGINNERS IN PYTHON

Omale Happiness Ojone — Mon, 18 Jan 2021 09:06:28 +0000

WHAT IS PYTHON?

Python is an open source programming language, an interpreted, object-oriented, high- level programming language with dynamic semantics, with applications in numerous areas, including web programming, scientific computing, and artificial intelligence.

CHARACTERISTICS OF PYTHON

It has a large standard library
It is used in Databases
It is used for web scraping
Python can be used to develop games
It is used for machine learning
It is used for Data analytics
It is used for web framework
It is used for Graphical User Interface
It is used for networking and documentation e. t. c.
It is simple and powerful
Above all it is easy and fun to learn

BRIEF HISTORY OF PYTHON

Python was created by Guido Van Rossum, funny enough python was not named after the snake rather it was from a British comedy group
The first release of python was in 1991, version 0.9.0
In 2000, python 2.0 was released
Python3 was released on December 2008, although python2 and 3 are similar but they have subtle differences, the most noticeable one are the print statement.
For example print “Hello World” (This is for python2, if you do this in python3 it outputs an error because "Hello World" is suppose to be in parenthesis).

USEFUL RESOURCES TO LEARN PYTHON

If you decide to learn Python in 2021, then here are some of the useful Python books, courses, and tutorials to start your journey in the beautiful world of Python.

The Complete Python MasterClass
The Python Bible — Everything You Need to Program in Python
Python Fundamentals by Pluralsight
10 Free Python Programming EBooks and PDF
Brad Travesry Media @ Youtube: beginner’s python course:-Brad explains in a way you understand them clearly and it is practical so you will see how the code actually works.
Sololearn: sololearn is a great app for learning how to code and python course is really amazing and covers all you need to know about python as a beginner, there are questions and answers too
Udemy:is a good place to learn too, they have paid and unpaid python courses in udemy platforms.

REASONS WHY YOU SHOULD LEARN PYTHON

Simplicity-This is the single biggest reason for beginners to learn Python. When you first start with programming and coding, you don’t want to start with a programming language which has tough syntax and weird rules. Python is both readable and simple. It also easier to setup, you don’t need to deal with any class path problems like Java or compiler issues like C++.Just install Python and you are done. While installing it will also ask you to add Python in PATH which means you can run Python from anywhere on your machine.
Multipurpose-One of the things I like about Python is its Swiss Army knife nature. It’s not tied to just one thing e.g. R which is good on Data Science and Machine learning but nowhere when it comes to web development. Learning Python means you can do many things.
Jobs and Growth-Python is growing really fast and big time and it makes a lot of sense to learn a growing programming major programming language if you are just starting your programming career.
Salary-Python developers are one of the highest paid developers, particularly in the Data Science, Machine learning and web development. On average also, they are very good paying, ranging from 70,000 USD to 150,000 USD depending upon their experience, location, and domain.
Python is Known for its Huge Community-You need a community to learn a new technology and friends are your biggest asset when it comes to learning a programming language. You often get stuck with one or other issue and that time you need helping hand. Thanks to Google, you can find the solution of your Python related problem in minutes. Communities like StackOverflow also brings many Python experts together to help newcomers.

TIPS TO GETTING STARTED WITH PYTHON
Dedication
Consistency
Make friends with experts
Teach
Build project
Contribute to open source
Rest: very important

Closing Notes:
Thanks, You made it to the end of the article … Good luck with your Python journey! It’s certainly a great decision and will pay you a lot in your nearest future.

How to send emails using python

Omale Happiness Ojone — Mon, 18 Jan 2021 08:38:44 +0000

What are emails?
Emails are messages distributed by electronic means from one computer user to another. There can be many more recipient as well via network.

Methods we can use to send an email:

We can use python web automation using selenium.
We have a python library which is SMTP library which can be used to send an email.
But for this article I shall be explaining how to use the STMP library.
STMP which means simple transfer mail protocol. STMP library, this library or the modules defines an SMTP clients session object which can be used to send an email to any other internet machine with an STMP or E-STMP listener daemon.

Here's the full code
```
      :
      import smtplib
      server = smtplib.SMTP_SSL('smtp.gmail.com', 465)
      server.login('example@gmail.com', 'password')
      server.sendmail('example@gmail.com', 
      'contact1@gmail.com', 'Hi happiness,how are you?')
      server.quit()
```

Quickly what you should note:
"example@gmail.com"-your email address should be there
"password"-the password to your email address
"contact1@gmail.com"-the receiver's email address.
then you go ahead with the body of the message.

So here's the output

Finally you have to enable your "less secure app" from your google account in order to send the message.

DEV Community: Omale Happiness Ojone

How to predict a model using linear regression

What is Regression?

What is Linear Regression?

Linear Regression (multiple independent variables): Let's predict weight

Conclusion

Data Preparation

Data Preparation

Why is Data Preparation Important?

Steps in data preparation process

1. Data collection:

2. Data discovery and profiling.

3. Data cleaning.

4. Data structuring.

5. Data transformation and enrichment.

6. Data validation and publishing.

Conclusion

Data Cleaning

Fish Dataset

Fill-Out Missing Values

Removing rows with missing values

Fixing errors in the Dataset

Identify Columns That Contain a Single Value

Delete Columns That Contain a Single Value

Identify Rows That Contain Duplicate Data

Conclusion

How I Deployed my First Machine Learning Model Using Streamlit (Part 2)

End Notes

How I Deployed my First Machine Learning Model Using Streamlit (Part 1)

Preparing Data and Training Model

Model Deployment of the Loan Prediction Model using Streamlit

Analysing Dataset Using Naive Bayes Classifier

1.Introduction to Naive Bayes.

2.Naive Bayes with Scikit-Learn.

Analysing Dataset Using KNN

What is a supervised learning model?

KNN Classifier

Data Preparation

KNN Classifier Model

Congrats! You developed a KNN classifier!

What is Overfitting?

Prediction

If You Want to See Some Further Analysis of KNN Classifier

That is a good classifier!

Conclusion

WORKING WITH DATETIME FUNCTION IN PYTHON

Dates for Python

Date Output

Date Output

Creating Date Objects

Date Output

The strftime() Method

Date Output

A reference of all the legal format codes:

TIPS FOR BEGINNERS IN PYTHON

WHAT IS PYTHON?

CHARACTERISTICS OF PYTHON

BRIEF HISTORY OF PYTHON

USEFUL RESOURCES TO LEARN PYTHON

REASONS WHY YOU SHOULD LEARN PYTHON

TIPS TO GETTING STARTED WITH PYTHON

How to send emails using python

Here's the full code