DEV Community: Nitin Kendre

Python : Linear search and Binary Search

Nitin Kendre — Fri, 31 May 2024 19:13:54 +0000

Practicing Python From Basics

Linear Search:

Linear search, also known as sequential search, checks each element in a collection one by one until the target element is found or the end of the collection is reached.
It's a simple but inefficient algorithm, especially for large datasets, as it has a time complexity of O(n) in the worst case.
Linear search is applicable to both sorted and unsorted collections.

Implementation

def linear_search(key,arr):
    for index in range(len(arr)):
        if arr[index] == key:
            return index

    return 0

Calling function

arr = [5,8,2,10,3,6]
key = 3
result = linear_search(key,arr)

if result:
    print(f'Element {key} found at index {result}')

else:
    print("Element not found")

Element 3 found at index 4

2nd calling

key1 = 7
result1 = linear_search(key1,arr)

if result1:
    print(f'Element {key1} found at index {result1}')

else:
    print("Element not found")

Element not found

The linear_search function takes a list arr and a target value key.
It iterates through each element of the list using a for loop.
For each element, it checks if it matches the target value.
If a match is found, it returns the index of the element. If not found, it returns 0.

Binary Search

Binary search is a more efficient algorithm for finding a target value within a sorted array.
It repeatedly divides the search interval in half until the target is found or the interval is empty.
Binary search has a time complexity of O(log n), making it significantly faster than linear search for large datasets.
It requires the array to be sorted beforehand.

Implementation

def binary_search(key,arr):
    start, end = 0, len(arr)-1

    while start<=end:
        mid = (start+end)//2

        if arr[mid] == key:
            return mid

        elif arr[mid]<key:
            start = mid+1

        else:
            end = mid-1

    return 0

Calling Binary search

arr = [2, 4, 6, 8, 10, 12, 14, 16]
key = 12
result = binary_search(key,arr)

if result:
    print(f'Element {key} found at index {result}')

else:
    print("Element not found")

Element 12 found at index 5

2nd Calling

key = 1
result = binary_search(key,arr)

if result:
    print(f'Element {key} found at index {result}')

else:
    print("Element not found")

Element not found

The binary_search function takes a sorted array arr and a target value key.
It initializes start and end pointers to the start and end of the array, respectively.
It repeatedly calculates the mid index and compares the element at mid with the key.
Based on the comparison, it updates start or end pointers to narrow down the search interval.
It continues until the target is found or the search interval is empty, returning the index of the target or 0 if not found.

Recursion : Python

Nitin Kendre — Thu, 30 May 2024 18:45:03 +0000

Recursion

Recursion is a programming technique where a function calls itself in order to solve a problem.
The recursive approach breaks a problem down into smaller, more manageable sub-problems of the same type.

Components of Recursion:

Base Case:

The condition under which the recursion terminates.
It prevents infinite recursion by providing a simple, non-recursive solution to the smallest instance of the problem.

Recursive Case:
- The part of the function where the function calls itself with a modified argument, gradually approaching the base case.

Types of Recursion:

Direct Recursion:
- A function calls itself directly.
Indirect Recursion:
- A function calls another function, which in turn calls the original function.

Example (Direct Recursion)

Factorial Function:

def factorial(number):
    # base case
    if number == 1 or number == 0:
        return 1
    else:
        # Recursive case
        return number*factorial(number-1)

Factorial of 5

fact = factorial(5)
print(f"factorial of 5 is : {fact}")

factorial of 5 is : 120

Here, factorial(5) calls factorial(4), which calls factorial(3), and so on, until it reaches factorial(1).

Example (Indirect Recursion)

# variable to count how many times function called
function_call_count = 0

def functionA():
    # Telling function's local space that i am using global variable.
    global function_call_count 

    # Counting function call
    function_call_count += 1
    print("Printing from functionA().")

    # Base case to break the function call otherwise it will go infinitely calling functions.
    if function_call_count == 5:
        return

    # function call
    functionB()

def functionB():
    print("Printing from functionB().")

    # Function call
    functionA()

# Calling functionA()
functionA()

Printing from functionA().
Printing from functionB().
Printing from functionA().
Printing from functionB().
Printing from functionA().
Printing from functionB().
Printing from functionA().
Printing from functionB().
Printing from functionA().

Advantages of Recursion:

Simplifies the code for problems that can naturally be divided into similar sub-problems, like tree traversals, and certain mathematical computations (e.g., Fibonacci sequence).
Provides a clear and straightforward approach to solve problems like backtracking and divide-and-conquer algorithms.

Disadvantages of Recursion:

Can lead to high memory usage due to the depth of the call stack, especially if the recursion depth is large.
May result in slower performance compared to iterative solutions because of the overhead of multiple function calls.
Risk of stack overflow if the base case is not properly defined or if the problem size is too large.

Tail Recursion:

A special form of recursion where the recursive call is the last operation in the function.
Tail recursion can be optimized by the compiler to avoid increasing the call stack, converting the recursion into iteration internally.

Example

def tail_recursive_factorial(number, result = 1):
    if number == 1 or number == 0:
        return result

    # Recursive case (only recursive case)
    return tail_recursive_factorial(number-1, result*number)

%%time
tail_recursive_factorial(5)

CPU times: total: 0 ns

Wall time: 0 ns

120

Most Important Python Does not optimize tail recursion. Some languages does not optimizes the tail recursion, and python is one of them.

Conclusion

Recursion is a powerful tool in programming, enabling elegant solutions for complex problems by breaking them down into simpler sub-problems.
However, it requires careful implementation to manage memory and performance effectively.

Unleashing the Potential of Random Forest Regression : A Python Implementation Guide with Hyperparameter Tuning.

Nitin Kendre — Fri, 09 Jun 2023 11:19:23 +0000

In the field of machine learning, regression is a widely used technique for predicting continuous numerical values. And random forest regression is most versatile and effective algorithm in regression. In this article we will learn how to implement random forest regression using python language.

Random Forest Regressor

Random Forest Regressor is an ensemble learning algorithm which combines decision trees and the concept of randomness. It is belongs to the supervised learning algorithm family. While working on data this algorithm create multiple decision trees and combines the predictions of all trees to give final output.

The name Random Forest comes from two concepts: Randomness and forests. Forest meaning collection of trees, which this model creates by generating multiple decision trees and combining them all.

In decision tree algorithm all data is used to create only one tree and predict using it. But in random forest algorithm creates trees using input given by user n_estimators and predicts or gives output by combining output from all these trees.

How Random Forest Regressor Works

Data Preparation :
First step is preparing the training data. Each data point should be represented by a set of features and a corresponding numerical target value.
Generating Trees :
In this step algorithm creates trees according the input given by user by parameter n_estimators. Each tree is built using a small random group of data from training dataset. This group of data is choose randomly and can be used many times.
Growing The Trees :
Each tree is build using smaller groups of data which are created by using chosen features. We do this process until we can not divide the data further or we met certain condition. Our aim is to make sure that the values we want to predict are as similar as possible within each smaller group.
Combining Predictions :
After Building all the trees, we need to combine the predictions. We do this by adding predictions from all trees and taking average. This is called aggregation. Let, if we are trying to predict a number, like the price of house, we usually take the average of all the predicted prices from the trees.
This combined prediction gives us more accurate result rather than relying on just one tree.
Missing Values :
Random forest regressor can handle missing values using averaging approach. During prediction process, each decision tree separately handles the missing values and makes prediction.

Hyper Parameter Tuning

Hyper parameters controls the behavior of algorithm and these parameters should be set before learning or training process. Tuning these parameters can impact the performance of the model. We will discuss here two important hyper parameters and their tuning.

n_estimators : This parameter decides the number of decision tress in random forest. By increasing the number of trees the performance of model can be improved but also computational complexity is also increased. We can choose different values and final the one values which gives best results.
max_depth : It represents the maximum depth of decision tree. A deeper tree can capture more complex relationships in the data but can also lead to overfitting. By tuning this parameter, we can find the right balance between model complexity and generalization.

To tune these parameters we can use Grid Search, Random Search, or Bayesian Optimization.

Grid Search involves deciding the set of values for each hyper parameter and exhaustively evaluating all possible combinations.

Random Search randomly samples combinations of hyperparameters and evaluate their performance.

Bayesian Optimization uses a probabilistic model to search for promising hyperparameters.

For brief explanation and more information on hyper parameter tuning you can refer this Link

Implementation of Random Forest Regressor using Python

To implement random forest regression we will use sklearn library, which provides different set of tools for machine learning tasks.

Step 1 : Importing Necessary Libraries

First we will import necessary libraries for loading and manipulating the data.



import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

Step 2 : Loading the data

We will use salary data for this examples.



sal_d = pd.read_csv('Position_Salaries.csv')

Our data will look like this -

Step 3 : Creating x and y variables

x will be the independent variable and y will be the dependent variable.



x = sal_d.iloc[:,1:-1].values
y = sal_d.iloc[:,-1].values

Step 4 : Hyper Parameter Tuning

In this step we will use GridSearchCV for hyper parameter tuning.



## Importing required libraries.
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error

## Defining param grid 
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 5, 10]
}

## Creating a random forest regressor object
rf_regressor = RandomForestRegressor(random_state=42)
# Perform grid search for hyperparameter tuning
grid_search = GridSearchCV(rf_regressor, param_grid, cv=5, scoring='neg_mean_squared_error')
grid_search.fit(x, y)

# Getting the best parameters and best score
best_params = grid_search.best_params_
best_score = -grid_search.best_score_

print("Best Parameters:", best_params)
print("Best Score (MSE):", best_score)

Step 5 : Training the Model

In this step we will use above tuned parameters and whole dataset to train our model.



rf_regressor_best = RandomForestRegressor(n_estimators=best_params['n_estimators'], max_depth=best_params['max_depth'], random_state=42)

Above code will create a model using tuned parameter in previous step.
Below is the code to fit the whole dataset.



# Fiting the model to the training data using the best parameters

rf_regressor_best.fit(x, y)

Step 6 : Predictions.

In this step we will predict for unique input which will not be in dataset.



rf_regressor_best.predict([[6.5]])

Conclusion

By using all these steps anyone can implement random forest regressor using python. Also there are more parameters than 2, by tuning these parameters we can improve our model more.

All these steps are done by me in python also this theory information is from internet and some udemy course. which I wrote in my words. If there is mistake above then please comment.

Decision Tree Regression : A Comprehensive Guide with Python Code Examples and Hyperparameter Tuning

Nitin Kendre — Mon, 05 Jun 2023 06:51:00 +0000

Decision Tree regression is popular and powerful algorithm in regression. But to get full potential of this algorithm you have to Hyperparameter Tuning. Means you have to choose some parameters that can best fit the data and predict correctly. In this article we will focus on implementation mainly using python. Also we will learn some hyperparameter tuning techniques.

For more information on Decision tree Regression you can refer to this blog by Ashwin Prasad - Link.

Decision Tree Regression

Decision Tree Regression builds a tree like structure by splitting the data based on the values of various features. Simply it creates different subsets of data. For prediction of new sample or data, average value of target variable from leaf node is used. It handles both categorical and continues variables, making it versatile algorithm for regression tasks.

But in some libraries of python like sklearn categorical variable can not be handled by decision tree regression. So we have to encode it using any encoder method, according to data or model.

Implementation Using Python

We will use sklearn library from python for implementation.
First step will import necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Next we will read the dataset. In this article we will be using simple salary dataset.

## Reading the data
sal_d = pd.read_csv('Position_Salaries.csv')

sal_d ## this line will print the data in jupyter notebook

Our data will look like -

Now we will create dependent and independent variable.

x = sal_d.iloc[:,1:-1].values
y = sal_d.iloc[:,-1].values

In next step we will train our regression model using above variables. Our data is so small, so we are training our model on entire data.

There are various steps in data preparing or preprocessing, you can refer those all steps in article - Link

Below is the code for training our model -


## Importing 
from sklearn.tree import DecisionTreeRegressor

## Creating model
reg = DecisionTreeRegressor(random_state=21)

## training our model
reg.fit(x,y)

Now we will make a prediction using created model.

reg.predict([[6.5]])

Output of above -

array([150000.])

Now we will visualize the prediction of our model. For higher resolution we will create x_grid. Which plot the line smooth.

## For smooth line
x_grid = np.arange(min(x),max(x),0.01)
x_grid = x_grid.reshape(len(x_grid),1)

## this will plot points on chart
plt.scatter(x,y,color='red')

## this will plot the line connecting to points
plt.plot(x_grid,reg.predict(x_grid),color='blue')

## This will give title to our plot
plt.title('actual vs predict')

## this will give label to x axis
plt.xlabel('level')

### this will give label to y axis
plt.ylabel('salary')

## This line will save our plot as image on our computer
plt.savefig('decision_tree_regression.png',bbox_inches='tight')

## And this line is for showing the chart and ending.
plt.show()

Output of above plot code is below -

Hyperparameter Tuning

Hyperparamter Tuning means we have to select the best values for parameters of algorithm in machine learning. It includes searching and evaluating different combinations of parameters to maximize the performance of model.

To enhance the performance of decision tree regression we can tune its parameters using methods in library like GridSearchCV and RandomizedSearchCV.

Grid Search

Grid search is a method to find the best set of values for different options by trying out all possible combinations.

Below is the code for implementing GridSearchCV -


## importing class from library
from sklearn.model_selection import GridSearchCV

## Setting optimum values for parameters.
param_grid = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

## creating instance
grid_search = GridSearchCV(DecisionTreeRegressor(), param_grid, cv=5)

## fitting data 
# grid_search.fit(x, y)

grid_search.fit(x_train, y_train)

## getting best parameters 
best_params = grid_search.best_params_
print("Best Hyperparameters:", best_params)

Using these parameters we can train our model and enhance the performance.

Randomized Search

Randomized search is a way to find the best values for different parameters by randomly trying out a subset of possible combinations, which makes the search process faster.

Below is the code for implementing RandomizedSearchCV -


## importing class from library
from sklearn.model_selection import RandomizedSearchCV

## Setting optimum values for parameters.
param = {
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

## creating instance
random_search = RandomizedSearchCV(DecisionTreeRegressor(random_state=25), param, n_iter=10, cv=5)

## fitting data
random_search.fit(x_train, y_train)

## getting best parameters 
best_params = random_search.best_params_
print("Best Hyperparameters:", best_params)

Also using these parameters we can enhance performance of our model.

From these two methods RandomizedSearchCV is faster. Because GridSearchCV apply parameters on all data and check one by one which is suitable.

Also according to size of dataset it is decided which is faster.

There are various methods for search best parameters to model. But these two I personally implemented, so explained it here as learned.

Conclusion

In this article we learned how to implement decision tree regression using python. Also we learned some techniques for hyperparameter tuning like GridSearchCV and RandomizedSearchCV.

All code implementations done by me. So if anyone finds a mistake in it please comment it down.

Thank You!

Support Vector Regression (SVR) Using Python : A practical approach to Predictive Modeling

Nitin Kendre — Fri, 02 Jun 2023 07:57:59 +0000

Introduction :

Support Vector Regression (SVR) is a powerful algorithm used to solve regression problems. It is a part of Support Vector Machines (SVM) which is used for nonlinear relationships between variables.

In this article we will learn how to implement it using python language.

Understanding SVR :

The goal of SVR is to find the hyperplane that best fits the data points, while allowing a margin of tolerance for errors. While traditional regression models focus on minimizing the errors, SVR focuses on data points within a specific margin. SVR Operates on the premise that only support vectors and the data point close to the margin, which significantly affects the model's performance.

For more information on SVR you can refer this blog post LINK.

Implementing SVR Using Python :

We will implement SVR algorithm using sklearn library from pyhton language.

Below are the steps of implementation -

Step 1 : Importing Necessary Libraries


import numpy as np
import pandas as pd
import matplolib.pyplot as plt

Step 2 : Loading the Dataset in preparing it

We have discussed all data preprocessing steps previous article. You can refer that for more tools to data preparation LINK


salary_data = pd.read_csv('Position_Salaries.csv')

## Below line will print first 10 rows from data.
salary_data.head(10)

creating Variables

we will create dependent and independent variables in this step.

## Creating independent variable
x = salary_data.iloc[:, :-1].values

## Creating dependent variable
y = salary_data.iloc[:, -1].values

Step 3 : Feature Scaling

It is an important step in machine learning that brings all features in similar scales. It ensures that no single feature dominates the learning process due to differences in their magnitude. By scaling features, algorithms can converge faster and perform better, leading to more accurate and reliable models.

Below is the code example to do feature scaling.

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import StandardScaler
x_sc = StandardScaler()
y_sc = StandardScaler()

Here y variable is a 1D array. But StandardScaler takes 2D array. So we have to reshape it. Below is the code for reshaping

y_res = y.reshape(len(y),1)

Below code fits and transforms the x and y variable into similar scale.


x_scld = x_sc.fit_transform(x)
y_scld = y_sc.fit_transform(y_res)

Step 4 : Training The SVR Model

In this step we will use above scaled data to train our model.

## Importing SVR
from sklearn.svm import SVR

## Creating SVR model
svr_rbf = SVR(kernel='rbf') ## here `rbf` stands for Radial Basis Function kernel

## Training SVR model
svr_rbf.fit(x_scld, y_sclf)

Here rbf is one of the kernel from SVR model. For more learning you can refer this Link

Step 5 : Predicting New Result

In this step we predict a random value using random input.


result = y_sc.inverse_transform(svr_rbf.predict(x_sc.transform([[6.5]])).reshape(-1,1))

print(result)

Here inverse_transform() method will transform scaled data into original format.

Step 6 : Visualizing the result and data with Smooth Curve

## This line will create a array using min and max value of data with difference of 0.1
x_grid = np.arange(min(x),max(x),0.1)

## This line will reshape above 1D array to 2D array.
x_grid = x_grid.reshape(len(x_grid),1)

## This line will plot the scatter plot for x and y variable
plt.scatter(x,y,color='red')

## This line will connect the points using curve line
plt.plot(x_grid,y_sc.inverse_transform(svr_rbf.predict(x_sc.transform(x_grid)).reshape(-1,1)),color='blue')

## This line will give the title to our plot
plt.title('actual vs true (SVR) smooth')

## this line label the x axis
plt.xlabel('Level')

## this line will label the y axis
plt.ylabel('Salary')

## This line will help to show the plot
plt.show()

Above all steps are implemented personally by me. I learned them online. So if anyone finds any mistake please comment down. I will be happy to edit.

Conclusion :

SVR combined with data preprocessing can provide accurate predictions for regression tasks. By following above all steps anyone can implement it using python language.

Polynomial Regression with Python: A Flexible Approach for Non-Linear Curve Fitting

Nitin Kendre — Thu, 01 Jun 2023 14:25:50 +0000

Polynomial Regression is a powerful technique, which helps us to model the relationship between dependent and independent variable. Independent variables can be multiple (one or many). It extends the concept of Simple or multiple Linear Regression by allowing for more flexible curve fitting.

In this article we will see some theory behind it and How to implement it using python.

1. Understanding the Polynomial Regression

Introduction to Polynomial Regression :

The goal or aim of polynomial regression is to fit a polynomial equation to a given set of data points.

Below is the polynomial equation :


y = b0 + b1 * x + b2 * x^2 + ... + bn * x^n

where,

y is the dependent variable.
x is the independent variable.
b1, b2, ..., bn are the coefficients of polynomial terms.
n denotes the degree of polynomial

Advantages and Limitations :

Polynomial Regression offers several advantages over simple Linear Regression -

Flexibility : By using polynomial terms, we can capture non-linear relationships between variables that can not be represented by a straight line.
Higher Order Trends : Polynomial Regression can capture Higher order trends in data which can be used for more accurate predictions.
Interpretability : The polynomial Terms provide insights into the relationships between variables.

But Polynomial Regression also have some limitations -

Overfitting : Using high-degree polynomials can easily cause the training data to overfit. Which results in poor generalization to unseen data.
Computational Complexity : When we increase the degree of polynomial then complexity of the regression model also increases with the degree. Which makes it more computationally expensive.

2. Implementing Polynomial Regression using Python :

We will use sklearn library from python language to implement Polynomial Regression. sklearn provides a simple and efficient way to build machine learning models.

Below are the steps to implement the Polynomial Regression -

step 1 : Data Preparation

In this step, we will load the data and preprocess it for training our model. This step also ensure that data should in required format.


## Importing Libraries

import numpy as np
import pandas as pd

## Loading the dataset

sal_data = pd.read_csv("Position_Salaries.csv")

## creating dependent and independent variables

x = sal_data.iloc[:,1:-1].values
y = sal_data.iloc[:,-1].values

step 2 : Feature Engineering

In this step we will generate the polynomial terms by transforming our independent variables into polynomial terms.

Below is the code for generating polynomial terms -


from sklearn.preprocessing import PolynomialFeatures
pf = PolynomialFeatures(degree=3)
x_poly = pf.fit_transform(x)

step 3 : Model Training

In this step we will train our model using above generated polynomial terms.

Below is the code for training -


from sklearn.linear_model import LinearRegression
lr2 = LinearRegression()
lr2.fit(x_poly,y)

step 4 : Evaluating the model

We will use r_squared method which is in sklearn library to evaluate the model.


from sklearn.metrics import r2_score
y_pred = lr2.predict(x_poly)

r2 = r2_score(y, y_pred)

print("r_squared : ",r2)

step 5 : Visualizing results from Polynomial Regression

This step will visualize the predicted values and original values on the line plot/scatter plot.

import matplotlib.pyplot as plt

plt.scatter(x,y, color='red')
plt.plot(x,lr2.predict(x_poly),color='blue')
plt.xlabel('level')
plt.ylabel('Salary')
plt.title('Polynomial')
plt.show()

These all steps I personally learned and implemented it using python. So if you find any mistake please comment down.

Conclusion :

Remember to carefully choose the degree of the polynomial to avoid overfitting and consider the computational complexity of higher degree polynomials.

Mastering Multiple Linear Regression: A Step-by-Step Implementation Guide with Python Code Examples

Nitin Kendre — Wed, 31 May 2023 10:16:28 +0000

Introduction :

Multiple Linear Regression is a statistical model used to find relationship between dependent variable and multiple independent variables. This model helps us to find how different variables contribute to outcome or predictions. In this article we will see how to implement it using python language from data preparation to model evaluation.

1. Understanding Multiple Linear Regression :

In simple linear regression only one independent and dependent variables are there. So Multiple Linear Regression extends this capacity of simple linear regression. Means there can many number of independent variables in Multiple Linear Regression.

General Equation for Multiple Linear Regression is as follow -

y = β₀ + β₁x₁ + β₂x₂ + ... + βₚxₚ + ɛ

where,

y is the dependent variable
β₀ is the intercept (means value of y when value of x is zero)
β₁, β₂, ..., βₚ are the coefficients.
x₁, x₂, ..., xₚ are the independent variables.
ɛ represents error terms.

2. Data Preparation :

It is the fundamental step in any Machine Learning Model. Because before feeding to model data should be clean, without any missing values, and all values should be in numeric.

Below are some code examples -

## Importing Libraries
import numpy as np
import pandas as pd

## Loading Data
m_data = pd.read_csv("50_Startups.csv")

## Checking for missing values
m_data.isnull().sum()

## Creating dependent and independent variables.
x = m_data.iloc[:, :-1].values
y = m_data.iloc[:, -1].values

Encoding Categorical Variables :

It is necessary to encode categorical values in the form of numbers. Because model don't accepts categorical values like string, characters etc. In this article we will be using one hot encoding.

Refer link for more information on one hot encoder.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(),[3])],remainder='passthrough')
x = np.array(ct.fit_transform(x))

Splitting Dataset into training and test set :

By splitting the dataset into training and test set we can train our model using training set and evaluate our model using test set. So this is also an important step if you don't have testing dataset seperately.

from sklearn.model_selectiom import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=21)

For more data preparation tools refer this Link.

3. Model Training and Evaluation :

Now in this step we will train our multiple linear regression model using training set and evaluate it using test set.

Model Training -

from sklearn.linear_model import LinearRegression
mlr = LinearRegression()
mlr.fit(x_train,y_train)

Predicting -

y_pred = mlr.predict(y_test)

Model Evaluation -

we will use mean squared error (MSE) for evaluating our model.

from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test,y_pred)

To learn more on Mean Squared Error (MSE) refer this Link

4. Interpreting Coefficients :

The coefficients in multiple linear regression equation determines the relationship between each independent and dependent variable. If value of coefficient is positive then it shows positive relationship between two variables, while negative value shows negative relationship.

Below is the python code which will generate a data frame which will contains columns name and coefficient value of that column from data.

coefficients = pd.DataFrame({"Variable":m_data.columns,"Coefficient":mlr.coef_})
print(coefficients)

5. Model Evaluation using plots :

We can use various diagnostic plots to evaluate the performance of model or diagnose any issue.

Below is the python code for plot -

import matplotlib.pytplot as plt

# residuals vs Predicted values
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='r', linestyle='-')
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Values")
plt.show()

There are different diagnostic plans for models. You can refer below links for more information.

Conclusion :

We have learned implementation of Multiple Linear Regression using python. Also with that we learned data preparation, model evaluation etc.

So by following these steps and using code examples provided, you can easily implement Multiple Linear Regression in your own projects.

References :

sklearn: Linear Regression Documentation - Link.
sklearn: Mean Squared Error Documentation - Link
Seaborn Documentation for Various Diagnostic plots - Link.
Matplotlib Documentation for Visualization - Link.

From Data to Prediction : Mastering Simple Linear Regression with python

Nitin Kendre — Tue, 30 May 2023 10:30:00 +0000

Linear Regression is an essential statistical method used to analyze the correlation between two variables. In this article we will study the concept of simple linear regression and implement it using python language.

1. Introduction to Simple Linear Regression :

Simple Linear Regression is a statistical method which helps us to find the relation between two variables. It mainly focuses on the exploration of the connection between a dependent variable (which we are predicting) and a independent variable ( which is used to predict ).

The equation for simple linear regression is follow :

y = b0 + b1 * x + e

Where,

y is the dependent variable.
x is the independent variable.
b0 is the intercept ( means value of y when value of x is zero ).
b1 is the slope.
e is the error.

The mail goal of simple linear regression is finding the values of b0 and b1 which best fits the data.

For more theoretical knowledge refer this

2. Data preprocessing :

Data preprocessing is an essential step in any data analysis task. It includes data cleaning, transforming and preparing raw data for further analysis.

In previous blog post we have discussed it in detail. Please refer :
7 Essential Techniques for Data Preprocessing Using Python: A Guide for Data Scientists.

3. Implementing Simple Linear Regression using python :

Step 1 : Importing Necessary Libraries

First step in any code always includes importing the necessary libraries for further analysis.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

Step 2 : Loading and preprocessing the data

For this model we will using dataset of students which includes their study hours and their exam scores. This data is stored in a csv file named student_scores.csv .


from scipy import stats

# Loading the data in code
df = pd.read_csv("student_scores.csv")

# Checking for missing values in data
missing_values = df.isnull().sum()

# Handling missing values
df.fillna(0, inplace=True)

# Handling Outliers
sns.boxplot(x=df['Hours'])
plt.show()

df = df[(np.abs(stats.zscore(df['Hours'])) < 3).all(axis=1)]

# Encoding the categorical Variables
df = pd.get_dummies(df, columns=['Category'])

Step 3 : Creating dependent and independent variables

x = df.iloc[:, :-1].values  # independent variable
y = df.iloc[:, -1].values   # dependent variable

Step 4 : Splitting dataset in training and test set

Splitting dataset can help in evaluating the model. Means we can test it using test set if it is trained well or not.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x, y, test_size=0.2, random_state=21)

Step 5 : Fitting the Regression model on data

There is class named LinearRegression in sklearn library which is used to fit the model on data.


from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(x_train, y_train)

Step 6 : Predicting

After fitting the model we can use that to make predictions on new data. Let's we test the model by predicting the exam score for student who studies for 9.25 hours.

hrs = 9.25
score = regressor.predict([[hours]])
print('predicted score : ',score[0][0])

Output :

Predicted Score: 92.90985477015731

Step 7 : Model Evaluation

For model evaluation two values are important first one is Mean Squared error and second is R-squared.

Below is code to calculate above values -


from sklearn.metrics import mean_squared_error, r2_score

# making predictions on entire dataset
y_pred = regressor.predict(x_test)

# calculating mean squared error
mse = mean_squared_error(y_test,y_pred) 
print('mean_squared_error : ',mse)


# calculating r-squared value
r2 = r2_score(y_test,y_pred)
print("R-Squared value : ",r2)

Conclusiong :

In this article we have studied the concept of simple linear regression and implemented it using python language.

Remember, Linear regression is just one of many techniques available in regression analysis. Keep exploring and expanding your knowledge to unlock the full potential of regression analysis.

7 Essential Techniques for Data Preprocessing Using Python: A Guide for Data Scientists

Nitin Kendre — Tue, 30 May 2023 03:45:00 +0000

Data preprocessing is an important step in data science. Data preprocessing means cleaning, transforming and preparing raw data for analyzing. For this purpose Python programming language is used. Because python has inbuilt libraries and tools for data science and machine learning. In this blog we will see some steps in data preprocessing.

Below are some steps anyone (even beginner can understand) can use in their practice or learning.

1. Importing Libraries :

First step in any project is importing the necessary libraries, which will be used in entire code. Below are some common libraries.

import numpy as np
import pandas as pd
import matplot.pyplot as plt
import seaborn as sns

2. Loading Data :

Next step is to load the data to process. In python pandas is a library which used for this purpose. Pandas is a very powerful library to load the data and process it. To load the data from csv file there is function named read_csv.

df = pd.read_csv("data.csv")

3. Handling Missing Values :

When we work on real world data, there might be some missing values in files. When there are missing values in data then any algorithm can't work on that data. So this step is most important in data preprocessing. So we need to handle all missing values before performing any analysis. pandas includes various function/methods to handle missing values. There is a function names isnull which provides information about missing values from data.

missing_values = df.isnull()

For filling some value in the place of missing value fillna function can be used.

df.fillna(0,inplace=True)

here inplace means values will be overwritten in dataset directly

4. Handling outliers :

There may be some data points in dataset which may be totally different from other data points. These datapoints are known as outliers. This process can be called as finding the odd one out. There is function in seaborn library named Boxplot which can be used to visualize the distribution of data and identify the outliers.

sns.boxplot(x=df['column_name'])

z_score is used to identify and remove the outliers. z_score is the measurement of standard deviation of a data point from the mean. this function is inside the scipy library.

from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

To more information on outliers Click Here

5. Encoding the Categorical variables :

Categorical variables means that data which shows categories or some memberships. Mostly this data can be in the form of string or characters. But all machine learning models works on only numerical data, so that's why we have to encode these variables in numerical format.

There are several methods for encoding categorical variables.
we will discuss get_dummies method from pandas library.
This method creates dummy variables for categorical data. Basically it converts categorical data into 0/1 on the basis of categories from that column.

df = pd.get_dummies(df,columns=['column_name'],dtype=float)

Demo :

For more Click Here

6. Feature Scaling :

To work the model smoothly data should be in some range. Here comes the feature scaling. Using feature scaling data points can be brought in some scale, like 0-1. There is method named MinMaxScaler in sklearn library, which can scale the features between 0 to 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['column_name']] = scaler.fit_transform(df[['column_name']])

Above code scales the data points from given column between 0 to 1.

7. Feature Selection :

Feature selection involves choosing the most relevant features from a dataset, aiming to enhance the accuracy and efficiency of machine learning models. sklearn library provides the function named SelectKBest which selects the top K features from the dataset using statistical tests. By opting out for most relevant features, we can optimize the performance of our models in terms of speed and accuracy.

from sklearn.feature_selection import SelectKBest, chi2
x = df.drop('target_column', axis=1)
y = df['target_column']

selection = SelectKBest(chi2, k=3)
x_new = selection.fit_transform(x,y)

here chi2 means chi-squared value which indicates which features is most important. Here if chi2 is higher then that feature will be selected. And k=3 means number of features to select.

For more information of chi2 - Click Here

Conclusion :

In conclusion, we can say that data pre-processing plays important role in data science/Machine Learning. In this post we explored some fundamental techniques for data preprocessing using python. By applying these techniques, we can clean, transform and prepare raw data for further analysis and modeling.

Data Preprocessing with Python: Essential Techniques for Cleaning and Transforming Data

Nitin Kendre — Sat, 06 May 2023 16:49:13 +0000

Data preprocessing is a crucial step in the data science pipeline. It involves cleaning, transforming, and preparing raw data for further analysis. Python is a popular programming language for data preprocessing because of its rich ecosystem of data science libraries.

In this blog post, we will explore some essential techniques for data preprocessing using Python.

1. Importing Libraries:

The first step in any data science project is importing the necessary libraries. For data preprocessing, we typically use the following libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

2. Loading Data

The next step is to load the data into Python. Pandas is a powerful library for loading and manipulating data. We can use the read_csv function to load data from a CSV file.

df = pd.read_csv('data.csv')

3. Handling Missing Values

Missing values are a common problem in real-world data. We need to handle missing values before we can perform any analysis. Pandas provides several functions to handle missing values. The isnull function returns a Boolean mask indicating which values are missing.

missing_values = df.isnull()

We can use the fillna function to replace missing values with a specified value.

df.fillna(0, inplace=True)

4. Handling Outliers

Outliers are data points that are significantly different from the other data points in the dataset. Outliers can have a significant impact on statistical models, so it is essential to handle them. We can use the boxplot function in Seaborn to visualize the distribution of data and identify outliers.

sns.boxplot(x=df['column_name'])

We can use the Z-score to identify and remove outliers. The Z-score measures how many standard deviations a data point is from the mean.

from scipy import stats
df = df[(np.abs(stats.zscore(df)) < 3).all(axis=1)]

5. Encoding Categorical Variables

Categorical variables are variables that take on a limited number of possible values. Machine learning algorithms typically require numeric input, so we need to encode categorical variables. We can use the get_dummies function in Pandas to convert categorical variables into a series of binary columns.

df = pd.get_dummies(df, columns=['column_name'])

6. Feature Scaling

Feature scaling is the process of scaling the values of the features in the dataset. Scaling is essential for algorithms that use distance-based metrics, such as K-nearest neighbors and support vector machines. We can use the MinMaxScaler function in Scikit-learn to scale the features between 0 and 1.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df[['column_name']] = scaler.fit_transform(df[['column_name']])

7. Feature Selection

Feature selection is the process of selecting the most relevant features from the dataset. Selecting the most relevant features can improve the accuracy and speed of the machine learning models. We can use the SelectKBest function in Scikit-learn to select the top k features based on statistical tests.

from sklearn.feature_selection import SelectKBest, chi2
X = df.drop('target_column', axis=1)
y = df['target_column']
selector = SelectKBest(chi2, k=3)
X_new = selector.fit_transform(X, y)

In conclusion, data preprocessing is a critical step in the data science pipeline. In this blog post, we explored some essential techniques for data preprocessing using Python. By following these techniques, we can clean, transform, and prepare raw data

Outliers

Nitin Kendre — Wed, 23 Feb 2022 21:18:10 +0000

My Simple Definition of an outlier is :

Outlier in a dataset means extremely high or extremely low value from other values in dataset.

In above graph we can see that value below 50 and greater than 150 are outliers.

Finding Outliers from a Dataset

An outliers follows the any one condition from below.

1. outlier < Q1 - 1.5*(IQR)
2. outlier > Q3 + 1.5*(IQR)

where

IQR = Interquartile range
Q1 = Lower Quartile
Q2 = Median or 2 Quartile
Q3 = Upper Quartile

From above we can say that 1st rule means data point should be below of lower quartile and 2nd rule means that data point should be greater than upper quartile.

To Find the outlier we have to calculate the Q1, Q2, Q3 and IQR first.

Finding the Upper, median, lower quartile and inter quartile range in an odd dataset:

Let, we have -

3,5,1,4,2,6,7

The first step is to sort the given data in ascending order.

1,2,3,4,5,6,7

Now, here lowest value is 1 i.e. MIN and highest value is 7 i.e. MAX

Calculating Q2 in an odd dataset:

Now, Q2 means median or quartile 2, In this step we will calculate it.

Our given data contains odd values i.e. 7.
So, we have to divide it in equal to parts and there will be one middle value i.e. 4.

(1,2,3),4,(5,6,7)

So here 4 is the median or Q2 value.

Now, To verify it OR an alternate way to calculate it.

index of median = (total_no_of_values+1)/2

Here, (7+1)/2 = 4 which means number in dataset or array at place 4.

SO Q2 = 4.

Calculating Q1 in an odd dataset:

Initial Dataset :

(1,2,3),4,(5,6,7)

So, to calculate the lower quartile or Q1 we have to take first half of the data. That is-

1,2,3

So, here also we have to pick the middle value i.e. 2

And Formula to calculate the Q1 is -

Q1_place = (total_number_of_values_in_first_half+1)/2
Q1_place = (3+1)/2
Q1_place = 2

That means, Q1 is at place 2 in data.
So, Q1 = 2.

Calculating Q3 in an odd dataset:

It is similar to calculating Q1 but instead of First half we have to take another half.
(1,2,3),4,(5,6,7)

5,6,7

So, the middle value is Q3 i.e. 6

Formula:

Q3_place = (total_no_of_values_in_last_half+1)/2
Q3_place = (3+1)/2
Q3_place = 2

That means, Q3 is at 2nd place in given half.
So, Q3 = 6.

Calculating IQR in an odd dataset:

Formula for Calculating IQR is :-

IQR = Q3 - Q1

To find the IQR of given data from above-

IQR = 6-2
IQR = 4

To find an outlier in an odd dataset:

Given Data is-

1,2,3,4,5,6,7

We have calculated -

MIN = 1
Q1 = 2
MED = 4
Q3 = 6
MAX = 7
IQR = 4

Now, we can find if any outliers in data -
A data point to be an outlier it must follow any one rule of below.

outlier < Q1 - 1.5*(IQR)

outlier > Q3 + 1.5*(IQR)

So to find an outlier we have to calculate that minimum and maximum value.

outlier < Q1-1.5*(IQR)
outlier < 2-1.5*(4)
outlier < 2-6.0
outlier < -4

There are no minimum value outliers, because there is no value in dataset less than -4.

Next,-

outlier > Q3 + 1.5*(IQR)
outlier > 6 + 1.5*4
outlier > 6 + 6
outlier > 12

And There are no maximum value outliers in data.

Finding the upper, median, lower quartile and IQR in an Even Dataset:

The process of finding quartiles and finding the outliers is bit different from odd dataset.

Calculating Q2 in an even dataset:

Let, we have -

4,8,12,16,20,24,28,32

Now, given data is already sorted.
So, to find the median or Q2 we have get an average of middle two numbers. Like-

(4,8,12),16,20,(24,28,32)

So, here we have to take average of 20 and 25, and that will be our median or Q2.

Q2 = (16+20)/2
Q2 = 36/2
Q2 = 18

Calculating Q1 in an even dataset:

To calculate Q1 we have to cut given dataset in half -

4,8,12,16 | 20,24,28,32

Here to find the Q1, we have to take average of the middle 2 numbers of first half -

4,(8,12),16

That is, average of 8 and 12

Q1 = (8+12)/2
Q1 = 20/2
Q1 = 10

Calculating Q3 in an even dataset:

Showing given data in two half's -

4,8,12,16 | 20,24,28,32

To calculate Q3, we have to take average of middle two numbers of last half. Like -

20,(24,28),32

That is, average of 24 and 28

Q3 = (24+28)/2
Q3 = 52/2
Q3 = 26

Calculating IQR in an even dataset:

Calculating IQR is same as from Odd dataset, That is -

IQR = Q3 - Q1
IQR = 26 - 10
IQR = 16

Finding an outlier in an even dataset:

Now, we have calculated terms required -

MIN = 4
Q1 = 10
MED = 18
Q3 = 26
MAX = 32
IQR = 16

Rules for outliers -

outlier < Q1 - 1.5*(IQR)

outlier > Q3 + 1.5*(IQR)

Finding minimum value outlier -

outlier < Q1 - 1.5*(IQR)
outlier < 10 - 1.5*(16)
outlier < 10 - 24
outlier < -14

SO, There is no minimum value outlier, Because there no value less than -14.

Finding Maximum value outlier -

outlier > Q3 + 1.5*(IQR)
outlier > 26 + 1.5*(16)
outlier > 26 + 24
outlier > 50

Here, Also no outlier, because there is no value greater than 50.

Conclusion:

In this article we learned about how to calculate quartiles, inter quartile range and outliers.

Thank You!

Links For Basics Of Machine Learning

Nitin Kendre — Mon, 24 Jan 2022 18:47:00 +0000

Basic Concepts:-

Free 1 Hour Course for Understand the basics of machine
learning by Google
Link :-
https://learndigital.withgoogle.com/digitalgarage/course/machine-learning-basics
Basic Statistics Concepts for Machine Learning Newbies! with practicals using Python
Data Structures Related to Machine Learning
Commonly used Machine Learning Algorithms (with Python and R Codes)
Statistics in ML.

Interview Questions:-

Thank You For Reading.

_ _

Above All Resources are free to read and Learn. I am not promoting any comercials.