DEV Community: CptWycliffe

Churn Prediction: Using Gradio

CptWycliffe — Wed, 07 Feb 2024 16:25:31 +0000

📢 ChurnPredict Pro: Customer Churn Prediction App with Gradio💥🔍

In today's competitive business landscape, understanding and predicting customer behavior is paramount. One crucial aspect of this is forecasting customer churn, which can help businesses make data-driven decisions to enhance customer retention and satisfaction. ChurnPredict Pro is a cutting-edge web application that leverages the power of machine learning, specifically a Random Forest Classifier model, to provide real-time customer churn predictions.

This article takes you on a journey through the development process of ChurnPredict Pro, highlighting the technology stack, features, and the exciting journey of integrating machine learning models for accurate churn predictions.

Introduction

ChurnPredict Pro is an innovative web application designed to predict customer churn. The application allows users to input customer data effortlessly and receive instant churn predictions, thereby enabling them to take proactive measures to retain valuable customers.

The Importance of Customer Churn Prediction

Customer churn, the loss of customers, can have a significant impact on a company's bottom line. By predicting churn, businesses can take preventive actions, such as targeted marketing and improved customer service, to reduce customer attrition. This predictive power can lead to higher customer satisfaction, improved profitability, and a more sustainable business model.

The Technology Stack

ChurnPredict Pro relies on a robust technology stack to deliver real-time customer churn predictions. Here are the key technologies used in the development of the application:

Gradio: A Python library for building interactive interfaces, which is the foundation of the user interface.
Pandas: A widely used library for data manipulation and analysis.
Scikit-Learn: A machine learning library that simplifies the implementation of various machine learning algorithms.
Joblib: Used for serialization and deserialization of machine learning models.

Development Process

### Data Collection, Preprocessing and Model Training

The development of ChurnPredict Pro began with data. Historical customer information used in the previous project titled "📢 Unlocking Insights: Decoding Telecommunication Customer Churn Through Machine Learning!💥🔍" was used as the basis for training a Random Forest Classifier model. This data included various customer attributes, such as gender, senior citizen status, contract details, and payment method, which are used to make predictions about churn. The Random Forest Classifier is chosen because it emerged as the best performing model to handle churn prediction.

### Preprocessor and Model Exports

The preprocessing steps and the Random Forest Classifier model were exported from the notebook using Joblib. This ensured that the preprocessor and model were readily available for further preprocessing tasks and forecasting within the app.

### Building the User Interface

The user interface serves as the front end of ChurnPredict Pro. Built with Gradio, it offers an interactive and intuitive experience. The design allows users to effortlessly input customer data, and with a single click, receive real-time churn predictions.

Using Gradio Blocks, app's structure is organized into 2 main elements:

The main function responsible for the preprocessing of the input data and returning the churn prediction and the customer information in a DataFrame.

The output which consist of two components responsible for displaying the prediction and customer information.

The UI is composed of Rows and Columns for Layout and gradio components to receive the inputs from the user.

### Building the Logic

Upon submission of customer data, the submit button calls the churn_predict function and passes the customer data as input to the function.

The churn_predict function is invoked, which initiates the data processing as follows:

The preprocessor and model are loaded using joblib.load from the model files. These components are essential for making churn predictions.
Customer data (received through *args) is converted into a DataFrame.
The categorical feature "SeniorCitizen" data is converted from "Yes"/"No" to "1"/"0" for machine learning compatibility.
The preprocessor is applied to transform the user's input.
Predictions are made using the Random Forest Classifier model.
The DataFrame of the customer data and the predicted churn status is returned from the function.

The output is displayed in a user-friendly format, the Gradio Dataframe component displays the Customer Information as received from the user, and the Gradio Label component displays the prediction making it easy for businesses to understand the likelihood of churn and make informed decisions to retain customers.

### Displaying Results

A simple click on the "Submit" button delivers a real-time churn prediction.

ChurnPredict Pro simplifies churn prediction. With real-time predictions, businesses can take immediate actions to enhance customer retention. The application provides a clear prediction of whether a customer is likely to churn or stay, helping businesses plan their customer management strategies effectively.

Conclusion

ChurnPredict Pro is more than a churn prediction tool; it's a solution that empowers businesses to optimize customer management. The development process involved data collection, model training, and the creation of an interactive user interface. ChurnPredict Pro exemplifies the potential of machine learning in real-time decision-making. Businesses can now anticipate customer churn and take the necessary steps to enhance customer satisfaction and profitability.

The journey of developing ChurnPredict Pro is a testament to the power of combining machine learning and user-friendly applications. With the ability to predict customer churn, businesses can stay ahead of the competition and deliver the best possible service to their customers.

Leveraging on ML : Customer Churn Analysis:

CptWycliffe — Tue, 13 Jun 2023 20:38:31 +0000

Analyzing the Telco Dataset

In a rapidly evolving business industry, companies are constantly seeking ways to optimize their operations enhance customer experience and customer retention. In a bid to increase profit and revenue margin, customer retention is a key factor that any business entity or industry players must focus their resources. With the customers as the key revenue stream source, its therefore imperative to minimize the customer attrition.

One powerful approach is leveraging machine learning (ML) techniques to analyze large datasets and gain valuable insights into customer attrition factors and thereby formulate necessary intervention strategies.

In this article, we present a comprehensive analysis for the Telco Dataset using different ML models

Business Understanding
The business problem is that the customer churn at Telco is impacting the company's revenue and profitability.

Telco is a telecommunications company that provides phone and internet services to its customers. They have observed that their customer churn rate has been increasing over the past year, and they want to understand the reasons behind it and predict which customers are most likely to churn in the future.

In this project, we will analyse customers data from Telco in order to determine the key indicators of churn, and thus, formulate retention strategies that can be implemented to avert the problem.

Scope : The project will focus on analyzing the customer data sources, including customer type, services used and billing information. The analysis will cover just a portion of the number of customers and churn definition will be when a customer cancels their subscription.

Data Understanding
The data for provided for thisproject is in a csv format. The following describes the columns present in the data.

Gender -- Whether the customer is a male or a female
SeniorCitizen -- Whether a customer is a senior citizen or not
Partner -- Whether the customer has a partner or not (Yes, No)
Dependents -- Whether the customer has dependents or not (Yes, No)
Tenure -- Number of months the customer has stayed with the company
Phone Service -- Whether the customer has a phone service or not (Yes, No)
MultipleLines -- Whether the customer has multiple lines or not
InternetService -- Customer's internet service provider (DSL, Fiber Optic, No)
OnlineSecurity -- Whether the customer has online security or not (Yes, No, No Internet)
OnlineBackup -- Whether the customer has online backup or not (Yes, No, No Internet)
DeviceProtection -- Whether the customer has device protection or not (Yes, No, No internet service)
TechSupport -- Whether the customer has tech support or not (Yes, No, No internet)
StreamingTV -- Whether the customer has streaming TV or not (Yes, No, No internet service)
StreamingMovies -- Whether the customer has streaming movies or not (Yes, No, No Internet service)
Contract -- The contract term of the customer (Month-to-Month, One year, Two year)
PaperlessBilling -- Whether the customer has paperless billing or not (Yes, No)
Payment Method -- The customer's payment method (Electronic check, mailed check, Bank transfer(automatic), Credit card(automatic))
MonthlyCharges -- The amount charged to the customer monthly
TotalCharges -- The total amount charged to the customer
Churn -- Whether the customer churned or not (Yes or No)

We’ll therefore analyze the relations between customer churn and any of the independent variables (gender, senior citizen status, partner, dependents, tenure, phone service, multiple lines, internet service, online security, online backup, device protection, tech support, streaming TV, streaming movies, contract, paperless billing, payment method, monthly charges, total charges)

Hypothesis
H0:
We hypothesize that there is a significant relationship between customer churn and at least two of the independent variables.

H1: Alternative hypothesis - There is no significant relationship between any of independent variable and customer churn.

Research Questions
In understanding the Telco customers data trends, we are going to research on the following questions:
1.Is there a correlation between contract length and customer churn?
2.Do customers who have online security and backup services have lower churn rates?
3.Does the payment method have an impact on customer churn?
4.Is there a difference in churn rates between male and female customers?
5.Are customers with dependents less likely to churn compared to those without dependents?
6.Is there a correlation between the Total Charge and customer churn?
7.Are the customers who get TechSupport service less likey to church?
8.Is there a relationship between the customers who get Device protections and churn?
These questions will give us insights into the dataset, understand the distributions of variables, identify correlations, and discover any patterns or anomalies that may impact customer churn.

Findings
Plotting the categorical feature

Plotting the KDE for the numerical feature

From the E.D.A we make the following findings
(i)From the graph Bar Plot of Contract, we can see that more customer that churn have the Month-to-Month subscription. We can therefore conclude that there is a negative correlation between the contact length and customer churn; i.e the shorter the contract length, the more likely that the customer will churn.

(ii)From the Bar Plot of Online Security and Bar Plot of Online Backup While it is evident that that there is lower churn rates for customers with online security and onlineBackup services, there was still more customers that did not have these services but still churned.
(iii)While there is 0.48 probability that a customer without both OnlineSecurity and OnlineBackup will churn, the probability of a customer with both services churning is just 0.1.
(iv)Based on the findings above and from the graph Bar plot of PaymentMethod, we can observe that payment method does have an impact on customer attrition. However, we can conclude that customers using electronic check as their payment method have a significantly higher churn rate compared to other payment methods.
(v)The gender of the customer does not affect the probability of customer attrition.
(vi)Of the total number of customers that churned, abt 66 percent had no dependants while about 33% had dependants.We can therefore conclude that customers with dependants are less likey to churn.
(vii)The customers that did not have TechSuport had the highest attrition.
(viii)A large proportion of customers who churner did not have device protection.

Data Pre-processing
Since our target variable is “Churn”, we will use the syntax pandas.value_counts() to dertermine the number of the target class.

From the class, we see that the data is not balance.
Step 1.
We will balance data using the oversampling technique

Step2. We then split the dataset into train and test

Step 3. Since our categorical columns each have a maximum of three unique feature, we will use the OrdinalEncoder for categorical feature encoding. This technique will assign each unique value an inter 1,2 or 3.

Step 4. Scale the numerical columns using StandardScaler.
Model Building and Evaluation
We will then train and Evaluate different models which include
1)Logistic Regression
2)Decission Tree Classifier
3)Random Forest Classifier
4)Catboost Classifier
5)Gradient Boosting Classifier
6)Vaive Bayes
7)Nearest Neighbours (KNN)
8)Multi-Layer Perceptron
9)Stochastic Gradient Descent (SGD)
10)AdaBoost

Results and Analysis
From the models trained and evaluated, the best two models are SGD and adaboost with the following metrics
AdaBoost
Accuracy : 0.80
F1_Score : 78
Precission : 0.78
Recall : 0.80

SGD
Accuracy : 0.96
F1_Score : 78
Precission : 0.78
Recall : 0.80

Recommendation
a)This insight can help inform strategies to reduce churn, such as promoting alternative payment methods or providing incentives for customers to switch from electronic check to more stable payment methods
b)It's imparative fot the company to initiate measure of encouraging more customers to use TechSupport services in order to reduce the customers' turnover probability.
c)It is necessary for the company to formulate incentive strategies to encourage customers to use the device protection service.
d)THe company can continuously use ML algorithms the find the probability of a customer dropping the service, and thereby employ necessary measures to ensure customer retention.

Exploring the Indian Startup Funding Ecosystem for the years 2018-to-2021

CptWycliffe — Sun, 09 Apr 2023 19:00:47 +0000

Introduction

While it's reported that only 1 in every five startup businesses fail within the first year, the situation in India is not any different. A recent study of the Indian startup economy indicated that whereas India has 105 unicorn startups, surprisingly only 23 of them are profitable.
This shocking trend is compounded by the fact that only 10% of Indian startups live to see their 5th anniversary, thereby sinking in billions in funding from venture capitalists and investment firms around the world.

It's therefore imperative that Venture Capitalists make a number of considerations before committing resources in any startup venture.

This project aims to explore and gain insight into the Indian startup funding ecosystem through an in-depth data analysis. We will try to understand the venture capital funding trends as observed in the year 2018-to-2021. Some of the factors we will consider are;

The Location/headquarters of the Business startup
The Sector/Industry of the business venture
The Startup year
The stage at which the business is in when seeking funding

The Null hypothesis in this projects suggests that Indian Start-ups in the technology industry are likely to receive funding. The alternative hypothesis argues that no factor will affect the probability or amount of funding received by an Indian start-up.

To understand these factors, we will research on the following questions:

What are the top five startup-sectors which are investor's favorites'?
Can the success of obtaining finance from investors be impacted by location?
Which stages receives more investment from investors for start-ups?
What Sectors have the maximum amount of funding?
What is the total amount of funds each year?

Data Handling

Using python scripts, we will explore four datasets with information about Indian startups funding for the years, 2018,2019,2020 and 2021.
We will go through the following processes in a bid to validate out hypothesis

Load our data
Process each dataset
Merge the datasets
Evaluate the data using univariate and multivariate analysis
Visualize our findings
Draw a conclusion

First we will import the libraries that will enable us analyze the data. These libraries include
pandas and NumPy: for data manipulation
seaborn and matplotlib: for visualization

# Data handling
import pandas as pd
import numpy as np

# Vizualisation (Matplotlib, Plotly, Seaborn, etc. )
import matplotlib.pyplot as plt
import seaborn as sns
import re

# Other packages
import os

Data Loading

Using pandas.read_csv, we can load all our four datasets and be able to read the .csv files.

# For CSV, use pandas.read_csv to import the data files
data_2018 = pd.read_csv('startup_funding2018.csv')
data_2019 = pd.read_csv('startup_funding2019.csv')
data_2020 = pd.read_csv('startup_funding2020.csv')
data_2021 = pd.read_csv('startup_funding2021.csv')

Data Processing

In this section, we will explore each dataset separately. We will undertake three main tasks i.e

Understand the data contained in them
Identify the issues with the data
Determine how to handle each of the issues identified To do this we will use the code:

data.columns - To see what columns are contained in the data set
data.head() - to preview the data contained in the dataset
data.info() - to get a summary of data in each column
pandas.isnull() - to check for the null values in each column

Processing 2018 Dataset

From the above, we notice the following
-The amount columns contains different currencies as well as cells with "-"
-The Industry columns is described using more than one phrase
-The Location is described using more than one physical address.

To clean these;
We will assume the Amounts without any signs are in dollars; thus convert the values in INP to USD and replace "-" with NaN.
Split and pick the first part of the phrase to describe the physical address/headquarter.
Pick only one phrase to describe the industry.

Processing 2019 - 2021 Datasets

By repeating the process with the other datasets, we make the following observations

Observations from previewing the datasets

The 2018 DataFrame

The columns in 2018 are different from those of 2019 - 2021, meaning they have to be renamed for concatenation.
The amounts in the 2018 DataFrame are a mix of Indian Rupees (INR) and US Dollars (USD), meaning they have to be converted into same currency.
The industry and location columns have multiple information. A decision is to be made between selecting the first value before the separator(,) as the main value.

The 2019 DataFrame

The datatype of the "Founded" column is set to float64. It should be set to a string for uniformity.
The headquarter column has multiple information. A decision is to be made between selecting the first value before the separator(,) as the main value.

The 2020 DataFrame

There is an extra column called "Unnamed:9", giving it a total of 10 columns. It should be dropped to ensure complete alignment with the other DataFrames for ease of concatenation.

The 2021 DataFrame

The datatype of the "Founded" column is set to float64. It should be set to a string for uniformity.
There are some cells that have null values, "Amount" has 3 null cells

General Observations

The currency signs and comma separator have to be removed from each of amount column for each DataFrame to allow numerical manipulation and analysis.
The 2022 average INR/USD rate will be used to convert the Indian Rupee values to US Dollars in the 2018 DataFrame.
First values of industry and location in the 2018 data will be selected as the primary sector and headquarters respectively.
Amounts without currency symbols are assumed to be in USD ($)
Financial analysis will be narrowed to transactions whose amounts are available in the loaded data

Merging the Datasets

Once the cleaning of the data is completed, we combine all the four .csv files into one dataframe using the pandas.concat() syntax as demonstrated below. We can also view a summary of the new dataset contents as seen

#Joining all the four files using concatenate
data = pd.concat([data_2018,data_2019,data_2020,data_2021])

#Preview a summary of the data in the new combined file 
data.info()

You can view more Visualization of the findings here.

By conducting multivariate analysis on the combined dataset, we are able to answer the research questions and put our findings on the visualization as shown.

What are the top five startup-sectors which are investor's favorites'?

Can the success of obtaining finance from investors be impacted by location?

Which stages receives more investment from investors for start-ups?

What Sectors have the maximum amount of funding?

What is the total amount of funds each year?

Exploring the Indian Startup funding provides insight into the vibrant and ever-growing venture capitalist ecosystem. From this analysis, we gain a better understanding of the characteristics associated with Indian startups.
We are able to draw the following observations;
The Top sectors that are favorites' to investors are in FinTech and EdTech¶
The highest Amount received by a start-up are also in the FinTech and EdTech sectors
The startup with the highest amount of funding is also in the Edtech industry.

We can therefore affirm the null hypothesis; Indian Startups in the technology industry are likely to receive funding.

If you're interested in exploring this project further, please check out my GitHub repository for more information, suggestions and input.