DEV Community: mayank-p

Churn Prediction Pt. 2

mayank-p — Thu, 13 Aug 2020 22:06:58 +0000

This is a continuation of my last post (https://dev.to/mayankp/churn-prediction-c75). For this blog, I will talk about how well a random forest model did.

Random Forest

I tested out different types of criterion for determining the better model.

This model ended up being my best one with an roc_auc of 91.5%, which is pretty good.

The random forest model also find international plan, customer service calls, and total day minutes to be the most important factors.This somewhat makes sense because the people who are charged the most usually talk a lot more or have to make expensive calls (like international calls). Also, going back to my last blog, people generally like the status quo until they become irritated enough to change it, and high charges fall into that category.

Imbalances

While dealing with this data set, I realized that there was an imbalance between churns and non-churns. Approximately 15-20% of the data resulted in a churn, while the rest did not. This could cause problems for the data because it didn't learn from the churn data points enough. So in order to balance the data out, I employed the use of Smote. Smote creates new data based on the information it has from the existing data set. By doing this, you can give the model your training more information to learn from, which is always good.

Surely enough this increased my roc_auc to 96.1%.

Churn Prediction

mayank-p — Mon, 10 Aug 2020 04:21:54 +0000

I decided to look at a churn data set found on Kaggle ( https://www.kaggle.com/becksddf/churn-in-telecoms-dataset). A churn is when a customer decides to change their telecom service. So the point of this exercise was to try to identify factors that caused customers to switch their plans and to create a model to try and predict them.

EDA

After doing so preliminary exploratory data analysis, I found two stats that deserved more attention when trying to figure out important statistics. Customers who switched are labeled as one in the graphs below and are the second graph in each picture.

In this graph, you can see how customers with an international plan were more willing to switch plans. Maybe they weren't happy with the service or high prices.

In this graph, you can see how customers who switched called customer service a lot more than customers who didn't switch. Maybe they were so unhappy with the service that they wanted to switch. This makes sense because people generally stick to the status quo until they absolutely have to change.

Modeling

Now time for some modeling.

No. Not that kind of modeling!

Logistic Regression

For my first take, I tried a logistic regression model on the data. This model found these columns important.

Interestingly enough, this model found the number of customer calls the most important thing when predicting a churn, just like I thought earlier. However, this model was only 81% accurate.

For my next blog post, I will try a random forest model and see how well it fares.

Black Friday Shopping Hackathon: Part 1

mayank-p — Mon, 20 Jul 2020 04:36:48 +0000

This week I decided to participate in a hack-a-thon provided by analyticsvidhya.com. Here is the link: https://datahack.analyticsvidhya.com/contest/black-friday/#About. The basic premise of this problem is to be able to predict how much money a customer is going to spend this Black Friday based on criteria ranging anywhere from age, occupation, gender, to marital status.

Initial Thoughts

My initial thought at looking into this data set is that I probably have to do a linear regression model based on the type of information that I have. Most of the data in these columns are numerical values. There are approximately 170k nulls in the Product Category 2 column, 380k in Product Category 3 column, and none in the other columns. For my first run through, I always drop the columns with a large amount of nulls. Next I try to see if there's any relationships between the columns and target, like do girls spend more than guys or do married individuals spend more than unmarried ones.

Exploratory Data Analysis

Poker

mayank-p — Mon, 13 Jul 2020 04:53:41 +0000

So after playing poker during all of quarantine, I have decided to create my own poker table using Python.

from poker import *
import eval7
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.core.display import display, HTML
import random`)

These are all of the imports I have used.

deck = list(Card)
random.shuffle(deck)

This creates the deck and shuffles it.

players = input("How many people are playing: ")

This creates the number of players playing.

flop = [deck.pop() for __ in range(3)]
turn = deck.pop()
river = deck.pop()

This sets the table for the flop.

My next steps are to create hands for each of the players to evaluate and to create stack sizes for buy ins.

Unsupervised Learning

mayank-p — Mon, 06 Jul 2020 04:46:50 +0000

If you're just starting to get into data science and machine learning, you've probably heard of unsupervised learning a lot. It is generally used for classification problems. When classifying sets of data, you have a question that you need answered. Say for example you only have today's temperature, humidity, air pressure, and inches of rain today. And you want to try and predict if it will rain tomorrow. When you look up historical data, your targets become the answer to whether or not it rained the next day. If yes, then your target is a one. If no, then your target is a zero.

Unsupervised learning is when you don't set targets for your machine learning algorithms to learn from. And in this example, it would be if you didn't know if it rained the next day. The advantage of unsupervised learning is that it might be able to pick up on events that don't normally happen, like a hurricane.

Another example of this is in fraud detection. Fraud detection algorithms are pretty good at detecting common frauds. However, frauds make the most money by inventing new methods, which regular supervised learning cannot pickup.

Unsupervised learning has a lot of potential ranging anywhere from fraud detection to stock trading.

Overfitting

mayank-p — Mon, 29 Jun 2020 02:44:58 +0000

Too Good to Be True

Overfitting is when you train a model too closely an existing data set. In every data set, there is signal and noise. When predictng something, you want your model to learn the signal or what part of the data affects your desired result is. For example, if you have a data set on house pricing, you want the model to learn that location or number of bedrooms is a signal that increases its price. Noise is just the randomness of collected data. You want all of your predictive models learning from the signal in the data and not the noise. Overfitting happens when you don't take into account that data could have noise and try to fit your model to the noise. Although an overfitted model will correctly classify the data it trained on, it will fail to correctly classify new data. Here is an example:

The black line represents a good model that pretty clearly identifies a dividing line that accurately classifies all the blue and red dots with a few exceptions, which can be attributed to the natural noise of data. The overfitted line divides the two colors with a 100% accuracy; however, it seems to look very unnatural and in simple terms "tries too hard". This overfitted line accounts for the noise in the data and does not filter it out.

How to Identify Overfitting

The easiest way to identify overfitting is splitting your data set before training your model into a train and test split (I like to use an 80%/20% split respectively). Then you can train your models on the train set and then see how accurately it predicts the test set. If it tests well on the training set like 92% but predicts the test set at a 65%, then it is overfit.

How to Prevent Overfitting

The most common practice I use to prevent overfitting is cross validation.

Cross validation is when you split your data into multiple sets of train test splits. If you're cross validating with 5 sets, it means you have 5 sets of 80/20 splits. Each split makes sure it doesn't have the same test split. This way you can train your models on multiple parts of the data set and thus reduces overfitting on one portion of the data.

Skewness

mayank-p — Sun, 21 Jun 2020 22:37:39 +0000

Skewness Definition

Skewness refers to the abnormality of data in a certain direction. If the tail tapers on the right side, then the distribution is said to have positive skew. If the tail tapers on the left side, then it is said to have a negative skew.

Skewness Example

If the distribution is positively skewed, then the mean of the data will be larger than the median of it. This is because there are larger values on the max extreme causing the positive skew. One example of this is the US income disparity. This graph below represents the 2015 median and mean household income.

The mean income is over $20,000 more than the median income due to the extremely high earning millionaires and billionaires. Below is a table of the top percent earning buckets causing the skew.

Measuring Skewness

There are two different ways to measure skewness. The two values used to measure it are called Pearson's coefficients.

The first one is used if the distribution has a clear mode. If not, the second one is used.

Information and pics gotten from:
https://www.investopedia.com/personal-finance/how-much-income-puts-you-top-1-5-10/
https://www.investopedia.com/terms/s/skewness.asp
https://fas.org/sgp/crs/misc/R44705.pdf
https://en.wikipedia.org/wiki/Skewness

Bernoulli Distributions

mayank-p — Mon, 15 Jun 2020 03:55:09 +0000

The Bernoulli Distribution. One of the first stats concepts that I learned in middle school. Interested in seeing what else this mathematician did, I looked him up and was surprised to see how many concepts were named after him. The Bernoulli Effect, the Bernoulli Series, Bernoulli Polynomials. I remember thinking, "man this guy must have been really smart or had nothing better to do". However, everything made more sense a couple years later when I found out they a whole family of geniuses and each had come up with their own mark in math history. It was probably the only way to not be the loser of the family.

This was Johann Bernoulli

What is the Bernoulli Function?

Anyways Jacob Bernoulli was the one who coined this distribution. The idea behind it is that in any event, you will either get a success (1) or a failure (0), and the probability of getting the success is p, making the probability of failure (1-p). Say you were trying to pick an ace out of a randomly shuffled deck. Since there are four aces in a deck with fifty two cards, the probability of success or p is 1/13, which means the probability of failure is 12/13. When repeated an n number of times the expected value or chance that you will succeed n number of times in picking an ace each time decreases exponentially. These probabilities can be graphed, creating the Bernoulli Distribution.

Formula for calculating the Bernoulli function

Formula for calculating expected value and variance

Where is it used?

This distribution can be used in any and every statistical modeling as long as it meets four criteria.

1) It has only two outcomes -- Success or Failure.
(Some event either can or cannot happen)
2) Each trial must be independent from one another.
(The success or failure in trial one cannot affect the success
or failure in outcome two)
3) Probability of success and failure stays the same throughout
each trial.
4) Number of trials are fixed.

So going back to our picking aces example. We can use the Bernoulli Distribution to try and guess how many times we will succeed as long as we put the card I drew back into the deck before I run the experiment again. If I don't then criteria 2 and 3 will not apply so I cannot use the distribution to calculate my success over n number of times.

Pictures and Information gotten from:
https://probabilityformula.org/bernoulli-trials.html
https://towardsdatascience.com/understanding-bernoulli-and-binomial-distributions-a1eef4e0da8f

Statistically Significant

mayank-p — Mon, 08 Jun 2020 02:52:48 +0000

How do scientists test out different theories? One such strategy is using a p-value test.

Problem Statement

On average, do dogs weigh more than cats?

State the Hypothesis

This is where we make our guesses. When we try to prove that event A causes event B, we have to provide the burden of proof, just like a prosecutor in a court case. Now we state our hypotheses.

Null Hypothesis: On average, dogs do not weigh more than cats.
Alternate Hypothesis: On average, dogs weight more than cats.

Set the Significance Level

For the test, we need to set a certain probability threshold that will indicate whether or not the probability we get after doing the test is significant or not. Generally the threshold is set at 0.05 or 5%. So if the test gives a probability less than 5%, then the information is significant enough to reject the null hypothesis. If the probability is greater than 5%, then the test is not significant enough to reject the null hypothesis.

Set the significance level or alpha to 0.05

Perform the Test

After setting the significance level, we can now perform the statistical test. This can pretty much be any test like chi square test, t-test, z-score, etc. After performing the test, we get a probabilistic value based on the test. This is called the p-value. If the p-value is less than the significance level, then we reject the null hypothesis.

Conclusion

Now we can conclude our test.

P-value = 0.03. Therefore I reject my null hypothesis.
P-value = 0.16. Therefore I do not reject my null hypothesis.

Rookie Mistakes...or not

mayank-p — Fri, 17 Jan 2020 20:29:59 +0000

Sports are all about winning a championship. Dynasties are built on the ability to win championships. Players become legends when they bring their team championships as their legacies become invincible. In the NBA, superstars control the game more than any other team sport because only five people play on the court at one time. Therefore, every team tries to get as many superstars on their team as possible. There are three methods to get a superstar on the team. Attract one in free agency, trade for one, or draft and develop one. The first two methods require a lot of capital, in terms of money or talent. So teams try to draft rookies and develop them, without worrying about giving up money or talent. Therefore, picking the player correctly becomes vital to the success of a franchise.

For my analysis, I will be looking at approximately 1300 rookies and their stats and trying to predict if they lasted in the NBA for more than five years. This is a good start for trying to predict super star talent because rookie contracts generally last for four years. If the team sees value in the player, then it can sign another contract the rookie one ends. (Also the data set where I all of this information from had a 5 years played target so it was convenient). This is the first step and as a rookie myself, I will be trying to use methods that I've recently learned like logit and decision trees in order to create a model that predicts whether or not a rookie will get a second contract or not.

Step 1: Understanding the Data

First, I decided to take a look at all of the correlations in the data.

Note: Red means highly correlated and blue means not correlated.

As you can see in the picture above, there are a lot of correlations in the data. Most of the correlations make sense, like any relationships with shot attempts and made attempts (the more you shoot, the more you'll make). Furthermore, there are a lot of correlations between counting stats (like points, rebounds, and assists) and minutes played, which also makes sense because, the more you play, the more you can get these stats. There seems to be a signal with the number of minutes played that strongly affects the longevity of the player's career. One surprising observation I noticed was how there was little to know correlation between shooting threes and if the player survived for five years, because in today's NBA the three point shot is the most important shot in the game. Next, I will look at the players who got to the five year mark and compare then to the players who did not get to five years.

For this, I decided to analyze the data by incorporating some visualizations. I compared all of the given stats; however the most important difference I noticed were the number of games played and number of minutes per game.

This makes sense since the good players usually get to play more and show off their talents and skill with more time, thus earning a second contract.

Sometimes might be a bad idea to play more.

Step 2: Baseline Modeling

Now I will try running a basic log regression model and test its accuracy. I created an 80/20 train/test split to score my baseline model on. Then I used the train set to make the model. Then I created a scorecard and tested the model using it. I used "roc_auc" as a base measurement in order to test my accuracy on the data and ended up getting this:

This is the baseline score that I now want to beat. Time to do some feature engineering.

Step 3: Feature Importance

For my next step, I will try to find the most important features needed to create the model. The correlation heat map earlier indicated that there were a lot of stats that were related to each other, so reducing the features are necessary in order to make the model better. I decided to test the scorecard feature importance using permutation importance. My results brought me to this graph.

Note: Blue lines indicate signal and black lines indicate variance.

The results are a bit surprising. Some of the signals make sense. For example, field goals made and games played being two of the highest signals are understandable. The more games the rookie plays and the more field goals they score, the more likely they are able to showcase their talents. I did not expect the model to give little importance to the number of minutes played, when the rookies are able to get more counting stats. Furthermore, there are a lot of correlated variables that the model found important like FGM, FGA, and FG%. But hey maybe the model is finding something my intuition can't, right?

Here is the post-feature importance score:

Seems like the features that I took out earlier contributed to the overall accuracy. Time to go back to the drawing board.

Step 4: Next Steps

Clearly the permutation importance algorithm missed out on the signals causing the accuracy to go down. My next plan is to try and do a PCA to try and reduce features another way. Hopefully this will create a better accuracy score.

If anyone has anymore ideas of what I should try, I would appreciate some ideas in the comments!

Data gotten from: https://data.world/exercises/logistic-regression-exercise-1

Sesame Street: Harmless Kids TV Show or Skynet's First Step?

mayank-p — Thu, 19 Dec 2019 17:21:53 +0000

When most people think of Sesame Street, they believe it to be an innocent kids TV show that teaches our youth words and math. Count von Count, with his number knowledge. Big Bird with his goofy and awkward height, which is only matched by his big heart. And Cookie Monster, whose only goal in life is to eat cookies--or is it? These seemingly harmless characters have an ulterior motive. A much darker one that would end humanity as we know it. Their easy-going demeanors are meant to catch us off guard until it's too late...BERT, ERNIE, and most importantly, their notorious ring-leader ELMo are more than just Sesame Street characters. They are the forefront of cutting edge AI designed to learn and adapt to the human language. In other words, they learn how to use and communicate through our languages for reasons unknown. It's time to wake up, everyone. Elmo is evil.

No. Not really, Elmo. Not yet at least.

Predicting what to say is extremely hard. Sometimes even humans have a tough time thinking of what to say at times, like when a long-time crush decides to talk to you or when you're about to say a speech in front an audience. Similarly, computers have a hard time predicting what to say if they're given a sentence or test set. As humans, it takes most of our toddler years to be able to form proper sentences and learn when to say the words they've learned at a proper situation (it doesn't make sense to say I like rainbow ponies and unicorns during a company board meeting). Learning a language is supervised learning that is trialed and tested over years of human development and can continue our whole lifetimes. Matter of fact, behavioral scientists say the best time to learn a language is usually our toddler years, which ironically is the target audience of Sesame Street. Coincidence? I think not.

ELMo

ELMo stands for Embeddings from Language Models. Embedding is the process of converting words into vectors in a space. This is advantageous because you can then apply graphical techniques to strings to predict or visualize them. Existing word embeddings at the time like GLoVe used to assign the same vector for the same word, without context clues. However, this has flaws that can be improved upon.

For example:

Sentence 1: The worker has to buy the paint.

Sentence 2: The worker tried to paint the fence.

As English-speaking humans, we know that paint in both sentences have different meanings, so they should be represented differently. Previous word embeddings would give "paint" the same value, even though they represented different ideas. ELMo differed from other existing word embeddings because it gave the same word different vector values based on the context of the sentence. After converting the words, it then puts its embeddings through its magic sauce of RNNs and CNNs in order to spit out its predicted outcome by drawing them as a vector and then converting them back to strings. This is seems innocent enough, so how does this relate to evil Elmo? Well, in his song, Elmo's world, he talks loving his crayons. What can crayons do? Draw! Yes. I'm saying Elmo uses his crayons to draw embeddings and try to speak like a human.

In case you forgot the song: https://www.youtube.com/watch?v=OeVp9S1HzqI

BERT

Similar to ELMo, BERT, which stands for Bidirectional Encoder Representations from Transformers, is another language processor that is used to predict words. BERT also uses RNNs in order to model.

RNNs are neural nets that add a time variable. This method is commonly used in translators that take each word you say and convert it to a different language. However, the problem with this is that it generally "forgets" words said earlier in a paragraph. This could hurt the prediction model because it could forget the context clues provided earlier in a text of data. Furthermore, they read data and do calculations in only one direction, which affects predictions.

Instead of reading words from left to right (like we do in English) or right to left (like we do in Arabic), it uses a Transformer in order to read text both ways, which why it's called bidirectional. Transformers in BERT are used to focus on certain details in a test set. Take a moment to try to identify the important features you notice on the picture below before you go onto the next paragraph.

In this picture above, our brain notices a couple of important features. The first is the bridge that is clearly outlined and in the middle. The second is the mountains in the background that contrast with the sky. The bridge attracts our brain because it is in the middle, and the mountains attract our brain because of the color contrast between them and the sky. The rest of the picture our brain decides is not as important such as the middle of the sky or the water. Over the years, our brain has been conditioned to generally look for objects in the middle of the picture and objects that contrast greatly with its surroundings. Similar to how our brain uses the ENHANCE! feature, this Transformer does the same.

The Transformer then encodes the training data into vectors, like ELMo, and give each of them weights for each node while reading the sample backwards and forwards. BERT also embeds words based on their position, entire sentences, and even each word. This allows the neural nets to train on a lot more data and come up with a better model.

Fun facts:

BERT was trained on Wikipedia articles.

There's a Small BERT, Medium BERT, Large BERT, and Extra Large BERT.

Researching about the complexity of language model has left me in awe at how powerful and smart our brains. The smartest and largest companies in the world are trying to imitate our brain--something we take for granted. Language processing still has a long way to go before it can accurately predict what to say properly. However, it took evolution millions of years to get to this point of the brain. It has only taken a decade to get AI from nothing to closely modeling language processing that the brain does. The rapid growth and development is astonishing, but it'll still take some time. So for now, I guess Skynet still has to wait a while.

If you want to see an example of language processing models in action, here's a great link about someone who fed a model Hallmark movies and made a script through it.

https://twitter.com/KeatonPatti/status/1072877290902745089

Update 1: I see suspicious "people" following me around. I think Skynet is on to me...

Update 2: Human fine. Nothing wrong.

Information from:
https://medium.com/@jonathan_hui/nlp-bert-transformer-7f0ac397f524
https://medium.com/synapse-dev/understanding-bert-transformer-attention-isnt-all-you-need-5839ebd396db
https://www.analyticsvidhya.com/blog/2019/03/learn-to-use-elmo-to-extract-features-from-text/
Picture from:
https://www.deviantart.com/nickchoubg/art/Landscape-Wallpaper-299457255

Going Wildcat

mayank-p — Tue, 26 Nov 2019 23:06:45 +0000

I recently graduated with a petroleum engineering degree from the University of Texas at Austin. I decided to do petroleum engineering for a couple of reasons. During high school, I liked math and science and felt that engineering would be a good way to challenge myself in college and as a career. Furthermore, my dad worked as a chemical engineer for an oil and gas company, and I felt like doing something similar as him but something different at the same time.

By talking to industry professionals in college and having internships in IT related industries, I realized the need and potential of data science and analysis in an industry that is slow to adapt. I took some online courses through Coursera related to data analysis and found out that I really liked being able to make a computer recognize images through pixels.

The ability for data science to be used for a wide area of activities-- anywhere from deciding where to drill to financial models to sports analytics gives me an enormous potential to learn and pursue my interests. As technology and analysis becomes more important in the 21st century, I want to be a part of the ongoing change and potential of data science, which is why I want to learn more about it.