DEV Community: Ankit Patel

Why you should consider Udacity's 75% off black Friday deal and get yourself enlightened in this holiday season?

Ankit Patel — Fri, 27 Nov 2020 19:57:54 +0000

First of all, if you are seeing this post on black Friday then, happy thanks giving to all of you. As this is a holiday season, i know every one is little busy and mostly spending their time with families. But in this post i'm gonna cover the important deal which might help you to uplift your career or land high paying job in your hand. So, if you are interested then please go thorough this post thoroughly.

Before going thought further, I want to first clarify that i have already graduated from Udacity's Deep Learning nano degree program. So, based on my experience i'm recommending this site.You can check my nano degree here.

Why Udacity?
- If you don't know about Udacity then let me tell you a brief about it. Udacity is founded by Sebastian Thrun, David Stavens, and Mike Sokolsky in June, 2011. Currently, Udacity is dominating Education Sector by providing industry leading compititive flagship nano degree programs through online distance learning platform.
- Also most of the Udacity's Nano degree programs are made in collaboration with Google, Facebook, Amazon AWS, Microsoft Azure, Intel and more. So, you can ensure that course will deliver quality content with hands on experience.
What are the flagship courses that Udacity offers?
Here i'm mentioning some of the nano degree program which i think help you to excel your career:
- Data Structures and Algorithms,
- Android Kotlin Developer,
- DevOps Engineer for Microsoft Azure
- AWS Cloud Architect
- Machine Learning Engineer for Microsoft Azure
- Intel® Edge AI for IoT Developers
- Data Scientist
- Intro to Machine Learning with TensorFlow
- Become a Computer Vision Expert
- Machine Learning Engineer
- Natural Language Processing
- Self Driving Car Engineer
- Flying Car and Autonomous Flight Engineer
What is the offer?
On this holiday season, Udacity announced flat 75% off on all the nano degree programmes. So, if you want to swift your career in latest technologies like machine learning, self driving cars, deep learning or cloud architect than you should grab this offer and show you employer your latest skills by graduating with the nano degree you want to pursue.
How to avail this 75% off offer?
well, first go to the Udacity's official website. you can visit it by this link.

Browse and navigate the course you want to study.

Click on Enroll Now button to select and buy the course.

Now if you can see there will be automatically 75% off coupon code applied to your cart. if it is not there than you can apply it manually. The coupon code to get 75% off discount for any nano degree is BLACKFRIDAY75.

After applying that code just pay via credit card or paypal. Now you go to the my courses section and you can see that purchased course is no available in you list. Happy learning!

Xbox vs PS4: which is the best gaming platform based on data

Ankit Patel — Sat, 25 Jul 2020 13:33:42 +0000

In this Udacity Data Scientist Nanodegree project, I want to find out the underlying factors behind a video game's success. As an avid gamer myself, I thought it would make an interesting project and as such I will aim to answer three questions:

Q1: Which genre's perform the most?

Q2: Which Platform is being used by the most by the users, PS4 or Xbox?

Q3-Q4: How does a games platform, Genre, critic score, user score, and Rating influence sales. Can we predict
Sales using these attributes?

Before I can start, I should be clear about the dataset which I have used for this project. In this project, I have used opensource kaggle dataset which can be find out here.

Now let's deep dive into the findings that I have found:

Q1: Which genre's perform the most?

Figure — List of genre.

Figure — List of genre by continent.

As we can see the top 3 genres, Action, Shooters, and Sports make up by far, the largest sales by genre.
Now when we break it down by continent, we can see that North America values shooters above any other, meanwhile for pure action the EU is slightly above.
For Sports we see that they are practically evenly matched, while EU takes the majority for racing and EU and NA are again, tied for Role-Playing.

Q2: Which Platform is being used by the most by the users, PS4 or Xbox?

Figure — List of data by Platform.

Figure — Popularity of Platform.

Here we can see that for the ps4 on it's release year, global_sales were at its max, with the following year only seeing a slight decline. For Xbox the opposite is true, with Xbox seeing a slight increase globally on it's second year. Overall PS4 outshines the xbox globally which can be seen in the following bar chart.

Q3: Looking at the correlation of features with overall sales

Figure — Correlation of Global_Sales and Critic_Score.

Figure — Correlation of Global_Sales and User_Score.

Figure — Heatmap of All the features.

Figure — Pair Grid of Global_Sales, Critic_Score and User_Score.

Looking above we can see that critic score and global_sales have a positive correlation. User_count, Critic_Count, Critic_Score all seemed to have the highest positive correlations with Global_Sales, with Genre also having a relatively high positive correlation. Check out below for more :

Q4: Can we predict sales with these attributes?

Figure — Prediction of sales.

So, we can obviously assume that outside sales would correlate with global sales as a whole, but I think the interesting aspect is that neither Europe nor North America make up the biggest parts of overall sales, but rather most likely places like Australia, Asia, South America, etc.

Conclusion:

In this article, we had a look at what were the most popular genre according to Kaggle dataset.
We had also seen that PS4 dominates the gaming market based on our analysis.
Finally, we had a look at some of the correlation and found an interesting fact that neither Europe nor North America make up the biggest parts of overall sales, but rather most likely places like Australia, Asia, South America, etc.

To see more about this analysis, see the link to my GitHub available here

References:

GitHub :
https://github.com/ankit1797/Write-A-DS-Blog-Post
Video Game Sales Dataset :
https://www.kaggle.com/gregorut/videogamesales

What are the most common programming languages used in Brazil?

Ankit Patel — Mon, 29 Jun 2020 04:07:37 +0000

Introduction

With the arrival of new areas in Brazil such as Data Science and Machine Learning, many programming languages which were poorly spoken and used are now experiencing a rise in popularity.

In this article, we are going to analyse real data to verify if these programming languages are really being used Brazil or they are just rumours.

For this, we are going to use data from Stack Overflow’s 2017 and 2018 Annual Developer Survey.

Every year, Stack Overflow conducts a massive survey of people on it’s website, covering all sorts of information like programming languages, jobs, code style and more.

There were more than 150 questions as a part of the survey, including:

“What frameworks do you work with?”
“Do you program as a hobby or contribute to open source projects?”
“What IDE do you work with?”

Part 1 —What are the most used programming languages in Brazil?

We can see that classic languages like: SQL, Java, Java Script are still in the top positions. Stack Overflow 2017 survey data and Stack Overflow 2018 survey data

We can see that in 2018 two programming languages have risen such as HTML and CSS, but these programming languages are old and commonly used, this probably happened because in the StackOverflow Survey of 2017 it was probably not possible to select these two languages as option.

These data have brought us a very valid information, but it could be more interesting to know what are the most wanted programming languages.

Part 2 — What are the most wanted programming languages in Brazil?

We can see that Python have grown tremendously close to other languages. This is probably happening because its’s a programming language that is very versatile and has been used extensively in data related areas.

Most of the programming languages that have appeared as programming languages most used at work, also appeared in the ranking of most wanted programming languages, this show us that are many people wanting to learn these languages.

Part 3 —How does programming languages used at work relates with programming languages people want to learn?

By looking at the raw data, we can spot some patterns such as for people that use python at work, for example, python is also cited as language that those people wanted to learn in the next year.

A natural question arise:

“The pattern observed for python holds for the other languages?”

To address this question, we built a heat map that indicate how the work programming languages were related to the desired programming languages. The darker the position gets, the more related the programming language is.

With this figure we can have two insights:

As is evidenced by the diagonal line, people who already work with a programming language have a strong probability of wanting to learn the same programming language.
People who work with a programming language of a specific area, tend to want to learn programming languages of the same area. For example: HTML **is strongly correlated with **CSS and JavaScript.

Conclusion

In this article, we had a look at what were the most popular and most biased programming languages, according to Stackoverflow’s 2017 and 2018 Annual Developer Survey data.

People who already worked with a certain programming language have a tendency to learn that language or related languages within correlated areas to improve their own skill.
We have seen that some older programming languages like JavaScript, SQL, and Java still dominates.
Younger programming language like Python have been well-deserved to be learned now, but the oldest ones still have their value and are still being much demanded.

References

Github: https://github.com/ankit1797/Write-A-Data-Science-Blog-Post

Stackoverflow Developer Survey Data: https://insights.stackoverflow.com/survey

Sparkify — Digital Music Service - Customer Churn Analysis & Prediction

Ankit Patel — Sat, 27 Jun 2020 12:22:52 +0000

Sparkify — Digital Music Service

Customer Churn Analysis & Prediction

Sparkify is a digital music service similar to Spotify and fizy. Many users make subscriptions, listen their favourite songs with free (with advertisement) or paid subscriptions, add friends or songs to their playlists etc. Users can also upgrade, downgrade or cancel your subscriptions in any time. Similar to all digital services and their subscriptions, member churn is the most important problem, so company should define model to identify customers who are potentially churn and make marketing offers to keep their subscriptions.

Business Problem: is that predicting potential customers who will churn from Sparkify digital music service.

Datasets: Udacity provides mini dataset (128MB) and full dataset (12GB) including transactional log data of users in Sparkify. I have used mini dataset in Jupyter notebook to perform explanatory data analysis and implement model. Mini json dataset including 18 columns including 12 categorical columns (string) and 6 numerical columns (long and double) and 286.500 records.

After loading and preliminary observation of data above, I started to clean data:

1. Drop NaN and ‘ ‘ values (empty strings)

#Check missing values and '' data for each column.
print('Count of Missing Values for each Column')
for col in df.columns:
    missing_count = df.filter((df[col] == "") | df[col].isNull() | isnan(df[col])).count()
    print('{}: '.format(col), missing_count)

*There are 8.346 missing values for userId and other related columns (first name, last name etc.) These records should be dropped, because userId is important column to define potential customers who churn. After cleaning, there are also 58.392 missing values for artist and song. However, they can be acceptable, because all Sparkify transaction in dataset is not related with play song in data.*

2. Convert “registration” and “ts” columns to human readable form to detect registration and page event log dates in mini dataset.

convert_ts = udf(lambda x: datetime.datetime.fromtimestamp(x / 1000.0).strftime("%Y-%m-%d %H:%M:%S"))
df_clean = df_clean.withColumn('page_event_time', convert_ts('ts'))
df_clean = df_clean.withColumn('registration_time', convert_ts('registration'))

Furthermore, I will create “register_duration” column to calculate customer lifetime between registration and last page event dates. It will be also used as feature in modelling.

3. Split location columns to define city and state in different columns.

import pyspark.sql.functions as f
split_col = f.split(df_clean['location'], ',')
df_clean = df_clean.withColumn('City', split_col.getItem(0))
df_clean = df_clean.withColumn('State', split_col.getItem(1))

After cleaning data, there are 278.154 records and 22 columns including new columns: page_event_time, registration_time, City and State.

After these data cleaning and type conversion, I created columns;

“Register_Duration”: day difference of “page_event_date” and “registration_time”
“Churn_Flag” column: using “Cancellation Confirmation” events of page column to define churn as “1”

“Downgrade_Event”: Using “Submit Downgrade” events of page column to define downgrade as “1”.

Exploratory Data Analysis

After cleaning data, I performed some exploratory data analysis.

1. There are 121 male and 104 female users in data frame. It is approximately balanced data according to gender.

df_clean.createOrReplaceTempView("Sparkify_df_clean")
spark.sql('''
        SELECT gender,COUNT(DISTINCT userId) AS user_counts
        FROM Sparkify_df_clean
        GROUP BY gender
        ORDER BY user_counts DESC
''').show()

+------+-----------+

|gender|user_counts|

+------+-----------+

|     M|        121|

|     F|        104|

+------+-----------+

2. Majority of page event data is composed of *“NextSong” (80% of data: 228.108 / 278.154)***

Furthermore, “Cancel Confirmation” page event is also 52 and this event was used to define churn.

spark.sql('''
        SELECT page,COUNT(userId) AS user_counts
        FROM Sparkify_df_clean
        GROUP BY page
        ORDER BY user_counts DESC
''').show()

+--------------------+-----------+

|                page|user_counts|

+--------------------+-----------+

|            **NextSong|     228108**|

|           Thumbs Up|      12551|

|                Home|      10082|

|     Add to Playlist|       6526|

|          Add Friend|       4277|

|         Roll Advert|       3933|

|              Logout|       3226|

|         Thumbs Down|       2546|

|           Downgrade|       2055|

|            Settings|       1514|

|                Help|       1454|

|             Upgrade|        499|

|               About|        495|

|       Save Settings|        310|

|               Error|        252|

|      Submit Upgrade|        159|

|    Submit Downgrade|         63|

|              Cancel|         52|

|**Cancellation Conf...|         52**|

+--------------------+-----------+

3. There are 195 free and 165 paid account in this dataset. There are 225 userIds in dataframe, so it shows that 135 users have changed their account level.

spark.sql('''
        SELECT level,COUNT(DISTINCT userId) AS user_counts
        FROM  Sparkify_df_clean
        GROUP BY level
        ORDER BY user_counts DESC
''').show()

+-----+-----------+

|level|user_counts|

+-----+-----------+

| free|        195|

| paid|        165|

+-----+-----------+

4. I checked city and state distribution in which customers locate. Majority of users are in CA, TX, NY-NJ-PA, FL states. Furthermore, LA-Long Beach and NY City are the top cities. However, there was no extraordinary bias in location in mini dataset.

# City distribution of users is showed below. LA and NY are top cities. 
city_dist = spark.sql('''
        SELECT City,COUNT(DISTINCT userId) AS user_counts
        FROM Sparkify_df_clean
        GROUP BY City
        ORDER BY user_counts DESC
''').toPandas()

# Plot graph for distribution
sns.set(rc={'figure.figsize':(30,50)})
plt.title('City Distribution of Users', size=20)
barplot = sns.barplot(x = 'user_counts', y = 'City', data = city_dist, palette = 'dark')
plt.show()
# Majority of Users are in CA, TX, NY-NJ-PA, FL states.
state_dist = spark.sql('''
        SELECT State,COUNT(DISTINCT userId) AS user_counts
        FROM Sparkify_df_clean
        GROUP BY State
        ORDER BY user_counts DESC
''').toPandas()

# Plot graph for distribution
sns.set(rc={'figure.figsize':(30,50)})
plt.title('State Distribution of Users', size=20)
barplot = sns.barplot(x = 'user_counts', y = 'State', data = state_dist, palette = 'dark')
plt.show()

5. I checked length of time for current row of specific log and show distribution of length of event logs. Majority of length is between 0 and 500 minutes.

length_data = spark.sql('''
        SELECT length
        FROM Sparkify_df_clean
''')
sns.set(rc={'figure.figsize':(10,5)})
sns.distplot(length_data.toPandas().dropna());

6.There are 52 users cancelled their service and 49 users downgraded their service in dataset.

df_clean.createOrReplaceTempView("Sparkify_df_clean")
spark.sql('''
        SELECT Churn_Flag,COUNT(DISTINCT userId) AS user_counts
        FROM  Sparkify_df_clean
        GROUP BY Churn_Flag
        ORDER BY user_counts DESC
''').show()

df_clean.createOrReplaceTempView("Sparkify_df_clean")
spark.sql('''
        SELECT Downgrade_Flag,COUNT(DISTINCT userId) AS user_counts
        FROM  Sparkify_df_clean
        GROUP BY Downgrade_Flag
        ORDER BY user_counts DESC
''').show()

7.While taking into “gender” into regard, I checked churn and play next song analysis. Male customers who churned played slightly more than female customers. On the other hand, female customers who did not churn, play more song then male customers.

Same analysis was also performed for “Thumbs Up” page event. Similar results were also observed.

df_clean.createOrReplaceTempView("Sparkify_df_clean")
cust_positive_lifetime_1 = spark.sql('''
        SELECT  gender, Churn_Flag, COUNT(userId) AS user_counts
        FROM  Sparkify_df_clean
        WHERE page == "NextSong"
        GROUP BY gender, Churn_Flag
        ORDER BY user_counts DESC
''').toPandas()
sns.set(rc={'figure.figsize':(10,5)})
ax = sns.barplot(x='Churn_Flag', y='user_counts', hue='gender', data=cust_positive_lifetime_1)
plt.xlabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.ylabel('User Counts')
plt.legend(title='Gender', loc='best')
plt.title('Customer Listen Next Song Analysis: Churn vs. Gender')
sns.despine(ax=ax);

df_clean.createOrReplaceTempView("Sparkify_df_clean")
cust_positive_lifetime_2 = spark.sql('''
        SELECT  gender, Churn_Flag, COUNT(userId) AS user_counts
        FROM  Sparkify_df_clean
        WHERE page == "Thumbs Up"
        GROUP BY gender, Churn_Flag
        ORDER BY user_counts DESC
''').toPandas()
sns.set(rc={'figure.figsize':(10,5)})
ax = sns.barplot(x='Churn_Flag', y='user_counts', hue='gender', data=cust_positive_lifetime_2)
plt.xlabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.ylabel('User Counts')
plt.legend(title='Gender', loc='best')
plt.title('Customer Thumbs Up Analysis: Churn vs. Gender')
sns.despine(ax=ax);

8. I also explored churn pattern according to gender and account level. According to analysis, male customers are more likely to churn than female customers. Furthermore, customers having paid account churned more than customers having free account.

# Explore churn pattern according to gender: Male customers are more likely to churn than female customers.
df_clean.createOrReplaceTempView("Sparkify_df_clean")
churn_gender_analysis_pd = spark.sql('''
        SELECT gender, Churn_Flag, COUNT( DISTINCT userId) AS user_counts
        FROM  Sparkify_df_clean
        GROUP BY  gender, Churn_Flag
        ORDER BY user_counts DESC
''').toPandas()
sns.set(rc={'figure.figsize':(10,5)})
ax = sns.barplot(x='Churn_Flag', y='user_counts', hue='gender', data=churn_gender_analysis_pd)
plt.xlabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.ylabel('User Counts')
plt.legend(title='Gender', loc='best')
plt.title('Churn vs. Gender Analysis')
sns.despine(ax=ax);

# Explore churn pattern according to level (paid and free): Customers having paid account are more likely to churn than customers having free account.
df_clean.createOrReplaceTempView("Sparkify_df_clean")
churn_level_analysis_pd = spark.sql('''
        SELECT level, COUNT(DISTINCT userId) AS user_counts
        FROM  Sparkify_df_clean
        WHERE page == "Cancellation Confirmation"
        GROUP BY level
        ORDER BY user_counts DESC
''').toPandas()
sns.set(rc={'figure.figsize':(10,5)})
ax = sns.barplot(x='level', y='user_counts', data=churn_level_analysis_pd)
plt.xlabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.ylabel('User Counts')
plt.legend(title='Level', loc='best')
plt.title('Churn vs. Level Analysis')
sns.despine(ax=ax);

9. Register duration (customer lifetime) of customers churned is less than customers who did not churn.

df_register_duration_pd = df_register_duration.toPandas()
ax = sns.boxplot(data=df_register_duration_pd, y='Churn_Flag', x='Register_Duration', orient='h')
plt.xlabel('Days Between Last Page Event and Registration')
plt.ylabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.title('Registration Duration vs. Churn Analysis')
sns.despine(ax=ax);

10. I performed analysis total listened songs and average listened songs per session according to churn status of customers. Customers who churned normally listened less songs than customers who did not churn.

# Total number of song listened vs. Churn Analysis
# It shows that users churned listen less song.
df_clean.createOrReplaceTempView("Sparkify_df_clean")
churn_listenedsong_analysis_pd = spark.sql('''
        SELECT Churn_Flag, userID, sum(page_counts) AS sum_page_counts
        FROM
        (SELECT Churn_Flag, userId, count(page) AS page_counts
        FROM  Sparkify_df_clean
        WHERE page == "NextSong"
        GROUP BY Churn_Flag, userId)
        GROUP BY Churn_Flag, userID
''').toPandas()

ax = sns.boxplot(data=churn_listenedsong_analysis_pd, y='Churn_Flag', x='sum_page_counts', orient='h')
plt.xlabel('Total Number of Song Listened')
plt.ylabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.title('Total Listened Song vs. Churn Analysis')
sns.despine(ax=ax);

# Average Number of song listened per session vs. Churn Analysis
# It shows that users churned listen less song per session.
df_clean.createOrReplaceTempView("Sparkify_df_clean")
churn_ses_listenedsong_analysis_pd = spark.sql('''
        SELECT Churn_Flag, userID, avg(page_counts) AS avg_page_counts
        FROM
        (SELECT Churn_Flag, userId, sessionid, count(page) AS page_counts
        FROM  Sparkify_df_clean
        WHERE page == "NextSong"
        GROUP BY Churn_Flag, userId, sessionid)
        GROUP BY Churn_Flag, userID
''').toPandas()

ax = sns.boxplot(data=churn_ses_listenedsong_analysis_pd, y='Churn_Flag', x='avg_page_counts', orient='h')
plt.xlabel('Average Number of Song Listened Per Session')
plt.ylabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.title('Average Listened Song per Session vs. Churn Analysis')
sns.despine(ax=ax);

11. I performed thumbs up and down events according to churn status of customers. Customers who churned normally less thumbs up than customers who did not churn.

# Number of thumbs up vs. Churn Analysis: It shows that users churned make less thumbs up. 
df_clean.createOrReplaceTempView("Sparkify_df_clean")
thumbs_up_analysis_pd = spark.sql('''
        SELECT Churn_Flag, userId, count(page) AS page_counts
        FROM  Sparkify_df_clean
        WHERE page == "Thumbs Up"
        GROUP BY Churn_Flag, userId
''').toPandas()

ax = sns.boxplot(data=thumbs_up_analysis_pd, y='Churn_Flag', x='page_counts', orient='h')
plt.xlabel('Number of Thumbs Up')
plt.ylabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.title('Number of Thumbs Up vs. Churn Analysis')
sns.despine(ax=ax);

# Number of thumbs down vs. Churn Analysis: It shows that the number of thumbs down is similar according to churn flag. 
df_clean.createOrReplaceTempView("Sparkify_df_clean")
thumbs_down_analysis_pd = spark.sql('''
        SELECT Churn_Flag, userId, count(page) AS page_counts
        FROM  Sparkify_df_clean
        WHERE page == "Thumbs Down"
        GROUP BY Churn_Flag, userId
''').toPandas()

ax = sns.boxplot(data=thumbs_down_analysis_pd, y='Churn_Flag', x='page_counts', orient='h')
plt.xlabel('Number of Thumbs Down')
plt.ylabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.title('Number of Thumbs Down vs. Churn Analysis')
sns.despine(ax=ax);

12. Finally, I performed analysis add playlist and friend page transactions according to churn status. Customers who churned normally perform less “add friend” and “add playlist” transactions than customers who did not churn.

# Number of Add Playlist vs. Churn Analysis: It shows that users churned make less "add playlist" 
df_clean.createOrReplaceTempView("Sparkify_df_clean")
playlist_analysis_pd = spark.sql('''
        SELECT Churn_Flag, userId, count(page) AS page_counts
        FROM  Sparkify_df_clean
        WHERE page == "Add to Playlist"
        GROUP BY Churn_Flag, userId
''').toPandas()

ax = sns.boxplot(data=playlist_analysis_pd, y='Churn_Flag', x='page_counts', orient='h')
plt.xlabel('Number of Add Playlist')
plt.ylabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.title('Number of Add Playlist vs. Churn Analysis')
sns.despine(ax=ax);

# Number of Add Friend vs. Churn Analysis: It shows that users churned make less "add friend" 
df_clean.createOrReplaceTempView("Sparkify_df_clean")
addfriend_analysis_pd = spark.sql('''
        SELECT Churn_Flag, userId, count(page) AS page_counts
        FROM  Sparkify_df_clean
        WHERE page == "Add Friend"
        GROUP BY Churn_Flag, userId
''').toPandas()

ax = sns.boxplot(data=addfriend_analysis_pd, y='Churn_Flag', x='page_counts', orient='h')
plt.xlabel('Number of Add Friend')
plt.ylabel('Customer Churn Flag - (0: No) (1: Yes)')
plt.title('Number of Add Friend vs. Churn Analysis')
sns.despine(ax=ax);

Feature Engineering

After defining churn label and exploratory data analysis explained above, I calculated feature metrics below to solve problem (predicting potential customers who will churn.)

These 10 features are

Registration duration (customer lifetime — important for loyalty),
Total number of songs listened
Average number songs listened per session
The number of thumbs up
The number of thumbs down
The number of add friends
The number of add playlist
Total length of listening
Gender
Account level (paid or free)

These features will serve as inputs and outputs for designed machine learning model in next section. Except “label” (churn label), there are 10 feature metrics in calculated dataset used in models.

# Read model data from json file.
model_data = spark.read.json('df_final_model_data.json')
# change column name Churn_Label to label
model_data = model_data.withColumnRenamed("Churn_Label","label")
# Check columns and shape
model_data.printSchema()
model_data.toPandas().shape

Modelling

For this step of my project, I will be building 3 base models that could possibly be used for predicting churn. Since churn is a simple churned/not-churned value, I can leverage classification models. The models I will be building are below:

Logistic Regression Model -(https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.LogisticRegression)

Random Forest Model -
(http://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression)

GBT Classifier Model - (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.classification.GBTClassifier)

From these models I will be providing their results and evaluating their Accuracy and F1-Score. It is important for me to understand when my models are being as accurate as possible when determining when one of my users will possibly be churning and thus this is something that I will check against all models.

However, Accuracy is not everything thus I must also make sure to not have a model that will potentially tell me a lot more users could possibly be churning when in reality they may not or there may be too many users showing possible churn.

This could potentially lead to time wasted on users that are not churning and could mean extra costs for Sparkify. To balance this, I will be evaluating and optimizing the F1-Score which is a function of Precision and Recall to help give me the right balanced model.

Logistic Regression Metrics:

# import time to calculate model training duration (seconds)
from time import time

# 1. Logistic Regression 

# initialize classifier, set evaluater and build paramGrid
lr = LogisticRegression(maxIter=10)
f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')
paramGrid = ParamGridBuilder().build()
crossval_lr = CrossValidator(estimator=lr, evaluator=f1_evaluator, estimatorParamMaps=paramGrid,numFolds=3)

# Calculate time metric of model. 
start_time = time()
cvModel_lr = crossval_lr.fit(train)
end_time = time()
cvModel_lr.avgMetrics
seconds = end_time- start_time

results_lr = cvModel_lr.transform(validation)

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")

print('Logistic Regression Metrics:')
print('Accuracy of model is : {}'.format(evaluator.evaluate(results_lr, {evaluator.metricName: "accuracy"})))
print('F1 score of model is :{}'.format(evaluator.evaluate(results_lr, {evaluator.metricName: "f1"})))
print('The training process of model took {} seconds'.format(seconds))

Accuracy of model is : 0.7887323943661971

F1 score of model is :0.7189395194697596

The training process of model took 4.507469654083252 seconds

Random Forest Metrics:

# 2. Random Forest Classifier 

# initialize classifier, set evaluater and build paramGrid
rf = RandomForestClassifier()
f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')
paramGrid = ParamGridBuilder().build()
crossval_rf = CrossValidator(estimator=rf,estimatorParamMaps=paramGrid,evaluator=f1_evaluator,numFolds=3)

# Calculate time metric of model. 
start_time = time()
cvModel_rf = crossval_rf.fit(train)
end_time = time()
cvModel_rf.avgMetrics
seconds = end_time- start_time

results_rf = cvModel_rf.transform(validation)

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")

print('Random Forest Metrics:')
print('Accuracy of model is : {}'.format(evaluator.evaluate(results_rf, {evaluator.metricName: "accuracy"})))
print('F1 score of model is :{}'.format(evaluator.evaluate(results_rf, {evaluator.metricName: "f1"})))
print('The training process of model took {} seconds'.format(seconds))

Accuracy of model is : 0.9014084507042254

F1 score of model is :0.8972171191399606

The training process of model took 4.879174709320068 seconds

Gradient Boosted Trees Metrics:

# 3. Gradient Boosting Trees

# initialize classifier, set evaluater and build paramGrid
gbt = GBTClassifier(maxIter=10,seed=42)
f1_evaluator = MulticlassClassificationEvaluator(metricName='f1')
paramGrid = ParamGridBuilder().build()
crossval_gbt = CrossValidator(estimator=gbt,estimatorParamMaps=paramGrid,evaluator=f1_evaluator,numFolds=3)

# Calculate time metric of model. 
start_time = time()
cvModel_gbt = crossval_gbt.fit(train)
end_time = time()
cvModel_gbt.avgMetrics
seconds = end_time- start_time

results_gbt = cvModel_gbt.transform(validation)

evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")

print('Gradient Boosted Trees Metrics:')
print('Accuracy of model is : {}'.format(evaluator.evaluate(results_gbt, {evaluator.metricName: "accuracy"})))
print('F1 score of model is :{}'.format(evaluator.evaluate(results_gbt, {evaluator.metricName: "f1"})))
print('The training process of model took {} seconds'.format(seconds))

Accuracy of model is : 0.8450704225352113

F1 score of model is :0.8430847520275758

The training process of model took 21.43347477912903 seconds

*Why Random Forrest Model is best for our scenario:*

First of all we used f1-score to select our final model because our problem is classification one and f1-score can help us to find the balance between accuracy and recall. Higher the f1-score the more perfect our model will be as false negative and false positive will be less.

Here Random Forrest has best f1-score that's why I choose it for the next steps.

Then, I have applied hyperparameter tuning based on f1 score :

# Optimizing Hyperparameters in Random Forest Classification
clf = RandomForestClassifier()
numTrees=[20,75]
maxDepth=[10,20]    
paramGrid = ParamGridBuilder().addGrid(clf.numTrees, numTrees).addGrid(clf.maxDepth, maxDepth).build()       
crossval = CrossValidator(estimator = Pipeline(stages=[clf]),
                         estimatorParamMaps = paramGrid,
                         evaluator = MulticlassClassificationEvaluator(metricName='f1'),
                         numFolds = 3)

cvModel_rf = crossval.fit(train)
predictions = cvModel_rf.transform(test)

evaluator = MulticlassClassificationEvaluator(metricName='f1')
f1_score = evaluator.evaluate(predictions.select(col('label'), col('prediction')))
print('The F1 score is {:.2%}'.format(f1_score)) 

bestPipeline = cvModel_rf.bestModel

print('Best parameters : max depth:{}, num Trees:{}'.format(bestPipeline.stages[0].getOrDefault('maxDepth'), bestPipeline.stages[0].getNumTrees))

After execution hyperparameter tuning; best parameters and f1-score were listed below:

The F1 score is 91.39%

Best parameters : max depth:10, num Trees:75

Finally, Random Forrest Classifier (RF) Model is executed with best parameters (max depth:10, num Trees:75), I obtained higher accuracy and f1-score values below:

# Best Model: Random Forrest Model with max depth:10, num Trees:75 parameters
rf_best = RandomForestClassifier(numTrees=75, maxDepth=10)
rf_best_model = rf_best.fit(train)
result = rf_best_model.transform(test)
evaluator = MulticlassClassificationEvaluator(predictionCol="prediction")
print('Random Forrest Model - Test Metrics with best parameters:')
print('Accuracy of model is :  {}'.format(evaluator.evaluate(result, {evaluator.metricName: "accuracy"})))
print('F1 score of model is : {}'.format(evaluator.evaluate(result, {evaluator.metricName: "f1"})))

Accuracy and f1 score are very good. However, it should not be forgotten that my dataset may not be representing all customer base, I used mini — sized dataset. In real world and big data set, I think that 80% accuracy ratio is very good.

I have also calculated feature importance in predicting customer churn. I observe that register_duration (days — customer lifetime), number of thumbs down and average listened songs per session are top 3 important features while predicting churn.

# Features importance of Random Forrest Model 
Feature_Importance_Scores = rf_best_model.featureImportances.values.tolist()
Feature_Importance_df = pd.DataFrame({'Feature_Importance_Scores': Feature_Importance_Scores, 'Features': columns})
plt.title('Features Importance Scores of Random Forrest Model')
sns.barplot(x='Feature_Importance_Scores', y='Features', data=Feature_Importance_df, color="blue")

Conclusion

In Udacity — Sparkify Project, I try to generate model predicting customer churn. While building a model to predict customer churn from a transactional user data of Sparkify, these steps have been completed:

Load mini json dataset file, clean data and make ready for explanatory data analysis.
Perform explanatory data analysis, define churn and try to determine features which will be used in model.
Define 10 features which will be used in model while predicting churn.
Select three models which are logistic regression, random forest and gradient boosting trees to compare model.
Select random forest model as the final model implemented for predicting final result and also fine-tuning its parameters. I obtained approximately 92% accuracy and 0.913 f1 score with this model.

Finally, I observed that register_duration (days), number of thumbs down and average listened songs per session are top 3 important features while predicting churn.

ML algorithms with Spark provides us to process big data, gain insight and develop actions from predictions. For this project, Sparkify can make marketing campaign to customers who will potentially churn to prevent higher churn rates.

Feature engineering is also very important to define correct metrics in prediction models. In this project, I thought and defined 10 numerical features having impact on churn. However, many metrics can also be added while working with big data to improve model. For instance, rolling advertisement, log out event can also be added into features. However, I obtained good accuracy and f1-score with existing mini dataset. Extra features can be also taken into account while working with big dataset.

References:
GitHub: https://github.com/ankit1797/Udacity-data-scientist-capstone-project

Stack overflow Survey in India

Ankit Patel — Sat, 27 Jun 2020 06:52:23 +0000

Introduction

This blog post is about the massive survey results posted by Stack Overflow for year 2017 and 2018 to cover all sorts of information like code style, programming languages and various other information. The Survey in 2017 had more than 50,000 participants where as in 2018.

Every year, Stack Overflow conducts a massive survey of developers on it’s website, covering all sorts of information like programming languages, jobs, code style and more. This will help future developers to give some thoughts on what would work best for them. Some interesting insights that i got from the results are listed below:

Question 1: How does Programming Languages used at work relates with Programming Languages People Wants to Learn in India According to Stack overflow survey data of 2017 and 2018?

Figure — Percentage of relationship of Programming Languages in 2017 and 2018.

HTML is highly correlated with every Programming Language (except: Rust, Ocaml, Julia, Hack, Haskell and Clojure) and CSS also have same trends like HTML.
There are around 27 Programming Language which has nearly no correlation among them with anyone.
HTML has strongest correlation between CSS and JavaScript (i.e. about 12% to 14%).
There is one noticeable thing that Objective C and Swift are strongly correlated with each other (i.e. about 10% to 12%).

Question 2: What are the most wanted Programming Languages in India According to Stack overflow survey data of 2017 and 2018?

Figure — Percentage of desire of programming languages in India 2017–2018.

Question 3: Which Programming Languages are most used for work and are most Required According to Stack overflow survey data of 2017 and 2018?

Figure 1— Percentage of use of programming languages in India 2017–2018

JavaScript has the Highest Rate of percentage among all Programming Languages and highest growth rate in 2017 which is around 16% whereas this percentage is drop significantly in 2018 about 4% in India.
As we can see through Graph, Java has same drop percentage about 4% in India.
The Most Interesting this about this Graph is HTML and CSS it was probably not possible to select these two Languages as option in 2017 whereas in 2018 these two Programming Languages have risen such as HTML and CSS.

Conclusion:
In this article, we had a look at what were the most popular programming languages, according to Stack overflow’s 2017 and 2018 Annual Developer Survey data.
We had also seen that some older programming languages such as Python, SQL, JavaScript, C, and Java still dominates the market.
People who already work with a certain programming language have a tendency to learn that language or related languages within correlated areas to improve their own skill.

To see more about this analysis, see the link to my GitHub available here

References:
GitHub: https://github.com/ankit1797/Write-A-Data-Science-Blog-Post
Stack overflow Developer Survey Data:
https://insights.stackoverflow.com/survey

DEV Community: Ankit Patel

Why you should consider Udacity's 75% off black Friday deal and get yourself enlightened in this holiday season?

Xbox vs PS4: which is the best gaming platform based on data

Q1: Which genre's perform the most?

Figure — List of genre.

Figure — List of genre by continent.

Q2: Which Platform is being used by the most by the users, PS4 or Xbox?

Figure — List of data by Platform.

Figure — Popularity of Platform.

Q3: Looking at the correlation of features with overall sales

Figure — Correlation of Global_Sales and Critic_Score.

Figure — Correlation of Global_Sales and User_Score.

Figure — Heatmap of All the features.

Figure — Pair Grid of Global_Sales, Critic_Score and User_Score.

Q4: Can we predict sales with these attributes?

Figure — Prediction of sales.

Conclusion:

References:

What are the most common programming languages used in Brazil?

Introduction

Part 1 —What are the most used programming languages in Brazil?

Part 2 — What are the most wanted programming languages in Brazil?

Part 3 —How does programming languages used at work relates with programming languages people want to learn?

Conclusion

References

Sparkify — Digital Music Service - Customer Churn Analysis & Prediction

Sparkify — Digital Music Service

After loading and preliminary observation of data above, I started to clean data:

Exploratory Data Analysis

Modelling

Conclusion

Stack overflow Survey in India

Figure — Percentage of relationship of Programming Languages ​​in 2017 and 2018.

Figure — Percentage of desire of programming languages in India 2017–2018.

Figure 1— Percentage of use of programming languages in India 2017–2018

Figure — Percentage of relationship of Programming Languages in 2017 and 2018.