DEV Community: ezgigm

My Data Science Bootcamp Experience

ezgigm — Sun, 07 Jun 2020 02:21:09 +0000

In February, I have enrolled Flatiron Immersive Data Science program. I preferred to enroll full-time in-person program instead of self-paced or online because I thought that to be on the campus makes me more motivated. In this post, I will write about my experiences with Bootcamp.

First of all, I did not have big expectations when I was starting the program. Of course, I knew the syllabus and details about it, but I did not know how much we practice and work on hands-on projects. In the boot camp, I finished 5 modules. Each module has different goals and I had to finish the assignment and project of each module for graduation. Project weeks were really tough to survive, even I felt like my final weeks in university. Certainly, the most difficult one was the capstone project which we have worked for the last 3 weeks.

Now, when I look at my first project, I can obviously see how much I have improved my knowledge and skills. Here, I would like to say 'Thank you' to our lead instructor Bryan Arnold and our coach Lindsey Berlin. They always supported me to push my boundaries. In the first project, I got my data by using the API key and web-scrabbing. It took two days to understand how to get my data. If they wrote the code for me and gave me ready code, I would not learn how to do it myself. Instead of giving the code to me, they gave me the perspective on how to solve this problem. This is the approach that I was used to from university, so it helps me a lot.

Because of Covid-19, we continued to our last 3 modules remotely. It is easier to focus on campus rather than home exactly. But, I experienced different skills in this remote system such as;

How to working in a team remotely
Being self-sufficient
Focusing on online lectures
How to get online help from instructors

Flatiron's full-time remote schedule helped me a lot to keep my motivation. When I began to feel more relaxed, it kept me engaged with its strict schedule and syllabus. Even, we had accountability channel with our instructors which reminded our goals every day.

For the capstone project, I worked on sentiment analysis with deep learning and build book recommendation systems from Kindle Store reviews on Amazon, even it was impossible to imagine for me when I was in the first module. I feel pretty comfortable about what I learned in Flatiron and it was worth the risk to leave my job. If you have any questions about Bootcamp, please feel free to ask about my experience. To see details of my projects, my GitHub repo can be found here.

Cover photo by Glenn Carstens-Peters on Unsplash and gif is from giphy

Kindle Store Reviews - Sentiment Analysis and Recommendation Systems

ezgigm — Fri, 29 May 2020 17:21:20 +0000

As a capstone project of Flatiron School, I worked on Kindle Store reviews on Amazon.

Why did I choose this project?

With the increasing demand in e-commerce, voice of customer concerns getting more important such as reviews or surveys. Customers could not touch online products so they want to learn from previous customer's experiences. If the company fails to meet this expectation, it can lose money. In this project, I choose Kindle store reviews because it is easy to buy and read books with Kindle and Kindle app is available for every device. So, there are many people who prefer to buy online books and give feedback about them online. I worked on book reviews but the important point of this project for me is that it can be applied for different companies and different areas. I found this online business problem interesting so, I tried to solve it.

How did I solve this problem?

I have two main solutions. Firstly, I did sentiment analysis to classify reviews as positive or negative. It is hard to read thousands of reviews regularly and understand which product is liked or hated by customers. It takes too much time and generally does not give brief idea about product comparison. I build a model which can predict the review is positive or negative from text and saving our time.

My second solution is to build a recommendation system to meet customers' expectations with recommend them good books and sell more products.

Models

I tried different models for sentiment analysis. Firstly, I tried statistical models as LogReg, DecisionTree, ExtraTree, RandomForest, XGBoost, LGBM Classifiers. Between them, I found best result 87% accuracy. I decided to try neural nets. As neural net models, I built Torch model with Fast Text Class. Then, I tried CNN and RNN with different layer numbers and layer types on Keras. At last, I used transfer learning technique with using Pre-trained BERT model which gave me 95% for small sample data and 97% for large-scale data. The results with small-scale data were inserted below;

Recommendation Systems

As a recommender system, I used collaborative filtering which finds similar users according to past ratings on same products and recommend user B the book which liked by also user A. Although this system has many advantages such as easy to find interested areas of customers and adding data to system easily, it has one big disadvantage as cold-start problem.

To solve this issue, I built a system which finds similarities between keywords and reviews. It is easy to write keywords in search tab and take recommendations but what if I do not have any other information for product than reviews, or maybe best book for me is not best book for new user. So, I prefer to calculate text similarities between keywords and past reviews of other users. Then, I find the best N similar users and give recommendation to new user according to this.

Challenges of This Project?

The first challenge was that the data is very big to analyze. There are more than 2.2 million reviews with 139,815 users and 98,824 books. I used meta data as helper dataset to match book titles with book ID's. Every trial can take hours even my personal computer can not run this model for this whole data. So, I found different solutions. Firstly, to run models easily and compare, I took sample data from whole set and after decided my model with small set, I run my model for whole set. I used Google Colab also to run some models which can take hours.
The second challenges of the project was to build neural networks. Before using transfer learning technique, I tried to train Torch and Keras models. It is not easy to understand which layer is necessary or what the layers works for.
As a third challenge, it is not easy to say this recommendation system is well. Although we are trying to guess likes or dislikes of people, it is hard to predict human nature.

As a result, this project can help companies to save time with not to read many reviews for comparing products. If I can recommend good books, people can buy more products and advice my online store to others. Also, this project give opportunity to compare products easily and define dislikes quickly. And if the customer spend less time on server with finding interested books easily, it means fewer problems for my server.

All details and codes can be found in my GitHub repo.

As a personal gain, I learned a lot of different techniques, models and I got more information about deep learning. It was a good hands-on learning experience for me about neural networks.

As general feelings, I am very happy with my level which I have reached in 15 weeks. When I compares my first projects with capstone, I feel that I have improved myself a lot. I am grateful to our instructor Bryan Arnold and our coach Lindsey Berlin for helping and supporting us always to push the limits. And special thanks to my dear husband, I am grateful to him who encourages me always to try my best.

Cover image by pixabay and gif is from giphy

Walmart Store Sales Forecasting

ezgigm — Fri, 08 May 2020 18:08:06 +0000

All companies have long-term and short-term plans. Especially, big companies like Walmart must make sure about the effectiveness of future plans. Sales forecasting is one of the very important plans because it gives idea to the company for arranging stocks, calculating revenue and deciding to make new investment. Another advantage of knowing future sales is that achieving predetermined targets from the beginning of the seasons can have a positive effect on stock prices and investors' perceptions. Also, not reaching the projected target could significantly damage stock prices, conversely. And, it will be a big problem especially for Walmart as a big company.

Walmart began recruiting competition for store sales forecasting on Kaggle. As a Module 4 project of Flatiron School Data Science Bootcamp, I worked on this competition. The reason for choosing this project is to learn more about time series models and to learn deeply which parameters effect sales forecasting. My previous work before bootcamp was about sales, so I also would like to observe my seasonal effect experiences scientifically.

The data was obtained from Kaggle. In the data, there are mainly weekly sales of departments and stores, holidays, type and size of stores and some external features such as consumer price index, fuel price, temperature and unemployment rate.

Main Challenge of This Project

The main challenge of this project for me is seasonal sales. Some seasons have higher sales values like Thanksgiving, Black Friday and Christmas. These seasonal effects make the data highly non-stationary. Dealing with non-stationary data is not easy. There are many ways to solve this problem.

Differencing : We can calculate the difference of consecutive terms in data. Differencing is generally done to get rid of the varying average.
Seasonal Differencing : The idea behind this technique is taking the difference of same seasons like the difference between 2010 and 2011 for the first week of January.
Transforming Data: This technique is generally used for non-constant variances. We can transform data with taking log, square root or power transform, mainly. We can also shift the data.

The best solution which I found for making my data more stationary is that I took differences of consecutive terms and used this more stationary data for modeling. From the below graph, it can be seen easily how my data is non-stationary. I improves this also with Adfuller test. Details about Adfuller test can be found here.

Metric and Models

The metric of the competition is weighted mean absolute error (WMAE). Weight of the error changes when it is holiday. It calculates the error of holidays 5 times more than normal days such as if I predict 50$ wrongly for usual days, it means 250$ error for holiday predictions. I think, Walmart wants to give weight for true holiday predictions because holiday sales are more higher and important than usual times. Details of metric can be found in this link.

After trying some machine learning and time series models, I found best results with Exponential Smoothing(Holt-Winters) which flattens the trend of data. My best error value is 821. If we assume our weekly sales averages between 18000-20000 dollars, it means my model can predict future sales of Walmart stores with 4% error.

Explorations

When we look at the average sales monthly, it is obviously seen that November and December have higher sales. To make deep analysis, I looked at the weekly sales with corresponding week numbers.

From the graph, we can understand easily 51st and 47th weeks have higher sales. It means that higher demands in whole year belong to Christmas, Thanksgiving and Black Friday seasons. 50th and 49th weeks are following top 2. And, it is not obviously seen from the graph but I found it from values in project that 22nd week is in the top 5. It means that after schools are closed at the end of the May, people prefer to shop for summer or holiday.
Another interesting exploration is that when I looked at the top highest sales, I found it in Thanksgiving time. But, when we look at the graph above, Christmas has higher averages. It means that, maybe Thanksgiving week has higher values individually for some stores and some departments but when we take the average and sort them, Christmas is the best for general average weekly sales.
Although during other months sales are closed to each others, January has the least sales average. This is the result of November and December high sales. After two high sales month, people prefer to pay less on January.

From pie plot and box plot above, we can understand that there are mainly 3 types of Walmart stores in the data and Walmart categorized them according to sizes. Type A is the biggest store, B is medium and C is the smallest. And, nearly half of the stores in the data are Type A stores which are biggest.

In the graph above, red line shows the average holiday sales and green line represents the average non-holiday sales. According to lines, holiday average is higher than non-holiday, as expected.
It is obviously seen from the graph that size of the store is directly proportional to weekly sales. Bigger stores have higher average sales.
And, it is interesting that according to this graph, Thanksgiving sales are shown higher than Christmas but we know from the weekly graph, 51th week is the best. When I looked deeper to the data, I realized that Walmart assigned Christmas sales to the 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13. But we know that people generally do their Christmas shopping before last week of the year. So, it is not a good idea to assign Christmas sale season to the last week of the year. The problem in the data can be understood easily from this graph.
Interestingly, according to this data consumer price index, temperature, fuel price and unemployment rate do not have any pattern on weekly sales.

The first graph above shows that department 72 has some higher values, when I looked at the numerical data also I understood that especially on Thanksgiving department 72 has high sales. But, from the second graph above, I found some other department with higher average sales. It shows that, although department 72 has good sales during Thanksgiving, some other departments such as 92 or 95 have higher sales averages in general. This graph shows us one more time the importance of seasonal sales and misleading points of seasonal sales.

Future Improvement and Recommendations

As a modeling perspective, data will be made more stationary with different techniques, feature selection and feature engineering can be added to model also.
As a data quality perspective, to improve to results, more data is needed, which has more detailed seasonal sales such as Come Back to School, 4th of July, Memorial Day, Halloween or Easter. Before these holidays, some department can sell more products.
There are markdowns and store sales for some special seasons. The effect of markdowns and discounts on departments can be added to the model.
From the data, I observed that some holidays has higher effect of some stores and some departments. So, different models can be build for each store and departments. It is not easy to build model for 45 different stores and 81 different departments but it can be begun with the departments and stores which have higher sales. Because, they have higher effect on total sales than others.
And lastly, market basket analysis can be added to the data, to find higher demand items of departments.

All details for cleaning process, data preprocessing, modeling process, more explorations and solutions for problem and future improvements can be found in this github repo.

Cover image by PublicDomainPictures from Pixabay and gif is from giphy

Pump it Up: Data Mining the Water Table

ezgigm — Mon, 20 Apr 2020 15:31:30 +0000

Despite the unbelievable development of technology, simple basic needs such as access to clean drinking water are still one of the most important problems of human beings. For some areas in the world, to find clean water, pump this water up and transport this water to people are really difficult processes. Tanzania is the largest country of East-Africa with 59 million population. 25 million of this population have lack access to clean water, 40 million people also have a lack access to improved sanitation. The Tanzanian Water Ministry agreed with Taarifa and they begun a competition by DrivenData to solve this problem with improving clean water sources. As a Module 3 project of Flatiron School Data Science Bootcamp, I worked on this problem with Mark Subra. The reason for choosing this project is my interest in solving the main problems that concern humanity. I have worked in NGO for many years to help people. Even, I have two close friends who live in Tanzania and always tell me about the water shortage of this country. So, I have an interest this type of problems.

Water points were divided in three classes as functional, non-functional or functional but needs repair by water ministry. Our aim in this project to build a model which predicts the functionality of water points. The data was taken from DrivenData. Basically, there are 4 different datasets; submission format, training set, test set and train labels set which contains status of wells. With given training set and labels set, competitors are wanted to build predictive model and apply it to test set to determine status of the wells and submit. Train set contains 59400 water points data with 40 features.

Challenges of This Project

Importance of Data Cleaning

Mainly, there are two challenges in this data. The first one is to clean the data. Because, it contains lots of columns which has same information. These columns cause multi-collinearity in model. Also, there are many null, zero and missing values. Generally, features are categorical and some of them has more than 2000 unique values. There are spelling mistakes in some columns which creates high unique values. Lastly, some columns has discrete values. So, we dropped some columns which contains same information, converted null and missing values to mean or collected them in unknown category. For feature engineering, we created new columns for some features and categorized them again manually. This cleaning process took too much time but at the end, we understood the importance of data cleaning again. Because, for the first modeling trial as a baseline with simple logistic regression our model gave 0.83 roc-auc score for binary class.

Imbalanced Ternary Class Problem

Our data has highly imbalanced three target labels and all three of them are important to predict as true. So, we have to find the balanced values for each label in confusion matrix. To simplify this problem, firstly we collected functional and functional but needs repair wells together and found the best model for binary class. With doing this, we understood the which model approach is true for this data and how to set our parameters for models. When we converted our data to ternary class we also faced with imbalance problem. One of our target labels has very less value than others. To solve this problem, we used oversampling technique SMOTE (Synthetic Minority Over-Sampling Technique). SMOTE works on the idea of nearest neighbors and create its synthetic data. It is one of the most popular techniques for oversampling.

Models and Metric

To prepare our data to machine learning, we did encoding and scaling also for dealing with first challenge. For ternary target model, Target Encoder and Robust Scaler were used. Random Forest, LGBM and XGBoost were tried. To handle the second challenge as imbalanced target problem, SMOTE over-sampling technique was applied. The metric for competition was balanced accuracy. So, we used this metric and sometimes to compare and check results one more, we used roc-auc score.

Explorations of Problems for Solution

From the graph, it is shown that there are many wells which contains enough water but non-functional. Also, it is observed that 4272 wells were dried but they have good water quality. With finding a solution to give source again these wells, they can be functional. Finding clean water sources is not the only problem, to continue to feed these sources are equally important. 2226 (7%) wells have enough and soft water but needs repair. Authorities must invest on repairing. Otherwise these will be non-functional. 8035 (27%) wells has enough, good quality water but they are non-functional. This shows that authorities must work and invest on technology to pump these good sources.

3500 water wells need repairs, otherwise they would be non-functional easily.

This graph shows the highest ratio of functional wells to non-functional by funder. Danida is Tanzania, Denmark cooperation for wells and has many functional wells. RWSSP is Rural Water Supply and Sanitation Program. Also, most of the wells which was funded by Germany Republic are functional.

Mostly the wells which are funded by government are non-functional. Most of water points which central government and district council installed are non-functional.
Darul es Salaam is one of the highest populated cities but 35% of good water quality points are non-functional.
Iringa is one of the important areas but it contains lots of non-functional water points which has soft water.

The most common extraction type is gravity but second is hand pumps. It is seen that, there are many non-functional water points which belongs to gravity (which is natural force so no need to do anything expensive) as extraction type. So, gravity type wells do not need too much investment on it. So, there can be found more water points which can be functional easily.
The wells which have constructed in recent years are functional then older ones. And it is seen that recent years have some functional but needs repair wells. It means that if they will not be repaired recently, they will be non-functional easily.

This map shows the location of functional but needs repair wells locations. There are some clusters around highly populated areas. With the regular maintenance of this wells, more people can find clean water.
Water basin is also another important parameter for functionality of wells. The areas which has near to good water basin high probability to find clean water.

Wells with no fee are more likely to be non-functional, and wells with some form of payment are more likely to be functional.

Solutions

Our model can predict the functionality of the water wells with 86% accuracy. With the good prediction of functionality, the solutions can be;

prioritizing functioning wells which need repair and yield clean water
targeting repairs to clusters of wells especially those with high populations
payments of some kind will provide incentive to keep wells functional
allocate funds and resources to effective organizations with track record

All details for cleaning process, data preprocessing, modeling process, more explorations and solutions for problem and future improvements can be found in this github repo.

Pump image courtesy of flickr user christophercjensen and gif is from giphy

Data Visualization Basic Libraries

ezgigm — Mon, 23 Mar 2020 06:37:17 +0000

One of the advantages of using Python is user-friendly visualization packages. In this blog, I try to explain some essential plots for data visualization and I will share with you which plots I used for what before and what I found useful. Mainly, the plots which I will share in this article belongs to Matplotlib which is a library for Python using for visualization. Also, I would like to state about Seaborn, visualization library based on matplotlib, because I like Seaborn's colorful visuals. It is all about my usage preferences from my experiences.

Scatter Plot (plt.scatter)

First of all, we need to import matplotlib.pyplot like all other libraries, to use these plot types.

Scatter plots are very useful to understand the correlation between two variables easily. Also, I like to put markers for data points like 'stars','triangles' or 'sized-circles' according to the frequency of data. The most important feature which I like to use is scatter plot with a legend.

Another feature that I like about scatter plot is adding a color bar with plt.colorbar(). It is very useful to see the colors and recognize different points. In the below graph, I used the 'viridis' as perceptually uniform sequential colormap scale but there are lots of alternative sequential, diverging, qualitative, miscellaneous color bars in 'matplotlib'.

All of them can be found in matplotlib.org.

Bar Plot (plt.bar)

I prefer bar plots to compare categorical data and differences between them. It is super easy to see the lengths of rectangular bars which proportional to data. Bars can be sorted according to y-axis values and the variables from best to worst easily can be seen.

I like to color the bars also and compare two variables in same the graph with the bar plot. With this method, we can compare lots of variables with each other proportionally. Additionally, it is also good to arrange the width of the bars.

Pie plot (plt.pie)

When I would like to see the percentages of a small number of variables in the same graph, I like to use pie plots.

Also, nested pie-plot charts can be created for more complex data but I do not prefer complexity for interpreting data. Another good visual thing, which I like about the pie plot is we can separate one slice from the pie plot and show the importance of this part.

Histogram and Density Plot (plt.hist)

Undoubtedly, one of the best ways to see the distribution of the data is histograms. It is super easy to see the normally distributed variables which are a very important concept for data scientists. So, histograms are very essential graphs for data scientists.

Although it is a great way to begin understanding a single variable distribution, histograms can fail when we try to compare distributions of one variable across multiple categories. Because, when histograms get together in the same area, readability will be the problem. The best way to do this is to create a new histogram for each category. Also, a side-by-side histograms or stacked bars can be useful.

There is a good example in towardsdatascience.com with NYC flight data about when histograms fail and how to solve this by using side-by-side histogram or stacked bars.

When density plots are get together with the histograms of the distribution of data, the graph will be more useful also. I want to state here about Seaborn which is super easy to use and Seaborn can create a histogram and density curve on the same plot easily.

Seaborn (sns.pairplot , sns.heatmap , sns.distplot, sns.barplot)

Seaborn is a really useful and colorful library in Python. Like 'matplotlib' , we need to import Seaborn also before using it.

The most useful basic plots in Seaborn are pair plot, heat map, distribution plot. I also added a bar plot here because I like to use Seaborn's easy colorful bar plot. You do not need to give specific colors so, it is easier than 'matplotlib'. The other thing which I like about Seaborn is using fewer syntax.

There are many ready datasets in Seaborn. This is the other important thing which I like. As a data scientist candidate, it is very good to find ready data sets to play around visualizing them. In the graph below, I plotted the distribution and also density curves for ready data set 'iris' which can be downloaded as sns.load_dataset('iris') easily.

Another important thing for modeling is the correlation between independent variables. To analyze this correlation, Seaborn's sns.pairplot and sns.heatmap are useful functions.

From the data set 'auto-mpg.csv', I choose some columns for simplification and just put my data set in pairplot as sns.pairplot(df). It is a super fast, easy and useful way of Seaborn. Then with the graph above, I can observe the correlations, linearity and also distribution of variables. So, sns.pairplot is a very important feature of Python.

It is also the same for heatmap.

Just a simple one-line code I can observe all correlations with target and multicollinearity between variables super easily.

Lastly, there are many ways to visualize data in Python and also in other programming languages. But, these topics are really basic, useful and also easy to use for beginners. In addition to this, these are the fastest ways to observe datasets. As seen in this blog post, visualization library or tool always depends on the data and what you want to obtain.

Cover image sourced by Colin Behrens from Pixabay and gif is from giphy.

Data : The Superpower of Future

ezgigm — Mon, 02 Mar 2020 04:55:46 +0000

The demand for data scientists in the market increases day by day. I also decided to learn data science and enrolled Flatiron School Data Science Bootcamp. Although there are many popular tech areas nowadays, I will try to explain why I decided to data science in this post.

Personal Interest

The most important reason of my decision is obviously my personal interest of math and coding. Despite I graduated from Bs. Chemical Engineering, my favorite courses were always about math and coding during my university years. So, I always want to continue my education with Ms. Computer Science. After I realized data science concept, I understood that this is the area what I am looking for, not computer science. It is more specific area with the combination of math, programming, statistics, problem-solving, the ability to interpret captured data and with processing this data build and offer business solutions. This combination perfectly matches with my expectations.

Demand in Market

Big data is one of the most important resources for companies future. Companies can build new models for future plans and investments with using previous data. The importance of data science has increased with sharing big data (individual or corporate) via internet. Thanks to internet, people can easily reach important data about topic and can guess the future plan of this area. But, this data is not always clean and useful for everyone. Mostly, the data which is easily reachable, needs to clean and prepare for next model. So, the demand in the market has increasing everyday for data scientist.

Use of Data Science in Healthcare

Data science not only offers solutions for companies, but also offers solutions and models for issues that concern all humanity. Healthcare is one of the these common sense areas. Data science has many applications in healthcare and medicine and I really consider some of these applications like genetics and genomics. All the data about our bodies are stored in human's genes individually. If this database can be commented in perfect way, the impact of genetic variations on genetic code can be understood and according to this interpretation maybe medicine for fatal illnesses can be found by scientists. All of these steps in healthcare such as interpretation of DNA, creating medicine, accuracy of diagnosis, keeping patience data and offering models for illnesses or virtual assistance to patients, contain lots of data and need data scientists for handling. One of the reasons of my decision is exactly these applications of data science which relates to all humanity. Even though it is difficult to make precise predictions at the beginning of the road, this field will be the area that I would like to work on after graduation.

Long-term Learning Process - like Marathon, not 100m Run

Lastly, I love learning and learning must be the life-long for me about my job. And, data science is the perfect area for this. It has very broad applications and like all other tech jobs it is very dynamic. You have to be aware of innovations and must learn new concepts. You should find solutions for different cases in market and to do this you have to search lots of sources and interpret them. I always believe that a rolling stone gathers no moss. So, I love to learn and to level up my knowledge.

In addition to lots of causes, these four headings are the main reasons of why I began data science journey.

Two images sourced from pixabay.com and gif is from giphy.