DEV Community

Cover image for Estimation of unit sales of Walmart retail goods
Abzal Seitkaziyev
Abzal Seitkaziyev

Posted on

Estimation of unit sales of Walmart retail goods

For my module 4 project, I decided to concentrate on working with time-series and bigger data sets.

Problem Description
After some research, I found interesting project on Kaggle competitions page. Original competition intended to estimate unit sales of Walmart retail goods. Particularly, metric described by organizers is Weighted Root Mean Squared Scaled Error. This means that in order for a forecast to perform well, it must provide accurate estimation across all hierarchical levels, especially for series that represent significant sales, measured in US dollars. (Source: M5-Competitors-Guide).

Data Analysis
When I prepared data for the feature engineering stage, I already had around 50 million rows of data. So, I realized that training the whole data frame with all the features could be quite compute-intensive task and not effective strategy for me to get good results. Also, during data exploration, I found that around 14% of the retail items generated 50% of the revenue. Considering that and time limit allocated for the project, I decided to explore only items which generate high revenue. So, I reduced model training time and used RMSE as a metric. For this case, RMSE serves as good approximation of WRMSSE, particularly, when applied for an each item separately.

Baseline model and Validation strategy

Taking into account that I potentially will need to scale up my solution to the whole dataset, I was looking for model that trains fast and provides accurate results. After some research, I decided to choose LightGBM which based on gradient boosted trees algorithm.

As my goal was to find an effective method that I can scale up, in order to check my concept fast, I decided to use simple 'Random Search' approach for the model hyper parameter tuning.

For the future work, I think the most important would be applying good validation strategy along with custom Loss function, in order to get more accurate and stable forecast. This idea described in the tutorial I found here.

My major conclusion, that when working with a big dataset, finding optimal strategy for model training and validation is important. For that reason, I decided to explore forecast method that will perform better on the series that generates high revenue for the company and then potentially scale it up.

Here is the link for my repository.

Top comments (0)