DEV Community: Vincent O Ajayi, PhD in Economics

How to deal with missing data

Vincent O Ajayi, PhD in Economics — Wed, 24 Jul 2019 22:55:24 +0000

The most common challenge faced by data scientists (DS) and data analysts (DA) is missing data. Every day, both DA and DS spend several hours dealing with missing data. The question is why is missing data a problem? Analysts presume that all variables should have a particular value at a particular state, and when there is no value for the variable, we refer to it as missing data. Missing data can have severe effects on a statistical model and ignoring it may lead to a biased estimate that may invalidate statistical results.
In this article, I will suggest ways to resolve the problem of missing data. Although different studies have suggested various methods to deal with missing data, I have noticed that none of these methods have theoretical or mathematical support to justify their processes. In this article, I will analysis the nine essential steps a data scientist must follow to address the issue of missing data. The steps are based on my personal experience as a quantitative researcher and data scientist for more than 7 years.

Basic steps for dealing with missing data

Aims and objectives: Before jumping to any method of estimating missing data, we must know the motivation behind the project to identify the research problem. The aim of the project must be outlined to specify key variables that are likely to be relevant for the project. You must be able to list the relevant data that can help answer the questions that define the objectives of the project.
Check for the appropriate variable: If you have been provided with the dataset, ask yourself a question: does the dataset contain all the relevant variables needed to address the research questions? For example, a data scientist may be interested in predicting inflation with the help of the multivariate model, and the data received might not contain likely inflation indicators such as consumer price or GDP deflator. To address the issue, you should contact your line manager or the data department to provide you with the appropriate dataset that contains the relevant variable.
Visualise the data and check for the missing value: If there is a missing value, check with the database; remember the best approach for finding a missing value is to look for the value at the source. It may be possible that there are problems with the extraction process.
Variable substitution: A straightforward way to deal with missing data is to substitute the variable with a similar indicator, especially if a large percentage of the data is missing. I strongly suggest using another indicator to replace the missing value, especially for continuous variables. For example, the GDP deflator could be used instead of the consumer price index to measure or forecast inflation. However, one needs to be careful in applying this method because different proxies for different variables may lead to different outcomes or results.
Mean/ Mode/ Median substitution: This method can be applied if the percentage of the missing value is smaller (e.g., less than 30%). For continuous variables, the missing value can be replaced by its median or mean value. For the category variable, the missing value can be replaced by its model value. The limitation of this method is that it reduces the variability of your data.
Delete the missing attribute: If a large percentage of the data is missing (e.g., more than 30%), all the rows or columns can be dropped, if the variable is an independent variable and not depend on the dependent variable as well as not relevant to the model. For example, if you want to use multiple regression to predict revenue and have a variable on a product number that has a missing number, the variable could be removed instead of filling the missing value. Note that you may lose samples, important information and underfit the model.
Evaluation and prediction: You can use different statistical models or theoretical models to estimate or predict the missing value. For instance, statistical models can estimate or predict the missing value from the available dataset.
Apply sophisticated statistical models that are robust in handling missing data without requiring imputation: For example, if you have missing data, the XGBoost model can be applied for prediction instead of using linear regression. The XGBoost model will handle the missing values by default. The model will minimise the training loss and choose the best imputation value for the dataset when the value is missing.
Sample reduction: This step applies to the time-series data, if you have missing data, the sample can be reduced to avoid looking for the missing value and base the estimation on a reduced sample that does not has missing value. Note that sample reduction can significantly affect the precision and accuracy of the results.

Common Applications of Machine Learning for Small-Scale Businesses

Vincent O Ajayi, PhD in Economics — Sun, 14 Jul 2019 14:18:59 +0000

When I first heard about machine learning (ML), I thought only big companies applied it to explore big data. On searching the internet for the meaning of ML, I discovered that Wikipedia defines it as a subset of artificial intelligence (AI). In particular, it involves the scientific study of algorithms and statistical models that computer systems use to perform a specific task effectively, without using explicit instructions and relying on patterns and inferences instead. A few common examples of ML’s application available on the internet include skin cancer detection, facial recognition, churn prediction, diagnosis of diabetic eye disease, in addition to those of natural language processing such as language translation. Moreover, ML plays in a role in the way companies such as Amazon and Netflix apply recommended engines to predict and suggest what any given user might want to buy or watch.

I never thought that ML could be useful for small-scale businesses. Surprisingly, almost everyone I met and discussed the concept of ML with also held the same view. I continued to hold this opinion until I registered for a short training course on ML. Subsequently, I was assigned a project to develop a ML model that could effectively improve the cost of marketing campaigns for a charitable organisation, which changed my perspective completely.

This article aims to discuss the importance of ML for small-scale businesses and gives an example of the way an ML algorithm can be employed to estimate costs.

Six Common Important Functions of Machine Learning (ML) for Small-Scale Businesses

Trend and Pattern Recognition: Several owners of small-scale businesses maintain a sales book and an account one, wherein they record their customers’ names, sales volume, cash transactions, and so on, from a different store. This record generates data that can be analysed to identify buying patterns of customers, along with other factors that drive sales or influence customers’ preferences.
Modelling and Forecasting: The information in one’s sales book can be capitalised to estimate costs, predict sales volume and gauge revenues, profits as well as expected market share. The advantage is that the sustainability and success of a business depend upon the accuracy of such forecasts.
Security: ML can help in analysing data and identifying patterns in such data. Further, this process can be used to identify suspicious transaction behaviour, track errors and detect fraud. As a result, business owners can easily take immediate actions to cover a loophole and prevent its occurrence in future.
Information Extraction: ML can be used to extract valuable information from other database and encourage operation coordination. The fact is that no business owner can be an island of knowledge. In order to take proper business decisions, business owners require external information, to conduct appropriate data analysis of the same (for instance, information on weather, inflation, interest rate, etc.).
Good Business Environment: ML creates a suitable environment for small businesses to grow and become efficient; it also provides staffs with new technology to function better. For example, ML recommendation engines can imitate customer behaviour, recommend additional products and promote upselling. Media companies use ML to identify patterns of lip movements, which they convert into text.
Advertisement and Marketing: It is noteworthy that 75% of enterprises utilise ML to enhance customer satisfaction by more than 10%, while three in four organisations employ AI and ML to increase the sale of new products and services by more than 10%. Within seconds, ML apps can reach millions of customers to inform them about new products and why the existing product is better than alternative product.

An Example of How Machine Learning Can be Used to Estimate Costs
A charitable organisation relies on the generosity of its well-wisher to cover the operational cost and provide the necessary capital to pursue charitable endeavours. Owing to the higher numbers of donors from different parts of various countries, the cost of soliciting for funds through postcard has increased over the years, for 10% record that the average donation received through postcard for an individual is £15, while the cost to produce and send the card is £2. The expectation is that this cost could increase if a charitable organisation chooses to send a postcard to everybody that identifies with the organisation. As a result, the organisation would need to hire a data scientist to develop a cost-effective model that can identify donors with the highest potential and likelihood of making donations.
Here is the link to the code of this project. I skipped a lot of code for brevity. Please follow the Github code provided on the side while reading this article: Machine Learning: Donor Prediction (https://github.com/vincajayi/-Machine-Learning-Donor-prediction-/blob/master/Dnonor_solution.ipynb).

Machine Learning Algorithms
We compare the forecasting performance of six different supervised machine learning techniques together using python with aim of chosen the appropriate algorithm to estimate cost. In particular, we considered the following classification techniques: Logistic Regression (LR), Kneigbors Classifier (KNN), GaussianNB, Random Forest Classifier(RF), Linear Discriminant Analysis and Neural Network Classifier (NN). The result is available from (https://github.com/vincajayi/-Machine-Learning-Donor-prediction-/blob/master/Dnonor_solution.ipynb)

From the study, we observed that the decision tree classifier (CART) outperformed other prediction models because the model has the highest mean when compared with other selected models. Therefore, we apply CART model for prediction. The results are available below:

True Positives(TP) 280
False Positives(FP) 796
True Negatives(TN) 2106
False Negatives(FN) 693

Classification Report
Precision,recall,f1-score,suppor(tabulated below respectively)

      0       0.75      0.73      0.74      2902
      1       0.26      0.29      0.27       973

avg / total 0.63 0.62 0.62 3875

For the classification report comprising 3875 households, we test for the actual number of households that are likely to donate funds for the charitable organisation. In our report, there are two possible predicted classes: “0” and “1”. For the two predicted outcomes, “1” indicates the actual number of households that are likely to donate, whereas “0” represents the actual number of households that are unlikely to do so.

From the Confusion Matrix, N = 3875 (the total households), the decision tree classifier predicted 1076 households likely to donate (FP +TP) and 2799 households unlikely to do so (TN+FN). In reality, 973 households from the sample will donate (ACTUAL “1” = TP +FN), while 2902 households may not (ACTUAL “0” = TN +FP).
To calculate the cost, recall the following:

Unit cost = £2
Unit average revenue = £15
For the organisation to minimise its cost, it could avoid sending postcards to everyone who expressed an interest and send them to only those households that are most likely to donate.

In this case, the total cost will be (TP+FN) *unit cost = £1946
Revenue = TP*unit revenue = £4200
Profit = (TP*unit revenue) – ([TP + FN] *unit cost) = £2254

This result implies that if the charity sends postcards to only those households that are likely to donate, it will spend £ 1946 and earn £4200, to generate a profit of £2254.

Conclusion
In this article, we discussed how small-scale business can apply ML to improve their performance and generate greater revenue. We also provided an example of how ML can be used to estimate costs and identify likely donors for a charity. We believe that it is crucial for business owners to learn about the importance of data collection and use ML algorithms to improve their businesses’ performance.