Dealing with missing values is a crucial step in the data cleaning process. Most of us know what techniques to apply when handling missing values, but it's essential to understand why we choose specific techniques for our dataset.
In this article, we will explore the reasons behind using
deletion or imputation techniques.
Deletion: Removing rows or columns with missing values.
Imputation: Filling in missing values with estimates such as mean, median or mode.
Types of Missing Values:
Before diving into the techniques, let's briefly review the three types of missing values: consider the task of predicting performance of a student.
| Hours of Study | Previous Test Score | Attendance Rate | Final Exam Score |
|----------------|---------------------|-----------------|------------------|
| 5 | 70 | 90% | 80 |
| 2 | 45 | 75% | 60 |
| 7 | 85 | 95% | 90 |
| 4 | 60 | 80% | 70 |
| 6 | 75 | 85% | 85 |
| 3 | 50 | 70% | 65 |
| 8 | 90 | 100% | 95 |
| 4 | 65 | 80% | 75 |
| 6 | 80 | 90% | 85 |
| 2 | 40 | 60% | 55 |
| 9 | 95 | 95% | 100 |
| 5 | 70 | 85% | 80 |
| 3 | 55 | 75% | 70 |
| 7 | 80 | 90% | 90 |
| 4 | 65 | 80% | 75 |
Missing Completely At Random (MCAR):
In MCAR, missing values occur completely at random and are unrelated to any other variables in the dataset. The absence of data happens purely by chance, and there is no systematic reason for the missingness. If some rows were missing randomly, irrespective of any pattern or relationship with other variables, it would be considered MCAR.
Missing At Random (MAR):
In MAR, the missing values' occurrence is related to other observed variables in the dataset, but not the missing values themselves. The missingness is conditional on other variables but not on the missing values. For example, if the "Attendance Rate" is missing for students who scored poorly in the previous test, it could be considered MAR.
Missing Not At Random (MNAR):
In MNAR, the missing values' occurrence is related to the missing values themselves. The missingness is related to information that is missing and not captured by other variables in the dataset. For example, if students with high "Hours of Study" tend to have missing values for "Final Exam Score," it would be considered MNAR.
When to Apply Deletion:
Missing Completely At Random (MCAR):
If the missing values occur completely at random, deleting the rows or columns with missing data is a valid option. It does not introduce bias, and the remaining data can still be representative of the overall dataset.
Small Amount of Missing Data:
If the dataset has a small proportion of missing values (e.g., less than 5%), deletion may be reasonable, especially if the missingness is random. In such cases, the impact of removing the missing data might not significantly affect the analysis.
When to Apply Imputation:
Missing At Random (MAR) or Missing Not At Random (MNAR):
When the missing data is not completely random and has some patterns related to other observed variables, imputation methods can be used to estimate missing values. Imputation tries to preserve relationships between variables and can help retain valuable information.
Significant Amount of Missing Data:
If a large portion of the dataset contains missing values, outright deletion may lead to a loss of valuable information and potentially biased results. Imputation, in this case, allows you to keep the entire dataset and work with the complete information.
Drawback of Imputation:
While imputation techniques can be helpful in estimating missing values, they do make assumptions about the data. The quality of imputed values depends on the accuracy of these assumptions. Therefore, it is crucial to validate the imputation techniques used and assess their impact on the final results.
In conclusion, understanding the types of missing values in your dataset is essential for making informed decisions about how to handle them. If the missingness is random, deletion might be a suitable option, but if there are patterns in the missing data, imputation techniques can help retain valuable information. Always remember to validate the chosen techniques to ensure the integrity of your analysis.
Top comments (0)