Missing data is a common problem in data analysis and machine learning. It can occur due to various reasons, such as human errors, sensor failures, or incomplete records. Missing data can affect the quality and validity of the analysis and lead to biased or inaccurate results. Therefore, it is important to handle missing data properly before applying any statistical or machine learning techniques.
In this blog post, I will show you how to handle missing data in Python with pandas, a popular library for data manipulation and analysis. I will use a sample dataset from Kaggle that contains information about passengers on the Titanic. The dataset has some missing values in the columns Age, Cabin, and Embarked. I will demonstrate five different methods to handle missing data and compare their advantages and disadvantages.
You can find the complete code for this blog post in this Kaggle notebook: 5 Methods to Handle Missing Data
Deleting the columns with missing data
One of the simplest methods to handle missing data is to delete the columns (or features) that contain missing values. This method can be applied by using the dropna
method in pandas:
df.dropna(axis=1, inplace=True)
This method has the advantage of being easy to implement and avoiding any assumptions about the missing data. However, it also has some drawbacks:
- It can result in a significant loss of information and reduce the predictive power of the model.
- It can introduce bias if the missing data is not completely random and depends on some other variables.
- It can reduce the variability and generalizability of the model by eliminating some important features.
Therefore, this method should only be used when the columns with missing data are irrelevant or redundant for the analysis, or when the proportion of missing data is very high (more than 50%).
Deleting the rows with missing data
Another simple method to handle missing data is to delete the rows (or observations) that contain missing values. This method can also be applied by using the dropna
method in pandas:
df.dropna(axis=0, inplace=True)
This method has the advantage of preserving all the features and avoiding any assumptions about the missing data. However, it also has some drawbacks:
- It can also result in a significant loss of information and reduce the sample size and statistical power of the test.
- It can also introduce bias if the missing data is not completely random and depends on some other variables.
- It can increase the variance and overfitting of the model by eliminating some important observations.
Therefore, this method should only be used when the rows with missing data are irrelevant or outliers for the analysis, or when the proportion of missing data is very low (less than 5%).
Filling the missing data with a value – Imputation
A more sophisticated method to handle missing data is to fill (or impute) the missing values with some appropriate value. This method can be applied by using the fillna
method in pandas:
df.fillna(value=30, inplace=True)
This method has the advantage of retaining all the information and avoiding any loss of data. However, it also has some drawbacks:
- It requires making some assumptions about the distribution and mechanism of the missing data.
- It can introduce bias and distortion if the imputed value is not representative or realistic for the missing data.
- It can reduce the variability and uncertainty of the model by creating artificial values.
Therefore, this method should only be used when there is a reasonable justification for choosing a specific value for imputing the missing data, such as a domain knowledge or a logical rule.
Filling missing data with a measure of central tendency
A common way to choose a value for imputing missing data is to use a measure of central tendency, such as mean, median, or mode. This method can be applied by using the fillna
method in pandas along with a function to calculate the measure of central tendency:
df.fillna(value=df.mean(), inplace=True) # mean for numerical columns and symmetric data
df.fillna(value=df.median(), inplace=True) # median for numerical columns and skewed data
df.fillna(value=df.mode(), inplace=True) # mode for categorical columns
This method has the advantage of being simple and robust to outliers. However, it also has some drawbacks:
- It assumes that the missing data is randomly distributed and follows a normal or uniform distribution.
- It can introduce bias and distortion if the mean, median, or mode is not representative or realistic for the missing data.
- It can reduce the variability and uncertainty of the model by creating artificial values that are equal to each other.
Therefore, this method should only be used when there is no strong evidence that the missing data is not random or follows a different distribution than normal or uniform.
Filling using the most probable value
A more advanced way to choose a value for imputing missing data is to use a statistical or machine learning technique to estimate
the most probable value based on other variables. For example, one can use linear regression to predict a numerical variable based on other numerical variables, or logistic regression to predict a categorical variable based on other categorical variables. This method can be applied by using a library such as scikit-learn to fit a model and make predictions:
from sklearn.linear_model import LinearRegression
lr = LinearRegression() # create an instance of the linear regression model
traindf = df[df['Age'].isna()==False].copy() # create a copy of the dataframe with non-missing values of Age as the training set
testdf = df[df['Age'].isna()==True].copy() # create a copy of the dataframe with missing values of Age as the test set
y = traindf['Age'] # assign the Age column of the training set as the target variable
traindf.drop("Age", axis=1, inplace=True) # drop the Age column from the training set as it is not a feature variable
testdf.drop("Age", axis=1, inplace=True) # drop the Age column from the test set as it is not a feature variable
lr.fit(traindf, y) # fit the linear regression model on the training set using the feature and target variables
pred = lr.predict(testdf) # predict the Age values for the test set using the fitted model
traindf['Age'] = y # add the Age column back to the training set with the original values
testdf['Age'] = pred # add the Age column back to the test set with the predicted values
df = pd.concat([traindf, testdf]) # concatenate the training and test sets to form a complete dataframe with no missing values in Age
This method has the advantage of being more accurate and realistic than using a constant value. However, it also has some drawbacks:
- It requires making some assumptions about the relationship and correlation between the variables.
- It can introduce bias and distortion if the model is not well-fitted or validated for the missing data.
- It can increase the complexity and computation time of the model by adding more parameters and steps.
Therefore, this method should only be used when there is a strong evidence that the missing data is related to other variables and can be predicted by a reliable model.
Conclusion
In this blog post, I have shown you how to handle missing data in Python with pandas. I have demonstrated five different methods to handle missing data and compared their advantages and disadvantages. There is no single best method that works for all cases, as different methods have different assumptions and implications. The choice of method depends on various factors, such as the type and amount of missing data, the nature of the problem, and the goal of the analysis. Therefore, it is important to understand the characteristics and limitations of each method and apply them with caution and care.
I hope you found this blog post useful and informative. If you have any questions or feedback, please feel free to leave a comment below. Thank you for reading!
Top comments (0)