DEV Community

Neha Gupta
Neha Gupta

Posted on

Handling Missing Values || Feature Engineering || Machine Learning (Part1)

Hey reader👋Hope you are doing well😊
We know that to improve performance machine learning model feature engineering is crucial step. One of most important tasks in feature engineering is handling outliers. In this blog we are going to do a detailed discussion on handling missing values. So let's get started 🔥.

What are Missing Values?

Missing values are data points that are absent for a specific variable in a dataset. They can be represented in various ways, such as blank cells, null values, or special symbols like “NA” or “unknown.”

These missing data points pose a significant challenge in data analysis and can lead to inaccurate or biased results.
Image description
There are many reasons for a dataset to contain missing values-:

  • Due to technical issues.

  • If the data comes from a survey then many people can leave blank response which can lead to missing values in the data.

  • Data processing issues, privacy concerns etc.

Types of Missing Values

  • Missing Completely at Random (MCAR)
    MCAR is a specific type of missing data in which the probability of a data point being missing is entirely random and independent of any other variable in the dataset. In simpler terms, whether a value is missing or not has nothing to do with the values of other variables or the characteristics of the data point itself.

  • Missing at Random (MAR)
    MAR is a type of missing data where the probability of a data point missing depends on the values of other variables in the dataset, but not on the missing variable itself. For example, if someone lost a schedule, then it may be replaced by a schedule taking at random from the set of filled schedules.

  • Missing not at random (MNAR)
    MNAR is the most challenging type of missing data to deal with. It occurs when the probability of a data point being missing is related to the missing value itself. This means that the reason for the missing data is informative and directly associated with the variable that is missing. For example, when smoking status is not recorded in patients admitted as an emergency, who are also more likely to have worse outcomes from surgery.

How missing values impact our dataset?

  • It can reduce the size of the sample or dataset.

  • Lack of information. If the dataset has large amount of missing values then there are high chances of lacking useful information.

  • If the missing data is not handled properly, it can bias (model not properly training on dataset) the results of your analysis.

  • Some statistical techniques require complete data for all variables, making them inapplicable when missing values are present.

Identify missing values

There are different methods in Python's pandas library to identify missing values.

  • .isnull() -: Identifies missing values in a Series or DataFrame.

  • .notnull() -: Check for missing values in a pandas Series or DataFrame. It returns a boolean Series or DataFrame, where True indicates non-missing values and False indicates missing values.

  • .isna() -: Similar to notnull() but returns True for missing values and False for non-missing values.

Treating Missing Values

There are various techniques used to treat missing values in a dataset.

1. Remove all the missing data
If the dataset doesn't contain significant amount of missing data then it is worthful to remove all the missing data. The method used in Python is-:
dropna() -: Drops rows or columns containing missing values based on custom criteria.

2. Imputation
Imputation means replacing a missing value with another value based on reasonable estimate. This have chances to give high bias.
Some common Imputation methods are -:

  • Mean Imputation -: Replace missing values with the mean of the relevant variable. The strategy can highly be affected by outliers. Implementation -: Method 1-: df[column_name].fillna(df[column_name].mean())

Method 2 -:
Using SimpleImputer()-:
It is defined in sklearn library. It replace missing values using a descriptive statistic (e.g. mean, median, or most frequent) along each column, or using a constant value.
Image description
Here we have imported numpy and SimpleImputor and then created an instance of SimpleImputer named as imp_mean which replaces missing value (np.nan) by mean (strategy="mean"). Then we have fitted the data to imputer and transformed it.
We can use different strategies to impute missing values here.

  • Median Imputation -: Replace missing values with the median of the relevant variable.
    Implementation -:
    df[column_name].fillna(df[column_name].median())
    We can also use SimpleImputer all we need to do is to give strategy="median".

  • Mode Imputation -: Replace missing values with the mode of the relevant variable.
    Implementation -:
    df[column_name].fillna(df[column_name].mode())
    We can also use SimpleImputer all we need to do is to give strategy="most_frequent".
    This strategy can be challenging in case of multimodal data (having more than one mode).

3. Forward and Backward Fill
Replace missing values with the previous or next non-missing value in the same variable.
These fill methods are particularly useful when there is a logical sequence or order in the data, and missing values can be reasonably assumed to follow a pattern. The method parameter in fillna() allows to specify the filling strategy, and here, it’s set to ‘ffill’ for forward fill and ‘bfill’ for backward fill.

Forward Fill
It replaces missing values with the last observed non-missing value in the column.
Implementation-:
forward_fill=df[column_name].fillna(method='ffill')
The result is stored in the variable forward_fill.

Backward Fill
It replaces missing values with the next observed non-missing value in the column.
Implementation-:
backward_fill=df[column_name].fillna(method='bfill')
The result is stored in the variable backward_fill.

There are two more techniques which we will see in the next blog.
I hope you have understood that how missing values are handled in our dataset. In the next blog we are going to read take our discussion further. Till then stay connected and don't forget to follow me.
Thankyou 💙

Top comments (0)