Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?

#discuss #data #datascience #beginners

If a variable has more than 30% missing values, I treat it carefully because that much missing information can weaken the model. First, I try to understand the cause: is it random, system-driven, or does it follow some pattern? Knowing this helps me decide if the feature is still useful.

If the column doesn’t provide strong value, the safest choice is to drop it. For important variables, I look at different imputation methods. Numeric fields might use median or interpolation, while categorical fields can use mode. If the feature is valuable but tricky, I may use advanced methods like KNN imputation or model-based imputation to estimate missing values.

Sometimes missing values are meaningful on their own, for example, “not filled because user didn’t use the feature.” In those cases, I keep the column and also create a separate flag like “is_missing” to capture that information.

The goal is to keep the dataset balanced, clean, and meaningful without forcing incomplete or low-quality data into the modeling process.

DEV Community

Suppose there is a dataset having variables with missing values of more than 30%, how will you deal with such a dataset?

Top comments (0)