DEV Community

soul-o mutwiri
soul-o mutwiri

Posted on

DATA CLEANING: Common data issues and their solutions

Data cleaning is a useful process to articulate the desired state of data before being ingested and used for insights and visualization. Data driven decisions often depend on accuracy of the data being presented.

Some of the common data issues are

  1. Missing data
  2. Incorrect data
  3. Outliers
  4. Duplication
  5. Irrelevant data

potential techniques for fixing

  1. Missing data
  2. Imputation (mean, median, mode)
  3. dropping rows/columns with excessive missing values
  4. Incorrect data
  5. Validate against external reference
  6. standardization of formats
  7. manual correction by domain expert/review
  8. Outliers - deleting / retaining based on domain review
  9. Duplication - fuzzy matching, use unique indentifiers
  10. Irrelevant data - feature importance score, correlation analysis to determine and remove features of low variance or no contribution to target variables.

Pandas based solutions

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from scipy import stats

Step 1: Load your dataset

df = pd.read_csv('your_data.csv')

Step 2: Handle missing data

imputer = SimpleImputer(strategy='mean')

df['A'] = imputer.fit_transform(df[['A']])

df = df.dropna(axis=0, thresh=df.shape[1] // 2) # Drop rows with > 50% missing

Step 3: Handle incorrect data (standardize formats)

df['A'] = df['A'].astype(str)
df['date'] = pd.to_datetime(df['date'], errors='coerce')

Step 4: Handle outliers (IQR method)

Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['A'] >= (Q1 - 1.5 * IQR)) & (df['A'] <= (Q3 + 1.5 * IQR))]

Step 5: Remove duplicates

df = df.drop_duplicates()

Step 6: Remove irrelevant features (based on feature importance)

X = df.drop(columns=['target'])
y = df['target']
model = RandomForestClassifier()
model.fit(X, y)
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
important_features = feature_importances[feature_importances > 0.01].index
df = df[important_features.tolist() + ['target']]

Now df is cleaned and ready for analysis or modeling.

Top comments (0)