Data cleaning is a useful process to articulate the desired state of data before being ingested and used for insights and visualization. Data driven decisions often depend on accuracy of the data being presented.
Some of the common data issues are
- Missing data
- Incorrect data
- Outliers
- Duplication
- Irrelevant data
potential techniques for fixing
- Missing data
- Imputation (mean, median, mode)
- dropping rows/columns with excessive missing values
- Incorrect data
- Validate against external reference
- standardization of formats
- manual correction by domain expert/review
- Outliers - deleting / retaining based on domain review
- Duplication - fuzzy matching, use unique indentifiers
- Irrelevant data - feature importance score, correlation analysis to determine and remove features of low variance or no contribution to target variables.
Pandas based solutions
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from scipy import stats
Step 1: Load your dataset
df = pd.read_csv('your_data.csv')
Step 2: Handle missing data
imputer = SimpleImputer(strategy='mean')
df['A'] = imputer.fit_transform(df[['A']])
df = df.dropna(axis=0, thresh=df.shape[1] // 2) # Drop rows with > 50% missing
Step 3: Handle incorrect data (standardize formats)
df['A'] = df['A'].astype(str)
df['date'] = pd.to_datetime(df['date'], errors='coerce')
Step 4: Handle outliers (IQR method)
Q1 = df['A'].quantile(0.25)
Q3 = df['A'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['A'] >= (Q1 - 1.5 * IQR)) & (df['A'] <= (Q3 + 1.5 * IQR))]
Step 5: Remove duplicates
df = df.drop_duplicates()
Step 6: Remove irrelevant features (based on feature importance)
X = df.drop(columns=['target'])
y = df['target']
model = RandomForestClassifier()
model.fit(X, y)
feature_importances = pd.Series(model.feature_importances_, index=X.columns)
important_features = feature_importances[feature_importances > 0.01].index
df = df[important_features.tolist() + ['target']]
Top comments (0)