You’re testing your model on the validation set and the results look great. But when you try it on new data, the performance drops badly. After hours of searching, you find the problem: duplicate samples in your data and wrong handling of categorical features. These are common mistakes when preprocessing is done poorly. And you’re not alone, many people in machine learning say “dirty data” is their biggest challenge.
So, what is preprocessing? It simply means preparing your raw data fixing missing values, normalizing numbers, removing duplicates, encoding categories, so the model can learn properly.
Did you know that more than 80% of the time in AI projects is spent on data preparation, not on building models?
In this article, I’ll show you the key steps for good preprocessing, to save you from these problems and help your models perform better.
Different types of data require different preprocessing methods:
- Numerical Data: Continuous values like age, salary, or temperature. Scaling and normalization are often necessary.
from sklearn.preprocessing import MinMaxScaler
X = [[20],[50],[80]]
print(MinMaxScaler().fit_transform(X))
- Categorical Data: Discrete values like gender, country, or product category. Encoding is required.
from sklearn.preprocessing import LabelEncoder
labels = ["male","female","male"]
print(LabelEncoder().fit_transform(labels))
- Text Data: Articles, reviews, or tweets. Requires tokenization, cleaning, and vectorization.
from sklearn.feature_extraction.text import CountVectorizer
texts = ["AI is cool","AI is powerful"]
print(CountVectorizer().fit_transform(texts).toarray())
- Images and Audio: Pixels, spectrograms, or raw audio signals. Need resizing, normalization, and augmentation.
import numpy as np
img = np.array([[0,255],[128,64]])
print(img/255.0) # normalize to [0,1]
Common Preprocessing Steps
Common Preprocessing Steps
a) Data Cleaning
- Handling missing values (imputation or removal).
import pandas as pd
df = pd.DataFrame({"age":[20, None, 30]})
df["age"] = df["age"].fillna(df["age"].mean()) # imputation
print(df)
- Detecting and removing outliers.
import numpy as np
data = np.array([10,12,11,500])
clean = data[data < 100]
print(clean)
- Correcting errors and inconsistencies.
df = pd.DataFrame({"gender":["Male","male","Female"]})
df["gender"] = df["gender"].str.lower()
print(df)
b) Data Transformation
- Normalization and Standardization: Ensures features are on the same scale.
df = pd.DataFrame({"gender":["Male","male","Female"]})
df["gender"] = df["gender"].str.lower()
print(df)
- Encoding Categorical Variables: One-hot, label encoding, or embedding techniques.
from sklearn.preprocessing import OneHotEncoder
X = [["red"],["blue"],["red"]]
print(OneHotEncoder().fit_transform(X).toarray())
- Date and Time Features: Extracting useful components like day, month, year, or weekday.
dates = pd.to_datetime(["2025-08-25","2025-08-26"])
print(dates.dt.day, dates.dt.weekday)
c) Feature Engineering
- Creating new meaningful features.
df = pd.DataFrame({"length":[2,4],"width":[3,5]})
df["area"] = df["length"] * df["width"]
print(df)
- Selecting the most important features.
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np
X = np.array([[1,10,20],[2,20,30],[3,30,40]])
y = [0,1,0]
print(SelectKBest(f_classif, k=2).fit_transform(X,y))
- Dimensionality reduction techniques like PCA or t-SNE.
from sklearn.decomposition import PCA
X = [[1,2,3],[4,5,6],[7,8,9]]
print(PCA(2).fit_transform(X))
d) Deep Learning Specific Preprocessing
- Images: Resizing, normalization, data augmentation (flip, rotate, crop).
import numpy as np
img = np.random.randint(0,255,(64,64))
img_norm = img/255.0 # normalize
print(img_norm.shape)
- Text: Tokenization, stemming/lemmatization, converting words to embeddings.
sentence = "AI is powerful"
tokens = sentence.lower().split()
vocab= {word: idx+1 for idx, word in enumerate(tokens)}
print(vocab)
- Audio: Converting to spectrograms or MFCCs, normalizing amplitudes.
import librosa
y, sr = librosa.load(librosa.example('trumpet'))
mfccs = librosa.feature.mfcc(y=y, sr=sr)
print(mfccs.shape)
Overall, strong models start with clean data. By focusing on preprocessing, you unlock better accuracy, faster training, and reliable results.
References
Techtarget: What is data preprocessing? Key steps and techniques
Top comments (0)