DEV Community

Cover image for Basics of Preprocessing in Machine Learning and Deep Learning
Jospin Ndagano
Jospin Ndagano

Posted on

Basics of Preprocessing in Machine Learning and Deep Learning

You’re testing your model on the validation set and the results look great. But when you try it on new data, the performance drops badly. After hours of searching, you find the problem: duplicate samples in your data and wrong handling of categorical features. These are common mistakes when preprocessing is done poorly. And you’re not alone, many people in machine learning say “dirty data” is their biggest challenge.

So, what is preprocessing? It simply means preparing your raw data fixing missing values, normalizing numbers, removing duplicates, encoding categories, so the model can learn properly.

Did you know that more than 80% of the time in AI projects is spent on data preparation, not on building models?

In this article, I’ll show you the key steps for good preprocessing, to save you from these problems and help your models perform better.

Different types of data require different preprocessing methods:

  • Numerical Data: Continuous values like age, salary, or temperature. Scaling and normalization are often necessary.
from sklearn.preprocessing import MinMaxScaler
X = [[20],[50],[80]]
print(MinMaxScaler().fit_transform(X))
Enter fullscreen mode Exit fullscreen mode
  • Categorical Data: Discrete values like gender, country, or product category. Encoding is required.
from sklearn.preprocessing import LabelEncoder
labels = ["male","female","male"]
print(LabelEncoder().fit_transform(labels))

Enter fullscreen mode Exit fullscreen mode
  • Text Data: Articles, reviews, or tweets. Requires tokenization, cleaning, and vectorization.
from sklearn.feature_extraction.text import CountVectorizer
texts = ["AI is cool","AI is powerful"]
print(CountVectorizer().fit_transform(texts).toarray())
Enter fullscreen mode Exit fullscreen mode
  • Images and Audio: Pixels, spectrograms, or raw audio signals. Need resizing, normalization, and augmentation.
import numpy as np
img = np.array([[0,255],[128,64]])
print(img/255.0)  # normalize to [0,1]
Enter fullscreen mode Exit fullscreen mode

Common Preprocessing Steps

Common Preprocessing Steps

a) Data Cleaning

  • Handling missing values (imputation or removal).
import pandas as pd
df = pd.DataFrame({"age":[20, None, 30]})
df["age"] = df["age"].fillna(df["age"].mean())  # imputation
print(df)
Enter fullscreen mode Exit fullscreen mode
  • Detecting and removing outliers.
import numpy as np
data = np.array([10,12,11,500])  
clean = data[data < 100]  
print(clean)
Enter fullscreen mode Exit fullscreen mode
  • Correcting errors and inconsistencies.
df = pd.DataFrame({"gender":["Male","male","Female"]})
df["gender"] = df["gender"].str.lower()
print(df)
Enter fullscreen mode Exit fullscreen mode

b) Data Transformation

  • Normalization and Standardization: Ensures features are on the same scale.
df = pd.DataFrame({"gender":["Male","male","Female"]})
df["gender"] = df["gender"].str.lower()
print(df)
Enter fullscreen mode Exit fullscreen mode
  • Encoding Categorical Variables: One-hot, label encoding, or embedding techniques.
from sklearn.preprocessing import OneHotEncoder
X = [["red"],["blue"],["red"]]
print(OneHotEncoder().fit_transform(X).toarray())
Enter fullscreen mode Exit fullscreen mode
  • Date and Time Features: Extracting useful components like day, month, year, or weekday.
dates = pd.to_datetime(["2025-08-25","2025-08-26"])
print(dates.dt.day, dates.dt.weekday)
Enter fullscreen mode Exit fullscreen mode

c) Feature Engineering

  • Creating new meaningful features.
df = pd.DataFrame({"length":[2,4],"width":[3,5]})
df["area"] = df["length"] * df["width"]
print(df)
Enter fullscreen mode Exit fullscreen mode
  • Selecting the most important features.
from sklearn.feature_selection import SelectKBest, f_classif
import numpy as np
X = np.array([[1,10,20],[2,20,30],[3,30,40]])
y = [0,1,0]
print(SelectKBest(f_classif, k=2).fit_transform(X,y))
Enter fullscreen mode Exit fullscreen mode
  • Dimensionality reduction techniques like PCA or t-SNE.
from sklearn.decomposition import PCA
X = [[1,2,3],[4,5,6],[7,8,9]]
print(PCA(2).fit_transform(X))
Enter fullscreen mode Exit fullscreen mode

d) Deep Learning Specific Preprocessing

  • Images: Resizing, normalization, data augmentation (flip, rotate, crop).
import numpy as np
img = np.random.randint(0,255,(64,64))  
img_norm = img/255.0  # normalize
print(img_norm.shape)
Enter fullscreen mode Exit fullscreen mode
  • Text: Tokenization, stemming/lemmatization, converting words to embeddings.
sentence = "AI is powerful"
tokens = sentence.lower().split()
vocab= {word: idx+1 for idx, word in enumerate(tokens)}
print(vocab)
Enter fullscreen mode Exit fullscreen mode
  • Audio: Converting to spectrograms or MFCCs, normalizing amplitudes.
import librosa
y, sr = librosa.load(librosa.example('trumpet'))
mfccs = librosa.feature.mfcc(y=y, sr=sr)
print(mfccs.shape)
Enter fullscreen mode Exit fullscreen mode

Overall, strong models start with clean data. By focusing on preprocessing, you unlock better accuracy, faster training, and reliable results.

References

Techtarget: What is data preprocessing? Key steps and techniques

Precisely: 2025 Planning Insights: Data Quality Remains the Top Data Integrity Challenge and Priority

Top comments (0)