Imagine you're learning to tell dog breeds apart, but your teacher occasionally tells you the wrong information. They sometimes mistakenly call a Labrador a Golden Retriever. They also call a Husky a Malamute at times. When this keeps happening, you'll start doubting yourself, or worse....learn the wrong things altogether.
This is exactly what happens when you train machine learning models on noisy labels. Labels that are erroneous in the data. The model gets confused, learns the incorrect patterns, and does not predict well.
So, how do you make a model smart enough to handle these errors? That's what we will explore in this article.
You can find the code snippets I've used here in my colab notebook: Colab Notebook
What Are Noisy Labels?
A label is the correct answer for a data point. So, if you have a data set of pictures of cats and dogs, each picture will have a label of "cat" or "dog."
But sometimes, labels are wrong. This can happen because:
- Humans make errors: Someone manually labeled a picture of a Husky as a Wolf.
- Data can be unclear: Some flowers are nearly identical to each other.
- Automatic labeling goes wrong: A weak system can incorrectly classify objects.
These types of errors in labels are what are called noisy labels. And if you train a model with too much noise, it may end up memorizing the mistakes instead of learning from correct patterns.
Let’s Create a Noisy Dataset in Python
First, let’s generate a clean dataset, then introduce some noise.
Step 1: Generate a Clean Dataset
We’ll create a simple dataset with two classes (0 and 1) using sklearn.datasets.make_classification
.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# Create a classification dataset with 1000 samples (data points) and 2 features (columns)
# n_informative=2 means the two features are useful for the classification task
# n_redundant=0 means no extra, redundant features are added
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,
n_redundant=0, n_clusters_per_class=1, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap="coolwarm", alpha=0.7)
plt.title("Clean Dataset")
plt.show()
Step 2: Add Noisy Labels
Now, we introduce 20% label noise by flipping some labels randomly.
def add_label_noise(y, noise_rate=0.2):
np.random.seed(42)
num_noisy = int(len(y) * noise_rate)
noisy_indices = np.random.choice(len(y), num_noisy, replace=False)
y_noisy = y.copy()
y_noisy[noisy_indices] = 1 - y_noisy[noisy_indices] # flip the labels :)
return y_noisy
# Introduce noise into labels
y_train_noisy = add_label_noise(y_train, noise_rate=0.2)
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train_noisy, cmap="coolwarm", alpha=0.7)
plt.title("Dataset with Noisy Labels (20% incorrect)")
plt.show()
🔴 Notice the difference? Some red points are mixed into the blue area and vice versa. That’s the noise!
Why Are Noisy Labels Bad?
To understand why it can be a problem, let’s train a Random Forest model on both clean and noisy datasets to compare how noise affects accuracy.
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# Train on the clean labels
clf_clean = RandomForestClassifier(random_state=42)
clf_clean.fit(X_train, y_train)
acc_clean = accuracy_score(y_test, clf_clean.predict(X_test))
# Train on noisy labels
clf_noisy = RandomForestClassifier(random_state=42)
clf_noisy.fit(X_train, y_train_noisy)
acc_noisy = accuracy_score(y_test, clf_noisy.predict(X_test))
print(f"Accuracy with Clean Labels: {acc_clean * 100:.2f}%")
print(f"Accuracy with Noisy Labels: {acc_noisy * 100:.2f}%")
Accuracy with Clean Labels: 93.00%
Accuracy with Noisy Labels: 86.00% 😧
📉 On clean labels, the model was 93% accurate, yet when noisy labels were introduced, the accuracy dropped to 86%. This demonstrates the extent to which incorrect data can confuse a model during training.
This may not seem like a lot, but even a small proportion of noisy data (20% in this case) can lead to an apparent degradation in performance. This is because the model starts memorizing the incorrect labels instead of learning real patterns.
How to Handle Noisy Labels?
Method 1: Use a Smarter Loss Function (MAE Instead of Cross-Entropy)
As per my research in Google and ChatGPT, the most common loss function for classification is cross-entropy loss, but it overreacts to wrong labels. A better option is Mean Absolute Error (MAE), which is more resistant to noise.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
model = Sequential([
Dense(16, activation="relu", input_shape=(2,)),
Dense(16, activation="relu"),
Dense(1, activation="sigmoid")
])
model.compile(optimizer="adam", loss="mean_absolute_error", metrics=["accuracy"])
model.fit(X_train, y_train_noisy, epochs=20, batch_size=16, verbose=0, validation_data=(X_test, y_test))
_, acc = model.evaluate(X_test, y_test)
print(f"Accuracy with MAE loss: {acc:.4f}")
MAE gives less importance to extreme errors, making the model less likely to overfit to noisy labels.
Accuracy with MAE loss: 0.9150
Method 2: Co-Teaching (Two Models Teach Each Other)
Instead of using one model, we can train two models and let them filter out the noisy data for each other.
from sklearn.linear_model import LogisticRegression
model1 = LogisticRegression()
model2 = LogisticRegression()
# Identify "trusted" samples (least noisy)
trusted_samples = np.abs(y_train - y_train_noisy) < 0.5
model1.fit(X_train[trusted_samples], y_train_noisy[trusted_samples])
model2.fit(X_train[trusted_samples], y_train_noisy[trusted_samples])
# Combine predictions
preds1 = model1.predict(X_test)
preds2 = model2.predict(X_test)
final_preds = (preds1 + preds2) // 2 # Majority voting
acc_coteach = accuracy_score(y_test, final_preds)
print(f"Accuracy with Co-Teaching: {acc_coteach:.4f}")
Each model trains only on "trusted" data, preventing them from memorizing noise.
Accuracy with Co-Teaching: 0.9000
Method 3: Semi-Supervised Learning (Train on Clean Data First)
If we have some clean data, we can train the model on it first, then use it to fix noisy labels.
# Train model on clean samples
model_clean = RandomForestClassifier()
model_clean.fit(X_train[trusted_samples], y_train_noisy[trusted_samples])
y_train_fixed = model_clean.predict(X_train)
final_model = RandomForestClassifier()
final_model.fit(X_train, y_train_fixed)
acc_fixed = accuracy_score(y_test, final_model.predict(X_test))
print(f"Accuracy after Semi-Supervised Learning: {acc_fixed:.4f}")
The model learns correct patterns first, then refines its understanding with the full dataset.
Accuracy after Semi-Supervised Learning: 0.9050
Conclusion
While noisy labels cause a noticeable drop in performance, techniques like Semi-Supervised Learning and MAE loss can help recover some of the lost accuracy, though nothing quite matches the performance of clean data.
Top comments (0)