Amandha Panagoda

Posted on Feb 28

How to Train Your Model When Data Lies

#machinelearning #python #datascience

Imagine you're learning to tell dog breeds apart, but your teacher occasionally tells you the wrong information. They sometimes mistakenly call a Labrador a Golden Retriever. They also call a Husky a Malamute at times. When this keeps happening, you'll start doubting yourself, or worse....learn the wrong things altogether.

This is exactly what happens when you train machine learning models on noisy labels. Labels that are erroneous in the data. The model gets confused, learns the incorrect patterns, and does not predict well.

So, how do you make a model smart enough to handle these errors? That's what we will explore in this article.

You can find the code snippets I've used here in my colab notebook: Colab Notebook

What Are Noisy Labels?

A label is the correct answer for a data point. So, if you have a data set of pictures of cats and dogs, each picture will have a label of "cat" or "dog."

But sometimes, labels are wrong. This can happen because:

Humans make errors: Someone manually labeled a picture of a Husky as a Wolf.
Data can be unclear: Some flowers are nearly identical to each other.
Automatic labeling goes wrong: A weak system can incorrectly classify objects.

These types of errors in labels are what are called noisy labels. And if you train a model with too much noise, it may end up memorizing the mistakes instead of learning from correct patterns.

Let’s Create a Noisy Dataset in Python

First, let’s generate a clean dataset, then introduce some noise.

Step 1: Generate a Clean Dataset

We’ll create a simple dataset with two classes (0 and 1) using sklearn.datasets.make_classification.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

# Create a classification dataset with 1000 samples (data points) and 2 features (columns)
# n_informative=2 means the two features are useful for the classification task
# n_redundant=0 means no extra, redundant features are added
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, 
                           n_redundant=0, n_clusters_per_class=1, random_state=42)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap="coolwarm", alpha=0.7)
plt.title("Clean Dataset")
plt.show()

Step 2: Add Noisy Labels

Now, we introduce 20% label noise by flipping some labels randomly.

def add_label_noise(y, noise_rate=0.2):
    np.random.seed(42)
    num_noisy = int(len(y) * noise_rate)
    noisy_indices = np.random.choice(len(y), num_noisy, replace=False)
    y_noisy = y.copy()

    y_noisy[noisy_indices] = 1 - y_noisy[noisy_indices] # flip the labels :)
    return y_noisy

# Introduce noise into labels
y_train_noisy = add_label_noise(y_train, noise_rate=0.2)

plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train_noisy, cmap="coolwarm", alpha=0.7)
plt.title("Dataset with Noisy Labels (20% incorrect)")
plt.show()

🔴 Notice the difference? Some red points are mixed into the blue area and vice versa. That’s the noise!

Why Are Noisy Labels Bad?

To understand why it can be a problem, let’s train a Random Forest model on both clean and noisy datasets to compare how noise affects accuracy.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Train on the clean labels
clf_clean = RandomForestClassifier(random_state=42)
clf_clean.fit(X_train, y_train)
acc_clean = accuracy_score(y_test, clf_clean.predict(X_test))

# Train on noisy labels
clf_noisy = RandomForestClassifier(random_state=42)
clf_noisy.fit(X_train, y_train_noisy)
acc_noisy = accuracy_score(y_test, clf_noisy.predict(X_test))

print(f"Accuracy with Clean Labels: {acc_clean * 100:.2f}%")
print(f"Accuracy with Noisy Labels: {acc_noisy * 100:.2f}%")

Accuracy with Clean Labels: 93.00%
Accuracy with Noisy Labels: 86.00% 😧

📉 On clean labels, the model was 93% accurate, yet when noisy labels were introduced, the accuracy dropped to 86%. This demonstrates the extent to which incorrect data can confuse a model during training.

This may not seem like a lot, but even a small proportion of noisy data (20% in this case) can lead to an apparent degradation in performance. This is because the model starts memorizing the incorrect labels instead of learning real patterns.

How to Handle Noisy Labels?

Method 1: Use a Smarter Loss Function (MAE Instead of Cross-Entropy)

As per my research in Google and ChatGPT, the most common loss function for classification is cross-entropy loss, but it overreacts to wrong labels. A better option is Mean Absolute Error (MAE), which is more resistant to noise.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential([
    Dense(16, activation="relu", input_shape=(2,)),
    Dense(16, activation="relu"),
    Dense(1, activation="sigmoid")
])

model.compile(optimizer="adam", loss="mean_absolute_error", metrics=["accuracy"])

model.fit(X_train, y_train_noisy, epochs=20, batch_size=16, verbose=0, validation_data=(X_test, y_test))

_, acc = model.evaluate(X_test, y_test)
print(f"Accuracy with MAE loss: {acc:.4f}")

MAE gives less importance to extreme errors, making the model less likely to overfit to noisy labels.

Accuracy with MAE loss: 0.9150

Method 2: Co-Teaching (Two Models Teach Each Other)

Instead of using one model, we can train two models and let them filter out the noisy data for each other.

from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression() 
model2 = LogisticRegression()

# Identify "trusted" samples (least noisy)
trusted_samples = np.abs(y_train - y_train_noisy) < 0.5

model1.fit(X_train[trusted_samples], y_train_noisy[trusted_samples])
model2.fit(X_train[trusted_samples], y_train_noisy[trusted_samples])

# Combine predictions
preds1 = model1.predict(X_test)
preds2 = model2.predict(X_test)
final_preds = (preds1 + preds2) // 2  # Majority voting

acc_coteach = accuracy_score(y_test, final_preds)
print(f"Accuracy with Co-Teaching: {acc_coteach:.4f}")

Each model trains only on "trusted" data, preventing them from memorizing noise.

Accuracy with Co-Teaching: 0.9000

Method 3: Semi-Supervised Learning (Train on Clean Data First)

If we have some clean data, we can train the model on it first, then use it to fix noisy labels.

# Train model on clean samples
model_clean = RandomForestClassifier()
model_clean.fit(X_train[trusted_samples], y_train_noisy[trusted_samples])

y_train_fixed = model_clean.predict(X_train)

final_model = RandomForestClassifier()
final_model.fit(X_train, y_train_fixed)

acc_fixed = accuracy_score(y_test, final_model.predict(X_test))
print(f"Accuracy after Semi-Supervised Learning: {acc_fixed:.4f}")

The model learns correct patterns first, then refines its understanding with the full dataset.

Accuracy after Semi-Supervised Learning: 0.9050

Conclusion

While noisy labels cause a noticeable drop in performance, techniques like Semi-Supervised Learning and MAE loss can help recover some of the lost accuracy, though nothing quite matches the performance of clean data.

DEV Community