๐ Data Augmentation in CNNs and the impact on Generalisation (Using CIFAR-10 Experiments)
Data augmentation is widely used when training convolutional neural networks, especially for image classification tasks.
The idea is simple: by transforming training images โ rotating, flipping, or shifting them, we can introduce more variation and help the model generalise better.
However, one question that is often overlooked is:
๐ Does more augmentation always improve performance?
In this post, I investigate how different levels of data augmentation affect a CNN trained on the CIFAR-10 dataset.
All experiments, code, and plots shown here are taken directly from my notebook.
๐ Dataset Overview: CIFAR-10
The CIFAR-10 dataset contains:
- 60,000 colour images
- 10 output classes
- 32ร32 resolution
- A balanced distribution across classes
One key detail is the image resolution.
At 32ร32 pixels, fine details are limited, and some classes (like cats and dogs) can look very similar. This becomes important when analysing model performance later.
โ๏ธ Data Preparation
Before training, the dataset was preprocessed to ensure stable learning.
# Load the CIFAR-10 dataset
(x_train_full, y_train_full), (x_test, y_test) = cifar10.load_data()
# Define the class names
class_names = [
"airplane", "automobile", "bird", "cat", "deer",
"dog", "frog", "horse", "ship", "truck"
]
# Scale pixel values to the range [0, 1]
x_train_full = x_train_full.astype("float32") / 255.0
x_test = x_test.astype("float32") / 255.0
# Convert class labels to one-hot encoded format
y_train_full_cat = to_categorical(y_train_full, 10)
y_test_cat = to_categorical(y_test, 10)
# Split the training data into training and validation sets
x_train, x_val, y_train_cat, y_val_cat, y_train, y_val = train_test_split(
x_train_full,
y_train_full_cat,
y_train_full,
test_size=0.2,
random_state=42,
stratify=y_train_full
)
# Print dataset shapes
print("Training set shape:", x_train.shape)
print("Validation set shape:", x_val.shape)
print("Test set shape:", x_test.shape)
- Pixel values are scaled to [0, 1]
- Labels are converted into one-hot encoding
- Data is later split into training and validation sets
These steps ensure that the model trains efficiently and can be evaluated properly.
๐ง Model Architecture Used
All experiments use the same CNN architecture to ensure a fair comparison.
def build_cnn_model():
model = Sequential([
Conv2D(
32, (3, 3), activation="relu",
padding="same", input_shape=(32, 32, 3)
),
BatchNormalization(),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation="relu", padding="same"),
BatchNormalization(),
MaxPooling2D((2, 2)),
Conv2D(128, (3, 3), activation="relu", padding="same"),
BatchNormalization(),
MaxPooling2D((2, 2)),
Flatten(),
Dense(128, activation="relu"),
Dropout(0.5),
Dense(10, activation="softmax")
])
model.compile(
optimizer="adam",
loss="categorical_crossentropy",
metrics=["accuracy"]
)
return model
epochs = 15
batch_size = 64
๐ Experiment Setup
Three models were trained:
- Baseline โ No augmentation
- Light augmentation โ small transformations
- Strong augmentation โ larger transformations
This setup allows us to isolate the effect of augmentation.
๐ Augmentation Setup
# Create a light augmentation generator
light_datagen = ImageDataGenerator(
rotation_range=10,
width_shift_range=0.1,
height_shift_range=0.1,
horizontal_flip=True
)
# Create a stronger augmentation generator
strong_datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.15,
height_shift_range=0.15,
zoom_range=0.2,
horizontal_flip=True
)
These parameters control the transformation strength.
- Smaller values โ subtle variation
- Larger values โ stronger distortion
๐ Results
Test accuracy comparison across models
Observations
Baseline โ 0.752
Light augmentation โ 0.750
Strong augmentation โ 0.692
Interpretation
- Light augmentation had almost no effect
- Strong augmentation reduced performance
โก More augmentation does not always mean better performance.
๐ Model Behaviour (Confusion Matrix)
Confusion matrix showing class-level performance
Observations
- Strong performance:
- airplane, ship, truck
- Weak performance:
- cat vs dog
- automobile vs truck
Insight
Errors are often due to:
- low resolution
- visual similarity between classes
โก Some limitations come from the dataset itself.
๐งพ Conclusion
The experiments highlight the following:
- The baseline model already performs well
- Light augmentation has minimal impact
- Strong augmentation reduces performance
- Augmentation must be applied carefully




Top comments (0)