Introduction:
In this blog post, we delve into a comparative study of three machine learning models—Random Forest, XGBoost, and FashionCNN—applied to the FashionMNIST dataset, a widely recognized benchmark in image classification. While traditional models showcase commendable accuracy, the paper reveals their struggle with modeling the dataset's nonlinear intricacies. Notably, the deep learning model FashionCNN steals the spotlight with an impressive accuracy of 92.39%, outperforming its counterparts. The exploration of trade-offs between traditional machine learning and deep learning offers valuable insights for practical applications. Additionally, the study delves into the relevance of Principal Component Analysis (PCA) and t-distributed Stochastic Neighbor Embedding (t-SNE) in enhancing interpretability, contributing to the ongoing discourse on optimizing image classification for FashionMNIST. The findings serve as a roadmap for practitioners, guiding them through the nuanced choices between traditional and deep learning models in the pursuit of effective and interpretable image classification solutions.
Background:
FashionMNIST is a dataset specifically designed for image classification in the context of fashion-related items. Fashion-MNIST was originally developed to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms. It shares the same image size and structure of training and testing splits as MNIST.
- MNIST is too easy. Convolutional nets can achieve 99.7% on MNIST. Classic machine learning algorithms can also achieve 97% easily.
- MNIST is overused. Popular research scientists and deep learning experts have called for people to move away from MNIST.
- MNIST can not represent modern CV tasks, which are often challenging with intricate datasets.
Instead of moving on to harder datasets than MNIST, the ML community is studying it more than ever. Even proportional to other datasets
--Ian Goodfellow, Inventor of GANs.Many good ideas will not work well on MNIST (e.g. batch norm). Inversely many bad ideas may work on MNIST and no transfer to real CV.
-- François Chollet, Creator of Keras
Dataset Overview:
Fashion-MNIST is a dataset of Zalando's article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255. The training and test data sets have 785 columns. The first column consists of the class labels (see above), and represents the article of clothing. The rest of the columns contain the pixel-values of the associated image.
- To locate a pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27. The pixel is located on row i and column j of a 28 x 28 matrix.
- For example, pixel31 indicates the pixel that is in the fourth column from the left, and the second row from the top, as in the ascii-diagram below.
Here are the key characteristics of the Fashion MNIST dataset:
- Fashion Categories: Fashion MNIST contains grayscale images of various fashion items and accessories instead of handwritten digits. It includes a total of 10 fashion categories, making it a multi-class classification problems.
- The 10 categories in the dataset are: T-shirt/top, Trouser, Pullover, Dress, Coat, Sandal, Shirt, Sneaker, Bag, Ankle boot
- Image Size: Each image in the Fashion MNIST dataset is 28 pixels in height and 28 pixels in width. Therefore, the images are 28x28 pixels in size, resulting in a total of 784 pixels per image.
- Grayscale Images: All images in the dataset are grayscale, which means they are represented in black and white without color information. Each pixel has a single intensity value, ranging from 0 (black) to 255 (white).
- Dataset Size: The dataset is split into two main parts: a training set and a test set.
- Training Set: It contains 60,000 images of fashion items, with 6,000 images per class.
- Test Set: It contains 10,000 images, with 1,000 images per class.
- Balanced Classes: The dataset is balanced, meaning that each fashion category has an approximately equal number of examples in both the training and test sets. This balance ensures that the model doesn’t have a bias toward any specific class.
Exploratory Data Analysis (EDA):
Data Visualization
Let's begin by examining the visuals. In this exploration, we leverage the power of matplotlib to illustrate the initial 30 training images found within the input_data dataframe, accompanied by their corresponding labels.
Using a single graph to display pixel distributions for each label in the FashionMNIST dataset enhances the visual exploration of how pixel intensities vary across different clothing items. This visualization approach involves plotting histograms for pixel intensity distribution, possibly in different colors or shades, corresponding to each distinct label in the FashionMNIST dataset. The x-axis typically represents pixel intensity values, while the y-axis signifies the frequency or count of pixels with a particular intensity. Each label's histogram is superimposed on the same graph, allowing for a clear comparison of pixel distributions among different fashion categories.
PCA
Principal Component Analysis (PCA) serves as a statistical technique designed to decrease the dimensionality of a dataset while preserving the essential information. Widely employed for dimension reduction, particularly with high-dimensional data, PCA operates by evaluating and contrasting correlations among various dimensions. This process yields a minimal set of variables that capture the highest amount of variation or explanation regarding the original data distribution. In simpler terms, PCA discards dimensions with lower explained variance, retaining those that convey more meaningful insights.

Each constituent represents an eigenvector, with the initial component resembling a combination of a T-shirt and a shoe. The second component embodies a trouser and a pullover, while the third exhibits characteristics of a pullover and an ankle boot.
The first dimension accounts for 29% of the pixel data, the second for approximately 17%, and the third for only 6%. As additional dimensions are incorporated into the Principal Component Analysis (PCA), each contributes less to the overall variance of the data. While employing 784 components would achieve a cumulative explained variance of 100%, such an exhaustive approach is deemed impractical. Instead, a balance is sought through dimension reduction, sacrificing some explained variance to enhance efficiency.
Choosing a judiciously small number of components is imperative to maintain model effectiveness. Plotting the cumulative explained variance ratio across all PCA components enables a clear visualization of how variance changes with the growing number of components.
As depicted in the aforementioned graph, around 200 components account for approximately 95% of the variance, while 80 components explain 90% of the variance. With an increase in the number of components, the explained variance approaches 100%, aligning with expectations as it corresponds to the original matrix with 784 dimensions.
Let's try plotting the first two dimensions of PCA results.
From the graph above, we can see the two components can separate different categories apart to some degree, but the separation is not clear enough. We need a more efficient technique.
t-SNE
t-SNE, or t-distributed stochastic neighbor embedding, is a powerful technique for reducing the dimensionality of high-dimensional data, often applied to visualize complex patterns in a more understandable form. The fundamental concept of t-SNE involves mapping each data point from a high-dimensional space to a lower-dimensional one, ensuring that similar points in the original space are represented as close points in the reduced space.
By applying t-SNE to FashionMNIST, we can effectively condense the image features while preserving their pairwise similarities. This is valuable for visualizing relationships and clusters between different fashion items in a two-dimensional or three-dimensional space.
The practical implementation involves creating a scatter plot after applying t-SNE, where each point on the plot corresponds to an image. The proximity of points on the plot reflects the similarities between the images, unveiling inherent structures and resemblances within the dataset. For instance, t-SNE may reveal that similar types of clothing tend to cluster together, providing a visually intuitive means of exploring and understanding the underlying structure of the FashionMNIST dataset.
Data Preprocessing:
Flatten the Images:
The first step we are taking is to flatten these 2D arrays into 1D arrays. This means you convert each image into a vector of 784 (28x28) values. This is a common practice when working with neural networks or other machine learning models that expect 1D input.
Scale Pixel Values:
The second step involves scaling the pixel values of the images. This is done to standardize the input data, making it have a mean of 0 and a standard deviation of 1. The formula we are using ((X - np.mean(X)) / np.std(X)) is a standardized formula. It subtracts the mean and divides by the standard deviation for each pixel value.
- Numerical Stability: Standardizing the input helps in numerical stability during the training of neural networks. It prevents the exploding or vanishing gradient problems that can occur when working with data that has a large range of values.
- Convergence Speed: Training may converge faster when input features are on a similar scale. This can be particularly important for iterative optimization algorithms used in training neural networks.
- Model Performance: Some machine learning algorithms are sensitive to the scale of input features. Standardizing the data ensures that no particular feature dominates the learning process.
# Flatten the images and scale pixel values to mean=0.0 and var=1.0
X_train = X_train.reshape(X_train.shape[0], -1)
X_train= (X_train - np.mean(X_train)) / np.std(X_train)
FashionMNIST is a relatively clean and straightforward dataset. Unlike more complex datasets or real-world images, FashionMNIST images are grayscale, well-aligned, and already normalized in terms of size. As a result, extensive preprocessing may not be as necessary compared to dealing with more complex datasets.
Model 1 - Random Forest:
Random Forests are an ensemble learning technique based on decision trees. The core idea is to build multiple decision trees during training and merge their predictions to achieve a more accurate and stable result. Each tree is trained on a random subset of the data, and the final prediction is determined by a majority vote or averaging.
Default Model Training and Evaluation:
Default Model: A Random Forest model is initially trained with default parameters on the normalized training set. This serves as a baseline for comparison. The default model's performance is assessed on the test set, providing insights into its initial accuracy. We get a test set accuracy of 87.60%.
Hyperparameter Tuning:
GridSearchCV: A systematic search over a predefined parameter grid is conducted. Parameters like the number of trees (n_estimators), tree depth (max_depth), and leaf characteristics (min_samples_split, min_samples_leaf) are explored. GridSearach uses accuracy as the metric and uses stratified 5-fold cross-validation for each hyperparameter combination.
### Hyperparameter tuning using GridSearchCV
param_grid = {
'n_estimators': [100],
'max_depth': [20,50],
'min_samples_split': [2,4],
'min_samples_leaf': [1,2]
}
Best Model: The combination of hyperparameters that maximizes performance is identified, resulting in a more optimized Random Forest model. We used the best hyper-parameter that we found using GridSearchCV, which gave the best mean validation accuracy. We got the following hyper-parameters for the fine-tuned model.
best_model = RandomForestClassifier(random_state=42, n_estimators = 100, max_depth = 50,min_samples_leaf=1,min_samples_split=2)
Model Evaluation
The optimized model is evaluated on the test set. There wasn't much improvement in the score and it gave an accuracy of 87.61%.
Interpretability:
We used the inbuilt 'feature_importances_' from sklearn to interpret the model in a better way. This attribute provides a way to understand the importance of each feature in making predictions. The higher the value, the more important the feature is considered to be.
feature_importances = best_model.feature_importances_
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)
Model 2 - XGBoost
We used a boosting version of ensemble decision trees, to further improve the performance. XGBoost (Extreme Gradient Boosting) is a powerful and widely used machine learning algorithm for regression and classification tasks. It belongs to the ensemble learning family and is based on the gradient boosting framework. XGBoost is known for its efficiency, speed, and effectiveness in handling structured/tabular data. It is particularly popular in data science competitions and real-world applications due to its excellent performance.
XGBoost offers several advantages over Random Forests, including a regularization term to prevent overfitting, a sequential tree-building process with gradient boosting for improved accuracy, built-in handling of missing values, parallelization for faster training on large datasets, and advanced optimization techniques. XGBoost also allows for custom objective functions and assigns weights to training instances, providing finer control over learning
Default Model Training and Evaluation:
Default Model: An XGBoost model is initially trained with default parameters on the normalized training set. This serves as a baseline for comparison. The default model's performance is assessed on the test set, providing insights into its initial accuracy. We get a test set accuracy of 89.60%. It has a 2% increase over the default Random Forest model.
Hyperparameter Tuning:
Grid Search and Random Search are traditional methods for hyperparameter tuning. They involve specifying a grid or a random set of hyperparameter values and evaluating the model's performance for each combination. We used Bayesian optimization for hyperparameter tuning of the XGBoost model.
Bayesian optimization, on the other hand, uses probabilistic models to model the unknown objective function. It maintains a surrogate model (usually a Gaussian process) that predicts the objective function's behavior across the hyperparameter space. We preferred this since Bayesian optimization is efficient in terms of the number of evaluations needed to find.
Hyperparameters like learning rate, max depth, number of estimators, subsample, and min child weight are tuned within specified ranges.
# Hyperparameter tuning using Bayesian optimization
param_space = {
'learning_rate': (0.1, 0.5),
'max_depth': (8, 12),
'n_estimators': (70, 130),
'subsample': (0.5, 1.0),
'min_child_weight': (0, 4),
}
- Best Model - The combination of hyperparameters that maximizes performance is identified, resulting in a more optimized XGBoost model. We got the following hyper-parameters for the fine-tuned model. We then train it over the full training set and evaluate it on the testing set.
hyperparameters = {'learning_rate': 0.27741698265128945,
'max_depth': 10,
'min_child_weight': 2,
'n_estimators': 116,
'subsample': 0.8251291379552849,
'seed': SEED,
'objective' : 'multi:softprob',
'num_class':10
}
# Initialize the XGBoost classifier with best parameters
xgb_best_model = xgb.XGBClassifier(**hyperparameters)
# Fit the model on the entire training set
xgb_best_model.fit(X_train, y_train)
Model Evaluation:
The optimized model is evaluated on the test set. Similar to Random Forest,There wasn't much improvement in the score and it gave an accuracy of 89.8%.
Interpretability:
Similar to before we used the inbuilt 'feature_importances_' from sklearn to interpret the model in a better way. This attribute provides a way to understand the importance of each feature in making predictions. The higher the value, the more important the feature is considered to be.
Model 3 - Deep learning: FashionMLP
A Multi-Layer Perceptron (MLP) is a type of neural network with an input layer, one or more hidden layers, and an output layer. Nodes in each layer are connected with weights and biases, and activation functions introduce non-linearity. During training, the network adjusts these parameters using backpropagation to minimize prediction errors. MLPs are effective for various tasks, such as classification and regression, but may struggle with complex data patterns. They serve as foundational models for more advanced neural network architectures.
FashionMLP was crafted in the mold of a multi-layer perceptron (MLP) architecture specifically tailored for processing grayscale images in the FashionMNIST dataset. This model comprises three linear layers organized sequentially. The initial layer takes in 28 x 28 input features, followed by two hidden layers housing 400 and 150 neurons, respectively, and ultimately producing output for 10 classes. It's noteworthy that FashionMLP lacks activation functions, rendering it a purely linear model. Consequently, its suitability for nonlinear classification tasks is limited.
The primary objective behind FashionMLP was to establish a foundational architecture, serving as a baseline for image classification tasks on FashionMNIST. The model's design emphasizes simplicity, providing a benchmark for evaluating the performance of more intricate models.
class FashionMLP(nn.Module):
def __init__(self):
super(FashionMLP, self).__init__()
self.layers = nn.Sequential(
nn.Linear(in_features=28 * 28, out_features=400),
nn.Linear(in_features=400, out_features=150),
nn.Linear(in_features=150, out_features=10)
)
def forward(self, x):
x = x.view(len(x), -1)
return self.layers(x)
Capturing only linear patterns, it achieved an accuracy of 86.19% on the test set. This serves as a benchmark for the CNN model that we are going to develop.
Model 4 - Deep learning: FashionCNN
Convolutional Neural Networks (CNNs) are a class of neural networks specifically designed for image classification tasks. They excel in capturing spatial hierarchies of features within images through the use of convolutional layers. These layers apply filters or kernels to local regions of the input image, extracting low-level features such as edges and textures. Subsequent pooling layers reduce spatial dimensions while retaining essential information, and activation functions like ReLU introduce non-linearity to the model. CNNs often consist of fully connected layers for high-level reasoning, and the final layer typically employs the softmax activation for multi-class classification, producing probability scores for each class.
Training CNNs involves adjusting weights and biases through backpropagation, optimizing the model's ability to predict class labels accurately. The network learns to recognize increasingly complex features as information passes through its layers, creating hierarchical representations. Transfer learning is a common practice, where pre-trained CNNs on large datasets are fine-tuned for specific tasks, leveraging the knowledge gained from broader image datasets like ImageNet. CNNs have become foundational in computer vision, demonstrating remarkable effectiveness in tasks ranging from object recognition to facial recognition, owing to their ability to automatically learn and represent meaningful features from images.
CNN Architecture:
CNN Layer 1:
- Conv2d: This convolutional layer convolves input images with 32 filters, each using 3 x 3 kernels, and includes padding of 1. This operation extracts various features from the input images.
- BatchNorm2d: Batch normalization is applied to normalize and stabilize the gradients during training, enhancing the overall training process.
- ReLU: The Rectified Linear Unit introduces nonlinearity, allowing the network to capture complex patterns in the data.
- MaxPool2d: This max-pooling layer downsamples the spatial dimensions of the data by a factor of 2, aiding in extracting dominant features and reducing computational complexity.
CNN Layer 2:
This layer possesses the same architecture as CNN Layer 1, likely stacking another set of convolutional, batch normalization, ReLU, and max-pooling operations.
Output Layer:
- Linear: The output from the preceding layers is flattened and transformed into a dimensional space of 600 neurons using a linear layer.
- ReLU: Similar to previous layers, ReLU introduces nonlinearity.
- Dropout: To prevent overfitting, a dropout layer with a rate of 0.2 is included. Dropout randomly deactivates some neurons during training.
- Linear: A linear layer with 120 neurons is added.
- Linear: Another linear layer with 10 neurons, likely representing the 10 fashion categories in the FashionMNIST dataset.
- Softmax: The softmax activation function generates class probabilities, assigning likelihoods to each fashion category. This final layer aids in making informed classification decisions by identifying the most probable category for a given input image.
In summary, this architecture employs two convolutional layers for feature extraction, followed by fully connected layers in the output layer for classification. The use of ReLU introduces nonlinearity, batch normalization stabilizes training, and dropout helps prevent overfitting. The softmax layer produces probability distributions, allowing the network to make confident fashion category predictions.
class FashionCNN(nn.Module):
def __init__(self):
super(FashionCNN, self).__init__()
self.layer1 = nn.Sequential( # x is [bs, 1, 28, 28]
nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1), # Size([bs, 32, 28, 28])
nn.BatchNorm2d(32),
nn.ReLU(),
nn.MaxPool2d(2), # Size([bs, 32, 14, 14])
)
self.layer2 = nn.Sequential(
nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3), # Size([bs, 64, 12, 12])
nn.BatchNorm2d(64),
nn.ReLU(),
nn.MaxPool2d(2) # Size([bs, 64, 6, 6])
)
self.output_layers = nn.Sequential(
nn.Linear(in_features=64 * 6 * 6, out_features=600), # Size([bs, 600])
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(in_features=600, out_features=120), # Size([bs, 120])
nn.Linear(in_features=120, out_features=10), # Size([bs, 10])
nn.Softmax(dim=1)
)
def forward(self, x):
out = self.layer1(x)
out = self.layer2(out)
out = out.view(out.size(0), -1)
out = self.output_layers(out)
return out
Hyper-Parameter Tuning:
The hyperparameters investigated include the number of epochs, learning rate, batch size, and the maximum early stopping criterion. The metrics evaluated are Cross-Validation (CV) Accuracy, Precision, Recall, and F1-score. Here's a summary of the key findings:
Learning Rate Impact: Lower learning rates (e.g., 0.00001) generally lead to better performance across all experiments. This suggests that a finer adjustment of weights during training improves the model's ability to learn intricate patterns.
Batch Size Influence: Larger batch sizes (e.g., 200) tend to yield better results, possibly benefiting from more stable weight updates and enhanced generalization.
Effect of Epochs: The number of epochs, set to 500 in all experiments, seems to be sufficient for model convergence. Further investigation could explore whether reducing the number of epochs without sacrificing performance is feasible.
Impact of Early Stopping: The maximum early stopping criterion is set at 50 epochs. This prevents overfitting by halting training if the model's performance on the validation set doesn't improve. The chosen value appears effective as it doesn't hinder model training.
Model Performance: The model achieves high accuracy, precision, recall, and F1-score, indicating robust performance across different hyperparameter configurations. Experiment 8 with a learning rate of 0.0005 and a batch size of 200 stands out as the top-performing configuration.
In conclusion, the hyperparameter tuning results suggest that a lower learning rate, larger batch size, and a reasonable number of epochs contribute to the optimal performance of the model. Experiment 8, with a learning rate of 0.0005 and a batch size of 200, seems to strike a balance between precision, recall, and overall accuracy. Further fine-tuning and exploration of hyperparameter space could potentially lead to even better results.
Model Evaluation:
Based on our experimental findings, it is clear that the FashionCNN model demonstrates superior performance when configured with a learning rate of 0.0005 and a batch size of 200. It achieved an impressive cross-validation accuracy of 96.65% with this optimized set of hyperparameters.
Using these hyper-parameters, our final model is built on whole training data.
# Hyperparameters setup
hyperparameters = Params(num_epochs=500, learning_rate=0.0005, batch_size=200, max_early_stop=50)
# Train model with full data
model = FashionCNN()
model_name = f"{model.__class__.__name__}__lr{hyperparameters.learning_rate}_batch{hyperparameters.batch_size}_epoch{hyperparameters.num_epochs}_earlystop{hyperparameters.max_early_stop}"
model_save_path = f"PyTorch_models/{model_name}"
train_records, test_records = train(model, train_loader, test_loader, hyperparameters, model_save_path, loss_fn=torch.nn.CrossEntropyLoss(), show_tqdm=False, verbose=False)
Our final FashionCNN model obtained a test accuracy of 92.16% with 0.9211 F-1 score on the test data. The below image shows class-wise metrics on the test data.
CNN Feature Maps:
Feature maps are the output of convolutional layers in a convolutional neural network (CNN) during the process of convolution. In a CNN, convolutional layers apply a set of filters (also called kernels) to the input data, resulting in the generation of feature maps.
Convolution Operation: In a CNN, the convolution operation involves sliding a filter (small matrix) over the input data, computing the dot product at each step. This operation captures different patterns or features present in the input.
Feature Maps: The result of the convolution operation for each filter is a feature map. Each element in the feature map represents the activation of a neuron that has been exposed to a specific local region of the input.
Feature maps are crucial in deep learning models, as they represent the hierarchical learning of features from low-level to high-level. Early layers in a CNN might learn simple patterns like edges or textures, while deeper layers learn more complex and abstract features, potentially representing parts of objects or even entire objects. Visualizing feature maps can provide insights into what kind of information the network is learning and how it transforms the input data as it passes through the layers.
Conclusion:
The FashionCNN model consistently demonstrated superior performance across all metrics in the test evaluation. The superior performance of the FashionCNN model across all metrics on the test dataset could be attributed to various factors, such as its architecture, effective learning of relevant features from the data, and the hyper-parameter optimization process during training. Furthermore, the model may have effectively captured complex patterns and relationships within the fashion dataset, resulting in robust and accurate predictions during the test phase.
This study unveils compelling practical implications for businesses, particularly in the realm of clothing companies. To enhance user experience and drive sales companies can integrate image-based search functionalities into their websites or applications, allowing customers to upload or search for specific clothing items through images. Furthermore, leveraging image classification models for marketing analytics proves advantageous as they can automatically tag, categorize, and analyze vast volumes of visual marketing content online. The insights gained from this analysis can subsequently be applied to various downstream tasks, such as recommending visual content for marketing campaigns and monitoring competitors' marketing strategies.
References:
[1] J. Luo, " Comparison of Different Models for Clothing Images
Classification Studies," in AIAM2020: Proceedings of the 2nd
International Conference on Artificial Intelligence and Advanced
Manufacture, October 2020.
[2] J. Kim, M. Z. Bashir, and F. Cavallo, " Image Classification Using Multiple Convolutional Neural Networks on the Fashion-MNIST
Dataset," in Sensors, Dec. 2022.
[3] Gadri, S., Neuhold, E. (2020). Building Best Predictive Models Using ML and DL Approaches to Categorize Fashion Clothes. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds) Artificial Intelligence and Soft Computing. ICAISC 2020.
Acknowledgments:
Team Jarvis -
Advisor -












Top comments (0)