DEV Community

Markus Stoll
Markus Stoll

Posted on • Originally published at Medium

How I Created an Animation Of the Embeddings During Fine-Tuning

How I Created an Animation Of the Embeddings During Fine-Tuning

Using Cleanlab, PCA, and Procrustes to visualize ViT fine-tuning on CIFAR-10

In the field of machine learning, Vision Transformers (ViT) are a type of model used for image classification. Unlike traditional convolutional neural networks, ViTs use the transformer architecture, which was originally designed for natural language processing tasks, to process images. Fine-tuning these models, for optimal performance can be a complex process.

In a previous article, I used an animation to demonstrate changes in the embeddings during the fine-tuning process. This was achieved by performing Principal Component Analysis (PCA) on the embeddings. These embeddings were generated from models at various stages of fine-tuning and their corresponding checkpoints.


Projection of embeddings with PCA during fine-tuning of a Vision Transformer (ViT) model [1] on CIFAR10 [3]; Source: created by the author

The animation received over 200,000 impressions. It was well-received, with many readers expressing interest in how it was created. This article is here to support those readers and anyone else interested in creating similar visualizations.

In this article, I aim to provide a comprehensive guide on how to create such an animation, detailing the steps involved: fine-tuning, creation of embeddings, outlier detection, PCA, Procrustes, review, and creation of the animation.

The complete code for the animation is also available in the accompanying notebook on GitHub.

Preparation: Fine-tuning

The first step is to fine-tune the google/vit-base-patch16–224-in21k Vision Transformer (ViT) model [1], which is pre-trained. We use the CIFAR-10 dataset [2] for this, containing 60,000 images classified into ten different classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

You can follow the steps outlined in the Hugging Face tutorial for image classification with transformers to execute the fine-tuning process also for CIFAR-10. Additionally, we utilize a TrainerCallback to store the loss values during training into a CSV file for later use in the animation.

from transformers import TrainerCallback

class PrinterCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero:
            if len(logs) == 3:  # skip last row
                with open("log.csv", "a") as f:
                    f.write(",".join(map(str, logs.values())) + "\n")
Enter fullscreen mode Exit fullscreen mode

It’s important to increase the save interval for checkpoints by setting save_strategy="step" and a low value for save_step in TrainingArguments to ensure enough checkpoints for the animation. Each frame in the animation corresponds to one checkpoint. A folder for each checkpoint and the CSV file are created during the training and are ready for further use.

Creation of the Embeddings

We use the AutoFeatureExtractor and AutoModel from the Transformers library to generate embeddings from the CIFAR-10 dataset’s test split using different model checkpoints.

Each embedding is a 768-dimensional vector representing one of the 10,000 test images for one model checkpoint. These embeddings can be stored in the same folder as the checkpoints to maintain a good overview.

Extracting Outliers

We can use the OutOfDistribution class provided by the Cleanlab library to identify outliers based on the embeddings for each checkpoint. The resulting scores can then identify the top 10 outliers for the animation.

from cleanlab.outlier import OutOfDistribution

def get_ood(sorted_checkpoint_folder, df):
...
ood = OutOfDistribution()
ood_train_feature_scores = ood.fit_score(features=embedding_np)
df["scores"] = ood_train_feature_scores

Enter fullscreen mode Exit fullscreen mode




Applying PCA and Procrustes Analysis

With a Principal Component Analysis (PCA) for the scikit-learn package, we visualize the embeddings in a 2D space by reducing the 768-dimensional vectors to 2 dimensions. When recalculating PCA for each timestep, large jumps in the animation due to axis-flips or rotations can occur. To address this issue, we apply an additional Procrustes Analysis [3] from the SciPy package to geometrically transform each frame onto the last frame, which involves only translation, rotation, and uniform scaling. This enables smoother transitions in the animation.

from sklearn.decomposition import PCA
from scipy.spatial import procrustes

def make_pca(sorted_checkpoint_folder, pca_np):
...
embedding_np_flat = embedding_np.reshape(-1, 768)
pca = PCA(n_components=2)
pca_np_new = pca.fit_transform(embedding_np_flat)
_, pca_np_new, disparity = procrustes(pca_np, pca_np_new)

Enter fullscreen mode Exit fullscreen mode




Review in Spotlight

Before finalizing the entire animation, we conduct a review in Spotlight. In this process, we utilize the first and last checkpoints to perform embedding generation, PCA, and outlier detection. We load the resulting DataFrame in Spotlight:


Embeddings for CIFAR-10: PCA and 8 worst outliers for the first and the last checkpoint of a short fine-tuning— visualized with spotlight, source: created by the author

Spotlight provides a comprehensive table in the top left, showcasing all the fields present in the dataset. On the top right, two PCA representations are displayed: one for the embeddings generated using the first checkpoint and one for the last checkpoint. Finally, in the bottom section, selected images are presented.

Disclaimer: The author of this article is also one of the developers of Spotlight.

Create the animation

For each checkpoint, we create an image, which we then store alongside its corresponding checkpoint.

This is achieved through the utilization of the make_pca(...) and get_ood(...) functions, which generate the 2D points representing the embedding and extract the top 8 outliers, respectively. The 2D points are plotted with colors corresponding to their respective classes.The outliers are sorted based on their score, and their corresponding images are displayed in a highscore leaderboard. The training loss is loaded from a CSV file and plotted as a line graph.

Finally, all the images can be compiled into a GIF using libraries such as imageio or similar.

Three generated frames from the initial three written checkpoints of the fine-tuning process show slight clustering. It is expected that more pronounced clustering will occur in later steps. source: created by the author

Conclusion

This article has provided a detailed guide on how to create an animation that visualizes the fine-tuning process of a Vision Transformer (ViT) model. We’ve walked through the steps of generating and analyzing embeddings, visualizing the results, and creating an animation that brings these elements together.

Creating such an animation not only helps in understanding the complex process of fine-tuning a ViT model but also serves as a powerful tool for communicating these concepts to others.

The complete code for the animation is available in the accompanying notebook on GitHub.

I am a professional with expertise in creating advanced software solutions for the interactive exploration of unstructured data. I write about unstructured data and use powerful visualization tools to analyze and make informed decisions.

References

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), arXiv

[2] Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images (2009), University Toronto

[3] Gower, John C. Generalized procrustes analysis (1975), Psychometrika

Top comments (0)