infinityspace23

Posted on Jun 5

Visualizing the Platonic Representation Hypothesis at a small scale - An elementary analysis on visual and semantic modalities.

#machinelearning #python #computerscience #ai

Abstract

AI models can be trained on various inputs through different modalities, storing information as vectors in a compressed, abstract mathematical space called a ‘Latent space’. Information is supposed to be clustered together depending on traits decided by the model. This entry aims to visualize the impact of different modalities on the organizing and storing of information. Taking inspiration from the Platonic Representation Hypothesis paper ¹, I seek to explore whether a convergence happens from a small scale, and at what layer in a model’s architecture does it occur.

Introduction

Artificial Neural Networks (ANNs) have an immensely sophisticated architecture, so much so, that our understanding of their internal workings has not caught up with their rate of progress. Starting from AlexNet to present LLMs, there has been a multitude of work in the domain of mechanistic interpretability to deepen our understanding of these models. When deepening understanding, a key topic at hand is, how does an ANN store, retrieve and utilize the information it has trained and learnt? Many exemplary researchers prior have written papers on this, and a recent one that stands out, the Platonic Representation Hypothesis (PRH), was what interested me in this exploration.

The core argument from the paper revolves around how at large scales, AI models regardless of modalities seem to be storing information in similar geometric representations (Huh et al., 2024). This is counterintuitive as the underlying expectation is for these models to be storing them along different distinguishers and hence different representations. A sentence transformer is expected to sort along semantic categories while a vision transformer is expected to sort along visual categories. Thus, the original researchers raise the possibility that AI models seem to be grasping a deeper meaning embedded in our worldly understanding, somewhat like Plato’s allegory of the Cave.

In this entry, I explore with a few models while pointing out my limitations and future areas for improvements. Also explored is whether the behavior of convergence is proof of Plato’s allegory or simply a distillation of human-centered cultural concepts. Finally, this series of entries serves as a distillation of my experience understanding ANNs better and my attempt at visualizing the work done in the original PRH paper.

Methodology

Due to hardware constraints, 1000 images were sampled from Fashion-MNIST, sourced via PyTorch. All investigations also occurred on PyTorch. Figure 1 below shows the images with their corresponding labels available in Fashion-MNIST.²

To understand how models store information, their architecture is peeled layer by layer. First, I wanted to set a baseline comparison by comparing how information is stored between models of similar modalities. Thus, I compared ResNet (He et al., 2016)—acting as our representative Vision model—to my personal-trained model FashionCNN (also a Vision model) with both trained on the Fashion-MNIST dataset. Starting with ResNet as the control, I hooked into some key layers (Conv1, layer1, layer2, layer3, layer4, avgpool) compressing spatial dimensions to 1 to facilitate UMAP plotting later on (McInnes et al., 2018). Similar actions were taken for FashionCNN.

FashionCNN is a simple model with 3 key layers: 2 Convolutional layers, 1 pooling layer and 1 classifier. ReLU activations between the 2 convolutional layers introduce non-linearity, preventing the network from collapsing into a single linear transform. All layers except the final classifier were hooked.

When plotting heatmaps for VT (Vision) vs VT, both Euclidean and cosine distance calculations were taken into account. However when plotting VT vs ST (Sentence Transformer), only cosine was taken into account due to Euclidean providing little value for the ST model. Euclidean distance between 2 vectors would normally take into account the magnitude while cosine would account for direction. Hence, typically Euclidean might be spoofed in certain activations but cosine would not.

To compare against the sentence transformers, the all-minilm-l6-v2 model was used to encode meanings from the labels portion of Fashion-MNIST data and is compared against the FashionCNN’s representations. The all-minilm-l6-v2 is a sentence transformer that maps input to a 384-dimensional vector space. Also to facilitate this comparison with FashionCNN, FashionCNN’s output data is used to calculate centroids for every unique label, which is then compared with the embeddings of all-minilm-l6-v2 using Representational Similarity Analysis ³.

Experiment and Results

Figure 2 shows ResNet’s performance on Fashion-MNIST. It can be seen that ResNet clustered footwear (Sandal, Ankle Boot, Sneaker) in the bottom left while all other articles of clothing are in the top right. Within both left and right clusters, all clothing articles have been loosely grouped with mostly no concrete gatherings (exception: Trousers have a strong gathering).

Figure 3 shows FashionCNN’s performance on Fashion-MNIST. Here, a clustering of footwear (Sandal, Ankle Boot, Sneaker) more distinct than ResNet’s is seen on the left. While on the right, although loosely grouped, Trousers, Bags and Dresses seem to have found more stable clustering than previously.

Figure 4 shows the changes over 6 layers of ResNet (conv1, layer1, layer2, layer3, layer4, avgpool). At first glance, there is a stark difference in the clustering between the early-mid layers and later layers. However once again, ResNet can clearly identify footwear as separate from the others, though bags seem to confound it. Trousers and Dresses remain weakly clustered. Clustering in layer 4 has little to no correlation and seems random.

Figure 5 shows the changes over 3 layers of FashionCNN (conv1, conv2, avgpool). Some key insights differentiating from simply looking at the avgpool results are:

The model is able to tell apart Trousers and Dresses relatively better compared to other non-footwear clothing articles.
FashionCNN can’t seem to discern if Bags should be clustered with footwear or non-footwear.
Throughout its early to later layers, the clustering remains mostly constant.

Figure 6 illustrates a heatmap, showing a layer-by-layer comparison of the similarity in how information is stored in the latent spaces. Key insights to take note of:

Fashion conv1, conv2 / ResNet conv1 has a score of 1, hence 1-1 mapping. Suggests they share a lot of edges and line detection in common with each other.
Fashion conv1, conv2 typically have a higher similarity against every single ResNet layer, compared to Fashion avgpool against ResNet layers. Possibly due to Fashion’s avgpool discarding spatial dimensions, leading to a loss of potential points to compare against.
Fashion avgpool has a minima when compared to ResNet layer4. This is likely due to avgpool comparing against a highly specialized later-layer.
Fashion conv1, conv2 / ResNet layer1 have poorer scores compared to Fashion conv1, conv2 / ResNet layer2, layer3. Although weak, this does support the pre-established idea that early layers of models are generally more universal and share more with each other compared to later more specialized ones.

Figure 7 illustrates a similar heatmap except this time cosine distances were measured.

Two trends that port over from Euclidean distances is first, Fashion’s conv1, conv2 / ResNet layer 1 having lower similarity than Fashion conv1, conv2 / ResNet layer2, layer3. Secondly, Fashion conv1, conv2 share higher similarity with all ResNet layers compared to avgpool.
Fashion conv1, conv2 / ResNet conv1 is no longer a value of 1.0. Hence instead it does not share a 1-1 mapping.
Fashion avgpool has a minima against ResNet layer1 instead of layer4 like previously.
In general, there is greater similarity measured in this heatmap for most layer pairs.

Figure 8 shows a plot of the distance values in FashionCNN (x-axis) against the values in all-minilm-l6-v2. Most data points are scattered around the trendline, with 1 major outlier and a few points beyond the expected uncertainties of the trendline. These could be possibly due to FashionCNN’s poor classification of Bags, which are more spread out over the activations maps compared to other labels.

Discussion

There is a stark difference between the performance of both models (ResNet, FashionCNN) and there are a few reasons for this. On one hand, ResNet is a model trained on everyday objects while FashionCNN is a model trained specifically on Fashion-MNIST data. While ResNet is trained on images with RGB, FashionCNN is trained on grayscale images. These properties inevitably allow FashionCNN more fitting to the data and hence greater performance.

Another thing to note is, both models seem to be able to differentiate footwear from others yet cannot sort effectively through the ‘others’ cluster. For them, footwear could have more distinct shapes (Sneakers vs Trousers) while the difference in shapes between a coat and dress may not be that substantial. This is indicative of a Vision model's nature, where it specifically looks for patterns and shapes in an image rather than understanding what is really represented in the image.

Alternatively, ResNet’s later layers’ results are interesting particularly because of the low clustering pattern. However, this is expected from previous feature visualization research published that talks of how the earlier layers of an ANN are more universal and look for generic edges and lines, while later layers become more specialized depending on the dataset used to train (Zeiler & Fergus, 2014). ResNet in this case might instead start looking for textures of objects or fur while the fashion dataset cannot provide those metrics.

Furthermore, for both models, clustering is much better in avgpool. This could be due to the nature of Adaptive Pooling, where the output is stripped of spatial data by converting the height and width of each image to (1,1). This correlation might be indicative of spatial data being excess noise during feature-rich analysis.

Alternatively, when looking at the Euclidean and cosine distance heatmaps, we can notice that similarity between the layers of both Visual models peaks in the intermediate layers. FashionCNN’s conv1, conv2 against ResNet’s layers 2 to 4 have the highest similarity values. This can be due to the intermediate layers being beyond the initial raw pixel scanner and right before the more specialized avgpool layer. Not only that, while the Euclidean heatmap shows a clear drop in similarity values between FashionCNN conv1, conv2 and ResNet layers 2/3 to 4, the cosine distance heatmap illustrates all three layers having similar values. It's likely this is due to the cosine distance calculation ignoring activations like magnitude differences from brightness or contrast rather than the inherent concepts.

Moving on to VT vs ST analysis, a clear first observation is the RSA value. It's a surprising 0.347 (to 3 d.p.) despite the obvious limitation in using the Sentence Transformer with Fashion-MNIST data. STs are typically good at recognizing meaning in context, with words used in sentences. Here however, single words were used to embed meanings. Despite this, a visual model was able to store information vaguely along semantic boundaries one-third of the time. It is likely the similarity value would have been higher if the ST was fed complete sentences like it was made for. If this is not enough, the trendline shows a tendency to tend towards a form of $y=Cx$ for some constant $C$, which is promising since if the representation was the same in both models with a 1-1 mapping, then the trendline would have been $y=x$.

Limitations

Due to compute constraints the data size used for training and testing is a minuscule 1000 images. Though using the full set of 10,000 to 60,000 may not be radically useful for the experiment, it is worth noting as a limitation in gathering optimal data.
FashionCNN is not a model optimized for its performance. Its structure is barebones with no BatchNorms etc. Hence comparing it to ResNet may create a smaller discrepancy than there might have been if stronger models trained on Fashion-MNIST data were used.
As mentioned the sentence transformer was not used in its original strength due to experimental limitations. The result was likely deflated due to this.
The data used was all human-derived, raising questions if our own cultural biases were affecting how the data is produced.

Future improvements

Although comparisons between VT vs VT and VT vs ST were done, a more complete one could be between a model like CLIP (Radford et al., 2021) that natively fuses both modalities. This would have offered a more complete perspective on this topic.

Furthermore, datasets lacking human-derived concepts would add diversity to the data. For example, recordings of Humpback Whale songs would put some pressure on the argument that PRH is simply a distillation of human knowledge.

Conclusion

Looking from the models and their results, a convergence can be seen despite our scale being objectively small. This leans in favor of the ideas represented in the Platonic Representation Hypothesis, but a key question that remains is whether human influence plays a role in our results. Both models have been trained on human-derived data. It inadvertently reflects how we as humans view the world. Thus, is convergence representative of a deeper underlying structure of reality or are models simply optimizing and revealing human-specific knowledge in its most primitive form? I’m still not confident there is enough evidence to support either side of the argument, but the fact that even at such a small scale, models operating with different modalities converged on geometric representations of knowledge lends non-trivial support to the Platonic Representation Hypothesis.

References

Greeshma K V, & Viji Gripsy. (2020). Image classification using HOG and LBP feature descriptors with SVM and CNN.

Huh, M., Cheung, B., Wang, T., & Isola, P. (2024). The Platonic Representation Hypothesis. arXiv preprint arXiv:2405.07987.

Kriegeskorte, N., Mur, M., & Bandettini, P. A. (2008). Representational similarity analysis - connecting the branches of systems neuroscience. Frontiers in Systems Neuroscience, 2(4), 1–28.

McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for dimension reduction. arXiv preprint arXiv:1802.03426.

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., & Sutskever, I. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (ICML), 8748–8763.

Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. European Conference on Computer Vision (ECCV), 818–833.

The foundational premise of this work is built entirely upon Huh et al. (2024), who proposed that representations inside independently trained neural networks tend to converge toward a shared statistical model of the physical world as scale increases. ↩
Image adapted from Greeshma K V & Viji Gripsy (2020), illustrating the 10 structural classes and sample dimensions characteristic of the Fashion-MNIST dataset. ↩
Representational Similarity Analysis (RSA) is a framework originally adapted from computational neuroscience (Kriegeskorte et al., 2008) to quantitatively gauge how closely the geometry of two separate feature spaces align without requiring direct 1-to-1 node mappings. ↩

DEV Community