State Space Models such as Mamba have become a promising alternative to attention mechanism, which is difficult to compress due to the quadratic complexity of the attention mechanism. For handling image data, models such as ViM and VMamba have been proposed. Though Mamba has been shown to perform well for long contexts, there is still a lot of understanding to be had on the mechanisms of this model. In this study, I attempt to run some experiments on vision token transformation for the ViM model to see how it performs and try find any possible improvements. This post will serve as a journal of the progress of this independent study.
Background
Attention Mechanism of Transformer
Transformers took the world by storm for its unmatched ability in handling long contexts and scalability. Modern Large Language Models such as ChatGPT, Claude, Gemini, DeepSeek have been built using transformers. These models have billions of parameters, and this can be done because transformers can be very easily parallelized as text is not parsed sequentially, unlike previous seq2seq models. At its core is the self-attention mechanism presented in the following equation.
Here, the input is split into Q, K, V matrices and passed through the above formula to get the attention score for the input. Multiple such attention scores are calculated for each input sequence in a process called multi-headed attention. This provides robust outputs and responses.
Transformers have been adapted to vision tasks by dividing the images into a set number of tokens or patches, inserting positional information and then passing them to the transformer layers. A CLS token is added for identifying which class the image belongs too when needed.
For all its ability of scalability, transformers are notoriously difficult to compress because of the same self-attention mechanism. The quadratic complexity makes it hard for parsing long input sequences under resource constraints. Many optimizations have been proposed to address this.
State Space Models & Mamba
State Space Models (SSM) are a type of mathematical formulation which describes the state of an object in a continuous space with the least number of parameters. It is described using the following equation:
The first equation describes how the system's internal state evolves over time, while the second equation relates the system's internal state to its observable outputs. Neural networks using SSM make matrices A
, B
and C
shared across the network. Matrix D
can be thought of as a skip connection and is usually ignored. To make this equation usable for a neural network, the matrices are discretized by introducing another variable delta
and transforming the matrices in the following way:
These shared weights allows us to pre-calculate a portion of the equation and then reuse over and over again. Similar to convolutions, the pre-calculated value can be used to calculate proceding representations efficiently. This allows us to save computation costs. This can be visualized using the following figure:
Neural networks using SSMs have been proposed and were shown to be more efficient than transformers, but their performance was not good enough to be considered as a replacement for it. This is because of the static nature of the A
, B
and C
matrices. The model cannot focus on relevant parts of the input like transformers and thus struggles when sequences are longer. The introduction of Mamba changed that notion by addressing the performance issue by making SSMs selective. This notion of selectivity is introduced by making matrices A
and C
input dependent. That is, each token will have its own A
and C
matrices instead of having a shared one. This would allow the model to focus on specific input tokens, just like transformers. Though this would not allow us to use the kernel trick, this was addressed by designing a hardware aware algorithm. Mathematical calculations are less time-consuming than memory movement. That is, it takes more time to more data from high-bandwidth SRAM to low bandwidth DRAM and vise versa. In order to make
Mamba for Vision
Different models have been proposed for handling image data using Mamba such as VMamba and ViM. For my experiments, I will be utilizing VMamba. Similar to vision transformers, VMamba divides the image into patches and then the patches are fed through a linear block to get intermediate representations. To properly get positional information, 4-directional SSM is done so that model can get positional information of all the patches properly.
Methodology
This experiment is inspired by Famba-V, which merges tokens after passing through the forward and backward SSM blocks, as can be seen in the following image.
After the tokens pass through the SSM blocks, similar tokens are merged using cosine similarity and a new representation matrix is formed. This allowed for lower training time and memory usage.
Instead of merging, we aim to transform the tokens. As proposed in this paper, just simply pruning or merging vision tokens can lead to relevant information loss. Instead of doing either one, the vision tokens would be transformed using a transformation matrix.
The construction of this transformation matrix first involves calculating attention scores using the attention equivalence scores for mamba models as formulated in this paper, which is:
Afterwards, the transformation matrix is created in the following order:
- The attention score of each token with respect to others is summed to create a vector of aggregated attention scores.
- Another matrix of size (patch length * patch length) is created by taking the cosine similarity of each token
- This new matrix is then softmaxed.
After this, the outputs from the selective scan operation is multiplied with this matrix to get the new latent representations.
This token transformation can be visualized from the following image:
The inner workings of the token transformation can be visualized in the following way:
Potential Pitfalls:
Calculation of this attention matrix involves a quadratic calculation. This might become a bottleneck. In order to alleviate this, we could utilize the linear attention equivalence proposed in EfficientViT.
Conclusion:
This is an ongoing blog and, as I mentioned earlier, is going to act as a journal of whatever updates I have on this project.
Top comments (0)