Lewis Won

Posted on Jun 15 • Edited on Jun 21

Efficient self-attention mechanism

#llm #nlp #machinelearning #datascience

Motivation
Efficient self-attention
Time and space complexity of softmax self-attention
Naive implementation of $\phi(x) = x$ as the kernel feature map
A better kernel feature map
Why can't we linearise the softmax kernel exactly?
Time complexity of linear self-attention
Space complexity of linear self-attention
Implementation of linear self-attention as code
Any downsides to linear self-attention mechanism?
Discussions in the paper which I did not touch upon
Conclusion
What's next
Appendix A - What is the difference between a kernel and a function?
Appendix B - What are positive, positive definite, and positive semi-definite matrices?
Appendix C: proof that the identity matrix is symmetrical and positive semi-definite
Appendix D: why stacking of multiple linear transformations is equivalent to a single linear transformation
Appendix E: theoretical validity of elu(x) + 1 as a kernel feature map
Appendix F: practical usefulness of elu(x) + 1 as a kernel feature map

Motivation

Ever felt LLM could get very slow, especially when your prompt gets longer? Or the LLM is taking up too much storage space? This article explores how we can speed up the self-attention mechanism and reduce its memory usage by making it more efficient. In technical parlance, reduce the time complexity and space complexity.

This is a follow-up from the previous article where we I explored the why and how of the softmax self-attention mechanism. We ended off with this sofmax self-attention equation:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_{in}}}\right)V

We will explore how this equation can be tweaked to be more efficient, both in time and space. The paper which I will examine is Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.

Efficient self-attention

In order to improve the performance of the self-attention mechanism with lower memory consumption and higher throughput for longer contexts, we can improve the efficiency of the self-attention mechanism by replacing the softmax function with a kernel-based attention (source: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention). I will flesh out the mathematical workings below, so if you are more mathematically inclined, you may wish to read the paper directly yourself.

Recall that the softmax self-attention mechanism can be represented as below, where $A$ is the attention matrix:

A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_{in}}}\right)

$A$ is an $N \times N$ matrix where $N$ is the sequence length. The element $A_{ij}$ represents the attention weight that the i-th query token gives to the j-th key token.

Because the softmax is applied row-wise, the formula for a single element $A_{ij}$ (the element at the i-th row and j-th column of A) is:

A_{ij} = \frac{\exp\left(\frac{Q_i K_j^T}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i K_k^T}{\sqrt{d_{in}}}\right)}

Here:

$Q_i$ is the i-th row of the Query matrix Q.
$K_j$ is the j-th row of the Key matrix K.
$Q_i K_j^T$ is the dot product between the i-th query and the j-th key.
The denominator sums these exponential scores over all possible keys (from k=1 to N) for the fixed query $Q_i$ .

If the equation is too mouthful, the following visualisation might make it easier to digest. Imagine both Q and K are a 5 x 5 matrix each, and we are interested to calculate $A_{22}$ . What the equation essentially means is to do what is as shown in the diagram below. Note that the diagram does not include dividing by $\sqrt{d_{in}}$ , and taking the exponent of the value, but you get the drift.

The final output $V'$ is the product of the attention matrix (A), and the value matrix (V):

V' = A V

To find the i-th row of the output, $V'_i$ , multiply the i-th row of A by the entire matrix V. This results in a weighted sum of all the value vectors ( $V_j$ ), where the weights are the attention scores in the i-th row of A:

V^\prime_i = A_{i1} V_1 + A_{i2} V_2 + ... + A_{iN} V_N = \sum_{j=1}^{N} A_{ij} V_j

Here, $V_j$ is the j-th row (the value vector) of the matrix V.

A visual representation of the above summation is below, calculating $V^\prime_2$ as an example:

$Calculating V^\prime_2$

We can now substitute our formula for $A_{ij}$ in the equation for $V^\prime_i$ and obtain the following:

V^\prime_i = \sum_{j=1}^{N} A_{ij} V_j = \sum_{j=1}^{N} \frac{\exp\left(\frac{Q_i K_j^T}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i K_k^T}{\sqrt{d_{in}}}\right)} V_j

For simplicity of display and to abstract away the mechanistic operation of matrix manipulation, we re-express the dot product of the i-th query vector $Q_i$ and the j-th vector $K_j$ as such:

Q_i K_j^T = Q_i \cdot K_j

So we can re-express $V^\prime_i$ as:

V^\prime_i = \sum_{j=1}^{N} \frac{\exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)} V_j

As the denominator term $\sum_{k=1}^{N}(...)$ is a scalar that is a constant for all j in the outer summation, we can pull out of the summation and get the following:

V^\prime_i = \frac{\sum_{j=1}^{N} \exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)} V_j

Instead of taking the exponent of values, we generalise the term $\exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)$ to be represented by kernel $\phi\left(Q_i \cdot K_j\right)$ , and thus we get:

V^\prime_i = \frac{\sum_{j=1}^{N} \phi\left(Q_i \cdot K_j\right)}{\sum_{k=1}^{N} \phi\left(Q_i \cdot K_j\right)} V_j

In the paper, the author set the condition that a valid $\phi$ kernel must be non-negative for the following reasons, i.e. $\phi(x,y): \mathbb{R}^{2 \times d_{out}} \to \mathbb{R}^+$ . My understanding that this constraint is necessary for two reasons:

Interpretability. A negative weight is difficult to interpret.

2 Mathematical stability for the denominator. If weights could be negative, it is possible for positive and negative weights cancelling each other out and resulting in a denominator of zero, and division by zero is an error.

(Note: All kernels are functions, but not all functions are kernels. Check out Appendix A for an explanation.)

Since a kernel $\phi\left(Q_i \cdot K_j\right)$ is one that can be represented as a dot product of feature-mapped inputs, i.e. $\phi(x,y)=⟨\phi(x),\phi(y)⟩=\phi(x)^\top \phi(y)$ , we can express $V^\prime_i$ as the following equation:

V^\prime_i = \frac{\sum^N_{j=1} \phi\left(Q_i\right)^\top \phi\left(K_j\right)V_j}{\sum^N_{j=1} \phi\left(Q_i\right)^\top \phi\left(K_j\right) }

Time and space complexity of softmax self-attention

The problem with the above equation is that it has a time complexity of $O(N^2)$ , and also a space complexity of $O(N^2)$ . As the time complexity of both the numerator and the denominator are the same, let us just look at the numerator:

$Q K^T$ : $(N \times D) * (D \times N) = (N \times N)$ . This matrix multiplication takes $O(N^2 \times D)$ operations. = $\text{softmax}(Q K^T/sqrt(d_{in}))$ : It consists of 3 parts:
- Exponentiation: Compute exp(w_i) of each attention score, and as there are $N^2$ attention scores, the time complexity is $O(N^2)$ .
- Summation: Sum the N exponentiated values in each row to get the denominator, which is N - 1 addition per row. As there are N rows, the cost is $N * (N - 1) = O(N^2)$
- Division: Divide each of the N exponentiated value by the sum. As there is a total of $N^2$ division operations, the cost is $O(N^2)$ .
$A * V$ : $(N \times N) * (N \times D) = (N \times D)$ . This matrix multiplication takes $O(N^2 \times D)$ operations.
The dominant part is $O(N^2 \times D + N^2 + N^2 \times D) = O(N^2 D)$ . Unlike N which could scale to a very large number as the context (i.e. user prompt) gets very long, D is a constant, hence we can simplify the time complexity to $O(N^2)$ .

The space complexity is also $(N^2)$ because the full attention matrix must be stored to compute the attention weights matrix of N X N.

Here is how we can reduce the time and space complexity -- given the distributive property of matrix multiplication over addition, i.e. $A(B+C) = AB + AC$ , we can rewrite the equation as:

V^\prime_i = \frac{\phi\left(Q_i\right)^\top \sum^N_{j=1} \phi\left(K_j\right)V_j^\top}{\phi\left(Q_i\right)^\top \sum^N_{j=1} \phi\left(K_j\right) }

I will explain in detail how, using $\phi(x) = elu(x) + 1$ as the kernel feature map, the time and space complexity of the attention mechanism will be O(N), which grows linearly, hence the name "linearised self-attention mechanism". This is a huge performance improvement and provides huge storage savings as compared to the $O(N^2)$ time and space complexity of the softmax self-attention mechanism.

Naive implementation of $\phi(x) = x$ as the kernel feature map

The most straightforward choice is to choose the kernel feature map where $\phi(x) = x$ , so that $\phi(x, y) = x^\top y$ . You can see that it is a valid kernel feature map because it satisfies the condition:
$\phi(x)^\top\phi(y) = x^\top y = \phi(x, y)$

(Note: To prove that the a function is suitable as a kernel, we need to prove that it is symmetrical and positive semi-definite. The proof is in appendix C.)

However, $\phi(x) = x$ is not a valid feature map because we had set the condition for the kernel to be non-negative. $x^\top y$ does not guarantee a positive value.

Also, this approach means we are effectively negating the advantages which softmax normalisation provides, because we are normalising by a simple sum of all the denominator terms, as shown in the equation below. This is not ideal as explained in the previous article, softmax was important in emphasising attention on tokens that truly matter due to its "winner-takes-most" effect. While we enjoy a time and space complexity of O(N), we may most likely end up with a LLM that has difficulties focusing on what's important in the context.

V^\prime_i = \frac{Q_i^\top \sum^N_{j=1} K_j V_j^\top}{Q_i^\top \sum^N_{j=1} K_j }

In addition, this kernel feature map would make the entire attention block a purely linear transformation (ignoring the layer norms for now). Mathematically, stacking multiple layers of linear transformation on top of each other is mathematically equivalent to a single, large linear transformation, and this could significantly limit the expressive powers of the attention mechanism to learn complex patterns.

(Note: See appendix D for proof of how multiple linear transformations could collapse into a single linear transformation.)

A better kernel feature map

The kernel feature map considered by Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention is as follows:

\phi(x) = elu(x) + 1 = \begin{cases} x + 1 & \text{if } x > 0 \newline e^x & \text{if } x \le 0 \end{cases}

To prove that $\phi(x) = elu(x) + 1$ is a valid feature map, we must show that it fulfills the following two criteria:

Theoretical validity: $\phi(x,y)$ is a valid kernel function, i.e. it is (a) symmetrical, and (b) positive semi-definite.
Practical usefulness: It is a good approximation of the softmax normalisation, i.e. the weights after normalisation are (a) positive and (b) non-linear.

The proofs are theoretical validity is in appendix E, and proof of practical usefulness is in appendix F.

Why can't we linearise the softmax kernel exactly?

With softmax attention, the kernel is an exponential kernel where:

\phi(Q_i, K_j) = exp(Q_i^\top \cdot K_j) = \phi(Q_i)^\top \cdot \phi(K_j)

For a feature map $\phi(x)$ , using the Taylor series expansion for the exponential function, where:

e^x = 1 + x + x^2/2! + x^3/3! + x^4/4! + ...

Substituting x for the dot product $\phi(Q_i)^\top \cdot \phi(K_j)$ , we get:

\begin{align*} \exp(Q^\top K) &= \sum_{n=0}^{\infty} \frac{(Q^\top K)^n}{n!} && \text{(Taylor Series)} \newline &= \sum_{n=0}^{\infty} \frac{1}{n!} \left( \sum_{d=1}^D q_d k_d \right)^n && \text{(Expand the dot product)} \newline &= \sum_{n=0}^{\infty} \frac{1}{n!} \sum_{|\alpha|=n} \frac{n!}{\alpha_1! \dots \alpha_D!} \prod_{d=1}^D (q_d k_d)^{\alpha_d} && \text{(Apply Multinomial Theorem)} \newline &= \sum_{n=0}^{\infty} \sum_{|\alpha|=n} \frac{\prod_{d=1}^D q_d^{\alpha_d} \prod_{d=1}^D k_d^{\alpha_d}}{\alpha_1! \dots \alpha_D!} && \text{(Cancel } n! \text{ and separate Q and K terms)} \newline &= \sum_{n=0}^{\infty} \sum_{|\alpha|=n} \left( \frac{\prod_{d=1}^D q_d^{\alpha_d}}{\sqrt{\alpha_1! \dots \alpha_D!}} \right) \cdot \left( \frac{\prod_{d=1}^D k_d^{\alpha_d}}{\sqrt{\alpha_1! \dots \alpha_D!}} \right) \newline &= \phi(Q)^\top \cdot \phi(K) \end{align*}

As we can see that $\phi(Q)$ and $\phi(K)$ are infinitely long vectors, it is impossible to linearise the exact softmax attention.

(Note: there are papers which attempt to linearlise by taking the first few terms of the Taylor Series, and in this case the vectors are no longer infinite. For example QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion.)

Time complexity

In comparison, when we define the feature map as $\phi(x) = elu(x) + 1$ , what we instead do is the following:

For matrix Q of dimension N X D, we do element-wise transformation of $\phi(Q)$ by $elu(x) + 1$ . The time complexity for this element-wise transformation is O(ND), which is the number of calculations done for each element, where there is a total of N X D elements. Similarly, the time complexity for transforming $\phi(K)$ where K is a N X D matrix is also O(ND).
Next, we compute $\phi(K)^T V$ : As $\phi(K)^T$ is D X N, $V$ is N X D, hence $\phi(K)^T V$ is D X D. The time complexity for calculating this matrix multiplication is $O(ND^2)$ .
Finally, we compute $\phi(Q) * \phi(K)^T V$ , which is an $O(ND^2)$ operation.
Because N >> D especially when the context length gets longer, we can treat D as a constant, hence the time complexity of a linearised self-attention mechanism is O(N).

Space complexity

Let us do a quick recap of the dimension of all the matrices for a sequence of prompts of length N:

Q: N X D (Queries)
K: N X D (Keys)
V: N X D (Values)
$\phi(Q)$ : N X D (element wise elu + x applied to Q)
$\phi(K)$ : N X D (element wise elu + x applied to K)

The storage requirements are as follow:

We first compute $S = \phi(K)^\top * V$ , which has a dimension of D X M, hence the space complexity is $O(D^2)$ . Note that as N >> D, $\phi(K)^\top V$ is a small matrix and its space complexity can be treated as constant, i.e. O(1).
Next, we compute $W = \phi(Q) * S$ , where P has a dimension of N X M, and hence it has a space complexity of O(ND). As N >> D, the space complexity can be simplified to O(N). Note that this is unlike the softmax attention where where we had to store the N X N matrix of attention scores and weights and which has a space complexity of $O(N^2)$ .
The normalisation vector $D_{in}$ , which holds the normlisation term $D_i$ for each query $Q_i$ has a dimension of N X 1, hence the space complexity is O(N).

Implementation of linear self-attention as code

The python code to implement the linear self-attention mechanism is below. You can simply replace the SelfAttention class in the previous article with the following and it should work out of the box.

class LinearSelfAttention(nn.Module):
    """ Linear Self-Attention with a Kernel Function (Equation 3) """
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def kernel_function(self, x):
        """
        Applies the feature map φ(x).
        A common choice is elu + 1 to ensure positivity, mimicking exp().
        """
        return F.elu(x) + 1

    def forward(self, x):
        # x shape: (batch_size, sequence_length, d_in)

        # 1. Project inputs to Q, K, V
        keys = self.W_key(x)      # Shape: (b, n, d_out)
        queries = self.W_query(x) # Shape: (b, n, d_out)
        values = self.W_value(x)  # Shape: (b, n, d_out)

        # 2. Apply the kernel function (feature map) to queries and keys
        queries_projected = self.kernel_function(queries) # φ(Q)
        keys_projected = self.kernel_function(keys)       # φ(K)

        # 3. Reorder operations using the associative property of matrix multiplication
        # Original (inefficient): (φ(Q) @ φ(K).T) @ V
        # Reordered (efficient): φ(Q) @ (φ(K).T @ V)
        # This avoids creating the large (n, n) attention matrix.

        # kv_state shape: (b, d_out, d_out)
        kv_state = torch.bmm(keys_projected.transpose(1, 2), values)

        # context_unnormalized shape: (b, n, d_out)
        context_unnormalized = torch.bmm(queries_projected, kv_state)

        # 4. Calculate the normalization factor
        # The denominator in Eq. 3 is φ(Q_i) @ Σ φ(K_j)
        # normalizer shape: (b, n, 1)
        normalizer = torch.bmm(
            queries_projected,
            keys_projected.sum(dim=1).unsqueeze(-1) # Sum keys over sequence length
        )

        # 5. Apply normalization with a small epsilon for stability
        context_vec = context_unnormalized / (normalizer + 1e-6)

        return context_vec

Any downsides to linear self-attention mechanism?

Some research have shown that linear self-attention mechanism leads to a decline in model capability. The softmax self-attention is more expressive than linear self-attention because while softmax can create any attention pattern using the full $QK^\top$ matrix, linear attention is constrained in its expression to the dot product of the feature maps of $\phi(Q_i)$ and $\phi(K_j)$ .

There are papers that are trying to enjoy the best of both worlds by training using soft-max self attention, and inference with linear self-attention as an approximation:

[Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference(https://arxiv.org/abs/2211.10526)
The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
LoLCATs: On Low-Rank Linearizing of Large Language Models
Agent Attention: On the Integration of Softmax and Linear Attention
QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion

Discussions in the paper which I did not touch upon

The paper also went on to discuss about causal masking, as well as how it is able to, with a neat trick, reduce the space complexity during training (i.e. forward + backward pass) from $O(ND^2)$ to $O(ND)$ . I may discuss these techniques in future articles.

Conclusion

In short, we have shown how linear self-attention matrix can be derived and implemented, as well as why it is able to achieve constant time and space complexity of O(N), which is far more efficient than the quadratic time and space complexity of $O(N^2)$ of the softmax self-attention.

What's next

In my next article, my focus will turn more practical as I try to implement from scratch a code base self-documentation service, similar to Devin's DeepWiki. While there are already open source alternatives such as this, I believe a deeper understanding of how to build such a service will serve as a solid foundation to further improve the quality of documentation generated by LLM.

Appendix A - What is the difference between a kernel and a function?

(Note: The appendixes are written with the help of Google Gemini 2.5 Pro)

A simple way to think about the difference is:

Every kernel is a function, but not every function is a kernel.

Function

A function is a general mathematical concept. It is simply a rule that maps an input from a set A to a single output in a set B.

Inputs: Can take one, two, or many inputs.
Purpose: Can do anything. It can describe a relationship ( $y = x^2$ ), perform an action (print("hello")), or calculate a value (sin(x)).
Examples:
- f(x) = x + 5 (takes one number, outputs another)
- g(x, y) = x * y (takes two numbers, outputs their product)
- h(text) = length(text) (takes a string, outputs its length)

There are no other constraints. If it's a well-defined rule mapping inputs to an output, it's a function.

Kernel

A kernel, in the context of machine learning, is a special kind of function with two key characteristics: its purpose and its mathematical properties.

1. The Purpose: To Measure Similarity

The primary job of a kernel k(x, y) is to take two inputs, x and y, and return a single scalar value that quantifies how "similar" they are. A large value means they are very similar; a small or zero value means they are dissimilar.

This is why the sim(·) function in the paper is a perfect candidate to be thought of as a kernel. Its job is to measure the similarity between a query $Q_i$ and a key $K_j$ .

2. The Mathematical Property: The "Kernel Trick"

A function k(x, y) is formally considered a kernel if it can be expressed as a dot product in some (potentially very high-dimensional) feature space.

Mathematically, a function k is a kernel if there exists a feature map $\phi$ (phi) such that for all x and y:

k(x, y) = \phi(x) \cdot \phi(y)

Let's unpack this:

$\phi(x)$ (the feature map): This is a function that takes a single input x and transforms it into a new, often much longer, vector. It maps the data into a "feature space" where, hopefully, relationships are easier to see.
⋅ (the dot product): This is the standard dot product between the two transformed vectors.

The "kernel trick" is that you can often compute k(x, y) very easily and cheaply, without ever having to compute the potentially massive or even infinite-dimensional vectors $\phi(x)$ and $\phi(y)$ .

Example: The Polynomial Kernel

Let x and y be simple 2D vectors.
Consider the kernel function $k(x, y) = (x ⋅ y)^2$ . This is very easy to compute.

It turns out this simple function is equivalent to first mapping x and y into a higher-dimensional space with a feature map $\phi$ and then taking their dot product.
If $x = [x₁, x₂]$ , then a valid feature map is $\phi(x) = [x_1^2, x_2^2, \sqrt{2}x_1x_2]$ .
Calculating $\phi(x) \cdot \phi(y)$ gives you the exact same result as $(x \cdot y)^2$ , but $(x \cdot y)^2$ is much faster to compute.

The formal requirement for a function to be a valid kernel is given by Mercer's Theorem, which states that the function must be "positive semi-definite." This condition guarantees that a feature map $\phi$ exists, even if we don't know what it is. (source)

Comparison Table

Aspect	Function	Kernel
Core Idea	A rule that maps inputs to an output.	A special function that measures similarity between two inputs.
Number of Inputs	Can be one or more.	Almost always takes two inputs (`x` and `y`).
Purpose	Can be anything.	To compute a similarity score.
Key Property	None, beyond being a well-defined mapping.	Can be expressed as a dot product in a feature space: $k(x, y) = \phi(x) ⋅ \phi(y)$ . Must be positive semi-definite.
Example	`f(x) = sin(x)`	$k(x, y) = (x ⋅ y + c)^d$ (Polynomial Kernel)

Appendix B - What are positive, positive definite, and positive semi-definite matrices?

The most intuitive way to understand the difference is by using an analogy with regular numbers.

The Core Intuition: An Analogy with Numbers

A positive definite matrix is the matrix equivalent of a strictly positive number (like 3, 0.5, or 100). A number a > 0.
A positive semi-definite matrix is the matrix equivalent of a non-negative number (like 3, 0.5, 100, and also 0). A number a ≥ 0.

The key difference is whether zero is allowed.

Now, how does this concept of "positiveness" apply to a matrix, which is a whole grid of numbers? It's not about the individual elements of the matrix being positive. Instead, it's about how the matrix behaves when it interacts with vectors.

The key operation to test this "positiveness" is the quadratic form: $x^T A x$ , where A is our matrix and x is any non-zero vector. This operation always results in a single number (a scalar). We look at the sign of that resulting number.

Positive Semi-Definite (The "Non-Negative" Case, `≥ 0`)

A symmetric matrix A is positive semi-definite if, for every possible non-zero vector x, the result of $x^\top A x$ is greater than or equal to zero.

x^\top A x ≥ 0 \forall x ≠ 0

What it means: The matrix A will never "flip" a vector to point in the opposite direction.
The "Zero" Case: It is possible for $x^\top A x$ to equal zero even when the vector x is not zero. This means there are some directions in space that the matrix A "squashes" or "flattens" completely.
Eigenvalues: All eigenvalues of the matrix are non-negative (λ ≥ 0). The existence of a zero eigenvalue corresponds to a direction that gets squashed.
Invertibility: A positive semi-definite matrix might not be invertible. If it has a zero eigenvalue, its determinant is zero, and it's singular (not invertible).

Positive Definite (The "Strictly Positive" Case, `> 0`)

A symmetric matrix A is positive definite if, for every possible non-zero vector x, the result of $x^\top A x$ is strictly greater than zero.

x^\top A x > 0 \forall x ≠ 0

What it means: This is a stricter condition. The matrix A not only doesn't flip vectors, it also doesn't completely squash any of them (unless the vector was zero to begin with).
The "Zero" Case: The only way for $x^\top A x$ to equal zero is if the vector x itself is the zero vector.
Eigenvalues: All eigenvalues of the matrix are strictly positive (λ > 0). There are no zero eigenvalues.
Invertibility: A positive definite matrix is always invertible. Since none of its eigenvalues are zero, its determinant is non-zero.

Comparison Table

Feature	Positive Semi-Definite	Positive Definite (Stricter)
Condition	`x^T A x ≥ 0`	`x^T A x > 0`
Analogy	Non-negative number (`a ≥ 0`)	Strictly positive number (`a > 0`)
Can `x^T A x` be zero?	Yes, for some non-zero `x`.	No, only if `x` is the zero vector.
Eigenvalues (`λ`)	All `λ ≥ 0` (can include zero).	All `λ > 0` (must be positive).
Invertibility	May not be invertible.	Always invertible.
Geometric Intuition	A transformation that doesn't flip vectors, but might collapse some directions to zero.	A transformation that doesn't flip vectors and preserves some "energy" in all directions.

Why This Matters for Kernels

A function k(x, y) is a valid kernel if and only if the "Gram matrix" it produces is positive semi-definite.

Gram Matrix: For any set of data points x_1, x_2, ..., x_n, the Gram matrix K is the n x n matrix where the entry K_ij = k(x_i, x_j).

The condition for a valid kernel is semi-definite, not definite. This is because we want to allow for cases where two different data points are perfectly "dissimilar" or orthogonal in the feature space (k(x_i, x_j) = 0) or where two different points x_i and x_j might actually map to the exact same location in the feature space (φ(x_i) = φ(x_j)). These situations lead to a Gram matrix that is only semi-definite, and we need to include them for the theory to be general enough.

he most intuitive way to understand the difference is by using an analogy with regular numbers.

The Core Intuition: An Analogy with Numbers

A positive definite matrix is the matrix equivalent of a strictly positive number (like 3, 0.5, or 100). A number a > 0.
A positive semi-definite matrix is the matrix equivalent of a non-negative number (like 3, 0.5, 100, and also 0). A number a ≥ 0.

The key difference is whether zero is allowed.

The key operation to test this "positiveness" is the quadratic form: x^T A x, where A is our matrix and x is any non-zero vector. This operation always results in a single number (a scalar). We look at the sign of that resulting number.

Positive Semi-Definite (The "Non-Negative" Case, `≥ 0`)

A symmetric matrix A is positive semi-definite if, for every possible non-zero vector x, the result of x^T A x is greater than or equal to zero.

x^T A x ≥ 0 for all x ≠ 0

What it means: The matrix A will never "flip" a vector to point in the opposite direction.
The "Zero" Case: It is possible for x^T A x to equal zero even when the vector x is not zero. This means there are some directions in space that the matrix A "squashes" or "flattens" completely.
Eigenvalues: All eigenvalues of the matrix are non-negative (λ ≥ 0). The existence of a zero eigenvalue corresponds to a direction that gets squashed.
Invertibility: A positive semi-definite matrix might not be invertible. If it has a zero eigenvalue, its determinant is zero, and it's singular (not invertible).

Positive Definite (The "Strictly Positive" Case, `> 0`)

A symmetric matrix A is positive definite if, for every possible non-zero vector x, the result of x^T A x is strictly greater than zero.

x^T A x > 0 for all x ≠ 0

What it means: This is a stricter condition. The matrix A not only doesn't flip vectors, it also doesn't completely squash any of them (unless the vector was zero to begin with).
The "Zero" Case: The only way for x^T A x to equal zero is if the vector x itself is the zero vector.
Eigenvalues: All eigenvalues of the matrix are strictly positive (λ > 0). There are no zero eigenvalues.
Invertibility: A positive definite matrix is always invertible. Since none of its eigenvalues are zero, its determinant is non-zero.

Comparison Table

Feature	Positive Semi-Definite	Positive Definite (Stricter)
Condition	`x^T A x ≥ 0`	`x^T A x > 0`
Analogy	Non-negative number (`a ≥ 0`)	Strictly positive number (`a > 0`)
Can `x^T A x` be zero?	Yes, for some non-zero `x`.	No, only if `x` is the zero vector.
Eigenvalues (`λ`)	All `λ ≥ 0` (can include zero).	All `λ > 0` (must be positive).
Invertibility	May not be invertible.	Always invertible.
Geometric Intuition	A transformation that doesn't flip vectors, but might collapse some directions to zero.	A transformation that doesn't flip vectors and preserves some "energy" in all directions.

Why This Matters for Kernels

A function k(x, y) is a valid kernel if and only if the "Gram matrix" it produces is positive semi-definite.

Gram Matrix: For any set of data points x_1, x_2, ..., x_n, the Gram matrix K is the n x n matrix where the entry K_ij = k(x_i, x_j).

Proof

We want to show that if K is symmetric and PSD, then there exists a feature map φ such that Kᵢⱼ = φ(xᵢ) ⋅ φ(xⱼ).

Proof using the Spectral Theorem:

Eigendecomposition: The Spectral Theorem states that any real symmetric matrix K can be diagonalized by an orthogonal matrix of its eigenvectors.
K = UΛUᵀ
where:
- U is an orthogonal matrix whose columns u₁, ..., uₙ are the eigenvectors of K.
- Λ is a diagonal matrix whose entries λ₁, ..., λₙ are the corresponding real eigenvalues.
Examine a Single Element: Let's write out what this decomposition means for a single element Kᵢⱼ:
Kᵢⱼ = (UΛUᵀ)ᵢⱼ = ∑ₖ uᵢₖ λₖ uⱼₖ
Here, uᵢₖ is the i-th component of the k-th eigenvector uₖ.
Use the PSD Property: A fundamental property of a PSD matrix is that all its eigenvalues are non-negative: λₖ ≥ 0 for all k. This is the crucial step.
Construct the Feature Map: Since λₖ ≥ 0, we can take its square root. Let's define a feature map φ that maps each point xᵢ to a new vector in ℝⁿ:
φ(xᵢ) = [√λ₁ uᵢ₁, √λ₂ uᵢ₂, ..., √λₙ uᵢₙ]
Verify the Dot Product: Now, let's compute the dot product φ(xᵢ) ⋅ φ(xⱼ) using our new feature map:
φ(xᵢ) ⋅ φ(xⱼ) = ∑ₖ (√λₖ uᵢₖ) * (√λₖ uⱼₖ)
= ∑ₖ (√λₖ)² uᵢₖ uⱼₖ
= ∑ₖ λₖ uᵢₖ uⱼₖ
Conclusion: This final expression is exactly our expansion for Kᵢⱼ from Step 2. We have successfully shown that Kᵢⱼ = φ(xᵢ) ⋅ φ(xⱼ). This proves the theorem for the finite-dimensional matrix case.

Appendix C: proof that the identity matrix is symmetrical and positive semi-definite

Definition of a Symmetric Matrix:
A matrix A is symmetric if it is equal to its transpose, i.e., $A = A^\top$ . The transpose of a matrix is found by swapping its rows and columns. In terms of elements, this means $(A^\top){ij} = A{ji}$ for all i and j.

Proof:
We need to show that $I = I^T$ . Let's prove this by showing that their corresponding elements are equal for all i and j.

Let $(I^T)_{ij}$ be the element in the i-th row and j-th column of the matrix $I^\top$ .
By the definition of the transpose, $(I^\top){ij} = I{ji}$ .
Now, let's use the definition of the identity matrix to evaluate $I_{ji}$ :
- If j=i, then $I_{ji} = 1$ .
- If $j \neq i$ , then $I_{ji} = 0$ .
Let's compare this to the elements of the original identity matrix, $I_{ij}$ :
- If i=j, then $I_{ij} = 1$ .
- If $i \neq j$ , then $I_{ij} = 0$ .

Notice that the condition "j=i" is identical to "i=j", and $j \neq i$ is identical to $i \neq j$ .

Therefore, for all possible values of i and j, we have shown that $(I^\top){ij} = I{ji} = I_{ij}$ . Since all corresponding elements are equal, the matrices are equal.

Conclusion: $I = I^\top$ , so the identity matrix is symmetrical.

Part 2: Proof that the Identity Matrix is Positive Semi-Definite

A symmetric matrix A is positive semi-definite if for every non-zero column vector x, the quadratic form $x^\top A x$ is non-negative.

\mathbf{x}^\top A \mathbf{x} \ge 0

We can prove this for the identity matrix using two common methods.

Method 1: Using the Definition of the Quadratic Form

Let I be the $n \times n$ identity matrix and let x be any $n \times 1$ column vector:

\mathbf{x} = \begin{pmatrix} x_1 \ x_2 \ \vdots \ x_n \end{pmatrix}

We need to evaluate the quadratic form $\mathbf{x}^T I \mathbf{x}$ .
First, let's compute the product $I\mathbf{x}$ . By the definition of the identity matrix, multiplying any vector by $I$ results in the vector itself:
$I\mathbf{x} = \mathbf{x}$
Now, substitute this back into the quadratic form:
$\mathbf{x}^T I \mathbf{x} = \mathbf{x}^T (\mathbf{x})$
The expression $\mathbf{x}^\top \mathbf{x}$ is the dot product of the vector $\mathbf{x}$ with itself:
$\mathbf{x}^T \mathbf{x} = \begin{pmatrix} x_1 & x_2 & \cdots & x_n \end{pmatrix} \begin{pmatrix} x_1 \ x_2 \ \vdots \ x_n \end{pmatrix} = x_1^2 + x_2^2 + \cdots + x_n^2$
This sum can be written as $\sum_{i=1}^n x_i^2$ .
Since each $x_i$ is a real number, its square $x_i^2$ must be greater than or equal to zero ( $x_i^2 \ge 0$ ). The sum of non-negative numbers is also non-negative.

Therefore, $\mathbf{x}^T I \mathbf{x} = \sum_{i=1}^n x_i^2 \ge 0$ .

This holds true for any vector $\mathbf{x}$ . Thus, the identity matrix is positive semi-definite.

Method 2: Using Eigenvalues

Definition: A symmetric matrix is positive semi-definite if and only if all of its eigenvalues are non-negative ( $\lambda_i \ge 0$ ).

Let's find the eigenvalues of the identity matrix I. The eigenvalue equation is $A\mathbf{v} = \lambda\mathbf{v}$ , where $\lambda$ is an eigenvalue and $\mathbf{v}$ is its corresponding eigenvector.
For the identity matrix, this equation becomes:
$I\mathbf{v} = \lambda\mathbf{v}$
Since $I\mathbf{v} = \mathbf{v}$ , we have: $\mathbf{v} = \lambda\mathbf{v}$
Rearranging the terms gives:
$\mathbf{v} - \lambda\mathbf{v} = \mathbf{0}$
$(1 - \lambda)\mathbf{v} = \mathbf{0}$
By definition, an eigenvector v must be a non-zero vector. For the equation $(1-\lambda)\mathbf{v} = \mathbf{0}$ to be true with $\mathbf{v} \neq \mathbf{0}$ , the scalar multiplier must be zero: $1 - \lambda = 0 \implies \lambda = 1$
An $n \times n$ identity matrix has n eigenvalues, and they are all equal to 1.
The condition for positive semi-definiteness is that all eigenvalues must be non-negative. Since all eigenvalues of I are 1, and $1 \ge 0$ , the condition is satisfied.

Conclusion: The identity matrix is positive semi-definite.

Summary

We have shown that:

$I = I^\top$ , which proves symmetry.
$\mathbf{x}^T I \mathbf{x} \ge 0$ for all vectors $\mathbf{x}$ (and its eigenvalues are all $\ge 0$ ), which proves positive semi-definiteness.

Note: In fact, we have shown something stronger. Since $\mathbf{x}^T I \mathbf{x} > 0$ for all non-zero vectors $\mathbf{x}$ and its eigenvalues are all strictly positive (1 > 0), the identity matrix is actually positive definite, which is a subset of positive semi-definite.

Appendix D: why stacking of multiple linear transformations is equivalent to a single linear transformation

1. Mathematical Foundation: What is a Linear Transformation?

First, let's formally define our terms. A transformation (or function) $T$ that maps vectors from one vector space to another ( $T: \mathbb{R}^n \to \mathbb{R}^m$ ) is linear if it satisfies two properties for any vectors $\vec{u}, \vec{v}$ and any scalar c:

Additivity: $T(\vec{u} + \vec{v}) = T(\vec{u}) + T(\vec{v})$
Homogeneity (Scalar Multiplication): $T(c\vec{u}) = cT(\vec{u})$

A cornerstone theorem of linear algebra states that every linear transformation can be represented by a matrix multiplication. If T is a linear transformation, there exists a unique matrix A such that for any vector $\vec{x}$ :

T(\vec{x}) = A\vec{x}

This matrix representation is the key to our proof.

2. The Proof for Two Layers

Let's start by proving the case for two stacked transformations. "Stacking" is mathematically known as function composition.

Let's define two linear transformations:
- Let $T_1: \mathbb{R}^n \to \mathbb{R}^m$ be our first linear transformation. It takes an n-dimensional vector and maps it to an m-dimensional vector. It can be represented by an $m \times n$ matrix, which we'll call $A_1$ . So, $T_1(\vec{x}) = A_1\vec{x}$ .
- Let $T_2: \mathbb{R}^m \to \mathbb{R}^p$ be our second linear transformation. It takes an m-dimensional vector (the output of $T_1$ ) and maps it to a p-dimensional vector. It can be represented by a $p \times m$ matrix, which we'll call $A_2$ . So, $T_2(\vec{y}) = A_2\vec{y}$ .
"Stacking" $T_1$ and then $T_2$ means we first apply $T_1$ to an input vector $\vec{x}$ , and then apply $T_2$ to the result. Let's call this new, composed transformation L:

L(\vec{x}) = T_2(T_1(\vec{x}))

Now, let's substitute the matrix representations into this equation:

First, replace $T_1(\vec{x})$ with its matrix form: $L(\vec{x}) = T_2(A_1\vec{x})$
The term $A_1\vec{x}$ is just a vector. Let's think of it as the input to $T_2$ . We can now apply the matrix representation for $T_2$ : $L(\vec{x}) = A_2(A_1\vec{x})$

Here is the crucial step. Matrix multiplication is associative. This means that for compatible matrices A, B, and C, we have $(AB)C = A(BC)$ . We can regroup our expression:

L(\vec{x}) = (A_2A_1)\vec{x}

Let's define a new matrix $C = A_2A_1$ . Since $A_2$ is a $p \times m$ matrix and $A_1$ is an $m \times n$ matrix, their product C will be a single $p \times n$ matrix.
By substituting C back into our equation, we get:

L(\vec{x}) = C\vec{x}

This shows that the entire composition of two linear transformations, L, can be represented by multiplication with a single matrix C. A transformation defined by a single matrix multiplication is, by definition, a linear transformation.

Therefore, the composition of two linear transformations is itself a single linear transformation.

3. Generalization to 'N' Layers

We can easily extend this logic to any number of layers by induction.

Imagine a stack of N linear transformations, $T_1, T_2, ..., T_N$ , represented by matrices $A_1, A_2, ..., A_N$ . The final transformation L is:

L(\vec{x}) = T_N(T_{N-1}(...T_2(T_1(\vec{x}))...))

Substituting the matrix forms gives:

L(\vec{x}) = A_N(A_{N-1}(...A_2(A_1\vec{x})...))

Because matrix multiplication is associative, we can drop all the parentheses and multiply the matrices together first:

L(\vec{x}) = (A_N A_{N-1} ... A_2 A_1)\vec{x}

The product of all these matrices, $C_{total} = A_N A_{N-1} ... A_2 A_1$ , is still just one single matrix.

Therefore, the entire stack of N transformations is equivalent to a single linear transformation represented by the matrix $C_{total}$ .

Appendix E: theoretical validity of elu(x) + 1 as a kernel feature map

To prove symmetry, we must prove that $\phi(x,y) = \phi(y,x)$ .

Given that $\phi(x,y) = elu(x^\top y) + 1$ , and $\phi(y,x) = elu(y^\top x) + 1$ since the dot product is commutative, i.e. $x^\top y = \sum^d_{i=1}x_i y_i = sum^d_{i=1}y_i x_i = y^\top x$ , hence $\phi(x,y) = \phi(y,x)$ .

Next, prove that the function is positive semi-definite. Let's check the condition cᵀKc ≥ 0.
cᵀKc = Σᵢ Σⱼ cᵢ cⱼ k(xᵢ, xⱼ)
Substitute our kernel definition:
= Σᵢ Σⱼ cᵢ cⱼ (φ(xᵢ)ᵀφ(xⱼ))
We can rearrange this sum:
= (Σᵢ cᵢ φ(xᵢ)ᵀ) (Σⱼ cⱼ φ(xⱼ))
Let the vector v = Σᵢ cᵢ φ(xᵢ). The expression becomes:
= vᵀv
This is simply the dot product of a vector with itself, which is the squared L2 norm ||v||². The squared norm of any real vector is always greater than or equal to 0.
Positive semi-definiteness is automatically satisfied.

Appendix F: practical usefulness of elu(x) + 1 as a kernel feature map

We want to prove that elu(x) + 1 is a practically useful kernel feature map where:

$\phi(Q_i)^\top \phi(K_j) >= 0($ , i.e. non-negative
elu(x) + 1 is a non-linear feature map.

We recall that the equation of the feature map is:

\phi(x) = elu(x) + 1 = \begin{cases} x + 1 & \text{if } x > 0 \newline e^x & \text{if } x \le 0 \end{cases}

Given the equation, both $\phi(Q_i)$ and $\phi(K_j)$ are guaranteed to be non-negative, hence their product $\phi(Q_i)^\top \phi(K_j) ($ must also be non-negative. Therefore, $\phi(Q_i)^\top \phi(K_j) >= 0($

On linearity, a function f is linear of $f(ax + by) = af(x) + bf(y)$ .

Let us take this equation $\phi(ax+ by)$ , where a = 2, x = 1, b = 1, and y = -1:

LHS: $\phi(2*1 + 1*-1) = \phi(1) = 1 + 1 = 2$

RHS: $2\phi(1) + (1)\phi(-1) = 2*2 + 1e^-1 = 4 + 0.36787944117 ≈ 4.368$

As the $\text{LHS} \neq \text{RHS}$ , $\phi(x) = elu(x) + 1$ is a non-linear feature map.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Table of Contents

Motivation

Efficient self-attention

Time and space complexity of softmax self-attention

Naive implementation of ϕ(x)=x\phi(x) = x ϕ(x)=x as the kernel feature map

A better kernel feature map

Why can't we linearise the softmax kernel exactly?

Time complexity

Space complexity

Implementation of linear self-attention as code

Any downsides to linear self-attention mechanism?

Discussions in the paper which I did not touch upon

Conclusion

What's next

Appendix A - What is the difference between a kernel and a function?

Function

Kernel

1. The Purpose: To Measure Similarity

2. The Mathematical Property: The "Kernel Trick"

Comparison Table

Appendix B - What are positive, positive definite, and positive semi-definite matrices?

The Core Intuition: An Analogy with Numbers

Positive Semi-Definite (The "Non-Negative" Case, ≥ 0)

Positive Definite (The "Strictly Positive" Case, > 0)

Comparison Table

Why This Matters for Kernels

The Core Intuition: An Analogy with Numbers

Positive Semi-Definite (The "Non-Negative" Case, ≥ 0)

Positive Definite (The "Strictly Positive" Case, > 0)

Comparison Table

Why This Matters for Kernels

Proof

Appendix C: proof that the identity matrix is symmetrical and positive semi-definite

Part 2: Proof that the Identity Matrix is Positive Semi-Definite

Method 1: Using the Definition of the Quadratic Form

Method 2: Using Eigenvalues

Summary

Appendix D: why stacking of multiple linear transformations is equivalent to a single linear transformation

1. Mathematical Foundation: What is a Linear Transformation?

2. The Proof for Two Layers

3. Generalization to 'N' Layers

Appendix E: theoretical validity of elu(x) + 1 as a kernel feature map

Appendix F: practical usefulness of elu(x) + 1 as a kernel feature map

Naive implementation of $\phi(x) = x$ as the kernel feature map

Positive Semi-Definite (The "Non-Negative" Case, `≥ 0`)

Positive Definite (The "Strictly Positive" Case, `> 0`)

Positive Semi-Definite (The "Non-Negative" Case, `≥ 0`)

Positive Definite (The "Strictly Positive" Case, `> 0`)