DEV Community

Cover image for Efficient self-attention mechanism
Lewis Won
Lewis Won

Posted on • Edited on

Efficient self-attention mechanism

Table of Contents

Motivation

Ever felt LLM could get very slow, especially when your prompt gets longer? Or the LLM is taking up too much storage space? This article explores how we can speed up the self-attention mechanism and reduce its memory usage by making it more efficient. In technical parlance, reduce the time complexity and space complexity.

This is a follow-up from the previous article where we I explored the why and how of the softmax self-attention mechanism. We ended off with this sofmax self-attention equation:

Attention(Q,K,V)=softmax(QKTdin)V \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_{in}}}\right)V

We will explore how this equation can be tweaked to be more efficient, both in time and space. The paper which I will examine is Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.

Efficient self-attention

In order to improve the performance of the self-attention mechanism with lower memory consumption and higher throughput for longer contexts, we can improve the efficiency of the self-attention mechanism by replacing the softmax function with a kernel-based attention (source: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention). I will flesh out the mathematical workings below, so if you are more mathematically inclined, you may wish to read the paper directly yourself.

Recall that the softmax self-attention mechanism can be represented as below, where AA is the attention matrix:

A=softmax(QKTdin)A = \text{softmax}\left(\frac{QK^T}{\sqrt{d_{in}}}\right)

AA is an N×NN \times N matrix where NN is the sequence length. The element AijA_{ij} represents the attention weight that the i-th query token gives to the j-th key token.

Because the softmax is applied row-wise, the formula for a single element AijA_{ij} (the element at the i-th row and j-th column of A) is:

Aij=exp(QiKjTdin)k=1Nexp(QiKkTdin)A_{ij} = \frac{\exp\left(\frac{Q_i K_j^T}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i K_k^T}{\sqrt{d_{in}}}\right)}

Here:

  • QiQ_i is the i-th row of the Query matrix Q.
  • KjK_j is the j-th row of the Key matrix K.
  • QiKjTQ_i K_j^T is the dot product between the i-th query and the j-th key.
  • The denominator sums these exponential scores over all possible keys (from k=1 to N) for the fixed query QiQ_i .

If the equation is too mouthful, the following visualisation might make it easier to digest. Imagine both Q and K are a 5 x 5 matrix each, and we are interested to calculate A22A_{22} . What the equation essentially means is to do what is as shown in the diagram below. Note that the diagram does not include dividing by din\sqrt{d_{in}} , and taking the exponent of the value, but you get the drift.

Getting A_22

The final output VV' is the product of the attention matrix (A), and the value matrix (V):

V=AV V' = A V

To find the i-th row of the output, ViV'_i , multiply the i-th row of A by the entire matrix V. This results in a weighted sum of all the value vectors ( VjV_j ), where the weights are the attention scores in the i-th row of A:

Vi=Ai1V1+Ai2V2+...+AiNVN=j=1NAijVjV^\prime_i = A_{i1} V_1 + A_{i2} V_2 + ... + A_{iN} V_N = \sum_{j=1}^{N} A_{ij} V_j

Here, VjV_j is the j-th row (the value vector) of the matrix V.

A visual representation of the above summation is below, calculating V2V^\prime_2 as an example:

Calculating V^\prime_2

We can now substitute our formula for AijA_{ij} in the equation for ViV^\prime_i and obtain the following:

Vi=j=1NAijVj=j=1Nexp(QiKjTdin)k=1Nexp(QiKkTdin)VjV^\prime_i = \sum_{j=1}^{N} A_{ij} V_j = \sum_{j=1}^{N} \frac{\exp\left(\frac{Q_i K_j^T}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i K_k^T}{\sqrt{d_{in}}}\right)} V_j

For simplicity of display and to abstract away the mechanistic operation of matrix manipulation, we re-express the dot product of the i-th query vector QiQ_i and the j-th vector KjK_j as such:

QiKjT=QiKjQ_i K_j^T = Q_i \cdot K_j

So we can re-express ViV^\prime_i as:

Vi=j=1Nexp(QiKjdin)k=1Nexp(QiKjdin)VjV^\prime_i = \sum_{j=1}^{N} \frac{\exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)} V_j

As the denominator term k=1N(...)\sum_{k=1}^{N}(...) is a scalar that is a constant for all j in the outer summation, we can pull out of the summation and get the following:

Vi=j=1Nexp(QiKjdin)k=1Nexp(QiKjdin)VjV^\prime_i = \frac{\sum_{j=1}^{N} \exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)}{\sum_{k=1}^{N} \exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right)} V_j

Instead of taking the exponent of values, we generalise the term exp(QiKjdin)\exp\left(\frac{Q_i \cdot K_j}{\sqrt{d_{in}}}\right) to be represented by kernel ϕ(QiKj)\phi\left(Q_i \cdot K_j\right) , and thus we get:

Vi=j=1Nϕ(QiKj)k=1Nϕ(QiKj)VjV^\prime_i = \frac{\sum_{j=1}^{N} \phi\left(Q_i \cdot K_j\right)}{\sum_{k=1}^{N} \phi\left(Q_i \cdot K_j\right)} V_j

In the paper, the author set the condition that a valid ϕ\phi kernel must be non-negative for the following reasons, i.e. ϕ(x,y):R2×doutR+\phi(x,y): \mathbb{R}^{2 \times d_{out}} \to \mathbb{R}^+ . My understanding that this constraint is necessary for two reasons:

  1. Interpretability. A negative weight is difficult to interpret.

2 Mathematical stability for the denominator. If weights could be negative, it is possible for positive and negative weights cancelling each other out and resulting in a denominator of zero, and division by zero is an error.

(Note: All kernels are functions, but not all functions are kernels. Check out Appendix A for an explanation.)

Since a kernel ϕ(QiKj)\phi\left(Q_i \cdot K_j\right) is one that can be represented as a dot product of feature-mapped inputs, i.e. ϕ(x,y)=ϕ(x),ϕ(y)=ϕ(x)ϕ(y)\phi(x,y)=⟨\phi(x),\phi(y)⟩=\phi(x)^\top \phi(y) , we can express ViV^\prime_i as the following equation:

Vi=j=1Nϕ(Qi)ϕ(Kj)Vjj=1Nϕ(Qi)ϕ(Kj)V^\prime_i = \frac{\sum^N_{j=1} \phi\left(Q_i\right)^\top \phi\left(K_j\right)V_j}{\sum^N_{j=1} \phi\left(Q_i\right)^\top \phi\left(K_j\right) }

Time and space complexity of softmax self-attention

The problem with the above equation is that it has a time complexity of O(N2)O(N^2) , and also a space complexity of O(N2)O(N^2) . As the time complexity of both the numerator and the denominator are the same, let us just look at the numerator:

  • QKTQ K^T : (N×D)(D×N)=(N×N)(N \times D) * (D \times N) = (N \times N) . This matrix multiplication takes O(N2×D)O(N^2 \times D) operations. = softmax(QKT/sqrt(din))\text{softmax}(Q K^T/sqrt(d_{in})) : It consists of 3 parts:
    • Exponentiation: Compute exp(w_i) of each attention score, and as there are N2N^2 attention scores, the time complexity is O(N2)O(N^2) .
    • Summation: Sum the N exponentiated values in each row to get the denominator, which is N - 1 addition per row. As there are N rows, the cost is N(N1)=O(N2)N * (N - 1) = O(N^2)
    • Division: Divide each of the N exponentiated value by the sum. As there is a total of N2N^2 division operations, the cost is O(N2)O(N^2) .
  • AVA * V : (N×N)(N×D)=(N×D)(N \times N) * (N \times D) = (N \times D) . This matrix multiplication takes O(N2×D)O(N^2 \times D) operations.
  • The dominant part is O(N2×D+N2+N2×D)=O(N2D)O(N^2 \times D + N^2 + N^2 \times D) = O(N^2 D) . Unlike N which could scale to a very large number as the context (i.e. user prompt) gets very long, D is a constant, hence we can simplify the time complexity to O(N2)O(N^2) .

The space complexity is also (N2)(N^2) because the full attention matrix must be stored to compute the attention weights matrix of N X N.

Here is how we can reduce the time and space complexity -- given the distributive property of matrix multiplication over addition, i.e. A(B+C)=AB+ACA(B+C) = AB + AC , we can rewrite the equation as:

Vi=ϕ(Qi)j=1Nϕ(Kj)Vjϕ(Qi)j=1Nϕ(Kj)V^\prime_i = \frac{\phi\left(Q_i\right)^\top \sum^N_{j=1} \phi\left(K_j\right)V_j^\top}{\phi\left(Q_i\right)^\top \sum^N_{j=1} \phi\left(K_j\right) }

I will explain in detail how, using ϕ(x)=elu(x)+1\phi(x) = elu(x) + 1 as the kernel feature map, the time and space complexity of the attention mechanism will be O(N), which grows linearly, hence the name "linearised self-attention mechanism". This is a huge performance improvement and provides huge storage savings as compared to the O(N2)O(N^2) time and space complexity of the softmax self-attention mechanism.

Naive implementation of ϕ(x)=x\phi(x) = x as the kernel feature map

The most straightforward choice is to choose the kernel feature map where ϕ(x)=x\phi(x) = x , so that ϕ(x,y)=xy\phi(x, y) = x^\top y . You can see that it is a valid kernel feature map because it satisfies the condition:
ϕ(x)ϕ(y)=xy=ϕ(x,y)\phi(x)^\top\phi(y) = x^\top y = \phi(x, y)

(Note: To prove that the a function is suitable as a kernel, we need to prove that it is symmetrical and positive semi-definite. The proof is in appendix C.)

However, ϕ(x)=x\phi(x) = x is not a valid feature map because we had set the condition for the kernel to be non-negative. xyx^\top y does not guarantee a positive value.

Also, this approach means we are effectively negating the advantages which softmax normalisation provides, because we are normalising by a simple sum of all the denominator terms, as shown in the equation below. This is not ideal as explained in the previous article, softmax was important in emphasising attention on tokens that truly matter due to its "winner-takes-most" effect. While we enjoy a time and space complexity of O(N), we may most likely end up with a LLM that has difficulties focusing on what's important in the context.

Vi=Qij=1NKjVjQij=1NKjV^\prime_i = \frac{Q_i^\top \sum^N_{j=1} K_j V_j^\top}{Q_i^\top \sum^N_{j=1} K_j }

In addition, this kernel feature map would make the entire attention block a purely linear transformation (ignoring the layer norms for now). Mathematically, stacking multiple layers of linear transformation on top of each other is mathematically equivalent to a single, large linear transformation, and this could significantly limit the expressive powers of the attention mechanism to learn complex patterns.

(Note: See appendix D for proof of how multiple linear transformations could collapse into a single linear transformation.)

A better kernel feature map

The kernel feature map considered by Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention is as follows:

ϕ(x)=elu(x)+1={x+1if x>0exif x0\phi(x) = elu(x) + 1 = \begin{cases} x + 1 & \text{if } x > 0 \newline e^x & \text{if } x \le 0 \end{cases}

To prove that ϕ(x)=elu(x)+1\phi(x) = elu(x) + 1 is a valid feature map, we must show that it fulfills the following two criteria:

  1. Theoretical validity: ϕ(x,y)\phi(x,y) is a valid kernel function, i.e. it is (a) symmetrical, and (b) positive semi-definite.
  2. Practical usefulness: It is a good approximation of the softmax normalisation, i.e. the weights after normalisation are (a) positive and (b) non-linear.

The proofs are theoretical validity is in appendix E, and proof of practical usefulness is in appendix F.

Why can't we linearise the softmax kernel exactly?

With softmax attention, the kernel is an exponential kernel where:

ϕ(Qi,Kj)=exp(QiKj)=ϕ(Qi)ϕ(Kj)\phi(Q_i, K_j) = exp(Q_i^\top \cdot K_j) = \phi(Q_i)^\top \cdot \phi(K_j)

For a feature map ϕ(x)\phi(x) , using the Taylor series expansion for the exponential function, where:

ex=1+x+x2/2!+x3/3!+x4/4!+... e^x = 1 + x + x^2/2! + x^3/3! + x^4/4! + ...

Substituting x for the dot product ϕ(Qi)ϕ(Kj)\phi(Q_i)^\top \cdot \phi(K_j) , we get:

exp(QK)=n=0(QK)nn!(Taylor Series)=n=01n!(d=1Dqdkd)n(Expand the dot product)=n=01n!α=nn!α1!αD!d=1D(qdkd)αd(Apply Multinomial Theorem)=n=0α=nd=1Dqdαdd=1Dkdαdα1!αD!(Cancel n! and separate Q and K terms)=n=0α=n(d=1Dqdαdα1!αD!)(d=1Dkdαdα1!αD!)=ϕ(Q)ϕ(K)\begin{align*} \exp(Q^\top K) &= \sum_{n=0}^{\infty} \frac{(Q^\top K)^n}{n!} && \text{(Taylor Series)} \newline &= \sum_{n=0}^{\infty} \frac{1}{n!} \left( \sum_{d=1}^D q_d k_d \right)^n && \text{(Expand the dot product)} \newline &= \sum_{n=0}^{\infty} \frac{1}{n!} \sum_{|\alpha|=n} \frac{n!}{\alpha_1! \dots \alpha_D!} \prod_{d=1}^D (q_d k_d)^{\alpha_d} && \text{(Apply Multinomial Theorem)} \newline &= \sum_{n=0}^{\infty} \sum_{|\alpha|=n} \frac{\prod_{d=1}^D q_d^{\alpha_d} \prod_{d=1}^D k_d^{\alpha_d}}{\alpha_1! \dots \alpha_D!} && \text{(Cancel } n! \text{ and separate Q and K terms)} \newline &= \sum_{n=0}^{\infty} \sum_{|\alpha|=n} \left( \frac{\prod_{d=1}^D q_d^{\alpha_d}}{\sqrt{\alpha_1! \dots \alpha_D!}} \right) \cdot \left( \frac{\prod_{d=1}^D k_d^{\alpha_d}}{\sqrt{\alpha_1! \dots \alpha_D!}} \right) \newline &= \phi(Q)^\top \cdot \phi(K) \end{align*}

As we can see that ϕ(Q)\phi(Q) and ϕ(K)\phi(K) are infinitely long vectors, it is impossible to linearise the exact softmax attention.

(Note: there are papers which attempt to linearlise by taking the first few terms of the Taylor Series, and in this case the vectors are no longer infinite. For example QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion.)

Time complexity

In comparison, when we define the feature map as ϕ(x)=elu(x)+1\phi(x) = elu(x) + 1 , what we instead do is the following:

  1. For matrix Q of dimension N X D, we do element-wise transformation of ϕ(Q)\phi(Q) by elu(x)+1elu(x) + 1 . The time complexity for this element-wise transformation is O(ND), which is the number of calculations done for each element, where there is a total of N X D elements. Similarly, the time complexity for transforming ϕ(K)\phi(K) where K is a N X D matrix is also O(ND).
  2. Next, we compute ϕ(K)TV\phi(K)^T V : As ϕ(K)T\phi(K)^T is D X N, VV is N X D, hence ϕ(K)TV\phi(K)^T V is D X D. The time complexity for calculating this matrix multiplication is O(ND2)O(ND^2) .
  3. Finally, we compute ϕ(Q)ϕ(K)TV\phi(Q) * \phi(K)^T V , which is an O(ND2)O(ND^2) operation.
  4. Because N >> D especially when the context length gets longer, we can treat D as a constant, hence the time complexity of a linearised self-attention mechanism is O(N).

Space complexity

Let us do a quick recap of the dimension of all the matrices for a sequence of prompts of length N:

  • Q: N X D (Queries)
  • K: N X D (Keys)
  • V: N X D (Values)
  • ϕ(Q)\phi(Q) : N X D (element wise elu + x applied to Q)
  • ϕ(K)\phi(K) : N X D (element wise elu + x applied to K)

The storage requirements are as follow:

  1. We first compute S=ϕ(K)VS = \phi(K)^\top * V , which has a dimension of D X M, hence the space complexity is O(D2)O(D^2) . Note that as N >> D, ϕ(K)V\phi(K)^\top V is a small matrix and its space complexity can be treated as constant, i.e. O(1).

  2. Next, we compute W=ϕ(Q)SW = \phi(Q) * S , where P has a dimension of N X M, and hence it has a space complexity of O(ND). As N >> D, the space complexity can be simplified to O(N). Note that this is unlike the softmax attention where where we had to store the N X N matrix of attention scores and weights and which has a space complexity of O(N2)O(N^2) .

  3. The normalisation vector DinD_{in} , which holds the normlisation term DiD_i for each query QiQ_i has a dimension of N X 1, hence the space complexity is O(N).

Implementation of linear self-attention as code

The python code to implement the linear self-attention mechanism is below. You can simply replace the SelfAttention class in the previous article with the following and it should work out of the box.

class LinearSelfAttention(nn.Module):
    """ Linear Self-Attention with a Kernel Function (Equation 3) """
    def __init__(self, d_in, d_out, qkv_bias=False):
        super().__init__()
        self.d_out = d_out
        self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_key   = nn.Linear(d_in, d_out, bias=qkv_bias)
        self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)

    def kernel_function(self, x):
        """
        Applies the feature map φ(x).
        A common choice is elu + 1 to ensure positivity, mimicking exp().
        """
        return F.elu(x) + 1

    def forward(self, x):
        # x shape: (batch_size, sequence_length, d_in)

        # 1. Project inputs to Q, K, V
        keys = self.W_key(x)      # Shape: (b, n, d_out)
        queries = self.W_query(x) # Shape: (b, n, d_out)
        values = self.W_value(x)  # Shape: (b, n, d_out)

        # 2. Apply the kernel function (feature map) to queries and keys
        queries_projected = self.kernel_function(queries) # φ(Q)
        keys_projected = self.kernel_function(keys)       # φ(K)

        # 3. Reorder operations using the associative property of matrix multiplication
        # Original (inefficient): (φ(Q) @ φ(K).T) @ V
        # Reordered (efficient): φ(Q) @ (φ(K).T @ V)
        # This avoids creating the large (n, n) attention matrix.

        # kv_state shape: (b, d_out, d_out)
        kv_state = torch.bmm(keys_projected.transpose(1, 2), values)

        # context_unnormalized shape: (b, n, d_out)
        context_unnormalized = torch.bmm(queries_projected, kv_state)

        # 4. Calculate the normalization factor
        # The denominator in Eq. 3 is φ(Q_i) @ Σ φ(K_j)
        # normalizer shape: (b, n, 1)
        normalizer = torch.bmm(
            queries_projected,
            keys_projected.sum(dim=1).unsqueeze(-1) # Sum keys over sequence length
        )

        # 5. Apply normalization with a small epsilon for stability
        context_vec = context_unnormalized / (normalizer + 1e-6)

        return context_vec
Enter fullscreen mode Exit fullscreen mode

Any downsides to linear self-attention mechanism?

Some research have shown that linear self-attention mechanism leads to a decline in model capability. The softmax self-attention is more expressive than linear self-attention because while softmax can create any attention pattern using the full QKQK^\top matrix, linear attention is constrained in its expression to the dot product of the feature maps of ϕ(Qi)\phi(Q_i) and ϕ(Kj)\phi(K_j) .

There are papers that are trying to enjoy the best of both worlds by training using soft-max self attention, and inference with linear self-attention as an approximation:

Discussions in the paper which I did not touch upon

The paper also went on to discuss about causal masking, as well as how it is able to, with a neat trick, reduce the space complexity during training (i.e. forward + backward pass) from O(ND2)O(ND^2) to O(ND)O(ND) . I may discuss these techniques in future articles.

Conclusion

In short, we have shown how linear self-attention matrix can be derived and implemented, as well as why it is able to achieve constant time and space complexity of O(N), which is far more efficient than the quadratic time and space complexity of O(N2)O(N^2) of the softmax self-attention.

What's next

In my next article, my focus will turn more practical as I try to implement from scratch a code base self-documentation service, similar to Devin's DeepWiki. While there are already open source alternatives such as this, I believe a deeper understanding of how to build such a service will serve as a solid foundation to further improve the quality of documentation generated by LLM.

Appendix A - What is the difference between a kernel and a function?

(Note: The appendixes are written with the help of Google Gemini 2.5 Pro)

A simple way to think about the difference is:

Every kernel is a function, but not every function is a kernel.


Function

A function is a general mathematical concept. It is simply a rule that maps an input from a set A to a single output in a set B.

  • Inputs: Can take one, two, or many inputs.
  • Purpose: Can do anything. It can describe a relationship ( y=x2y = x^2 ), perform an action (print("hello")), or calculate a value (sin(x)).
  • Examples:
    • f(x) = x + 5 (takes one number, outputs another)
    • g(x, y) = x * y (takes two numbers, outputs their product)
    • h(text) = length(text) (takes a string, outputs its length)

There are no other constraints. If it's a well-defined rule mapping inputs to an output, it's a function.

Kernel

A kernel, in the context of machine learning, is a special kind of function with two key characteristics: its purpose and its mathematical properties.

1. The Purpose: To Measure Similarity

The primary job of a kernel k(x, y) is to take two inputs, x and y, and return a single scalar value that quantifies how "similar" they are. A large value means they are very similar; a small or zero value means they are dissimilar.

This is why the sim(·) function in the paper is a perfect candidate to be thought of as a kernel. Its job is to measure the similarity between a query QiQ_i and a key KjK_j .

2. The Mathematical Property: The "Kernel Trick"

A function k(x, y) is formally considered a kernel if it can be expressed as a dot product in some (potentially very high-dimensional) feature space.

Mathematically, a function k is a kernel if there exists a feature map ϕ\phi (phi) such that for all x and y:

k(x,y)=ϕ(x)ϕ(y)k(x, y) = \phi(x) \cdot \phi(y)

Let's unpack this:

  • ϕ(x)\phi(x) (the feature map): This is a function that takes a single input x and transforms it into a new, often much longer, vector. It maps the data into a "feature space" where, hopefully, relationships are easier to see.
  • (the dot product): This is the standard dot product between the two transformed vectors.

The "kernel trick" is that you can often compute k(x, y) very easily and cheaply, without ever having to compute the potentially massive or even infinite-dimensional vectors ϕ(x)\phi(x) and ϕ(y)\phi(y) .

Example: The Polynomial Kernel

Let x and y be simple 2D vectors.
Consider the kernel function k(x,y)=(xy)2k(x, y) = (x ⋅ y)^2 . This is very easy to compute.

It turns out this simple function is equivalent to first mapping x and y into a higher-dimensional space with a feature map ϕ\phi and then taking their dot product.
If x=[x,x]x = [x₁, x₂] , then a valid feature map is ϕ(x)=[x12,x22,2x1x2]\phi(x) = [x_1^2, x_2^2, \sqrt{2}x_1x_2] .
Calculating ϕ(x)ϕ(y)\phi(x) \cdot \phi(y) gives you the exact same result as (xy)2(x \cdot y)^2 , but (xy)2(x \cdot y)^2 is much faster to compute.

The formal requirement for a function to be a valid kernel is given by Mercer's Theorem, which states that the function must be "positive semi-definite." This condition guarantees that a feature map ϕ\phi exists, even if we don't know what it is. (source)

Comparison Table

Aspect Function Kernel
Core Idea A rule that maps inputs to an output. A special function that measures similarity between two inputs.
Number of Inputs Can be one or more. Almost always takes two inputs (x and y).
Purpose Can be anything. To compute a similarity score.
Key Property None, beyond being a well-defined mapping. Can be expressed as a dot product in a feature space: k(x,y)=ϕ(x)ϕ(y)k(x, y) = \phi(x) ⋅ \phi(y) . Must be positive semi-definite.
Example f(x) = sin(x) k(x,y)=(xy+c)dk(x, y) = (x ⋅ y + c)^d (Polynomial Kernel)

Appendix B - What are positive, positive definite, and positive semi-definite matrices?

The most intuitive way to understand the difference is by using an analogy with regular numbers.

The Core Intuition: An Analogy with Numbers

  • A positive definite matrix is the matrix equivalent of a strictly positive number (like 3, 0.5, or 100). A number a > 0.
  • A positive semi-definite matrix is the matrix equivalent of a non-negative number (like 3, 0.5, 100, and also 0). A number a ≥ 0.

The key difference is whether zero is allowed.

Now, how does this concept of "positiveness" apply to a matrix, which is a whole grid of numbers? It's not about the individual elements of the matrix being positive. Instead, it's about how the matrix behaves when it interacts with vectors.

The key operation to test this "positiveness" is the quadratic form: xTAxx^T A x , where A is our matrix and x is any non-zero vector. This operation always results in a single number (a scalar). We look at the sign of that resulting number.


Positive Semi-Definite (The "Non-Negative" Case, ≥ 0)

A symmetric matrix A is positive semi-definite if, for every possible non-zero vector x, the result of xAxx^\top A x is greater than or equal to zero.

xAx0x0x^\top A x ≥ 0 \forall x ≠ 0
  • What it means: The matrix A will never "flip" a vector to point in the opposite direction.
  • The "Zero" Case: It is possible for xAxx^\top A x to equal zero even when the vector x is not zero. This means there are some directions in space that the matrix A "squashes" or "flattens" completely.
  • Eigenvalues: All eigenvalues of the matrix are non-negative (λ ≥ 0). The existence of a zero eigenvalue corresponds to a direction that gets squashed.
  • Invertibility: A positive semi-definite matrix might not be invertible. If it has a zero eigenvalue, its determinant is zero, and it's singular (not invertible).

Positive Definite (The "Strictly Positive" Case, > 0)

A symmetric matrix A is positive definite if, for every possible non-zero vector x, the result of xAxx^\top A x is strictly greater than zero.

xAx>0x0x^\top A x > 0 \forall x ≠ 0
  • What it means: This is a stricter condition. The matrix A not only doesn't flip vectors, it also doesn't completely squash any of them (unless the vector was zero to begin with).
  • The "Zero" Case: The only way for xAxx^\top A x to equal zero is if the vector x itself is the zero vector.
  • Eigenvalues: All eigenvalues of the matrix are strictly positive (λ > 0). There are no zero eigenvalues.
  • Invertibility: A positive definite matrix is always invertible. Since none of its eigenvalues are zero, its determinant is non-zero.

Comparison Table

Feature Positive Semi-Definite Positive Definite (Stricter)
Condition x^T A x ≥ 0 x^T A x > 0
Analogy Non-negative number (a ≥ 0) Strictly positive number (a > 0)
Can x^T A x be zero? Yes, for some non-zero x. No, only if x is the zero vector.
Eigenvalues (λ) All λ ≥ 0 (can include zero). All λ > 0 (must be positive).
Invertibility May not be invertible. Always invertible.
Geometric Intuition A transformation that doesn't flip vectors, but might collapse some directions to zero. A transformation that doesn't flip vectors and preserves some "energy" in all directions.

Why This Matters for Kernels

A function k(x, y) is a valid kernel if and only if the "Gram matrix" it produces is positive semi-definite.

  • Gram Matrix: For any set of data points x_1, x_2, ..., x_n, the Gram matrix K is the n x n matrix where the entry K_ij = k(x_i, x_j).

The condition for a valid kernel is semi-definite, not definite. This is because we want to allow for cases where two different data points are perfectly "dissimilar" or orthogonal in the feature space (k(x_i, x_j) = 0) or where two different points x_i and x_j might actually map to the exact same location in the feature space (φ(x_i) = φ(x_j)). These situations lead to a Gram matrix that is only semi-definite, and we need to include them for the theory to be general enough.

he most intuitive way to understand the difference is by using an analogy with regular numbers.

The Core Intuition: An Analogy with Numbers

  • A positive definite matrix is the matrix equivalent of a strictly positive number (like 3, 0.5, or 100). A number a > 0.
  • A positive semi-definite matrix is the matrix equivalent of a non-negative number (like 3, 0.5, 100, and also 0). A number a ≥ 0.

The key difference is whether zero is allowed.

Now, how does this concept of "positiveness" apply to a matrix, which is a whole grid of numbers? It's not about the individual elements of the matrix being positive. Instead, it's about how the matrix behaves when it interacts with vectors.

The key operation to test this "positiveness" is the quadratic form: x^T A x, where A is our matrix and x is any non-zero vector. This operation always results in a single number (a scalar). We look at the sign of that resulting number.


Positive Semi-Definite (The "Non-Negative" Case, ≥ 0)

A symmetric matrix A is positive semi-definite if, for every possible non-zero vector x, the result of x^T A x is greater than or equal to zero.

x^T A x ≥ 0 for all x ≠ 0

  • What it means: The matrix A will never "flip" a vector to point in the opposite direction.
  • The "Zero" Case: It is possible for x^T A x to equal zero even when the vector x is not zero. This means there are some directions in space that the matrix A "squashes" or "flattens" completely.
  • Eigenvalues: All eigenvalues of the matrix are non-negative (λ ≥ 0). The existence of a zero eigenvalue corresponds to a direction that gets squashed.
  • Invertibility: A positive semi-definite matrix might not be invertible. If it has a zero eigenvalue, its determinant is zero, and it's singular (not invertible).

Positive Definite (The "Strictly Positive" Case, > 0)

A symmetric matrix A is positive definite if, for every possible non-zero vector x, the result of x^T A x is strictly greater than zero.

x^T A x > 0 for all x ≠ 0

  • What it means: This is a stricter condition. The matrix A not only doesn't flip vectors, it also doesn't completely squash any of them (unless the vector was zero to begin with).
  • The "Zero" Case: The only way for x^T A x to equal zero is if the vector x itself is the zero vector.
  • Eigenvalues: All eigenvalues of the matrix are strictly positive (λ > 0). There are no zero eigenvalues.
  • Invertibility: A positive definite matrix is always invertible. Since none of its eigenvalues are zero, its determinant is non-zero.

Comparison Table

Feature Positive Semi-Definite Positive Definite (Stricter)
Condition x^T A x ≥ 0 x^T A x > 0
Analogy Non-negative number (a ≥ 0) Strictly positive number (a > 0)
Can x^T A x be zero? Yes, for some non-zero x. No, only if x is the zero vector.
Eigenvalues (λ) All λ ≥ 0 (can include zero). All λ > 0 (must be positive).
Invertibility May not be invertible. Always invertible.
Geometric Intuition A transformation that doesn't flip vectors, but might collapse some directions to zero. A transformation that doesn't flip vectors and preserves some "energy" in all directions.

Why This Matters for Kernels

A function k(x, y) is a valid kernel if and only if the "Gram matrix" it produces is positive semi-definite.

  • Gram Matrix: For any set of data points x_1, x_2, ..., x_n, the Gram matrix K is the n x n matrix where the entry K_ij = k(x_i, x_j).

The condition for a valid kernel is semi-definite, not definite. This is because we want to allow for cases where two different data points are perfectly "dissimilar" or orthogonal in the feature space (k(x_i, x_j) = 0) or where two different points x_i and x_j might actually map to the exact same location in the feature space (φ(x_i) = φ(x_j)). These situations lead to a Gram matrix that is only semi-definite, and we need to include them for the theory to be general enough.

Proof

We want to show that if K is symmetric and PSD, then there exists a feature map φ such that Kᵢⱼ = φ(xᵢ) ⋅ φ(xⱼ).

Proof using the Spectral Theorem:

  1. Eigendecomposition: The Spectral Theorem states that any real symmetric matrix K can be diagonalized by an orthogonal matrix of its eigenvectors.
    K = UΛUᵀ
    where:

    • U is an orthogonal matrix whose columns u₁, ..., uₙ are the eigenvectors of K.
    • Λ is a diagonal matrix whose entries λ₁, ..., λₙ are the corresponding real eigenvalues.
  2. Examine a Single Element: Let's write out what this decomposition means for a single element Kᵢⱼ:
    Kᵢⱼ = (UΛUᵀ)ᵢⱼ = ∑ₖ uᵢₖ λₖ uⱼₖ
    Here, uᵢₖ is the i-th component of the k-th eigenvector uₖ.

  3. Use the PSD Property: A fundamental property of a PSD matrix is that all its eigenvalues are non-negative: λₖ ≥ 0 for all k. This is the crucial step.

  4. Construct the Feature Map: Since λₖ ≥ 0, we can take its square root. Let's define a feature map φ that maps each point xᵢ to a new vector in ℝⁿ:
    φ(xᵢ) = [√λ₁ uᵢ₁, √λ₂ uᵢ₂, ..., √λₙ uᵢₙ]

  5. Verify the Dot Product: Now, let's compute the dot product φ(xᵢ) ⋅ φ(xⱼ) using our new feature map:
    φ(xᵢ) ⋅ φ(xⱼ) = ∑ₖ (√λₖ uᵢₖ) * (√λₖ uⱼₖ)
    = ∑ₖ (√λₖ)² uᵢₖ uⱼₖ
    = ∑ₖ λₖ uᵢₖ uⱼₖ

  6. Conclusion: This final expression is exactly our expansion for Kᵢⱼ from Step 2. We have successfully shown that Kᵢⱼ = φ(xᵢ) ⋅ φ(xⱼ). This proves the theorem for the finite-dimensional matrix case.

Appendix C: proof that the identity matrix is symmetrical and positive semi-definite

Definition of a Symmetric Matrix:
A matrix A is symmetric if it is equal to its transpose, i.e., A=AA = A^\top . The transpose of a matrix is found by swapping its rows and columns. In terms of elements, this means (A)ij=Aji(A^\top){ij} = A{ji} for all i and j.

Proof:
We need to show that I=ITI = I^T . Let's prove this by showing that their corresponding elements are equal for all i and j.

  1. Let (IT)ij(I^T)_{ij} be the element in the i-th row and j-th column of the matrix II^\top .
  2. By the definition of the transpose, (I)ij=Iji(I^\top){ij} = I{ji} .
  3. Now, let's use the definition of the identity matrix to evaluate IjiI_{ji} :
    • If j=i, then Iji=1I_{ji} = 1 .
    • If jij \neq i , then Iji=0I_{ji} = 0 .
  4. Let's compare this to the elements of the original identity matrix, IijI_{ij} :
    • If i=j, then Iij=1I_{ij} = 1 .
    • If iji \neq j , then Iij=0I_{ij} = 0 .

Notice that the condition "j=i" is identical to "i=j", and jij \neq i is identical to iji \neq j .

Therefore, for all possible values of i and j, we have shown that (I)ij=Iji=Iij(I^\top){ij} = I{ji} = I_{ij} . Since all corresponding elements are equal, the matrices are equal.

Conclusion: I=II = I^\top , so the identity matrix is symmetrical.


Part 2: Proof that the Identity Matrix is Positive Semi-Definite

A symmetric matrix A is positive semi-definite if for every non-zero column vector x, the quadratic form xAxx^\top A x is non-negative.

xAx0\mathbf{x}^\top A \mathbf{x} \ge 0

We can prove this for the identity matrix using two common methods.

Method 1: Using the Definition of the Quadratic Form

  1. Let I be the n×nn \times n identity matrix and let x be any n×1n \times 1 column vector:
x=(x1 x2  xn)\mathbf{x} = \begin{pmatrix} x_1 \ x_2 \ \vdots \ x_n \end{pmatrix}
  1. We need to evaluate the quadratic form xTIx\mathbf{x}^T I \mathbf{x} .

  2. First, let's compute the product IxI\mathbf{x} . By the definition of the identity matrix, multiplying any vector by $I$ results in the vector itself:
    Ix=xI\mathbf{x} = \mathbf{x}

  3. Now, substitute this back into the quadratic form:
    xTIx=xT(x)\mathbf{x}^T I \mathbf{x} = \mathbf{x}^T (\mathbf{x})

  4. The expression xx\mathbf{x}^\top \mathbf{x} is the dot product of the vector x\mathbf{x} with itself:
    xTx=(x1x2xn)(x1 x2  xn)=x12+x22++xn2\mathbf{x}^T \mathbf{x} = \begin{pmatrix} x_1 & x_2 & \cdots & x_n \end{pmatrix} \begin{pmatrix} x_1 \ x_2 \ \vdots \ x_n \end{pmatrix} = x_1^2 + x_2^2 + \cdots + x_n^2

  5. This sum can be written as i=1nxi2\sum_{i=1}^n x_i^2 .

  6. Since each xix_i is a real number, its square xi2x_i^2 must be greater than or equal to zero ( xi20x_i^2 \ge 0 ). The sum of non-negative numbers is also non-negative.

Therefore, xTIx=i=1nxi20\mathbf{x}^T I \mathbf{x} = \sum_{i=1}^n x_i^2 \ge 0 .

This holds true for any vector x\mathbf{x} . Thus, the identity matrix is positive semi-definite.

Method 2: Using Eigenvalues

Definition: A symmetric matrix is positive semi-definite if and only if all of its eigenvalues are non-negative ( λi0\lambda_i \ge 0 ).

  1. Let's find the eigenvalues of the identity matrix I. The eigenvalue equation is Av=λvA\mathbf{v} = \lambda\mathbf{v} , where λ\lambda is an eigenvalue and v\mathbf{v} is its corresponding eigenvector.

  2. For the identity matrix, this equation becomes:
    Iv=λvI\mathbf{v} = \lambda\mathbf{v}

  3. Since Iv=vI\mathbf{v} = \mathbf{v} , we have: v=λv\mathbf{v} = \lambda\mathbf{v}

  4. Rearranging the terms gives:
    vλv=0\mathbf{v} - \lambda\mathbf{v} = \mathbf{0}
    (1λ)v=0(1 - \lambda)\mathbf{v} = \mathbf{0}

  5. By definition, an eigenvector v must be a non-zero vector. For the equation (1λ)v=0(1-\lambda)\mathbf{v} = \mathbf{0} to be true with v0\mathbf{v} \neq \mathbf{0} , the scalar multiplier must be zero: 1λ=0    λ=11 - \lambda = 0 \implies \lambda = 1

  6. An n×nn \times n identity matrix has n eigenvalues, and they are all equal to 1.

  7. The condition for positive semi-definiteness is that all eigenvalues must be non-negative. Since all eigenvalues of I are 1, and 101 \ge 0 , the condition is satisfied.

Conclusion: The identity matrix is positive semi-definite.

Summary

We have shown that:

  1. I=II = I^\top , which proves symmetry.
  2. xTIx0\mathbf{x}^T I \mathbf{x} \ge 0 for all vectors x\mathbf{x} (and its eigenvalues are all 0\ge 0 ), which proves positive semi-definiteness.

Note: In fact, we have shown something stronger. Since xTIx>0\mathbf{x}^T I \mathbf{x} > 0 for all non-zero vectors x\mathbf{x} and its eigenvalues are all strictly positive (1 > 0), the identity matrix is actually positive definite, which is a subset of positive semi-definite.

Appendix D: why stacking of multiple linear transformations is equivalent to a single linear transformation

1. Mathematical Foundation: What is a Linear Transformation?

First, let's formally define our terms. A transformation (or function) $T$ that maps vectors from one vector space to another ( T:RnRmT: \mathbb{R}^n \to \mathbb{R}^m ) is linear if it satisfies two properties for any vectors u,v\vec{u}, \vec{v} and any scalar c:

  1. Additivity: T(u+v)=T(u)+T(v)T(\vec{u} + \vec{v}) = T(\vec{u}) + T(\vec{v})
  2. Homogeneity (Scalar Multiplication): T(cu)=cT(u)T(c\vec{u}) = cT(\vec{u})

A cornerstone theorem of linear algebra states that every linear transformation can be represented by a matrix multiplication. If T is a linear transformation, there exists a unique matrix A such that for any vector x\vec{x} :

T(x)=AxT(\vec{x}) = A\vec{x}

This matrix representation is the key to our proof.


2. The Proof for Two Layers

Let's start by proving the case for two stacked transformations. "Stacking" is mathematically known as function composition.

  1. Let's define two linear transformations:

    • Let T1:RnRmT_1: \mathbb{R}^n \to \mathbb{R}^m be our first linear transformation. It takes an n-dimensional vector and maps it to an m-dimensional vector. It can be represented by an $m \times n$ matrix, which we'll call A1A_1 . So, T1(x)=A1xT_1(\vec{x}) = A_1\vec{x} .
    • Let T2:RmRpT_2: \mathbb{R}^m \to \mathbb{R}^p be our second linear transformation. It takes an m-dimensional vector (the output of T1T_1 ) and maps it to a p-dimensional vector. It can be represented by a p×mp \times m matrix, which we'll call A2A_2 . So, T2(y)=A2yT_2(\vec{y}) = A_2\vec{y} .
  2. "Stacking" T1T_1 and then T2T_2 means we first apply T1T_1 to an input vector x\vec{x} , and then apply T2T_2 to the result. Let's call this new, composed transformation L:

L(x)=T2(T1(x))L(\vec{x}) = T_2(T_1(\vec{x}))
  1. Now, let's substitute the matrix representations into this equation:
  • First, replace T1(x)T_1(\vec{x}) with its matrix form: L(x)=T2(A1x)L(\vec{x}) = T_2(A_1\vec{x})
  • The term A1xA_1\vec{x} is just a vector. Let's think of it as the input to T2T_2 . We can now apply the matrix representation for T2T_2 : L(x)=A2(A1x)L(\vec{x}) = A_2(A_1\vec{x})
  1. Here is the crucial step. Matrix multiplication is associative. This means that for compatible matrices A, B, and C, we have (AB)C=A(BC)(AB)C = A(BC) . We can regroup our expression:
L(x)=(A2A1)xL(\vec{x}) = (A_2A_1)\vec{x}
  1. Let's define a new matrix C=A2A1C = A_2A_1 . Since A2A_2 is a p×mp \times m matrix and A1A_1 is an m×nm \times n matrix, their product C will be a single p×np \times n matrix.

  2. By substituting C back into our equation, we get:

L(x)=CxL(\vec{x}) = C\vec{x}

This shows that the entire composition of two linear transformations, L, can be represented by multiplication with a single matrix C. A transformation defined by a single matrix multiplication is, by definition, a linear transformation.

Therefore, the composition of two linear transformations is itself a single linear transformation.


3. Generalization to 'N' Layers

We can easily extend this logic to any number of layers by induction.

Imagine a stack of N linear transformations, T1,T2,...,TNT_1, T_2, ..., T_N , represented by matrices A1,A2,...,ANA_1, A_2, ..., A_N . The final transformation L is:

L(x)=TN(TN1(...T2(T1(x))...))L(\vec{x}) = T_N(T_{N-1}(...T_2(T_1(\vec{x}))...))

Substituting the matrix forms gives:

L(x)=AN(AN1(...A2(A1x)...))L(\vec{x}) = A_N(A_{N-1}(...A_2(A_1\vec{x})...))

Because matrix multiplication is associative, we can drop all the parentheses and multiply the matrices together first:

L(x)=(ANAN1...A2A1)xL(\vec{x}) = (A_N A_{N-1} ... A_2 A_1)\vec{x}

The product of all these matrices, Ctotal=ANAN1...A2A1C_{total} = A_N A_{N-1} ... A_2 A_1 , is still just one single matrix.

Therefore, the entire stack of N transformations is equivalent to a single linear transformation represented by the matrix CtotalC_{total} .

Appendix E: theoretical validity of elu(x) + 1 as a kernel feature map

To prove symmetry, we must prove that ϕ(x,y)=ϕ(y,x)\phi(x,y) = \phi(y,x) .

Given that ϕ(x,y)=elu(xy)+1\phi(x,y) = elu(x^\top y) + 1 , and ϕ(y,x)=elu(yx)+1\phi(y,x) = elu(y^\top x) + 1 since the dot product is commutative, i.e. xy=i=1dxiyi=sumi=1dyixi=yxx^\top y = \sum^d_{i=1}x_i y_i = sum^d_{i=1}y_i x_i = y^\top x , hence ϕ(x,y)=ϕ(y,x)\phi(x,y) = \phi(y,x) .

Next, prove that the function is positive semi-definite. Let's check the condition cᵀKc ≥ 0.
cᵀKc = Σᵢ Σⱼ cᵢ cⱼ k(xᵢ, xⱼ)
Substitute our kernel definition:
= Σᵢ Σⱼ cᵢ cⱼ (φ(xᵢ)ᵀφ(xⱼ))
We can rearrange this sum:
= (Σᵢ cᵢ φ(xᵢ)ᵀ) (Σⱼ cⱼ φ(xⱼ))
Let the vector v = Σᵢ cᵢ φ(xᵢ). The expression becomes:
= vᵀv
This is simply the dot product of a vector with itself, which is the squared L2 norm ||v||². The squared norm of any real vector is always greater than or equal to 0.
Positive semi-definiteness is automatically satisfied.

Appendix F: practical usefulness of elu(x) + 1 as a kernel feature map

We want to prove that elu(x) + 1 is a practically useful kernel feature map where:

  1. ϕ(Qi)ϕ(Kj)>=0(\phi(Q_i)^\top \phi(K_j) >= 0( , i.e. non-negative
  2. elu(x) + 1 is a non-linear feature map.

We recall that the equation of the feature map is:

ϕ(x)=elu(x)+1={x+1if x>0exif x0\phi(x) = elu(x) + 1 = \begin{cases} x + 1 & \text{if } x > 0 \newline e^x & \text{if } x \le 0 \end{cases}

Given the equation, both ϕ(Qi)\phi(Q_i) and ϕ(Kj)\phi(K_j) are guaranteed to be non-negative, hence their product ϕ(Qi)ϕ(Kj)(\phi(Q_i)^\top \phi(K_j) ( must also be non-negative. Therefore, ϕ(Qi)ϕ(Kj)>=0(\phi(Q_i)^\top \phi(K_j) >= 0(

On linearity, a function f is linear of f(ax+by)=af(x)+bf(y)f(ax + by) = af(x) + bf(y) .

Let us take this equation ϕ(ax+by)\phi(ax+ by) , where a = 2, x = 1, b = 1, and y = -1:

LHS: ϕ(21+11)=ϕ(1)=1+1=2\phi(2*1 + 1*-1) = \phi(1) = 1 + 1 = 2

RHS: 2ϕ(1)+(1)ϕ(1)=22+1e1=4+0.367879441174.3682\phi(1) + (1)\phi(-1) = 2*2 + 1e^-1 = 4 + 0.36787944117 ≈ 4.368

As the LHSRHS\text{LHS} \neq \text{RHS} , ϕ(x)=elu(x)+1\phi(x) = elu(x) + 1 is a non-linear feature map.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.