Table of Contents
- Motivation
- Efficient self-attention
- Time and space complexity of softmax self-attention
- Naive implementation of as the kernel feature map
- A better kernel feature map
- Why can't we linearise the softmax kernel exactly?
- Time complexity of linear self-attention
- Space complexity of linear self-attention
- Implementation of linear self-attention as code
- Any downsides to linear self-attention mechanism?
- Discussions in the paper which I did not touch upon
- Conclusion
- What's next
- Appendix A - What is the difference between a kernel and a function?
- Appendix B - What are positive, positive definite, and positive semi-definite matrices?
- Appendix C: proof that the identity matrix is symmetrical and positive semi-definite
- Appendix D: why stacking of multiple linear transformations is equivalent to a single linear transformation
- Appendix E: theoretical validity of elu(x) + 1 as a kernel feature map
- Appendix F: practical usefulness of elu(x) + 1 as a kernel feature map
Motivation
Ever felt LLM could get very slow, especially when your prompt gets longer? Or the LLM is taking up too much storage space? This article explores how we can speed up the self-attention mechanism and reduce its memory usage by making it more efficient. In technical parlance, reduce the time complexity and space complexity.
This is a follow-up from the previous article where we I explored the why and how of the softmax self-attention mechanism. We ended off with this sofmax self-attention equation:
We will explore how this equation can be tweaked to be more efficient, both in time and space. The paper which I will examine is Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention.
Efficient self-attention
In order to improve the performance of the self-attention mechanism with lower memory consumption and higher throughput for longer contexts, we can improve the efficiency of the self-attention mechanism by replacing the softmax function with a kernel-based attention (source: Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention). I will flesh out the mathematical workings below, so if you are more mathematically inclined, you may wish to read the paper directly yourself.
Recall that the softmax self-attention mechanism can be represented as below, where is the attention matrix:
is an matrix where is the sequence length. The element represents the attention weight that the i-th query token gives to the j-th key token.
Because the softmax is applied row-wise, the formula for a single element (the element at the i-th row and j-th column of A) is:
Here:
- is the i-th row of the Query matrix Q.
- is the j-th row of the Key matrix K.
- is the dot product between the i-th query and the j-th key.
- The denominator sums these exponential scores over all possible keys (from k=1 to N) for the fixed query .
If the equation is too mouthful, the following visualisation might make it easier to digest. Imagine both Q and K are a 5 x 5 matrix each, and we are interested to calculate . What the equation essentially means is to do what is as shown in the diagram below. Note that the diagram does not include dividing by , and taking the exponent of the value, but you get the drift.
The final output is the product of the attention matrix (A), and the value matrix (V):
To find the i-th row of the output, , multiply the i-th row of A by the entire matrix V. This results in a weighted sum of all the value vectors ( ), where the weights are the attention scores in the i-th row of A:
Here, is the j-th row (the value vector) of the matrix V.
A visual representation of the above summation is below, calculating as an example:
We can now substitute our formula for in the equation for and obtain the following:
For simplicity of display and to abstract away the mechanistic operation of matrix manipulation, we re-express the dot product of the i-th query vector and the j-th vector as such:
So we can re-express as:
As the denominator term is a scalar that is a constant for all j in the outer summation, we can pull out of the summation and get the following:
Instead of taking the exponent of values, we generalise the term to be represented by kernel , and thus we get:
In the paper, the author set the condition that a valid kernel must be non-negative for the following reasons, i.e. . My understanding that this constraint is necessary for two reasons:
- Interpretability. A negative weight is difficult to interpret.
2 Mathematical stability for the denominator. If weights could be negative, it is possible for positive and negative weights cancelling each other out and resulting in a denominator of zero, and division by zero is an error.
(Note: All kernels are functions, but not all functions are kernels. Check out Appendix A for an explanation.)
Since a kernel is one that can be represented as a dot product of feature-mapped inputs, i.e. , we can express as the following equation:
Time and space complexity of softmax self-attention
The problem with the above equation is that it has a time complexity of , and also a space complexity of . As the time complexity of both the numerator and the denominator are the same, let us just look at the numerator:
-
:
. This matrix multiplication takes
operations.
=
: It consists of 3 parts:
- Exponentiation: Compute exp(w_i) of each attention score, and as there are attention scores, the time complexity is .
- Summation: Sum the N exponentiated values in each row to get the denominator, which is N - 1 addition per row. As there are N rows, the cost is
- Division: Divide each of the N exponentiated value by the sum. As there is a total of division operations, the cost is .
- : . This matrix multiplication takes operations.
- The dominant part is . Unlike N which could scale to a very large number as the context (i.e. user prompt) gets very long, D is a constant, hence we can simplify the time complexity to .
The space complexity is also because the full attention matrix must be stored to compute the attention weights matrix of N X N.
Here is how we can reduce the time and space complexity -- given the distributive property of matrix multiplication over addition, i.e. , we can rewrite the equation as:
I will explain in detail how, using as the kernel feature map, the time and space complexity of the attention mechanism will be O(N), which grows linearly, hence the name "linearised self-attention mechanism". This is a huge performance improvement and provides huge storage savings as compared to the time and space complexity of the softmax self-attention mechanism.
Naive implementation of as the kernel feature map
The most straightforward choice is to choose the kernel feature map where
, so that
. You can see that it is a valid kernel feature map because it satisfies the condition:
(Note: To prove that the a function is suitable as a kernel, we need to prove that it is symmetrical and positive semi-definite. The proof is in appendix C.)
However, is not a valid feature map because we had set the condition for the kernel to be non-negative. does not guarantee a positive value.
Also, this approach means we are effectively negating the advantages which softmax normalisation provides, because we are normalising by a simple sum of all the denominator terms, as shown in the equation below. This is not ideal as explained in the previous article, softmax was important in emphasising attention on tokens that truly matter due to its "winner-takes-most" effect. While we enjoy a time and space complexity of O(N), we may most likely end up with a LLM that has difficulties focusing on what's important in the context.
In addition, this kernel feature map would make the entire attention block a purely linear transformation (ignoring the layer norms for now). Mathematically, stacking multiple layers of linear transformation on top of each other is mathematically equivalent to a single, large linear transformation, and this could significantly limit the expressive powers of the attention mechanism to learn complex patterns.
(Note: See appendix D for proof of how multiple linear transformations could collapse into a single linear transformation.)
A better kernel feature map
The kernel feature map considered by Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention is as follows:
To prove that is a valid feature map, we must show that it fulfills the following two criteria:
- Theoretical validity: is a valid kernel function, i.e. it is (a) symmetrical, and (b) positive semi-definite.
- Practical usefulness: It is a good approximation of the softmax normalisation, i.e. the weights after normalisation are (a) positive and (b) non-linear.
The proofs are theoretical validity is in appendix E, and proof of practical usefulness is in appendix F.
Why can't we linearise the softmax kernel exactly?
With softmax attention, the kernel is an exponential kernel where:
For a feature map , using the Taylor series expansion for the exponential function, where:
Substituting x for the dot product , we get:
As we can see that and are infinitely long vectors, it is impossible to linearise the exact softmax attention.
(Note: there are papers which attempt to linearlise by taking the first few terms of the Taylor Series, and in this case the vectors are no longer infinite. For example QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion.)
Time complexity
In comparison, when we define the feature map as , what we instead do is the following:
- For matrix Q of dimension N X D, we do element-wise transformation of by . The time complexity for this element-wise transformation is O(ND), which is the number of calculations done for each element, where there is a total of N X D elements. Similarly, the time complexity for transforming where K is a N X D matrix is also O(ND).
- Next, we compute : As is D X N, is N X D, hence is D X D. The time complexity for calculating this matrix multiplication is .
- Finally, we compute , which is an operation.
- Because N >> D especially when the context length gets longer, we can treat D as a constant, hence the time complexity of a linearised self-attention mechanism is O(N).
Space complexity
Let us do a quick recap of the dimension of all the matrices for a sequence of prompts of length N:
- Q: N X D (Queries)
- K: N X D (Keys)
- V: N X D (Values)
- : N X D (element wise elu + x applied to Q)
- : N X D (element wise elu + x applied to K)
The storage requirements are as follow:
We first compute , which has a dimension of D X M, hence the space complexity is . Note that as N >> D, is a small matrix and its space complexity can be treated as constant, i.e. O(1).
Next, we compute , where P has a dimension of N X M, and hence it has a space complexity of O(ND). As N >> D, the space complexity can be simplified to O(N). Note that this is unlike the softmax attention where where we had to store the N X N matrix of attention scores and weights and which has a space complexity of .
The normalisation vector , which holds the normlisation term for each query has a dimension of N X 1, hence the space complexity is O(N).
Implementation of linear self-attention as code
The python code to implement the linear self-attention mechanism is below. You can simply replace the SelfAttention
class in the previous article with the following and it should work out of the box.
class LinearSelfAttention(nn.Module):
""" Linear Self-Attention with a Kernel Function (Equation 3) """
def __init__(self, d_in, d_out, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
def kernel_function(self, x):
"""
Applies the feature map φ(x).
A common choice is elu + 1 to ensure positivity, mimicking exp().
"""
return F.elu(x) + 1
def forward(self, x):
# x shape: (batch_size, sequence_length, d_in)
# 1. Project inputs to Q, K, V
keys = self.W_key(x) # Shape: (b, n, d_out)
queries = self.W_query(x) # Shape: (b, n, d_out)
values = self.W_value(x) # Shape: (b, n, d_out)
# 2. Apply the kernel function (feature map) to queries and keys
queries_projected = self.kernel_function(queries) # φ(Q)
keys_projected = self.kernel_function(keys) # φ(K)
# 3. Reorder operations using the associative property of matrix multiplication
# Original (inefficient): (φ(Q) @ φ(K).T) @ V
# Reordered (efficient): φ(Q) @ (φ(K).T @ V)
# This avoids creating the large (n, n) attention matrix.
# kv_state shape: (b, d_out, d_out)
kv_state = torch.bmm(keys_projected.transpose(1, 2), values)
# context_unnormalized shape: (b, n, d_out)
context_unnormalized = torch.bmm(queries_projected, kv_state)
# 4. Calculate the normalization factor
# The denominator in Eq. 3 is φ(Q_i) @ Σ φ(K_j)
# normalizer shape: (b, n, 1)
normalizer = torch.bmm(
queries_projected,
keys_projected.sum(dim=1).unsqueeze(-1) # Sum keys over sequence length
)
# 5. Apply normalization with a small epsilon for stability
context_vec = context_unnormalized / (normalizer + 1e-6)
return context_vec
Any downsides to linear self-attention mechanism?
Some research have shown that linear self-attention mechanism leads to a decline in model capability. The softmax self-attention is more expressive than linear self-attention because while softmax can create any attention pattern using the full matrix, linear attention is constrained in its expression to the dot product of the feature maps of and .
There are papers that are trying to enjoy the best of both worlds by training using soft-max self attention, and inference with linear self-attention as an approximation:
- [Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference(https://arxiv.org/abs/2211.10526)
- The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry
- LoLCATs: On Low-Rank Linearizing of Large Language Models
- Agent Attention: On the Integration of Softmax and Linear Attention
- QT-ViT: Improving Linear Attention in ViT with Quadratic Taylor Expansion
Discussions in the paper which I did not touch upon
The paper also went on to discuss about causal masking, as well as how it is able to, with a neat trick, reduce the space complexity during training (i.e. forward + backward pass) from to . I may discuss these techniques in future articles.
Conclusion
In short, we have shown how linear self-attention matrix can be derived and implemented, as well as why it is able to achieve constant time and space complexity of O(N), which is far more efficient than the quadratic time and space complexity of of the softmax self-attention.
What's next
In my next article, my focus will turn more practical as I try to implement from scratch a code base self-documentation service, similar to Devin's DeepWiki. While there are already open source alternatives such as this, I believe a deeper understanding of how to build such a service will serve as a solid foundation to further improve the quality of documentation generated by LLM.
Appendix A - What is the difference between a kernel and a function?
(Note: The appendixes are written with the help of Google Gemini 2.5 Pro)
A simple way to think about the difference is:
Every kernel is a function, but not every function is a kernel.
Function
A function is a general mathematical concept. It is simply a rule that maps an input from a set A to a single output in a set B.
- Inputs: Can take one, two, or many inputs.
- Purpose: Can do anything. It can describe a relationship (
), perform an action (
print("hello")
), or calculate a value (sin(x)
). - Examples:
-
f(x) = x + 5
(takes one number, outputs another) -
g(x, y) = x * y
(takes two numbers, outputs their product) -
h(text) = length(text)
(takes a string, outputs its length)
-
There are no other constraints. If it's a well-defined rule mapping inputs to an output, it's a function.
Kernel
A kernel, in the context of machine learning, is a special kind of function with two key characteristics: its purpose and its mathematical properties.
1. The Purpose: To Measure Similarity
The primary job of a kernel k(x, y)
is to take two inputs, x
and y
, and return a single scalar value that quantifies how "similar" they are. A large value means they are very similar; a small or zero value means they are dissimilar.
This is why the sim(·)
function in the paper is a perfect candidate to be thought of as a kernel. Its job is to measure the similarity between a query
and a key
.
2. The Mathematical Property: The "Kernel Trick"
A function k(x, y)
is formally considered a kernel if it can be expressed as a dot product in some (potentially very high-dimensional) feature space.
Mathematically, a function k
is a kernel if there exists a feature map
(phi) such that for all x
and y
:
Let's unpack this:
-
(the feature map): This is a function that takes a single input
x
and transforms it into a new, often much longer, vector. It maps the data into a "feature space" where, hopefully, relationships are easier to see. -
⋅
(the dot product): This is the standard dot product between the two transformed vectors.
The "kernel trick" is that you can often compute k(x, y)
very easily and cheaply, without ever having to compute the potentially massive or even infinite-dimensional vectors
and
.
Example: The Polynomial Kernel
Let x
and y
be simple 2D vectors.
Consider the kernel function
. This is very easy to compute.
It turns out this simple function is equivalent to first mapping x
and y
into a higher-dimensional space with a feature map
and then taking their dot product.
If
, then a valid feature map is
.
Calculating
gives you the exact same result as
, but
is much faster to compute.
The formal requirement for a function to be a valid kernel is given by Mercer's Theorem, which states that the function must be "positive semi-definite." This condition guarantees that a feature map exists, even if we don't know what it is. (source)
Comparison Table
Aspect | Function | Kernel |
---|---|---|
Core Idea | A rule that maps inputs to an output. | A special function that measures similarity between two inputs. |
Number of Inputs | Can be one or more. | Almost always takes two inputs (x and y ). |
Purpose | Can be anything. | To compute a similarity score. |
Key Property | None, beyond being a well-defined mapping. | Can be expressed as a dot product in a feature space: . Must be positive semi-definite. |
Example | f(x) = sin(x) |
(Polynomial Kernel) |
Appendix B - What are positive, positive definite, and positive semi-definite matrices?
The most intuitive way to understand the difference is by using an analogy with regular numbers.
The Core Intuition: An Analogy with Numbers
- A positive definite matrix is the matrix equivalent of a strictly positive number (like 3, 0.5, or 100). A number
a > 0
. - A positive semi-definite matrix is the matrix equivalent of a non-negative number (like 3, 0.5, 100, and also 0). A number
a ≥ 0
.
The key difference is whether zero is allowed.
Now, how does this concept of "positiveness" apply to a matrix, which is a whole grid of numbers? It's not about the individual elements of the matrix being positive. Instead, it's about how the matrix behaves when it interacts with vectors.
The key operation to test this "positiveness" is the quadratic form:
, where A
is our matrix and x
is any non-zero vector. This operation always results in a single number (a scalar). We look at the sign of that resulting number.
Positive Semi-Definite (The "Non-Negative" Case, ≥ 0
)
A symmetric matrix A
is positive semi-definite if, for every possible non-zero vector x
, the result of
is greater than or equal to zero.
- What it means: The matrix
A
will never "flip" a vector to point in the opposite direction. - The "Zero" Case: It is possible for
to equal zero even when the vector
x
is not zero. This means there are some directions in space that the matrixA
"squashes" or "flattens" completely. - Eigenvalues: All eigenvalues of the matrix are non-negative (
λ ≥ 0
). The existence of a zero eigenvalue corresponds to a direction that gets squashed. - Invertibility: A positive semi-definite matrix might not be invertible. If it has a zero eigenvalue, its determinant is zero, and it's singular (not invertible).
Positive Definite (The "Strictly Positive" Case, > 0
)
A symmetric matrix A
is positive definite if, for every possible non-zero vector x
, the result of
is strictly greater than zero.
- What it means: This is a stricter condition. The matrix
A
not only doesn't flip vectors, it also doesn't completely squash any of them (unless the vector was zero to begin with). - The "Zero" Case: The only way for
to equal zero is if the vector
x
itself is the zero vector. - Eigenvalues: All eigenvalues of the matrix are strictly positive (
λ > 0
). There are no zero eigenvalues. - Invertibility: A positive definite matrix is always invertible. Since none of its eigenvalues are zero, its determinant is non-zero.
Comparison Table
Feature | Positive Semi-Definite | Positive Definite (Stricter) |
---|---|---|
Condition | x^T A x ≥ 0 |
x^T A x > 0 |
Analogy | Non-negative number (a ≥ 0 ) |
Strictly positive number (a > 0 ) |
Can x^T A x be zero? |
Yes, for some non-zero x . |
No, only if x is the zero vector. |
Eigenvalues (λ ) |
All λ ≥ 0 (can include zero). |
All λ > 0 (must be positive). |
Invertibility | May not be invertible. | Always invertible. |
Geometric Intuition | A transformation that doesn't flip vectors, but might collapse some directions to zero. | A transformation that doesn't flip vectors and preserves some "energy" in all directions. |
Why This Matters for Kernels
A function k(x, y)
is a valid kernel if and only if the "Gram matrix" it produces is positive semi-definite.
- Gram Matrix: For any set of data points
x_1, x_2, ..., x_n
, the Gram matrixK
is then x n
matrix where the entryK_ij = k(x_i, x_j)
.
The condition for a valid kernel is semi-definite, not definite. This is because we want to allow for cases where two different data points are perfectly "dissimilar" or orthogonal in the feature space (k(x_i, x_j) = 0
) or where two different points x_i
and x_j
might actually map to the exact same location in the feature space (φ(x_i) = φ(x_j)
). These situations lead to a Gram matrix that is only semi-definite, and we need to include them for the theory to be general enough.
he most intuitive way to understand the difference is by using an analogy with regular numbers.
The Core Intuition: An Analogy with Numbers
- A positive definite matrix is the matrix equivalent of a strictly positive number (like 3, 0.5, or 100). A number
a > 0
. - A positive semi-definite matrix is the matrix equivalent of a non-negative number (like 3, 0.5, 100, and also 0). A number
a ≥ 0
.
The key difference is whether zero is allowed.
Now, how does this concept of "positiveness" apply to a matrix, which is a whole grid of numbers? It's not about the individual elements of the matrix being positive. Instead, it's about how the matrix behaves when it interacts with vectors.
The key operation to test this "positiveness" is the quadratic form: x^T A x
, where A
is our matrix and x
is any non-zero vector. This operation always results in a single number (a scalar). We look at the sign of that resulting number.
Positive Semi-Definite (The "Non-Negative" Case, ≥ 0
)
A symmetric matrix A
is positive semi-definite if, for every possible non-zero vector x
, the result of x^T A x
is greater than or equal to zero.
x^T A x ≥ 0
for all x ≠ 0
- What it means: The matrix
A
will never "flip" a vector to point in the opposite direction. - The "Zero" Case: It is possible for
x^T A x
to equal zero even when the vectorx
is not zero. This means there are some directions in space that the matrixA
"squashes" or "flattens" completely. - Eigenvalues: All eigenvalues of the matrix are non-negative (
λ ≥ 0
). The existence of a zero eigenvalue corresponds to a direction that gets squashed. - Invertibility: A positive semi-definite matrix might not be invertible. If it has a zero eigenvalue, its determinant is zero, and it's singular (not invertible).
Positive Definite (The "Strictly Positive" Case, > 0
)
A symmetric matrix A
is positive definite if, for every possible non-zero vector x
, the result of x^T A x
is strictly greater than zero.
x^T A x > 0
for all x ≠ 0
- What it means: This is a stricter condition. The matrix
A
not only doesn't flip vectors, it also doesn't completely squash any of them (unless the vector was zero to begin with). - The "Zero" Case: The only way for
x^T A x
to equal zero is if the vectorx
itself is the zero vector. - Eigenvalues: All eigenvalues of the matrix are strictly positive (
λ > 0
). There are no zero eigenvalues. - Invertibility: A positive definite matrix is always invertible. Since none of its eigenvalues are zero, its determinant is non-zero.
Comparison Table
Feature | Positive Semi-Definite | Positive Definite (Stricter) |
---|---|---|
Condition | x^T A x ≥ 0 |
x^T A x > 0 |
Analogy | Non-negative number (a ≥ 0 ) |
Strictly positive number (a > 0 ) |
Can x^T A x be zero? |
Yes, for some non-zero x . |
No, only if x is the zero vector. |
Eigenvalues (λ ) |
All λ ≥ 0 (can include zero). |
All λ > 0 (must be positive). |
Invertibility | May not be invertible. | Always invertible. |
Geometric Intuition | A transformation that doesn't flip vectors, but might collapse some directions to zero. | A transformation that doesn't flip vectors and preserves some "energy" in all directions. |
Why This Matters for Kernels
A function k(x, y)
is a valid kernel if and only if the "Gram matrix" it produces is positive semi-definite.
- Gram Matrix: For any set of data points
x_1, x_2, ..., x_n
, the Gram matrixK
is then x n
matrix where the entryK_ij = k(x_i, x_j)
.
The condition for a valid kernel is semi-definite, not definite. This is because we want to allow for cases where two different data points are perfectly "dissimilar" or orthogonal in the feature space (k(x_i, x_j) = 0
) or where two different points x_i
and x_j
might actually map to the exact same location in the feature space (φ(x_i) = φ(x_j)
). These situations lead to a Gram matrix that is only semi-definite, and we need to include them for the theory to be general enough.
Proof
We want to show that if K
is symmetric and PSD, then there exists a feature map φ
such that Kᵢⱼ = φ(xᵢ) ⋅ φ(xⱼ)
.
Proof using the Spectral Theorem:
-
Eigendecomposition: The Spectral Theorem states that any real symmetric matrix
K
can be diagonalized by an orthogonal matrix of its eigenvectors.
K = UΛUᵀ
where:-
U
is an orthogonal matrix whose columnsu₁, ..., uₙ
are the eigenvectors ofK
. -
Λ
is a diagonal matrix whose entriesλ₁, ..., λₙ
are the corresponding real eigenvalues.
-
Examine a Single Element: Let's write out what this decomposition means for a single element
Kᵢⱼ
:
Kᵢⱼ = (UΛUᵀ)ᵢⱼ = ∑ₖ uᵢₖ λₖ uⱼₖ
Here,uᵢₖ
is thei
-th component of thek
-th eigenvectoruₖ
.Use the PSD Property: A fundamental property of a PSD matrix is that all its eigenvalues are non-negative:
λₖ ≥ 0
for allk
. This is the crucial step.Construct the Feature Map: Since
λₖ ≥ 0
, we can take its square root. Let's define a feature mapφ
that maps each pointxᵢ
to a new vector inℝⁿ
:
φ(xᵢ) = [√λ₁ uᵢ₁, √λ₂ uᵢ₂, ..., √λₙ uᵢₙ]
Verify the Dot Product: Now, let's compute the dot product
φ(xᵢ) ⋅ φ(xⱼ)
using our new feature map:
φ(xᵢ) ⋅ φ(xⱼ) = ∑ₖ (√λₖ uᵢₖ) * (√λₖ uⱼₖ)
= ∑ₖ (√λₖ)² uᵢₖ uⱼₖ
= ∑ₖ λₖ uᵢₖ uⱼₖ
Conclusion: This final expression is exactly our expansion for
Kᵢⱼ
from Step 2. We have successfully shown thatKᵢⱼ = φ(xᵢ) ⋅ φ(xⱼ)
. This proves the theorem for the finite-dimensional matrix case.
Appendix C: proof that the identity matrix is symmetrical and positive semi-definite
Definition of a Symmetric Matrix:
A matrix A
is symmetric if it is equal to its transpose, i.e.,
. The transpose of a matrix is found by swapping its rows and columns. In terms of elements, this means
for all i
and j
.
Proof:
We need to show that
. Let's prove this by showing that their corresponding elements are equal for all i
and j
.
- Let be the element in the i-th row and j-th column of the matrix .
- By the definition of the transpose, .
- Now, let's use the definition of the identity matrix to evaluate
:
- If j=i, then .
- If , then .
- Let's compare this to the elements of the original identity matrix,
:
- If i=j, then .
- If , then .
Notice that the condition "j=i" is identical to "i=j", and is identical to .
Therefore, for all possible values of i
and j
, we have shown that
. Since all corresponding elements are equal, the matrices are equal.
Conclusion: , so the identity matrix is symmetrical.
Part 2: Proof that the Identity Matrix is Positive Semi-Definite
A symmetric matrix A
is positive semi-definite if for every non-zero column vector x
, the quadratic form
is non-negative.
We can prove this for the identity matrix using two common methods.
Method 1: Using the Definition of the Quadratic Form
- Let
I
be the identity matrix and letx
be any column vector:
We need to evaluate the quadratic form .
First, let's compute the product . By the definition of the identity matrix, multiplying any vector by $I$ results in the vector itself:
Now, substitute this back into the quadratic form:
The expression is the dot product of the vector with itself:
This sum can be written as .
Since each is a real number, its square must be greater than or equal to zero ( ). The sum of non-negative numbers is also non-negative.
Therefore, .
This holds true for any vector . Thus, the identity matrix is positive semi-definite.
Method 2: Using Eigenvalues
Definition: A symmetric matrix is positive semi-definite if and only if all of its eigenvalues are non-negative ( ).
Let's find the eigenvalues of the identity matrix
I
. The eigenvalue equation is , where is an eigenvalue and is its corresponding eigenvector.For the identity matrix, this equation becomes:
Since , we have:
Rearranging the terms gives:
By definition, an eigenvector
v
must be a non-zero vector. For the equation to be true with , the scalar multiplier must be zero:An identity matrix has
n
eigenvalues, and they are all equal to 1.The condition for positive semi-definiteness is that all eigenvalues must be non-negative. Since all eigenvalues of
I
are 1, and , the condition is satisfied.
Conclusion: The identity matrix is positive semi-definite.
Summary
We have shown that:
- , which proves symmetry.
- for all vectors (and its eigenvalues are all ), which proves positive semi-definiteness.
Note: In fact, we have shown something stronger. Since for all non-zero vectors and its eigenvalues are all strictly positive (1 > 0), the identity matrix is actually positive definite, which is a subset of positive semi-definite.
Appendix D: why stacking of multiple linear transformations is equivalent to a single linear transformation
1. Mathematical Foundation: What is a Linear Transformation?
First, let's formally define our terms. A transformation (or function) $T$ that maps vectors from one vector space to another (
) is linear if it satisfies two properties for any vectors
and any scalar c
:
- Additivity:
- Homogeneity (Scalar Multiplication):
A cornerstone theorem of linear algebra states that every linear transformation can be represented by a matrix multiplication. If T
is a linear transformation, there exists a unique matrix A
such that for any vector
:
This matrix representation is the key to our proof.
2. The Proof for Two Layers
Let's start by proving the case for two stacked transformations. "Stacking" is mathematically known as function composition.
-
Let's define two linear transformations:
- Let be our first linear transformation. It takes an n-dimensional vector and maps it to an m-dimensional vector. It can be represented by an $m \times n$ matrix, which we'll call . So, .
- Let be our second linear transformation. It takes an m-dimensional vector (the output of ) and maps it to a p-dimensional vector. It can be represented by a matrix, which we'll call . So, .
"Stacking" and then means we first apply to an input vector , and then apply to the result. Let's call this new, composed transformation
L
:
- Now, let's substitute the matrix representations into this equation:
- First, replace with its matrix form:
- The term is just a vector. Let's think of it as the input to . We can now apply the matrix representation for :
- Here is the crucial step. Matrix multiplication is associative. This means that for compatible matrices
A
,B
, andC
, we have . We can regroup our expression:
Let's define a new matrix . Since is a matrix and is an matrix, their product
C
will be a single matrix.By substituting
C
back into our equation, we get:
This shows that the entire composition of two linear transformations, L
, can be represented by multiplication with a single matrix C. A transformation defined by a single matrix multiplication is, by definition, a linear transformation.
Therefore, the composition of two linear transformations is itself a single linear transformation.
3. Generalization to 'N' Layers
We can easily extend this logic to any number of layers by induction.
Imagine a stack of N linear transformations,
, represented by matrices
. The final transformation L
is:
Substituting the matrix forms gives:
Because matrix multiplication is associative, we can drop all the parentheses and multiply the matrices together first:
The product of all these matrices, , is still just one single matrix.
Therefore, the entire stack of N transformations is equivalent to a single linear transformation represented by the matrix .
Appendix E: theoretical validity of elu(x) + 1 as a kernel feature map
To prove symmetry, we must prove that .
Given that , and since the dot product is commutative, i.e. , hence .
Next, prove that the function is positive semi-definite. Let's check the condition cᵀKc ≥ 0
.
cᵀKc = Σᵢ Σⱼ cᵢ cⱼ k(xᵢ, xⱼ)
Substitute our kernel definition:
= Σᵢ Σⱼ cᵢ cⱼ (φ(xᵢ)ᵀφ(xⱼ))
We can rearrange this sum:
= (Σᵢ cᵢ φ(xᵢ)ᵀ) (Σⱼ cⱼ φ(xⱼ))
Let the vector v = Σᵢ cᵢ φ(xᵢ)
. The expression becomes:
= vᵀv
This is simply the dot product of a vector with itself, which is the squared L2 norm ||v||²
. The squared norm of any real vector is always greater than or equal to 0.
Positive semi-definiteness is automatically satisfied.
Appendix F: practical usefulness of elu(x) + 1 as a kernel feature map
We want to prove that elu(x) + 1 is a practically useful kernel feature map where:
- , i.e. non-negative
- elu(x) + 1 is a non-linear feature map.
We recall that the equation of the feature map is:
Given the equation, both and are guaranteed to be non-negative, hence their product must also be non-negative. Therefore,
On linearity, a function f is linear of .
Let us take this equation , where a = 2, x = 1, b = 1, and y = -1:
LHS:
RHS:
As the , is a non-linear feature map.
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.