Did you check out the recent ICLR results? I got intrigued by a rather provocative paper from Apple - ParaRNN, claiming parallelism for RNNs, when this is supposedly their main weakness, the very reason transformers replaced them (in most tasks).
So let's dig into all of it at the lowest level possible. If you know what an RNN is and what a derivative is, this article is for you.
1. The DEER algorithm
DEER = Deep Equilibrium Evaluation of Recurrence (Lim et al., 2024). The base algorithm on which ParaRNN is built.
1.1. Formulation as a root-finding problem
Consider an RNN with transition function , initial state , and unknown states . Let's introduce the residual:
The true trajectory is the **unique solution* of the equation with zero residual:
When we say "apply an RNN to a sequence", we mean the standard procedure: take the initial state , apply the transition function , get , then apply again to get , and so on:
Accordingly, turns out to be a vector with all elements equal to 0, again because when the recurrence holds we have and therefore .
1.2. Newton's iterations
So we need to find a solution to the equation , or in the full case, a vector that solves a system of equations. But let's start with the scalar case.
Scalar case: one equation in one variable
Suppose we have a smooth function and we want to find ParseError: KaTeX parse error: Expected group after '^' at position 2: s^̲ such that . Geometrically, we want to find the point where the graph of the function crosses the x-axis.
The idea of Newton's method rests on a simple thought: in a small neighborhood of a point, a smooth function is almost indistinguishable from its tangent line. If we're standing at our current approximation (which, in general, is not a root - there ), we can pretend that is its tangent line at this point, and for such a linear function it's easy to find analytically where it crosses the axis.
The tangent to at the point is the first-order term of the Taylor expansion:
The Taylor expansion is a way to approximate any smooth function near a point by a polynomial: , where each subsequent term refines the approximation by adding information about an ever finer feature of the function's shape (slope, curvature, etc.). The logical meaning: if a function is smooth, then its behavior in a neighborhood of a point is fully encoded in the values of its derivatives at that single point - by measuring a few numbers at , we can reconstruct the function's values nearby. The divisor arises naturally from the requirement that at all derivatives of the polynomial coincide with those of the function itself (it cancels with the factorial that pops out when differentiating times).
We set this linear approximation equal to zero and find where it crosses the axis:
Solve for - this is just school algebra:
And we declare this to be our next approximation:
This shows graphically the step , and on the graph you can see that the root of the equation (we're interested in the intersection of the function with the x-axis) shifted from 2 to 1, which shows improvement, since the target value is 0.
We can rewrite this in terms of the increment , which will be more convenient when generalizing:
That is, "find an increment such that the linear correction cancels out the current residual ".
Multidimensional case: N equations in N variables
Now let's generalize. Instead of one function of one variable, we have - a vector-valued function of a vector argument, and we're looking for a vector such that .
The logic remains literally the same, only the objects change:
| Scalar | Vector |
|---|---|
| function | vector function |
| derivative - a number | Jacobian - an matrix |
| tangent (line) | tangent hyperplane |
| division by | multiplication by (i.e. solving a linear system) |
Where is the Jacobian, the multidimensional analog of the ordinary derivative (more on this below).
The Jacobian is just a matrix of all partial derivatives: at position stands . It plays the role of a derivative - it shows how a small change in affects to first order.
Linearization of around the point :
We set the linear approximation equal to the zero vector:
And by denoting the increment , we get a linear system for :
Having solved the system and obtained , we update the approximation:
This is often written compactly using the inverse matrix:
- this is the same formula, just shorter. The notation with is purely notational: in practice, no one ever computes the inverse matrix, because it's both expensive and numerically unstable. Instead, one solves the system directly - for example, via LU decomposition, or via forward substitution if has a special structure (which is what happens in our case).
1.3. Application to our RNN problem
For RNNs everything follows exactly this pattern, only the dimensions are specific:
- - all hidden states glued into one long vector of length .
- - the vector of all one-step residuals, of the same length.
- - the Jacobian of the residual with respect to the state.
We apply the same Newton step:
And here a legitimate question arises: "But isn't solving a linear system of size the same sequential problem? Where's the parallelization?"
If were an arbitrary dense matrix, then yes - a naive solution would cost , and there'd be no benefit. But is not arbitrary. Due to the Markov property of the RNN (each step sees only the previous state , not the full history), in the Jacobian the overwhelming majority of blocks are zero. Specifically: in block-row , nonzero entries appear only in columns and . This gives us a block bidiagonal structure:
What is a Jacobian, generally
When we have an ordinary function of one variable , its derivative is a single number that tells us "how fast the output changes for a small change in the input". It plays the role of a local coefficient of proportionality: if we shift by a small , then changes by approximately .
Now imagine a function whose input and output are both vectors. Say : we feed in a vector of numbers and get out a vector of numbers. The notion of "derivative" gets more complex here, because now we have to answer questions at once: "how does the -th output component change when the -th input component changes?". The answers to all these questions naturally collect into a matrix of size - and this is the Jacobian:
At position stands the number - the partial derivative of the -th output component with respect to the -th input component. So the Jacobian is literally a complete sensitivity map: each cell answers a specific question - "how sensitively does this output coordinate respond to this input coordinate?".
Recall the definition of the residual:
This expression depends only on two variables: on (through the first term) and on (through the second). All other simply don't appear in the formula. And the derivative with respect to a variable that doesn't appear in the formula equals zero.
Let's go case by case to see what block we get for different :
Case 1: . We take the derivative of with respect to . Only the first term depends on , and its derivative with respect to itself is the identity matrix. We get:
Case 2: . We take the derivative with respect to . The first term doesn't depend on it; the second is , and its derivative is evaluated at the point :
Case 3: all other (i.e., and ). The variable simply doesn't appear in the formula for . Hence:
That's it. Out of blocks, exactly are nonzero: identity matrices on the main diagonal and transition Jacobians on the subdiagonal. Everything else is zero. If we write out the whole matrix:
Where the Markov property comes in
The key reason this structure appears is the Markov property of the RNN. The transition function at each step looks only at the previous state , not at the entire history . Because of this, the residual turns out to be a "local" object: it depends only on two adjacent states - the current and the previous .
How much memory do we actually need
Although formally the Jacobian has cells, we only need to store the nonzero blocks. These are:
- identity matrices - but we don't even need to store these; we know they're and can substitute on the fly;
- transition Jacobians of size - that's numbers, which for , gives about 65 million numbers instead of 65 billion. Already feasible.
How the Jacobian's structure gives us parallelism
Now the main question: why does this structure let us solve the system in parallel? Here it's important to distinguish two levels:
Level 1: the structure lets us solve the system via forward substitution. Take the system and write it out row by row. The first block-row of the matrix is , so the first equation of the system is:
- that is, simply . We got the first chunk of the answer essentially for free.
The second block-row of is , so the second equation:
From which:
And in general, for any :
This is the linear recurrence that our gigantic system has turned into. Note that in general solving a linear system costs - but here we avoided inverting any matrix, thanks to being block bidiagonal. The system is solved by a simple top-to-bottom sweep through the equations. This is what's called forward substitution.
If things ended here, we'd have only a sequential algorithm taking steps - each depends on , and we have to traverse the recurrence strictly in order. Same as just running the RNN sequentially. Parallelism is born at the next level.
Level 2: the recurrence is LINEAR, and therefore associative. This is the main trick. Let's note the principal difference between two situations:
- Original RNN: - the function is nonlinear, so this kind of recurrence cannot be parallelized: we have to compute each step honestly in sequence.
- Recurrence for : (where , ) - this one is linear. That means we can derive a closed-form expression from it:
All these matrix products can be computed in any order (matrix multiplication is associative: ). Which means we can build a computation tree where we first compute all pairwise products in parallel, then all 4-tuples , and so on. In tree levels we get all the cumulative products we need, and from them we assemble all simultaneously.
This is the parallel scan (also known as parallel prefix sum in its generalized form). By analogy with ordinary addition: if you need to sum a billion numbers, sequentially that's a billion steps, but with a pairwise tree it's only levels. The same trick works for any associative operation, and the composition of linear maps (i.e., multiplication of their matrices) is associative.
Bottom line on complexity: one Newton step runs in
parallel depth (instead of
sequential steps), and the entire RNN application takes
, where iters is the number of Newton iterations.

Top comments (0)