How to make PageRank faster (with lots of math and a hint of Python)

karjudev profile image Giacomo Mariani ・7 min read

TL;DR: Here you can find a single-threaded implementation of the Personalized PageRank that works better on sparse graphs.

At first, there it was PageRank. A brilliant algorithm, that presents a way of measuring the importance of a webpage by assigning a numerical weight to it. Back in the days, to bring your site to the first page of Google Search, you could just add links to your page in portals perceived as "important".

Nowadays the PageRank is no more the only metric used to rank web pages, but the algorithm proved to be useful in every context where we need an effective ranking for graphs of various kinds, from the Twitter WTF (Who To Follow) system to some applications in predictive models for archaeology.

With the rise of explorative programming environments such as Jupyter notebooks, the need of lightweight, single-threaded implementations of well-known algorithms became important for sketching ideas and develop software via iterative approximations, even on resource-constrained devices.

In this post I'd like to talk about a different implementation of the Personalized PageRank algorithm, introduced by Gianna Maria Del Corso, Antonio Gulli and Francesco Romani in this paper.

Note: The following needs a basic knowledge of linear algebra and numerical methods for solving eigenvector problems and linear systems. There will be a bunch of mathemathical explanation, and a final Pyhton implementation.

The Personalized PageRank

Given a graph G=(N,A)G = (N, A) with nn nodes (vertices) and mm edges, we suppose that the probability of switching from node iNi \in N to any of its neighbors jOij \in \mathcal{O}{i} is the quantity defined as follows:

Pij={1#Oi#Oi>00#Oi=0 P{ij} = \begin{cases} \frac{1}{\# \mathcal{O}_i} & \# \mathcal{O}_i > 0 \\ 0 & \# \mathcal{O}_i = 0 \end{cases}

Note: In this implementation we removed self-loops from the adjacency list.

When #Oi=0\# \mathcal{O}_i = 0 , ii is called a dangling node, and we denote the position of such nodes in a vector d{0,1}n\mathbf{d} \in \{ 0, 1 \}^{n} where:

di={1,#Oi=00,#Oi>0 d_i = \begin{cases} 1, & \# \mathcal{O}_i = 0 \\ 0, & \# \mathcal{O}_i > 0 \end{cases}

We suppose to know a personalization vector v[0,1]n\mathbf{v} \in [0, 1]^{n} that for each node iNi \in N stores the probability that ii is chosen among the others. For a web search engine it could be computed with cutting edge machine learning models that learn from your search history (sometimes really badly). For another scenario, i.e. archaeology, it could be computed with respect to the cost of starting an excavation in the location. In every other case, we suppose vi=1nv_i = \frac{1}{n} , and we always impose that i=1nvi=1\sum_{i = 1}^{n} v_i = 1 (in the implementation the values are normalized).

When we have no dangling nodes and no loops in GG , the PageRank ziz_i of iNi \in N is the sum of the PageRanks of its in-pointing neighbors jIij \in \mathcal{I}i , weighted with the transition probability that a random surfer switches from jj to ii :

zi=jIiPijzj z_i = \sum{j \in \mathcal{I}i} P{ij} z_j

that represents the left eigenvector z[0,1]n\mathbf{z} \in [0, 1]^{n} of the matrix P=[pij][0,1]n×nP = [p_{ij}] \in [0, 1]^{n \times n} :
zTP=zT \mathbf{z}^{T} P = \mathbf{z}^{T}

Since z\mathbf{z} is the steady-state distribution of the Markov chain modelled by PP , z\mathbf{z} is in fact a probability vector, so i=1nzi=1\sum_{i = 1}^{n} z_i = 1 .

We have to take into account two problems arising from this simple model:

  • When the random surfer reaches a dangling nodes, the navigation restarts according to the preference vector.
  • At each step, the surfer can keep navigating through outgoing nodes with probability α\alpha , and restart the navigation from a node chosen via the preference vector with probabilty 1α1 - \alpha . In such a way we prevent the surfer to get into infinite loops.

Denoting e=[1,...,1]T{1}n\mathbf{e} = [1, ..., 1]^{T} \in \{ 1 \}^{n} , the eigenvector to compute is now:

zT=zT[α(P+dvT)+(1α)evT]=αzT(P+dvT)+(1α)zTevT \begin{aligned} \mathbf{z}^{T} &= \mathbf{z}^{T} [ \alpha (P + \mathbf{d v}^{T}) + (1 - \alpha) \mathbf{e v}^{T} ] \\ &= \alpha \mathbf{z}^{T} (P + \mathbf{d v}^{T}) + (1 - \alpha) \mathbf{z}^{T} \mathbf{e v}^{T} \end{aligned}

We have zTe=i=1nzi=1\mathbf{z}^{T} \mathbf{e} = \sum_{i = 1}^{n} z_i = 1 so we can write:
zT=αzTP+αzTdvT+(1α)vT=αzTP+(αzTd+1α)vT \begin{aligned} \mathbf{z}^{T} &= \alpha \mathbf{z}^{T} P + \alpha \mathbf{z}^{T} \mathbf{d v}^{T} + (1 - \alpha) \mathbf{v}^{T} \\ &= \alpha \mathbf{z}^{T} P + (\alpha \mathbf{z}^{T} \mathbf{d} + 1 - \alpha) \mathbf{v}^{T} \end{aligned}

The canonical way of solving this problem is via the well-known power method:

In mathematics, power iteration is an eigenvalue algorithm: given a diagonalizable matrix , the algorithm will produce a number , which is the greatest eigenvalue of , and a nonzero vector , which is a corresponding eigenvector of , that is, . The algorithm is also known as the Von Mises iteration.

but now we'll do some algebraic manipulations to get this into a sparse linear system.

Linear system formulation

Let's make a step back from the previous manipulation. Let's write:

zT=αzT(P+dvT)+(1α)vT    zTαzT(P+dvT)=(1α)vT    zT[Iα(P+dvT)]=(1α)vT    [Iα(PT+vdT)]z=(1α)v \begin{aligned} & \mathbf{z}^{T} = \alpha \mathbf{z}^{T} (P + \mathbf{d v}^{T}) + (1 - \alpha) \mathbf{v}^{T} \\ \iff & \mathbf{z}^{T} - \alpha \mathbf{z}^{T} (P + \mathbf{d v}^{T}) = (1 - \alpha) \mathbf{v}^{T} \\ \iff & \mathbf{z}^{T} [I - \alpha (P + \mathbf{d v}^{T})] = (1 - \alpha) \mathbf{v}^{T} \\ \iff & [I - \alpha (P^{T} + \mathbf{v d}^{T})] \mathbf{z} = (1 - \alpha) \mathbf{v} \end{aligned}

This is a dense linear system. This means that when the graph is really big, the coefficient matrix is full of values and a solution needs a lot of work to be computed. However, since when the graph is sparse, so PTP^{T} is sparse, the density is caused by the vdT\mathbf{v d}^{T} component. What if we find a way to not store this matrix at all?

Let R=IαPTR = I - \alpha P^{T} be the sparse coefficient matrix of this new linear system. We note that [Iα(PT+vdT)]=RαvdT[I - \alpha (P^{T} + \mathbf{v d}^{T})] = R - \alpha \mathbf{v d}^{T} , so we have:

[Iα(PT+vdT)]z=(1α)v    (RαvdT)z=(1α)v    z=(1α)(RαvdT)1v \begin{aligned} & [I - \alpha (P^{T} + \mathbf{v d}^{T})] \mathbf{z} = (1 - \alpha) \mathbf{v} \\ \iff & (R - \alpha \mathbf{v d}^{T}) \mathbf{z} = (1 - \alpha) \mathbf{v} \\ \iff & \mathbf{z} = (1 - \alpha) (R - \alpha \mathbf{v d}^{T})^{-1} \mathbf{v} \end{aligned}

Let's go all in with linear algebra and put in the game the Sherman-Morrison formula:

(RαvdT)1=R1+R1vdTR11α+dTR1v (R - \alpha \mathbf{v d}^{T})^{-1} = R^{-1} + \frac{R^{-1}\mathbf{v d}^{T} R^{-1}}{\frac{1}{\alpha} + \mathbf{d}^{T} R^{-1} \mathbf{v}}

Don't worry, we climbed the peak of matemathical abstrusity, now the equations will be easier:

z=(1α)(RαvdT)1v=(1α)(R1+R1vdTR11α+dTR1v)v=(1α)(1+dTR1v1α+dTR1v)R1v \begin{aligned} \mathbf{z} &= (1 - \alpha) (R - \alpha \mathbf{v d}^{T})^{-1} \mathbf{v} \\ &= (1 - \alpha) \left ( R^{-1} + \frac{R^{-1}\mathbf{v d}^{T} R^{-1}}{\frac{1}{\alpha} + \mathbf{d}^{T} R^{-1} \mathbf{v}} \right ) \mathbf{v} \\ &= (1 - \alpha) \left ( 1 + \frac{\mathbf{d}^{T} R^{-1} \mathbf{v}}{\frac{1}{\alpha} + \mathbf{d}^{T} R^{-1} \mathbf{v}} \right ) R^{-1} \mathbf{v} \end{aligned}

Now let y\mathbf{y} be the solution of Ry=v    y=R1vR \mathbf{y} = \mathbf{v} \iff \mathbf{y} = R^{-1} \mathbf{v} , so we can write:

z=(1α)(1+dTy1α+dTy)y=yy1 \mathbf{z} = (1 - \alpha) \left ( 1 + \frac{\mathbf{d}^{T} \mathbf{y}}{\frac{1}{\alpha} + \mathbf{d}^{T} \mathbf{y}} \right ) \mathbf{y} = \frac{\mathbf{y}}{\left \lVert \mathbf{y} \right \rVert_1}

The computation is reduced to:

{Ry=(IαPT)y=vz=yy1 \begin{cases} R \mathbf{y} = (I - \alpha P^{T}) \mathbf{y} = \mathbf{v} \\ \mathbf{z} = \frac{\mathbf{y}}{\left \lVert \mathbf{y} \right \rVert_1} \end{cases}

We don't even need to compute the dangling nodes!

The Gauss - Seidel method

Let's make a step further and see how we don't even need to explicitly store PP but we can use GG itself.

To compute Ry=vR \mathbf{y} = \mathbf{v} we can use the Gauss - Seidel iterative stationary method:

In numerical linear algebra, the Gauss–Seidel method, also known as the Liebmann method or the method of successive displacement, is an iterative method used to solve a system of linear equations. It is named after the German mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel, and is similar to the Jacobi method. Though it can be applied to any matrix with non-zero elements on the diagonals, convergence is only guaranteed if the matrix is either strictly diagonally dominant, or symmetric and positive definite. It was only mentioned in a private letter from Gauss to his student Gerling in 1823. A publication was not delivered before 1874 by Seidel.

Starting from a vector y(0)\mathbf{y}^{(0)} - in the implementation y(0)=[1n,...,1n]T\mathbf{y}^{(0)} = [\frac{1}{n}, ..., \frac{1}{n}]^{T} - at each iteration k0k \ge 0 we compute every value yi(k+1)y_i^{(k + 1)} as:

yi(k+1)=vij=1i1Rijyj(k+1)j=i+1nRijyj(k)Rii=vij=1i1(αPji)yj(k+1)j=i+1n(αPji)yj(k)1Pii=vi+α[j=1i1Pjiyj(k+1)+j=i+1nPjiyj(k)]1Pii \begin{aligned} y_i^{(k + 1)} &= \frac{v_i - \sum_{j = 1}^{i - 1} R_{ij} y_j^{(k + 1)} - \sum_{j = i + 1}^{n} R_{ij} y_j^{(k)}}{R_{ii}} \\ &= \frac{v_i - \sum_{j = 1}^{i - 1} (-\alpha P_{ji}) y_j^{(k + 1)} - \sum_{j = i + 1}^{n} (-\alpha P_{ji}) y_j^{(k)}}{1 - P_{ii}} \\ &= \frac{v_i + \alpha \left [ \sum_{j = 1}^{i - 1} P_{ji} y_j^{(k + 1)} + \sum_{j = i + 1}^{n} P_{ji} y_j^{(k)} \right ]}{1 - P_{ii}} \\ \end{aligned}

Let's call Ii(a,b)={jIiajb}\mathcal{I}i (a, b) = \{ j \in \mathcal{I}_i | a \le j \le b \} . Noting that the nodes that have no in-linking to ii do not contribute to the sum because of the all zero coefficients and that the diagonal of the matrix is Rii=1Pii=1R{ii} = 1 - P_{ii} = 1 (we removed self-loops), we can write:

yi(k+1)=vi+α[j=1i1Pjiyj(k+1)+j=i+1nPjiyj(k)]1Pii=vi+α[jIi(1,i1)Pjiyj(k+1)+jIi(i+1,n)Pjiyj(k)]=vi+α[jIi(1,i1)yj(k+1)#Oj+jIi(i+1,n)yj(k)#Oj] \begin{aligned} y_i^{(k + 1)} &= \frac{v_i + \alpha \left [ \sum_{j = 1}^{i - 1} P_{ji} y_j^{(k + 1)} + \sum_{j = i + 1}^{n} P_{ji} y_j^{(k)} \right ]}{1 - P_{ii}} \\ &= v_i + \alpha \left [ \sum_{j \in \mathcal{I}i (1, i - 1)} P{ji} y_j^{(k + 1)} + \sum_{j \in \mathcal{I}i (i + 1, n)} P{ji} y_j^{(k)} \right ] \\ &= v_i + \alpha \left [ \sum_{j \in \mathcal{I}i (1, i - 1)} \frac{y_j^{(k + 1)}}{\# \mathcal{O}_j} + \sum{j \in \mathcal{I}_i (i + 1, n)} \frac{y_j^{(k)}}{\# \mathcal{O}_j} \right ] \end{aligned}

The implementation (finally!)

The main advantage of the Gauss - Seidel method is the low memory footprint: you need to store just one yRn\mathbf{y} \in \mathbb{R}^{n} . In Python, supposing that a graph G is stored as a Dict[int, List[int]], we can write the following, very simple implementation:

This code is just a sketch, and can be improved in various ways. Just let me know if this got your interest.

Sorry for the long post, see you next time!

Posted on by:

karjudev profile

Giacomo Mariani


👨‍💻 Big data for little humans


Editor guide