TL;DR: Here you can find a single-threaded implementation of the Personalized PageRank that works better on sparse graphs.
At first, there it was PageRank. A brilliant algorithm, that presents a way of measuring the importance of a webpage by assigning a numerical weight to it. Back in the days, to bring your site to the first page of Google Search, you could just add links to your page in portals perceived as "important".
Nowadays the PageRank is no more the only metric used to rank web pages, but the algorithm proved to be useful in every context where we need an effective ranking for graphs of various kinds, from the Twitter WTF (Who To Follow) system to some applications in predictive models for archaeology.
With the rise of explorative programming environments such as Jupyter notebooks, the need of lightweight, single-threaded implementations of well-known algorithms became important for sketching ideas and develop software via iterative approximations, even on resource-constrained devices.
Note: The following needs a basic knowledge of linear algebra and numerical methods for solving eigenvector problems and linear systems. There will be a bunch of mathemathical explanation, and a final Pyhton implementation.
Given a graph
nodes (vertices) and
edges, we suppose that the probability of switching from node
to any of its neighbors
is the quantity defined as follows:
Note: In this implementation we removed self-loops from the adjacency list.
is called a dangling node, and we denote the position of such nodes in a vector
We suppose to know a personalization vector that for each node stores the probability that is chosen among the others. For a web search engine it could be computed with cutting edge machine learning models that learn from your search history (sometimes really badly). For another scenario, i.e. archaeology, it could be computed with respect to the cost of starting an excavation in the location. In every other case, we suppose , and we always impose that (in the implementation the values are normalized).
When we have no dangling nodes and no loops in
, the PageRank
is the sum of the PageRanks of its in-pointing neighbors
, weighted with the transition probability that a random surfer switches from
that represents the left eigenvector of the matrix :
Since is the steady-state distribution of the Markov chain modelled by , is in fact a probability vector, so .
We have to take into account two problems arising from this simple model:
- When the random surfer reaches a dangling nodes, the navigation restarts according to the preference vector.
- At each step, the surfer can keep navigating through outgoing nodes with probability , and restart the navigation from a node chosen via the preference vector with probabilty . In such a way we prevent the surfer to get into infinite loops.
, the eigenvector to compute is now:
We have so we can write:
The canonical way of solving this problem is via the well-known power method:
In mathematics, power iteration is an eigenvalue algorithm: given a diagonalizable matrix , the algorithm will produce a number , which is the greatest eigenvalue of , and a nonzero vector , which is a corresponding eigenvector of , that is, . The algorithm is also known as the Von Mises iteration.
but now we'll do some algebraic manipulations to get this into a sparse linear system.
Let's make a step back from the previous manipulation. Let's write:
This is a dense linear system. This means that when the graph is really big, the coefficient matrix is full of values and a solution needs a lot of work to be computed. However, since when the graph is sparse, so is sparse, the density is caused by the component. What if we find a way to not store this matrix at all?
be the sparse coefficient matrix of this new linear system. We note that
, so we have:
Let's go all in with linear algebra and put in the game the Sherman-Morrison formula:
Don't worry, we climbed the peak of matemathical abstrusity, now the equations will be easier:
be the solution of
, so we can write:
The computation is reduced to:
We don't even need to compute the dangling nodes!
Let's make a step further and see how we don't even need to explicitly store but we can use itself.
we can use the Gauss - Seidel iterative stationary method:
In numerical linear algebra, the Gauss–Seidel method, also known as the Liebmann method or the method of successive displacement, is an iterative method used to solve a system of linear equations. It is named after the German mathematicians Carl Friedrich Gauss and Philipp Ludwig von Seidel, and is similar to the Jacobi method. Though it can be applied to any matrix with non-zero elements on the diagonals, convergence is only guaranteed if the matrix is either strictly diagonally dominant, or symmetric and positive definite. It was only mentioned in a private letter from Gauss to his student Gerling in 1823. A publication was not delivered before 1874 by Seidel.
Starting from a vector
- in the implementation
- at each iteration
we compute every value
. Noting that the nodes that have no in-linking to
do not contribute to the sum because of the all zero coefficients and that the diagonal of the matrix is
(we removed self-loops), we can write:
The main advantage of the Gauss - Seidel method is the low memory footprint: you need to store just one
. In Python, supposing that a graph
G is stored as a
Dict[int, List[int]], we can write the following, very simple implementation:
This code is just a sketch, and can be improved in various ways. Just let me know if this got your interest.
Sorry for the long post, see you next time!