Demystifying the PageRank Algorithm
Sishaar Rao May 13 '17
PageRank is arguably the most widely-utilized and influential algorithm in our modern day - chances are you use it tens of times every single day. In this article, I hope to explain not only what PageRank is and how it became so influential but also the Matrix/Linear Algebra that powers it. I created a simple Python+Shell demo that you can check out to see PageRank in live action!
The Stone Ages of the Internet -- How Search Engines were born
Once upon a time, you had to know the address of a webpage if you wanted to visit it. Could you imagine how frustrating that would be? You could pretty much only find new websites by word of mouth.
But then Archie said "Let there be the world's first search engine."
How did these work? Well the world's first search engines more or less followed the same process:
1) Crawling: Getting a list of domain names/information:
The tool that consolidated all website data to be later searched became known as a webcrawler. This guy will find every site and then peruse the links on the page to get a complete look at what is really in it.
2) Indexing: Searching through (usually grepping) and finding relevant information:
This process searches through all the pages that the web crawler found and will return to you pages that match the search query. For example, if your query is "cat" the web crawler will get you every single web page, which the indexer will parse through to return pages with the keyword "cat" on it
Pros of these early search engines
The internet was a lot more accessible and useful to ordinary people. They could now find new information from websites they'd never known of before. Thus, the growth and usage of the internet exploded.
Cons of these early search engines
It sucked at finding what you wanted. The indexer would simply return whatever pages contained the query. The query "cat" could return you the web page for Caterpillar. You'd have to click through multiple links to find what you wanted.
Why does Everybody know Google?
In 1996, two Ph.D. students at Stanford named Larry Page and Sergey Brin wanted to create a Search Engine that could provide users with relevant results to queries. They went on to devise the PageRank algorithm which accomplished precisely this. Their search engine, originally dubbed BackRub, took off in popularity because for some reason it would find you exactly what you were searching for. They accomplished this by not only creating a search engine with the web crawler and indexer components but with the PageRank as a third step to sift through results and find the most relevant. The rest is history.
So what is the PageRank algorithm?
The PageRank algorithm: given a network of sites with links pointing to each other, determine the significance of each site
What do we mean by a network of sites? Something along the following:
In this example (which is utilized by the demo I created) we can see that Page 1 has links to Page 2 & 4, Page 2 has links to Page 1, 4, & 3, etc.
The main question however is - How can we determine "Significance"? How do we know which page is relevant to the query? When I first thought of this problem, I came up with a few approaches. Here's why they're wrong:
Approach 1: Use Artificial Intelligence to analyze text in the query and Webpage
Why this can't work: While this may be practical for a page or two, it's simply too costly. With over a billion pages, it'll take too long to determine the significance of every page given a query.
Approach 2: Simply Count the Number of Occurrences of a Keyword
Why this can't work: The most relevant pages aren't always the ones that mention the query the most. Irrelevant sites can easily boost their traffic by pasting the word on their page a million times. When I search "shoes", I want to see Nike, Adidas, and New Balance.
The Approach used by PageRank: Count the Number of Links pointing to that Webpage
The more the website is cited by others in the network, the more likely it's important. We can think of this as a "democratic voting" system. The key to this system is that every page gets one vote that is weighted by its significance
The Technical Details + Linear Algebra Basis
Given a page in the network, we will count how many links to other pages the current page possesses. We will then take the inverse of this number to get the portion of the current page's weighted vote delegated to each of those other pages. Let's take a network of 4 pages with links that look like this:
We can express the significances of each page with a system of equations where Xn is Page n. Page 1 has links from Page 3 & 4 (aka Page 3 has a link to Page 1, Page 4 also has a link to Page 1). Since Page 3 is pointing to only one page, it's casting its entire vote for Page 1. Since Page 4 is pointing to two pages, it's casting half a vote for Page 1. We proceed with this process for the rest of the pages in the network to create the system of equations:
We could go ahead and brute-force our way through this SOE but ain't nobody got time fo' that. If we look carefully, we can observe that all the variables in this system are defined in relation to each other: namely, Xn is given some value in relation to different values of n, and those different values of n may be defined in relation to the original value of n. Thus, we can express this system of equations in matrix form Ax = x. The matrix A is going to be populated with our coefficients from the system, and the matrix X is going to be populated with the coefficients X1, X2, X3, X4.
Note: All the columns in matrix A add up to 1 because each Page can cast a total of one vote whereas the rows are the sums of the weighted votes that each Page receives. Also, the "T" superscript indicated transposed, in that the matrix X is a 4x1 rather than a 1x4 matrix.
So now we have transformed the system of equations to a system of matrices. How does help us find the solution (namely the x matrix) for the system? This is where Eigenvalues & Eigenvectors come in.
A brief overview of Eigenvalues & Eigenvectors
Eigenvectors are special instances of vectors that, when multiplied by a transformation matrix (in our case the matrix is A), the vector does not change its direction. It can change in scale, however, the factor by which we call the Eigenvalue. These special-case scenarios can be expressed by the equation Ax = λx where λ is our eigenvalue and x is the eigenvector. Note: Vector, in this case, is another name for the matrix.
In our system Ax = x we say Hey! The solution matrix X is an eigenvector for the system -- namely, it's the eigenvector when the eigenvalue (λ) is equal to 1!!!
We can then do the following steps which are characteristic of solving for an eigenvector:
1) Ax = λx
2) Ax - λx = 0
3) (A - λI)x = 0 <-- where I is the Identity matrix
From here, we can use row reduction to solve for our matrix x, when λ = 1. In this particular instance, we get the following as our solution matrix x:
As we can see, we find that Page 1 is our most relevant ranking. This may seem counterintuitive given that Page 3 has the more links pointing to it, but remember, the votes are based on weighted significance.
The Genius of Google
Learning how the PageRank algorithm works should leave everyone awestruck. The scale at which they employ PageRank is exponentially larger than the example I provided. Never mind the fact that they handle millions of queries every day in a fraction of a second.
The Genius behind Google is that they employed Mathematical principles to provide a solution to a problem everybody was facing at the time. Going forward, when I look at the work I'm doing, it helps me to take a step back and think: What am I trying to solve? What am I trying to accomplish? How does my work provide value to people? This algorithm is an ode to that mindset, and I genuinely believe that if you work in this manner, regardless of your field, you'll find success.
I hope this article was helpful in understanding the PageRank algorithm. Check out the Python+Shell demo I created! Above all, if you feel like I made any factual errors or if there's anything I can clarify/expand/modify to improve this post, please comment or reach out to me!