Last week the blogosphere was abuzz when Google decided to ‘update’ the PageRank numbers they display on the toolbar. It seems Google has made real on its threat to demote sites engaged in buying and selling links for search rankings. The problem is that they caught some innocent ones in the crossfire. A couple of days later, they corrected their mistake, and those sites are now back to where they were supposed to be.
The incident reveals that there is a lot of misunderstanding about PageRank, both inside and outside the SEO community. For example, Forbes reporter Andy Greenberg writes:
On Thursday, Web site administrators for major sites including the Washingtonpost.com, Techcrunch, and Engadget (as well as Forbes.com) found that their “pagerank”–a number that typically reflects the ranking of a site in Google…
He also quotes Barry Schwartz saying:
But Schwartz says he knows better. “Typically what Google shows in the toolbar is not what they use behind the scenes,” he says. “For about two and a half years now this number has had very little to do with search results.”
There are two mistakes in these assertions:
The toolbar PageRank does not reflect the ranking of a site in Google. It reflects Google’s perceived ‘importance’ of the site.
The toolbar PageRank is an approximation of the real PageRank Google uses behind the scenes. Google doesn’t update the toolbar PageRank as often as they update the real thing, but saying that it has little to do with search results is a little farfetched.
Several sites lost PageRank, but they did not experience a drop in search referrals. Link buyers and sellers use toolbar PageRank as a measure of the value of a site’s links. By reducing this perceived value, Google is clearly sending a message about paid links. The drop is clearly intended to discourage such deals.
Some ask why Google doesn’t simply remove the toolbar PageRank altogether so that buyers and sellers won’t have a currency to trade with. At first glance it seems like a good idea, but here is the catch—the toolbar PageRank is just a means of enticing users to activate the surveillance component that Google uses to study online behavior. Google probably has several reasons for doing so, but at minimum it helps measure the quality of search results and improve its algorithms. If Google were to remove the toolbar PageRank users would have no incentive to let Google ‘spy’ on their online activities.
Is PageRank dead?
Some are suggesting that PageRank is no longer important for Google, but that is truly far from the truth. See http://www.google.com/technology/
The heart of our software is PageRank™, a system for ranking web pages developed by our founders Larry Page and Sergey Brin at Stanford University. And while we have dozens of engineers working to improve every aspect of Google on a daily basis, PageRank continues to play a central role in many of our web search tools.
There are several reasons why PageRank is still very important for Google. Among them is the fact that compared to other ranking algorithms it is both extremely reliable and computationally efficient. Given the size of the Web and the competition in the search space, those traits are must-haves.
An Inside Look at PageRank
In the complex mathematical terms I love, PageRank is the normalized dominant left-hand eigenvector, the stationary vector that results from applying the Power Method to a stochastic transition probability matrix for a Markov chain, until it converges. In simplified Googlespeak, it’s described like this:
PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. …
Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo’s homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.
In other words, PageRank is simply the probability that your page will be found by some crazy surfer following links at random. The more incoming links (citations) your page has (and the higher the importance of such citations) the more likely that is.
In order to compute the PageRank for all the pages on the Web, a link between two pages is seen as a vote. The votes are weighted by the PageRank of the page casting the vote and the number of votes (links) on the page. An important page (high PageRank) casts important votes, but the value of the votes goes down the more votes the page casts (the more links it has). This way, web pages distribute their PageRank to other pages evenly though their links.
It is obvious that in order to determine the value of the votes, the PageRank of the page must be determined first. The way this circular (recursive) reference is solved is by picking a starting point for all web pages (all valued with one vote) and computing the PageRank for all of them in multiple interactions until the values stabilize (converge). Originally, Google’s founders faced some problems such as rank sinks – pages that only link to each other and not to other pages, and as such do not redistribute their PageRank. This was simple to fix with some adjustments to the original model. Check out the PageRank paper for more details.
The toolbar PageRank is an approximation of the real PageRank of a page. In order to make it harder for SEOs to reverse engineer the ranking algorithms, Google is only updating the toolbar values every few months. It is safe to assume that the real values are updated at least once a month, thanks to advances in accelerating the computation of PageRank. This measure primarily affects new sites, such as this blog that waited five months to see its first toolbar PageRank, yet was receiving search referrals on its second day.
The paid link challenge
Perhaps what is most interesting in all this is that back in 1998 Google already saw paid links as a problem, but underestimated its impact in the future:
6.1 Manipulation by Commercial Interests
…At worst, you can have manipulation in the form of buying advertisements (links) on important sites. But, this seems well under control since it costs money. This immunity to manipulation is an extremely important property. This kind of commercial manipulation is causing search engines a great deal of trouble, and making features that would be great to have very difficult to implement. For example fast updating of documents is a very desirable feature, but it is abused by people who want to manipulate the results of the search engine
Yes, Google, paid links cost money, but the revenue from the search referrals far outweighs the investment… It is my opinion that Google should address this problem algorithmically and not force everybody and their mother to label paid links via ‘nofollow’ attributes. But instead of crying about it, I am going to explain a simple adjustment Google can make to their PageRank model to help discount any paid, irrelevant, or low-quality link on a web page. Hopefully they will listen.
The need for an intelligent surfer
In their paper, Google’s founders claim that PageRank tries to model actual user behavior, yet they model a random surfer. Regular users do not click on links at random! They do not click on irrelevant links, invisible links or links that are not in the main content area. They follow links that pique their interest. Most of the time they ignore ads, too.
A couple of researchers from the University of Washington came up with an improved PageRank model they call the intelligent surfer. Their basic idea is that the transition probabilities are weighted according to the similarity of the pages at both ends of the link. The end result is that links between pages that are related receive more weight and irrelevant paid links on a page would not count as they do now.
Why isn’t Google using this model? There is a big reason:
The difficulty with calculating a query dependent PageRank is that a search engine cannot perform the computation, which can take hours, at query time, when it is expected to return results in seconds (or less). We surmount this problem by precomputing the individual term rankings QD-PageRankq, and combining them at query time according to equation 5. We show that the computation and storage requirements for QD-PageRankq for hundreds of thousands of words is only approximately 100-200 times that of a single query independent PageRank.
Essentially this means that in order to use this algorithm Google needs to increase its datacenter capacity by a couple of hundred times. It is definitely far cheaper and easier to scare the search crowd and continue the paid-link campaign propaganda.
I like this intelligent surfer concept, but a simpler and more practical approach occurred to me. Instead of using the similarity between linked pages as the probabilities that will determine where the surfer jumps to, I’d use the click-through data of each page as measured by the Google toolbar.
Think about it. Click-through data reflects real user behavior. Links with higher click-throughs would have greater weight because users click on what really interests them. On the other hand, the links that are not relevant or not visible to the user will receive few clicks or no clicks at all. Google can also measure the bounce rates of users clicking on links and reduce the click-through rates accordingly. For example, if I place a prominent link on my blog to a questionable site, many of my readers will follow it, but they will hit the back button if it is not as good as I recommended or if they felt they were tricked.
And all this data is already available to Google via their toolbar!
The point I am trying to make is that it is possible for Google to improve their ranking algorithm and not have to resort to toolbar PageRank reductions or the other scare tactics they are employing. Hopefully, Tom or some other mathematically-savvy reader can help me with the math to prove this concept. With a formal, carefully-researched paper, Google is more likely to listen.
Do you think Google should solve the problem algorithmically or keep going with their existing campaign against paid links?