PageRank: Caught in the paid-link crossfire

Last week the blogosphere was abuzz when Google decided to ‘update’ the PageRank numbers they display on the toolbar. It seems Google has made real on its threat to demote sites engaged in buying and selling links for search rankings. The problem is that they caught some innocent ones in the crossfire. A couple of days later, they corrected their mistake, and those sites are now back to where they were supposed to be.

The incident reveals that there is a lot of misunderstanding about PageRank, both inside and outside the SEO community. For example, Forbes reporter Andy Greenberg writes:

On Thursday, Web site administrators for major sites including the Washingtonpost.com, Techcrunch, and Engadget (as well as Forbes.com) found that their “pagerank”–a number that typically reflects the ranking of a site in Google

He also quotes Barry Schwartz saying:

But Schwartz says he knows better. “Typically what Google shows in the toolbar is not what they use behind the scenes,” he says. “For about two and a half years now this number has had very little to do with search results.”

There are two mistakes in these assertions:

  • The toolbar PageRank does not reflect the ranking of a site in Google. It reflects Google’s perceived ‘importance’ of the site.

  • The toolbar PageRank is an approximation of the real PageRank Google uses behind the scenes. Google doesn’t update the toolbar PageRank as often as they update the real thing, but saying that it has little to do with search results is a little farfetched.

Several sites lost PageRank, but they did not experience a drop in search referrals. Link buyers and sellers use toolbar PageRank as a measure of the value of a site’s links. By reducing this perceived value, Google is clearly sending a message about paid links. The drop is clearly intended to discourage such deals.

Some ask why Google doesn’t simply remove the toolbar PageRank altogether so that buyers and sellers won’t have a currency to trade with. At first glance it seems like a good idea, but here is the catch—the toolbar PageRank is just a means of enticing users to activate the surveillance component that Google uses to study online behavior. Google probably has several reasons for doing so, but at minimum it helps measure the quality of search results and improve its algorithms. If Google were to remove the toolbar PageRank users would have no incentive to let Google ‘spy’ on their online activities.

Is PageRank dead?

Some are suggesting that PageRank is no longer important for Google, but that is truly far from the truth. See http://www.google.com/technology/

The heart of our software is PageRank™, a system for ranking web pages developed by our founders Larry Page and Sergey Brin at Stanford University. And while we have dozens of engineers working to improve every aspect of Google on a daily basis, PageRank continues to play a central role in many of our web search tools.

There are several reasons why PageRank is still very important for Google. Among them is the fact that compared to other ranking algorithms it is both extremely reliable and computationally efficient. Given the size of the Web and the competition in the search space, those traits are must-haves.

An Inside Look at PageRank

In the complex mathematical terms I love, PageRank is the normalized dominant left-hand eigenvector, the stationary vector that results from applying the Power Method to a stochastic transition probability matrix for a Markov chain, until it converges. In simplified Googlespeak, it’s described like this:

PageRank can be thought of as a model of user behavior. We assume there is a “random surfer” who is given a web page at random and keeps clicking on links, never hitting “back” but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. …

Another intuitive justification is that a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo’s homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.

In other words, PageRank is simply the probability that your page will be found by some crazy surfer following links at random. The more incoming links (citations) your page has (and the higher the importance of such citations) the more likely that is.

In order to compute the PageRank for all the pages on the Web, a link between two pages is seen as a vote. The votes are weighted by the PageRank of the page casting the vote and the number of votes (links) on the page. An important page (high PageRank) casts important votes, but the value of the votes goes down the more votes the page casts (the more links it has). This way, web pages distribute their PageRank to other pages evenly though their links.

It is obvious that in order to determine the value of the votes, the PageRank of the page must be determined first. The way this circular (recursive) reference is solved is by picking a starting point for all web pages (all valued with one vote) and computing the PageRank for all of them in multiple interactions until the values stabilize (converge). Originally, Google’s founders faced some problems such as rank sinks – pages that only link to each other and not to other pages, and as such do not redistribute their PageRank. This was simple to fix with some adjustments to the original model. Check out the PageRank paper for more details.

The toolbar PageRank is an approximation of the real PageRank of a page. In order to make it harder for SEOs to reverse engineer the ranking algorithms, Google is only updating the toolbar values every few months. It is safe to assume that the real values are updated at least once a month, thanks to advances in accelerating the computation of PageRank. This measure primarily affects new sites, such as this blog that waited five months to see its first toolbar PageRank, yet was receiving search referrals on its second day. ;-)

The paid link challenge

Perhaps what is most interesting in all this is that back in 1998 Google already saw paid links as a problem, but underestimated its impact in the future:

6.1 Manipulation by Commercial Interests

…At worst, you can have manipulation in the form of buying advertisements (links) on important sites. But, this seems well under control since it costs money. This immunity to manipulation is an extremely important property. This kind of commercial manipulation is causing search engines a great deal of trouble, and making features that would be great to have very difficult to implement. For example fast updating of documents is a very desirable feature, but it is abused by people who want to manipulate the results of the search engine

Yes, Google, paid links cost money, but the revenue from the search referrals far outweighs the investment… It is my opinion that Google should address this problem algorithmically and not force everybody and their mother to label paid links via ‘nofollow’ attributes. But instead of crying about it, I am going to explain a simple adjustment Google can make to their PageRank model to help discount any paid, irrelevant, or low-quality link on a web page. Hopefully they will listen.

The need for an intelligent surfer

In their paper, Google’s founders claim that PageRank tries to model actual user behavior, yet they model a random surfer. Regular users do not click on links at random! They do not click on irrelevant links, invisible links or links that are not in the main content area. They follow links that pique their interest. Most of the time they ignore ads, too.

A couple of researchers from the University of Washington came up with an improved PageRank model they call the intelligent surfer. Their basic idea is that the transition probabilities are weighted according to the similarity of the pages at both ends of the link. The end result is that links between pages that are related receive more weight and irrelevant paid links on a page would not count as they do now.

Why isn’t Google using this model? There is a big reason:

4 Scalability

The difficulty with calculating a query dependent PageRank is that a search engine cannot perform the computation, which can take hours, at query time, when it is expected to return results in seconds (or less). We surmount this problem by precomputing the individual term rankings QD-PageRankq, and combining them at query time according to equation 5. We show that the computation and storage requirements for QD-PageRankq for hundreds of thousands of words is only approximately 100-200 times that of a single query independent PageRank.

Essentially this means that in order to use this algorithm Google needs to increase its datacenter capacity by a couple of hundred times. It is definitely far cheaper and easier to scare the search crowd and continue the paid-link campaign propaganda. ;-)

I like this intelligent surfer concept, but a simpler and more practical approach occurred to me. Instead of using the similarity between linked pages as the probabilities that will determine where the surfer jumps to, I’d use the click-through data of each page as measured by the Google toolbar.

Think about it. Click-through data reflects real user behavior. Links with higher click-throughs would have greater weight because users click on what really interests them. On the other hand, the links that are not relevant or not visible to the user will receive few clicks or no clicks at all. Google can also measure the bounce rates of users clicking on links and reduce the click-through rates accordingly. For example, if I place a prominent link on my blog to a questionable site, many of my readers will follow it, but they will hit the back button if it is not as good as I recommended or if they felt they were tricked.

And all this data is already available to Google via their toolbar!

The point I am trying to make is that it is possible for Google to improve their ranking algorithm and not have to resort to toolbar PageRank reductions or the other scare tactics they are employing. Hopefully, Tom or some other mathematically-savvy reader can help me with the math to prove this concept. With a formal, carefully-researched paper, Google is more likely to listen.

Do you think Google should solve the problem algorithmically or keep going with their existing campaign against paid links?

25 replies
  1. Richard Chmura
    Richard Chmura says:

    What is the install base of the Google toolbar? Is it significant enough to make an index-wide impact on link relevance? (maybe for existing links from high traffic sites – but what happens to the low traffic sites or fresh new content?)

    Isn't there a specific segment of internet users who dominate usage of the toolbar? Could these users manipulate this data rather easily?

    I appreciate the thought you put into this post Hamlet. You have a good start. With a few tweaks in implementation, I'm sure it would make a solid solution.

    Reply
  2. Tom
    Tom says:

    I think that Google HAS to adopt some form of intelligent surfer method of calculating pagerank, the biggest problem is ensuring the algorithm is efficient enough to keep returning results quickly.

    While we're talking about using toolbar data – why not go the whole way and use traffic stats for each page as a factor in the algorithm for how important it is? Some combination of traditional pagerank, modified by how many visits it gets. After all, if a page is getting 10,000 hits a day then if the pagerank algorithm is doing it's job the page should have a high pagerank (or probability of a random surfer hitting the page).

    Trying to use pagerank data on each link on a page (and then also relating that to the toolbar data on the linked-to page) I think unfortunately will cause the algorithm to slow down too much.

    Thanks for outing me as some kind of maths genius as well Hamlet – in actual fact Will is much better at maths than I am (although I think we're both a bit rusty having not used any hardcore maths for a good few years!!).

    One last note – do you think Google would ever accept a pagerank algorithm created by the search community!? I mean user generated content is one thing but I think that's taking it a bit far! We'd know exactly how to rank sites since we wrote the algo ;-)

    Reply
  3. Tom
    Tom says:

    Oh, and nice post fantomaster – I forgot to mention about manipulating the data but that's also a valid concern. In fact, it's what makes me think that if Google WERE using browsing behaviour to modify rankings in any significant way then they would want to keep quiet about it to limit the amount of manipulation.

    Reply
  4. Will Critchlow
    Will Critchlow says:

    Hey Hamlet,

    Interesting post. I haven't thought about eigenvectors in detail for some time. Stochastic convergence is a fun subject though. I'd be very interested in what kind of convergence they are using – I suspect quite a weak form (actually, when I say "very interested" I mean "a little bit interested, but not quite enough to go and read the original papers").

    I'm going to have a think (and Tom and I will discuss – he is too modest about his maths abilities) about how you might model an 'intelligent surfer' (there's an oxymoron for you!).

    I think one of the biggest challenges is going to be how you make it query-independent (which is important for computational intensiveness)…. Hmmmmmmm….

    Thanks for the thought-provoking post.

    Reply
  5. David Hopkins
    David Hopkins says:

    I thought this was going to be another boring PR cull post until I got through it :P

    I'm with fantomaster here. If Google are using this data, it would be very easy to manipulate. How long would it take for traffic exchange networks to be set up that can spoof throusands or tens of thousands of visitors a day? A relativly simple PHP engine could easily set up 100s of connections to make it loked like it was spending hours on a site.

    The problem is abuse of this system would be impossible to track, whereas paid links can be spotted by humans or guessed at by computers.

    Tom brings up an interesting point here: "do you think Google would ever accept a pagerank algorithm created by the search community!?"

    Earlier I was reading that Jimbo Wales of Wikipedia wants to take on Google in the search engine market and he is relying on the open source community – all of the code will be available. The obvious problem here is people will know the exact inner workings and be able to exploit the system. It was like when Infoseek released all their code, a few people worked out their system and started selling automated rankings on the engine. However, it came to mind that Jimbo would be able to allow people to make their own algorithms and systems. So rather than just suggesting your ideas to Google, you can just take action. If Jimbo can pull something like that off and make it work, I think the search market will be his.

    Reply
  6. David Hopkins
    David Hopkins says:

    In regards to bounce rates, I have a really high bounce rate because the site is effectivly one page, but the average time spent on site is really high. I monitor the time spent on site for all incoming quotes and it averages at 14 minutes 13 seconds for just one page.

    This data can be a bit ambiguous, because the bounce rate would suggest to Google that this is a low quality site, but the time spent on site would suggest high quality.

    Reply
  7. David Airey
    David Airey says:

    Thanks for the intelligent write-up, Hamlet. As you know, I'm not that clued in on SEO, Page Rank, and the like, so I'll just let you know that there's a typo in your article. In the first para of 'The need for an intelligent surfer' piece, you type, 'peak their interest', whilst it should read, 'pique their interest'.

    I hope all's well with you.

    Reply
  8. fantomaster
    fantomaster says:

    Great analysis (as we’ve come to expect it of you, heh!).
    And you’re probably spot on with Google going for user behavior as a relevancy determinant: personalized search is there to stay, I would assume.
    Whether the toolbar alone can be relied upon to muscle this task is, however, somewhat doubtful. Think Alexa, for example, even considering that their model is far inferior if not primitive compared to what you’re suggesting: It’s dead easy to manipulate Alexa rankings, and automatically, too, you’ll actually find plenty of software around helping you do it.

    The Google toolbar is bound to experience a similar fate. While its installed base may be vastly greater than the Alexa toolbar’s, it’s still covering just a tiny weeny bit of overall human Internet traffic.

    Moreover, it’s essentially just as prone to manipulation as Alexa’s tool: Bot programs that simulate human behavior to the dot are a technically trivial option. Given a sufficiently diverse network of IPs it’s a no-brainer to simulate geo targetting, synching it with timezones, etc. etc.

    Is this happening right now? Not on a large scale, I would assume. But happening it is.
    Roll out that ultimate super duper “weapon to end all wars” – and experience a ton of nasty surprises when it turns out that your antagonist has either developed something of a similar (if not superior) quality, or, at the very least, a workaround that renders your arsenal basically useless. Same story in search…

    Reply
  9. Hamlet Batista
    Hamlet Batista says:

    Richard – It is probably not significant enough to make an index-wide impact, but even a partial improvement to the current random surfer model is a welcome improvement.

    Since I am suggesting the use of click-through rates and not raw traffic, it doesn't really matter if the site is a low traffic one.

    I personally think that manipulation will always exist as long as there is money to be made.

    Ralph – You are right that using the toolbar alone would be a suicide. For that reason, I am only suggesting they use click-through data to bias the weights (probabilities) of the votes (links). Measuring all the links in a page (even the invisible ones) the same, is counterintuitive in my opinion.

    I agree that manipulation will always exist. The point I am trying to make is that Google need to address this problem algorithmically.

    Tom – Thanks for your input. Sorry I forgot to mention that. I think they should combine the click-through data offline at the same time the PageRank is being computed. That way it is query-independent and the search results will come back fast to the user.

    I am trying to get my Math up to speed. I see it is a must, in order to understand all these interesting papers I am reading lately.

    do you think Google would ever accept a pagerank algorithm created by the search community!?

    That is a good point. I am not hoping they'd use it, but more to proof that it can be done. The idea is to push them in the right direction and stop scaring us.
    Will – From what I've read it seems they are still using the Power Method. It is slow, but does not require changes to the massive adjacency matrix that could result in massive storage/computation issues.

    The idea is to factor the click-through rates at the time the PageRank is being computed. That way it is query-independent. Maybe, averaging the click-through rates from the last PageRank update to date, would be a sound approach.

    BTW: Happy birthday, Tom!

    David H – My suggestion is to use the click-through rates to weight the links on each page. It doesn’t really matter the amount of traffic the page gets. What matters is the distribution of the traffic.

    This data can be a bit ambiguous, because the bounce rate would suggest to Google that this is a low quality site, but the time spent on site would suggest high quality.

    In that case, they can incorporate the average time on site too. That information is available to them via the toolbar. Thanks for the input.

    David D – Corrected, thanks.

    Tom F – Thanks for the link.

    Reply
  10. Tom
    Tom says:

    Hi Hamlet – sorry for confusing you again, it's Will's birthday today not mine (although he says thanks :-)

    You're right about brushing up on maths – I think people often forget that a sound technical understanding of the algorithms (or at least an understanding of how algorithms work in general) is useful when it comes to SEO. It's not all about social media, digg and iphones!

    Reply
  11. Web Design Newcastle
    Web Design Newcastle says:

    I reckon there is more to it than just buying and selling links. I've seen a lot of things that make me think the ranking algorithm has changed in some pretty major ways. It'll be interesting to see how it all pans out.

    Reply
  12. Hamlet Batista
    Hamlet Batista says:

    Tom – You don't need to be Math genius or programmer to succeed in SEO, but I think that the better you understand the "why" of things, the more likely you are to compete and make better decisions.

    Carsten – Thanks for your comment and the link to your post. It is a very good post.

    Reply
  13. Chris K
    Chris K says:

    Very interesting blog.
    With the focus on PageRank Google has created something big, not necessarily good. The commercial abuse aspect, millions of people trying to work the system for their individual gain is not a good reflection on our combined social conscience. It is difficult to see a much better model of ranking the value of pages. We may have to live with what we have for some time to come but of course we need to continue to search for a better outcome. Thanks for your work.

    Reply
  14. Claire
    Claire says:

    As a web design business the Google pagerank is very important aspect for us to show customers. As alot are sceptical about our services if we cold call them. As much as getting good links from highly ranked sites, i find that adding news posts around and generally advertising your site throughout website blogs helps. Its good to have a pagerank to get tafffic via the search engines, but i think its good to get custom through posting to blogs that people are likely to visit your site through.

    Reply
  15. Eva White
    Eva White says:

    The maths to assign page ranks is still beyond me, but I have improved my pagerank since I started blogging and am quite happy about that. I recently changed my theme and that cost me two points off the pr. Will have to start building traffic and links to get it back.:)

    Reply

Trackbacks & Pingbacks

  1. SEO made Stupidly Simple says:

    SEO made Stupidly Simple…

    Internet search engine marketing can be difficult due to the lot of alternatives accessible when dealing with online net- marketing media solutions. Want more online promotion ideas marketing news & views. The importance of understanding what users are…

  2. [...] PageRank: Caught in the paid-link crossfire [...]

  3. [...] A brilliant article on PageRank: Caught in the paid-link crossfire. [...]

  4. [...] PageRank: Caught in the paid-link crossfire [...]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply