No doubt that at some point you have done a search in Google, clicked on an attractive result, and come up with a frightening wall—the article or page in question requires a subscription! 😉 As a user, we all find this annoying, and the last thing we want to do is get a new name and password. But as a content provider, it’s an excellent business move. Premium/paid content is a fine monetization strategy for anyone with content good enough to sell.
It also brings up an interesting question for SEO. How exactly does Google index paid content?
I got this email on from my loyal reader Wing Yew:
I've read your blog since the day you launched. That said, I can
completely appreciate if you don't have time to respond to this
message or post a blog about it. On the off chance you do know an
answer, I knew I had to ask.
Question: How do you have google/yahoo/msn spider password protected
content? I know that SEOMoz does it with their premium content, but
I'm not sure how. I'm rather desperately seeking out a hard and fast
answer… and I know of no better person to whom to go.
for His reknown,
Saying that I've been extremely busy lately is an understatement, but how can I say no to a loyal reader that has been following my blog from day one? Thanks for your support, Wing! Letting search engines index paid content is not only a good idea, it is also a very clever one.
Activate cloaking device
In order to do this you need to use cloaking. Before you panic and run for the hills saying that this is black-hat stuff and you don't want to be penalized, it is important that you know that Google does not penalize every type of cloaking. It is all about the intention. Let me explain the main concept and then dive into the technical details.
Your paid or sensitive content can be protected by your web server or by a web application. Let's call it the gatekeeper. The gatekeeper is responsible for asking for credentials anytime a visitor lands on a protected page. It validates the credentials and, assuming they are good, allows access to the page.
In this case we need to make the gatekeeper a little bit smarter by teaching it how to distinguish search engine spiders from regular users. The gatekeeper should still ask for passwords from any web surfer, but it should not ask search engine spiders for credentials. This is where cloaking comes in.
As I explained before, cloaking is presenting different content to crawlers than we show regular users. Traditionally I have used two detection strategies: either by user agent or by IP address. The first is having the code check if the HTTP_USER_AGENT server variable contains a bot identification string (e.g. Googlebot, Yahoo Slurp, etc.), and the second is to check the IP against a list of known bots. You can get just such an IP list from http://iplists.com/. A list of search engine user agents can be found here: http://www.user-agents.org/
Both approaches are relatively simple, but they have flaws and are not difficult to exploit if an advanced user wants access to paid content for free. The user agent can be forged, for example there is a Firefox extension that can be used to make the gatekeeper think the user is a search engine robot simply by providing a search engine user agent instead of the browser's. The IP list method is stronger, but maintaining/updating an accurate list of bot IP addresses is extremely difficult and time consuming.
Here is a better strategy
Let's use a method I've discussed before to protect against CGI hijackers. The method is not infallible, but it is extremely powerful for our purposes here. Here are the steps:
Do a simple user agent detection as explained above.
In order to detect fake robots, we use reverse-forward DNS detection. We only do this check if the requestor has been identified as a known search engine robot in step 1. Making two DNS requests for every single request will definitely slow your server down and we don't want that.
Once the code confirms that the requestor is a search engine, we allow the robot to access the paid content.
A word of caution
It is wise to prevent the search engine from caching the paid content. Clever users will hit the back button and access the content via the cache. I see a lot of sites that implement this type of cloaking, yet forget to prevent the search engine from caching the protected content.
As regular readers know, this is as simple as setting the meta robots tag with the command “noarchive.” Alternatively you can set the HTTP header X-Robots-Tag with the value “noarchive,” but this only works with Google at the moment.
Now to the technical details
If the gatekeeper is the web server directly and is using HTTP authentication you can use mod_rewrite to set up rules that identify the bot and set the status code 401 (Authorization Required) if the requestor is not a search engine. Doing more advanced detection of this type of gatekeeper requires another post.
Robot detection by user agent or IP address
The code simply needs to check a couple of variables that are set by the web server for every request. These are HTTP_USER_AGENT and REMOTE_ADDR.
Most scripting languages such as Python, Php and Ruby have a class or module named CGI that provides access to such variables. If you are using a framework such as Django, Ruby on Rails or Cake Php, look f
or the relevant documentation to see how you can access and modify the HTTP headers from your controller or view. It is important to keep in mind that any code that modifies headers needs to be executed before any other that sends output to the browser.
Reverse forward DNS
To do this type of detection you need to query your DNS cache or name server. The low level way to do this is by calling C system level functions that are available in any BSD-based TCP/IP implementation. They are gethostbyaddr() and gethostbyname(). With the first call your script does the name lookup by providing the IP address obtained from the server variable REMOTE_ADDR. In the second call your script passes the result from the first call to confirm with the DNS that the host does indeed correspond with such an IP. For all this to work, it is very important that the search engines maintain accurate forward (A) and reverse (PTR) DNS records for all the crawler IPs. It is also very important that you have a solid DNS cache if your site receives a lot of traffic.
I don't think it is so bad, but most web developers are not big fans of C. But it is good to know that those APIs have been ported and are accessible as scripting functions/methods on any modern scripting language such as Php, Python, Ruby and Perl.