Making the world (and your site) flat—via a Reverse Proxy

flat_world.jpgIn order to protect some of the inventions in our software, I’ve been working with a law firm that specializes in IP protection. I’ve learned a lot from them, but I’ve learned far more from reviewing the patent applications they sent me back as possible ‘prior art.’ Let me share one of the most interesting ones I’ve seen so far, Patent Application 20070143283. Here is the abstract:

A system and method for optimizing the rankings of web pages of a commercial website within search engine keyword search results. A proxy website is created based on the content on the commercial website. When a search engine spider reaches the commercial website, the commercial website directs the search engine spider to the proxy website. The proxy website includes a series of proxy web pages that correspond to web pages on the commercial website along with modifications that enhance the rankings of the pages by the search engines. However, hyperlinks containing complex, dynamic URLs are replaced with spider-friendly versions. When a human visitor selects a proxy web page listing on the search engine results page, that visitor is directed to the proxy web page. The proxy server delivers the same content to the human visitor as to the search engine spider, only with simplified URLs for the latter.

Basically they use a reverse proxy (I wrote about this before) to replace dynamic URLs with search engine–friendly ones automatically. In addition to this, they make ‘enhancements’ to the proxy version of the pages so that they get high search engine rankings. They claim this is not cloaking:

[0014] The content contained on the proxy web pages is the same when the proxy web page is accessed either by the search engine spider or by the human visitor. The presentation of the same web page content to both the search engine spider and the human visitor allows the proxy website to stay within the ‘no cloaking’ guidelines set by most commonly used search engines.

If they rewrote only the dynamic URLs I would agree that they present the same content to both users and search engines. I don’t think as users we care much about the URLs, unless we have to type them manually into the address bar. However, I do think it is cloaking because they say changes are made to the pages in order to optimize them for higher search engine rankings—and they only present these optimized pages to the search engine crawlers. From the patent application:

[0015] Since the proxy web pages are contained on a proxy website separate from the commercial website, additional content and HTML optimization can be added to the proxy web pages that are not included on the corresponding web pages on the commercial site, via a web-based interface. The addition of this content and HTML optimization on the proxy web pages can be utilized to enhance the ranking of the proxy web pages on the search engine results pages. The effect of the addition of these optimizations on ranking can be analyzed and the content can then be revised to further enhance the ranking of the proxy web page. By utilizing the proxy web pages rather than the web pages contained on the commercial website, the rankings and functionality of the proxy web pages can be enhanced without altering the commercial web pages.

That being said, I think this is a very useful and clever technique. Rewriting dynamic URLs to make them search engine friendly via a reverse proxy is extremely useful, particularly for large e-commerce sites where the CMS or shopping cart software is not flexible enough.

Here is another interesting use that came to my mind and is not mentioned in the PA. (Maybe I should file a patent for this.) ;-)

If only the world were flat

Picture a tiered site architecture where you have a home page and tiered internal pages. Tier 1 includes pages that the search engine robots access in one click; tier 2 are pages that are accessible via two clicks, tier 3 via three clicks, and so on. Search engine spiders visit a limited number of pages per site and follow a limited number of clicks from the entrance page (usually the home page). The more clicks necessary to arrive at a page the less likely the page will be crawled or indexed. Ideally you would like to have a flat site architecture where all the pages are in tier 1. Unfortunately, while this is good for search engines, it is not very appealing for your site’s visitors. Imagine how crowded your home page would look with so many links!

An automatic solution

In the initial step, a simple crawler script visits the whole site and tags each page with its corresponding tier: tier 1, tier 2, tier 3, etc. The script would record such information in a database. When a search engine requests a tier 1 page via the reverse proxy, the proxy can inject the URLs of the pages in the next non-direct tier (tier 3 — tier 2 pages are directly accessible when the robot parses the tier 1 page) and so on. This will provide a flatter structure for the search engine robot, allowing for more pages to be indexed, saving bandwidth and CPU cycles for the SEs crawlers. Alternatively, the proxy can inject links to all internal pages beyond the next tier, i.e.: tier 3, tier 4, etc. when the search engine robot requests pages on tier 1. This would make the site completely flat.

This is definitely very useful, but as I clearly explained above, this is cloaking. In my last post about cloaking Jill Whalen and others expressed concern that Google’s view of this is still negative. It is my personal opinion that Google needs to draw a line between the legitimate uses of cloaking and cloaking to take advantage of search engines. In order to stay on the safe side it is not a bad idea to ask Google if they are OK with this.

Update

After reading an insightful comment from Sam Daams and re-reading the PA, I have to admit I was wrong about my initial assessment about them cloaking. They are presenting the same 'optimized' content to the search engine spider and to the user coming from the search results. I guess this is technically not cloaking. However, if a user goes directly to the web, page he or she will see the original, non-optimized version. When Google says don't cloak to users, are they talking about search engine users or any regular visitor? Let me know your thoughts on this

 

8 replies
  1. Web Design Newcastle
    Web Design Newcastle says:

    Hehe. This relly is just cloaking using a slightly different method. Redirecting search engines to a proxy site with additional content – who are they kidding? I can't believe this stuff gets patented.

    Asking Google about this would be interesting but I think I might already know their answer lol.

    Reply
    • Hamlet Batista
      Hamlet Batista says:

      WDN – Thanks for all your comments. I don't have problems with comments with keyword rich names, but I'd appreciate if you ended at least one comment with your real name. It is a little bit awkward to call people by their website names. ;-)

      Reply
  2. Sam Daams
    Sam Daams says:

    I usually agree with your assessments, which is why it's hard for me to understand that you see "additional content and HTML optimization can be added to the proxy web pages that are not included on the corresponding web pages on the commercial site" as cloaking. Only IF what is being shown to the search engine is different than what is being shown to the visitor would this be cloaking. I'm 99.99% sure they intend to do some kind of framing of the original source and rewrite that framed page in such a way that it is better optimized with a better url. But that whole 'new page' would be the same to visitors and search engines so how is this cloaking?

    Duplicate content it certainly is though and also more likely to cause the original article to be dumped by SE's than other proxy methods used…

    Either way I don't agree with the method because it is essentially stealing the content and with the way SE's currently work can get the original site penalized but it's still not cloaking :)

    Reply
    • Hamlet Batista
      Hamlet Batista says:

      Sam – Thanks for your comment. It is good to disagree every once in a while as it becomes more of a conversation than a speech.

      You are right. I read the full patent again, more carefully, and found out that they are presenting the same 'optimized' content to the spiders and search engine users. It is not cloaking. I will update the post to reflect this. Thanks.

      From the patent:

      19… creating proxy web pages on the proxy website for web pages on the commercial website, each of the proxy web pages including substantially the same information as the corresponding commercial web page, the proxy web pages having a simplified URL and simplified hyperlinks compared to the corresponding commercial web page; adding optimized content to the proxy web pages that is not present on the corresponding commercial web page; serving the proxy web pages including the optimized content to a search engine spider upon request; and serving the proxy web pages including the optimized content to a web browser when the web browser selects the search results listing for the proxy web page from the search engine results page.

      My initial assumption came from the fact that when you want to present the same content to spiders and users, you don't need to detect spiders. In their case, they only present the 'optimized' content to spiders and users coming from search results, as explained in the claims. This means that users going directly to the site will still see the non-optimized version of the pages. The question is: When Google says don't present different content to users than you present search engines, do they mean search engine users only?

      Reply
  3. Web Design Newcastle
    Web Design Newcastle says:

    I get what Sam is saying but thats a different impression to the one I got reading the article. Maybe I read it wrong?

    If it is as Sam points out then I just can't see the point of it – or atleast any ethical reason for it.

    Reply
    • Hamlet Batista
      Hamlet Batista says:

      David – The difference is that they detect both robots and users clicking from the search results, so they present the 'optimized' content to both.

      Sorry about the confusion. Adam's comment prompted me to re-read the PA.

      Reply
  4. Matt
    Matt says:

    This is actually exactly what Google recommends for getting AJAX and Flash sites indexed. I'm sure that by now they have some method for detecting obvious black hat stuff using this technique, but they are going to have to compromise until they figure out how to deal with rich interfaces.

    Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply