Posts

Preventing duplicate content issues via robots.txt and .htaccess

Rand of SEOmoz.org posted an interesting article on duplicate content issues. He uses the typical blog to show different examples.

In a blog, every post can appear in the home page, pagination, archives, feeds, etc.

Rand suggests the use of the meta robots tag “no-index”, or the potentially risky use of cloaking, to redirect the robots to the original source.

Joost the Valk recommends WordPress users change some lines in the source code to address these problems.

There are a few items I would like to add to the problem and to the proposed solution.

As willcritchlow asks, there is also the problem of multiple URLs leading to the same content (ie.: www.site.com, site.com, site.com/index.html, etc.). This can be fixed by using HTTP redirects and by telling Google what is our preferred domain via webmaster central.

Reader roadies, recalls reading about a robots.txt and .htaccess solution somewhere. That gave me the inspiration to write this post.

After carefully reviewing Google’s official response to the duplicate content issue, it occurred to me that the problem might not be as bad as we think.

What does Google do about it?
During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering — rather than ranking adjustments … so in the vast majority of cases, the worst thing that’ll befall webmasters is to see the “less desired” version of a page shown in our index.

Basically, Google says that unless we are trying to do something purposely ill intended (like ‘borrowing’ content from other sites), they will only toss out duplicate pages. They explain that their algorithm automatically detects the ‘right’ page and uses that to return results.

The problem is that we might not want Google to choose the ‘right’ page for us. Maybe they are choosing the printer-friendly page and we want them to choose the page that includes our sponsors’ ads! That is one of the main reasons, in my opinion, to address the duplicate content issue. Another thing is that those tossed out pages will likely end up in the infamous supplemental index. Nobody wants them there :-).

One important addition to Rand’s article is the use of robots.txt to address the issue. One advantage, this has over the use of the meta robots tag “no-index”, is in the case of RSS feeds. Web robots index them, they contain duplicate content but the meta tag is intended for HTML/XHTML content and the feeds are XML content.

If you read my post on John Chow’s robots.txt file, you probably noticed that some of the changes he did to his file, were precisely to address duplicate content issues.

Now, let me explain how you can address duplicate content via robots.txt. Read more

Advanced link cloaking techniques

The interesting discussion between Rand and Jeremy had me thinking about some of the things affiliates do to protect their links. I am talking about link cloaking — the art of hiding links.

We can hide links from our potential customer (in the case of affiliate links), and we can hide them from the search engines as well (as in the case of reciprocal links, paid links, etc.).

While I think cloaking affiliate links to prevent others from stealing your commissions is useful, I am not encouraging you to use the techniques I am about to explain. I certainly think it is very important to understand link cloaking in order to protect yourself when you are buying products, services or links.

When I am reading a product endorsement, I usually mouse over the link to see if it is an affiliate link. Why? I don’t mind the blogger making a commission’; but, If I see he or she is trying to hide it via redirects, Java-script, etc. I don’t perceive it is as an endorsement.  I feel it is a concealed ad. When I see <aff>, editor’s note, etc. I feel I can trust the endorsement.

Another interesting technique is the cloaking of links to the search engines. The reasoning behind this concept is so that your link partners think you endorse them, but you tell the search engines that you don’t. Again, I am not supporting this.

Cloaking links to the potential customers.

Several of the techniques, I’ve seen are: Read more

Robots.txt 101

First let me thank my beloved reader SEO blog.

Thanks to him I got a really nice bump in traffic and several new RSS subscribers.

It is really funny how people that don’t know you, start questioning your knowledge, calling you names, etc. I am glad that I don’t take things personal. For me it was a great opportunity to get my new blog some exposure.

I did not try intentionally, to be controversial. I did ran a back link check on John’s site and found those interesting results I reported. I am still more inclined to believe that my theory has more grounds than SEO Blog’s. Please keep reading to learn why.

His theory is that John fixed the problem, by making some substantial changes to his robots.txt file. I am really glad that he finally decided to dig for evidence. This is far more professional than calling people, you don’t know, names.

I thoughtfully checked both robots.txt files and here is what John removed in the new version: Read more