Canonicalization: The Gospel of HTTP 301

book_gospel_closed.jpgUsually I don’t cover basic material in this blog, but as a loyal reader, Paul Montwill, requested it, I’m happy to oblige. As I learned back in school, if one person asks a question, there are probably many others at the back of the class quietly wondering the same thing. So here is a brief explanation of web server redirects and their use to solve URL canonicalization issues.

And just what is that ecclesiastic-sounding word “canonicalization”? It was Matt Cutts and not the Pope that made it famous, as he used the nomenclature to describe a certain issue that popped up at Google. Here is the problem. All of us have these URLs:

1) sitename.com/

2) sitename.com/index.html

3) www.sitename.com 

4) www.sitename.com/index.html

You know they are all the same page. I know they are all the same page. But computers — unfortunately, they aren't on the same page. They aren’t that smart and need to be told that each one of these addresses represents the same page. One way is for you to pick one of them and use it consistently in all your linking. The harder part, however, is getting other website owners linking to you to do the same. Some might use one, others another, and a few are bound to choose a third.

The best way to solve this is to pick one URL and have your web server automatically force all requests for other variations to go to the one you picked. We can use HTTP redirects to accomplish this.

HTTP redirects are simply web server response codes of this form (this is how it looks to the web browser):

HTTP 30x http://anotherurl.com

The number 30x is a status code from 300–307. The most commonly used are 301 and 302. (For a more complete description of each of the status codes, please read the HTTP Request For Comments (RFC2616), section 10.) We only need to use 301, which is the permanent redirect. This status code tells the crawler that the new address for the currently requested page is the one in the message. For example, you may want http://sitename.com to be your canonical page (like I do for my blog). If a visitor types http://www.sitename.com you want the web server to send back HTTP 301 http://sitename.com so that the crawler 'understands' that this is the proper, canonical page.

How do we do that?

There are two ways we can accomplish this with Apache — a basic one and an advanced one. Keep in mind that the basic one does not help with www vs non-www issues, though. It involves using the mod_alias module and directives: Redirect, RedirectPermanent or RedirectMatch.

 

In your .htaccess file, add one of these:

Redirect 301 /index.html http://sitename.com/

RedirectPermanent /index.html http://sitename.com/

RedirectMatch 301 /(.*)\.html http://sitename.com/$1.html

The more advanced one, which I recommend, is the one that I use. It involves changing the mod_rewrite module. Here is what my Apache configuration looks like:

# URL Rewriting

RewriteEngine on

RewriteLog logs/rewrite.log

RewriteLogLevel 0

RewriteCond %{HTTP_HOST} ^www\.hamletbatista\.com [NC]

RewriteRule ^/(.*) http://hamletbatista.com/$1 [R=301,L]

 

As you have probably noticed, I prefer http://hamletbatista.com. If I wanted http://www.hamletbatista.com/ instead, I would rewrite it this way:

RewriteCond %{HTTP_HOST} ^hamletbatista\.com [NC]

RewriteRule ^/(.*) http://www.hamletbatista.com/$1 [R=301,L]

If it was a regular website and not a blog, I'd add this line too.

RewriteRule ^/index.html http://hamletbatista.com/ [R=301,L]

As always, when you begin playing with files like these, it’s a good idea to check the Apache documentation for more details. It may not be the Bible, but for canonicalization issues, it’s as good as gospel.

10 replies
  1. Jason
    Jason says:

    Mod_Rewrite is certianly the way to go, and will offer the most flexibility when solving problems like this. However, for WordPress users that might not want to edit their .htaccess file or update the mod_rewrite module can get away with using Justin Shattuck's WWW Redirect Plugin. It's dead simple and can solve this problem with just a few clicks.

    Reply
  2. Mutiny Design
    Mutiny Design says:

    I noticed you were doing this and realised I was not doing it myself…

    I notice a lot of sites out on the internet that suffer from this, particularly when the link to the homepage goes to a /index.php page with no pagerank. Decent programmers seem not to fall for this one, but you rearly see a site that redirects http:// to http://www. or vice versa.

    How damaging do you think it can be not to do this?

    No nice image with this post.

    Reply
  3. markus941
    markus941 says:

    I recently noticed that Google has both versions of my site and versions of my posts indexed with a slash and also without.

    Do you know how to write a rule that makes sure that nothing has a slash behind it?

    Reply

Trackbacks & Pingbacks

  1. […] – a High Ranking Forums discussion 301, Parking and Other Redirects for SEOs ( FAQ ) – Ian McAnerin Canonicalization: The Gospel of HTTP 301 – Hamlet Batista writes on a related problem 301 Permanent Redirect to Error404.htm Page is a […]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply