Posts

Reclaiming What’s Yours: Getting your feed back from FeedBurner while still tracking subscribers

feed-icon.jpgAs you are probably aware, FeedBurner's way of tracking subscriptions is a little bit unreliable. You've probably seen your subscription numbers drop significantly during the weekends and during the days when you have no new posts or little activity. If you’re like me, you want to know your actual subscriber numbers.

There isn’t a straightforward solution, but I have a couple of ideas I'd like to test. One of the easiest involves using FeedBurner’s Awareness API to gain access to the raw data collected and interpreting the data myself in a more useful way. The other idea takes a little bit more time and involves parsing the RSS hits from the web server log file. I explained my idea for the log file in this comment.

The API idea has the disadvantage that depends on FeedBurner, and my hands would be tied later if I wanted to create a competing product. On the other hand, the log file idea is complicated by the fact that, once you’ve moved your feed to FeedBurner, your RSS hits no longer go to your website. You only get the RSS hits from FeedBurner. That is major obstacle. Read more

Controlling Your Robots: Using the X-Robots-Tag HTTP header with Googlebot

robopet.jpgWe have discussed before how to control Googlebot via robots.txt and meta robot tags. Both methods have limitations. With robots.txt you can block the crawling of any page or directory, but you cannot control the indexing, caching or snippets. With the robots meta tag you can control crawling, caching and snippets but you can only do that for HTML files, as the tag is embedded in the files themselves. You have no granular control for binary and non-HTML files.

Until now. Google recently introduced another clever solution to this problem. You can now specify robot meta tags via an HTTP header. The new header is the X-Robots-Tag, and it behaves and supports the same directives as the regular robots meta tag: index/noindex, archive/noarchive, snippet/nosnippet and the new unavailable_after directive. This new technique makes it possible to have granular control over crawling, caching, and other functions for any page on your website, no matter the type of content it has—PDF, Word doc, Excel file, zip files, etc. Read more

Canonicalization: The Gospel of HTTP 301

book_gospel_closed.jpgUsually I don’t cover basic material in this blog, but as a loyal reader, Paul Montwill, requested it, I’m happy to oblige. As I learned back in school, if one person asks a question, there are probably many others at the back of the class quietly wondering the same thing. So here is a brief explanation of web server redirects and their use to solve URL canonicalization issues.

And just what is that ecclesiastic-sounding word “canonicalization”? It was Matt Cutts and not the Pope that made it famous, as he used the nomenclature to describe a certain issue that popped up at Google. Here is the problem. All of us have these URLs:

1) sitename.com/

2) sitename.com/index.html

3) www.sitename.com 

4) www.sitename.com/index.html

You know they are all the same page. I know they are all the same page. But computers — unfortunately, they aren't on the same page. They aren’t that smart and need to be told that each one of these addresses represents the same page. One way is for you to pick one of them and use it consistently in all your linking. The harder part, however, is getting other website owners linking to you to do the same. Some might use one, others another, and a few are bound to choose a third.

The best way to solve this is to pick one URL and have your web server automatically force all requests for other variations to go to the one you picked. We can use HTTP redirects to accomplish this. Read more

Preventing duplicate content issues via robots.txt and .htaccess

Rand of SEOmoz.org posted an interesting article on duplicate content issues. He uses the typical blog to show different examples.

In a blog, every post can appear in the home page, pagination, archives, feeds, etc.

Rand suggests the use of the meta robots tag “no-index”, or the potentially risky use of cloaking, to redirect the robots to the original source.

Joost the Valk recommends WordPress users change some lines in the source code to address these problems.

There are a few items I would like to add to the problem and to the proposed solution.

As willcritchlow asks, there is also the problem of multiple URLs leading to the same content (ie.: www.site.com, site.com, site.com/index.html, etc.). This can be fixed by using HTTP redirects and by telling Google what is our preferred domain via webmaster central.

Reader roadies, recalls reading about a robots.txt and .htaccess solution somewhere. That gave me the inspiration to write this post.

After carefully reviewing Google’s official response to the duplicate content issue, it occurred to me that the problem might not be as bad as we think.

What does Google do about it?
During our crawling and when serving search results, we try hard to index and show pages with distinct information. This filtering means, for instance, that if your site has articles in “regular” and “printer” versions and neither set is blocked in robots.txt or via a noindex meta tag, we’ll choose one version to list. In the rare cases in which we perceive that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. However, we prefer to focus on filtering — rather than ranking adjustments … so in the vast majority of cases, the worst thing that’ll befall webmasters is to see the “less desired” version of a page shown in our index.

Basically, Google says that unless we are trying to do something purposely ill intended (like ‘borrowing’ content from other sites), they will only toss out duplicate pages. They explain that their algorithm automatically detects the ‘right’ page and uses that to return results.

The problem is that we might not want Google to choose the ‘right’ page for us. Maybe they are choosing the printer-friendly page and we want them to choose the page that includes our sponsors’ ads! That is one of the main reasons, in my opinion, to address the duplicate content issue. Another thing is that those tossed out pages will likely end up in the infamous supplemental index. Nobody wants them there :-).

One important addition to Rand’s article is the use of robots.txt to address the issue. One advantage, this has over the use of the meta robots tag “no-index”, is in the case of RSS feeds. Web robots index them, they contain duplicate content but the meta tag is intended for HTML/XHTML content and the feeds are XML content.

If you read my post on John Chow’s robots.txt file, you probably noticed that some of the changes he did to his file, were precisely to address duplicate content issues.

Now, let me explain how you can address duplicate content via robots.txt. Read more