Robots.txt 101

Robots.txt 101

June 4th, 2007 @ // 9 Comments

First let me thank my beloved reader SEO blog.

Thanks to him I got a really nice bump in traffic and several new RSS subscribers.

It is really funny how people that don’t know you, start questioning your knowledge, calling you names, etc. I am glad that I don’t take things personal. For me it was a great opportunity to get my new blog some exposure.

I did not try intentionally, to be controversial. I did ran a back link check on John’s site and found those interesting results I reported. I am still more inclined to believe that my theory has more grounds than SEO Blog’s. Please keep reading to learn why.

His theory is that John fixed the problem, by making some substantial changes to his robots.txt file. I am really glad that he finally decided to dig for evidence. This is far more professional than calling people, you don’t know, names.

I thoughtfully checked both robots.txt files and here is what John removed in the new version:

# Disallow all monthly archive pages  Disallow: /2005/12  Disallow: /2006/01  Disallow: /2006/02  Disallow: /2006/03  Disallow: /2006/04  Disallow: /2006/05  Disallow: /2006/06  Disallow: /2006/07  Disallow: /2006/08  Disallow: /2006/09  Disallow: /2006/10  Disallow: /2006/11  Disallow: /2006/12  Disallow: /2007/01  Disallow: /2007/02  Disallow: /2007/03  Disallow: /2007/04  Disallow: /2007/05    # The Googlebot is the main search bot for google  User-agent: Googlebot   # Disallow all files ending with these extensions  Disallow: /*.php$  Disallow: /*.js$  Disallow: /*.inc$  Disallow: /*.css$  Disallow: /*.gz$  Disallow: /*.wmv$  Disallow: /*.tar$  Disallow: /*.tgz$  Disallow: /*.cgi$  Disallow: /*.xhtml$   # Disallow Google from parsing indididual post feeds and trackbacks..  Disallow: */feed/  Disallow: */trackback/   # Disallow all files with ? in url  Disallow: /*?*  Disallow: /*?   # Disallow all archived monthlies  Disallow: /2006/0*  Disallow: /2007/0*  Disallow: /2005/1*  Disallow: /2006/1*  Disallow: /2007/1*

In English, this means, he is now letting Google crawl and index his archived articles, dynamic pages,
and files ending with “.php”, “.js”,”.inc”, “.css”, etc. Note that in none of the robots.txt files, John is preventing the crawler from accessing his home page or the regular posts. WordPress uses PHP, but regular posts and the home page can be accessed without “.php”.

If this was the change that fixed the problem, it might have been because removing those internal pages from the spider view might have weaken his internal link structure. His claim is not without merit.

Now, here is one tiny little detail that my friend is missing. To prove his point, he used Google’s cache to show the different version of the robots.txt file. If Google still has that version on their cache, what makes him think that Google is already using the new one? Google should be caching the new version not the old one. That is why I am still not convinced that this is the reason for the fix.

John says he is not telling, because a reader said Google might change their algorithm and drop him again. How does the changes John did to his robot.txt file , have anything to do with algorithm changes? I am just curious.

In reality, we can theorize all we can, but the only ones who can tell for sure is the guys at the Googleplex. John probably tried many different things and one or several of them worked. He is probably not even sure which one did.

How did I learn SEO?

SEO Blog suggests I visit his forum to learn SEO. Here is the problem with that. I am a technical guy, I can not take gut feelings or opinions as truth. I do visit some forums and blogs every now and then, but my experience is that the noise to signal ratio is too high. I prefer to learn and get my insights from the source: search engine research papers, search engine representatives blogs or my own experiments.

I learned SEO back in 2002 when I read this paper. Back then, nobody was even talking about Google bombs, anchor text, etc. Read the paper, it is all there.


Category : Blog

9 Comments → “Robots.txt 101”


  1. Jez

    6 years ago

    Hi Hamlet,

    I have been following this issue too and think you have made a bit of an error… as you know the cache is always a few days old, but the robots.txt will be analysed on the day of the crawl, in "real time".

    If Google never let go of the cached file, how would it ever crawl the site again???

    The actual crawl runs ahead of the cache, but you already know this…

    One thing you may not have seen is this post on JC:
    http://www.johnchow.com/getting-out-of-the-google

    A few days earlier the robots.txt file was changed due to reasons outlined in the above post… give it a few days for the denied pages to be dropped, a couple of days for users to report the drop in SERPs and the timing is about right for Johns "google ban".

    Then, the latest robots.txt file reverses what had been done, re-allows the supplemental pages things return to normal.

    What we should have checked was whether the supplemental pages were back in the cache

    I think JC made a blunder in blocking his supplemental pages simple as that.

    Does anyone really believe Google would change their algorithm because of John Chow!!!!

    I think you have to bear in mind that JC survives on hype, spin and reader manipulation, that's what the his site exemplifies. I think he has created a lot of buzz and mystery out of his own %$£% up… that's what he is good at.

    Reply

  2. Hamlet Batista

    6 years ago

    Jez,

    Thanks for your comment. I am really glad to have experts visiting my blog.

    Please note that I did not rule out the robots.txt changes, as the solution to the problem.

    If this was the change that fixed the problem, it might have been because removing those internal pages from the spider view might have weaken his internal link structure. His claim is not without merit.

    I am not sure I follow part of your conclusions.

    … as you know the cache is always a few days old, but the robots.txt will be analysed on the day of the crawl, in “real time”.

    If Google never let go of the cached file, how would it ever crawl the site again???

    The robots.txt is "analyzed" (parsed) in real time, but the results of this will need to be reflected when the index is updated. Search engines first crawl and then index pages.

    Dropping pages imply modification to the index (as a result of a crawl).

    To me, the pages in the cache are the pages that are affecting the current index. I might be wrong, but I need some research papers that would tell me otherwise.

    Again, I am not ruling out the robots.txt as the solution to his problem.

    At one moment, I thought that he blocked access to the regular posts by misusing the wildcards ie.: Disallow: /2007/1*. My blog includes the date in normal posts, but I checked his and it doesn't.

    Reply

  3. Jez

    6 years ago

    Hi Hamlet,

    Sorry if I did not read your post thoroughly enough… your points are interesting, it could well have been the anchor text… or perhaps a mix of anchor and robots.txt.

    I thought for some time that JC should have asked users to use a mix of different links.

    If it had been me, I would have asked users to collect the link text from a dynamic page that rotated a number of different permutations….

    As for being an expert, far from it, I am here to learn ;-)

    Reply

  4. Jez

    6 years ago

    Oh yes, the point I was trying to make about the cache was that although the old robots file was still cached it is possible that re-allowed pages had already been re-indexed etc.

    I did not explain myself well….

    Reply

  5. Hamlet Batista

    6 years ago

    If it had been me, I would have asked users to collect the link text from a dynamic page that rotated a number of different permutations….

    That doesn't sound like newbie stuff to me :-)

    I am working on a post where I am dissecting Google's original paper. Hopefully we all can learn something valuable from it.

    Reply

  6. Jez

    6 years ago

    Hi Hamlet,

    I am no stranger to code, I work in that field, but most of my experience has been on Intranets… I currently manage a large installation (9 instances) of moodle.org for a University… but there is no SEO requirement for this work… SEO is something I am interested in learning more about…

    Reply

    • Hamlet Batista

      6 years ago

      Jez,

      I'm glad to have other developers visiting my blog. Hopefully you can put some of the code in my posts to work. I appreciate any feedback.

      It's amazing how we can find open source code for pretty much everything. Moodle.org looks very interesting.

      Reply

    • Jez

      6 years ago

      If I get time ;-)

      I notice some of it uses Python, its been a long time since Ive used Python but I may have a play with it.

      Jez

      Reply

  7. Siaar

    5 years ago

    Really it is working by doing some changes to my blog by using the same techniques you wrote above – and i like to purchase that software which you added at bottom of your article…. Siaar Of Siaar Group

    Reply

Leave a Reply


six − = 4

Latest Posts

Testimonials

"Hamlet is a true leader in the online marketing industry. He is passionate about his work and when you combine this with his extensive technical and business experience, he is a valued asset to any organization looking to grow in the online space. I trust any recommendation he makes from his experience and because he stands behind his recommendations professionally and personally. Hamlet also never stops learning new skills and staying ahead of the curve. Few individuals have his focus - and he makes it look so easy!"

Danielle Simon — Marketing and Sales Manager, North America, Searchmetrics, Inc

Subscribe Now