Anatomy of a Distributed Web Spider — Google's inner workings part 3

What can you do to make life easier for those search engine crawlers? Let's pick up where we left off in our inner workings of Google series. I am going to give a brief overview of how distributed crawling works. This topic is useful, but can be a bit geeky, so I'm going to offer a prize for you at the end of the post. Keep reading, I am sure you will like it. (Spoiler: it's a very useful script ).

 

4.3 Crawling the Web

Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

If that was when they started, can you imagine how massive it is now?

As I explained in the first installment, crawling is the process of downloading the documents that are going to be indexed. If Google is unable to download your pages, you won't be listed. That is why it is so important to use reliable hosting and to keep your pages up at all times. For my most profitable sites I use a redundant setup I came up with a few years ago. I will share it in a future post.

A distributed crawling system is Google's answer to divide the crawling task (URL Server, crawlers, etc.), and they use multiple servers to perform the crawling. Let me explain this with an analogy.

Imagine a librarian that has been assigned with the task of preparing books for a new library. The library is empty. He has several assistants to help with the task. The assistants don't know anything about the books, where they can be found or where they go. The librarian does know and he gives the assistants a list of books and exactly where to find them. "Please bring them here," he says. He does the same for every assistant, until he has all the books he needs.

The librarian is the URL Server in this case, and the assistants are the crawlers. The crawlers receive URL lists that they need to download, and they need to do it efficiently. (As a consequence of this, you can see that Googlebot hits to your site from different IP addresses.)

There are also some people who do not know about the robots exclusion protocol, and think their page should be protected from indexing by a statement like "This page is copyrighted and should not be indexed." Needless to say, web crawlers don't read pages the same way humans do, and they don't quite understand.

Another important piece of information is how to tell search engines not to crawl sensitive pages. There are pages on your site, that for one reason or another, you don't want to have them included in the Google index. One way to do that is by way of the robots exclusion protocol. Another way is via meta robots tag noindex.

If you read this far (or jumped to the end of the post ;-)), here is your prize: chkng.zip

Before, I fell in love with Python, I coded this program in Perl a few years ago. An actual crawler for your own use. 

What you can use it for?

1. To check your site for broken links.
2. To find out if the site that is supposed to be linking to you (because you paid or exchanged a link), still is.

There are some other uses that I would not comment–as I do not want to encourage spamming.

How to use it? It is multi-threaded code. That means that it can work on multiple sites simultaneously. You simply need to paste the number of sites you want to check in parallel. You also need to pass the url to search for (your site url in most cases), and:

1. Download to your Linux box (I have not tested it on Windows or Mac)
2. Run chmod +x chkng.pl
3. Create a file with the list of sites you want to check, one per line.
4. Run ./chnkng.pl 5 http://yrousite.com < sitestochecklist.txt

This example will check 5 sites in parallel, and search for your url on any of the pages.

5 replies
  1. Mutiny Design
    Mutiny Design says:

    Another excellent post. I will have to read the first two now. I hadn't considered some of the variables you have brought to light here. The 'if your site is down' is an obvious one though.

    Reply
    • ABDUL BARI
      ABDUL BARI says:

      Hi Friend,

      I like the information which is given in this site, i want to know more about the backend process of search engine. iam working on serch engine project for my acadamics, can you please help me out in this regards.

      thank you,
      with regards
      (BARI)

      Reply

Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply