Google keeps tweaking its search engine, and now it is more important than ever to better understand its inner workings.
Google lured Mr. Manber from Amazon last year. When he arrived and began to look inside the company’s black boxes, he says, that he was surprised that Google’s methods were so far ahead of those of academic researchers and corporate rivals.
While Google closely guards its secret sauce, for many obvious reasons, it is possible to build a pretty solid picture of Google's engine. In order to do this we are going to start by carefully dissecting Google's original engine: How Google was conceived back in 1998. Although a newborn baby, it had all the basic elements it needed to survive in the web world.
The plan is to study how it worked originally, and follow all the published research papers and patents in order to put together the missing pieces. It is going to be very interesting.
Google has added and improved many things over the years. The original paper only describes the workings of the web search engine. Missing features are the ability to search news, images, documents (PDF, word, etc.), video, products, addresses, books, patents, maps, blogs, etc.
Also missing are substantial improvements such as local search, mobile search, personalized search, universal search, supplemental index, freshness, spam detection and PageRank improvements. Some things that will be hard to know is how Google uses the data it collects through other services, like Google Toolbar, Google Analytics, Google Adsense, Doubleclick, Gmail, Gtalk, Feedburner, etc. There is a lot of information that can be used both for better ads and for better search results.
No matter the type of search you are conducting, conceptually, search engines have three key components: the crawler, the indexer and the searcher.
The crawler's (also known as a search engine robot) job is to collect all the information that will be later searched. Whether it's images, video, text or RSS feeds. These documents are stored for later processing by the indexer module. Webmasters and site owners can control how crawlers access their websites via a robots.txt file and the robots exclusion protocol. In this file you basically tell the crawler what pages or sections it is not allowed to crawl. I posted an entry about this several days ago.
The indexer module is the one doing the heavy lifting. It has the daunting task of carefully organizing the information collected by the crawler. The power of the search engine is on this specific task. Depending on how well classified the information is – the faster and the better the search. Search engines conceptually classify documents similar to the way you file documents on a cabinet. Without some sort of labeling you will probably waste a lot time finding your bank statements, notes, etc. Search engines label documents in a way that makes it easy for them to find them later by words or phrases (also known as keywords). In the case of text and similar documents the indexer breaks down the document in words and collects some additional information about those words, such as the frequency of the word in the document.
The searcher module is the one that takes the user search, cleans it to remove ambiguities, misspellings, etc., finds the documents in the index that more closely match the search, and rank them according to the current ranking formula. The ranking formula is the most closely guarded secret of all major commercial search engines.
These basic components remain the same nowadays, but they say that the devil is on the details. Today, Google's inner workings are far more complex than what I am going to explain, but the basic principles are the same. I will quote the original paper as necessary.
There is quite a bit of recent optimism that the use of more hypertextual information can help improve search and other applications [Marchiori 97] [Spertus 97] [Weiss 96] [Kleinberg 98]. In particular, link structure [Page 98] and link text provide a lot of information for making relevance-related assessements and quality filtering. Google makes use of both link structure and anchor text (see Sections 2.1 and 2.2).
One notable improvement Google brought to the commercial search markeplace was the use of link structure and anchor/link text to improve the quality of results. This proved to be a significant factor that helped fuel their growth. Today, these elements remain significant, but Google makes use of very sophisticated filters to detect most attempts at manipulation. Proof that they remain signifficant are the successful Google bombs of late.
To support novel research uses, Google stores all of the actual documents it crawls in compressed form
Here is the reference to their caching feature we are acostumed to use.
Now, let's see how they define PageRank — how important or high quality are pages for Google's search engine.
…a page can have a high PageRank if there are many pages that point to it, or if there are some pages that point to it and have a high PageRank. Intuitively, pages that are well cited from many places around the web are worth looking at. Also, pages that have perhaps only one citation from something like the Yahoo! homepage are also generally worth looking at. If a page was not high quality, or was a broken link, it is quite likely that Yahoo's homepage would not link to it. PageRank handles both these cases and everything in between by recursively propagating weights through the link structure of the web.
Here is a clear description of why they use anchor text for searching.
The text of links is treated in a special way in our search engine. Most search engines associate the text of a link with the page that the link is on. In addition, we associate it with the page the link points to. This has several advantages. First, anchors often provide more accurate descriptions of web pages than the pages themselves. Second, anchors may exist for documents which cannot be indexed by a text-based search engine, such as images, programs, and databases. This makes it possible to return web pages which have not actually been crawled. Note that pages that have not been crawled can cause problems, since they are never checked for validity before being returned to the user. In this case, the search engine can even return a page that never actually existed, but had hyperlinks pointing to it. However, it is possible to sort the results, so that this particular problem rarely happens.
Now, let's read about on-page elements that Google considered that were not in regular use back then. Proxi
mity, capitalization and font weight, and page caching.
Aside from PageRank and the use of anchor text, Google has several other features. First, it has location information for all hits and so it makes extensive use of proximity in search. Second, Google keeps track of some visual presentation details such as font size of words. Words in a larger or bolder font weigh heavier than other words. Third, full raw HTML of pages is available in a repository.
Now let's see have a big picture view as to how everything fits together. This is very technical, but I will try to explain it the best I can.
In Google, the web crawling (downloading of web pages) is done by several distributed crawlers. There is a URLserver that sends lists of URLs to be fetched to the crawlers. The web pages that are fetched are then sent to the storeserver. The storeserver then compresses and stores the web pages into a repository. Every web page has an associated ID number called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads the repository, uncompresses the documents, and parses them. Each document is converted into a set of word occurrences called hits. The hits record the word, position in document, an approximation of font size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted forward index. The indexer performs another important function. It parses out all the links in every web page and stores important information about them in an anchors file. This file contains enough information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to. It also generates a database of links which are pairs of docIDs. The links database is used to compute PageRanks for all the documents.
The sorter takes the barrels, which are sorted by docID (this is a simplification, see Section 4.2.5), and resorts them by wordID to generate the inverted index. This is done in place so that little temporary space is needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A program called DumpLexicon takes this list together with the lexicon produced by the indexer and generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
Google uses distributed crawlers/downloaders. If you have ever looked at your server log files you will notice that when Googlebot is visiting your site you will see hits coming from different IPs. That is because the crawling is distributed among several computers. They need a URL server to feed the URLs to download, to the crawlers, because the URL server is the one coordinating the crawling efforts.
All the urls are sent to the storeserver for compression and storage and are assigned an ID (doctID). For computers, it is easier and more efficient to use numbers to refer to things.
The indexers does some heavy work:
- Reads, uncompresses and parses documents. Converts documents into hits (word ocurrences)
- Creates partially sorted forwarded indices.
- Create anchors file (link text, and to and from links). URLResolver fixes relative URLs and assigns docIDs.
- Include anchor text in forward index but using the link it points to as the docID. Associates the text in the link to the document it points to.
- Maintains a link databases used to compute PageRanks
- Generates a lexicon–the list of all different words in the index
Basically, the forward index allows you to find the words of a document given the docID. In order to be useful for searching, this needs to be inverted. ie.: find documents by the words. The sorter does this addtional step, by assisting the indexer in creating an inverted index that uses wordIDs as keys to the docIDs. The inverted index includes the offsets and list of words. Dumplexicon is used to update the lexicon used by the searcher.
Finally, the searcher combines the lexicon, the inverted index and the PageRanks to respond the queries.
Next, I'll describe each of the processes in more detail. Can't wait? Read the document yourself and draw your own conclusions