For this installment of my Google's inner workings series, I decided to revisit my previous explanation. However, this time I am including some nice illustrations so that both technical and non-technical readers can benefit from the information. At the end of the post, I will provide some practical tips to help you improve your site rankings based on the information given here.
To start the high level overview, let's see how the crawling (downloading of pages) was described originally.
Google uses a distributed crawling system. That is, several computers/servers with clever software that download pages from the entire web
There is an URLserver that sends URL lists to be fetched by the Crawlers. These crawlers download each individual page to a Store Server.
The Store Server compresses all web pages and stores them in a Repository, where every page is assigned an ID, the docID.
This is the
complex interesting part: the indexing.
The Indexer reads the pages from the Repository, uncompresses them, and extracts key information.
Each document is broken down into a set of word occurrences called word hits.
The word hits record the word, position in the document, an approximate font size, and capitalization. These are later distributed into a set of "barrels". This is the first step in creating the index. At this moment, it is a partially sorted forward index. This means that you can find the word hits by document identifier, but to be useful for search you need the opposite (find documents by the words they contain). That is why it is needed an inverted index.
The indexer extracts all the links in every web page and stores important information about them in an Anchors File. This file contains enough information to determine where each link points to and from, and the text of the link.
The Sorter takes the barrels, that were previously sorted by docID and resorts them by word identifier to generate the Inverted Index. Each possible word in the index has an unique identifier, the wordID. The Sorter then produces a list of wordIDs, and their locations in the Inverted Index.
The URLresolver reads the anchors file and converts relative URLs (/page.html) into absolute URLs (http://site.com/page.html) and assigns docIDs. The link text is included in the Document Index, associated with the docID that the link points to. It also creates a database of links( pairs of docIDs) that is used to compute the PageRank of each document.
A program called DumpLexicon takes the wordID list produced from the Inverted Index
as well as the Lexicon generated by the Indexer and converts this information into a new lexicon for the Searcher. The searcher is run by a web server and uses the lexicon built by DumpLexicon together with the inverted index and the PageRanks to answer queries.
I really hope that this time it was far easier to understand these concepts; now to the practical applications.
Search engine friendliness tips
The ability of the Indexer to decompose your web pages and find all of the links is paramount if you want your content to be indexed. Make sure your pages' HTML are not broken or the site isn't inaccessible for long periods of time.
Expect robot hits coming from multiple IPs (Thanks to distributed crawling).
On-page optimization tips
I highlighted an important section in the explanation. It is clear on Google's
original research paper that they use the position of the word in the document, the font of the word, and the capitalization information. I am sure a lot of people don't pay attention to the value of putting words in capitals to emphasize them. This is a confirmation from the source that it is useful to engage in this practice.
When we go more in depth in the next installments we will see just how important the presence of the words in the title, URL, etc. actually are.
For now, make sure you use your most important keywords at the top of the page. Use large fonts or headings (h1,h2, etc.) and if possible use capitals. However don't abuse it by being overly aggr
Off-page optimization tips
As you can see here, Google was conceived to make heavy use of the text in the links as a qualifier for what the page that the link is pointing to is about. Nowadays, Google has very potent filters to avoid unscrupulous manipulation of this key feature. My recommendation is that you should try to get links with several different anchor texts; the more mixed the better. It needs to look really natural to avoid filtering.
In our next post, we are going to briefly look at the main software components and take a closer look at the Crawling part.