The Ranking Triathlon: How to overcome crawling, indexing, and searching hurdles

felizsanchez.jpgI frequently get asked why a particular page is no longer ranking. I wish there were a simple answer to that question. Instead of giving personal responses, I’ve decided to write a detailed post with the possible problems that might cause your ranking to drop, as well as all the solutions I could think of. I also want to present a case study every week of a ranking that dropped and what we did to get it back. If you have a site that is affected I invite you to participate. Send me an email or leave a comment.

There are many reasons why your page or website might not be ranking. Let's go through each of the three steps in the search engine ranking process and examine the potential roadblocks your page might face. We’ll see how to avoid them, how to identify if your page was affected, and most importantly, how to recover.

Hurdle 1: Crawling

Before anything else, your page needs to be crawled—that is, downloaded to the search engine's cache. If your page was not crawled yet, you need to fix that.

The easiest way to tell if your page was crawled is to look at your web server log files or log based web analytics package. Web-only analytics (like Google Analytics) can’t help here as they are not able to detect search engine robot visits, which do not interpret JavaScript.

If the search engine robot visited the page, then you know that it was crawled. If not, then we need to look at the potential reasons:

(This is very simple to test by checking the search engine's cache of the page.)

  1. The crawler was not able to reach the page at the moment of the crawl.

  • The web server must be up. Hire reliable hosting with qualified staff. If your server is down, no one, including search engines, is going to reach it. You can use a watchdog service that automatically restarts the server software if it dies. I used those in the past with mixed results. I prefer to use redundant servers with load balancing for the serious stuff.

  • The web server must return the right status response “200 OK” (30x is fine if the crawler can reach the redirected to address). 40x (Not found) and 50x (Server error) status codes are signs of problems to fix. When I see 404s in my log files, there are usually one or more pages with incorrect links and I fix them. If the broken link is on another site out of my control, I e-mail the webmaster to have it fixed. If I get no response, I add a rewriting rule to make the link point to the right page. Every link counts :-)

  • If the page is dynamic, remove session ids and do not use more than 2 or 3 parameters. My advice is to avoid request parameters completely and use search engine friendly URLs (that means no ? or & in the URL). Also, mod_rewrite is excellent for this. That is also one of the reasons why I love Django and its URLConfs. :-). If you cannot change the URLs because they came with your CMS (Content Management System), use XML sitemaps to get them indexed.

  • Lastly, make sure your ISP/hosting service provider is not filtering the crawlers. This sounds silly, but I remember Godaddy was doing something like this a few years ago.

  1. You are sending the crawler to another domain when the crawler visits (via HTTP 30x redirect). The redirected page will be the one indexed. Remove the redirect.

  2. You purposely or inadvertently blocked the search engine crawler. Proper use of robots.txt and meta robot tags are essential to avoid shooting your site in the foot.

  3. Your page needs to have at least one incoming link to be visited. The more links and the higher the quality, the higher the chances the page will be crawled and indexed. Contrary to some SEOs who advise that you submit your pages to the search engines, I personally think that it is a waste of valuable time. Search engine robots follow links; pages with no links will not be followed, even if you submit them to the search engine every day. If you have some incoming links, make sure they don’t have a ‘no-follow’ tag. Although 'no-follow' changes its meaning slightly depending on search engine, you definitely want links without them.

  4. Depth quota. In order to allocate their resources more efficiently, search engines don't crawl every single page if the site is big or has many exact duplicate pages. It’s good to have some quality links going to every page that you deem important in order to improve the odds of more pages being visited.

  5. Not Cached. Your log files say that the crawler downloaded the web page but there is no cache on the search engine. If you are not preventing the cache via the meta robots tag ‘noarchive,’ then it is likely that there is one or more exact duplicates. They can detect exact duplicates with a simple checksum. More advanced duplicate detection, such as near duplicates, is performed at the indexing step.

  6. You purposely or inadvertently removed removed your page from Google's index. Submit a re-inclusion request.

  7. You purposely or inadvertently engaged in some deceptive or potentially deceptive practice the search engine banned your page (or site). Detecting this is not trivial as search engines are not usually public about this. Google can penalize your site with mild, moderate or hard penalties. Basically you need to remove the offending content, links, etc. and send a re-inclusion request explaining to Google that you got greedy and followed ill-advice, you apologize and promise to never ever do again. Pinky swear.

Hurdle 2: Indexing

If your page was crawled, the next step is to check if it was indexed. A site:sitename.com search should be your first step, as it lists all your pages that are in the search index. You can also search for your website name, page titles, or unique word combinations that appear only on your web pages.

Assuming your page is not indexed, here are some of the potential problems and solutions:

  1. Page removed. Points #7 and 8 from the crawling issues also apply here. If you remove your pages from the index, they won't be there. If you spam it, it’s gone too. Let me add that you should check Google’s quality guidelines and their recent opposition to massive/automated link exchanges and
    paid links. Some penalties are automatic but others are manual as result of spam reports and the work of the spam monitoring team.

  2. Duplicate content. Search engines don't like to index duplicate content. Make sure your content is original and that no scraper with more links than you is pushing your content out of the index. Search engines can detect exact, as well as near duplicates.

  3. Hijacked pages. If your pages are hijacked via 302 redirects (supposedly this no longer works) or CGI proxies, make sure you follow my instructions in a previous post.

  4. Broken HTML. Search engine crawlers as well as web browsers are very good at fixing most errors produced by hand coding your HTML pages. We already know how smart you are, so just use a WYSIWYG HTML editor already!

  5. No Text (Frames, Flash, JavaScript, Images, etc.). You pages look beautiful with all these features, but search engines only understand plain text and some document formats (word, pdf, etc.). Make sure you provide text-only alternatives for search (i.e. “alt” attribute for images).

  6. Password protected pages. Search engines are not yet bored enough to sit there guessing your password, so leave your important pages unprotected.

  7. Gargantuan pages. If search engines indexed every 10GB-document out there, there wouldn't be space left for the regular pages. Keep your pages at a reasonable size (100k-10MB) and consider supporting gzip encoding in your web server to save bandwidth. As we learned, search engines compress the pages in their cache.

  8. Supplemental page. The page doesn't have enough links and PageRank to deserve a main listing. Supplemental pages are only partially indexed and hence you will only be able to find them with very obscure searches. Improve the links to the page to help it reach the main index.

  9. Partially indexed pages. Your page is listed but only as a URL with no snippet or cache. Google doesn't need to crawl the page in order to return it in the search results. It can use the data it collects from the links pointing to the page. These pages are listed with URL only. Another possibility is that you are forbidding Google from indexing or displaying a snippet via a meta robots tag 'noindex' or ‘nosnippet.’

Hurdle 3: Searching

We’re in the home stretch. If your page is crawled and indexed (i.e. searchable for phrases that only appear on your page) the next issue is address whether the page is searchable for the terms that matter to you. I will explain how to obtain high rankings in greater detail in a future post, but for now let's assume that you were already ranking for particular, important terms. If your favorite page has suddenly dropped off the radar, this part is for you.

  1. Index updates. This is the most common cause. As search engines find more pages similar to yours, it is inevitable that some or many of them will have better quality signals and will start ranking above yours. You will need to get more sites to link to your page. This is one of the reasons why SEO is an ongoing battle.

  2. Ranking algorithm updates. Search engines tweak their formulas all the time and as a result some pages will lose their rankings and others will get a leg up. As I explained before, make sure you follow the leaders for your search keywords – the web authorities. Search engines need to make sure those pages’ rankings remain high and it’s wise to follow their steps to have yours maintain their rankings too. For major ranking algorithm updates you will see drastic changes in the results, official announcements, and a lot of noise and complaints in the webmaster forums.

  3. Changes to your relevance profile (visible signals). While search engines deduce your rankings or relevance by studying your page quality signals, the most important ones for the moment are the content on your page and incoming links. Check those for any radical changes. Make sure the sites that link to yours are not banned and are still providing link juice. Check your incoming links anchor text and check that your page content is still relevant for the searches you want to rank for.

  4. Changes to your relevance profile (invisible signals). We know that Google uses Qualiy Raters to provide feedback. Other invisible signals are still theoretical, but Google could be using many signs to tell if the page in the search results is providing useful content. The only way to guarantee inclusion is by doing just that.

  5. Penalties. If you violated Google's quality guidelines this may be the reason for the drop. Depending on the type of violation, they can demote all your incoming links, some of them or your entire website and all internal pages (u get a PR0 in this case). Evidence of this is the site's rankings dropping significantly, even for obscure words.

  6. Negative SEOs. As explained in a previous post, your competitors might be trying to damage your rankings by forcing your site or page into any of these roadblocks. Actively monitor your site, rankings, etc. to make sure this is not the case. Consider legal options if you find out that this is the case and you can identify the culprit.

The Final Lap

If you find any problems on your pages and you fix them, it is also a good idea to remove the mess from the search engine index as well. Most web servers return 404s for pages that you remove. Unfortunately this tells the search engine robot that the page might be back at a later date. The best way to tell the search engine that the page is not coming back is to configure your web server to return 410s (gone for good) for those URLs. Good ol’ mod_rewrite and mod_alias are excellent for this purpose. Google's planned 'unavailable_after<
/a>' meta tag will be another good alternative approach.

For Google you can also use the URL removal tool. You need to use this with care to avoid any potential problems. As I explained before, don't forget to choose a canonical URL and install rewrite rules to make sure all your incoming links land in the right place. Check your robots.txt file periodically and monitor your site via the Google Webmaster Central console and your log files.

Going for Gold

One last tip I highly recommend is to monitor your searches, find out what people are looking for, and where they are landing. High bounce rates are a clear indication that they are not finding what they are looking for. Make sure they do the next time they search. It’s a win-win situation: great for you and great for them.

I have tried to provide tips general enough so that you can attack the obstacles for all the major search engines. If your primary concern is Google, Google Webmaster Central provides most of the information you need to identify and address many of the problems.

As usual, I welcome any additions that I missed in the comments section.

3 replies
  1. Jez
    Jez says:

    Well you got the post title right, that was serious effort!
    I don't think you missed anything ;-)
    It will be really interesting to see how you pull all of this stuff together in real world case studies.
    I thought the analysis you did on John Chows site a while back was very interesting….

    Reply

Trackbacks & Pingbacks

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply