Sphinn Doctor: Adding Sphinn It! (with Sphinn counts) to your feed and website posts

You have probably read about Sphinn – the Digg-like, social media site for search engine marketers. Almost every SEO/SEM blog has talked about it. If you haven't, Rand's post is an excellent introduction.

Instead of trying to explain why it is important to get on the Sphinn home page – that is covered in other blogs – I will focus on how to make your posts more “sphinnable” by adding a Sphinn It! link to the end of all your posts. You can do that by using FeedBurner FeedFlares.

In my previous posts on the website and feed you have probably seen something like this:

screenshot2.jpg

Read more

Content is King, but Duplicate Content is a Royal Pain.

painkiller.jpgDuplicate content is one of the most common causes of concern among webmasters. We work hard to provide original and useful content, and all it takes is a malicious SERP (Search Engine Results Page) hijacker to copy our content and use it for his or her own. Not nice.

More troubling still is the way that Google handles the issue. In my previous post about cgi hijacking, was clear that the main problem with hijacking and content scraping is that search engines do not reliably determine who is the owner of the content and, therefore, which page should stay in the index. When faced with multiple pages that have exactly the same or nearly the same content, Google's filters flag them as duplicates. Google's usual course of action is that only one of the pages — the one with the higher PageRank — makes it to the index. The rest are tossed out. Unless there is enough evidence to show that the owner or owners are trying to do something manipulative, there is no need to worry about penalties.

Recently, regular reader Jez asked me a thought-provoking question. I'm paraphrasing here, but essentially he wanted to know: "Why doesn’t Google consider the age of the content to determine the original author?” I responded that the task is not as trivial as it may seem at first, and I promised a more thorough explanation. Here it is. Read more

Canonicalization: The Gospel of HTTP 301

book_gospel_closed.jpgUsually I don’t cover basic material in this blog, but as a loyal reader, Paul Montwill, requested it, I’m happy to oblige. As I learned back in school, if one person asks a question, there are probably many others at the back of the class quietly wondering the same thing. So here is a brief explanation of web server redirects and their use to solve URL canonicalization issues.

And just what is that ecclesiastic-sounding word “canonicalization”? It was Matt Cutts and not the Pope that made it famous, as he used the nomenclature to describe a certain issue that popped up at Google. Here is the problem. All of us have these URLs:

1) sitename.com/

2) sitename.com/index.html

3) www.sitename.com 

4) www.sitename.com/index.html

You know they are all the same page. I know they are all the same page. But computers — unfortunately, they aren't on the same page. They aren’t that smart and need to be told that each one of these addresses represents the same page. One way is for you to pick one of them and use it consistently in all your linking. The harder part, however, is getting other website owners linking to you to do the same. Some might use one, others another, and a few are bound to choose a third.

The best way to solve this is to pick one URL and have your web server automatically force all requests for other variations to go to the one you picked. We can use HTTP redirects to accomplish this. Read more

Our Digital Footprints: Google's (and Microsoft’s) most valuable asset

searchengine_footprints.jpgAfter reading this intriguing article in the LA Times, I came to the conclusion that Google has far more ambitious plans than I originally thought. In their effort to build the perfect search engine — an oracle that can answer all of our questions, even answers that we didn't know about ourselves — Google is collecting every single digital footprint we leave online. They can afford to provide all their services for free. After all, our digital footprints are far more valuable.

What exactly are digital footprints, and how does Google get them? Imagine each one of Google’s offerings as a surveillance unit. Each service has a double purpose. First, to provide a useful service for “free,” and second to collect as much information about us as possible. Consider these few examples: Read more

You’ve Won the Battle but not the War: 10 Ways to protect your site from negative SEO

rankings_robber2.jpgLast month there was an interesting article in Forbes about the search engine marketing saboteurs. These so-called “SEO professionals” proudly proclaim their job to be damaging the hard-earned rankings of their clients’ competitors. I understand a lot of people would do anything for money, but it’s still unsettling to see such people trumpet their efforts with such gusto. A huge thumbs down to all those mentioned in the article.

Earning high search engine rankings is challenging enough. Now we need to work twice as hard to protect the rankings once we earn them. The Forbes article lists seven ways you can damage someone else's website. I can think of three more — but instead of adding more wood to the negative SEO fire, I’ve decided to create a list of things you can do to detect, prevent and protect your rankings from these types of attacks.

Here are Hamlet’s countermeasures. (You may want to read the Forbes article first to better understand the terms.) Read more

Why Quality Always Wins Out in Google's Eyes

eyes.jpgQuality is king in search engine rankings. Of course spam sites using the latest techniques make their way up top—but their rankings are temporary, fleeting, and quickly forgotten. Quality sites are the only ones that maintain consistent top rankings.

This wasn’t always true. A few years ago it was very easy to rank highly for competitive search terms. I know that for a fact because I was able to build my company from scratch just by using thin affiliate sites. Everybody knows that there has been a drastic change. At least on Google, it is increasingly difficult for sites without real content to rank highly, no matter how many links back they get.

Why is this?

Search engines use “quality signals” to rank websites automatically. Traditionally, those signals were things you might expect: the presence of the searched words in the body of the web page, and on the links coming from other pages. There is increasing evidence that Google is looking at other quality signals to deduce the relevance of websites.

Here are some of the so-called quality signals Google might be using to provide better results: Read more

Out of the Supplemental Index and into the Fire

fire.jpgDo you have pages in Google's supplemental index? Get 'em out of there!

Matt Cutts of Google doesn't think SEOs and website owners should be overly concerned about having pages in the supplemental index. He has some pages in the supplemental index, too.

As a reminder, supplemental results aren’t something to be afraid of; I’ve got pages from my site in the supplemental results, for example. A complete software rewrite of the infrastructure for supplemental results launched in Summer o’ 2005, and the supplemental results continue to get fresher. Having urls in the supplemental results doesn’t mean that you have some sort of penalty at all; the main determinant of whether a url is in our main web index or in the supplemental index is PageRank. If you used to have pages in our main web index and now they’re in the supplemental results, a good hypothesis is that we might not be counting links to your pages with the same weight as we have in the past. The approach I’d recommend in that case is to use solid white-hat SEO to get high-quality links (e.g. editorially given by other sites on the basis of merit).

Google is even considering removing the supplemental result tag. There won't be any way for us to tell if any page is supplemental. Read more

Popularity Contest: How to reveal your invisible PageRank

podium.jpgLet's face it. We all like to check the Google PageRank bar to see how important websites, especially ours, are for Google. This tells us how cool and popular our site is.

For those of us who are popularity-obsessed, the sad part is that the other search engines do not provide a similar feature, and Google's visible PageRank is updated only every 3 months (the real PageRank is invisible). This blog is two months old and doesn't have a visible PageRank yet, but I get referrals from many long tail searches, ergo it has to have a PageRank already. 
How can you tell what your PageRank is without waiting for the public update? Keep reading to learn this useful tip. This technique is not bulletproof, but you can get a rough estimate of your invisible PageRank — and how important your pages are for the other search engines as well — by studying how frequently your page is indexed. Read more

Becoming a Web Authority: The story of Sally and Edward

authority.jpgGetting permanent search engine rankings for your site requires making it very popular within your specific niche. This is what we call a web authority. Other site owners will naturally link to your site when they are talking about your topic because you have some of the best content out there. 

It's common sense that if you want your site to become popular, you'll seek advice from those who have already made it. That's assuming they are willing to share how they did it. I personally like to read Neil Patel's blog. He is one of the few SEO celebrities who is open to share how he made it to the top.
Most advanced SEOs carefully study high-ranking pages for the keywords they are targeting. If you can understand how those sites got there, you can apply the same techniques to your site. Unfortunately, as I explained previously, not all sites ranking high will remain there. These sites are like a house of cards, destined to be brought down by the next fair wind. If you apply the strategies of such cardboard castles, yours will come down too. Let me illustrate my point with this story about Sally and Edward.
Read more

Anatomy of a Distributed Web Spider — Google's inner workings part 3

What can you do to make life easier for those search engine crawlers? Let's pick up where we left off in our inner workings of Google series. I am going to give a brief overview of how distributed crawling works. This topic is useful, but can be a bit geeky, so I'm going to offer a prize for you at the end of the post. Keep reading, I am sure you will like it. (Spoiler: it's a very useful script ).

 

4.3 Crawling the Web

Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

If that was when they started, can you imagine how massive it is now? Read more