Posts

Writing for People (and Search Engines): How to improve click-through rates for organic listings

Another new year has come and many of us are still analyzing the balance of successes and failures of the previous one. It is definitely a useful chore. I am happy to count this blog as one of my successes. It was humbling to see it included in SearchEngineLand’s blogroll and nominated for Best SEO Research blog—I voted for Bill’s and I am glad he won the title :-)—among other accomplishments. Thanks to everyone for the recognition!

On the other hand, last year I had more goals that I didn’t quite reach than ones that I did, although I suppose that puts me in the big crowd. :-) I like to start each year by revisiting the unachieved goals, the uncompleted projects, the planned-but-not-executed things I call my missed opportunities. One common one (and I am sure many of my peers experienced the same) is maximizing the number of clicks I get from organic listings. The problem, as many might be asking themselves, is how to measure the organic click-through rate in the first place! Read on to learn how…. Read more

Google: The New Market Gorilla

Every company, big or small, faces unfavorable market conditions at some point in its trajectory. The common sense thing to do is to try to adapt—modify the business strategy to survive and continue thriving. Unfortunately some companies, especially big and successful ones like Google or Microsoft, are stubborn and prefer that the market adapt to them. It really is difficult to hit the ‘Back’ button, throw away what you’ve built, and try something completely new. It is far easier—at least it seems so at first—to create publicity designed to adapt the market to your own needs.

The problem, for both Microsoft and Google, is that it rarely works. Let me show you why. Read more

Google brings back the Search API … Sort of

college.jpgI found an interesting bit of information that has been missed by most of the SEO community. As quietly as Google dropped the Google Search API at the end of last year, they decided to bring it back—but only to the research community.

It’s now called the University Research Program for Search and brings with it the following limitations:

  •  The research program is open to faculty members and their research teams at colleges and universities, by registration only.
  • The program may be used exclusively for academic research, and research results must be made available to the public.
  • The program must not be used to display or retrieve interactive search results for end users.
  • The program may be used only by registered researchers and their teams, and access may not be shared with others.

Getting the information you need

As an advanced SEO you are no doubt aware that in order to test many ideas, theories and routines, you need to create custom tools and scripts that automate most of the work for you. The most crucial information resides in the search engines space, and gaining access to it is critical. Read more

Content is King, but Duplicate Content is a Royal Pain.

painkiller.jpgDuplicate content is one of the most common causes of concern among webmasters. We work hard to provide original and useful content, and all it takes is a malicious SERP (Search Engine Results Page) hijacker to copy our content and use it for his or her own. Not nice.

More troubling still is the way that Google handles the issue. In my previous post about cgi hijacking, was clear that the main problem with hijacking and content scraping is that search engines do not reliably determine who is the owner of the content and, therefore, which page should stay in the index. When faced with multiple pages that have exactly the same or nearly the same content, Google's filters flag them as duplicates. Google's usual course of action is that only one of the pages — the one with the higher PageRank — makes it to the index. The rest are tossed out. Unless there is enough evidence to show that the owner or owners are trying to do something manipulative, there is no need to worry about penalties.

Recently, regular reader Jez asked me a thought-provoking question. I'm paraphrasing here, but essentially he wanted to know: "Why doesn’t Google consider the age of the content to determine the original author?” I responded that the task is not as trivial as it may seem at first, and I promised a more thorough explanation. Here it is. Read more

Why Quality Always Wins Out in Google's Eyes

eyes.jpgQuality is king in search engine rankings. Of course spam sites using the latest techniques make their way up top—but their rankings are temporary, fleeting, and quickly forgotten. Quality sites are the only ones that maintain consistent top rankings.

This wasn’t always true. A few years ago it was very easy to rank highly for competitive search terms. I know that for a fact because I was able to build my company from scratch just by using thin affiliate sites. Everybody knows that there has been a drastic change. At least on Google, it is increasingly difficult for sites without real content to rank highly, no matter how many links back they get.

Why is this?

Search engines use “quality signals” to rank websites automatically. Traditionally, those signals were things you might expect: the presence of the searched words in the body of the web page, and on the links coming from other pages. There is increasing evidence that Google is looking at other quality signals to deduce the relevance of websites.

Here are some of the so-called quality signals Google might be using to provide better results: Read more

Anatomy of a Distributed Web Spider — Google's inner workings part 3

What can you do to make life easier for those search engine crawlers? Let's pick up where we left off in our inner workings of Google series. I am going to give a brief overview of how distributed crawling works. This topic is useful, but can be a bit geeky, so I'm going to offer a prize for you at the end of the post. Keep reading, I am sure you will like it. (Spoiler: it's a very useful script ).

 

4.3 Crawling the Web

Running a web crawler is a challenging task. There are tricky performance and reliability issues and even more importantly, there are social issues. Crawling is the most fragile application since it involves interacting with hundreds of thousands of web servers and various name servers which are all beyond the control of the system.

In order to scale to hundreds of millions of web pages, Google has a fast distributed crawling system. A single URLserver serves lists of URLs to a number of crawlers (we typically ran about 3). Both the URLserver and the crawlers are implemented in Python. Each crawler keeps roughly 300 connections open at once. This is necessary to retrieve web pages at a fast enough pace. At peak speeds, the system can crawl over 100 web pages per second using four crawlers. This amounts to roughly 600K per second of data. A major performance stress is DNS lookup. Each crawler maintains a its own DNS cache so it does not need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a number of different states: looking up DNS, connecting to host, sending request, and receiving response. These factors make the crawler a complex component of the system. It uses asynchronous IO to manage events, and a number of queues to move page fetches from state to state.

If that was when they started, can you imagine how massive it is now? Read more

The Never Ending SERPs Hijacking Problem: Is there a definite solution?

hijacker.jpgIn 2005 it was the infamous 302, temporary redirect page hijacking. That was supposedly fixed, according to Matt Cutts. Now there is a new interesting twist. Hijackers have found another exploitable hole in Google: the use of cgi proxies to hijack search engine rankings.

The problem is basically the same. Two URLs pointing to the same content. Google's duplicate content filters kick in and drop one of the URLs. They normally drop the page with the lower PageRank. That is Google's core problem. They need to find a better way to identify the original author of the page.

When someone blatantly copies your content and hosts it on their site, you can take the offending page down by sending a DMCA complaint to Google, et al. The problem with 302 redirects and cgi proxies is that there is no content being copied. They are simply tricking the search engine into believing there are multiple URLs hosting the same content.

What is a cgi proxy anyway? Glad you asked. I love explaining technical things :-) Read more

Google's Architectural Overview (Ilustrated) — Google's inner workings part 2

For this installment of my Google's inner workings series, I decided to revisit my previous explanation. However, this time I am including some nice illustrations so that both technical and non-technical readers can benefit from the information. At the end of the post, I will provide some practical tips to help you improve your site rankings based on the information given here.

To start the high level overview, let's see how the crawling (downloading of pages) was described originally.

Google uses a distributed crawling system. That is, several computers/servers with clever software that download pages from the entire web Read more

Google's architectural overview — an introduction to Google's inner workings

Google keeps tweaking its search engine, and now it is more important than ever to better understand its inner workings.

Google lured Mr. Manber from Amazon last year. When he arrived and began to look inside the company’s black boxes, he says, that he was surprised that Google’s methods were so far ahead of those of academic researchers and corporate rivals.

While Google closely guards its secret sauce, for many obvious reasons, it is possible to build a pretty solid picture of Google's engine. In order to do this we are going to start by carefully dissecting Google's original engine: How Google was conceived back in 1998. Although a newborn baby, it had all the basic elements it needed to survive in the web world.
Read more

Protecting your privacy from Google with Squid and FoxyProxy

There is no doubt about it; this has definitely been Google’s Privacy Week. Relevant news:

The infamous Privacy International’s report (it basically says that Google sucks in privacy, far more than Microsoft)

Privacy International’s open letter to Google

Danny Sullivan defending Google

Matt Cutts defending his employer

Google’s official response (PDF letter)

Google Video flaw exposes user credentials

It’s only human nature to defend ourselves (and those close to us) when we are under public scrutiny. I am not surprised to see Matt or Danny stand behind Google on this matter. I do think it is far more wise and beneficial to look into criticism and determine for ourselves what we can do to remedy it. I am glad to see that Google took this approach on their official response:

After considering the Working Party’s concerns, we are announcing a new policy: to anonymize our search server logs after 18 months, rather than the previously-established period of 18 to 24 months. We believe that we can still address our legitimate interests in security, innovation and anti-fraud efforts with this shorter period … We are considering the Working Party’s concerns regarding cookie expiration periods, and we are exploring ways to redesign cookies and to reduce their expiration without artificially forcing users to re-enter basic preferences such as language preference. We plan to make an announcement about privacy improvements for our cookies in the coming months.

You can take any side you want. But, I feel that none of the people covering this topic has addressed two critical issues:

1) How do you opt-out of data collection by Google or other search engines at will?

2) And, do you want to wait 18 months for your data to be anonymized? Read more