Posts

LinkingHood v0.2 – Find all your supplemental pages with ease

linkinghood2.jpgAs it is well known by now, Google decided to remove the supplemental label from pages it adds to its supplemental index. That is unfortunate because pages that are labeled this way need some “link lovin’.” How are we going to give those pages the love they need if we are not able to identify them in the first place?

In this post, I want to take a look at some of the alternatives we have to identify supplemental pages. WebmasterWorld has exposed a new query that displays such results, but nobody knows how long it is going to last. Type this into your Google search box: site:hamletbatista.com/& and you’ll see my supplemental pages. I tested it before and after Google removed the label and I'm getting the same pages. Read more

Watch out, Feedburner's numbers are woefully inaccurate! … but why?

This was Rand's response to a comment I made about Rand's confirmation of Aaron's claim that an RSS subscriber is worth 1000 links.

Here is my comment:

Wed (6/27/07) at 07:38 AM

Very useful links. I really like the Adwords tip.

An RSS Subscriber is Worth a Thousand Links – well said, Aaron, and very true (though I'd say, rather, 250 or 300)

I think it all depends on the quality of the links, the content on your blog, and your audience.

I checked some A-list blogs to compare subscribers count and inbound links:

SEOmoz

13,109 subscribers

998,000 links

76.13 links per subscriber

Problogger

25,579 subscribers

543,000 links

21.23 links per subscriber

Copyblogger

19083 subscribers

196,000 links

10.27 links per subscriber

Shoemoney

9737 subscribers

127,000 links

13.04 links per subscriber

John Chow

5,818 subscribers

127,000 links

21.83 links per subscriber

The gap doesn't seem to be so big.

What got me intrigued was the fact that bloggers are losing credibility on Feedburner's ability to accurately count RSS subscribers. I noticed, especially on Seomoz that RSS subscriber numbers jump up and down drastically, usually during weekends.

We all like to see our reader stats, count and traffic as a measure of whether we are doing things right or wrong. When WordPress.com dropped the RSS stats tab, they motivated me to host my blog on this server. I am glad they did as I have a lot more flexibility now. I will write a post with more details on the move soon.

I decided to dig deep for clues as to how Feedburners assess the subscriber count. I had the feeling they were measuring the hits to the RSS pages. But, how they account for the hits coming from aggregator services like Bloglines, Google Reader, etc was the question.

How Feedburner estimates the number of RSS readers?
Read more

Log based link analysis for improved PageRank

While top website analytics packages offer pretty much anything you might needto find actionable data to improve your site, there are situations where we need to dig deeper to identify vital information.

One of such situations came to light in a post by randfish of Seomoz.org.He writes about the problem with most enterprise-size websites, they have many pages with no or very few incoming links and fewer pages that get a lot of incoming links.He later discusses some approaches to alleviate the problem, suggesting primary linking to link-poor pages from link-rich ones manually, or restructuring the website.I commented that this is a practical situation where one would want to use automation.

Log files are a goldmine of information about your website: links, clicks, search terms, errors, etcIn this case, they can be of great use to identify the pages that are getting a lot of links and the ones that are getting very few.We can later use this information to link from the rich to the poor by manual or automated means.

Here is a brief explanation on how this can be done.

Here is an actual log entry to my site tripscan.com in the extended log format: 64.246.161.30 – – [29/May/2007:13:12:26 -0400] “GET /favicon.ico HTTP/1.1″ 206 1406 “http://www.whois.sc/tripscan.com” “SurveyBot/2.3 (Whois Source)” “-”

First we need to parse the entries with a regex to extract the internal pages — between GET and HTTP — and the page that is linking after the server status code and the page size.In this case, after 206 and 1406.

We then create two maps: one for the internal pages — page and page id, and another for the external incoming links page and page id as well.After that we can create a matrix where we identify the linking relationships between the pages. For example: matrix[23][15] = 1, means there is a link from external page id 15 to internal page id 23.This matrix is commonly known in information retrieval as the adjacency matrix or hyper link matrix.We want an implementation that can be preferably operated from disk in order to be able to scale to millions of link relationships.

Later we can walk the matrix and create reports identifying the link-rich pages, the pages with many link relationships, and the link-poor pages with few link relationships. We can define the threshold at some point (i.e. pages with more or less than 10 incoming links.)