LinkingHood v0.1 alpha — Improve your PageRank distribution

As I promised to one of my readers, here is the first version of the code to mine log files for linking relationship information.

I named it LinkingHood as the intention is to take link juice from the rich to give to the poor linking sites.

I wrote it in Python for clarity ( I love Python :-) ) . I was working on an advanced approach involving matrices and linear algebra. After reading some of the feedback regarding the article, it gave birth to a new idea. To make it easier to explain, I decided to use a simpler approach . This code would definitely need to be rewritten to use matrices and linear algebraic operations. (More about that in a later post). For scalability to sites with 10,000 or more pages, this is primarily an illustration and does everything in memory. It’s also extremely inefficient in its current form.

I simply used a dictionary of sets. The keys are the internal pages and the sets are the list of links pointing to those pages. I tested it with my tripscan.com log file and included the results of a test-run.

Here is the script:

linkinghood.zip

Here are the results from the run:

Tripscan internal pages:
/orlando.php: 2 links
/directory/money_and_finance.html: 3 links
/contact.php: 2 links
/favicon.ico: 3 links
/lasvegas.php: 2 links
/directory/services.html: 2 links
/index.php: 2 links
/directory/travel.html: 1 links
/charleston.php: 2 links
/sunburst.php: 2 links
/cancun.php: 2 links
/blank.php: 5 links
/london.php: 2 links
/discount_travel.php: 2 links
/santodomingo.php: 2 links
/directory/internet.html: 2 links
/phoenix.php: 2 links
/: 41 links
/paris.php: 2 links
/sanfrancisco.php: 2 links
/directory/drugs_and_pharmacy.html: 2 links
/honolulu.php: 2 links
/chicago.php: 2 links
/directory/general.html: 1 links
/directory/fun.html: 2 links
/sitemap.php: 2 links
/hiltongrand.php: 2 links
//: 1 links
/directory/travel2.html: 2 links
/directory/home_business.html: 1 links
/losangeles.php: 2 links
/directory/misc.html: 1 links
/jamaica.php: 2 links
/aruba.php: 2 links
/best_spa.php: 2 links
/amsterdam.php: 2 links
/puertovallarta.php: 3 links
/barcelona.php: 2 links
/newyork.php: 2 links
/submit_link.php: 2 links
/11thhour.php: 2 links
/directory/services2.html: 2 links
/neworleans.php: 2 links
/toronto.php: 2 links
/rome.php: 2 links
/directory/: 2 links
/aboutus.html: 4 links
/directory/other_resources.html: 2 links
/top_ten.php: 2 links

Home has 41 links
http://www.directorypanel.com/detail/link-3571.html
http://www.campwalden.ca/web/travel14.htm
http://chiangmai.discount-thailand-hotel.net/chmresources/travel_resources-page17.php
http://hamletbatista.com/2007/05/29/mining-you-server-log-files/
http://www.the-happy-side.com/link_description.php?cat_id=1
http://www.popularaffiliate.com/travel.html
http://www.kingbloom.com
http://energytable.com/links/shopping.html
http://hamletbatista.com/page/2/
http://www.garyknight.com/links/vacations6.html
http://www.nicepakistan.com/directory/index.php?c=14
http://www.linkdirectory.com/Travel___Vacation/Destinations/
http://www.realestateingrandrapids.com/links/recreation.html
http://whois.domaintools.com/tripscan.com
-
http://www.abccoachhire.co.uk
http://www.1americamall.com/index.php?c=22&s=201
http://www.littlemarketstreet.com/links/travel3.html
http://www.uddsprinting.com/travellinks.html
http://uddsprinting.com/travellinks.html
http://www.whois.sc/tripscan.com
http://www.cheap-air-travel-fares.info/resources9.html
http://www.siteinclusion.com/directory?logic=or&maximum=&term=mexico+vacation+central&sr=20&pp=20
&cp=2
http://www.tripscan.com
http://www.goodsearch.com/Search.aspx?Keywords=vacation+packages&page=4
http://www.vts.net/links/travel3.html
http://hamletbatista.com/
http://www.search-the-world.com/search/search.php/search::cat/category::25/page::42/hpp::20/
http://linkcentre.com/search/?keyword=travel&page=4&flag=
http://www.patclarkconversions.com/links/travel2.html
http://www.goodsearch.com/Search.aspx?Keywords=www.tripscan.com&Source=mozillaplugin
http://hamletbatista.com/tag/link-building/
http://www.datingshare.com/sharelinks/travel.html
http://www.tripscan.com/directory/
http://www.webdigity.com/ws/
http://www.tripscan.com/
http://www.ottosuch.de/
http://www1.tripscan.com/hotel-deals/10015639-hotrate.html
http://www.link-exchange.ws/link-exchange/index.php?action=displaycat&catid=27&page=12&perpage=15
&page=13&perpage=15
http://www.bargaintraveleurope.com/Travel_Links.htm
http://www.weboart.com/links/recreation-sports-travel.html
About has 4 links
http://www.tripscan.com
http://res99.lmdeals.com/config.html?in_origination_key=371&in_pd_key=329&SRC=10015639&SRC_AID=no
ne&in_package_key=5034225&in_offering_key=1578846&in_slipclick=main_result&SRC=10015639&SRC_AID=none
-
http://www.tripscan.com/

One of the most common errors for people unfamiliar with Python is the issue of indentation. This code cannot just be copied, pasted to a text file, and passed onto Python to run. You need to make sure the indentation (spacing) is right. I will post the code somewhere else and provide a link if this causes too much trouble.

Some readers got lost when I talked about matrices in the previous post. Linking relationships and similarly connected structures are conceptually and graphically represented as graphs. A graph is an interconnected structure that has nodes and edges. In our case, the links are the edges and the nodes are the pages. One of the most common ways to express a graph is with a matrix. Similar to an Excel sheet, it has rows and columns, where the squares can be use to indicate that there is a relationship between the page in column A and the page in row C.

Matrices are great for this because one can use matrix operations to solve problems that would otherwise require a lot of memory and computing power to solve. In order to create the matrix, we would number each unique page and unique link. We would use the rows to represent the pages and the columns to represent the links. Each position where there is a 1 means there is a link between the two pages and a 0 means there is no relationship. Using numbers for the rows and columns, and ones and zeros, for the values saves a lot of memory. This makes the computation a lot more efficient. In the code I use the pages and links directly for more clarity.

I hope this is not too confusing.

Update: I made a small change to include the incoming link count for each page.

In order to use the script, download Python from http://www.python.org. The script should run in Unix/Linux, Mac and Windows but I only tested it in Linux.

1. Copy your log file to the directory where the script was saved.

2. Change the name of the log file (inside the quotes) in the line log = open(’tripscan.actual_log’) to the name of your log file.

3. In the command line, type: python LinkingHood.py and you should se
e the report.

7 replies
  1. kichus
    kichus says:

    thats a great one Batista, thanks for simplifying. And now I 've got one more doubt.. The Logs only records the Hits (I mean it won't have the entire list of links pointing to one page), so only the referral links for that particular period of time – depend on what date range the Log files has – is getting evaluated.

    I understand that they are the Performing Links which has the capability to distribute some Link Juice. But do they have the exact list on which we can rely on to make strategies?

    Also, if you could add one more paragraph saying the basic requirements for running that script, that would be great.

    Reply
  2. Andrea
    Andrea says:

    Hi, this is really interesting, but I miss what to do next.
    I already store in a DB all the incoming links from non-SE for almost any page of my sites plus the SE keywords.
    I use it to see the top keywords trend and to check if affiliate sites stop linking to me…
    What is the smartest thing I should do with all this huge amount of data?

    Reply
  3. Hamlet Batista
    Hamlet Batista says:

    kichus,

    You are absolutely right. Note that search engines crawl you website frequently and the script will take advantage of that to collect all your internal links. The more days recorded in your log file, the better.

    I am not sure I understand your second paragraph.

    I updated the entry based on your suggestions.

    Andrea,

    Parsing the keywords is very smart idea. I think I will modify the code to do that as well.

    In my opinion the best use for this data is to modify the link rich pages so that they pass traffic and link love to the other ones.

    The keywords are the best way for your visitors to express their intentions. Use your data to make sure they land in the right pages– the intention expressed in the keywords needs to match the content in the landing page.

    If this is not the case I would change the content on the pages to match the queries.

    Reply
  4. kichus
    kichus says:

    Thanks Batista….

    Sorry about the confusion, I was actually saying if you could mention some basic requirements for running this script (like, python server with version x.xx, where to keep your Log files and how to call it etc.). Sorry if I missed it in the post…

    Reply
  5. kichus
    kichus says:

    Oops… I was actually asking for exactly the same wht you've added under the "Usage". :)

    I felt it was mising the first time when I read that…

    kichus

    Reply

Trackbacks & Pingbacks

  1. [...] is Hamlet Batista in the SEOmoz comments when he posted a link to a great post talking about how to automate passing pagerank between your low pagerank pages. Since then I’ve loved the technical and insightful posts he makes and look forward to many [...]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply