Like Flies to Project Honeypot: Revisiting the CGI proxy hijack problem

CGI proxy hijacking appears to be getting worse. I am pretty sure that Google is well aware of it by now, but it seems they have other things higher on their priority list. If you are not familiar with the problem, take a look at these for some background information:

  1. Dan Thies take and proposed solutions

  2. My take and proposed solutions

Basically negative SEOs are causing good pages to drop from the search engine results by pointing CGI proxy servers’ URLs to a victim’s domain, and then linking to those URLs so that search engine bots find them and the duplicate content filters drop one of the pages—inevitably the one with the lowest PageRank, the victim’s page.

As I mentioned in a previous post, it is very likely that this would be an ongoing battle, but that doesn’t mean we have to lay down and do nothing. Existing solutions require the injection of a meta robots noindex tag on all web pages if the visitor is not a search engine. In this way search engines won’t index the proxy-hijacked page. Unfortunately, the proxies are already altering the content before passing it to the search engine. I am going to present a solution I think can drastically reduce the effectiveness of such attacks.

Case in point: I got an e-mail last week from an old friend I hadn’t heard from in a while. Richard operates the popular traffic stats web site GoStats and has been having some serious problems fighting CGI hijackers:

…In related Search Engine talk, I did post some information on DP about
some anti-proxy hijacking techniques:
http://forums.digitalpoint.com/showthread.php?t=499858

Perhaps this can be of interest to your anti-proxy research.

One thing I’m interested in learning more about is dealing with the bad
proxies who strip the meta noindex tags and cache the content (or
otherwise ignore the 403 forbidden message). Sending a DMCA fax to
Google for all of the offenders seems like a very time consuming way to
deal with a straight forward web-spam problem. What is your though[t] on
that? Do you know of any alternatives to expedite the proxy-hijack
problem when proxies are not behaving?

Regards,

Richard

It is interesting that the flaws I mentioned in my comments to Dan Thies’s post are already being exploited. I think this is a good time for me to resume my research and see how I can contribute something useful back to the community.

A solution built on experience

My idea is not 100% bulletproof, but it is definitely a good start. It builds upon the experience that the e-mail anti-spam community has accumulated through years of fighting spam.

First, a little bit of background. Most responsible ISPs implement strong anti-spam filters on their mail servers; if they didn’t you would receive far more than you currently do. One of the techniques used is to query public databases of blacklisted IP addresses. Users who receive spam can report the e-mail to a service such as SpamCop and the service detects and submits the IP for inclusion in a shared blacklist. Anti-spam researchers also place decoy e-mails in public places (known as honeypots) to trap e-mail harvesters, and any spam they get at those addresses gets flagged and the source is blacklisted. For efficiency, the blacklist databases are implemented as DNS servers, also known as DNSBL (DNS Black Lists). A couple of well known ones used for anti-spam purposes are spamhaus.org and SURBL.

Enter the Honeypot

I am glad to report that there is already a project that uses this concept to protect against malicious web traffic, the Project Honey Pot’s Http:BL (thanks to my reader Heather Paquinas for the tip). They are not currently targeting CGI proxy spammers, but they have pretty much everything we need to defend ourselves:

  1. Http:BL (DNSBL database). An existing, actively-maintained global database with the IP addresses of malicious web users.

  2. Detection modules and API (Apache module, WordPress plugin, etc.). Code that verifies each visitor against the blacklist database and blocks unwanted traffic to your site.

  3. Honeypot (PHP, ASP, and other scripts). Code that sets up traps to catch e-mail harvesters, comment spammers, etc.

I already installed a WordPress plugin on this blog that does the detection; I also added a honeypot. To do so yourself and start blocking malicious traffic, you just need to register a free account, request an access key and set up your WordPress plugin (or Apache module, etc.). Just blocking the comment spammers is a great help in reducing the spam from the Askimet queue. :-)

Next, I recommend you install their honeypot on your server. In my case, I only had to copy a PHP script to the root folder of my blog, access it from the Web and follow a link on the displayed page to activate it. Next, I placed links to the page that are not clickable (or visible) for my users. I checked the honeypot page and it includes the relevant meta robots directive to prevent crawling or indexing, so a real search engine robot would not bother with such a page. I’d recommend you block access to your honeypot page via robots.txt, so that any IP that gets to the page can be flagged as suspicious.

Using the honeypot to lure CGI proxy hijackers

According to the API, Project Honey Pot currently detects search engines, suspicious robots, e-mail harvesters and comment spammers. We need to work with them to set up traps that catch CGI proxy hijackers too. My proposal to identify the CGI proxy hijackers is to generate an encoded text in the body of the honeypot page, and later perform searches for this text in major search engines to see if results come up. As the honeypot page is not supposed to be indexed (per the meta robots tag and robots.txt instructions), the presence of the text in the index means a CGI proxy is responsible. Finally, the encoded IP addresses and other information can be recorded in the blacklist and labeled as CGI proxy hijackers using one of the reserved slots shown in the diagram above.

If the CGI proxies respect robots.txt and don’t alter the meta robots noindex tag, we use the solution recommended by Thies. Otherwise, I suggest the detection code (the WordPress plugin in my case) be modified to set meta robots noindex to any visitors that the code identifies as not being a search engine.

I wanted to write this post first in as much detail as I could, but I plan to contact The Honey Pot project to see if they are interested in adding CGI proxy hijacker detection to their already robust and comprehensive solution. Hopefully this will give the hijackers a really complex obstacle to deal with. :-)

16 replies
  1. egorych
    egorych says:

    Nice idea. But may be it's too late? Webmasters report that hijacking proxies were banned in Google. May be Google already has a right solution? Or you think hijacking will never die, it can be only implemented in different ways? Anyhow thanks for information. It's very interesting.

    Reply
  2. Dan Thies
    Dan Thies says:

    I haven't seen any real evidence that Google did anything. A couple folks reported that they no longer see proxy copies of their own sites. But it's easy enough to check the # of php-proxy,cgi-proxy,nph-proxy etc. duplicates in the index, and those are just the obvious EASY ones to remove.

    Reply
  3. David Hopkins
    David Hopkins says:

    Is it possible to use a htaccess solution rather than a PHP one for this problem?

    A lot of people have low grade coded websites that don't have any ability to set anything globally. In these cases it would be better to use a the server to block.

    Reply
  4. Hamlet Batista
    Hamlet Batista says:

    David – the project honeypot http:BL has something they call quicklinks for people that don't have access to scripting functionality on their websites. They would need at least the Apache module installed for detection, though.

    Reply
  5. Richard Chmura
    Richard Chmura says:

    From my end I've seen a brief period of what seemed to be "testing" at google where the proxies were gone. I think "G" is up to something here – but not ready for a full production solution.

    Even still – despite some overall imminent victories from G against proxy-hijackers, it is always important to protect your content from other scrapers – and even future revisions of proxy-hijacking.

    I propose an additional step in dealing with this scourge: content stamping and source stamping. This stamp would be inserted into the content as a unique string of characters unique to each of your documents. A second part of the string would be an IP address encoding (either in decimal notation, in hex, or in any encrypted form for added effectiveness.) This would allow quick discovery of which content in the index is being copied and also give you an idea about what IP address was used to copy the content (or even tell you where this site got your content from)

    Here's an example of the content stamp: (first is the unique content ID, the second is the hex format of the IP address)
    3P1I4E159 7F000001
    -ensure that the content id tag is unique for each page or resource for your site.
    -of course you can make the unique content id longer to decrease the effectiveness when searching.
    -This is also very effective when copying sites insert random words or parts of words to break up your normal sentences. (I've seen "th" used as a random injection into various sentences to avert manual duplicate content detection)
    -You can also prepend the content stamp with a sitewide content stamp. This will allow you to search google for [your-global-content-stamp -site :http://yoursite.com

    Making a google alert (or alert for any other service) regarding copies of your content will be swift and easy.

    Another thing I should mention:
    -You can use .htaccess blocking for sites without php. Just use a cron job (or scheduled task) to rebuild your .htaccess file regularly from your block list source. This is assuming that you can download the BL & get regular updates.
    -You can also use a captcha for forbidden requests so that real humans can access your site in the event of false positives. (You may want to "allow all" to the captcha page.

    Reply
  6. Richard Chmura
    Richard Chmura says:

    Small typo oops:
    "-of course you can make the unique content id longer to decrease the effectiveness when searching."
    should read:
    "-of course you can make the unique content id longer to increase the effectiveness when searching."

    Also, here's an example of an extended content stamp. (or content-id tag)
    3243F6A8885A308D31319 3P1I4E159 7F000001
    1 site wide content tag, 2 unique resource tag, 3 hex code for IP address of client requesting your resource.
    1 is constant across your domain
    2 is unique to each page/resource
    3 is unique to each client who requests your content

    Find copies of your site by searching:
    -site:http:/yourdomain.com 3243F6A8885A308D31319

    Find copies of your resource by searching:
    -site:http:/yourdomain.com 3243F6A8885A308D31319 3P1I4E159
    (or remove the global content tag portion to broaden your results)

    Then WHOIS the associated IP address to determine where the content was picked up from. Blacklist if necessary.

    Reply
  7. Hamlet Batista
    Hamlet Batista says:

    Hi Richard,

    Thanks for stopping by.

    I propose an additional step in dealing with this scourge: content stamping and source stamping.

    Sorry, I didn't make it clearer, but what you describe is the purpose of the encoded text I mentioned in my recommendation. Thanks for making it clearer, though.

    The honeypot does exactly this. The honeypot generates email addresses with encoded IP, timestamp, etc. The harvesters collect the addresses and when the spam hits, the email address has all the identifying information. They use base64 encoding, which makes the strings a little bit shorter than simply hexing the characters.

    My proposal to identify the CGI proxy hijackers is to generate an encoded text in the body of the honeypot page, and later perform searches for this text in major search engines to see if results come up. As the honeypot page is not supposed to be indexed (per the meta robots tag and robots.txt instructions), the presence of the text in the index means a CGI proxy is responsible. Finally, the encoded IP addresses and other information can be recorded in the blacklist and labeled as CGI proxy hijackers using one of the reserved slots shown in the diagram above.

    Your content stamping idea is really interesting, but the encoded text needs to be included in all pages. Many site owners would not like to display such text in their copy. I'd personally prefer to have that encoded text on the honeypot page that is not accessible by regular readers. I believe the CGI proxy hijackers 'copy' all the pages, so we can do the detection in one page to catch them. I guess you could hide the encoded text too, but there will be the concern of being penalized.

    Another thing I should mention:
    -You can use .htaccess blocking for sites without php. Just use a cron job (or scheduled task) to rebuild your .htaccess file regularly from your block list source. This is assuming that you can download the BL & get regular updates.
    -You can also use a captcha for forbidden requests so that real humans can access your site in the event of false positives. (You may want to “allow all” to the captcha page.

    These are excellent ideas. Thanks for sharing! Alternatively, they could rebuild the .htaccess on their PC and upload it to the server if they don't have access to scripting (the cron job needs to run a script).

    Reply
  8. Richard Chmura
    Richard Chmura says:

    Hi Hamlet,

    Yes, sorry I got a little carried away when detailing my explanation. ;) But here is the reason why I think content stamping on legitimate pages (non-honey pot) is also important: proxy-hijacking techniques will evolve – so will web scraping. What may be simple today, will be a totally different beast down the line. Honey pots will make a great positive identification of the unsophisticated proxies. However, as they change their tactics, locating the source of scraped content may become not so simple (or not so easy to black list). It is even possible that embedding scraping or proxy collecting software in malware could open up content to legitimate clients facilitating the theft. At that point the best defense is an aggressive offense of locating all content that has been copied. (And having a stamp in the public content too will help greatly) – Perhaps it could be worked into a copyright statement like:
    "copyright 2007 example.com 3243F6A8885A308D31319 3P1I4E159 7F000001" – or similar.

    Reply
  9. Hamlet Batista
    Hamlet Batista says:

    Richard – Don't worry, I appreciate you are taking the time to detail things. Sometimes I get too lazy ;-)

    I completely agree with you. That is why I say that this is an ongoing battle.

    I had content stamping for my feed on this blog. Exactly as you describe it. The plugin stopped working after I moved to WordPress 2.3. Hopefully they will fix it soon. Check it out http://www.blogclout.com/blog/goodies/feed-footer

    On the other hand, it would be great if Google fixed this. How do you think they can fix this problem? I blogged about an idea I had a while back, but I'd appreciate your ideas and suggestions. Please check http://preview.hamletbatista.com/2007/07/19/content-is-ki

    Reply

Trackbacks & Pingbacks

  1. […] but I’ve seen this technique being used more and more each day. Read more about it here and here. Basically you use proxy sites to generate duplicated content of a site… And Google is not […]

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply