Quick Nav:  Online Store   |   Login | Register

Blocking nasty robots

Rate this topic:

Please Register to post a reply. Another benefit of registration is the ability to subscribe to and recieve notifications of new posts.
AuthorMessages
Keith Tuomi
<20 Posts
Posts:14


05/03/2007 4:30 AM  
I've always used the following script on my classic ASP sites:www.evolvedcode.net/content/code_crawlerfilter/

I got into this habit after noticing bad robots going through my site and maxing out resources. This code, although a bit dated by now, worked pretty good. So, i'm happy to see Pageblaster has now added Useragent blocking (which can also support Regex's).

So I tried adding the following list of bad agents, which is basically lifted straight from the above-linked ASP code (i've added a line break at each agent for readability in the forum message):

blocked-useragents="\sAdvanced\sEmail\sExtractor\sv\d\.\d+\)$;
CherryPicker;ClariaBot/1.0;
Crescent;^DA\s\d\.\d+$;
^Mozilla/\d\.\d\s\(compatible;
\sMSIE\s\d\.\d;\sWindows\sNT;
\sDigExt;\sDTS\sAgent$;
asyDL/\d\.\d+;e-collector;
EmailCollector;
^EmailSiphon$;
EmailWolf;
ExtractorPro;
Go!Zilla;GetRight/\d.\d;
^ia_archiver$;
Indy\sLibrary;
larbin;MSIECrawler;
Microsoft\sURL\sControl;
NEWT\sActiveX;
NICErsPRO;
RealDownload/\d\.\d\.\d\.\d;
Teleport;
Telesoft;
UtilMind\sHTTPGet;
WebBandit;
webcollage/\d\.\d\d;
WebCopier\sv\d\.\d;
WebEMailExtrac;
WebZIP;
^WGet/\d\.\d;^Zeus.+Webster;
^Mozilla/3\.Mozilla/2\.01\s\(Win95;\sI\)$;
^Internet\sExplorer\s?\d?\.?\d?$;
^IE\s\d\.\d\sCompatible.*Browser$;
^Microsoft\sInternet\sExplorer/4\.40\.426\s\(Windows\s95\)$;
^Mozilla/4\.0\s\(?hhjhj@yahoo\.com\)?$;
^MSIE;^Mozilla$;^Mozilla(\\|/)\?\?$;
^Internet\sExplore\s?\d?\.?[a-z0-9]+$;
^IAArchiver-\d\.\d$;
^NPBot(-\d/\d\.\d)?(\s\(http://www\.nameprotect\.com/botinfo\.html\))?$;
^Webclipping\.com$;
^Mozilla/\d\.\d\s\(X11;\sLinux\si686;\sen-US;\srv:\d.\d[a-z0-9]*;
\sOBJR\)$;
^Sqworm/\d\.\d\.\d\d-BETA\s\(beta_release;\s\d{8}-\d{3};
\si\d{3}-pc-linux-gnu\)$;
^Lickity_Split/\d\.\d$;
^Production\sBot\s\d+B$;
^amzn_assoc$;
^Harvest;
^Webdup/\d\.\d$;
^WebIndex/\d\.\d[a-z]$;
(^|\s)RPT-HTTPClient/\d\.\d-\d$;
^sitecheck\.internetseer\.com\s\(For\smore\sinfo\ssee:\shttp://sitecheck\.internetseer\.com\)$;
^vspider$;
^k2spider$;
^Mac\sFinder\s;
^ICU\sv;
^DART$;
^Mozilla/\d\.\d\s\(compatible;\sMSIE\s\d\.\d;\sWindows\sNT\s\d\.\d;\sQ\d{6};
\s\.NET\sCLR\s\d\.\d\.\d{4}$;
^COMBOMANIA$;
^MyCrawler$;
^Mozilla/\d\.\d\s\(compatible;\sWin32;\sWinHttp\.WinHttpRequest\.\d\)$;
^WEP\sSearch\s\d+$;
^Mozilla/\d\.\d\s\(fantomBrowser\)$;
^TE$;
^WebStripper/\d\.\d\d$;
^OWR_Crawler\s\d\.\d$;
^WebMiner/\d\.\d\s\[en\]\s\(Win\d\d;\sI\)$;
^WebGather\s\d\.\d$;
^readwebpage$;
^InstantSSL\sBrowser:\slow\scost\sfully\svalidated\sSSL\s\+\sfree\strial$;
^Mozilla/\d\.\d\s\(compatible;\sHTTrack\s2\.0x;
\sWindows\s.+\)$;
^Mozilla/4\.0\s\(compatible;
\sPowermarks/\d\.\d;\sWindows\s.+\)$;
^Vivante\sLink\sChecker\s\(http://www\.vivante\.com\)$;
^Mozilla/\d\.\d\s\(compatible;
\sWindows\sNT\s\d\.\d;
\sABN\sAMRO$;
^Mozilla/\d\.\d\s\(compatible;
\sIntelliseek;
\shttp://www\.intelliseek\.com\)$;
^WebCopier\sSession\s\d$;
^Mozilla/\d\.\d\s\(compatible;
\sMSIE\s\d\.\d\d;
\sWindows\s\d\d$;
^Art-Online\.com\s\d\.\d\(Beta\)$;
^WebGo\s;
^SuperBot/\d\.\d\s\(Win\d\d\)$;
^Download\sNinja\s\d\.\d$;
^Mozilla/\d\.\d\s\(compatible;
\sMSIE\s\d\.\d;
\sWindows\sNT\s\d\.\d;
\s\.NET\sCLR\s\d\.\d\.\d{4}$;
^Expired\sDomain\sSleuth$;
^SHARP-TQ-GX\d\d$;
^HTTP/\d\.\d\sMozilla/\d\.\d\+\(compatible;
\+MSIE\+\d\.\d;
\+Windows\+NT\+\d\.\d\)$;
^.+/\d+\.\d+\s\(Version:\s\d+\sType:\d+\)$;
^Offline\sExplorer/\d+\.\d+$"


This however resulted in the site serving up blank html pages.  So.. considering that the regex's are already valid syntax (worked before), could it be possible that this is a bigger chunk of exclusions than is desirable?
John Mitchell
Posts:3033


05/03/2007 8:04 AM  
That is a hefty set of Regex. I'll test it in debug when I get a chance. Having that many exlusions cause a loop for checking each one on every request, so you may want to consider only blocking the ones you know are giving you trouble.
Please Register to post a reply. Another benefit of registration is the ability to subscribe to and recieve notifications of new posts.
Forums >Snapsis Product Support >PageBlaster > Blocking nasty robots



ActiveForums 3.7
Powered by: Snapsis Software