Block spam bots and evil web scrapers

I noticed on one server today a huge CPU load. A quick look at Netstat showed that most of the current traffic was coming from someone in Africa. I crossed referenced the IP address from Netstat with the access log files on the various sites on the server and saw that it had a UserAgent of 'DTS Agent'. A quick Google showed this to be a scraper for email contacts.

It was time that I added a little something for these spam bots to help reduce their ability to see my sites. Simply adding the following code to an .htaccess file in the root of the site structure would help to curb evil good for nothing spam bots:

[Last Updated: April 7th 2009]

RewriteEngine On
RewriteCond %{HTTP:User-Agent} (?:Alexibot|Art-Online|asterias|BackDoorbot|Black.Hole|BlackWidow|BlowFish|botALot|BuiltbotTough|Bullseye|BunnySlippers|Cegbfeieh|Cheesebot|CherryPicker|ChinaClaw|CopyRightCheck|cosmos|Crescent|Custo|DISCo|DittoSpyder|DownloadsDemon|eCatch|EirGrabber|EmailCollector|EmailSiphon|EmailWolf|EroCrawler|ExpresssWebPictures|ExtractorPro|EyeNetIE|FlashGet|Foobot|FrontPage|GetRight|GetWeb!|Go-Ahead-Got-It|Go!Zilla|GrabNet|Grafula|Harvest|hloader|HMView|httplib|HTTrack|humanlinks|ImagesStripper|ImagesSucker|IndysLibrary|InfonaviRobot|InterGET|Internet\sNinja|Jennybot|JetCar|JOC\sWeb\sSpider|Kenjin.Spider|Keyword.Density|larbin|LeechFTP|Lexibot|libWeb/clsHTTP|LinkextractorPro|LinkScan/8.1a.Unix|LinkWalker|lwp-trivial|Mass\sDownloader|Mata.Hari|Microsoft.URL|MIDown\stool|MIIxpc|Mister.PiX|Mister\sPiX|moget|Mozilla/3.Mozilla/2.01|Mozilla.*NEWT|Navroad|NearSite|NetAnts|NetMechanic|NetSpider|Net\sVampire|NetZIP|NICErsPRO|NPbot|Octopus|Offline.Explorer|Offline\sExplorer|Offline\sNavigator|Openfind|Pagerabber|Papa\sFoto|pavuk|pcBrowser|Program\sShareware\s1|ProPowerbot/2.14|ProWebWalker|ProWebWalker|psbot/0.1|QueryN.Metasearch|ReGet|RepoMonkey|RMA|SiteSnagger|SlySearch|SmartDownload|Spankbot|spanner|Superbot|SuperHTTP|Surfbot|suzuran|Szukacz/1.4|tAkeOut|Teleport|Teleport\sPro|Telesoft|The.Intraformant|TheNomad|TightTwatbot|Titan|toCrawl/UrlDispatcher|toCrawl/UrlDispatcher|True_Robot|turingos|Turnitinbot/1.5|URLy.Warning|VCI|VoidEYE|WebAuto|WebBandit|WebCopier|WebEMailExtrac.*|WebEnhancer|WebFetch|WebGo\sIS|Web.Image.Collector|Web\sImage\sCollector|WebLeacher|WebmasterWorldForumbot|WebReaper|WebSauger|Website\seXtractor|Website.Quester|Website\sQuester|Webster.Pro|WebStripper|Web\sSucker|WebWhacker|WebZip|Wget|Widow|[Ww]eb[Bb]andit|WWW-Collector-E|WWWOFFLE|Xaldon\sWebSpider|Xenu's|Zeus|DTS\sAgent) [NC]
RewriteRule .* - [F]
ErrorDocument 403 /403.html
 
# IF THE UA STARTS WITH THESE
SetEnvIfNoCase ^User-Agent$ .*(aesop_com_spiderman|backweb|bandit|batchftp|bigfoot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(black.?hole|blackwidow|blowfish|botalot|buddy|builtbottough|bullseye) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(cheesebot|cherrypicker|chinaclaw|collector|copier|copyrightcheck) HTTP_SAFE_BADBOT  
SetEnvIfNoCase ^User-Agent$ .*(cosmos|crescent|diibot|dittospyder|dragonfly) HTTP_SAFE_BADBOT      
SetEnvIfNoCase ^User-Agent$ .*(drip|easydl|ebingbong|ecatch|eirgrabber|emailcollector|emailsiphon) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(emailwolf|erocrawler|exabot|eyenetie|filehound|flashget|flunky) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(frontpage|getright|getweb|go.?zilla|go-ahead-got-it|gotit|grabnet) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(grafula|harvest|hloader|hmview|httrack|humanlinks|ilsebot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(infonavirobot|infotekies|interget|iria|jennybot|jetcar) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(justview|jyxobot|kenjin|keyword|larbin|leechftp|lexibot|lftp) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(likse|linkscan|linkwalker|lnspiderguy|magnet|mag-net|markwatch) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(mata.?hari|memo|microsoft.?url|midown.?tool|miixpc|mirror|missigua) HTTP_SAFE_BADBOT  
SetEnvIfNoCase ^User-Agent$ .*(mister.?pix|moget|mozilla.?newt|nameprotect|navroad|backdoorbot|nearsite) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(net.?vampire|netants|netmechanic|netspider|nextgensearchbot) HTTP_SAFE_BADBOT 
SetEnvIfNoCase ^User-Agent$ .*(attach|nicerspro|nimblecrawler|npbot|octopus|offline.?explorer) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(offline.?navigator|openfind|outfoxbot|pagegrabber|pavuk) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(pcbrowser|php.?version.?tracker|pockey|propowerbot|prowebwalker) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(psbot|pump|queryn|recorder|realdownload|reaper|true_robot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(repomonkey|internetseer|sitesnagger|siphon|slysearch|smartdownload) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(snake|snapbot|snoopy|sogou|spacebison|spankbot|spanner|sqworm|superbot) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(superhttp|surfbot|asterias|suzuran|szukacz|takeout|teleport) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(telesoft|the.?intraformant|thenomad|tighttwatbot|titan|urldispatcher) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(turingos|urly.?warning|vacuum|voideye|whacker) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(widow|wisenutbot|wwwoffle|xaldon|xenu|zeus|zyborg|anonymouse) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*web(zip|emaile|enhancer|fetch|go.?is|auto|bandit|clip|copier|reaper|sauger|site.?quester|whack) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(craftbot|download|extract|stripper|sucker|ninja|clshttp|webspider|leacher|collector|grabber|webpictures) HTTP_SAFE_BADBOT
SetEnvIfNoCase ^User-Agent$ .*(dts\sagent) HTTP_SAFE_BADBOT

This code is a cleaned up version of the code found here and here. Note that you really should look through all the User Agents and be sure you are not blocking someone or some software that you would like to keep. The list above is far from complete, but is a good start.

For a fairly up-to-date list of User Agents you will find a useful User Agent database here.

Read more…

Comments