Blocking bots with mod security31 Oct 2007
mod_security is an awesome apache module to have in your security arsenal -- but besides the extra security it gives you it can also be used to protect the content on your blog or website.
There are a few risks and reasons to block certain bots 1) If you use apache as a proxy spammers may try to use your Apache to "fake" clicks on advertisements on their pages, 2) to send plain old email spam and 3) to steal content from your site and use it as filler on their sites used only to serve out ads. mod_security can prevent all of this.
The key is to leverage mod_security's abilities to perform certain actions based on the the HTTP protocol requests it receives from the client. In this case clients are both legitimate web-browsers, search engine crawlers, and nefarious bots looking to mirror content on their own sites and make advertising money.
The key field in the Http request is called the "User-Agent". This is the string that clients use to identify themselves, for example when the Google spider fetches a page it declares its user agent to be "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" , The Yahoo spider identifies itself as "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)", and finally my firefox browser currently version 220.127.116.11 identifies itself as "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:18.104.22.168) Gecko/20071008 Firefox/22.214.171.124".
It is worth noting that clients have full control over the way they identify themselves in the User-Agent header, and many choose to mask their existence by looking like a standard web browser. However, many quick and dirty bots, and scrapers either fail to do so, either because they don't care or because their authors don't know enough about the HTTP protocol to know why they would want to.
There are two approaches: 1) Block everything but a pre-approved white list, 2) Allow everything except a pre-denied black list. The first approach doesn't really work, because we don't want to have to update our apache configuration every time a new web-browser comes out, so we are left with only blocking a set of the most egregious bots.
Ideally we want our configuration to fail gracefully when mod_security isn't available (if we move servers for examples).
Below is the configuration I use on my Apache host.
I block bots that try to proxy to other sites (presumably to fake ad clicks) and ones that try to connect to Port 25 -- the email port to send spam. I comment the configuration inline (the lines that begin with #), but here is a quick overview:
The <IfModule> tells apache only to run the following configuration if the mod_security module is present. This allows me to copy this around to different hosts where I may or may not have mod_security without breaking the apache config.
The begin just sets up mod_security, and configures where we will log traffic we have blocked. The first interesting configuratin directive "REQUEST_METHOD" blocks two HTTP request types that probe the server (TRACE) or can connect to a mail server to relay spam (CONNECT). The next line "REQUEST_URI_RAW" blocks requests that are not relative to the server (such as requestion /help.html) but contain the full protocol which means the request is probably trying to access a remote site. The remaining "REQUEST_HEADERS" lines block access depending on the headers in the HTTP request. We must specificy which header we want to examine, in our case the "User Agent" and what identifier we want to look for.
Hope this helps.
Basic configuration options
SecRuleEngine On SecRequestBodyAccess On SecResponseBodyAccess Off
SecDebugLog /var/log/apache2/modsec_debug.log SecDebugLogLevel 0
Serial audit log
SecAuditEngine RelevantOnly SecAuditLogRelevantStatus ^5 SecAuditLogParts ABIFHZ SecAuditLogType Serial SecAuditLog /var/log/apache2/modsec_audit.log
Block HTTP requests that have http://, proxy spam, or ones that go to port 25
SecRule REQUEST_METHOD "^((?:connect|trace))$" "log,drop" SecRule REQUEST_URI_RAW "http:/" "log,drop" SecRule REQUEST_HEADERS:User-Agent "VadixBot" "log,drop" SecRule REQUEST_HEADERS:User-Agent "radianrss-1.0" "log,drop" SecRule REQUEST_HEADERS:User-Agent "Python-urllib/2.5" "log,drop" </IfModule>