Search Engine bots using up bandwidth

In order for a search engine to find your site, it uses something commonly called a 'bot'. A bot is a robot user that periodically trawls through every link on your site looking for relevant info. Usually that's a good thing but some bots are a little overzealous.

For instance, recently we had a user whose site bandwidth was being eaten up at the rate of 4GB a week by a Russian search engine bot from yandex.ru. Yandex.ru is larger than Google in Russia but their search engine was eating bandwidth trawling through a site that would be of little interest to Russians.

Here's how we block it.

First we create a robots.txt file following the instructions at http://www.robotstxt.org/robotstxt.html. There's lots of info about bots at robotstxt.org that is useful and usually a robots.txt is enough.

But, the Yanex search bot ignores robots.txt.

Next we add the following to .htaccess in the public_html web root

# mod_rewrite on
Options +FollowSymLinks
RewriteEngine on
RewriteBase /

# Block bots we don't care about
RewriteCond %{HTTP_USER_AGENT} Yandex [NC,OR]
RewriteRule ^/.* [F]

This works by looking at the 'user agent' of the incoming bot and giving it a FAIL result if it matches 'Yandex'. It relies on Apache's mod_rewrite rules to work.

  • 1 Users Found This Useful
Was this answer helpful?

Related Articles

Redirecting from http to https

The following snippet of code, added to your .htaccess file in your public_html folder redirects...