How to Filter Bots from Your Nginx Log Files

One thing you learn when running your own webserver is just how unfriendly the internet can be. Within moments of enabling your HTTP ports, malicious bots will immediately start scanning you, looking for vulnerabilities. It’s not too hard to lock things down; you have to be careful and meticulous and always keep up to date with the current best security practices. But that still leaves one problem: the bots make a mess of your log files.

Even for a small site like this, there will be hundreds of scanner bots every day, probing for vulnerabilities. They don’t find anything (at least, not yet they haven’t), but every time a bot pings my site it leaves a trace in the log file. That means my log files are cluttered with the spammy detrius of failed bot attempts. It messes up my server stats, so I can’t see what my popular pages are. And it’s just plain annoying as well.

I’m going to share the filter files I use to filter out most bot activity from my log files. It’s not perfect – it doesn’t get everything. But it does filter out most of it and for my purposes it’s good enough. Maybe it will work for you too. For the record, I’m running Debian 11, and the Debian installed version of Nginx (1.18), so if you’re using Apache this won’t work for you.

Prebuilt Solutions

I’m certainly not the only admin to have this problem, and other people have written their own solutions to it. I looked around and one of the best ones I found was the Nginx Ultimate Bad Bot Blocker. That’s a set of blocker files that are much more extensive than mine, and it also includes a cron job which will update the files regularly. My setup is a lot simpler than that. Also, I couldn’t make use of the update script, because if you have a certificate installed with Let’s Encrypt (which I do) it has to verify your site via Webroot Authenticator (which isn’t compatible with my wildcard domains).

Directory Setup

For my setup, I like to isolate all my changes into subdirectories. This also ensures that if there’s a package update from Debian, it won’t interfere with any changes I make, and vice versa.

# mkdir /etc/nginx/local-conf.d

Config files

(By the way, everything here has to be done by root, or via sudo, of course).

In this directory we’re going to create two files:

/etc/nginx/local-conf.d# ls -l
total 8
-rw-r--r-- 1 root root 201 Sep 20 12:26 blackhole.conf
-rw-r--r-- 1 root root 700 Sep 20 12:25 bot-filters.conf

Blackhole is where we’re going to send the bots. Whenever we detect an evil bot scanning us, we’ll reroute it here. This is the contents of blackhole.conf:

# Used for locations that are spammed by intrusion bots
# Get them off the system ASAP, and don't log anything

access_log off;
log_not_found off;
return 444;   # nginx special "drop connection" code

The access_log and log_not_found directives ensure that our logs don’t get cluttered with the bot’s scan attempts. The last directive, return 444, is a special nginx command that hard-drops the connection. Usually any HTTP request is replied to with a code: 200, 404, 301, whatever. Now, we could respond with 400 Bad Request but I feel even that is too much. These are evil scan bots and I don’t want to waste a single byte responding to them. The 444 response isn’t actually returned by nginx; instead it’s a special value that when nginx sees, it simply drops the connection without even bothering to send a response. That’s perfect.

Next, we need to set up our filters for detecting bots. Here is bot-filters.conf

## Disable common bot and pentests

# Disable the evil Google Floc
add_header Permissions-Policy interest-cohort=();

# Don't serve dotfiles, except the .well-known dir
location ~ /\.(?!well-known) {
  include local-conf.d/blackhole.conf;
}

# Only parse the 'standard' HTTP verbs
if ($request_method !~ ^(GET|HEAD|PUT|POST|DELETE|OPTIONS|PATCH)$ ){
  return 405;
}

# We're not a wordpress server
location ~ /wp- {
  include local-conf.d/blackhole.conf;
}

# Or a PHP server
location ~ \.php$ {
  include local-conf.d/blackhole.conf;
}

# And we don't do cgi-bin
location ~ /cgi-bin/ {
  include local-conf.d/blackhole.conf;
}

location ~ /(app-)?ads.txt  {
  include local-conf.d/blackhole.conf;
}

These are a set of regex matches that I developed just by scanning my own logs. As I said, it’s by no means perfect, but it seems to catch about 90% of the bots, and that’s good enough for me. Again, some of these are specific to my own situation. I’m not running Wordpress or PHP, and a lot of bots seem to test for vulnerabilities from those systems, so it makes it easy for me to filter out any attempt that thinks I’m running a PHP Wordpress site. For each pattern that matches, we reroute them to the blackhole.

Server config file

Now, to connect everything up, inside my server {...} block I add a line:

include local-conf.d/bot-filters.conf;

and that’s it!

If you’re a server admin and this real-life example was useful to you, please let me know by dropping a webmention below.

Comments and Webmentions


You can respond to this post using Webmentions. If you published a response to this elsewhere,

You can also reply on Twitter