Check for fake googlebot scrapers

I noticed a bot scraping using fake GoogleBot useragent string.

Here is a one liner that can detect the IPs to ban:

$ awk 'tolower($0) ~ /googlebot/ {print $1}' /var/www/httpd/access_log | grep -v 66.249.71. | sort | uniq -c | sort -n

It does a case-insensitive awk search for keyword "googlebot" from apache log file removing IPs with "66.249.71." which belongs to google and prints the output in a sorted hit count.

You can validate the IPs with:

IP=66.249.71.37 ; reverse=$(dig -x $IP +short | grep googlebot.com) ; ip=$(dig $reverse +short) ; [ "$IP" = "$ip" ] && echo $IP GOOD || echo $IP FAKE

Replace the IP value with the one you want to check.

Comment