Steps
(1) `ssh` to one of the Drupal site's webservers
(2) `cd` to the Logs directory
(3) Run the snippet shown in this transcript:
myemployeesite@ded-1234:/var/log/sites/myemployeesite.prod/logs/ded-1234$ min_pct=20; snippet=/mnt/tmp/htaccess-block-snippet.txt; cat /dev/null >$snippet; cat access.log|tail -200000 |awk -F\" '{print $6}' | cat >/tmp/tmp-count-$$ && total=`grep -c . /tmp/tmp-count-$$` && sort /tmp/tmp-count-$$ |uniq -c |sort -nr | awk -v total=$total 'NR==1 { snippet="'$snippet'"; num_block=0; print " Count (pct) Value" } { num=$1; $1=""; pct=num/total*100; if (pct>1) { printf("%7d (%2d%%) %s\n", num, pct, $0); } if (pct>'$min_pct') { ++num_block; sub(/^[ \t\r\n]+/, "", $0); gsub(/[^a-zA-Z0-9\./:_ -]/, ".", $0); gsub(/\./, "\\.", $0); block[num_block]=$0; } } END { print "# " num_block " User-Agents should be blocked"; if (num_block>0) { print "# Add this into .htaccess rules" >snippet; print "#" >snippet; for (i=1; i<=num_block; i++) { print "RewriteCond %{HTTP_USER_AGENT} \"" block[i] "\" " (i<num_block ? "[NC,OR]" : "[NC]") >snippet} print "RewriteRule .* - [F,L]" >snippet} }' && rm /tmp/tmp-count-$$ && cat $snippet
Count (pct) Value
3577 (29%) Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)
706 ( 5%) check_http/v2.2 (monitoring-plugins 2.2)
...
# 1 User-Agents should be blocked
# Add this into .htaccess rules
#
RewriteCond %{HTTP_USER_AGENT} "Mozilla/5\.0 \.compatible\. SemrushBot/7\.bl\. \.http://www\.semrush\.com/bot\.html\." [NC]
RewriteRule .* - [F,L]
Just the snippet is
min_pct=20; snippet=/mnt/tmp/htaccess-block-snippet.txt; cat /dev/null >$snippet; cat access.log|tail -200000 |awk -F\" '{print $6}' | cat >/tmp/tmp-count-$$ && total=`grep -c . /tmp/tmp-count-$$` && sort /tmp/tmp-count-$$ |uniq -c |sort -nr | awk -v total=$total 'NR==1 { snippet="'$snippet'"; num_block=0; print " Count (pct) Value" } { num=$1; $1=""; pct=num/total*100; if (pct>1) { printf("%7d (%2d%%) %s\n", num, pct, $0); } if (pct>'$min_pct') { ++num_block; sub(/^[ \t\r\n]+/, "", $0); gsub(/[^a-zA-Z0-9\./:_ -]/, ".", $0); gsub(/\./, "\\.", $0); block[num_block]=$0; } } END { print "# " num_block " User-Agents should be blocked"; if (num_block>0) { print "# Add this into .htaccess rules" >snippet; print "#" >snippet; for (i=1; i<=num_block; i++) { print "RewriteCond %{HTTP_USER_AGENT} \"" block[i] "\" " (i<num_block ? "[NC,OR]" : "[NC]") >snippet} print "RewriteRule .* - [F,L]" >snippet} }' && rm /tmp/tmp-count-$$ && cat $snippet
(4) Add the code recommended to the `.htaccess` file, test, and push the code change live. In this example, the code to add is
RewriteCond %{HTTP_USER_AGENT} "Mozilla/5\.0 \.compatible\. SemrushBot/7\.bl\. \.http://www\.semrush\.com/bot\.html\." [NC]
RewriteRule .* - [F,L]
Testing
curl -sSLIXGET http://myemployeesitedev.prod.acquia-sites.com/url-that-does-or-doesnt-exist -H "User-Agent: Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html)"
For the example above, if you blocked this particular robot, you should obtain a 403 (regardless of the URL!)