« previous: Workaround for the "Could not find Library" Problem with the iPhone Remote App | next: How To Access A VNC Remote Desktop After The Server Reboots »
Count IP Addresses in Access Log File: BASH One-Liner

Recently my server was nearly overloaded by a web spider that was severely stupid and/or malfunctioning. It was making multiple requests every second for totally nonsensical URLs.
When I first noticed my server slowing down, I checked my Apache access.log file. Normally, when you want to learn about the real people viewing real pages on your site, VisitorLog is the app for that; but when you’re dealing with a crazy web spider, you have to go right to the Apache logs.
Since encodable.com normally gets about 1000 visitors per day anyway, a visual inspection of the logfile did not make it immediately obvious which IP address was making the most requests. There are lots of hits from my own IP, for example, but not enough to slow the server down.
One quick way to see which IP addresses are most active is to sort by them:
cat access.log |cut -d ' ' -f 1 |sort
The cut command there simply throws away all the output except for the first field on each line, which is the IP address. Then we sort them. We can then scroll up through the terminal window and get a quick-and-dirty visual indication of which IP is most prevalent.
But in my case, I had quite a few IPs with several hundred hits, and that’s not enough to cause a problem. I needed to see which ones were in the thousands, but scrolling up through the terminal output it’s not especially easy to see the difference between say 500 lines and 1000 lines.
I needed an actual count of the number of times each IP address appeared the access log. I came up with the following BASH one-liner to do it (split onto multiple lines here only for readability):
FILE=/path/to/access.log; for ip in `cat $FILE |cut -d ' ' -f 1 |sort |uniq`; do { COUNT=`grep ^$ip $FILE |wc -l`; if [[ "$COUNT" -gt "500" ]]; then echo "$COUNT: $ip"; fi }; done
First it creates a for-loop based on the output of the uniq command, so each iteration of the loop is for a different unique IP from the log. It then greps the log for that IP and uses "wc -l" to count the lines in the output. Finally, if the count is greater than 500, it displays the count and the IP, like so:
6975: 124.115.3.33 5648: 124.115.5.169 1514: 66.219.73.236 1451: 74.204.11.20
As you can see, the stupid spider was coming from the 124.115.* IP range.
UPDATE: even easier: the uniq command has a -c argument that does most of this work automatically. It counts the occurrences of each unique line. Then a quick sort -n and a tail shows the big ones. Also, I tend to use "cut" as above, but one of the Dreamhost guys reminded me that awk may be a little more straightforward:
cat /path/to/access.log |awk '{print $1}' |sort |uniq -c |sort -n |tail