Data Mining Apache Logfiles

Creative Commons License jennifrog
This is a guest post by Thomas Eisenbarth. Thomas studied computer science at the University of Augsburg, currently works at BINconsult GmbH, Berlin and co-founded makandra GmbH in Augsburg. He and his teams develop and operate web applications.

Everybody knows those tools for analyzing logfiles written out by your favorite httpd: awstats, WebAlizer (yes, this 1970-GUI-thing is still around), etc.

Being one of the guys insisting on a certain level of security, it makes me cry to see that there are still people around that run PHP in three versions, Perl and Python and all modules on this planet within production environments in order to get one of those analyzer outputs. Dear colleagues: Please stop it!

Another problem with the tools mentioned is their granularity. I had to do some sort of capacity planning for a customer some days ago and I wanted to know the exact requests per second on their static web server.

I installed awstats locally on my notebook and created a static report after scp’ing the logfile from the remote server. It was a pain to configure the tool, I managed it finally but was not really happy with the results. It is perfectly okay for a monthly view to get an idea where your visitors come from, what Browser they pretend to use
and so on. I haven’t seen a tool to get a very in-depth view just on some minutes of the traffic, so I decided to code it myself.

Go away or I will replace you …

… with a very small shell script. Ta-dah:


while [ $i != 60 ]; do
    if [ $i -lt 10 ]
    echo -n "$h:$m:$s: "
    grep "26/Feb/2009:$h:$m:$s" host_filename.log.2009-02-27 | wc -l
    i=`expr $i + 1`

You have to modify the date and the filename according to your setup (or just enhance the script a bit and mail it back to me :-))
Start it with the hour as first and minute as second parameter like this:
./ 15 58

Beware: This script is completely inefficient because it grep’s the whole logfile 60 times, but hey… the file system cache of your favourite OS will jump at the chance to demonstrate it’s amazing performance… No kidding: If your webserver is not idle you should seriously think about not running the script on your production machine.

To give you an idea regarding the time it takes to produce the output: I ran the script on the live machine (pretty late in the evening). It is a old Dual-Xeon running at 3,2GHz with 2GB RAM and (compared to new SAS-hdds) slow SATA-disks with a hardware-controller and RAID 1 (so maybe reading is a bit faster). The logfile for this day was 1.4GB in size:

foo:~>date;./ 15 58;date
Mon Mar 9 22:28:41 GMT 2009
15:58:00: 82
15:58:01: 92
15:58:02: 115
15:58:57: 66
15:58:58: 39
15:58:59: 32
Mon Mar 9 22:30:11 GMT 2009

So you should get your results in an acceptable amount of time on modern machines.

You want more?

With some small enhancements you could additionally iterate over the hours, modify the output to write everything in a file and use sort(1) to find the maximum requests per second per day – if it’s of use for you.

Simple as that

I ran the script against the logfile to get a better idea about the requests per second on this Apache web server. I noticed some interesting things about that. The customer of us has several clients installed at many locations in Europe which poll data from this machine. I wondered what the heck was going on at the first 3 seconds of every minute until I realized that those clients just check for those files in these first seconds. Seems to be obvious after closer inspection…

See you next time!

One thought on “Data Mining Apache Logfiles

  1. Hmm..What’s about this?
    grep “20/Apr/2009:23:45:” access_log | awk -F: ‘{print $2″:”$3″:”substr($4,0,3)}’ |sort | uniq -c
    6 23:45:00
    4 23:45:01
    3 23:45:02


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.