We had a short load spike on our application servers a couple weeks back. A load of 28 on a 4-core machine is more than uncomfortable – it’s downright dangerous. Luckily, it only lasted for 2 minutes and, just as suddenly as it came, vanished again without a trace. Well, that’s not quite true, because this is exactly why we keep those gigabytes of Apache logs lying around!
I knew where the information was, but how to crawl through that >1GB log and find it? A few years back I had done some similar research through JBoss application logs using Splunk. It was a fairly pleasant experience and I wondered if they were still around.
Turns out they are – Splunk is now offering version 3.4.9 for download and it was trivial to setup locally and start crunching data.
How did I know I had a peak in load? Well, we have Munin installed on all servers and the entire office keeps an eye on the health of our site.
Pretty obvious when you’ve got the proper monitoring and awareness running.
Splunk Twitter Support
After initially visiting the Splunk site, I read that the “free” version was limited to indexing of only 500MB of data. Hmmm, a 1.2GB logfile seems to leave me out in the cold. So I tweeted my frustrations as a warning to other would-be Splunk’ers out there and went to work slicing and dicing the log file to get under the 500MB ceiling.
Within a few hours @michaelwilde responded that this limit was only enforced after crossing this threshold 7 times within a 14 day period. Nice! So I re-ran the full index (which only took about 15 minutes on my lappy) and got to work.
What to look for
I’ll leave the installation Splunk as an excercise for the reader. To start it up, just drop into a shell and :
$ /Applications/splunk/bin/splunk start
Checking http port : open
Checking mgmt port : open
Verifying configuration. This may take a while...
Finished verifying configuration.
Checking index directory...
Verified databases: _audit, _blocksignature, _internal, _thefishbucket, history, main, sampledata, splunklogger, summary
All preliminary checks passed.
Starting Splunk web...
Splunk Server started.
The Splunk web interface is at http://localhost:8000
After dropping the log file into the default drop zone (on MacOS /Applications/splunk/var/spool/splunk/) and waiting 10-15 minutes:
1. pull up http://localhost:8000/ and goto the Search area (‘Back to search’)
2. verify your dataset is loaded by searching for “GET” during the time in question
3. narrow the search by looking for known load inducing activity
4. click on the spike time and then the “Report results” link
5. click on clientip in the left hand reporting column
6. click on this IP address to see what action they did
So, someone with IP addresss 220.127.116.11 made 190 searches for ‘darmkrebs’ between 1119 – 1120. As we don’t yet cache search results, and the server typically takes 2-4 seconds to return a search query, 190 searches in 2 minutes is extremely uncomfortable. Doing a reverse DNS lookup for this IP reports 77-22-235-134-dynip.superkabel.de, a dynamic IP which appears to belong to Kabel Deutschland.
*Note: Splunk will completely consume this log file so if you need a local copy, don’t
mv the file there!
What I did
First off, I figured out howto setup a simple Redirect 503 for searches in the Apache virtual host entry on the test environment. Next, I accessed our test server from our ILO site, shut off the networking and logged into the server via the remote terminal console. I then uncommented the entry, restarted Apache, and then restarted the networking through the ILO interface. After verifying that any search URL returned a 503, I reenabled search functionality on the test server.
Performing this emergency step first gave me confidence with using our ILO and proved I could gracefully disable the search functionality without shutting down the whole server.
We’re now working on a more permanent solution for caching search results to avoid such obviously trivial (yet incredibly damaging) attacks.
What’s your favorite “research” tool for digging through logfiles for those gold nuggets of information?