Splunking for Spikes


We had a short load spike on our application servers a couple weeks back. A load of 28 on a 4-core machine is more than uncomfortable – it’s downright dangerous. Luckily, it only lasted for 2 minutes and, just as suddenly as it came, vanished again without a trace. Well, that’s not quite true, because this is exactly why we keep those gigabytes of Apache logs lying around!

I knew where the information was, but how to crawl through that >1GB log and find it? A few years back I had done some similar research through JBoss application logs using Splunk. It was a fairly pleasant experience and I wondered if they were still around.

Turns out they are – Splunk is now offering version 3.4.9 for download and it was trivial to setup locally and start crunching data.

What happened

How did I know I had a peak in load? Well, we have Munin installed on all servers and the entire office keeps an eye on the health of our site.

Load spike

Pretty obvious when you’ve got the proper monitoring and awareness running.

Splunk Twitter Support

After initially visiting the Splunk site, I read that the “free” version was limited to indexing of only 500MB of data. Hmmm, a 1.2GB logfile seems to leave me out in the cold. So I tweeted my frustrations as a warning to other would-be Splunk’ers out there and went to work slicing and dicing the log file to get under the 500MB ceiling.

Within a few hours @michaelwilde responded that this limit was only enforced after crossing this threshold 7 times within a 14 day period. Nice! So I re-ran the full index (which only took about 15 minutes on my lappy) and got to work.

What to look for

I’ll leave the installation Splunk as an excercise for the reader. To start it up, just drop into a shell and :

$ /Applications/splunk/bin/splunk start
Checking prerequisites...
Checking http port [8000]: open
Checking mgmt port [8089]: open
Verifying configuration. This may take a while...
Finished verifying configuration.
Checking index directory...
Verifying databases...
Verified databases: _audit, _blocksignature, _internal, _thefishbucket, history, main, sampledata, splunklogger, summary
...
All preliminary checks passed.
Starting splunkd...
Starting Splunk web...
Splunk Server started.
...
The Splunk web interface is at http://localhost:8000

After dropping the log file into the default drop zone (on MacOS /Applications/splunk/var/spool/splunk/) and waiting 10-15 minutes:

1. pull up http://localhost:8000/ and goto the Search area (‘Back to search’)

2. verify your dataset is loaded by searching for “GET” during the time in question
Verifying data loaded

3. narrow the search by looking for known load inducing activity
Finding the spike in Apache

4. click on the spike time and then the “Report results” link

5. click on clientip in the left hand reporting column
Client IP Report

6. click on this IP address to see what action they did
Gotcha

So, someone with IP addresss 77.22.235.134 made 190 searches for ‘darmkrebs’ between 1119 – 1120. As we don’t yet cache search results, and the server typically takes 2-4 seconds to return a search query, 190 searches in 2 minutes is extremely uncomfortable. Doing a reverse DNS lookup for this IP reports 77-22-235-134-dynip.superkabel.de, a dynamic IP which appears to belong to Kabel Deutschland.

*Note: Splunk will completely consume this log file so if you need a local copy, don’t mv the file there!

What I did

First off, I figured out howto setup a simple Redirect 503 for searches in the Apache virtual host entry on the test environment. Next, I accessed our test server from our ILO site, shut off the networking and logged into the server via the remote terminal console. I then uncommented the entry, restarted Apache, and then restarted the networking through the ILO interface. After verifying that any search URL returned a 503, I reenabled search functionality on the test server.

Performing this emergency step first gave me confidence with using our ILO and proved I could gracefully disable the search functionality without shutting down the whole server.

We’re now working on a more permanent solution for caching search results to avoid such obviously trivial (yet incredibly damaging) attacks.

What’s your favorite “research” tool for digging through logfiles for those gold nuggets of information?

2 thoughts on “Splunking for Spikes

  1. Epic blog post.. nicely done, and thanks for the props. (and i don’t mean “props.conf”) 😉

    When you said “Note: Splunk will completely consume this log file so if you need a local copy, don’t mv the file there! ” Were you using the spool input so splunk would eat & delete, or were you just monitoring the file. Unless you ran in to a bug with file monitoring, Splunk doesn’t typically delete the files it is monitoring.

    Michael Wilde
    Splunk Ninja
    SplunkNinja.com – A social network for Splunk enthusiasts.

    Like

  2. Thanks, Michael!

    I was using the spool input as I just copied the logfile down locally. Wasn’t really surprising to me that it did consume the file, but I just wanted to let everyone else know (because sometimes pulling down a Gig or two takes some time ;)).

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.