brykmantra
A couple months ago, I told you about how simple it was to get Munin up and running. Today, I want to show you some of the successes I’ve had in getting our operations under control and how Munin was key in doing so.
Don’t Suck Your Own Bandwidth
Bandwidth is a pretty damn precious commodity in an application environment. Sure, you probably have at least 100Mbit connections between your servers, but I doubt you have that going to the outside world. Enabling mod_deflate was the first visible success we had in cutting our outbound traffic by almost 60%.
The savvy reader probably spotted a couple more optimizations in this picture. Yes, the spikes are daily backups which were taking so long (the thicker spikes in weeks 6-8) that they even started impacting the response time of our site in the morning. Turns out we were backing up ~25 GB worth of archived apache logs per server up every night (yes, the same logs).
And what about the sudden decrease in the inbound traffic (week 9)? I discovered a couple weeks ago that we completely rebuilt our search index every 5 minutes and pushed it out to the frontend slaves. I asked business if this was really necessary : “Of course not!” Now, we run it once per day.
Server Load As A Yardstick
Increasing our cache time to 4 hours (week 6 in the image below) gave our servers some extra growing room, and this had to be done before I could enable mod_deflate.
After making sure the load was hunky dory for a full week (week 7), I finally enabled mod_deflate on the weekend. The next Monday was the true test and we passed with flying colors. But, before I could relax, I saw a terrible thing happening on Tuesday (week 8).
In this case, the comparable curve on the IOStat graph pointed me to the filesystem. Turns out that manually deleting the entire Zend file cache by hand is a bad idea. Zend apparently thought someone else was going to take care of those old pages after that, and it took me a couple of days before I discovered we have ~250k files in the caching directory. What finally helped out was to delete only those files which were older than the caching time period. Zend recovered on its own after doing so.
I scan through my Munin graphs at least once a day. It takes about ten seconds of my time and provides me with at least a day’s worth of inner peace. If you’d like some more detailed information about what I’ve done here, feel free to write!