When you’re running any business critical application, you need to know what’s going on with it. Is it up? Does it cause extended load on your servers? Does it have enough disk space left, how fast is the data on the disk growing, etc.
To know all that, you need a tool which a) monitors and tracks all important performance data like CPU load, memory, disk space, slow queries per second, etc. and b) alerts you if any of the monitored values crosses a defined threshold.
Both Munin and Nagios offer these features. Munin started as a pure monitoring tool for “remembering” data. But it soon learned about alerting, too. Nagios is a very powerful alerting tool, but there are plenty of extensions to make it graph as well. The one I use (and discuss here) is nagios-pnp.
Munin Node and Munin Server
Munin runs a munin-node service on every monitored box, which records the performance data using RRD tool. The munin server connects to the munin-node via TCP port 4949 in order to retrieve the data and raise an alert if anything goes out of bounds. Thomas has described how to securely tunnel Munin traffic over SSH. That’s definitely better than any unsecured remote connection.
Graphing Performance Data With Nagios-PNP or Nagiosgraph
Nagios does not necessarily need any service running on the monitored box. In our setup we let the nagios server connect to the monitored box via ssh, executing the check commands. Those check commands return the service status (OK, WARN, CRITICAL or UNKNOWN) as well as the performance data (at the check time). Nagios-pnp and Nagiosgraph use RRD tool (on the nagios server) to store and graph the retrieved performance data. One very nice feature of nagios-pnp, which I’m missing from munin, is the ability to zoom into any graph to get a more detailed look at a certain event. Very cool!
Munin Plugins and Nagios Plugins
While Munin provides more sophisticated monitoring plugins at MuninExchange (e.g. it measures all imaginable parameters of NFS where nagios merely can tell you: yes, it’s there and has X GB free), Nagios gives you much more flexibility in accessing the monitored hosts and in modeling your network structure. Writing new plugins is easy for both tools.
I have now switched from Munin to Nagios (with nagios-pnp, but you could use nagiosgraph, too) to enjoy the added benefits of greater detailed configuration. What I’m missing, though, is the level of detail provided by the better Munin plugins. Time to make Nagios plugins out of those Munin plugins 😉
Are you using Munin or Nagios with Nagios-PNP or Nagiosgraph for monitoring and alerting? What’s your take on Munin vs Nagios? Let us know in the comments!
9 thoughts on “Monitoring tools essentials: Munin vs. Nagios”
Besides nagios-pnp you could use Centreon (http://www.centreon.com/) which integrates graphing nagios configuration and it comes with a quite sexy interface replacing the “plain” Nagios standard UI. For graphing & reports it uses the Nagios NDO interface. I was giving it a try just a few hours before reading your post. When you copy the main Nagios config files (cgi.cfg, nagios.cfg, ndo2db.cfg, ndomod.cfg) into a seperate location you can can configure, run and evaluate Centreon quite isolated in a real environment, with real data but without interfering with normal operations.
Right now I am using Nagios 3.x for monitoring/alerting and Cacti (http://www.cacti.net/) handles all the graphing. Cacti is a very excellent tool for this task and via SNMP and other interfaces gives you all the details, including NFS usage, you want. There are plenty of predefined configurations (data sources, data queries, graph templates, …) ready to be imported into Cacti, including scripts to get all the data – if not from SNMP. Works like a charm, comes with a nice UI, very granular security (including LDAP integration) and I just love it.
We’re trying ganglia for visualization and collectl for stat gathering, so far so good.
I used both and have been very happy doing so 😉
Here is the idea :
granted nagios can see munin generated rrd, it will read them vua the check_munin_rrd plugin and alert in case of any value getting above a given trigger.
So you can still take advantage of munin awesome plugins and connect nagios to it.
Cons : due to munin default checks frequency (*/5) nagios won’t be as fast to detect an issue than what it could be with, say, snmp but it’s still accpetable in most cases.
we use munin and nagios together with about 100 servers across two data centers and it works very well.
we have munin submit passive check results to nagios via contact.nagios and send_nsca (it’s in the comments of munin.conf)
the downside is you have to set up all the hosts/services in both nagios and munin which is time consuming to begin with if you have a lot of servers.
but it works very well once set up. munin’s great stats-over-time are available if you need them, and nagios is handling alerts like a pro, with escalations, time periods, etc.
you’ll also find you probably only need to send 2 or 3 munin plugins as passive checks. for us it’s just load+diskspace+mailqueue is all, as anything else that needs alerting is already covered by our nagios service checks (e.g. check_http, check_website, etc.)
there’s also a check_munin plugin if you want nagios to contact munin-node for active checks..
For distributed and highly complex applications, it’s necessary to also add business transactions into the mix. Focusing solely on metrics such as CPU load won’t tell if, for example, an e-commerce like “Add to Cart” or “Check Out” is functioning correctly from tier to tier. The trick is to gain visibility into business transactions without putting excessive overhead on the system in production.
Greg, you’re absolutely right. Adding business process monitoring is a very valuable step as it helps the whole organisation to understand and agree on what’s really important. That’s a big part of the DevOps idea!
i will throw another ball in the round. graphite (http://graphite.wikidot.com/) ;
because we need nearly realtime app monitoring, app is tracked via statsD (similar like etsy do; http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/ ), system data is served via munin-node from the servers, but all data (app and server) is pushed into graphite so every information can be mixed together, if needed.
nagios is also in the game, it is needed for server checks, app check and send out notifications …
configuration comes from a central database, so there is not that much work todo.
I used Nagios a while ago, Munin in the meantime but now I am using Nagios with Nagiosgrapher again. Though I had some hard times setting up certain specific perf data, I love the overall simplicity of Nagios. The ability to add new plugins with just a few lines of bash or php code is great.