We’re running separate zones for web, app, and db servers. To be able to know the health of our application and our servers, we rely on pnp4nagios for graphing performance data like CPU utilization, memory usage, etc. Using OpenSolaris zones, there is only one OS kernel running. This is different in e.g. XEN, where every VM runs it’s own kernel. Such a “one kernel setup” has some important implications for monitoring: Within a zone, you see CPU utiliziation and memory usage of the whole box (the kernel) instead of what is used by the zone. None of the available nagios check scripts is able to report that data by zone.
Nagios Plugin for monitoring OpenSolaris Zones CPU und MEM
To get CPU and memory data by zone, there is
prstat -Z. Executed from within the global zone it returns a list of all zones, their current memory and CPU utilization, etc.
ZONEID NPROC SWAP RSS MEMORY TIME CPU ZONE 3 34 2443M 2430M 15% 9:03:05 7.9% app 4 19 2562M 2014M 12% 288:22:24 1.7% db 0 51 85M 92M 0.6% 145:57:00 1.6% global 2 143 483M 192M 1.2% 0:17:52 0.6% web 1 19 3106M 3111M 19% 0:49:05 0.0% mem
Putting it into a script and passing an command line parameter for the zone name to it, the script can calculate a nagios status. The script returns that nagios status based on warning and critical thresholds, passed to the script together with nagios performance data.
prstat -Z needs to run within the global zone
Usually, our nagios server calls nagios check scripts, which need to run locally on the monitored box, via SSH. All our command definitions use $HOSTADDRESS$ as the target host for the SSH connection. $HOSTADDRESS$ resovles to the host under test during a nagios check run. But
prstat -Z needs to run within the global zone. To deal with that I added the IP-address of the global zone as a parameter to the nagios check call:
In my command definition I changed the SSH target from $HOSTADDRESS$ to $ARG5$ making the check_by_ssh script connect to the global zone instead of the host under test.
Now we get nice CPU and MEM usage graphs for our OpenSolaris zones. As soon as I’ll have added documentation to my nagios check scripts, I’ll publish them on MonitoringExchange for everyone to scrunitize and maybe even use.
3 thoughts on “Monitoring OpenSolaris Zones with Nagios”
Has this check been published yet? I like the concept and haven’t seen a really good zone resource checker.
I’ve released my zone memory and cpu check scripts for nagios at: http://www.monitoringexchange.org/cgi-bin/page.cgi?g=Detailed%2F3193.html;d=1
Hope it helps!
This is not the ideal way to do this if you have a large number of zones that have a large amount of shared memory on them
Running this will cause a lot of locks on the shared memory and stall a lot of processes in the wait state because of the high number of prstat commands running