Monitoring OpenSolaris Zones with Nagios

OpenSolaris Zone Memory pnp4nagios graphWe’re running separate zones for web, app, and db servers. To be able to know the health of our application and our servers, we rely on pnp4nagios for graphing performance data like CPU utilization, memory usage, etc. Using OpenSolaris zones, there is only one OS kernel running. This is different in e.g. XEN, where every VM runs it’s own kernel. Such a “one kernel setup” has some important implications for monitoring: Within a zone, you see CPU utiliziation and memory usage of the whole box (the kernel) instead of what is used by the zone. None of the available nagios check scripts is able to report that data by zone.

Nagios Plugin for monitoring OpenSolaris Zones CPU und MEM


To get CPU and memory data by zone, there is prstat -Z. Executed from within the global zone it returns a list of all zones, their current memory and CPU utilization, etc.

ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE                        
     3       34 2443M 2430M    15%   9:03:05 7.9% app                       
     4       19 2562M 2014M    12% 288:22:24 1.7% db                         
     0       51   85M   92M   0.6% 145:57:00 1.6% global                      
     2      143  483M  192M   1.2%   0:17:52 0.6% web                        
     1       19 3106M 3111M    19%   0:49:05 0.0% mem                        

Putting it into a script and passing an command line parameter for the zone name to it, the script can calculate a nagios status. The script returns that nagios status based on warning and critical thresholds, passed to the script together with nagios performance data.

prstat -Z needs to run within the global zone

Usually, our nagios server calls nagios check scripts, which need to run locally on the monitored box, via SSH. All our command definitions use $HOSTADDRESS$ as the target host for the SSH connection. $HOSTADDRESS$ resovles to the host under test during a nagios check run. But prstat -Z needs to run within the global zone. To deal with that I added the IP-address of the global zone as a parameter to the nagios check call:

check_command check_by_ssh_zone_mem!4000!5000!app2!5120!10.0.0.1

In my command definition I changed the SSH target from $HOSTADDRESS$ to $ARG5$ making the check_by_ssh script connect to the global zone instead of the host under test.

Now we get nice CPU and MEM usage graphs for our OpenSolaris zones. As soon as I’ll have added documentation to my nagios check scripts, I’ll publish them on MonitoringExchange for everyone to scrunitize and maybe even use.

3 thoughts on “Monitoring OpenSolaris Zones with Nagios

  1. Has this check been published yet? I like the concept and haven’t seen a really good zone resource checker.
    Thanks.
    – Susan

    Like

  2. This is not the ideal way to do this if you have a large number of zones that have a large amount of shared memory on them

    Running this will cause a lot of locks on the shared memory and stall a lot of processes in the wait state because of the high number of prstat commands running

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.