nagios puppet module for the April infrastructure

This document explains the nagios configuration for the infrastructure of the April non profit organisation.
It is used to configure the nagios server overseeing all the services. The nagios plugins that cannot be run from the server ( such as check_oom_killer ) are installed locally and connected to nagios with nrpe. All services are bound to private IPs within the 192.168.0.0/16 network and exposed to the nagios server ( using OpenVPN to connect bare metal machines together ) and the firewalls are set to allow TCP on the nrpe port ( 5666 ).

Configuration

To monitor a host the april_nagios::host type must be used. It is a wrapper around the nagios_host puppet type.

node 'pavot.april-int' {
  april_nagios::host { 'pavot.april.org': address => '86.65.39.24' }
}

If a host needs to be monitored thru nrpe ( for instance check_oom_killer ) the april_nagios::nrpe_server class will install the server and the april_nagios::check_oom_killer class will install the plugin itself.

node 'pavot.april-int' {
  include april_nagios::nrpe_server
  include april_nagios::check_oom_killer
}

The list of available plugins can be found by looking for the string april_nagios::check_ in the init.pp file.

The nagios server is configured with the april_nagios::server class. The check_something plugin can be installed using the april_nagios::plugin type.

node 'nagios.vm.april-int' {
  include april_nagios::server
  april_nagios::plugin { 'check_something': }
}

The nagios plugins installable with april_nagios::plugin are provided with the module, in the files directory and all start with check_.

The following files are copied verbatim in the /etc/nagios3/conf.d directory of the nagios server by the april_nagios::server class:

Querying the nagios server state

Each nagios plugin can be used during puppet modules integration tests to assert that a service is properly configured. The MK livestatus is installed by default and can be used as follows ( assuming a OpenStack tenant is available ) :

while ! ( echo  "GET services"
          echo  "Filter: host_name = $instance.novalocal"
          echo  "Filter: check_command = check_nrpe_1arg"'!'"check_oom_killer" ) |
    unixcat /var/lib/nagios3/rw/live |
    grep "OK : OOM" ; do
    sleep 1
done

The above snippet queries livestatus by sending commands using unixcat to the /var/lib/nagios3/rw/live socket and expecting back an OK status from the plugin. This is the nagios server view, meaning it will have to schedule a call to the check_oom_killer plugin which may take some time.