Icinga

Icinga ( http://www.icinga.org/ ) is a host and service monitoring software using a binary daemon, some cgi scripts for the web interface and binaries plugins to check various things. Basically, automated testing of our site that screams and sends up alarms when it fails. It originated as a fork of the earlier project "Nagios", from which WMF transitioned in 2013.

It can be set to monitor services such as ssh, squid status, mysql socket as well as # of user logged in, load, disk usage. There is two levels of alarms (warning, critical) and notification system is fully customizable (groups of users, notify by email / irc / pager, stop notifying after x alarm...).

Our installation can be found at https://icinga.wikimedia.org which is currently an alias to machine neon.

(April 2013) The rest of this page needs to be updated for icinga

Quick summary

In order to set downtime / ack alerts you need to login which is done over https
Nagios configuration files are automatically generated by /home/w/conf/nagios/conf.php (on any host with NFS home mounted) and synched over to Spence.
MRTG was setup and will display Nagios usage data (useful to see if Nagios is actually doing what it is supposed to be doing) - https://nagios.wikimedia.org/stats/ -
Ganglia and Wikitech are loosely integrated with most hosts (G and W icons next to the host name respectively) and will display the Ganglia data of that host or its associated wikitech page if SPOF.
There is an icinga-wm bot in #wikimedia-operations that will echo whatever Icinga alerts on (see below)
Merlin (Module for Endless Redundancy and Load balancing In Nagios) was installed and configured but is not being used at this time.
Nagios has shortcuts in the side panel for most of Wikimedia's infrastructure monitoring
On Spence, the nagios install is located in /usr/local/nagios.
On Spence, the nagios config files are located in /etc/nagios (but you probably don't want to edit anything there)
On Spence, the HTTP interface is configured in /usr/local/nagios/share

Installation

On the server

Install package from source found at http://www.nagios.org/download/core/thanks/ (both core and plugins packages are needed)

After installing, do this:

cp /home/wikipedia/conf/nagios/* /etc/nagios/
service start nagios

and you're away.

On each client

Ubuntu:

apt-get update
apt-get -y install nagios-nrpe-server nagios-plugins
scp fenari:/home/wikipedia/conf/nagios/nrpe-debian.cfg /etc/nagios/nrpe.cfg
invoke-rc.d nagios-nrpe-server restart

Solaris:

pkgadd -d http://toolserver.org/~rriver/TSnrpe.pkg
pkgadd -d http://toolserver.org/~rriver/TSnagios-plugins.pkg

The right answers are: all, yes, all, yes, yes.

mv /lib/svc/method/nagios-nrpe /lib/svc/method/nagios-nrpe.old
sed 's/nrpe.cfg -dn/nrpe.cfg -d/' /lib/svc/method/nagios-nrpe.old > /lib/svc/method/nagios-nrpe
chmod a+x /lib/svc/method/nagios-nrpe
scp fenari:/home/wikipedia/conf/nagios/nrpe-solaris.cfg /etc/opt/ts/nrpe/nrpe.cfg
scp fenari:/home/wikipedia/conf/nagios/check-zfs /opt/local/bin/
svcadm -v enable nrpe

If you're installing on a server with no internet access, you can use a local path to the pkg file instead.

Wikimedia-fication

Customization

  * Logo: yeah... had to put it somewhere to show our 'leetness, so it is in /usr/local/nagios/share/images on Spence
  * Theme: I prefer a black theme. This is controlled in the CSS in /usr/local/nagios/share/stylesheets
  * Links to other services: this is controlled by /usr/local/nagios/share/side.php

ADDONS

  * Merlin

Merlin Is an addon for Nagios that provides ease of integration and redundancy across multiple Nagios instances. Usually, we will want to have a Nagios installation in each Datacenter, and each instance should be able to talk to the other, share data, and act as a backup should one fail. This is in essence what Merlin offers. The interresting thing about Merlin is that it stores everything in a mysql DB, from host config, to statuses. This is a lot easier to use to parse data that Nagios' own files, which is why it was installed in the first place. However, at this moment nothing is making use of Merlin, and it is just there 'in case'. Find more information about Merlin at http://www.op5.org/community/projects/merlin.

  * Ganglia / wikitech integration

I wrote a little perl script that parses ganglia data from Spence (/var/lib/ganglia) and tries its best to match up Ganlia hostnames with Nagios hosts definition. In most cases it will work as advertised. The same goes for wikitech. Most servers don't have a wikitech entry associated with them, but some do. Most legacy systems and SPOF should have an entry.

This script is located on Nagios in /etc/nagios/generate_ext-info.pl and runs automagically when sync in called.

Configuration

There are two ways to setup monitoring: using the old PHP script, and using Puppet.

PHP script

There's a configurator script for adding hosts, host groups, services and service groups at /home/wikipedia/conf/nagios/conf.php . Run it somewhere with PHP CLI installed, i.e. fenari. The configurator writes to a file called hosts.cfg in the current directory.

cd /home/wikipedia/conf/nagios
./sync

Most host groups (the ones in $hostGroups) are based on dsh node group files. This is preferred for maintainability reasons, if such a node group exists, otherwise you can list miscellaneous hosts inline using $listedHosts. Some service groups (e.g. Apache and Squid) are just replicas of the host groups, others (such as Lucene and Memcached) are taken from the MediaWiki configuration. Services may also be listed inline using $listedServices, but again, this is not preferred.

Other configuration should be done by editing the *.cfg files on NFS and then copying to spence. Keeping two up-to-date copies like this protects us against failure of the monitoring host or NFS. (Note: the Sync command actually replicates every .cfg to Spence)

If nagios refuses to restart due to a configuration error, you can get more information by running this on the monitoring host (Spence):

nagios -v /etc/nagios/nagios.cfg

The error messages can be cryptic at times.

Puppet

Puppet is being integrated with Nagios as well, in file manifests/nagios.pp. To monitor the availability of a host, simply define the following anywhere under its node definition (i.e. in site.pp or included classes):

monitor_host { $hostname: }

To monitor a service, e.g. SSH, use something like the following:

monitor_service { "ssh": description => "SSH status", check_command => "check_ssh" }

Custom Checks

Custom checks can be found on the private svn repository under ops/nagios-checks

Examples include:

check_stomp.pl
check_all_memcached.php

Authentication

To add a user or update a password:

Log in to Spence
Run htpasswd /usr/local/nagios/etc/htpasswd.users <user>

IRC notification

There's a contact called "irc" in a contact group called "admins" which currently does the IRC notification. Messages are appended to /var/log/nagios/irc.log and picked up by an IRC client. Our IRC client (ircecho) can be started with:

/usr/local/bin/start-nagios-bot

Which is just the shell one-liner:

su -s/bin/sh -c'tail -n0 -f /var/log/nagios/irc.log | /home/wikipedia/bin/ircecho \#wikimedia-tech nagios-wm irc.freenode.net &' nobody > /dev/null 2>&1

How to fix Nagios

When Nagios dies due to it not working with puppet, run the following command:

cd /etc/nagios/ && purge-nagios-resources.py puppet_hosts.cfg puppet_hostgroups.cfg puppet_servicegroups.cfg puppet_services.cfg hosts.cfg && /etc/init.d/nagios restart

Acknowledgement logic

From Nagios Wiki (but this was just on Google Cache and the original site seemed gone, so pasted it here)

There is a difference between sticky and non-sticky acknowledgements

From Nagios 3.2.3.

Assuming you have a service with notifications enabled for all states with a max retry attempts of 1, these are the notifications you should get based on the following transitions:

#service in OK
#service goes into WARNING - notification sent
#non-sticky acknowledgement applied
#service goes into CRITICAL. Acknowledgement removed. Notification sent
#non-sticky acknowledgement applied
#service goes into WARNING. Acknowledgement removed. Notification sent
#non-sticky acknowledgement applied
#service goes into CRITICAL. Acknowledgement removed. Notification sent
#service goes into OK. Recovery notification sent 


This is the flow if sticky acknowledgements are used:

#service in OK
#service goes into WARNING - notification sent
#sticky acknowledgement applied
#service goes into CRITICAL. No notification sent
#service goes into WARNING. No notification sent
#service goes into CRITICAL. No notification sent
#service goes into OK. Recovery notification sent

Fixing the USB dongle

If the dongle isn't showing up as ttyUSB0, and lsusb is show one of the following lines:

Bus 001 Device 010: ID 12d1:1446 Huawei Technologies Co., Ltd. 
or
Bus 001 Device 011: ID 12d1:1436 Huawei Technologies Co., Ltd.

run the following:

usb_modeswitch -v 0x12d1 -p 0x1446 -V 0x12d1 -P 0x1436 -M "55534243123456780000000000000011062000000100000000000000000000"

you should see output that looks like

Looking for target devices ...
 Found devices in target mode or class (1)
Looking for default devices ...
 No devices in default mode or class found. Nothing to do. Bye.

Scheduling downtimes with a shell command

Put multiple hosts into a scheduled downtime, from now on for the next 3 days. Example used on Labs Nagios:

nagios command file is a named pipe at /var/lib/nagios/rw/nagios.cmd

for host in huggle-wa-w1 puppet-lucid turnkey-1 pad2 webserver-lcarr asher1 dumpster01 dumps-4 ; do
printf "[%lu] SCHEDULE_HOST_DOWNTIME;$host;$(date +%s);1332479449;1;0;259200;Dzahn;down to save memory on virt3 having RAM issues\n" $(date +%s) \
> /var/lib/nagios/rw/nagios.cmd ; done

After a few seconds you should see something like this in /var/log/icinga/icinga.log (on neon)

[1332220596] HOST DOWNTIME ALERT: dumpster01;STARTED; Host has entered a period of scheduled downtim

Command Format:
SCHEDULE_HOST_DOWNTIME;<host_name>;<start_time>;<end_time>;<fixed>;<trigger_id>;<duration>;<author>;<comment>

quote: If the "fixed" argument is set to one (1), downtime will start and end at the times specified by the "start" and "end" arguments. Otherwise, downtime will begin between the "start" and "end" times and last for "duration" seconds. The "start" and "end" arguments are specified in time_t format (seconds since the UNIX epoch). The specified host downtime can be triggered by another downtime entry if the "trigger_id" is set to the ID of another scheduled downtime entry. Set the "trigger_id" argument to zero (0) if the downtime for the specified host should not be triggered by another downtime entry.