Ganglios

From Wikitech

Just in case you missed the RT ticket (1603) fly by, I have intsalled (and tested, somewhat) ganglios[1], a collection of python that allows nagios to query ganglia for the metrics it reports and use that data to trigger alerts. This will be useful in two ways:

  • allow us to trigger alerts on data that only exists in ganglia, such as the enwiki job queue length or the packet loss metric on emery
  • allow us to reduce load on spence by querying for all hosts that match a given condition rather than querying each host individually.

The first use is relatively straight forward and understandable because it's really no different from how we use nagios now. An example check (this is the emery packet loss check):

root@spence:~# /usr/lib/nagios/plugins/check_ganglios_generic_value -H emery -m dataloss -o gt -w 5 -c 15
CRITICAL: dataloss is 89.809475 (gt 15.0)

The arguments are:

-H host
-m metric name
-o mathematical operator to use for the comparison (greater than (gt) and less than (lt) will be common)
-w warning level
-c crit level

This check is saying that it should WARN when dataloss goes over 5%, CRIT when it crosses 15%. The output says the current value is 89.8%, so CRITs. As soon as we fix the udp log packetloss counter to deal properly with the SSL hosts, this metric will have useful data and we can turn on the alert.

The second is unusual and is a little different from Nagios' normal way of thinking about the world. Normally, a nagios check is checking that a host has a service in a particular state. The second method asks instead for a list of all hosts that match a given condition. In other words, rather than asking each host how much disk it is using, you ask for all hosts that are using more than 95% of their disk. An example[2]:

root@spence:~# ./check_ganglios_disk warn 90 exclude ms1002.eqiad.wmnet
<b>DISK</b>:db10.pmtpa.wmnet:93.8 db5.pmtpa.wmnet:91.3 db7.pmtpa.wmnet:100.0 db9.pmtpa.wmnet:90.7 ekrem.wikimedia.org:100.0 hume.wikimedia.org:100.0 mobile1.wikimedia.org:94.4 ms3.pmtpa.wmnet:91.0 prototype.tesla.usability.wikimedia.org:93.4 searchidx1.pmtpa.wmnet:99.1 srv160.pmtpa.wmnet:90.2 ssl3001.esams.wikimedia.org:91.2

This is a good thing for us because it dramatically reduces the load on the nagios server. The current disk checks are 485 (or so) individual checks, each connecting to a host, querying its disk utilization, and registering that. Using the ganglios check replaces nearly all of those with a single check[2].

I haven't actually configured (via puppet) any of these checks yet, but the system is installed and running with current data. Feel free to play around with it.

-ben

p.s. I've put a copy of this mail up on wikitech (ganglios) as a start for our documentation; it's obviously not well wiki formatted, but will be better than nothing to start.

[1] http://bitbucket.org/maplebed/ganglios/

[2] I'm excluding ms1002 because of a bug in gmond that it can't deal with disks >4T: http://bugzilla.ganglia.info/cgi-bin/bugzilla/show_bug.cgi?id=317. We will have to keep the existing active disk checks for hosts with extra large disks. I'm also using the copy of this check in root's homedir instead of the one installed by the package because it references a different metric for the disk check. I'll likely ask puppet to install a copy of this metric that's more appropriate for us overwriting the package's version.