Usability VM cluster monitoring wishes

Monitoring

Put tesla in misc group in Ganglia
- Requires trickery: tesla runs ESXi hence can't run gmond for the box itself (only for individual VMs), monitoring through SNMP is the only option
Put all VMs in a VM group in Ganglia

Check for CPU > 95% on every VM and on tesla itself
Check for real mem usage > 75% on every VM and on tesla itself
Disk space checks on every VM and on tesla itself
- VMs should have a lower percentual threshold, say 80% rather than the usual 95% or 97%, because they have small root partitions
- tesla's threshold should be 90%
HTTP checks on all VMs that serve HTTP
- prototype.wikimedia.org set up already, commons.prototype.wikimedia.org isn't. Currently no other VMs serving HTTP
HTTP check on grid.tesla.usability.wikimedia.org:4444/console (Selenium server)

Notify Ryan, Roan and possibly ops people by SMS when any CRITICAL status on the Nagios checks above persists for more than 5 minutes (to prevent triggering text messages when Nagios just flaps, hope this is possible).