Usability VM cluster monitoring wishes

From Wikitech

Information on the VMs is at http://ryandlane.com/wiki/Category:Servers

Monitoring

Ganglia

  • Put tesla in misc group in Ganglia
    • Requires trickery: tesla runs ESXi hence can't run gmond for the box itself (only for individual VMs), monitoring through SNMP is the only option
  • Put all VMs in a VM group in Ganglia

Nagios

  • Check for CPU > 95% on every VM and on tesla itself
  • Check for real mem usage > 75% on every VM and on tesla itself
  • Disk space checks on every VM and on tesla itself
    • VMs should have a lower percentual threshold, say 80% rather than the usual 95% or 97%, because they have small root partitions
    • tesla's threshold should be 90%
  • HTTP checks on all VMs that serve HTTP
    • prototype.wikimedia.org set up already, commons.prototype.wikimedia.org isn't. Currently no other VMs serving HTTP
  • HTTP check on grid.tesla.usability.wikimedia.org:4444/console (Selenium server)

Notification

  • Notify Ryan, Roan and possibly ops people by SMS when any CRITICAL status on the Nagios checks above persists for more than 5 minutes (to prevent triggering text messages when Nagios just flaps, hope this is possible).