Usability VM cluster monitoring wishes
This page is kept for historical interest. Don't expect its contents to be relevant to the current day. |
Information on the VMs is at http://ryandlane.com/wiki/Category:Servers
Monitoring
Ganglia
- Put tesla in misc group in Ganglia
- Requires trickery: tesla runs ESXi hence can't run gmond for the box itself (only for individual VMs), monitoring through SNMP is the only option
- Put all VMs in a VM group in Ganglia
Nagios
- Check for CPU > 95% on every VM and on tesla itself
- Check for real mem usage > 75% on every VM and on tesla itself
- Disk space checks on every VM and on tesla itself
- VMs should have a lower percentual threshold, say 80% rather than the usual 95% or 97%, because they have small root partitions
- tesla's threshold should be 90%
- HTTP checks on all VMs that serve HTTP
- prototype.wikimedia.org set up already, commons.prototype.wikimedia.org isn't. Currently no other VMs serving HTTP
- HTTP check on grid.tesla.usability.wikimedia.org:4444/console (Selenium server)
Notification
- Notify Ryan, Roan and possibly ops people by SMS when any CRITICAL status on the Nagios checks above persists for more than 5 minutes (to prevent triggering text messages when Nagios just flaps, hope this is possible).