Nova Resource:Tools/Admin/new exec host

From Wikitech

New exec hode checklist

Initial notes

  • Host types:
    • exec
    • webgrid-lighttpd
    • webgrid-generic 
    • custom (cyberbot, catscan, ...)
  • Hosts typically exist in Precise (-12xx) and Trusty (-14xx) variants.
  • Hosts are numbered incrementally.

Host setup

  1. Create a new host
    • Instance name: tools-<host type>-NNxx
      • precise: NN=12, trusty: NN=14
      • xx is incremental
    • Instance type: m1.large
    • Image type: precise or trusty
    • Security groups:
      • exec: default, execnode
      • webgrid-lighttpd: default, execnode, webserver
      • webgrid-generic: default, execnode, webserver
      • custom: default, execnode
  2. Configure host:
    • all hosts: role::labs::tools::compute,
    • exec: toollabs::node::compute::general
    • webgrid-lighttpd: toollabs::node::web::lighttpd
    • webgrid-generic: toollabs::node::web::generic
    • custom: ??
  3. run sudo apt-get update && puppet agent -tv until no failures
    1. For precise instances, you need to reboot them after the first puppet run, and run puppet again. This fixes an NFS permissions issue and turns on swap partition properly, and outputs the correct vmem value for the gridengine configuration.
  4. kill mpt-statusd

Grid configuration

  1. add the host as exec host: qconf -Ae /var/lib/gridengine/etc/exechosts/<hostname>
    1. If pooling precise instances, remember to check that swap is enabled ('sudo swapon -s') and that the exec host config file mentions 30G as value for vmem (on a large host)
  2. webgrid, custom: add the host as submit host: qconf -as <hostname>
    • exec: add the host to hostgroup @generic: qconf -mhgrp \@general
    • webgrid-lighttpd: add the host to hostgroup @webgrid: qconf -mhgrp \@webgrid
    • webgrid-generic: add the host to queue webgrid-generic: qconf -mq webgrid-generic
    • custom: add the host to the custom queue: qconf -mq <queue name>
  3. qmod -e "*@<hostname>" should now tell you the new hosts' queues are enabled
  4. start gridengine-exec on the new host: sudo service gridengine-exec start
  5. qhost -q -h <hostname> should show the new queues without trailing 'au', indicating the host is up and running
  6. qhost -j -h <hostname> hopefully already shows jobs being submitted on the host