Tool Labs documentation for admins.

Failover

Tools should be able to survive the failure of any one virt* node. Some items may need manual failover

WebProxy

There are two webproxies, tools-proxy-01 and tools-proxy-02. They are both on different virt hosts, and they are 'hot spares' - you can switch them without any downtime. Webservices register themselves with the active proxy (specified by hiera setting active_proxy), and this information is stored in redis. This proxying information is also replicated to the standby proxy via simple redis replication. So when the proxies are switched, new webservice starts would fail for a while until puppet runs on all the web nodes and the proxies, but current http traffic will continue to be served.

To switch over:

Switch the floating IP for tools.wmflabs.org (Currently 208.80.155.131) from one proxy to the other (if tools-proxy-01 is the current active one, switch to tools-proxy-02, and vice versa). You can use this link to do the switchover. Easiest way to verify that the routing is proper (other than just hitting tools.wmflabs.org) is to tail /var/log/nginx/access.log on the proxy machines.
Use hiera to set the active proxy host (toollabs::active_proxy) to the hostname (not fqdn) of the newly active proxy.
Force a puppet run on all the proxies and webgrid nodes

Doing potentially dangerous puppet merges

Since we have two instances, easy to verify a puppet merge doesn't boink anything.

Disable puppet on the 'active' instance (you can find this from hiera and by tailing /var/log/nginx/access.log)
Run puppet on the other instance
Check to make sure everything is ok. curl -H "Host: tools.wmflabs.org" localhost is a simple smoketest
If ok, then enable puppet on the 'active' instance, and test that too to make sure
Celebrate!

Recovering a failed proxy

When a proxy fails, it should be brought back and recovered so that it can be the new hot spare. This can be a post fire-fighting operation. The process for that is:

Bring the machine back up (this implies whatever hardware issue that caused the main machine to be down has been fixed)
Run puppet (this will start up replication from the current active master)

Checking redis registration status

$ redis-cli hgetall prefix:dplbot
.*
http://tools-webgrid-lighttpd-1205.tools.eqiad.wmflabs:33555

$ grep /var/lib/redis/tools-proxy-01-6379.aof -e 'dplbot' -C 5
(...)
HDEL
$13
prefix:dplbot
$2
.*
*4
(...)
HSET
$13
prefix:dplbot
$2
.*
$60
http://tools-webgrid-lighttpd-1202.tools.eqiad.wmflabs:44504

Static webserver

This is a stateless simple nginx http server. Simply switch the floating IP from tools-web-static-01 to tools-web-static-02 (or vice versa) to switch over. Recovery is also equally trivial - just bring the machine back up and make sure puppet is ok.

Checker service

This is the service that catchpoint (our external monitoring service) hits to check status of several services. It's totally stateless, so just switching the public IP from tools-checker-01 to -02 (or vice versa) should be fine (IP switch direct link). Same procedure as static webserver.

GridEngine Master

The gridengine scheduler/dispatcher runs on tools-master, and manages dispatching jobs to execution nodes and reporting. The active master write its name to /var/lib/gridengine/default/common/act_qmaster, where all enduser tools pick it up. tools-grid-master normally serves in this role but tools-grid-shadow can also be manually started as the master iff there are currently no active masters with service gridengine-master start on the shadow master.

Note that puppet is configure to start the master at every run on the designated master and this probably needs to be disabled there if one intends to use the shadow master as primary.

Redundancy

Every 30s, the master touches the file /var/spool/gridengine/qmaster/heartbeat. On tools-grid-shadow there is a shadow master that watches this file for staleness, and will fire up a new master on itself if it has been for too long (currently set at 10m). This only works if the running master crashed or was killed uncleanly (including the server hosting it crashing), because a clean shutdown will create a lockfile forbidding shadows from starting a master (as would be expected in the case of willfuly stopped masters).

If it does, then it changes act_qmaster to point to itself, redirecting all userland tools. This move is unidirectional; once the master is ready to take over again then the gridengine-master on tools-grid-shadow need to be shut down manually, and the one on tools-master started (this is necessary to prevent flapping, or split brain, if tools-grid-master only failed temporarily). This is simply done with service gridengine-master {stop/start}.

Redis

Redis runs on two instances - tools-redis-01 and -02, and the currently active master is set via hiera on toollabs::active_redis (defaults to tools-redis-01). The other is set to be a slave of the master. Switching over can be done by:

Switchover on hiera, set toollabs::active_redis to the hostname (not fqdn) of the up host
Force a puppet run on the redis hosts
Force a puppet run on the exec / webgrid hosts. This sets the /etc/hosts entry for tools-redis to the new master
Restart redis on the redis hosts, this resets current connections and makes master / slaves see themselves as master / slave

Services

These are services that run off service manifests for each tool - currently just the webservicemonitor service. They're in warm standby requiring manual switchover. tools-services-01 and tools-service-02 both have the exact same code running, but only one of them is 'active' at a time. Which one is determined by the puppet role param role::labs::tools::services::active_host. Set that via [[1]] to the fqdn of the host that should be 'active' and run puppet on all the services hosts. This will start the services in appropriate hosts and stop them in the appropriate hosts. Since services should not have any internal state, they can be run from any host without having to switch back compulsorily.

Administrative tasks

Logging in as root

In case the normal login does not work for example due to an LDAP failure, administrators can also directly log in as root. To prepare for that occasion, generate a separate key with ssh-keygen, add an entry to the passwords::root::extra_keys hash in Hiera:Tools with your shell username as key and your public key as value and wait a Puppet cycle to have your key added to the root accounts. Add to your ~/.ssh/config:

# Use different identity for Tools root.
Match host *.tools.eqiad.wmflabs user root
     IdentityFile ~/.ssh/your_secret_root_key

The code that reads passwords::root::extra_keys is in labs/private:modules/passwords/manifests/init.pp.

SGE resources

PDF manuals found using [2]:

List of handy commands

Most commands take -xml as a parameter to enable xml output. This is useful when lines get cut off.

Queries

list queues on given host: qhost -q -h <hostname>
list jobs on given host: qhost -j -h <hostname>
list all queues: qstat -f
qmaster log file: tail -f /data/project/.system/gridengine/spool/qmaster/messages

Configuration

modify host group config: qconf -mhgrp \@general
print host group config: qconf -shgrp \@general

modify queue config: qconf -mq queuename
print queue config: qconf -sq continuous
enable a queue: qmod -e 'queue@node_name'
disable a queue: qmod -d 'queue@node_name'

add host as exec host: qconf -Ae node_name
print exec host config: qconf -se node_name
remove host as exec host: ??

add host as submit host: qconf -as node_name
remove host as submit host: ??

add host as admin host: ??
remove host as admin host: ??

Accounting

retrieve information on finished job: qacct -j <jobid or jobname>
there are a few scripts in /home/valhallasw/accountingtools: (need to be puppetized)
- vanaf.py makes a copy of recent entries in the accounting file
- accounting.py contains python code to read in the accounting file

Creating a new node

Creating a new node

Draining a node of Jobs

Disable the queues on the node with qmod -d '*@node_name'
Reschedule continuous jobs running on the node (see below)
Wait for non-restartable jobs to drain or qdel them
Once whatever needed to be done, reenable the node with qmod -e '*@node_name'

There is no simple way to delete or reschedule jobs on a single host, but the following snippet is useful to provide a list to the command line:

$(qhost -j -h node_name|sed -e 's/^\s*//' | cut -d ' ' -f 1|egrep ^[0-9])

which make for reasonable arguments for qdel or qmod -r.

Local package management

Local packages are provided by an aptly repository on tools-services-01.

On tools-services-01, you can manipulate the package database by various commands; cf. aptly(1). Afterwards, you need to publish the database to the file Packages by (for the trusty-tools repository) aptly publish --skip-signing update trusty-tools. To use the packages on the clients you need to wait 30 minutes again or run apt-get update. In general, you should never just delete packages, but move them to ~tools.admin/archived-packages.

You can always see where a package is (would be) coming from with apt-cache showpkg $package.

Webserver statistics

To get a look at webserver statistics, goaccess is installed on the webproxies. Usage:

goaccess --date-format="%d/%b/%Y" --log-format='%h - - [%d:%t %^] "%r" %s %b "%R" "%u"' -q -f/var/log/nginx/access.log

Interactive key bindings are documented on the man page. HTML output is supported by piping to a file. Note that nginx logs are rotated (twice?) daily, so there is only very recent data available.

Restarting all webservices

This is sometimes necessary, if the proxy entries are out of whack. Can be done with

qstat -q webgrid-generic -q webgrid-lighttpd -u '*' | awk '{ print $1;}' | xargs -L1 qmod -rj

The qstat gives us a list of all jobs from all users under the two webgrid queues, and the qmod -rj asks gridengine to restart them. This can be run as root on tools-login.wmflabs.org

Banning an IP from tool labs

On Hiera:Tools, add the IP to the list of dynamicproxy::banned_ips, then force a puppet run on the webproxies. Add a note to Help:Tool Labs/Banned explaining why. The user will get a message like [3].

Emergency guides

Other

Servicegroup log

tools.admin runs /data/project/admin/bin/toolhistory, which provides an hourly snapshot of ldaplist -l servicegroup as git repository in /data/project/admin/var/lib/git/servicegroups