Labs troubleshooting

This page is for troubleshooting labs-wide infrastructure issues. For difficulties with a single instance, some useful information is on the OpenStack page.

Network node failure

The network node is either labnet1001 or labnet1002, running the nova-network service. Only one is active at a time; the other is an inactive failover. Before taking any of these steps, check in the site.pp in puppet to see with node is active -- role::nova::network is commented out on the standby node.

Symptoms

ssh connections to instances display an unexpected host-key warning

In some cases when the nova-network service is down, traffic bound for a labs instance instead hits the network node directly. In this case, ssh will try to log you in to the node itself rather than the instance behind the node. This gets you a host-key (or userkey) failure.

All labs instances unreachable
Web services running on multiple instances fail at the same time

Treatments

Restart nova-network on the active network node (this works surprisingly often)

 service nova-network restart

Check iptables and try to figure out what's happening

Fail-over

If the active network node is completely dead, you'll need to switch the network service to the backup node. Note that this switch-over WILL cause network downtime for labs, so outside of an emergency don't do this without scheduling a window in advance.

This switchover requires you to muck about in the nova database. At the moment, this database is hosted on m5-master, aka db1009. You can access the database like so:

 $ sudo su -
 # mysql nova

stop puppet on both network nodes (old and new)
change the network record (today, newhostname was 'labnet1002')

This is probably how new floating IPs know what to set their host as.

   MariaDB MISC m5 localhost nova > select * from networks\G
   # note network record id, in this case it is '2'
   MariaDB MISC m5 localhost nova > update networks set host = '<newhostname>' where id=2;

reassign floating IPs to the new network host (again, today newhostname was 'labnet1002')

This is how a given network node knows to set up natting for each floating ip

   MariaDB MISC m5 localhost nova > update floating_ips set host = '<newhostname>' where host = '<oldhostname>';

release 10.68.16.1 on the old network host

Shutdown the active br01 interface (Openstack will not migrate the IP while in use)
ifconfig br01 shutdown

(re)start nova-network on the new network host

   $ sudo service nova-network restart

verify that the new network host has grabbed the gateway IP

ip addr show and verify the gateway IP has migrated

verify that floating IPs have moved over to the new host

   $ sudo iptables -t nat -L -n

change routing so that floating IPs are routed to the new host

On both cr1 and cr2 (Not the next-hop should reflect active node -- this shows labnet1002:
delete routing-options static route 208.80.155.128/25 next-hop 10.64.20.13
delete routing-options static route 10.68.16.0/21 next-hop 10.64.20.13
set routing-options static route 208.80.155.128/25 next-hop 10.64.20.25
set routing-options static route 10.68.16.0/21 next-hop 10.64.20.25

restart and then stop nova-network on the old node, just so it knows it's not responsible anymore

   $ sudo service nova-network restart
   $ sudo service nova-network stop

restart keystone on labcontrol1001. I don't know why, but it died while we did all this.

   $ sudo service keystone restart

Update puppet: comment out role::nova::network on the old node, uncomment on the new node, then re-enable puppet on both nodes.

instance DNS failure

Fail-over

There are two designate/pdns nodes: labservices1001 and labservices1002. The active node is determined in Hiera by a few settings:

labs_certmanager_hostname: <primary designate host, generally labservices1001> labs_designate_hostname: <primary designate host> labs_designate_hostname_secondary: <other designate host, generally labservices1002>

In order to switch to a new primary designate host, change the $labs_designate_hostname and $labs_certmanager_hostname settings. That's not enough, though! Powerdns will reject dns updates from the new server due to it not being the master, which will result in syslog messages like this:

   Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from 208.80.155.117 which is not a master
   Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for 68.10.in-addr.arpa from 208.80.155.117 which is not a master
   Nov 23 01:46:06 labservices1001 pdns[23266]: Received NOTIFY for eqiad.wmflabs from 208.80.155.117 which is not a master

To change this, change the master in the pdns database:

   $ ssh db1009.eqiad.wmnet
   $ sudo su -
   # mysql pdns
   MariaDB MISC m5 localhost pdns > select * from domains;
   +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
   | id | name               | master              | last_check | type  | notified_serial | account        | designate_id  
   +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
   |  1 | eqiad.wmflabs      | 208.80.155.117:5354 | 1448252102 | SLAVE |            NULL | noauth-project | 114f1333c2c144
   |  2 | 68.10.in-addr.arpa | 208.80.155.117:5354 | 1448252099 | SLAVE |            NULL | noauth-project | 8d114f3c815b46
   +----+--------------------+---------------------+------------+-------+-----------------+----------------+---------------
   MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=1;
   MariaDB MISC m5 localhost pdns > update domains set master="<ip of new primary designate host>:5354" where id=2;

Typically the dns server 'labs-ns2.wikimedia.org' is associated with the primary designate server, and 'labs-ns3.wikimedia.org' with the secondary. You will need to make appropriate hiera changes to modify those as well.