Etcd

From Wikitech

etcd is a distributed key/value store.

Overview: https://coreos.com/etcd/

Sources: https://github.com/coreos/etcd

Labs project: Nova_Resource:Etcd

Operations

Note: There's no TLS for peer communications yet, so pay close attention to http vs https in the URLs and the port numbers used in various places.

Bootstrapping an etcd cluster

This is just for creating the initial node. Let's say you're creating an initial node named etcd1001.eqiad.wmnet.

Make sure the following hiera parameters are set:

"etcd::peers_list: 'etcd1001=http://etcd1001.eqiad.wmnet:2380'
"etcd::host": '%{::fqdn}'
"etcd::cluster_name": k8s-etcd
"etcd::ssl::puppet_cert_name": '%{::fqdn}'
"etcd::port": 2739
"etcd::use_ssl": true
"etcd::cluster_state": new

Now run puppet on the etcd1001 node, and it should bring up an etcd cluster of one node. You can verify this with:

etcdctl --ca-file /var/lib/etcd/ssl/certs/ca.pem -C https://etcd1001.eqiad.wmnet:2379 cluster-health

Once verified, flip the etcd::cluster_state hiera variable to 'existing' from 'new', and continue adding more nodes via the following procedure.

Adding a new member to the cluster

Say we want to add a new server called conf1001.eqiad.wmnet to our cluster. The steps are as follows:

  1. Add the member via the members api, using the etcdctl tool:
    $ etcdctl --ca-file /var/lib/etcd/ssl/certs/ca.pem -C https://etcd1001.eqiad.wmnet:2379 member add conf1001 http://conf1001.eqiad.wmnet:2380
    Added member named conf1001 with ID 5f62a924ac85910 to cluster
    
    ETCD_NAME="conf1001"
    # Next line is broken down artificially for ease of reading
    ETCD_INITIAL_CLUSTER="conf1001=http://conf1001.eqiad.wmnet:2380,
                          etcd1001=http://etcd1001.eqiad.wmnet:2380,
                          etcd1002=http://etcd1002.eqiad.wmnet:2380,
                          etcd1003=http://etcd1003.eqiad.wmnet:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"
    
    Write down the output as it will be useful for our puppet changes.
  2. Add the new entry to the SRV record for _etcd-server._tcp.eqiad.wmnet (or your value of etcd::srv_dns). As an example see this patchset
  3. Assign the etcd role to the node in puppet. Also use hiera to set the following variables:
    etcd::srv_dns set to false
    etcd::peers_list set to the value of ETCD_INITIAL_CLUSTER from the output of the etcdctl command before
    etcd::cluster_state set to existing
    You can see such changes for example in this gerrit changeset
  4. Run puppet on the host. It should join the cluster. Confirm this is the case with the other hosts in the cluster as well (the logs should stop complaining about not reaching the new member)
  5. You should now remove the special parameters defined in hiera for the host, see here for example
  6. Verify everything is ok after puppet runs again (it will restart etcd with the new parameters)
  7. Finally, add the new server to the SRV records that clients consume, see here

Removing a member to the cluster

  1. Verify the node you want to remove is not the current leader, that could run us into trouble:
    $ curl -k -L https://etcd1001:2379/v2/stats/leader
    {"message":"not current leader"}
    
  2. Remove the server from the clients SRV record
  3. Dynamically remove the server from the cluster:
    $ etcdctl --ca-file /var/lib/etcd/ssl/certs/ca.pem -C https://conf1001.eqiad.wmnet:2379 member remove etcd1001 http://etcd1001.eqiad.wmnet:2380
    $ etcdctl --ca-file /var/lib/etcd/ssl/certs/ca.pem -C https://conf1001.eqiad.wmnet:2379 cluster-health
    
  4. Remove the server from the cluster's SRV record