Salt/Upgrades

From Wikitech

Upgrading Salt

TL;DR

  • use the saltstack packages as a base and apply wmf patches
  • always update the master first; minions running more recent version than the master = broken.
  • shove the dependencies (but not the salt packages) into our repo
  • test on yer laptop or a set of containers first, then in deployment-prep next putting the packages around manually
  • do one minion per distro first as a test
  • test git deploy to be sure it works
  • shove the salt packages into our repo
  • update the production salt master and its minion; then update the rest of prod except the labs salt masters
  • update the salt master and minion on the labs salt masters; then update labs
  • profit!

Getting started

We are usually running a more recent version of Salt than is included with the installed version of Ubuntu. We get source package files from https://launchpad.net/~saltstack/+archive/ubuntu/ for trusty/precise and from http://debian.saltstack.com/debian/pool/main/s/salt/ for jessie. For dependencies we make use of the packages from the salt ppa for ubuntu and the saltstack repo for debian.

Building

We keep our build sources and patches in the operations/debs/salt repo. These instructions are written for git-buildpackage which is what I used. Make sure git-buildpackage is already installed and you already have any config files set up for git and git-buildpackage.

  • Check out the master branch.
  • Download the appropriate source packages from the right repo as listed above. You want the dsc, debian and orig.tar files.
  • cd into your repo
  • gbp import-dsc path-to-dsc-file --pristine-tar for each distro
  • for each distro, create the branch if it doesn't already exist, and toss/cherry pick commits from master until your branch reflects the version and distro you want to build
  • make and commit your patches
  • gbp dch --debian-branch=jessie debian/ or whatever branch you are using; you may need to edit debian/changelog by hand afterwards to fix up the version, also to fix the distro entry
  • commit your changelog
  • git-buildpackage --git-pbuilder --git-dist=jessie --git-arch=amd64 --git-debian-branch=jessie --git-export-dir=/var/tmp/build-area/salt --git-pristine-tar -us -uc or whatever branch you are using. You may need the --git-ignore-new option if gbp whines about upstream not matching the local branch.
  • go to /var/tmp/build-area/salt and retrieve your packages

NOTE:For pushing your changes back to the repo, it would be best to push the master and pristine-tar branches directly back to the git repo bypassing gerrit. Your patches will be in a distro branch and should go through gerrit as usual.


Then test on your own cluster (a few labs instances, or containers on your laptop, whatever works for you. I used to use docker and some scripts to set up a cluster running the versions of salt and ubuntu I want to test upgrading from, but it needs updating. I also tested git deploy via a script included in that repo.

When you're happy with the results move on to testing in beta labs.

Testing in beta labs

Next, test on the deployment-prep project in labs. Make sure the dependencies for salt have been added to our repo, see below for how we add them. You should install the dependencies via apt-get install on all deployment-xxx instances.

Check to see what packages are installed on the salt-master (deployment-salt); this could include for example the syndic package.

Scp over the appropriate debs from your build, typically salt-common, salt-master, salt-minion and if needed salt-syndic, to deployment-salt. Install them retaining the old config files by:

dpkg -i --force-confold salt-common-xxx...deb salt-master-xxx...deb etc

Updating the master

Update salt-master on deployment-salt first; this will force an update of salt-minion as well, since both rely on the same salt-common code. Make sure you keep the old config files; you can do this via

After the update, make sure there's one salt minion running, a bunch of masters, and that a test.ping from deployment-salt to itself and to the other minions returns True. If anything looks awry, you can check the logs (/var/log/salt/{minion,master}) for a clue.

Updating the minions

Start with one minion per distro. If something goes wrong you don't want to have to ssh into them all to fix it by hand.

Doublecheck that the dependencies are all installed.

Check also if a ps show more than two copies of salt-minion running. If so, you have two main processes running by mistake and you'll need to shoot one, or they will both be trying to answer your replies at once. NOTE that if you do this check via salt from the salt master, a separate salt-minion process will be spawned to run your ps command, so check to make sure there are not three minions in that case.

Install keeping the old configs by:

dpkg -i --force-confold salt-common-xxx...deb salt-minion-xxx....deb

Now check:

  • Does the host respond to salt myhost test.ping?
  • Does salt myhost test.version show the new salt version?
  • Does a ps on the hst show only new processes of salt-minion running? If not, kill the old ones: "killall --older-than 1d salt-minion" or someting similar ought to get them.

If that looks good you can update the minions on the rest of the deployment instances doing the same checks before and after as n the case of the single minion. You can also look at the master log to see if there are errors about multiple returns from any minion.

Test with trebuchet; there's a test repo which you can add as a requirement to a couple of instances and then see if sync all (done by puppet) works to them. You can also make a change to the test repo and try git sync to make sure it gets out to all the instances it should.

Wait a couple days to make sure no one reports any bizarre behaviour; then, on to the next step.

In case of trouble, for example jessie upgrades

It turns out that updating salt 2014.1.13/2014.7.x on a jessie minion via salt from the salt master, kills off the minion in the middle and leaves dpkg unhappy. This is due to a setting in the shutdown/startup script. This has been fixed in puppet, but if you run into this for some other distro you can work around this by using at now to schedule the command; best to do the install followed by a service salt-minion restart as one-line input to at. And don't forget to send all output to /dev/null or use the -M option, or all of ops will get your at-spam.

Another option is to create the file /etc/systemd/system/salt-minion.service.d/killmode.conf on all jessie hosts that don't already have it, with the contents

[Service]
KillMode=process

If you run into something like this and it can't be worked around by the above, you can always update those hosts via ssh. You did just test only one minion that you'll have to fix by hand, right? The rest can be done as follow.

Grab the list of names of the hosts via e.g.

root@deployment-salt:~ # names=`salt -G 'lsb_distrib_codename:jessie' --out raw cmd.run hostname| awk -F "'" '{ print $2 }'`
root@deployment-salt:~ # echo $names

Now on deployment-bastion you can do

for name in $names; do ssh -l root -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no  ${name} 'sudo -s; dpkg -i --force-confold salt-common-xxx...deb salt-minion-xxx...deb' ; done

After that you can go back to the salt master and do the checks from there.

Similarly for production hosts you can do the above steps from the production saltmaster for getting the names, and your laptop for ssh. DO NOT use dsh, this requires key forwarding which we all agree SUCKS. Thank you.

Adding dependency packages from the salt ppa, or new salt packages, to our repo

You can add them to our repo by logging in to the host with the repo (at this writing, carbon), becoming root (be sure to pick up the root environment for gpg keys), and giving the command

reprepro -C universe includedeb trusty-wikimedia /path-to-deb/python-some-dep-or-othertrusty_all.deb for ubuntu, replacing "trusty" with the ubuntu distro you want, or reprepro -C main includedeb jessie-wikimedia /path-to-deb/salt-something_someversion.deb replacing "jessie" with the debian distro you want.

Adding your newly built salt packages to our repo works the same way but you can use the changes file to import the packages. All packages should all go in 'main'. You may also need to use the --ignore=wrongdistribution and the --ignore=missingfile flags, for example:

reprepro -C main --ignore=missingfile --ignore=wrongdistribution include trusty-wikimedia /home/ariel/salt/2014.7.5wmf/trusty/salt_2014.7.5+ds-1ubuntu1+wm1_amd64.changes

Updating production

You need to update the master first, just as you did in labs. This will entail updating the minion on the salt mater at the same time, just as in labs. As before, you'll want to keep the old config files. It's a good idea to check first for currently unresponsive minions and fix whatever problems they may have, before going on to the update. It's also a good time to delete any salt keys still lying around for decommissioned hosts, and sign any keys that someone forgot to accept.

Updating the master, now that the packages are in the repo, requires only

  • apt-get update
  • apt-get -y -o "DPkg::Options::=--force-confold" install salt-common salt-master salt-minion ...

Don't forget to update the syndic or any other salt packages that might be installed over there, at the same time.

You can --dry-run this first to make sure that nothing weird is going to be done.

Sometimes the master will hang on restart, walking through the salt job cache for 15 - 20 minutes or longer. If you see that the master has not started up a number of workers within a minute, strace it and see if it's stuck on opening files in the cache. The fix for this is to stop the master and the minion, move the cache (/var/cache/salt/master/jobs) out of the way and start the master and then the mininon up again.

Once you see that the master is up and that test.ping works on the salt master to itself, you can update the rest of the minions:

First, apt-get update on them all.

Next, check:

Does a ps show more than two copies of salt-minion running on any minion? Then you have two main processes running by mistake and you'll need to shoot one, or they will both be trying to answer your replies at once. This will lead e.g. to them both trying to grab the lock for apt-get which will make your install fail.

Does apt-get update run clean on them all? if you see whines about repositories that don't reply or duplicate list entries, there may be a corrupted apt list cache; clean out /var/lib/apt/lists and try the update again.

Install:

apt-get -y -o "DPkg::Options::=--force-confold" install salt-common salt-minion

I've done them in batches per cluster, checking to make sure they succeed before moving on; ymmv. You might want to use -b 20 across all hosts in a distro, for example.

Labs has its own salt master which runs at this writing on virt1000 and labcontrol2001.wikimedia.org; you'll need to handle those separately, updating both the master and the minion on those hosts when you get to them.

Also check for syndics, maybe we'll have one by the next time there's an upgrade.

Updating labs

This is likely to be the most painful part of the whole upgrade.

  • First, toss the salt keys for deleted instances. There's a script for that, monitor_labs_salt_keys.py, which must be run as root. python salt-cleanup/monitor_labs_salt_keys.py --cleanup --dryrun will show the keys that would be deleted if you want to double check them first. (Might need to update these for nova 2 api?)
  • Next, make sure any unaccepted keys are signed.
  • You can use that same script to show information on nova instances that are not deleted but are unresponsive to salt by running python salt-cleanup/monitor_labs_salt_keys.py --no_ping; you'll want to look at any of those that do not have a shutoff or error status, and fix them up if possible. A few of these may be in their own mini salt cluster. If so you can update them separately OR leave them for the project owners. Likewise there may be some that have bad ferm rules or whatever; just use the ssh procedure for them described above in "In case of trouble".
  • After that you are ready to do the upgrade. The master (a production host) should have been done already, so you can proceed to to the minions in batches as you see fit.
  • After you've done them you'll see that some have failed. Typical reasons: no room for dpkg to do its work; go remove some old linux-header packages on the instance to reclaim space. Other packages unpacked and not installed due to dependency issues, thus borking dpkg completely; just purge those and move on. Etc. Lather, rinse and repeat til you've got all the minions responding to test.ping and salt-minion --version says they are on the new version.
  • Check that you don't have dup minions; you can check the master salt log for errors about duplicate responses to verify this.

If everything looks good, you're almost done.

Cleanup

Go back to deployment-prep and make sure any new instances created there in the meantime get the updated salt. That's it!