Puppet

From Wikitech
Jump to: navigation, search
This page is about how to install, configure, and manage puppet. For documentation about writing puppet code, visit this page.

puppet is the main configuration management tool to be used on the Wikimedia clusters (puppet for dummies on the blog).

puppet agent is the client daemon that runs on all servers, and manages machines with configuration information gathered from puppetmasterd.

puppet agent

Installation of the puppet service is handled via our automated installation. No production ready machines should have puppet manually installed.

On new installs, initial root login can be done from palladium with sudo /usr/local/sbin/install-console HOSTNAME. The script uses /root/.ssh/new_install ssh key and thus works also while debian-installer is running during PXE install.

On Solaris, the installation instructions for the Blastwave packages seem to work.

Communication with the puppetmaster server is over encrypted SSL and with signed certificates. To sign the certificate of the newly installed machine on the puppetmaster server, log in on palladium.eqiad.wmnet and run:

puppet cert -s clienthostname

To check the list of outstanding, unsigned certificates, use:

puppet cert -l

Reinstalls

When a server gets reinstalled, the existing certs/keys on the puppetmaster will not match the freshly generated keys on the client, and puppet will not work.

Before a server runs puppet for the first time (again), on the puppetmaster host, the following command should be run to erase all history of a server:

puppet cert --clean clienthostname

However, if this is done after puppet agent has already run and therefore has already generated new keys, this is not sufficient. To fix this situation on the !!! client !!!, use the following command to erase the newly generated keys/certificates:

find /var/lib/puppet -name "$(hostname -f)*" -exec rm -f {} \;

Misc

Sometimes you want to purge info for a host from the puppet db. The below will do it for you:

puppetstoredconfigclean.rb server fqdn

on the puppet master. All references, i.e. the host entry and all facts going with it, will be tossed.

Puppetmaster

This could be worse, believe me.

As of late 2013 we have two puppetmasters -- our design supports the easy addition of additional masters. Palladium serves as the SSL terminator, certificate server, and load-balancer for all puppetmasters; Palladium does double-duty as one of the puppetmaster backends as well.

Puppetmaster Installation

Simply use the (backported) puppetmaster Ubuntu package:

# apt-get install puppetmaster puppetmaster-passenger

The default package install uses the Webrick development webserver. That works fine for a couple of nodes, but is single-threaded. Therefore we eventually switched to using Mongrel, but are now using a Passenger based install, from the package puppetmaster-passenger. This implies that puppetmaster is started from Apache, and not by an independent daemon anymore.

We use multiple backends via Apache's mod_proxy and mod_balancer, so it's easy to add a new puppet master into the group to take some of the load.

The installation basically follows these instructions, as well as the default configurations provided in the package, with the exception of the proxy balancer addition. The back end servers are set up to answer responses on 8141, with minor changes to the certificate setup.

Configuration

The default configuration is very usable, but we've made some tweaks here and there.

See /etc/puppet/site.pp for the basics. Puppet currently pushes out crontabs for the image scalers, ganglia binaries and conf files on on hosts, and syncs user information including ssh keys on all hosts. It will reread its conf instantly. Changes to any given host get pushed out every 30 minutes, but puppet is continually updating some host or other. See syslog on palladium for details.

MD5 is broken, use SHA1 for signing certificates:

ca_md=sha1

We use storeconfigs so hosts can exchange configuration (e.g. SSH host keys). To enable this, configure:

storeconfigs=true
dbadapter=sqlite3
dblocation=$vardir/clientconfigs/clientconfigs.sqlite3

Packages rails, sqlite3, libsqlite3-ruby need to be installed. The directory /var/lib/puppet/clientconfigs should be created and owned by user/group puppet.

Making changes

It's a crazy merry-go-round!

For the gerrit and pre-gerrit patch stages, see the Git/Gerrit doc page.

Updating operations/puppet for production nodes

For security purposes, changes made to the puppet git repository are not immediately applied to nodes. In order to get approved puppet changes live on production systems, you must fetch and review the changes one last time on palladium. This final visual check is crucial to making sure that malicious puppet changes don't sneak their way in, as well as making sure that you don't deploy something that wasn't ready to be deployed.

The operations/puppet repository is hosted on palladium at /var/lib/git/operations/puppet. This working copy has hooks to update strontium and other puppetmasters.

"puppet-merge" is a wrapper script designed to formalize the merge steps while making it possible to review actual diffs of submodules when they change. When there are submodule changes, puppet-merge will clone the /var/lib/git/operations/puppet working copy to a tmp directory, do the merge and submodule update, and then show a manual file diff between /var/lib/git/operations/puppet and the temporary clone. This allows for explicit inspection of exactly what is about to be done to the codebase, even when there are submodule changes.

 $ cd /var/lib/git/operations/puppet # optional, puppet-merge will work properly from anywhere
 $ puppet-merge
 # diff is shown...
 Merge these changes? (yes/no)? yes
 Merging a4678c710573006249e86d311198b94cc3889382...
 git merge --ff-only a4678c710573006249e86d311198b94cc3889382
 Updating 8b0e19d..a4678c7
 Fast-forward
  files/puppet/puppet-merge |    3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 From https://gerrit.wikimedia.org/r/p/operations/puppet
    8b0e19d..a4678c7  production -> origin/production
 Merge made by the 'recursive' strategy.
  files/puppet/puppet-merge |    3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)
 Running git clean to clean any untracked files.
 git clean -dffx
 HEAD is now a4678c710573006249e86d311198b94cc3889382.

Once the changes are updated, they will be put into place by puppet on whatever relevant nodes during the next puppet run.

Noop test run on a node

You can do a dry run of your changes using:

# puppet agent --noop --test --debug

This will give you (among other things) a list of all the changes it would make.

Trigger a run on a node

Just run:

# puppet agent --test

Debugging

Using

# puppet agent --test --trace --debug

You get maximum output from puppet.

You can see a list of classes that are being included on a given puppet host, by checking the file /var/lib/puppet/state/classes.txt.

With --evaltrace, puppet will shows the resources as they are being evaluated:

# puppet agent -tv --evaltrace
info: Class[Apt::Update]: Starting to evaluate the resource
info: Class[Apt::Update]: Evaluated in 0.00 seconds
info: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]: Starting to evaluate the resource
notice: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]/returns: executed successfully
info: /Stage[first]/Apt::Update/Exec[/usr/bin/apt-get update]: Evaluated in 16.24 seconds
info: Class[Apt::Update]: Starting to evaluate the resource
info: Class[Apt::Update]: Evaluated in 0.01 seconds
...

Most of the puppet configuration parameters can be passed as long options (aka evaltrace can be passed as --evaltrace).

Errors

Occassionally you may see puppet fill up disks, and then result in yaml errors during puppet runs. If so, you can run the following on the puppet master, but do so very, very carefully:

 cd /var/lib/puppet && find . -name "*<servername>*.yaml -delete

Check .erb template syntax

"ERB files are easy to syntax check. For a file mytemplate.erb, run"

erb -x -T '-' mytemplate.erb | ruby -c
(puppet templating)

Troubleshooting

puppet master spewing 500s

It might happen that there's a storm of puppet failures, this is usually due to the clients not being able to talk to the master(s). If that happens first identify the failing puppet master, there should be a nagios check on HTTP checking for 200s. Once on the puppet master check that apache children are present, in particular the mod_passenger's passenger-spawn-server and that there "master" processes running, the stdout/stderr are connected to /var/log/apache2/error.log so that will provide some guidance, if e.g. passenger-spawn-server crashed it would be sufficient to restart apache.

Puppet failure on all hosts with Error: Could not retrieve catalog from remote server: Error 400 on SERVER: Mysql::Error: Out of range value for column 'id' at row 1: INSERT INTO `fact_values` (`updated_at`, `host_id`, `created_at`, `fact_name_id`, `value`) VALUES ...

This happened on 2015-08-03 because puppet-MySQL backed run out of ids on the fact_values table. Truncating it solved the issue:

 jynus@iron:~$ sudo mysql -h m1-master -e "TRUNCATE TABLE puppet.fact_values"

Private puppet

Our main puppet repo is publicly visible and accepts (via gerrit review) volunteer submissions. Certain information (passwords, keys, etc.) cannot be made public, and lives in a separate, private puppet repository.

The private repository is stored on palladium in /root/private. It is not managed by gerrit or subject to review; changes are made there by logging in, editing and committing directly on palladium. Changes to /root/private are distributed to puppetmasters automatically via a post-commit hook. On palladium, the puppet master pulls private data from /var/lib/git/operations/private but you don't need to edit there, it should be synced automatically by the post-commit hook in /root/private.

The data in the private repository is highly sensitive and should not ever be copied onto your local machine or to anywhere outside of a puppetmaster system.

Nowadays, most things in the private repo should be class parameters defined with Puppet Hiera. Those reside under private/hieradata and have the big advantage they don't need to get replicated in a second repository (seel below).

Think twice before doing this!

Public (fake) private puppet repo

In order to satisfy puppet dependencies while retaining security, there is also a 'labs private' repo which the labs puppetmaster uses in place of the actual, secure private repo. The labs private repo lives on Gerrit and consists mainly of disposable keys and dummy passwords. In the case of hieradata in the private repo, in most cases labs can be happy with class defaults or with some data you can put in labs.yaml in the public hiera repository.

puppet git submodules

Some puppet modules are managed as git submodules for easier collaboration and code sharing between production puppet and other environments (e.g. vagrant, third parties).

troubleshooting

If submodules need to get merged into the main puppet.git repo, then there's need for a manual cleanup or git pull will fail with

 error: The following untracked working tree files would be overwritten by checkout:

thus you'll need to remove whichever modules/SUBMODULE files were there and try pulling again.

Todo

  • More secure certificate signing
  • Better, more automated version control
  • Better tools for adding/maintaining node definitions

tickets

some selected "puppetize" tickets that are open:

  • T80340 puppetize: contacts.wikimedia.org
  • RT4082 puppetize office servers ?

More information