Swift/TODO

Switch to statsd

(swift.pp already has options for it) done
use a sensible sampling rate done
extract useful performance graphs and set up a gdash dashboard done

possibly based on existing Ganglia graphs

kill ganglia-logtailer with fire, disable insane access logging

Rewrite ganglia-report-global-stats with something more sensible done

Graphite-based?

Upgrade to a newer (stable) release done

Might need incremental updates
Possibly use this opportunity to go with trusty not done!
Start from stateless services (proxy) might be an option

Figure out a way to fix the "restarting syslog kills Swift" bug done, fixed by icehouse

Possibly by switching syslog from /dev/log to 127.0.0.1
Verify that the issue is still there with Python 2.7.4 (trusty+)
Looks like icehouse fixes this, poor's man test:

 filippo@ms-be3001:~$ ps fwaux | grep -c swift
 140
 filippo@ms-be3001:~$ sudo service rsyslog restart
 rsyslog stop/waiting
 rsyslog start/running, process 19463
 filippo@ms-be3001:~$ ps fwaux | grep -c swift
 140
 filippo@ms-be3001:~$

logs:

 Jul  7 12:53:44 ms-be3001 object-replicator: Starting object replication pass.
 Jul  7 12:53:44 ms-be3001 kernel: Kernel logging (proc) stopped.
 Jul  7 12:53:44 ms-be3001 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="14931" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jul  7 12:53:45 ms-be3001 kernel: imklog 5.8.6, log source = /proc/kmsg started.
 Jul  7 12:53:45 ms-be3001 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="19463" x-info="http://www.rsyslog.com"] start
 Jul  7 12:53:45 ms-be3001 rsyslogd: rsyslogd's groupid changed to 103
 Jul  7 12:53:45 ms-be3001 rsyslogd: rsyslogd's userid changed to 101
 Jul  7 12:53:45 ms-be3001 rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
 Jul  7 12:54:01 ms-be3001 CRON[19541]: (root) CMD (/usr/sbin/ganglia-logtailer --classname SwiftHTTPLogtailer --log_file /var/log/syslog --mode cron > /dev/null 2>&1)
 Jul  7 12:54:09 ms-be3001 container-replicator: Replication run OVER

Bring wikitech documentation up to date

mention new changes like graphite, swift-recon, dispersion, etc
documentation overhaul in general

Switch to https://github.com/stackforge/puppet-swift.git

port all of our existing functionality
replace require => Class[...pipeline...] / $required_classes
swift::middleware for custom middleware?
rsync use_xinetd -> false?
remove basedir parameter for '/dev'
test XFS/parted logic
fix tempauth template (hardcoded temp users?!?)
add statsd options

Move esams to real LVS service and away from DNS round robin

apparently no "service lvs" in esams, only for user-facing services (varnish), need to figure that out

Get a better overview on global stats

e.g. replication stats with dispersion-report done
swift-recon might be an option too done

Figure out a better way to distribute rings than puppet fileserver

It should be auditable (e.g. if it lives in git it should be code-reviewable)
Alternative 1: Stackforge puppet module sets up an rsync server
Alternative 2: https://github.com/pandemicsyn/swift-ring-master
Alternative 3: New git repository with the rings (e.g. operations/swift-ring.git), storage via git-fat
- manipulation of ring files via swift-ring-builder, the results are committed into e.g. swift-ring.git/eqiad-prod/...-ring.gz
- also a textual representation is committed swift-ring.git/eqiad-prod/..-ring.txt, a code review is pushed and eventually approved
- the updated files are pushed via git-fat push
- puppet (or cron) invokes a script on the swift machines to safely pull/update swift-ring.git and copy the ring files in place if need be
- separate machine provides rsync server for git-fat (strictly not a SPOF, every swift machine would also have a copy)

Set up a cluster as Labs infrastructure

With Keystone

Set up a geocluster in production with codfw