Swift/TODO
< Swift
- Switch to statsd
- (swift.pp already has options for it) done
- use a sensible sampling rate done
- extract useful performance graphs and set up a gdash dashboard done
- possibly based on existing Ganglia graphs
- kill ganglia-logtailer with fire, disable insane access logging
- Rewrite ganglia-report-global-stats with something more sensible done
- Graphite-based?
- Upgrade to a newer (stable) release done
- Might need incremental updates
- Possibly use this opportunity to go with trusty not done!
- Start from stateless services (proxy) might be an option
- Figure out a way to fix the "restarting syslog kills Swift" bug done, fixed by icehouse
- Possibly by switching syslog from /dev/log to 127.0.0.1
- Verify that the issue is still there with Python 2.7.4 (trusty+)
- Looks like icehouse fixes this, poor's man test:
filippo@ms-be3001:~$ ps fwaux | grep -c swift 140 filippo@ms-be3001:~$ sudo service rsyslog restart rsyslog stop/waiting rsyslog start/running, process 19463 filippo@ms-be3001:~$ ps fwaux | grep -c swift 140 filippo@ms-be3001:~$
logs:
Jul 7 12:53:44 ms-be3001 object-replicator: Starting object replication pass. Jul 7 12:53:44 ms-be3001 kernel: Kernel logging (proc) stopped. Jul 7 12:53:44 ms-be3001 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="14931" x-info="http://www.rsyslog.com"] exiting on signal 15. Jul 7 12:53:45 ms-be3001 kernel: imklog 5.8.6, log source = /proc/kmsg started. Jul 7 12:53:45 ms-be3001 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="19463" x-info="http://www.rsyslog.com"] start Jul 7 12:53:45 ms-be3001 rsyslogd: rsyslogd's groupid changed to 103 Jul 7 12:53:45 ms-be3001 rsyslogd: rsyslogd's userid changed to 101 Jul 7 12:53:45 ms-be3001 rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ] Jul 7 12:54:01 ms-be3001 CRON[19541]: (root) CMD (/usr/sbin/ganglia-logtailer --classname SwiftHTTPLogtailer --log_file /var/log/syslog --mode cron > /dev/null 2>&1) Jul 7 12:54:09 ms-be3001 container-replicator: Replication run OVER
- Bring wikitech documentation up to date
- mention new changes like graphite, swift-recon, dispersion, etc
- documentation overhaul in general
- port all of our existing functionality
- replace require => Class[...pipeline...] / $required_classes
- swift::middleware for custom middleware?
- rsync use_xinetd -> false?
- remove basedir parameter for '/dev'
- test XFS/parted logic
- fix tempauth template (hardcoded temp users?!?)
- add statsd options
- Move esams to real LVS service and away from DNS round robin
- apparently no "service lvs" in esams, only for user-facing services (varnish), need to figure that out
- Get a better overview on global stats
- e.g. replication stats with dispersion-report done
- swift-recon might be an option too done
- Figure out a better way to distribute rings than puppet fileserver
- It should be auditable (e.g. if it lives in git it should be code-reviewable)
- Alternative 1: Stackforge puppet module sets up an rsync server
- Alternative 2: https://github.com/pandemicsyn/swift-ring-master
- Alternative 3: New git repository with the rings (e.g. operations/swift-ring.git), storage via git-fat
- manipulation of ring files via swift-ring-builder, the results are committed into e.g. swift-ring.git/eqiad-prod/...-ring.gz
- also a textual representation is committed swift-ring.git/eqiad-prod/..-ring.txt, a code review is pushed and eventually approved
- the updated files are pushed via git-fat push
- puppet (or cron) invokes a script on the swift machines to safely pull/update swift-ring.git and copy the ring files in place if need be
- separate machine provides rsync server for git-fat (strictly not a SPOF, every swift machine would also have a copy)
- Set up a cluster as Labs infrastructure
- With Keystone
- Set up a geocluster in production with codfw