Swift/TODO

From Wikitech


  • Switch to statsd
  • (swift.pp already has options for it) done
  • use a sensible sampling rate done
  • extract useful performance graphs and set up a gdash dashboard done
  • possibly based on existing Ganglia graphs
  • kill ganglia-logtailer with fire, disable insane access logging
  • Rewrite ganglia-report-global-stats with something more sensible done
  • Graphite-based?
  • Upgrade to a newer (stable) release done
  • Might need incremental updates
  • Possibly use this opportunity to go with trusty not done!
  • Start from stateless services (proxy) might be an option
  • Possibly by switching syslog from /dev/log to 127.0.0.1
  • Verify that the issue is still there with Python 2.7.4 (trusty+)
  • Looks like icehouse fixes this, poor's man test:
 filippo@ms-be3001:~$ ps fwaux | grep -c swift
 140
 filippo@ms-be3001:~$ sudo service rsyslog restart
 rsyslog stop/waiting
 rsyslog start/running, process 19463
 filippo@ms-be3001:~$ ps fwaux | grep -c swift
 140
 filippo@ms-be3001:~$ 

logs:

 Jul  7 12:53:44 ms-be3001 object-replicator: Starting object replication pass.
 Jul  7 12:53:44 ms-be3001 kernel: Kernel logging (proc) stopped.
 Jul  7 12:53:44 ms-be3001 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="14931" x-info="http://www.rsyslog.com"] exiting on signal 15.
 Jul  7 12:53:45 ms-be3001 kernel: imklog 5.8.6, log source = /proc/kmsg started.
 Jul  7 12:53:45 ms-be3001 rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="19463" x-info="http://www.rsyslog.com"] start
 Jul  7 12:53:45 ms-be3001 rsyslogd: rsyslogd's groupid changed to 103
 Jul  7 12:53:45 ms-be3001 rsyslogd: rsyslogd's userid changed to 101
 Jul  7 12:53:45 ms-be3001 rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
 Jul  7 12:54:01 ms-be3001 CRON[19541]: (root) CMD (/usr/sbin/ganglia-logtailer --classname SwiftHTTPLogtailer --log_file /var/log/syslog --mode cron > /dev/null 2>&1)
 Jul  7 12:54:09 ms-be3001 container-replicator: Replication run OVER
  • Bring wikitech documentation up to date
  • mention new changes like graphite, swift-recon, dispersion, etc
  • documentation overhaul in general
  • port all of our existing functionality
  • replace require => Class[...pipeline...] / $required_classes
  • swift::middleware for custom middleware?
  • rsync use_xinetd -> false?
  • remove basedir parameter for '/dev'
  • test XFS/parted logic
  • fix tempauth template (hardcoded temp users?!?)
  • add statsd options
  • Move esams to real LVS service and away from DNS round robin
  • apparently no "service lvs" in esams, only for user-facing services (varnish), need to figure that out
  • Get a better overview on global stats
  • e.g. replication stats with dispersion-report done
  • swift-recon might be an option too done
  • Figure out a better way to distribute rings than puppet fileserver
  • It should be auditable (e.g. if it lives in git it should be code-reviewable)
  • Alternative 1: Stackforge puppet module sets up an rsync server
  • Alternative 2: https://github.com/pandemicsyn/swift-ring-master
  • Alternative 3: New git repository with the rings (e.g. operations/swift-ring.git), storage via git-fat
    • manipulation of ring files via swift-ring-builder, the results are committed into e.g. swift-ring.git/eqiad-prod/...-ring.gz
    • also a textual representation is committed swift-ring.git/eqiad-prod/..-ring.txt, a code review is pushed and eventually approved
    • the updated files are pushed via git-fat push
    • puppet (or cron) invokes a script on the swift machines to safely pull/update swift-ring.git and copy the ring files in place if need be
    • separate machine provides rsync server for git-fat (strictly not a SPOF, every swift machine would also have a copy)
  • Set up a cluster as Labs infrastructure
  • With Keystone
  • Set up a geocluster in production with codfw