Eqiad Migration Planning

From Wikitech

Coordination

Outstanding Server/System Readiness

  • App, Imagescalers, Bits, Jobrunners and API Apaches
    • All Ready - awaiting code deploy
  • Parsoid servers@Eqiad
    • Target - 1/11/13 (RobH)
  • Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
    • 2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
    • holding off adding more as to not disrupt swift->ceph replication speed
    • swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
    • some stability issues - close cooperation with Ceph developers, being fixed realtime
    • h310 perc issue - workaround with raid 0
    • 0.56 has been released and deployed to the eqiad cluster
    • various other hiccups, both hardware & software related
    • still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki
  • Database Master switchover (PY / Asher)
    • MHA
    • https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
      • Automated DB/Apache switchcover script
        • Tampa - Read-only
        • Eqiad - Grants needed
        • See "Actually Failing Over" below.
      • varnish configuration switchover script - Mark

Software / Config Requirements


  • replicating the git checkouts, etc. to new /home
    • not an issue

Actually Failing Over

  • Sequence (-AI Asher)
    • deploy db.php with all shards set to read-only in both pmtpa and eqiad
    • redis failover - setting mc1001-1016 as masters, mc1-16 slaving from eqiad
    • deploy squid and mobile + bits varnish configs pointing to eqiad apaches
      • start with read-only mode
      • try to bypass puppet / must be within 1 minute or 2
    • database warmup - scripting select query collection for every project, and warmup of all eqiad dbs
    • master swap every core db and writable es shard to eqiad
    • deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
      • the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
    • No DNS or Ceph/Swift changes required
    • Rollback plan - needs to add details
    • turn off multi-write to NAS & turn on multi-write to Ceph
    • TEST! TEST! TEST!

Deployment- D-day

  • Day minus 1 (1/21/13) preparation Work
    • Automated test run
    • determine if deploying bits early is a possibility
  • D-Day 1/22/13
    • see actual failover paragrah above
  • D-day + 1 1/23/13

Risk & Mitigation

Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.

  • What could cause fallback to Tampa a big problem should migration failed?
    • should Ceph fail?
    • should Swift@Tampa fail?
    • Database integrity
    • Performance
  • Need to determine Switchback Threshold - ??

Improving Switchover

  • pre-generate squid + varnish configs for different primary datacenter roles
  • implement MHA to better automate the mysql master failovers
  • migrate session storage to redis, with redundant replicas across colos

See more

Parking Lot Issues

  • Identify and plan around the deployment/migration date - tentatively Oct 15, 2012 [see below]. Need to communicate date.
    • Migration needs to happen before Fundraising season starts in Nov.
    • Vacation 'freeze'; all hands on deck week before and after deployment
    • migrate ns1 from tampa to ashburn, but not a critical item.
  • An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).

AI - a automated test scripts - ChrisM

Use Cases - Tests

  • Developer
    • Check-in/out codes
    • code review
    • Code push/deploy
    • revert deployment
  • User
    • registers
    • search article
    • read article
    • comment on article
    • edit article
    • create article
    • localization
  • Community member
    • tag article
    • (exercise special pages features)
  • Ops
    • monitoring works - ganglia, nagios, torrus, .....
    • check amanda backups