Jump to content

Eqiad Migration Planning

From Wikitech

Coordination

We now have an incomplete tracking ticket in RT that depends on more specific tickets.
Platform Engineering will be using Bug 39106 for tracking dev tasks
Sept 12 Update - http://etherpad.wikimedia.org/TechOps-12Sept2012
3rd Jan Update - http://etherpad.wmflabs.org/pad/p/EqiadMigration-3Jan2013
8th Jan Update - http://etherpad.wmflabs.org/pad/p/EqiadMigration-8Jan2013
Weekly Countdown meeting http://etherpad.wmflabs.org/pad/p/EqiadMigration - meeting minutes
Checklist and acceptance tests culled from this page [IN PROCESS] Eqiad Migration Planning/Checklist

Outstanding Server/System Readiness

Master RT - https://rt.wikimedia.org/Ticket/Display.html?id=3403
Master Bugzilla - * https://bugzilla.wikimedia.org/show_bug.cgi?id=39106 Master Bugzilla

App, Imagescalers, Bits, Jobrunners and API Apaches
- All Ready - awaiting code deploy

Parsoid servers@Eqiad
- Target - 1/11/13 (RobH)

setup pc1001-1003 (PY/Asher)- https://rt.wikimedia.org/Ticket/Display.html?id=3644 / bugzilla http://bugzilla.wikimedia.org/42463
- Deployed 1/14/13

Setup Ceph in eqiad for image storage (Swift in Tampa & Ceph in EQIAD) (Faidon/Mark)
- 2 more servers set up (up to 4 now), intra-cluster replication ETA is Saturday early morning PST
- holding off adding more as to not disrupt swift->ceph replication speed
- swift->ceph copy 17.5T out of 43TB, complete in 12 days (very rough estimate)
- some stability issues - close cooperation with Ceph developers, being fixed realtime
- h310 perc issue - workaround with raid 0
- 0.56 has been released and deployed to the eqiad cluster
- various other hiccups, both hardware & software related
- still pending: puppetization, rewrite.py -> VCL, testing with MediaWiki

Remove production MediaWiki dependency on NFS - https://bugzilla.wikimedia.org/show_bug.cgi?id=43495 (RT-4183)
- https://bugzilla.wikimedia.org/37946: Add support for git branches to ExtensionDistributor - Chad
  - by end of 1/12/13 - testing
- https://bugzilla.wikimedia.org/43493: Debug and reenable Swift-based CAPTCHA (*) - Aaron
  - being tested using Swift-based captcha

Database Master switchover (PY / Asher)
- MHA

- https://bugzilla.wikimedia.org/show_bug.cgi?id=43338 - Dev tasks related to git-deploy migration; ready and use it on 1/16/13
  - https://bugzilla.wikimedia.org/43339: Deploy git-deploy to the Beta Cluster - Antoine
  - https://bugzilla.wikimedia.org/43614 l10n generation in git-deploy - Brad / Ryan
    - create localization directory, etc
    - Tim to review
  - https://bugzilla.wikimedia.org/43340 - Design new on-disk layout for MediaWiki install on tin/eqiad Apaches - Sam/Tim
  - https://bugzilla.wikimedia.org/43615 Audit of the salt scripts for completeness (looking in current scripts) - Aaron
  - mwscript

- https://bugzilla.wikimedia.org/show_bug.cgi?id=39082 - Add support for deploying per-datacenter config variances - Antoine
  - multi-datacenter support - Antoine

- https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
  - Automated DB/Apache switchcover script
    - Tampa - Read-only
    - Eqiad - Grants needed
    - See "Actually Failing Over" below.
  - varnish configuration switchover script - Mark

Software / Config Requirements

MediaWiki deploy support for per colo config variances (Bugzilla 39082)
- generating eqiad and pmtpa dsh groups
- mostly done - rolling out by end of month https://gerrit.wikimedia.org/r/#/c/32167/ https://gerrit.wikimedia.org/r/#/c/32168/ ..
- new mediawiki conf files for eqiad

replicating the git checkouts, etc. to new /home
- not an issue

Actually Failing Over

https://bugzilla.wikimedia.org/show_bug.cgi?id=43453 - Checklist/script to switch datacenters - Tim
- Automated DB/Apache swithcover script
  - Tampa - Read-only
  - Eqiad - Grants needed
  - See "Sequence" below.

Sequence (-AI Asher)
- deploy db.php with all shards set to read-only in both pmtpa and eqiad
- redis failover - setting mc1001-1016 as masters, mc1-16 slaving from eqiad
- deploy squid and mobile + bits varnish configs pointing to eqiad apaches
  - start with read-only mode
  - try to bypass puppet / must be within 1 minute or 2
- database warmup - scripting select query collection for every project, and warmup of all eqiad dbs
- master swap every core db and writable es shard to eqiad
- deploy db.php in eqiad removing the read-only flag, leave it read-only in pmtpa
  - the above master-swap + db.php deploys can be done shard by shard to limit the time certain projects are read-only
- No DNS or Ceph/Swift changes required
- Rollback plan - needs to add details
- turn off multi-write to NAS & turn on multi-write to Ceph
- TEST! TEST! TEST!

Deployment- D-day

Day minus 1 (1/21/13) preparation Work
- Automated test run
- determine if deploying bits early is a possibility

D-Day 1/22/13
- see actual failover paragrah above

D-day + 1 1/23/13

Risk & Mitigation

Identify the high risk migration tasks and ensure we have a way to mitigate or revert without extended downtime.

What could cause fallback to Tampa a big problem should migration failed?
- should Ceph fail?
- should Swift@Tampa fail?
- Database integrity
- Performance
Need to determine Switchback Threshold - ??

2a.Test checklist: - http://wikitech.wikimedia.org/view/Eqiad_Migration_Planning/Checklist

Improving Switchover

pre-generate squid + varnish configs for different primary datacenter roles
implement MHA to better automate the mysql master failovers
migrate session storage to redis, with redundant replicas across colos

See more

Records and original tracking doc - http://etherpad.wikimedia.org/EQIAD-rollout-sequence
Category:Eqiad cluster

Parking Lot Issues

Identify and plan around the deployment/migration date - ~~tentatively Oct 15, 2012~~ [see below]. Need to communicate date.
- Migration needs to happen before Fundraising season starts in Nov.
- Vacation 'freeze'; all hands on deck week before and after deployment
- migrate ns1 from tampa to ashburn, but not a critical item.

An update from CT Woo from October 2012 regarding the status of the migration is available here. It looks like it'll be pushed back to January or February 2013 (post-annual fundraiser).

Hume equivalent (misc::maintenance) - postponed
https://rt.wikimedia.org/Ticket/Display.html?id=1279 - allocate 1 box in eqiad for puppet testing
- not critical/not showstopper

create/doc CheckList - PY/ChrisM
- Test checklist:http://wikitech.wikimedia.org/view/Eqiad_Migration_Planning/Checklist

AI - a automated test scripts - ChrisM

Use Cases - Tests

Developer
- Check-in/out codes
- code review
- Code push/deploy
- revert deployment

User
- registers
- search article
- read article
- comment on article
- edit article
- create article
- localization

Community member
- tag article
- (exercise special pages features)

Ops
- monitoring works - ganglia, nagios, torrus, .....
- check amanda backups

Retrieved from "https://labtestwikitech.wikimedia.org/w/index.php?title=Eqiad_Migration_Planning&oldid=2734"

Eqiad cluster