External storage

From Wikitech

Big picture

The External Storage (ES) hosts are used to store the text of a wiki page in the form of compressed revisions of pages. When a user asks for a page or a diff to an older version of a page, mediawiki grabs the compressed content from the external store database and uses it to display the user's request. Compressed text is stored in a number of different formats.

ES data is sharded into multiple 'clusters'. Originally this was to allow it to scale across multiple machines when one became unable to handle the load. Now those clusters have all been merged back onto one host (represented as separate tables inside each wiki's db) and the separation into clusters only serves to keep the table size down. Multiple clusters are used at any given time.

The Echo extension uses External Storage to store notifications. The Flow extension uses External Storage to store the Parsoid HTML of users' posts. They use the 'extension1' cluster in production as of 2014-07, see InitialiseSettings.php. The AbuseFilter extension uses ExternalStore as well.

Servers

File:External storage single cluster.png

The ES hosts are named es#. Eqiad hosts are numbered >=1000. Codfw hosts are numbered >=2000 They are all Dell r510 class machines with 12 2TB disks. Configured with RAID10, they have ~12TB available disk.

For a list of which servers are currently active serving which cluster, see db-*.php.

The ES servers use run MySQL and use standard MySQL replication to maintain consistent datasets. The replication topology is described in the image on the right - each colo has flat replication within the colo with one host per colo doing inter-colo replication. Of course this statement should be treated as immediately out of date and verified by querying the mysql servers directly.

Reading db.php

The section in db.php that lists out the ES hosts is called externalLoads. As with other db arrays in db.php, the first in the list is the master, the rest are slaves. Since all but one of the ES clusters are read only, there is no difference between master and slave.

The default table name for an ES host is 'blobs'. the templateOverridesByCluster section of db.php allows you to change that. Because most of the shards are coresident on the same server, most of them are overridden.

Nagios, Monitoring, and Health

Nagios

Nagios watches a few things on the ES hosts:

  • stardards: ping, ssh, disk, etc.
  • mysql: replication lag, master write status

Here are responses to some potential nagios alerts:

  • Host down
  • Disk full
    • verify whether it's / or /a/ that is full
    • delete replication logs, old backups, etc.
    • escalate to sean or jaime
    • if either / or /a/ reaches 100%, the database will need a reslave
  • Replication fallen behind or stopped
    • Do nothing - mediawiki will remove the host from rotation on its own
    • check RAID for degraded disks
    • wait and see if it gets better on its own
    • if it doesn't get better on its own, escalate to sean or jaime
  • RAID
    • Go figure out which disk has failed Raid and MegaCli, put in an RT ticket to replace it.

Ganglia

In addition to the standard host metrics, each ES host has a number of mysql-specific metrics. The most useful of these:

  • mysql_questions (how many queries are coming in)
  • mysql_max_used_connections (how many connections mysql has open)
  • mysql_threads_connected (the number of running threads)
  • mysql_slave_lag (how many seconds behind the master the slave is currently lagging)

Health

Other generic commands to check the health of the databases

  • show slave status \G
    • look for the following 3 lines
    • Slave_IO_Running: Yes
    • Slave_SQL_Running: Yes
    • Seconds_Behind_Master: 0

Backups and Snapshots

Daily copies of /a/sqldata are currently taken on es1004. these snaps are filesystem copies, not lvm snapshots. They should be discontinued as soon as we have regular LVM snapshots. There is a perl script called snaprotate.pl running around but it is not yet in use on the es hosts.

Taking a snapshot

The basic steps are: stop replication, record replication data, flush tables and sync the disks, take the snapshot, start replication.

  • mysql> flush tables; # this step is optional but speeds up the next flush tables
  • mysql> stop slave io_thread;
  • mysql> show slave status\G
    • note Read_Master_Log_Pos
  • mysql> show slave status\G
    • repeat until Read_Master_Log_Pos and Exec_Master_Log_Pos match and don't change twice in a row
  • mysql> flush tables;
  • $ mysql -e 'show slave status\G' > /a/slave_status_YYYY-MM-DD.txt
  • $ sync
  • $ device=$(lvdisplay | grep 'LV Name' | head -n 1 | awk '{print $3}')
    • what this means - note the LV Name field for the first volume in lvdisplay - it probably looks like /dev/es1003/data
  • $ lvcreate -L 200G -s -n snap $device
  • mysql> start slave;

Making a new slave using snapshots

This process is useful when you are making a new slave of the cluster. The broad task is to take a snapshot (recording slave status), copy the data over to a new host, clean it up, start replication, then delete the snapshot. Here are the steps in detail using example hosts A (the current slave) and B (the new host).

Do all the steps listed above in #Taking a snapshot.

On host A, also do:

  • $ mkdir /mnt/snap
  • $ fs=$(mount | grep /a | cut -f 1 -d\ )
    • what this means - note the filesystem mounted on /a; it probably looks like /dev/mapper/es1003-data
  • $ mount -t xfs -o ro,noatime,nouuid ${fs/data/snap} /mnt/snap
    • this mounts the snapshot at /mnt/snap. The snapshot is probably called something like /dev/mapper/es1003-snap

on host B:

  • stop all running mysql processes, either /etc/init.d/mysql stop or kill them
  • $ rsync -avP A:/mnt/snap/ /a/
    • copy over all of /mnt/snap/ into /a, overwriting anything you find there.
    • you should do this in a screen session; it will take about 2 days to complete.
  • $ cd /a/sqldata; rm A* master.info relay-log.info
    • by A* here I mean all the binary logs from the old host. There are a bunch of them; binlogs, relay logs, the slow query log, etc. eg. es1003-bin.000023, es1003-relay-bin.000004, es1003.err, etc.
  • $ mysqld_safe&
  • $ cat /a/slave_status_YYYY-MM-DD.txt (the file you created above).
    • look for master host, user, log file, and position.
  • mysql> change master to master_host='10.0.0.123', master_user='repl', master_password='xxxxx', master_log_file='host-bin.00012', master_log_pos=12345;
  • mysql> start slave;

on host A:

  • $ umount /mnt/snap
  • $ lvremove $dev
    • eg lvremove /dev/es1003/snap

Database Schema

The ES hosts have tables named blobs or blobs_cluster#. The schema is more or less the same: two columns, an ID (autoincrementing) and a blob store that contains the gzipped text.

 CREATE TABLE `blobs` (
   `blob_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
   `blob_text` longblob,
   PRIMARY KEY (`blob_id`)
 ) ENGINE=MyISAM

In the main database schema, you get to the external store from the old_text column of the text table. It contains either the actual text for the page (for very few very old pages) or a pointer into the external store databases. The pointer looks like one of these:

 DB://<clustername>/<integer>
 DB://<clustername>/<integer>/<integer>

clustername is resolved in db.php. There you find the hostname(s) of servers that have the content you're looking for, in addition to the table name.

Path in the database from a page name to its text (using http://en.wikipedia.org/wiki/Defenestration as my example):

 select page_latest as rev_id from page where page_namespace=0 and page_title='Defenestration';
 select rev_text_id as old_id from revision where rev_id=$rev_id
 select old_text from text where old_id=$old_id

or, put together into one query:

 select text.old_text from page, revision, text 
  where page.page_namespace=0 and page.page_title='Defenestration'  
    and page.page_latest = revision.rev_id 
    and revision.rev_text_id = text.old_id;

OK, going back the other direction... In order to find a list of pages that exist on a specific cluster: (warning these are not efficient queries.)

 select old_id as rev_id from text where old_text like 'DB://cluster6%' order by old_id desc limit 5;
 select rev_page as page_id from revision where rev_text_id=$rev_id;
 select page_title from page where page_id=$page_id;

or, put together as one query: select 20 pages from cluster 6 (warning this query is slow.):

 select page.page_title, revision.rev_id, text.old_text from text,revision,page 
  where text.old_text like 'DB://cluster6%' 
    and text.old_id = revision.rev_text_id 
    and page.page_id = revision.rev_page 
   order by page.page_id limit 20;

These pages can be loaded in a browser with the rev_id alone with a URL like http://sv.wikipedia.org/w/index.php?oldid=2336620

See also