MariaDB/troubleshooting

From Wikitech
File:Replication problems.jpg
DBAs are working on it

This guide is a work in progress.

Depooling a slave

Quick depooling, for emergencies only

Connect to the offending server. Log in to mysql (sudo mysql --defaults-file=/root/.my.cnf), and execute the following command:

STOP SLAVE;

If you stop replication (assuming replication is working normally for other hosts), mediawiki will depool the slave automatically after X seconds. This is not the preferred method for several reasons:

  • The depool is not immediate, during that time, the server keeps receiving queries, which may be undesirable in many cases
  • The depool is not total, the server may receive queries still if all servers are lagged

The only advantage is that it does not require a deploy to depool the server. Use the regular depooling method in most cases.

Normal depooling

Comment its server name from db-eqiad.php or db-codfw.php.

  's8' => array(
    'db1045' => 0,   # 1.4TB  64GB
# depooled because reasons 'db2094' => 100,
    'db2095' => 100,
    'db2096' => 100,
    'db2097' => 100,
 ),
  • Do not just set the weight to 0, that will not fully depool the server
  • If the server has a particular role (normally commented on the same line), like vslow, dump. recentchanges, etc., you must take away the roles and assign them to another node in the 'groupLoadsBySection' array:
 's8' => array(
   'watchlist' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchanges' => array(
#      'db2094' => 1,
     'db2055' => 1,
   ),
   'recentchangeslinked' => array(
#      'db2094' => 1,
     'db2055' => 1,
),
  • Most roles can be taken by any slave (assuming the have enough resources and not overloaded), although there are some cases where there are only a couple of "special slaves" optimized for a particular role.
  • commit the change and deploy on tin:
/srg/mediawiki-staging $ sync-file wmf-config/db-eqiad.php "Depooled because reasons"
  • Monitor that mysql connections to that host slowly drop as soon as the queries finish. For that, connect to that host and use mysql's SHOW PROCESSLIST; and check there are no wikiuser or wikiadmin connctions. You can also monitor connections with regular linux tools like netstat/ss on port 3306. Monitoring tools regularly check the host, but they use separate users.

Example:

MariaDB PRODUCTION x1 localhost (none) > SHOW PROCESSLIST;
+---------+-----------------+-------------------+--------------------+---------+-
| Id      | User            | Host              | db                 | Command | 
+---------+-----------------+-------------------+--------------------+---------+-
# internal process, ignore
|       2 | event_scheduler | localhost         | NULL               | Daemon  | 
# replication users, ignore
| 3192579 | system user     |                   | NULL               | Connect | 
| 3192580 | system user     |                   | NULL               | Connect | 
# monitoring users, ignore
| 6284249 | watchdog        | 10.XX.XX.XX:34525 | information_schema | Sleep   | 
| 6284250 | watchdog        | 10.XX.XX.XX:34716 | information_schema | Sleep   | 
| 6284253 | watchdog        | 10.XX.XX.XX:34890 | mysql              | Sleep   | 
# this is your own connection
| 6311084 | root            | localhost         | NULL               | Query   | 
+---------+-----------------+-------------------+--------------------+---------+-

(no wikiuser or wikiadmin processes, ok to do maintenance, kill the machine, etc.)

  • Except in the case of a problem or the dump slave, while creating the dumps, connections should go away in seconds/very few minutes. If there is an emergency- killing the process (KILL <#ID>) is the way to go. Selects are ok to kill, writes and alters can create worse issues due to rollback process kicking in- be sure what you kill.

Depooling a master (a.k.a. promoting a new slave to master)

See: Switch master (most of it still relevant).

Planned switchover

//FIXME: add the actual commands and some nice pictures:

This is the easiest case: master is working (or at least replication is, or if replication is not working- it has cleanly stopped leaving all slaves on the same exact data state- in other words, there is no lag) and we only need to make a different server be the current master for a particular shard. General summary of the steps:

  1. Put mediawiki in read-only mode (so users are aware of maintenance mode)
  2. Put the current mariadb master in read_only mode (now all servers should be in read-only mode, as slaves should generally always in read only mode)
  3. Wait until all slaves catch up (under regular, non-lagged conditions, it should take only 1-10 seconds)
  4. RESET SLAVE ALL the new master, so it is no longer replicating from any other server
  5. Note the binary log position of the new master
  6. CHANGE MASTER on all slaves except the new master to replicate from the new master
  7. CHANGE MASTER on the original master to replicate from the "new master"
  8. Modify mediawiki configuration to set the new master as the actual configured master
  9. Disable read_only mode on the new mysql master
  10. Disable mediawiki's read only mode

Topology changes can indeed be done in a hot way before the start of the maintenance, with scripts like repl.pl, minimizing the time mediawiki is in read only-mode if replication is running. That would alter the process to be:

  1. Set all slaves except the new master to be children of the new master
  2. Create a circular replication between the old and the new master by setting the old master as a slave of the new master
  3. Execute steps 1, 2, 3, 4, 8, 9, 10

This can be automatized, MHA can do that, repl.pl does not do all steps, but it can help with slave changes, or a script like switch.php could be done to perform that.

There are some ops-level operations to clean up after the failover:

  • At the time of writing this lines, there is no puppet configuration to mark a master- this, however, is planned (to control the read_only config and monitoring). Review puppet and update it if necessary.
  • There are some tools that depend on the master being sX-master: update the dns as necessary
  • Make sure pt-heartbeat, if not yet controlled by puppet, is running on the master- mediawiki may be already using it (instead of SHOW SLAVE STATUS) to calculate the lag between the master and the slave.

Emergency failover

If the master is not available, or replication is broken, this is a more complex case. The reasons is that slaves will have executed different amount of transactions and will be in a close, but different state. E.g. slave1 has executed transactions A, while slave2 has executed transactions A, B and C. In addition to that, if we do not have access to the master's binary log (or it has not properly been synchronized to disk after a crash), we will have to recover from a slave. In theory, with semi-sync replication, no transaction will be lost, and at least one slave will have the change, but all other slave will be on different coordinates (and binary log position is only local to the masters).

Scenario 1 -master is recoverable: just wait until the master restarts, it will avoid headaches and be faster and less disruptive than trying to failover it.

Scenario 2 -master is not recoverable, but its binary log is (and all slaves have less or equal amount of data):

  1. For each slave: send the master log position, starting from the last Exec_master_position so all slaves are in the same starting state
  2. Follow regular failover steps as mentioned in the scheduled maintenance

Scenario 3 -neither master is recoverable nor its binary logs (or a master binary log is behind a slave binary log): We need to put all servers in the same state, using the most up-to-date slave, then perform the regular failover process. This is the most complicated part without using GTIDs:

  1. Identify the most up to date slave by comparing Exec_master_log_pos
  2. By comparing binary log positions, try to find the binlog coordinate that corresponds to the the other's slaves binlog to the most up to date slave's binlog. This is the tricky part. pt-heartbeat should be able to find this.
  3. Excute the pending transactions on each slave
  4. Follow the regular steps for regular scheduled maintenance

Again, these steps can be automatized.

Replication lag

Caused by hardware

File:Hwraid.png
This is what a half-failing diks looks like in monitoring (small lag until it becomes critical).

One common cause of lag that is easy to check and repair is hardware issues. Databases have a lot (and I mean a lot) of IO pressure, and while it is not insane, it means that 3-year old drives are very prone to fail.

As an operator, you are already familiar with the way drives fail (not very reliably, to be honest). All important databases have a hardware RAID, which means 1 disk can fail at a time, usually with very little impact. When that happens, the icinga alert "1 failed LD(s) (Degraded)" should tell you it is time to replace at least one disk. Usually there are spares onsite or the servers are under guarantee, which means you can create a ticket to ops-eqiad or ops-codfw and let Chris or Papaul know that should take it off and insert a new one, the hw RAID should automatically reconstruct itself.

To check the RAID status, execute:

 megacli -AdpAllInfo -aALL

And check the section "Devices present"

To identify the particular disk

 megacli -PDList -aALL

Check in particular for the Firmware State (on or off), the S.M.A.R.T alerts, and the number of medium errors (a few, like a dozen, should not affect much performance, but when there are hundreds of errors in a short timespan, that is an issue).

Sadly, disks fail in a very creative way, and while our RAIDs controllers are reliable enough to 1) continue despite medium errors and 2) disable the disk when it fails completelly; in a state of "almost failing", there could be lag issues. If that is the case, executing:

 megacli -PDOffline -PhysDrv \[#:#\] -aALL

where #:# is enclosure:slot, will take the particular physical drive offline so that it can be replaced later

Data loss

Cloning a slave

The best (and fastest) way to repair a slave (or a master) is to clone from another slave. Either stop a slave (after depooling it) and just send the files away or use xtrabackup to clone it. The rest of the techniques assume that the data loss is fleet-wide (a logical/application/DBA error, rather than a physical problem).

Delayed slave

There are 2 delayed slaves with the main production data, 1 day behind: dbstore1001 and dbstore2001. If a data loss is detected within less than 24 hours, it is important to stop those slaves so they do not execute the offending queries. It is not easy to stop them, as they are programmed to be always 1 day behind. To stop replication on those, disable the events scheduling:

 SET GLOBAL event_scheduler = 0;

Then STOP ALL SLAVES; or STOP SLAVE 's1';, to stop only connection s1 (enwiki replication), for example. Then you can roll forward changes, starting from the current slave position, using mysqldump:

 mysqldump --start-position=<relay log position> --stop-position/--stop-datetime=<stopping relay coordinate or date time> <relay logs> | mysql

or just START SLAVE UNTIL;

And finally, you can use mysqldump to import and export certain rows or tables:

 mysqldump -h dbstore1001 --single-transaction database table | mysql -h remote-master

Please note that tables in dbstores are most of the time in TokuDB format, so they should be converted to innodb on import.

Backups

There are weekly logical backups in /srv/backups on dbstore1001 (for 3 weeks) and in bacula (for 6 months). They are very slow to recover (it could take 3 days to revcover a single shard), so think twice before recovering from them. If possible, use paralelism to load the tables.


Data inconsistency between nodes