Help:Tool Labs/Database

From Wikitech

Tool and Labs accounts are granted access to replicas of the production databases. Private user data has been redacted from these replicas (some rows are elided and/or some columns are made NULL depending on the table), but otherwise the schema is, for all practical purposes, identical to the production databases and the databases are sharded into clusters in much the same way.

Database credentials (credential user/password) are stored in the 'replica.my.cnf' file found in the tool account’s home directory. To use these credentials with command-line tools by default , copy 'replica.my.cnf' to '.my.cnf'.

If you do not have a 'replica.my.cnf' in your home directory, please create a ticket in Phabricator.


Naming conventions

As a convenience, each mediawiki project database (enwiki, bgwiki, etc) has an alias to the cluster it is hosted on. The alias has the form:

project.labsdb

where 'project' is the name of a hosted mediawiki project (e.g. enwiki for the English Wikipedia or bgwiktionary for the Bulgarian Wiktionary. (full list of available projects )).

The database names themselves consist of the mediawiki project name, suffixed with _p (an underscore, and a p), for example:

enwiki_p (for the English Wikipedia replica)

In addition each cluster can be accessed by the name of its shard sX.labsdb (for example, s1.labsdb hosts the enwiki_p database). As the cluster where a database is located can change, you should only use this name if your application requires it, e.g. for heavily crosswiki tools which would otherwise open hundreds of database connections.

If you use a user database (see below), connect to a database server instead of one of the aliases. The target of the aliases can (and do!) change, and your database will seem to have disappeared if it does. The current database servers are c1.labsdb, c2.labsdb and c3.labsdb.

Connecting to the database replicas

You can connect to the database replicas (and/or the cluster where a database replica is hosted) by specifying your access credentials and the alias of the cluster and replicated database. For example:

To connect to the English Wikipedia replica, specify the alias of the hosting cluster (enwiki.labsdb) and the alias of the database replica (enwiki_p) :

mysql --defaults-file="${HOME}"/replica.my.cnf -h enwiki.labsdb enwiki_p

To connect to the Wikidata cluster:

mysql --defaults-file=~/replica.my.cnf -h wikidatawiki.labsdb

To connect to Commons cluster:

mysql --defaults-file=~/replica.my.cnf -h commonswiki.labsdb

There is also a shortcut for connecting to the replicas: sql <dbname>[_p] The _p is optional, but implicit (i.e. the sql tool will add it if absent).

To connect to the English Wikipedia database replica using the shortcut, simply type:

sql enwiki

Connecting to the database replicas from other Labs instances

It is possible to connect to the tools.labs database replica (labsdb) from other labs instances besides tools with a bit of network configuration.

Connecting to the database replicas from your own computer

You can access the database replicas from your own computer by setting up an SSH tunnel. If you use MySQL Workbench, you can find a detailed description for that application below.

In the general case, add the option -L 4711:enwiki.labsdb:3306 to your ssh call, e. g.:

ssh -L 4711:enwiki.labsdb:3306 tools-login.wmflabs.org

This will set up a tunnel so that connections to port 4711 on your own computer will be relayed to the database replica's MySQL server on port 3306. This tunnel will persist as long as the corresponding SSH session.

If you use mysql, the corresponding command line would look like:

mysql --user=u1234 --host=127.0.0.1 --port=4711 --password enwiki_p

Note that you need to use the IP address of your loopback device; using localhost instead will give an "Access denied" error.

Creating new databases

User-created databases can be created on the database hosting the replica servers or on a database local to the 'tools' project: tools-db. The latter tends to be a bit faster since that server has less heavy activity, and tools-db is the recommended location for user-created databases when no interaction with the production replicas is needed. If you create a database on a replica server, make sure to connect to that specific server (e.g. c1.labsdb) and not to an alias (e.g. enwiki.labsdb), as the alias-server relationship can (and do) change.

Database names must start with the name of the credential user (not your user name), which can be found in your ~/replica.my.cnf file (the name of the credential user looks something like 'u1234' for a user and 's51234' for a service group). The name of the credential user is followed by two underscores and then the name of the database:

credentialUser__DBName

Privileges on the database

Users have all privileges and have access to all grant options on their databases. Database names ending with _p are granted read access for everyone. Please create a ticket if you need more fine-grained permissions, like sharing a database only between 2 users, or other special permissions.

Steps to create a user database on the replica servers

If you would like your database to interact with the replica databases (i.e., if you need to do actual SQL joins with the replicas, which can only be done on the same cluster) you can create a database on the replica servers.

To create a database on the replica servers:

1. Become your tool account:

maintainer@tools-login:~$ become toolaccount

2. Connect to the replica servers with the replica.my.cnf credentials. You must specify the host of the replica (e.g., enwiki.labsdb):

mysql --defaults-file="${HOME}"/replica.my.cnf -h XXwiki.labsdb

3. In the mysql console, create a new database (where CREDENTIALUSER is your credentials user, which can be found in your ~/replica.my.cnf file, and DBNAME the name you want to give to your database. Note that there are 2 underscores between CREDENTIALUSER and DBNAME):

MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME

You can then connect to your database using:

mysql --defaults-file="${HOME}"/replica.my.cnf -h XXwiki.labsdb CREDENTIALUSER__DBBAME
Warning Caution: Writing (INSERT, UPDATE, DELETE) tables on the replica servers leads to the replication being stopped for all users on that server until the query finishes. So you need to make sure that such queries can be executed in a time frame that does not disrupt other users' work too much, for example by processing data in smaller batches or rethinking your data flow. As a rule of thumb, queries should finish in less than a minute.

Steps to create a user database on tools-db

To create a database on tools-db:

1. Become your tool account.

maintainer@tools-login:~$ become toolaccount

2. Connect to tools-db with the replica.my.cnf credentials:

mysql --defaults-file="${HOME}"/replica.my.cnf -h tools-db

You could also just type:

sql local

3. In the mysql console, create a new database (where CREDENTIALUSER is your credentials user, which can be found in your ~/replica.my.cnf file, and DBNAME the name you want to give to your database. Note that there are 2 underscores between CREDENTIALUSER and DBNAME):

MariaDB [(none)]> CREATE DATABASE CREDENTIALUSER__DBNAME;

You can then connect to your database using:

mysql --defaults-file="${HOME}"/replica.my.cnf -h tools-db CREDENTIALUSER__DBBAME

Example

Assuming that your tool account is called "mytool", this is what it would look like:

maintainer@tools-login:~$ become mytool
tools.mytool@tools-login:~$ cat replica.my.cnf | grep user
user='123something'
mysql --defaults-file="${HOME}"/replica.my.cnf -h tools-db
create database 123something__wiki;

Tables for revision or logging queries involving user names and IDs

The revision and logging tables do not have indexes on user columns. In an email, one of the system administrators pointed out that this is because "those values are conditionally nulled when supressed". One has to instead use the corresponding revision_userindex or logging_userindex for these types of queries. On those views, rows where the column would have otherwise been nulled are elided; this allows the indexes to be usable.

Example query that will use the appropriate index (in this case on the rev_user_text column, the rev_user column works the same way for user IDs):

SELECT rev_id, rev_timestamp FROM revision_userindex WHERE rev_user_text="Foo"

Example query that fails to use an index because the table doesn't have them:

SELECT rev_id, rev_timestamp FROM revision WHERE rev_user_text="Foo"

You should use the indexes so queries will go faster (performance).

Metadata database

From bugzilla:48626 there is a table with automatically maintained meta information about the replicated databases: meta_p.wiki. See Quarry #4031 for an up-to-date version.

MariaDB [meta_p]> DESCRIBE wiki;
+------------------+--------------+------+-----+---------+-------+
| Field            | Type         | Null | Key | Default | Extra |
+------------------+--------------+------+-----+---------+-------+
| dbname           | varchar(32)  | NO   | PRI | NULL    |       |
| lang             | varchar(12)  | NO   |     | en      |       |
| name             | text         | YES  |     | NULL    |       |
| family           | text         | YES  |     | NULL    |       |
| url              | text         | YES  |     | NULL    |       |
| size             | decimal(1,0) | NO   |     | 1       |       |
| slice            | text         | NO   |     | NULL    |       |
| is_closed        | decimal(1,0) | NO   |     | 0       |       |
| has_echo         | decimal(1,0) | NO   |     | 0       |       |
| has_flaggedrevs  | decimal(1,0) | NO   |     | 0       |       |
| has_visualeditor | decimal(1,0) | NO   |     | 0       |       |
| has_wikidata     | decimal(1,0) | NO   |     | 0       |       |
| is_sensitive     | decimal(1,0) | NO   |     | 0       |       |
+------------------+--------------+------+-----+---------+-------+

Example data:

MariaDB [nlwiki_p]> select * from meta_p.wiki limit 1 \G
 *************************** 1. row ***************************
          dbname: aawiki
            lang: aa
            name: Wikipedia
          family: wikipedia
             url: http://aa.wikipedia.org
            size: 1
           slice: s3.labsdb
       is_closed: 1
        has_echo: 1
 has_flaggedrevs: 0
has_visualeditor: 1
    has_wikidata: 1
    is_sensitive: 0

Identifying lag

If there is a network/labs db infrastructure problem, production problem, maintenance (scheduled or unscheduled), excessive load or production or user's queries blocking the replication process, labs replicas can get behind the production databases "lag".

To identify lag, execute on the database host you are connected to:

mysql> SELECT * FROM heartbeat_p.heartbeat;
+-------+----------------------------+------+
| shard | last_updated               | lag  |
+-------+----------------------------+------+
| s6    | 2015-11-24T12:21:13.500950 |    0 |
| s2    | 2015-11-24T12:21:13.501200 |    0 |
| s7    | 2015-11-24T12:21:13.501190 |    0 |
| s3    | 2015-11-24T12:21:13.501110 |    0 |
| s4    | 2015-11-24T12:21:13.501170 |    0 |
| s1    | 2015-11-24T12:21:13.500670 |    0 |
| s5    | 2015-11-24T12:21:13.500780 |    0 |
+-------+----------------------------+------+
7 rows in set (0.00 sec)

This table is based on the tool pt-heartbeat, not on SHOW MASTER STATUS, producing very accurate results, even if replication is broken, and directly comparing it to the original master, and not the replicas's direct master.

  • shard: s1-7. Each of the production masters. The wiki distribution can be seen at: https://noc.wikimedia.org/db.php
  • last_updated: Every 0.5 seconds, a row in the master is written with the date local to the master. Here you have its value, once replicated. As it is updated every 0.5 seconds, it has a measuring error of [0, 0.5] seconds.
  • lag: The difference between the current date and the last_updated column, rounded to seconds (timestampdiff(SECOND,`heartbeat`.`heartbeat`.`ts`,utc_timestamp())). Due to a MariaDB bug, decimals are truncated, but expect a number with 6 decimals soon.

Please note that some seconds or a few minutes of lag is considered normal, due to the filtering process and the hops done before reaching the public hosts.

Example queries

See Help:MySQL queries. Add yours!

Python

Use User:Legoktm/wmflib.

Configuring MySQL Workbench

You can connect to databases on Tool Labs with MySQL Workbench (or similar client applications) via an SSH tunnel.

Configuring PGAdmin for OSM access

This guide assumes you already know what PGAdmin is and how to use it.

Connecting to OSM via the official CLI PostgreSQL

  1. SSH to tools-login.wmflabs.org or tools-dev.wmflabs.org
  2. psql -h osmdb.eqiad.wmnet -U osm gis
  3. Start querying