Dumps/Rerunning a job

From Wikitech

Fixing a broken dump

Back in the day, the dumps were meant to be generated in an endless loop with no human intervention; if a dump broken you wated another 10 days or so until your project's turn came around again and then there was a new one.

These days folks want the data Right Now, and some dumps take a good long time to run (*cough*en wp*cough*). If you see a broken or failed dump, this is how you can fix it up.

Rerunning a complete dump

If most of the steps failed, or the script failed or died early on, you might as well rerun the entire thing. If the dump scheduler is running, just wait for the entire dump to be rerun. Otherwise, follow the steps below.

  1. Be root on the appropriate snapshot host (1001 for en wiki, one of 1002/1004 for the rest).
  • Note that if the dump is currently running you will need to be on the specific host running it, as you'll need to shoot the running dumps.
  1. start a screen session (these dumps take a while to run).
  2. sudo -s datasets
  3. cd /srv/dumps
  4. determine which config file the wiki uses: enwiki uses wikidump.conf.hugewikis. "big" wikis (listed here...) use wikidump.conf.bigwikis. The rest use wikidump.conf.
  5. Make sure any process dumping the wiki in question has stopped:
    • python dumpadmin.py --kill --wiki <wikiname here> --configfile confs/<config-file here>
  6. Clean up the lock file left behind if any:
    • python dumpadmin.py --unlock --wiki <wikiname here> --configfile confs/<config-file here>
  7. Rerun the entire dump. Steps already completed properly will be skipped
    • If it's the full run with history, do this:
    python ./worker.py --date last --skipgood --log --configfile confs/<config-file here> <wikiname-here>
    • If it's not the full history, but the abbreviated run, do this:
    python ./worker.py --date last --skipgood --log --configfile confs/<config-file here> --skipjobs metahistorybz2dump,metahistorybz2dumprecombine,metahistory7zdump,metahistory7zdumprecombine <wikiname-here>

NOTE THAT if a new dump is already running for the project by the time you discover the broken dump, you can't do this. You'll need to either wait for the new dump to complete or run the old one one step at a time (see below).

Rerunning one piece of a dump

ONLY do this if the dump on the wiki is already running (and hence locked) and you really really have to have that output Right Now.

  1. As above, you'll need to determine the date, which configuration file you need, and which host to run from.
  2. You don't need to do anything about lockfiles.
  3. Determine which job (which step) needs to be re-run. Presumably the failed step has been recorded on the web-viewable page for the particular dump (http://dumps.wikimedia.org/wikiname-here/YYYYmmdd/) in which case it should be marked as status:failed in the dumpruninfo.txt file in the run directory. Use the job name from that file, and remember that they are listed in reverse order of execution. If you were told by a user or aren't sure which job is the one, see Dumps/Phases of a dump run to figure out the right job(s).
  4. If there's already a root screen session on the host, use it, otherwise start a new one. Open a window,
    • su - datasets
    • bash
    • cd /srv/dumps
    • python ./worker.py --job job-name-you-found --date YYYYmmdd --configfile wikidump.conf.XXX --log name-of-wiki
    The date in the above will be the date in the directory name and on the dump web page for that wiki.
    Example: to rerun the generation of the bzip2 pages meta history file for the enwiki dumps for January 2012 you would run
    python ./worker.py --job metahistorybz2dump --date 20120104 --configfile wikidump.conf.enwiki --log enwiki

Rerunning an interrupted en wikipedia history dump

Just rerun the en wikipedia dump as described in 'rerunning a complete dump' and the rest will take care of itself. This includes checkpoint files and all the rest.

Rerunning a dump from a given step onwards

ONLY do this if the output for these jobs is corrupt and needs to be regenerated. Otherwise follow the instructions to rerun a complete dump, which will simply rerun steps with missing output.

Do as described in 'Rerunning one piece of a dump' but add '--cleanup', '--exclusive', and '--restart' args before the wikiname.

You must not do this while a dump run for the wiki is in progress.

Rerunning a step without using the python scripts

ONLY FOR DEBUGGING.

Sometimes you may want to rerun a step using mysql or the MediaWiki maintenance scripts directly, especially if the particular step causes problems more than once.

In order to see what job was run by the worker.py script, you can either look at the log (dumplog.txt) or you can run the step from worker.py giving the "dryrun" option, which tells it "don't actually do this, write to stderr the commands that would be run".

  1. Determine which host the wiki is dumped from, which configuration file is used, the date of the dump and the job name, as described in the section above about rerunning one piece of a dump.
  2. Give the appropriate worker.py command, as in that same section, adding the option "--dryrun" before the name of the wiki.

Examples

  • To see how the category table gets dumped, type:
    python ./worker.py --date 20120109 --job categorytable --dryrun elwiktionary
    to get the output
    Command to run: /usr/bin/mysqldump -h 10.0.6.21 -u XXX -pXXX --opt --quick --skip-add-locks --skip-lock-tables elwiktionary category | /bin/gzip > /mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-category.sql.gz
  • To see how the stub xml files get dumped, type:
    python ./worker.py --date 20120109 --job xmlstubsdump --dryrun elwiktionary
    to get the output
    Command to run: /usr/bin/php -q /apache/common/multiversion/MWScript.php dumpBackup.php --wiki=elwiktionary --full --stub --report=10000 --force-normal --output=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --output=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-current.xml.gz --filter=latest --output=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-articles.xml.gz --filter=latest --filter=notalk --filter=namespace:!NS_USER
    As you see from the above, all three stub files are written at the same time.
  • To see how the full history xml bzipped file is dumped, type:
    python ./worker.py --date 20120109 --job metahistorybz2dump --dryrun elwiktionary
    to get the output
    Command to run: /usr/bin/php -q /apache/common/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120101/elwiktionary-20120101-pages-meta-history.xml.bz2 --force-normal --report=1000 --spawn=/usr/bin/php --output=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-pages-meta-history.xml.bz2 --full
    Don't be surprised if you see that it would prefetch from a file more recent than the dump you are fixing up. If the more recent dump is marked as successful, the script will try to do just that, which may be unexpected behaviour but should give you good output... unless you suspect the more recent dump of having bad data. In that case you should see the sections about prefetch below.

Generating new dumps

When new wikis are enabled on the site, they are added to all.dblist which is checked by the dump scripts. They get dumped as soon as a worker completes a run already in progress, so you don't have to do anything special for them.

Running a (new) specific dump by hand

Once in a while we get a request for a dump of a wiki out of sequence, so that it can be archived before it is shut down and removed, for example.

Just follow the instructions for rerunning an entire dump.

Text revision files

A few notes about the generation of the files containing the text revisions of each page.

Stub files as prequisite

You need to have the "stub" XML files generated first. These get done much faster than the text dumps. For example, to generate the stubs files for en wikipedia without doing multiple pieces at a time, took less than a day in early 2010 but to generate the full hiistory file without parallel runs took over a month and today would take much longer.

While you can specify a range of pages to the script that generates the stubs, there is no such option for generating the revision text files. The revision ids in the stub file used as input determine which revisions are written as output.

Prefetch from previous dumps

In order to save time and wear and tear on the database servers, old data is reused to the extent possible; the production scripts run with a "prefetch" option which reads revision texts from a previous dump and, if they pass a basic sanity check, writes them out instead of polling the database for them. Thus, only new or restored revisions in the database should be requested by the script.

Using a different prefetch file for revision texts

Sometimes the file used for prefetch may be broken or the XML parser may balk at it for whatever reason. You can deal with this in two ways.

  1. You could mark the file as bad, by going into the dump directory for the date the prefetch file was generated and editing the file dumpruninfo.txt, changing "status:done;" to "status;bad" for the dump job (one of articlesdump, metacurrentdump or metahistorybz2dump), and rerun the step usin the python script worker.py.
  2. You could run the step by hand without the python script, (see the section above on how to do that), specifying prefetch from another earlier file or set of files. Example: to regenerate the ekwiktionary history file from 20120109 with a prefetch from the 20111224 output instead of the 20120101 files, type:
    /usr/bin/php -q /apache/common/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --prefetch=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20111224/elwiktionary-20111224-pages-meta-history.xml.bz2 --force-normal --report=1000 --spawn=/usr/bin/php --output=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-pages-meta-history.xml.bz2 --full

Skipping prefetch for revision texts

Sometimes you may not trust the contents of the previous dumps or you may not have them at all. In this case you can run without prefetch but it is much slower so avoid this if possible for larger wikis. In this case you can do one of the following:

  1. run the worker.py script with the option --noprefetch
  2. run the step by hand without the python script, (see the section above on how to do that), removing the prefetch option from the command. Example: to regenerate the ekwiktionary history file from 20120109 without prefetch, you would type:
    /usr/bin/php -q /apache/common/multiversion/MWScript.php dumpTextPass.php --wiki=elwiktionary --stub=gzip:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-stub-meta-history.xml.gz --force-normal --report=1000 --spawn=/usr/bin/php --output=bzip2:/mnt/data/xmldatadumps/public/elwiktionary/20120109/elwiktionary-20120109-pages-meta-history.xml.bz2 --full