Compress old revisions

From Wikitech

There is a script to compress individual old revisions. Two modes, single revision compression (50% space use) and multiple (25% use). Needs to be run as root to create the log files.

Concatenated multiple revision compression

This reduces the size of old records to about 25% of the original by combining multiple revisions and compressing them all into one record. Not available as a configuration setting so you need to apply it as a batch job.

  • cd /home/wikipedia/common/php-new/maintenance
  • nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " | tee -a /home/wikipedia/logs/compressOld/20050108enwiki

If the preceding run was interrupted after getting as far as Burke it would be resumed with nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Burke | tee -a /home/wikipedia/logs/compressOld/20050108enwiki.

The -q " cur_namespace not in (10,11,14,15) " part is optional but should be used at present for Wikimedia hosted projects, while deletion and undeletion of articles with concatenated compressed revisions is unavailable. It disables concatenated compression of template and category pages and their talk pages, which are currently being changed at a high rate.

Not a problem to apply concatenated compression to records which are already compressed.

Normal operation looks like this:

[user@zwinger:/home/wikipedia/common/php-1.4/maintenance]$ nice php compressOld.php en wikipedia -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Cleanthes | tee -a /home/wikipedia/logs/compressOld/20050108enwiki
Depending on the size of your database this may take a while!
If you abort the script while it's running it shouldn't harm anything,
but if you haven't backed up your data, you SHOULD abort now!

Press control-c to abort first (will proceed automatically in 5 seconds)
Starting from Cleanthes
Starting article selection query cur_title >= 'Cleanthes' AND  cur_namespace not in (10,11,14,15)  ...

Cleanthes
Talk:Cleanthes
Wikipedia_talk:Cleanup ..................../...................././
Waiting for 10.0.0.2 10.0.0.1 10.0.0.3 10.0.0.24 10.0.0.23
Cleanup
MediaWiki_talk:Cleanup
Wikipedia:Cleanup ........../........../........../........../
.........../........../.........../............./............../
............../............./............./............./............/
............/............/............./.............../

When there are a large number of revisions for an article it's possible that you'll lose connection (timeout) to one of the database servers. Restarting after that is harmless, if irritating:

Waiting for 10.0.0.2 10.0.0.1 A database error has occurred
Query: COMMIT
Function: Database::immediateCommit
Error: 2013 Lost connection to MySQL server during query (10.0.0.1)

Backtrace:
Database.php line 345 calls wfdebugdiebacktrace()
Database.php line 297 calls databasemysql::reportqueryerror()
Database.php line 1345 calls databasemysql::query()
Database.php line 1262 calls databasemysql::immediatecommit()
compressOld.inc line 249 calls databasemysql::masterposwait()
compressOld.inc line 226 calls waitforslaves()
compressOld.php line 74 calls compresswithconcat()

Cause is being investigated - may be servmon kills of the idle slave threads during long master operations, since it can take a long time to retrieve all old records sometimes, perhaps 400 seconds for 20,000 on a lightly loaded master.

Single revision compression

This produces about a 50% reduction and is also available automatically via a config file setting. Use the batch job either to apply the compression if it wasn't on before.

  • cd /home/wikipedia/common/php-new/maintenance
  • nice php compressOld.php en wikipedia -t 1 -c 100 5467442
  • -t 1 : the time to sleep between batches, in seconds
  • -c 100: the number of old records per batch
  • 5467442: the old_id to start at, usually 1 to start. Displayed as it runs, if you stop the job, note the last value reached and use it to resume the job later. You get a warning for every record which has already been converted, so don't start much below the point you need.
  • batch size of 5000 is OK off peak

Completed. Left about 40GB lost to fragmentation. Will take a table rebuilt to free it but that can't be done on Ariel using an InnoDB table because it will add 40GB of space to the tablespace for the copy.

Full options

* Usage:
*
* Non-wikimedia
* php compressOld.php [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
*     [-a <first-article>] [--exclude-ns0]
*
* Wikimedia
* php compressOld.php <database> [-t <type>] [-c <chunk-size>] [-b <begin-date>] [-e <end-date>] [-s <start-id>]
*     [-f <max-factor>] [-h <factor-threshold>] [--exclude-ns0] [-q <query condition>]
*
* <type> is either:
*   gzip: compress revisions independently
*   concat: concatenate revisions and compress in chunks (default)
*
* <start-id> is the old_id to start from
*
* The following options apply only to the concat type:
*    <begin-date> is the earliest date to check for uncompressed revisions
*    <end-date> is the latest revision date to compress
*    <chunk-size> is the maximum number of revisions in a concat chunk
*    <max-factor> is the maximum ratio of compressed chunk bytes to uncompressed avg. revision bytes
*    <factor-threshold> is a minimum number of KB, where <max-factor> cuts in
*    <first-article> is the title of the first article to process
*    <query-condition> is an extra set of SQL query conditions for the article selection query

Database fragmentation

Because the compression reduces record sizes it can result in substantial database record fragmentation. In the case of English language Wikipedia the old text started at 80GB and was reduced to 40GB but the MySQL InnoDB storage engine didn't make the space free for reuse by other tables in the tablespace.

The space can be fully freed by using alter table old engine=InnoDB but this requires as much extra free space in the tablespace as the complete new copy of the table requuires. If the space isn't available in the tablespace, the tablespace will be enlarged to make room. If you're short of disk space that can be impossible or could leave insufficient space for temporary files and logs. In a multiple wiki situation it's best to apply the compression to the smallest wikis first, alter them to free the space, and move on up to larger sizes. By the time you get to the largest you'll have freed much of the space they will need.

Alternatively, you can temporarily convert some tables to MyISAM using alter table tablename engine=MyISAM to move them out of the tablespace and into the normal free space, freeing space in the tablespace. Once the alter table for the big projects has completed you can use alter table tablename engine=InnoDB to convert them back to InnoDB.

A combination of both doing smaller wikis first and converting some tables in some wikis to MyISAM may be necessary if space is very tight. For Wikimedia, the minimum safe free disk space is between 9 and 10GB. Even at 10GB there's the risk that a large set of temporary files can leave the server without sufficient log space and break replication.

If using MySQL version 4.1 there's also the option of putting each database into its own tablespace. You'll still need enough free space for the copy of the table but won't have the main tablespace size expanded.

Compression results

Some raw data for the Wikimedia compression in the week preceding 18 February 2005. where present, after compression datra is in FlagCount format while before is currently in countFlag format. Times are the run time on ariel for the alter table to free the space. Space free in the Ariel tablespace went from 8GB to 15.7GB.

changes for this set: -e 20050108000000 -q " cur_namespace not in (10,11,14,15) "
meta              97517 rec size pre   471465984 post   446283776 94.66% ariel  4:13 min 570 no;3 0;945gzip;79672object;15956utf-8,gzip. 

543/1/841/84438/12034
commons           79446 rec size pre   129253376 post    94978048 73.48% ariel  0:40 min 904 no;4 0;22889gzip;54666utf-8,gzip. 896/2/5181/38078/36125
sources           38486 rec size pre   765607936 post   590938112 77.19% ariel 12:32 min 6247 no;1 0;21184gzip;10966utf-8,gzip. 287/2653/29698/5994
hewiki           230649 rec size pre   905347072 post   453722112 50.12% ariel  7:12 min 1st 1630 no;29 0;175031gzip;52771utf-8,gzip. 83/8/6684/189966/34989
etwiki            61803 rec size pre   195248128 post    70844416 36.28% ariel  1:40 min 12145 no;1 0;39846gzip;9575utf-8,gzip. 413/1/2115/53765/5747
cawiki            65178 rec size pre   147292160 post    75038720 50.95% ariel  2:56 min 2699 no;2 0;52742gzip;9543utf-8,gzip. 50/4210/55272/5796
huwiki            67255 rec size pre   425934848 post   212418560 49.87% ariel  7:20 min 7174 no;1 0;43199gzip;16524utf-8,gzip. 222/1572/56323/9402
slwiki            80759 rec size pre   199589888 post    93929472 47.06% ariel  2:56 min 1235 no;63942gzip;14541utf-8,gzip. 77/2893/69033/9138
nowiki           105901 rec size pre   228016128 post   129597440 56.84% ariel  2:52 min 1581 no;72071gzip;31435utf-8,gzip. 60/5933/80965/19431
bgwiki           108585 rec size pre   219398144 post   117014528 53.33% ariel  2:39 min 3036 no;1 0;83734gzip;21563utf-8,gzip. 328g5978o93974u8451
ruwiki           111898 rec size pre   382042112 post   210354176 55.06% ariel  3:51 min 1119 no;12 0;78024gzip;32120utf-8,gzip. 81/8g3476o89080u18929
eowiki           120872 rec size pre   203833344 post    97091584 47.63% ariel  3:13 min 1631 no;106402gzip;12641utf-8,gzip. 45g5167o108343u7476
fiwiki           127004 rec size pre   376274944 post   310099968 82.24% ariel  4:14 min 1st 1697 no,9 0,96339 gzip,28367utf-8,gzip. 
dawiki           168081 rec size pre   537673728 post   122273792 22.74% ariel  3:52 min 125298 no;14 0;42385gzip. 5406/1g10781o152094
enwikiquote       37614 rec size pre   279101440 post   115933184 41.54% ariel  3:13 min 10331 no;20167gzip;7004utf-8,gzip. 341g842o32676u3850
enwikibooks       91536 rec size pre   618070016 post   238649344 38.61% ariel  3:32 min 25797 no;5 0;49515gzip;15850utf-8,gzip. 805g2852o77213u10923
enwikinews        24421 rec size pre   117719040 post    77135872 65.53% ariel  0:56 min 1st 1420 no,5496gzip,17097utf-8,gzip
enwiktionary     153900 rec size pre   389431296 post   217677824 55.90% ariel  2:26 min 1st 7048 no,3 0,119327gzip,26851utf-8,gzip

zhwiki           297194 rec size pre   586022912 post   481017856 82.08% ariel  11 min 38 no;6 0;6565gzip;243090object;49149utf-9,gzip. 

38/6g6562o243090u55444
eswiki           457025 rec size pre  1615773696 post  1339899904 82.93% ariel  22.5 min 5 no;13949gzip;391189object;53472utf-8,gzip. 5g13910o391179u57245
itwiki           396178 rec size pre  1466810368 post  1101873152 75.12% ariel  19.5 min 61 no;16 0;17262gzip;329671object;50929utf-8,gzip
svwiki           416266 rec size pre   797802496 post   475807744 59.64% ariel   8 min 11692 no;3 0;55069gzip;350841object
nlwiki           792247 rec size pre  2745171968 post  1545601024 56.30% ariel  26 min 12693 no;12 0;94620gzip;687934object
plwiki           551211 rec size pre   733937664 post   650002432 88.56% ariel   9 min 54 no;10 0;12825gzip;473302object;67028utf-8,gzip
frwiki          1428554 rec size pre  5574230016 post  4383047680 78.63% ariel  47 min 1303 no;44 0;105870gzip;1148205object;177333utf-8,gzip
jawiki          1390023 rec size pre  3605004288 post  2899312640 80.43% ariel  46 min 1320 no;432 0;135797gzip;1039149object;217480utf-8,gzip
dewiki          4327741 rec size pre 15771467776 post 13693353984 86.84% ariel 159:46 min (count time 2518 sec = 42 minutes)

Compression for en is ongoing. Currently needs to be resumed with:

nice php compressOld.php en wiki -e 20050108000000 -q " cur_namespace not in (10,11,14,15) " -a Surfers | tee -a /home/wikipedia/logs/compressOld/20050108enwiki