Media storage

From Wikitech
Wikimedia infrastructure

Data centres and PoPs

Networking

HTTP Caching

Wiki

Media


Logs

This page describes Wikimedia's media storage infrastructure.

Context

When we talk about "media storage", we refer to the storage & serving of user-uploaded content (typically images, PDF, video & audio files), served from upload.wikimedia.org. It includes both original content or content generated by other sources. The files can be broadly grouped into the following categories:

  • "Originals": originally uploaded content
  • Thumbnails: variable arbitrarly-sized thumbnails of original content, scaled on demand by imagescalers
  • Transcoded videos: multiple format (Ogg, WebM) multiple -but preset- resolution (360p, 720p etc.) conversion of originally uploaded videos
  • Rendered content:

Components

The media storage architecture involves the following closely coupled components.

Media storage components

Caching proxies

There are the usual, tiered multiple layers of Varnish HTTP caching proxies, serving the upload.wikimedia.org domains.

The upload Varnish setup is special in a few ways:

  • There are certain provisions for handling Range requests that also needed a special Varnish version, due to its importance for serving large video files.
  • The config contains rewriting rules to handle the conversion from upload.wikimedia.org URLs to ms-fe Swift API URLs.
  • The config has special support for handling 404 from media storage on thumbnails URLs, retrying instead using an image scaler.

The last two were written to replace the previous Swift middleware written in Python to prepare for the Ceph transition as well as cut some of the issues that the middleware had with cascading failures looping each other in case of simple incidents. As of July 2013, those two are implemented but inactive, pending the full Ceph roll-out.

Media storage

This is the actual storage layer where files/objects are being stored to and retrieved from.

Historically, media storage was composed of a few NFS servers (ms7 for originals, ms5 for thumbs, and ms6 for a thumbs cache in esams) that all MediaWiki application servers had mount points to and MediaWiki wrote files using regular filesystem calls. This was unscalable, fragile and inelegant. In 2012, a combined effort from the platform and technical operations teams was made to replace this with a separate infrastructure, interfacing with MediaWiki with a separate API.

Openstack Swift was picked as the new platform and the Swift API was picked because of its simplicity and it being native to the Swift implementation. Because of Swift's certain limitations and in particular geographically-aware replication between datacenters which affected the eqiad migration, as well as certain Swift's shortcomings with data consistency and performance, as of 2013 Ceph with its Swift-compatible layer (radosgw) is also being evaluated for the same purpose, with pmtpa running Swift and eqiad running Ceph and a final decision between the two to be taken in late 2013.

Image & video scalers

Image scalers are a special group of application servers that are otherwise normal and running MediaWiki. Their sole purpose is to receive thumb scaling requests for arbitrary originals & sizes and scaling them down on request. While there a number of constraints for resource usage & security purposes, they are performing resource-intensive operations on foreign content and thus can frequently misbehave, which is why they are grouped separately.

Video scalers are similar, but because of the nature of their work, instead of a per request basis, they are performing their work as part of job queue processing.

Architecture

File/object structure

Files are grouped into containers that have 5 components: a project (wikipedia), a language (en), a repo, a zone and, optionally, a shard.

Project can also be "global" for certain global items. Note that there are a few exceptions to the project names, with the most notable being Wikimedia Commons that has a project name of "wikipedia" for historical reasons.

Rendered content (timeline, math, scope and captcha) all have their zone set to render and their repo set to their respective category. Regular media files have repo set to local. Zones are public, for public unscaled media, thumb for thumbnails/scaled media, transcoded for transcoded videos, temp for temporary files created by e.g. UploadStash, and deleted for unscaled media that have had their on-wiki entries deleted. These are defined and categorized on the MediaWiki configuration option $wgFileBackends.

Historically, files were put under directories on a filesystem and directories were sharded per wiki in a two-level hierarchy of 16 shards per level, totaling 256 uniformly sharded directories. On the Swift era, the hope was that such a sharding scheme would be unneeded, as the backend storage would handle such a complexity. This hope ultimately proved to be untrue and for certain wikis, the amount of objects per container is large enough that it created scalability problems on Swift. To address this issue, multiple containers were created for those large projects. These were shared into a flat (one level) 256 shards (00-ff), with the exception of the deleted zone that was sharded to 1296 shards (00-zz). The list of large projects that has sharded containers is currently defined in three places: a) MediaWiki's $wmfSwiftBigWikis, b) Swift's shard_container_list (proxy-server.conf, via puppet) and c) under the rewrite Varnish's rewrite configuration in puppet.

The previous, two-level scheme is kept as the name of the object on all containers as well as the public upload.wikimedia.org URLs, irrespective of whether the project is large enough to have sharded containers. This was made for compatibility reasons, as well as to give us the capability of sharding more containers in the future if they get large enough. For those that are sharded, the name of the shard matches the name of the object's second-level shard and the shard of derived contents (thumbnail) remains the same as the shard of the original that produced it.

A few examples:

Thumbnail handling

When a user requests a page from a public wiki, links to scaled media needed for the page (e.g. http://upload.wikimedia.org/project/language/thumb/x/xy/filename.ext/NNNpx-filename.ext) are generated, but the scaled media themselves are not generated at that time. As the thumb sizes are arbitrary, it is not possible to pregenerate them either, therefore the only way to handle this is to generate them on demand and cache them. On the MediaWiki side, this is accomplished by using the configuration settings: 'transformVia404' => true for the local file repo, $wgGenerateThumbnailOnParse = false, $wgThumbnailScriptPath = false.

The architecture here is very different between Swift and Ceph (so far, there are plans to converge):

  • Swift has a middleware (rewrite.py) as it also handles rewrites, under operations/puppet, that catches the 404 responses and instead makes an HTTP request the to imagescalers. The imagescaler will fetch the original, resize it, store the thumbnail into media storage (all backends, see below) in a separate request and in the end return the thumb contents as an HTTP response, which Swift proxies back to the client (the HTTP frontend caches).
  • With Ceph, this entire functionality has been moved into Varnish: on a cache miss, Varnish makes the backend request to Ceph and on a 404, it "restarts" the backend request, this time with the imagescalers set as a backend and with a slightly modified URL. Imagescalers do the same as above.

For private wikis this can't be the case as no access control happens outside the MediaWiki layer. Therefore, the links there are not to upload.wikimedia.org but to /w/thumb.php?f=Example.png&width=450 on the wiki itself, with MediaWiki ultimately serving the file.

Datacenter replication

We currently have Swift running in pmtpa and Ceph running in eqiad. MediaWiki has a the capability of running with multiple backends, with one of them being the primary (where reads come from and what MediaWiki cares most about in terms of file operation success). This is configured by means of the FileBackendMultiWrite setting for $wgFileBackends, after creating a local-ceph and a local-swift SwiftFileBackend instance.

This is currently disabled because of Ceph's stability issues, that due to to the synchronous nature of FileBackendMultiWrite, would propagate to production traffic. Until this is enabled, we have two ways of syncing files:

  • For original media content, MediaWiki has a journal mechanism that keeps all changes into a database table, and scripts exist to replay that journal to the other store.
  • For all other content, operations/software has a software of our own called swiftrepl which traverses containers on both sides and syncs them.

Examples

First request for a thumbnail image

  1. Request for http://upload.wikimedia.org/project/language/thumb/x/xy/filename.ext/NNNpx-filename.ext is received by an LVS server.
  2. The LVS server picks an arbitrary Varnish frontend server to handle the request.
  3. Frontend Varnish looks for cached content for URL in in-memory cache.
  4. Frontend Varnish computes hash of URL and uses that hash to select a consistent backend Varnish server.
    • The consistent hash routing ensures that all frontend Varnish servers will select the same backend Varnish server for a given URL to eliminate duplication in the backend cache layer.
  5. Frontend Varnish requests URL from backend Varnish.
  6. Backend Varnish looks for cached content for URL in SSD based cache.
  7. Backend Varnish requests URL from media storage cluster.
  8. Request for URL from media storage cluster received by an LVS server.
  9. The LVS server picks an arbitrary frontend Swift server to handle the request.
  10. The frontend Swift server rewrites the URL to map from the wiki URL space into the storage URL space.
  11. The frontend Swift server request new URL from Swift cluster.
  12. The 404 response for the URL is caught in the frontend Swift server.
  13. The frontend Swift server constructs a URL to request the thumbnail from an image scaler server via /w/thumb_handler.php.
  14. The image scaler server requests the original image from Swift.
    • This goes back to the same LVS -> Swift frontend -> Swift backend path as the thumb request came down from the Varnish backend server.
  15. The image scaler transforms the original into the request thumbnail image.
  16. The image scaler stores the resulting thumbnail in Swift.
  17. The image scaler returns the thumbnail as a http response to the frontend Swift server's request.
  18. The frontend Swift server returns the thumbnail image as a http response to the backend Varnish server.
  19. The backend Varnish server stores the response in it's SSD-backed cache.
  20. The backend Varnish server returns the thumbnail image as a http response to the frontend Varnish server.
  21. The frontend Varnish server stores the response in it's in-memory cache.
  22. The frontend Varnish server returns the thumbnail image as a http response to the original requestor.


Common operations

Removing archived files

Occasionally, there is a need to eradicate the content of files that have been deleted & archived on the wikis (e.g. for illegal to distribute content). To serve this purpose, there is a MediaWiki maintenance script, eraseArchivedFile.php, that handles the deletion of both the content and its thumbnails from all configured FileBackend stores, as well as the purging of those from frontend HTTP caches. The script takes either the filename as input:

user@terbium:~$ mwscript eraseArchivedFile.php --wiki commonswiki --filename 'Example.jpg' --filekey '*' 
Use --delete to actually confirm this script
Purging all thumbnails for file 'Example.jpg'...done.
Finding deleted versions of file 'Example.jpg'...
Would delete version 'f6mypp1mxmrj2aoxfucxwo2sj8eb9ww.jpg.jpg' (20130604053028) of file 'Example.jpg'
Done

or the filekey (e.g. as given in a Special:Undelete URL) as an argument:

user@terbium:~$ mwscript eraseArchivedFile.php --wiki commonswiki --filekey 'f6mypp1mxmrj2aoxfucxwo2sj8eb9ww.jpg'
Use --delete to actually confirm this script
Purging all thumbnails for file 'Example.jpg'...done.
Would delete version 'f6mypp1mxmrj2aoxfucxwo2sj8eb9ww.jpg.jpg' (20130604053028) of file 'Example.jpg'

(note how it needs to be invoked with --delete to confirm all actions)

Cleaning up thumbs

Syncing between stores

See also