Analytics/Data/Mediacounts
The mediacounts stream holds counts of how often an image, video, or audio file from upload.wikimedia.org
has been transferred to users.
WMF currently does not have infrastructure to provide perfect media counts, and the current media counts implementation has several short comings. But since the community has been waiting for ages already to see any media counts, we publish this non-perfect data nonetheless to get data out until WMF has infrastructure to produce perfect media counts.
Rationales, and motivations for this stream can be found in the corresponding RfC.
This stream is owned by the Analytics Team.
Contained data
Selected requests
The stream contains all requests from the upload
cache group that have
- HTTP status code 200 (OK), or
- HTTP status code 206 (Partial Content) and a Range header that starts in
bytes=0-
, but is notbytes=0-0
.
The first condition matches the plain fetches of image, movie and audio files. The second condition matches beginnings of streamed media.
Corner cases
- After some discussion with stake-holders (some parts in on-wiki, most parts in emails), requests with HTTP status code 304 (Not modified) do not get counted at this point, as more interest seems to be on media transfers than media requests. Ideally, it would be media consumption or media views, but there is currently no way to detect that easily from the logs.
- When consuming streamed media and jumping back to the beginning of the file after having watched part of the file, counts as a new transfer.
- When using Media viewer to view images, some images are prefetched for better user experience, but need not yet been shown to the user. Currently, those prefetched images are getting counted, as there is as of now no way to detect whether an image was actually shown to the user or not.
Fields
The stream consists of the following fields
Field # | Name | Description |
---|---|---|
1 | base_name
|
The name of the raw, original file without the leading https?://upload.wikimedia.org
So for example for each of
, the
For images from Commons, you can get the file's page by replacing the first four path segments of the |
2 | total_response_size
|
Total number of response bytes sent to the users for that file (and its transcodings). |
3 | total
|
Total number of transfers (counting both transfers of the raw, original and tiny thumbs as 1). |
4 | original
|
Total number of transfers of the raw, original file (transcodings, thumbs and the like are not counted here). Note, this includes JPG images embedded in pages without the thumb parameter or equivalent, as well as the "thumbnails" asked at a resolution equal or higher than the original image's resolution: in both cases in the original image is embedded directly (and downloaded upon visiting the page), rather than generating a derivative image. See example. |
5 | transcoded_audio
|
Total number of transfers of a file that got transcoded to an audio file. So for example when a FLAC file is requested as OGG file, the request is counted in this column. (Transfers for the raw, original FLAC file, would get counted in the original column.
|
6 | n/a | Reserved for future use. |
7 | n/a | Reserved for future use. |
8 | transcoded_image
|
Total number of transfers of a file that got transcoded to an image file. So for example when a WebM file, or a GIF file is requested as JPG file, the request is counted in this column. Note, this seems to include (all?) thumbnails as well: the value is higher than 0 also for jpg images, which are rescaled to jpg rather than converted to other formats. (Transfers for the raw, original WebM, or the raw, original GIF file, would get counted in the original column.)
|
9 | transcoded_image_0_199
|
Total number of transfers of a file that got transcoded to an image file, where 0 <= width <= 199. (This is a drill-down of the transcoded_image column.)
|
10 | transcoded_image_200_399
|
Total number of transfers of a file that got transcoded to an image file, where 200 <= width <= 399. (This is a drill-down of the transcoded_image column.)
|
11 | transcoded_image_400_599
|
Total number of transfers of a file that got transcoded to an image file, where 400 <= width <= 599. (This is a drill-down of the transcoded_image column.)
|
12 | transcoded_image_600_799
|
Total number of transfers of a file that got transcoded to an image file, where 600 <= width <= 799. (This is a drill-down of the transcoded_image column.)
|
13 | transcoded_image_800_999
|
Total number of transfers of a file that got transcoded to an image file, where 800 <= width <= 999. (This is a drill-down of the transcoded_image column.)
|
14 | transcoded_image_1000
|
Total number of transfers of a file that got transcoded to an image file, where 1000 <= width. (This is a drill-down of the transcoded_image column.)
|
15 | n/a | Reserved for future use. |
16 | n/a | Reserved for future use. |
17 | transcoded_movie
|
Total number of transfers of a file that got transcoded to a movie file. So for example when a WebM file is requested as OGV file, the request is counted in this column. (Transfers for the raw, original WebM file, would get counted in the original column.)
|
18 | transcoded_movie_0_239
|
Total number of transfers of a file that got transcoded to a movie file, where 0 <= height <= 239. (This is a drill-down of the transcoded_movie column.)
|
19 | transcoded_movie_240_479
|
Total number of transfers of a file that got transcoded to a movie file, where 240 <= height <= 479. (This is a drill-down of the transcoded_movie column.)
|
20 | transcoded_movie_480
|
Total number of transfers of a file that got transcoded to a movie file, where 480 <= height. (This is a drill-down of the transcoded_movie column.)
|
21 | n/a | Reserved for future use. |
22 | n/a | Reserved for future use. |
23 | referer_internal
|
Total number of transfers with a Referer from a WMF domain. |
24 | referer_external
|
Total number of transfers with a Referer from a non-WMF domain. |
25 | referer_external
|
Total number of transfers with an empty or invalid Referer. |
Availability
dumps.wikimedia.org
The stream is available as daily TSV files at http://dumps.wikimedia.org/other/mediacounts/.
stat1002.eqiad.wmnet
The stream is available as daily TSV files at /mnt/hdfs/wmf/data/archive/mediacounts
on stat1002.
Analytics cluster
The stream is available as daily TSV files at /wmf/data/archive/mediacounts
in the Analytics cluster.
In addition to those files, the data is also available at hourly granularity in Parquet format at /wmf/data/wmf/mediacounts
, which is accessible in Hive through the wmf.mediacounts
table.
Clients
- mediacounts-stats.py can filter statistics for a specific file or category of files, keeping the same CSV format (example).
- commons-media-views compacts the entire dataset to have only one row per filename and outputs the table in JSON format (example).
Events and known problems since 2015-01-01
Date from | Date until | Bug | Details |
---|