Analytics/Data/Pagecounts-raw

From Wikitech
See also the pageviews API, available since the end of 2015.

pagecounts-raw holds the desktop sites' pageview data, in the same format that webstatscollector used to emit. pagecounts-raw data is still getting generated for legacy consumers, but if possible use pagecounts-all-sites instead.

This stream is owned by the Analytics Team.

Contained data

The dataset consists of files with names

${YEAR}/${YEAR}-${MONTH}/pagecounts-${YEAR}${MONTH}${DAY}-${HOUR}0000.gz
${YEAR}/${YEAR}-${MONTH}/projectcounts-${YEAR}${MONTH}${DAY}-${HOUR}0000

.

The pagecounts are gzipped text files holding hourly per page aggregates of pageviews and total response bytes, and projectcounts are plain text files holding hourly per domain-name[1] aggregates of pageviews and total response bytes, and projectcounts.

The time used in the filename is in UTC timezone refers to the end of the aggregation period, not the beginning.

Both pagecounts and projectcounts are made up of lines having 4 space-separated fields:

domain_code page_title count_views total_response_size
Field name Description
domain_code Domain name of the request, abbreviated.

The domain coding scheme in pagecounts-all-sites is on purpose downward compatible with pagecounts-raw, thus retaining quirks and inconsistencies in the coding scheme (and perhaps adding to the confusion with new added complexity). Our apologies if the scheme looks a bit complex (it is), but codes are unambiguous, and are primarily for machine-reading.

Common trailing parts in the domain name have been abbreviated in pagecounts-all-sites just as they are for pagecounts-raw. Main inconsistency was and is: project 'wikipedia.org' doesn't add a suffix for project name, where 'wikibooks.org' adds .b., 'wiktionary.org' adds .k', etc. (the original scheme predates Wikimedia's mobile site).

Domain_code now can also be a abbreviation for mobile and zero domain names, in which case .m or .zero is inserted as second part of the domain name (just like with full domain name). E.g. 'en.m.v' stands for "en.m.wikiversity.org". (Again, as project Wikipedia is not coded in the abbreviation: 'en' stands for "en.wikipedia.org", and 'en.m' stands for "en.m.wikipedia.org".

Domain trailing part Coded as Database name
.wikipedia.org *wiki

(be careful about the other non- wikipedia sites using this however)

.wikibooks.org .b *wikibooks
.wiktionary.org .d *wiktionary
.wikimediafoundation.org .f foundationwiki
.wikimedia.org .m

Only the following domains are considered

  • commons.wikimedia.org
  • meta.wikimedia.org
  • incubator.wikimedia.org
  • species.wikimedia.org
  • strategy.wikimedia.org
  • outreach.wikimedia.org
  • usability.wikimedia.org
  • quality.wikimedia.org
  • commonswiki
  • metawiki
  • incubatorwiki
  • specieswiki
  • strategywiki
  • outreachwiki
  • usabilitywiki
  • qualitywiki
.m.${WHITELISTED_PROJECT}.org .mw (See below's explanation)
.wikinews.org .n *wikinews
.wikiquote.org .q *wikiquote
.wikisource.org .s *wikisource
.wikiversity.org .v *wikiversity
.wikivoyage.org .voy *wikivoyage
.mediawiki.org .w mediawikiwiki
.wikidata.org .wd wikidatawiki
page_title For pagecounts files, it holds the title of the unnormalized part after /wiki/ in the request Url (E.g.: Main_Page, Berlin).

For projectcounts files, it is -.

count_views The number of times this page has been viewed in the respective hour.
total_response_size The total response size caused by the requests for this page in the respective hour. This is a sum over field #7 of Cache log format fields.

So for example a line

en Main_Page 42 50043

means 42 requests to "en.wikipedia.org/wiki/Main_Page", which accounted in total for 50043 response bytes. And

de.m.voy Berlin 176 314159

would stand for 176 requests to "de.m.wikivoyage.org/wiki/Berlin", which accounted in total for 314159 response bytes.

Each domain_code and page_title pair occurs at most once.

The file is sorted by domain_code and page_title.


Data not included

This dataset does not contain per language, or per title counts for a project's mobile site. See pagecounts-all-sites, if you need them.

(note: this line should be be moved from template to parent page) So pagecounts-raw does not contain counts for mobile or zero sites. Use file version pagecounts-all-sites if you need them.

Aggregation for .mw

Note: anomaly retained for backward compatibility! These lines better belong in projectcounts file. Best to ignore .mw lines.

The .mw abbreviation aggregates the mobile sites across all projects per language. The page_name gets set to the used language.

So consider a given hour only sees the following requests:

https://en.m.wikipedia.org/wiki/Main_Page
https://en.m.wikipedia.org/wiki/Berlin
https://en.m.wiktionary.org/wiki/House

(and assuming each request accounted for 100 bytes), the hour's pagecounts file would consist only of the line

 en.mw en 3 300

. The corresponding projectcounts file would be

 en.mw - 3 300

. So while the .mw abbreviation counts the mobile site, it throws wikipedia, wiktionary into the same bucket. And also, it does not distinguish between page_titles.

Availability

dumps.wikimedia.org

The stream is available unsampled as gzipped hourly files from http://dumps.wikimedia.org/other/pagecounts-raw/.

The date in the file name refers to the end of the capturing period, not the beginning.

stat1002.eqiad.wmnet

The most recent ~11 days of data are available as hourly files at /mnt/data/pagecounts/incoming on stat1002.

The date in the file name refers to the end of the capturing period, not the beginning.

Events and known problems since 2014-03-01

Date from Date until Bug Details
* 2014-09-02 ~16:19 bug 70140 Https traffic from ulsfo gets counted twice.
2014-04-17 2014-07-07 bug 67456 Logs from SSL endpoints was not fed into webstatscollector, hence SSL traffic has not been counted by webstatscollector.
2014-07-07 ~16:25 2014-09-02 ~16:19 bug 70295 Requests to Special:CentralAutoLogin/* have been counted.
2014-07-08 19:00 2014-07-08 22:00 bug 67694 A 2014 FIFA World Cup (soccer) related traffic spike caused udp2log overload and lead to up to ~10% packetloss during this period of time.
2014-07-13 19:00 2014-07-13 23:00 bug 67694 A 2014 FIFA World Cup (soccer) related traffic spike caused udp2log overload and lead to up to ~25% packetloss during this period of time.
2014-07-29 01:35 2014-07-29 01:42 bug 68796 Most of esams missing between 2014-07-29T01:35:45 and 2014-07-29T01:42:00 due to flapping network link (<=11% of total zero traffic around that time)
2014-08-16 ~22:43 2014-08-16 ~22:49 bug 69663 Root mount on oxygen went full, which caused services to panic and udp2log dropped requests during that time
2014-08-17 ~06:26 2014-08-17 ~06:30 bug 69663 Root mount on oxygen went full again, which caused services to panic and udp2log dropped requests during that time
2014-08-24 14:00 2014-08-27 21:00 bug 70118 Resource scarceness on gadolinium causing higher drop rates, and service restarts chopping off part of the data for some hours.
2014-08-28 16:01 2014-08-28 ~20:30 bug 70136 Permission errors on gadolinium prohibited writing of hourly files
2014-10-08 22:00 2014-10-08 24:00 bug 71879 ULSFO having connectivity issues leading to partial message loss
* 2014-10-15 ~19:02:30 bug 66352 Pageviews to “undefined” and “Undefined” pages have been counted
* 2014-10-15 ~19:02:30 bug 71790 Redirects have been counted
2014-10-15 ~19:00:00 2014-10-15 ~19:02:30 bug 72102 No messages collected during deployment of new webstatscollector version
2014-10-15 ~20:22:00 2014-10-15 ~20:23:00 bug 72107 No messages collected during restart of webstatscollector's filter
2014-10-20 13:06 2014-10-20 13:27 bug 72306 ULSFO connectivity issues causing packet loss between 6% and 47% for ulsfo caches.
2014-10-21 ~10:30 2014-10-21 ~11:43 bug 72355 Ulsfo connectivity issues causing packet loss for ulsfo caches.
2014-11-25 ~01:56 2014-12-04 14:03 task T76390 Change of HTTPS setup makes requests HTTPS from eqiad and esams (not ulsfo) get count twice.

On 2014-12-08, backfilling the affected period with good data from pagecounts-all-sites finished. So since then, the pagecounts/projectcounts files for the affected period are good again.

2014-11-30 ~03:50 2014-11-30 ~10:13 task T76334 No data while analytics infrastructure suffered eqiad network issues.

On 2014-12-08, backfilling the affected period with good data from pagecounts-all-sites finished. So since then, the pagecounts/projectcounts files for the affected period are good again.

2015-01-01 00:00 n/a Switch from webstatscollector generated files to Hive generated files (If32afc, stripped-down variant of pagecounts-all-sites).
2015-01-13 ~22:20 2015-01-13 ~23:18 task T86973 No data due to firewall problems

See also

  1. Hence, the “project” in projectcounts is somewhat a misnomer. It was kept nonetheless to keep compatibility with pagecounts-raw.