Analytics/Data/Pageview hourly

From Wikitech

wmf.pageview_hourly

The wmf.pageview_hourly table (available on Hive) is 'pre-aggregated' webrequest data, filtered to keep only pageviews, and aggregated over a predefined set of dimension. It is stored in the Parquet columnar file format and partitioned by (year, month, day hour)

Current Schema

$ hive --database wmf

hive (wmf)> describe pageview_hourly;
OK
col_name    data_type   comment
project                 string                  Project name from requests hostname
language_variant        string                  Language variant from requests path (not set if present in project name)
page_title              string                  Page Title from requests path and query
access_method           string                  Method used to access the pages, can be desktop, mobile web, or mobile app
zero_carrier            string                  Zero carrier if pageviews are accessed through one, null otherwise
agent_type              string                  Agent accessing the pages, can be spider, user or bot (Tagged as bot if user-agent explicitly contains 'WikimediaBot')
referer_class           string                  Can be internal, external or unknown
continent               string                  Continent of the accessing agents (computed using maxmind GeoIP database)
country_code            string                  Country iso code of the accessing agents (computed using maxmind GeoIP database)
country                 string                  Country (text) of the accessing agents (computed using maxmind GeoIP database)
subdivision             string                  Subdivision of the accessing agents (computed using maxmind GeoIP database)
city                    string                  City iso code of the accessing agents (computed using maxmind GeoIP database)
user_agent_map      	map<string,string>  	User-agent map with device_family, browser_family, browser_major, os_family, os_major, os_minor and wmf_app_version keys and associated values
record_version          string                  Keeps track of changes in the table content definition - https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageview_hourly
view_count              bigint                  number of pageviews
page_id             	int                 	MediaWiki page_id for this page title. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.
year                    int                     Unpadded year of pageviews
month                   int                     Unpadded month of pageviews
day                     int                     Unpadded day of pageviews
hour                    int                     Unpadded hour of pageviews
# Partition Information
# col_name              data_type               comment
year                    int                     Unpadded year of pageviews
month                   int                     Unpadded month of pageviews
day                     int                     Unpadded day of pageviews
hour                    int                     Unpadded hour of pageviews

Notice the year, month, day, and hour fields. These are Hive partitions, and are explicit mappings to hourly aggregations in HDFS. You must include at least one partition predicate in the where clause of your queries (even if it is just year > 0). Partitions allow you to reduce the amount of data that Hive must parse and process before it returns you results. For example, if are only interested in data during a particular day, you could add where year = 2014 and month = 1 and day = 12. This will instruct Hive to only process data for partitions that match that partition predicate. You may use partition fields as you would any normal field, even though the field values are not actually stored in the data files.

Changes and known problems since 2015-06-16

Date from Task record_version Details
2015-05-01[1] task T99931 0.0.1 Create table with pageviews aggregated from 'text' and 'mobile' refined webrequest sources and backfill aggregation from beginning of may.
2015-06-01[2] task T107436 0.0.2 Add parsed user agent data (user_agent_map field) to aggregated pageviews to prepare wikistat-2.0 request.
2015-08-31 task T110614 0.0.3 Backfilled data from the beginning of pageview_hourly (May 1st, 2015) to correct bugs:
  • in page_title, spaces were not always transformed into underscores
  • wrong hosts were included and are no more (outreach.wikimedia.org, donate.wikipedia.org, arbcom-*.wikipedia.org)

We took advantage of this backfill to reorder the user_agent_map fields in a more coherent place.

2015-12-01 task T116023 0.0.4 Add the mediawiki page_id for this title when available. For redirects this could be the page_id of the redirect or the page_id of the target. This may not always be set, even if the page is actually a pageview.

See also

The code that generates it:

Notes

  1. Action taken on the 7th of June, but since data is available from beginning of may, the date of this line is set so
  2. Action taken on the 31st of July, but since data is available from beginning of June, the date of this line is set so