Analytics/Data/Pageview hourly/Sanitization

This page summarizes and links to details about the Analytics Team approach, research and results on sanitizing the pageview_hourly dataset.

Problem: Reconstruction of browsing patterns

As we found in our Identity reconstruction analysis, an attacker with access to our cluster could follow user browsing patterns by combining two datasets: pageview_hourly and webrequest. We only keep data in the webrequest dataset for a short period of time, but we would like to keep pageview_hourly indefinitely, and so we need to make it safe against this type of attack.

Two different cases can occur:

Users with a rare combination of values in various fields, especially user-agent and geographical location, are at risk of first being identified in the more raw webrequest dataset, and then followed in pageview_hourly
Groups of users having viewed only one page can be positively identified as having viewed that page (while if there are two pages, users in the group could have viewed one, the other, or both).

Solution: Sanitizing using K-Anonymity over multiple fields

See this page for a detail version of the algorithm we propose.

Very briefly, the idea is to group pageviews into buckets by sensitive fields, such as user agent and location. When these buckets have less than Kip disctinct IPs or less than Kpv distinct pages viewed, we anonymize one of the sensitive fields and repeat so that all possible buckets have more than Kip distinct IPs and more than Kpv distinct pages viewed. Fields with values that are unlikely to show up often are anonymized first.

Strategies to review

The good Ks

We (the Analytics-Team) did a manual review of browsing patterns over an hour with various distinct IPs, distinct pages and settings. Detailed data on exercise can be found on this dedicated page along with Hive code.

We found that:

When looking at groups of pages viewed by multiple people, it is sometimes easy to guess which sub-groups of pages or could have been viewed together based on topics.
- It is however not possible to re-attach sub-groups to the underlying people with certainty.
- It could be feasible to reattach subgroups to the underlying people with some probability of being right using prior knowledge of browsing habits of those people.
When looking at groups of pages having a small number of distinct pages, we have noticed a big difference between groups having 2 or less and 5 or more distinct pages.
- Groups with 2 or less distinct pages can almost always be identified as single sessions.
- Groups with 5 or more distinct pages can almost never been identified as single sessions, and can sometimes be identified as two sessions.

It means that:

The minimum anonymization we could go for would make sure that at least 2 distinct IPs and 5 distinct pages occur per bucket. It would involve us anonymizing 89.98% of buckets, making 32.83% of requests.
We prefer to go on the safer side and add more variability to our buckets, ensuring that at least 3 distinct IPs and 5 distinct pages occur per bucket. It involves us anonymizing 91.28% of buckets making 35.11% of requests.

Choosing hourly or longer term data to establish the "uniqueness" of values in sensitive fields

We want to anonymize the most rare values first, because they are the most identifying. We can establish the "rareness" of each value by looking at either hourly statistics or longer term, such as monthly statistics:

Using hourly statistics would establish a Local probability. This should reduce the processing time and the number of steps needed to terminate the algorithm (because locally rare values are anonymized first, leading to faster progress towards buckets of size greater than K).
Using monthly statistics would establish a more Global probability. This normalizes any temporal patterns (such as hourly or weekly seasonality) and accounts for differences across time zones. This approach gives more value to global data quality but would run slower.

We decided as a team to use hourly statistics.

Determining the information lost in the anonymization process

NOTE: The following assumptions may be wrong, this section is, for now, just an initial draft, subject to review and modification :]

Statistics on the anonymized fields

The anonymization algorithm permits to retrieve, at the end of the process, the number of pageviews that got anonymized a certain field. For example:

FIELD        #PAGEVIEWS WHERE THAT FIELD GOT ANONYMIZED
city                 4.118.697
ua_device_family     1.749.547
subdivision            148.113
ua_browser_major       132.594
country                 76.716
country_code            76.716
zero_carrier            50.384
ua_browser_family       48.029
ua_wmf_app_version      34.399
ua_os_minor             33.334
ua_os_major             19.689
ua_os_family            15.641
continent                  793

Note that the same pageview can have several fields anonymized. So the sum of all these values does not correspond to the total of pageviews that have some field anonymized, there is an overlap.

How much information does a pageview_hourly dataset contain?

Let's suppose we have 1 hour of pageview_hourly data. Each row possesses a value (view_count) and a set of dimensions or breakdowns (the rest of fields). The less probable (less common) the dimension values are, the more information the row holds: A pageview of "Amoeba defense" viewed from Abuja (Nigeria) tells us more than a pageview of "Michael Jackson" viewed from New York, because we already know that there are a lot of "Michael Jackson" pageviews from New York. So the quantity of information is proportional to:

1 / P_row

Or if we want to express this relation in bits, we can do:

log₂(1 / P_row)

This will give us the quantity of bits needed to binary-code the dimension values in the context of the dataset. And how we define P_row? We can use the view_count value of the row divided by the total view_count of the whole dataset:

P_row = view_count_row / view_count_dataset

Also, we should consider that a row represents view_count pageviews, so its information load should be multiplied by view_count:

H_row = view_count_row * log₂(view_count_dataset / view_count_row)

And finally, if we sum all information held by each row, we get the information held by the whole dataset:

H_dataset = ∑_row view_count_row * log₂(view_count_dataset / view_count_row)

How much information loss does anonymization represent?

To get the information loss L we can calculate the information held by the dataset before and after anonymization and make a ratio:

L = (H_before - H_after) / H_before