Analytics/EventLogging/Data retention and auto-purging

From Wikitech

To comply with WMF's Privacy Policy and m:Data_retention_guidelines EventLogging data goes through an automatic purging process. In a nutshell, this process deletes all sensitive information contained in EventLogging events older than 90 days.

Definitions

What is personally identifiable information (PII)?

Information ... that could be used to personally identify you, like: name, IP, email, phone number, credit card number, government id, etc. We also consider MediaWiki's userName and userId as potentially identifying. Please read the "Definitions" section of the Privacy Policy as the true authority on those concepts: https://wikimediafoundation.org/wiki/Privacy_policy#Definitions

What do we consider sensitive information?

Any information that both: 1) Is associated with any personally identifiable information (PII). 2) Expresses any of: racial or ethnic origins, sexual orientation, marital or familial status, religion, political affiliation, etc. Please read the "Definitions" section of the Privacy Policy as the true authority on those concepts: https://wikimediafoundation.org/wiki/Privacy_policy#Definitions

Usually, in the EventLogging context, sensitive information means reading history: the pages visited by a user, or the pages watched, or the recommendations clicked, etc.

Exceptions

Any editing history associated to a MediaWiki userName or userId is considered to be non-sensitive in EventLogging. The revision table in MediaWiki database already contains this data and it is available publicly.

What does the Data Retention Guidelines recommend?

Purging

Strategies

There are 3 purging strategies in EventLogging, ranging from more strict to more permissive.

Full purge

It permanantly deletes the whole event records from the database when they reach the age of 90 days. This is suited for schemas that are naturally sensitive or for schemas whose information doesn't need to be kept for a longer period of time. Note that this is the default strategy for new schemas.

Partial purge

It permanantly assigns a garbage-value to a subset of the event's fields when the event reaches the age of 90 days. The rest of the fields are kept indefinitely. This is suited for schemas that can be easily sanitized and whose information is of great value and needs to be kept for a longer period of time. Note that this strategy includes the

Minimal purge

It permanently assigns a garbage-value to just 2 fields when the event reaches the age of 90 days. Those fields are part of the EventCapsule, a wrapper schema that is common in all EventLogging schemas. The 2 fields are: clientIp and userAgent. All the other fields in the schema, are kept indefinitely. This is suited for non-sensitive schemas.

Criteria for purging

The criteria for choosing the purging strategy (which fields must be purged or can be kept) is:

  • If the schema has both PII and sensitive data, and most of the fields are either one or the other: Full purging.
  • If the schema has both PII and sensitive data, but there are still other fields that are not PII or sensitive data that should be kept: Partial Purging. The fields chosen to be purged will be either the identifying fields or the sensitive fields. Initially there's no need of purging both identifying fields and sensitive fields.
  • If the schema doesn't have both PII and sensitive data: Minimal purging.

Implementation

The auto-purging will be activated early Q2 2015. It acts upon 2 EventLogging storage systems, MySQL/MariaDB replica and Hadoop cluster. Note that the only database implementing partial and minimal purging will be MySQL/MariaDB. The Hadoop cluster will initially full-purge all events for all schemas after 90 days, so the MySQL/MariaDB instance will be the one holding all the historical data, properly purged (fully, partially or minimally).

The purging is implemented using white-lists. Any schema/field we want to keep, needs to be in the white-list. This makes the default strategy for new schemas to be fully purged.

Which schemas/fields are being purged?

Today, there are several places you can look for that:

In the near future, we'll have the actual white-list stored in a repository. This will be the single source of truth on purging information.

Bucketizing of editCount field

The field editCount is considered an identifier field (PII) because it can single out editors with a high number of edits, and many times there's need to purge it. However, it uses to be a very valuable piece of information to the schema owners.

The strategy adopted here is to anonymize it, transforming it into a bucketed value named "editCountBucket". Instead of an integer "number of edits" (i.e. 402) it would store a string "bucket of edits" (i.e. "100-999 edits"). Here is the partition algorithm used in EventLogging to calculate the buckets: if editCount = 0 then editCountBucket = "0 edits" if 1 <= editCount <= 4 then editCountBucket = "1-4 edits" if 5 <= editCount <= 99 then editCountBucket = "5-99 edits" if 100 <= editCount <= 999 then editCountBucket = "100-999 edits" if 1000 <= editCount then editCountBucket = "1000+ edits"

Implementation

For schemas that existed before end of Q1 2015

A collection of scripts will be deployed to EventLogging MySQL database, to add an editCountBucket column to the affected tables, and to populate that field before the purging of the original editCount field takes place.

For schemas created after end of Q1 2015

New schemas should be instrumented to send an already bucketized editCountBucket field instead of the numeric editCount. Note that, if your schema does not contain any sensitive information, there's no need to bucketize the editCount field.

Which schemas are being automatically bucketized?

Today, there are several places you can look for that:

In the near future, we'll have the actual bucketization deployed in EventLogging MySQL database. This will be the single source of truth on bucketizing information.

Important warning

The default strategy for new schemas is full purge. This is a security measure to avoid loosing control of the sensitive data inside EventLogging databases. This means, if you create a new schema, you'll see your events deleted after 90 days.

If you want to keep part or the totality of your new schema's data, please contact the Analytics team! We'll study the contents of your schema and may propose you a less strict purging strategy.