Analytics/Unique Devices/Last access solution

From Wikitech

Objective

The analytics team aims to count unique devices per project per day and month in a way that does not uniquely identify, fingerprint or otherwise track users. The desired outcome is a report on the number of Unique Devices per project for a given month. This will be achieved by setting cookies with a Last-Access day on clients and counting sightings of browsers with an old cookie or no cookie at all.

Deliverable

A report in the following format:

Bucket 2015-03 2015-04 ...
en.wikipedia 200,000,000 210,000,000 ...
es.wikipedia 20,000,000 21,000,000 ...
en.wikisource 2,000,000 2,100,000 ...
es.wikisource 200,000 210,000 ...
... ... ... ...
overall total (not deduplicated across projects) 500,000,000 510,000,000 ...

Caveats

  • To report uniques per project, we set a WMF-Last-Access cookie per project.
  • The reported overall total is a sum of unique devices to each domain and includes duplicates (the same browser on the same computer visits multiple wikis). We do not think it is possible to de-duplicate with a Last-Access approach because we do not have a common ending for all our domains (like *.wikipedia.org). For example, cookies for *.wikipedia.org and *.wikidata.org cannot be shared. To count uniquely across domains we would need another domain (central.wikipedia.org) and a set of redirects among our domains to this centralized place to set cookies.

Bots

  • We will to need to filter Bots in our report as the cookie system will over count them. A bot request might not accept cookies thus counting as distinct every time.

This is easier said that done but we are going to use requests tagged with 'nocookies' as a means to identify the % of our traffic that comes from bots not tagged as such. https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution/BotResearch

Privacy

  • Users can delete or refuse cookies
  • We will not be able to identify users from the data passed in the cookie. The cookie contains only a year, month and day.
  • We will comply with Wikimedia's Privacy Policy

Technicalities

In order to produce the above report these are the cookies we need, each cookie stores last access time per project.

WMF-Last-Access:

<<language>>.m.<<project>>.org
mobile site uniques for <<project>> and <<language>>
<<language>>.<<project>>.org
desktop site uniques for <<project>> and <<language>>


How will we be counting: Plain English

Setting and value of WMF-Last-Access cookie

Inside Varnish VCL we will set the cookies and alter the [ X-Analytics https://wikitech.wikimedia.org/wiki/X-Analytics] header. Two possible cases per cookie:

1) Request comes in, if the user does not have a WMF-Last-Access cookie we issue one with last access date that includes day/month with a future expire time (any expire time over a month will work). Cookie value is "14-Dec-2015" for example.

2) Request comes in, user already has a WMF-Last-Access cookie. We re-issue a new cookie with a future expiration date and set the old date as the value of the cookie in the x-analytics header. In our prior example, one day has gone by among requests, value of cookie is reset to "15-Dec-2015" and we store the following in the x-analytics hash:

X-analytics["WMF-Last-Access"] = "14-Dec-2015"

In order to count unique devices in the cluster we will get from the webrequest table all requests for, say, January that do not have a January date set on x-analytics["WMF-Last-Access"] (this includes requests without any date at all). All those are January uniques, cause those are requests that came in in January without a January date in the lWMF-Last-Access cookie.

Same logic for daily: to count uniques in December 15th we will get all requests for December 15th that have on X-analytics["WMF-Last-Access"] value an older date than December 15th. So the request on our example above will be counted. Those are uniques for December 15th.

Note that this method of counting assumes that requests come from real users that accept cookies, so we are assuming that if we set a cookie we are going to be able to retrieve it in a subsequent request. This is true only in the case of browser clients that accept cookies. While it is true that while counting we are only looking at traffic tagged as "user" in the cluster we have to be aware of bots that are not reported as such. In order to discount those requests we only count requests that have nocookie=0, meaning that those requests came to us with 'some' cookie set. We use the nocookie header on x-analytics as a "cheap" proxy to identify bots.

Nocookie

Per x-analytics documentation every request that comes in without cookies whatsoever is tagged with nocookie=1 These are requests that came without any cookies at all are either bots, users browsing with cookies off or users using an "incognito" mode. See: [BotResearch https://wikitech.wikimedia.org/wiki/Analytics/Unique_clients/Last_access_solution/BotResearch] , turns out that nocookie=1 is a cheap proxy to rule out a bunch of what might be bot traffic.

Nocookie Offset

Now, if possible we want to make sure that we count users that might be coming to wikipedia with a fresh session without cookies at all, the only true way to distinguish those requests from bots is to look at request ratio. Thus, we also count as uniques requests with nocookie=1 whose signature (or fingerprint) appears only once in a day or month. The signature is calculated with a hash of the hash(ip, user_agent, accept_language) per project. The idea behind this reasoning is that -if you are a real user- for the month and you did not refresh your browser session there is only one request you could do without cookies, the 1st one. Subsequent request will be sending the WMF-Last-Access cookie.

We add this offset to the numbers that result from looking at WMF-Last-Access cookie.

Future work?

Deploy cookie on *.wikimedia.org to count "per project". Doable?


Developer docs

Meeting notes

2015-02-27

Attending: bbblack, otto, mforns, milimetric, nuria

Brandon thinks that it is not that complicated to do but likely it is best done by someone that knows VCL as at this time code structure and work is a total mess. The people that have done these changes before are himself, ori and faidon. Nuria is to provide a functional implementation that will be further improved through code review. Brandon thinks that if we start next week we should be able to have this on production in a month.

Time functions on VCL: Very limited support, we normally end up writing C in place rather that adding (and compiling) other exception. Any manipulations we do with time will require some memoization so as not to harm performance. Otherwise we are left with using functions like 'now' and 'regexes'. Improvements for memoization to be suggested via CR.

There is a Phabricator ticket : [1] to put in one place the code that is adding various parameters to x-Analytics header. A phabricator ticket was open in this regard by Yuri, we could benefit from his work when adding our counters to the x-Analytics header.

There is generic code in place to strip off cookies in order to do cache lookups.

2015-03-09

Attending: Aaron, Nuria

Discuss how to count daily uniques using a Last-Access year-month-day. This implies changes in the VCL logic, issuing a cookie for every request, and passing on the last-access date in the x-analytics header. This will need to be discussed and reviewed with Ops.