Analytics/Geolocation

From Wikitech

Fundraising and Analytics both use geolocation in understanding how users behave and how best to interact with them.

In Fundraising's case, geolocation through geoiplookup.wikimedia.org helps with identifying which users to display banners to, and in what countries, and to identify banner penetration in various locales. In the case of Analytics, geolocation is used to look at things like per-country editor breakdowns through the Geowiki scripts, per-country reader breakdowns, and fulfill ad-hoc data requests from various departments.

In all of these situations, we depend on the MaxMind databases. This is a guide to what they are, how they work (and when they don't), and how to access them internally.

The databases

MaxMind is an organisation that provides a variety of both free and paid IP geolocation databases, which resolve down to country, region and city level. The Wikimedia Foundation currently has access to:

  • GeoIP Country Edition: resolves down to the country level (e.g., "United States");
  • GeoIP Region Edition: resolves down to the region level (e.g., "California");
  • GeoIP City Edition: resolves down to the city level (e.g., "San Francisco");
  • GeoIP ASNum Edition: resolves down to Autonomous System Numbers;
  • GeoIP Country V6 Edition: resolves down to the country level for IPv6 IPs.

MaxMind updates these databases once a week, on Tuesdays, and the updates filter down to our machines in the form of entire databases (rather than deltas).

Using MaxMind

Access and data formats

The geoiplookup databases can be accessed from any of our analytics machines, including stat1 and stat1002 (stat2), through the path:

/usr/bin/geoiplookup [IP address]

If you have access to those machines, no problem. If you don't, and you need it, submit an RT ticket explaining why you need it, and get your manager to sign off on it. Then read server access responsibilities very very carefully.

Either way, once queried, the MaxMind databases produce something that looks like...

ironholds@stat1002:~$ /usr/bin/geoiplookup 216.38.130.164 
GeoIP Country Edition: US, United States
GeoIP City Edition, Rev 1: US, CA, San Francisco, N/A, 37.774899, -122.419403, 807, 415
GeoIP Region Edition, Rev 1: US, CA
GeoIP City Edition, Rev 0: US, CA, San Francisco, N/A, 37.774899, -122.419403
GeoIP Region Edition, Rev 0: US, CA
GeoIP ASNum Edition: AS6994 Fastmetrics

(Using the office IP address for obvious privacy reasons. I suspect people know where we work.)

To break this down, we have the country, city, region and ASNum editions, as mentioned above, with two different revisions of city and region. The IPV6 database isn't displayed, because it's not an IPV6 IP.

For our purposes, the most-likely useful datapoints are country, region and city. Country is best retrieved from the GeoIP Country Edition, simply because the outputted data is the most useful; while the region- and city-level databases also generate country IDs, they're two or three letter abbreviations that can be difficult for end users to parse if they're passed into a datasets or visualisations. The Country Edition, on the other hand, produces the full name ("United State" versus "US").

The databases cannot be queried one-by-one, and will always return some variant on the above data format. This is good because it guarantees you always retrieve all the available data, and bad because it demands some data scrubbing (see the example functions for how this can be handled in R).

Caveats

There are a few caveats with using MaxMind's data for geolocation.

While it's the most accurate data we have access to, that doesn't mean it's flawless. MaxMind themselves boast 99.8% accuracy on a per-country level, but it drops off at Region level (90% in the US, less elsewhere) and the City level (83% accurate in the US - but only if "accurate" is "within 40km" and less so elsewhere). The MaxMind city accuracy report demonstrates pretty high inaccuracy levels at City resolution for a variety of countries, including many European ones (e.g. Finland). Generally-speaking it's probably not worth relying on for anything below country-level, unless you really really have to.

IPv6 support is currently very patchy for the paid versions, resulting in a generic error message rather than actual data. MaxMind claim that they'll have resolved this by "Q4 of 2013"; given that it's currently early 2014 at time of writing, and still not resolved, we can conclude this was optimistic. Finally, when using it for analysing historical data, bear in mind that IPs do (very occasionally, but still occasionally) switch nations between database updates.

Example functions

Example functions for interfacing with the geolocation database; if you've got one in your language (Python, say), post it where everyone can use it, darnit.

R

  #Function for retrieving country-level data
  geoip <- function(IP){
    
    #Return data on the submitted IP from the MaxMind GeoIP database, subsetting to get the bit we actually care about
    IPData <- system(command = paste("/usr/bin/geoiplookup",IP),
                     intern = TRUE)[1]
    
    #Use regex to remove, well, junk, retrieve first elements, and concatenate.
    processed_IPData <- strsplit(x = gsub("(GeoIP Country Edition: )|([0-9])|(N/A)|(\\.)|(-)","",IPData),
                                 split = ", ")[[1]][2]
    
    #Return!
    return(processed_IPData)
    
  }