Analytics/Geowiki

From Wikitech

Geowiki is a set of scripts used to automatically analyze the active editors per project/country. The generated data is split into a public part (available through http://gp.wmflabs.org/ (cf. domain description)), and a "foundation-only" part (available through https://stats.wikimedia.org/geowiki-private/ ).

Source code

The source code for the geowiki scripts themselves is at https://gerrit.wikimedia.org/r/#/admin/projects/analytics/geowiki .

The repository holding the generated public data can be found at https://gerrit.wikimedia.org/r/#/admin/projects/analytics/geowiki-data . The repository holding the generated "foundation-only" data can be get synced over to machines by requiring puppet's misc::statistics::geowiki::data::private_bare::sync.

Generated data

The geowiki scripts generate several hundred files. To allow to give a still managable overview, we use ${WIKI_NAME} to refer to names of wikis.

Public data

Dashboards

None.

(The related dashboards/reportcard dashboard is part of the dashboard-data repository. See Global-Dev_Dashboard)

Datafiles

Datasources

Geo

None.

Graphs

Foundation-only data

(Currently, no visualization is offered for this data.)

WIKI_NAME

${WIKI_NAME} is (as of 2013-09-15) any of ab, ace, af, ak, als, am, ang, an, arc, ar, arz, as, ast, av, ay, az, bar, ba, bat_smg, bcl, be, be_x_old, bg, bh, bi, bjn, bm, bn, bo, bpy, br, bs, bug, bxr, ca, cbk_zam, cdo, ceb, ce, chr, ch, chy, ckb, co, crh, cr, csb, cs, cu, cv, cy, da, de, diq, dsb, dv, dz, ee, el, eml, en, eo, es, et, eu, ext, fa, ff, fi, fiu_vro, fj, fo, frp, frr, fr, fur, fy, gan, ga, gd, glk, gl, gn, got, gu, gv, hak, ha, haw, he, hif, hi, hr, hsb, ht, hu, hy, ia, id, ie, ig, ik, ilo, io, is, it, iu, ja, jbo, jv, kaa, kab, ka, kbd, kg, ki, kk, kl, km, kn, koi, ko, krc, ksh, ks, ku, kv, kw, ky, lad, la, lbe, lb, lez, lg, lij, li, lmo, ln, lo, ltg, lt, lv, map_bms, mdf, mg, mhr, mi, mk, ml, mn, mrj, mr, ms, mt, mwl, my, myv, mzn, nah, nap, na, nds_nl, nds, ne, new, nl, nn, no, nov, nrm, nso, nv, ny, oc, om, or, os, pag, pam, pap, pa, pcd, pdc, pih, pi, pl, pms, pnb, pnt, ps, pt, qu, rm, rmy, rn, roa_rup, roa_tara, ro, rue, ru, rw, sah, sa, scn, sco, sc, sd, se, sg, sh, simple, si, sk, sl, sm, sn, so, sq, srn, sr, ss, stq, st, su, sv, sw, szl, ta, te, tet, tg, th, ti, tk, tl, tn, to, tpi, tr, ts, tt, tum, tw, ty, udm, ug, uk, ur, uz, vec, vep, ve, vi, vls, vo, war, wa, wo, wuu, xal, xh, yi, yo, za, zea, zh_classical, zh_min_nan, zh, zh_yue, zu .

To add further wikis, add them in the file geowiki/data/all_ids.tsv.

Getting access to the “foundation only” part of geowiki

If you're part of the foundation and need to get access to the “foundation only” part of geowiki, contact the Analytics Development team.

Necessary steps from the team to grant access

The initial accounts have been created by QChris, and Ottomata. So if something is unclear, ask one of them.

To grant someone access to the “foundation only” part of geowiki carry out the following steps in that order:

  1. File an RT request
  2. Make sure the new user signed an NDA (The people behind contracts@ asked us to double check even if they the new user is listed on the staff page). To do so, send an email to contracts@ for international contractors and volunteers. For domestic employees/contractors, check with Jlohr_(WMF).
  3. Generate a password for the user. The agreement is to use
    gpg --gen-random 2 200 | LC_ALL=C tr -d -c '[:graph:]' | head -c 10
    and make sure it's 10 characters long.
  4. Generate a htpasswd line for ops to add. For example use
    touch new-htpasswd-line && chmod 600 new-htpasswd-line && htpasswd new-htpasswd-line $NEW_USERNAME
    on stat1001.
  5. Ask ops to append the new-htpasswd-line to puppet:///private/apache/htpasswd.stats-geowiki (Do not email the contents of new-htpasswd-line or paste in IRC. Ask the op to pick up the file from stat1001 by thonself).
  6. Ask ops to run puppet on stat1001, so the new file gets deployed.
  7. Do not email the username/password to the new user in clear. Use encrypted email. If that is not possible, get the password to the new user in person. If that is also not possible, tell username/password to the new user on the phone.
  8. Finally, done.


Dataflow

Dataflow for geowiki

(For an up-to-date version, see https://commons.wikimedia.org/wiki/File:Geowiki_workflow.png)

The big picture of the dataflow in Geowiki is illustrated in the diagram above. The whole dataflow is cut into five separate tasks:

  1. Aggregation
  2. Extraction and formatting for Limn
  3. Fetching
  4. Bringing data in place
  5. Monitoring

Aggregation

This step is responsible of aggregating the editor information (which is available only for 90-days) of the slave databases into a condensed format that is stored in a permanent container.

This aggregation is grouped by projects, editor's country, and date.

The implementation of this step can be found in geowiki's geowiki/process_data.py script. Running this script has been puppetized as misc::statistics::geowiki::jobs::data and is run daily on stat1003 at 12:00 (as of 2013-09-15).

Extraction and formatting for Limn

The aggregated data gets formatted for limn and pushed to a public, and a private data repository by running geowiki/make_and_push_limn_files.py script.

Running this script has been puppetized as misc::statistics::geowiki::jobs::limn and is run daily on stat1003 at 15:00 (as of 2013-09-15).

As it has been decided that not all packages that are required format the limn files will get puppetized, we have to rely on a pre-initialized setup containing those packages. This setup is currently provided by the user qchris on stat1003.

Fetching

Since the computation of geowiki data takes place on a different host, the computed data has to be fetched onto the serving hosts periodically to be able to serve up-to-date data.

For the public data on limn1, up-to-date data if fetched through a cron job that fetches from the geowiki data repository daily at 19:00 (as of 2013-09-15).

For the private data on stat1001, up-to-date data if fetched through a cron job that rsyncs the private data bare repository over from stat1003 daily at 17:00 (as of 2013-12-11).

Bringing data in place

The data fetched to limn1 in the previous step relies on absolute paths that are occupied by a different repository on the limn instance. So we have to link the geowiki data into the correct place to make the graphs, etc. work. This linking happens daily at 21:00 (as of 2013-09-15) through a cronjob that runs dashboard-data's blend_in_repository.sh.

We can get rid of this item by finalizing extracting the geowiki parts out of the dashboard repos.

Monitoring

To assure that the data served through limn, and stat1001 is up-to-date, we check geowiki's data daily and assure that it contains recent enough data, and the contained data is within expected bounds. This monitoring has been puppetized as misc::statistics::geowiki::jobs::monitoring and is currently running on stat1003 daily at 21:30.