Analytics/Data/Pageview hourly/K Anonymity Threshold Analysis

From Wikitech

In a plan to secure our user data, the pageview_hourly dataset needs to be sanitized in such a way that it does not allow to track user path.

See this page for a broad view on the pageview_hourly sanitization project, and this one for a detailed description of the algorithm proposal.

This page go through the various analysis we have been doing in defining K, the minimum number of distinct IPs a finger-printing group must have not to go through anonymization.

Dataset specification

  • One our of pageviews (${YEAR},${MONTH},${DAY},${HOUR})
  • filtered from
    • IPs having more than IP_UA_THRESHOLD different user agents
      (trying to remove mobile "many devices for a single IP" artifact)
    • IPs having more than IP_REQ_THRESHOLD views
      (trying to remove u "many devices for a single IP" artifact)
  • finger-printing Dimensions we took as defining a group are:
    • project (not to be anonymized)
    • language_variant (not to be anonymized)
    • access method (not to be anonymized)
    • agent type, (not to be anonymized)
    • zero carrier,
    • geocoding hierarchical data (continent, country, subdivision, city)
    • user_agent_map
  • Data to analyze
    • Distinct IPs (count and list)
    • Titles of page viewed
    • Number of requests
CREATE TABLE pv_san.idents_buckets STORED AS PARQUET AS
WITH wrong_ips AS  (
    SELECT
        client_ip AS w_ip,
        COUNT(DISTINCT user_agent) AS user_agent_count,
        COUNT(1) AS request_count
    FROM wmf.webrequest
    WHERE webrequest_source IN ('mobile', 'text')
        AND year=${YEAR} AND month=${MONTH} AND day=${DAY} AND hour=${HOUR}
        AND is_pageview
    GROUP BY client_ip
    HAVING (COUNT(DISTINCT user_agent) >= ${IP_UA_THRESHOLD}
            OR COUNT(1) >= ${IP_REQ_THRESHOLD} )
)
SELECT
    pageview_info['project'] as project,
    pageview_info['language_variant'] as language,
    agent_type,
    access_method,
    agent_type,
    x_analytics_map['zero'] AS zero_carrier,
    geocoded_data['continent'] AS continent,
    geocoded_data['country'] AS country,
    geocoded_data['subdivision'] AS subdivision,
    geocoded_data['city'] AS city,
    user_agent_map,
    COUNT(DISTINCT client_ip) AS ips_count,
    COLLECT_SET(client_ip) AS unique_ips,
    COUNT(DISTINCT pageview_info['page_title']) AS pages_count,
    COLLECT_SET(pageview_info['page_title']) AS unique_pages,
    COUNT(1) AS request_count
FROM wmf.webrequest webrequest
  LEFT OUTER JOIN wrong_ips ON (webrequest.client_ip = wrong_ips.w_ip)
WHERE wrong_ips.w_ip IS NULL and
    AND webrequest_source IN ('mobile', 'text')
    AND year=${YEAR} and month=${MONTH} and day=${DAY} and hour=${HOUR}
    AND geocoded_data['continent'] not in ('Unknown', '--', '-')
    AND geocoded_data['country'] not in ('Unknown', '--', '-')
    AND geocoded_data['subdivision'] not in ('Unknown', '--', '-')
    AND geocoded_data['city'] not in ('Unknown', '--', '-')
    AND is_pageview
    AND agent_type = 'user'
GROUP BY
    pageview_info['project'],
    pageview_info['language_variant'],
    access_method,
    agent_type,
    x_analytics_map['zero'],
    geocoded_data['continent'],
    geocoded_data['country'],
    geocoded_data['subdivision'],
    geocoded_data['city'],
    user_agent_map;

General shape of data

bucket_count request_count
2865382 21135520
SELECT
     COUNT(1) AS bucket_count,
     SUM(request_count) AS request_count
FROM pv_san.idents_buckets
LIMIT 100000;

Finger-printing groups, distinct ips, distinct pages and requests

We only dig into finger-printing groups that have 10 or less distinct IPs, or 10 or less distinct pages since those are where the anonymization problem exists.

Tables below show show two results:

  • Number of request per IP is stable at 2.4 and number of request per page is stable at 1.2, making number of pages per IP stable at 2.
  • Both for distinct ips and distinct pages, number of buckets and number of requests follow power laws (more or less steep).

Distinct IPs

ips_count bucket_count % %_cumul request_count % %_cumul request_per_ip %_difference_to_average
1 2127846 74.26% 74.26% 4617515 21.85% 21.85% 2.17 4.74%
2 315539 11.01% 85.27% 1416920 6.70% 28.55% 2.25 1.43%
3 125822 4.39% 89.66% 868266 4.11% 32.66% 2.30 -0.98%
4 68535 2.39% 92.06% 623021 2.95% 35.61% 2.27 0.23%
5 42663 1.49% 93.54% 488106 2.31% 37.92% 2.29 -0.45%
6 29511 1.03% 94.57% 402144 1.90% 39.82% 2.27 0.30%
7 21460 0.75% 95.32% 343318 1.62% 41.44% 2.29 -0.33%
8 16422 0.57% 95.90% 299057 1.41% 42.86% 2.28 0.07%
9 12803 0.45% 96.34% 267363 1.26% 44.12% 2.32 -1.86%
10 10465 0.37% 96.71% 245865 1.16% 45.29% 2.35 -3.14%
SELECT
     ips_count,
     COUNT(1) AS bucket_count,
     SUM(request_count) AS request_count
FROM pv_san.idents_buckets
WHERE ips_count <= 10
GROUP BY ips_count
ORDER BY ips_count
LIMIT 10;


Distinct pages

pages_count bucket_count % %_cumul request_count % %_cumul request_per_page %_difference_to_average
1 1551522 54.15% 54.15% 2065809 9.77% 9.77% 1.33 -7.42%
2 463215 16.17% 70.31% 1164015 5.51% 15.28% 1.26 -1.37%
3 223403 7.80% 78.11% 822750 3.89% 19.17% 1.23 0.96%
4 131879 4.60% 82.71% 650671 3.08% 22.25% 1.23 0.49%
5 87189 3.04% 85.75% 533933 2.53% 24.78% 1.22 1.19%
6 61462 2.14% 87.90% 451024 2.13% 26.91% 1.22 1.33%
7 45987 1.60% 89.50% 389742 1.84% 28.76% 1.21 2.32%
8 35842 1.25% 90.76% 353985 1.67% 30.43% 1.23 0.40%
9 28092 0.98% 91.74% 313424 1.48% 31.91% 1.24 -0.01%
10 23060 0.80% 92.54% 279851 1.32% 33.24% 1.21 2.09%
SELECT
     pages_count,
     COUNT(1) AS bucket_count,
     SUM(request_count) AS request_count
FROM pv_san.idents_buckets
WHERE pages_count <= 10
GROUP BY pages_count
ORDER BY pages_count
LIMIT 10;

Example of pages viewed together

By number of distinct IPs

2 distinct IPs

  • English_words_without_vowels, Davido
  • Main_Page, Jay_Chou
  • Flunitrazepam, Photograph, Main_Page
  • Derivative, Leibniz_integral_rule, Fundamental_theorem_of_calculus,
    Sir_John_Mandeville, Crocodile
  • Main_Page, Hamster_wheel
  • List_of_countries_by_GDP_(PPP), Luxembourg, List_of_countries_by_GDP_(nominal),
    List_of_countries_by_GDP_(nominal)_per_capita, List_of_countries_by_GDP_(PPP)_per_capita,
    The_Curious_Case_of_Benjamin_Button_(film), The_Social_Network
  • Substance_over_form, SsangYong_Rexton
  • Barkha_Dutt, Arnab_Goswami, Sagarika_Ghose, Rajdeep_Sardesai, Rosacea
  • Bigg_Boss_9, Debtor, Sovereign_debt_crisis, Great_Recession, Sovereign_default,
    European_sovereign-debt_crisis, Credit_worthiness, Greek_government-debt_crisis,
    Secondary_market, Loan_covenant, Credit_rating, Default_(finance), Main_Page,
    Tom_Enders, Airbus_Group, Financial_crisis_of_2007–08, Economic_bubble,
    Credit_rating_agency, * Intrinsic_value_(finance), Bailout
  • Trilateration, Ilya_Ilyin
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE ips_count = 2
    AND project = 'en.wikipedia'
LIMIT 10;

3 distinct IPs

  • Sharman_Joshi, Amendment_of_the_Constitution_of_India,
    List_of_amendments_of_the_Constitution_of_India,
    Ninety-third_Amendment_of_the_Constitution_of_India, Abu_Bakr_al-Baghdadi,
    Abu_Ahmed_al-Kuwaiti, Alfreda_Frances_Bikowsky, Ayman_al-Zawahiri, Main_Page,
    Kelly_Hu, Battlefield_Hardline
  • Pahlaj_Nihalani, Main_Page, Okariki_Okaru, Johnson_&_Johnson
  • GDP, List_of_countries_by_GDP_(PPP), Gross_domestic_product, GDP_per_capita,
    List_of_countries_by_GDP_(nominal)_per_capita, List_of_countries_by_GDP_(nominal),
    Purchasing_power_parity, Nauru, Turkey, NATO, Main_Page, BATE_Borisov, FC_BATE_Borisov
  • Main_Page, The_Revenant_(2015_film), Brandade, Bruschetta
  • Jon_Hamm, Minions_(film), Romulan, Vitruvian_Man
  • The_Late_Show_with_Stephen_Colbert, Smart_casual, Alice_in_Chains, Main_Page
  • Howard_Schultz, Starbucks, Influenza, Rhinorrhea, 1918_flu_pandemic, Peet's_Coffee_&_Tea
  • Triple_H, Triple_X_(professional_wrestling), Stephanie_McMahon, Nauru, Tuvalu,
    Funafuti, Palau, President_of_the_United_States, Rolex
  • Sargassum_fish, Kim_Richards, Kyle_Richards, Jamiroquai, Jay_Kay
  • Jamie_Vardy, Mad_Season_(band), Greg_Graffin, General_Dynamics_F-16_Fighting_Falcon_variants,
    Sukhoi_Su-24, Academy_Award_for_Best_Picture, The_Departed, Eurofighter_Typhoon
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE ips_count = 3
LIMIT 100;

By number of distinct pages (with more than only 1 IP)

2 distinct pages

  • English_words_without_vowels, Davido
  • The_Undertaker, Myanmar
  • Julia_Roberts, Kiefer_Sutherland
  • November_2015_Paris_attacks, Main_Page
  • Main_Page, Jay_Chou
  • Main_Page, Abu_Bakr_al-Baghdadi
  • Karen_Page, Deborah_Ann_Woll
  • Jaane_Kya_Hoga_Rama_Re, Mugdha_Godse
  • Patty_Brard, Main_Page
  • Main_Page, Hamster_wheel
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 2
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;

3 distinct pages

  • Oprah_Winfrey, Main_Page, Gayle_King
  • Flunitrazepam, Photograph, Main_Page
  • List_of_Bakugan:_Gundalian_Invaders_episodes, List_of_Digimon_episodes_and_movies, Digimon
  • Ahsan_Khan_(actor), Khoya_Khoya_Chand, Khoya_Khoya_Chand_(TV_series)
  • Mass_Rapid_Transit_(Singapore), SBS_Transit, SMRT_Corporation
  • BMW, Main_Page, Program_Files
  • Ajmal_Kasab, Main_Page, Sonam_Kapoor
  • Russia, Main_Page, England
  • Box_Car_Racer_(album), Jerry_Finn, Spinal_disc_herniation
  • Denzel_Washington, Haley_Bennett, The_Equalizer_(film)
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 3
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;

4 distinct pages

  • Pahlaj_Nihalani, Main_Page, Okariki_Okaru, Johnson_&_Johnson
  • Adnan_Sami, Salman_Khan_filmography, Salman_Khan, Main_Page
  • Main_Page, The_Revenant_(2015_film), Brandade, Bruschetta
  • Jon_Hamm, Minions_(film), Romulan, Vitruvian_Man
  • Great_Pyramid_of_Giza, Larry_Silverstein, Robin_Stewart, Deaths_in_2015
  • The_Late_Show_with_Stephen_Colbert, Smart_casual, Alice_in_Chains, Main_Page
  • Surf_music, Sukhoi_Su-24, Fairchild_Republic_A-10_Thunderbolt_II, Brian_Eno
  • Germany, Geography_of_Germany, Heavy_fighter, Buzzword_bingo
  • Natalie_Cole, CM_Punk, AJ_Lee, Blue_Bell_Creameries
  • Tortoiseshell_cat, Rath_House, Main_Page, Area_codes_718,_347,_and_929
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 4
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;

5 distinct pages

  • Barkha_Dutt, Arnab_Goswami, Sagarika_Ghose, Rajdeep_Sardesai, Rosacea
  • Yakov_Dzhugashvili, Joseph_Stalin, Soviet_Union, Vladimir_Lenin, Dollar
  • Proclaimers, The_Proclaimers, Marcus_Garvey_Park, Henry_Zebrowski,
    Essex_County_Fire_and_Rescue_Service
  • Sargassum_fish, Kim_Richards, Kyle_Richards, Jamiroquai, Jay_Kay
  • Talk:Mindscape_(software_publisher), Stein, Potter_sequence, Vertically_transmitted_infection,
    Necrotizing_enterocolitis
  • Galen_G._Weston, Galen_Weston, George_Weston_Limited, Gene_Hackman, Utopia_(Australian_TV_series)
  • Ronda_Rousey, Dancing_with_the_Stars_(U.S._TV_series), Dancing_with_the_Stars,
    List_of_Dancing_with_the_Stars_(U.S.)_competitors, Années_de_pèlerinage
  • Gabriel_Byrne, Cool_World, List_of_Chicago_Fire_characters, Chic, Chicago_Fire_(TV_series)
  • Chesley_Sullenberger, ECW_Heat_Wave, John_Mulaney, Main_Page, Hey_Pretty
SELECT
    unique_pages
FROM pv_san.idents_buckets
WHERE pages_count = 5
    AND ips_count > 1
    AND project = 'en.wikipedia'
LIMIT 100;