Ops Duty - Analytics

Being on the Analytics team we have to keep an eye on data flowing through our infrastructure. Data gets lost, found, jobs need to be rerun, and other fun stuff. We'll keep this page up to date with what we need to do. Currently we each take turns doing this for a week at a time. We keep helper scripts here: https://github.com/wikimedia/analytics-refinery/tree/master/bin

Event Logging

Documented here: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall

Webrequest Data and Jobs

webrequest raw and refine status - see % loss (raw) or missing partition (refine)
- if missing less than 1%, too small, don't worry about it
- if missing more than 1% of data, check from which hostname(s) the data is lost (modify webrequest source and year, month, day, hour restriction clauses):

SELECT *
FROM wmf_raw.webrequest_sequence_stats
WHERE
    webrequest_source = 'SSSSSSS'
    AND year = YYYY AND month = MM AND day = DD AND hour = HH;

Pick one host where data is missing, and try to recompute statistics to check if data is still missing (modify webrequest_source, yeaar, month, day, hour, and hostname restriction clauses):

ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar;
SELECT
    hostname,
    MIN(sequence)                                                  AS sequence_min,
    MAX(sequence)                                                  AS sequence_max,
    COUNT(*)                                                       AS count_actual,
    MAX(sequence) - MIN(sequence) + 1                              AS count_expected,
    MAX(sequence) - MIN(sequence) + 1 - COUNT(hostname)            AS count_different,
    COUNT(*) - COUNT(DISTINCT sequence)                            AS count_duplicate,
    SUM(if(sequence IS NULL,1,0))                                  AS count_null_sequence,
    ((COUNT(*) / (MAX(sequence) - MIN(sequence) + 1)) - 1) * 100.0 AS percent_different
  FROM
    wmf_raw.webrequest
  WHERE
    webrequest_source='SSSSSSSS'
    AND year=YYYY AND month=MM AND day=DD AND hour=HH
    AND hostname = 'HHHHHHHHHH'
  GROUP BY
    hostname, webrequest_source, year, month, day, hour;

If the percent loss is the same as the one in the statistics table, the loss is real, contact Ottomata or Joal to look more in details. If the percent loss is zero, it means the original statistics computation happened before the data were fully imported, and therefore backfill is needed: rerun the load job and every job depending on it for the given hour(s) and webrequest_source(s).

coordinators in Hue
- get familiar with each type of job (weekly sessions, cassandra endpoints, etc.) and find jobs that are late and still running
Oozie command line (https://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html )

oozie job -info | grep killed/running
oozie job -info bundle/coordinator/workflow id
oozie job -kill bundle/coordinator/workflow id (To kill jobs)
oozie job -rerun COORD_ID -action ♯number
oozie job -rerun COORD_ID -date DATE

Yarn logs

sudo -u hdfs yarn logs -applicationId application_blah

Killing yarn applications

sudo -u hdfs yarn application -kill applicationId

Rerunning jobs starting with the first job that didn't work
- Look at graph of job dependencies: https://github.com/wikimedia/analytics-refinery/blob/master/diagrams/oozie-overview.dia

How to

Check hive settings

hive -e 'set -v' | grep hive.auto.convert.joi

Analytics Query Service

TBD