Analytics/Oncall
Ops Duty - Analytics
Being on the Analytics team we have to keep an eye on data flowing through our infrastructure. Data gets lost, found, jobs need to be rerun, and other fun stuff. We'll keep this page up to date with what we need to do. Currently we each take turns doing this for a week at a time. We keep helper scripts here: https://github.com/wikimedia/analytics-refinery/tree/master/bin
Event Logging
- Documented here: https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/Oncall
Webrequest Data and Jobs
- webrequest raw and refine status - see % loss (raw) or missing partition (refine)
- if missing less than 1%, too small, don't worry about it
- if missing more than 1% of data, check from which hostname(s) the data is lost (modify webrequest source and year, month, day, hour restriction clauses):
SELECT * FROM wmf_raw.webrequest_sequence_stats WHERE webrequest_source = 'SSSSSSS' AND year = YYYY AND month = MM AND day = DD AND hour = HH;
Pick one host where data is missing, and try to recompute statistics to check if data is still missing (modify webrequest_source, yeaar, month, day, hour, and hostname restriction clauses):
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core.jar; SELECT hostname, MIN(sequence) AS sequence_min, MAX(sequence) AS sequence_max, COUNT(*) AS count_actual, MAX(sequence) - MIN(sequence) + 1 AS count_expected, MAX(sequence) - MIN(sequence) + 1 - COUNT(hostname) AS count_different, COUNT(*) - COUNT(DISTINCT sequence) AS count_duplicate, SUM(if(sequence IS NULL,1,0)) AS count_null_sequence, ((COUNT(*) / (MAX(sequence) - MIN(sequence) + 1)) - 1) * 100.0 AS percent_different FROM wmf_raw.webrequest WHERE webrequest_source='SSSSSSSS' AND year=YYYY AND month=MM AND day=DD AND hour=HH AND hostname = 'HHHHHHHHHH' GROUP BY hostname, webrequest_source, year, month, day, hour;
If the percent loss is the same as the one in the statistics table, the loss is real, contact Ottomata or Joal to look more in details. If the percent loss is zero, it means the original statistics computation happened before the data were fully imported, and therefore backfill is needed: rerun the load job and every job depending on it for the given hour(s) and webrequest_source(s).
- coordinators in Hue
- get familiar with each type of job (weekly sessions, cassandra endpoints, etc.) and find jobs that are late and still running
- Oozie command line (https://oozie.apache.org/docs/3.1.3-incubating/DG_CommandLineTool.html )
oozie job -info | grep killed/running oozie job -info bundle/coordinator/workflow id oozie job -kill bundle/coordinator/workflow id (To kill jobs) oozie job -rerun COORD_ID -action ♯number oozie job -rerun COORD_ID -date DATE
- Yarn logs
sudo -u hdfs yarn logs -applicationId application_blah
- Killing yarn applications
sudo -u hdfs yarn application -kill applicationId
- Rerunning jobs starting with the first job that didn't work
- Look at graph of job dependencies: https://github.com/wikimedia/analytics-refinery/blob/master/diagrams/oozie-overview.dia
How to
Check hive settings
hive -e 'set -v' | grep hive.auto.convert.joi
Analytics Query Service
- TBD