Analytics/EventLogging/Oncall

From Wikitech

EventLogging Routine Maintenance for the oncall

  • Check graphite counts for incoming events raw/valid both client and server side
  • Check database for any gaps. A few different scripts exist, Milimetric's gist for example.
  • Decide on whether we need to deploy on that week, avoid Friday deployments
  • Remember to log all actions to SAL log (!log <something> on ops channel)
  • Report outages as part of wikimedia's incident reports so there is a reference
  • Follow up on any alarms that might be raised

OnCall Setup

  • Ensure you can access graphite (wikitech credentials).
  • Get Access to eventlog1001.eqiad.wmnet with root access.
  • Ask Andrew (ottomata on IRC) to setup the system to send you alarms.
  • In case of DB errors, the person to contact is Jaime Crespo (jcrespo on IRC), usually on #wikimedia-operations channel.


How Tos

https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/How_to

How to blacklist a schema

https://gerrit.wikimedia.org/r/#/c/248045/

How to restart EL

I'm not sure how much I trust upstart to restart > 10 processes all at once, so I prefer to do stop && sleep && start to restart:

 sudo eventloggingctl stop && sleep 5 && sudo eventloggingctl start