Analytics/EventLogging/Oncall
< Analytics | EventLogging
EventLogging Routine Maintenance for the oncall
- Check graphite counts for incoming events raw/valid both client and server side
- Check database for any gaps. A few different scripts exist, Milimetric's gist for example.
- Decide on whether we need to deploy on that week, avoid Friday deployments
- Remember to log all actions to SAL log (!log <something> on ops channel)
- Report outages as part of wikimedia's incident reports so there is a reference
- Follow up on any alarms that might be raised
OnCall Setup
- Ensure you can access graphite (wikitech credentials).
- Get Access to eventlog1001.eqiad.wmnet with root access.
- Ask Andrew (ottomata on IRC) to setup the system to send you alarms.
- In case of DB errors, the person to contact is Jaime Crespo (jcrespo on IRC), usually on #wikimedia-operations channel.
How Tos
https://wikitech.wikimedia.org/wiki/Analytics/EventLogging/How_to
How to blacklist a schema
https://gerrit.wikimedia.org/r/#/c/248045/
How to restart EL
I'm not sure how much I trust upstart to restart > 10 processes all at once, so I prefer to do stop && sleep && start to restart:
sudo eventloggingctl stop && sleep 5 && sudo eventloggingctl start