Incident documentation/20150206-EventLogging

From Wikitech


EventlLogging code was dropping events sparsely from 2015-02-06 to 2015-02-10


2015-02-05 08:15 PST
El code got deployed from mainline (and not logged in SAL) to fix issues with incident on


Code seems to be working normally.


Alarms regarding throughput get trigger.

Developers researchs events on db versus validated events on log and finds discrepancies. Those should agree not 100% but about 99%. (There are valid events that do not get inserted due to encoding issues and other errors)
mysql --defaults-extra-file=/etc/mysql/conf.d/research-client.cnf --host dbstore1002.eqiad.wmnet -e "select left(timestamp,8) ts , 
COUNT(*) from log.ServerSideAccountCreation_5487345 where left(timestamp,8) >=   '20150128' group by ts order by ts;"
ts      COUNT(*)
20150128        18237
20150129        17546 
20150130        16556
20150131        15814
20150201        17079
20150202        17387
20150203        17888
20150204        11496
20150205        6640
20150206        11159
20150207        10307
20150208        10095
20150209        10375


  • We would benefit from looking at alarms right away and not wait several days
  • More precise alarms as to what is going on wouldn't hurt
