Quarterly Review of post-mortems - 2014-07

Questions we want to be able to answer

Have all of the issues that came out of the post-mortem been addressed? If not, why not?
Are we satisfied with the current state of that part of the infra? Are there further actions to take (upon further reflection)?
anything else?

Agenda

Go through the post-mortems and their respective action items and make sure they have been followed up appropriately.
- If you have details that are relevant to the post-mortem in BZ/etc, please link from the post-mortem.
Discuss if there is anything else that we learned from the situation and follow up to better inform future decisions.
Notes written up by all, collaboratively, so that others in the organization will learn from these as well.

Groupings

The post mortems

20140318-EventLogging

Sean, Nuria

Status: on going - Clarify who is to respond to which EventLogging alerts by when.
- Analytics now owns EL, and can respond to some alerts, but not all. Discussion between Ops and Analytics is still ongoing.
Status: Done - RT #7081 - Move EventLogging database to m2.
Status: Done - Figure out a way to allow joining EventLogging data against enwiki, as this seems to be critical for researchers.

Replication back to db1047 is included in the required events to move EventLogging database to m2.

20140328-DB-Queries

Sean

Status: in-progress - PrivateSettings.php should be in a repo so we can be sure what's changed.
- bugzilla:67818
Status: Done - Db user and password settings should go into PrivateSettings (and not be removed from AdminSettings until anyone relying on that file has converted their jobs).
- bugzilla:67820
Status: Not done - Changes made should go out immediately as they do for all configuration files.
Status: Not done - Better coordination

20140403-Deploy

Bryan, Reedy

Status: Done - bug 63659 Fix ExtensionMessages-X.php generation
Status: Done - Review Niklas' doc fixes
Status: Not done - revive and come to conclusion on testwiki served from mw1017 issue
- bugzilla:43722

20140503-Thumbnails

Antoine, Faidon

Status: Done - we might want to generate finer metrics by adding the pingLimiter() action to the wfProfileIn() call.
- bugzilla:65477 User::pingLimiter should have per action profiling

Status: Not done - a graph in gdash and a monitoring alarm could be added whenever the rate change significantly.
- bugzilla:65478 - Graph User::pingLimiter() actions in gdash

Status: on-going - it took us too long (3 days) to get informed about that outage though as soon as the proper folks have been made aware of it it got promptly solved.

20140509-EventLogging

Nuria

Status: Done - We have set up alarms on Event Logging regarding throughput so analytics team is notified if events per second go beyond a certain threshold.
Status: Done - We have set a policy of notifying our users when these type of events happen.
Status: Done - Repopulate data?
- Analytics talked to the affected teams and looks like there is no need to repopulate the data lost.

20140517-bits

Timo

Status: Done - Tracking bug for the outage: bug 65424
Status: Done - dissolve the bits servers into the appserver pool
Status: Done - Monitor for anomalies/spikes in read failures of memcached task T69817 231704

20140526-m1

Sean

In the short term:

Status: Done MAX_USER_CONNECTIONS is 512 for "bloguser" on db1001.
Status: Done The affected MyISAM tables have been switched to ARIA.

In the long term:

Status: on-going Phabricator will use proper InnoDB tables, plus Elasticsearch.
Status: on-going Tendril query sampling will steadily reach more hosts.

20140529-appservers

Ori

Status: Done - wikimedia-task-appserver is no more, and site is operational.
Status: ongoing - The postrm script of packages should be inspected prior to their removal from nodes that power critical services.

20140607-Elasticsearch

Nik

Status: on going - Don't ever deploy stuff on Friday afternoon.
Status: informational - Cirrus was using SSDs!
- Why didn't we know? That's not clear. Obviously some communication breakdown.
- Short term: we're going to continue using SSDs.
- Long term: we don't actually look like we're doing much IO at once at steady state. When shards move around we obviously need more. We might be able to avoid needing expensive ssds. OTOH, it might not be worth the time. Not sure.
Status: Done - We're going to have to file a bug upstream to do something about nodes that are *broken* like this one. We've had this issue before when we built a node from puppet and didn't set the memory correctly and it ran out of heap and started running really really slow. The timeouts we added then helped search limp along with the broken node, but it'd be nice to do something automatically to the node that is obviously sick.
- filed

20140608-Kafka

Andrew O

Status: Done - Decommission 2 or 3 Hadoop DataNodes and provision as Kafka Brokers.
Status: Done - ~~Create partman recipe for new Kafka brokers.~~ Too complicated for partman :/
- WIP: https://gerrit.wikimedia.org/r/#/c/138451/
Status: Done - Replace sdf on analytics1021:
- https://rt.wikimedia.org/Ticket/Display.html?id=7647
Status: Done - failover and load tests of Kafka Brokers with all varnish log traffic.
Status: Done - fix or replace this alert: https://gerrit.wikimedia.org/r/#/c/138302/
- Faidon disabled this as it was flapping too much during the downtime.
  - https://rt.wikimedia.org/Ticket/Display.html?id=7828
  - Done here: https://gerrit.wikimedia.org/r/#/c/150010/
Status: Done - fix (?) pages for Kafka services.
- https://rt.wikimedia.org/Ticket/Display.html?id=7829
Status: Done - See if we can tune a single broker to handle more traffic.
- I.e. why is one broker not able to handle all traffic? Just curious!

20140612-Math

Greg

Status: on going - Greg be more diligent about actively reverting non-backwards compatible changes before they cause problems.
Status: Done - Update our extension development documentation as per Physikerwelt's (good) suggestion.
- bugzilla:66603 -- done by Physikerwelt
Status: in-progress - Get more WMF reviewers for the Math extension work (not only will it be reviewed more quickly, but we'll have more institutional knowledge for when things break, as software tends to do)
- See also: RT 6077
- See also: wikitech-l thread

20140613-Videoscalers

Giuseppe

Status: Not done - Add monitoring for individual job types on single machines.
Status: on-going - Make job-loop aware of the status of launched jobs, or rethink and rewrite it completely.
- https://gerrit.wikimedia.org/r/#/c/144612/ is going live before HHVM (hopefully)

20140618-Wikitech

Andrew O

Status: Done - Add a puppet success check to icinga
- RT 7842

20140619-parsercache

Sean

Status: Not done - Run puppet nice'd
- RT 9
Status: Not done - A puppet run should not start if a box is under abnormal load.
- RT 7888
Status: Not done - Improve how Mediawiki handles a DB host that is flaky rather than completely down
- bugzilla:68062
Status: Not done - Migrate parsercache away from being a full RDBMS.
- RT 7889

20140622-es1006

Sean

Status: Done An additional S5 slave has been deployed.
Status: Done DB traffic sampling has been deployed to S5.
Status: Done Aude and Hoo deployed https://gerrit.wikimedia.org/r/#/c/141997/

20140622-imagescaler

Filippo

Status: Done - Upgrade swift frontend bandwidth to 10G
- RT 7728
Status: Not done - Number of uploads should be a metric in graphite
- bugzilla:67116
Status: Done - Increase the number of imagescaler workers once swift bandwith can withstand it
- RT 7823

20140625-CirrusSearch

Nik

Status: on-going - Be more careful next time.
Status: Not done - If sync-dir/sync-file/scap don't sync any files then we need to log something about it because its weird. Warn the operator that the sync they just performed was a noop.
- bugzilla:67091
Status: Done - Add automatic cache warming to CirrusSearch to prevent load spikes when loading cold caches.
- bugzilla:67094
Status: Done - Improve CirrusSearch error handling, it's very broken.
- bugzilla:67095