Quarterly Review of post-mortems - 2014-03-19

Questions we want to be able to answer

Have all of the issues that came out of the post-mortem been addressed? If not, why not?
Are we satisfied with the current state of that part of the infra? Are there further actions to take (upon further reflection)?
anything else?

Agenda

Go through the post-mortems and their respective action items and make sure they have been followed up appropriately.
- If you have details that are relevant to the post-mortem in BZ/etc, please link from the post-mortem.
Discuss if there is anything else that we learned from the situation and follow up to better inform future decisions.
Notes written up by all, collaboratively, so that others in the organization will learn from these as well.

The post mortems

site outage ~ 2014-01-11 22:10 UTC

TODO: Follow up with Sean and Tim about this. (Greg) - Status: Not done
- greg pinged sean 20140320

20140113-Poolcounter

Status: Done - Analyze CategoryTree problem and implement workaround
Status: Not done - Fix monitoring of poolcounter service
- https://rt.wikimedia.org/Ticket/Display.html?id=7108
Status: Not done - Improve poolcounter extension error messages. Some context would be helpful, like poolcounter server contacted, pool context, URL. And perhaps error messages even if only in english (as opposed to what's displayed to the user)
- https://bugzilla.wikimedia.org/show_bug.cgi?id=63027
Status: Done - Investigate page_restrictions query slowness on db1006
- SELECT /* Title::loadRestrictions */ pr_type,pr_expiry,pr_level,pr_cascade FROM `page_restrictions` WHERE pr_page = '2720924'; domas says needs FORCE INDEX
- https://rt.wikimedia.org/Ticket/Display.html?id=7126 (closed/rejected)
- Forcing an index seems like the wrong approach here, or perhaps there was some miscommunication somewhere. The example query is properly indexed and fast on any S6 slave today including db1006, so it's likely something else was affecting these queries during the outage, or there was a storm of them, or something else has been fixed in the meantime. Springle (talk) 06:03, 15 April 2014 (UTC)[reply]

20140203-LVS

Most of the issues addressed (as of 2014-03-19), the rest of the things like are good to haves, not critical.

Status: Done - We should explicitly monitor some critical sysctl active values on systems.
- "add Icinga checks for critically important sysctl params" - Ori
Status: Not done - LVS testing needs to include internal services testing, and simple TCP port connects may not tell the whole story.
- RT 6812
Status: Done - Check remaining uses of sysctl::parameters and their priorities (Andrew Bogott has committed to handling this).
- "Restore sysctl priorities." - Andrew Bogott
Status: Done - We need to reinvestigate the performance impact of ntpd on present day LVS (which was found detrimental on old kernels years ago), or find a solution for maintaining the clocks on these systems if it’s still a problem.
- RT 6813

20131205-Swift

Upstream: Swift daemons die when syslog stops running LP:1094230
- Abandoned (because of inactivity) change to fix: https://review.openstack.org/#/c/24871
Status: Not done - Figure out something since the upstream issue probably won't be resolved:
- ALT1: use udp for syslog messages from swift?
- ALT2: upstart hook to restart swift when syslog is restarted?
Status: Done - ping swift people (Faidon) - (unnecessary)
Status: Done - syslog is not autoupgraded, so that shouldn't happen again

20140206-Math

Status: Done - wrap Math stuff in PoolCounter so it doesn't kill apaches so easily. More review on recent changes to Math. Be careful in rolling this release out further.
- PoolCounter: https://gerrit.wikimedia.org/r/#/c/111916/
Status: Not done - Let's get better at reviewing the Math extension
- need client side knowledge, and caching
- Brion Vibber?
- Greg to ping Brion - (done)
Status: Not done - implement true code deployment pipleine (so that all code spends a comparable amount of time in testing/beta cluster before hitting production)
- backlog entry for RelEng team - Long term

20140211-Parsoid

Status: Done - Fix log rotation, run it hourly instead of daily
Status: Done - Remove old init scripts and update documentation on the log file path
Status: Done - Lower the warning threshold on parsoid node disk space to provide time to react
- RT 6851
Status: in-progress - Finish migration to async logging backend in Parsoid so that a full disk does not affect the service availability
- partly done, framework merged mid-March
Status: Not done - Check the logging volume in Parsoid unit tests, less critical once logging is async

20140228-Cirrus

What we're doing to prevent it from happening again:

Status: Done - We're going to monitor the slow query log and have icinga start complaining if it grows very quickly. We normally get a couple of slow queries per day so this shouldn't be too noisy. We're going to also have to monitor error counts, especially once we get more timeouts.
- https://bugzilla.wikimedia.org/show_bug.cgi?id=62077
Status: Done - We're going to sprinkle more timeouts all over the place. Certainly in Cirrus while waiting on Elasticsearch and figure out how to tell Elasticsearch what the shard timeouts should be as well.
- https://bugzilla.wikimedia.org/show_bug.cgi?id=62079
Status: Done - We're going to figure out why we only got half the settings. This is complicated because we can't let puppet restart Elasticsearch because Elasticsearch restarts must be done one node at a time.
- Was puppet error coding error. Fixed.

20140313-API-Parsoid

Status: informational - Search is broken, with latency quadrupling to crazy numbers on a daily basis. We kinda knew that :( I'll leave the decision of what to do (fix or wait for ElasticSearch) to Nik.

Status: Not done - Investigate throttling of the API, address DOS vectors
- https://bugzilla.wikimedia.org/show_bug.cgi?id=62615 (private bug)

Status: Not done - The Pybal DNS bug needs to be fixed. Until then remove servers from /h/w/conf/pybal before renumbering them or decommission them. It's probably a good idea anyway.
- https://rt.wikimedia.org/Ticket/Display.html?id=7127

Status: Not done - Make Pybal better about error detection/logging (it hides/makes opaque some backend errors)
- https://rt.wikimedia.org/Ticket/Display.html?id=7152

It's unfortunate that we noticed such an issue hours later via a user report. We should have an alert for unusual/high API latency (among others). The data is there, in Graphite, but we need a check_graphite to poll it. Matanya started that but it needs more work.
- Status: On hold - https://gerrit.wikimedia.org/r/#/c/118435/
- Status: Not done - https://bugzilla.wikimedia.org/show_bug.cgi?id=57882
- Status: Not done - Talk with Brad on his thoughts on monitoring the API (error rates etc) - NEEDS BUG

Similarly, we probably need to monitor reports for failing/retried requests & alert when they happen. The current reqstats/reqerror graphs report errors from frontends which in this case were showing no errors as they were retrying and succeeding. We really need to overhaul the whole metric collection & alerting there.
- Status: Not done - T83580
- Status: Not done - Talk with Analytics team, they probably have related analytics
- Status: informational - Good candidate for a monitoring Sprint at the Hackathon

Status: Done - Reconcile the Parsoid/Varnish connect/TTFB timeout differences.
- Gabriel votes that Varnish's 5s is more sensible than Parsoid's 15s.
- [1] and [2]

20140313-Deploy

Status: Not done - documentation update needed for when moving boxes?
- Greg to talk to RobH and Chris FILE TICKET
- Suggestion: Have mark/faidon work jointly on future moves?
- Suggestion: Investigate sharing the same source file? Or having some minimal automatic checking?
Status: Not done - investigate Potential scap bug with the change of mw versions
- Why didn't puppet pull in the latest versions of deployed mw?
- see also: the eventual consistency requirement for deployment tooling
- bugzilla:66050
Status: Done - make scap report rsync errors
- Bug 62862
Status: Done - put rsync proxy config in puppet
- https://gerrit.wikimedia.org/r/#/c/119677/
Status: Done - Add mw1161 and mw1201 as scap proxies for EQIAD row C and D
- https://gerrit.wikimedia.org/r/#/c/119686/ and https://rt.wikimedia.org/Ticket/Display.html?id=7080