Incident documentation/QR201407/group2
Quarterly Review of post-mortems - 2014-07
Questions we want to be able to answer
- Have all of the issues that came out of the post-mortem been addressed? If not, why not?
- Are we satisfied with the current state of that part of the infra? Are there further actions to take (upon further reflection)?
- anything else?
Agenda
- Go through the post-mortems and their respective action items and make sure they have been followed up appropriately.
- If you have details that are relevant to the post-mortem in BZ/etc, please link from the post-mortem.
- Discuss if there is anything else that we learned from the situation and follow up to better inform future decisions.
- Notes written up by all, collaboratively, so that others in the organization will learn from these as well.
Notes
The post mortems
20140328-DB-Queries
Bryan, Reedy
- Status: in-progress - PrivateSettings.php should be in a repo so we can be sure what's changed.
- Status: Done - Db user and password settings should go into PrivateSettings (and not be removed from AdminSettings until anyone relying on that file has converted their jobs).
- Status: Not done - Changes made should go out immediately as they do for all configuration files.
- Status: Not done - Better coordination
20140403-Deploy
Bryan, Reedy
- Status: Done - bug 63659 Fix ExtensionMessages-X.php generation
- Status: Done - Review Niklas' doc fixes
- Status: Not done - revive and come to conclusion on testwiki served from mw1017 issue
20140503-Thumbnails
Antoine, Faidon
- Status: Done - we might want to generate finer metrics by adding the pingLimiter() action to the wfProfileIn() call.
- bugzilla:65477 User::pingLimiter should have per action profiling
- Status: Not done - a graph in gdash and a monitoring alarm could be added whenever the rate change significantly.
- bugzilla:65478 - Graph User::pingLimiter() actions in gdash
- Status: on-going - it took us too long (3 days) to get informed about that outage though as soon as the proper folks have been made aware of it it got promptly solved.
20140517-bits
Timo
- Status: Done - Tracking bug for the outage: bug 65424
- Status: Done - dissolve the bits servers into the appserver pool
- Status: Done - Monitor for anomalies/spikes in read failures of memcached task T69817 231704
20140529-appservers
Ori
- Status: Done - wikimedia-task-appserver is no more, and site is operational.
- Status: ongoing - The postrm script of packages should be inspected prior to their removal from nodes that power critical services.
20140608-Kafka
Andrew O
- Status: Done - Decommission 2 or 3 Hadoop DataNodes and provision as Kafka Brokers.
- Status: Done -
Create partman recipe for new Kafka brokers.Too complicated for partman :/ - Status: Done - Replace sdf on analytics1021:
- Status: Done - failover and load tests of Kafka Brokers with all varnish log traffic.
- Status: Done - fix or replace this alert: https://gerrit.wikimedia.org/r/#/c/138302/
- Faidon disabled this as it was flapping too much during the downtime.
- Status: Done - fix (?) pages for Kafka services.
- Status: Done - See if we can tune a single broker to handle more traffic.
- I.e. why is one broker not able to handle all traffic? Just curious!
20140612-Math
Greg
- Status: on going - Greg be more diligent about actively reverting non-backwards compatible changes before they cause problems.
- Status: Done - Update our extension development documentation as per Physikerwelt's (good) suggestion.
- bugzilla:66603 -- done by Physikerwelt
- Status: in-progress - Get more WMF reviewers for the Math extension work (not only will it be reviewed more quickly, but we'll have more institutional knowledge for when things break, as software tends to do)
- See also: RT 6077
- See also: wikitech-l thread
20140618-Wikitech
Andrew B, Marc
- Status: Done - Add a puppet success check to icinga