Incident response

OMG THE SITE'S DOWN!!!! What to do?????

In overview:

Diagnose
Fix
Communicate

The nature of the tasks is that they can all be interleaved. Speculative or temporary fixes can be applied before a full diagnosis is made. Analysis of root causes can be complicated and is often best left until after the site is back up. Operational communication (server admin log, IRC, etc.) starts immediately, and communication with the rest of the organisation and the public should start within the first 5 minutes.

Process

At the start of an incident, no definite assumptions can be made about who's available for what, as incidents can happen at any time, and can be triggered or responded to by anyone.

The first person arriving (on IRC) to investigate an incident should explicitly mention this on IRC in the #wikimedia-operations channel, and is responsible for (basic) communication as well until others arrive to assist - typically very soon afterwards. These responsibilities can change later on.

Because the first few minutes of an outage are crucial while no information is available yet, hand off the facilitation of communication and/or paging for additional assistance to someone else if present. As soon as multiple responders/team members are available, explicitly decide whom will take this facilitative role before joining the investigation.

(This process will soon be assisted with additional tools.)

In the event of insufficient help/backup, call Mark, Faidon or Greg (phone numbers can be found on Office Wiki's Contact list).

Communicate

Peer communication and logging

It's absolutely essential that you communicate your actions to other sysadmins as you do them. Here are some reasons:

It avoids duplication of effort, conflicts over text file edits, etc.
It avoids confusing other sysadmins about the causes of the site changes that they observe. It is difficult enough to diagnose the cause of downtime. If a sysadmin changes something, and another sysadmin erroneously attributes the results, then that can significantly slow the diagnosis process.
Bus factor. If you say what you are doing, other sysadmins have a chance of continuing your work should you lose internet connectivity.
Sanity review. Responding to site downtime is a high-stress activity and is prone to errors. By writing about your actions and your thoughts, you give others the chance to review and comment on them.
It makes post-mortem analysis possible. If actions are unlogged, then reconstructing the order of events becomes very difficult. If you hinder post-mortem analysis, then you make it more likely that the same problem will happen again.

Manual actions/interventions should be logged to the https://wikitech.wikimedia.org/Server_Admin_Log using the !log keyword in IRC, #wikimedia-operations channel:

!log Restarted Varnish backend instance on cp1065

General discussions / synchronisation should occur on IRC, in channels #wikimedia-operations or (if sensitive) #mediawiki_security.

Communicating with the rest of the organisation

Besides diagnosing and fixing the problem as soon as possible (which is the highest priority), it's very important that for any outages that impact many users, a notification of the outage is sent to other parties within the organization, within the first 5 minutes. At that point, the Communications team, the Community teams and management may start receiving (press/phone/e-mail/social media) inquiries that need to be answered. Although accurate and complete information can be scarce in the early stages of some types of outages, at the very minimum a notification of the ongoing outage should be sent out, along with a brief indication of scope & impact where known. Technical details are not yet important at this point, and can change (drastically) as more information becomes available. Focus on what is known, and how it impacts (which) users. Keep it brief & quick, allowing you to focus on the investigation & fix.

An update should be sent to outage-notification@wikimedia.org (using e-mail for now), within the first 5 minutes of the start of the investigation of the event. (We're working on additional technical tools to assist with this, and will update this page when they are available.) This is the responsibility of the initial responder, or someone else explicitly designated for this afterwards, if applicable. This notification will alert (and page) several people from different teams in the organization, allowing them to prepare and assist with handling the outage in its different aspects. Additional questions from these teams and management should be directed to Mark, Faidon, Greg or Ori at this point, allowing the engineers to focus on the resolution of the problem.

When to notify

This process of notification should be done for severe outages that affect a significant amount of people, and need the Communications, Community teams and management to be aware. Examples of this are outages that affect the majority of site users (e.g. all wikis down), the majority of contributors (e.g. editing broken) or a big security breach that needs to be dealt with immediately. Smaller incidents that impact a rather limited amount of people or just a small subset of site functionality may not warrant paging other teams. There are no black & white rules for this, in the end it's a matter of judgement whether other teams in the organistion need to be aware and assist with followup. If in doubt, err on the side of yes.

After 15 minutes, if the outage is still ongoing, new update(s) should be sent with additional details, ideally including an ETA for a fix if available.

When service has been restored, a final update should be sent along with a brief description. More technical details will be provided later in the form of an incident report.

Paging

You have to be able to recognise when the problem is beyond your ability (or the ability of those people so far assembled) to fix alone.

Some issues require a lot of work to fix. For example, it takes a lot of work to recover from a power outage in a data center. In such a case, it makes sense to get everyone online from the outset.

Some issues require special expertise. For example, database crashes need Jaime to be online. Network failures need Faidon or Mark to be online.

Ask anyone online to assist with paging people, as it's typically the fastest way. If noone is available quickly, call Mark or Faidon or Greg to help with this.

If the site has been down for 15 minutes or more, it is time to stop working on the technical issues and to get some perspective. If a small team can't get the site back up in this amount of time, it has failed, and it is time to wake more people up, by calling them. Call, don't just text them, so you know whether they've received it or not. Primary phone numbers can be found on the Office wiki Contact List, or, if that is down, in the Icinga configuration for paging in Puppet.

Diagnosis

Diagnosis should always start by observing the symptoms.

Open Ganglia and the site itself in separate browser tabs.
Carefully read the reports from users, which typically come in on #wikimedia-tech. Ask for clarifications if they are unclear.

Ganglia is by far the most useful and important diagnosis tool. Interpreting it is complex but essential. Request rate statistics (e.g. ~~reqstats~~) are useful to get a feel for the scale of the problem, and to confirm that the user reports are representative and not just confined to a few vocal users. Viewing the site itself is the least useful diagnosis tool, and can often be left out if the user reports are clear and trustworthy.

Shell-based tools such as MySQL "show processlist", strace, tcpdump, etc. are useful for providing more detail than Ganglia. However, they are potential time-wasters. Unfamiliarity with the ordinary output of these tools can lead to misdiagnosis. Complex chains of cause and effect can lead responders on a wild goose chase, especially when they are unfamiliar with the system.

Failure modes

Fast fail

Requests fail quickly. Backend resource utilisation drops by a large factor. Frontend request rate typically drops slightly, due to people going away when they see the error message, instead of following links and generating more requests. Frontend network should drop significantly if the error messages are smaller than the average request size.

Example causes:

Someone pushes out a PHP source file with a syntax error in it
An essential TCP service fails with an immediate "connection refused"

Overload

This is the most common cause of downtime. Overload occurs when the demand for a resource outstrips the supply. The queue length increases at a rate given by the difference between the demand rate and the supply rate.

The growth of the queue length in this situation is limited by two things:

Client disconnections. The client may give up waiting and voluntarily leave the queue.
Queue size limits. Once the queue reaches some size, something will happen that stops it from growing further. Ideally, this will be a rapidly-served error message. In the worst case, the limit is when the server runs out of memory and crashes.

As long as the server does not have some pathology at high queue sizes (such as swapping), it is normal for some percentage of requests to be properly served during an overload. However, if queue growth is limited by timeouts, the FIFO nature of a queue means that service times will be very long, approximately equal to the average timeout.

There are two kinds of overload causes:

Increase in demand: For example: news event, JavaScript code change, accidental DoS due to an individual running expensive requests, deliberate DoS.
Reduction in supply: For example: code change causing normal requests to become more expensive, hardware failure, daemon crash and restart, cache clear.

It can be difficult to distinguish between these two kinds of overload.

Note that for whatever reason, successful, deliberate DoS is extremely rare at Wikimedia. If you start with an assumption that the problem is due to stupidity, not malice, you're more likely to find a rapid and successful solution.

Common overload categories

Somewhere in the system, a resource has been exhausted. Problems will extend from the root cause, up through the stack to the user. Low utilisation will extend down through the stack to unrelated services.

For example, if MySQL is slow:

Looking up the stack, we will see overloads in MySQL and Apache, and error messages generated in Varnish.
Looking down the stack, the overload in Apache will cause a large drop in utilisation of unrelated services such as search.

Varnish connection count

For the Varnish/nginx cache pool, the client is the browser, and disconnections occur both due to humans pressing the "stop" button, and due to automated timeouts. It's rare for any queue size limit to be reached in Varnish, since queue slots are fairly cheap. Varnish's client-side timeouts tend to prevent the queue from becoming too large.

Apache process count

For the Apache pool, the client is Varnish. Varnish typically times out and disconnects after 60 seconds, then it begins serving HTTP 503 responses. However, when Varnish disconnects, the PHP process is not destroyed (ignore_user_abort=true). This helps to maintain database consistency, but the tradeoff is that the apache process pool can become very large, and often requires manual intervention to reset it back to a reasonable size.

An apache process pool overload can easily be detected by looking at the total process count in ganglia.

Regardless of the root cause, an apache process pool overload should be dealt with by regularly restarting the apache processes using /home/wikipedia/bin/apache-restart-all. In an overload situation, the bulk of the process pool is taken up with long-running requests, so restarting kills more long-running requests than short requests. Regular restarting of apache allows parts of the site which are still fast to continue working.

Regular restarting is somewhat detrimental to database consistency, but the effects of this are relatively minor compared to the site being completely down.

There are two possible reasons for an apache process pool overload:

Some resource on the apache server itself has been exhausted, usually CPU.
Apache is acting as a client for some backend, and that backend is failing in a slow way.

Apache CPU

If CPU usage on most servers is above 90%, and CPU usage has plateaued (i.e. it has stopped bouncing up and down due to random variations in demand), then you can assume that the problem is an apache CPU overload. Otherwise, the problem is with one of the many remote services that MediaWiki depends on.

CPU profiling can be useful to identify the causes of CPU usage, in cases where the relevant profiling section terminates successfully, instead of ending with a timeout or other fatal error. Run /home/wikipedia/bin/clear-profile to reset the counters.

Note that recursive functions such as PPFrame_DOM::expand() are counted multiple times, roughly as many times as the average stack depth, so the numbers for those functions need to be interpreted with caution. Parser::parse() is typically non-recursive, and gives an upper limit for the CPU usage of recursive parser functions.

In cases of severe overload, or other cases where profiling is not useful, it is possible to identify the source of high CPU usage by randomly attaching to apache processes.

All our apache servers should have PHP debug symbols installed. Our custom PHP packages have stripping disabled. So just log in to a random apache, and run top. Pick the first process that seems to be using CPU, and run gdb -p PID to attach to it. Then run bt to get a backtrace. Here's the bottom of a typical backtrace:

#16 0x00007fa1230e85da in php_execute_script (primary_file=0x7fff8387eb10)
    at /tmp/buildd/php5-5.2.4/main/main.c:2003
#17 0x00007fa1231b19e4 in php_handler (r=0x13dd838)
    at /tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c:650
#18 0x0000000000437d9a in ap_run_handler ()
#19 0x000000000043b1bc in ap_invoke_handler ()
#20 0x00000000004478ce in ap_process_request ()
#21 0x0000000000444cc8 in ?? ()
#22 0x000000000043eef2 in ap_run_process_connection ()
#23 0x000000000044b6c5 in ?? ()
#24 0x000000000044b975 in ?? ()
#25 0x000000000044c208 in ap_mpm_run ()
#26 0x0000000000425a44 in main ()

The "r" parameter to php_handler has the URL in it, which is extremely useful information. So switch to the relevant frame and print it out:

(gdb) frame 17
#17 0x00007fa1231b19e4 in php_handler (r=0x13dd838)
    at /tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c:650
650	/tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c: No such file or directory.
	in /tmp/buildd/php5-5.2.4/sapi/apache2handler/sapi_apache2.c
(gdb) print r->hostname
$2 = 0x13df198 "ko.wikipedia.org"
(gdb) print r->unparsed_uri
$3 = 0x13dee48 "/wiki/%ED%8A%B9%EC%88%98%EA%B8%B0%EB%8A%A5:%EA%B8%B0%EC%97%AC/%EB%A7%98%EB%A7%88"

At the other end of the stack, there is information about what is going on in PHP. See GDB with PHP for some information about using it.

An extension to this idea of profiling by randomly attaching to processes in gdb is Domas's Poor Man's Profiler.

Slow backend service

If the site is down because a service that MediaWiki depends on has become slow, there are a number of tools that can help to identify the service:

Ganglia: High load or resource utilisation at the root cause server may be obvious at a glance on ganglia.
Profiling: Run /home/wikipedia/bin/clear-profile, and then observe the highest users of real time here. This will only show you requests which have completed within the php.ini timeout of 3 minutes, and without the apache process being killed by administrator intervention, so it's not so helpful for the most severe overloads.
Strace: This is useful for the most severe overloads. Log in to a random apache. Run ps -C apache2 -l and pick a process with a suspicious-looking WCHAN. Run lsof -p PID, then attach to it with strace -p PID. With luck (and perhaps some repetition), this will hopefully tell you what FD apache is waiting on. Using the lsof output, you can identify the corresponding remote service.

MySQL overload

MySQL overload can often be detected from Ganglia, by looking for an increase in load, or for anomalies in network usage and CPU utilisation.

Slow queries on MySQL typically lead to exhaustion of disk I/O resources. Fast, numerous queries may lead to overload via high CPU and lock contention.

Slow queries can be identified by running SHOW PROCESSLIST. If slow queries are identified as the source of site downtime, the immediate response should be to kill them. To do this, a shell/awk one-liner typically suffices, such as:

mysql -h $server -e 'show processlist' | awk '$0 ~ /...CRITERIA.../ {print "kill", $1, ";"}' | mysql -h $server

Once this is done and the site is back up, a secondary response can be considered, such as a temporary fix in MediaWiki by disabling the relevant module.

Monitoring of the number of running slow queries can be done with a related shell one liner:

mysql -h $server -e 'show processlist' | grep CRITERIA | wc -l

Important: disabling the source of slow queries in MediaWiki will typically not bring the site back up, if a large number of slow queries are queued in MySQL. That's one of the reasons why it's so important to kill first and patch second. Patching stops new queries from starting, it doesn't stop old queries from running.

Fixing

Get your priorities straight. While the site is down, your priority is to get it back up. Do not let curiosity or a desire for a complete and elegant solution distract you from doing this as quickly possible.

Analysis of root causes can be done after the site is back up, based on logs. If you can't do it using the logs after the fact, then the logs aren't good enough and you should improve them for next time.

Post mortem

It's often overlooked that our server admin log is on a wiki. A nice way to start a postmortem is to add server admin log entries that were omitted at the time. Once you've reconstructed the order of events, with precise times attached, you can start looking at logs.

It's sometimes useful to test your theories about the root causes of downtime. If your theory about the root cause is incorrect, it means that the real root cause is still out there, waiting to cause more downtime. So there is a strong incentive to be rigorous.