23:45 brion: apt is borked on mayflower due to smartmontools refusing to load
20:46 tomasz: cp'd phpmyadmin from dev.civicrm to prod civicrm for david
20:06 brion: replaced 'wikimedia.org' with 'meta.wikimedia.org' in local VHosts list in wgConf.php. The general 'wikimedia.org' was causing CodeReview's diff loads (via codereview-proxy.wikimedia.org) to fail as they were hitting localhost instead of the proxy. Do we need to add more vhosts to his list, or redo how it works?
19:45 brion: test-deploying CodeReview on mediawiki.org
19:17 RobH: updated DNS to add something for Brion.
18:21 mark: cache cleaning complete
15:05 Tim: doing some manual purges of URLs requested on #wikimedia-tech
15:00 mark: Cleaning caches of all backend text squids one by one, starting with pmtpa
14:20 mark: pooled all squids manually to fix the issues.
14:10 RobH: Site back up, slow as squids play catchup.
14:06 RobH: Pushed out old redirects.conf and restarted apaches.
14:01 RobH: Site is down, go me =[
14:00 RobH: updated redirects.conf and pushed change for orphaned domains.
13:38 RobH: updated dns for more orphaned domains.
13:11 Tim: cluster13 and cluster14 both have only one server left in rotation. Shut down apache on srv129 and srv139 out of fear that it might hasten their doom.
10:12 Tim: Switched ES cluster 3-10 to use Ubuntu servers (again)
08:05 Tim: removed the ORDER BY clauses from the ApiQueryCategoryMembers queries, to work around MySQL bug, probably involving truncated indexes
07:08 Tim: re-enabled the API
06:56 Tim: ixia (s2 master) overloaded due to ApiQueryCategoryMembers queries. Disabled the API and killed the offending queries
September 29
22:20 brion: reenabled history export ($wgExportAllowHistory), but put $wgExportMaxHistory back to 1000 instead of experimental 0 for enwiki. (sorry enwiki)
21:27 RobH: fixed the mounts on srv163 and started apache back up.
20:20 brion: srv163 has bad NFS config, missing upload and math mounths. I've shut off its apache so it stops polluting the parser cache with math errors.
17:01 RobH: updated apache redirects.conf for orphaned domains, restarted all apaches.
15:06 RobH: updated DNS to reflect a number of orphaned domains.
08:48 Tim: put db7 back into watchlist rotation (99%)
08:08 domas: enabled ipblocks replication on db7, resynced from db16
08:00 domas: Replaced gcc-4.2 build on db7 with gcc-4.1 one, from /home/wikipedia/src/mysql-4.0.40-r9-hardy-x86_64-gcc4.1.tar.gz
13:25 mark: Upgraded php5 and APC on all ubuntu apaches... got tired of restarting them. ;)
12:06 Tim: on db7: replicate-ignore-table=enwiki.ipblocks. Good enough for now.
11:51 Tim: schema update at 04:44 made db7 segfault. Replication stopped, watchlists stopped working after code referencing the new schema was synced. Switched to db16 for watchlist and RCL. Tried INSERT SELECT, that segfaulted too.
09:37 mark: Made syslog-ng on db20 filter the flood of 404s in /var/log/remote
09:15 mark: Restarted all (and only) segfaulting apaches
18:09 brion: got some complaints about ERROR_ZERO_SIZED_OBJECT on saves, seeing a lot of segfaults in log. Restarting all apaches to see what they do.
September 26
22:49 RobH: repooled sq49.
22:00 RobH: depooled sq49 for power testing.
21:50 RobH: pulled search7 for power testing and left off, as the power circuit would trip if it was left on there.
21:18 RobH: put srv189 back into rotation.
19:51 RobH: Pulled srv189 for power testing.
September 25
21:41 RobH: had to recreate /home/wikipedia/logs/jobqueue/error as it was lost and job queue runners failed due to it not being there. Restarted runners.
19:08 domas: fixed clear-profile by replacing 'zwinger' with 'zwinger.wikimedia.org' - apparently datagrams to 127.1 used to fail.
18:44 brion: manually applied r41264 to MimeMagic.php to fix uploads of OpenDocument files to private/internal wikis
15:25 RobH: bayes minimally installed.
15:23 RobH: reverted statistics1 to bayes in dns, pushed dns change.
14:04 RobH: bayes racked and ready for install.
05:00 mark: Flapped BGP session to HGTN, to resolve blackholing of traffic
03:20 Tim: stopped apache on srv167, was segfaulting again. I suspect binary version mismatch between compile and deployment, e.g. APC was compiled for libc 2.5-0ubuntu1, deployed on libc 2.7-10ubuntu3.
02:28 Tim: srv35 was segfaulting again, probably because it was in both the test.wikipedia.org pool and the main apache pool. Having two copies of everything tends to make the APC cache overflow, which triggers bugs in APC and leads to segfaulting. Removed it from the main apache pool.
September 24
20:23 RobH: restarted srv186 apache due to segfault.
20:21 RobH: restarted srv179 apache due to segfault.
20:05 brion: restarted srv35's apache (test.wikipedia.org) was segfaulting
19:25 tomasz: restricted grant for 'exim'@'208.80.152.186' to 150 MAX_USER_CONNECTIONS
18:40 mark Increased TCP backlog setting on mchenry from 20 to 128.
18:19 brion: restoring ApiQueryDeletedrevs and Special:Export since they're not at issue. Domas thinks some of the hangs may be caused by mails getting stuck via ssmtp when the mail server is overloaded; auto mails on account creation etc may hold funny transactions open
17:52 brion: disabling SiteStats::update() actual update query since it's blocking for reasons we can't identify and generally breaking shit
17:50 RobH: updated nagios files/node groups for raid checking on hosts without 3ware present
17:37 brion: domas thinks the problem is some kind of lock contention on site_stats, causing all the edit updates to hang -- as a result the ES connections stack up while waiting on the core master. I'm disabling ss_active_users update for now, that sounds slow...
17:34 RobH: srv131 apache setup is borked, removing from lvs.
17:33 RobH: added proper ip info for lo device on srv131
17:24 brion: temporarily disabling special:export
17:22 brion: the revert got us back to being able to read the site most of the time, but still lots of problems saving -- ES master on cluster18 still has lots of sleeper connections and refuses new saves
17:10 brion: trying a set of reverts to recent ES changes
16:43 brion: temporarily disabling includes/api/ApiQueryDeletedrevs.php, it may or may not be hitting too much ES or something?
16:38 brion: seeing lots of long-delayed sleeping connections on ES masters, not running queries. trying to figure out w/ Aaron what could cause these
16:36 mark: Set up a syslog server on db20, logging messages from other servers to /var/log/remote.
16:31 brion: confirmed PHP fatal error during connection error (backend connection error "too many connections"). Manually merging r41230 to live copy to skip around the frontend PHP error
16:20 brion: we're getting reports of eg "(Can't contact the database server: Unknown error (10.0.2.104))" on save. Trying to investigate, but MediaWiki was borked by the previous reversions of core DB-related files to a 6-month-old version with incompatible paths. Trying to re-sanitize MW to r41097 straight
15:45 Rob: setup wikimedia-task-appserver on srv141.
15:09 mark: The problem reappeared, looks like a bug in MediaWiki, possibly triggered by some issue in ES. Reverted the files includes/ExternalStore.php includes/ExternalStoreDB.php includes/Revision.php includes/db/Database.php includes/db/LoadBalancer.php to r35098 and ran scap.
14:50 mark: Reports of most/all saves failing with PHP fatal error in /usr/local/apache/common-local/php-1.5/includes/ExternalStoreDB.php line 127: Call to a member function nextSequenceValue() on a non-object. Suspected APC cache corruption, did a hard restart of all apaches which appeared to resolve the problem.
16:53 RobH: srv136 was locked up, restarted, synced, added correct lvs ip info.
16:45 RobH: srv126 was locked up, restarted, synced, added correct lvs ip info.
16:29 RobH: rebooted srv106, was locked up.
16:25 RobH: reinstalled srv101, was old ubuntu with no ES data.
16:13 RobH: reinstalled srv143 and srv148 from FC to Ubuntu, redeployed as apache
15:57 RobH: reinstalled srv128 and srv140 from FC to Ubuntu, redeployed as apache.
14:00-14:50 Tim: cleaned up /home/wikipedia somewhat, put various things in /home/wikipedia/junk or /home/wikipedia/backup, moved some lock files to lockfiles, deleted ancient /h/w/c/*.png symlinks, etc.
14:50 Tim: Made sync-common-file use rsync instead of NFS since some mediawiki-installation servers still have a stale NFS handle for /home
14:31 RobH: srv189 back in apache rotation
14:20 RobH: srv130 back in apache rotation
13:56 Tim: started rsync daemon on db20
13:49 Tim: restored dsh node groups on zwinger
13:40 Tim: installed udplog 1.3 on henbane
00:05 - 01:20 Tim: copying everything from the recovered suda image except /home/kate/xx, /home/from-zwinger and /home/wikipedia/logs. Will copy /home/wikipedia/logs selectively.
September 22
21:30 brion: noting that ExtensionDistributor extension is disabled for now due to the NFS problem
18:59 RobH: srv131 offline due to kernel panic. Cannot bring back until /home issue is resolved.
18:00 brion: things seem at least semi-working.
everything hung
suda had some kind of kernel crash
after reboot, it was found to have a couple flaky disks
brion hacked up MW config files to skip the NFS logging
mark set up an alternate /home NFS server
17:50 mark: Set up db20 as an (empty) temporary suda replacement. Set up NFS server for /home.
17:25 RobH: srv130 not working right, removed from pool.
16:32 RobH: removed srv8 and srv10 from nagios, resynced.
15:00 mark: Site down completely. Post-mortem:
Rob is untangling power cables in rack B2, and both asw-b2-pmtpa and asw3-pmtpa (in B4) lose power
Two racks unreachable, PyBal sees too many hosts down and won't depool more
Rob brings power to asw-b2-pmtpa back up, but connectivity loss to B4 is not noticed
Mark investigates why LVS isn't working, adjusts PyBal parameters, until PyBal pools not a single server
Apaches are unhappy about completely missing ES clusters
Connectivity loss to B4 discovered, restored
Site back online
September 21
10:10 Tim: disabled srv106's switch port. Was running the job queue with old configuration, inaccessible by ssh.
September 20
14:45 Tim: re-enabled Special:Export with $wgExportAllowHistory=false. Please find some way of doing transwiki requests which doesn't involve crashing the site.
14:30 Tim: People were reporting ES current master overload, no ability to save pages at all. This was apparently due to the small number of max connections on srv103/srv104. Most threads were sleeping. The real culprit was apparently db2 being slow due to a long-running (1 hour) Special:Export request. Disabled Special:Export entirely.
12:00 mark: Restored zwinger's IPv6 connectivity; removed svn.wikimedia.org from /etc/hosts
11:40 mark: Found an IP conflict; 208.80.152.136 was assigned to srv9 but not listed in DNS
10:09 Tim: removed srv69 and srv118 from the memcached list, down
09:02 Tim: ES on srv84 had new passwords, was not accepting connections from 3.23 clients on srv32-34. Fixed.
08:45 Tim: depooled ES srv110, reformatted by Rob while it was still a current ES slave. Depooled srv137, mysqld was shut down on it for some reason. One server left in cluster14.
srv137 has a corrupt read-only file system on /usr/local/mysql/data2
05:34 Tim: svn.wikimedia.org not reachable from zwinger via IPv6, causing very slow operation due to timeouts. Hacked /etc/hosts.
01:06 Tim: ES migration failed on all clusters except cluster3 (the cluster I used to test the script), due to MySQL 4.0-4.1 version differences. Restarting with mysqldump --default-character-set=latin1.
15:59 RobH: reinstalled srv70, srv100, srv110-srv119 from FC to ubuntu, redeployed.
07:30 Tim: srv38 was hanging while attempting to write to log files on /home. Fixed permissions on /mnt/upload4/en/thumb which was causing a high log write rate, restarted apache, disabled search-restart cron job, restarted pybal. Seems to be fixed.
01:55 Tim: the issue with ES was the lack of a master pos wait between transfer and slave shutdown. Fixing.
01:00 Tim: restarting possibly segfaulting apaches on srv158,srv177,srv178,srv173,srv51,srv187,srv182,srv44,srv117. Keeping srv139 for debugging, it has kindly depooled itself by segfaulting on pybal health checks.
September 18
17:39 RobH: srv35, srv37, srv55 & srv59 bootstrapped with ganglia.
17:37 RobH: srv40, srv41, srv43-srv53 bootstrapped with ganglia.
17:36 RobH: srv60-srv68 bootstrapped with ganglia.
17:31 RobH: srv151-srv188 bootstrapped with ganglia.
11:45 Tim: reverted db.php change, still has issues.
11:18 Tim: removed apaches_yaseo from nagios config, changed apaches_pmtpa to apaches.
11:09 Tim: in db.php, switched ES clusters 3-10 to use the ubuntu servers
September 17
23:57 brion: set $wgLogo to $stdpath for wikinews -- old local /upload path failed to redirect properly on secure.wikimedia.org interface
22:19 mark: Deployed the rest of the new search servers, search2 - search7.
19:25 JeLuF: changed robots.php to send both Mediawiki:robots.txt and /apache/common/robots.txt
19:23 RobH: Removed srv63 from memcache list, put in spare memcache and synced file.
19:14 RobH: restarted memcached on srv74
19:00 RobH: reinstalled srv62, srv64, srv65, srv66, srv67, & srv68 from FC to Ubuntu.
18:26 RobH: srv63 shutdown due to hdd failure.
18:25 RobH: srv61 shutdown due to overheating issue.
16:00 RobH: moved srv37 from pybal render group to apache group
01:50 brion: killed obsolete juriwiki-l list per delphine
September 16
22:59 mark: srv133 is giving Bus errors, read-only file systems, and was therefore automatically depooled by PyBal. Good times.
22:59 mark: Installed memcached on srv182 (was missing?), restarted memcached on srv70, srv169 and replaced instance of srv141 by srv142.
22:36 mark: Prepared searchidx1 and search1 for production, if things work sufficiently well I'll deploy the others tomorrow
21:30 brion: found a bunch of memcache machines down or not running memcached: 170, 141, 70, 169, 182
21:01 mark Building search deployment with rainman, with search1 as test host
20:33 brion: fixed secure.wikimedia.org for Wikimania wikis -- wikimedia-ssl-backend.conf rewrite rules were mistakenly excluding digits from the wiki pseudodir
18:28 mark: Upgraded wikimedia-task-appserver on all Ubuntu app servers, which creates a limited ssh account pybal-check for use by PyBal. Create the account manually on all Fedora apaches
17:01 mark: Apache on srv151 is stuck on an NFS mountpoint and cannot be restarted. I'm not rebooting the box as I'm not sure what's going on with ES atm.
September 12
23:30 jeluf: apache on srv37 doesn't restart, libhistory.so.4 is missing
23:00 jeluf: proxy robots.txt requests through live-1.5/robots.php, which delivers Mediawiki:robots.txt if it exists and /apache/common/robots.txt else.
15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)
15:00 RobH: bart crashed, rebooted.
14:56 Tim: pulling out all the stops now, running migrate.php migrate-all.
14:45 RobH: synced srv104, back online.
14:40 RobH: synced db.php.
14:32 RobH: srv105 unresponsive, rebooted.
14:25 Tim: Removed the corrupted ES installations on srv151-176
14:18 RobH: Installed NRPE plugins on db9-db16.
09:01 Tim: reverted, blob corruption due to charset conversion observed
07:58 Tim: Experimentally switched db.php to use the ubuntu servers for cluster3/4.
07:50 Tim: Stopping replication on the ubuntu cluster3 and cluster4 servers, and changing the file permissions on the MyISAM files to prevent any kind of modification by the mysql daemon. This is done by the new lock/unlock commands in ~tstarling/migrateExtToUbuntu/migrate.php.
September 11
05:30 Tim: Migrating cluster4. Testing new binlog deletion feature.
September 10
15:40 RobH: Racktables database moved from will to db9.
15:00 RobH: Reinstalled srv185, srv186, srv187 to newest ubuntu, online as apache.
05:00 - 10:10 Tim: copied cluster3 to srv151, srv163 and srv175, second attempt, seems to have worked this time
September 9
23:25 brion: for a few minutes got some complaints about 'Can't contact the database server: Unknown error (10.0.6.22)' (db12). This box seems to be semi-down pending some data recovery, but load wasn't disabled from it. May have gotten load due to other servers being lagged at the time. Set its load to 0.
15:48 mark: alrazi overloaded, switch traffic back to knams and hope it can take the load
14:37 mark: knams partially back up, broken line card still down. Moved some important servers to another line card. knsq16 - knsq30 will be down for the upcoming days, as well as most management.
10:20 domas: copied in mysql build from db16 to db12 - db12 was running gcc-4.2 one, and in crashloop. next crash will bring up proper build :)
September 6
20:15 river: failure of many hosts at knams (including lvs), moved to authdns-scenario knams-down
14:45 JeLuF: started to rsync enwiki images from amane to storage1 in preparation of tomorrow's final move of the image directory
04:24 Tim: sync-file screwup caused thumbnails to be created in the source image directory. Will try to repair.
03:13 Tim: srv151 is depooled for some reason. No indication as to why in the logs or config files. Using it to test the new wikimedia-task-appserver package. Will repool once I get it working properly.
September 4
22:15 JeLuF: Switched srv179's mysql to read_only
22:10 JeLuF: OTRS back online, switched to db9. Changed exim config on mchenry, too.
20:00 JeLuF,RobH: Shut down OTRS, migrating its DB from srv179 to db9
19:31 mark: Many boxes still in inconsistent state because of OOM kills. Some background processes not running (e.g. ntpd). Rebooted srv159, srv182, srv154, srv156, srv157, srv158, srv181, srv188
19:28 mark: scap
19:01 mark: Killed all stuck convert processes on srv151..srv188 (but left srv189 intact for debugging)
18:33 mark: Many application servers are running out of memory, one by one. This seems to be caused by stuck thumbnail convert processes which end up there. The thumbnail convert processes on the regular apaches are indirectly caused by the API, and is opensearch/prefixsearch/allpages related - but I get lost in that code. One sample url is http://en.wikipedia.org/w/api.php?action=opensearch&search=Gina&format=xml Another interesting and likely related question is why many apaches can no longer reach storage1 NFS...
17:07 RobH: Restarted ssh process which had stalled on srv188.