Server admin log/2008-09

September 30

23:45 brion: apt is borked on mayflower due to smartmontools refusing to load
20:46 tomasz: cp'd phpmyadmin from dev.civicrm to prod civicrm for david
20:06 brion: replaced 'wikimedia.org' with 'meta.wikimedia.org' in local VHosts list in wgConf.php. The general 'wikimedia.org' was causing CodeReview's diff loads (via codereview-proxy.wikimedia.org) to fail as they were hitting localhost instead of the proxy. Do we need to add more vhosts to his list, or redo how it works?
19:45 brion: test-deploying CodeReview on mediawiki.org
19:?? brion: set up temporary limited SVN JSON proxy as codereview-proxy.wikimedia.org
19:17 RobH: updated DNS to add something for Brion.
18:21 mark: cache cleaning complete
15:05 Tim: doing some manual purges of URLs requested on #wikimedia-tech
15:00 mark: Cleaning caches of all backend text squids one by one, starting with pmtpa
14:20 mark: pooled all squids manually to fix the issues.
14:10 RobH: Site back up, slow as squids play catchup.
14:06 RobH: Pushed out old redirects.conf and restarted apaches.
14:01 RobH: Site is down, go me =[
14:00 RobH: updated redirects.conf and pushed change for orphaned domains.
13:38 RobH: updated dns for more orphaned domains.
13:11 Tim: cluster13 and cluster14 both have only one server left in rotation. Shut down apache on srv129 and srv139 out of fear that it might hasten their doom.
10:12 Tim: Switched ES cluster 3-10 to use Ubuntu servers (again)
10:03 Tim: depooled ES on srv127, has been wiped
10:00 Tim: depooled thistle, is down
09:20 Tim: Set up MediaWiki UDP logging
08:05 Tim: removed the ORDER BY clauses from the ApiQueryCategoryMembers queries, to work around MySQL bug, probably involving truncated indexes
07:08 Tim: re-enabled the API
06:56 Tim: ixia (s2 master) overloaded due to ApiQueryCategoryMembers queries. Disabled the API and killed the offending queries

September 29

22:20 brion: reenabled history export ($wgExportAllowHistory), but put $wgExportMaxHistory back to 1000 instead of experimental 0 for enwiki. (sorry enwiki)
21:27 RobH: fixed the mounts on srv163 and started apache back up.
20:20 brion: srv163 has bad NFS config, missing upload and math mounths. I've shut off its apache so it stops polluting the parser cache with math errors.
17:01 RobH: updated apache redirects.conf for orphaned domains, restarted all apaches.
15:06 RobH: updated DNS to reflect a number of orphaned domains.
08:48 Tim: put db7 back into watchlist rotation (99%)
08:08 domas: enabled ipblocks replication on db7, resynced from db16
08:00 domas: Replaced gcc-4.2 build on db7 with gcc-4.1 one, from /home/wikipedia/src/mysql-4.0.40-r9-hardy-x86_64-gcc4.1.tar.gz

September 28

17:52 mark: Upgraded mchenry to Hardy.
17:15 mark: Upgraded sanger to Hardy.
13:43 mark: Repooled srv150
13:25 mark: Upgraded php5 and APC on all ubuntu apaches... got tired of restarting them. ;)
12:06 Tim: on db7: replicate-ignore-table=enwiki.ipblocks. Good enough for now.
11:51 Tim: schema update at 04:44 made db7 segfault. Replication stopped, watchlists stopped working after code referencing the new schema was synced. Switched to db16 for watchlist and RCL. Tried INSERT SELECT, that segfaulted too.
09:37 mark: Made syslog-ng on db20 filter the flood of 404s in /var/log/remote
09:15 mark: Restarted all (and only) segfaulting apaches
05:38 Tim: svn up/scap to r41337.
04:44 Tim: applying patch-ipb_allow_usertalk.sql on all DBs. No master switches.

September 27

20:41 mark: Packaged a newer PHP5 (5.2.4 from Ubuntu Hardy, with CDB support) and a new APC (3.0.19). Deployed it on srv153 for testing.
18:15 brion: srv100 looks particularly crashy.
18:09 brion: got some complaints about ERROR_ZERO_SIZED_OBJECT on saves, seeing a lot of segfaults in log. Restarting all apaches to see what they do.

September 26

22:49 RobH: repooled sq49.
22:00 RobH: depooled sq49 for power testing.
21:50 RobH: pulled search7 for power testing and left off, as the power circuit would trip if it was left on there.
21:18 RobH: put srv189 back into rotation.
19:51 RobH: Pulled srv189 for power testing.

September 25

21:41 RobH: had to recreate /home/wikipedia/logs/jobqueue/error as it was lost and job queue runners failed due to it not being there. Restarted runners.
19:08 domas: fixed clear-profile by replacing 'zwinger' with 'zwinger.wikimedia.org' - apparently datagrams to 127.1 used to fail.
18:44 brion: manually applied r41264 to MimeMagic.php to fix uploads of OpenDocument files to private/internal wikis
15:25 RobH: bayes minimally installed.
15:23 RobH: reverted statistics1 to bayes in dns, pushed dns change.
14:04 RobH: bayes racked and ready for install.
05:00 mark: Flapped BGP session to HGTN, to resolve blackholing of traffic
03:20 Tim: stopped apache on srv167, was segfaulting again. I suspect binary version mismatch between compile and deployment, e.g. APC was compiled for libc 2.5-0ubuntu1, deployed on libc 2.7-10ubuntu3.
03:03 Tim: restarted segfaulting apaches srv111,srv168,srv154,srv167,srv46
02:28 Tim: srv35 was segfaulting again, probably because it was in both the test.wikipedia.org pool and the main apache pool. Having two copies of everything tends to make the APC cache overflow, which triggers bugs in APC and leads to segfaulting. Removed it from the main apache pool.

September 24

20:23 RobH: restarted srv186 apache due to segfault.
20:21 RobH: restarted srv179 apache due to segfault.
20:05 brion: restarted srv35's apache (test.wikipedia.org) was segfaulting
19:25 tomasz: restricted grant for 'exim'@'208.80.152.186' to 150 MAX_USER_CONNECTIONS
18:40 mark Increased TCP backlog setting on mchenry from 20 to 128.
18:19 brion: restoring ApiQueryDeletedrevs and Special:Export since they're not at issue. Domas thinks some of the hangs may be caused by mails getting stuck via ssmtp when the mail server is overloaded; auto mails on account creation etc may hold funny transactions open
17:52 brion: disabling SiteStats::update() actual update query since it's blocking for reasons we can't identify and generally breaking shit
17:50 RobH: updated nagios files/node groups for raid checking on hosts without 3ware present
17:37 brion: domas thinks the problem is some kind of lock contention on site_stats, causing all the edit updates to hang -- as a result the ES connections stack up while waiting on the core master. I'm disabling ss_active_users update for now, that sounds slow...
17:34 RobH: srv131 apache setup is borked, removing from lvs.
17:33 RobH: added proper ip info for lo device on srv131
17:24 brion: temporarily disabling special:export
17:22 brion: the revert got us back to being able to read the site most of the time, but still lots of problems saving -- ES master on cluster18 still has lots of sleeper connections and refuses new saves
17:10 brion: trying a set of reverts to recent ES changes
16:43 brion: temporarily disabling includes/api/ApiQueryDeletedrevs.php, it may or may not be hitting too much ES or something?
16:38 brion: seeing lots of long-delayed sleeping connections on ES masters, not running queries. trying to figure out w/ Aaron what could cause these
16:36 mark: Set up a syslog server on db20, logging messages from other servers to /var/log/remote.
16:31 brion: confirmed PHP fatal error during connection error (backend connection error "too many connections"). Manually merging r41230 to live copy to skip around the frontend PHP error
16:20 brion: we're getting reports of eg "(Can't contact the database server: Unknown error (10.0.2.104))" on save. Trying to investigate, but MediaWiki was borked by the previous reversions of core DB-related files to a 6-month-old version with incompatible paths. Trying to re-sanitize MW to r41097 straight
15:45 Rob: setup wikimedia-task-appserver on srv141.
15:09 mark: The problem reappeared, looks like a bug in MediaWiki, possibly triggered by some issue in ES. Reverted the files includes/ExternalStore.php includes/ExternalStoreDB.php includes/Revision.php includes/db/Database.php includes/db/LoadBalancer.php to r35098 and ran scap.
14:50 mark: Reports of most/all saves failing with PHP fatal error in /usr/local/apache/common-local/php-1.5/includes/ExternalStoreDB.php line 127: Call to a member function nextSequenceValue() on a non-object. Suspected APC cache corruption, did a hard restart of all apaches which appeared to resolve the problem.
07:15 Tim: installed wikimedia-nis-client on db20

September 23

20:03 RobH: srv170 reporting apache down, synced, restarted.
20:02 RobH: srv188 was not running apache, synced and started.
19:59 RobH: Installed memcached on srv183, updated mc-pmtpa.php.
19:57 RobH: Installed memcached on srv66, updated mc-pmtpa.php.
19:54 RobH: Installed memcached on srv141, updated mc-pmtpa.php.
19:52 RobH: srv106 back up, apache synced and memcached running.
19:45 RobH: srv127 complained of port in use starting apache, rebooted, all is fine.
19:27 RobH: removed srv106 from active memcached, replaced with srv127, sync-file mc-pmtpa.php
18:00 RobH: srv127 had booting issues into the OS, reinstalled and redeployed.
17:08 RobH: srv138 was locked up, restarted.
16:53 RobH: srv136 was locked up, restarted, synced, added correct lvs ip info.
16:45 RobH: srv126 was locked up, restarted, synced, added correct lvs ip info.
16:29 RobH: rebooted srv106, was locked up.
16:25 RobH: reinstalled srv101, was old ubuntu with no ES data.
16:13 RobH: reinstalled srv143 and srv148 from FC to Ubuntu, redeployed as apache
15:57 RobH: reinstalled srv128 and srv140 from FC to Ubuntu, redeployed as apache.
14:00-14:50 Tim: cleaned up /home/wikipedia somewhat, put various things in /home/wikipedia/junk or /home/wikipedia/backup, moved some lock files to lockfiles, deleted ancient /h/w/c/*.png symlinks, etc.
14:50 Tim: Made sync-common-file use rsync instead of NFS since some mediawiki-installation servers still have a stale NFS handle for /home
14:31 RobH: srv189 back in apache rotation
14:20 RobH: srv130 back in apache rotation
13:56 Tim: started rsync daemon on db20
13:49 Tim: restored dsh node groups on zwinger
13:40 Tim: installed udplog 1.3 on henbane
00:05 - 01:20 Tim: copying everything from the recovered suda image except /home/kate/xx, /home/from-zwinger and /home/wikipedia/logs. Will copy /home/wikipedia/logs selectively.

September 22

21:30 brion: noting that ExtensionDistributor extension is disabled for now due to the NFS problem
18:59 RobH: srv131 offline due to kernel panic. Cannot bring back until /home issue is resolved.
18:00 brion: things seem at least semi-working.
1. everything hung
2. suda had some kind of kernel crash
3. after reboot, it was found to have a couple flaky disks
4. brion hacked up MW config files to skip the NFS logging
5. mark set up an alternate /home NFS server
17:50 mark: Set up db20 as an (empty) temporary suda replacement. Set up NFS server for /home.
17:20 mark: suda died.
17:25 RobH: srv130 not working right, removed from pool.
16:32 RobH: removed srv8 and srv10 from nagios, resynced.
15:00 mark: Site down completely. Post-mortem:
1. Rob is untangling power cables in rack B2, and both asw-b2-pmtpa and asw3-pmtpa (in B4) lose power
2. Two racks unreachable, PyBal sees too many hosts down and won't depool more
3. Rob brings power to asw-b2-pmtpa back up, but connectivity loss to B4 is not noticed
4. Mark investigates why LVS isn't working, adjusts PyBal parameters, until PyBal pools not a single server
5. Apaches are unhappy about completely missing ES clusters
6. Connectivity loss to B4 discovered, restored
7. Site back online

September 21

10:10 Tim: disabled srv106's switch port. Was running the job queue with old configuration, inaccessible by ssh.

September 20

14:45 Tim: re-enabled Special:Export with $wgExportAllowHistory=false. Please find some way of doing transwiki requests which doesn't involve crashing the site.
14:30 Tim: People were reporting ES current master overload, no ability to save pages at all. This was apparently due to the small number of max connections on srv103/srv104. Most threads were sleeping. The real culprit was apparently db2 being slow due to a long-running (1 hour) Special:Export request. Disabled Special:Export entirely.
12:00 mark: Restored zwinger's IPv6 connectivity; removed svn.wikimedia.org from /etc/hosts
11:40 mark: Found an IP conflict; 208.80.152.136 was assigned to srv9 but not listed in DNS
10:09 Tim: removed srv69 and srv118 from the memcached list, down
09:02 Tim: ES on srv84 had new passwords, was not accepting connections from 3.23 clients on srv32-34. Fixed.
08:45 Tim: depooled ES srv110, reformatted by Rob while it was still a current ES slave. Depooled srv137, mysqld was shut down on it for some reason. One server left in cluster14.
- srv137 has a corrupt read-only file system on /usr/local/mysql/data2
05:34 Tim: svn.wikimedia.org not reachable from zwinger via IPv6, causing very slow operation due to timeouts. Hacked /etc/hosts.
04:58 Tim: svn up/scap to r41053
01:06 Tim: ES migration failed on all clusters except cluster3 (the cluster I used to test the script), due to MySQL 4.0-4.1 version differences. Restarting with mysqldump --default-character-set=latin1.
00:14 Tim: restarted segfaulting apaches: srv167,srv152,srv172,srv171,srv153,srv151,srv176,srv155,srv112,srv119,srv111,srv113
00:10 Tomasz: upraded public and private depot to svn 1.5 data format.
00:00 Tomasz: svn installed ubuntu 8.04 along with svn 1.5.

September 19

23:00 Tomasz: svn installed ubunu 7.10, ready
22:55 RobH: db20 installed, ready for next upgrade.
22:38 RobH: db19 installed, ready for setup.
22:26 RobH: db18 installed, ready for setup.
18:00 brion: updated mwlib on bindery.wikimedia.org and Collection extension
15:59 RobH: reinstalled srv70, srv100, srv110-srv119 from FC to ubuntu, redeployed.
07:30 Tim: srv38 was hanging while attempting to write to log files on /home. Fixed permissions on /mnt/upload4/en/thumb which was causing a high log write rate, restarted apache, disabled search-restart cron job, restarted pybal. Seems to be fixed.
01:55 Tim: the issue with ES was the lack of a master pos wait between transfer and slave shutdown. Fixing.
01:00 Tim: restarting possibly segfaulting apaches on srv158,srv177,srv178,srv173,srv51,srv187,srv182,srv44,srv117. Keeping srv139 for debugging, it has kindly depooled itself by segfaulting on pybal health checks.

September 18

17:39 RobH: srv35, srv37, srv55 & srv59 bootstrapped with ganglia.
17:37 RobH: srv40, srv41, srv43-srv53 bootstrapped with ganglia.
17:36 RobH: srv60-srv68 bootstrapped with ganglia.
17:31 RobH: srv151-srv188 bootstrapped with ganglia.
11:45 Tim: reverted db.php change, still has issues.
11:18 Tim: removed apaches_yaseo from nagios config, changed apaches_pmtpa to apaches.
11:09 Tim: in db.php, switched ES clusters 3-10 to use the ubuntu servers

September 17

23:57 brion: set $wgLogo to $stdpath for wikinews -- old local /upload path failed to redirect properly on secure.wikimedia.org interface
22:19 mark: Deployed the rest of the new search servers, search2 - search7.
19:25 JeLuF: changed robots.php to send both Mediawiki:robots.txt and /apache/common/robots.txt
19:23 RobH: Removed srv63 from memcache list, put in spare memcache and synced file.
19:14 RobH: restarted memcached on srv74
19:00 RobH: reinstalled srv62, srv64, srv65, srv66, srv67, & srv68 from FC to Ubuntu.
18:26 RobH: srv63 shutdown due to hdd failure.
18:25 RobH: srv61 shutdown due to overheating issue.
18:16 RobH: Reinstalled srv51, srv52, srv53, srv54, srv55, srv56, srv57, srv58, srv59, srv60, srv61 as ubuntu apache servers.
16:56 RobH: Reinstalled srv44, srv45, srv46, sr47, srv48, srv49, & srv50 as ubuntu apache servers.
16:00 RobH: Reinstalled srv35, srv37, srv40, srv41, srv43 as ubuntu apache servers.
16:00 RobH: moved srv37 from pybal render group to apache group
01:50 brion: killed obsolete juriwiki-l list per delphine

September 16

22:59 mark: srv133 is giving Bus errors, read-only file systems, and was therefore automatically depooled by PyBal. Good times.
22:59 mark: Installed memcached on srv182 (was missing?), restarted memcached on srv70, srv169 and replaced instance of srv141 by srv142.
22:36 mark: Prepared searchidx1 and search1 for production, if things work sufficiently well I'll deploy the others tomorrow
21:30 brion: found a bunch of memcache machines down or not running memcached: 170, 141, 70, 169, 182
21:01 mark Building search deployment with rainman, with search1 as test host
20:33 brion: fixed secure.wikimedia.org for Wikimania wikis -- wikimedia-ssl-backend.conf rewrite rules were mistakenly excluding digits from the wiki pseudodir
18:00 JeLuF: made the main page of https://secure.wikimedia.org/ editable via http://meta.wikimedia.org/wiki/Secure.wikimedia.org_template using extract2.php

September 15

22:45 Tim: rebooted srv151. Shut down mysqld and then gave it a sync; sysrq b.
21:11 RobH: Installed Ubuntu on searchidx1, search1, search2, search3, search4, search5, search6, search7.
19:00 RobH: searchidx1 installed.

September 14

18:45 mark: Upgraded PyBal on lvs3 to a newer version, and set up SSH checking (once a minute) of all apaches, see LVS.
18:42 mark: srv170 is doing OOM kills
18:28 mark: Upgraded wikimedia-task-appserver on all Ubuntu app servers, which creates a limited ssh account pybal-check for use by PyBal. Create the account manually on all Fedora apaches
17:01 mark: Apache on srv151 is stuck on an NFS mountpoint and cannot be restarted. I'm not rebooting the box as I'm not sure what's going on with ES atm.

September 12

23:30 jeluf: apache on srv37 doesn't restart, libhistory.so.4 is missing
23:15 mark: NTP ip missing on zwinger, readded
23:00 jeluf: proxy robots.txt requests through live-1.5/robots.php, which delivers Mediawiki:robots.txt if it exists and /apache/common/robots.txt else.
15:30 Tim: set read_only=0 on srv108 (Rob rebooted it)
15:00 RobH: bart crashed, rebooted.
14:56 Tim: pulling out all the stops now, running migrate.php migrate-all.
14:45 RobH: synced srv104, back online.
14:40 RobH: synced db.php.
14:32 RobH: srv105 unresponsive, rebooted.
14:25 Tim: Removed the corrupted ES installations on srv151-176
14:18 RobH: Installed NRPE plugins on db9-db16.
09:01 Tim: reverted, blob corruption due to charset conversion observed
07:58 Tim: Experimentally switched db.php to use the ubuntu servers for cluster3/4.
07:50 Tim: Stopping replication on the ubuntu cluster3 and cluster4 servers, and changing the file permissions on the MyISAM files to prevent any kind of modification by the mysql daemon. This is done by the new lock/unlock commands in ~tstarling/migrateExtToUbuntu/migrate.php.

September 11

05:30 Tim: Migrating cluster4. Testing new binlog deletion feature.

September 10

15:40 RobH: Racktables database moved from will to db9.
15:00 RobH: Reinstalled srv185, srv186, srv187 to newest ubuntu, online as apache.
05:00 - 10:10 Tim: copied cluster3 to srv151, srv163 and srv175, second attempt, seems to have worked this time

September 9

23:25 brion: for a few minutes got some complaints about 'Can't contact the database server: Unknown error (10.0.6.22)' (db12). This box seems to be semi-down pending some data recovery, but load wasn't disabled from it. May have gotten load due to other servers being lagged at the time. Set its load to 0.
18:49 RobH: Moved maurus from A4 to A2.
18:05 mark: Made lvs2 a temporary LVS host for upload.pmtpa.wikimedia.org to be able to remove alrazi from its rack. Will redo this setup soon.
17:50 RobH: srv61 reinstalled and setup as apache and memcached.
17:50 RobH: srv144 reinstalled, needs ES setup.
17:50ish brion: updated planet to 2.0, cleared en feed caches. Something was broken in them which caused updated to fail since September 5.
17:42 RobH: Updated DNS to reflect new search servers.
15:11 RobH: Moved isidore, upon reboot, noticed the wordpress update didnt take, reapplied it to blog and whygive installations.
14:49 RobH: zwinger and khaldun moved from A4 to A2.
10:26 Tim: copying ES data from srv32 to srv151, srv163 and srv175
01:30-10:20 Tim: testing and debugging the ubuntu ES migration script on srv151, srv163 and srv175
02:15 Tomasz: Added bugzilla reporting cron on isidore.
00:48 Tim: granted root access to zwinger on all ES servers, useful for migration

September 8

22:20 RobH: reinstalled srv178, srv179, srv180, srv181, srv182, srv184.
21:20 RobH: reinstalled srv175, srv176, & srv177.
20:30 RobH: reinstalled srv172, srv173, & srv174.
19:23 RobH: reinstalled srv169, srv170, & srv171.
18:23 RobH: reinstalled srv166, srv167, & srv168.
18:00 RobH: reinstalled srv163, srv164, & srv165.
16:40 RobH: reinstalled srv160, srv161, & srv162.
15:40 RobH: reinstalled srv157, srv158, & srv159.
15:05 RobH: reinstalled srv154, srv155, & srv156.
14:36 mark: Exchanged down srv126 for srv140, and down srv137 for srv141 in mc-test.php
14:12 RobH: reinstalled srv151, srv152, & srv153.
06:16 Tim: Gave myself a RackTables account
05:33 Tim: srv146 down, removed from ES rotation
05:08 Tim: accidentally crashed srv37. Needs restart.

September 7

15:48 mark: alrazi overloaded, switch traffic back to knams and hope it can take the load
14:37 mark: knams partially back up, broken line card still down. Moved some important servers to another line card. knsq16 - knsq30 will be down for the upcoming days, as well as most management.
10:20 domas: copied in mysql build from db16 to db12 - db12 was running gcc-4.2 one, and in crashloop. next crash will bring up proper build :)

September 6

20:15 river: failure of many hosts at knams (including lvs), moved to authdns-scenario knams-down
12:05 hashar : merged r40433 to fix &editintro
5:30 JeLuF: image upload on enwiki enabled again. Slowly deleting images from amane.
3:00 JeLuF: image upload on enwiki disabled, copying enwiki images to storage1

September 5

22:00-00:00 Hashar : gmaxwell provided backup of files (downloaded in ~/files/), I recovered non existent one.
- run ~/check_missing_pics.pl for hints (output example)

17:03 Tim: Updated trusted-xff.cdb. Fixes AOL problems.
14:45 JeLuF: started to rsync enwiki images from amane to storage1 in preparation of tomorrow's final move of the image directory
04:24 Tim: sync-file screwup caused thumbnails to be created in the source image directory. Will try to repair.
03:13 Tim: srv151 is depooled for some reason. No indication as to why in the logs or config files. Using it to test the new wikimedia-task-appserver package. Will repool once I get it working properly.

September 4

22:15 JeLuF: Switched srv179's mysql to read_only
22:10 JeLuF: OTRS back online, switched to db9. Changed exim config on mchenry, too.
20:00 JeLuF,RobH: Shut down OTRS, migrating its DB from srv179 to db9
19:49 RobH: db10 replication slave of db9
17:58 RobH: civicrm and dev civicrm database now located on db9 (was on srv10)
17:19 RobH: Bugzilla database is now located on db9 (was on srv8)
16:52 RobH: Both the wikimedia blog and donation blog databases are now residing on db9 (was on srv8)
16:43 Tim: re-enabled thumb.php after some of the culprits came to talk to me on #wikimedia-tech and promised to reform their ways
11:09 Tim: fixed APC on srv38 and srv39, was broken.
10:35 Tim: srv38 and srv39 have been overloaded since 05:50. Blocked thumb.php for external clients.
05:30 Tim: restarted srv138 with sysrq-trigger. Was reporting "bus error" on sync-file.
04:03 Tim: upgrading to wmerrors-1.0.2 on all mediawiki-installation

September 3

23:00 jeluf: moved enwiki's upload archive from amane to storage1, freeing up some 20G on amane.
16:54 brion: tweaking ApiOpenSearchXml to hopefully fix the rendering-thumbs-on-text-apaches problem
14:01 RobH: updated libtiff4 on all apaches
04:23 Tim: svn up/scap to r40356
04:13 Tim: populating ss_active_users
03:21 Tim: applying patch-ss_active_users.sql

September 2

19:50 mark: Repooled srv181
19:31 mark: Many boxes still in inconsistent state because of OOM kills. Some background processes not running (e.g. ntpd). Rebooted srv159, srv182, srv154, srv156, srv157, srv158, srv181, srv188
19:28 mark: scap
19:01 mark: Killed all stuck convert processes on srv151..srv188 (but left srv189 intact for debugging)
18:51 mark: Rebooted srv169, srv180
18:48 mark: Remounted /mnt/upload4 on srv151..srv188 (not srv189)
18:33 mark: Many application servers are running out of memory, one by one. This seems to be caused by stuck thumbnail convert processes which end up there. The thumbnail convert processes on the regular apaches are indirectly caused by the API, and is opensearch/prefixsearch/allpages related - but I get lost in that code. One sample url is http://en.wikipedia.org/w/api.php?action=opensearch&search=Gina&format=xml Another interesting and likely related question is why many apaches can no longer reach storage1 NFS...
17:07 RobH: Restarted ssh process which had stalled on srv188.
16:52 mark: Rebooted srv186
16:00 RobH: Pushed a number of dns changes for CZ chapter redirects.
15:25 RobH: Updated dns for arbcom.de.wikimedia.org. Also added wiki to the cluster.

September 1

23:10 mark: Added upload.v4.wikimedia.org hostname (explicitly A-record only), and allowed it in Squid frontend.conf
17:40 jeluf: unpooled apache srv138, srv181 ssh not working
17:30 jeluf: re-enabled srv124 in ES cluster12
17:15 jeluf: re-enabled srv86 in ES cluster7
16:32 mark: Deployed the PowerDNS pipe backend with the selective-answer script on all authoritative servers
09:38 Tim: srv102 done, re-added cluster17 to the write list
04:09 Tim: repooled ES on srv107, schema change done
03:50 Tim: depooled apache on srv105, had old MW configuration, no ssh
03:45 Tim: starting max_rows change on srv102. srv107 is actually stopped due to disk full, fixing.
03:37 Tim: switching masters on cluster17 to srv103.
02:14 Tim: Killed job runner on srv107 to speed up schema change.
02:10 Tim: Brought srv142 and srv145 into ES rotation in cluster16.

September 30

September 29

September 28

September 27

September 26

September 25

September 24

September 23

September 22

September 21

September 20

September 19

September 18

September 17

September 16

September 15

September 14

September 12

September 11

September 10

September 9

September 8

September 7

September 6

September 5

September 4

September 3

September 2

September 1

Archives