Server admin log/Archive 12

July 31

23:27 mark: Installed db12 (146G drives) and db13 (72 G).
21:45 RobH: srv101 back online.
20:12 RobH: srv104 back online.
10:17 mark: Disabled switchports srv101 and srv104.

July 30

21:15 brion: wasted a lot of time there. most of the segfaults have stopped as mysteriously as they came. 101 still reporting some, but can't get into it anyway, it's pretty broken atm, should get rebooted. can't ssh in, but it's a memcached so don't want to just shut it off anyway without rearranging shit
20:30 brion: seeing a lot of segfaults on apaches, trying to track it down
20:08 mark: Installed Ubuntu on db11
18:26 brion: mobile.wikipedia.org temporarily moved to yongle on backend
17:59 brion: mobile.wikipedia.org (anthony) down
17:30ish rob is moving servers around
04:50 Tim: removed comcomwiki from all.dblist. Obsolete, internal.wikimedia.org is used instead.
03:00 Tim: Firewalled srv101/srv104 from the slave servers as well, to prevent pollution of the revision cache.
02:30 Tim: Report on the village pump of ongoing database corruption due to job queue runners on srv101/srv104 with bad database configuration. Tried for half an hour to work out how to disable their switch ports, eventually gave up when I couldn't work out how to log in to asw3-pmtpa (the bulk of the half hour was used in determining which switch srv101 is connected to). Instead, firewalled them from db2, ixia and adler.

July 29

19:15 JeLuF: moved enwiki thumbs to storage1
05:00 Tim: disabled NRPE arguments since it allows arbitrary shell commands to be invoked by any user with access to the NRPE port
04:35 Tim: samuel appears to be up, started mysqld on it
~4:00 Tim: installed NRPE on adler, ixia, thistle, webster
03:50 Tim: ran yum update on holbach
~03:30 Tim: installed NRPE on FC4 servers db2 and lomaria using custom built RPM.

July 28

23:00 - 02:30 Rob & Tim: installed NRPE on storage1, storage2 and all ubuntu DB core DB servers. Installed NRPE from source on bart. Switched nagios to use NRPE for disk space monitoring on these servers.
21:40 brion: seen a huge rash of errors where a non-object is passed to Title::equals(). Added a live hack exception to try to track it down, but it's not being logged in exception.log
21:05ish brion: fixed the categoryfinder infinite recursion bug -- lots less segfaults on labs...
20:00 jeluf & mark: Stopped all services on friedrich, preparing for decommissioning. If you miss anything, be quick.
06:18 brion: mass panic erupts as update involves several fatal errors from calls in 'skins' to things not yet in 'includes', etc, increasing the pain level during the code update
06:09 brion: stopped srv71 apache, weird php fatal errs in log ('missing LocalFile' etc)

July 27

23:33 brion: db9 mysql 5 test server has at least some bad entries in user_group table
19:45 JeLuF: reinstalled knsq30, back in the pool
14:30 JeLuF: rr.yaseo is down, switched to DNS-scenario yaseo-down, investigating

July 26

02:20 Tim: changed master on cluster17 from srv101 to srv102. Took srv101 out of ext. store rotation. Pre-emptive action before it dies completely.
02:14 Tim: Deployed a new version of wmerrors. Segfaults started spewing out everywhere, so I disabled it. srv101 went down, took it out of rotation.
- srv101 is a current ext store master, apparently it's still doing mysql, but it's segfaulting with apache and not responding on ssh

July 25

23:00 Tim: Attempting to upgrade librsvg on srv37. Involves upgrading the system from FC3 to FC4.
21:35 Tim: added squid ACLs to make /w/thumb.php go to the rendering cluster. The ubuntu apaches aren't set up properly for rendering. Also escaped the dots in the existing url_regex ACLs.
20:40 RobH: Reinstalled sq21-sq30 with ubuntu 8.04.
20:06 RobH: Reinstalled sq11-sq20 with ubuntu 8.04.
19:14 RobH: Reinstalled sq1-sq10 with ubuntu 8.04.
18:23 RobH: replaced /c1/p9 in storage2 and put in the rebuild for the array.
18:17 RobH: replaced /c0/p3 and /c0/p6 in amane and put in the rebuild for the array.
08:31 Tim: disabled UsernameBlacklist, nasty regexes cause the servers to crash. By some reports, 8/10 of account creation requests were failing.
07:36 Tim: temporarily enabling core dumps on all apache servers, to see if I can track down the abort()s we're seeing

July 24

20:29 RobH: knsq23-knsq29 reinstalled to 8.04.
16:45 brion: fixed ownership on new dirs on upload4 -- fixes upload problem 14906
16:18 RobH: knsq21-knsq22 reinstalled to 8.04.
15:23 RobH: knsq16-knsq19 reinstalled to 8.04.
13:45 Tim: installed ganglia on storage2, added storage1 and storage2 to nagios (but no working services), freed up a little bit of space on storage2 (was full)
13:06 Tim: returned srv128 to the pool
07:53 Tim: srv128 showing spurious OOM errors. Took it out of rotation (DID NOT RESTART), so that I can have a look at it with gdb in a couple of hours when I have time.

July 23

21:47 RobH: knsq11-knsq15 reinstalled to 8.04.
21:10 RobH: knsq1-knsq10 reinstalled to 8.04.
20:43 brion: "fixed itself" -- may have been overload on amane/storage2, or from those slow procs on srv38, who knows. seems resolved now
20:30 brion: massive thumbnailing errors on commons (thumb servers return 500 err)
16:17 RobH: sq36-sq40 reinstalled to 8.04.
15:51 RobH: sq31-sq35 reinstalled to 8.04.
14:44 RobH: sq46-sq49 reinstalled to 8.04.
14:00 RobH: sq41-sq45 had wrong lvs ip, fixed.
13:50 brion: sighted but non-stable pages on de were being marked as 'noindex,nofollow' due to a logic buglet in FlaggedRevs. FlaggedArticle.php has been updated, but they'll be in cache. Sigh.

July 22

23:35 RobH: reinstalled sq41 - sq45
20:43 Tim: increased the number of job runners from 10 to 39.
19:20 Tim: fixing redirect table on all wikis
19:10ish brion: redirected donate.wikimedia.org to [1]; blank drupal page confuses people and davidstrauss didn't seem interested in fixing it to look nice
19:05 RobH: srv65 kernel panic, reboot.
19:00 RobH: srv78 kernel panic, reboot.
15:37 mark: Upgraded yf1019 and bayle to Ubuntu 8.04 hardy.

July 21

21:00ish brion: enabled collection extension, now with proper temp file usage, on test/labs
15:57 mark: Deployed srv150 as an Ubuntu 8.04 Hardy application served for testing (it's pooled)

July 19

11:03 mark: Pages containing timelines were giving HTTP 500 errors since some recent sync; reverted the live timeline extension to r31101 for now.
07:30 Tim: setting up authentication and write commands on nagios.wikimedia.org

July 18

08:00 brion: fixed img_auth on SSL sites
07:35 Tim: removed srv61 from memcached rotation, is down. Added some servers to the mc-pmtpa.php spares list.
07:19 Tim: cleaned up binlogs on adler, was 14GB free, now 84GB free

July 17

13:45 brion: disabled collection again, there's stupid race conditions in the extension's CURL usage. wtf guys
13:33 brion: got pediapress collection thingy at least sort of working. WSGI self-hosting just doesn't work at all. Fought a lot with Python's horrible eggs to tweak up all its awful ugly directories so the damn modules import. Enabled on test only atm. There's a warning that it causes segfaults with category loads.

Jul 16

09:36 river: shut down HGTN peering, excessive failure to route packets anywhere useful

July 15

22:00ish brion: popped in briefly to do code review and update live code. seems non-exploding so far

July 14

19:00 mark: Built a test setup for my new LVS kernel module: mirrored all live traffic to alrazi (upload.wm.org) to lvs3 in a separate test VLAN with no outside connectivity
06:00 Tim: installing mysql on db1 using the source tree in /home/midom/mysql/server
05:15 Tim: samuel down, removed from rotation

July 12

16:13 brion: trimming more stuff from storage2 (upload old versions, timelines not vital to backup)

July 11

19:45 mark: srv78 went down again
12:45 mark: srv39 (image rendering cluster) was out-of-memory and image rendering was mostly down, restarted apache to clear it up

July 10

22:08 mark, rob: Reinstalled yaseo servers yf1000 - yf1017 with Ubuntu 8.04 Hardy and installed squid 2.6.21 (now in the repository). Switched back traffic
21:48 brion: srv144 has been reporting 'read-only filesystem'. Have shut it down remotely.
17:59 mark: srv37 is out of date and giving internal server errors, depooled it in LVS
17:29 mark: Brought sq5 back up with a newly built squid-2.6.21 deb (hardy only), not yet in the repository.
14:25 RobH: srv46 apache process was not responsive, synced it and restarted it.
14:20 RobH: srv36 rebooted due to PDU swap, HDD died upon reboot.
14:20 RobH: Swapped out the PDU for the switch feeds in C4.

July 9

19:07 brion: srv37 appears to have a db.php from february -- something very wrong. (it's a scaler)
- it's commented out of mediawiki-installation, so has not been receiving any scaps since! this probably explains the intermittent image thumb breakages
16:13 brion: fixing mobile.wikipedia.org dns entry
16:00 jeluf: unpooled knsq30, setting up some performance and config tests with OSM.
08:34 mark: yaseo text squids under strong memory pressure again, moved traffic to pmtpa until I have time to look into it

July 8

14:31 RobH: yf1016 reinstalled.
14:08 RobH: yf1015 reinstalled.
10:18 Tim: removed the firewall rule on adler which was preventing srv31 from connecting to it.
8:45 mark: yaseo text squids ran out of socket memory

July 7

21:49 RobH: reinstalled yf1010(mark), yf1011, yf1012, yf1013
18:51 RobH: db1 reinstalled.
18:07 brion: restarting apache on srv135, let's see if it continues
18:04 brion: huge rash of 'out of memory in WebStart' on srv135; shutting off its apache.
09:24 mark: yf1000 has no SSH host keys and appears not having a root password set - unaccessible, needs reinstall. Blocked its Squid process by sysrq force unmount.
09:15 mark: yf1001 had no SSH host keys, created them, rebooted it, synced, back online
08:40 mark: yaseo text squid cluster in trouble, run into swap. Decreased cache_mem to 1000 and restarted backends

July 5

22:45 domas: added gmond on db7, thistle, db4. stopped db10 for data copy to db7, trying to make db7 to rotate two hourly snapshots
22:00 brion: KNAMS seems to have fallen off of ganglia; pascal is down

July 3

21:51 brion: finally got the damn software updated. yay!
20:23 mark: Added AAAA record for hume.wikimedia.org (and therefore, static.wikipedia.org).

July 2

15:29 RobH: srv124 crashed, back online.
15:28 RobH: srv78 kernel panic, back online.
15:27 RobH: srv107 back online and in lvs.
15:11 RobH: srv107 broken, rerunning the bootstrap.
14:52 RobH: db7 back online, reinstalled, replaced network cable. Needs db setups.
03:21 Tim: deployed wmerrors extension, see wikitech-l

July 1

22:45 domas: Collection was causing Categoryfinder to go into infinite recursion, thus causing segfaults. I need stickers 'I love Magnus' :)
18:45 mark: Implemented the Cloud.
~13:00 Tim: resynced /usr/local/dsh/node_groups/apaches and dalembert:/usr/local/etc/apaches. Symlinked node_groups/apaches_pmtpa -> apaches. Resynced nagios accordingly. Deleted the "Apaches 1 CPU" group from gmetad. Reassigned srv2 to the miscellaneous group, stopped apache on it, removed apache-specific lines from /etc/rc.local.
8:00 jeluf: stopped apaches on srv101 and srv112, they are segfaulting several times per second.

June 30

19:00 domas: restarted few segfaulting httpds, as well as few using additional number of cpu cycles, started few httpds on boxes that didn't have httpd running. killed stale child processes yet again.

June 29

18:45 jeluf: srv107 depooled, needs a reboot.
18:42 jeluf,rob: Switch in rack C4 powered up again, rack reachable again
18:30 jeluf: added 10 more memcached candidates to mc-pmtpa.php
18:00 jeluf: Rack C4 outage again, like last sunday. Contacted Powermedium. Shut down postfix on bart. going to change memcached.

June 27

07:39 Tim: killed srv129 with sysrq-trigger, was segfaulting

June 26

22:38 Tim: Set mysql root password on srv179 (OTRS), was blank, now standard
18:34 RobH: srv140 needed a poke, apache had crashed.

June 24

20:20 brion: unmounted dead albert dirs from srv31.
20:15 brion: poking at storage2 -- disk full again. trimming some old stuff
20:14 RobH: Synced and restarted apache on srv31

June 22

23:10 brion: updated static dump robots.txt to exclude 'new' and 'downloads'
17:48 mark: All servers reachable again, reinstated Exim config on mchenry, repooled apaches
15:07 mark: Added temporary hack to Exim config on mchenry to have it not check the OTRS db for address existence, but just defer addresses not accepted earlier
14:54 mark, domas: Installed new memcached instances on srv151..168, replaced all down instances, depooled down servers in apache LVS
14:20 mark, jeluf: One rack in Florida seems to be down. Mark informs hostway. OTRS can't reach its DB, postfix on bart stopped.

June 21

17:00 brion: added 'shared' dir on *.planet.wikimedia.org for shared styles/icons/etc
SSL cert on the IMAP server has expired.

June 20

23:58 brion: experimentally enabled pdf download (special:collection pediapress thingy) on test.wikipedia.org. Still some major problems with this system; many pages don't render properly (eg, at all)
21:13 Tim: recompiled ircd with nicklen=20 and maxclients=10000
20:30ish brion: tweaked dir permissions on all top-level upload dirs so the automatic subdir creation for new wikis should work consistently
18:05 brion: syslog was broken on suda -- nothing going to /var/log/messages. restarted syslog, seems ok now.
17:55 brion: leuksman.com was offline for several hours. Sago sends automatic hourly emails "your server doesn't respond to ping!" but doesn't reboot it until you ask them. yay!

June 19

21:15 Tim: spike on enwiki DB servers, possible outage, blocked offending client in squid
16:15 RobH: db5 hdd replaced, reinstalled, needs setup.
16:00 RobH: sparky pulled and boxed for return.
15:28 RobH: amane disk replaced, raid needs rebuild.

June 17

19:28 RobH: srv133 back online.
18:54 RobH: srv51 moved and re-racked, back online.
17:33 brion: srv133 hanging, probably 'read-only filesystem' or similar. shutdown via sysrq
17:00 RobH: moved srv126 - srv123 to B1
16:38 mark: Made catchall aliases work for CiviCRM on mchenry, added MX records to donate.wikimedia.org.
15:40 RobH: srv76 shutdown due to primary HDD failure.
15:35 RobH: srv141 shutdown briefly to move into its new home in B1.
15:30 RobH: srv142 shutdown briefly to move a switch out of rack. Back online.

June 16

14:52 mark: Moved test.wikipedia.org from srv3 to srv35 so the former can be decommissioned.

June 12

20:19 RobH: srv27 shutdown per rainman, as its not working and needs to be decomissioned.

June 11

19:45 domas: srv101-srv109 are part of ES duty as cluster17-19. all raid1, myisam, set up by http://p.defau.lt/?GA97Jegyef8uQGQJTTFILg
18:00 RobH: srv141 rebooted, back online.
17:04 mark: Moved rendering cluster LVS from avicenna to alrazi so avicenna can be decommissioned
12:22 mark: Upgraded khaldun to Ubuntu 8.04 hardy
11:54 mark: Shut down albert's switchport in preparation for decommissioning it
11:32 mark: Changed static.wikipedia.org to point to hume, removed Squid configuration for it
11:08 mark: Moved mirror.pmtpa.wmnet (Fedora mirror) to a proxy setup on khaldun so we can retire albert
03:42 Tim: Switched back to Preprocessor_DOM, uses 4 times less memory on Chicago test case

June 10

19:56 brion: disabled spam blacklist on internal private/fishbowl wikis so jay can use tinyurl *rolleyes*
16:37 RobH: srv78 rebooted, back online.
00:01 brion: added a generic wgReadOnlyFile setting to all wikis in 'closed' group that don't have one specifically set.

June 9

20:50 brion: removes srv81 from cluster6 db rotation. downed slave was causing timeouts in transwiki imports due to crappy failover in ES connections
19:04 brion: re-cleared localnames and globalnames tables to fix CentralAuth unattached lists. Only localnames had been cleared, but nothing got lazy-populated since there was already a globalnames entry.
16:44 brion: srv141 rebooted (not online yet)
16:39 brion: srv141 read-only filesystem
16:30ish brion: setting up private collab.wikimedia.org

June 8

16:00 - 20:00 mark, gmaxwell: IPv6 AAAA reachability testing, which lets us determine how much would break if we'd put an AAAA record on our main hostnames
- Installed Hardy on iris
- Set up lighttpd on v4 and v6
- Added hostnames ipv6.labs.wikimedia.org, ipv4., ipv6and4. and results. to DNS
- Put a modified version of this script in [[w:en:Special:Watchlist]]'s javascript
14:00 domas: db10 died (02 disk went 'foreign', after recreating array came back up), db1 died few hours before, telling about dimm7 errors. few MCE events complained about L2 cache though.

June 7

09:21 Tim: /mnt/ganglia_tmp on zwinger was full, fixed.
02:00 Tim: started new static HTML dump on hume

June 6

12:00 - 12:45 mark: Network migration and firmware update. Split vlan 100 into 100 (Squids/LVS) and 101 (public services). Reloaded csw5-pmtpa with newer image.
12:00 RobH: srv147 hdd dead, powered down pending rma.

June 5

21:45 RobH: ssl enabled on srv9
02:34 Tim: Danny B. asked me to delete the entire archive for wikimediacz-l, so I moved the private archive directory to wikimediacz-oldprivate, moved the mbox to wikimediacz-oldprivate.mbox, recreated a blank mbox, and regenerated the empty index with "arch".

June 4

22:56 brion: updated PHP didn't change things. shutting srv134 back off
22:51 brion: srv134 still giving mystery errors. reinstalling PHP
19:43 RobH: sq5 reinstalled.
19:40 RobH: srv134 memory tests passed, restarted.
19:34 RobH: db1 reinstalled with Ubuntu 8.04.
18:45 RobH: db1 mainboard and cpu replaced.
17:30 RobH: hume memory upgraded from 2 GB to 8 GB.
07:38 Tim: TorBlock extension enabled
02:35 Tim: cleaned up binlogs on srv139

June 3

22:16 brion: started two more dump threads to keep things churning while the big boys run
20:46 mark: Brought knsq28 back up after its broken disk has been replaced

June 2

20:24 brion: refreshing category counts for enwiki > 'Unprintworthy_redirects', was last one that was reached in the original batch process. Some inconsistent counts reported in the Ws.
09:40 robert: removed srv30 from search_pool_1 on diderot. decommissioned and seem to cause balancer problems
02:59 brion: manually depooled srv30 from diderot's LVS balancer for search... again... why isn't pybal taking care of this?

May 30

19:36 domas depooled db9 -- mysql 5 test server had bad dataset, missing revisions
16:19 brion: added more wikis to closed.dblist -- somebody started closing things by hardcoding wgReadOnly instead of setting lock files, so they weren't seen on the first pass. Sigh.
15:00 RobH: yongle updated to Ubuntu 8.04.

May 29

19:33 brion: disabled "apc.stat=off" in php.ini on srv3 and srv2. srv3 was breaking updates to InitialiseSettings.php on test.wikipedia.org
17:42 RobH: srv146 fsck and back online.
17:23 brion: shutting down srv134, has been reporting mysterious "Possible integer overflow" errors which may indicate bad RAM, corrupt software, or some other mystery problem
16:30 domas: upgraded db1 to hardy, datadir severely desynced, will work on it later

May 28

21:09 brion: excluded closed wikis from CentralAuth to avoid interference. closed.dblist provides a "closed" group for InitialiseSettings.php
17:23 brion: reenabling $wgCentralAuthCreateOnView for now
16:28 RobH: srv150 to srv130 now have IPMI connected via msw-b1-pmtpa
15:59 RobH: srv67 to srv80 moved off asw2-pmtpa to asw-b2-pmtpa
15:29 RobH: srv78 kernel panic, rebooted.
15:29 RobH: srv79 accidentally rebooted.
15:17 RobH: srv130 kernel panic, back online. Synced back into cluster in db.php.
14:47 RobH: srv146 was shutdown in rack? Back online, cluster16 back online.
14:32 RobH: db1 dimm4 swapped, back online.
02:31 Tim: Site went down briefly due to ext store master srv139's disk filling up. Fixed.

May 27

23:50 brion: disabling local account autocreation on page view for now (controlled via $wgCentralAuthCreateOnView); it's too darned annoying
23:11 brion: running batch-updates of all globalname/localname records (some missing entries are messing up due to lazy initialization)
22:30 brion: CentralAuth tweaks: disabled centralauth special perms for meta sysops (should just be bcrats for now); fixed global cookies for SSL
20:18 brion: reverted Whatlinkshere to state of r35370; was broken by use of subqueries in 35371 and following updates.
19:00 domas: db9 is in persistent state of testing, upgraded to hardy, 5.0.64, etc. not for production use yet.
18:04 RobH: renewed secure.wikimedia.org cert on bart
15:40 domas: srv146 as down, cluster16 master, disabled cluster16 for now.

May 26

21:30 brion: tweaked up wap portal to fix JPEG images (PHP recompile -- missing JPEG libraries) and have a safer, more reliable image loading method
~06:15 - 08:15 Tim: created wikis listed on bugzilla:13264 and bugzilla:14252.

May 25

01:44 brion: upgraded wap portal server to PHP 5.2.6, seems to fix some crashing cases

May 24

23:05 brion: updated wap portal so the search engine actually works
19:51 brion: freeing up some space on storage2, restarting dump runners
19:33 brion: storage2 full; poking...
19:29 brion: remounted storage2 on the dump runners; NFS got broken by the reboot and stuck the clients
11:43 mark: Whoops, I sent UAE upload to rr.knams a few days ago... corrected.

May 23

21:42 brion: disabled write api on testwiki for now... it seems to accept edits over GET which is very.... not good.

May 22

21:11 brion: ran migrateStewards on meta -- global steward flags active
17:49 mark: Upgraded storage2 from Ubuntu 6.10 to 8.04 and rebooted it
17:14 brion: stuck 79.115.44.59 in the proxy blocker set for now (mass link spamming)
17:06 brion: commonswiki dump was stuck again (was on old stickable code, iirc). killed the fetch thread.... seems to be stopped/over/eh?
05:00 brion: enabled SimpleAntiSpam for test logging

May 21

19:09 brion: hacked bugzilla templates to avoid the big scary error when you log in after a password reset. got one too many complaints about the damn thing. :D
18:22 RobH: srv133 went read-only. Ran FSCK to correct errors, restarted.
18:20 RobH: srv149 back online, no issues in fsck.
15:11 mark: Upgraded GnuTLS on Exim servers, restarted Exim, replaced ssh keys for mail sync jobs

May 20

22:51 mark: Shutdown srv133's switchport

May 19

19:52 brion: $wgEnableWriteAPI on testwiki
15:52 brion: poking at srv133, read-only filesystem
15:40 brion: poked stuck dump thread (commonswiki)... had a stuck fetchText subprocess. (should be fixed in svn now)
02:40 brion: took srv149 off network (ifconfig eth2 down :D)
02:10 brion: disabled $wgShowUpdatedMarker for now, it seems wonky
01:33 brion: recovered some disk space from rabanus, but seem unable to account for a bunch of space o_O
01:00 brion: srv149 borked with read-only filesystem. srv150 borked in some unspecified way (login problems). rabanus disk full

May 18

8:20 JeLuF: Enabled the replaced disk on amane so that the disk is also being used.

May 16

17:12 RobH: db1 memtest passed, no memory errors reported, but OS detects memory errors! As OS tests better than memtest, RMA has been placed for memory in slot 4. Server currently offline.
17:00 RobH: srv130 kernel panic. Restarted.
17:00 RobH: srv101 had a stalled apache, synced and restarted.

May 15

23:55 brion: commented out down srv130 and srv127 from cluster13. failover to srv129 was working, but vveerrryyy slowly. [2] was taking 9s for a handful of revisions; this broke transwiki special:import due to our strict HTTP timeouts internally
20:16 brion: got cs.planet actually working now
18:27 brion: setting up cs.planet.wikimedia.org DNS, adding it to generation/update list
14:22 mark: Upgraded ssh on mayflower so weak keys are detected and denied access

May 14

21:10 brion: restarted lsearchd on maurus, was borked -- is currently the only non-main-namespace search server in enwiki cluster, so it broke some searches. also restarted sshd, which was mysteriously on the fritz and rejected rainman's logins
17:27 RobH: replaced disk in amane
?? mark: restarted pybal on diderot
16:55 brion: manually depooled srv30 from LVS on diderot. Either pybal isn't being used to do health checks on search pools or it's not working.
16:50 brion: per report on WP:VPT, seem to be seeing a lot of search failures on enwiki. investigating

May 13

22:30 brion: logevents api back on, allegedly fixed
21:39 brion: putting flaggedrevs back on. api has been reenabled, with logevents query disabled by a nice clean exception
21:26 brion: scapped everything up to date, flaggedrevs still off. DISABLED API due to still having bad queries
20:58 brion: db5 and webster up to date. samuel in process...
20:53 brion, jeluf, domas: we managed to (hopefully) reconstructed the broken statement from adler binlog file, and have got db5 resyncing up from that point. if it seems to be going well we'll put the other two s3 slaves in shortly. (s3 still in read-only)
20:21 brion: removed a bunch of bogus old keys for srv150-srv170 from zwinger's /etc/ssh/ssh_known_hosts file. hopefully will clear up the broken sync issues
20:15ish brion: disabled flaggedrevs; old patch is bogus. still busy with other problems before resyncing to current
warning: new db.php config doesn't appear to allow marking a cluster as all read-only. this is being a problem for maintenance
20:00ish jeluf, brion: replication is broken on s3; corrupted binlog. investigating
19:30-50ish brion: some fun times w/ DB overload as bad API queries flooded DBs. reverted updated code to r34539 plus domas's patch for flaggedrevs. still having some ssh key problems, want that sorted out before tackling it again
18:37 RobH: Upgraded packages on yf1000-yf1009 and regenerated keys.
17:33 mark: Ran weak key detection script on mayflower and did chmod 000 on matching authorized_keys files - Brion will contact these users and ask for a new key
17:31 RobH: Upgraded packages on knsq1 - knsq30, and regenerated keys.
17:18 RobH: Upgraded packages on db5, db8, db10, mchenry, sanger, srv8, srv10, srv151-srv170, sage, mint, mayflower, fuchsia, and regenerated keys.
15:50 RobH: Upgraded packages on bayle, did not regenerate key, as it used to be 6.10 with original key generation.
15:49 RobH: Upgraded packages on webster, adler, ixia, thistle, hume, db1, db3, db4 and regenerated keys.
15:33 RobH: Upgraded packages on isidore and regenerated keys.
14:51 RobH: Upgraded packages on sq46 through sq50 and regenerated keys.
14:46 RobH: Upgraded packages on sq31 through sq45 and regenerated keys.

May 12

22:20 brion: wikimania2009.wikimedia.org and se.wikimedia.org set up
19:19 brion: adding DNS stub for se.wikimedia.org
17:02 brion: query.php now consistently disabled when current API is

May 11

15:20 domas enabled dewiki flaggedrevs after query tweaked
09:38 domas: disabled dewiki flaggedrevs, due to badly implemented database management, like one at http://p.defau.lt/?PJF1E0WJMHgmKhXT_SybIg

May 9

18:05 brion: load spike due to unindexed API recentchanges queries. reverted circa last day's changes to API rather than dig around in there

May 8

23:50 brion: added ipblock-exempt and accountcreator groups on en.wikipedia
20:10ish brion: reverted live FlaggedRevs to r34407-ish; a number of things broke overnight and we haven't fixed them all yet. be careful when scapping until this is resolved shortly
02:00 Tim: Added IPs at http://meta.wikimedia.org/wiki/User:Drini/xrunner to mwblocker.log

May 7

19:55 brion: srv30 read-only filesystem; needs to be taken out and ~~shot~~examined for disk problems. (I shut it down for the time bein')
17:50 brion: created new global groups tables; trying full scap again
17:40 brion: site broken by CentralAuth upgrades which silently added use of a table that's not present. reverting pending addition of tables

May 6

19:40 brion: fixed bugzilla upgrade
00:15ish tim: activated FlaggedRevs on dewiki again

May 5

00:00ish brion: restarted leuksman.com; server was down most of the day (Sago sends you an email every hour your server is down, but doesn't reboot it until you ask :)

May 3

14:01 mark: Restarted lsearchd on maurus
13:56 mark: Upgraded will to Ubuntu 8.04 Hardy
01:12 brion: disabled flaggedrevs on dewiki. Some problems with reviewed pages list and the UI disrupting page layout, which need to be resolved before we put it back on.
00:55 brion: starting up test deployment of flaggedrevs on dewiki. FlaggedRevs config is in flaggedrevs.php. Disable the section if it's causing trouble over the next couple days!
00:34 brion: clearing off old log files from maurus again. we need some log rotation or to dump those logs :)
00:24 brion: maurus out of space again
00:15 brion: updating test, labs wikis for separate flaggedpages table. debugging some deployment issues

May 1

23:10 brion: removed commons' foreign repo config from itself, so we don't get dupe file warnings :)
22:50ish brion: reenabled newpages uesr filter for non-affected wikis. index use is right now \o/
19:15 RobH: srv149 rebooted.
19:00 RobH: srv36 rebooted.
18:33 RobH: srv78 kernel panic. Rebooted.
00:08 brion: switching it back off, doesn't seem right.... insanely slow
- The index code has a typo, forcing it two use one of two bad indexes ;) Aaron 13:24, 1 May 2008 (UTC)[reply]
00:05 brion: putting newpages username search back except for the four wikis affected by bad rc_user_text indexes; wgEnableNewpagesUserFilter is off for them

April 30

18:35 brion: added $wmgUseDualLicense switch to InitialiseSettings.php. Set this to true for new wikis which should be created with the GFDL+CC-BY-SA 3 dual-license mode to set their default copyright link properly.

April 28

18:00 brion: turned off the double-diff-then-log (no hits since saturday, yay). turned on a logging log to check issues with updated log code
15:00 jeluf: mysqld on db5 hanging. Couldn't shut it down or even kill -9 it. Had to reboot the box. mysqld is currently recovering.

April 26

21:56 brion: bad diff logging indicated that problems were only on fc4 apaches. possibly a c++ version mismatch? recompiled wikidiff2 RPMs fresh on fc3 and fc4; upgraded the fc4 boxes, log's stopped dead. so far so good
21:38 brion: cleared old mwsearch indexes off rabanus, resyncing mw inst.
21:23 brion: bumped diff cache version to force diff regeneration
21:20 brion: enabled bad diff hack -- runs every diff twice, logging in baddiff.log if they don't match. (the shorter text is then returned, which may reduce the incidence of visible diff errors)
21:17 brion: rabanus disk full
17:50: jeronim: on db2 in ntp.conf, changed restrict 66.230.200.234 nomodify notrap to restrict 208.80.152.189 nomodify notrap to match the server 208.80.152.189 line below it. Output from ntpq -p looks much better now, showing an IP address in the refid column instead of ".RSTR."
16:53 mark: Installed lvs2, lvs3 and lvs4 for testing
15:58 mark: Installed Ubuntu 8.04 on lvs1 for testing
15:58 mark: Ubuntu 8.04 Hardy Heron installs are now possible on all VLANs
12:09 jeronim: did /etc/init.d/ntpd restart on db2 which fixed clock offset of about 6 seconds; underlying problem not fixed

April 25

20:09 mark: lily under extreme load, investigating
19:28 brion: added 'Vary: Cookie' HTTP header to blogs... don't know if it'll do a damn thing, I can't even clear things from these squids using the normal methods
18:33 brion: upgraded blog.wikimedia.org and whygive.wikimedia.org to wordpress 2.5.1
17:13 brion: fixed MWSearch extension to use Http::get() instead of file() to hit the backend. This should resolve the load spikes we've been seeing around 7:30-8:00 UTC daily; the servers slow down while indexes are being resynced, and the long default timeouts caused things to back up on the front end instead of failing out gracefully.
16:55 brion: upgraded utfnormal extension on srv42 so dumps will work again. (note that dumpBackup.php no longer works when autoselecting database connections, probably a bug due to the new load balancer. works in live use as a server is explicitly passed on command line.)
07:56 river: /var/lock on lily became full from the spamd bayes database; moved it to /var/spamd. expired the old bayes database because its size was causing spamd to be very slow (30+ seconds per mail).

April 24

05:50 Tim: fixed wikidiff2 on fedora apaches, was missing since 5.2.5 upgrade.

April 23

19:54 brion: restarted apache on bart (secure proxy), seems happier
19:50 brion: secure.wikimedia.org connections hanging
00:25 brion: resynced db2's clock; was 7 seconds slow, causing all s1 slaves to think they were lagged, causing all enwiki jobs runners to sit waiting for them to catch up

April 22

00:42 brion: enabling wgEnableMWSuggest globally for a few minutes to evaluate DB impact

April 21

23:18 brion: enabled $wgCookieHttpOnly -- new session & user token/name/id cookies should be sent HttpOnly, so supporting browsers won't expose them to JavaScript as an additional protection against some categories of XSS
23:10 brion: upgrading php on srv141, was down during 5.2 updates
22:26 brion: got a report of a commons image with missing archive versions. Files are present on upload4 but not on upload3... which is odd because as far as I can tell, only the thumbs are used on upload4 for commons. Why is there a full copy of commons, and why don't they match?
22:13 brion: getting lots of complaints from scap about time sync. clock offsets mostly <1s but some >3s
20:55 RobH: lvs3 and lvs4 racked and remote access enabled.
19:44 RobH: db4 reinstalled.
19:44 RobH: lvs1 and lvs2 racked and remote access enabled.
19:26 RobH: thistle reinstalled.
17:30 RobH: db1 unresponsive, rebooted.
17:30 RobH: racked srv141 and brought back online

April 20

15:15 mark: squid on khaldun had disappeared due to an upgrade a few days ago, and dependency conflict with the Wikimedia packages
13:00 mark: Depooled srv2 and srv4, the only remaining 32 bit apaches in rotation.

April 19

18:00 mark: srv133's time was off, corrected

April 18

22:00 RobH: relocated srv127
21:16 RobH: relocated srv129 and srv128
20:37 RobH: relocated srv130
20:09 RobH: relocated from B3 to B1 srv134, srv133, srv132, srv131
12:50 Tim: upgrading PHP to 5.2.5 on all Fedora Core apaches

April 17

21:50 brion: lowering db4 priority from 150 to 50; still loaded
21:10 brion: lowering db4 priority from 200 to 150; seems very highly loaded compared to db3 with same priority
20:25 RobH: Relocated srv136 & srv135.
19:40 RobH: Relocated srv137
19:25 RobH. Relocated srv138. Put ext store cluster 14 back in service.
19:19 brion: applying pt_title encoding fixes
18:59 RobH: Relocated srv141, srv140, srv139, srv138.
18:50 RobH: Removed ext store cluster 14 from active use.
18:44 mark: Removed AAAA record on khaldun.wikimedia.org, apparently apt doesn't even try v4 when it has a proxy hostname with an AAAA record and a v6 route is not available.
18:44 mark: Fixed httpd on pascal
18:20 brion: fixed ganglia reporting knams -> pmtpa (old zwinger IP in trusted list on pascal); detail reporting still down due to broken httpd on pascal
18:10 mark: Fixed MySQL group in ganglia by making ixia an aggregator again
18:11 RobH: srv143 and srv142 relocated.
17:58 brion: enabled search suggestion drop-down on testwiki
17:13 RobH: srv144 relocated.
17:00 RobH: srv145 relocated.

April 16

22:46 brion: enabled TitleKey extension, search suggestions, and HttpOnly cookies on wikitech
21:40ish brion: hopefully fixed the php5.1 bug with global sessions on secure.wikimedia.org
21:21 RobH: srv150 relocated.
21:11 RobH: srv149 relocated.
21:06 brion: enabling global sessions on secure.wikimedia.org
20:57 srv148 relocated.
20:47 brion: restarted data dumps on srv31 and srv42
20:45 srv147 relocated.
20:31 brion: cluster16 back in rotation; tim restarted mysql
20:25 brion: rash of complaints of db errors due to srv146 being out (cluster16 ES master). took cluster16 out of $wgDefaultExternalStore while it's being fixed
20:11 RobH: srv146 relocated.
16:52 brion: fixed ticket.wikimedia.org redirect to otrs
10:50 brion: got a mystery SMS complaining of 5-minute lag on dewiki

April 15

23:55 brion: giving planet its own little user account :)
22:24 brion: PMTPA databases, all KNAMS, and all YASEO are missing from ganglia and have been for a while. What's going on?
19:00 mark: Cleaned up csw5-pmtpa's config, added BGP inbound filtering on prefix lists and known bogons
17:35 brion: rc_user_text index is missing from frwiki, nlwiki, plwiki, and svwiki. Special:newpages was using it in some cases; have disabled the index and the username lookup feature for it pending fixes.
00:25 brion: updated SpecialNewpages.php to tweak index forces per domas's request; new pages was causing some sort of problem

April 14

23:55 brion: gettin' ready to svn up! applied flaggedrevs_promote table on test & labs, and the centralauth gu_token field
19:10 brion: restarted IRCD, was hanging mysteriously
16:25 RobH: srv130 synced and apache restarted.
16:00 RobH: srv0 and benet powered down pending drive wipe for decommissioning.

April 13

10:46 Tim: pybal on diderot was depooling servers due to name lookup failure (timeout). Traced the problem back to nscd and restarted it, that fixed it.

April 12

00:15 brion: robots.txt may or may not be fixed for blog.wikimedia.org; some kind of freakish default, probably from wordpress 404 handling, redirected it to robots.txt/ (with final slash) which disallowed all by default apparently (?!). added a plain file... but caching is still taking the redir that i can see
00:10 brion: sql script doesn't work for non-wiki dbs such as 'centralauth' and 'oai' at the moment; lookup fails
00:02 brion: setting up sr.planet.wikimedia.org

April 11

12:48 mark: Discovered that lighttpd does not allow caching of unknown content-type responses. amane was serving quite a lot of unknown content types, which were consequently not cached by the Squid clusters. Fixed this by adding a lot of content types to lighttpd.conf, as well as a default content-type in case any are missed.

April 10

21:30 jeluf: fixed nagios' conf.php, to reflect the latest db.php changes.
16:50 brion: restricted wfNoDeleteMainPage to enwiki which I presume it was added for. It's a huge nuisance for other wikis which quite legitimately are rearranging their content.

April 9

(all day) mark: Restarted various daemons on lots of servers to get DNS resolver libs to use the new DNS IPs (mostly nscd, apache, some mysql)

April 8

21:35 brion: fixed (?) nad nsswitch.conf on bart (nis -> ldap)
16:48 brion: adjusted new $wgExpensiveParserFunctionLimit to match old $wgMaxIfExistCount
16:38 Tim reenabled search
?????? Tim disabled search sitewide
7:40-8:40 Tim: the lack of a FORCE INDEX caused LogPager queries to be extremely slow. The site eventually went down when the cumulative query load built up sufficiently. Took a bit of time to disable the queries properly, kill the MySQL threads, and get the site back up.
07:40 Tim: updated to r32943
07:30 jeluf: restored .procmailrc for OTRS. We've lost all mails coming in between 0:38 and 7:30 UTC. I can't find them in /var/spool/mail, and they didn't go to OTRS. Any idea where postfix has put them?
07:19 Tim: deleted 100GB of binlogs on ixia
04:30 jeluf: migrated some of the changes that I've made to our OTRS. Installed a big red MOTD message on the login screen.
01:10 brion: reinstalled OTRS FAQ module, fixing broken ticket zoom.
00:40 brion: upgraded OTRS to 2.1.8. If you have information about the patches that were previously applied, please provide them! They have not been copied over since it's unclear what's what.

April 7

18:29 RobH: srv117 shutdown due to failed HDD. RMA placed.
18:18 RobH: db1 rebooted due to hard lockup.
17:25 Tim: running maintenance/archives/upgradeLogging.php on various (eventually all) wikis
00:10 brion: running a bzip2 integrity check on enwiki-20080312-pages-meta-history.xml.bz2; .7z is cut off

April 6

11:24 mark: Changing resolver IPs on all servers
05:10 Tim: cleaned up binlogs on srv139 and srv146

April 5

17:42 mark: lighttpd on storage2 had run out of FDs and crashed. Increased the limit.
16:52 mark: Stopped announcing prefix 66.230.200.0/24 in BGP.
16:00 mark: Removed old IPs from various servers.

April 4

19:52 brion: srv117 is borked; logins hanging

April 3

21:18 brion: moved dump monitor thread to srv31; stale ruwiki dump marked correctly as aborted now. NOTE: IPs for storage NFS mounts should be changed when enwiki and dewiki dumps finish..........
21:15 brion: killed dump & sitemap processes on benet. we're retiring it...
15:59 RobH: Removed vincent, biruni, kludge, humboldt, & hypatia from all dsh groups and apache pool for decommissioning.

April 2

22:01 RobH: isidore updated with newest wordpress installation for blog and donation blog.
17:55 RobH: db1 rebooted.
17:45 brion: added bart's new ip to known proxy list
17:32 mark: Renumbered friedrich
16:07 mark: Renumbered srv8, bayle
15:57 mark: Renumbered srv9 and srv10
15:43 mark: Renumbered yongle
15:34 mark: Renumbered isidore
15:26 mark: Renumbered browne
15:10 mark: Renumbered storage1, anthony
14:16 mark: Renumbered storage2, will
14:00 mark: Restored symlinks in /etc/powerdns/templates/, be careful when working on/copying those files, they are heavily symlinked!
13:15 mark: Renumbering bart to new IP range
11:00 - 11:30 mark: Reloaded csw1-knams with new firmware; temporarily moved traffic to florida

April 1

08:00 domas: db1 didn't like oracle migration, crashed

March 31

4:30 JeLuF: Added srv145 back to external storage pool 'cluster16'. Added srv130 back to external storage pool 'cluster13'.
4:00 JeLuF: Fixed mysql on srv81 and srv145. On srv138, resolved "out of diskspace" situation. The second disk was not mounted and both mysql datafiles were on one disk only.

March 28

18:57 RobH: sq12 back online from lockup.
18:46 RobH: Replaced DIMM4 in srv166
18:09 RobH: srv51 back online from kernel panic.
17:59 RobH: srv78 & srv81 back online from kernel panic.
17:55 RobH: srv130 & srv131 back online from kernel panic.
17:46 RobH: srv145 back online, was powered down?

March 26

19:00 brion: previous fix had a bug which broke wikis with language variants. fixed.
18:20 brion: Worked around mystery segfaults with voodoo fix (r32477)
17:26 brion: mysterious [crashes on private wiki root redirects, still trying to diagnose. (backtrace)
15:26 mark: Set up sq50 as temporary LVS balancer instead of avicenna, so it's not a squid atm.
15:00 mark: PyBal's configuration file had a syntax error, causing LVS to go down. Avicenna completely swamped and unreachable.
14:08 mark: Rendering cluster down due to OOM kills on all 3 servers. Killed apaches and restarted them.

March 25

22:31 brion: disabled CentralAuth debug log; found the bug i was looking for :)
22:22 brion: enabled CentralAuth debug log

March 24

23:11 brion: set default perms for upload to autoconfirmed except on commonswiki... this may be rolled back or changed if unpopular
17:50 brion: restarting category builds on commons and enwiki
17:45 brion: poked around old paypal post urls

March 21

22:00ish brion: switched search front-end to core UI wikimedia-wide. Note some site JS needs fixing like this
03:00 brion: mailman paranoia

March 20

19:25 brion: restarted lighty on storage2; was down mysteriously
16:53 storage2's lighty appears to have died... had lots of errors about too many open files etc
12:53 RobH: srv150 back online.
12:46 RobH: srv81 rebooted from kernel panic.

March 19

23:55 brion: starting batch category table population...
23:27 brion: updating code; stub updatelog and category tables applied. will populate tables after gone live...

March 18

17:49 brion: benet crashed again. moving DNS for dumps.wikimedia.org over to storage2. it had a lighty pointing to a now-empty backups directory; pointed it at the currently-used dir for dump storage instead.
17:00 and earlier -- some network issue with PowerMedium? large packets dying on routes through HGTN. mark did something to the network to cut our PowerMedium route? can't reach 66.230.200.* network from outside now; secure.wikimedia.org and planet.wikimedia.org at least using these addrs publically still
08:45 mark, JeLuF: Routing knams-pmtpa switched to another provider, dns switched to "normal". Everything looks fine. During the "knams-down" time, request rate in pmtpa dropped, needs further investigation.
08:30 JeLuF: Lost connection pmtpa-knams, switched DNS to scenario "knams-down".
07:23 Tim: hume's v1 partition is 92% full, set up a symlink farm to start filling v2.
01:18 brion: secondary problem was some kind of overload on avicenna (pmtpa text LVS). river managed to tweak it into submission by taking it off net for a couple minutes. things appear up for now
01:06 brion: packet loss down from 33%+ to about 4%... can reach ganglia consistently, still some outage issues
00:18ish brion: major net issues in tampa? lots of packet loss; cpu down dramatically

March 17

19:48 brion: fixed upload dir on wikimania2008wiki
18:00 jeluf: srv51 is down. Replaced by memcached on srv65.

March 16

15:28 mark: Renumbered mchenry to the new v4 IP range
14:47 mark: Renumbered sanger to the new v4 IP range
14:18 mark: Bound IPv6 IPs on csw5-pmtpa's vlan routing interfaces - so most if not all servers will have acquired one or more IPv6 addresses. Renumbered khaldun to the new IP range and published its IPv6 record as AAAA record in DNS (for apt.wikimedia.org)

March 13

21:19 mark: Shutdown srv150's switchport, it has a ro fs and doesn't react to IPMI.
19:55 brion: reenabled search result context for anons on LuceneSearch wikis
04:28 Tim: enabled CentralAuth in dry-run mode on all wikis

March 12

21:26 brion: de.labs thumbs mysteriously broken again. who knows...
21:05 brion: poked at thumb-handler.php ... it was apparently pointing to the wrong backend URL for de-labs (de.labs) etc. Hacked in a special case for non-wikipedias.... which may well be even more broken. Look at this again... :P
17:10 brion: dissolved mediawiki-ng-l list. Too much forced moderation and no mission meant it was never seriously used.

March 11

18:57 brion: swapped LuceneSearch for MWSearch plugin on test.wikipedia.org and commons.wikimedia.org. Search front-end now includes thumbnails for image page results, which is kind of handy. :) Will do a little more testing before swapping wholesale; there are still UI differences and things which should be improved.

March 10

20:25 brion: arbcom_enwiki was missing from dblist files (except private.dblist). Added it back to all.dblist and special.dblist, works again.
19:07 brion: installed svn 1.4.6 on zwinger in /usr/local/svn; use this to svn up if the old version keeps whining
18:36 brion: zwinger's old copy of svn (1.2.3) has decided that it can't deal with something in our repository (extensions/DumpHTML/wm-scripts). :(
18:02 brion: removed the evil transclusion at Server admin log/All which caused updates of this log page to be insanely slow, by forcing links refresh of 12 huge log pages all combined into a giant page of death
17:47 brion: set chapcom lang to 'en' instead of defaulting to 'chapcom'. special: page links now working instead of ':Userlogin' etc. not sure why it did that; seemed fine in command-line tests
17:32 brion: reported language config issues on chapcom; exmaining
16:54 brion: fixed spider blocks. :P
16:37 brion: blocked an evil spider IP from mayflower; SVN http back up
16:28 brion: mayflower overloaded in some way; load avg 147 o_O

Marc 9

17:13 brion: en.labs.wikimedia.org and de.labs.wikimedia.org have FlaggedRevs testing configurations enabled. Still doing imports from en.wikibooks on en.labs, though. (Internal names are de_labswikimedia and en_labswikimedia.)

March 8

08:45 Tim: cluster14 was inexplicably missing mywiki. No data loss, it's been missing since the cluster was created, apparently. Added it.
11:09 Tim: srv81 is down. Removed it from external storage rotation.
11:00 brion: updated hawhaw; WAP portal now looks nice in Mobile Safari on the iPhone SDK simulator app

March 7

23:34 brion: importing de.wikibooks to de.labs.wikimedia.org....
21:59 brion: setting up stub en.labs.wikimedia.org and de.labs.wikimedia.org for flaggedrevisions testing
12:05 domas: srv25 has 40GB of lucene logs. disk full.
12:00 domas: resynced samuel form db1, db5 remaining
11:46 Tim: running dumpHTML on hume with 16 threads
08:00 domas: s3 master switch, samuel_bin_log.171:224349875 to adler-bin.002:3522
00:28 Tim: Updated zwinger:/etc/ntp.conf
00:19 Tim: updated MySQL grants for new subnet

March 6

23:26 Tim: added 208.80.152.128/26 to suda:/etc/exports and srv1:/var/yp/securenets. Created checklist at IP addresses
06:49 brion: noticed zwinger can't access database servers since the IP renumbering. :P
00:48 RobH: hume installation complete.

March 5

23:57 brion: leuksman.com was offline for a while (net problems at sago)
14:12 RobH: srv65 back online.
13:59 RobH: srv150 back online from kernel panic.
13:38 RobH: upgraded kernel in storage2
13:28 RobH: srv127 back online from kernel panic.
13:27 RobH: upgraded kernel in storage1

March 4

22:30 mark: Changed dhcpd.conf on zwinger, firewall setup on khaldun and dhcp forwarding on csw5-pmtpa to make installs work from the new IP ranges.
22:00 mark: Migrated zwinger onto the new IP range, changed its DNS entry to 208.80.152.189.
19:08 brion: took out read-only
19:05ish brion: put in temporary limit of Special:Newpages to 200; lots of reads with limit 5000 on dewiki were bogging down holbach. DB overload cleared up.
18:53 brion: taking s2 and s2a to read-only temporarily while we work out this overload issue
18:40 jeluf: DB servers for s2a cluster (dewiki) overloaded. ixia logs

[5100027.207458] Machine check events logged

18:25 (large CPU spike up on mysql and apaches; continuing...)
11:00 domas: db1 and adler are running compacted/fixed schema/tablespaces - next targets are db5 and samuel, master switch imminent

March 3

21:18 brion: removed the special-case in lucene configuration for testwiki to use srv79. That seems to have an experimental version of the lucene server which is currently broken. search now works on testwiki
18:57 mark: srv65 went offline, taking its memcached instance with it. Replaced the memcached slot by the last spare one.
16:00 RobH: yf1019 kernel upgraded.
16:00 RobH: yf1018 kernel upgraded.
15:36 RobH: yf1016 kernel upgraded.
15:36 RobH: yf1015 kernel upgraded.
15:27 RobH: henbane kernel upgraded.
14:59 RobH: sage kernel upgraded.
14:51 RobH: mayflower kernel upgraded.
14:41 RobH: hawthorn kernel upgraded.
14:35 RobH: lily kernel upgraded.

March 2

11:30 Tim: Not sure what the deal was. Cleaned up the mount options a bit: reduced timeout, switched from TCP to UDP mode (lost TCP connections cause temporary hangs), removed "intr" (useless when in soft mode). Remounted.
11:17 Tim: amane immediately locked up again due to hang on NFS read of storage1. Unmounted /mnt/upload4 temporarily to restore service.
11:09 Tim: restarted lighttpd on amane, was broken

February 29

21:15 RobH: restarted ssh and put srv61 back in pool.
21:15 RobH: brought srv130 back from kernel panic.
19:56 RobH: Racked hume, new static-dump server. DRAC: 10.1.252.190 DHCPD needs modification to netboot this subnet.
14:26 Tim: Removed /etc/cron.daily/find from all ubuntu apache servers that had it. Killed all long-running sort commands.

February 28

15:56 RobH: upgraded kernel on isidore and storage2

February 27

22:22 RobH: Shutdown srv11-srv20 + srv6. (Old, warranty expiring, causing heat issues in that rack, per mark)
18:34 RobH: upgraded kernel on will
18:23 RobH: upgraded kernel on mchenry & sanger
18:05 RobH: upgraded kernel on bayle
18:00 RobH: upgraded kernel on khaldun
17:45 RobH: upgraded kernel on srv9 & srv10
17:37 RobH: upgraded kernel on yongle

February 26

23:59 RobH: upgraded kernel on yf1009
22:48 RobH: upgraded kernel on yf1005 to yf1008
22:14 brion: rebuilding enwiki-20080103-pages-meta-current.xml.bz2 (as -2 for now) on srv31
21:30 to 22:10 RobH: upgraded kernel on yf1002 to yf1004
19:45 RobH: fixed replication on srv77 to srv8
14:12 Tim: started lighttpd on benet, had crashed again

February 25

23:51 brion: someone mucked up wgRemoveGroups on srwiki, listing pretty much every permission they could think of. pared it down to array( 'bot', 'patroller', 'rollbacker', 'autopatrolled')
20:00 RobH: yf1001 security updates.
19:58 RobH: yf1000 security updates.
19:45 brion: maurus disk space filled up for a bit; there's a 39gb log file in /usr/local/search/log. Freed up some space from old index data; recommend adding some log rotation to search servers!

February 22

21:33 RobH: srv171-srv175 kernel and security updates.
20:32 RobH: srv161-srv170 kernel and security updates.
20:00 RobH: srv151-srv160 kernel and security updates.
16:53 RobH: sq33-sq40 kernel and security updates.
16:34 RobH: sq24-sq32 kernel and security updates.
16:09 RobH: sq16-sq23 kernel and security updates.
15:52 RobH: sq41-sq50 kernel and security updates.
05:15 Tim: Applying schema updates patch-page_props.sql and patch-ipb_by_text.sql
02:00 - 04:45 mark: Migration of office DSL connections to Cisco 2841 - server is policy routed over the lower speed connection.

February 21

22:42 RobH: sq10 - sq15 updated (kernel and security updates.)
21:45 RobH: sq2 - sq9 updated (kernel and security updates.)
20:08 RobH: sq1 updated (kernel and security updates.)

February 20

23:53 RobH: knsq28 seems to not be rebuilding. Letting mark know.
23:45 RobH: Upgraded kernel and such on knsq16 through knsq22 (apt-get upgrade). Not distro upgrade.
23:21 RobH: Upgraded kernel and such on knsq8 through knsq15 (apt-get upgrade). Not distro upgrade.
22:15 RobH: fuchsia back up by mark. All traffic remains routed to PMTPA (while rob finishes squid upgrades.)
22:15 RobH: fuchsia down. All traffic routed to PMTPA.
21:56 RobH: Upgraded kernel and such on knsq23 through knsq26 (apt-get upgrade). Not distro upgrade.
21:30 RobH: Upgraded kernel and such on knsq1 through knsq7 (apt-get upgrade). Not distro upgrade.

February 18

21:15 brion: manually mounted upload4 on srv189. Was not created in /mnt or listed in fstab.

February 17

7:30 jeluf: suda's root FS was 100% full. Changed logrotate.conf to rotate logs daily instead of weekly, added switch.log to the log rotation.

February 13

After 18:44 RobH: Reinstalled db1 OS.
18:44 RobH: rebooted srv37 from crash, back online.
18:35 RobH: Restarted apache on srv166 per domas.
15:03 RobH: storage2 disk 12 replaced. and is rebuilding

February 11

03:38 Tim: srv61 is refusing ssh connections, still serving HTTP. Depooled.

February 10

10:40 domas: db1 still needs fixing..
07:30 Tim: upgrading the remaining squids with ~tstarling/squid/squid-upgrade.php. The script will upgrade one squid every two hours, in random order. This mitigates the effect of the cache clear for items with a Vary header (i.e. text). sq17 and sq18 were done during script testing.
06:18 Tim: upgraded squid on sq16, including XVO feature
05:40 Tim: srv150 accepts connections on SSH or HTTP and then hangs for a long time. Removed it from mediawiki_installation and apaches and depooled it.

February 8

01:40 Tim: added "hidden" table (oversight) on wikis that didn't have it. Added it to addwiki.php.

February 7

17:43 mark: Wrote a Mailman withlist script to change the embedded web_page_url variable to use https, as this is not possible using config_list.
15:00ish to 16:30ish RobH: lily lightttpd.conf changed to support/redirect mailman with SSL certificate.

February 6

17:45 brion: updated bugzilla to 3.0.3
16:13 Tim: MW configuration changes:
- Renamed some wikimedia-specific globals from $wgXxxx to $wmgXxxx. Some of them had rather obvious names that could potentially conflict with extension configuration in the future.
- Moved passwords and private keys out to PrivateSettings.php
- Changed SiteConfiguration.php to allow "tags" such as "fishbowl" and "private" to be applied to wikis. These tags can be used to specify settings in InitialiseSettings.php.
- Used these tags to full effect by adding using fishbowl.dblist and private.dblist to set the fishbowl and private tags, and then removing all the fishbowl/private wiki lists from InitialiseSettings.php. This will make adding new private wikis easier.
- Fixed some whitespace and removed some old commented-out code
- Moved various ancient subdirectories of /h/w/common to /h/w/junk/common
14:43 RobH: srv166 had a memory error, reseated memory, and restarted server.
14:22 RobH: storage2 disk 2 replaced. Not rebuilding? (please show rob how to force this.)

February 4

21:11 RobH: isidore now running bugzilla.wikimedia.org with a SSL Cert.

February 3

11:47 mark: lighttpd disappeared on storage1 and was also inaccessible from the new IP range due to an old and broken firewall. Why was it there? Removed it.
11:25 mark: Move traffic back to pmtpa

February 2

20:30 mark: Added new service IPs to bayle and mchenry being the pmtpa DNS resolvers, and a new service IP for ns0.wikimedia.org on bayle.
20:15 mark: Forgot that we have some DNS records pointing at 66.230.200.100 directly, so those were down for a while until I updated DNS.
17:52 mark: Moved all text.* traffic to knams as well
17:04 mark: Put Canadian traffic on pmtpa, to seed those caches a bit
14:40 jeluf: storage1 overloaded. Killed static dump processes on srv136, srv135, srv134, srv133, srv132, srv131, srv42
13:15 mark: Updated upload Squid configs to use the new pmtpa IP range, causing immediate pmtpa CARP cache clear, but mitigated by the knams squids.
11:37 mark: Moved all upload.* traffic to knams, to prevent an effective CARP cache clear due to IP address changes swamping amane.

February 1

20:19 brion: reverted r30405 which broke boardvote and re-enabled the ext
20:10 brion: broken boardvote extension... was breaking all special pages; temporarily disabled the ext
- Feb 1 20:08:18 kluge httpd[12208]: PHP Fatal error: Call to undefined function wfBoardVoteInitMessages() in /usr/local/apache/common-local/php-1.5/extensions/BoardVote/GoToBoardVote_body.php on line 3
11:15 domas: restarted lighty on benet, did run away?

January 31

10:53 Tim: deleted binlogs on srv146
00:12 brion: svn.wikimedia.org resolved to old 145.* addy from anthony... since that doesn't work anymore, this is making svn access a pain for seeing about updating the wap interface. Tried to update resolv.conf with current values from zwinger, but still no dice.
- have temporarily resorted to /etc/hosts hack

January 30

22:25 brion: various reports of "blank pages" and/or 503 errors from Peru. Nothing narrowed down yet on our end.
20:35 brion: switched Apple Dictionary app backend to OpenSearch. bumped MaxClients on yongle up to 20, may resolve the 'gets really slow for no reason' issue
20:10 brion: enabling TitleKey sitewide. (Indexes should be rebuilt overnight to ensure they're up to date for changes in the last 15 hours.)
05:54 brion: building TitleKey indexes generally (not fully enabled yet so opensearch isn't useless until done; want them built first)
05:25 brion: experimenting with TitleKey ext on testwiki
04:50 Tim: Fixed thumb-handler to not attempt to "cache" files locally on storage1. Removed bacon from /h/w/upload-scripts/sync.

January 29

21:58 mark: Raised persistent_request_timeout on the backend squids from the default 2 minutes to 10 minutes, to make existing connection reuse even more likely between all communicating pairs of squids
10:30 Tim: Setting up storage1 as a static HTML dump storage server. Installed ganglia on it.
09:10 Tim: updatedb was running on storage1, attempting to index millions of files. Killed it, added /export to PRUNEPATHS, and re-ran it. Seems to work.

January 28

22:30 brion: csw5-pmtpa has been spewing alarms about 5/3 and 5/4 optical connections for a while. :(
- domas says this is harmless -- an unused port
18:50 brion: svn revert'd some live hack in Parser.php which apparently added a $clearState parameter to Parser::internalParse() which never gets passed to it, thus spewing error logs with billions of lines of PHP warnings

January 24

21:00 jeluf: installed lighty on storage1, configured squid so that all dewiki image requests and all commons thumb image requests go to storage1. Images fast again, backend request rate down to normal level.
18:40 brion: images still very slow :(
14:00 mark: Assigned new, extra IP addresses to Florida Squids, and added the new IP range to all squid.conf's. Also removed the old knams IP range, which has been unused over 2 months. This seems to have caused a massive cache clear in knams upload squids, causing a huge increase of image requests and overload of Amane. A real explanation is as of yet unknown... speculation is that old objects in knams caches have been invalidated somehow because they had the (now removed) old IP prefix in their caching info.

January 23

02:09 Tim: reverted refresh_pattern changes in squid (ignore-reload) to fix user JS/CSS problems. With Brion's blessing.

January 22

20:46 mark: Set $wgUserEmailReplyTo back to false, as mchenry will now rewrite envelope sender addresses from MediaWiki to wiki@wikimedia.org
16:12 Rob: srv11 back online
15:55 Rob: srv130,srv132,srv134 back online, see detailed server pages for crash information.

January 21

12:30 jeluf: mark reports twice as much backend requests as usual. live-patched opensearch_desc.php to send proper Cache-Control headers. Needs to be updated in SVN. Backend request rate back to normal levels.
07:10 brion: set $wgUserEmailUseReplyTo to protect against SPF failures and privacy leakage due to bounce messages in user-to-user emails. (Caused by sSMTP, which forces the envelope sender and From: address to be the same.) This uglifies user-to-user emails but keeps the same. In the long term I recommend replacing sSMTP with a minimal postfix or something like we used to use, which should work in a safe manner.
03:24 brion: taking srv184 out of apache rotation to test ssmtp config issues

January 20

21:45 jeluf: unpooled srv183, investigated why NFS mounts were missing after a reboot. Seems to be related to https://bugs.launchpad.net/ubuntu/+source/sysvinit/+bug/44836 . The fix suggested in that bug seems to help. Have to package it tomorrow.
21:40 brion: mounted NFS shares on srv183
21:39 brion: srv183 was rebooted 2h55m ago. its apaches are running, but NFS shared aren't mounted. nothing works properly. lead to several reports of captcha failures, and might have lead to some uplaod-related issues
18:30 jeluf: rebooted srv183, un-killable convert jobs were blocking port 80
18:29 brion: apache not restarting on srv164, srv176, srv183, srv184 -- "(98)Address already in use: make_sock: could not bind to address 0.0.0.0:80"
18:25 brion: killed job runner jobs on srv90-99, they were the error-spewers. syslog is clean. :D
18:18 brion: several apaches in srv90-99 range still spewing errors, but seem to have the right file. stuck apc?
18:11 brion: removed the random '$key' parameter from MessageCache::transform
18:06 brion: space was filled by /var/log/messages and /var/log/syslog; runaway PHP warnings from some live hack extra parameter. truncating the log files and resyncing
17:56 brion: turned off their apaches. looking for the space culprit.... they have most of their space wasted in a /a partition and a tiny / where all the stuff is
17:53 brion: lots of srv's in 150-190 range out of disk space; broken (LocalRepo.php update failed)
11:12 brion: file histories were broken for a few minutes (bad commit got through)
07:08 brion: enabling $wgFileRedirects on test.wikipedia

January 19

06:29 and a bit before - brion: some brief segfaulting due to a bad recursion in my SiteConfiguration update. Note: non-string values in InitialiseSettings.php (false, null, ints, etc) will now work.

January 18

22:46 brion: wikibugs was idle for an hour or so due to being autoblocked for bounces again...
22:40 brion: srv11 is hung; no ssh, HTTP opens but doesn't respond
18:40 brion: created wikimedia-sf mailing list

January 16

22:30ish brion: someone tried to delete sandbox on en.wikipedia, leading to various DB error warnings (transactions full) and breakage of most editing for nearly an hour. Have hacked in a 5000-revision limit on deletions, will prettify it shortly.
21:39 brion: Added a default "Cache-control: no-cache" header on output in CommonSettings.php. This will protect PHP Fatal Error blank pages and such from getting cached due to a 200 result code and lack of cache-control headers. Actual cache-control output will override the default one. (Had to manually purge a Special:Random on en.wikipedia... various issues with editing etc)
07:32 brion: fixed IRC recentchanges name for wikimania2008.wikimedia (was sending to the 2007 channel)

January 15

21:00 jeluf: removed memcached on srv56,57,58 on rainman-sr's request. Memcached was causing problems with the indexer.

January 14

21:33 brion: clearing a giant watchlist on users' request; may cause some s1 replag
21:00ish brion: we seem to be getting blank PHP fatal error pages stuck in squid caches. :( latest php should mark these as 500...
20:00 Rob: All yaseo upload squids upgraded.
19:45 Rob: All yaseo text squids upgraded.
18:45 Rob: Upgraded squid on sq41-sq50
17:45 Rob: Upgraded squid on sq11-sq15
17:00 Rob: Upgraded squid on sq6-sq10
17:00 Rob: Upgraded squid on sq1-sq4
16:20 Rob: Upgraded squid on sq32-sq40
16:20 Rob: Upgraded squid on sq24-sq31
16:03 Rob: Upgraded squid on sq16-sq23
15:26 Rob: Upgraded squid on knsq16,knsq17, knsq18, knsq20, knsq21, knsq22.
15:00 Rob: Upgraded squid on knsq8,knsq8, knsq9, knsq10, knsq11, knsq12, knsq13, knsq14, knsq15

January 13

20:34 mark: Enabled access log on mayflower's apache (why was it disabled?)
18:12 mark: Upgraded all knams text squids to new squid version
17:30 mark: Set refresh_pattern . 60 50% 3600 ignore-reload on all text squids to override reload headers
17:00 mark: Upgraded knsq1 to the new Squid
16:15 mark: Brought up knsq19, and installed a new squid 2.6.18-1wm1 on it, including Domas' Accept-Encoding normalization patches. If you notice anything weird, notify Mark or Domas...
04:25 Tim: Updated MW from r29455 to r29682.

January 12

11:00 domas: removing titleblacklist. there's certain level of crap beyond which I won't fix stuff.
03:10 brion: importing checkuser logs
02:59 brion: upgrading to current CheckUser code (per-wiki logs for now)

January 11

12:00 domas: installed lighty on zwinger for ganglia use

January 10

17:00 domas: disabled CentralNotice

January 9

21:00 domas: increased revtext ttl to 1w, fixed parser cache ttl problem, where magicwords were causing most of enwiki (and other template-aware wiki) pages to be cached for 1h only (r29511)
09:00 domas: memcached arena increased to 158GB, 79 active nodes, ES instances getting lower buffer pools on servers running memcached (1000M to 100M), full cache drop
00:14 brion: now that we've expanded storage2's size and removed a bunch of useless thumb and temp files from the amane backup so there's room again; have restarted up dump runs, including a continuation run of enwiki (which should start up from meta-current)

January 8

22:33 jeluf: extended storage2:/export by 650 GB
22:03 brion: uploads broken for several minutes by r29361 (reverted)
21:48 brion: srv17 and srv18 are whining about high temperatures
21:00 Rob: srv17 segfaults in httpd, resynced and restarted apache.
17:10 Rob: srv78 Kernel Panic, rebooted and back online.
16:45 Rob: srv177 cpu overheating, pulled, replaced thermal paste, back online.
16:20 Rob: srv15] cpu overheating, pulled, replaced thermal paste, back online.
16:15 Rob: srv189 back in rotation.
14:59 Rob: srv189 reinstalled, needs apache setup.
14:54 Rob: srv130 rebooted and back online.
07:50 domas: added db8 and db10 to ganglia

January 7

08:34 Tim: mounted upload4 on albert for static.wikipedia.org symlinks

January 6

21:33 mark: Enabled TCP ECN on lily and mayflower
21:03 mark: Added mayflower's EUI-64 address to DNS - svn may use it.
20:06 mark: Added a v6 service IP to lily (lists.wikimedia.org) and put it in DNS.

January 4

00:34 brion: restarting backup syncs from amane to storage2; was broken by bad script... trimming more thumbnails out of storage2 to clear up space

January 3

19:29 brion: starting enwiki dump on srv42, will continue with general worker thread
19:13 brion: Setting up srv42 to run dump worker threads as well as general batches, since it seems idle.
15:05 mark: Rebooted fuchsia with an LVS optimized kernel, moved all LVS services back onto it
13:45 mark: LVS on fuchsia overloaded, moved LVS for upload to mint
00:26 brion: http://download.wikimedia.org/ now running off storage2. will restart dump runs aiming at it until we have a better place to put the backend (with benet still not checked for its disk issues)