Server admin log/Archive 11

December 31

24:13 brion: enabled TitleBlacklist ext

December 30

20:00 domas: livehacked out rounded corner references in schulenberg main.css - files were 404 :) Tim, thats outrageous! :-)
15:00 domas: blocked & entirely.
14:00 domas: noticed that we still maintain separate squid caches due to access encodings not only for crawlers, but for IE/Moz too. Patch at http://p.defau.lt/?C9GXHJ14GWHAYK1Pf0x9cw

December 28

21:44 brion: running cleanupTitles on all wikis (HTML entity and general issues)
15:53 jeluf: shut down apache on srv14, srv15, srv17. CPU0: Running in modulated clock mode. Needs to be checked.
12:58 mark: Blocked broken Recentchanges&amp requests on the frontend squids (ACL badrc)
00:28 brion: wikibugs irc bot had been idle for about 2 days; on investigation its mailbox was disabled from the mailing list due to bounces. unblocked it.
00:12 brion: Special:Userrights restricted mode available for bcrats: add sysop, bureaucrat, bot; remove bot

December 27

22:21 brion: fixed Cite regression which produced invalid error message for ref names containing integers
21:58 brion: scapping updates
00:58 brion: stopped the dump worker thread on srv31 for the time being
00:56 brion: either multiple dump threads are running, or the reporter's really broken from benet being full. :)

December 26

21:50 jeluf: installed lighttpd on browne (irc.wikimedia.org), redirecting all requests to meta. See bug 11792
18:04 brion: starting rsync benet->storage2 again... it borked the first time. :P

December 25

06:00 jeluf: rsync completed. Symlinked dewiki upload dir to /mnt/upload4

December 24

23:30 jeluf: disabled image uploads on dewiki. Rsync'ing images from amane to storage1.
17:40 brion: configured stub fr and ru.planet.wikimedia.org

December 23

21:14 brion: setting up private exec.wikimedia.org wiki
06:48 domas: restarted all memcacheds with 1.2.4, complete cache wipe caused interesting caching issues

December 22

19:20 RobH: srv189 mainboard replaced, SCAP run.
19:00 RobH: mchenry memory replaced, back online.
11:15 jeluf: configured mchenry to use srv179 instead of srv7 for OTRS mail address lookups.
11:00 jeluf: moved otrs DB from srv7 to srv179. Bugzilla and donateblog are still on srv7.

December 21

19:45 jeluf: restarted lighttpd on benet since dewiki and frwiki were not reachable. I don't understand why it helped, but it did.
17:06 brion: benet was rebooted; migrating all data to storage2. started monitor thread but no additional worker threads for the moment.

December 20

00:08 brion: benet went from being sluggish to being totally unresponsive. May need to reboot it; may have disk problems or something.
20:43 brion: migrating additional dump dirs to storage2 to balance disk more
20:20 jeluf: benet's disk is full. Reduced root reserve to 100 blocks

December 19

20:13 mark: 15 14 new knams squids pushed into production, one is broken.

December 18

02:43 brion: download.wikimedia.org wasn't showing the dewiki dir; possibly confused about the symlink to the other mount. Restarting lighty seems to have resolved it.
02:34 brion: enabled gadgets extension sitewide

December 17

20:44 brion: working on patching up the broken dumps... Benet crashed on srv6, killing one worker thread and the monitor thread. Dumps continued when benet came back online, but weren't reported. Monitor thread is now restarted, showing status. Additionally, freeing up space on storage2 so dewiki dump can run again. ... trimming thumbnails from the storage2 backup of amane uploads
03:39 brion: updated wgRC2UDPAddress for some private wikis

December 16

13:39 mark: Blocking HTTP OPTIONS requests on all frontend Squids

December 14

14:55 mark: Moved patch between csw1-knams (in J-13) and lily (J-16), old one may be bad?
14:00 mark: Installed new SFP line card in slot 7 of csw1-knams. Brought up link to AMS-IX.
6:00 jeluf: started move of thumb directories of the commons from amane to storage1. Dirs get rsynced and symlinked in small steps. When the entire thumb dir is moved, only one symlink will be left. It would be nice to have the thumb dirs independent of the image dirs, but Mediawiki currently doesn't have a config option for this.

December 13

23:00 jeluf: Mounted storage1's upload directory on all apaches. Moved /mnt/upload3/wikipedia/commons/scans to the new directory. Other directories have to follow.

December 12

19:25 jeluf: restarted replication on srv127

December 11

16:12 mark: mchenry under incoming connection flood, increased max connections. This hit srv7's mysql max connection limit of 100, upped that to 500.
9:40 domas: thistle taken out for db8 rebuild

December 10

15:05 Rob: db8 is resinstalled, and needs DBA attention.

December 9

14:10 jeluf: External storage cluster 14 (srv139,138,137) enabled, replacing cluster 11 in $wgDefaultExternalStore

December 6

21:38 brion: fix redirect from sep11.wikipedia.org; was to sep11memories.org which does not exist though some browsers let it pass, trying the real www.sep11memories.org.
16:18 Rob: srv81 kernel panic. reboot, back online.
around 5: benet was found dead, PM rebooted it, server is back up, but lighttpd tries to load modules from /home/midom/tmp/thttpd/*, which does not exists. Needs to be checked.

December 5

14:25 brion: set up rate limits for rollback, awaiting broader rollback permissions

December 4

13:50 mark: avicenna collapsed under load
13:45 mark: Repooled knams
12:30 mark: Depooled knams, routing problems in surfnet

December 3

17:50 mark: Exim on lily was still bound to the old 145. IP for outgoing messages, so couldn't send out to the world. Fixed.
17:28 mark: Removed monitorurl options from Squid cache_peer lines now a Squid bug has been fixed where it wouldn't detect revival of dead parents.
13:05 mark: Downgraded knsq3 kernel to Feisty's kernel, started frontend squid
10:10 mark: Cut the routing for 145.97.39.128/26 on csw1-knams.

December 2

05:24 Tim: Master switch to ixia complete. New binlog pos: (ixia-bin.001, 2779)
05:10 Tim: starting response
04:58 db8 down (s2 master)

December 1

21:30 mark: Recompiled 2.6.20 kernel on mint (upload LVS host at knams) with a larger hash table of 2^20 entries instead of 2^12. %si CPU usage seems a bit lower, will be interesting to see it during peak traffic. Also tested the newer 2.6.22 kernel in Ubuntu on sage, and it appears 2-3 times slower in this respect!

November 30

19:06 domas: srv78 was commented out in mediawiki-installation since Nov2. It was also up since then. Uncommented, resynced.
17:18 Rob: srv146 back online.
17:16 Rob: srv132 back online.
17:14 Rob: srv125 back online.
16:17 Rob: yf1006 upgrade completed.
16:08 Rob: srv70 back online as apache.
15:47 Rob: srv121 back online, disk checked fine. No IPMI access.
13:11 Tim: Deploying r27961 with Parser_OldPP (svn up/scap)
05:15 Tim: running maintenance/fixBug12154.php to find diffs broken by Brion's little experiment.

November 29

22:49 Rob: Dist-upgrade to yf1005 completed.
21:41 Rob: Dist-upgrade to yf1004 completed.
21:14 Rob: Dist-upgrade to yf1003 completed.
20:45 Rob: Dist-upgrades to yf1001 & yf1002 completed.
19:44 Rob: Completed yf1000 reinstall.
19:29 Rob: Reinstalling yf1000 for software distro upgrade.
19:25ish mark: Upgraded yf1009 and yf1008 to new distro.
19:28 brion: apparently there remain problems with <ref>; reverting back to 27647
19:10 brion: updating wiki software to current trunk
17:22 brion: disabled send_reminders option on all mailing lists, since it's annoying
17:15 Rob: Finished upgrades on squid software for sq41-sq50
16:56 Rob: Started upgrades on squid software for sq41-sq50
16:54 Rob: Finished upgrades on squid software for sq38-sq40
16:48 Rob: Started upgrades on squid software for sq38-sq40
16:46 Rob: Finished upgrades on squid software for sq27-sq37
16:21 Rob: Started upgrades on squid software for sq27-sq37
16:18 Rob: Finished upgrades on squid software for sq16-sq26
15:55 Rob: Started upgrades on squid software for sq16-sq26
15:53 Rob: Finished upgrades on squid software for sq1-sq15
15:00 Rob: Started upgrades on squid software for sq1-sq15

November 28

23:01 Rob: completed reinstalls of KNAMS squids. (sans gmond)
21:43 brion: fixed apc.stat setting on srv0, was set to off
20:55 brion: renewed SSL cert for wikitech.leuksman.com
18:00ish Rob & Mark: Started reinstalls of KNAMS Squid servers.

November 27

22:27 mark: Reinstalled knsq1 with Ubuntu Gutsy and newest Squid; the rest will follow tomorrow
15:09 brion: srv121 is closing ssh, hanging on http traffic. May or may not be occasionally responding with bogus version of software. Requested shutdown from PM support.
12:20 mark: Moving European traffic back to knams on the new IPs.
09:00 - 12:20 mark: Renumbered knams public services VLAN.
04:00 - 07:00 mark: Brought up AS43821, 91.198.174.0/24 on a new transit link.
03:55 mark: DNS scenario knams-down for maintenance

November 26

19:52 Rob: srv81 reinstalled due to OS being fubar. Rob is bootstrapping and bringing into service.
18:55 Rob: Replaced bad ram in srv155. Stopped apache on boot, as there is no 'sync-common' on server.
18:30 Rob: srv70 has a dead hdd.
17:03 mark: Noticed that fuchsia (LVS knams) was overloaded: more input packets than outgoing. Set up mint as emergency LVS host for upload.wikimedia.org and moved the service IP.

November 25

07:00 brion: rewired some file layout on leuksman.com

November 24

22:15 jeluf: created new wikis: fiwikinews hywikisource brwikiquote bclwiki liwikiquote
17:30 mark: isidore swamped and unreachable, set some crude caching-header-override options in squid
13:30 mark: Upgraded will to Gutsy

November 23

19:15 jeluf: started squid on sq32.
19:10 jeluf: killed squid on sq23 by accident, restarted.
18:54 jeluf: started squid on knsq3, disabled sdc
18:45 jeluf: started apache on srv78
18:45 jeluf: deleted old binlogs on srv123

November 22

22:49 Noticed that a fully dynamic blog with no caching whatsoever was linked off every page on the site and wonders if anyone could come up with a more brilliant idea to kill a server. Put Squid infra structure in front. Wordpress currently sends no caching headers, so set a default expire time of 1 minute in Squid.
14:58 brion: enabled blog link for english-language fundraising notice... http://whygive.wikimedia.org/ (set up the blog itself yesterday)
11:00 domas: (following older events) srv155 Apache misbehaved (database errors, 500, etc), because / was full. / was full, because updatedb's sorts were filling the disk. updatedb's sorts were filling the disk because find was traversing all NFS shares. find was traversing all NFS shares because file system mounts were not recorded in mtab. file system mounts were not recorded properly in mtab, because mount wasn't clean. mount wasn't clean because /home/ was mentioned 3 times in fstab. Home was mentioned in fstab 3 times because, oh well. :)

November 20

21:55 mark: Set persistent_request_timeout to 1 minute in the frontend squid conf, since newer Squid versions increased the default value, causing greater FD usage.
17:57 brion: patched OTRS to avoid giving out e-mail addresses when requesting password resets
12:15 mark: Upgraded knsq14 to newer Squid
06:39 mark: knams network change failed, rolled back. repooled knams
04:06 mark: Depooled knams for maintenance
02:53 brion: srv81 deadish. (ext cluster 6)

November 19

19:13 mark: Preparing for the IP renumbering at knams tomorrow. Set up reverse DNS zones, added new squid ips to the MediaWiki configuration
16:00 brion: srv132 reporting read-only filesystem
Created a new squid 2.6.16-1wm1 package, included a patch by Adrian Chadd to solve the unbounded http headers array problem. Deployed it for testing on knsq15; the rest will follow soon.

November 18

21:35 domas: commented out srv132 on LVS, cause it is half up half down (serves wiki, ssh unhappy)

November 17

04:18 Tim: enabled AssertEdit extension on all wikis, after review. Short, simple extension to help some bot runners.

November 16

22:12 brion: switching on NPPatrol by default
22:ish brion: switching Gadgets back on on dewiki, with new message naming convention
20:58 brion: changed master settings on bacon, letting it catch up...
20:53 brion: taking bacon out of rotation for the moment
20:52 brion: s3b has bacon, which didn't get updated for master switch. causing ugly lag
16:52 brion: updated all to current SVN code. disabled gadgets extension on dewiki pending code review.
04:17 brion: started CU index update on old masters
04:15 brion: s1 swapped master db4 (db4-bin.237 748798723) -> db2 (db2-bin.161 198)
04:11 brion: s2 swapped master lomaria (lomaria-bin.107 206111887) -> db8 (db8-bin.002 2362)
04:04 brion: s3 swapped master db5 (db5-bin.107, 310565282) -> samuel (samuel_bin_log.021, 90525)
03:42 brion: CU index update on slaves complete; time to switch masters and run em there

November 15

18:43 Rob: srv146 back online.
17:28 Rob: reseated disk in sq32, its good to go. Mark is testing things on it.
17:00 brion: applying CU index update on slaves

November 14

19:12 brion: freed up some space on storage2; no space for dewiki borked the dump progress update
09:08 mark: sq32 has a bad disk /dev/sdc

November 13

12:13 Tim: vincent was giving "Please set up the wiki first" errors. Fixed permissions on /var/tmp/texvc and ran sync-common, seems to be fixed.

November 11

23:00 domas: reindexed holbach for efficient categorypage and contributions

November 10

20:55 sq5 disappeared
14:30 jeluf: Started move of commons from amane to storage1: Rsync started.

November 8

15:06 Rob: Rebooted srv146 and put back in pool.

November 5

15:00 Tim: Removed ru-sib.wikipedia.org from all.dblist and langlist as requested on bug 11680.

November 4

yesterday domas: ixia back to duty, bacon refurbished as jawiki principal slave

November 2

16:40 Rob: srv133 unresponsive to console. Rebooted and back in pool.
16:18 Rob: Kernel Panic on srv78, restarted and brought back in to pool.
16:09 Rob: Reinstalled ixia due to broken array.
13:40 Rob: Reinstalled bacon per domas.
13:26 Rob: Connected amnesiac, juniper route server.
13:00 Rob: Took ixia offline for disk testing, failed raid.
12:00ish Rob: Rebooted Foundry after a failed upgrade attempt.

November 1

20:43 mark: Installed sage as gutsy build/scratch host. Sorry, new hostkey :(
Ubuntu Gutsy installs are now possible, both for amd64 and i386
20:32 mark: Prepared installation environment for Gutsy installs
12:18 mark: Set up new vlans on csw1-knams. Set up BGP config for the new prefix. Migrated full view BGP peering between csw5-pmtpa and csw1-knams to the new public ASN. Changed AS-path prefix lists to named instead of numbered on both routers.

October 31

13:50 brion: storage2 full; pruning some stuff...

October 30

18:22 brion: adding dev.donate.wikimedia.org cname and vhost
02:08 brion: leuksman.com was down for about four hours due to mystery kernel panic possibly mystery hardware failure. Thanks for rebooting it, Kyle! :)

October 29

05:41 Tim: srv133 was giving bus errors on sync, not allowing logins. Forced restart, seems to be still down.

October 26

17:50 brion: rebuilt prefix search indexes, now splitting enwiki & other wikipedia index files to allow it to build on the 32-bit server. :) ready for leopard launch...

October 25

12:30 mark: knsq13's bad disk has been replaced; booted it up, cleaned the cache and started Squid.
08:14 Tim: created wiki at arbcom.en.wikipedia.org

October 23

21:30ish brion: sitenotice live on commons, meta, and en.*
19:25 brion: sitenotice now using user language; reenabled <funddraisinglogo/> little WMF logo
19:15 Rob: Resurrected srv118, synced, added back to pool.
18:37 Rob: Updated apache pool and pushed to dalembert.
18:33 Rob: Replaced backplane in srv78 and reinstalled. Bootstrapped and online.
18:20 Rob: Restarted srv124, synced, filesystem seems fine.
17:30 Rob: Booted up srv146 which was turned off, details on srv146 page.
17:21 Rob: Rebooted srv131 from kernel panic. Synced and back online.
17:13 Rob: Rebooted srv124 from unresponsive crash (black screen when consoled.) Synced and back online.
15:54 mark: Reenabled the no-caching hack; seeing stalls
10:20 mark: Disabled the randomizing URL hack in the flash player as I'm not seeing stalls and I think it's actually triggering the Squid concurrency problems. Reenable if it makes things worse.
09:29 mark: Seeing huge CPU spikes on knsq12, the CARPed Squid that's responsible for the Flash video. Looks like the same problem we saw 2 weeks ago. Changed the configuration so the upload frontend squids cache it, to reduce concurrency.
01:24 brion: turned off the scrolling marquee; too many complaints about it being slow and distracting
01:00 brion: switching donate.wikimedia.org back to fundcore2 (srv9), keeping the link from banner to the wiki until we figur eout performacne problems
00:40ish brion: hacking around squid problem with flash player by randomizing the URL for the .flv, defeating caching. :P

October 22

22:47 brion: putting on sitenotice on testwiki and enwiki
20:15 mark: Shut down knsq13 awaiting disk replacement.
15:51 brion: adding www.donate CNAMEs
15:16 mark: Broken disk in knsq13, stopped backend squid. Frontend still running.
13:50 brion: php5-apc installed on srv9; updated php5-apc package in apt to wm6 version, with both 386 and amd64 versions...
13:34 brion: added wikimedia apt repo to pbuilder images on mint in the hopes that one day i'll be able to actually build packages
12:00 domas: db2 gone live, took out adler for jawiki dump
03:24 brion: fixed email on srv9 i think... an extra space on line in ssmtp.conf after the hostname seemed to break it, making it try to send to port 0.

October 21

15:40 mark: Doubled upload squid's maximum cached object size to 100M for the fundraiser.

October 20

19:24 Rob: srv10 reinstalled as base Ubuntu for fundraiser.
18:50 Rob: srv9 reinstalled as base Ubuntu for fundraiser.
16:05 jeluf: removed binlogs on srv123
16:00 jeluf: started apache on srv85
16:00 jeluf: removed binlogs on srv7. Only 120MB disk space were left...
13:30 spontaneous reboot of srv85

October 18

21:48 brion: updated interwiki map
17:53 brion: http://donate.wikimedia.org/ up as redirect for now; all other domains also have this subdomain as redir

October 17

18:34 mark: increased ircd max clients to 2048 on irc.wikimedia.org.
18:14 brion: set up an hourly cronjob to 'apache2ctl graceful' on wikitech. Had one instance of segfault-mania, probably from fun APC bugs or something finally creeping in.

October 16

22:08 brion: tweaked CentralNotice URLs to use 'action=raw' to avoid triggering the squid header smashing. Sigh... :)
19:02 brion: switched in CentralNotice infrastructure sitewide. This may cause a spike of hits to meta's Special:NoticeLoader, a little JS stub. Due to squid's cache-control overwriting this won't be cached as nice on clients as I wanted, but it should still be sanely cached at squid. Actual notice won't be switched in until the fundraiser begins, but this should get the stub loader into anon-cached pages ahead of time.

October 11

21:17 brion: srv124 has been broken for a few hours, apparent disk error from behavior. still online but can't get in to shut it off.
18:59 Rob: amane disk replaced, but not rebuilt. Need Jens to look at this.
15:57 brion: srv118 down; swapped it out from memcached for srv101. WE ARE OUT OF MEMCACHE SPARES IN THE LIST

October 10

17:26 brion: http://wikimania2008.wikimedia.org/ up
17:08 brion: adding wikimania2008, 2009 to dns; setting up 2008 wiki
17:08 tim, domas, and rob are doing some sort of evil copying db7 to db10
16:36 brion: ES slaves under srv130 are stopped, they need to be restarted... no wait it's ok now
16:33 brion: restarted mysqld on srv130 (via /etc/init.d/mysqld); some mystery chown errors about invalid group for mysql? but seems to be running
16:30 brion: srv130 (ES master) mysql down; rebooted 80 minutes ago (mark said there was some problem with it and he depooled it, but this is not logged except for a kernel panic and restart september 19. may be flaky)

October 9

21:33 brion: switched srv105 out of memcache rotation, srv118 spare in. [domas tried upgrading memcached on srv105, apparently something's broke]
21:30 brion: memcached on srv105 is broken; verrrry slow, lots of errors on site -- timeouts etc
19:20 domas: purged that image also
19:12 brion: disabling the alpha PNG IE 6 hack on en.wikipedia; most of the breakage appears to be on the GIF that it loads, which may be result or may be cause or may be unrelated. :)
16:45 brion: CPU maxing out on Florida upload squids; site up but with slow image loading. We're working on figuring out why

October 8

23:22 dab: started nagios-irc-bot on bart.
16:00 jeluf: took db10 out of service, mysql not responding, SSH not responding.

October 7

19:22 brion: refreshing storage2 copy of amane

October 3

18:00 brion: scapped update including r26357 which tim warns may introduce performance problems. keep an eye out

October 2

16:50 brion: http://pt.planet.wikimedia.org working

October 1

21:00 mark: Reinstalled fuchsia and made it the new active LVS host for knams, as iris's cable seems bad.
09:58 mark: iris has had input errors on its link. Need to keep an eye on it.
09:45 mark: The site felt slow at knams with multi-second load times. Latency to LVS IP was ~ 200 ms. Apparently iris (LVS load balancer) had negotiated its link speed at 100M, which is not sufficient during peak. Forced it to 1000M-master at the switch, which seems to have worked.

September 30

15:10 mark: Gave csw1-knams a full BGP view from csw5-pmtpa with rewritten next-hop to the Kennisnet gateway.

September 29

18:33 mark: Installed will for routing/monitoring purposes
00:35 brion: bart seems alive and well, was rebooted around 23:25?

September 28

22:00 jeluf: bart is dead => no nagios, no otrs.
20:22 brion: pushed out updated cortado 0.2.2 patched build to live server. (Some clients may use cached old code, note.)
16:30ish brion: restarted lucene on srv58, too many open files. srv58 error log extracts
15:27 Rob: Restarted apache on hypatia. It was throwing seg-faults. See srv page for more info.
15:14 Rob: Restarted apache on srv17. It was throwing seg-faults. See srv page for more info.
15:05 Rob: sq45 back online, reseated disk a few times.
12:00 domas: srv122->srv121 data migration (mysqldump to stock fedora mysqld/MyISAM), cluster11 temporary r/o

September 27

16:13 Rob: Shutdown sq45 due to bad disk.
16:00 Rob: Updated node_groups: squids_pmtpa and squids_upload to show dell squids in nagios. Corrected errors in squid_pmtpa.
15:18 Rob: sq34 passed all disk checks, started back online as squid.
14:46 Rob: checking disks on sq34, squid offline during check.
14:14 mark: Repooled knsq12

September 26

15:02 mark: Repooled knams
14:49 mark: pascal had ns2.wikimedia.org's IP bound, which only showed after the move of routing to csw1. Removed.
05:13 mark: Depooled knams for network maintenance later today

September 25

19:30 jeluf: added hsbwiktionary as requested on bugzilla
19:00 jeluf: added bnwikisource as requested on bugzilla
17:24 brion: ariel reasonably happy now.
17:21 brion: cycling ariel a bit to let it fill cache
17:18 brion: putting ariel back into $dbHostsByName in db.php. someone commented it out without any logging of the reason why, rumor is because it had crashed and was recovering. this broke all enwiki watchlists.
17:11 ???: watchlist server for s1 overloaded
15:46 Rob: tingxi reinstalled and serves no current role.
14:51 Rob: srv121 reinstalled and online as apache. Needs ext. storage setup.
14:35 Rob: srv135 rebooted from kernel panic and back online as apache.
02:12 brion: new wikitech server working :) poking DNS to update...
01:47 brion: preparing to move wikitech wiki to new server; will be locked for a bit; DNS will change... new IP will be 63.246.140.16

September 24

12:30 mark: ~~knsq12's broken disk has been replaced, brought it back up.~~ ZX pulled the wrong disk, the broken one is still there.

September 21

19:35 jeluf: thistle's mysqld crashed. Restarts automatically. Error message.
05:12 nagios: srv135 went down. No ping.
~1:45 Tim: Set up srv131 and srv135 for static HTML dumps. Fixed enwiki 777/1024, continued dump.

September 20

14:10 Rob: Upgraded libkrb53 libt1-5 on srv151-srv189 (Will take a bit of time to work through them all.)
11:06 mark: Upgraded frontend squids on knsq1 - knsq5 to increase #FDs, they were running low.

September 19

20:40 jeluf: cleaned up disk space on db4
20:33 Tim: Fixed iptables configuration on benet: accept NTP packets from localhost
~19:30-19:50 Tim: Fixed NTP configuration on amane, bart, srv1, browne. Stepped clock and restarted broken ntpd on db8, alrazi, avicenna. Still working on benet.
18:07 mark: Turned off htdig cron job on lily, it's killing the box every day
16:15 jeluf: started mysqld on srv130. Added cluster13 back to the write list.
14:05 mark: Turned off knsq12.
13:53 Rob: srv128 back online. Had RO Filesystem, FSCK ran, all is well, synced and working.
13:42 brion: moving old dumps from amane's /export/archive to USB drive for archival. retrieve drive back to office when done. :)
13:32 Rob: srv130 had a kernel panic. Rebooted, synced, online.
12:23 Rob: Replaced disk on storage1.
09:53 mark: Disk errors on /dev/sdb in knsq12. Stopped both squid instances.
6:15 jeluf: DB copy completed, thistle and db8 are both up and running. NTP isn't properly configured on db8. Needs investigation. Check bugzilla.
4:20 jeluf: mysql on thistle stopped. Copying its DB to db8.

September 18

19:50 brion: shutting off srv121 apache again, segfaults continuing.
19:30 brion: segfaults on srv121; shutting its apache and examining. lots in srv188 past log too, earlier today
16:18 brion: killed the deletion thread, since it was by accident anyway. :) things look happier now
16:16ish brion: lots of hung threads on db4/enwiki due to deletion of Wikipedia:Sandbox clogging up revision table
16:06 brion: srv128 hanging on login... shut it down via IPMI
05:15 jeluf: updated nagios config to reflect db master switch.

September 17

22:50 mark: srv130, ES cluster 13 master, disappeared. Removed it from the active write list and set its load to 0.
13:45 domas: I/O hang on db8, FS hanged too, caused mysqld to hang. switched master to lomaria
13:10 mass breakage of some kind... segfault on db8 mysql, massive breakage due to commons loads etc
11:38 Tim: re-ran squid update to decomission bacon -- was bungled yesterday so bacon was still serving zero-byte thumbnails. Purging the zero-byte list again.

September 16

21:58 Rob: srv15 reported apache error on nagios. Re-synced apache data and restarted apache process, correcting the issue.
20:28 mark: Generated a list of 0-byte thumbs on bacon, purging that list from the Squids
10:45 domas: aawiki had corrupted localisation cache: http://p.defau.lt/?Q4c4JZRPeAufTVT_T_Xz1w
06:15 domas: db2 simply seems tired. servers/processes get tired after running under huge load for a year or more? :) Anyway, looks like deadlock caused by hardware/OS/library/mysqld - evil restart.
00:50 mark: Mounted /var/spool/exim4/db (purely nonessential cache files) as a tmpfs filesystem on lily to improve performance.
00:44 mark: ionice'd and renice'd running htdig processes on lily. I/O starvation caused high loads so Exim delayed mail delivery.

September 15

17:13 Tim: sending all image backend traffic to amane, turned off bacon and srv6. Deployed in slow mode.
15:49 Tim: took db2 out of rotation. Some threads are frozen, including replication, leaving it like that for a while so Domas can have a look.
10:27 Tim: Attempted to fix error handling in thumb-handler.php. It was probably serving zero-byte images on queue overflow of the backend cluster.
- Wrong diagnosis -- images are not zero bytes on amane, but are cached as zero bytes. Suspect lighttpd or squid. Remains unfixed.
- There was in fact a problem with thumb-handler.php, manifesting itself on bacon. Probably fixed now.

September 13

13:52 brion: db1 troubles? some pain in s3
14:33 Rob: adler reinstall complete.
14:00 Rob: adler down for re-install.
13:08 Rob: srv54 reinstalled, bootstrapped, and back online.
12:27 Rob: rebooted srv135, ssh now working, synced and added back to apache lvs pool, synced nagios.

September 12

20:30sih Rob: bootstrapped srv131 and srv150, back in apache rotation.
19:59 Tim: srv135 is refusing ssh connections. Removed from LVS.
19:45 Tim: auditcomwiki wasn't in special.dblist, so it was listed in wikipedia.dblist and the static HTML dump system attempted to dump it. Luckily I still haven't fixed that bug about dumping whitelist read wikis, so it just gave an error and exit. I put another couple of layers of protection in place anyway -- added auditcomwiki to special.dblist and patched the scripts to ignore wikis from private.dblist.
16:55 Tim: fixed sysctls on srv52, repooled
15:52 Tim: started static HTML dump
14:18 Rob: Added srv0 to apache lvs pool. Started httpd on srv0

September 11

17:02 Rob: srv135 reinstalled and bootstrapped, now online.
16:36 Rob: srv136 had a kernel panic, restarted, and sync'd.
15:50 Rob: srv0 re-installed and bootstrapped.
15:49 Rob: srv135 kernel panic, again. Re-installing system, if it occurs after re-install, will place an RMA repair order.
14:48 Rob: srv135 online.
14:32 Rob: srv146 online.
13:19 Rob: srv135 had a kernel panic. Powered on, online, and needs bootstrap. (Rob will do this.)
13:29 Rob: srv136 had a kernel panic. Restarted, sync'd, and is online.
13:19 Rob: srv146 was turned off. Back online, needs bootstrap. (Rob will do this.)
11:33 mark: Decreased cache_mem on knsq7, and upgraded squids to newer version with higher FD limits. I have a theory that these squids are running out of buffer memory.
07:14 Tim: killed lsearchd on srv142, that's not a search server
05:56 Tim: installed ganglia on thistle, db1 and webster. It would be kind of cool if whoever reinstalls them could also install ganglia at the same time. I've updated /home/wikipedia/src/packages/install-ganglia to make this easy on RH or Ubuntu.

September 10

20:20 brion: reassembled checkuser log file and updated code to not die when searching and it finds big blocks of nulls
19:11 brion: adding CNAMEs for quality.wikipedia.org, experimental.stats.wikimedia.org

September 9

22:51 mark: Upgraded lighttpd to version 1.4.18 on amane, bacon and benet to fix a security bug. New and previous version RPMs are in /usr/src/redhat on the respective hosts.
18:10 Tim: mod_dir, mod_autoindex and mod_setenvif were missing on the ubuntu apaches, causing assorted subtle breakage. Fixing.
19:12 Tim: The extract2.php portals were all broken on Apache 2, redirecting instead of serving directly. Fixed.

September 8

20:30 mark: SpamAssassin had crashed on lily, restarted it
16:30 domas: reenabled Quiz, reverted to pre-r25655
16:00 domas: disabled Quiz, it seems to break message cache, when used
07:43 Tim: installed json.so on srv110
07:25 Tim: installed PEAR on srv151-189
06:00 Tim: Removed some binlogs on srv123. But it's getting full, we need a new batch.

September 7

15:45 Rob: srv63 online and good to go.
15:25 brion: added symlink from /usr/bin/rsvg to /usr/local/bin/rsvg in install-librsvg FC script. sync-common or something complained about the file being missing on srv63 re-setup
15:06 brion: reinstalling yf1015 to use it for logging experiments
14:41 brion: fiddling with yaseo autoinstall config; trying to use jp ubuntu mirror since kr still down
14:25 domas: removed db1 due to 9MB binlog skip. needs recloning
13:00 domas: restarted few misbehaving httpds (srv3, srv121) - they were segfaulting, srv3 - seriously.
12:59 domas: fiddling with db9->db7/db3 copies

September 6

20:43 brion: temporarily switched henbane from kr.archive.ubuntu.com to jp.archive.ubuntu.com, since the former is down, while fiddling with software
17:31 Rob: srv63 reinstalled with FC4. Needs to have setup scripts run for apache use.
15:00 Rob: srv141 back online, ran sync-common, sync'd its data.
14:41 Rob: srv134 back online, ran sync-common, sync'd its data.
14:23 mark: Ran aptitude upgrade on srv151 - srv189
13:42 brion: ariel is lagged 5371 secs, looking into it
- slave thread listed as running, waiting for events; stop and start got it going again
12:30ish Rob: albert was locked. Rebooted, its running a forced FSCK, visible from SCS, will take a long time.
12:13 mark: Reloaded csw5-pmtpa with no issues, port activation bug seems to have been fixed

September 5

19:53 brion: fixed various php.ini's, things are much quieter in dberror.log now
srv37-39 missing from apaches group, but were in service
- ok they're in a new freak group. :) cleared apc caches and updated php.ini
.... wrong php.ini on srv37-srv39, srv135, srv144: apc.stat=off
fixed sudoers on srv61, resynced
19:30 brion: freeing up space on bart, setting up log rotation
19:20 brion: srv151 and srv152 were not in mediawiki-installation group despite apparently running in production. re-adding and re-syncing them.
- srv150, srv151, and srv152 are not in apaches group, though apparently all running apache. the heck? adde dem
18:58 brion: adjusted $wgDBservers setup so the non-native Commons and OAI/CentralAuth servers appear without the database name setting. This allows lag checks to run without dying on the database selection.
- Has greatly reduced the flood in dberror.log, but there's still a lot. Narrowing it down...
15:33 brion: shut down srv134, read-only filesystem and HD errors in syslog. needs fixin'
15:20ish brion: setting up closed quality.wikimedia.org wiki per erik
09:29 mark: Fixed Ganglia for the 8 CPU apaches, which broke when I reinstalled srv153 yesterday
01:20 river: removed lomaria from rotation to dump s2 for toolserver

September 4

21:09 Any reason why srv150 (old SM 4cpu Apache) is not pooled?
20:48 mark: Installed srv151, srv152 and srv153 with Ubuntu and deployed mediawiki. Sorry, forgot to save ssh host keys :(. Put into rotation.
18:37 brion: srv149 seems wrong, missing remote filesystems. mounted upload3 and math manually
18:36 brion: shut down srv141, was whining about read-only filesystem, couldn't log in
15:30 Rob: adler racked and online. Needs database check.
12:23 Rob: DRAC working on srv151, needs installation (failing on partitioning.)
13:16 Rob: Rebooted srv152 to borrow its DRAC for testing. Reinstalled and rebooted.
12:34 mark: Installed ubuntu on srv152, will deploy later
12:33 Rob: srv149 back online and sync'd. Ran FSCK on it.
12:32 mark: Reenabled switchport of srv149
12:20 Rob: srv52 has a kernel panic. Rebooted and re-sync'd.
12:14 Rob: Fixed DRAC password and parameters on srv151, srv152, & srv153.

September 3

18:43 brion: fixed perms on checkuser log

September 2

16:48 mark: Replaced srv52's memcached for srv69's off the spare list
16:41 srv52 goes down, unreachable

September 1

08:00 domas: live revert of DifferenceEngine.php to pre-24607 - requires additional patrolling index (ergh!), which was not created (ergh too). why do people think that reindexing recentchanges because of minor link is a good idea? :-/
The schema change requirement was noted and made quite clear. If it wasn't taken live before the software was updated, it's no fault of the development team. Rob Church 14:44, 2 September 2007 (PDT)
03:16 Tim: fixed upload on advisorywiki

August 31

19:50 brion: starting an offsite copy of public upload files from storage2 to gmaxwell's server
13:40 brion: srv149 spewing logs with errors about read-only filesystem; can't log in; no ipmi; mark shut its switchport off
13:25 Users are reporting empty wiki page problems rendered by the new servers, e.g http://de.wikipedia.org/w/index.php?title=Sebnitz&oldid=36169808
- That test case is gone from the cache now, but I did see it before it went. Can't reproduce directly to the apache. Maybe an overloaded ext store server? -- Tim
06:50 mark: Massive packet loss within 3356, prepending AS14907 twice to 30217, removed prepend to 4323

August 30

21:40 mark: Python / PyBal on alrazi had crashed with a segfault - restarted it
21:37 mark: Installed ganglia on sq31 - sq50
17:24 mark: Upgraded srv154 - srv189 to wikimedia-task-appserver 0.21, which has an additional depend on tetex-extra, needed for math rendering
16:03 Tim: put srv154-189 into rotation
15:55 Setup apaches srv154 - srv189, only waiting to be pooled...
12:14 Setup apache on srv161
11:52 Setup apache on srv160
srv159 has NFS /home brokenness, fix later
00:55 brion: disabling wap server pending security review

August 29

22:40 mark: Installed srv155 - srv189
19:44 mark: Reinstalled srv154
16:08 Rob: db3 drive 0:5 failed. Replaced with onhand spare, reinstalled ubuntu, needs Dev attn.
Incident starting at 15:50:
- 15:50: CPU/disk utilisation spike on s2 slaves (not s2a)
- ~15:55: Slaves start logging "Sort aborted" errors
- 15:55: mysqld on ixia crashes
- 15:58: mysqld on thistle crashes
- 16:03: mysqld on lomaria crashes
- 16:09-16:18: Timeout on s2 of $wgDBClusterTimeout=10 brings down entire apache pool in a cascading overload
- 16:19 Tim and Mark investigate
- 16:23 Tim: depooled ixia
- 16:26 Tim: switched s2 to r/o
- 16:28 Tim: reduced $wgDBClusterTimeout to zero on s2 only. Partial recovery of apache pool seen immediately
- 16:37 Tim: 4-CPU apaches still dead, restarted with ddsh
- 16:45 full recovery of apache pool
- 16:59 Tim: Lomaria overloaded with startup traffic, depooled. Also switched s2 back to r/w.
- 17:12 Tim: brought lomaria back in with reduced load and returned $wgDBClusterTimeout back to normal.
- 17:14 - 17:35 Tim: wrote log entry
- 17:37 Tim: returned lomaria to normal load

13:55 Rob: Reinstalled db1, Raid0, Ubuntu.
13:52 Rob: Restarted srv135 from kernel panic, sync'd and back online.
13:50 mark: Set AllowOverrideFrom in /etc/ssmtp/ssmtp.conf on srv153 to allow MW to set the from address. Also made this the default on new Ubuntu installs.
13:14 Rob: Rebooted srv18 after cpu temp warnings. Sync'd server, back online, no more temp warnings.
12:46 Rob: sq39's replacement powersupply arrived. Server back online and squid process cleaned and started.
09:54 mark: Ran apt-get dist-upgrade on srv153
05:30 domas: kill-STOP recaches, until we get crashed DB servers (3) up (or get new machines? :)
00:26 Tim: brought srv153 back into rotation after fixing some issues

August 28

19:33 brion: freeing up some space on amane; trimming and moving various old dumps and other misc files
19:08 Tim: fixed ssmtp.conf on srv153
16:10 brion: rerunning htdig cronjob on lily.... at leats some lists not indexed or something
15:53 brion: fixed img_sha1 on pa_uswikimedia and auditcomwiki
15:36 Tim: put srv153 into rotation
15:08 brion: manually rotated captcha.log, hit 2gb and stopped in june

August 27

15:57 Tim: switch done, r/w restored. New s3 master binlog pos: db5-bin.001, 79
15:37 Tim: restarting db5 with log_bin enabled
15:28 Tim: switched s3 to read-only mode
15:21 db1 down (s3 master)
13:17 Tim: running populateSha1.php on commonswiki

August 25

22:53 brion: noticed dberror.log is flooded with 'Error selecting database' blah with various wrong dbs; possibly from job runners, but the only place I found suspicious was nextJobDB.php and I tried a live hack to prevent that. Needs more investigation.
14:30 Tim: got rid of the APC statless thing, APC is buggy and crashes regularly when used in this way
14:00ish brion: scapping again; Tim made the img metadata update on-demand a bit nicer

August 24

17:45 brion: fixed SVN conflict in Articles.php in master copy
17:29 brion: restarted slave on db2
17:27 brion: applying image table updates on db2, didn't seem to make it in somehow. tim was going to run this but i can't find it running and he's not online and didn't log it
17:16 brion: restarting slave on ariel
??:?? tim is running a batch update of image rows in the background of some kind
??:?? tim may have changed which server enwiki watchlists come from while ariel is non-synced
16:04 brion: applying img_sha1 update to ariel so we can restart replication and get watchlists for enwiki going again...
....15:10 tim reverted code to r24312 to avoid image update buggage for now
15:10 brion: took db3 (down), db2 (stopped due to schema buggage) out of rotation
15:00ish -- massive overload on db8 due to image row updates
14:47 brion: starting scap, finally!
14:10 brion: unblocked wikibugs IRC mailbox from wikibugs-l list, was autoblocked for excessive bounces
13:59 brion: confirmed that schema update job on samuel looks done
13:40 Tim: restarted job runners, only 2 were left out of 9. Wiped job log files.

August 23

22:00 Tim: replication on samuel stopped due to a replicated event from testwiki that referenced oi_metadata. Applied the new patches for testwiki only and restarted replication. Brion's update script will now get an SQL error from testwiki, but hopefully this won't have serious consequences.
20:04 brion: switched -- samuel_bin_log.009 514565744 to db1-bin.009 496650201. samuel's load temporarily off while db changes apply...
19:56 brion: switching masters on s3 to apply final db changes to samuel
15:34 brion: knams more or less back on the net, mark wants to wait a bit to make sure it stays up. apache load has been heavy for a while, probably due to having to serve more uncached pages. dbs have lots of idle connections
15:32 brion: updated setup-apache script to recopy the sudoers file after reinstalling sudo, hopefully this'll fix the bugses
15:12 brion: srv61, 68 have bad sudoers files. srv144 missing convert
14:08 brion: depooled knams (scenario knams-down)
13:50 brion: knams unreachable from FL
9:08 mark: Repooled yaseo, apparently depooling it causes inaccessibility in China

August 22

20:35 brion: applying patch-oi_metadata.sql, patch-archive-user-index.sql, patch-cu_changes_indexes.sql on db5, will then need a master switch and update to samuel
20:20 brion: found that not all schema updates were applied. possibly just s3, possibly more. investigating.
14:00ish brion: amane->storage2 rsync completed more or less intact; rerunning with thumbs included for a fuller copy

August 21

20:21 brion: amane->storage2 rsync running again with updated rsync from CVS; crashing bug alleged to be fixed
14:43 Rob: sq39 offline due to bad powersupply, replacement ordered.
13:30 Tim, mark: setting up srv37, srv38 and srv39 as an image scaling cluster. Moving them out of ordinary apache rotation for now.
~12:00 Tim: convert missing on srv61, srv68, srv144, attempting to reinstall
9:30 mark: Reachability problems to yaseo, depooled it

August 20

15:00 brion: restarted amane->storage2 sync, this time with gdb sitting on their asses to catch the segfault for debugging
~12:00 Tim: started static HTML dump
11:15 Tim: running setup-apache on srv135

August 18

22:32 brion: schema updates done!

August 17

20:42 brion: started schema updates on old masters db2 lomaria db1
20:39 brion: s1 switched from db2 (db2-bin.160, 270102185) to db4 (db4-bin.131 835632675)
20:32 brion: s2 switched from lomaria (lomaria-bin.051 66321679) to db8 (db8-bin.066 55061448)
20:13 brion: s3 switched from db1 (db1-bin.009 496276016) to samuel (samuel_bin_log.001, 79)
19:54 brion: noticed amane->storage2 rsync segfaulted again. starting another one skipping thumb directories, will fiddle with updating and investigating further later
19:48 brion: doing master switches to prepare for final schema updates this weekend

August 16

16:21 Rob: srv135 back up, needs bootstrap for setup.
16:01 brion: think I got the mail issue sorted out. Using sendmail mode (really ssmtp), and tweaked ssmtp.conf on isidore: set the host to match smtp.pmtpa.wmnet, and set FromLineOverride=YES so it doesn't mess up the from address anymore
15:42 brion: having a lovely goddamn time with bugzilla mail. setting it back from SMTP to sendmail avoids the erorr messages with dab.'s email address, but appears to just send email to a black hole. setting back to SMTP for now

August 15

3:40 jeluf: Start copy of database from db5 to samuel.

August 14

20:30 jeluf: Lots of apache problems on nagios after some updates on InitialiseSettings.php. Restarting all apaches.
19:50 Rob: srv63 down. Suspected Bad mainboard.
19:38 Rob: srv51 back online and sync'd.
19:24 jeluf srv66 bootstrapped
19:20 Rob: srv133 rebooted. Network port re-enabled. Back online and sync'd.
19:18 Rob: db3 had a bad disk. Replaced and reinstalled. Needs setup.
18:18 brion: db1 in r/w, with reduced read load. seems working
18:12 brion: putting db1 back into rotation r/o
18:07 brion: applying SQL onto db1 to recover final changes from relay log which were rolled back in innodb recovery
17:28 brion: temporarily put s3/s3a/default to use db5 as r/o master, will put db1 back once it's recovered
17:24 brion: put s3/s3a/default to r/o while investigating
17:20 Rob: db1 crashed! Checked console, kernel panic locked it. Rebooted and it came back up with raid intact.
16:33 Rob: srv66 returned from RMA and reinstalled. Requires setup scripts to be run.
15:26 Rob: srv134 restarted from heat crash & sync'd.
15:17 Rob: srv146 restarted from crash and sync'd.
15:00ish brion: restarted amane->storage2 rsync job with rsync 3.0-cvs, much friendlier for huge file trees
15:00 Rob: biruni rebooted and FSCK. Back online.
15:00 Rob: sq12 cache updated and squid services restarted.
14:51 Rob: sq12 rebooted from crash. Back online.
14:47 Rob: Rebooted and ran FSCK on srv59. It is back up, needs to be brought back in to rotation.
- Sync'd, and in rotation.

August 13

21:00 domas: APC collapsed after sync-file, restarting all apaches helped.
18:59 brion: playing with rsync amane->storage2
18:17 brion: working on updating the upload copy on storage2. removing the old dump file which eats up all the space, will then start on an internal rsync job

August 11

18:50 brion: updated auth server for oai
18:34 brion: starting schema updates on slaves
15:43 brion: seem to have more or less resolved user-level problems after a bunch of apache restarts. i think there was some hanging going on, maybe master waits or attempts to connect to gone servers
15:36 brion: s3 slaves happy now, fixed repl settings
15:29 brion: s3 master moved to db1, hopefully back in r/w mode :)
15:22 brion: db1/db5/webster appear consistent, so moving s3 master to db1. working on it manually...
15:05ish brion: many but not all reqs working in r/o atm
14:50ish brion: samuel broken in some way; putting s3/s3a to read-only and falling back to another box temporarily while working it out

August 8

21:35 brion: shutting down biruni; read-only filesystem due to 2007-08-07 Biruni hard drive failure
21:09 brion: cleaning up srv80, vincent, kluge, hypatia, biruni as well. need to poke at the setup scripts and find out wtf is wrong
20:55 brion: same on srv37
19:58 brion: updated broken sudoers file on humboldt, was not updating files on scap correctly
13:48 brion: adding wikiversity.com redirect
13:26 brion: metadata update for commons image row 'Dasha_00010644_edit.jpg' was sticking due to mysterious row locks for last couple days -- may have been related to a reported stickage/outage yesterday. Script run after run after run tried to update the row but couldn't. Finally managed to delete the row so it's no longer trying, but the delete took over three minutes. :P

August 5

07:44 Tim: set $wgUploadNavigationUrl on enwiki, left a note on the relevant talk pages

August 4

02:10 Tim: Fixed srv7 again

Aug 2

06:00 mark: Installed yf1015 for use as HTTPS gateway

Aug 1

07:00 domas: amane lighty upgraded

July 29

23:43 brion: shut off srv134 via ipmi, since nobody got to it

July 27

4:30 jeluf: removed srv120 from the external storage pool, fixed srv130

July 25

16:33 brion: srv134 bitching about read-only filesystem, possible hd prob
15:44 Rob: srv59 had a kernel panic. Restarted and is now back online.
15:38 Rob: will reattached to port 15 of the SCS.
15:00 Rob: biruni HDD replaced, FC4 reinstalled. (Had to use 32 bit, system did not support 64.) Requires scripts run and server to be put in rotation.

July 24

21:59 brion: srv59 is down; replaced it in memcache pool with spare srv61.
02:38 brion: fixed bugzilla queries; had accidentally borked the shadow db configuration -- disabled that (for srv8) a few hours ago due to the earlier reported replication borkage, but did it wrong so it was trying to connect to localhost

July 21

22:30 mark: srv7 was out of space again, deleted a few bin logs. Replication I/O thread on srv8 is not running and doesn't want to come up, can any of you MySQL heads please fix that? :)
19:26 Tim: db3 is down, removed from rotation.
11:38 mark: Installed Ubuntu Feisty on yf1016 for use by JeLuF

July 19

19:40 brion: importing pywikipedia svn module

July 18

22:24 mark: switchport for srv133 shutdown
22:21 brion: srv133 borked with read-only fs

July 17

18:56 brion: set up and mounted upload3 and math mounts on vincent -- these were missing, probably causing bugzilla:10610 and likely some upload-related problems.
15:08 river: borrowing storage1 to dump external text clusters
13:51 Rob: Rebooted storage1 from a kernel panic.
13:50 Rob: biruni filesystem in read-only. Rebooted and running FSCK.
- HDD is toast. Emailed SM for RMA.

July 15

15:06 Tim: Biruni was hanging on various operations such as ordinary ssh login, or a "sync" command. Restarted using "echo b > /proc/sysrq-trigger" in a non-pty ssh session.
- Probably hung during startup, ping but no ssh

July 14

19:35 mark: Fixed a problem with our DNS SOA records
13:33 Tim: put ex-search servers vincent, hypatia, humboldt, kluge, srv37 into apache rotation. Fixed ganglia and nagios.

July 13

23:20 mark: Disabled Oscar's accounts on boardwiki/chairwiki per Anthere's request

July 12

18:35 brion: pa.us.wikimedia.org up.
~12:10 Tim: restarted postfix on leuksman, causing a flood of messages delivered to various locations.
11:44 Tim: running setup-apache on ex-search servers (vincent, hypatia humboldt, kluge, srv37)
11:38 mark: Set up Quagga on mint as a test box for my BGP implementation, gave it a multihop BGP feed from csw5-pmtpa / AS14907
11:02 Tim: reassigning srv57 and srv58 to search

July 11

13:10 Tim: updating lucene for enwiki, will be moving on to the other clusters shortly
12:49 Tim: removed /etc/crond.d/search-restart from search servers and restarted crond

July 10

21:30 brion: installed DeletedContributions ext
19:20 mark: Brought knsq6 back up, DIMM hopefully replaced
19:20 jeluf: setting skip-slave-start on read-only external storage clusters.

July 9

17:53 brion: temporarily taking db4 out of rotation cause people freak out about lag warnings
14:18 brion: running schema changes patch-backlinkindexes.sql on db1/db4 so they don't get forgotten
13:55 brion: fixed oai audit db setting (broken by master switch)

July 8

20:47 jeluf: added external storage cluster #13. Removed #10 from the write list.
18:24 Tim: changed cron jobs on the search servers to restart lsearchd instead of mwsearchd.
17:05 brion: switched s1 master, started the rest of the latest schema changes
17:00 brion: switched s2 master
16:54 brion: switched s3 master
16:23 brion: schema changes index tweaks done on slaves, waiting for a master switch to complete

July 7

20:10 jeluf: cleaned up binlogs on srv95.

July 6

21:16 brion: did rs.wikimedia.org updates -- imported pages and images from old offsite wiki, set up redirects from *.vikimedija.org domains that are pointed to us
19:55 brion: created mediawiki-api list
19:39 brion: starting schema changes run for index updates
19:25 Tim: restarted PHP on anthony to fix *.wap.wikipedia.org

July 5

21:59 river: taking clematis and 5TB space on the backup array to test replicating external text to knams
14:07 Rob: Reinstalled srv135 with a new HDD. Needs setup scripts run.
13:55 Rob: Rebooted srv99 from a kernel panic. It is back up.
13:20 Rob: adler offline, will not boot.
13:09 Rob: replaced cable for rose, shows at full duplex speed again.

July 4

21:40 jeluf: cleaned up disk space on srv95, 120, 126

July 3

20:54 brion: Closed out inactive wikimedical-l list
15:56 brion: lag problem resolved. stop/start slave got them running again; presumably the connections broke due to the net problems but it thought they were still alive, so didn't reconnect
15:53 brion: lag problems on enwiki -- all slaves lagged (approx 3547 sec), but no apparent reason why
15:00ish? mark did firmware updates on the switch and soemthing didn't come back up right and everything was dead for a few minutes
00:52 river: temporarily depooled thistle to dump s2.

July 2

18:54 brion: set bugzilla to use srv8 as shadow database
18:53 brion: replication started on srv8
18:39 brion: srv7 db back up, putting otrs back in play. fiddling with replication
18:25 Tim: bootstrapping srv80
18:24 brion: shut down srv7 db to copy
17:54 brion: shutting down srv8 db and clearing space for copy from srv7. otrs mail is being queued per mark
17:40 Tim, rainman: installing LS2 for the remaining wikis. Splitting off a new search pool, on VIP 10.0.5.11.

July 1

18:38 mark: Upgraded lily to Feisty, including a new customized Mailman, and a newer PowerDNS
13:15 mark: Upgraded asw-c3-pmtpa and asw-c4-pmtpa firmware to 3100a