20:00 domas: livehacked out rounded corner references in schulenberg main.css - files were 404 :) Tim, thats outrageous! :-)
15:00 domas: blocked & entirely.
14:00 domas: noticed that we still maintain separate squid caches due to access encodings not only for crawlers, but for IE/Moz too. Patch at http://p.defau.lt/?C9GXHJ14GWHAYK1Pf0x9cw
December 28
21:44 brion: running cleanupTitles on all wikis (HTML entity and general issues)
15:53 jeluf: shut down apache on srv14, srv15, srv17. CPU0: Running in modulated clock mode. Needs to be checked.
12:58 mark: Blocked broken Recentchanges& requests on the frontend squids (ACL badrc)
00:28 brion: wikibugs irc bot had been idle for about 2 days; on investigation its mailbox was disabled from the mailing list due to bounces. unblocked it.
00:12 brion: Special:Userrights restricted mode available for bcrats: add sysop, bureaucrat, bot; remove bot
December 27
22:21 brion: fixed Cite regression which produced invalid error message for ref names containing integers
21:58 brion: scapping updates
00:58 brion: stopped the dump worker thread on srv31 for the time being
00:56 brion: either multiple dump threads are running, or the reporter's really broken from benet being full. :)
December 26
21:50 jeluf: installed lighttpd on browne (irc.wikimedia.org), redirecting all requests to meta. See bug 11792
18:04 brion: starting rsync benet->storage2 again... it borked the first time. :P
December 25
06:00 jeluf: rsync completed. Symlinked dewiki upload dir to /mnt/upload4
December 24
23:30 jeluf: disabled image uploads on dewiki. Rsync'ing images from amane to storage1.
17:40 brion: configured stub fr and ru.planet.wikimedia.org
December 23
21:14 brion: setting up private exec.wikimedia.org wiki
06:48 domas: restarted all memcacheds with 1.2.4, complete cache wipe caused interesting caching issues
11:15 jeluf: configured mchenry to use srv179 instead of srv7 for OTRS mail address lookups.
11:00 jeluf: moved otrs DB from srv7 to srv179. Bugzilla and donateblog are still on srv7.
December 21
19:45 jeluf: restarted lighttpd on benet since dewiki and frwiki were not reachable. I don't understand why it helped, but it did.
17:06 brion: benet was rebooted; migrating all data to storage2. started monitor thread but no additional worker threads for the moment.
December 20
00:08 brion: benet went from being sluggish to being totally unresponsive. May need to reboot it; may have disk problems or something.
20:43 brion: migrating additional dump dirs to storage2 to balance disk more
20:20 jeluf: benet's disk is full. Reduced root reserve to 100 blocks
December 19
20:13 mark: 15 14 new knams squids pushed into production, one is broken.
December 18
02:43 brion: download.wikimedia.org wasn't showing the dewiki dir; possibly confused about the symlink to the other mount. Restarting lighty seems to have resolved it.
02:34 brion: enabled gadgets extension sitewide
December 17
20:44 brion: working on patching up the broken dumps... Benet crashed on srv6, killing one worker thread and the monitor thread. Dumps continued when benet came back online, but weren't reported. Monitor thread is now restarted, showing status. Additionally, freeing up space on storage2 so dewiki dump can run again. ... trimming thumbnails from the storage2 backup of amane uploads
03:39 brion: updated wgRC2UDPAddress for some private wikis
December 16
13:39 mark: Blocking HTTP OPTIONS requests on all frontend Squids
December 14
14:55 mark: Moved patch between csw1-knams (in J-13) and lily (J-16), old one may be bad?
14:00 mark: Installed new SFP line card in slot 7 of csw1-knams. Brought up link to AMS-IX.
6:00 jeluf: started move of thumb directories of the commons from amane to storage1. Dirs get rsynced and symlinked in small steps. When the entire thumb dir is moved, only one symlink will be left. It would be nice to have the thumb dirs independent of the image dirs, but Mediawiki currently doesn't have a config option for this.
December 13
23:00 jeluf: Mounted storage1's upload directory on all apaches. Moved /mnt/upload3/wikipedia/commons/scans to the new directory. Other directories have to follow.
December 12
19:25 jeluf: restarted replication on srv127
December 11
16:12 mark: mchenry under incoming connection flood, increased max connections. This hit srv7's mysql max connection limit of 100, upped that to 500.
21:38 brion: fix redirect from sep11.wikipedia.org; was to sep11memories.org which does not exist though some browsers let it pass, trying the real www.sep11memories.org.
16:18 Rob: srv81 kernel panic. reboot, back online.
around 5: benet was found dead, PM rebooted it, server is back up, but lighttpd tries to load modules from /home/midom/tmp/thttpd/*, which does not exists. Needs to be checked.
December 5
14:25 brion: set up rate limits for rollback, awaiting broader rollback permissions
12:30 mark: Depooled knams, routing problems in surfnet
December 3
17:50 mark: Exim on lily was still bound to the old 145. IP for outgoing messages, so couldn't send out to the world. Fixed.
17:28 mark: Removed monitorurl options from Squid cache_peer lines now a Squid bug has been fixed where it wouldn't detect revival of dead parents.
13:05 mark: Downgraded knsq3 kernel to Feisty's kernel, started frontend squid
10:10 mark: Cut the routing for 145.97.39.128/26 on csw1-knams.
December 2
05:24 Tim: Master switch to ixia complete. New binlog pos: (ixia-bin.001, 2779)
05:10 Tim: starting response
04:58 db8 down (s2 master)
December 1
21:30 mark: Recompiled 2.6.20 kernel on mint (upload LVS host at knams) with a larger hash table of 2^20 entries instead of 2^12. %si CPU usage seems a bit lower, will be interesting to see it during peak traffic. Also tested the newer 2.6.22 kernel in Ubuntu on sage, and it appears 2-3 times slower in this respect!
November 30
19:06 domas: srv78 was commented out in mediawiki-installation since Nov2. It was also up since then. Uncommented, resynced.
19:29 Rob: Reinstalling yf1000 for software distro upgrade.
19:25ish mark: Upgraded yf1009 and yf1008 to new distro.
19:28 brion: apparently there remain problems with <ref>; reverting back to 27647
19:10 brion: updating wiki software to current trunk
17:22 brion: disabled send_reminders option on all mailing lists, since it's annoying
17:15 Rob: Finished upgrades on squid software for sq41-sq50
16:56 Rob: Started upgrades on squid software for sq41-sq50
16:54 Rob: Finished upgrades on squid software for sq38-sq40
16:48 Rob: Started upgrades on squid software for sq38-sq40
16:46 Rob: Finished upgrades on squid software for sq27-sq37
16:21 Rob: Started upgrades on squid software for sq27-sq37
16:18 Rob: Finished upgrades on squid software for sq16-sq26
15:55 Rob: Started upgrades on squid software for sq16-sq26
15:53 Rob: Finished upgrades on squid software for sq1-sq15
15:00 Rob: Started upgrades on squid software for sq1-sq15
November 28
23:01 Rob: completed reinstalls of KNAMS squids. (sans gmond)
21:43 brion: fixed apc.stat setting on srv0, was set to off
20:55 brion: renewed SSL cert for wikitech.leuksman.com
18:00ish Rob & Mark: Started reinstalls of KNAMS Squid servers.
November 27
22:27 mark: Reinstalled knsq1 with Ubuntu Gutsy and newest Squid; the rest will follow tomorrow
15:09 brion: srv121 is closing ssh, hanging on http traffic. May or may not be occasionally responding with bogus version of software. Requested shutdown from PM support.
12:20 mark: Moving European traffic back to knams on the new IPs.
09:00 - 12:20 mark: Renumbered knams public services VLAN.
04:00 - 07:00 mark: Brought up AS43821, 91.198.174.0/24 on a new transit link.
03:55 mark: DNS scenario knams-down for maintenance
November 26
19:52 Rob: srv81 reinstalled due to OS being fubar. Rob is bootstrapping and bringing into service.
18:55 Rob: Replaced bad ram in srv155. Stopped apache on boot, as there is no 'sync-common' on server.
17:03 mark: Noticed that fuchsia (LVS knams) was overloaded: more input packets than outgoing. Set up mint as emergency LVS host for upload.wikimedia.org and moved the service IP.
November 25
07:00 brion: rewired some file layout on leuksman.com
November 24
22:15 jeluf: created new wikis: fiwikinews hywikisource brwikiquote bclwiki liwikiquote
17:30 mark: isidore swamped and unreachable, set some crude caching-header-override options in squid
19:10 jeluf: killed squid on sq23 by accident, restarted.
18:54 jeluf: started squid on knsq3, disabled sdc
18:45 jeluf: started apache on srv78
18:45 jeluf: deleted old binlogs on srv123
November 22
22:49 Noticed that a fully dynamic blog with no caching whatsoever was linked off every page on the site and wonders if anyone could come up with a more brilliant idea to kill a server. Put Squid infra structure in front. Wordpress currently sends no caching headers, so set a default expire time of 1 minute in Squid.
14:58 brion: enabled blog link for english-language fundraising notice... http://whygive.wikimedia.org/ (set up the blog itself yesterday)
11:00 domas: (following older events) srv155 Apache misbehaved (database errors, 500, etc), because / was full. / was full, because updatedb's sorts were filling the disk. updatedb's sorts were filling the disk because find was traversing all NFS shares. find was traversing all NFS shares because file system mounts were not recorded in mtab. file system mounts were not recorded properly in mtab, because mount wasn't clean. mount wasn't clean because /home/ was mentioned 3 times in fstab. Home was mentioned in fstab 3 times because, oh well. :)
November 20
21:55 mark: Set persistent_request_timeout to 1 minute in the frontend squid conf, since newer Squid versions increased the default value, causing greater FD usage.
17:57 brion: patched OTRS to avoid giving out e-mail addresses when requesting password resets
Created a new squid 2.6.16-1wm1 package, included a patch by Adrian Chadd to solve the unbounded http headers array problem. Deployed it for testing on knsq15; the rest will follow soon.
November 18
21:35 domas: commented out srv132 on LVS, cause it is half up half down (serves wiki, ssh unhappy)
November 17
04:18 Tim: enabled AssertEdit extension on all wikis, after review. Short, simple extension to help some bot runners.
November 16
22:12 brion: switching on NPPatrol by default
22:ish brion: switching Gadgets back on on dewiki, with new message naming convention
20:58 brion: changed master settings on bacon, letting it catch up...
20:53 brion: taking bacon out of rotation for the moment
20:52 brion: s3b has bacon, which didn't get updated for master switch. causing ugly lag
16:52 brion: updated all to current SVN code. disabled gadgets extension on dewiki pending code review.
04:17 brion: started CU index update on old masters
13:00 Rob: Took ixia offline for disk testing, failed raid.
12:00ish Rob: Rebooted Foundry after a failed upgrade attempt.
November 1
20:43 mark: Installed sage as gutsy build/scratch host. Sorry, new hostkey :(
Ubuntu Gutsy installs are now possible, both for amd64 and i386
20:32 mark: Prepared installation environment for Gutsy installs
12:18 mark: Set up new vlans on csw1-knams. Set up BGP config for the new prefix. Migrated full view BGP peering between csw5-pmtpa and csw1-knams to the new public ASN. Changed AS-path prefix lists to named instead of numbered on both routers.
October 31
13:50 brion: storage2 full; pruning some stuff...
October 30
18:22 brion: adding dev.donate.wikimedia.org cname and vhost
02:08 brion: leuksman.com was down for about four hours due to mystery kernel panic possibly mystery hardware failure. Thanks for rebooting it, Kyle! :)
October 29
05:41 Tim: srv133 was giving bus errors on sync, not allowing logins. Forced restart, seems to be still down.
October 26
17:50 brion: rebuilt prefix search indexes, now splitting enwiki & other wikipedia index files to allow it to build on the 32-bit server. :) ready for leopard launch...
October 25
12:30 mark: knsq13's bad disk has been replaced; booted it up, cleaned the cache and started Squid.
08:14 Tim: created wiki at arbcom.en.wikipedia.org
October 23
21:30ish brion: sitenotice live on commons, meta, and en.*
19:25 brion: sitenotice now using user language; reenabled <funddraisinglogo/> little WMF logo
19:15 Rob: Resurrected srv118, synced, added back to pool.
18:37 Rob: Updated apache pool and pushed to dalembert.
18:33 Rob: Replaced backplane in srv78 and reinstalled. Bootstrapped and online.
17:30 Rob: Booted up srv146 which was turned off, details on srv146 page.
17:21 Rob: Rebooted srv131 from kernel panic. Synced and back online.
17:13 Rob: Rebooted srv124 from unresponsive crash (black screen when consoled.) Synced and back online.
15:54 mark: Reenabled the no-caching hack; seeing stalls
10:20 mark: Disabled the randomizing URL hack in the flash player as I'm not seeing stalls and I think it's actually triggering the Squid concurrency problems. Reenable if it makes things worse.
09:29 mark: Seeing huge CPU spikes on knsq12, the CARPed Squid that's responsible for the Flash video. Looks like the same problem we saw 2 weeks ago. Changed the configuration so the upload frontend squids cache it, to reduce concurrency.
01:24 brion: turned off the scrolling marquee; too many complaints about it being slow and distracting
01:00 brion: switching donate.wikimedia.org back to fundcore2 (srv9), keeping the link from banner to the wiki until we figur eout performacne problems
00:40ish brion: hacking around squid problem with flash player by randomizing the URL for the .flv, defeating caching. :P
October 22
22:47 brion: putting on sitenotice on testwiki and enwiki
20:15 mark: Shut down knsq13 awaiting disk replacement.
15:51 brion: adding www.donate CNAMEs
15:16 mark: Broken disk in knsq13, stopped backend squid. Frontend still running.
13:50 brion: php5-apc installed on srv9; updated php5-apc package in apt to wm6 version, with both 386 and amd64 versions...
13:34 brion: added wikimedia apt repo to pbuilder images on mint in the hopes that one day i'll be able to actually build packages
12:00 domas: db2 gone live, took out adler for jawiki dump
03:24 brion: fixed email on srv9 i think... an extra space on line in ssmtp.conf after the hostname seemed to break it, making it try to send to port 0.
October 21
15:40 mark: Doubled upload squid's maximum cached object size to 100M for the fundraiser.
October 20
19:24 Rob: srv10 reinstalled as base Ubuntu for fundraiser.
18:50 Rob: srv9 reinstalled as base Ubuntu for fundraiser.
16:05 jeluf: removed binlogs on srv123
16:00 jeluf: started apache on srv85
16:00 jeluf: removed binlogs on srv7. Only 120MB disk space were left...
13:30 spontaneous reboot of srv85
October 18
21:48 brion: updated interwiki map
17:53 brion: http://donate.wikimedia.org/ up as redirect for now; all other domains also have this subdomain as redir
October 17
18:34 mark: increased ircd max clients to 2048 on irc.wikimedia.org.
18:14 brion: set up an hourly cronjob to 'apache2ctl graceful' on wikitech. Had one instance of segfault-mania, probably from fun APC bugs or something finally creeping in.
October 16
22:08 brion: tweaked CentralNotice URLs to use 'action=raw' to avoid triggering the squid header smashing. Sigh... :)
19:02 brion: switched in CentralNotice infrastructure sitewide. This may cause a spike of hits to meta's Special:NoticeLoader, a little JS stub. Due to squid's cache-control overwriting this won't be cached as nice on clients as I wanted, but it should still be sanely cached at squid. Actual notice won't be switched in until the fundraiser begins, but this should get the stub loader into anon-cached pages ahead of time.
October 11
21:17 brion: srv124 has been broken for a few hours, apparent disk error from behavior. still online but can't get in to shut it off.
18:59 Rob: amane disk replaced, but not rebuilt. Need Jens to look at this.
15:57 brion: srv118 down; swapped it out from memcached for srv101. WE ARE OUT OF MEMCACHE SPARES IN THE LIST
17:08 brion: adding wikimania2008, 2009 to dns; setting up 2008 wiki
17:08 tim, domas, and rob are doing some sort of evil copying db7 to db10
16:36 brion: ES slaves under srv130 are stopped, they need to be restarted... no wait it's ok now
16:33 brion: restarted mysqld on srv130 (via /etc/init.d/mysqld); some mystery chown errors about invalid group for mysql? but seems to be running
16:30 brion: srv130 (ES master) mysql down; rebooted 80 minutes ago (mark said there was some problem with it and he depooled it, but this is not logged except for a kernel panic and restart september 19. may be flaky)
October 9
21:33 brion: switched srv105 out of memcache rotation, srv118 spare in. [domas tried upgrading memcached on srv105, apparently something's broke]
21:30 brion: memcached on srv105 is broken; verrrry slow, lots of errors on site -- timeouts etc
19:12 brion: disabling the alpha PNG IE 6 hack on en.wikipedia; most of the breakage appears to be on the GIF that it loads, which may be result or may be cause or may be unrelated. :)
16:45 brion: CPU maxing out on Florida upload squids; site up but with slow image loading. We're working on figuring out why
October 8
23:22 dab: started nagios-irc-bot on bart.
16:00 jeluf: took db10 out of service, mysql not responding, SSH not responding.
October 7
19:22 brion: refreshing storage2 copy of amane
October 3
18:00 brion: scapped update including r26357 which tim warns may introduce performance problems. keep an eye out
21:00 mark: Reinstalled fuchsia and made it the new active LVS host for knams, as iris's cable seems bad.
09:58 mark: iris has had input errors on its link. Need to keep an eye on it.
09:45 mark: The site felt slow at knams with multi-second load times. Latency to LVS IP was ~ 200 ms. Apparently iris (LVS load balancer) had negotiated its link speed at 100M, which is not sufficient during peak. Forced it to 1000M-master at the switch, which seems to have worked.
September 30
15:10 mark: Gave csw1-knams a full BGP view from csw5-pmtpa with rewritten next-hop to the Kennisnet gateway.
September 29
18:33 mark: Installed will for routing/monitoring purposes
00:35 brion: bart seems alive and well, was rebooted around 23:25?
September 28
22:00 jeluf: bart is dead => no nagios, no otrs.
20:22 brion: pushed out updated cortado 0.2.2 patched build to live server. (Some clients may use cached old code, note.)
14:49 mark: pascal had ns2.wikimedia.org's IP bound, which only showed after the move of routing to csw1. Removed.
05:13 mark: Depooled knams for network maintenance later today
September 25
19:30 jeluf: added hsbwiktionary as requested on bugzilla
19:00 jeluf: added bnwikisource as requested on bugzilla
17:24 brion: ariel reasonably happy now.
17:21 brion: cycling ariel a bit to let it fill cache
17:18 brion: putting ariel back into $dbHostsByName in db.php. someone commented it out without any logging of the reason why, rumor is because it had crashed and was recovering. this broke all enwiki watchlists.
17:11 ???: watchlist server for s1 overloaded
15:46 Rob: tingxi reinstalled and serves no current role.
14:51 Rob: srv121 reinstalled and online as apache. Needs ext. storage setup.
14:35 Rob: srv135 rebooted from kernel panic and back online as apache.
02:12 brion: new wikitech server working :) poking DNS to update...
01:47 brion: preparing to move wikitech wiki to new server; will be locked for a bit; DNS will change... new IP will be 63.246.140.16
September 24
12:30 mark: knsq12's broken disk has been replaced, brought it back up. ZX pulled the wrong disk, the broken one is still there.
~1:45 Tim: Set up srv131 and srv135 for static HTML dumps. Fixed enwiki 777/1024, continued dump.
September 20
14:10 Rob: Upgraded libkrb53 libt1-5 on srv151-srv189 (Will take a bit of time to work through them all.)
11:06 mark: Upgraded frontend squids on knsq1 - knsq5 to increase #FDs, they were running low.
September 19
20:40 jeluf: cleaned up disk space on db4
20:33 Tim: Fixed iptables configuration on benet: accept NTP packets from localhost
~19:30-19:50 Tim: Fixed NTP configuration on amane, bart, srv1, browne. Stepped clock and restarted broken ntpd on db8, alrazi, avicenna. Still working on benet.
18:07 mark: Turned off htdig cron job on lily, it's killing the box every day
16:15 jeluf: started mysqld on srv130. Added cluster13 back to the write list.
09:53 mark: Disk errors on /dev/sdb in knsq12. Stopped both squid instances.
6:15 jeluf: DB copy completed, thistle and db8 are both up and running. NTP isn't properly configured on db8. Needs investigation. Check bugzilla.
4:20 jeluf: mysql on thistle stopped. Copying its DB to db8.
September 18
19:50 brion: shutting off srv121 apache again, segfaults continuing.
19:30 brion: segfaults on srv121; shutting its apache and examining. lots in srv188 past log too, earlier today
16:18 brion: killed the deletion thread, since it was by accident anyway. :) things look happier now
16:16ish brion: lots of hung threads on db4/enwiki due to deletion of Wikipedia:Sandbox clogging up revision table
16:06 brion: srv128 hanging on login... shut it down via IPMI
05:15 jeluf: updated nagios config to reflect db master switch.
September 17
22:50 mark: srv130, ES cluster 13 master, disappeared. Removed it from the active write list and set its load to 0.
13:45 domas: I/O hang on db8, FS hanged too, caused mysqld to hang. switched master to lomaria
13:10 mass breakage of some kind... segfault on db8 mysql, massive breakage due to commons loads etc
11:38 Tim: re-ran squid update to decomission bacon -- was bungled yesterday so bacon was still serving zero-byte thumbnails. Purging the zero-byte list again.
September 16
21:58 Rob: srv15 reported apache error on nagios. Re-synced apache data and restarted apache process, correcting the issue.
20:28 mark: Generated a list of 0-byte thumbs on bacon, purging that list from the Squids
06:15 domas: db2 simply seems tired. servers/processes get tired after running under huge load for a year or more? :) Anyway, looks like deadlock caused by hardware/OS/library/mysqld - evil restart.
00:50 mark: Mounted /var/spool/exim4/db (purely nonessential cache files) as a tmpfs filesystem on lily to improve performance.
00:44 mark: ionice'd and renice'd running htdig processes on lily. I/O starvation caused high loads so Exim delayed mail delivery.
September 15
17:13 Tim: sending all image backend traffic to amane, turned off bacon and srv6. Deployed in slow mode.
15:49 Tim: took db2 out of rotation. Some threads are frozen, including replication, leaving it like that for a while so Domas can have a look.
10:27 Tim: Attempted to fix error handling in thumb-handler.php. It was probably serving zero-byte images on queue overflow of the backend cluster.
Wrong diagnosis -- images are not zero bytes on amane, but are cached as zero bytes. Suspect lighttpd or squid. Remains unfixed.
There was in fact a problem with thumb-handler.php, manifesting itself on bacon. Probably fixed now.
13:08 Rob: srv54 reinstalled, bootstrapped, and back online.
12:27 Rob: rebooted srv135, ssh now working, synced and added back to apache lvs pool, synced nagios.
September 12
20:30sih Rob: bootstrapped srv131 and srv150, back in apache rotation.
19:59 Tim: srv135 is refusing ssh connections. Removed from LVS.
19:45 Tim: auditcomwiki wasn't in special.dblist, so it was listed in wikipedia.dblist and the static HTML dump system attempted to dump it. Luckily I still haven't fixed that bug about dumping whitelist read wikis, so it just gave an error and exit. I put another couple of layers of protection in place anyway -- added auditcomwiki to special.dblist and patched the scripts to ignore wikis from private.dblist.
16:55 Tim: fixed sysctls on srv52, repooled
15:52 Tim: started static HTML dump
14:18 Rob: Added srv0 to apache lvs pool. Started httpd on srv0
September 11
17:02 Rob: srv135 reinstalled and bootstrapped, now online.
16:36 Rob: srv136 had a kernel panic, restarted, and sync'd.
13:19 Rob: srv135 had a kernel panic. Powered on, online, and needs bootstrap. (Rob will do this.)
13:29 Rob: srv136 had a kernel panic. Restarted, sync'd, and is online.
13:19 Rob: srv146 was turned off. Back online, needs bootstrap. (Rob will do this.)
11:33 mark: Decreased cache_mem on knsq7, and upgraded squids to newer version with higher FD limits. I have a theory that these squids are running out of buffer memory.
07:14 Tim: killed lsearchd on srv142, that's not a search server
05:56 Tim: installed ganglia on thistle, db1 and webster. It would be kind of cool if whoever reinstalls them could also install ganglia at the same time. I've updated /home/wikipedia/src/packages/install-ganglia to make this easy on RH or Ubuntu.
September 10
20:20 brion: reassembled checkuser log file and updated code to not die when searching and it finds big blocks of nulls
19:11 brion: adding CNAMEs for quality.wikipedia.org, experimental.stats.wikimedia.org
September 9
22:51 mark: Upgraded lighttpd to version 1.4.18 on amane, bacon and benet to fix a security bug. New and previous version RPMs are in /usr/src/redhat on the respective hosts.
18:10 Tim: mod_dir, mod_autoindex and mod_setenvif were missing on the ubuntu apaches, causing assorted subtle breakage. Fixing.
19:12 Tim: The extract2.php portals were all broken on Apache 2, redirecting instead of serving directly. Fixed.
September 8
20:30 mark: SpamAssassin had crashed on lily, restarted it
16:30 domas: reenabled Quiz, reverted to pre-r25655
16:00 domas: disabled Quiz, it seems to break message cache, when used
07:43 Tim: installed json.so on srv110
07:25 Tim: installed PEAR on srv151-189
06:00 Tim: Removed some binlogs on srv123. But it's getting full, we need a new batch.
15:25 brion: added symlink from /usr/bin/rsvg to /usr/local/bin/rsvg in install-librsvg FC script. sync-common or something complained about the file being missing on srv63 re-setup
15:06 brion: reinstalling yf1015 to use it for logging experiments
14:41 brion: fiddling with yaseo autoinstall config; trying to use jp ubuntu mirror since kr still down
14:25 domas: removed db1 due to 9MB binlog skip. needs recloning
13:00 domas: restarted few misbehaving httpds (srv3, srv121) - they were segfaulting, srv3 - seriously.
12:59 domas: fiddling with db9->db7/db3 copies
September 6
20:43 brion: temporarily switched henbane from kr.archive.ubuntu.com to jp.archive.ubuntu.com, since the former is down, while fiddling with software
17:31 Rob: srv63 reinstalled with FC4. Needs to have setup scripts run for apache use.
15:00 Rob: srv141 back online, ran sync-common, sync'd its data.
14:41 Rob: srv134 back online, ran sync-common, sync'd its data.
14:23 mark: Ran aptitude upgrade on srv151 - srv189
13:42 brion: ariel is lagged 5371 secs, looking into it
slave thread listed as running, waiting for events; stop and start got it going again
12:30ish Rob: albert was locked. Rebooted, its running a forced FSCK, visible from SCS, will take a long time.
12:13 mark: Reloaded csw5-pmtpa with no issues, port activation bug seems to have been fixed
September 5
19:53 brion: fixed various php.ini's, things are much quieter in dberror.log now
srv37-39 missing from apaches group, but were in service
ok they're in a new freak group. :) cleared apc caches and updated php.ini
.... wrong php.ini on srv37-srv39, srv135, srv144: apc.stat=off
fixed sudoers on srv61, resynced
19:30 brion: freeing up space on bart, setting up log rotation
19:20 brion: srv151 and srv152 were not in mediawiki-installation group despite apparently running in production. re-adding and re-syncing them.
srv150, srv151, and srv152 are not in apaches group, though apparently all running apache. the heck? adde dem
18:58 brion: adjusted $wgDBservers setup so the non-native Commons and OAI/CentralAuth servers appear without the database name setting. This allows lag checks to run without dying on the database selection.
Has greatly reduced the flood in dberror.log, but there's still a lot. Narrowing it down...
15:33 brion: shut down srv134, read-only filesystem and HD errors in syslog. needs fixin'
15:20ish brion: setting up closed quality.wikimedia.org wiki per erik
09:29 mark: Fixed Ganglia for the 8 CPU apaches, which broke when I reinstalled srv153 yesterday
01:20 river: removed lomaria from rotation to dump s2 for toolserver
September 4
21:09 Any reason why srv150 (old SM 4cpu Apache) is not pooled?
20:48 mark: Installed srv151, srv152 and srv153 with Ubuntu and deployed mediawiki. Sorry, forgot to save ssh host keys :(. Put into rotation.
18:37 brion: srv149 seems wrong, missing remote filesystems. mounted upload3 and math manually
18:36 brion: shut down srv141, was whining about read-only filesystem, couldn't log in
15:30 Rob: adler racked and online. Needs database check.
12:23 Rob: DRAC working on srv151, needs installation (failing on partitioning.)
13:16 Rob: Rebooted srv152 to borrow its DRAC for testing. Reinstalled and rebooted.
12:34 mark: Installed ubuntu on srv152, will deploy later
12:33 Rob: srv149 back online and sync'd. Ran FSCK on it.
08:00 domas: live revert of DifferenceEngine.php to pre-24607 - requires additional patrolling index (ergh!), which was not created (ergh too). why do people think that reindexing recentchanges because of minor link is a good idea? :-/
The schema change requirement was noted and made quite clear. If it wasn't taken live before the software was updated, it's no fault of the development team. Rob Church 14:44, 2 September 2007 (PDT)
03:16 Tim: fixed upload on advisorywiki
August 31
19:50 brion: starting an offsite copy of public upload files from storage2 to gmaxwell's server
13:40 brion: srv149 spewing logs with errors about read-only filesystem; can't log in; no ipmi; mark shut its switchport off
That test case is gone from the cache now, but I did see it before it went. Can't reproduce directly to the apache. Maybe an overloaded ext store server? -- Tim
06:50 mark: Massive packet loss within 3356, prepending AS14907 twice to 30217, removed prepend to 4323
August 30
21:40 mark: Python / PyBal on alrazi had crashed with a segfault - restarted it
13:52 Rob: Restarted srv135 from kernel panic, sync'd and back online.
13:50 mark: Set AllowOverrideFrom in /etc/ssmtp/ssmtp.conf on srv153 to allow MW to set the from address. Also made this the default on new Ubuntu installs.
13:14 Rob: Rebooted srv18 after cpu temp warnings. Sync'd server, back online, no more temp warnings.
12:46 Rob: sq39's replacement powersupply arrived. Server back online and squid process cleaned and started.
13:17 Tim: running populateSha1.php on commonswiki
August 25
22:53 brion: noticed dberror.log is flooded with 'Error selecting database' blah with various wrong dbs; possibly from job runners, but the only place I found suspicious was nextJobDB.php and I tried a live hack to prevent that. Needs more investigation.
14:30 Tim: got rid of the APC statless thing, APC is buggy and crashes regularly when used in this way
14:00ish brion: scapping again; Tim made the img metadata update on-demand a bit nicer
August 24
17:45 brion: fixed SVN conflict in Articles.php in master copy
17:29 brion: restarted slave on db2
17:27 brion: applying image table updates on db2, didn't seem to make it in somehow. tim was going to run this but i can't find it running and he's not online and didn't log it
17:16 brion: restarting slave on ariel
??:?? tim is running a batch update of image rows in the background of some kind
??:?? tim may have changed which server enwiki watchlists come from while ariel is non-synced
16:04 brion: applying img_sha1 update to ariel so we can restart replication and get watchlists for enwiki going again...
....15:10 tim reverted code to r24312 to avoid image update buggage for now
15:10 brion: took db3 (down), db2 (stopped due to schema buggage) out of rotation
15:00ish -- massive overload on db8 due to image row updates
14:47 brion: starting scap, finally!
14:10 brion: unblocked wikibugs IRC mailbox from wikibugs-l list, was autoblocked for excessive bounces
13:59 brion: confirmed that schema update job on samuel looks done
13:40 Tim: restarted job runners, only 2 were left out of 9. Wiped job log files.
August 23
22:00 Tim: replication on samuel stopped due to a replicated event from testwiki that referenced oi_metadata. Applied the new patches for testwiki only and restarted replication. Brion's update script will now get an SQL error from testwiki, but hopefully this won't have serious consequences.
20:04 brion: switched -- samuel_bin_log.009 514565744 to db1-bin.009 496650201. samuel's load temporarily off while db changes apply...
19:56 brion: switching masters on s3 to apply final db changes to samuel
15:34 brion: knams more or less back on the net, mark wants to wait a bit to make sure it stays up. apache load has been heavy for a while, probably due to having to serve more uncached pages. dbs have lots of idle connections
15:32 brion: updated setup-apache script to recopy the sudoers file after reinstalling sudo, hopefully this'll fix the bugses
15:12 brion: srv61, 68 have bad sudoers files. srv144 missing convert
14:08 brion: depooled knams (scenario knams-down)
13:50 brion: knams unreachable from FL
9:08 mark: Repooled yaseo, apparently depooling it causes inaccessibility in China
August 22
20:35 brion: applying patch-oi_metadata.sql, patch-archive-user-index.sql, patch-cu_changes_indexes.sql on db5, will then need a master switch and update to samuel
20:20 brion: found that not all schema updates were applied. possibly just s3, possibly more. investigating.
14:00ish brion: amane->storage2 rsync completed more or less intact; rerunning with thumbs included for a fuller copy
August 21
20:21 brion: amane->storage2 rsync running again with updated rsync from CVS; crashing bug alleged to be fixed
14:43 Rob: sq39 offline due to bad powersupply, replacement ordered.
13:30 Tim, mark: setting up srv37, srv38 and srv39 as an image scaling cluster. Moving them out of ordinary apache rotation for now.
~12:00 Tim: convert missing on srv61, srv68, srv144, attempting to reinstall
9:30 mark: Reachability problems to yaseo, depooled it
August 20
15:00 brion: restarted amane->storage2 sync, this time with gdb sitting on their asses to catch the segfault for debugging
~12:00 Tim: started static HTML dump
11:15 Tim: running setup-apache on srv135
August 18
22:32 brion: schema updates done!
August 17
20:42 brion: started schema updates on old masters db2 lomaria db1
20:39 brion: s1 switched from db2 (db2-bin.160, 270102185) to db4 (db4-bin.131 835632675)
20:32 brion: s2 switched from lomaria (lomaria-bin.051 66321679) to db8 (db8-bin.066 55061448)
20:13 brion: s3 switched from db1 (db1-bin.009 496276016) to samuel (samuel_bin_log.001, 79)
19:54 brion: noticed amane->storage2 rsync segfaulted again. starting another one skipping thumb directories, will fiddle with updating and investigating further later
19:48 brion: doing master switches to prepare for final schema updates this weekend
August 16
16:21 Rob: srv135 back up, needs bootstrap for setup.
16:01 brion: think I got the mail issue sorted out. Using sendmail mode (really ssmtp), and tweaked ssmtp.conf on isidore: set the host to match smtp.pmtpa.wmnet, and set FromLineOverride=YES so it doesn't mess up the from address anymore
15:42 brion: having a lovely goddamn time with bugzilla mail. setting it back from SMTP to sendmail avoids the erorr messages with dab.'s email address, but appears to just send email to a black hole. setting back to SMTP for now
August 15
3:40 jeluf: Start copy of database from db5 to samuel.
August 14
20:30 jeluf: Lots of apache problems on nagios after some updates on InitialiseSettings.php. Restarting all apaches.
14:47 Rob: Rebooted and ran FSCK on srv59. It is back up, needs to be brought back in to rotation.
Sync'd, and in rotation.
August 13
21:00 domas: APC collapsed after sync-file, restarting all apaches helped.
18:59 brion: playing with rsync amane->storage2
18:17 brion: working on updating the upload copy on storage2. removing the old dump file which eats up all the space, will then start on an internal rsync job
15:43 brion: seem to have more or less resolved user-level problems after a bunch of apache restarts. i think there was some hanging going on, maybe master waits or attempts to connect to gone servers
19:58 brion: updated broken sudoers file on humboldt, was not updating files on scap correctly
13:48 brion: adding wikiversity.com redirect
13:26 brion: metadata update for commons image row 'Dasha_00010644_edit.jpg' was sticking due to mysterious row locks for last couple days -- may have been related to a reported stickage/outage yesterday. Script run after run after run tried to update the row but couldn't. Finally managed to delete the row so it's no longer trying, but the delete took over three minutes. :P
August 5
07:44 Tim: set $wgUploadNavigationUrl on enwiki, left a note on the relevant talk pages
August 4
02:10 Tim: Fixed srv7 again
Aug 2
06:00 mark: Installed yf1015 for use as HTTPS gateway
Aug 1
07:00 domas: amane lighty upgraded
July 29
23:43 brion: shut off srv134 via ipmi, since nobody got to it
July 27
4:30 jeluf: removed srv120 from the external storage pool, fixed srv130
July 25
16:33 brion: srv134 bitching about read-only filesystem, possible hd prob
15:44 Rob: srv59 had a kernel panic. Restarted and is now back online.
15:00 Rob: biruni HDD replaced, FC4 reinstalled. (Had to use 32 bit, system did not support 64.) Requires scripts run and server to be put in rotation.
July 24
21:59 brion: srv59 is down; replaced it in memcache pool with spare srv61.
02:38 brion: fixed bugzilla queries; had accidentally borked the shadow db configuration -- disabled that (for srv8) a few hours ago due to the earlier reported replication borkage, but did it wrong so it was trying to connect to localhost
July 21
22:30 mark: srv7 was out of space again, deleted a few bin logs. Replication I/O thread on srv8 is not running and doesn't want to come up, can any of you MySQL heads please fix that? :)
19:26 Tim: db3 is down, removed from rotation.
11:38 mark: Installed Ubuntu Feisty on yf1016 for use by JeLuF
18:56 brion: set up and mounted upload3 and math mounts on vincent -- these were missing, probably causing bugzilla:10610 and likely some upload-related problems.
15:08 river: borrowing storage1 to dump external text clusters
13:50 Rob: biruni filesystem in read-only. Rebooted and running FSCK.
HDD is toast. Emailed SM for RMA.
July 15
15:06 Tim: Biruni was hanging on various operations such as ordinary ssh login, or a "sync" command. Restarted using "echo b > /proc/sysrq-trigger" in a non-pty ssh session.
Probably hung during startup, ping but no ssh
July 14
19:35 mark: Fixed a problem with our DNS SOA records
13:33 Tim: put ex-search servers vincent, hypatia, humboldt, kluge, srv37 into apache rotation. Fixed ganglia and nagios.
July 13
23:20 mark: Disabled Oscar's accounts on boardwiki/chairwiki per Anthere's request
July 12
18:35 brion: pa.us.wikimedia.org up.
~12:10 Tim: restarted postfix on leuksman, causing a flood of messages delivered to various locations.
11:38 mark: Set up Quagga on mint as a test box for my BGP implementation, gave it a multihop BGP feed from csw5-pmtpa / AS14907
11:02 Tim: reassigning srv57 and srv58 to search
July 11
13:10 Tim: updating lucene for enwiki, will be moving on to the other clusters shortly
12:49 Tim: removed /etc/crond.d/search-restart from search servers and restarted crond
July 10
21:30 brion: installed DeletedContributions ext
19:20 mark: Brought knsq6 back up, DIMM hopefully replaced
19:20 jeluf: setting skip-slave-start on read-only external storage clusters.
July 9
17:53 brion: temporarily taking db4 out of rotation cause people freak out about lag warnings
14:18 brion: running schema changes patch-backlinkindexes.sql on db1/db4 so they don't get forgotten
13:55 brion: fixed oai audit db setting (broken by master switch)
July 8
20:47 jeluf: added external storage cluster #13. Removed #10 from the write list.
18:24 Tim: changed cron jobs on the search servers to restart lsearchd instead of mwsearchd.
17:05 brion: switched s1 master, started the rest of the latest schema changes
17:00 brion: switched s2 master
16:54 brion: switched s3 master
16:23 brion: schema changes index tweaks done on slaves, waiting for a master switch to complete
July 7
20:10 jeluf: cleaned up binlogs on srv95.
July 6
21:16 brion: did rs.wikimedia.org updates -- imported pages and images from old offsite wiki, set up redirects from *.vikimedija.org domains that are pointed to us
13:09 Rob: replaced cable for rose, shows at full duplex speed again.
July 4
21:40 jeluf: cleaned up disk space on srv95, 120, 126
July 3
20:54 brion: Closed out inactive wikimedical-l list
15:56 brion: lag problem resolved. stop/start slave got them running again; presumably the connections broke due to the net problems but it thought they were still alive, so didn't reconnect
15:53 brion: lag problems on enwiki -- all slaves lagged (approx 3547 sec), but no apparent reason why
15:00ish? mark did firmware updates on the switch and soemthing didn't come back up right and everything was dead for a few minutes
00:52 river: temporarily depooled thistle to dump s2.
July 2
18:54 brion: set bugzilla to use srv8 as shadow database
18:53 brion: replication started on srv8
18:39 brion: srv7 db back up, putting otrs back in play. fiddling with replication
18:25 Tim: bootstrapping srv80
18:24 brion: shut down srv7 db to copy
17:54 brion: shutting down srv8 db and clearing space for copy from srv7. otrs mail is being queued per mark
17:40 Tim, rainman: installing LS2 for the remaining wikis. Splitting off a new search pool, on VIP 10.0.5.11.
July 1
18:38 mark: Upgraded lily to Feisty, including a new customized Mailman, and a newer PowerDNS