Server admin log/Archive 8

September 30

19:30 jeluf: Adding new wikipedias ([1])
06:41 brion: yaseo wikis now living in pmtpa. yay!
various bits in yaseo migration plan
03:12 brion: samuel back in rotation
03:08 brion: fixed up samuel replication. need to clear out old binlogs and load yaseo again?
03:01 brion: took samuel out of rotation; unexpectedly filled up with binlogs during import
01:01 brion: holbach got forgotten; switched its master as well
00:36-00:47 brion: switched masters from samuel to adler as part of yaseo migration plan

September 29

18:13 brion: unlocking yaseo wikis to try fancier migration
18:03 brion: locking yaseo wikis for yaseo migration plan
09:25 Tim: upgraded 7z on srv31 to 4.42
03:30 Kyle: srv74 has some ram error, I will need a mce log to figure out which stick. I will RMA.
03:20 Tim: deleted text backups from amane for 2005 and Jan, Feb and March 2006. It now has 477GB free.
03:00 Kyle: holbach up again. Raid card spits out a strange error. 3ware says it is cosmetic.

September 28

23:37 Tim: srv74 was down too, took it out of external storage (cluster5) rotation. Re-added srv76, seems to be up and up to date.
23:25 Tim: noticed holbach was down, took it out of rotation
19:30 jeluf: srv34 looks better after a reboot.
19:10 brion: had to kill -9 stuck apache procs on srv34
16:35 jeluf: restarted apache on srv34. httpd was still running, but requests on port 80 were stuck in SYN_RECV.
14:20 mark: Set /proc/sys/net/tcp_max_syn_backlog to 4096 (instead of 1024) on sq12. Let's see if this makes a difference tomorrow...
14:15 mark: Restarted all new Squids
14:00 Something funky
11:30 Tim: restarted static HTML dump

September 27

22:45 mark: Backup run of amane on zwinger (in a root screen) stalled, tar process on dest server disappeared. If anyone wants to poke it go ahead, I'm sleepy...
15:50 mark: Restarted all new squids because they were unhappy and dropping open requests
5:14 jeluf: deleted binlogs 1 to 19 on srv86
5:01 jeluf: switched rr.wikimedia.org back to knams (for European users)
4:49 jeluf: switched upload.wikimedia.org back to knams (for European users)
00:32 brion: upload.wikimedia.org no longer resolves correctly in europe
- rumored to be fixed

September 26

22:33 network issues at surfnet, removed knams from geodns
15:20 mark: Added sq1-10.pmtpa.wmnet to alrazi's /etc/hosts file because internal DNS is overloaded
14:50 mark: Started a backup run of amane's upload/wiki* on zwinger
11:30 mark: Gave avicenna a public IP using a public gateway. Assigned it an LVS service IP 66.230.200.100 to use for text squids. Put sq12 - sq20 in production.
10:40 mark: 66.230.200.101 was on sq11, which has crashed twice now and is apparently broken.
05:40ish brion: fixed caching bug I introduced a couple weeks ago which caused extra hits to MonoBook's generated CSS and JS
04:09 Tim: Assigned 66.230.200.101 (www01) to srv10. It was unreachable. Could it have been down since will went down, 7 days ago?
00:32 Tim: Gave Erik mail aliases, steward access, subscriptions to internal-l and private-l and OTRS board member role

September 25

23:56 brion: restarted dump threads on srv31. (had to reboot srv31 due to benet mount being broken and lots of hung procs)
23:25 brion: restarted lighty on benet so download.wm.o works again
21:40 mark: Some network moves. Made benet external-only and gave it a sane network config again.
17:50 mark: sq11 up as Ubuntu Squid.
16:00 domas: enabled srv84 ES with MyISAM'y blob store
14:00 domas: whoever installed json.so, did break 32-bit apaches, haha, pthread_once() was probably called twice, muhaha, for now json disabled
14:00 mark: Installed squid-2.6.4-1wm2 with an experimental performance patch by Adrian Chadd on sq13
13:20 mark: Increased cache_mem from 1500 to 2048 on ragweed and sage - they have (unused) swap space now, and have 512 MB free, so let's use it.

September 24

18:00 mark: sage up as a Ubuntu squid.

September 22

07:00 jeluf: Extended Nagios monitoring: disk space on NFS and MySQL core servers, raid status on 3ware controllers, mailq of goeje and albert
05:40 jeluf: disabled firewall on coronelli.

September 21

15:20 brion: amane out of disk space, breaking uploads (0 bytes). deleting some old crap from disk

September 20

16:30 brion: added clerks-l mailing list for enwiki arbcom clerks
15:40 brion: fixed alternate master setting for enwiki; steward Makesysop wanted to talk to enwiki on ariel, which is now a locked slave
15:30 brion: added spoofusers table for AntiSpoof extension, will build contents and start logging soon

September 19

12:45 mark: Added 66.230.200.110 to rr.pmtpa.wikimedia.org DNS pool
12:39 brion: added sq11-30 to $wgSquidServersNoPurge, there were reports of new squid ips being blocked
11:50 mark: Brought sq12 and sq13 up as emergency Ubuntu Squids, with a preliminary Wikimedia .deb
10:45 mark: will had broken network settings and is now inaccessible. Moved the squid service ip to srv6. I'll try to get 3 new squids up this afternoon to deal with the load...

September 18

22:30 mark: sq11 and sq13 up with Ubuntu server installs
19:30 mark: Replaced vlan 1 for vlan 100 which caused some downtime
22:10 Kyle: Removed auditd on srv78 for login problems.
22:00 Kyle: Reinstall of db4. Its has a bad disk somewhere. Raid5 will find it.
18:30 brion: enabled experimental revision text caching in memcached (set $wgRevisionCacheExpiry to 0 to disable)
15:54 brion: fixed secure.wm.o IP alias on bart, was set to load only the old florida ip on boot
10:55 mark: Reduced A records in DNS to just one service IP per squid

September 17

12:44 mark: Updated udpmcast.py and made it send multicast packets with ttl 2 so they will pass a router between the two vlans

September 15

20:56 brion: enwiki db update failed. ariel crash?
20:23 brion: applying page table update on pmtpa wikis
18:40 mark: Disabled passive IGMP snooping and enabled PIM DM routing on csw5 to fix the purging problems.
15:00ish brion: squid purging is broken apparently. Need help investigating the UDP multicast stuff
09:15 brion: lists online
08:19 brion: taking mailing lists down for a bit to back up and upgrade
08:00 brion: updated leuksman.com to PHP 5.2.0RC4 and APC 3.0.12p2

September 14

21:00 domas: started rebuilding srv84(myisam) from srv86 - disabled writes to cluster7 for a while.
11:39 brion: srv84, srv116 back online and resynced. restored to nodegroup.
11:25 brion: restarted lvsmon on dalembert with srv84 and srv116 commented out of 'apaches' dsh group
- don't forget to turn them back on some day after they're fixed they don't respond to ssh, waiting for reboot

September 13

23:00 mark: Removed BGP announcement of Wikia's route on csw4-pmtpa, to disable incoming load balancing for them.
22:30 mark: Rebooted csw5-pmtpa with new firmware. Moved avicenna and alrazi onto it, and reverted back to the Cisco - Foundry seems to insist on untagged default vlan 1
19:00 brion: created sd.wikinews

September 12

22:08 brion: continuing search index builds on maurus
20:30 ævar: changed the logo on hrwikiquote and the wgSitename, didn't sync it (properly) though because all the servers hate be, boo hoo.
19:06 brion: removed harris and borked goeje from internal dns; added harris.wm.o on external dns
18:50 brion: restored port listen 80 on bart to prior config, with search.wm.o disabled
18:00 brion: reclaiming harris for search.wikimedia.org
15ish brion: moved search.wm.o to a separate ip, leaving bart free
13:06 brion: temporarily disabled port 80 on bart.wm.o, testing very slow response on secure.wm.o/ticket.wm.o

September 11

21:00 mark: avicenna broke, probably because of overload. Moved LVS services to alrazi and dalembert. avicenna is back after a reboot, on standby.
19:30 jeluf: killed jobs runner loop on srv81, srv82 and srv83. Load on ES servers is too high
18:24 brion: paused search index generator at frwiki; load on ES is too high
11:50 brion: moved www.wikimedia.org country portals into svn [2]
09:54 brion: installed new texvc on yaseo, built from srpms
09:43 brion: new texvc seems to be missing at yaseo

September 10

11:00 domas: installed dvipng/texvc RPMs (at /h/w/src/, as well as http://dammit.lt/tech/packages/) on all boxes,

September 9

03:00 Kyle: rewired scs. A bunch of servers don't console, I'm going to try to fix a bunch. A few are also not on it yet. I'm not quite finished.

September 6

13:28 brion: search index build restarted; weird mono thread-creation problem had hung it previously
~11:30 Tim: installed report.py and cachemgr.cgi on amaryllis
10:40 brion: restarted search build process on maurus, explicitly selecting db1/db3 as slaves this time
09:38 brion: restarted enwiki dump
09:14 Tim: brought yf1017 back into service as an apache (dual purpose with search)
09:05 Tim: fixed gmetad writes on amaryllis
~08:50 Tim: got gmetad, ganglia web working on amaryllis
04:30 Kyle: sq1 back up. srv110, 62, and 117 up.
01:00 river: srv62 broke, removed from memcached config

September 5

~23:00 domas: innodb deadlock monitor deadlocked something on ipblocks table, ..
~21:00 domas: restarted samuel and ariel with high TCP/IP backlogs and no name resolution, not sure if got rid of connection error problems completely
~15:00 Tim: installing JSON PECL module
10:30 brion: started pmtpa and yaseo dumps
08:15 domas: lighty on mail.wikipedia.org

September 4

22:45 Kyle, Mark: Put sq11 on the APC, which was empty and unconnected, but now on csw4-pmtpa:0/3 and SCS:9. sq11 is on SCS:14.
22:00 - 00:30 Kyle, Mark: Massive network moves... including all DBs and search servers, so go wild, Brion...
22:30 mark: Moved LVS on dalembert (Squids -> Apaches) to avicenna so dalembert could be moved.
21:10 mark: Removed LVS services on avicenna which were still using the old IPs.
21:00 mark: Deployed a new Squid.conf with all references to the old IP range replaced with new IPs.
17:00 mark: Increased vandale's COSS cache_dir to 12 GB.
12:49 brion: installing ploticus on yaseo boxen
somewhere... domas did something... which released a lot of email
10:09 brion: some or all mail appears to be stuck? not getting bugmail for last few hours, self-mails seem not to get delivered

September 3

20:25 domas: redirecting all domains mail to mail-eater, spammers used wiktionary.org... goeje... ergh...
15:11 brion: setting up hourly search index synchronization in pmtpa
08:58 brion: starting search-rebuild on maurus again; accidentally broke it last night

September 2

10:35 river: restarted mailman on goeje again, was hung
09:00 jeluf: Running runJobs.php on yf1008

September 1

16:13 brion: recompiled mwsearch on maurus, added /etc/cluster, rewrote search-sync, started a full index rebuild (search-rebuild via search-rebuild-wiki) pulling from dbs.
- auto syncs not set up yet, but will want to do those in the hourly restart, probably
- running the builds in a screen session for now. will want to run it continuous later
15:46 brion: added maurus back to mediawiki-installation group, set up a copy of php to do dump+index tests
14:50 brion: set up rsync server on maurus so that search indexes can be updated more easily
14:30ish brion: copied fedora 2 files back onto albert's yum mirror; they're gone from gatech mirror
02:57 river: restarted mailman on goeje, was stalled

August 31

15:10 brion: restarted sendmail on yaseo apaches. something was stuck somehow in them, causing the shell-outs to /usr/lib/sendmail to hang. since mail is sent *during* a db transaction in preference save, this caused some db locks also.
14:39 brion: added proctitle on rest of yaseo apaches
10:53 brion: yaseo problem appears to be stuck locks on user table, but i'm not sure why. trying to get the mighty domas on the case \o/
10:29 brion: rotated bot-heuristic.log; over 2gb, broke 32-bit boxes
09:50 brion: added proctitle on yf1009
09:09 brion: yaseo seems rather sluggish

August 30

13:36 Tim: started jobs-daemon on srv91-100
12:50 mark: Testing Squid's COSS storage file system on vandale
11:22 brion: fixing ^/wiki.phtml$ regexes in apache config to ^/wiki\.phtml$
10:39 Tim: and on srv86-90
10:30 Tim: started /h/w/b/jobs-daemon on srv81-85
10:05 Tim: installed daemonize-1.4-5_wm on pmtpa apaches
10:00 brion: added boardvote2006 to zedler's list of dbs not to replicate. may need to restart mysql to take effect. (replication is borked)
06:00 Kyle: Finished moving sq1-10 to their new rackspace in prep for the new squids.

August 29

20:45 mark: We got some evidence that routing for yaseo has been fucked all day, which may explain some reports we've been getting. Rerouting all traffic away from yaseo to florida just to be sure.
12:38 Tim: Fixed password and hostname configuration for dryas. It's probably been broken, getting virtually no requests, since we got it.
10:25 brion: amane offline for a few minutes due to network funkage, back alive now
00:00 Kyle, River, Mark: Moved a bunch of servers around, both physically and onto the Foundry (csw5-pmtpa). Moved csw4-pmtpa's L2 link from csw1-pmtpa to csw5-pmtpa and gave csw4-pmtpa's HSRP group 2 a higher preempted priority in order to balance traffic better while migrations are in progress.

August 28

15:50 Tim: fixed yaseo search, used an IP address instead of hostname
13:35 Tim: reinstalled srv50 and put it into rotation
13:20 mark: Made yaseo squids temporarily diskless while yf1001's outage lasts.

August 24

15:30 mark: Ran scap to disable sending of HTTP ETag (bug #7098)
13:30 mark: Fixed the PowerDNS setup on Browne, and added a beta.wikiversity.org DNS record, closing bug #7094.
11:00 mark: Reducing cache_dir sizes for yaseo from 20 GB to 6.

August 23

23:00 mark: Upgraded yf1000, yf1001 and yf1002 back to Squid 2.6.
22:53 mark: yaseo Squid problems were simply caused by the broken DNS resolvers: Squid needs a -HUP before it rereads /etc/resolv.conf. Also replaced lvsmon by PyBal for Squids on yf1018.
11:53 Tim: added version switch to squid.conf.php
11:37 Tim: removed yf1003, yf1004 and yf1019 from LVS rotation temporarily
~11:25 Tim: squid was mostly broken on yaseo, apparently not working at all for forwards to the local apaches, and very slow for forwards to pmtpa. It was just timing out. Lvsmon was reporting all squids down, it was just in "emergency mode", leaving a few in rotation for debugging. Downgraded squid to 1.5 on yf1000, yf1001 and yf1002, this fixed the problem on those squids. Lvsmon is now leaving those three in rotation and flapping the rest. The configuration file is locally hacked for the downgrade, will fix that shortly.
07:00 Kyle, Domas: db1-4 physically moved to redundant power circuits.
06:50 Kyle: Moved the load balancer. Right now unracked till I get rails for it.

August 22

20:30 knams back up, pointing EU traffic back at knams
19:35 mark: Kennisnet doesn't know what's wrong yet, and when it will be up again. Redirecting knams traffic to pmtpa.
18:45: Kennisnet went down
16:00 - 17:30 mark: Deployed squid-2.6.STABLE3-1wm on all Squids.
09:40 brion: migrating old dump data from benet to free space
09:30 brion: secure.wm.o whining again, about 10.0.2.7[4-6]

August 21

12:32 brion: someone appears to have fixed db permissions so secure.wm.o works again. thanks, mysterious person (possibly domas) who didn't log it! :)
00:59 brion: let's not forget to fix the database permissions today to secure.wm.o and manual db work from bastion hosts work again. Is it safe to update ourusers.php and resource the SQL, or has this changed?

August 20

15:28 brion: srv58 appears dead, but was listed in memcached cluster. this maybe broke some sessions. switched to double-loading srv59.
11:25 brion: restarted postfix on leuksman.com, mail was down for some reason so svn commit notices weren't being mailed
10:11 brion: noticed secure.wm.o is not happy
7:53 river: set goeje's relayhost to mayflower, so it can send mail while reverse DNS isn't working

August 19

22:30 mark: Replaced Tim's modified PyBal by the current code in SVN, which already supported multiple services in one PyBal instance and configuration file. Tim's modified code is in /usr/local/pybal.old, old config is in /etc/pybal/old/.
22:10 mark: Submitted support request to Verio to change delegation / nameserver glue record IPs on wikimedia.org.
21:48 mark: Added IP 66.230.200.16 to browne, as the new ns0.wikimedia.org IP. Removed bogus 66.230.200.207 and 66.230.200.208 IPs, and killed the old firewall which didn't let any of the new IPs through.
12:27 Tim: Modified pybal to accept an optional configuration file name on the command line (pybal.py -f <filename>). Started a second instance on avicenna with the configuration file /etc/pybal/pybal-newip.conf, to load balance for 66.230.200.228. Modified source is in ~tstarling/pybal and avicenna:/usr/local/pybal.

10:32 river: starting to migrate services to the new network

August 18

21:30 jeluf: added 66.230.200.x aliases to all 207.142.131.x hosts
19:59 brion: set dbs back to read-write
19:55 brion: apparently cogent has returned our old ips until monday
19:50 brion: route to old ips suddenly back for many people. waiting to hear more details
18:50 brion: we've been assigned new ip space, people are trying to figure out how to attach stuff to it
17:50 brion: kyle got in touch with charles, apparently cogent deleted or re-advertised our ip space. they're trying to figure out what happened and fix it
17:30 brion: pmtpa more or less inaccessible. called powermedium to have them investigate. texted kyle to see if he's available to look

August 17

22:59 brion: resynced php files on yf1017; bad copy had extra language files leftover, breaking th.wikipedia.org
15:00 domas, river: fixed broken mail configuration which caused delays in mail delivery, now returning 55xs for most of clients for more serious reasons instead of 450s on DNS failures.

August 16

19:07 brion: resyncing common on srv21, had old wiki list
18:01 brion: upgraded leuksman.com to apache 2.2.3 and php 5.2.0rc1
06:43 brion: restarted enwiki dump, it got eaten by dead mysql servers
01:30 brion: restarted apache on bart in response to reports of very slowness. tim found threads stuck in futex state. boooo. evil!

August 15

21:40 brion: updating interwiki table for wikiversity
21:13 brion: refreshLinks on frwiktionary, see if it trims bugzilla:7023
20:19 brion: added wikiversity-l
16:04 brion: polish file dump available
14:34 river: fixed upload dirs for wikiversity. added account on zedler for brendang (OpenSolaris developer) to work on dtrace scripts.
06:55 brion: adding enwikiversity, betawikiversity

August 14

19:40 jeluf: restarted mwsearchd on rabanus
18:11 brion: fixed dump settings for comcom, removed stats
13:50 mark: Set up iBGP between csw4-pmtpa and csw5-pmtpa according to BGP.

August 13

19:30 jeluf: installed djvulibre in yaseo. Was already installed in pmtpa.

August 12

23:24 brion: Enabling email notification for watchlists on meta. amgine reminded me about it
21:28 brion: added stewards-l

August 11

22:09 brion: disabled wm06reg to ensure rails is not exposed until fixed

August 10

20:49 brion: running enwiki dump from db3, domas fixed it hopefully
12:03 river: added new mail gateway, mail-eater.wikimedia.org on albert (via LVS on avicenna) for incoming mail, so goeje doesn't die when it has to reject large amounts of mail
00:00 mark: (Temporarily?) enabled captcha for dewiki, by request of DaB and elian due to a vandal attack going on

August 9

14:49 brion: copying captcha image store to yaseo; was enabled on jawiki but wasn't set up properly yet. creating accounts on ja should work again

August 7

05:00 jeluf: bugzilla complained about broken data/versioncache. Removed empty /srv/org/wikimedia/bugzilla/data/versioncache, bugzilla fine again.

August 6

20:20 jeluf: mailmanctl restart on goeje
19:40 jeluf: added cronjob to automaticaly update pascal's recipient map, /home/wikipedia/bin/UpdateBackupMX. The job runs every 15 minutes.
10:00 jeluf: removed all MAILER-DAEMON mails(about 30'000) from pascal's mailq
9:30 jeluf: added relay recipient map to pascal's mail configuration. It's generated by /home/wikipedia/bin/CreateRelayRecipientMap on goeje in /etc/postfix/recipient_map. It needs to be copied to pascal and processed by postmap. Todo: Automate this process

August 5

18:05 jeluf: restarting postfix on pascal
14:35 brion: postqueue -f on goeje just in case some old bits got stuck in queue

August 4

18:00 jeluf: added srv81-86 to node group ext-stores, added srv83 and srv86 to ext-store-masters
17:45 jeluf: only 8GB free on srv76 => removed binlogs 100 to 149

August 3

22:29 brion: db2 and db4 still have broken grants, missing admin user. this broke enwiki dump
22:16 brion: started dumps on pmtpa and yaseo
13:20 Tim: fixed Special:Version
13:10 Tim: Fixed mounts on srv13. Ran away from wikimania to avoid getting lynched by angry tired devs
10:02 brion: forgot actual scap was different. blah. removed extra files manually from yaseo
09:38 brion: adjusting scap15-2 on yaseo to use --delete as well as --delete-after in the hopes it'll properly delete the now-removed language files
07:34 jeluf: Removed LanguageTh.php from yaseo apaches. Everything seems to be fine now (i.e. no user complains about problems)
06:38 jeluf: More wikis seem to work, some still broken. LanguageTh doesn't have $wgNamespaceNamesEn defined, so the + at its beginning fails. Looks like the codebase is in some undefined state.
06:24 jeluf: removed definition of

class LanguageUtf8 extends Language {}

in Language.php. I hope it doesn't break anything...

- note that that could break things. recommend putting it back eventually, but some language files should have the remaining LanguageUtf8 references removed
06:10 jeluf: Users complain about empty pages in YASEO, error messages regarding redefinition of LanguageUtf8 class, scapping: Not better. Still get

Aug  3 06:18:11 wikif1010 httpd[11932]: PHP Fatal error:  Cannot redeclare class languageutf8
  in /usr/local/apache/common-local/php-1.5/languages/LanguageUtf8.php on line 38

in /var/log/messages

August 2

22:30 domas: cool guys hacking at OLPC
03:14 mark: Built a squid 2.6.STABLE2 RPM and installed it on clematis

August 1

4:10 jeluf: switched enwiki to read only, ariel out of sessions. domas killed hanging DB queries, switched to read/write at 4:20

July 31

18:40 JeLuF, mark: both goeje and pascal didn't respond on tcp port 25, both complaining about SYN floods. Restarted Postfix on both, which "fixed" the problem.
00:22 brion: noticed someone returned db4 to service, but didn't log it. Was it properly recloned or is it still broken? There are reports of database locking, which would be caused by detection of lagging slaves. There may or may not be some laggy problems.

July 30

22:48 brion: fixed resolv.conf on other search boxes, synced wikimania search db
22:20 brion: rebuilding wikimania2006 search db. corrected /etc/resolv.conf on maurus

July 29

23:15 brion: had to restart yaseo search server again
22:27 brion: mounted /mnt/math on srv11,srv12,srv14,srv15,srv16,srv17,srv18
04:30 brion: search daemon was down on yf1017. restarted it
00:48 brion: added oversight-l

July 28

04:43 brion: yaseo math dir seems to have vanished. created upload/math and symlinked it to /mnt/math to match the symlink at /home/wikipedia/common/math

July 27

20:50 brion: trying to remount /mnt/upload3 on srv12-19, mostly were missing. amane mountd has some problem
05:19 brion: srv14 and srv20 had time about 50 minutes off, ntp hadn't properly started on boot. srv14 had to have /etc/ntp/step-tickers adjusted (was zwinger, now 10.0.0.200). may be multiple edits with wrong timestamp

July 26

21:41 brion: db4 still has sync problems, taking out of rotation
20:00 Kyle: During the server move earlier, something I did is causing ganglia to mis-report the down'ing of a bunch of apaches.
20:00 brion: set apache to start on boot on albert
19:59 brion: fixed nfs mounts on srv19, srv20, got them back in service
19:50 brion: srv19 and srv20 have nfs mounts broken. albert's up but not running apache
08:30 Kyle: srv11-20 are physically moved. The netgear has a new uplink to csw5-pmtpa on Patch B.

July 25

21:50 jeluf: Built djvulibre FC4 package, installed on remaining hosts. Added to bootstrap script
21:23 jeluf: Built djvulibre package for FC3. Tried to install on all "mediawiki-installation"-servers. Install failed on FC4.

July 24

21:40 brion: added custom php.ini on bart for secret project extension
19:25 brion: enabled DPL on *wikibooks
00:20 brion: fixed thumbnail generation on wikimania2006wiki
- had to tweak thumb-handler.php on amane to special-case the site prefix. some other sites may also require this

July 23

22:20 brion: disabled page creation for anons on fawiki[3]
21:09 jeluf: rotated botquerylog, sent log to yurik
10:00 Tim: fixed nagios.wikimedia.org
08:14 Tim: brought db2 and db4 back into rotation
08:10 Tim: can't log in to srv78 but it's still serving HTTP. Took it out of rotation.
~07:00 Tim: copied data directory from db2 to db4

July 22

09:30 Tim: reniced WikiCounts.pl so that the job queue gets priority
09:15 Tim: restarted job threads on srv42
05:45 brion: started yaseo dump; installed local dbzip2 on amaryllis

July 21

23:34 brion: fixing ipb_create_account on old user blocks, was incorrectly set to 0
16:40 Tim: live patch to profile only requests above a certain minimum request time
~10:00 Tim: copied most lost revisions from db4 using fixSlaveDesync.php, fixed 96 broken page_latest fields with some handwritten queries starting with "select page_id,rev_page,page_latest from page,revision where page_latest=rev_id and rev_page<>page_id". That seems to have dealt with most complaints.
09:52 brion: took db4 out of slave rotation again
09:50ish brion: tim took back read-write to attempt resync in background
09:13 brion: set db1 and db3 read_only at runtime as well, for good measure
09:05 brion: added read_only to my.cnf-core-slave-13G and regenerated master copies of my.cnf
08:49 brion: in grievous violation of proper replication etiquette, both db4 and db2 DID NOT HAVE read_only SET AND WERE THUS DANGEROUS. db4 has been corrupted. db2 appears to be a clean copy of ariel. Have manually set read_only true (runtime, did not check config)
08:40 brion: taking read-only on enwiki. site appears to be fucked; most edits were going to a slave over last several hours? did someone leave the slaves misconfigured to accept writes? what the fuck?
08:36 brion: odd edit lagg reported in last hour, possibly due to ariel being commented out in db.php. fixing... hopefully
08:11 brion: expanded filetypes for wikimania2006wiki
07:52 brion: db.php briefly had a $wgReadOnly set on it for enwiki, for no apparent reason. Possibly accidentally re-saved by someone after/during/with some maintenance the other day?
05:49 brion: added biruni back to apaches node group
05:29 brion: running apache setup on biruni
04:45 jeluf: Blocked UCD search bot's user agent at squid level

July 20

21:05 jeluf: added info-cs alias, pointing to otrs
19:00 jeluf: added block for the search bot to squid.conf, deployed.
18:34 brion: lucene heavily overloaded last couple hours by some stupid bot. added a quickie block for it

July 19

mark: sq2 seems down. Kyle, can you look into it?
17:09 brion: migrating old dump files from benet to amane

July 18

22:07 brion: poking bart again
19:11 brion: poking bart's php config

July 17

20:52 brion: starting next enwiki dump
20:something db slowness from some bad group bys; domas fixing it

July 16

08:24 brion: added libwmf to install-imagemagick, silly dependencies...
08:01 brion; fixed /etc/hosts on srv78, someone had hardcoded wrong ip for mirror address
07:54 brion: removed wfLogProfilingData() from ProfileStub.php; seems to be now in GlobalFunctions rather than the profiler, and conflicts
07:47 brion: got yum mirror set up on albert, hopefully
07:37 brion: trying to fix albert. did a yum update, then setup-general which broke the yum configuration and haven't been able to bring it back to life yet
~05:00 Tim: db4 was down for an hour or so due to a previous configuration sync, changing the ibdata file size, combined with a segfault

July 14

22:45 jeluf: srv6 also restarted after fixing resolv.conf
22:25 brion: jeluf fixed dns on srv9 and restarted squid, seems happier. srv6 also slow
22:10 brion: srv9 seems to be very slow, trying to get in to poke it
21:14 brion: moving stats.wikimedia.org to zwinger
20:52 brion: since albert's down, setting up another fedora mirror on zwinger
20:31 brion: srv78 setup...
20:13 brion: reinstalling apc on srv12, srv19, srv24, was broken (0-byte .so)
17:20 mark: Restricted DNS queries to internal subnets on zwinger
at some point: albert broke again and nobody fucking logged it
06:something brion: with albert down, internal external dns is down. tim's poking it
06:38 Kyle: albert reinstalled. Now on port 25 of csw1-pmtpa
05:50 Kyle: srv78 has a new os.
05:44 brion: running bad image name fixes

July 13

18:22 brion: running message rebuilds for hu*, language file was updated recently
09:00 Tim: moved 10.0.5.7 to srv1 ready for reinstallation of albert
08:33 Tim: put srv117 into rotation
08:08 Tim: unmounted all albert mounts, removed them from /etc/fstab
08:00 Kyle: csw5-pmtpa is now on the scs.
07:45 Tim: moving fedora mirrors to srv81 ahead of reinstallation of albert

July 12

21:16 brion: running post-check on bad titles to make sure they're all dead
20:15 brion: running bad title cleanaup; 99 wikis had at least some bad titles, at most a couple dozen
19:22 brion: running non-invasive bad title checks on all pmtpa wikis
08:57 Tim: installed apache/php/mw etc. on zwinger. Moved 207.142.131.234 back to zwinger.
~07:10 Tim: all NFS shares on 10.0.0.4 unmounted except srv31, srv42 and benet. Removed /home/wikipedia/shared/math entry from all fstabs, it has been unmounted on all apaches for a few days with no trouble.
06:52 Kyle: moreri is up with an ip of 10.0.0.32
06:47 Kyle: zwinger has ip 10.0.0.34 and is on the scs.
05:39 Kyle: srv110 rebooted after audit removed. It now allows logins.
05:36 Kyle: srv68 and srv78 rebooted. Raid controller crashes.
05:13 Kyle: db3 is running with the older kernel. Let's see how it does.

July 11

15:58 Tim: Belatedly fixed annoying apache configuration warnings:

[Tue Jul 11 15:53:26 2006] [alert] httpd: Could not determine the server's fully qualified domain name, using 127.0.0.1 for ServerName
[Tue Jul 11 15:53:26 2006] [warn] NameVirtualHost *:80 has no VirtualHosts
[Tue Jul 11 15:53:26 2006] [warn] NameVirtualHost *:80 has no VirtualHosts

15:45 Tim: synchronised php.ini files
~13:50 Tim: reinstalled PHP on srv2 and goeje, installed srv118 and srv119
10:45 Tim: noticed srv78 was down, set up srv67 to take its place as a memcached server
09:00-10:13 Tim: new recentchanges index
08:43 Tim: new blocking code live
07:14 Tim: disabled logwatch on suda, it had filled the root partition yet again. 7.7 GB in /var/cache/logwatch
04:35 brion: knams inaccessible for a few minutes

July 9

23:48 brion: disabled apc on friedrich; test drupal/civicrm doesn't seem to like the late binding problems
22:41 brion: adding office2 temp cname on wm.o
05:57 Kyle: Patches installed for the Foundry switch. Lets start discussing Foundry Crossover Procedure

July 8

08:07 domas: set ldap.conf on suda to check peer certificates :-)
08:05 domas: changed srv1 expired cert with 3-year long another one, stored it's public key in suda:/etc/openldap/cacerts/
08:00 domas: set ldap.conf on suda not to check peer certificates.
06:36 Kyle: Foundry switch is running. The first management module

s console port is temporarily plugged in where asw3-pmtpa was. Its ethernet port is plugged into port 46 of asw3. (I couldn't get the console to come up.)

06:24 brion: killed & restarted apache on srv13; lots of fatal errors

July 7

22:50 brion: fixed scap, hopefully, so that it updates the SVN revision number properly
06:51 Kyle: took down, moved, and put back up srv120 to make room for the new switch that hasn't arrived yet.

July 6

22:29 brion: enabling DPL on de.wiktionary

July 5

15:54 Tim: fixed full root partition on suda
07:43 Tim: bringing db2 back into service with a warmup load
05:35 Tim: set $wgGenerateThumbnailOnParse = false

July 4

22:17 brion: db2 broke; asking for reboot
04:37 Tim: set logo on test.wikipedia.org

July 3

17:08 Tim: set $wgJobRunRate=0 everywhere, srv42 appears to be doing a sufficient job

July 2

21:40 brion: finished installing mono & mwsearchtool on srv31, restarted dump there
21:33 brion: started remaining dump threads on benet and srv31
16:30 brion: resolved 'Wikipedia:' title conflicts on lawiki

July 1

22:55 brion: experimentally activating DynamicPageList on en.wiktionary.org, it's been requeste
21:18 brion: added some otrs addresses (info-en-[coqrv])
19:05 brion: zhwiki temporarily broken by a bad update to zh_cn lang file
18:37 brion: running messages updates on all wikis
07:30 Kyle: zwinger up with software raid 1 and FC4. (I sent an email with more detailed info to private-l)