Server admin log/Archive 11

From Wikitech

December 31

  • 24:13 brion: enabled TitleBlacklist ext

December 30

  • 20:00 domas: livehacked out rounded corner references in schulenberg main.css - files were 404 :) Tim, thats outrageous! :-)
  • 15:00 domas: blocked & entirely.
  • 14:00 domas: noticed that we still maintain separate squid caches due to access encodings not only for crawlers, but for IE/Moz too. Patch at http://p.defau.lt/?C9GXHJ14GWHAYK1Pf0x9cw

December 28

  • 21:44 brion: running cleanupTitles on all wikis (HTML entity and general issues)
  • 15:53 jeluf: shut down apache on srv14, srv15, srv17. CPU0: Running in modulated clock mode. Needs to be checked.
  • 12:58 mark: Blocked broken Recentchanges&amp requests on the frontend squids (ACL badrc)
  • 00:28 brion: wikibugs irc bot had been idle for about 2 days; on investigation its mailbox was disabled from the mailing list due to bounces. unblocked it.
  • 00:12 brion: Special:Userrights restricted mode available for bcrats: add sysop, bureaucrat, bot; remove bot

December 27

  • 22:21 brion: fixed Cite regression which produced invalid error message for ref names containing integers
  • 21:58 brion: scapping updates
  • 00:58 brion: stopped the dump worker thread on srv31 for the time being
  • 00:56 brion: either multiple dump threads are running, or the reporter's really broken from benet being full. :)

December 26

  • 21:50 jeluf: installed lighttpd on browne (irc.wikimedia.org), redirecting all requests to meta. See bug 11792
  • 18:04 brion: starting rsync benet->storage2 again... it borked the first time. :P

December 25

  • 06:00 jeluf: rsync completed. Symlinked dewiki upload dir to /mnt/upload4

December 24

  • 23:30 jeluf: disabled image uploads on dewiki. Rsync'ing images from amane to storage1.
  • 17:40 brion: configured stub fr and ru.planet.wikimedia.org

December 23

  • 21:14 brion: setting up private exec.wikimedia.org wiki
  • 06:48 domas: restarted all memcacheds with 1.2.4, complete cache wipe caused interesting caching issues

December 22

  • 19:20 RobH: srv189 mainboard replaced, SCAP run.
  • 19:00 RobH: mchenry memory replaced, back online.
  • 11:15 jeluf: configured mchenry to use srv179 instead of srv7 for OTRS mail address lookups.
  • 11:00 jeluf: moved otrs DB from srv7 to srv179. Bugzilla and donateblog are still on srv7.

December 21

  • 19:45 jeluf: restarted lighttpd on benet since dewiki and frwiki were not reachable. I don't understand why it helped, but it did.
  • 17:06 brion: benet was rebooted; migrating all data to storage2. started monitor thread but no additional worker threads for the moment.

December 20

  • 00:08 brion: benet went from being sluggish to being totally unresponsive. May need to reboot it; may have disk problems or something.
  • 20:43 brion: migrating additional dump dirs to storage2 to balance disk more
  • 20:20 jeluf: benet's disk is full. Reduced root reserve to 100 blocks

December 19

  • 20:13 mark: 15 14 new knams squids pushed into production, one is broken.

December 18

  • 02:43 brion: download.wikimedia.org wasn't showing the dewiki dir; possibly confused about the symlink to the other mount. Restarting lighty seems to have resolved it.
  • 02:34 brion: enabled gadgets extension sitewide

December 17

  • 20:44 brion: working on patching up the broken dumps... Benet crashed on srv6, killing one worker thread and the monitor thread. Dumps continued when benet came back online, but weren't reported. Monitor thread is now restarted, showing status. Additionally, freeing up space on storage2 so dewiki dump can run again. ... trimming thumbnails from the storage2 backup of amane uploads
  • 03:39 brion: updated wgRC2UDPAddress for some private wikis

December 16

  • 13:39 mark: Blocking HTTP OPTIONS requests on all frontend Squids

December 14

  • 14:55 mark: Moved patch between csw1-knams (in J-13) and lily (J-16), old one may be bad?
  • 14:00 mark: Installed new SFP line card in slot 7 of csw1-knams. Brought up link to AMS-IX.
  • 6:00 jeluf: started move of thumb directories of the commons from amane to storage1. Dirs get rsynced and symlinked in small steps. When the entire thumb dir is moved, only one symlink will be left. It would be nice to have the thumb dirs independent of the image dirs, but Mediawiki currently doesn't have a config option for this.

December 13

  • 23:00 jeluf: Mounted storage1's upload directory on all apaches. Moved /mnt/upload3/wikipedia/commons/scans to the new directory. Other directories have to follow.

December 12

  • 19:25 jeluf: restarted replication on srv127

December 11

  • 16:12 mark: mchenry under incoming connection flood, increased max connections. This hit srv7's mysql max connection limit of 100, upped that to 500.
  • 9:40 domas: thistle taken out for db8 rebuild

December 10

  • 15:05 Rob: db8 is resinstalled, and needs DBA attention.

December 9

  • 14:10 jeluf: External storage cluster 14 (srv139,138,137) enabled, replacing cluster 11 in $wgDefaultExternalStore

December 6

  • 21:38 brion: fix redirect from sep11.wikipedia.org; was to sep11memories.org which does not exist though some browsers let it pass, trying the real www.sep11memories.org.
  • 16:18 Rob: srv81 kernel panic. reboot, back online.
  • around 5: benet was found dead, PM rebooted it, server is back up, but lighttpd tries to load modules from /home/midom/tmp/thttpd/*, which does not exists. Needs to be checked.

December 5

  • 14:25 brion: set up rate limits for rollback, awaiting broader rollback permissions

December 4

  • 13:50 mark: avicenna collapsed under load
  • 13:45 mark: Repooled knams
  • 12:30 mark: Depooled knams, routing problems in surfnet

December 3

  • 17:50 mark: Exim on lily was still bound to the old 145. IP for outgoing messages, so couldn't send out to the world. Fixed.
  • 17:28 mark: Removed monitorurl options from Squid cache_peer lines now a Squid bug has been fixed where it wouldn't detect revival of dead parents.
  • 13:05 mark: Downgraded knsq3 kernel to Feisty's kernel, started frontend squid
  • 10:10 mark: Cut the routing for 145.97.39.128/26 on csw1-knams.

December 2

  • 05:24 Tim: Master switch to ixia complete. New binlog pos: (ixia-bin.001, 2779)
  • 05:10 Tim: starting response
  • 04:58 db8 down (s2 master)

December 1

  • 21:30 mark: Recompiled 2.6.20 kernel on mint (upload LVS host at knams) with a larger hash table of 2^20 entries instead of 2^12. %si CPU usage seems a bit lower, will be interesting to see it during peak traffic. Also tested the newer 2.6.22 kernel in Ubuntu on sage, and it appears 2-3 times slower in this respect!

November 30

  • 19:06 domas: srv78 was commented out in mediawiki-installation since Nov2. It was also up since then. Uncommented, resynced.
  • 17:18 Rob: srv146 back online.
  • 17:16 Rob: srv132 back online.
  • 17:14 Rob: srv125 back online.
  • 16:17 Rob: yf1006 upgrade completed.
  • 16:08 Rob: srv70 back online as apache.
  • 15:47 Rob: srv121 back online, disk checked fine. No IPMI access.
  • 13:11 Tim: Deploying r27961 with Parser_OldPP (svn up/scap)
  • 05:15 Tim: running maintenance/fixBug12154.php to find diffs broken by Brion's little experiment.

November 29

  • 22:49 Rob: Dist-upgrade to yf1005 completed.
  • 21:41 Rob: Dist-upgrade to yf1004 completed.
  • 21:14 Rob: Dist-upgrade to yf1003 completed.
  • 20:45 Rob: Dist-upgrades to yf1001 & yf1002 completed.
  • 19:44 Rob: Completed yf1000 reinstall.
  • 19:29 Rob: Reinstalling yf1000 for software distro upgrade.
  • 19:25ish mark: Upgraded yf1009 and yf1008 to new distro.
  • 19:28 brion: apparently there remain problems with <ref>; reverting back to 27647
  • 19:10 brion: updating wiki software to current trunk
  • 17:22 brion: disabled send_reminders option on all mailing lists, since it's annoying
  • 17:15 Rob: Finished upgrades on squid software for sq41-sq50
  • 16:56 Rob: Started upgrades on squid software for sq41-sq50
  • 16:54 Rob: Finished upgrades on squid software for sq38-sq40
  • 16:48 Rob: Started upgrades on squid software for sq38-sq40
  • 16:46 Rob: Finished upgrades on squid software for sq27-sq37
  • 16:21 Rob: Started upgrades on squid software for sq27-sq37
  • 16:18 Rob: Finished upgrades on squid software for sq16-sq26
  • 15:55 Rob: Started upgrades on squid software for sq16-sq26
  • 15:53 Rob: Finished upgrades on squid software for sq1-sq15
  • 15:00 Rob: Started upgrades on squid software for sq1-sq15

November 28

  • 23:01 Rob: completed reinstalls of KNAMS squids. (sans gmond)
  • 21:43 brion: fixed apc.stat setting on srv0, was set to off
  • 20:55 brion: renewed SSL cert for wikitech.leuksman.com
  • 18:00ish Rob & Mark: Started reinstalls of KNAMS Squid servers.

November 27

  • 22:27 mark: Reinstalled knsq1 with Ubuntu Gutsy and newest Squid; the rest will follow tomorrow
  • 15:09 brion: srv121 is closing ssh, hanging on http traffic. May or may not be occasionally responding with bogus version of software. Requested shutdown from PM support.
  • 12:20 mark: Moving European traffic back to knams on the new IPs.
  • 09:00 - 12:20 mark: Renumbered knams public services VLAN.
  • 04:00 - 07:00 mark: Brought up AS43821, 91.198.174.0/24 on a new transit link.
  • 03:55 mark: DNS scenario knams-down for maintenance

November 26

  • 19:52 Rob: srv81 reinstalled due to OS being fubar. Rob is bootstrapping and bringing into service.
  • 18:55 Rob: Replaced bad ram in srv155. Stopped apache on boot, as there is no 'sync-common' on server.
  • 18:30 Rob: srv70 has a dead hdd.
  • 17:03 mark: Noticed that fuchsia (LVS knams) was overloaded: more input packets than outgoing. Set up mint as emergency LVS host for upload.wikimedia.org and moved the service IP.

November 25

  • 07:00 brion: rewired some file layout on leuksman.com

November 24

  • 22:15 jeluf: created new wikis: fiwikinews hywikisource brwikiquote bclwiki liwikiquote
  • 17:30 mark: isidore swamped and unreachable, set some crude caching-header-override options in squid
  • 13:30 mark: Upgraded will to Gutsy

November 23

  • 19:15 jeluf: started squid on sq32.
  • 19:10 jeluf: killed squid on sq23 by accident, restarted.
  • 18:54 jeluf: started squid on knsq3, disabled sdc
  • 18:45 jeluf: started apache on srv78
  • 18:45 jeluf: deleted old binlogs on srv123

November 22

  • 22:49 Noticed that a fully dynamic blog with no caching whatsoever was linked off every page on the site and wonders if anyone could come up with a more brilliant idea to kill a server. Put Squid infra structure in front. Wordpress currently sends no caching headers, so set a default expire time of 1 minute in Squid.
  • 14:58 brion: enabled blog link for english-language fundraising notice... http://whygive.wikimedia.org/ (set up the blog itself yesterday)
  • 11:00 domas: (following older events) srv155 Apache misbehaved (database errors, 500, etc), because / was full. / was full, because updatedb's sorts were filling the disk. updatedb's sorts were filling the disk because find was traversing all NFS shares. find was traversing all NFS shares because file system mounts were not recorded in mtab. file system mounts were not recorded properly in mtab, because mount wasn't clean. mount wasn't clean because /home/ was mentioned 3 times in fstab. Home was mentioned in fstab 3 times because, oh well. :)

November 20

  • 21:55 mark: Set persistent_request_timeout to 1 minute in the frontend squid conf, since newer Squid versions increased the default value, causing greater FD usage.
  • 17:57 brion: patched OTRS to avoid giving out e-mail addresses when requesting password resets
  • 12:15 mark: Upgraded knsq14 to newer Squid
  • 06:39 mark: knams network change failed, rolled back. repooled knams
  • 04:06 mark: Depooled knams for maintenance
  • 02:53 brion: srv81 deadish. (ext cluster 6)

November 19

  • 19:13 mark: Preparing for the IP renumbering at knams tomorrow. Set up reverse DNS zones, added new squid ips to the MediaWiki configuration
  • 16:00 brion: srv132 reporting read-only filesystem
  • Created a new squid 2.6.16-1wm1 package, included a patch by Adrian Chadd to solve the unbounded http headers array problem. Deployed it for testing on knsq15; the rest will follow soon.

November 18

  • 21:35 domas: commented out srv132 on LVS, cause it is half up half down (serves wiki, ssh unhappy)

November 17

  • 04:18 Tim: enabled AssertEdit extension on all wikis, after review. Short, simple extension to help some bot runners.

November 16

  • 22:12 brion: switching on NPPatrol by default
  • 22:ish brion: switching Gadgets back on on dewiki, with new message naming convention
  • 20:58 brion: changed master settings on bacon, letting it catch up...
  • 20:53 brion: taking bacon out of rotation for the moment
  • 20:52 brion: s3b has bacon, which didn't get updated for master switch. causing ugly lag
  • 16:52 brion: updated all to current SVN code. disabled gadgets extension on dewiki pending code review.
  • 04:17 brion: started CU index update on old masters
  • 04:15 brion: s1 swapped master db4 (db4-bin.237 748798723) -> db2 (db2-bin.161 198)
  • 04:11 brion: s2 swapped master lomaria (lomaria-bin.107 206111887) -> db8 (db8-bin.002 2362)
  • 04:04 brion: s3 swapped master db5 (db5-bin.107, 310565282) -> samuel (samuel_bin_log.021, 90525)
  • 03:42 brion: CU index update on slaves complete; time to switch masters and run em there

November 15

  • 18:43 Rob: srv146 back online.
  • 17:28 Rob: reseated disk in sq32, its good to go. Mark is testing things on it.
  • 17:00 brion: applying CU index update on slaves

November 14

  • 19:12 brion: freed up some space on storage2; no space for dewiki borked the dump progress update
  • 09:08 mark: sq32 has a bad disk /dev/sdc

November 13

  • 12:13 Tim: vincent was giving "Please set up the wiki first" errors. Fixed permissions on /var/tmp/texvc and ran sync-common, seems to be fixed.

November 11

  • 23:00 domas: reindexed holbach for efficient categorypage and contributions

November 10

  • 20:55 sq5 disappeared
  • 14:30 jeluf: Started move of commons from amane to storage1: Rsync started.

November 8

  • 15:06 Rob: Rebooted srv146 and put back in pool.

November 5

  • 15:00 Tim: Removed ru-sib.wikipedia.org from all.dblist and langlist as requested on bug 11680.

November 4

  • yesterday domas: ixia back to duty, bacon refurbished as jawiki principal slave

November 2

  • 16:40 Rob: srv133 unresponsive to console. Rebooted and back in pool.
  • 16:18 Rob: Kernel Panic on srv78, restarted and brought back in to pool.
  • 16:09 Rob: Reinstalled ixia due to broken array.
  • 13:40 Rob: Reinstalled bacon per domas.
  • 13:26 Rob: Connected amnesiac, juniper route server.
  • 13:00 Rob: Took ixia offline for disk testing, failed raid.
  • 12:00ish Rob: Rebooted Foundry after a failed upgrade attempt.

November 1

  • 20:43 mark: Installed sage as gutsy build/scratch host. Sorry, new hostkey :(
  • Ubuntu Gutsy installs are now possible, both for amd64 and i386
  • 20:32 mark: Prepared installation environment for Gutsy installs
  • 12:18 mark: Set up new vlans on csw1-knams. Set up BGP config for the new prefix. Migrated full view BGP peering between csw5-pmtpa and csw1-knams to the new public ASN. Changed AS-path prefix lists to named instead of numbered on both routers.

October 31

  • 13:50 brion: storage2 full; pruning some stuff...

October 30

  • 18:22 brion: adding dev.donate.wikimedia.org cname and vhost
  • 02:08 brion: leuksman.com was down for about four hours due to mystery kernel panic possibly mystery hardware failure. Thanks for rebooting it, Kyle! :)

October 29

  • 05:41 Tim: srv133 was giving bus errors on sync, not allowing logins. Forced restart, seems to be still down.

October 26

  • 17:50 brion: rebuilt prefix search indexes, now splitting enwiki & other wikipedia index files to allow it to build on the 32-bit server. :) ready for leopard launch...

October 25

  • 12:30 mark: knsq13's bad disk has been replaced; booted it up, cleaned the cache and started Squid.
  • 08:14 Tim: created wiki at arbcom.en.wikipedia.org

October 23

  • 21:30ish brion: sitenotice live on commons, meta, and en.*
  • 19:25 brion: sitenotice now using user language; reenabled <funddraisinglogo/> little WMF logo
  • 19:15 Rob: Resurrected srv118, synced, added back to pool.
  • 18:37 Rob: Updated apache pool and pushed to dalembert.
  • 18:33 Rob: Replaced backplane in srv78 and reinstalled. Bootstrapped and online.
  • 18:20 Rob: Restarted srv124, synced, filesystem seems fine.
  • 17:30 Rob: Booted up srv146 which was turned off, details on srv146 page.
  • 17:21 Rob: Rebooted srv131 from kernel panic. Synced and back online.
  • 17:13 Rob: Rebooted srv124 from unresponsive crash (black screen when consoled.) Synced and back online.
  • 15:54 mark: Reenabled the no-caching hack; seeing stalls
  • 10:20 mark: Disabled the randomizing URL hack in the flash player as I'm not seeing stalls and I think it's actually triggering the Squid concurrency problems. Reenable if it makes things worse.
  • 09:29 mark: Seeing huge CPU spikes on knsq12, the CARPed Squid that's responsible for the Flash video. Looks like the same problem we saw 2 weeks ago. Changed the configuration so the upload frontend squids cache it, to reduce concurrency.
  • 01:24 brion: turned off the scrolling marquee; too many complaints about it being slow and distracting
  • 01:00 brion: switching donate.wikimedia.org back to fundcore2 (srv9), keeping the link from banner to the wiki until we figur eout performacne problems
  • 00:40ish brion: hacking around squid problem with flash player by randomizing the URL for the .flv, defeating caching. :P

October 22

  • 22:47 brion: putting on sitenotice on testwiki and enwiki
  • 20:15 mark: Shut down knsq13 awaiting disk replacement.
  • 15:51 brion: adding www.donate CNAMEs
  • 15:16 mark: Broken disk in knsq13, stopped backend squid. Frontend still running.
  • 13:50 brion: php5-apc installed on srv9; updated php5-apc package in apt to wm6 version, with both 386 and amd64 versions...
  • 13:34 brion: added wikimedia apt repo to pbuilder images on mint in the hopes that one day i'll be able to actually build packages
  • 12:00 domas: db2 gone live, took out adler for jawiki dump
  • 03:24 brion: fixed email on srv9 i think... an extra space on line in ssmtp.conf after the hostname seemed to break it, making it try to send to port 0.

October 21

  • 15:40 mark: Doubled upload squid's maximum cached object size to 100M for the fundraiser.

October 20

  • 19:24 Rob: srv10 reinstalled as base Ubuntu for fundraiser.
  • 18:50 Rob: srv9 reinstalled as base Ubuntu for fundraiser.
  • 16:05 jeluf: removed binlogs on srv123
  • 16:00 jeluf: started apache on srv85
  • 16:00 jeluf: removed binlogs on srv7. Only 120MB disk space were left...
  • 13:30 spontaneous reboot of srv85

October 18

  • 21:48 brion: updated interwiki map
  • 17:53 brion: http://donate.wikimedia.org/ up as redirect for now; all other domains also have this subdomain as redir

October 17

  • 18:34 mark: increased ircd max clients to 2048 on irc.wikimedia.org.
  • 18:14 brion: set up an hourly cronjob to 'apache2ctl graceful' on wikitech. Had one instance of segfault-mania, probably from fun APC bugs or something finally creeping in.

October 16

  • 22:08 brion: tweaked CentralNotice URLs to use 'action=raw' to avoid triggering the squid header smashing. Sigh... :)
  • 19:02 brion: switched in CentralNotice infrastructure sitewide. This may cause a spike of hits to meta's Special:NoticeLoader, a little JS stub. Due to squid's cache-control overwriting this won't be cached as nice on clients as I wanted, but it should still be sanely cached at squid. Actual notice won't be switched in until the fundraiser begins, but this should get the stub loader into anon-cached pages ahead of time.

October 11

  • 21:17 brion: srv124 has been broken for a few hours, apparent disk error from behavior. still online but can't get in to shut it off.
  • 18:59 Rob: amane disk replaced, but not rebuilt. Need Jens to look at this.
  • 15:57 brion: srv118 down; swapped it out from memcached for srv101. WE ARE OUT OF MEMCACHE SPARES IN THE LIST

October 10

  • 17:26 brion: http://wikimania2008.wikimedia.org/ up
  • 17:08 brion: adding wikimania2008, 2009 to dns; setting up 2008 wiki
  • 17:08 tim, domas, and rob are doing some sort of evil copying db7 to db10
  • 16:36 brion: ES slaves under srv130 are stopped, they need to be restarted... no wait it's ok now
  • 16:33 brion: restarted mysqld on srv130 (via /etc/init.d/mysqld); some mystery chown errors about invalid group for mysql? but seems to be running
  • 16:30 brion: srv130 (ES master) mysql down; rebooted 80 minutes ago (mark said there was some problem with it and he depooled it, but this is not logged except for a kernel panic and restart september 19. may be flaky)

October 9

  • 21:33 brion: switched srv105 out of memcache rotation, srv118 spare in. [domas tried upgrading memcached on srv105, apparently something's broke]
  • 21:30 brion: memcached on srv105 is broken; verrrry slow, lots of errors on site -- timeouts etc
  • 19:20 domas: purged that image also
  • 19:12 brion: disabling the alpha PNG IE 6 hack on en.wikipedia; most of the breakage appears to be on the GIF that it loads, which may be result or may be cause or may be unrelated. :)
  • 16:45 brion: CPU maxing out on Florida upload squids; site up but with slow image loading. We're working on figuring out why

October 8

  • 23:22 dab: started nagios-irc-bot on bart.
  • 16:00 jeluf: took db10 out of service, mysql not responding, SSH not responding.

October 7

  • 19:22 brion: refreshing storage2 copy of amane

October 3

  • 18:00 brion: scapped update including r26357 which tim warns may introduce performance problems. keep an eye out

October 2

October 1

  • 21:00 mark: Reinstalled fuchsia and made it the new active LVS host for knams, as iris's cable seems bad.
  • 09:58 mark: iris has had input errors on its link. Need to keep an eye on it.
  • 09:45 mark: The site felt slow at knams with multi-second load times. Latency to LVS IP was ~ 200 ms. Apparently iris (LVS load balancer) had negotiated its link speed at 100M, which is not sufficient during peak. Forced it to 1000M-master at the switch, which seems to have worked.

September 30

  • 15:10 mark: Gave csw1-knams a full BGP view from csw5-pmtpa with rewritten next-hop to the Kennisnet gateway.

September 29

  • 18:33 mark: Installed will for routing/monitoring purposes
  • 00:35 brion: bart seems alive and well, was rebooted around 23:25?

September 28

  • 22:00 jeluf: bart is dead => no nagios, no otrs.
  • 20:22 brion: pushed out updated cortado 0.2.2 patched build to live server. (Some clients may use cached old code, note.)
  • 16:30ish brion: restarted lucene on srv58, too many open files. srv58 error log extracts
  • 15:27 Rob: Restarted apache on hypatia. It was throwing seg-faults. See srv page for more info.
  • 15:14 Rob: Restarted apache on srv17. It was throwing seg-faults. See srv page for more info.
  • 15:05 Rob: sq45 back online, reseated disk a few times.
  • 12:00 domas: srv122->srv121 data migration (mysqldump to stock fedora mysqld/MyISAM), cluster11 temporary r/o

September 27

  • 16:13 Rob: Shutdown sq45 due to bad disk.
  • 16:00 Rob: Updated node_groups: squids_pmtpa and squids_upload to show dell squids in nagios. Corrected errors in squid_pmtpa.
  • 15:18 Rob: sq34 passed all disk checks, started back online as squid.
  • 14:46 Rob: checking disks on sq34, squid offline during check.
  • 14:14 mark: Repooled knsq12

September 26

  • 15:02 mark: Repooled knams
  • 14:49 mark: pascal had ns2.wikimedia.org's IP bound, which only showed after the move of routing to csw1. Removed.
  • 05:13 mark: Depooled knams for network maintenance later today

September 25

  • 19:30 jeluf: added hsbwiktionary as requested on bugzilla
  • 19:00 jeluf: added bnwikisource as requested on bugzilla
  • 17:24 brion: ariel reasonably happy now.
  • 17:21 brion: cycling ariel a bit to let it fill cache
  • 17:18 brion: putting ariel back into $dbHostsByName in db.php. someone commented it out without any logging of the reason why, rumor is because it had crashed and was recovering. this broke all enwiki watchlists.
  • 17:11 ???: watchlist server for s1 overloaded
  • 15:46 Rob: tingxi reinstalled and serves no current role.
  • 14:51 Rob: srv121 reinstalled and online as apache. Needs ext. storage setup.
  • 14:35 Rob: srv135 rebooted from kernel panic and back online as apache.
  • 02:12 brion: new wikitech server working :) poking DNS to update...
  • 01:47 brion: preparing to move wikitech wiki to new server; will be locked for a bit; DNS will change... new IP will be 63.246.140.16

September 24

  • 12:30 mark: knsq12's broken disk has been replaced, brought it back up. ZX pulled the wrong disk, the broken one is still there.

September 21

  • 19:35 jeluf: thistle's mysqld crashed. Restarts automatically. Error message.
  • 05:12 nagios: srv135 went down. No ping.
  • ~1:45 Tim: Set up srv131 and srv135 for static HTML dumps. Fixed enwiki 777/1024, continued dump.

September 20

  • 14:10 Rob: Upgraded libkrb53 libt1-5 on srv151-srv189 (Will take a bit of time to work through them all.)
  • 11:06 mark: Upgraded frontend squids on knsq1 - knsq5 to increase #FDs, they were running low.

September 19

  • 20:40 jeluf: cleaned up disk space on db4
  • 20:33 Tim: Fixed iptables configuration on benet: accept NTP packets from localhost
  • ~19:30-19:50 Tim: Fixed NTP configuration on amane, bart, srv1, browne. Stepped clock and restarted broken ntpd on db8, alrazi, avicenna. Still working on benet.
  • 18:07 mark: Turned off htdig cron job on lily, it's killing the box every day
  • 16:15 jeluf: started mysqld on srv130. Added cluster13 back to the write list.
  • 14:05 mark: Turned off knsq12.
  • 13:53 Rob: srv128 back online. Had RO Filesystem, FSCK ran, all is well, synced and working.
  • 13:42 brion: moving old dumps from amane's /export/archive to USB drive for archival. retrieve drive back to office when done. :)
  • 13:32 Rob: srv130 had a kernel panic. Rebooted, synced, online.
  • 12:23 Rob: Replaced disk on storage1.
  • 09:53 mark: Disk errors on /dev/sdb in knsq12. Stopped both squid instances.
  • 6:15 jeluf: DB copy completed, thistle and db8 are both up and running. NTP isn't properly configured on db8. Needs investigation. Check bugzilla.
  • 4:20 jeluf: mysql on thistle stopped. Copying its DB to db8.

September 18

  • 19:50 brion: shutting off srv121 apache again, segfaults continuing.
  • 19:30 brion: segfaults on srv121; shutting its apache and examining. lots in srv188 past log too, earlier today
  • 16:18 brion: killed the deletion thread, since it was by accident anyway. :) things look happier now
  • 16:16ish brion: lots of hung threads on db4/enwiki due to deletion of Wikipedia:Sandbox clogging up revision table
  • 16:06 brion: srv128 hanging on login... shut it down via IPMI
  • 05:15 jeluf: updated nagios config to reflect db master switch.

September 17

  • 22:50 mark: srv130, ES cluster 13 master, disappeared. Removed it from the active write list and set its load to 0.
  • 13:45 domas: I/O hang on db8, FS hanged too, caused mysqld to hang. switched master to lomaria
  • 13:10 mass breakage of some kind... segfault on db8 mysql, massive breakage due to commons loads etc
  • 11:38 Tim: re-ran squid update to decomission bacon -- was bungled yesterday so bacon was still serving zero-byte thumbnails. Purging the zero-byte list again.

September 16

  • 21:58 Rob: srv15 reported apache error on nagios. Re-synced apache data and restarted apache process, correcting the issue.
  • 20:28 mark: Generated a list of 0-byte thumbs on bacon, purging that list from the Squids
  • 10:45 domas: aawiki had corrupted localisation cache: http://p.defau.lt/?Q4c4JZRPeAufTVT_T_Xz1w
  • 06:15 domas: db2 simply seems tired. servers/processes get tired after running under huge load for a year or more? :) Anyway, looks like deadlock caused by hardware/OS/library/mysqld - evil restart.
  • 00:50 mark: Mounted /var/spool/exim4/db (purely nonessential cache files) as a tmpfs filesystem on lily to improve performance.
  • 00:44 mark: ionice'd and renice'd running htdig processes on lily. I/O starvation caused high loads so Exim delayed mail delivery.

September 15

  • 17:13 Tim: sending all image backend traffic to amane, turned off bacon and srv6. Deployed in slow mode.
  • 15:49 Tim: took db2 out of rotation. Some threads are frozen, including replication, leaving it like that for a while so Domas can have a look.
  • 10:27 Tim: Attempted to fix error handling in thumb-handler.php. It was probably serving zero-byte images on queue overflow of the backend cluster.
    • Wrong diagnosis -- images are not zero bytes on amane, but are cached as zero bytes. Suspect lighttpd or squid. Remains unfixed.
    • There was in fact a problem with thumb-handler.php, manifesting itself on bacon. Probably fixed now.

September 13

  • 13:52 brion: db1 troubles? some pain in s3
  • 14:33 Rob: adler reinstall complete.
  • 14:00 Rob: adler down for re-install.
  • 13:08 Rob: srv54 reinstalled, bootstrapped, and back online.
  • 12:27 Rob: rebooted srv135, ssh now working, synced and added back to apache lvs pool, synced nagios.

September 12

  • 20:30sih Rob: bootstrapped srv131 and srv150, back in apache rotation.
  • 19:59 Tim: srv135 is refusing ssh connections. Removed from LVS.
  • 19:45 Tim: auditcomwiki wasn't in special.dblist, so it was listed in wikipedia.dblist and the static HTML dump system attempted to dump it. Luckily I still haven't fixed that bug about dumping whitelist read wikis, so it just gave an error and exit. I put another couple of layers of protection in place anyway -- added auditcomwiki to special.dblist and patched the scripts to ignore wikis from private.dblist.
  • 16:55 Tim: fixed sysctls on srv52, repooled
  • 15:52 Tim: started static HTML dump
  • 14:18 Rob: Added srv0 to apache lvs pool. Started httpd on srv0

September 11

  • 17:02 Rob: srv135 reinstalled and bootstrapped, now online.
  • 16:36 Rob: srv136 had a kernel panic, restarted, and sync'd.
  • 15:50 Rob: srv0 re-installed and bootstrapped.
  • 15:49 Rob: srv135 kernel panic, again. Re-installing system, if it occurs after re-install, will place an RMA repair order.
  • 14:48 Rob: srv135 online.
  • 14:32 Rob: srv146 online.
  • 13:19 Rob: srv135 had a kernel panic. Powered on, online, and needs bootstrap. (Rob will do this.)
  • 13:29 Rob: srv136 had a kernel panic. Restarted, sync'd, and is online.
  • 13:19 Rob: srv146 was turned off. Back online, needs bootstrap. (Rob will do this.)
  • 11:33 mark: Decreased cache_mem on knsq7, and upgraded squids to newer version with higher FD limits. I have a theory that these squids are running out of buffer memory.
  • 07:14 Tim: killed lsearchd on srv142, that's not a search server
  • 05:56 Tim: installed ganglia on thistle, db1 and webster. It would be kind of cool if whoever reinstalls them could also install ganglia at the same time. I've updated /home/wikipedia/src/packages/install-ganglia to make this easy on RH or Ubuntu.

September 10

  • 20:20 brion: reassembled checkuser log file and updated code to not die when searching and it finds big blocks of nulls
  • 19:11 brion: adding CNAMEs for quality.wikipedia.org, experimental.stats.wikimedia.org

September 9

  • 22:51 mark: Upgraded lighttpd to version 1.4.18 on amane, bacon and benet to fix a security bug. New and previous version RPMs are in /usr/src/redhat on the respective hosts.
  • 18:10 Tim: mod_dir, mod_autoindex and mod_setenvif were missing on the ubuntu apaches, causing assorted subtle breakage. Fixing.
  • 19:12 Tim: The extract2.php portals were all broken on Apache 2, redirecting instead of serving directly. Fixed.

September 8

  • 20:30 mark: SpamAssassin had crashed on lily, restarted it
  • 16:30 domas: reenabled Quiz, reverted to pre-r25655
  • 16:00 domas: disabled Quiz, it seems to break message cache, when used
  • 07:43 Tim: installed json.so on srv110
  • 07:25 Tim: installed PEAR on srv151-189
  • 06:00 Tim: Removed some binlogs on srv123. But it's getting full, we need a new batch.

September 7

  • 15:45 Rob: srv63 online and good to go.
  • 15:25 brion: added symlink from /usr/bin/rsvg to /usr/local/bin/rsvg in install-librsvg FC script. sync-common or something complained about the file being missing on srv63 re-setup
  • 15:06 brion: reinstalling yf1015 to use it for logging experiments
  • 14:41 brion: fiddling with yaseo autoinstall config; trying to use jp ubuntu mirror since kr still down
  • 14:25 domas: removed db1 due to 9MB binlog skip. needs recloning
  • 13:00 domas: restarted few misbehaving httpds (srv3, srv121) - they were segfaulting, srv3 - seriously.
  • 12:59 domas: fiddling with db9->db7/db3 copies

September 6

  • 20:43 brion: temporarily switched henbane from kr.archive.ubuntu.com to jp.archive.ubuntu.com, since the former is down, while fiddling with software
  • 17:31 Rob: srv63 reinstalled with FC4. Needs to have setup scripts run for apache use.
  • 15:00 Rob: srv141 back online, ran sync-common, sync'd its data.
  • 14:41 Rob: srv134 back online, ran sync-common, sync'd its data.
  • 14:23 mark: Ran aptitude upgrade on srv151 - srv189
  • 13:42 brion: ariel is lagged 5371 secs, looking into it
    • slave thread listed as running, waiting for events; stop and start got it going again
  • 12:30ish Rob: albert was locked. Rebooted, its running a forced FSCK, visible from SCS, will take a long time.
  • 12:13 mark: Reloaded csw5-pmtpa with no issues, port activation bug seems to have been fixed

September 5

  • 19:53 brion: fixed various php.ini's, things are much quieter in dberror.log now
  • srv37-39 missing from apaches group, but were in service
    • ok they're in a new freak group. :) cleared apc caches and updated php.ini
  • .... wrong php.ini on srv37-srv39, srv135, srv144: apc.stat=off
  • fixed sudoers on srv61, resynced
  • 19:30 brion: freeing up space on bart, setting up log rotation
  • 19:20 brion: srv151 and srv152 were not in mediawiki-installation group despite apparently running in production. re-adding and re-syncing them.
    • srv150, srv151, and srv152 are not in apaches group, though apparently all running apache. the heck? adde dem
  • 18:58 brion: adjusted $wgDBservers setup so the non-native Commons and OAI/CentralAuth servers appear without the database name setting. This allows lag checks to run without dying on the database selection.
    • Has greatly reduced the flood in dberror.log, but there's still a lot. Narrowing it down...
  • 15:33 brion: shut down srv134, read-only filesystem and HD errors in syslog. needs fixin'
  • 15:20ish brion: setting up closed quality.wikimedia.org wiki per erik
  • 09:29 mark: Fixed Ganglia for the 8 CPU apaches, which broke when I reinstalled srv153 yesterday
  • 01:20 river: removed lomaria from rotation to dump s2 for toolserver

September 4

  • 21:09 Any reason why srv150 (old SM 4cpu Apache) is not pooled?
  • 20:48 mark: Installed srv151, srv152 and srv153 with Ubuntu and deployed mediawiki. Sorry, forgot to save ssh host keys :(. Put into rotation.
  • 18:37 brion: srv149 seems wrong, missing remote filesystems. mounted upload3 and math manually
  • 18:36 brion: shut down srv141, was whining about read-only filesystem, couldn't log in
  • 15:30 Rob: adler racked and online. Needs database check.
  • 12:23 Rob: DRAC working on srv151, needs installation (failing on partitioning.)
  • 13:16 Rob: Rebooted srv152 to borrow its DRAC for testing. Reinstalled and rebooted.
  • 12:34 mark: Installed ubuntu on srv152, will deploy later
  • 12:33 Rob: srv149 back online and sync'd. Ran FSCK on it.
  • 12:32 mark: Reenabled switchport of srv149
  • 12:20 Rob: srv52 has a kernel panic. Rebooted and re-sync'd.
  • 12:14 Rob: Fixed DRAC password and parameters on srv151, srv152, & srv153.

September 3

  • 18:43 brion: fixed perms on checkuser log

September 2

  • 16:48 mark: Replaced srv52's memcached for srv69's off the spare list
  • 16:41 srv52 goes down, unreachable

September 1

  • 08:00 domas: live revert of DifferenceEngine.php to pre-24607 - requires additional patrolling index (ergh!), which was not created (ergh too). why do people think that reindexing recentchanges because of minor link is a good idea? :-/
    The schema change requirement was noted and made quite clear. If it wasn't taken live before the software was updated, it's no fault of the development team. Rob Church 14:44, 2 September 2007 (PDT)
  • 03:16 Tim: fixed upload on advisorywiki

August 31

  • 19:50 brion: starting an offsite copy of public upload files from storage2 to gmaxwell's server
  • 13:40 brion: srv149 spewing logs with errors about read-only filesystem; can't log in; no ipmi; mark shut its switchport off
  • 13:25 Users are reporting empty wiki page problems rendered by the new servers, e.g http://de.wikipedia.org/w/index.php?title=Sebnitz&oldid=36169808
    • That test case is gone from the cache now, but I did see it before it went. Can't reproduce directly to the apache. Maybe an overloaded ext store server? -- Tim
  • 06:50 mark: Massive packet loss within 3356, prepending AS14907 twice to 30217, removed prepend to 4323

August 30

  • 21:40 mark: Python / PyBal on alrazi had crashed with a segfault - restarted it
  • 21:37 mark: Installed ganglia on sq31 - sq50
  • 17:24 mark: Upgraded srv154 - srv189 to wikimedia-task-appserver 0.21, which has an additional depend on tetex-extra, needed for math rendering
  • 16:03 Tim: put srv154-189 into rotation
  • 15:55 Setup apaches srv154 - srv189, only waiting to be pooled...
  • 12:14 Setup apache on srv161
  • 11:52 Setup apache on srv160
  • srv159 has NFS /home brokenness, fix later
  • 00:55 brion: disabling wap server pending security review

August 29

  • 22:40 mark: Installed srv155 - srv189
  • 19:44 mark: Reinstalled srv154
  • 16:08 Rob: db3 drive 0:5 failed. Replaced with onhand spare, reinstalled ubuntu, needs Dev attn.
  • Incident starting at 15:50:
    • 15:50: CPU/disk utilisation spike on s2 slaves (not s2a)
    • ~15:55: Slaves start logging "Sort aborted" errors
    • 15:55: mysqld on ixia crashes
    • 15:58: mysqld on thistle crashes
    • 16:03: mysqld on lomaria crashes
    • 16:09-16:18: Timeout on s2 of $wgDBClusterTimeout=10 brings down entire apache pool in a cascading overload
    • 16:19 Tim and Mark investigate
    • 16:23 Tim: depooled ixia
    • 16:26 Tim: switched s2 to r/o
    • 16:28 Tim: reduced $wgDBClusterTimeout to zero on s2 only. Partial recovery of apache pool seen immediately
    • 16:37 Tim: 4-CPU apaches still dead, restarted with ddsh
    • 16:45 full recovery of apache pool
    • 16:59 Tim: Lomaria overloaded with startup traffic, depooled. Also switched s2 back to r/w.
    • 17:12 Tim: brought lomaria back in with reduced load and returned $wgDBClusterTimeout back to normal.
    • 17:14 - 17:35 Tim: wrote log entry
    • 17:37 Tim: returned lomaria to normal load
  • 13:55 Rob: Reinstalled db1, Raid0, Ubuntu.
  • 13:52 Rob: Restarted srv135 from kernel panic, sync'd and back online.
  • 13:50 mark: Set AllowOverrideFrom in /etc/ssmtp/ssmtp.conf on srv153 to allow MW to set the from address. Also made this the default on new Ubuntu installs.
  • 13:14 Rob: Rebooted srv18 after cpu temp warnings. Sync'd server, back online, no more temp warnings.
  • 12:46 Rob: sq39's replacement powersupply arrived. Server back online and squid process cleaned and started.
  • 09:54 mark: Ran apt-get dist-upgrade on srv153
  • 05:30 domas: kill-STOP recaches, until we get crashed DB servers (3) up (or get new machines? :)
  • 00:26 Tim: brought srv153 back into rotation after fixing some issues

August 28

  • 19:33 brion: freeing up some space on amane; trimming and moving various old dumps and other misc files
  • 19:08 Tim: fixed ssmtp.conf on srv153
  • 16:10 brion: rerunning htdig cronjob on lily.... at leats some lists not indexed or something
  • 15:53 brion: fixed img_sha1 on pa_uswikimedia and auditcomwiki
  • 15:36 Tim: put srv153 into rotation
  • 15:08 brion: manually rotated captcha.log, hit 2gb and stopped in june

August 27

  • 15:57 Tim: switch done, r/w restored. New s3 master binlog pos: db5-bin.001, 79
  • 15:37 Tim: restarting db5 with log_bin enabled
  • 15:28 Tim: switched s3 to read-only mode
  • 15:21 db1 down (s3 master)
  • 13:17 Tim: running populateSha1.php on commonswiki

August 25

  • 22:53 brion: noticed dberror.log is flooded with 'Error selecting database' blah with various wrong dbs; possibly from job runners, but the only place I found suspicious was nextJobDB.php and I tried a live hack to prevent that. Needs more investigation.
  • 14:30 Tim: got rid of the APC statless thing, APC is buggy and crashes regularly when used in this way
  • 14:00ish brion: scapping again; Tim made the img metadata update on-demand a bit nicer

August 24

  • 17:45 brion: fixed SVN conflict in Articles.php in master copy
  • 17:29 brion: restarted slave on db2
  • 17:27 brion: applying image table updates on db2, didn't seem to make it in somehow. tim was going to run this but i can't find it running and he's not online and didn't log it
  • 17:16 brion: restarting slave on ariel
  • ??:?? tim is running a batch update of image rows in the background of some kind
  • ??:?? tim may have changed which server enwiki watchlists come from while ariel is non-synced
  • 16:04 brion: applying img_sha1 update to ariel so we can restart replication and get watchlists for enwiki going again...
  • ....15:10 tim reverted code to r24312 to avoid image update buggage for now
  • 15:10 brion: took db3 (down), db2 (stopped due to schema buggage) out of rotation
  • 15:00ish -- massive overload on db8 due to image row updates
  • 14:47 brion: starting scap, finally!
  • 14:10 brion: unblocked wikibugs IRC mailbox from wikibugs-l list, was autoblocked for excessive bounces
  • 13:59 brion: confirmed that schema update job on samuel looks done
  • 13:40 Tim: restarted job runners, only 2 were left out of 9. Wiped job log files.

August 23

  • 22:00 Tim: replication on samuel stopped due to a replicated event from testwiki that referenced oi_metadata. Applied the new patches for testwiki only and restarted replication. Brion's update script will now get an SQL error from testwiki, but hopefully this won't have serious consequences.
  • 20:04 brion: switched -- samuel_bin_log.009 514565744 to db1-bin.009 496650201. samuel's load temporarily off while db changes apply...
  • 19:56 brion: switching masters on s3 to apply final db changes to samuel
  • 15:34 brion: knams more or less back on the net, mark wants to wait a bit to make sure it stays up. apache load has been heavy for a while, probably due to having to serve more uncached pages. dbs have lots of idle connections
  • 15:32 brion: updated setup-apache script to recopy the sudoers file after reinstalling sudo, hopefully this'll fix the bugses
  • 15:12 brion: srv61, 68 have bad sudoers files. srv144 missing convert
  • 14:08 brion: depooled knams (scenario knams-down)
  • 13:50 brion: knams unreachable from FL
  • 9:08 mark: Repooled yaseo, apparently depooling it causes inaccessibility in China

August 22

  • 20:35 brion: applying patch-oi_metadata.sql, patch-archive-user-index.sql, patch-cu_changes_indexes.sql on db5, will then need a master switch and update to samuel
  • 20:20 brion: found that not all schema updates were applied. possibly just s3, possibly more. investigating.
  • 14:00ish brion: amane->storage2 rsync completed more or less intact; rerunning with thumbs included for a fuller copy

August 21

  • 20:21 brion: amane->storage2 rsync running again with updated rsync from CVS; crashing bug alleged to be fixed
  • 14:43 Rob: sq39 offline due to bad powersupply, replacement ordered.
  • 13:30 Tim, mark: setting up srv37, srv38 and srv39 as an image scaling cluster. Moving them out of ordinary apache rotation for now.
  • ~12:00 Tim: convert missing on srv61, srv68, srv144, attempting to reinstall
  • 9:30 mark: Reachability problems to yaseo, depooled it

August 20

  • 15:00 brion: restarted amane->storage2 sync, this time with gdb sitting on their asses to catch the segfault for debugging
  • ~12:00 Tim: started static HTML dump
  • 11:15 Tim: running setup-apache on srv135

August 18

  • 22:32 brion: schema updates done!

August 17

  • 20:42 brion: started schema updates on old masters db2 lomaria db1
  • 20:39 brion: s1 switched from db2 (db2-bin.160, 270102185) to db4 (db4-bin.131 835632675)
  • 20:32 brion: s2 switched from lomaria (lomaria-bin.051 66321679) to db8 (db8-bin.066 55061448)
  • 20:13 brion: s3 switched from db1 (db1-bin.009 496276016) to samuel (samuel_bin_log.001, 79)
  • 19:54 brion: noticed amane->storage2 rsync segfaulted again. starting another one skipping thumb directories, will fiddle with updating and investigating further later
  • 19:48 brion: doing master switches to prepare for final schema updates this weekend

August 16

  • 16:21 Rob: srv135 back up, needs bootstrap for setup.
  • 16:01 brion: think I got the mail issue sorted out. Using sendmail mode (really ssmtp), and tweaked ssmtp.conf on isidore: set the host to match smtp.pmtpa.wmnet, and set FromLineOverride=YES so it doesn't mess up the from address anymore
  • 15:42 brion: having a lovely goddamn time with bugzilla mail. setting it back from SMTP to sendmail avoids the erorr messages with dab.'s email address, but appears to just send email to a black hole. setting back to SMTP for now

August 15

  • 3:40 jeluf: Start copy of database from db5 to samuel.

August 14

  • 20:30 jeluf: Lots of apache problems on nagios after some updates on InitialiseSettings.php. Restarting all apaches.
  • 19:50 Rob: srv63 down. Suspected Bad mainboard.
  • 19:38 Rob: srv51 back online and sync'd.
  • 19:24 jeluf srv66 bootstrapped
  • 19:20 Rob: srv133 rebooted. Network port re-enabled. Back online and sync'd.
  • 19:18 Rob: db3 had a bad disk. Replaced and reinstalled. Needs setup.
  • 18:18 brion: db1 in r/w, with reduced read load. seems working
  • 18:12 brion: putting db1 back into rotation r/o
  • 18:07 brion: applying SQL onto db1 to recover final changes from relay log which were rolled back in innodb recovery
  • 17:28 brion: temporarily put s3/s3a/default to use db5 as r/o master, will put db1 back once it's recovered
  • 17:24 brion: put s3/s3a/default to r/o while investigating
  • 17:20 Rob: db1 crashed! Checked console, kernel panic locked it. Rebooted and it came back up with raid intact.
  • 16:33 Rob: srv66 returned from RMA and reinstalled. Requires setup scripts to be run.
  • 15:26 Rob: srv134 restarted from heat crash & sync'd.
  • 15:17 Rob: srv146 restarted from crash and sync'd.
  • 15:00ish brion: restarted amane->storage2 rsync job with rsync 3.0-cvs, much friendlier for huge file trees
  • 15:00 Rob: biruni rebooted and FSCK. Back online.
  • 15:00 Rob: sq12 cache updated and squid services restarted.
  • 14:51 Rob: sq12 rebooted from crash. Back online.
  • 14:47 Rob: Rebooted and ran FSCK on srv59. It is back up, needs to be brought back in to rotation.
    • Sync'd, and in rotation.

August 13

  • 21:00 domas: APC collapsed after sync-file, restarting all apaches helped.
  • 18:59 brion: playing with rsync amane->storage2
  • 18:17 brion: working on updating the upload copy on storage2. removing the old dump file which eats up all the space, will then start on an internal rsync job

August 11

  • 18:50 brion: updated auth server for oai
  • 18:34 brion: starting schema updates on slaves
  • 15:43 brion: seem to have more or less resolved user-level problems after a bunch of apache restarts. i think there was some hanging going on, maybe master waits or attempts to connect to gone servers
  • 15:36 brion: s3 slaves happy now, fixed repl settings
  • 15:29 brion: s3 master moved to db1, hopefully back in r/w mode :)
  • 15:22 brion: db1/db5/webster appear consistent, so moving s3 master to db1. working on it manually...
  • 15:05ish brion: many but not all reqs working in r/o atm
  • 14:50ish brion: samuel broken in some way; putting s3/s3a to read-only and falling back to another box temporarily while working it out

August 8

  • 21:35 brion: shutting down biruni; read-only filesystem due to 2007-08-07 Biruni hard drive failure
  • 21:09 brion: cleaning up srv80, vincent, kluge, hypatia, biruni as well. need to poke at the setup scripts and find out wtf is wrong
  • 20:55 brion: same on srv37
  • 19:58 brion: updated broken sudoers file on humboldt, was not updating files on scap correctly
  • 13:48 brion: adding wikiversity.com redirect
  • 13:26 brion: metadata update for commons image row 'Dasha_00010644_edit.jpg' was sticking due to mysterious row locks for last couple days -- may have been related to a reported stickage/outage yesterday. Script run after run after run tried to update the row but couldn't. Finally managed to delete the row so it's no longer trying, but the delete took over three minutes. :P

August 5

  • 07:44 Tim: set $wgUploadNavigationUrl on enwiki, left a note on the relevant talk pages

August 4

  • 02:10 Tim: Fixed srv7 again

Aug 2

  • 06:00 mark: Installed yf1015 for use as HTTPS gateway

Aug 1

  • 07:00 domas: amane lighty upgraded

July 29

  • 23:43 brion: shut off srv134 via ipmi, since nobody got to it

July 27

  • 4:30 jeluf: removed srv120 from the external storage pool, fixed srv130

July 25

  • 16:33 brion: srv134 bitching about read-only filesystem, possible hd prob
  • 15:44 Rob: srv59 had a kernel panic. Restarted and is now back online.
  • 15:38 Rob: will reattached to port 15 of the SCS.
  • 15:00 Rob: biruni HDD replaced, FC4 reinstalled. (Had to use 32 bit, system did not support 64.) Requires scripts run and server to be put in rotation.

July 24

  • 21:59 brion: srv59 is down; replaced it in memcache pool with spare srv61.
  • 02:38 brion: fixed bugzilla queries; had accidentally borked the shadow db configuration -- disabled that (for srv8) a few hours ago due to the earlier reported replication borkage, but did it wrong so it was trying to connect to localhost

July 21

  • 22:30 mark: srv7 was out of space again, deleted a few bin logs. Replication I/O thread on srv8 is not running and doesn't want to come up, can any of you MySQL heads please fix that? :)
  • 19:26 Tim: db3 is down, removed from rotation.
  • 11:38 mark: Installed Ubuntu Feisty on yf1016 for use by JeLuF

July 19

July 18

  • 22:24 mark: switchport for srv133 shutdown
  • 22:21 brion: srv133 borked with read-only fs

July 17

  • 18:56 brion: set up and mounted upload3 and math mounts on vincent -- these were missing, probably causing bugzilla:10610 and likely some upload-related problems.
  • 15:08 river: borrowing storage1 to dump external text clusters
  • 13:51 Rob: Rebooted storage1 from a kernel panic.
  • 13:50 Rob: biruni filesystem in read-only. Rebooted and running FSCK.
    • HDD is toast. Emailed SM for RMA.

July 15

  • 15:06 Tim: Biruni was hanging on various operations such as ordinary ssh login, or a "sync" command. Restarted using "echo b > /proc/sysrq-trigger" in a non-pty ssh session.
    • Probably hung during startup, ping but no ssh

July 14

  • 19:35 mark: Fixed a problem with our DNS SOA records
  • 13:33 Tim: put ex-search servers vincent, hypatia, humboldt, kluge, srv37 into apache rotation. Fixed ganglia and nagios.

July 13

  • 23:20 mark: Disabled Oscar's accounts on boardwiki/chairwiki per Anthere's request

July 12

  • 18:35 brion: pa.us.wikimedia.org up.
  • ~12:10 Tim: restarted postfix on leuksman, causing a flood of messages delivered to various locations.
  • 11:44 Tim: running setup-apache on ex-search servers (vincent, hypatia humboldt, kluge, srv37)
  • 11:38 mark: Set up Quagga on mint as a test box for my BGP implementation, gave it a multihop BGP feed from csw5-pmtpa / AS14907
  • 11:02 Tim: reassigning srv57 and srv58 to search

July 11

  • 13:10 Tim: updating lucene for enwiki, will be moving on to the other clusters shortly
  • 12:49 Tim: removed /etc/crond.d/search-restart from search servers and restarted crond

July 10

  • 21:30 brion: installed DeletedContributions ext
  • 19:20 mark: Brought knsq6 back up, DIMM hopefully replaced
  • 19:20 jeluf: setting skip-slave-start on read-only external storage clusters.

July 9

  • 17:53 brion: temporarily taking db4 out of rotation cause people freak out about lag warnings
  • 14:18 brion: running schema changes patch-backlinkindexes.sql on db1/db4 so they don't get forgotten
  • 13:55 brion: fixed oai audit db setting (broken by master switch)

July 8

  • 20:47 jeluf: added external storage cluster #13. Removed #10 from the write list.
  • 18:24 Tim: changed cron jobs on the search servers to restart lsearchd instead of mwsearchd.
  • 17:05 brion: switched s1 master, started the rest of the latest schema changes
  • 17:00 brion: switched s2 master
  • 16:54 brion: switched s3 master
  • 16:23 brion: schema changes index tweaks done on slaves, waiting for a master switch to complete

July 7

  • 20:10 jeluf: cleaned up binlogs on srv95.

July 6

  • 21:16 brion: did rs.wikimedia.org updates -- imported pages and images from old offsite wiki, set up redirects from *.vikimedija.org domains that are pointed to us
  • 19:55 brion: created mediawiki-api list
  • 19:39 brion: starting schema changes run for index updates
  • 19:25 Tim: restarted PHP on anthony to fix *.wap.wikipedia.org

July 5

  • 21:59 river: taking clematis and 5TB space on the backup array to test replicating external text to knams
  • 14:07 Rob: Reinstalled srv135 with a new HDD. Needs setup scripts run.
  • 13:55 Rob: Rebooted srv99 from a kernel panic. It is back up.
  • 13:20 Rob: adler offline, will not boot.
  • 13:09 Rob: replaced cable for rose, shows at full duplex speed again.

July 4

  • 21:40 jeluf: cleaned up disk space on srv95, 120, 126

July 3

  • 20:54 brion: Closed out inactive wikimedical-l list
  • 15:56 brion: lag problem resolved. stop/start slave got them running again; presumably the connections broke due to the net problems but it thought they were still alive, so didn't reconnect
  • 15:53 brion: lag problems on enwiki -- all slaves lagged (approx 3547 sec), but no apparent reason why
  • 15:00ish? mark did firmware updates on the switch and soemthing didn't come back up right and everything was dead for a few minutes
  • 00:52 river: temporarily depooled thistle to dump s2.

July 2

  • 18:54 brion: set bugzilla to use srv8 as shadow database
  • 18:53 brion: replication started on srv8
  • 18:39 brion: srv7 db back up, putting otrs back in play. fiddling with replication
  • 18:25 Tim: bootstrapping srv80
  • 18:24 brion: shut down srv7 db to copy
  • 17:54 brion: shutting down srv8 db and clearing space for copy from srv7. otrs mail is being queued per mark
  • 17:40 Tim, rainman: installing LS2 for the remaining wikis. Splitting off a new search pool, on VIP 10.0.5.11.

July 1

  • 18:38 mark: Upgraded lily to Feisty, including a new customized Mailman, and a newer PowerDNS
  • 13:15 mark: Upgraded asw-c3-pmtpa and asw-c4-pmtpa firmware to 3100a

Archives