Server admin log/Archive 9

January 30

14:27 brion: moving some old dump files off benet to free dump space
13:04 mark: Move traffic back to knams
10:20 jeluf: added external storage cluster12, nodes srv126 (master), srv125, srv124. Using clusters 10, 11 and 12 as wgDefaultExternalStore.
08:50 jeluf: added external storage cluster11, nodes srv123 (master), srv122, srv120
05:24 mark: knams down for 15+ mins, moved all traffic to pmtpa. Here ya go Ben ;)

January 29

23:00 jeluf: setting up external storage cluster10 on srv95 (master), srv94, srv93
22:40 mark: rebooted srv27
22:30 mark: Brought up the new 10G link on csw5-pmtpa e5/1 using new subnet 84.40.25.100/30, brought down both 1 Gig E links (needed one fiber pair for the 10G link).
21:05 jeluf: removed binlogs 80-89 on adler
mark: Installed csw5-pmtpa SFM, 2x PSU, 4x 10G line card

January 27

01:22 Kyle: sq3 is in. Ready for OS.

January 26

11:05 brion: installed djvulibre on srv53,srv54,srv61,srv68,srv120; install-djvulibre script failed on 53,54,61,68 due to missing FC3 x86_64 RPM; the other FC3 boxen have the FC4 rpm installed so I symlinked it to let the script work.

January 24

21:12 Tim: taking db1 out of rotation for defragmentation of jawiki table space
15:56 Tim: upgrading to squid 2.6.9
09:30 brion: starting dumps on benet/srv31 again, need to watch it and see about disk usage, and get back to fixing up storage2 some time
01:15 brion: setting up revision compression on wikitech.leuksman.com

January 23

19:00 Tim: restarted ntpd on knsq*, many had no server in their peer list except the local clock.
18:53 Tim: stepped clock on knsq2, was off by one hour.

January 22

23:ish brion: seems to be some kind of problem with squid purging, with old or inconsistent versions of pages being returned from pmtpa squids; difficult to reproduce, but we could see several old versions of en:Barack Obama including a badly vandalized one from a couple users w/ Safari. Were not able to trace it to a particular squid before it finally got purged... somehow... Still hoping to figure this out.
20:08 Tim: phasing in UDP logging configuration
19:40 jeluf: deleted binlogs 100-109 on srv92
18:00 jeluf: configured srv52 as apache
16:58 Tim: upgrading all squids to 2.6.8 RC with UDP logging patch.
15:40 jeluf: restarted squid on sq6, deleted binlogs 50-59 on srv88

January 21

22:20 jeluf: db1 and samuel added to the pool.
15:15 jeluf: shutting down db1 and samuel, copying the DB from db1 to samuel.

January 20

20:20 mark: Raised the route cache on avicenna and alrazi by setting net.ipv4.route.max_size = 65536
20:05 mark: Florida was down due to troubles on avicenna:

printk: 23355 messages suppressed.
dst cache overflow

09:00 brion: switched on nofollow for enwiki article space in response to Jimmy's earlier request and the rumor of a spam championship targeting WP

January 19

16:30 jeluf: fixed disk space on samuel and srv88, restarted squid on yf1000 and yf1003
16:30 jeluf: fixed nagios config to reflect the master/slave change, the removal of harris, the new role of db10, ....
14:26 brion: going read-write
14:15-20ish brion: switching masters, disabling samuel from db.php
14:00 brion: switched samuel to explicit read-only while working on it ftm
13:50 brion: got woken up, some problem about full dbs
- samuel disk full
- all non-en locked

January 17

20:15 Tim: phasing in UDP squid logging, trialling it on a few hosts
03:27 Tim: set up henbane as a log host. Started udp2log on it. For now it is logging the access log of hawthorn, as a test.

January 16

09:19 brion: got isidore moved over to harris' old external IP and on the public vlan, woo
05:30 brion: broke isidore when experimenting with external IP on it. :D will poke it more shortly
02:59 brion: bringing isidore back up to speed, plan to replace downed harris

January 14

21:00 mark: Brought up db10 as a Squid with 16 GB memory and 8 disks, both to test and to help with the current upload load
19:00 mark: Installed Ganglia on a few misc Ubuntu servers
16:14 mark: Updated DNS names in zone wikimedia.org, removed the AAAA record for ns2 as it's reducing service quality for IPv6 users (geobackend doesn't work with ipv6)
16:05 mark: Made knams use lily as the primary DNS resolver, secondary bayle in Tampa.
15:45 mark: Set up PowerDNS Recursor on yf1019 as the new primary resolver for yaseo. Secondary is bayle in Tampa, no real point in having a secondary locally there...
13:40 mark: Reducing cache_mem on sq1 - sq10 to give Linux more disk cache memory for the beating of tomorrow. Restarting those squids to have it take effect.
10:56 brion: indeed that seemed to be most of it. restoring it for cascading pages only (english Main Page for nwo)
10:18 brion: db cpu way up; testing removing the templatelinks update on page view to see if that's it
09:55 brion: svn up'd to version with updated tooltips and cascading protection.

January 13

23:15 mark, Rob: Kernel upgrades on sq1 - sq10 seem to cause hangs at bootup. Had to reinstall sq1, sq6 and sq8 to get them to boot again. Put package linux-image-2.6.17-10-server on hold to keep them from upgrading. Other types of servers don't seem to have issues...
21:49 Rob: srv126 locked up again. Rebooting to bring back online.
20:32 Rob: Memory test ran 38 passes on srv144 with no errors, rebooting srv144. Back online.
19:26 mark: Replaced PowerDNS from Edgy on bayle by our custom package. Wasn't ours because the i386 variant wasn't built and in our repository before...
18:46 mark: DNS migration complete. ns0.wikimedia.org migrated to bayle, ns1 to yf1019 and ns2 to lily, all using the new wikimedia-task-dns-auth package, and newer PowerDNS. Procedure for updating DNS has changed slightly, please read!
17:08 mark: Starting authoritative DNS migration; please don't do any DNS updates until I'm done later.
16:45 mark: Reinstalling yf1019 for use as a DNS server

January 11

23:30 brion: stopped storage2 dump tests
23:09 mark: reverted bacon change, overloading amane
23:00 mark: memory frag issues back on florida image squids - many boxes down. restarting doesn't help. bacon is very loaded too - removed it from the squid conf to see if it helps anything
18:13 Tim: replaced loreley with perlbal on diderot. Loreley was only staying up for a few minutes at a time. Perlbal is only using 30% CPU.
18:04 RobH: srv52 was crashed. Rebooted. HTTPD still offline per JeLuF, but ssh is now working.
17:55 RobH: sq8 locks on boot reading 'Starting Up' Attached to serial 10 for further troubleshooting.
17:51 RobH: Started memtest on srv144 per JeLuF request.
17:38 RobH: srv111 same error as srv126. Server is online after reboot.
17:33 RobH: srv126 was down in nagios. Looked like a OS crash/lock. Rebooted and is now online.
17:21 RobH: sq11 returned from SM and connected. Set console redirect to port 9. No OS present.
16:55 RobH: thistle issues, would not detect mbr/boot media. Booted in to raid management for JeLuF to access via serial console.
16:44 RobH: srv129 ssh not accessible. Rebooted and now works.
16:26 RobH: srv133 ssh not accessible. Connected console and system rebooted. Server now online and working.
16:16 RobH: Corrected db6 DRAC settings, should work now.
16:00 RobH: Replaced bad drive in array for db8. Did not rebuild.
15:45 RobH: Replaced the secondary powercord for db10.

January 10

19:14 Tim: restarted loreley on diderot
19:11 brion: spot-tests seem to retrieve bits out of ES just fine on storage2. possibly temporary problems such as overload, but not really sure. will investigate further later, but at least storage looks safe
18:46 brion: external.log was being flooded by errors with enwiki blobs from the dump running on storage2 (have paused it). looking into whether it's a bug w/ storage2 setup or if we've got corrupt stuff
18:01 brion: rotated external.log, oversized

January 9

river: borrowing fuchsia + vandale to dry-run zedler reinstall
23:15 brion: replaced SSL cert on friedrich so fundraising.wikimedia.org has a non-annoying cert
- expires in a month, will want to upgrade to a paid one by then :)
22:12 mark: Set up header address rewriting on To: and CC: headers rewriting old mailing list addresses to new ones on lily
21:30 river: v6 outgoing smtp connections are experimentally enabled on lily
21:15 jeluf: removed binlogs 10 to 29 on srv88.
19:00ish: big eswiki template issue
18:55 river: repaired some broken knams hosts (fuchsia, mayflower, mint) to fix v6 config
17:30 jeluf, domas: Current status: Loreley broke, apache waited for loreley, apaches overloaded, lvsmon depooled some, load was too high for the remaining apaches, lvs pooled some again, vicious circle -> *boom*. manually pooled all working apaches, site still slow.
16:45 jeluf: restarted loreley.
16:15 jeluf: killed loreley on diderot since it was no longer answering requests. Domas disabled lucene search.
01:52 brion: restarted pdns on browne
01:45 brion: some kind of internal dns breakage in pmtpa
01:10 river: fixed ipv6 at knams (assigned proper IP to every host, disabled autoconf, fixed dns, fixed netmask (removes the annoying "wrong prefix" message))

January 8

21:25 brion: all.dblist was corrupt, with many missing entries and a bunch of nonexistent wiktionaries. (cf bugzilla:8544) copied pmtpa.dblist back over it, which seemed ok. have a copy in my home dir if someone's interested in a postmortem
14:18 mark: We were saturating Kennisnet's uplink, moving some more images to pmtpa
13:14 mark: Installed rng-tools on lily; apparently it has a hardware RNG :)
12:07 mark: Many Exim processes blocked on /dev/random because of a starved entropy pool on lily. Disabled outbound TLS for now; the real fix is to get a better random source or link Exim to OpenSSL instead of GnuTLS.

January 7

20:36 mark: Removed mailing list info-de-l by akl's request
18:40 Tim: restarting mysql on db7, changing innodb_flush_log_at_trx_commit from 1 to 2.
17:09 mark: Moved a couple more countries from knams to pmtpa for images
16:40 mark: The new Mailman hit a crash with some messages (~ 5 / 24h) in i18n.py that caused these messages to be shunted. Deployed a dirty fix/workaround to prevent this from happening.
14:05 Tim: srv52 is down, replaced its memcached slot with srv53
12:45 brion: fixed crond on leuksman
12:00 brion: srv53,srv54,srv61,srv68 sudoers files fixed
12:00 brion: srv3,srv53,srv54,srv61,srv68 do not update correctly due to broken sudo; poking at them (shtu down apaches)
11:37 brion: running deleteDefaultMessages on all wikis serially (did a scap)
11:30 thistle down
02:27 Tim: Experimentally configured ariel to serve only enwiki's watchlist queries. It is the sole member of the watchlist query group.

January 6

21:46 mark: Set up a privacy filter for mailing lists, but in freeze mode instead of bouncing.
19:02 Tim: brought srv89 into ext. storage rotation
18:30 Tim: installed memcached on about 5 reinstalled servers, added them to the spare list
17:58 Tim: srv129 is broken, returning 404 errors via apache with no access via ssh. Took it out of rotation. Requires manual restart and reinstall.
17:30 Tim: due to a configuration error, adler had no load and samuel (the everything else master) had lots. Fixed it.
17:00 Tim: all enwiki servers were lagged by about 5 minutes. Sent STOP signal to backup job running on storage2.
16:57 Tim: brought srv68 into apache rotation
16:45 Tim: removed srv111 from memcached rotation, it's down. Deleted binlogs on srv89 to free up space.
16:00 hashar: srv89 / partition is full.
15:55 hashar: added new namespaces for itwikibooks (see bug 7354 & 8408).
10:00 mark: Starting migration of mailing lists
7:15 jeluf: Starting maintenance of OTRS. Migrating to new version and new DB servers srv7 and srv8

January 5

19:39 brion: i think i bashed civicrm urls into shape. broke it for a while when trying to update the serialized config array in database (CR-LF sucks!)
17:15 brion: updated wikibugs, now in svn (under tools/wikibugs)
16:42 RobH: storage1 would not boot reliably. Reseated all cards and memory, it now boots just fine. No OS currently loaded.
16:31 RobH: srv134 was in read-only filesystem. Ran manual FSCK and rebooted.
16:19 RobH: Rebooted Will and set console redirection to 9600
6:20 jeluf: SCSI errors on db8. / was remounted read-only due to these errors. Rebooting.
6:00 jeluf: Cleaned up disk space on srv92, removed old binlogs 40-69

January 4

18:00ish robchurch: ~~...seems to have fixed itself?~~
- no, still throwing errors - on commit, it's going nuts at the top with "insufficient disk space, please try later" repeated over and over

Fixed by domas, pascal:/var/log/ldap.log took all the disk space. Need to be rotated / ziped.

17:35 robchurch: BugZilla is dead, e.g. "my bugs" produces ./data/versioncache did not return a true value at globals.pl line 358
15:00 Tim: made Rob Church a bugzilla admin
14:40 Tim: reset mysql root password on pascal. New root password is in /root/.my.cnf

January 3

22:56 mark: Restarted loreley on diderot

January 2

15:24 mark: Increased cache dir size of knams text squids to 10 GB per disk.
15:24 mark: Installed a new DNS recursor/resolver on lily
10:40 brion: set up temporary web server on storage2 to watch the dump testing... it's pulling from ES directly (no previous XML to pull from) so may be extra slow, but should get cleaner copies from it in case old errors have accumulated
10:30ish brion: made another internal chair wiki for anthere
10:20 brion: browne dns temporarily broke... or something... after updating dns
09:53 brion: shutting srv134 down, didn't come up after reboot
09:38 brion: rebooting srv134 via ipmi, hopefully

January 1

20:45 jeluf: db3's replication slave is broken. Processlist shows the same query all the time. show slave status is hanging. Restarting mysql.
20:45 jeluf: srv134 is running apache, but doesn't allow SSH logins. Can't be updated by scap any more.

December 30

19:35 mark: Shutdown BGP session to ar1, as ar1 seems to be the culprit of the packet loss. Uplink will be reconnected to a PM core router in the upcoming days.
19:13 mark: udpmcast.py wasn't running on goeje, started it.
~14:00 mark: There seems to be 4-6% packet loss outgoing to Hostway. Routed some problematic traffic over TWTC.
12:40 mark: Rebooted csw5-pmtpa as an attempt to solve some strange issues we've been seeing
04:30 brion: running dump-generation tests on storage2

December 29

~16:00 Tim: setting up srv53, will do the others soon. Various problems experienced probably due to FC3, FC4 would have been easier.
13:25 Tim: installed ganglia-metrics on ubuntu squids
02:39 mark: Shutdown gi0/8 on csw1-pmtpa (srv128's port) by request of brion
02:35 brion: whining about srv128 being generally slow and brokne
01:55ish brion: site notices and general briefly broken by bad interwiki database file generation. not entirely sure how that happened o_O
01:40 brion: rotated 5gb db error log file :P; setting up internal office wiki
01:11 Kyle: srv144 was off, not sure why. Kill apache just in case.
01:05 Kyle: srv126, srv129, and srv144 have had their ram replaced and apache killed for sanity. (Commented rc.local)
00:55ish brion: updating DNS for office.wm.o
00:46 Kyle: srv53, srv54, srv61, and srv68 have a fresh FC3 and a new raid card and are ready for apache service.

December 28

22:44 mark: Installed storage2 for backup purposes by request of Brion. has a 3TB RAID-10 array, JFS, with some space left in the volume group.
~19:55 Tim: installing gmond on various squid servers
18:05 Tim: running updateSpecialPages.php on all wikis, with some of the more expensive pages disabled.
16:50 mark: Noticed that goeje was about to crash again due to overload (most likely not hardware failure). There were lots of smtp and python (mailman) processes running. After killing them, load dropped and the box was under control again.
10:56 Tim: running resolveStubs.php on jawiki

December 27

12:38 Tim: remounting db1:/a with noatime
11:53 Tim: running moveToExternal.php on jawiki (to cluster6)
10:32 Tim: removed fedora mirror from srv81
10:05 Tim: started replication on srv89 (old cluster8 master), from position srv88_log_bin.000001, 0.

December 26

15:30 to 19:00 Rob: Replaced Fans with SM Warranty Tech for Storage1 and Storage2. Storage1 will not boot correctly. Storage2 is online and ready for testing once again.
15:30 Tim: enabled variant aliases (e.g. http://sr.wikipedia.org/sr-el) on all serbian wikis
12:14 Tim: disabled firewall on db7
12:05 Tim: put db7 into rotation
08:52 Tim: shut down mysql on db7, will shortly reboot it to enable write-behind caching

December 25

06:02 brion: added a little more sanity checking for error messages. man, this cache script sucks :D
05:37 brion: adjusted thumb cache script for better validity checking, de-escaping of input filenames so that images w/ punctuation or non-ascii chars should be less problematic (bugzilla:8367)
02:48 brion: migrating files from benet to amane again to free space
02:48 brion: requesting reboot on srv85
02:40 brion: srv85 not responding; benet disk full; removing srv85 from slave rotation on es to lowre load and examine it

December 24

11:27 Tim: started replication on db7

December 23

22:00 brion: reenabled CIA-bot notifications on svn commit, in e-mail mode to hopefully not hang
05:47 brion: srv15 freaking out about cpu temperatore and 'running in modulated clock mode'. taking out of service ... no wait, it stopped. odd. leaving it
00:41 brion: running title cleanups; RLM/LRM marks now stripped from titles, and any other mystery borkages...

December 22

23:47 brion: batch-initialising user_editcount fields
19:00 mark: Shut down apache and lighttpd on amaryllis
07:16 brion: aaaand it's up!
07:00ish brion: colo guys trying to swap hardware back with harris to see if that works; if not we'll use the backup
06:35 brion: goeje not coming back after reboot attempts. restoring its pre-move backup to harris, going to put it into place for now if we can't get it back up soon
05:54 brion: goeje is not online. what happened?
- It crashed after being online for about half an hour.
04:20 Kyle: I'm not so sure mail is flowing correctly...
04:01 Kyle: srv3 had no memory errors. Brought back up with no apache.
03:53 Kyle: goeje's chassis swapped with harris's. Mail seems to be flowing again.

December 21

22:06 brion: killed a stuck search index rebuild dump process... had been stuck since november 11! sigh... was holding up the build loop for small wikis

December 20

23:56 mark: Prepending our ASN once one TWTC's link
22:47 mark: Gave Adrian Chadd (adri) access to knsq15 (like yf1010), as he needed a busier server to test.
19:45 brion: goeje back online; had to force mailman to restart again, its stupid lockfiles get left
19:30 brion: requested reboot for goeje, dead again. kyle plans to transplant the drive into another mobo/chassis tomorrow, mark plans to replace the whole shebang in a week or two when we have a fresh new machine
04:05 Kyle: Tests performed on Storage1 <- Results are on the page.
02:50ish domas: been doing horrible things to db1
00:25 mark: Loreley on diderot was stuck on a futex again, had to restart it.

December 19

20:32 mark: Cleaned up the Squid leechers block list, updated some IPs. Most entries had long expired, domains no longer existed, URLs invalid or IPs were reassigned.
19:36 brion: mostly recovered from MASSIVE SLAVE OVERLOAD due to bad sorting in Special:Categories query change
19:16 brion: all dbs updated, so scapping to current mw
18:12 mark: Installed db8.
16:40 mark: Automated Ubuntu installs on internal servers are now possible. Installed db9.
15:12 brion: set up apc on friedrich; it was disabled, making load pretty high since people have been linking the new fundraiser report pages which are a bit php-intense
14:54 mark: Set up a forward Squid (the 2.6 version from Edgy, not our Wikimedia variant) on khaldun TCP port 8080 for use by internal servers, to let them access external webservers like security.ubuntu.com.
08:44 Tim: reading enwiki dump into mysql on db7
05:29 brion: master switch done. there was a brief period of 'write lock' errors on wikis not in the s2 or s3 group due to my slip-up. the use of read-only mode will have prevented this from causing data integrity problems (yay)
05:17 brion: starting master switch for s2/s3 adler -> samuel
04:54 Tim: started slave on ariel
00:17 brion: running schema updates on db3

December 18

15:08 mark: Doubled cache_dir size for knsq1 as a trial
15:07 mark: Why does sq29 have a weird cache_dir setting?
13:05 Tim: using ariel for SQL dump. Stopped slave.
10:26 Tim: brought db6 back into rotation
08:50 Tim: zwinger's root partition was full. Switched off debug-level logging in syslog on zwinger, deleted debug log.
08:22 Tim: setenforce 0 on db6. This was the reason /etc/init.d/mysql wasn't working.
07:26 Tim: mysqld on db6 was apparently running directly from a bash prompt, instead of via mysqld_safe. Restarting it, and increasing the deflault maximum number of FDs by editing /etc/init.d/mysql appropriately.
06:31 Tim: starting master switch from db3 to db2
05:30 Tim: installed ganglia on ariel
03:40 brion: webster was missing its local socket for mysqld so couldn't be root-logged in locally in mysql. restarting daemon.
- cron.daily/tmpwatch is suspected; could clear socket files after 10 days of no detected use...?
03:20 brion: the following slaves were running with read_only OFF in violation of reliability guidelines:
- db2 ariel db6 webster holbach
03:11 brion: noticed ntp seems broken on db6; selinux is denying access to files?
03:09 brion: depooled db6; replication broken
Error 'Can't find file: './enwiki/text.frm' (errno: 24)' on query. Default database: 'enwiki'.
00:32 brion: running db schema updates on slave servers (in a screen session on zwinger)

December 17

23:26 brion: lowered ariel's priority from 100 to 50; it's consistently 15-30 seconds lagged
19:15 jeluf: pooling db6
18:07 mark: Users were reporting out-of-sync watchlists, nagios reported slave not running on db6, SHOW SLAVE STATUS confirmed. I depooled db6 by commenting out in db.php.

Odd error message: Error 'Can't find file: './enwiki/recentchanges.frm' (errno: 24)' on query ...

jeluf: Stopped slave, started slave, works fine. File was in place, no idea why mysql didn't see it

domas said that db6 was out of file descriptors

December 16

21:54 brion: disabled CIA hit from SVN post-commit script; it's been hanging a lot lately
05:40 Kyle: db5, db6, and db7 are at 1G and have the normal root password and are ready for msyql service.
04:36 Kyle: Replaced cables for db5-10
00:44 brion: set default sitenotice w/ basic fundraising info; tweaked the old one from last year a bit as the text is a little cleaner than the tiny anonnotice from en.wikipedia

December 15

21ish brion: enabled UsernameBlacklist extension on dewiki by request
19ish brion: enabled DismissableSiteNotice extension sitewide (now in svn, and with localization fixed for button)
19ish RobH: Re-installed FC5 on db5. Confirmed cables for db5-db10 need replacement prefabs for gigabit operation.
17:05 jeluf: increased retry count for "mysql running threads" and "lucene"
15:35 Tim: moved srv41 from apache to search, in enwiki pool.
15:10 Tim: re-added srv40 to the search pool
08:47 Tim: updated FixedImage configuration
08:40 Tim: stepped clock on amane (167s off)
08:26 Tim: set up cron job for fundraising meter in amane:/etc/cron.d/fundraising . Configured lighttpd to send Cache-Control: max-age=300,s-maxage=300 for the relevant file.
8:20 jeluf: removed binlogs 50-69 on adler
6:20 jeluf: db6 added to enwiki pool
5:30 jeluf: removed binlogs 1..29 on srv92
5:00 jeluf: copying enwiki DB to db6
12:00 onward RobH: Racked db5-db10. Enabled drac, installed fc5 on db5-db7

December 14

7:45 jeluf: added ariel to the mysql pool.
5:25 jeluf: copying mysql from db2 to ariel. Ariel has a broken disk in its RAID. Rob set up a new array without the broken disk.
04:51 Tim: added names recursor0.wikimedia.org and recursor1.wikimedia.org for the new resolver VIPs, and also reverse DNS.

December 13

23:35 mark: switch traffic back to yaseo

December 12

14:29 mark: Doubling the size of the cache dirs of knams upload squids - it seems they can take it, others will follow if successful.
13:35 mark: yaseo traffic suddenly dropped quite a bit, which seems like routing trouble. Sending all yaseo traffic to pmtpa.

December 11

21:00 brion: updated search-rebuild-wiki script to use getSlaveServer to force slave use; an enwiki build was slurping from master, which made domas complain
20:35 brion: restarted mailman, was accidentally left off a couple hours ago after a list archive modification
18:45 jeluf: rebooted srv120. Its apache stopped answering several times.
18:00 jeluf: locked mowiki (switched to readonly mode), according to resolution [1] and the voting at [2]
17:40 Tim: enabled oversight on all wikis. The policy issue can be decided by stewards when they grant access, I don't have time to read yet another set of 600 debates.
17:30 jeluf: added eth2 to /etc/sysctl.conf on srv147, started apache
17:25 jeluf: restarted crashed squid on sq6
16:45 Tim: recompiled FSS on srv145, was using old version
16:40 Tim: took db4 out of rotation
15:11 Tim: Updated index pages for download.wikimedia.org and static.wikimedia.org.
12:00 db4 went down

December 10

22:00 mark: Disabled options rotate in all srv*'s /etc/resolv.conf to use only the primary nameserver in normal circumstances. Also changed all nameserver lines to the new resolver service IPs 66.230.200.17 and 66.230.200.18 earlier, which caused some weirdness with the Foundry (overflowing CAM table?)
21:04 mark: ariel.pmtpa.wmnet resolved to suda's ip due to my mistake, fixed
19:51 mark: Set up a secondary DNS resolver temporally on khaldun - until we have a new mailserver.
18:56 mark: Setting up a new DNS resolver (pdns-recursor) on bayle. Made it forward internal zones to ns0.wikimedia.org. srv1 now slaves these zones from ns0 as well, so do not edit zonefiles on srv1! albert doesn't even seem to have the internal zones, I'm not fixing that, redoing the entire setup.
10:00 Domas: yesterday db4 was deployed with 4.0.28/tcmalloc - seems to be still working, but performance difference does not seem to be very huge. Needs proper benchmarking.

December 9

23:59 Mark: Installed Ubuntu on bayle
23:00 Kyle, Mark: Tried to install the 2 new storage servers, but there's something seriously wrong with the write performance of their OS arrays:

1048576000 bytes (1.0 GB) copied, 1475.95 seconds, 710 kB/s

storage2.wikimedia.org is up as a temporary test.

15:30 - 17:44 Kyle, Mark: Reinstalled all remaining Squids (sq14 - sq30) with Ubuntu Edgy so they run tcmalloc.
17:28 Kyle: srv78 switched to correct kernel and rebooted. Killed apache just in case its old.

December 8

16:11 Tim: started new static HTML dump
11:45 jeluf: started srv(117|120|121|145|148|149) apaches after scap.
04:38 Kyle: srv117 has acpi off. I would like to see how it runs. Killed apache on this too just in case.
04:28 Kyle: srv120 ram replaced. Killed apache and awaiting sanity check before service.
04:22 Kyle: srv121 ram replaced. Killed apache and awaiting sanity check before service.
04:05 Tim: deleted adler binlogs 40-49 (to November 28)
02:00 mark: yf1010 was not in the Mediawiki trusted XFF list. Added all yaseo servers just to be sure.

December 7

23:38 mark: Pooled Adri's testserver yf1010
22:58 mark: Reinstalled yf1010 with Ubuntu Edgy, for temporary use by Squid developer Adrian Chadd - he has root access on the box.
21:10 hashar : adler is running out of disk space ( 12/400GB free) [3]
jens: rebooted goeje at some point

December 6

01:10 mark: Reinstalled yf1000 - yf1009 with Ubuntu Edgy to run the latest Squid deb. Just sq14 - sq30 left...
22:16 mark: Deployed squid_2.6.5-1wm6 (with tcmalloc) on all Ubuntu Edgy Squids. Dapper Squids need to be upgraded, libgoogle-perftools is only available in Edgy.
21:20 mark: Disabled Squid's coredumps again, they were causing more problems (filled up filesystems) than helpful information. I'll enable them selectively on certain debug-Squids from now on.
21:19 brion: rebuilt stats table for frwikiquote, was empty/broken
20:30ish brion: fixed info.txt with updated version
17:55 brion: restarted leuksman web server, was mysteriously crashed again

December 5

23:20 jeluf: added ariel back to the pool after mark reinstalled it to 64bit and domas set up mysql.
19:23 mark: sq13 was running with 100% CPU, probably memory fragmentation. Installed the experimental tcmalloc squid deb.
19:00 mark: ariel was installed with Fedora 32 bit, which is "not helpful". Remotely reinstalled it with Ubuntu Edgy AMD64. ~~Had to move it to public VLAN for that, so new hostname is ariel.wikimedia.org~~.
07:08 Tim: started mysqld on db2
05:57 jeluf: OS configuration of ariel. Currently copying mysql from db2 to ariel
05:26 jeluf: switched DNS back to use all three datacenters.
02:58 Kyle: sq29 rebooted. Down for unknown reason.
02:51 Kyle: srv117 brought up, and apache killed for sanity.
02:46 Kyle: ariel is available.
01:47 brion: killed srv3's apache and removed its LVS address so it won't restart itself. it doesn't sync scripts properly...

December 4

23:48 mark: Moved knams traffic to pmtpa
23:33 brion: knams down
23:15 mark: Experimenting with bigger COSS cache dirs on knsq15
23:15 brion: rerunning SUL pass 0 migration test with new schema
20:45 mark: Running Squid linked to tcmalloc on knsq2 and hawthorn to try to solve the malloc fragmentation problems
19:50 brion: reopened fr.wikiquote on the board's orders

December 3

20:24 mark: loreley on diderot was blocked on a FUTEX. After killing and starting it wouldn't keep running, so started perlbal instead.

December 2

14:59 mark: Fixed yf1001 and yf1013, yf1001 is up as a text squid.
09:15 brion: completed manual tweaks for blob recovery and did a bunch of purges of affected pages
02:39 brion: running disambiguation recovery for blobs
00:39 brion: put srv89 into read-only while i continue working w/ it
00:31 brion: srv3, 117, 121 also had bad config files and were saving into srv89. sigh. stopped those, and now poking dbs

December 1

23:43 brion: bad page saves discovered on frwiki and perhaps others. bad blobs saved onto srv89 former ES master, accidentally its Apache was brought up with non-updated config files. have updated files on srv89, will need to find and clean up affected blobs.... somehow... :D
01:23 river: replaced perlbal on diderot with loreley
01:00 mark: Running an unoptimized Squid (-O0) on sq8 and knsq1 to get useful coredumps

November 30

23:59 mark: Set coredump_dir /var/spool/squid in squid.conf
14:59 mark: knsq15's hardware has been replaced. Installed it, it's up as an image squid
12:44 river: testing loreley on diderot next to perlbal
06:57 Tim: changed access rules on the text squids to allow queries to the bot entry points even from user agents on the stayaway list
06:15 Tim: relaxed restrictions for missing user agents in checkers.php: allow for query.php, api.php and action=raw
05:49 Tim: re-added srv82 to ext storage
04:45 jeluf: synced nagios config to reflect the changes in the MySQL setup (i.e. Master of cluster 8)
04:30 jeluf: Changed "root reserve" of /a on db1 from 5% to 0%
04:25 jeluf: squid on sq8 crashed, restarted it
03:55 brion: restarted data dumps on srv31 and benet
03:14 Tim: upgraded FSS on srv88 and srv82, were using the old segfaulting version

November 29

12:00 domas: note from yesterday, lucene perlbal was swapping with 500MB VM - memory leak in there. used 20MB after restart.
10:50 mark: oprofiling squid on knsq1 and knsq3
04:37 brion: rotated spam blacklist log - hit 2gb limit

November 28

20:38 sq12 disappeared
18:40 mark: Installed knsq2 (which was unreachable before) as squid.
16:57 Tim: set up index page for http://upload.wikimedia.org/ . Also changed the MIME type for .html on amane to text/html.
16:45 mark: Changed routing policy to send a bit more traffic to TWTC
16:36 Tim: deleted adler binlogs 1-39
16:21 brion: running centralauth pre-migration pass 1 testing (in a screen on zwinger)
07:56 brion: running centralauth pre-migration pass 0 testing (in a screen on zwinger)
07:20ish brion: webster replication broke from centralauth inserts confusing the limited replication. domas fiddling with settings
07:02 Tim: srv89 not back up. Took it out of ES rotation, made srv88 the new cluster8 master.
06:56 Tim: restarted srv89, wasn't responding to ssh
04:55 brion: creating dummy centralauth db on commons servers, going to start back-end migration testing tonight
01:40 brion: added wikisv-skilkom-l list

November 27

19:39 jeluf: changed hardcoded "ariel" in nextJobDB.php into "db4" since ariel is down and job queues were filling up.

November 26

08:22 Tim: changed some wiki logos
00:30 mark: Removed sq1.pmtpa.wmnet - sq10.pmtpa.wmnet from internal DNS, as those servers have moved to external
00:18 Kyle: sq3 up and ready for squid.

November 25

23:55 Kyle: removed audit on srv82. Stopped apache for sanity check.
23:50: mark, Kyle, JeLuF: reinstalled srv7 and srv8 with Ubuntu Edgy as a misc DB cluster for things like OTRS, bugzilla, etc...
23:42 Kyle: brought up srv117, but killed apache for sanity check.
16:00 - 22:30 Kyle, mark: Reinstalled sq1 - sq13 as Ubuntu Edgy squids
17:53 Tim: same on srv120,srv71,srv56,srv59
17:46 Tim: srv110 came up for some reason. Did sync-common, fixed time, recompiled FSS.
16:19 Tim: removed XFF logs from March to July
11:57 ariel died

November 23

05:52 jeluf: added srv83 to external storage cluster 6, disabled srv82.

November 22

23:33 brion: set wgAccountCreationThrottle to 1 on frwiki in response to proxy vandal attack
21:55 brion: took srv82 out of ipvsadm manually
21:50 brion: srv82 is breaking srwiki, doesn't respond to ssh. needs taking out of service
18:15 jeluf: restarted squid on sq12 and sq13. They were down.

November 21

17:00 domas: amane unhappy - restarted nfsd with more children (/etc/sysconfig/nfs created), restarted lighty and php env.
13:03 Tim: did scap. Lots of servers started segfaulting about 15 minutes later. Disabled the new FSS stuff, that fixed it.
06:40 brion: reopened access to stats.wikimedia.org now that the files are scrubbed
04:24 Tim: deployed text squid configuration: redirected static.wikipedia.org from srv31 to albert.
04:22 Tim: removed old keys for yaseo servers from zwinger:/etc/ssh/ssh_known_hosts. Hey, I don't suppose we could back these up and restore them next time we reinstall servers?

November 20

19:06 mark: Anthony overloaded, sending en: thumbs back to amane
17:40 mark: TWTC transit back up
16:30 mark: Disabled HELO checking on albert, it was bouncing valid e-mail
03:05 Kyle: srv83 - removed auditd. You can now log in.
02:54 Kyle: Ram replaced in sq3, ready for service.

November 18

11:30 mark: TWTC BGP session down for unknown reason, stuck in 'CONN' state

November 17

19:50 mark: Installed TWTC transit
16:08 brion: fixed
16:05 brion: message serialized files maybe borked, missing some new data. :( trying to regen
15:05 mark: Playing with Varnish on hawthorn
11:00ish brion: fixed problem on arbcom-l where mail vanished into ether; for reference, problem was extra blank lines in the spam filter sending all mails into discard bitbucket
07:14 Tim: Re-added srv74 to the ext store list.
01:55 Tim: Most apaches have recovered, either by themselves or through my action, but srv34 is still in swapdeath.
01:25 memory usage jump on some apaches, sending some into swap.

November 16

11:40 brion: set $wgGenerateThumbnailOnParse back on for private wikis using img_auth.php, as img_auth doesn't automagically pass through not-yet-generated thumbnails
11:34 mark: Removed proxy-only from all squid.conf sibling lines, as I believe it actually decreases performance and cacheability in various ways. We'll see what the actual effect on the site is.
08:22 Tim: While investigating dumpHTML performance, I found that the NFS client in TCP mode was regularly pausing for 15 seconds, before disconnecting and reconnecting to the server. This was occurring for both /mnt/upload3 and /mnt/static. Switched srv122, srv123, srv124, srv125, srv42 to UDP mode in response, for all NFS shares.

November 15

23:24 brion: goeje back up after reboot; took a couple hours to get pm to do this; slow response to email and phones were busy. possibly support overload due to their recent exciting network problem?
11:57 brion: changed check-time script to use full path to ntpdate; some machines didn't have it in local path while scapping

November 14

18:57 mark: Reinstalled hawthorn, iris, lily
16:40 Tim: started HTML dump on albert
16:16 Tim: srv143 and srv144 do not have the VIP on lo, presumably because of the now-fixed problem with eth2 and rc.local. Restarting, will attempt to bring into the pool.
16:08 Tim: fixed sysctl.conf on srv141, restarted
16:04 srv121 went down
15:50 Tim: added srv121 and srv123 to the apache pool. Installed ganglia on srv121.
14:46 Tim: installing mediawiki on albert for use as a static HTML dump controller
11:50 mark: Disabling the old knams Squids; new servers seem to be running just fine
09:34 Tim: going to run rsync --delete on the thumbnail servers, to fix outdated files which weren't purged, and cached error messages
7:20 jeluf: restarted squid on sq13 (squid crashed around 2a.m.)
7:15 jeluf: cleaned up disk space on adler
06:25 Tim: fixed url encoding problem in HTCPpurger, set up synced copy in /usr/local/bin

November 13

21:34 mark: Put knsq8 - knsq14 into production as image squids
21:05 mark: Put knsq1 - knsq7 into production as text squids
19:19 mark: knsq1, knsq4-knsq14 OS installed. knsq2 is inaccessible because of wrong BIOS settings (my fault), knsq15 seems broken, as it doesn't want to enter BIOS and just says System halted!.
16:12 mark: Adding knsq1-15 to MediaWiki's XFF list
16:08 mark: knsq3 entered production as a text squid
15:52 mark: Installed Ubuntu Edgy on knsq3.
15:51 Tim: What is this?

+                               global $wgMaxShellMemory;
+                               $wgMaxShellMemory *= 3;
+

stuck in the middle of reallyRenderThumb()? I'm not really a fan of exponentially increasing memory limits. So if I use 3 djvu images on a page, then I get up to 4GB for all subsequent images? Cool!

Rest assured that if I was Brion, I would be swearing right now, instead of making sarcastic comments.

05:36 Tim: deleted some old lighttpd error logs from amane, to free up root partition space
03:39 Tim: amane full too, deleting April, May and June backups

November 12

15:25 brion: benet full; migrating files. sigh

November 11

13:36 brion: removed 80.242.195.68 from tor node list in mwblocker.log by request
13:19 brion: enabling email notification on commons
10:25 brion: upgraded leuksman.com to mysql 5.0.27
06:20 Tim: created Server roles
02:52 Tim: holbach was still replicating from samuel! Switched it to adler and took it out of rotation while it catches up.
02:20 Tim: running schema updates on the old masters

November 10

16:02 Tim: updated nagios configurator, made it draw MySQL server lists from db.php instead of elsewhere.
15:28 brion: restarted mailman runner on goeje; stale lockfile was left from the downtime
14:48 Tim: starting master switch
14:21 Tim: set up sync from /home/wikipedia/upload-scripts to local hard drives for thumb-handler.php etc.
14:07 Tim: fixed cache-control headers for thumb.php error messages. Symlinked bacon's thumb-handler.php to amane's.
09:07 Tim: goeje back up. I'm not sure if it was my request to PM or to Kyle which got through. I haven't heard anything from either of them.
08:13 goeje down
06:01 Tim: srv53 down, removed from memcached pool.
02:57 Tim: srv83 is down, removed from external storage rotation. Ports are open but nobody's home.

November 9

22:45 brion: commenting crawl-delay out of robots.txt; hopefully this is obsolete and no longer needed
16:30 Tim: had conversation with VoiceOfAll (VoABot operator). He has patched the bot but has asked that it remain blocked until he has a chance to update the code, later today. The patch he describes will probably fix the problem, but it doesn't sound like he has the bug completely characterised, so I'm not 100% confident. I'm happy for the bot to be unblocked, but we should also implement some kind of protection on the server side against this kind of thing.
12:32 Tim: applying patch-rc_user_text-index.sql to slaves
11:00 brion: clearing math rows with the 'extra - at end of html' bug, so they'll re-render on next page parse
06:30 Tim: blocked VoABot, was causing lock contention, about 100 concurrent threads running on the master.

November 8

17:18 Added knsq1-15 to wikimedia DNS. Reverse DNS needs delegation.
17:18 mark: Resized knams subnet from /27 to /26... on the router and pascal only. Other servers still need to be done. Updated pascal's dhcpd.conf and pdns-recursor.conf
16:25 brion: updated dump runner scripts to use getSlaveServer.php instead of hardcoding servers
14:08 Tim: set up staggered search restart
12:08 Tim: frwiki search index rebuild was going very slow, maybe because it is using adler which has no cache of frwiki. Trying removing the --server option from dumpBackup.
11:15 brion: hopefully fixed the parsertests automated reporting
01:07 Tim: moved srv38 back from the search cluster to the apache cluster. It's doing OTRS DB, which conflicts with the resource requirements of search. Moved srv40 from apache to search in its place.
00:23 Tim: stopped using perlbal for "small" search cluster. Split traffic among the 4 servers by crc32 hash of DB name instead.

November 7

14:15 mark: Deleted all the upload.* ACLs on the text squids, should save a few percents of CPU
13:47 Tim: set up two parallel search updater threads on srv37: one for enwiki and one for the rest.
13:17 mark: Upgraded the PowerDNS Recursor to 3.1.4-pre3 on pascal, mayflower, amaryllis (security fix)
11:30 mark: Successfully upgraded khaldun to Ubuntu Edgy
~08:00 - 10:10 Tim: upgraded to nagios 2.5 (from source). Managed to get it sorting in natural order, after a lengthy battle.
07:50 Tim: made "sort by hostname" in ganglia use natural order
06:55 Tim: Due to a change in fedora, some of our servers just have /etc/rc.local, some have /etc/rc.local as a symlink to /etc/rc.d/rc.local, and some have both /etc/rc.d/rc.local and /etc/rc.local as regular files. Standardised on having a symlink from /etc/rc.local to rc.d/rc.local, mainly to avoid the problem of "decoy" files. Synchronised rc.local from /home/config, to fix the eth2 problem.
06:40 Tim: Fixed rc.local on srv136 (eth2 problem). Did restart test. Also did restart test on srv78. It hasn't come back up yet.
06:24 Tim: srv78's problem appears to be firewallinit.sh. Removing firewallinit.sh invocation from all apaches using sed -i~ '/firewallinit/d' /etc/rc.local . The problem may continue to recur on the many apaches that are currently down.
05:30-05:50 Tim: stepped clocks on srv6, srv39, anthony, alrazi (318s!), srv10. Samuel and adler have no routing to zwinger, srv78 has no routing to 10/8.
- Samuel and adler actually had the wrong IP address cached for zwinger, nscd -i hosts fixed them.
05:25 Tim: put a time check in apache-sanity-check. Warning only. Can be run independently from /h/w/b/check-time.
04:52 Tim: In nagios, set up a router dependency for knams and yaseo. Hopefully this will make for less noisy flapping on IRC.
01:45 mark: Redirected upload requests with referer wikipedia - download . org to http://upload.wikimedia.org/not-wikipedia.png

November 6

22:35 brion: resynced ntp config on srv63, srv74; were about 8 and 10 seconds off respectively
20:30 jeluf, brion: srv3 needs a mem check. Apache is segfaulting at 30 times the rate of other servers. Powered off.
19:00 jeluf: bw rebooted db4 since it was no longer pinging. Had to fsck /a after reboot, now recovering mysql.
18:18ish brion: also killed runJobs.php on several apache boxen, they also were spewing connection loop errors
18:14 brion: removed db4 from rotation; it's down and spewing a giant 11-gig dberror log
18:00ish jeluf: rebooted srv119, srv3, srv142, srv32. Their apache always died after one to two minutes of service. Running fine since the reboot.

November 5

22:30 brion: installed corrected fix for bugzilla:1109 which I think causes the intermittent 'application/octet-stream' errors for people. a recent addition of an output buffer in PHP via a live hack broke the old protection, which only peeled back one output buffer on 304 events, incorrectly assuming it would be the compression handler.
17:26 Kyle: sq3 has MCE errors, RMA'ing RAM
15:39 Tim: maurus ran out of disk space, cleaning up
14:30 mark: Installed squid-2.6.5-1wm2 with a bugfix for squid bug #1818 on yf1005 - yf1007, but another bug showed up.
13:20 Tim: sent frwiki search load to srv39
12:05 Tim: installed normal (i.e. unicode NFC) extension on srv37
03:00 mark: sq3 seems down

November 4

23:23 mark: Brought sq1 and sq3 up as Edgy squids, for stability testing (RAID controller)
22:55 mark: Upgraded all Ubuntu squids at pmtpa
20:40 mark: Upgraded all yaseo squids
19:29 brion: restarted nscd on mediawiki-installation group; 45 of 146 machines had nscd not running. load on ldap server went waaaay down after that :D
19:18 brion: fiddling with logging for ldap on srv1. shut off from srv2 as no idea if it's set up right
18:45 brion: started ldap on srv2 which is supposedly failover ldap. mark is also fiddling with srv1
18:20 brion: restarted ldap server, lots of machines whining and confused
- doesn't seem to have helped. lots of machines still complain about unknown user id or "you don't exist, go away"
16:14 Tim: Started search index update on srv37, in an infinite loop.
15:49 Tim: Search index update finished, syncing
? brion: fixed dump bug, restarted enwiki dump
15:05 mark: Created squid-2.6.5-1wm1 deb, and included a fix for a crash bug we were experiencing. Installed it on ragweed and yf1004, will deploy on all other squids if nothing bad happens for a while.
13:34 brion: rotated dberror.log, was too big for 32-bit boxen
wow this sucks, we really should replace the logging infrastructure

November 3

09:44 brion: upgraded leuksman.com to php 5.2.0 final release
06:28 Tim: restarted mwsearchd on maurus, disabled squid
05:42 Kyle: The APC is enabled and has anthony, bayle, isidore, and yongle on it.
05:35 Tim: traced unusual disk activity on srv38 back to the DeleteAlbertMailerDaemon job, in OTRS's GenericAgent. Changed the job to delete bounce messages which have arrived in the last hour, rather than doing a search of all 370,000 tickets.

November 2

16:59 brion: disallowed all mailing list archives from robots.txt now
16:46 brion: got mailman-htdig working
15:00 mark: Created temporary channel #wikimedia-tech on irc.wikimedia.org See you there?
14:23 mark: Deflecting some traffic from knams to pmtpa
14:20ish brion: upgrading mailman for htdig search
14:19 mark: Freenode is under DDoS
14:00 jeluf: irc.freenode.org does not resolve any longer. The cname points to chat.freenode.net, which gets a *** Can't find chat.freenode.net: No answer reply on nslookup
09:43 brion: applying ipblocks schema updates
08:45 brion: rebuilt wikifr-l archives to suppress some messages due to a problem; unfortunately the numbering got thrown off by something much earlier in the archives, possibly the old 'from' bug. oh wells

November 1

21:43 river: scap broke blocking since db changes weren't applied, reverted PHP files from r17355
17:05 Tim: back to 3, small cluster couldn't handle it
16:10 Tim: back to 2 partitions. If the small servers can serve requests in ~200ms by hitting the disk, we may as well let them. srv38 and 39 will be better utilised by enwiki, which needs more CPU power allocated to it. We just have to be careful that the disk I/O on the small servers doesn't become saturated.
15:37 Tim: split search nodes into 3 partitions instead of 2.
15:22 Tim: sending dewiki search requests back to the "big" pool
~15:10 Tim: moved srv38 and srv39 to the search pool
14:20 Tim: started search index rebuild for all wikis
14:15 Tim: Inserted bfr (from /home/wikipedia/src/bfr) into the pipe in search index rebuilds. It seems to improve performance, by ensuring that MWSearchTool does not stall waiting for dumpBackup.php.
13:40 Tim: took srv37 out of apache rotation for lucene stuff
5:11 Kyle: srv144 has bad ram, will RMA.

October 31

16:55 brion: redirected sep11 to sep11memories.org
11:51 Tim: added "umask 002" to JeLuF's .bashrc
09:12 Tim: noticed that srv61 and srv67 are down, memcached instances with them. Brought in the spares.
04:45 Tim: set up srv145-149
04:26 Tim: srv144 crashed
03:57 Tim: setting up srv126,srv138,srv141,srv143,srv144,srv145
03:45 Tim: installed ganglia on srv121-145
03:36 Tim: set up apache on srv122

October 30

20:29 Kyle: srv146 - srv149 are available.
16:24 Tim: fixed ganglia
15:37 brion & mark: trying to fix ganglia, still borked
15:27 mark: Started Apache on zwinger
04:24 Tim: added bart to the trusted XFF list

October 29

16:24 Tim: locked sep11.wikipedia.org at Erik's request

October 28

14:45 Tim: removed dkwiki from all.dblist, old alias for da

October 27

14:36 brion: adding redirect & querycachetwo tables, not yet populated
05:04 Kyle: configured ipmi on srv121?? Maybe? I'm not sure how to test it.
04:56 Kyle: srv39 was off, I don't know why. I turned it on. Also a bunch of unreachable srv's were fixed. (Of the newest batch)
04:27 Kyle: sq1, and sq3 have Ubuntu Edgy. But need a password.
04:04 Kyle: Accidently rebooted zwinger! Sorry!
03:37 Kyle: Replaced power supply in sq11, its back up.

October 26

21:15 mark: Reinstalled yf1010 with Ubuntu Edgy, instead of Dapper. Install went ok, but needs a few more tweaks to the preseeding files to make it fully automatic again.
19:00 brion: tightened down friedrich, nfs /home no longer mounted

October 25

23:08 mark, kyle: Swapped sq11's mainboard, reinstalled it and brought it up as an upload squid
16:47 Tim: installed FSS on new apaches, added to install-modules51
16:40 Tim: running rebuildMessages.php
15:25 Tim: Set system-wide default for ssh ConnectTimeout to 5 seconds, on zwinger
14:30 Tim: finished user table schema changes
~13:30 Tim: switched masters to db2 and samuel.
12:52 mark: Fuzheado says PMTPA is blocked in China. Updated the GeoIP maps to make sure as many Chinese IPs resolve to yaseo
08:00 jeluf: Updated apaches 121-145, added to the pool, fixed startup scripts (use of eth2 instead of eth0/1). Still broken: srv122, srv126, srv136, srv138, srv141, srv145.
04:00 Kyle: New apaches were down because of poor power distribution. Its fixed now and they are back up.
03:19 Tim: starting user table schema changes

October 24

12:34 mark: Deployed a newer PyBal on pascal, avicenna, alrazi and yf1018
11:14 brion: wikibugs bot wasn't running; restarted it on goeje and added run-wikibugs to rc.local
06:23 brion: restarted postfix on leuksman.com; svn mails were stalled
~06:00 Tim: installed Dancer's dsh as ddsh on zwinger, changed scap and sync-file to use it. It shares perl dsh's node group files, via a symlink.

October 23

19:00 jeluf: added srv121 and srv123-srv134 to the farm. srv122 and srv135 are unreachable. srv136-145 died earlier during a "scap". I've no idea why.
16:13 mark: Users reporting image problems with IE in yaseo. Depooled dryas from the upload queue. What was it doing there and wtf wasn't it logged?
08:20 Tim: made scap faster by turning off "lazy backups" and using an rsync daemon on suda instead of cp -prfu over NFS. Set up scap to recompile and install texvc automatically.
07:18 Kyle: srv121-135 are available. srv143-145 fixed.

October 22

14:08 mark: Deployed a newer PyBal on pascal

October 21

18:00 jeluf, domas: installed apache&al on srv136-srv145. srv143 was already broken when we started, srv144 broke during the installation (had to reboot it, didn't come back)
17:00ish brion: hack-bumped the $wgStyleVersion again
17:55ish brion: tweaked mail servers on leuksman.com again
16:50ish brion: did a svn up & scap; there may be some css/js issues with the changes to section edit links. germans have broken js
16:15 brion: ldap is broken on srv144
15:24 brion: updated leuksman.com to PHP 5.2.0RC6
15:11 brion: disabling disused MWBlocker extension include; new boxen we're not installing the PEAR xml-rpc anymore since we don't use it anymore and the install kept breaking
14:58 brion: removed ganglia port and interface options from mwsearch.conf, trying to see if these get through ganglia... manual from rabanus does go through using gmetric without the specifiers on the command line
09:04 jeluf: created otrs-de-l, otrs-it-l
~05:25 Tim: synced files on srv63, was out of date. Initialised srv103 as a memcached hot spare.

October 20

13:30 Domas: enabled holbach, lomaria, ixia with higher loads.
03:48 Kyle: srv136 - srv145 are available for service.
00:36 Tim: noticed that srv68 is down, memcached instance included. Brought the hot spare on srv119 into rotation.

October 19

10:31 Tim: updating fedora mirror
00:09 Kyle: srv136 available. (More soon)
00:09 Kyle: srv54, srv55, srv63, srv66 rebooted. Bad raid controllers.

October 18

23:12 Kyle, Mark: csw1 uplinked to csw5.
21:46 brion: upload.wm.o dead in pmtpa

October 17

16:29 mark: Started Mailman on goeje
16:00 mark: goeje back up after a PM reboot request.
15:50 mark: Users reporting loss of session data. mctest.php reported srv55 down, which indeed doesn't reply to ping. Replaced its memcached slot by srv62.
15:18 mark: Because goeje went down, srv1 couldn't resolve DNS, which brought the entire cluster into dismay (fun). Made srv1 forward to zwinger,goeje (in that order). Recursing DNS really needs to be fixed.
15:00 mark: goeje went down
13:45 mark: Converted sq14 and sq15 to upload squids
07:01 jeluf: set up srv6 as thumb server, serving de/thumb, taking load from anthony, which is only serving en/thumb now
06:39 brion: enabled AntiSpoof extension for active prevention as well as logging

October 16

19:52 jeluf: restarted mwsearchd on coronelli
18:40 jeluf: moved thumbs/en/ to anthony, which is now serving thumbs/en/ and /thumbs/de/. Set up another HTCPpurger in the second page of the screen session.
14:38 mark: Increased swap size per COSS cache_dir from 5000 to 8000 on sq12 and sq13... After 4 days they had only a 4% i/o wait.
14:30 mark: Disabled Squid cache digests, as I don't believe they work well in our very dynamic environment, and may actually decrease cache efficiency.
12:20 mark: Squids were set to deny HTCP CLR requests from the pmtpa internal subnet, so purging didn't work in pmtpa. Fixed.
03:54 brion: updated viewvc on leuksman.com to 1.0.4-dev
02:09 Tim: installed FastStringSearch (fss)
01:04 Tim: installed gmetricd on sq2-10
00:43 Tim: ran updateArticleCount.php on the new wikis, to correct for a previous bug in the same script.
00:14 mark: Reinstalled yf1000 - yf1004 with Ubuntu, set them up as text Squids. Taken yf1019 out of rotation.

October 15

23:00ish brion: mysterious spike in apache cpu usage and segfaults, haven't figured out cause yet. reverting recent changes to mw to test
21:30 mark: Reinstalled yf1005 - yf1009 with Ubuntu, set them up as upload squids. Set up LVS on yf1018, pointed upload.yaseo at it...
18:54 mark: Changed MediaWiki's HTCP purge method from 'NONE' to 'GET' to make Squid 2.6 purge again
18:40 mark: Built a new squid-2.6.4-2wm1 .deb with debug symbols and --enable-stacktrace, and installed it on sq15
17:00 mark: Lots of Ubuntu Squids (with COSS) crashed around the same time. Restarted them.
16:42 brion: added charset header on 404 page to fix utf-7 silliness
16:15 mark: Fixed NTP on amaryllis. Y! has blocked UDP port 123, so SNAT to a high port...
14:20 mark: Creating two separate Squid groups with distinct default origin servers and "special destinations": text for MediaWiki content from the Apaches, and upload for static content from Amane and the thumb servers. This allows us to tweak the two very different Squid groups much better. Each group has its own subdir under /h/w/conf/squid, along with a separate subdir with a backup of the old setup. Yaseo doesn't have its own upload group yet, but I hope to rectify that today.

October 14

23:30 mark: Installed Ubuntu on clematis, it's back up as a Squid
11:30 jeluf: migrated upload.wm.o/wikipedia/de/thumbs/ to anthony, migration of /wikipedia/en/thumbs/ still running.
08:24 brion: [4] was somehow stuck in sq21's cache as a 301 to wikimediafoundation.org. UDP multicast packets to purge it could be seen when using ?action=purge, but had no effect. manually sending a PURGE over port 80 cleared it successfully
07:35 brion: adjusted 'missing wiki' screen to send a 404 response instead of 200; should keep some transient errors out of caches more nicely
07:29 brion: adding wikimania2007.wm.o to dns, preparing for wiki setup
07:03 brion: recompiled utfnormal extension on benet against proper ICU headers *cough*, restarted dump thread 4
06:48 brion: recompiled utfnormal extension on benet w/o -fPIC, restarted dump thread 4
06:12 brion: started pmtpa data dumps
05:00 Kyle: New ram with srv74, lets see how it does.
04:48 brion: migrating some old dump data from benet to amane to make room for next dump run
04:50ish brion: unmounted broken khaldun mount from benet

October 12

18:00 mark, jeluf: added thumb server bacon. Serves upload.wikimedia.org/wikipedia/commons/thumb/[0-3]/*. Currently, the squid.conf is a live hack. The next deployment will break this again, unless squid.conf.php is fixed.
17:05 mark: Set originserver on all parent cache_peers in squid.conf This makes Squid treat parents as origin content servers instead of proxy caches, and therefore enables Connection: keepalive and non-proxy GET requests.
15:10 mark: amane overloaded, tweaked its TCP settings a little more
07:39 Tim: secure.wikimedia.org back up, courtesy of mod_proxy.
07:00 jeluf: installed lighty on bacon, changed thumb handler to save images it got from the apaches to the FS. 0/* has been copied from bacon, 1/* currently running. Todo: HTCP listener to delete thumbs
05:57 Tim: disabled wiki stuff on secure.wikimedia.org temporarily, bart was overloaded. Will try to find a permanent solution involving proxying.
03:50 brion: started apache on leuksman.com, died again. :(
Set somaxconn = 1024 and tcp_max_syn_backlog = 4096 on the old image squids, and on amane.

October 11

23:40 mark: Made sq12 and sq13 image squids
22:30ish brion: a recently committed bug in ObjectCache caused the db to be used instead of memcached, grindin geverything to a halt
19:30 jeluf: copying amane's wikipedia/commons/thumb/* to bacon:/export/upload/wikipedia/commons/thumb using rsync on bacon, bwlimit 500

October 10

23:00-* mark: Upgrading the new Squids sq12..sq30 to squid-2.6.4-1wm4 to enable COSS
19:40 mark: Set connect-timeout=5 on Squid backend requests
17:40 mark: Reduced amane's PHP processes from 64 to 32
17:30 mark: Upgraded amane's lighttpd to 1.4.13.
11:25 mark: Set up sq29 with COSS as well, though different settings than sq30, to compare.
11:00 mark: Started Squid on several of the new servers. Squid had disappeared...
11:00 mark: Set up sq30 with COSS filesystems, using devices /dev/sda6, /dev/sdb, /dev/sdc, /dev/sdd.
mark: Set up an Ubuntu Dapper mirror on khaldun
07:54 brion: took stats.wikimedia.org offline; contains private info, needs scrubbing

October 9

21:25 mark: Set 'refresh-pattern ignore-reload' on upload squids
21:03 brion: removed anthony from mediawiki-installation group
20:35ish brion: disabled FancyCaptcha; using now SimpleCaptcha. seems to be lighter on amane's NFS for now
20:15ish brion: restarted many pmtpa upload squids with high InActConn backed up in lvs
18:00 mark,kyle: Reinstalled khaldun as dedicated install server / archive mirror
18:00 kyle,jeluf: Rebooted holbach. After reboot, mysqld's error log shows duplicate key errors while replicating. Shut down mysqld.
03:27 brion: disabled obsolete firewall rules on maurus; was preventing rsyncing of search index updates, stopping the ex-yaseo wikis from being searchable

October 8

15:02 Tim: Doubled the memcached instance count. srv104-118 brought into service with srv119 spare.
08:41 Tim: Stepped clocks on sq1-8, which were off by 8 hours. This was messing up ganglia. In the process of fixing NTP.
03:45 Tim: zwinger's disks were very overloaded due to the PMTPA gmetad. The data size is only 120MB, but apparently it was syncing very often. I moved the rrds to a tmpfs with an hourly rsync to disk.
02:38 Tim: holbach is down, took it out of rotation
02:34 Tim: removing old static HTML dump backup on srv35
02:12 Tim: Fixed disk space exhaustion on coronelli. MWDaemon.log was to blame.
~02:00 Tim: installed gmetricd in various places. diskio_* metrics should now be available.

October 7

22:00 jeluf: restarted db's on ixia and db1, with help of domas. Running 4.0.27 on db1
19:30 jeluf: Shut down mysql on ixia, copying DB to db1
17:30 jeluf: rebooted sq1, disabled squid. Mark depooled it from the LB
15:30 (Squid on) sq1 is down and being odd again
13:00 jeluf: rebooted sq1
03:45 Tim: removed sq11 from LVS on avicenna manually, it was down again and pybal didn't remove it.
03:35ish - timeouts connecting to rr.pmtpa
03:20 Kyle: db1 is now up and ready to be setup.

October 6

20:15 mark: Brought sq1 back up. The reason PyBal didn't depool it last night, not even during a restart, was that PyBal was in dry run mode so that it prints ipvsadm commands but never actually executes them. Apparently it has been inactive for weeks. Sorry!
19:00 jeluf: unmounted ikhaldun:/usr/local/upload on all apaches, removed from fstab
17:15 mark: Set up imap.wikimedia.org (which points to my private colocated server) as a temporary solution. Various @wikimedia.org aliases will be redirected.
17:15 jeluf: restarted apache on bart. Nagios and OTRS were not responding
01:33 brion: sq1 switchport reenabled; still hasn't fully shut down.
01:20ish tim manually removed sql from lvs; pybal wasn't removing it for unknown reason
01:07 brion: rebooting sq1, still haven't figured out wtf is wrong
00:55 brion: removed sq1 from pybal list while trying to kill its mad squid
00:50 brion: restarting squid on sq1; insane load (30+), not responding
00:44 brion: something wrong with upload.wikimedia.org; investigating. trouble connecting to pybal on alrazi; is it a problem with pybal or backends?

October 5

21:32 brion: resyncing srv11 common files; all were missing!
21:27 brion: wiped old copy of fundraising report scripts w/ redirect to new location
19:30 mark: Set up ingress filtering on port e8/1 and e8/2 of csw5-pmtpa
Tim: Set up ganglia 3.0.3, more or less starting from scratch with the configuration. We now have a hierarchical arrangement of grids, with knams and pmtpa in the system at present, yaseo will perhaps follow later if we can get the ACLs set up.
01:23 Tim: fixed replication on srv75. It's a read-only cluster so it's not critical. Had to skip some deleted binlogs, they were probably empty anyway. MAX(blob_id) looks fine.

October 4

23:43 brion: starting search rebuilds for ex-yaseo wikis on maurus
22:30 mark: Moved the console server to csw5-pmtpa and Wikia's network, so we have out of band access. Also moved the last bunch of machines off csw1-pmtpa.
22:00 jeluf, kyle: hot-replaced amane's faulty drives, started rebuilding the RAID.
21:26 jeluf: gzipped binlogs 1 and 2 on adler.
17:40 jeluf: rebooted srv98,srv93,srv87,srv109 since their apaches locked up a few minutes after being restarted
~13:15 Tim: albert was hanging, smtpd down. Mark's reboot -f attempts weren't working, so I did echo b > /proc/sysctl-trigger which did the trick. Came up without the right VIPs, I fixed it temporarily, Mark will fix it permanently.
09:20 Tim: postfix on albert had been broken since 23:56, restarted.

October 3

22:30 mark: Installed Ubuntu on yf1005 (used it as a testing host)
15:25 Tim: Deployed new external storage: srv87-89 as cluster8 and srv90-92 as cluster9
12:14 mark: Deployed sq21..30 as text squids to see if brute power solves the TCP open problem.
11:53 mark: zwinger is not letting me log in. Stalls after "entering interactive session."

October 2

18:50ish brion: set up www.wikibooks.org portal
15:42 brion: disabling writes to cluster6; it's overloaded
15:15ish overload on ES
14:40 Tim: srv54 went down, replaced its memcached instance with srv68
12:40 mark: Made zwinger external only by disabling eth1 and changing the default gateway to 66.230.200.10.
03:00-07:00 kyle, tim, jeluf, river: suda broke, zwinger broke. rebooted suda, moved zwinger's dns resolver to goeje (temporary only)

October 1

13:00-20:00 mark, bw, river: moved uplinks over to csw5, set up BGP and began advertising our network. brief downtime due to router breaking.
13:51 Tim: Fixed uploads on new wikipedias. I also fixed the absence of a spoofuser table, earlier today.