19:08 logmsgbot: kaldari Started syncing Wikimedia installation... :
18:47 hashar: the internal change to CommonSettings.php caused a lack of stylesheet for less than a minute on most wikis. I did test on test.wikipedia.org and beta project, but there must be a logic error somewhere that mess with the prod projects. Revert changes have been sent out in gerrit and merged in master.
18:35 hashar: so the nicely reviewed changes broke the enwiki stylesheets :/ reverted change :-(((
20:55 RobH: db1003 back online, replaced mgmt cable and mgmt is working now as well
20:51 LeslieCarr: rebooting srv266 as it is unresponsive
20:44 RobH: db1003 mgmt issue due to bad cable, system booting back up, replacing mgmt cable
20:35 RobH: clean mysql shutdown, db1003 now offline
20:33 RobH: db1003 mgmt is not responsible, I need to remove power and reboot. confirmed iwth asher this is an s3 slave and can do a short downtime without issues
20:31 logmsgbot: maxsem synchronized php-1.20wmf5/extensions/MobileFrontend/ 'MF fixes and logging'
20:29 logmsgbot: maxsem synchronized php-1.20wmf6/extensions/MobileFrontend/ 'MF fixes and logging'
19:56 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'disable LastModified and LastModified/E3Experiment'
19:03 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: meta back to wmf6, not cause of translate issues
18:53 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: closed to 1.20wmf6
18:51 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikimedia wikis to 1.20wmf6
18:48 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wiktionary and wikiversity to 1.20wmf6
18:47 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikisource and wikiquote to 1.20wmf6
18:47 Jeff_Green: added several mobile hostnames to DNS for RT #2996
18:46 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews and wikibooks to 1.20wmf6
18:44 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved metawiki back to 1.20wmf5
18:41 K4-713: synchronized payments cluster to fundraising/1.20 de0256084a
18:33 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved special wikis to php-1.20wmf6
18:29 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved en(wikibooks|wikinews|wikiquote|wikisource|wikiversity|wiktionary) to 1.20wmf6
18:05 RobH: cp1017 memory replaced
17:52 RobH: cp1017 is offline due to memory error. replacement memory on site, pulling system for swap
17:46 logmsgbot: preilly synchronized php-1.20wmf6/extensions/ZeroRatedMobileAccess 'update for landing page'
17:46 logmsgbot: asher synchronized wmf-config/mc.php 'disabling wgMemCachedPersistent; lowering wgMemCachedTimeout to 2x client default from 30x default'
17:45 logmsgbot: preilly synchronized php-1.20wmf5/extensions/ZeroRatedMobileAccess 'update for landing page'
17:44 logmsgbot: preilly synchronized php-1.20wmf4/extensions/ZeroRatedMobileAccess 'update for landing page'
17:19 LeslieCarr: restarting apache2 on srv258
17:11 maplebed: powercycled srv270
17:11 mutante: powercycling srv277 (had to, frozen console)
02:25 logmsgbot: LocalisationUpdate completed (1.20wmf5) at Sun Jun 24 02:25:23 UTC 2012
June 23
13:28 apergos: powrcycling srv288, swap death etc, some message to mgmt console but only the timestamp so couldn't see the issue, also couldn't get past the login prompt
02:24 logmsgbot: LocalisationUpdate completed (1.20wmf5) at Sat Jun 23 02:24:11 UTC 2012
June 22
23:11 LeslieCarr: restarted apache on srv278
22:23 binasher: stopping mysql on es3, reseeding slave via innodb hotbackup of es1004
04:35 Tim: on nickel: there were data sources for both "Apaches 8 CPU" and "Application servers", these were getting the same cluster name from the remote gmonds, and so different threads in gmetad were trying to write to the same summary files. Fixed temporarily, will fix in puppet shortly
04:26 Tim: on nickel: ran gmetad with -d3, it spews errors when trying to write to the faulty summary info files
04:20 Tim: on nickel: restarting gmetad
04:19 Tim: on srv258: started gmond
04:12 Tim: experimentally stopping gmond on srv258 to check for effects on oscillating appserver stats
03:23 Tim: on fenari, queueing refreshLinks jobs for some 2.8M commons image description pages that use location templates
02:48 logmsgbot: LocalisationUpdate completed (1.20wmf4) at Wed Jun 20 02:48:41 UTC 2012
02:26 logmsgbot: LocalisationUpdate completed (1.20wmf5) at Wed Jun 20 02:26:26 UTC 2012
02:24 Tim: started socat for /var/log/mw/fatal.log on fenari
18:15 logmsgbot: reedy synchronized wikiversions.cdb 'sync using sync-file'
18:08 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.20wmf5
16:30 mutante: installing package upgrades on sodium
16:29 mutante: restarting lighttpd on sodium - redirecting mediawiki-cvs list page
16:05 binasher: rebooting es1001
15:59 mutante: there have been no archives, so that should be it. there may be another issue in BZ 37690 but should be unchanged by renaming
15:58 mutante: copied full config/users/passes from mediawiki-cvs to mediawiki-commits, merged redirects, added old list name to acceptable_aliases in recipient filters
15:52 mutante: making the mailing list switch. mediawiki-cvs -> mediawiki-commits
15:49 Ryan_Lane: assigned service IPs for labs-ns0/labs-ns1
15:25 binasher: rebooting es1002 and es1003
15:11 Ryan_Lane: added virt1000 as a secondary ldap server for labsconsole
15:08 Ryan_Lane: testing gerrit config with multiple ldap servers
14:52 hashar: hume is out of disk space again. Probably the wmf branches taking toooo much space
20:28 RoanKattouw: Correction: the /usr/local/apache filesystem is full on hume, the root fs is not
20:27 RoanKattouw: hume has a full disk
20:23 RoanKattouw: Fixed ownership of php-1.20wmf{4,5}/cache/l10n , should be l10nupdate:wikidev . The wmf4 copy had wrong ownership causing rebuildLocalisationCache.php to fail for shell users (e.g. from scap)
11:28 mutante: powercycling downed srv232 (also cause for check_all_memcached crit)
11:08 mutante: powercycled mw1042 to check for hardware issues and fscked. appears to be just unused (though down since ~3d like mw1071 per nagios)
10:37 mutante: test to show linking from !log via SAL to RT: RT:3100 (before/without template)
10:31 hashar: incubatorwiki.translate_messageindex on db39 uses MyISAM engine. See RT #3100
10:25 logmsgbot_: hashar synchronized wmf-config/InitialiseSettings.php 'bug 37391 , take 2 - Install Translate extension on be.wikimedia.org'
10:24 logmsgbot_: hashar synchronized wmf-config/InitialiseSettings.php 'bug 37391 , take 2 - Install Translate extension on be.wikimedia.org'
10:23 hashar: Compared translate% tables schema on bewikimedia with incubatorwiki. diff prove they are the same so the schema changes made early are successful.
10:17 hashar: bewikimedia (db39) : dropped tables translate_tmf , translate_tms and translate_tmt I have incorrectly added
09:59 logmsgbot_: hashar synchronized wmf-config/InitialiseSettings.php 'revert translate extension on be.wikimedia.org, need DB update'
06:24 apergos: db1047 looks like the aft_article_filter_count is missing a few rows compared to the master (after replication caught up), presumably this is a side effect of the repair, have pinged binasher for help, leaving everything running and hope it's tolerable error for a day
02:34 logmsgbot_: LocalisationUpdate completed (1.20wmf5) at Tue Jun 12 02:34:49 UTC 2012
02:25 logmsgbot_: LocalisationUpdate completed (1.20wmf4) at Tue Jun 12 02:25:21 UTC 2012
01:35 binasher: passes the dba mantel to notpeter
01:22 notpeter: removing one slave from each db shard to upgrade/restart
00:24 binasher: shutdown mysql on es3. stopped slaving on es1002, rsyncing cluster23 tables to es3
00:09 binasher: pointed es3 to MASTER_LOG_FILE='es1-bin.000788', MASTER_LOG_POS=453509865
00:05 binasher: es3:~# rm -rf /usr/local/mysql*
June 11
23:54 logmsgbot_: asher synchronized wmf-config/db.php 'fully commenting out es3'
23:52 logmsgbot_: asher synchronized wmf-config/db.php 'making es1 the master for blobs cluster 23'
23:51 binasher: es1 is the new master, now switching mw conf
23:48 binasher: preparing to switch es master to es1
15:46 mutante: hume /usr/local/apache is out of disk (just 5GB but more branches now). (LVM vg "tank" lv "tank-apache" ) but no free extents. could take from /archive but unsure about shrinking the xfs.
15:35 logmsgbot_: reedy Started syncing Wikimedia installation... : Rebuild messagecache for 1.20wmf5
15:30 logmsgbot_: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: test2wiki to 1.20wmf5
17:09 mark: Added uRPF exception for 6to4 traffic on all routers
17:05 jeremyb: (UTC) 23:42:14 <binasher> !log re-enabled es4 monitoring. its currently our only es server without any tables marked as crashed / needing recovery, myisam recovery has been absent for all systems since the ms servers were migrated off of in nov 2011. (Sum of human knowledge * Renyi entropy = ES)
16:52 mark: Pooled ssl1001
15:44 logmsgbot_: asher synchronized wmf-config/db.php 'returning es2 to service'
15:25 paravoid: rebooting lvs1004 and reinstalling with precise
15:17 binasher: rebooting es2 for kernel + mysql upgrade
15:16 logmsgbot_: asher synchronized wmf-config/db.php 'pulling es2 for kernel+mysql upgrades'
14:56 paravoid: rebooting amslvs3 & amslvs4 to reinstall with precise
14:20 paravoid: rebooting lvs1006 to reinstall with precise
13:51 cmjohnson1: shutting down bellin to replace main board
13:49 notpeter: reimaging db1042
13:40 paravoid: rebooting lvs1005 to reinstall with precise
12:59 paravoid: rebooting lvs2 to reinstall with precise
12:47 Ryan_Lane: changing capella's subnet in DNS
12:10 Ryan_Lane: rebuilding capella as precise
10:01 logmsgbot_: asher synchronized wmf-config/db.php 'putting es1 in production'
09:53 notpeter: cancel that, it's mid-cron. will do later
09:52 notpeter: stopping indexing on searchidx1001 to rsync to searchidx2
09:35 binasher: rebooting es1 for kernel+mysql upgrade. dont need to pull from db.php because it was never correctly added or queried?
09:14 mark: Built PyBal 1.01 for precise, and included it in the precise-wikimedia APT repository
08:45 binasher: restarted mysql on es1004 with default innodb file format as barracuda
08:31 notpeter: reimaging searchidx2
02:36 logmsgbot_: LocalisationUpdate completed (1.20wmf3) at Tue Jun 5 02:36:44 UTC 2012
02:14 logmsgbot_: LocalisationUpdate completed (1.20wmf4) at Tue Jun 5 02:14:17 UTC 2012
19:32 Ryan_Lane: force running puppet on ssl servers
18:22 Reedy: Nuked php-1.20wmf4 on mw64 then ran sync-common. Seems to have dealt with the permission errors
18:11 logmsgbot_: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.20wmf4
16:48 binasher: upgraded kernel on db1047 / analytics
16:09 Ryan_Lane: restarting ircecho on manganese
15:21 paravoid: reinstalling lvs1 with precise
15:13 mark: Added new IPv6 LVS prefixes to all routers for uRPF filters; BGP import filters still need adjusting for dual-family sessions
15:08 cmjohnson1: physically power cycling lvs1
15:02 Ryan_Lane: depooling ssl1001 and ssl3001
14:55 Ryan_Lane: disabling puppet on all ssl hosts
13:27 mark: Changed upload.esams.wikimedia.org CNAME to upload-lb.esams, effectively disabling the IPv6 selective answer script
12:23 mark: Upgrading wikimedia-lvs-realserver to version 0.08 across the cluster (by Puppet)
12:18 Ryan_Lane: depooling ssl1
11:32 mark: Copied wikimedia-lvs-realserver 0.08 from APT distribution precise-wikimedia to lucid-wikimedia
02:38 logmsgbot: LocalisationUpdate completed (1.20wmf3) at Mon Jun 4 02:38:15 UTC 2012
02:14 logmsgbot: LocalisationUpdate completed (1.20wmf4) at Mon Jun 4 02:14:45 UTC 2012
June 3
15:45 paravoid: aborting lvs1 install, partition map is not ready; putting it back to production as-is
15:31 paravoid: reinstalling lvs1 with precise
15:10 RobH: torrus failed to refresh via puppet (failed refresh takes too long) so manually running the refresh/rebuild command as puppet copied the updates to the system
14:55 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 37237 - Change Wikisource namespace for Tamil wikisource'
13:10 logmsgbot: hashar synchronized wmf-config/InitialiseSettings.php '59753a9 (allow bureaucrats on frwiki to add+remove accountcreator group'
13:09 logmsgbot: hashar synchronized wmf-config/CommonSettings.php 'Commits: 6bef518 (wgHTCPMulticast only used on production cluster) and 882dd69 (wgLoadScript only used on production) -- was not correctly deployed earlier'
10:18 hashar: mw64: rsync: write failed on "/apache/common-local/wmf-config/CommonSettings.php": No space left on device (28)
10:17 logmsgbot: hashar synchronized wmf-config/CommonSettings.php 'Commits: 6bef518 (wgHTCPMulticast only used on production cluster) and 882dd69 (wgLoadScript only used on production)'
09:16 notpeter: pushing new zone files. only minor changes
02:35 logmsgbot: LocalisationUpdate completed (1.20wmf3) at Sun Jun 3 02:35:31 UTC 2012
02:13 logmsgbot: LocalisationUpdate completed (1.20wmf4) at Sun Jun 3 02:13:39 UTC 2012
18:00 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Enable random root page on testwiki'
17:58 logmsgbot: reedy synchronized wmf-config/CommonSettings.php 'Add enabling code for randomrootpage'
17:53 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Add setting for random root page'
17:47 LeslieCarr: cleared mobile varnish cache
17:45 ssmollett: ganglia uploaded backported ganglia 3.3.5 deb package to precise-wikimedia repository
17:45 ssmollett: ganglia uploaded backported ganglia 3.3.5 deb package to precise-wikimedia repo
17:45 ssmollett: ganglia uploaded backported ganglia 3.3.5 deb package to precise-wikimedia repo
17:40 logmsgbot: aaron synchronized php-1.20wmf4/extensions/PageTriage 'Switched to wmf4 extension branch to get 0be1787634613a36439b760d6d5f0639724f8a7b'
16:06 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'subpages for frwikibooks'
12:00 mutante: restarting pdns on ns2
11:41 mutante: running authdns-update to push analytics1011 to 1022 entries
16:17 hashar: /usr/local/apache/common-local is 4G where as / is 7G on srv187. Looks like deploying wmf2 + wmf3 + wmf4 will require partitions to be resized.
16:10 hashar: srv187 and srv188 are out of disk space
15:44 logmsgbot: reedy synchronizing Wikimedia installation... : test2wiki to 1.20wmf4 to build localisation cache
14:39 logmsgbot: reedy synchronizing Wikimedia installation... : Does running scap on its own with no wikis on that version build l10n for it? I suspect not...
23:32 logmsgbot: awjrichards synchronized wmf-config/CommonSettings.php 'Updating technical feedback email address for mobile feedback'
23:17 logmsgbot: awjrichards synchronizing Wikimedia installation... : Picking up changes to hide feedback form to prevent spamming of mobile feedback page - f6ed8ba
23:12 logmsgbot: awjrichards synchronized wmf-config/CommonSettings.php 'Enabling 'technical feedback' link on mobile feedback form to disable feedback form'
23:00 binasher: stopped replication on es1002 in order to rsync cluster23 to es1003
02:43 logmsgbot: LocalisationUpdate completed (1.20wmf2) at Sun May 20 02:43:08 UTC 2012
02:21 logmsgbot: LocalisationUpdate completed (1.20wmf3) at Sun May 20 02:21:20 UTC 2012
May 19
21:07 cmjohnson1: shutting down storage3 to replace RAID controller card
18:37 logmsgbot: awjrichards synchronized wmf-config/InitialiseSettings.php 'Disabling PageTriage extension on enwiki per request from Kaldari, due to bug 36968'
02:43 logmsgbot: LocalisationUpdate completed (1.20wmf2) at Sat May 19 02:43:56 UTC 2012
02:22 logmsgbot: LocalisationUpdate completed (1.20wmf3) at Sat May 19 02:22:21 UTC 2012
May 18
23:27 logmsgbot: aaron synchronized wmf-config/CommonSettings.php 'Set $wgSiteStatsAsyncFactor=1 on commonswiki.'
13:44 cmjohnson1: shutting down bellin for troubleshooting
09:04 hashar: Site outage was due to our custom wfLogXFF() which uses wfErrorLog(). $wmfUdp2logDest not being global there, caused exception to be shown.
08:59 hashar: Broken the cluster by having an invalid global set
06:50 hashar: WMFLabs dieing out, I/O latency raised constantly over the last 2 hours and eventually lead to situation where system (via ssh) is not usable anymore
03:41 logmsgbot: asher synchronized wmf-config/db.php 'returning db12 and db46'
02:48 logmsgbot: LocalisationUpdate completed (1.20wmf2) at Thu May 17 02:48:02 UTC 2012
02:22 logmsgbot: LocalisationUpdate completed (1.20wmf3) at Thu May 17 02:22:02 UTC 2012
19:14 Reedy: manually ran ddsh -cM -g mediawiki-installation -o -oSetupTimeout=30 -F30 "sudo -u mwdeploy rsync -a 10.0.5.8::common/*.dblist /usr/local/apache/common-local" because sync-dblist is woefully out of date..
19:13 notpeter: restarting ganglia on nickel
19:09 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: 12 more misc/wikimedia wikis to 1.20wmf3
18:59 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All closed wikis to 1.20wmf3
18:55 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All special wikis to 1.20wmf3
18:54 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All wikimedia wikis to 1.20wmf3
18:52 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All wikisource to 1.20wmf3
18:50 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All wikiquote to 1.20wmf3
18:48 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All wikiversity to 1.20wmf3
18:45 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All wiktibooks to 1.20wmf3
18:42 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All wiktionaries to 1.20wmf3
18:40 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All wikinews to 1.20wmf3
18:31 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: All non wikipedia en projects to 1.20wmf3
18:27 binasher: running recentchanges.rc_ip (ipv6) schema migration on s3 dbs via os������c (s4 already completed during prior testing)
18:25 mutante: synced wikiversions.* files from NFS to spence local to prevent death of check_job_queue monitoring
18:21 binasher: running recentchanges.rc_ip (ipv6) schema migration on s5 dbs via os������c
18:19 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: commonswiki to 1.20wmf3, again
18:13 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: commonswiki back to 1.20wmf2
18:10 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: commonswiki to 1.20wmf3
18:10 binasher: running recentchanges.rc_ip (ipv6) schema migration on all s6 dbs via os������c
18:03 binasher: running recentchanges.rc_ip (ipv6) schema migration on all s7 dbs via os������c
17:43 binasher: ipblocks migration completed for all wikis
17:38 binasher: running ipblocks schema migration on all s2 dbs via os������c
17:35 logmsgbot: awjrichards synchronized php-1.20wmf2/extensions/MobileFrontend 'Picking up fix for fatal in api in MobileFrontend at 9936e7a'
17:34 logmsgbot: awjrichards synchronized php-1.20wmf3/extensions/MobileFrontend 'Picking up fix for fatal in api in MobileFrontend at 9936e7a'
17:16 maplebed: deploying change to swift to make which containers write thumbs configurable
17:11 logmsgbot: preilly synchronized php-1.20wmf3/extensions/MobileFrontend/ 'zero and mobile changes'
17:10 logmsgbot: preilly synchronized php-1.20wmf2/extensions/MobileFrontend/ 'zero and mobile changes'
17:08 RobH: aluminum back online
17:00 binasher: running ipblocks schema migration on all s3 (819) dbs via os������c
16:59 binasher: running ipblocks schema migration on all s4 dbs via os������c
16:58 binasher: running ipblocks schema migration on s5/dewiki via osc
16:57 RobH: aluminum shut down for hard disk additions
16:56 binasher: running ipblocks schema migration on all s6 dbs via osc
16:51 RobH: udpating dns for osm web servers
16:50 binasher: running ipblocks schema migration on all s7 dbs via osc
16:49 logmsgbot: preilly synchronized php-1.20wmf2/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing'
16:29 maplebed: deploying gerrit change 7798 to the mobile varnish servers
07:41 logmsgbot: raindrift synchronized php-1.20wmf3/extensions/PageTriage/api/ApiPageTriageTemplate.php 'fixing exception bug that makes lots of logspam'
07:41 logmsgbot: raindrift synchronized php-1.20wmf2/extensions/PageTriage/api/ApiPageTriageTemplate.php 'fixing exception bug that makes lots of logspam'
06:20 Ryan_Lane: restarted lucene on search1015
05:58 Tim: setting net.ipv4.tcp_tw_recycle=1 on cp1005 seems to have fixed it, doing it on cp1004 as well now
05:52 Tim: on cp1005 setting tcp_tw_recycle=1
05:29 Tim: experimentally started squid on cp1004
04:05 hashar: updating a few plugins on Jenkins (host: gallium )
03:34 Ryan_Lane: stopped the squid process on cp1004 and stopped puppet to avoid it being restarted. it's having issues and I can't debug it right now.
03:22 Ryan_Lane: repooling squid frontend on cp1004
03:14 Ryan_Lane: depooling cp1004 and stopping the squid backend service to let some connections close
02:43 logmsgbot: LocalisationUpdate completed (1.20wmf3) at Wed May 16 02:43:51 UTC 2012
17:35 logmsgbot: aaron synchronized wmf-config/swift.php 'Use new thumb purge hook for testwikis'
16:54 RobHalsell: updated apache config for wiki-pedia.org, seems the bot doesnt spam that anymore =[
16:36 mutante: srv app servers max. uptime with older kernel down to ~120 days after another bunch of upgrades
16:34 RobHalsell: updating dns for wiki-pedia.org
12:20 hashar: deployment-prep replaced most occurrences of /mnt/upload to /mnt/upload6
10:37 apergos: on db39 dropped triggers pt_osc_elwiki_recentchanges ins, del, upd, they were preventing all elwiki edits except bot edits with the complaint Table 'elwiki._recentchanges_new' doesn't exist ... binasher, doublecheck me please?
09:24 mutante: srv278 - still has issues as in reopnened RT #24 - upgrading kernel anyways
03:05 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'update wgUploadNavigationUrl on all cs wikis'
02:35 logmsgbot: LocalisationUpdate completed (1.20wmf3) at Tue May 15 02:35:53 UTC 2012
02:23 logmsgbot: LocalisationUpdate completed (1.20wmf2) at Tue May 15 02:23:47 UTC 2012
01:09 logmsgbot: asher synchronized wmf-config/db.php 'returning db31 as an s4 slave'
17:46 maplebed: deleted wikipedia-de-local-thumb container from swift. the sharded version is currently being used.
15:33 mutante: adding DNS entries for analytics hosts in new vlan 1121 (10.64.21.0/24), hosts starting at .101 to match names analytics1001 = .101 and ++
15:03 mutante: mw62 -unless somebody was on that right now it died. mgmt also just Create Instance Error
14:06 mutante: kernel upgrading / rebooting srv servers where uptime > 200 d order by uptime desc limit 1
13:12 mutante: installing package upgrades on pdf1-3 (and installed requested indic fonts via new puppet role class)
11:39 mutante: starting ms-be swift-container-auditors every once in a while
11:35 mutante: stat1 - installed new kernel, but waiting to reboot. schedule with aotto
11:24 mutante: upgrading packages/kernel on hooper, rebooting (Blog,Etherpad,Racktables)
09:21 mutante: ekrem was close running out of disk again. logrotated apache logs, changed config to: size 512M,rotate 3
08:58 mutante: package upgrades on ekrem (IRC server, WAP, Apple dict...)
08:51 mutante: rebooting marmontel (blog)
08:48 mutante: upgrading apache/mysql/kernel on marmontel (blog)
02:20 logmsgbot: LocalisationUpdate completed (1.20wmf2) at Fri May 11 02:20:39 UTC 2012
02:00 RoanKattouw: Started Apache back up on srv200, done debugging
23:33 notpeter: taking down search20 to do precise test-install
23:26 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Enable Translate on outreachwiki'
23:25 Reedy: Created Translate tables on outreachwiki
22:49 Reedy: ExtensionDistributor fixed
22:32 Reedy: Debugging ExtensionDistributor being broken. Likely to show more debug output on mw.org if you attempt to use it (though, it wouldn't give you what you wanted anyway)
18:25 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: 295 other wikipedias over to 1.20wmf2
18:20 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: jawiki to 1.20wmf2
18:16 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: ruwiki to 1.20wmf2
18:12 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: frwiki to 1.20wmf2
18:11 notpeter: turning db30 back on
18:07 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: dewiki to 1.20wmf2
17:51 cmjohnson1: to shutting down storage3
16:58 LeslieCarr: restarted mobile varnish instances
16:58 LeslieCarr: flushed mobile varnish cache
16:54 logmsgbot: aaron synchronized wmf-config/CommonSettings.php 'Make sure Swift backend will have journaling too.'
16:31 logmsgbot: aaron synchronized wmf-config/CommonSettings.php 'Removed backend config conditional now that everything was switched over.'
14:06 mutante: started container-auditor on ms-be1
09:24 mutante: started container-auditor on ms-be3 and 4
02:37 logmsgbot: LocalisationUpdate completed (1.20wmf1) at Wed May 9 02:37:02 UTC 2012
02:19 Reedy: Running cleanupUploadStash.php over all wikis
02:13 logmsgbot: LocalisationUpdate completed (1.20wmf2) at Wed May 9 02:13:10 UTC 2012
01:42 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 36506 - Site logo for Tsonga Wikipedia -- ts.wikipedia.org'
01:39 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 36522 - Upload link should lead to UploadWizard instead of commons:Special:Upload'
01:36 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 36663 - Please allow bureaucrats to add and remove autoreviewer status on pt.wiki'
01:30 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'wgShowUpdatedMarker enabled on anything that isn't enwiki or dewiki'
01:27 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 36533 - Set sitename to Telugu Wiktionary'
22:53 RoanKattouw: Actually fixed it now with chmod -R g+w /h/w/conf/httpd
22:47 RoanKattouw: Fixed permissions in /h/w/conf/httpd by running find -group wikidev -not -perm 020 -exec chmod g+w \{\} \;
22:38 logmsgbot: awjrichards synchronized php-1.20wmf2/extensions/MobileFrontend/stylesheets/sections.css 'Live hack to live test broken interface on ICS devices on very large articles'
22:34 logmsgbot: awjrichards synchronized wmf-config/InitialiseSettings.php 'Enable mobile url transformation on testwiki'
22:13 logmsgbot: awjrichards synchronized wmf-config/CommonSettings.php 'bumping MobileFrontend resource version number'
21:32 binasher: rebooting eqiad core db slaves for kernel upgrade
21:29 logmsgbot: aaron synchronized wmf-config/swift.php 'Added new thumbnail purge/import hooks handlers that use the swift backend class; unused atm.'
21:15 logmsgbot: asher synchronized wmf-config/db.php 'returning db45 to service'
21:14 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Matched wikimania2013wiki configuration to that of wikimania2012wiki'
21:13 maplebed: delpoyed container sharding for thumbnails to swift for 'dewiki', 'fiwiki', 'frwiki', 'hewiki', 'huwiki', 'idwiki', 'itwiki', 'jawiki', 'rowiki', 'ruwiki', 'thwiki', 'trwiki', 'ukwiki', 'zhwiki' (in addition to existing sharding for commons and enwiki)
21:13 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Move wikimania2013wiki to php-1.20wmf2
21:12 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Matched wikimania2013wiki configuration to that of wikimania2012wiki'
21:11 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Matched wikimania2013wiki configuration to that of wikimania2012wiki'
21:10 binasher: shutting down mysql across all eqiad core db slaves
20:59 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'logo for wikimania2013wiki'
17:18 logmsgbot: reedy synchronized php-1.20wmf2/extensions/MobileFrontend/ 'Pushing out head'
17:16 logmsgbot: reedy synchronized php-1.20wmf1/extensions/MobileFrontend/ 'Pushing out head'
17:14 RobH: asw-c1-eqiad connected to both cr1 and cr2
15:16 cmjohnson1: shutting down storage3 to replace raid card
12:40 pp-pdf1: updated mwlib to 0.13.7
12:39 pp-pdf2: updated mwlib to 0.13.7
12:36 pp-pdf3: updated mwlib to 0.13.7
11:59 mutante: merging CSS fix for broken mobile site table layout
02:18 RoanKattouw: Removed and recloned /var/lib/l10nupdate/mediawiki/extensions , it was in a weird state because magic extension submodules work now but my hacky workaround for them not working was still in place
02:00 logmsgbot: LocalisationUpdate failed: git pull of extensions failed
22:16 logmsgbot: raindrift synchronized wmf-config/InitialiseSettings.php 'enabling PageTriage on enwp'
22:14 logmsgbot: raindrift synchronized php-1.20wmf2/extensions/PageTriage 'Syncing PageTriage to enwp, a la carte'
22:14 logmsgbot: raindrift synchronized php-1.20wmf1/extensions/PageTriage 'Syncing PageTriage to enwp, a la carte'
21:59 mutante: was still upgrading/rebooting amssq* and knsq* hosts on the side (slow,b/c upload squids). expect temp. nagios squid reports tomorrow as well. out for now.
21:44 binasher: moved default resolution for upload from eqiad to pmtpa
21:29 cmjohnson1: shutting down storage3 for troubleshooting
20:37 binasher: attempting a live online schema change for zuwikitionary.recentchanges on the prod master
20:22 LeslieCarr: (above) restarted nagios-wm on spence
18:07 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: enwiki to 1.20wmf2
16:16 cmjohnson1: shutting down storage3 to reseat RAID card
15:58 cmjohnson1: Going to power cycling storage3 several times to troubleshoot hardware issue
15:15 RobH: updating firmware on storgae3
14:20 Jeff_Green: stopped cron jobs on storage3 because of RAID failure
12:49 mutante: pushing out virtual host for wikimania2013 wiki. sync / apache-graceful/all
11:18 mutante: continuing with upgrades/reboots in amssq* on the side during the day
11:09 mutante: squids - sq* done. all latest kernel and 0 pending upgrades.
09:27 mutante: rebooting bits varnish sq68-70 one by one..
08:00 mutante: upgrading/rebooting the last couple sq* servers
07:20 binasher: power cycled db45 (crashed dewiki slave)
07:05 logmsgbot: asher synchronized wmf-config/db.php 'db45 is down'
02:25 Tim: on locke: introduced 1/100 sampling for banner impressions, changed filename to bannerImpressions-sampled100.log
02:12 Tim: on locke: moved fundraising logs back where they were
02:00 logmsgbot: LocalisationUpdate failed: git pull of extensions failed
01:38 Tim: on locke: compressing bannerImpressions.log
01:35 Tim: on locke: moved bannerImpressions.log to archive and restarted udp2log
01:26 Tim: on locke: moved fundraising logs from /a/squid/fundraising/logs to /a/squid so that they will be processed by logrotate
May 6
07:03 apergos: manually rotates udplogs on locke, copying destined_for_storage3 off to hume:/archive/emergencyfromlocke/ (jeff, this note's for you in particular)
06:36 apergos: bringing up storage3 with neither /a nor /archive mounted, saw "The disk drive for /archive is not ready yet or not present" etc on boot, waited a long time, finally skipped them
06:12 apergos: and powercycling the box instead. grrrr
06:05 apergos: rebooting storage3: we have messages like May 6 05:45:12 storage3 kernel: [465081.410025] Filesystem "dm-0": xfs_log_force: error 5 returned. in the log, and the raid is unaccessible, megacli doesn't run either
02:00 logmsgbot: LocalisationUpdate failed: git pull of extensions failed
May 5
09:37 mutante: squids - upgrading in the sq5x range (upload)
08:53 apergos: disabling modcompress temporarily for lightty on dataset2 (live hack), let's see what that does as far as it dying. could be issue similar to http://redmine.lighttpd.net/issues/2391
19:49 logmsgbot: asher synchronized wmf-config/db.php 'setting db52 weight to 0 in prep for making new s2 master'
19:32 binasher: powering off db24
18:08 LeslieCarr: reloaded mobile varnish caches and purged them
18:02 Ryan_Lane: gerrit upgrade is done
17:55 Ryan_Lane: starting gerrit
17:32 Ryan_Lane: installing gerrit package on manganese
17:28 Ryan_Lane: adding gerrit 2.3 package to the repo
17:25 Ryan_Lane: shutting down gerrit so that everything can be backed up
16:45 apergos: lighty on dataset2 is running under gdb in screen session as root, if it dies please leave that alone (or look at it if you want to investigate)
16:26 notpeter: turning off db30 (former s2 db, still on hardy, will ask asher what to do with it) to test noise in DC
15:50 mutante: rebooting sq67 (bits)
15:42 mutante: going through sq7x servers (text), full upgrades
15:32 notpeter: removing srv281 from rending pool until we figure out what's going on with it
15:23 notpeter: putting srv224 back into pybal pool
15:09 notpeter: removing srv224 from pybal pool for repartitioning
14:56 notpeter: putting srv223 back into pybal pool
14:50 mutante: going through sq6x (text), full upgrades
14:08 notpeter: removing srv223 from pybal pool for repartitioning
14:02 notpeter: putting srv222 back into pybal pool
13:50 notpeter: removing srv222 from pybal pool for repartitioning
13:43 notpeter: putting srv221 back into pybal pool
13:30 notpeter: removing srv221 from pybal pool for repartitioning
13:16 mutante: going through sq80 to sq86 (upload), full upgrade & reboot
12:56 mutante: maximum uptime in the sq* group down to 171 days, so we have like a month now for the rest. stopping upgrades for the moment being.
12:54 notpeter: starting script to move /usr/local/apache to /a partition on all remaing non-imagescaler apaches
12:47 mutante: (just) new kernels & reboot - sq45,sq49 (upload)
12:30 mark: Sending ALL non-european upload traffic to eqiad
12:23 mutante: (just) new kernels & reboot - sq63 to sq66 (209 days up)
12:06 mutante: dist-upgrade & kernel & reboot - sq42,sq43 - rebooting upload squids one by one
11:48 mutante: powercycling srv266 one more time, but now creating RT for it, once already showed CPU issue before it was reinstalled recently
11:13 apergos: restarted lighty on dataset2 ... about ... half an hour ago. stupid case sensitivity
10:02 apergos: tossed knsq1 through 7 from squid_knams dsh nodegroups file, prolly lots more cleanup where that came from
17:44 RobH: db1029 ssd test items removed, can go back to normal service via asher
17:43 notpeter: returning mw58 to pool
17:34 RobH: shutting down db1029 for ssd card testing removal per rt 2766
17:26 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 36320 - Set $wgShowUpdatedMarker back to true on ptwiki'
17:18 notpeter: removing mw58 from pool for more testin'
17:16 LeslieCarr: reloaded and purged varnish cache for mobile in eqiad
17:03 notpeter: mwm59 out of apache pool. using it for some testing
16:16 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 36359 - Add namespace 102 to $wgContentNamespaces on ptwiki Bug 36360 - Add namespace 102 to $wgNamespacesToBeSearchedDefault on ptwiki'
16:08 logmsgbot: reedy synchronized wmf-config/CommonSettings.php 'Bug 36460 - Enable chunked uploads as opt-in user preference'
16:06 logmsgbot: reedy synchronized wmf-config/CommonSettings.php 'Bug 31406 - Set $wgUseMathJax = true on Wikimedia wikis'
15:12 notpeter: chris is taking down search1-12 to replace with new search nodes
15:05 mutante: powercycling srv266
13:49 mark: Built new wikimedia-base 1.00 package, stripped of most stuff now handled by Puppet, and inserted it into the lucid-wikimedia and precise-wikimedia APT repositories
10:33 mutante: starting container-auditor on ms-be3
02:38 Tim: fixed scap, was failing on the remote side due to mwversionsinuse exiting with status 1 due to /home/wikipedia/common not existing on apaches
22:49 logmsgbot: preilly synchronized php-1.20wmf2/extensions/MobileFrontend/ 'contact us change'
22:48 logmsgbot: preilly synchronized php-1.20wmf1/extensions/MobileFrontend/ 'contact us change'
21:43 logmsgbot: asher synchronized wmf-config/db.php 's2: pulling db30, raising weights on new hosts'
21:02 ^demon: finished database maintenance on db9.reviewdb
20:24 hashar: hashar: updated TestSwarm to distribute tests to Firefox 12 users.
20:12 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Re-pushing for srv219 and srv220
20:07 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files:
20:04 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved special, wikimedia, wikiquote, wikiversity, and wiktionary wikis to 1.20wmf2
19:59 logmsgbot: asher synchronized wmf-config/db.php 'adding dbs 52,53,57 to s2 at lower weights'
19:55 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files:
19:47 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: metawiki to 1.20wmf2
19:40 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: commonswiki to 1.20wmf2
19:36 preilly: fix for PHP Warning: in_array() expects parameter 2 to be array, string given in /usr/local/apache/common-local/php-1.20wmf1/extensions/MobileFrontend/skins/SkinMobile.php on line 156
19:36 logmsgbot: preilly synchronized php-1.20wmf2/extensions/MobileFrontend/skins/SkinMobile.php 'fix php notice for in_array'
19:35 logmsgbot: preilly synchronized php-1.20wmf1/extensions/MobileFrontend/skins/SkinMobile.php 'fix php notice for in_array'
19:34 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved all wikibooks to 1.20wmf2
19:25 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved enwikibooks to 1.20wmf2
19:21 logmsgbot: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moved sourceswiki to 1.20wmf2
16:51 RoanKattouw: Changing docroot/bits/skins-1.19 and other 1.19 symlinks to point to the 1.20wmf1 tree instead. This is needed because we're still getting requests for magnify-clip.png at the 1.19 URL from cached HTML
16:16 notpeter: starting innobackupex from db1040 to db1022 for new s6 snapshot slave
15:31 notpeter: no nagios bot, kicking nagios on spence
15:04 RobH: shutting down mw64 for hw test per rt 1890
15:03 RobH: bellin crashed, unresponsive to ssh or serial console
14:43 mark: Built varnish for precise as 3.0.2-2wm5 and imported it into APT repository precise-wikimedia
11:52 mark: Started distribution upgrade of server stafford from Lucid to Precise
10:41 mutante: refreshLinks.php - started it once again in a screen on hume, just for s1. last cron failed with "mwscript command not found"?? well now it is there again and running
10:09 mark___: Started distribution upgrade of server sockpuppet from Lucid to Precise
09:20 mutante: upgrading bugzilla to 4.0.6
08:43 mutante: kaulen: installing various upgrades (apache,mysql,cron,php-wikidiff2,...)
22:23 logmsgbot_: asher synchronized wmf-config/db.php 'pulling db45, last coredb on prior fb mysql build'
22:17 logmsgbot_: reedy synchronized wmf-config/InitialiseSettings.php 'Enable doublepage on test2wiki'
22:11 binasher: upgraded percona-toolkit on coredbs to 2.1.1 - now with the potential to run online schema changes on tables without single column unique keys!!
21:39 binasher: created an ops db on all core mysql shards
21:00 notpeter: reinstalling db53. this time with correct raid!
20:40 logmsgbot_: awjrichards synchronized wmf-config/CommonSettings.php 'Fixing mailto links on mobilefrontend feedback form to properly populate subject lines'
19:32 LeslieCarr: reverting vrrp mastership of row a to cr2-eqiad
19:29 LeslieCarr: switching vrrp mastership of row a to cr1-eqiad
18:32 logmsgbot_: awjrichards synchronized wmf-config/InitialiseSettings.php 'Make testwiki use mobile domain for URLs'
18:28 LeslieCarr: making routing change, higher risk
17:51 Ryan_Lane: make that virt0
17:51 Ryan_Lane: switching the session cache back to filesystem on virt1, since it isn't working properly with memcache
17:29 maplebed: kicking nagios to check a change to fix the mobile LVS alert
20:57 pgehres: disabled all Jenkins jobs on Aluminium in prep for db1008 reboot
20:50 Jeff_Green: db1025 and storage3 get new kernels and reboot
20:28 notpeter: restarting, once again, innobackupex from db1034 to db57 for new s2 slave after fenari crash killed my screen
20:24 Reedy: Running ddsh -F30 -cM -g mediawiki-installation -o -oSetupTimeout=10 '/usr/bin/scap-1' in the hope it syncs all the files that would be nice to be on the app servers
07:37 apergos: powercycling srv266, had this message on mgmt console: Severity: Non Recoverable, SEL:CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted
07:22 mutante: installing upgrades on srv212
07:19 apergos: reinstalled srv284, seems to be up now
07:17 mutante: powercycled mw8
02:14 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Mon Apr 30 02:13:59 UTC 2012
April 29
20:13 apergos: srv206 won't run puppet, see syslog, clearing out the yaml file didn't help, since it's not urgent I'm leaving it for tomorrow
19:51 Ryan_Lane: depooling ssl3004
19:51 Ryan_Lane: removed the ipv6 addresses from maerlant and added them to ssl3001, then restarted nginx
19:50 Ryan_Lane: repooling ssl3001
19:46 apergos: powercycled mw60, same reason as the rest
19:12 apergos: power cycled mw48 and mw52 (hung just like the others)
18:05 apergos: sll3002 and 3003 were rebooted and are the entire ssl esams pool right now
18:02 apergos: ok the ssl300x situation: ssl3001 is now disabled in the pybal conf file on fenari; it is picking up the ipv6and4labs tmplate and I don't know if that's right, anyways nginx doesn't want to bind to one of those addresses. ssl3004 isn't reachable or pingable even via mgmt but at leasy lvs sees it's gone
16:34 apergos: powercycling the ssl300x.esams hosts. 212 days of uptime... (and 3001 had gone out to lunch)
12:34 mutante: and finally mw1, so just leaving mw1102 and mw60 for having other issues for a while (->Nagios)
12:22 mutante: check_all_memcached recovered, but still same treatment for mw10 and 11 (8 and 15h ago)
12:15 mutante: powercycling mw32,mw33,mw44,mw46 one by one, they were all frozen and went down between like 17 and 24 hours ago approx.
12:07 mutante: powercycling mw30
02:56 paravoid: rebooting ssl2 (has 214 days uptime)
02:47 paravoid: powercycled ssl3
02:13 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Sun Apr 29 02:13:58 UTC 2012
April 28
22:53 Reedy: Job queue logs on gdash seem to have stopped on the 26th...
02:24 Ryan_Lane: rebooting all mediawiki boxes that have uptimes affected by the bug are being rebooted at 8 minute intervals
02:14 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Sat Apr 28 02:14:14 UTC 2012
01:33 paravoid: powecycled mw29
01:21 paravoid: powercycled mw38
00:17 notpeter: db12 is sooooo sloooooow, starting innobackupex from db1017 to db60 for new s1 slave
April 27
22:15 paravoid: upgraded ssl4 to nginx 0.7.65-5wmf1 and added it back to the pool
21:45 paravoid: rebooting ssl4 after upgrading (incl. a kernel update)
20:00 notpeter: starting innobackupex from db1040 to db1022 for new eqiad s6 snapshot slave, again
19:59 notpeter: starting innobackupex from db12 to db60 for new s1 slave, again
19:58 notpeter: starting innobackupex from db1017 to db59 for new s1 slave, again
19:49 paravoid: de-pooling ssl4
19:30 mutante: test - added new gerrit interwiki prefix for SAL/wikitech - gerrit:6002
19:14 logmsgbot_: catrope synchronized wmf-config/CommonSettings.php 'Fix rights for afttest and afttest-hide groups'
18:25 logmsgbot_: reedy synchronized wmf-config/CommonSettings.php 'Cleanup enotif related settings'
18:24 logmsgbot_: reedy synchronized wmf-config/InitialiseSettings.php 'Set wgEnotifWatchlist to true for all wikis. Leaving wgShowUpdatedMarker set to false for all the big wikis'
18:10 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: Moving all remaining wikis to php-1.20wmf1
17:07 LeslieCarr: reloaded mobile varnish configs
17:06 LeslieCarr: purging mobile cache
16:40 LeslieCarr: starting delete script on ms-be3
16:14 RobH: done moving mgmt connections and serial connections in s8-eqiad for now
16:05 RobH: reshuffling cables in eqiad for serial and mgmt connections in a8, this may affect all eqiad mgmt and serial connections for the next 5 minutes
15:29 hashar: hashar: gallium: MySQL had issues most probably because of the mysql configuration snippets. https://gerrit.wikimedia.org/r/5796 might solve that.
14:03 mutante: gallium - don't start puppet unless the erb template fix for mysql has been merged
13:52 mutante: gallium stopped puppet, moved log_slow_queries config, re-setting up mysql again
13:41 mutante: gallium/testswarm - back up after mysql upgrade and issue starting the service
13:36 mutante: gallium - dpkg-reconfigure mysql-server-5.1, mysql does not start right
13:27 mutante: running apt-get upgrade on gallium
12:29 mark: Sending US, Brazil, Indian traffic to upload.eqiad
11:39 mutante: running authdns-update to add analytics100x and labsdb100x mgmt names
05:35 paravoid: powercycled lvs6, was dead and not responding to serial
03:43 logmsgbot_: asher synchronized wmf-config/db.php 'adding db58 to s7 as a new slave with a low weight'
11:54 apergos: after much cursing and kicking zfs, a manual snapshot replication is running in screen as root on ms7 to ms8, expect it to take at least a day
11:44 mark: Sending all non-european upload traffic back to pmtpa to prepare for eqiad varnish storage rework
08:56 mutante: updated blog theme per guillaume (April commits)
08:05 apergos: temporarily disabled automatic zfs replication from ms7 -> ms8, cleared out space on ms8, catching up by hand
10:30 mutante: searchidx1 was in site.pp and decom.pp at the same time. breaks puppet runs on spence. cannot override local resource. removing from site
10:27 mutante: killed a couple morebots processes on wikitech and it came back by itself :p
April 21
02:29 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Sat Apr 21 02:29:40 UTC 2012
02:15 logmsgbot_: LocalisationUpdate completed (1.19) at Sat Apr 21 02:15:20 UTC 2012
April 20
22:03 logmsgbot_: aaron synchronized wmf-config/CommonSettings.php 'Switched test2wiki to use the new LocalRepo config style.'
22:01 logmsgbot_: aaron synchronized wmf-config/CommonSettings.php 'Switched testwiki to use the new LocalRepo config style.'
21:52 logmsgbot_: aaron synchronized wmf-config/CommonSettings.php 'Added NFS backends for local/shared repos; they are not used yet.'
21:12 LeslieCarr: starting swift delete script on ms-be2
06:27 pgehres: re-eanabled PayPal on donatewiki and wmfwiki and resumed queue consumer on Aluminium
05:32 LeslieCarr: flushing mobile varnish cache
04:56 pgehres: disabled paypal on donatewiki and disabled queue consumer for duration of PayPal outage
02:33 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Fri Apr 20 02:33:02 UTC 2012
02:23 logmsgbot_: LocalisationUpdate completed (1.19) at Fri Apr 20 02:23:57 UTC 2012
01:47 logmsgbot_: awjrichards synchronizing Wikimedia installation... : r114983 on wikis still running 1.19
April 19
23:33 binasher: powercycled es1004
21:08 Jeff_Green: changed nagios contactgroup fundraising from tfinc/awrichards --> jgreen
21:03 RoanKattouw: Scap is broken in some weird way, it just stops running after the scap1-skins step. Doesn't run scap-1 (which does the actual sync), doesn't log "sync done", doesn't update graphite
21:01 logmsgbot_: catrope synchronizing Wikimedia installation... : Running scap again, AFTv5 is acting up
19:29 RoanKattouw: Running scap to deploy AFTv5 updates, and running AFTv5 schema changes on enwiki at the same time
18:50 logmsgbot_: catrope synchronized wmf-config/InitialiseSettings.php 'Set wmgArticleFeedbackv5OversightEmails for enwiki'
18:25 notpeter: nothing obvious in logs on db1005, starting mysql
18:15 notpeter: rebooting db1005. it's dead, jim.
17:52 RoanKattouw: Running schema changes for AFTv5 on testwiki
17:51 Jeff_Green: discovered nfs1 had ~1K redundant iptables rules, removed extras and reloaded
17:42 Jeff_Green: discovered sanger had ~7K redundant iptables rules, removed extras and reloaded
13:56 mutante: adding refreshLinks cron jobs to hume per RT-2355 (via puppet). if there should be any performance issues, schedule can be changed like <cluster>@<hour> in mediawiki.pp (and/or remove mediawiki::refreshlinks from hume and clear out the jobs of user mwdeploy)
08:35 mutante: emery - "udp2log_age" says some squid logfiles have not been written to in 6 hours, but from the filenames looks like this isnt a reason to worry, right
07:49 mutante: stat1 - this also needs udp2log stuff fixed. currently Could not find class misc::udp2log::udp-filter
07:47 mutante: gilman - what's up with it? closes SSH, does not like mgmt pass, was running jenkins but broken
07:43 mutante: owa[1-3] They dont have real puppet freshness issues, it's rather firewalling and the snmp traps
02:30 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Thu Apr 19 02:30:33 UTC 2012
02:21 logmsgbot_: LocalisationUpdate completed (1.19) at Thu Apr 19 02:21:31 UTC 2012
April 18
22:55 LeslieCarr: updating exim4.conf on mchenry to not allow old ranges
21:03 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files:
20:04 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: specieswiki and foundationwiki to 1.20wmf1
19:56 logmsgbot_: aaron synchronized php-1.20wmf1/extensions/LiquidThreads/classes/Hooks.php 'Avoid fatals on invalid title in API'
19:51 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: All *wiki wikis to 1.20wmf1
19:25 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: All wikiquote and wikiversity projects to 1.20wmf1
19:22 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: All wikibooks to 1.20wmf1
19:18 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: All wikinewses to 1.20wmf1
19:07 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: All wikisources to 1.20wmf1
19:07 logmsgbot_: catrope synchronized wmf-config/mc.php 'Swap out 10.0.2.251 (down) with 10.0.11.24 (spare). This is the last spare, there are now NO SPARES LEFT in mc.php'
19:00 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: All wiktionaries to 1.20wmf1
18:57 logmsgbot_: aaron synchronized php-1.20wmf1/extensions/LiquidThreads/classes/Dispatch.php 'Added type hint for better fatals'
18:44 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: enwikiversity to 1.20wmf1
18:43 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: enwikiquote to 1.20wmf1
18:41 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: enwikibooks to 1.20wmf1
18:40 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: enwikinews to 1.20wmf1
18:39 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: enwiktionary to 1.20wmf1
18:32 logmsgbot_: aaron rebuilt wikiversions.cdb and synchronized wikiversions files: enwikisource to 1.20wmf1
21:51 logmsgbot_: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.i18n.php 'changes for zero needed for carrier testing'
18:29 RoanKattouw: Did a graceful restart of all job runners using dsh about 15 mins ago
18:29 RoanKattouw: Restarted morebots
07:44 apergos: morebots test
07:44 apergos: restarted varnish service manually a bit a go on sq67 and sq70, the cron job didn't seem to have gone off. restarted morebots too while I was at it
03:37 Jeff_Green: dist-upgrade arsenic
03:29 LeslieCarr: restarting varnish on arsenic again
03:12 maplebed: started a script to delete old objects on ms-be1 for swift truncated object cleaning
02:53 Jeff_Green: dist-upgrade on strontium
02:43 LeslieCarr: restarted varnish on arsenic
02:26 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Tue Apr 17 02:26:40 UTC 2012
02:17 logmsgbot_: LocalisationUpdate completed (1.19) at Tue Apr 17 02:17:24 UTC 2012
11:45 logmsgbot_: reedy synchronized php-1.18/ 'Symlink php-1.18 back to php (our current main running version) as lots of requests on bits are for 1.18 resources'
11:44 mutante: sq34 was broken and died when connecting to mgmt, powercycling
11:37 mutante: nfs1 - Could not find class misc::mediawiki-logger for nfs1
10:57 Krinkle: bits.wikimedia.org back up, mark fixed it.
10:33 Krinkle: bits.wikimedia.org serving Error 503 Service Unavailable on all load.php requests for mediawiki.org and nl.wikipedia.org, maybe more
09:45 logmsgbot_: reedy synchronized wmf-config/InitialiseSettings.php 'Set wgEnableJavaScriptTest to true for test2wiki'
02:26 logmsgbot_: LocalisationUpdate completed (1.20wmf1) at Mon Apr 16 02:26:58 UTC 2012
02:17 logmsgbot_: LocalisationUpdate completed (1.19) at Mon Apr 16 02:17:57 UTC 2012
16:18 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35261 - Add block permissions in rollback on Lusophone Wikipedia'
16:13 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35823 - Wikijunior and cookbook namespaces for the Vietnamese Wikibooks'
16:05 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35659 - Set logo for sl.wikiversity'
16:00 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35853 - Set a non-empty default value for wmgArticleFeedbackBlacklistCategories on WMF wikis'
15:58 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35878 - Enable e-mail notifications for watchlist (EnotifWatchlist) on tawiki'
15:52 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35852 - Add a category to $wgArticleFeedbackBlacklistCategories for Portuguese Wikipedia to remove AFT from disambiguation pages'
15:10 mutante: gallium - after files have been deleted/moved, puppet back to normal operation (and new clone directory in Apache)
13:23 mutante: killed puppets on gallium
12:33 mark: repooled ssl1002
12:27 mutante: powercycling frozen ssl1002
12:22 mark: Manually depooled down ssl1002 in pybal
02:24 logmsgbot: LocalisationUpdate completed (1.20wmf1) at Thu Apr 12 02:24:29 UTC 2012
02:15 logmsgbot: LocalisationUpdate completed (1.19) at Thu Apr 12 02:15:54 UTC 2012
April 11
22:37 maplebed: deployed more log filters to emery: gerrit/r4758
21:20 logmsgbot: awjrichards synchronized wmf-config/InitialiseSettings.php 'Disabling mobile URL template for mediawiki.org (using "mediawikiwiki" this time)'
21:18 logmsgbot: awjrichards synchronized wmf-config/InitialiseSettings.php 'Disabling mobile URL template for mediawiki.org'
21:08 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: mediawikiwiki to 1.20wmf1
02:16 logmsgbot: LocalisationUpdate completed (1.19) at Sat Apr 7 02:16:54 UTC 2012
April 6
22:23 LeslieCarr: deploying new squid config to all squids
22:14 LeslieCarr: added neon into tiertwo of squid allowed hosts
22:13 LeslieCarr: deploying new squid config to amssq35
21:55 LeslieCarr: restarted puppet on spence
21:35 LeslieCarr: moved jenkins_1.458_all.deb to /srv/wikimedia/incoming/ on brewster
21:32 LeslieCarr: restarted squid on brewster
18:27 Ryan_Lane: updating OpenStackManager to r114758 on virt0
17:33 mark: Sending Japanese upload traffic to varnish in eqiad
17:15 mark: Power cycled down host lvs5
16:43 mutante: changed master and started slave on es1004
15:55 mutante: used gerrit create-project to create operations/debs/wikistats.git
14:13 mutante: manganese (gerrit) now sends SSL CA certificate on https, (curl -vvv says verify ok), should resolve RT:2777 and BZ:35709
11:51 mutante: es1004 - rsync was finished, deleted all binlogs from old host, mysqld_safe& , but did not "change master.." and "start slave" (see mail)
11:39 notpeter: restarting lsearchd on search3... again...
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Fri Apr 6 02:17:37 UTC 2012
01:21 Ryan_Lane: updating OpenStackManager to r114757 on virt0
00:18 Ryan_Lane: updating OpenStackManager to r114754 on virt0
April 5
23:49 logmsgbot: catrope synchronized wmf-config/InitialiseSettings.php 'Change guwikisource logo to point to the unscaled file instead'
21:46 notpeter: halting db15 for it to await decom
21:39 binasher: started enwiki.revision sha1 migration on db12
17:32 RobH: update done, all nameservers still online
17:31 RobH: dns update for wikipedia.org/com.il being resolved
17:08 RoanKattouw: Applying AFTv5 schema change on testwik
15:30 logmsgbot: py synchronized wmf-config/lucene.php 'pointing eswiki search at lvs pool in eqiad for live testing'
15:28 notpeter: pointing eswiki search at eqiad
12:51 mutante: db1007 - add mysql startup via 'update-rc.d mysql defaults'
12:42 apergos: started mysqld on db1007 via /etc/init.d/mysql (this doesn't seem to point to a special fb build, and can't seem to find one on this host, what's up with that?)
12:31 apergos: rebooted bd1007, it was dead in the water (also no helpful messages on console, bah)
11:16 mutante: enabled Renameuser extension on wikitech, renamed tchay per RT request, disabled extension again (it was installed but disabled)
02:19 logmsgbot: LocalisationUpdate completed (1.19) at Wed Apr 4 02:19:03 UTC 2012
01:39 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header of landing page only for mswiki'
April 3
23:17 LeslieCarr: updating bgp policies on cr1.sdtpa
22:44 LeslieCarr: reinstalling neon
22:04 maplebed: rolled back changes to emery in udp-filter due to the new binary crashing.
21:50 maplebed: ran /etc/init.d/udp2log reload on emery to enact the puppetted changes
21:41 maplebed: deploying new udp-filter and teahouse filters to emery for diederik
20:13 notpeter: restarting lsearchd on search7. was taosted
17:15 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header of landing page only for mswiki'
16:46 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header of landing page only for mswiki'
02:16 logmsgbot: LocalisationUpdate completed (1.19) at Mon Apr 2 02:16:47 UTC 2012
April 1
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Sun Apr 1 02:17:22 UTC 2012
March 31
10:22 mutante: srv222,225 were also upgraded but stopping there for now in favor of reinstalls
09:58 mutante: nuked /usr/shared/doc on a couple srv's, hey at least 700MB or something, and yes we really should reinstall with a decent partitioning scheme as M ark said
02:18 logmsgbot: LocalisationUpdate completed (1.19) at Sat Mar 31 02:18:10 UTC 2012
March 30
19:37 hashar: configured jenkins on gallium to use smtp.pmtpa.wmnet as outgoing SMTP server
19:28 RobH: puppet daemon restarted on brewster
18:13 RobH: killing puppet daemon on brewster, i need to hack at local configuration for cisco server stuff
12:56 mutante: db1047 - added system startup for /etc/init.d/mysql
12:47 mutante: powercycling db1047
12:28 mutante: deleted old kernel sources on upgraded srvs for that little extra space during peaks, suggesting to nuke /usr/share/doc if there should be more disk space warnings
10:41 mutante: same for srv223
09:18 mutante: srv224,srv219,srv220, upgrade apache, dist-upgrading w/ kernel, disabling ureadahead, rebooting one by one
08:06 mutante: storage3 - gmond unable to find the metric information for any mysql_* .."module has not been loaded", starting mysql, running puppet ...
07:57 mutante: powercycling storage3
07:03 Tim: running bug 35578 cleanup script in screen on fenari
22:44:53 RobH: !log also rolling firmware to ps1-d[1|2|3]-pmtpa
22:28:10 RobH: !log pushing firmware updates to servertechs in sequence: ps1-[a2|a3|a4|a5|b2|b3|b4|b5|c1|c2|c3|d1|d2|d3]-sdtpa, disregard any errors from rebooting alerts
19:55:09 notpeter: !log stopping puppet on search6 and search15 for 24 hours to test new log rotation script
19:19:35 RobH: !log cp1019 memory replaced per rt 2651
19:07:14 apergos: rebooting ms1001 (new kernel)
17:53:34 RobH: cp1019 coming down for memory replacement per rt 2651
17:51:39 RobH: fluorine disk upgrade done, os install pending, details on rt 2350
17:14 notpeter: backingup plwiki.nspart1 index on search7, deleting working copy, and restarting lsearchd. (note: this will probably cause some downtime on some languages while the proc restarts...)
15:18 RobH: db59 has errors, but as it was a fusion io testbed server, it is more than likely tweaked for such, it is not in any rotation
14:54 RobH: db59 shutting down for io card removal per rt 2589
13:37 mutante: while on it, installing a whole bunch of package updates on db42
13:25 mutante: db42 was out of disk , caused by ~5G citations.csv in /tmp, gzipped the file
09:59 mutante: ..and on ms-be-3. running puppet on db59
09:43 mutante: another corrupted .yaml file on ssl2
09:05 mutante: brewster was out of disk - deleted lighttpd access.log.1, gzipped access.log
08:24 mutante: on several mw* boxes puppet did not run because .yaml files on the puppetmaster became corrupted. need to delete the $hostname files in /var/lib/puppet/yaml/node on stafford and re-run. puppet bug similar to http://projects.puppetlabs.com/issues/7836
02:18 logmsgbot: LocalisationUpdate completed (1.19) at Mon Mar 26 02:18:03 UTC 2012
March 25
22:26 RobH: row b servertech firmware in eqiad all updated, should clear alarms as they come back online
22:18 RobH: firmware updates on servertechs in row b eqiad, disregard alarms
20:14 RobH: to fellow ops, you can disregard those observium errors, as I caused them
20:13 RobH: firmware updated on all power strips in row a eqiad.
16:09 RobH: updating firmware on ps1-s1-eqiad and ps1-b1-sdtpa
16:07 RobH: updated firmware successfully on ps1-a8-eqiad, if it has observium alarms now then there are bigger issues.
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Sun Mar 25 02:17:21 UTC 2012
00:59 LeslieCarr: admin down asw-a-eqiad xe-1/1/2 and cr2-eqiad xe-5/0/0 due to framing errors causing packet loss and lacp sporadic timeouts. source of the issue
March 24
19:46 logmsgbot: preilly synchronized php-1.19/extensions/MobileFrontend/MobileFrontend.body.php 'Following a performance regression reported on wikitech-l, added merciless profiling to ExtMobileFrontend::DOMParse()'
19:46 logmsgbot: preilly synchronized php-1.19/extensions/MobileFrontend/MobileFrontend.body.php 'Following a performance regression reported on wikitech-l, added merciless profiling to ExtMobileFrontend::DOMParse()'
17:35 mark: Migration from br1-knams to cr2-knams completed.
17:09 mark: Migrated second knams-esams dark fiber link from br1-knams to cr2-knams
16:36 mark: Corrected MTU setting on cr2-knams's AMS-IX interface
16:20 Reedy: Some european users reporting oruting issues
16:01 mark: Cleared OSPF session between csw1-esams and csw2-esams which magically made some internal routes reappear
15:40 mark: Brought up AMS-IX ipv4 BGP sessions
15:30 mark: Brought up AMS-IX ipv6 BGP sessions
15:25 mark: Moved AMS-IX connection to cr2-knams:xe-1/1/0
15:22 mark: Shutdown all AMS-IX BGP sessions
15:06 mark: Disabled BFD on OSPF3 between cr2-knams and csw1-esams
14:49 mark: Moved AS6908 and AS1257 PIs to cr2-knams
14:18 mark: Brought up AS13030 and AS1299 BGP sessions on cr2-knams
13:57 mark: Shutdown AS1299 BGP session on br1-knams
13:14 mark: Established full iBGP mesh with added router cr2-knams. cr2-knams now has full Internet connectivity.
12:48 mark: Moved fiber from br1-knams:e1/2 to cr2-knams:xe-0/0/0
12:44 mark: Disabled br1-knams:e1/2 (DF leg 1 to esams)
12:43 mark: Rack mounted and powered up cr2-knams
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Sat Mar 24 02:17:02 UTC 2012
16:44 RobH: cp1019 in middle of firmware update, please dont touch
16:44 RobH: cp1017 memory error seems ot have cleared post firmware update, will keep an eye on it for the rest of the day
16:09 RobH: raid rebuilding on magnesium, however swift stuff is kind of black box mystery right now to me, need Ben to review magnesium later for that
15:53 RobH: magnesium coming back online
15:44 RobH: shutting down magnesium for disk swap
15:37 RobH: firmware updating on cp1017, no one touch it please
15:30 RobH: db1020 can go back into whatever rotation Asher wants it in
15:29 RobH: db20 memory error on raid controller resolved with firmware updarte
23:18 RobH: db1020 firmware still updating, will check on it later tonight. offline until then
22:19 notpeter: all 3 dns servers are responding to digs after reload
22:10 notpeter: pushing a new zone file to add 2 more search-related vips for eqiad
20:52 notpeter: stopping puppet on brewster temporarily
20:25 notpeter: rebuilding search1015 and 1016 for disk shuffles
20:01 RobH: magnesium goign down and up again, troubleshooting the disks
19:47 apergos: rebooting ms1002, had stuck rsyncs, and kswapds at 100% cpu, weirdness like "ls /export/upload/wikipedia/am/0/00" hanging.
18:08 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header of landing page only for mswiki'
15:45 RobH: search 1015 and search1016 back up with added disks
15:08 RobH: shutting down search1015 & search1016 for hdd additions
14:45 RobH: db1020 still offline, requires firmware update on raid controller per rt 2621, will perform later today
20:10 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'Set $wgArticleFeedbackv5OversightEmails on enwiki'
18:59 maplebed: rebooted ms-be3 after it crashed.
18:51 binasher: brought db24 back up after hang, and reslaving, but leaving out of db.php. just replicating until a replacement s2 snapshot host is built
18:01 logmsgbot: catrope synchronized wmf-config/InitialiseSettings.php 'Temporarily disable ShortUrl on testwiki because we think it might conflict with ArticleFeedbackv5'
17:59 K4-713: updated and synchronized payments cluster to r114382
21:46 binasher: stopped eqiad bits servers from udplogging to emery, packet loss is back to zero
20:59 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header of landing page only for mswiki'
20:17 binasher: killed enwiki.revision sha1 migrator (upgrade-1.19wmf1-2.php). after db36 completes, will run the rest by hand
19:52 Ryan_Lane: pushing change for zero.wikipedia.org to redirect to the english message
19:41 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header of landing page'
19:16 cmjohnson1: pulling disk 5 on virt1 for reseating
18:34 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header of landing page'
18:02 pgehres: flipped Template:CC-status on wmfwiki since credit cards are still disabled on payments.wikimedia.org
17:52 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35193 - Enable sub page feature in Telugu Wikisource'
18:14 mark: Running smartctl -t long /dev/sdb on brewster
12:58 logmsgbot: hashar synchronized php-1.19/includes/SiteStats.php 'Reenable SiteStatsInit::articles() for bug 35169. SiteStatsInit::doAllAndCommit() still disabled since it breaks the site'
10:28 logmsgbot: tstarling synchronized wmf-config/PoolCounterSettings.php 'increased max queue from 50 to 100 on reports that the limit was reached on the enwiki main page in normal operation'
09:11 mutante: nomcom and langcom wikis look kind of broken , redirecting to pages on incubator with "Error: This page is unprefixed! "
08:49 mutante: making (almost) all private wikis https-only per RT-2565, vi remnant.conf,sync,graceful...
07:30 mutante: running sync-apache after making a change to remnant.conf to make grants.wm https-only
05:09 Ryan_Lane: bringing up most instances on virt3, doing so by project priority
04:42 Ryan_Lane: bringing up all instances on virt4, waiting 30 seconds between instances
04:25 Ryan_Lane: bringing up all instances on virt2, waiting 30 seconds between instances
04:09 Ryan_Lane: bringing up all instances on virt1, waiting 30 seconds between instances
04:00 Ryan_Lane: attempting to bring some instances up
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Mon Mar 19 02:17:17 UTC 2012
19:54 RobH: sq67-sq70 have been reinstalled, but not signed in puppet, not sure if they are ready for that or if there are other items mark needs to change first
19:11 RobH: working on sq67-sq70 reinstalls, disregard alerts
19:00 RobH: db1022 resetup and redeployed per rt 2537 and assigned back to asher
18:51 logmsgbot: reedy synchronizing Wikimedia installation... : Running scap to deal with message changes earlier
18:19 RobH: db1022 coming down for reinstall and resetup of raid per rt 2537
15:15 mark: Created git repo operations/debs/varnish in gerrit
14:06 apergos: disabled moodbar temporarily on en wikii, see bug 35245
14:02 logmsgbot: ariel synchronized wmf-config/InitialiseSettings.php 'emergency disable of feedback dashboard (right config var this time?)'
13:51 logmsgbot: ariel synchronized wmf-config/InitialiseSettings.php 'emergency disable of feedback dashboard'
13:11 apergos: on screen as root on dataset1001, copying to gluster volume; if this causes problems feel free to shoot it. ( cp -a 20120211 /mnt/glusterpublicdata/public/enwiki/ )
09:08 mutante: ran puppet on mw1020
08:12 mutante: installing apache,apt,cron,mysql-client upgrades on spence
07:51 mutante: messed with /var/lib/dpkg/status on hume to fix broken packages/remove "marked for purging" on libmysql-php5 without removing a ton of other packages, rather hackish but seems fine anyways, like not broken anymore on simulated dist-upgrade etc
07:01 mutante: uprading apache and apt on hume
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Thu Mar 15 02:17:35 UTC 2012
01:26 Ryan_Lane: labsconsole was missing libapache2-mod-php5. puppet must have tried to upgrade a package unsuccessfully
01:22 mutante: planet back up (installed libapache2-mod-php5 which installed apache2-mpm-prefork and removed apache2-mpm-worker)
01:19 mutante: planet down - apache on singer, syntax error in site config "Invalid command 'php_admin_flag'"
01:03 mutante: fixing nrpe "unable to read output" raid check on srv197,207,243,,244,253.. (nrpe running as wrong user)
March 14
23:16 maplebed: installed the swiftcleaner to run daily from iron. see root's crontab for more info.
20:41 binasher: disabled log_queries_not_using_indexes on all core dbs
20:33 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header for image support debugging'
20:30 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header for image support debugging'
19:29 maplebed: rebooting ms-be1 to enable hyperthreading (and make it the same as all the other ms-be hosts)
19:06 preilly: pushing x-images header for vary support
19:06 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing header for image support debugging'
19:05 logmsgbot: preilly synchronized php-1.19/extensions/MobileFrontend/MobileFrontend.body.php 'zero needs to add x-images to vary header'
18:58 maplebed: ms-be5 is back in rotatino
18:31 preilly: push zero change for carrier testing
18:31 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing'
16:19 RobH: updating dns for new domain wikimediacommons.pt (nameservers not yet pointed at us)
16:04 logmsgbot: reedy synchronized wmf-config/CommonSettings.php 'add vcs for extdist updates'
13:03 RobH: cp1029-cp1035 all installed and ready for varnish deployment, puppet has been run
08:24 mutante: running "apt-get -f install" on snapshot3 to fix dpkg, which installed mysql-client- and client-core-5.1
08:02 mutante: stop/start memcached on srv254,srv255,srv257
07:51 mutante: restarting mecached on marmontel
07:51 mutante: fixing owa[1-3] Swift HTTP commands manually
03:44 mutante: ekrem - user agent "AppleDictionaryService" requests cause temp. WAP outage ..it seems
03:38 mutante: free some disk space on spence - deleted user.log.1 on spence, compressing messages.1, apt-get clean,...
02:52 RobH: cp1032-cp1035 reinstall issue wiped mbr causing issues, will reinstall in my AM
02:49 RobH: revoked, cp1032 is some reason in grub error, and its too late at night for me to work on it, will troubleshoot tomorrow
02:48 RobH: realized i forgot to log hours ago that cp1029-cp1036 are installed with puppet run, ready for varnish deployment tomorrow
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Wed Mar 14 02:17:13 UTC 2012
23:14 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing'
21:44 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero'
21:44 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.i18n.php 'changes for zero'
21:31 logmsgbot: asher synchronized wmf-config/db.php 'replacing db18 with new s7 slave db56'
21:19 binasher: started slaving db56 from db37
20:30 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.i18n.php 'changes for zero'
19:27 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero needed for carrier testing'
19:17 RobH: iron updated to use ipmi_mgmt script
19:08 preilly: pushing changes for zero to mswiki
19:08 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.i18n.php 'changes for zero'
19:08 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero'
19:05 binasher: streaming hotbackup of db1041 to db56 (new s7 slave replacing db18)
18:10 maplebed: failover successful, restarted pybal on lvs4, failback successful.
18:09 binasher: power cycling db1020, which also froze this morning
18:08 maplebed: stopping pybal on lvs4 - should fail over to lvs3
17:47 maplebed: pybal restarted on lvs3
17:47 binasher: power cycling db1040, crashed again
17:30 logmsgbot: reedy synchronized wmf-config/CommonSettings.php 'Bug 35183 - p include extensions/Renameuser/Renameuser.php instead of extensions/Renameuser/SpecialRenameuser.php'
17:12 mark: Sending all normally-pmtpa upload traffic to upload-lb.eqiad
17:05 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero'
16:59 preilly: add disable images support to mswiki under zero domain
16:59 logmsgbot: preilly synchronized wmf-config/CommonSettings.php 'add disable images option for mswiki on zero domain'
16:58 logmsgbot: preilly synchronized wmf-config/InitialiseSettings.php 'add disable images option for mswiki on zero domain'
16:46 logmsgbot: preilly synchronized wmf-config/InitialiseSettings.php 'add ZeroRatedMobileAccess extension to mswiki remove from mywiki'
16:44 mark: Sending traffic from Japan, India, Mexico to upload-lb.eqiad
16:37 LeslieCarr: reinstalling neon
16:23 apergos: stole some free space from the phys volume on ms1002 to give us more time for the rsync to keep going til after the move to swift etc
15:28 mark: Sending traffic from the USA to upload-lb.eqiad
15:27 mark: Rebooting lvs1005 with upgraded kernel/packages
15:12 LeslieCarr: manually deleted cp1025 info from nagios config file - nagios restored for now
14:51 mark: Sending traffic from Canada to upload-lb.eqiad
14:32 mark: Sending traffic from Brazil to upload-lb.eqiad
13:58 mark: Sending traffic from Argentina to upload-lb.eqiad
12:58 mark: Seeding the eqiad upload caches from live upload requests
11:59 mark: Setup squid logging to oxygen, with oxygen relaying to multicast 233.58.59.1
11:02 mark: Rebooting lvs1002 with kernel updates
10:17 mark: Rebooting manutius with newer 2.6.36 kernel to attempt avoiding i/o kernel bug with torrus
02:18 logmsgbot: LocalisationUpdate completed (1.19) at Tue Mar 13 02:18:03 UTC 2012
March 12
22:55 K4-713: synchronized payments cluster to r113679, and tweaked the anti-fraud rules
04:19 mutante: wanted to restart nagios-nrpe-server on spence with debug=1 to investigate permission issue. arr! "Address already in use" "cant write to pidfile", killed the one started on Feb18, and reordered allowed_hosts, spence talks to itself again now :p
03:40 mutante: same (and nscd) on fenari
03:35 mutante: upgrading libc6 and related packages on spence
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Mon Mar 12 02:17:28 UTC 2012
March 11
08:14 apergos: restarted lighttp on dataset2
07:49 apergos: removed current htcp log file, restarted purger, it seems to be logging normallynow
07:35 apergos: current ls shows 17416851456 2012-03-11 07:34 HTCPpurger.log while current du -sh shows 175M for /var/log. Sparse file that gets rotated badly? lots of leading nulls (many gb worth), why?
07:33 apergos: on ms1004 the HTCPpurger.log file after rotation was 17 gb, filling the disk. Removed it.
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Sun Mar 11 02:17:35 UTC 2012
March 10
22:09 Reedy: Make that wikimania2012, not wikimediawiki
19:28 binasher: set sync_binlog = 1 on all current masters and eqiad dbs
19:22 binasher: reslaved db1033
07:03 mutante: ran puppet on db1022, another one that works fine manually but somehow did not by itself
05:11 mutante: doing more (cp*, db*, msbe-* ,mw*) by hand / for loop
05:01 mutante: starting nagios-nrpe-server on all via dsh (fail to restart on config change issue)
02:16 logmsgbot: LocalisationUpdate completed (1.19) at Sat Mar 10 02:16:57 UTC 2012
01:07 maplebed: started swiftcleaner on owa1 looking for (and purging) bad objects
01:06 maplebed: rebalanced the swift rings to finish decreasing traffic sent to ms1 and ms2
00:18 Ryan_Lane: powercycling ssl1003
00:18 Ryan_Lane: powercycling ssl1001
March 9
20:34 notpeter: stopping search indexer on searchidx2 for fresh rsync to searchidx1001
19:58 preilly: pushed change to remove description from landing page
19:57 logmsgbot: preilly synchronized php-1.19/extensions/ZeroRatedMobileAccess/ZeroRatedMobileAccess.body.php 'changes for zero'
18:59 Ryan_Lane: sending test.m.wikipedia.org to the same place as test.wikipedia.org via squid
18:58 logmsgbot: awjrichards synchronized wmf-config/InitialiseSettings.php 'Fixing wgMobileUrlTemplate settings for domains that do not have .m. domains configured'
18:40 logmsgbot: awjrichards synchronized wmf-config/CommonSettings.php 'Changing the way in which wgMobileUrlTemplate is configurable by InitialiseSettings.php'
18:39 logmsgbot: awjrichards synchronized wmf-config/InitialiseSettings.php 'Disabling wgMobileUrlTemplate for testwiki - hopefully for real this time'
18:34 logmsgbot: awjrichards synchronized wmf-config/CommonSettings.php 'Making wgMobileUrlTemplate configurable by InitialiseSettings.php'
18:34 logmsgbot: awjrichards synchronized wmf-config/InitialiseSettings.php 'Disabling wgMobileUrlTemplate for testwiki'
17:32 maplebed: set swift storage device weight on ms2 to 0 and pushed out rings
15:52 apergos: cleared up a little bit of space on root partition of snapshot2, but that's about it. I hope we never have 3 versions of mw in test at the same time, the tmp caches will kill us
15:52 mark: Turned off vcc_err_unref on all varnish servers, so varnish doesn't complain when ACLs/probes/backends are unused
15:44 Jeff_Green: hume apt upgrades, puppetd --test, switch to mysql 5.1.53-fb3753-wm1
06:38 Ryan_Lane: reloading autofs on all labs instances
06:13 Tim: running svn cleanup on extdist trunk
04:18 Tim: switched php and wmf-deployment symlinks over to php-1.19 instead of php-1.18
23:56 logmsgbot: tstarling synchronized wmf-config/CommonSettings.php 'per-wiki memory limit configuration, with extra memory for zh* for converter tables'
23:55 logmsgbot: tstarling synchronized wmf-config/InitialiseSettings.php 'per-wiki memory limit configuration, with extra memory for zh* for converter tables'
18:40 binasher: deploying new squid frontend.conf to fix epic fail - all googlebot traffic was being redirected to mobile. now just if it's mobilegooglebot.
18:29 RoanKattouw: Applying AFTv5 schema changes on testwiki
18:27 RoanKattouw: Pushing new AFTv5 code to testwiki, do not sync to the live site just yet
17:46 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'ptwikipedia to ptwiki'
17:14 cmjohnson1: shutting down db18 for memory testing
16:57 RobH: search1014 still down per rt2483
16:47 maplebed: took ms-be5 out of rotation in the swift cluster - it's crashed 3 times now.
16:31 logmsgbot: reedy synchronized php-1.19/extensions/ExtensionDistributor/ExtensionDistributor_body.php 'Revert live hack because it works, will come in properly'
16:30 logmsgbot: reedy synchronized php-1.19/extensions/ExtensionDistributor/ExtensionDistributor_body.php 'Test for bug 27246'
16:16 RobH: search1008 repaired
15:52 RobH: mw1103 finally repaired and ready for os and such
14:48 pp-pdf1: installed python faulthandler 2.1
14:47 pp-pdf3: installed python faulthandler 2.1
14:47 pp-pdf2: installed python faulthandler 2.1
14:24 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 35012 - Namespace aliases for wikipedia and wikipedia-talk namespaces on Sanskrit wiki'
09:17 mutante: running puppet on mw1010 - finished quickly without problems - uh, wonder why Nagios reported puppet freshness then
08:22 mutante: cp1019 - Hitting F1 to continue reboot ( "Alert! System fatal error during previous boot")
08:21 mutante: cp1019 went down, then rebooted by itself (i think) after showing "idrac-8W82BP1 Severity: Non Recoverable, SEL:CPU Machine Chk: Processor sensor, transition to non-recoverable was asserted"
07:54 mutante: cadmium fixed by adding groups::wikidev
07:41 mutante: puppet on cadmium broken due to dependency Group[500] for User[catrope]
07:20 mutante: ms1004 ran out of disk - caused by 17G HTCPurger.log.1, trying to gzip it now
20:17 Jeff_Green: yet another redirects.conf change, per RT#2498 redirect wikimedia.com-->wikimedia.org
20:05 binasher: reverted no-pagecache rsync on search nodes - without corresponding index warmup in lsearchd, it just pushes back the pain a bit and does more harm than good
20:04 binasher: deployed support for zero.wikipedia.org and carrier tagging to mobile varnish servers
10:41 logmsgbot: tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: switching zh* from 1.18 to 1.19
08:36 mutante: on hooper: puppet broken due to dependency Package[libapache2-mod-php5] for Service[apache2]
03:33 mutante: rebooting bast1001 for kernel upgrade
03:32 mutante: upgrading apache2 packages, base-files, kernel, several libs on bast1001
03:27 mutante: installing a couple upgrades on fenari (apache2-utils, update-manager-core, cvs, ruby, libxml*, libopenssl-ruby*...)
02:37 logmsgbot: LocalisationUpdate completed (1.18) at Tue Mar 6 02:37:06 UTC 2012
02:36 logmsgbot: tstarling synchronizing Wikimedia installation... : updating to r113119
02:18 logmsgbot: LocalisationUpdate completed (1.19) at Tue Mar 6 02:18:13 UTC 2012
01:27 Jeff_Green: manually updated packages and restarted apache on srv198, srv229, srv262, srv268, mw40 because their apache redirect configs failed to update after sync-apache and restart
01:07 Jeff_Green: another adjustment to redirects.conf and apache-graceful-all for RT#2488
March 5
22:24 Jeff_Green: modified redirects.conf per RT #2488
21:21 Reedy: Ran foreachwiki cleanupUploadStash.php
20:36 maplebed: enabled swift for 100% of thumbnails in production
13:50 mark: Set increased OSPF/OSPFv3 metric 30 on both directions of the link cr1-eqiad:xe-5/2/1 <--> cr1-sdtpa:xe-0/0/1, to combat higher than normal jitter and packet loss on the link
12:53 mark: Upgraded observium to latest version
09:41 mutante: restarting memcached on marmontel
09:40 mutante: restarting squid backend on knsq25
06:52 Ryan_Lane: all of the instances are accessing the file descriptors of files inside of the _base directory, and fuse has an issue with this. gluster can't recreate the base directory because of the processes holding open the old one.
06:50 Ryan_Lane: I've corrupted the _base directory on the instance's glusterfs share. I'm recovering the files from file descriptors using lsof. Not totally sure how I'm going to get the _base directory back, yet.
02:33 logmsgbot: LocalisationUpdate completed (1.18) at Mon Mar 5 02:33:04 UTC 2012
02:16 logmsgbot: LocalisationUpdate completed (1.19) at Mon Mar 5 02:16:39 UTC 2012
March 4
21:48 logmsgbot: reedy synchronized wmf-config/ 'Bug 32726 - Set =true for Commons'
16:55 RobH: ms-be4 boot order fixed, fixing ms-be5 & ms-be2
16:49 RobH: fixed boot order on ms-be3, fixing ms-be4
16:33 RobH: poking at bios on ms-be3
16:05 RobH: wikitech outage resolved
15:20 RobH: shutdown frdev offsite vm per email to engineering last week
15:18 RobH: backing up wikitech in hopes of upgrading some of its software
08:36 apergos: on ms1004, low on space, HTCPpurger.log.1 had about 16 gb of nulls before any real content, I tailed off the real stuff and tossed the original. The current log file has the same problem, why?
02:34 logmsgbot: LocalisationUpdate completed (1.18) at Fri Mar 2 02:34:34 UTC 2012
02:17 logmsgbot: LocalisationUpdate completed (1.19) at Fri Mar 2 02:17:51 UTC 2012
17:39 Jeff_Green: Removed >5GB /tmp/gmond.log on db25, db32, db33, db37
17:36 logmsgbot: hashar synchronized php-1.19/includes/EditPage.php 'r112819 - Bug 34849 diff during editing an old version compares to the old version instead of the current one'
17:36 Jeff_Green: Removed >5GB /tmp/gmond.log on db13
17:35 Jeff_Green: Removed >5GB /tmp/gmond.log on db11
17:25 Jeff_Green: Removed 5.3GB /tmp/gmond.log on db1018
17:24 Jeff_Green: Removed 5.3GB /tmp/gmond.log on db1017
17:13 Jeff_Green: Removed 4.8GB /tmp/gmond.log on db1008. Tried to resist urge to make snarky comment about ganglia but failed.
14:54 RobH: strontium server rebooting to set HT to enabled
14:26 mark: Moving bits traffic back from pmtpa to eqiad
14:24 mark: Cleared dnsmasq cache on virt2
14:16 mark: csw5-pmtpa: Mar 1 14:01:42:A:Power Supply 2 , 2nd from left, bad
14:14 mark: mr1-pmtpa rebooted/lost power for some reason
14:07 mark: pmtpa/sdtpa management network went down
13:54 mark: Pooled new eqiad bits servers strontium and palladium
12:45 logmsgbot: hashar synchronized php-1.19/includes/specials/SpecialWatchlist.php 'r111882 for Bug 34835 - watchlist shows times in UTC'
10:53 logmsgbot: tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: reverting sr* wikis back to 1.18 per Siebrand's recommendation due to bug 34832
21:00 logmsgbot: hashar synchronized php-1.19/extensions/ApiSandbox/ext.apiSandbox.js '(bug 34790) Pressing "Make Request" should not make two requests to api.php'
20:58 Ryan_Lane: restarting nova-compute on virt4
19:56 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Turn wmgReduceStartupExpiry on for wikipedia projects, off for nl/pl wiki ahead of tonights deploy'
19:33 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Turn wmgReduceStartupExpiry off by default. Needs to go to wikipedia only later on'
19:23 logmsgbot: aaron synchronized wmf-config/PrivateSettings.php 'updating swift auth'
13:55 mark: Reinstalled strontium and palladium with hw raid1 and fully automatic lvm based partman recipe
12:51 schmir: upgraded mwlib to 0.13.5 on pdf cluster
11:35 logmsgbot: hashar synchronized wmf-config/codereview.php 'CodeReview: autodefers /trunk/extensions/ParserFun[/$] so ParserFunctions is not deferred'
10:21 logmsgbot: hashar synchronized php-1.19/extensions/ApiSandbox 'ApiSandBox: r112114: show request time'
03:25 maplebed: took swift out of rotation - thumbnails now served by ms5
02:34 logmsgbot: LocalisationUpdate completed (1.19) at Wed Feb 29 02:34:48 UTC 2012
02:18 logmsgbot: LocalisationUpdate completed (1.18) at Wed Feb 29 02:18:05 UTC 2012
02:02 binasher: manually set large rmem_max and rmem_default on locke and restarted udp2log to stem packet loss, opened an rt ticket to fix the (lost) fix
20:22 logmsgbot: hashar synchronized php-1.19/extensions/CodeReview/backend/DiffHighlighter.php 'Special:CodeReview fix HTML entities showing up in diff output r112459r112464'
20:18 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Put oldwikisource/sourceswiki on 1.19wmf1
19:42 RobH: adjusted threshholds for ps1-b4-sdtpa.mgmt.pmtpa.wmnet again, bottom sensor set to high
18:04 logmsgbot: aaron synchronized multiversion 'deployed all changes through HEAD'
17:58 logmsgbot: catrope synchronized php-1.19/LocalSettings.php 'Guard against /home/wikipedia not existing'
17:51 RoanKattouw: And of course I can't commit this because the code in /h/w/common/multiversion hasn't been updated this calendar year and there are undeployed commits from January *grumble*
17:47 notpeter: stopping indexing on searchidx2 to rsync over a clean copy of index to searchidx1001
17:45 logmsgbot: catrope synchronized multiversion/MWVersion.php 'Add file_exists check for /home before trying to access /home'
13:40 mark: Fixed directory permissions of /srv/swift-storage/{sda4,sdb4} on ms-be1
13:35 mark: Copied swift ring builder files from ms-fe1 to all swift hosts
01:03 logmsgbot: tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: switching wikisource to 1.19 except for wikis with FlaggedRevs enabled
00:47 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Put all wikiversity on 1.19wmf1
00:24 logmsgbot: tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews back to 1.18
00:23 logmsgbot: tstarling rebuilt wikiversions.cdb and synchronized wikiversions files: wikinews back to 1.18
00:19 logmsgbot: tstarling synchronized wmf-config/CommonSettings.php 'reduce startup module expiry time for all projects except wikipedia'
00:19 logmsgbot: tstarling synchronized wmf-config/InitialiseSettings.php 'reduce startup module expiry time for all projects except wikipedia'
00:05 K4-713: synchronized payments cluster to r112275
February 23
23:54 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Put all wikinews' on 1.19wmf1
20:50 logmsgbot: aaron synchronized wmf-config/StartProfiler.php 'Sample thumb.php traffic properly. Removed 'bigpage' and 'incubatorslowness' hacks. Split thumbnail group by MW version and added some more code comments.'
20:42 logmsgbot: aaron synchronized wmf-config/StartProfiler.php 'updated profiling class code paths to post-1.17 locations'
20:34 logmsgbot: aaron synchronized wmf-config/StartProfiler.php 'Sample thumb.php traffic properly. Removed 'bigpage' and 'incubatorslowness' hacks. Split thumbnail group by MW version and added some more code comments.'
22:20 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'Comment out hack that enabled $wgResourceLoaderExperimentalAsyncLoading for logged-in users'
21:36 Ryan_Lane: force-running puppet on every labs instance
15:39 logmsgbot: reedy synchronized php-1.19/extensions/CentralAuth/CentralAuth.php 'Enable AntiSpooof for CentralAuth on all 1.19 wikis again, doesn't break signup with a mass of fail'
15:37 logmsgbot: reedy synchronized php-1.19/extensions/CentralAuth/CentralAuth.php 'Enable AntiSpoof for CentralAuth on testwiki only'
04:42 Tim: on db40: setting innodb-use-purge-thread=4 to test multithreaded purge
04:12 notpeter: disabling search lvs1 check because it's going to false-positive in 4 hours...
02:34 logmsgbot: LocalisationUpdate completed (1.19) at Tue Feb 21 02:34:36 UTC 2012
02:17 logmsgbot: LocalisationUpdate completed (1.18) at Tue Feb 21 02:17:36 UTC 2012
February 20
22:14 logmsgbot: hashar synchronized php-1.19/includes/logging/PatrolLog.php 'r111969 - bug 34495 patrol log credit the user patrolled, not the user patrolling'
21:10 rainman-sr: shut down lucene on search15, comes up with some strange errors "Connection refused to host: 10.0.3.15"
15:56 notpeter: initial test-spinup of searchidx1001 and search1001-1006 (en cluster)
14:56 logmsgbot: hashar synchronized php-1.19/skins/simple/main.css 'r111580 for Bug 34397: align footer so that it does not overlap with sidebar in Simple skin'
13:36 logmsgbot: hashar synchronized php-1.19/includes/UserMailer.php 'r111925 for bug 34421 duplicate Subject / wrong To: headers in mail'
03:06 notpeter: re-enabling notifications for search-pool1 and search-pool3, search-pool2 still flapping very badly
02:36 logmsgbot: LocalisationUpdate completed (1.19) at Mon Feb 20 02:36:36 UTC 2012
02:18 logmsgbot: LocalisationUpdate completed (1.18) at Mon Feb 20 02:18:57 UTC 2012
February 19
13:47 notpeter: disabling notifications for search lvs... if anyone still has their phone on
02:34 logmsgbot: LocalisationUpdate completed (1.19) at Sun Feb 19 02:34:11 UTC 2012
02:17 logmsgbot: LocalisationUpdate completed (1.18) at Sun Feb 19 02:17:17 UTC 2012
08:56 apergos: restarted all the searchpool1 lsearchds
02:35 logmsgbot: LocalisationUpdate completed (1.19) at Sat Feb 18 02:35:12 UTC 2012
02:18 logmsgbot: LocalisationUpdate completed (1.18) at Sat Feb 18 02:18:33 UTC 2012
01:59 logmsgbot: aaron synchronized php-1.19/extensions/CentralAuth/CentralAuth.php 'disabled AntiSpoof hooks which broken account creation with DB errors'
01:00 logmsgbot: andrew synchronizing Wikimedia installation... :
00:58 Andrew: Running scap to ensure a consistent environment
00:51 logmsgbot: andrew synchronized php-1.19/resources/mediawiki/mediawiki.user.js 'Attempt to repush r111695'
00:44 logmsgbot: andrew synchronized php-1.19/resources/Resources.php 'Deploy r111809'
18:06 maplebed: sending thumbnail traffic back to ms5, taking swift out of production
18:00 notpeter: temporarily stopping puppet on brewster. please let me know if you need to turn it back on
17:52 maplebed: changing squids to send 100% of thumbnail traffic to swift
16:58 maplebed: turned swift live for 50% of all thumbnail requests
15:02 logmsgbot: hashar synchronized php-1.19/LocalSettings.php 'make jobrunners for testwiki to use the apache CommonSettings file instead of the non existant /home/wikipedia one'
05:35 Tim: on db40: reduced to 10M, should be causing massive delays, but the site's not down and the purge rate is lower if anything. Going to disable the mysql parser cache entirely.
05:25 Tim: on db40: purge lag is still increasing at 108 per second, so reducing innodb_max_purge_lag to 50M
05:21 Tim: on db40: giving the innodb manual the benefit of the doubt and following its advice, setting innodb_max_purge_lag to 100M, which should give a delay of 4.5ms
05:13 Tim: killing purgeParserCache.php since it is probably doing more harm than good
02:43 maplebed: deployed updated thumb_handler.php to ms5 to include Content-Length in generated images
02:34 logmsgbot: LocalisationUpdate completed (1.19) at Fri Feb 17 02:34:32 UTC 2012
02:29 Ryan_Lane: installed labstore1-4
02:17 logmsgbot: LocalisationUpdate completed (1.18) at Fri Feb 17 02:17:35 UTC 2012
02:16 Tim: on db40: truncating pc008 - 15
02:01 binasher: db1035 is replicating again
01:12 Ryan_Lane: re-enabled the mobile plugin for the blogs, seems w3 total cache supports varying
01:03 Ryan_Lane: disabling mobile skin for the blogs - we need to fix varnish support first
00:38 Ryan_Lane: fixed singer by adding in ssl configuration to the planet configuration
February 16
23:53 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'Tweak live hack so async loading is enabled for all logged-in users on all 1.19 wikis'
23:04 binasher: db1035 is fubar after crashing during schema migrations, running a hotbackup from db1019
22:59 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'Also give User:Cmcmahon experimental async loading on meta'
22:37 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'Add live hack to enable $wgResourceLoaderExperimentalAsyncLoading on meta only for me (User:Catrope)'
22:35 binasher: db1035 died 2 days ago, attempting to power cycle
22:26 binasher: adding search15 to search-pool2 lvs vip
22:23 Ryan_Lane: restarting pdns ns2
22:17 logmsgbot: catrope synchronized wmf-config/InitialiseSettings.php 'Enable wgResourceLoaderExperimentalAsyncLoading on test2wiki'
21:45 apergos: singer certificate issues, looks like
21:18 apergos: the most recent apache update (thanks puppet) must have broke things on singer. the url.wm.o config wants /srv/org/wikimedia/url/ but I have no idea what that service ever did or what is supposed to be in there. will someone who knows this undocumented information please check it? thanks.
21:16 notpeter: stopping mysql and apache on searchidx2... not sure why they are there. also, going to clean up some packages... like the ubuntu version of mediawiki
22:10 binasher: restarted the 1.19 schema migration script - it's going to hit the just rotated s3 (db34), s2 (db30), s7 (db16), and s4 (db31) ex-masters before resuming s5 (db55) and all s6/s1 slaves
02:48 logmsgbot: tstarling synchronized wmf-config/AdminSettings.php 'remove $wgUseRootUser and $wgUseNormalUser, broken since 1.17 and 1.16 respectively'
02:41 logmsgbot: reedy synchronized php-1.19/extensions/Contest/Contest.php 'Comment out stupid die for the moment'
22:10 binasher: pulled db38 from enwiki, running normal "alter table revision add rev_sha1" and on db1043, the pt-online-schema-change equiv (with --chunk-size=1000, --sleep=0.1) to compare timing
22:03 logmsgbot: asher synchronized wmf-config/db.php 'pulling db38 from enwiki for revision alter timing'
21:38 maplebed: initial deploy of swift to serve thumbnails is complete
21:27 maplebed: deployed new squid.conf to enable swift for all thumbs with /a/a2 in the URL
20:04 binasher: running mk-slave-prefetch on db1018 which was down for 5 days to see if it can catch up
19:28 maplebed: swift deploy aborted due to squid config issues
18:42 maplebed: deploying squid config change to put swift in service for all thumbnails with /a/a2 in the URL
16:49 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 34043 - Change logo at Indonesian Wikibooks'
16:46 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 34009 - Translation of project namespace and site-name for ta.wikiquote'
16:22 Reedy: Updated user_former_groups.ufg_group to 32 characters
14:10 RoanKattouw: Finally fixed ownership of cache/l10n on scalers , sync-l10nupdate only throws the expected errors, no more perms errors on the scalers
14:09 RoanKattouw: Scalers now have disk space available because php-1.17-test is gone
13:59 logmsgbot: catrope synchronizing Wikimedia installation... : Deleted php-1.17-test on fenari, running scap to delete it on the Apaches as well
13:49 RoanKattouw: Deleting /home/wikipedia/common/php-1.17-test , has been unused for a long time
13:45 RoanKattouw: Deleting /tmp/mw-cache-1.17 on srv219 and srv223
13:44 RoanKattouw: srv219-224 have a full disk according to rsync
13:38 RoanKattouw: Fixing ownership of /usr/local/apache/common-local/php-1.18/cache/l10n on srv191, srv199, srv219-224
13:35 RoanKattouw: Running sync-l10nupdate again to investigate rsync errosr
13:34 logmsgbot: LocalisationUpdate completed (1.18) at Thu Feb 2 13:34:53 UTC 2012
13:12 RoanKattouw: Running l10nupdate by hand to hopefully fix bug 33768
13:11 logmsgbot: catrope synchronized php-1.18/extensions/LocalisationUpdate/LocalisationUpdate.class.php 'Deploy live-hacked version that will hopefully fix bug 33768'
10:00 Tim: reinserted the deleted site_stats row for plwiki
09:35 Tim: killing statistics queries on all s2 slave servers
09:34 logmsgbot: tstarling synchronized php-1.18/includes/SiteStats.php 'disabling even more'
18:27 RobH: ganglia1002 back online ready for install
18:26 RobH: ganglia1002 mgmt offline per rt 2247, system was unplugged... no idea why
18:24 cmjohnson1: pulled drive 2 db47
18:15 RobH: cp1019 memory error repaired, now it is ready for OS install
18:14 RobH: cp1017 memory error repaired
17:54 RobH: updated dns for payments boxen renames in eqiad
17:37 RobH: cp1014 memory was improperly installed (from factory?), installed in supported configuration and system is now ready for OS install per RT2351
17:08 RobH: investigating errors on cp1014
17:04 RobH: cp1019 console redirection fixed per rt2353, ready for OS install
16:37 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'wikilove default on fawikis'
16:22 RobH: dataset1001 down for controller replacement
16:07 mark: Removed now obsolete package wikimedia-task-squid from the karmic-wikimedia and lucid-wikimedia APT repositories, and deleted in svn.wikimedia.org
15:45 logmsgbot: reedy synchronized php/cache/interwiki.cdb 'Try a quietened dumpInterwiki script'
14:45 mutante: running authdns-update to remove oldusability
14:30 mutante: shutting down "oldusability" linode instance
13:44 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Point wgInterwikiCache at interwiki.cdb'
06:34 Tim: added myself to the gerrit "administrators" group
05:23 Tim: the segfaults didn't stop, so I'm disabling wmerrors entirely for now
05:13 Tim: since puppet is broken, disabled wmerrors backtrace logging by adding a separate configuration file in /etc/php5/conf.d and reloading apache
02:06 logmsgbot: LocalisationUpdate completed (1.18) at Tue Jan 31 02:06:09 UTC 2012
January 30
23:56 awjr: synchronized i18n files for DonationInterface on payments cluster to r110342
23:46 Ryan_Lane: moving instances from virt2 to virt1 to rebalance compute cluster
19:48 logmsgbot: awjrichards synchronizing Wikimedia installation... : Syncing CentralNotice to r110026 of trunk, includes important fix for 1.19 compatibility
23:49 logmsgbot: asher synchronized wmf-config/db.php 'adding db26 to s7'
23:35 pgehres: re-enabled queue consumption for payments through Jenkins
23:35 pgehres: awjr synchronized CiviCRM on aluminium to r1211
23:05 logmsgbot: awjrichards synchronized wmf-config/CommonSettings.php 'Checking if wmgDisplayFeedsInSidebar === false rather than true, since it defaults to true in the install file'
00:32 logmsgbot: asher synchronized wmf-config/db.php 'setting db52 to full weight'
00:19 logmsgbot: asher synchronized wmf-config/db.php 'adding new enwiki slave db52, with a low weight'
00:08 preilly: push weekly mobile frontend update
00:08 logmsgbot: preilly synchronized php-1.18/extensions/MobileFrontend/MobileFrontend.php 'weekly update to Mobile Frontend'
January 19
23:56 Ryan_Lane: changed global roles netadmins and sysadmins to be virtual static groups in ldap that autopopulate with any user that has objectclass=novauser
23:15 Tim: rebuilt wikidiff2 with package name php-wikidiff2, removed lucid package php5-wikidiff2 from apt using "reprepro remove"
22:52 Tim: recompiled wikidiff2 and put the new version up on apt.wikimedia.org
21:51 Jeff_Green: starting conversion of fundraisingdb 'faulkner' tables from myisam to innodb, expect replication delays
21:12 binasher: starting slaving db52 from db36, running hotbackup of db32 to db53
03:56 logmsgbot: neilk synchronizing Wikimedia installation... : deploying CongressLookup (for i18n reasons, not deploying to enwiki)
03:28 logmsgbot: laner synchronized wmf-config/InitialiseSettings.php 'Removing restriction of display title for SOPA landing pages'
02:35 binasher: cp1039-40 are now in service for mobile wikipedia
02:04 logmsgbot: LocalisationUpdate completed (1.18) at Wed Jan 18 02:04:57 UTC 2012
01:35 RobH: cp1040 and cp1036 ready for use
01:33 RobH: cp1037, cp1038, cp1039 os installed, varnish partitions mounted, and puppet run
January 17
22:47 binasher: ram only varnish instance now running on marmontel in front of apache/wordpress
22:07 Ryan_Lane: installing memcache on marmontel
20:54 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33769 - Allow bureaucrats to remove sysop rights at Bashkir Wikipedia'
20:13 mutante: en.planet updates were stuck. reason was corrupted cache causing "bsddb.db.DBPageNotFoundError" which broke update script. solution was to kill stuck updates, delete files in cache dir and run update manually
18:05 RobH: theme updated on blog along with settting limit back to 20 comments per page
17:46 RobH: aware of blog slowdowns, work is being done
17:35 mutante: also upgraded drac firmware on mw1081 & mw1099 (fixes mgmt console problem)
16:45 mutante: upgrading drac firmware on mw1108
15:54 RobH: db43 rebooting
15:27 RobH: db7 shutting down for decom, not listed in db for any clusters, load .01
11:05 logmsgbot: neilk synchronized wmf-config/ExtensionMessages-1.18.php 'added CongressLookup to ExtensionMessages-1.18 for i18n'
11:04 logmsgbot: neilk synchronized wmf-config/extension-list 'added CongressLookup to extension-list for i18n'
10:30 logmsgbot: neilk synchronizing Wikimedia installation... : deploying CongressLookup. We are not deploying to any live wiki, just test, but this is to make i18n work
10:28 logmsgbot: neilk synchronized wmf-config/InitialiseSettings.php 'added CongressLookup to InitialiseSettings'
20:07 LeslieCarr: reassigning ports on asw-b-sdtpa
17:00 notpeter: stop sodium to do manual reinstall
16:33 RobH: adjusting all power strip humidity sensor 2 (floor level) to 12% humidity, as the center rack has the proper levels, floor levels always are low in humidity.
16:17 mutante: after a config change to nrpe_local.cfg and puppet applying the change, the service was not resrted but for some reason all nagios-nrpe-server caught SIGTERM. manually applying the same config change does not cause problems. that caused a Nagios outage until nrpe servers were started again (via dsh)
16:04 mutante: starting nagios-nrpe-server on ALL via dsh to speed up nagios recovery
15:33 mutante: starting nagios-nrpe-server on srv's via dsh
02:04 logmsgbot: LocalisationUpdate completed (1.18) at Thu Jan 12 02:04:31 UTC 2012
00:48 preilly: pushing quick fix for special random
00:48 logmsgbot: preilly synchronized php-1.18/extensions/MobileFrontend/MobileFrontend.php 'update to mobile frontend to fix random link'
00:41 LeslieCarr: added ganglia1002 and ganglia1001 to dns
January 11
23:18 RobH: searchidx1001 offline and powered down until replacement memory arrives (2012-01-13) rt 2208
22:56 RobH: poking searchidx1001 for memory error
22:45 RobH: mw1108 online and ready for install per rt2253
22:42 RobH: mw1099 repaired, ready for os install per rt2252
02:01 logmsgbot: LocalisationUpdate completed (1.18) at Mon Jan 9 02:04:47 UTC 2012
January 8
23:20 Reedy: For some reason cp1001-1042 weren't listed in CommonSettings.php XFF, but (at least) 1042 was in service, meaning edits were attributed to it
23:10 logmsgbot: reedy synchronized wmf-config/CommonSettings.php 'Add cp1042 to XFF'
21:47 rainman-sr: killed broken search indexer thread on searchidx1 (please note searchidx1 is no longer in use!), and restarted incremental indexing on searchidx2 which was somehow broken
21:43 rainman-sr: someone started incremental updating on searchidx1 ??!!
14:54 apergos: removed old puppet lockfile on brewster, ran by hand
14:47 apergos: cleared out some very large squid logs on brewster, (basically all of them) plus lighty logs, disk was full. restarted squid manually
02:01 logmsgbot: LocalisationUpdate completed (1.18) at Sun Jan 8 02:05:11 UTC 2012
00:43 tfinc: killing long running show_bug.cgi procs on kaulen
January 7
22:30 Reedy: Users reporting slowness while editing. dberror.log shows a few mysql errors for enwiki master and slaves. Few errors on other wikis, mainly enwiki
02:01 logmsgbot: LocalisationUpdate completed (1.18) at Sat Jan 7 02:05:09 UTC 2012
January 6
23:22 RobH: working rt1549 lvs1003 may flap, it is presently not in service due to possible hdd failure
22:55 binasher: db22 is back in s4
22:55 logmsgbot: asher synchronized wmf-config/db.php 'adding db22 back to s4'
21:41 RobH: db1029 powering back up with ssd testing hardware installed
21:35 RobH: db1029 coming down for ssd testing
21:26 RobH: cp1014 and cp1019 hdd controller cables replaced (removed for testing controllers), both can be used normally
21:19 binasher: restoring db22 from a live hotbackup of db1038
21:18 RobH: es1002 back ready for service use per #2220: replace original RAID card in es1002
21:05 binasher: putting db51 into production as an s4 slave
21:05 logmsgbot: asher synchronized wmf-config/db.php 'adding db51 as an s4 slave'
20:57 binasher: started slaving db51 off of db31
20:21 RobH: rt2226 - redeploy db22 for asher
20:19 RobH: db22 reinstalled and booting into OS. No puppet runs yet, now its Asher's problem ;]
20:04 RobH: db22 reinstalling
19:24 binasher: started innodb hot backup of db1038 to db51
18:37 maplebed: pushed out new db.php setting s4 to read-write
18:37 logmsgbot: ben synchronized wmf-config/db.php
18:35 maplebed: db31 made read-write as the new master for s4
18:31 maplebed: old master for s4 log file db22-bin.000106 log pos 631618956
18:30 maplebed: new master for s4: db31, log file db31-bin.000213 log pos is 205612709
18:24 logmsgbot: asher synchronized wmf-config/db.php 'setting s4 to read only, preparing to make db31 master'
18:21 Reedy: Commons having db issues, db22 (s4 master) has a disk issue
16:02 apergos: restarted lilghty on dataset2
16:01 Reedy: HTTP server (lighttpd?) seems to be down on dataset2
15:46 RoanKattouw: Removing gs_* files in /tmp on srv220 that are >30 min old
15:44 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33556 - ArticleFeedback settings on Chinese wikipedia'
15:43 RoanKattouw: Removed /tmp/mw-cache-1.17 and /tmp/mw-cache-1.17-test on srv220
15:41 Reedy: srv220 / is at 100% usage
15:41 logmsgbot: reedy synchronized wmf-config/InitialiseSettings.php 'Bug 33556 - ArticleFeedback settings on Chinese wikipedia'
14:34 mutante: saw the log about cp1043/44 being deliberately left broken, but requirement in varnish.pp also broke others, fixed on sq67,68,69 (gerrit change 1802)
02:01 logmsgbot: LocalisationUpdate completed (1.18) at Fri Jan 6 02:05:01 UTC 2012
01:25 binasher: puppet is being deliberately left broken on cp1043 and 1044 until tomorrow
01:23 binasher: backend varnish instance on cp1042 running 3.0.2 is in production for 1/3 of mobile requests
18:00 mutante: tarin - added "#includedir /etc/sudoers.d" to sudo config, needs to read /etc/sudoers.d/nrpe for Nagios RAID check
17:49 logmsgbot_: hashar: gallium: cleaned /tmp . Our test suites leak a large amount of files :D
17:49 ^demon: removed chuck norris plugin from jenkins, restarted
16:48 mutante: payments4 - 25 running nginx procs cause a warning - but normal and just raise limit?
16:15 mutante: people claim it was "completely resolved with "2.6.38-10 backport from PPA." (add-apt-repository ppa:kernel-ppa/ppa ...). wanna try that? (or just reboot ms1002 pls)
15:49 mutante: quotes on kswapd problem (that also appeared on other servers): "has nothing to do with swap space or memory".."the kernel process which swaps tasks".."means the kernel is spending more time context switching tasks than it is actually executing the tasks".."you're chasing a ghost if you're trying to tune your swap/memory environment"
15:39 mutante: Nagios check_ntp does stuff like: overall average offset: 0 -> NTP OK: Offset unknown| -> NTP CRITICAL: Offset unknown (even though this bug was supposed to be fixed in a version before the one we use)..sigh
15:14 mutante: lvs1004 - puppet didnt run since 12 hours, looked stuck, "already in progress" on every run. rm /var/lib/puppet/state/puppetdlock, restart puppet agent, finished fine in a few seconds. maybe puppet bug 2888,5246 or related
14:57 mutante: magnesium - memcached runs on default port 11211, but we run all the others on 11000, this causes Nagios CRIT. Is it supposed to run here? (was also on -l 127.0.0.1 only, but init script starts it on all)
14:55 Jeff_Green: searchidx1 /a reached 100%, did the "space issues" maintenance procedure from wikitech search documentation
14:39 mutante: same on srv193
14:35 mutante: srv290 - before restart memcached was running with -m 64 and -l 127.0.0.1 for some reason, causing Nagios CRIT, now it looks like others and recovered
14:32 mutante: restarting memcached on srv290
02:01 logmsgbot: LocalisationUpdate completed (1.18) at Thu Jan 5 02:05:03 UTC 2012
18:30 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'Actually bump version number'
18:28 logmsgbot: catrope synchronized php-1.18/resources/mediawiki/mediawiki.user.js 'Revert live hack'
18:24 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'and bump the version number too'
18:22 logmsgbot: catrope synchronized wmf-config/CommonSettings.php 'Enable tracking for AFTv5 bucketing'
18:06 mutante: duplicate nagios-wm instances on spence (/home/wikipedia/bin/ircecho vs. /usr/ircecho/bin/ircecho) killed them both, restarted with init.d/ircecho
18:00 logmsgbot: catrope synchronized php-1.18/resources/mediawiki/mediawiki.user.js 'Live hack for tracking a percentage of bucketing events'
17:52 mutante: knsq11 is broken. boots into installer, then "Dazed and confused" at hardware detection (NMI received for unknown reason 21 on CPU 0). -> RT 2206
07:40 Tim: fixed puppet by re-running the post-merge hook with key forwarding enabled, and then started puppet on ms6
07:32 Tim: on ms6.esams: fixed proxy IP address and stopped puppet while I figure out how to fix it
03:25 Tim: experimentally raised max_concurrent_checks to 128
03:17 Tim: on spence in nagios.cfg, reduced service_reaper_frequency from 10 to 1, to avoid having a massive process count spike every 10 seconds as checks are started. Locally only as a test.
02:27 Ryan_Lane: I should clarify that I removed 10.2.1.13 from /etc/network/interfaces, it's still properly bound to lo
02:24 Tim: on spence: setting up logrotate for nagios.log and removing nagios-bloated-log.log
02:22 Ryan_Lane: removing manually added 10.2.1.13 address from lvs4
02:01 logmsgbot: LocalisationUpdate completed (1.18) at Wed Jan 4 02:04:57 UTC 2012
01:43 Nemo_bis: Last week slowness: job queue backlog now cleared on !Wikimedia Commons and (almost) English !Wikipedia http://ur1.ca/77q9b