15:33 logmsgbot: demon Synchronized docroot/bits/favicon/wikipedia.ico: Favicons are my favorite icons, especially when they're only 18% of the size of the original (duration: 00m 04s)
15:16 logmsgbot: demon Synchronized php-1.25wmf1/extensions/Wikidata: (no message) (duration: 00m 11s)
15:14 logmsgbot: demon Synchronized php-1.25wmf1/extensions/VisualEditor: (no message) (duration: 00m 08s)
15:12 andrewbogott: running sync-common on virt1000
15:12 logmsgbot: demon Synchronized visualeditor.dblist: (no message) (duration: 00m 04s)
15:11 logmsgbot: demon Synchronized visualeditor-default.dblist: (no message) (duration: 00m 04s)
15:06 logmsgbot: demon Synchronized wmf-config/Wikibase.php: (no message) (duration: 00m 04s)
15:06 logmsgbot: demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s)
14:26 _joe_: restarted apache on mw1196, lots of apc errors
14:22 logmsgbot: oblivian gracefulled all apaches
12:10 mark: Stopped exim daemon on mchenry
09:41 godog: removed obsolete /etc/puppet/hiera from strontium and palladium, /etc/puppet/hieradata is the new location
09:24 godog: reboot ms-be2001 as a test
04:18 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 30 04:18:48 UTC 2014 (duration 18m 47s)
03:17 logmsgbot: LocalisationUpdate completed (1.25wmf1) at 2014-09-30 03:17:15+00:00
02:41 logmsgbot: ori Synchronized 503.html: Ia88b306ef: Make the 503 error page consistent with other 5xx error pages (duration: 00m 08s)
02:34 logmsgbot: LocalisationUpdate completed (1.24wmf22) at 2014-09-30 02:34:07+00:00
01:00 Krinkle: Jenkins connection seemed in order with integration-slave1007 and 8, but disconnecting and relaunching the slave agents immediately resulted in them getting jobs assigned. Cause unknown, problem resolved for now.
00:58 Krinkle: integration-slave1007 and integration-slave1008 have not gotten any jobs in the past 24h. integration-slave1006 however has gotten loads of action. Investigating load balancing issue.
21:58 mutante: stopping udp2log-vumi on silver - not needed anymore per Yuvipanda
21:12 Reedy: elasticsearch upgradeed to 1.3.2 on logstash1001
20:50 bd808: Ran sync-common on tmh1002.eqiad.wmnet for cscott's failed sync-dir there
20:49 bd808: Ran sync-common on tmh1001.eqiad.wmnet for cscott's failed sync-dir there
20:29 logmsgbot: cscott Synchronized wmf-config: Switch default PDF renderer to OCG (duration: 00m 15s)
20:04 subbu: deployed Parsoid version deed30b2
19:41 ottomata: restarted varnishkafka on cp3019 to troubleshoot drerrs
19:26 Reedy: doing rolling upgrade of elasticsearch on logstash100[1-3]
17:59 cscott: updated OCG to version 89d8f29a24295b05d0643abe976fea83b56575c9
17:58 logmsgbot: ori Synchronized php-1.24wmf22/includes/password/Pbkdf2Password.php: I3b0a1de69: Test for string in Pbkdf2Password::crypt() (duration: 00m 05s)
17:47 bblack: stopped powerdns and disabled puppet on virt1000 to prevent further cache pollution w/ bad data in public caches
08:28 awight: skip over wmf_civicrm schema migration 7022 -- *why* did I make that unsafe
08:24 awight: fundraising_code_update: revision for civicrm changed from 06c9546f9b68f6ecbaaf510944418aa52f9ed0fb to 5aca00fd4573f0fe8f385baa7238172f6ae54438
08:19 awight: disabling CRM jobs during deployment
08:09 cscott: cleared OCG queues and cache to quiet icinga; will try to get to the root cause tomorrow.
22:58 logmsgbot: ori Synchronized php-1.24wmf22/extensions/Wikidata: Update Wikidata for I0acd2096d21b (duration: 00m 11s)
21:41 mutante: powercycling mw1053
20:36 mutante: no !log
20:36 legoktm: manually migrated "NickK" to a global account
20:29 mutante: repooled mw1051
19:49 bd808: Restarted logstash on logstash1001. udp2log events were not being recorded.
19:30 logmsgbot: reedy Synchronized php-1.25wmf1/: (no message) (duration: 00m 46s)
19:24 logmsgbot: reedy Synchronized php-1.24wmf22/resources/src/mediawiki.ui/components/buttons.less: (no message) (duration: 00m 14s)
19:22 bblack: ntp work done on hosts
19:18 logmsgbot: reedy Synchronized php-1.25wmf1/: (no message) (duration: 00m 55s)
19:17 logmsgbot: reedy Synchronized php-1.24wmf22/extensions/CentralAuth/: (no message) (duration: 00m 14s)
18:55 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 14s)
18:47 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 14s)
18:20 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.25wmf1
18:08 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.24wmf22
17:20 logmsgbot: reedy Finished scap: testwiki to 1.25wmf1 and build l10n cache (duration: 28m 36s)
16:52 logmsgbot: reedy Started scap: testwiki to 1.25wmf1 and build l10n cache
16:41 Reedy: Purged php-1.24wmf9
16:38 logmsgbot: reedy Purged l10n cache for 1.24wmf20
15:31 bblack: testing ntpd changes on acamar, achernar, chromium, hydrogen, nescio, and baham (puppet-agent disabled)
15:19 logmsgbot: mattflaschen Synchronized wmf-config/CommonSettings.php: Extend GettingStarted bucketting period end date to Sept. 28 (duration: 00m 07s)
12:36 godog: update bash on elastic1014 analytics1021 elastic1013
11:33 _joe_: gracefully reloaded apache on mw1139 and mw1199, apc issues
23:12 greg-g: restarted jouncebot, he wasn't announcing deploy windows
23:00 mutante: OCG - scheduled downtime/disabled notifications for LVS check
22:44 andrewbogott: salted a bash update on labs instances, which turned out to be updated already.
22:09 cscott: icinga VS HTTP IPv4 on ocg.svc.eqiad.wmnet test is most likely due to `du -s` of a 6G cache directory, not critical. timeouts can be increased to quiet it. i will look into adding a -quick parameter or some such tomorrow to make the health check faster.
20:56 cscott: updated OCG to version 48acb8a2031863e35fad9960e48af60a3618def9
18:53 logmsgbot: reedy Synchronized php-1.24wmf22/extensions/WikimediaMaintenance: (no message) (duration: 00m 14s)
17:13 manybubbles: lowered throttling on Elasticsearch index transfer from one node to another speed because I hate excitement
15:38 Nemo_bis: cscott> i'm working on the OCG health issue above. i'll let you know when i know what's going on. icinga-wm> PROBLEM - OCG health on ocg1002 is CRITICAL
15:37 logmsgbot: demon Synchronized php-1.24wmf22/extensions/CentralAuth: (no message) (duration: 00m 05s)
15:21 logmsgbot: demon Synchronized php-1.24wmf22/extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php: (no message) (duration: 00m 05s)
15:01 logmsgbot: demon Synchronized wmf-config/Wikibase.php: (no message) (duration: 00m 06s)
14:57 Jeff_Green: restarted service ocg on ocg1001
14:40 manybubbles: finished deployment - load spikes look to be gone. yay
14:22 logmsgbot: manybubbles Synchronized php-1.24wmf21/extensions/CirrusSearch/: Switch implementation of Cirrus link counting jobs to hopefully lower overall load. (duration: 00m 04s)
14:21 logmsgbot: manybubbles Synchronized wmf-config: More cirrus config to lower load (duration: 00m 04s)
14:14 logmsgbot: manybubbles Synchronized php-1.24wmf22/extensions/CirrusSearch/: Switch implementation of Cirrus link counting jobs to hopefully lower overall load. (duration: 00m 06s)
14:08 manybubbles: starting deployment to lower cirrus load spikes
13:19 manybubbles: *disabled*
13:17 manybubbles: disable row awareness on Cirrus's elasticsearch cluster - might help balance load better. too much load was on one row
13:04 hashar: Zuul proceeding queue again
13:00 hashar: Jenkins: disconnecting Gearman client from Zuul and reconnecting
12:59 hashar: Zuul / Jenkins stuck
09:33 hashar_: Jenkins switched mwext-UploadWizard-qunit back to Zuul cloner by applying pending change 161459
15:02 logmsgbot: anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Add securepoll-create-poll right to sysop on testwiki gerrit:161653 (duration: 00m 09s)
15:01 logmsgbot: anomie Synchronized wmf-config/CommonSettings.php: SWAT: Add REL1_24 as branch in ExtensionDistributor gerrit:161666 (duration: 00m 10s)
14:12 hashar: Jenkins deleted job mediawiki-core-lint , replaced by mediawiki-core-phplint
12:10 apergos: shutdown of db1050 to install trusty
10:04 hashar: Jenkins back and fully operational
09:55 hashar: restarting jenkins
09:37 hashar_: Jenkins: deleting old mediawiki extensions jobs (rm -fR /var/lib/jenkins/jobs/*testextensions-master). They are no more triggered and superseded by the *-testextension jobs.
03:36 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Sep 22 03:36:40 UTC 2014 (duration 36m 39s)
02:41 logmsgbot: LocalisationUpdate completed (1.24wmf22) at 2014-09-22 02:41:29+00:00
02:29 logmsgbot: LocalisationUpdate completed (1.24wmf21) at 2014-09-22 02:29:09+00:00
02:16 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-22 02:16:20+00:00
September 21
22:43 ori: ms-be1008 overloaded starting 18:00:24 UTC, syslog says "BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:1:2196]". machine became unresponsive at 21:35, coinciding with a spike of 5xxs, lasting until Coren powercycled it at 22:10.
03:37 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun Sep 21 03:37:31 UTC 2014 (duration 37m 30s)
03:16 springle: labsdb1001 mysqld restarted in gdb; crash loop with a labs user's table
02:46 logmsgbot: ori Synchronized wmf-config/throttle.php: I7bb42b49a: Increase account creation throttle on enwiki for Cochrane colloquium. (duration: 00m 07s)
02:41 logmsgbot: LocalisationUpdate completed (1.24wmf22) at 2014-09-21 02:41:36+00:00
02:29 logmsgbot: LocalisationUpdate completed (1.24wmf21) at 2014-09-21 02:29:51+00:00
02:16 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-21 02:16:56+00:00
September 20
22:28 Krinkle: Reloading Zuul to deploy I0170766cfc06b8e6
20:30 andrewbogott: rebooting virt1006 to make good and sure it doesn't spontaneously re-enter the compute pool
20:29 andrewbogott_afk: moved all VMs off of virt1006, disabled compute service
03:46 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Sep 20 03:46:00 UTC 2014 (duration 45m 59s)
02:46 logmsgbot: LocalisationUpdate completed (1.24wmf22) at 2014-09-20 02:46:05+00:00
02:33 logmsgbot: LocalisationUpdate completed (1.24wmf21) at 2014-09-20 02:33:34+00:00
02:19 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-20 02:19:34+00:00
20:50 ori: restarted HHVM and cleared bytecode cache on all HHVM app servers
20:47 _joe_: restarted hhvm on mw1018, cleaning the cache as well
20:25 ori: Deployed Ic71064e08 (type hint fix for Wikidata) to wmf21/22.
19:09 bblack: restarted hhvm on mw1021
18:59 _joe_: rolling restart of hhvm servers
18:22 bblack: restarting hhvm on mw1020 (again!)
18:19 hashar: Jenkins: reverting job mwext-VisualEditor-qunit to previous state (i.e. without Zuul cloner)
18:17 bblack: restarting hhvm on mw1020
17:57 logmsgbot: ori Synchronized wmf-config/CommonSettings.php: I3e1bd5e4bb: Don't manipulate the environment to determine TZ offset (Bug: 71036) (duration: 00m 13s)
17:30 bblack: turned down apache prefork procs on fenari to reduce swapping
17:16 ottomata: initiating controlled shutdown of kafka broker analytics1021 to test some kafkatee weirdness, as well as a potential kafka/zookeeper bug
17:07 bblack: restarting apache on fenari
16:21 bblack: restarted hhvm on mw1019 + 1021
14:57 hashar: Jenkins friday deploy: migrate all MediaWiki extension qunit jobs to Zuul cloner.
14:37 akosiaris: initiated rsync of tridge data that is to be kept to nas1001-a
13:56 springle: killing any sleeping connection on enwiki db slaves to make room
13:56 mark: Stopped jobrunners on mw1001-1003
12:36 springle: temporarily disable log fsync on enwiki slaves
19:22 logmsgbot: reedy Synchronized php-1.24wmf22: (no message) (duration: 00m 57s)
19:16 Jeff_Green: disabling puppet on polonium, lead, sodium, iridium, magnesium, and iodine to monitor rollout of https://gerrit.wikimedia.org/r/155753
19:05 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: rest of group0 to 1.24wmf22
19:01 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf21
18:59 bblack: restarting apache on fenari
18:49 logmsgbot: reedy Finished scap: testwiki to 1.24wmf22 and build l10n cache (duration: 30m 23s)
18:44 Jeff_Green: testing exim configuration change on lead.wm.o
18:18 logmsgbot: reedy Started scap: testwiki to 1.24wmf22 and build l10n cache
17:49 logmsgbot: reedy Started scap: testwiki to 1.24wmf22 and build l10n cache
17:08 cmjohnson1: replacing failed disk es1005
17:05 logmsgbot: yurik Finished scap: (no message) (duration: 23m 26s)
16:43 yurikR: yurik scaping zero - partner needs an l10n message asap
21:41 mutante: fixing updates on planet feeds - file permissions
21:11 manybubbles: restarting rebuilding cirrus's enwiki index now that I've found the reason it wasn't working before - the new index was putting too many shards on an already full node and overwhelming it. silly allocation algorithm! thats a bad idea!
21:07 logmsgbot: yurik Synchronized php-1.24wmf21/extensions/ZeroPortal/: (no message) (duration: 01m 05s)
20:19 godog: rebooting ms-be1006
19:00 Krinkle: jenkins-slave tmpfs on lanthanum was filling up (> 500MB). I purged tmp dbs for old jobs. We should get these purged automatically and also increase the size as 500MB is too little.
18:59 robh: disabled icinga alerts for ms-be1001, rebooting it to look at its raid bios settings for codfw deployment mirroring
18:23 mutante: phabricator - made aklapper an admin
17:26 logmsgbot: andrew rebuilt wikiversions.cdb and synchronized wikiversions files: (no message)
17:23 logmsgbot: andrew Synchronized wikiversions.json: (no message) (duration: 00m 05s)
17:04 manybubbles: cirrus brownout looks just about fixed. So! My plan for periodically explicitly merging deletes has some problems.....
16:42 gwicke: restarted parsoid on wtp102{2,3,4}
16:31 manybubbles: just going to make this clear - the current cirrus brownout doesn't seem to be effecting my queries but we're getting hit with pool counter full events - sadness. its not caused by switching cirrus to ruwiki's primary backend - its caused by me attempting to perform index maintenance activities.
16:23 akosiaris: restarted node on wtp boxes except wtp1022,wtp1023,wtp1024
16:23 manybubbles: caused cirrus brownout by executing a force merge for enwiki's general index. ooops
16:06 logmsgbot: manybubbles Synchronized wmf-config/: set cirrus as primary search backend for ruwiki and make permanent some settings set on the fly (duration: 00m 06s)
15:57 manybubbles: manually pushed apart ruwiki and nlwiki's shards as well - might help - updated commit to reflect that
15:41 manybubbles: manually forcing Cirrus's commonswiki's file index apart from one another in an attempt to lower the consistently high load on elastic1013
15:34 logmsgbot: reedy Synchronized wmf-config/InitialiseSettings.php: Set wgMetaNamespace for labswiki (duration: 00m 14s)
21:36 Jeff_Green: SPF record deployed for donate.wikimedia.org
21:01 logmsgbot: ejegg Synchronized php-1.24wmf20/extensions/CentralNotice/modules/ext.centralNotice.bannerController/bannerController.js: (no message) (duration: 00m 06s)
19:38 csteipp: deployed patches for bugs 70469 and 70672
19:17 logmsgbot: catrope Synchronized php-1.24wmf21/extensions/VisualEditor/: Revert IE hacks so Firefox will stop corrupting non-Latin characters (duration: 00m 06s)
19:15 logmsgbot: catrope Synchronized php-1.24wmf20/extensions/VisualEditor/: (no message) (duration: 00m 09s)
18:32 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 15s)
18:11 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf21
17:03 logmsgbot: bd808 Finished scap: No code change scap to test scap internal update (duration: 18m 06s)
16:45 logmsgbot: bd808 Started scap: No code change scap to test scap internal update
16:43 bd808|deploy: Updated scap to 663f137 (Check php syntax with parallel `php -l`)
16:42 bd808|deploy: Trebuchet sync for scap reporting failure from osmium.eqiad.wmnet, mw1053.eqiad.wmnet, searchidx1001.eqiad.wmnet, fenari.wikimedia.org, and mw1110.eqiad.wmnet
16:41 bd808|deploy: Trebuchet update for scap reporting failure from osmium.eqiad.wmnet, searchidx1001.eqiad.wmnet, fenari.wikimedia.org and mw1110.eqiad.wmnet
16:00 _joe_: mw1018 and mw1021 in the hhvm appservers pool
15:35 logmsgbot: reedy Synchronized docroot and w: Update symlinks to use /srv/mediawiki (duration: 00m 16s)
15:34 hashar: Jenkins: deleting /srv/ssd/jenkins-slave/workspace/*testextensions-master on gallium and lanthanum.
15:25 logmsgbot: andrew Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 03s)
15:23 logmsgbot: andrew Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 19s)
15:13 hashar: Jenkins: mediawiki extensions phpunit jobs should pass more or less until the CI system is sent an orbit and dies out horribly. in such a case ping me / phone.
14:52 ottomata: set vm.dirty_expire_centisecs to 10000 (was 30000) on analytics1021 to experiment with paging and kafka-zookeeper timeouts
14:36 godog: stopped htcp-purger on ms1004 RT #8358
14:32 godog: silenced ms-be1014 until torrow, pending forced reboot
14:28 hashar: Jenkins: breaking continuous integration for MediaWiki repositories. Extensions are now tested with mediawiki/vendor and, mediawiki/core is checked out to the patch branch if it exist. 160656
14:20 akosiaris_: restarted apache on fenari , it was leaking memory, situation back to normal, cause unknown yet
14:12 akosiaris_: stopped apache on fenari . It was in swap, investigating
23:12 bblack: restarting lvs1002 for HT disable + kernel upgrade
23:07 Krinkle: Running sample job on integration-slave1006 and warming up npmjs.org cache
22:56 Krinkle: Running sample job on integration-slave1008 and warming up npmjs.org cache
22:49 Krinkle: Running sample job on integration-slave1007 and warming up npmjs.org cache
22:48 Krinkle: Pooling the newly setup Trusty-based Jenkins slaves (integration-slave1006, integration-slave1007 and integration-slave1008)
22:42 bblack: dropping static routes for 2620:0:861:ed1a::[d,f,10,11] -> lvs1005 from cr[12]-eqiad (only 11 is of any consequence, misc-web-lb, and they're advertised by bgp and this is preventing failover to lvs1002)
21:28 cscott: updated OCG to version 188a3c221d927bd0601ef5e1b0c0f4a9d1cdbd31
18:43 manybubbles: performance tests show cirrus should handle jawiki with no problem but if load spirals out of control and I'm not around then revert https://gerrit.wikimedia.org/r/#/c/160465/
18:40 hoo: Local part of the global rename of Gnumarcoo => .avgas fatally timed out on itwiki. This needs to be fixed per hand.
18:40 manybubbles: Setting Cirrus to jawiki's primary search backend went well but Japan is mostly asleep. If Elasticsearch load takes a turn for the worse in four or five hours then we'll know how it went.
17:14 bd808: Restarted elasticsearch on logstash1003; 2014-09-14T09:33:57Z java.lang.OutOfMemoryError
17:09 _joe_: killing salt-call on all mediawiki hosts
17:06 bd808: Restarted elasticsearch on logstash1001; 2014-09-15T06:12:09Z java.lang.OutOfMemoryError
17:04 bblack: using salt to kill salt-minion everywhere...
17:02 bd808: Restarted logstash on logstash1001. I hoped this would fix the dashboards, but it looks like the backing elasticsearch cluster is too sad for them to work at the moment.
16:55 bd808: Restarted hung elasticsearch service on logstash1002
16:15 manybubbles: jawiki now has cirrus as primary. we're back to where we were before the great cascading failure of two months ago
16:13 logmsgbot: manybubbles Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 06s)
15:29 logmsgbot: marktraceur Synchronized php-1.24wmf21/extensions/MultimediaViewer/: [SWAT] Several backports for metrics and bugfixes in Media Viewer (duration: 00m 07s)
15:27 logmsgbot: marktraceur Synchronized php-1.24wmf20/extensions/MultimediaViewer/: [SWAT] Several backports for metrics and bugfixes in Media Viewer (duration: 00m 07s)
15:18 logmsgbot: marktraceur Synchronized php-1.24wmf21/extensions/GeoCrumbs/GeoCrumbs.class.php: [SWAT] Handle return value NULL of GeoCrumbs::getParserCache (duration: 00m 07s)
15:17 logmsgbot: marktraceur Synchronized php-1.24wmf20/extensions/GeoCrumbs/GeoCrumbs.class.php: [SWAT] Handle return value NULL of GeoCrumbs::getParserCache (duration: 00m 07s)
15:06 logmsgbot: marktraceur Synchronized wmf-config/: [SWAT] Remove 'renameuser' right from bureaucrats on CentralAuth wikis (duration: 00m 09s)
14:50 logmsgbot: aude Finished scap: Put test.wikidata back on mw1.24-wmf19 extension branch (duration: 37m 27s)
14:43 manybubbles: restarting the enwiki cirrus reindex process - it crashed over the weekend. why you crash and leave error message "1". "1" is not a useful error message.
14:13 logmsgbot: aude Started scap: Put test.wikidata back on mw1.24-wmf19 extension branch
13:03 _joe_: fenari is swapping hard, restarting apache who was eating up all the RAM
09:20 logmsgbot: hashar Synchronized wmf-config/InitialiseSettings.php: *.scienceimage.csiro.au to the wgCopyUploadsDomains 159999bug 70771 (duration: 00m 06s)
02:01 logmsgbot: LocalisationUpdate failed: mwversionsinuse returned empty list
00:45 ori_: fenari appears to still have twemproxy (in addition to nutcracker); decom'ing.
00:29 ori_: restarting apache2 on fenari
September 13
04:42 legoktm: global rename for Trevor Parscal (WMF) unstuck itself, yay
04:22 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Sep 13 04:22:04 UTC 2014 (duration 22m 3s)
03:51 legoktm: global rename for Trevor Parscal --> Trevor Parscal (WMF) looks stuck on metawiki and mswiki, in queued state for both but showJobs.php says the jobs are active and claimed
03:11 logmsgbot: LocalisationUpdate completed (1.24wmf21) at 2014-09-13 03:11:40+00:00
02:38 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-13 02:38:26+00:00
01:45 logmsgbot: ori Synchronized php-1.24wmf21/extensions/Flow: Update flow for I4da934dfe (duration: 00m 06s)
01:45 logmsgbot: ori Synchronized php-1.24wmf20/extensions/Flow: Update flow for I4da934dfe (duration: 00m 06s)
01:41 logmsgbot: ori Synchronized php-1.24wmf20/extensions/Flow: Update flow for I4da934dfe (duration: 00m 08s)
September 12
21:26 csteipp: deployed fixes for bugs 70620, 69008
20:37 logmsgbot: mattflaschen Synchronized php-1.24wmf21/extensions/GettingStarted/: Deploy to fix GettingStarted bucketting for users with null registration date (duration: 00m 05s)
20:37 logmsgbot: mattflaschen Synchronized php-1.24wmf20/extensions/GettingStarted/: Deploy to fix GettingStarted bucketting for users with null registration date (duration: 00m 07s)
19:34 legoktm: running migratePass0.php across all CentralAuth wikis
17:43 logmsgbot: ori updated /a/common to I4e4187285: Rename some constants to clarify their meaning and purpose
14:52 manybubbles: rebuilding enwiki's Cirrus index for more performance testing. Please be faster now. k?
08:37 _joe_: rolling restart of pybal finished. Adding note on Fenari
08:19 _joe_: reactivated puppet on all lvs hosts, esams almost done, pending eqiad
08:06 _joe_: new pybal conf applied in all of ulsfo
07:39 _joe_: changing pybal config place; stopping puppet on all loadbalancers
04:27 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Sep 12 04:27:17 UTC 2014 (duration 27m 16s)
03:15 logmsgbot: LocalisationUpdate completed (1.24wmf21) at 2014-09-12 03:15:57+00:00
03:08 logmsgbot: mattflaschen Finished scap: One last CSS fix (wrapping issue for error state) for GettingStarted A/B test (duration: 24m 38s)
02:43 logmsgbot: mattflaschen Started scap: One last CSS fix (wrapping issue for error state) for GettingStarted A/B test
02:39 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-12 02:39:35+00:00
01:33 logmsgbot: mattflaschen Synchronized php-1.24wmf21/extensions/GettingStarted/: CSS tweaks for GettingStarted A/B test (duration: 00m 07s)
01:32 logmsgbot: mattflaschen Synchronized php-1.24wmf20/extensions/GettingStarted/: CSS tweaks for GettingStarted A/B test (duration: 00m 21s)
01:29 logmsgbot: ori Synchronized wmf-config/wikitech.php: Ia5b81076e: Update path reference for /srv/mediawiki (duration: 00m 04s)
01:28 logmsgbot: ori updated /a/common to Ia5b81076e: Update path reference for /srv/mediawiki
01:19 ori: manually migrated /u/l/a/common-local to /srv/mediawiki on virt1000
00:36 logmsgbot: ori Synchronized php-1.24wmf21/extensions/Wikidata: Update Wikidata to tip of master for I23b7eb54b8e (Bug: 70747) (duration: 00m 08s)
00:12 logmsgbot: esanders Synchronized php-1.24wmf21/resources/lib/oojs-ui/: (no message) (duration: 00m 03s)
00:12 logmsgbot: esanders Synchronized php-1.24wmf21/extensions/MultimediaViewer/: (no message) (duration: 00m 07s)
23:00 mutante: restarting icinga-wm for config change
21:49 logmsgbot: mattflaschen Started scap: Deploy new GettingStarted recommendations A/B test
21:14 Krinkle: Stopping/starting zuul
21:08 andrewbogott: restarting zuul on gallium
20:58 andrewbogott: restarted jenkins, maybe
20:56 ori: graceful'd apache on mw1053, missed it earlier
20:49 logmsgbot: ori Synchronized wmf-config/CommonSettings.php: I1f3234746: Revert Scribunto: double the Lua CPU limit on the job runners (duration: 00m 05s)
20:48 logmsgbot: ori updated /a/common to I1f3234746: Revert "Scribunto: double the Lua CPU limit on the job runners"
20:42 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 14s)
20:15 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 13s)
20:15 andrewbogott: syncing virt1000, again in hopes of moving to wmf20
20:08 logmsgbot: reedy Synchronized php-1.24wmf21/extensions/Wikidata/: (no message) (duration: 00m 17s)
18:08 logmsgbot: reedy Started scap: testwiki to 1.24wmf21 and build l10n cache
18:02 manybubbles: raised logging on Elasticsearch cluster temporarily to get more information about merging - a process super important to keeping the index up to date in "real time"
17:20 logmsgbot: ori updated /a/common to I0bda3deab: Replace remaining references to /u/l/a/common
17:18 logmsgbot: ori updated /a/common to I37b0a8338: Get rid of MULTIVER_CDB_DIR_{APACHE,HOME}
16:57 andrewbogott: sync-common on virt1000 -- with any luck this will upgrade us to wmf20
16:56 logmsgbot: andrew rebuilt wikiversions.cdb and synchronized wikiversions files: (no message)
16:53 logmsgbot: bd808 Finished scap: Preparing to move wikitech to 1.24wmf20 (second try) (duration: 24m 25s)
16:46 andrewbogott: apache graceful on mw1039
16:33 bd808|deploy: andrewbogott did apache graceful on mw1120 to stop wikidata APC logspam
16:29 logmsgbot: bd808 Started scap: Preparing to move wikitech to 1.24wmf20 (second try)
16:22 logmsgbot: andrew Finished scap: Preparing to move wikitech to 1.24wmf20 (duration: 06m 45s)
16:19 bd808: Restarted logstash on logstash1001. Log empty and events not being stored in elasticsearch
16:15 logmsgbot: andrew Started scap: Preparing to move wikitech to 1.24wmf20
15:45 bblack: icinga config is correct now, back to normal puppet updates
15:24 bblack: restarted icinga, manually removed some labsy things that were broken in config and temporarily disabled puppet :p
14:44 _joe_: php upgrade finished
14:23 _joe_: upgrading php across the cluster: libapache2-mod-php5 php5-cli php-pear php5 php5-common php5-curl php5-dev php5-intl php5-mysql php5-xmlrpc
13:04 akosiaris: uploaded php5_5.3.10-1ubuntu3.14+wmf1 on apt.wikimedia.org
10:00 _joe_: enabled puppet on mw1053
09:38 _joe_: gracefulling mw1200 mw1196 and mw1186 as they have APC issues
09:21 _joe_: upgrading hhvm and hhvm-luasandbox across the production cluster
09:00 akosiaris: upgrading php5 to 5.3.10-1ubuntu3.14+wmf1 on mw1212
08:34 _joe_: updating php-pear php5 php5-cli php5-common php5-curl php5-dev php5-intl php5-mysql php5-xmlrpc libapache2-mod-php5 on mw1018, see USN 2344-1
03:41 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 11 03:41:03 UTC 2014 (duration 41m 2s)
02:49 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-11 02:49:26+00:00
02:36 logmsgbot: LocalisationUpdate completed (1.24wmf19) at 2014-09-11 02:36:37+00:00
02:23 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-09-11 02:23:29+00:00
00:28 mutante: graceful'ed Apaches on mw1171, mw1187
00:25 logmsgbot: ori Synchronized wmf-config: Id607bf36d: Update remaining references to /u/l/a/common-local (duration: 00m 03s)
00:25 logmsgbot: ori Synchronized multiversion: Id607bf36d: Update remaining references to /u/l/a/common-local (duration: 00m 04s)
00:22 logmsgbot: ori Synchronized docroot and w: Id607bf36d: Update remaining references to /u/l/a/common-local (duration: 00m 04s)
00:07 logmsgbot: ori updated /a/common to Id607bf36d: Update remaining references to /u/l/a/common-local
September 10
23:44 mutante: graceful'ed mw1202 apache
23:29 mutante: deleted labstore1003.eqiad.wmnet.org from puppet stored resource db, fixes puppet runs on hosts with ssh host key collection
22:52 mutante: labstore1003 - (earlier) revoked salt and puppet key and signed new after hostname fix - same salt-minion puppet errors that happen after reinstalls
19:52 Reedy: Created Echo tables on extension1 for cawikimedia
19:51 RobH: puppet disabled on carbon (install server) for a livehack test of config setting
18:51 logmsgbot: yurik Synchronized wmf-config/CommonSettings.php: (no message) (duration: 01m 05s)
18:26 logmsgbot: yurik Synchronized php-1.24wmf20/extensions/ZeroBanner: (no message) (duration: 01m 09s)
18:22 logmsgbot: yurik Synchronized php-1.24wmf19/extensions/ZeroBanner: (no message) (duration: 01m 11s)
18:00 manybubbles: cirrus index rebuild for test2wiki went well - doing the rest of group0
17:35 manybubbles: rebuilding cirrus index for test2wiki to test some performance enhancements don't break anything. test2wiki is too small to see any gain from the enhancements though.
17:25 Reedy: mw1126, mw1116, mw1122, mw1146, mw1121, mw1136, mw1114, mw1068 have been gracefulled
11:29 _joe_: git.wikimedia.org works now, no action needed
11:26 MatmaRex: git.wikimedia.org is down: Error: 503, Service Unavailable
10:04 _joe_: also re-enabling puppet
10:02 _joe_: restarting manually apache on mw1178,mw1192,mw1163,mw1130,mw1018 as they started with the wrong pidfile before my fix
09:24 _joe_: disabling puppet on appservers
08:55 godog: launched "iptables" on tin to check current rules and it loaded iptables modules, logging for future reference
08:10 _joe_: re-enabling puppet on appservers and imagescalers, change is good
08:08 _joe_: restarted apache2 on mw1018
08:06 _joe_: stopping apache on mw1018 for inspection
07:36 _joe_: that was on appservers
07:36 _joe_: disabling puppet, releasing a potentially harmful apache change
04:56 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Sep 9 04:56:25 UTC 2014 (duration 56m 24s)
03:44 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-09 03:44:07+00:00
03:11 logmsgbot: LocalisationUpdate completed (1.24wmf19) at 2014-09-09 03:11:27+00:00
02:38 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-09-09 02:38:38+00:00
01:02 logmsgbot: ebernhardson Synchronized php-1.24wmf20/extensions/Flow/includes/Content/BoardContentHandler.php: Sync BoardContentHandler.php for Flow in 1.24wmf20 (duration: 00m 04s)
00:22 mutante: re-enabled mw1070 in pybal
00:19 logmsgbot: ebernhardson Finished scap: Repeat SWAT scap deployment due to possible sync-common failure (duration: 38m 50s)
September 8
23:59 ori: restarted rsync on mw1070 to unblock scap
23:40 logmsgbot: ebernhardson Started scap: Repeat SWAT scap deployment due to possible sync-common failure
23:39 logmsgbot: ebernhardson Finished scap: SWAT deploy updates to Flow, Echo and Thanks (duration: 24m 00s)
23:34 mutante: disabled mw1070 in pybal because it refused sync
23:31 ebernhardson: scap failed to connect to mw1070. Repeated message: rsync: failed to connect to mw1070.eqiad.wmnet (10.64.16.50): Connection refused (111)
23:15 logmsgbot: ebernhardson Started scap: SWAT deploy updates to Flow, Echo and Thanks
22:28 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Sep 4 22:27:45 UTC 2014 (duration 54m 38s)
21:54 bd808: sync-dir failure was really on osmium, not mw1161; confusing error messages are confusing
21:50 bd808: Running sync-common on mw1161 to try and reproduce error seen during sync-file
21:43 logmsgbot: spage Synchronized wmf-config/InitialiseSettings.php: Enable Flow on pages, including frwiki and hewiki (duration: 00m 09s)
21:40 logmsgbot: spage updated /a/common to Ib0aaa60f0: Enable Flow on several pages
21:08 logmsgbot: LocalisationUpdate completed (1.24wmf20) at 2014-09-04 21:07:10+00:00
20:56 MaxSem: Running cleanupPageProps.php from terbium, now for realz
20:42 mutante: restarting icinga-wm, making it join #wikidata for custom output
20:16 logmsgbot: LocalisationUpdate completed (1.24wmf19) at 2014-09-04 20:15:35+00:00
19:56 Reedy: mw1088 and mw1100 rsync errors during the manual l10n update
19:25 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-09-04 19:23:57+00:00
18:32 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf20
18:26 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.24wmf19
18:11 logmsgbot: reedy Synchronized php-1.24wmf19: (no message) (duration: 00m 55s)
18:10 logmsgbot: reedy Synchronized php-1.24wmf20: (no message) (duration: 00m 35s)
18:09 logmsgbot: reedy Finished scap: testwiki to 1.24wmf20 and build l10n cache (duration: 41m 33s)
18:05 mutante: restarting service gitblit on antimony
17:48 RobH: correction, simply surpressing alerts for the host in icinga is the better move, as the host isnt reclaimed yet, so not removing holmium from pupeptstoreddb
17:46 RobH: stopping puppet on holmium and removing it from puppetstoreddb so it doesnt show in icinga once updated
17:45 RobH: shutting down holmium, as blog has migrated for a month now. Not yet wiping system, please leave for me (robh)
17:27 logmsgbot: reedy Started scap: testwiki to 1.24wmf20 and build l10n cache
16:44 logmsgbot: reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 17s)
16:36 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 14s)
15:53 logmsgbot: andrew Synchronized private/WikitechPrivateSettings.php: (no message) (duration: 00m 01s)
15:40 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 14s)
18:48 logmsgbot: yurik Synchronized php-1.24wmf18/extensions/Graph/: (no message) (duration: 01m 09s)
18:47 logmsgbot: yurik Synchronized php-1.24wmf19/extensions/Graph/: (no message) (duration: 01m 05s)
16:52 logmsgbot: andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s)
16:52 logmsgbot: andrew Synchronized private/WikitechPrivateLdapSettings.php: (no message) (duration: 00m 03s)
16:51 logmsgbot: andrew Synchronized private/WikitechPrivateSettings.php: (no message) (duration: 00m 05s)
16:51 logmsgbot: andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s)
16:19 logmsgbot: andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 03s)
16:18 logmsgbot: andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 04s)
16:16 logmsgbot: andrew Synchronized wmf-config/wikitech.php: (no message) (duration: 00m 05s)
15:41 _joe_: mw1020 correctly reimaged, putting it in the hhvm pool
15:27 logmsgbot: manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT - Update another cirrus config - this time maybe it will work (duration: 00m 05s)
15:12 manybubbles: deployed throttling for Cirrus job named cirrusSearchLinksUpdate - it handles updating the index when a transcluded page changes - we'll have to check on the backlog over the next few hours/days to see if it stabilizes
15:11 logmsgbot: manybubbles Synchronized php-1.24wmf19/extensions/Wikidata/: (no message) (duration: 00m 07s)
15:07 manybubbles: mw1020 gets WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! during sync-dir call
15:07 logmsgbot: manybubbles Synchronized wmf-config/: SWAT deploy cirrus config changes - make sure to get mw1020 (duration: 00m 04s)
18:30 logmsgbot: ori Synchronized php-1.24wmf19/resources/src/mediawiki.action/mediawiki.action.view.redirect.js: I19221a25a: mediawiki.action.view.redirect: Work around a IE 10+ HTML5 history API bug (duration: 00m 06s)
18:30 logmsgbot: ori Synchronized php-1.24wmf18/resources/src/mediawiki.action/mediawiki.action.view.redirect.js: I19221a25a: mediawiki.action.view.redirect: Work around a IE 10+ HTML5 history API bug (duration: 00m 07s)
15:36 hashar_: Jenkins: pooled a new slave 10.68.16.162 as wikidata-jenkins3 on behalf of addshore / wmde
15:04 _joe_: shutting down mw1163, filled RT 8243 for repair.
14:54 _joe_: re-enabled mw1130
14:41 logmsgbot: aude Synchronized wmf-config/Wikibase.php: enable Wikibase badges css, follow up from last night deploy (duration: 00m 06s)
14:22 _joe_: syncing mw1130
14:06 _joe_: disable mw1130 from the api pool whil it gets resynced
12:30 Krinkle: Running extensions/GlobalCssJs/removeOldManualUserPages.php per m:GlobalCssJs
01:06 cmjohnson1: shutting down ms-fe1002 to relocate racks
01:04 godog: depool ms-fe1002
01:02 godog: repool ms-fe1001
00:57 logmsgbot: ori Synchronized php-1.24wmf18/extensions/WikimediaEvents: Ib44fe0898: Inject 'wgPoweredByHHVM' JS config var if powered by HHVM (duration: 00m 03s)
00:56 logmsgbot: ori Synchronized php-1.24wmf19/extensions/WikimediaEvents: Ib44fe0898: Inject 'wgPoweredByHHVM' JS config var if powered by HHVM (duration: 00m 04s)
00:38 cmjohnson1: shutting down ms-fe1001 for rack relocation
00:34 godog: depool ms-fe1001
00:32 godog: repool ms-fe1004
00:27 mutante: restarting gmetad on nickel
00:04 cmjohnson1: shutting down ms-fe1004 to relocate racks
17:13 ^d: elastic: excluded the elastic1016 node from shard allocation, shards draining so we can take it down for disk testing
16:01 ottomata: restarted webstats-collector on gadolinium
13:18 mark: Reactivated cr2-eqiad AS3257 transit link
10:44 springle: xtrabackup clone db1051 to db1073
10:18 godog: restarting mailman on sodium
08:52 godog: restarted apache on mw1134
08:03 godog: killed stray mailman processes on sodium (no pid file) and restarted mailman
06:11 springle: xtrabackup clone db1051 to db1072
06:09 springle: restarted morebots
August 26
21:04 hashar: Updating our Jenkins Job Builder fork 0268581..e5c0c61 . Will let us define variables in 'default' section and override them when invoking a job template ( https://review.openstack.org/#/c/100020/ )
19:58 bd808: Ran sync-common on mw1053.eqiad.wmnet to recover from failure during last scap
19:48 logmsgbot: aude Finished scap: Update new messages for Wikibase (duration: 07m 16s)
19:41 logmsgbot: aude Started scap: Update new messages for Wikibase
18:54 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf18
18:53 logmsgbot: reedy Synchronized php-1.24wmf18/extensions/MassMessage: (no message) (duration: 00m 14s)
18:52 logmsgbot: reedy Synchronized php-1.24wmf17/extensions/MassMessage: (no message) (duration: 00m 16s)
18:19 jgage: Failover from analytics1010-eqiad-wmnet to analytics1004-eqiad-wmnet successful
17:47 logmsgbot: bd808 Synchronized private/PrivateSettings.php: Syncing file rather than symlink (duration: 00m 04s)
17:36 bd808: mw1010.eqiad.wmnet was out of sync too. I suspect there is something wrong with the fanout update step in scap
17:26 bd808: /usr/local/apache/common-local out of date on mw1161.eqiad.wmnet; updated via sync-common
17:25 bd808: sync-* not updating terbium properly; sync-common from terbium manually got several config changes; maybe a problem with mw1161.eqiad.wmnet rsync mirror
16:26 logmsgbot: bd808 Finished scap: no-op scap to test scap code update (duration: 13m 31s)
16:20 bd808|DEPLOY: Rsync sloooow to fenari "16:18:52 fenari INFO - Finished rsync common (duration: 04m 38s)"
16:12 logmsgbot: bd808 Started scap: no-op scap to test scap code update
16:07 logmsgbot: demon Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 04s)
16:07 bd808|DEPLOY: Updated scap to 116027f (Make sync-common update l10n cdb files by default)
15:05 logmsgbot: anomie Synchronized wmf-config: SWAT: Enable GlobalCssJs on all CentralAuth wikis minus loginwiki gerrit:154432 (duration: 00m 09s)
13:32 hashar: Jenkins mediawiki-core-qunit job has been switched to Zuul cloner and pass! :-D
13:29 _joe_: re-enabling puppet, change aborted as not all sites are served via hhvm on the hhvm appservers (true story). Will re-do once all configs are in their place
13:12 _joe_: disabling puppet on all appservers while deploying an apache change
12:48 logmsgbot: springle Synchronized wmf-config/db-eqiad.php: db1054 to normal load (duration: 00m 06s)
12:33 hashar: Jenkins reverted mediawiki-core-qunit to use Zuul cloner 156268. Gotta play with it on a new job name since it does not work out of the box as expected.
12:12 hashar: Jenkins migrating mediawiki-core-qunit to use Zuul cloner 156268
12:03 akosiaris: disable puppet on labsdb1006 for planet osm import
11:53 logmsgbot: springle Synchronized wmf-config/db-eqiad.php: pool db1054, warm up (duration: 00m 08s)
09:04 godog: reboot ms-be1011, unresponse on network and console
17:24 godog: reboot ms-be1004 to pick up kernel upgrade
17:13 godog: rebooting ms-be1002 to pick up updated kernel
16:54 ottomata: stopping puppet on cp3021. Testing an increase of http://kafka.queue.buffering.max.ms/ in order to avoid dropping messages during broker metadata change (e.g. leader elections)
16:48 hashar: Jenkins pooled in a new slave wdjenkins-node1 that will be used to run Wikidata jenkins jobs. Work in progress with addshore. It is not running jobs yet.
16:47 godog: reboot ms-be1011, xfsaild errors in dmesg
16:06 andrewbogott: wikitech deployment finished. Note that the OpenStackManager submodule is off of the MediaWiki branch because… the whole submodule setup there is a bit broken on account of a git bug that uses absolute paths to manage submodules.
16:01 andrewbogott: deploying tiny OpenStackManager upgrade on wikitech
23:57 logmsgbot: ori Started scap: SWAT: d3de89777, 7abfe0d5e7, 8ec9853c32b, 476e9e90bd01
21:58 logmsgbot: ori Synchronized php-1.24wmf17/resources/src/mediawiki/mediawiki.js: I8d27442d1: Workaround for bug introduced by Icf6ede09b (duration: 00m 03s)
21:57 manybubbles: performing elasticsearch upgrade on elastic1015
21:02 logmsgbot: ori Synchronized php-1.24wmf17/resources/src/mediawiki/mediawiki.util.js: Touch resources/src/mediawiki/mediawiki.util.js (duration: 00m 06s)
20:44 godog: rolling restart of swift-proxy on ms-fe1*
20:11 godog: restarted swift-proxy on ms-fe1001
19:55 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 13s)
19:49 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf18
19:46 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf17
19:23 reedy|webirc: mw1178 returned [255]: ssh: connect to host mw1178 port 22: Connection timed out
19:23 reedy|webirc: mw1019 returned [127]: bash: sync-common: command not found
19:09 logmsgbot: reedy Started scap: testwiki to 1.24wmf18
18:28 manybubbles: *victim*
18:27 manybubbles: trying to recover from weird Elasticsearch upgrade failure by redoing the upgrade on one node while also blowing away the data directory during the upgrade. elastic1005, you are my first victem.
17:28 cmjohnson1: removing mw1130 from pybal
14:53 hashar: Jenkins: updated PHP CodeSniffer MediaWiki standard on all slaves.
03:21 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Aug 21 03:20:47 UTC 2014 (duration 20m 46s)
02:36 logmsgbot: LocalisationUpdate completed (1.24wmf17) at 2014-08-21 02:34:56+00:00
02:20 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-21 02:19:26+00:00
00:08 MatmaRex: (manybubbles contd.) …a single node going down but I expect the cluster to stay "yellow" during the process- no alerts.
00:07 manybubbles: bd808 needs to plan a logstash upgrade soon - let it be logged
00:05 manybubbles: if anyone is reading the SAL for fun or sees an error in Elasticsearch cluster in the next 24 hours - we're performing an elasticsearch upgrade. We've set it up this time so its super slow and boring. So boring I'm going to sleep through it. If you see more then transient complaining from icinga about elasticsearch you can call me/have someone with access to the contact list call me. I expect icinga to complain about a
00:00 manybubbles: unattended rolling restart of Elasticsearch cluster is going just fine - adding the 30 minute sleep between servers and turning down the replication rate makes it pretty boring.
August 20
23:07 awight: stopping the Thank You job
22:50 ori: disabled puppet on osmium to debug memory leak
21:09 logmsgbot: marktraceur Synchronized wmf-config: Turn off Media Viewer for logged-in users at Commons. (duration: 00m 07s)
21:06 logmsgbot: marktraceur updated /a/common to I226bd1468: Add item-redirect to OAuth permissions
19:50 hashar: Restarting Zuul to prettify build results bug 66095
19:48 logmsgbot: awight Synchronized php-1.24wmf17/extensions/CentralNotice: push CentralNotice updates, including new hide cookie format (duration: 00m 05s)
19:47 logmsgbot: awight Synchronized php-1.24wmf16/extensions/CentralNotice: push CentralNotice updates, including new hide cookie format (duration: 00m 04s)
19:46 logmsgbot: awight Synchronized php-1.24wmf16/extensions/CentralNotice: push CentralNotice updates, including new hide cookie format (duration: 00m 07s)
16:11 manybubbles: elastic1001 upgrade went well - upgrading elastic1002 now
15:48 hashar: dns: Jenkins will now complain whenever you attempt to send tabs in any file of operations/dns.git bug 69478
15:17 manybubbles: manually lowered elasticsearch recovery speeds to stem off high load caused by healing the restart of elastic1001 - we were slowing down enough that we were filling the pool counter
15:01 logmsgbot: anomie Synchronized php-1.24wmf17/extensions/Wikidata/extensions/Wikibase/lib/: SWAT: Touch files on advice of Wikidata folks (duration: 00m 09s)
15:01 logmsgbot: anomie Synchronized wmf-config/Wikibase.php: SWAT: Fix config for specialSiteLinkGroups in Wikibase gerrit:155218 (duration: 00m 09s)
14:49 manybubbles: installing elasticsearch 1.3.2 on elasticsearch1001 only right now as a test
14:47 manybubbles: upgrading elasticsearch plugins on all elasticsearch servers in preparation to upgrade to elasticsearch 1.3 - if we roll back we'll have to redeploy the plugins
14:10 ottomata: changing group ownership and permissions on raw webrequest data in hdfs. Users now must be in the analytics-privatedata-users group to access.
13:47 manybubbles: experimenting with lowering merge factor on enwiki's Cirrus index - should improve query performance at the cost of more background tasks in the Elasticserach cluster
13:36 ottomata: disabling puppet on analytics1027 temporarily
13:10 godog: reboot ms-be1003, xfs errors/panics
12:03 logmsgbot: ori updated /a/common to Ic3fe1ef83: Update all symlinks to /apache
15:13 logmsgbot: anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Add Affiliate namespace on chapcomwiki gerrit:154713 (for real this time) (duration: 00m 09s)
15:12 mark: Completed network migration of BGP confideration renumbering: AS65002 -> AS65001, AS65003 -> AS65004, old AS65001 (pmtpa) is part of eqiad for its remaining lifetime
07:17 hashar: Jenkins: manually cleared out a tmpfs partition on lanthanum.eqiad.wmnet which was causing all MediaWiki / extensions jobs to fail completely. bug 69731. We need disk space monitoring which is bug 69733.
07:09 bblack: ... and strontium passenger is failing to start up correctly again. icinga-wm disabled to avoid spam
07:07 bblack: restarted apache2 service on strontium/palladium, expect another small spike of puppet fail->ok
03:21 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 19 03:20:21 UTC 2014 (duration 20m 20s)
02:37 logmsgbot: LocalisationUpdate completed (1.24wmf17) at 2014-08-19 02:36:21+00:00
02:16 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-19 02:15:14+00:00
August 18
23:03 andrewbogott: isolated virt1006, re-enabling puppet on virt1000 and virt1006
22:36 andrewbogott: disabling puppet on virt1000 and virt1006 while I try to convince the scheduler to overlook virt1006
22:01 bblack: done futzing w/ puppetmasters+neon, all agents enabled and bot back online
21:28 hashar: Zuul processing again. Definitely need to write doc about how to unstuck it
21:02 hashar: Zuul / Jenkins stalled again :-/
21:02 hashar: Zuul / Jenkins stalled again :-/
19:35 bblack: testing new passenger perf params on strontium/palladium. agents on those two and icinga-wm still disabled
19:04 bblack: restarted service apache2 on strontium - passenger for puppet master was dead again
17:00 andrewbogott: added a (yuvi-built) python-txstatsd package to trusty on Carbon.
16:37 bd808: deployment-prep Restarted Apache and HHVM on deployment-mediawiki02 to pick up removal of /etc/php5/conf.d/mail.ini
13:06 Reedy: Large amount of incoming traffic to bast1001 is me uploading files
12:11 godog: rebalanced swift object ring in eqiad
09:34 godog: reenabled puppet on neon and started ircecho
09:23 godog: stop ircecho again on neon, disable puppet on neon
09:11 godog: restarted apache2 on strontium
08:58 godog: stopped ircecho on neon while diagnosing puppet failure
03:13 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 18 03:12:27 UTC 2014 (duration 12m 26s)
03:06 hoo: Ran sync-common on mw1053 to stop "Unrecognized job type 'ChangeNotification'." exceptions
02:31 logmsgbot: LocalisationUpdate completed (1.24wmf17) at 2014-08-18 02:30:17+00:00
02:19 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-18 02:18:52+00:00
August 17
21:07 legoktm: running migrateAccount.php without --safe or --auto on terbium for bug 69291
18:45 hashar: Zuul upgraded
18:41 hashar: Upgrading Zuul to latest version (that is not a friday afterall)
09:22 springle: ongoing schema change wikidatawiki & testwikidatawiki wb_entity_per_page.epp_redirect_target. osc_host.sh processes on terbium ok to kill in emergency
04:34 ottomata: restarted udp2log on oxygen
03:05 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun Aug 17 03:04:22 UTC 2014 (duration 4m 21s)
02:49 springle: killed stuff on labsdb1002 using all disk for temp tables. investigating
02:24 logmsgbot: LocalisationUpdate completed (1.24wmf17) at 2014-08-17 02:23:08+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-17 02:13:35+00:00
August 16
18:12 bblack: (amssq33: and yes, removing from fe/be cache pools)
18:11 bblack: powering off amssq33, it's clipping network traffic at peak times due to bad ethernet connection negotiated down to 100Mbps (see existing RT 7933 in esams queue)
18:02 bblack: ms-be1006: syslog indicates it started generating repeated "BUG: soft lockup" 10 minutes before dying, in XFS kernel code again...
17:55 bblack: rebooting ms-be1006, ping-dead in icinga for 23m, console was unresponsive
17:37 bblack: restarted apache2 on palladium... looks like something went horribly wrong with its puppet of itself that somehow killed off puppetmaster service?
03:07 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Aug 16 03:06:29 UTC 2014 (duration 6m 28s)
02:27 logmsgbot: LocalisationUpdate completed (1.24wmf17) at 2014-08-16 02:26:02+00:00
02:17 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-16 02:16:00+00:00
15:47 Jeff_Green: adjust wiki-mail._domainkey DNS record to allow sending from 'wiki*@" addresses, instead of just wiki@
15:23 _joe_: powercycling mw1053, which looks like the victim of hhvm-induced ooms
15:15 logmsgbot: reedy Started scap: testwiki to 1.24wmf17
14:01 _joe_: puppet re-enabled on the appserver
12:38 _joe_: stopping puppet on appservers while deploying a delicate change.
12:12 manybubbles|away: cirrus index rebuilds are still proceeding without issue. Going to continue to let them run and keep half an eye on them. enwiki is nearly done. Commons and wikidata are done. Many of group1 are done - we're up to eswiktionary now - but there are many to go.
17:55 AaronSchulz: populateBacklinkNamespace.php finished on all wikis
17:13 springle: restart mysqld on labsdb1002, upgrade to mariadb 10.0.13 for bugfix
16:57 Jeff_Green: removed aluminium.wikimedia.org from production
16:50 springle: restart mysqld on labsdb1001, upgrade to mariadb 10.0.13 for bugfix
15:08 bblack: flipping ulsfo traffic back to ulsfo
11:51 logmsgbot: hoo Synchronized wmf-config/InitialiseSettings.php: Set siteGroup for testwikidata (duration: 00m 11s)
11:21 hashar: Jenkins: clearing up some obsolete symbolic links under gallium.wikimedia.org:/var/lib/jenkins/jobs/*/builds/ Running in a screen as user jenkins
05:01 springle: rsync ~1TB labsdb1001 to labsdb1003, throttled ~25MB/s
03:15 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Aug 12 03:14:34 UTC 2014 (duration 14m 33s)
02:33 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-12 02:32:09+00:00
02:19 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-08-12 02:18:36+00:00
August 11
22:33 awight: update CRM schema to wmf_civicrm:7021
21:47 andrewbogott: removed the old puppet-freshness check which should have no effect but may instead produce a torrent of alert spam https://gerrit.wikimedia.org/r/#/c/142560/
04:00 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 11 03:59:17 UTC 2014 (duration 59m 16s)
03:05 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-11 03:04:15+00:00
02:34 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-08-11 02:33:02+00:00
August 10
23:47 logmsgbot: ori Synchronized php-1.24wmf16/extensions/MassMessage/includes: Revert MassMessage to 9884fbb50a (duration: 00m 06s)
23:36 logmsgbot: ori Synchronized php-1.24wmf16/extensions/MassMessage: Update MassMessage for I840c98dca: Fix MassMessage::getMessengerUser() after Password API changes (duration: 00m 06s)
14:22 mutante: RT - reverted permission change for access requests requestors per robh
13:50 mutante: RT - granted permission to show ticket summary for role requestor in queue access-requests
12:49 akosiaris: uploaded ruby-jsduck 5.3.4-1wmftrusty1 and ruby-rkelly-remix 0.0.6-1trusty1 on apt.wikimedia.org
12:33 ori: testwiki up, judgement poor
12:28 hashar: Jenkins: somehow the ArtifactDeployer plugin got upgraded on Aug 7th 20:57 UTC despite it being broken bug 69197. Attempting manual downgrade
12:13 hashar: reloading Jenkins
12:07 akosiaris: ifconfig br0 0.0.0.0 on platinum to get rid of the IP on that interface and have facter work more reliably. This does not matter right now as it is an evaluation machine but logging it for completeness
12:03 logmsgbot: ori rebuilt wikiversions.cdb and synchronized wikiversions files: (no message)
11:32 _joe_: rebooting mw1017
11:29 akosiaris: mw1130 has broken disk
11:09 ori: running rsync-common on mw1017
11:02 logmsgbot: hoo Synchronized php-1.24wmf16/extensions/CentralAuth/: Another shot towards bug 39996 (duration: 01m 04s)
11:01 logmsgbot: hoo Synchronized php-1.24wmf15/extensions/CentralAuth/: Another shot towards bug 39996 (duration: 01m 04s)
09:29 _joe_: reimaging mw1017 aka testwiki.
06:03 springle: ongoing schema changes: rev_content_model, rev_content_format. on terbium, osc_host.sh processes ok to kill in emergency
03:13 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 8 03:12:21 UTC 2014 (duration 12m 20s)
02:29 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-08 02:28:39+00:00
02:17 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-08-08 02:16:13+00:00
August 7
19:19 jgage: rebooting analytics1021 for kernel upgrade
18:55 bblack: starting the process of fixing upload cache sizes, there will be periodic slim 5xx spikes...
16:31 Jeff_Green: temporarily disabling icinga notifications for ocg100[123] ocg service check
07:44 logmsgbot: springle Synchronized wmf-config/db-eqiad.php: move s4 api traffic to db1056 (duration: 00m 07s)
07:39 mark: Set OSPF metric 1000 on cr2-eqiad:xe-5/2/2 (GTT link)
05:39 springle: labsdb1002 restart
03:48 springle: labsdb1001 restart
03:09 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Aug 7 03:08:49 UTC 2014 (duration 8m 48s)
02:28 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-07 02:27:52+00:00
02:16 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-08-07 02:15:45+00:00
August 6
21:33 hashar: Jenkins: moved mediawiki-core-regression-hhvm-master to run on Trusty instance
20:26 hashar: Jenkins: downgraded ansicolor plugin from 0.4 to 0.3.1 Some colors.js function emits ANSI codes to reset the color which are not properly understood
18:49 hashar: Stopping Jenkins. Reverting upgrade of artifact deployer plugin
18:10 mutante: puppet-catalog-compiler says to "wait while Jenkins is getting ready to work"
17:20 hashar: Jenkins process jobs again, the UI will take a bunch of hours to load though due to some issue when initializing
17:14 hashar: killed Jenkins
17:12 _joe_: stopped the jobrunner on mw1053, was running in fcgi mode unpuppetized and with a broken vhost. Fixed it, it started spawning exceptions. DO NOT enable puppet again
17:02 ^d: jenkins restarted, was stuck
15:52 hashar: Restarted Zuul and Zuul-merger on gallium to tweak logging settings 152118
11:30 logmsgbot: hoo Synchronized wmf-config/CommonSettings.php: Grant 'centralauth-rename' to 'steward' (duration: 00m 24s)
11:26 logmsgbot: demon Synchronized wmf-config/abusefilter.php: (no message) (duration: 00m 19s)
18:57 jgage: beginning kafka upgrade: disabling puppet on brokers
13:17 apergos: stopped labs rsync job from dataset1001, mount of labstore1003 was borked, removed 90GB of stuff on /mnt/data (= /) filesystem, restarted nfsd on dataset1001, dumps back to going
03:12 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Aug 4 03:11:03 UTC 2014 (duration 11m 2s)
02:29 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-04 02:27:58+00:00
02:17 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-08-04 02:16:46+00:00
August 3
03:30 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun Aug 3 03:28:56 UTC 2014 (duration 28m 55s)
02:28 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-03 02:27:44+00:00
02:17 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-08-03 02:16:39+00:00
August 2
15:28 godog: reboot ms-be1008, stuck on xfs errors and most processes in D state
15:48 logmsgbot: reedy Synchronized php-1.24wmf16/includes/specials/SpecialRecentchangeslinked.php: (no message) (duration: 00m 14s)
12:07 apergos: powercycled dataset1001, inaccessible via mgmt console, only visible message was 'mnt.nfs failed'
09:10 _joe_: apache mediawiki::web train finished its run. re-enabling puppet on all appservers
07:47 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Aug 1 07:46:04 UTC 2014 (duration 46m 3s)
07:24 _joe_: stopping puppet on appservers to deploy a potentially dangerous case
05:16 logmsgbot: springle Synchronized wmf-config/db-eqiad.php: Move enwiki api traffic away from lagging slaves (duration: 00m 07s)
03:12 logmsgbot: LocalisationUpdate completed (1.24wmf16) at 2014-08-01 03:11:14+00:00
02:40 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-08-01 02:38:56+00:00
00:52 logmsgbot: catrope Synchronized php-1.24wmf16/extensions/VisualEditor/lib/ve/modules/ve/ui/inspectors/ve.ui.CommentInspector.js: Fix typo in class name (duration: 00m 10s)
July 31
23:23 logmsgbot: mwalker Synchronized php-1.24wmf16: Updating core and Flow for SWAT (duration: 00m 53s)
19:24 Coren_away: labsdb1005 had to blow away the postgres slave: was using all the space on / because DB at wrong spot (should have been /srv/postgres)
18:40 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 15s)
18:27 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 15s)
18:09 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf16
18:02 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: wikipedias to 1.24wmf15
17:46 logmsgbot: reedy Finished scap: testwiki to 1.24wmf16 and build l10n cache (duration: 22m 35s)
17:23 logmsgbot: reedy Started scap: testwiki to 1.24wmf16 and build l10n cache
14:57 bblack: added labstore1003 to filter labs-in4 terms allow-labstore-(udp|tcp)4 on cr[12]-eqiad
14:33 logmsgbot: reedy Synchronized wmf-config/InitialiseSettings.php: Allow sysops and 'crats on wikimania2014wiki to grant confirmed (duration: 00m 15s)
14:12 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikivoyages to 1.24wmf15
14:12 logmsgbot: reedy Synchronized wmf-config/InitialiseSettings.php: (no message) (duration: 00m 14s)
14:05 bblack: removed labs-in4 and labs-in6 filters on vlan 1117 (labs-hosts1-a-eqiad) on cr[12]-eqiad
13:47 logmsgbot: reedy Started scap: Rebuild 1.24wmf15 l10n cache for WikimediaMessages updates
13:44 logmsgbot: reedy Synchronized php-1.24wmf15/extensions/RelatedSites/: (no message) (duration: 00m 15s)
13:44 logmsgbot: reedy Synchronized php-1.24wmf15/extensions/WikimediaMessages: (no message) (duration: 00m 14s)
12:10 hashar: stopping Jenkins and restarting it
12:04 hashar: reloading Jenkins configuration
11:37 hashar: Jenkins: upgrading almost all jobs to use a new label 'UbuntuPrecise' bug 68340150785
10:49 hashar: Jenkins: attempting to poll a Trusty slave (integration-slave1004-trusty [10.68.17.148] with label UbuntuTrusty).
10:32 hashar: Jenkins: tweaking jobs labels, that might eventually screw up Zuul/Jenkins entirely.
08:43 _joe_: start rolling reload of nginx to catch up with the new ssl config
06:50 springle: labsdb1001 migration complete, should be all systems go
03:19 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 31 03:18:07 UTC 2014 (duration 18m 6s)
02:36 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-07-31 02:35:29+00:00
02:20 logmsgbot: LocalisationUpdate completed (1.24wmf14) at 2014-07-31 02:19:17+00:00
02:06 springle: labsdb1001 migrating to mariadb 10, expect read-only and downtime, see labs-l
July 30
23:27 logmsgbot: maxsem Synchronized php-1.24wmf15/extensions/MwEmbedSupport/: (no message) (duration: 00m 03s)
23:27 logmsgbot: maxsem Synchronized php-1.24wmf15/extensions/Wikidata/: (no message) (duration: 00m 08s)
23:26 logmsgbot: maxsem Synchronized php-1.24wmf15/extensions/SyntaxHighlight_GeSHi/: (no message) (duration: 00m 05s)
23:23 logmsgbot: maxsem Synchronized php-1.24wmf14/extensions/Wikidata: (no message) (duration: 00m 11s)
23:13 logmsgbot: maxsem Synchronized wmf-config: (no message) (duration: 00m 05s)
21:04 AaronSchulz: Started populateBacklinkNamespace.php on wikidata and commons
21:02 bblack: turned icinga email/sms back on
20:24 bblack: icinga back online again
19:57 bblack: shutting off icinga to make some optimizations
19:20 bblack: icinga is now substantially back online. email/sms still disabled for now, and downtimes/acks need to be re-added for known issues
19:06 logmsgbot: csteipp Synchronized php-1.24wmf14/includes/: (no message) (duration: 00m 05s)
19:04 logmsgbot: csteipp Synchronized php-1.24wmf15/includes/: (no message) (duration: 00m 07s)
18:59 bblack: icinga coming back up again for the first time, expect random strangeness to be ignored
18:46 bblack: temporarily hard-disabling email/sms from icinga via 'mv /usr/bin/mail /usr/bin/mail-disabled' on neon to prevent icinga spam on next startup attempt
17:55 bblack: stopping icinga service for now while working out other details
17:25 tacotuesday: repooled elastic1018 and elastic1019 as well
17:21 Coren: labmon1001 rebooting (final check for proper raid+lvm autodetection)
17:08 bblack: working on bringing up new neon install (first puppet run, etc)
17:01 Coren: labmon1001 rebooting (partitioning changes on primary disks)
23:04 logmsgbot: cscott Synchronized wmf-config: (no message) (duration: 00m 12s)
23:03 logmsgbot: cscott updated /a/common to Iae1ac79d5: Enable OCG in production
22:55 cscott: updated OCG to version aeb8623d6ebe41ae7c7e36c57844bd9ea8e6d595
22:50 RoanKattouw: Fixed ownership of slot0/cache on wikitech (virt1000), was root:root but should have been www-data:www-data
22:24 RoanKattouw: Updated lib/ve submodule inside extensions/VisualEditor on virt1000; wikitechwiki was running a Frankenstein version of VE that was part yesterday's code, part code from April
21:47 logmsgbot: ori Synchronized rpc/RunJobs.php: Ia62e9158f: Added a streamlined RunJobs that can be used by redisJobService (2/2) (duration: 00m 03s)
21:47 logmsgbot: ori Synchronized multiversion: Ia62e9158f: Added a streamlined RunJobs that can be used by redisJobService (1/2) (duration: 00m 03s)
16:50 logmsgbot: hoo Synchronized wmf-config/Wikibase.php: Only declare "special" sitegroups for testwikidata (duration: 00m 07s)
16:48 logmsgbot: hoo Synchronized wmf-config/Wikibase.php: Only declare "special" sitegroups for testwikidata (duration: 00m 08s)
16:47 logmsgbot: hoo Finished scap: Updating Wikidata with various changes for testwikidata and a client bug fix. (duration: 27m 27s)
16:37 cmjohnson1: replacing defective disk virt1009
16:20 logmsgbot: hoo Started scap: Updating Wikidata with various changes for testwikidata and a client bug fix.
16:10 logmsgbot: hoo Synchronized wmf-config/Wikibase.php: Make testwikidata use the "special" sitelink group. Preparations for submodule updates. (duration: 00m 08s)
16:10 bd808: logstash log event volume up after restart
16:09 bd808: restarted logstash on logstash1001.eqiad.wmnet; log volume looked to be down from expected levels
16:08 _joe_: reenabled puppet on mw1053
16:03 logmsgbot: hoo Synchronized wmf-config/InitialiseSettings.php: Enable Wikibase other projects links per default for ruwiki (duration: 00m 07s)
15:13 manybubbles: building cirrus indexes for group0 wikis in place to turn on the weighted all field we'll use for performance improvements later
15:06 logmsgbot: manybubbles Synchronized wmf-config: SWAT - deploy cirrussearch all field stage 2 part 2 (duration: 00m 04s)
15:06 logmsgbot: manybubbles Synchronized wmf-config/InitialiseSettings.php: SWAT - deploy cirrussearch all field stage 2 part 1 (duration: 00m 04s)
13:54 logmsgbot: hashar Synchronized wmf-config/InitialiseSettings.php: added Universiteits Museum Utrecht to the wgCopyUploadsDomains array 150163 (duration: 00m 04s)
13:38 ottomata: restarted gmetad on nickel, seems to have brought ganglia back up
11:30 _joe_: upgrading packages on mw1053, for testing hhvm with pcre-jit enabled
10:35 _joe_: puppet re-enabled on the appservers
10:29 _joe_: temporarily stopping puppet on appservers, releasing a potentially dangerous puppet change
09:10 _joe_: stopping jobrunner on mw1053, disabling puppet as well - running tests
09:02 hashar: restarted zuul-server and zuul-merger on gallium (new version though that is a noop)
09:00 hashar_: Zuul bumping Zuul cloner from patchset 21 to patchset 23. Deploying with tag wmf-deploy-2014-07-29-1
07:51 akosiaris: uploaded PHP 5.3.10-1ubuntu3.13+wmf1 on apt.wikimedia.org. Puppet will upgrade it across the fleet within 20 mins
03:48 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 29 03:47:39 UTC 2014 (duration 47m 38s)
03:11 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-07-29 03:10:31+00:00
02:36 logmsgbot: LocalisationUpdate completed (1.24wmf14) at 2014-07-29 02:35:18+00:00
23:41 logmsgbot: ori Started scap: I42c07b64: Update MobileFrontend
23:33 logmsgbot: ori Synchronized php-1.24wmf15/extensions/VisualEditor: Update VisualEditor to I944f8fbfa (duration: 00m 04s)
23:25 logmsgbot: ori Synchronized wmf-config/InitialiseSettings.php: I369dbad6e: Allow crats to add/remove petitiondata group on foundationWiki (duration: 00m 04s)
23:21 AaronS: Updated /srv/jobrunner to 0bb0ad62dd9240e0f67b2ded4519f125de13dfbc
23:12 mutante: temp. disabled puppet on neon and ircecho
23:06 mutante: graceful apache on palladium
21:12 hashar: Gerrit: allowed JenkinsBot to submit patches on wikimedia/bots (and thus on all child repositories)
16:30 andrewbogott: updated wikitech to 1.24wmf15; turned on OAuth
16:05 Nemo_bis: andrewbogott> Nikerabbit: I'm upgrading it [wikitech wiki], it'll be flaky for a bit
16:00 manybubbles: deone with SWAT
15:57 logmsgbot: manybubbles Synchronized php-1.24wmf14/extensions/VisualEditor/: SWAT - fix visual editor bug - Changes made after reviewing changes are not sent (when caching is enabled) (duration: 00m 07s)
15:46 logmsgbot: manybubbles Synchronized php-1.24wmf15/extensions/VisualEditor/: SWAT - fix visual editor bug - Changes made after reviewing changes are not sent (when caching is enabled) (duration: 00m 08s)
15:41 hoo: Removed all right holders from closed and inaccessible ukwikimedia (bug 68737)
19:25 hashar: zuul@gallium:/etc/zuul/wikimedia$ echo status|nc -q 3 localhost 4730|wc -l ... Yields: 0 . Which mean jobs are no more registered for some reason.
19:24 hashar: Jenkins stalled again yeahhhhh
16:59 mutante: powercycled ms-be1010 - unresponsive to ssh, nothing on mgmt
16:28 MaxSem: Updating PageImages data for mainspace on Commons from terbium
13:09 _joe_: re-enabling puppet, test run on the test host was fine.
13:03 _joe_: stopping puppet on all appservers - will reactivate after testing
11:26 logmsgbot: reedy Synchronized docroot and w: (no message) (duration: 00m 13s)
10:50 hashar: contint: manually cleared /tmp on the 3 labs jenkins slaves.
10:46 hashar: integration-slave1001.eqiad.wmflabs is out of disk space ( / /dev/vda1)
07:29 springle: shutdown tantalum per mwalker request
04:19 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 25 04:18:45 UTC 2014 (duration 18m 44s)
03:31 logmsgbot: LocalisationUpdate completed (1.24wmf15) at 2014-07-25 03:30:33+00:00
02:48 logmsgbot: LocalisationUpdate completed (1.24wmf14) at 2014-07-25 02:47:17+00:00
01:21 logmsgbot: ori Synchronized wmf-config/CommonSettings.php: Ic29ae11fa: On Labs, disable LuaSandbox's profiling feature to isolate bug 68413 (duration: 00m 04s)
15:10 hashar: Clearing out old Zuul references on operations/puppet.git might cause merge errors
15:10 logmsgbot: yurik Synchronized php-1.24wmf14/extensions/ZeroBanner: (no message) (duration: 01m 07s)
15:08 logmsgbot: yurik Synchronized php-1.24wmf13/extensions/ZeroBanner: (no message) (duration: 01m 11s)
14:30 logmsgbot: yurik Synchronized wmf-config/mobile.php: Font for zero banner (duration: 01m 10s)
13:38 hashar: Deleting old Zuul references in the Zuul maintained repository /srv/ssd/zuul/git/mediawiki/core/ on gallium bug 68481 . Should speed up merge operations on that repository.
10:10 hashar: Zuul code being installed on lanthanum.eqiad.wmnet Will let us use a merger daemon there and the Zuul cloner client. 141758
05:44 springle: labsdb1002 work in progress; it may misbehave. see labs-l for updates
03:57 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 24 03:56:32 UTC 2014 (duration 56m 31s)
03:09 logmsgbot: LocalisationUpdate completed (1.24wmf14) at 2014-07-24 03:08:05+00:00
02:37 logmsgbot: LocalisationUpdate completed (1.24wmf13) at 2014-07-24 02:36:35+00:00
00:44 ori: installing linux-tools on mw1053 to run perf on jobrunner
July 23
23:59 logmsgbot: maxsem Finished scap: Pick up messages forgotten during Zero deployment (duration: 26m 42s)
23:39 ori: running sync-common on mw1053.eqiad.wmnet
23:32 logmsgbot: maxsem Started scap: Pick up messages forgotten during Zero deployment
23:26 logmsgbot: maxsem Synchronized php-1.24wmf14/extensions/MultimediaViewer/: (no message) (duration: 00m 03s)
23:26 logmsgbot: maxsem Synchronized php-1.24wmf14/extensions/VisualEditor/: (no message) (duration: 00m 04s)
23:19 logmsgbot: maxsem Synchronized php-1.24wmf13/extensions/MobileFrontend/: (no message) (duration: 00m 04s)
23:18 logmsgbot: maxsem Synchronized php-1.24wmf14/extensions/MobileFrontend/: (no message) (duration: 00m 04s)
22:39 mutante: removed platinum from icinga
22:36 _joe_: installed mw1053 as the first hhvm jobrunner, currently stopped. Puppet disabled so that it won't restart the jobrunner automatically
21:49 logmsgbot: ori Synchronized wmf-config/CommonSettings.php: I2f366fa93: Use luastandalone on HHVM (duration: 00m 03s)
21:17 hashar: Zuul is all good. It just receives too many patches :-]
20:31 bd808|deploy: Updated /a/common to 07834a9 (beta cluster: use luastandalone); no sync needed
20:30 subbu: deployed parsoid version 47d4bc83
20:27 hashar: Having no idea how to fix zuul. Restarting it and killing the whole queue :-/
20:14 mutante: contacts.wm - set $base_url in default/settings.php to https URL, and $is_https='on' in bootstrap.inc (unpuppetized?)
19:49 logmsgbot: awight Synchronized php-1.24wmf14/extensions/CentralNotice: push many CentralNotice fixes, including GeoIP cookie and hide cookie (duration: 00m 05s)
19:49 logmsgbot: awight Synchronized php-1.24wmf13/extensions/CentralNotice: push many CentralNotice fixes, including GeoIP cookie and hide cookie (duration: 00m 05s)
19:28 logmsgbot: awight Synchronized php-1.24wmf14/extensions/CentralNotice: push many CentralNotice fixes, including GeoIP cookie and hide cookie (duration: 00m 04s)
19:27 logmsgbot: awight Synchronized php-1.24wmf13/extensions/CentralNotice: push many CentralNotice fixes, including GeoIP cookie and hide cookie (duration: 00m 04s)
18:57 hashar: reenabled Gearman plugin in Jenkins. Jobs have been reregistered and seems to be proceeding again
18:10 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias back to 1.24wmf13 due to Wikidata and Cirrus fatals
18:06 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf14
17:48 logmsgbot: mwalker Finished scap: Deploying Petition extension to the cluster (duration: 28m 27s)
17:19 logmsgbot: mwalker Started scap: Deploying Petition extension to the cluster
18:05 logmsgbot: aaron Synchronized wmf-config/jobqueue-eqiad.php: Set "daemonized" flag for the redis job queue (duration: 00m 04s)
17:33 cmjohnson: replacing disk 2 es1005
17:25 mutante: temp. stopped icinga-wm to avoid channel spam
17:24 mutante: puppetmaster on strontium had 'Unexpected error in mod_passenger" causing puppet fails all over the place with error 500 on master, resumed normal after graceful
17:21 mutante: graceful'ed apache on strontium
14:37 godog: rolling reload of proxy-server on swift ms-fe1* to pick up changes
13:19 _joe_: re-enabling puppet, applying on a sample of hosts created no change according to my tests.
13:13 _joe_: temporarily disabling puppet on mw servers, will re-enable when I'm done with testing (again) the change
11:20 godog: restart proxy-server on ms-fe1003, as suspected it wasn't running the latest version
11:14 godog: restart proxy-server on ms-fe1003, double checking for a change in numbers reported to graphite
10:04 godog: stagger reload swift {account,object,container} server in ms-be.eqiad to pick up recon changes
06:01 AaronSchulz: Updated /srv/deployment/jobrunner to 4cddd5033efadf431e138c399b5d86542e32f196
03:55 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 18 03:53:55 UTC 2014 (duration 53m 54s)
03:22 ori: Updated jobrunner to d9520c9 and restarted service on all jobrunners
03:09 logmsgbot: LocalisationUpdate completed (1.24wmf14) at 2014-07-18 03:08:02+00:00
17:03 RobH: payments4 is kernel updating (per jgreen)
17:01 logmsgbot: reedy Started scap: testwiki to 1.24wmf14
15:05 logmsgbot: manybubbles Synchronized php-1.24wmf13/extensions/MultimediaViewer/: SWAT - Moving repo icon back to the right-hand side in Media Viewer (duration: 00m 05s)
15:03 logmsgbot: manybubbles Synchronized wmf-config/CommonSettings-labs.php: SWAT deploy to keep us synced, but this is a noop in prod. only anything in beta. (duration: 00m 05s)
07:27 springle: mariadb 10 on labsdb1002:3309 cloning s5 from sanitarium db1054:3308
03:33 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Jul 17 03:32:25 UTC 2014 (duration 32m 24s)
02:47 logmsgbot: LocalisationUpdate completed (1.24wmf13) at 2014-07-17 02:46:24+00:00
02:24 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-17 02:23:08+00:00
July 16
23:55 logmsgbot: maxsem Synchronized private: Clean up old mobile cruft (duration: 00m 05s)
22:34 andrewbogott: temporarily fixed puppet on tin by restarting salt-master and salt-minion. A proper fix would involve upgrading to a salt version that fixes https://github.com/saltstack/salt/issues/6306
22:29 logmsgbot: yurik Synchronized php-1.24wmf13/extensions/ZeroBanner: (no message) (duration: 03m 55s)
22:27 ori: restarted jobrunner service on all job runners
22:18 logmsgbot: yurik Synchronized php-1.24wmf12/extensions/ZeroBanner: (no message) (duration: 04m 31s)
21:50 AaronSchulz: Updated job runners to 186b9b33
21:08 legoktm: clearing Magog the Ogre's watchlist on enwp per request (173668 entries)
14:34 _joe_: moving the stale conf-enabled directory away on jobrunners, or when we upgrade to trusty all hell will break loose
13:06 logmsgbot: oblivian gracefulled all apaches
12:14 logmsgbot: oblivian gracefulled all apaches
12:01 _joe_: removed stale files from /etc/apache2/conf-enabled on all mw hosts
11:25 logmsgbot: manybubbles Synchronized wmf-config/InitialiseSettings.php: Take Cirrus as default from more wikis while we figure out load issues (duration: 00m 06s)
10:32 _joe_: releasing a new apache config to all mediawikis
08:54 godog: repool ms-fe1004
08:51 godog: repool ms-fe1003 and depool ms-fe1004
08:46 godog: repool ms-fe1002 and depool ms-fe1003
08:39 godog: depool ms-fe1002 for swift upgrade
05:54 springle: resuming page content model schema changes, osc_host.sh processes on terbium ok to kill in emergency
04:22 springle: restarted gitblit on antimony
03:04 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 16 03:03:41 UTC 2014 (duration 3m 40s)
02:27 logmsgbot: LocalisationUpdate completed (1.24wmf13) at 2014-07-16 02:26:12+00:00
02:15 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-16 02:14:32+00:00
01:34 manybubbles: moving shards off of elastic101[789]
22:35 K4-713: synchronized payments to afa12be34769000bf8
21:34 _joe_: disabling puppet on mw1001, tests
21:26 logmsgbot: aude Synchronized php-1.24wmf13/extensions/Wikidata: Update submodule to fix entity search issue on Wikidata (duration: 00m 21s)
21:15 ori: to test r146607, locally modified upstart conf for jobrunner on mw1001 to log to /var/log/mediawiki, and restarted service
20:24 ori: restarted jobrunner on all jobrunners
20:23 AaronSchulz: Deployed /srv/jobrunner to 31e54c564d369e89613db48977eec0a5891b6498
20:21 logmsgbot: reedy Synchronized docroot and w: (no message) (duration: 00m 21s)
20:18 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non wikipedias to 1.24wmf13
20:12 Krinkle: Reloading Zuul to deploy If2312bcf18bdbe8dee
20:12 bd808: log volume up after logstash restart
20:10 bd808: restarted logstash on logstash1001; log volume looked to be down from "normal"
19:55 Reedy: Applied extensions/UploadWizard/UploadWizard.sql to rowiki (re bug 59242)
18:53 manybubbles: bouncing elastic1018 to pick up new merge policy. hopefully that'll help with io thrashing
17:58 ori: _joe_ deployed jobrunner to all job runners
17:40 manybubbles: my last attempt to lower the concurrent traffic for recovery was a failure - tried again and succeeded. that seems to have fixed the echo service disruption from taking elastic1017 out of service
17:37 ori: updated jobrunner to bef32b9120
17:29 manybubbles: elastic1017 went nuts again. just shutting elasticsearch off on it for now
17:17 manybubbles: lowered Elasticsearch concurrent recovery streams to 2 (from 3) and total write rate across those streams to 20MB/sec (from 4MB/sec). This should prevent io thrash on recovery which looked to cause echo distruptions in service while recovering from some other disruption.
16:25 _joe_: all mw servers updated
16:10 _joe_: mw1100 and onwards updated
16:00 _joe_: mw1060-mw1099 updated
15:57 manybubbles: restarting Elasticsearch on elastic1017 - its thrashing the disk again. I'm still not 100% sure why
15:56 _joe_: mw1020-mw1059 updated
15:53 _joe_: mw101[0-9] updated
15:51 manybubbles: elasticsearch1017 is freaking out again - maybe there is something wrong with it. odds aren't good it picked up the same shard again after restart and that shard is somehow poison just for it and not the other two nodes with the same shard....
15:47 _joe_: starting rolling update of all appservers to apache2 2.2.22-1ubuntu1.6, half of them are on 2.2.22-1ubuntu1.5 now
15:42 manybubbles: setting the filter cache on one node in the cluster set it on all. yay, I guess. Anyway, I'm going to let it soak for a while.
15:32 manybubbles: setting filter cache size to 20% on elastic1001 to see if it takes/helps us
15:18 anomie: anomie actually committed a live hack someone left on tin (removing db1035)
15:16 logmsgbot: anomie updated /a/common to I7ca6a16d5: Switch jawiki back to lsearchd
13:52 manybubbles: after switching jawiki back to lsearchd by default load is mostly recovered. the cluster is still healing from bouncing elastic1017 and that'll take a while. the load will be a bit high during that but searches are coming back in a reasonably amount of time again
13:42 logmsgbot: manybubbles Synchronized wmf-config/InitialiseSettings.php: jawiki back to lsearchd (duration: 00m 05s)
13:38 manybubbles: elastic1017 had a load average of 60 - was thashing in io. bounced Elasticsearch. lets see if it recovers on its own
09:09 _joe_: restarting mailman on sodium, again, for testing
08:50 godog: restart mailman on sodium after inodes freed
07:27 _joe_: restarted mailman on sodium
07:22 _joe_: stopping mailman on sodium for repairing
06:54 _joe_: killed jenkins stale process on gallium, stuck in a futex while shutting down
04:48 springle: db1035 crash cycle. down for memtest and stuff
03:34 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Jul 15 03:33:38 UTC 2014 (duration 33m 37s)
03:01 logmsgbot: LocalisationUpdate completed (1.24wmf13) at 2014-07-15 03:00:03+00:00
02:30 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-15 02:29:02+00:00
02:27 springle: powercycle db1035 unresponsive
July 14
23:52 logmsgbot: mwalker Finished scap: Updating for SWAT 146304, 146306, 146149, 146165, 146166, 146282, and 146281. Also finishing awight's deploy of FundraisingTranslateWorkflow. (duration: 19m 42s)
23:32 logmsgbot: mwalker Started scap: Updating for SWAT 146304, 146306, 146149, 146165, 146166, 146282, and 146281. Also finishing awight's deploy of FundraisingTranslateWorkflow.
20:22 cscott: updated Parsoid to version d51e64097bb1b18e356584d4f3ddcfd90a6071ba
19:57 ori: postponing jobrunner deployment to tomorrow; ran over time
19:45 _joe_: doing the same on mw1064, segfaulted for the same reason
19:44 _joe_: killed a lone apache2 child on mw1152, stuck in a futex, after a segfault of another apache process. Restarted apache, now working correctly
19:03 godog: re-enabling mailman on sodium, missing list config restored
18:49 logmsgbot: awight Synchronized wmf-config: Deploying FundraisingTranslateWorkflow on metawiki (t
18:03 logmsgbot: awight updated /a/common to Ie7599fb6e: jawiki gets Cirrus as primary search
17:43 Krinkle: npm-cache for integration slaves got corrupted again. Depooling/Repooling integration-slave100{1,2,3} onoe by one to clear cache and let it warm up again.
17:35 Krinkle: Jenkins slaves in labs are unable to reach zuul.eqiad.wmnet
17:10 andrewbogott: purging old local-* service group entries from labs ldap (via purgeOldServiceGroups.php)
17:05 godog: started mailman on sodium post-reboot
18:10 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf13
18:02 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf12
17:25 logmsgbot: hoo Synchronized php-1.24wmf13/extensions/Wikidata/: Fix a UI issue and two API related flaws (same version as for wmf12) (duration: 00m 09s)
17:21 logmsgbot: hoo Synchronized php-1.24wmf12/extensions/Wikidata/: Fix a UI issue and two API related flaws (duration: 00m 14s)
16:04 godog: restarted pdns in turn on virt1000 and virt0 after opendj ulimit change
15:56 hashar: gallium running a rather long du command in a screen. Need to have a good figure at how much disk space each jobs consume
15:50 logmsgbot: reedy Finished scap: testwiki to 1.24wmf13 and build l10n cache (duration: 32m 09s)
15:18 logmsgbot: reedy Started scap: testwiki to 1.24wmf13 and build l10n cache
15:15 ottomata: reinstalling analytics1026 and analytics1027
14:10 godog: ran swift-dispersion-populate on eqiad and esams swift clusters
14:04 godog: cycle-restarting swift proxy-server on ms-fe to apply config updates
13:09 godog: restart pdns on virt1000
12:48 springle: ongoing schema changes: pl_from_namespace gerrit 117373. on terbium, osc_host.sh processes ok to kill in emergency
12:43 godog: restart opendj on virt1000 with higher ulimit -n
12:29 godog: restarted opendj on virt1000, ran out of fd
10:29 godog: restart profiler-to-carbon on tungsten, seemingly cpu spinning
01:06 bblack: cleared icinga downtimes for ulsfo (we now have some traffic back there)
00:50 logmsgbot: mattflaschen Synchronized php-1.24wmf11/extensions/GuidedTour/: GuidedTour cherry-pick to 1.24wmf11 in support of GettingStarted anonymous editor acquisition test (duration: 00m 09s)
16:51 logmsgbot: csteipp Finished scap: Update CentralAuth for Global Rename (duration: 28m 46s)
16:22 logmsgbot: csteipp Started scap: Update CentralAuth for Global Rename
16:17 mark: ulsfo is now offline
16:16 mark: Shutdown NTT BGP sessions on cr2-ulsfo
16:13 mark: Shutdown TiNet BGP sessions on cr1-ulsfo
16:10 mark: Shutdown IXP BGP sessions on cr2-ulsfo
16:10 mark: Shutdown WMF HQ BGP sessions on cr2-ulsfo
16:09 mark: Shutdown WMF HQ BGP sessions on cr1-ulsfo
16:02 logmsgbot: hoo Synchronized php-1.24wmf12/extensions/Wikidata/: Update Wikibase to fix a fatal and various JS things (duration: 00m 14s)
14:13 hashar: Jenkins: bringing back puppet-compiler02.eqiad.wmflabs node online. /tmp get filled when running huge catalog compilations which causes Jenkins to unpool the node :/
13:30 godog: reboot ms-be1005, raid controller confused (?) after disk replacement
12:52 godog: umounted sdg1 on ms-be1005, device disappeared, errors in dmesg
12:35 bblack: enabled amssq47 text frontend cache in pybal for esams
09:39 hashar: Jenkins had a bit of failure earlier due to the massive configuration update of mediawiki-core and mwext jobs. If that fails again the best thing is to stop Jenkins on gallium , wait for it to be killed or force kill -9 the java process then restart Jenkins. Should sort it out
09:30 hashar: restarted Zuul to clear out stalled items in queue
09:12 hashar: Jenkins being slow because the mediawiki-core* jobs history cache has been wiped out while updating their configuration. Jenkins is busy processing the history :(
09:02 hashar: Jenkins killing slave process on lanthanum. Some job is stalled and unrecoverable.
08:53 godog: upgrade ms-be1013/1014/1015 (zone5) to icehouse swift
08:51 hashar: Jenkins migrating jobs to use $ZUUL_URL instead of git://zuul.eqiad.wmnet Preparing to scale out Zuul merger to several nodes
08:19 godog: upgrade ms-be1009/1010/1011 (zone4) to swift icehouse
08:04 hashar: Jenkins: granted matanya the ability to manually trigger builds. Use case: the puppet compiler!
08:02 godog: upgrade ms-be1005/1006/1007 (zone3) to swift icehouse
03:37 mutante: ran puppet on neon - false puppet failure alarms
02:55 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Wed Jul 9 02:54:37 UTC 2014 (duration 54m 36s)
02:26 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-09 02:25:33+00:00
02:15 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-07-09 02:14:38+00:00
20:07 legoktm: finished migrateAccount.php --safe, now starting migrateAccount.php --attachbroken
20:05 mutante: restarted apache on ytterbium
19:47 K4-713: updated payments fraud filters again
19:47 legoktm: running migrateAccount.php --safe for accounts only existing on one wiki (bug 39817)
19:27 mutante: this should have fixed all the services behind misc. varnish now getting an actual "A" rating on ssllabs
19:20 mutante: arr, i meant "nginx", not varnish
19:15 mutante: restarting varnish on cp1043/cp1044 (misc cluster)
18:55 cmjohnson1: disconnecting serial cable from psw1-c2-eqiad
18:50 csteipp: patch for bug66608 deployed to wmf11/12
18:50 K4-713: updated fraud filters on payments cluster
18:28 logmsgbot: reedy Synchronized robots-private.txt: (no message) (duration: 00m 14s)
18:27 logmsgbot: reedy Synchronized wmf-config/: (no message) (duration: 00m 15s)
18:20 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non Wikipedias to 1.24wmf12
15:22 logmsgbot: reedy Purged l10n cache for 1.24wmf10
15:21 logmsgbot: reedy Purged l10n cache for 1.24wmf9
15:15 logmsgbot: anomie Synchronized php-1.24wmf11/extensions/Scribunto/: SWAT: Fix regression in os.date and os.time at module scope gerrit:144559 (duration: 00m 10s)
15:14 logmsgbot: anomie Synchronized php-1.24wmf12/extensions/Scribunto/: SWAT: Fix regression in os.date and os.time at module scope gerrit:144511 (duration: 00m 11s)
15:10 logmsgbot: anomie Synchronized php-1.24wmf11/extensions/UploadWizard/UploadWizard.config.php: SWAT: Flickr API is https-only now gerrit:144584 (duration: 00m 10s)
15:04 logmsgbot: anomie Synchronized php-1.24wmf12/extensions/UploadWizard/UploadWizard.config.php: SWAT: Flickr API is https-only now gerrit:144583 (duration: 00m 10s)
13:34 springle: slow transaction rollback in progress on db1001 librenms. other databases not affected, but librenms writes are timing out
13:32 cmjohnson1: replacing disk disk 6 ms-be1005
13:30 cmjohnson1: replacing disk 4 ms-be1007
12:38 YuviPanda: disregard previous log message, was meant for labs
12:37 YuviPanda: graphite reduced metrics count from 65k to 25k, monitoring io performance
06:57 logmsgbot: springle Synchronized wmf-config/db-eqiad.php: raise db traffic samplers to normal load (duration: 00m 06s)
09:40 godog: upgrade ms-be1003/1004/1012 (zone2) to swift icehouse
09:16 _joe_: restarting rhenium, pings but no ssh since 2 days, serial console is blank and unresponsive
09:15 godog: upgrade ms-be1002/1008 (zone1) to swift icehouse
02:53 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Jul 7 02:52:10 UTC 2014 (duration 52m 9s)
02:24 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-07 02:23:48+00:00
02:13 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-07-07 02:12:44+00:00
July 6
02:50 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun Jul 6 02:49:21 UTC 2014 (duration 49m 20s)
02:25 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-06 02:24:08+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-07-06 02:13:07+00:00
July 5
02:53 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Jul 5 02:52:04 UTC 2014 (duration 52m 3s)
02:27 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-05 02:26:08+00:00
02:16 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-07-05 02:15:05+00:00
01:22 springle: ongoing osc_host.sh schema change jobs on terbium. fine to kill in an emergency
July 4
20:05 hoo: Ran sync-common on fenari to update the docs on noc.wikimedia.org
15:40 _joe_: restarting salt-minion, killing io hungry job on fenari running since jun 30, 00 AM
12:28 akosiaris: executed dist-upgrade on virt1000. Keystone configure phase failed in keystone-manage db-sync and hence dpkg configure failed. It was trying to create an already existing index in the database. Dropped the index, ran dpkg --configure -a to recreate the index (and whatever else keystone-manage db_sync does). All is back to normal.
03:29 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Jul 4 03:28:29 UTC 2014 (duration 28m 28s)
03:03 logmsgbot: LocalisationUpdate completed (1.24wmf12) at 2014-07-04 03:02:49+00:00
02:33 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-07-04 02:32:29+00:00
00:28 gwicke: deployed parsoid config change e21a534 to support VE on the OTRS wiki
July 3
23:40 mutante: osmium - libboost-dev : Depends: libboost1.54-dev but it is not going to be installed
23:33 mutante: rhenium (pmacct / flow) Out of memory: Kill process 3123 (pmacctd) score 1 or sacrifice child
23:22 K4-713: updated payments to c5689f385b2f0a7bdc55c5752010e9eb
20:27 manybubbles: Adding cache warmers to all Cirrus indexes for group1 wikis with more then one shard except commons (commons is busy, it'll have to wait:)
19:16 awight: updated payments from a04e536b6923f2228bb7f5fbf2caeed64a888742 to 2b6c527617dcde154cc298dd9697c9d57c9f3620
18:41 awight: updated payments from a8138fefd940ba41812e5c07ca6bc74b63cb9bcf to a04e536b6923f2228bb7f5fbf2caeed64a888742
17:38 manybubbles: Cirrus reindex update! all wikipedias finished their in place reindex except ruwiki - that one is running now. all group1 wikis finished their from mediawiki reindex except commons and mgwiktionary which are running now. started from mediawiki reindex of all wikipedias exception for enwiki, itwiki, and cawiki which are already long done.
17:12 logmsgbot: manybubbles Synchronized cirrus.dblist: Enabled CirrusSearch as the default search backend on 30 more wikis - take five (duration: 00m 04s)
17:08 logmsgbot: manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis - take four (duration: 00m 04s)
17:08 logmsgbot: manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis - for real for real (duration: 00m 04s)
17:07 logmsgbot: manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis - for real (duration: 00m 04s)
17:05 logmsgbot: manybubbles Synchronized wmf-config/: Enable CirrusSearch as the default search backend on 30 more wikis (duration: 00m 05s)
15:43 logmsgbot: manybubbles Synchronized php-1.24wmf11/extensions/Wikidata/: (no message) (duration: 00m 09s)
15:35 logmsgbot: manybubbles Synchronized php-1.24wmf11/extensions/VisualEditor/: SWAT Correctly VisualEditor - update full size in MediaSizeWidget (duration: 00m 07s)
15:26 logmsgbot: manybubbles Synchronized wmf-config/: SWAT - disable local uploads on Malay Wiktionary (duration: 00m 04s)
16:44 hoo: Jenkins/ Zuul not reacting for at least half an hour now
16:43 awight: update tools from 3a35482ab1fede2ccfcc49a64ec661b0cb013b81 to e894f1f77674b6b101ae0e1644e363ca52e319d9
16:09 awight: updated payments from 6d74002f2634f41f7038daa7357ff6de55ee4880 to a8138fefd940ba41812e5c07ca6bc74b63cb9bcf
02:45 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun Jun 29 02:44:35 UTC 2014 (duration 44m 34s)
02:22 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-06-29 02:21:01+00:00
June 28
17:16 ori: restarted lucene on search1016 per _joe_
12:58 manybubbles: Cirrus reindex status: enwiki has almost finished its in place reindex, alphabetical wikipedias are at frwiki, all group1 wikis have finished their in place reindex. all group1 wikis are running from mediawiki reindex. itwiki and cawiki both finished both the in place and from mediawik reindex. Haven't started alphabetical from mediawiki reindex yet for wikipedias. that is the only
10:40 _joe_: restarting lucene on search1015, stuck. again.
02:47 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 28 02:46:49 UTC 2014 (duration 46m 48s)
02:25 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-06-28 02:24:12+00:00
02:16 logmsgbot: LocalisationUpdate completed (1.24wmf10) at 2014-06-28 02:15:38+00:00
20:57 awight: updated crm from 340c43a15a84a9392ad5ef9fc2782243ff140deb to 17439326ca4488ece843a263fc14859b38cff0e9
19:33 hashar: puppet-compiler: removed modules/varnish at root@puppet-compiler02:/opt/wmf/software/compare-puppet-catalogs/external/puppet and resetted repo.
19:07 awight: update crm from e2fe03a9cd51e30206d9a1114d62dfbd6960816b to 340c43a15a84a9392ad5ef9fc2782243ff140deb
03:31 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 27 03:30:00 UTC 2014 (duration 29m 59s)
03:06 logmsgbot: LocalisationUpdate completed (1.24wmf11) at 2014-06-27 03:05:31+00:00
02:36 logmsgbot: LocalisationUpdate completed (1.24wmf10) at 2014-06-27 02:35:20+00:00
June 26
23:32 manybubbles: Cirrus rebuild progress - started large/high cirrus visibility wikis in group2 - enwiki, cawiki, and itwiki.
23:31 manybubbles: Cirrus rebuild progress - alphabetical wikis in group2 are 2/3 of the way done with reindex - from mediawiki rebuild is maybe 20% done there
23:31 manybubbles: Cirrus rebuild progress - big wikis in group1 are finished with in place reindex and well into from mediawiki rebuild.
23:27 ori: Previous scap included I2cfcfaf06 as well
12:55 hashar: Jenkins: updates jobs for extensions (phpunit and qunit) to use the mw-run-update-script.sh instead of update.php . That runs update.php twice, the first time logging sql to a file that can be archived. 141851
12:48 mark: Deactivated BGP session to AS13030
11:01 hashar: Replacing operations-puppet-validate job with operations-puppet-pplint-HEAD which is faster and can run concurrently on multiple boxes. 142223
10:52 godog: stopping swift on ms-be3003
10:12 godog: upgrading ms-be3001 to swift icehouse
06:26 springle: ran operations/software maintain-replicas.pl and fedtables.pl on labsdbs for bug 59683
05:54 Tim: on mw1014: reformatted the /tmp partition
05:50 Tim: on mw1014: stopped job runner due to bad /tmp
04:44 ori: mw1014 is sad, has filesystem issues: "Attempt to read block from filesystem resulted in short read while trying to open /tmp". Puppet can't run. Should be depooled.
03:34 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu Jun 26 03:33:19 UTC 2014 (duration 33m 18s)
03:02 logmsgbot: LocalisationUpdate completed (1.24wmf10) at 2014-06-26 03:01:43+00:00
02:32 logmsgbot: LocalisationUpdate completed (1.24wmf9) at 2014-06-26 02:31:50+00:00
June 25
23:43 awight: updated crm from f3389daa94e9ad924175bdf0d5bc09c4a26aeb8c to e2fe03a9cd51e30206d9a1114d62dfbd6960816b
13:30 Krinkle: Depooling integration-slave1003 as almost every other -npm build on this node fails due to corrupted ~/.npm cache
12:52 manybubbles: cirrus rebuild update: starting from mediawiki reindex step for all alphabetical wikis that have finished so far
12:48 manybubbles: cirrus rebuild update: started rebuilding group1's indexes yesterday. commons and wikidata finished their in place pass and started their from mediawiki pass. The remaining wikis are running their in place pass in alphabetical order and currently on frwiktionary.
19:58 hashar: Upgraded Zuul on gallium.wikimedia.org to install the zuul-cloner of doom. 4f9fd51..9839edb Tagged wmf-deploy-20140624-1 in our repo.
19:39 manybubbles: rebuilding search index for group1 wikis after upgrade today
18:27 logmsgbot: reedy Synchronized docroot and w: (no message) (duration: 00m 14s)
18:25 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Non Wikipedias to 1.24wmf10
17:52 logmsgbot: manybubbles Synchronized wmf-config: Drop Cirrus indexes to five shards on rebuild and switch all wikis to new highlighter (duration: 00m 04s)
19:22 hashar: gallium / zuul : deleting /var/lib/zuul/git old Zuul repositories. They have been migrated to /srv/ssd/zuul/git/ ages ago
19:20 jgage: ms-be3003 full root partition fixed, swift had written to /srv/swift-storage/sdk1 onto root due to umounted sdk1
17:38 bblack: lvs1005:eth3 was negotiated to 100mbps (???) - disable -> enable on switch fixed it
17:36 godog: restarted salt-master on palladium, suspected job cleanup stuck
17:04 bd808: Fixed dangling symlink for /etc/apache2/sites-enabled/logstash.wikimedia.org on logstash1001 by deleting symlink and forcing puppet run
16:49 godog: added mw1149-52 back to pybal apache
16:33 paravoid: switched inbound mail for all non-wikimedia.org domains from mchenry/sodium to polonium/mchenry (~16:00 + <= 1h TTL UTC)
15:13 logmsgbot: anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Add a Library of Congress domain to wgCopyUploadsDomains gerrit:141308 (duration: 00m 14s)
15:11 logmsgbot: anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Adjust group rights on ruwiki gerrit:140910 (duration: 00m 14s)
15:10 logmsgbot: anomie Synchronized php-1.24wmf9/includes/api/ApiExpandTemplates.php: SWAT: Fix fatal in API action=expandtemplates with Scribunto gerrit:141416 (duration: 00m 15s)
15:04 logmsgbot: anomie Synchronized php-1.24wmf10/includes/api/ApiExpandTemplates.php: SWAT: Fix fatal in API action=expandtemplates with Scribunto gerrit:141417 (duration: 00m 14s)
14:55 andrewbogott: reenabling puppet on labstore1001, hoping it doesn't break labs
14:38 hashar: Further upgraded Zuul up to upstream b8c24ce + our local hacks. Git tag is wmf-deploy-20140623-4
14:14 hashar: upgraded Zuul by one commit (that introduces swift supports though disabled it on our setup via a custom hack)
13:20 paravoid: switching outbound email to polonium
12:17 manybubbles: rebuilding Cirrus index on group0 wikis to pick up changes like results boosting from categories and wikitext search
10:37 godog: powering down maerlant, decom-med
10:05 godog: hardreset maerlant, stuck on console and no ssh
18:37 ori: neon, logstash100x, zirconium, stat1001, netmon1001: replaced sites-enabled symlinks with their targets and forced puppet-run to clean up after Iddc778a28
18:37 logmsgbot: reedy Started scap: scap 1.24wmf10 take 2...
18:08 logmsgbot: reedy Started scap: testwiki to 1.24wmf10 and build l10n cache
17:29 logmsgbot: hoo Synchronized php-1.24wmf9/extensions/Wikidata/: Update Wikidata to fix the entity selector (duration: 00m 09s)
15:51 mutante: powercycling elastic1017 (went down and no console output)
15:13 godog: removed old pmtpa swift stats from graphite
15:04 logmsgbot: anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Put testwiki namespaces in the right place gerrit:140261 (duration: 00m 14s)
15:04 logmsgbot: anomie Synchronized wmf-config/InitialiseSettings.php: SWAT: Put testwiki namespaces in the right place gerrit:140261 (duration: 00m 15s)
15:02 logmsgbot: anomie Synchronized wmf-config/throttle.php: SWAT: Raise account creation limit for Telugu Wikipedia workshop on June 23 gerrit:140669 (duration: 00m 15s)
14:30 cmjohnson1: replacing failed disk slot3 es1006
18:46 paravoid: rebooting ms-be1001, XFS: Internal error XFS_WANT_CORRUPTED_RETURN, lots of processes in D
03:08 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 14 03:07:14 UTC 2014 (duration 7m 13s)
02:37 logmsgbot: LocalisationUpdate completed (1.24wmf9) at 2014-06-14 02:36:35+00:00
02:36 bblack: enabled amssq43-46 frontends (esams text varnish) in pybal
02:17 logmsgbot: LocalisationUpdate completed (1.24wmf8) at 2014-06-14 02:16:38+00:00
00:46 bblack: enabled amssq39-42 frontends (esams text varnish) in pybal
June 13
22:01 manybubbles: logstash1002 seems to be properly restoring nodes to itself. I'll monitor it for the next few minutes but I believe my work here is done.
21:55 manybubbles: bouncing logstash1002 because it seems stuck. not sure why. no useful logs.
21:07 bblack: turned on amssq35-38 text frontends in esams (in pybal)
20:57 awight: update crm from c38296add61421f87e12cb5b4f3dd68bdf2340db to e52a4eb1bfab622f612dc84f687678fff1fdbc04
20:23 bblack: turned on amssq31-34 text frontends in esams
18:41 mutante: DNS update - removing manutius' public IP
18:31 mutante: shutting down manutius, decom
18:22 logmsgbot: ori Synchronized php-1.24wmf9/extensions/Math: I498053de4: Fix the VisualEditor parts of Math-wmf9 with a working cherry pick of I7d5e1174 (duration: 00m 08s)
03:54 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Jun 13 03:53:17 UTC 2014 (duration 53m 16s)
03:12 logmsgbot: LocalisationUpdate completed (1.24wmf9) at 2014-06-13 03:11:28+00:00
02:35 logmsgbot: LocalisationUpdate completed (1.24wmf8) at 2014-06-13 02:34:41+00:00
00:45 logmsgbot: ori Synchronized php-1.24wmf8/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 05s)
00:44 logmsgbot: ori Synchronized php-1.24wmf9/extensions/Math: Reverting Extension:Math to 1bb3bfa3b5656 (duration: 00m 06s)
00:41 ori: removed Physikerwelt and Frédéric Wang from extension-Math group in Gerrit pending further inquiry into recent changes
00:38 logmsgbot: ori Finished scap: fix any lingering inconsistencies in the state of the app servers (see https://gerrit.wikimedia.org/r/139089) (duration: 26m 59s)
23:35 logmsgbot: ori Synchronized php-1.24wmf8/extensions/MobileFrontend: Re-syncing after submodule update (duration: 00m 06s)
23:34 ori: ran sync-common on mw1151
23:17 logmsgbot: catrope Synchronized php-1.24wmf9/extensions/MobileFrontend: (no message) (duration: 00m 04s)
23:17 logmsgbot: catrope Synchronized php-1.24wmf8/extensions/MobileFrontend: (no message) (duration: 00m 05s)
23:17 logmsgbot: catrope Synchronized php-1.24wmf8/extensions/VisualEditor: (no message) (duration: 00m 04s)
23:07 Krinkle: integration-slave1003 is failing npm-test builds due to a cache corruption (filed as https://github.com/npm/npm/issues/5472). Manually cleared /mnt/home/jenkins-deploy/.npm/async on integration-slave1003.eqiad.wmflabs for now.
23:05 MaxSem: Purging PageImages data from Wikibooks and Wikisource
22:59 logmsgbot: catrope Synchronized wmf-config/: (no message) (duration: 00m 04s)
22:46 logmsgbot: ori Synchronized wmf-config/CommonSettings.php: disable MW_MATH_MATHML until mathoid table is created (BUG 66492) (duration: 00m 04s)
22:31 logmsgbot: ori Synchronized php-1.24wmf8/extensions/WikimediaEvents: Update WikimediaEvents for Ibd36da416 (duration: 00m 03s)
22:30 logmsgbot: ori Synchronized php-1.24wmf9/extensions/WikimediaEvents: Update WikimediaEvents for Ibd36da416 (duration: 00m 03s)
18:37 logmsgbot: reedy Started scap: 1.24wmf9 staging take 2...
18:06 logmsgbot: reedy Started scap: testwiki to 1.24wmf9 and build l10n cache
17:49 ottomata: disabling puppet on analytics1012 and analytics1022
17:48 ottomata: starting some kafka failure tests, I have scheduled downtime for some service checks in icinga, hopefully this will not be noisy
17:41 ottomata: restarting elasticsearch on logstash servers
17:34 logmsgbot: yurik Synchronized wmf-config/InitialiseSettings.php: Enabling new zero ext on all wikis (duration: 01m 03s)
17:22 logmsgbot: yurik Synchronized wmf-config/InitialiseSettings.php: Attempting to enable new zero ext on zerowiki & ruwiki - take3 (duration: 01m 04s)
17:06 logmsgbot: yurik Synchronized php-1.24wmf8/extensions/: (no message) (duration: 01m 12s)
17:05 greg-g: yurik's blank sync message could have been: Deploying new JsonConfig,ZeroBanner,ZeroPortal extensions (refactoring ZeroRatedMobileAccess ext)
17:04 logmsgbot: yurik Synchronized php-1.24wmf7/extensions/: (no message) (duration: 01m 15s)
23:50 mwalker: clearing resourceloader blobs on commonswiki to try and force a multimediaviewer message "mwscript extensions/WikimediaMaintenance/clearMessageBlobs.php --wiki=commonswiki"
23:49 awight: updated SmashPig from 98b1f348aa55f6a3aac441db08a59ca309fade7a to 22e2923a3a030b17815181574f9ca99b38c5f2dc
23:41 logmsgbot: mwalker Finished scap: SWAT deploy for MultimediaViewer, CentralNotice, and testwiki config (duration: 24m 16s)
23:16 logmsgbot: mwalker Started scap: SWAT deploy for MultimediaViewer, CentralNotice, and testwiki config
23:10 Krinkle: Running deleteEqualMessages.php on trwiki (bug 43917)
22:58 logmsgbot: yurik Synchronized wmf-config/: Restoring to ZRMA for now (duration: 01m 04s)
22:22 logmsgbot: yurik Synchronized wmf-config/InitialiseSettings.php: Attempting to enable new zero ext on zerowiki & ruwiki - take2 (duration: 01m 06s)
22:19 ^d: restarted elasticsearch on logstash1003, complaining about heap.
22:06 logmsgbot: yurik Synchronized wmf-config/InitialiseSettings.php: Attempting to enable new zero ext on zerowiki & ruwiki (duration: 01m 12s)
21:58 logmsgbot: yurik Synchronized php-1.24wmf8/extensions/JsonConfig/: (no message) (duration: 01m 11s)
21:56 logmsgbot: yurik Synchronized php-1.24wmf7/extensions/JsonConfig/: (no message) (duration: 01m 09s)
21:50 logmsgbot: yurik Finished scap: (no message) (duration: 25m 51s)
21:46 ori: Disabling Puppet on mw1149. It's a former bits app server that isn't in PyBal so it isn't getting traffic. Going to stage some proposed changes for apache-config and operations/puppet there.
21:24 logmsgbot: yurik Started scap: (no message)
21:05 logmsgbot: yurik Finished scap: Deploying 3 new ext (JsonConfig, ZeroBanner, ZeroPortal), but they are not enabled anywhere yet (duration: 05m 03s)
21:00 logmsgbot: yurik Started scap: Deploying 3 new ext (JsonConfig, ZeroBanner, ZeroPortal), but they are not enabled anywhere yet
20:07 gwicke: deployed Parsoid 3de0dba15
19:18 bblack: rebooting lvs3003 for 3.13 kernel
19:17 logmsgbot: marktraceur Finished scap: MultimediaViewer fixes for cards 630, 429, and 697 (duration: 18m 45s)
14:48 bblack: rebooting lvs3004.esams (inactive uploads LVS) for 3.13 kernel
14:41 _joe_: manually ran 'planet' on en.planet to restore technews
14:40 hashar: Jenkins updating plugins
13:56 paravoid: upgrading mw1153-mw1160, tmh1001-tmh1002 for USN-2244-1
12:21 _joe_: set up a secondary remote named 'readonly' in /a/common on tin, to use with the icinga check for unmerged commits
11:40 akosiaris: manually cleaning librenms tables. db1001 is going to have increased load for some time. The approach is automatable, see http://jira.observium.org/browse/OBSERVIUM-757
11:32 godog: restarted uwsgi on tungsten, a lot of accesses to reqstats.edits.*.submits
10:45 godog: restarted uwsgi on tungsten, hung on fetching many metrics
09:54 _joe_: restarted apache on palladium - passenger crashed
05:26 paravoid: restarting all swift daemons across the cluster to fix runaway threads due to rsyslog restart
07:42 springle: enabled pt-slave-delay for dbstore1001, 24h all shards
06:12 springle: xtrabackup clone db1043 to db1048
04:57 springle: db1048 down for upgrade
03:40 springle: switched mchenry to use m2-master/m2-slave for OTRS address lookups
03:25 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Jun 10 03:24:19 UTC 2014 (duration 24m 18s)
02:29 logmsgbot: LocalisationUpdate completed (1.24wmf8) at 2014-06-10 02:28:14+00:00
02:27 springle: switched traffic db1048 to db1020. broke gerrit briefly; see ops email
02:15 logmsgbot: LocalisationUpdate completed (1.24wmf7) at 2014-06-10 02:14:41+00:00
01:33 chasemp: restarted gerrit on ytterbium
01:01 manybubbles: upgraded all elasticsearch servers in production to 1.2.1. They are just restoring the last few shards on the last node now and they'll spend a few hours tonight rebalancing after the upgrade but otherwise I'm done.
00:41 mwalker: updating donationinterface on payments from b4c5cf1bceb70d65eae28cdd0873036dc33c8992 to 6d74002f2634f41f7038daa7357ff6de55ee4880 for worldpay form error
June 9
23:58 manybubbles: lied - upgrading elastic1014
23:57 manybubbles: upgrading elastic1015
23:30 Krinkle: Reloading Zuul to deploy 6727b8b
23:12 logmsgbot: maxsem Synchronized php-1.24wmf8/extensions/MobileApp: (no message) (duration: 00m 03s)
23:11 logmsgbot: maxsem Synchronized php-1.24wmf7/extensions/MobileApp: (no message) (duration: 00m 03s)
15:05 logmsgbot: anomie Synchronized php-1.24wmf8/extensions/VisualEditor/modules/ve-mw/: SWAT: VE fix for focus regression and alignment issues gerrit:137971gerrit:138122 (duration: 00m 14s)
15:01 manybubbles: successfully synced plugins, upgrading elastic1001 to make sure everything is working ok with it - then we'll run through the others more quickly
14:57 manybubbles: syncing elasticsearch plugins for 1.2.1 - any elasticsearch restart from here on out needs to come with 1.2.1 or the node will break.
14:54 manybubbles: starting Elasticsearch upgrade with elastic1001
07:14 springle: disabled puppet on analytics1021 to avoid kafka broker restarting with missing mount
05:15 springle: xtrabackup clone db1046 to db1020
04:44 springle: umount /dev/sdf on analytics1021, fs in r/o mode, kafka broker not running. no checks yet
03:24 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Jun 9 03:23:05 UTC 2014 (duration 23m 4s)
02:29 logmsgbot: LocalisationUpdate completed (1.24wmf8) at 2014-06-09 02:28:08+00:00
02:15 logmsgbot: LocalisationUpdate completed (1.24wmf7) at 2014-06-09 02:14:46+00:00
June 8
23:27 p858snake|l: icinga has been shitting in the channel for 9+ hours (before I went to bed) about Varnishkafka, nothing noted in SAL. Here be a note about it.
03:22 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun Jun 8 03:21:28 UTC 2014 (duration 21m 27s)
02:28 logmsgbot: LocalisationUpdate completed (1.24wmf8) at 2014-06-08 02:27:21+00:00
02:15 logmsgbot: LocalisationUpdate completed (1.24wmf7) at 2014-06-08 02:14:10+00:00
June 7
23:48 hoo: Fixed four CentralAuth log entries on meta which were logged for WikiSets/0
21:36 manybubbles: that means I turned off puppet and shut down Elasticsearch on elastic1017 - you can expect the cluster to go yellow for half an hour or so while the other nodes take rebuild the redundency that elastic1017 had
21:35 manybubbles: after consulting logs - elastic1017 has had high io wait since it was deployed - I'm taking it out of rotation
21:31 manybubbles: elastic1017 is sick - thrashing to death on io - restarting Elasticsearch to see if it recovers unthrashed
17:56 godog: restarted ES on elastic1017.eqiad.wmnet (at 17:22 UTC)
03:24 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Jun 7 03:23:32 UTC 2014 (duration 23m 31s)
02:31 logmsgbot: LocalisationUpdate completed (1.24wmf8) at 2014-06-07 02:29:57+00:00
02:17 logmsgbot: LocalisationUpdate completed (1.24wmf7) at 2014-06-07 02:16:30+00:00
June 6
23:51 Krinkle: Restarted Jenkins, force stopped Zuul, started Zuul, configure Jenkins via web interface (disable Gearman, save, enable German); Seems to be back up now, finally.
22:52 mutante: same for rhenium, titanium, bast1001, calcium, carbon, ytterbium, stat1003
22:42 RoanKattouw: Restarting Jenkins didn't help, jobs still aren't making it across from Zuul into Jenkins
22:36 RoanKattouw: Restarting stuck Jenkins
22:35 mutante: same for holmium, hafnium, silver, netmon1001, magnesium, neon, antimony
22:17 mutante: upgraded ssl packages on zirconium
21:57 Krinkle: Took Jenkins slave on gallium temporarily offline and back online to resolve possible stagnation
20:56 awight_: updated crm from ded541894a70922e098fb3ea48306c8ec0f0f6aa to b38497a9d0ef75fe2b20b03b649ac13a5e3f47a7
18:24 mwalker: updating payments from e823354822c7a35e6c2069d3e72180a45dbc89dc to b4c5cf1bceb70d65eae28cdd0873036dc33c8992 for globalcollect oid hack
14:04 hashar: Gerrit back. chase rebooted it :)
13:55 hashar: Gerrit having some troubles: error: RPC failed; result=22, HTTP code = 503 (while cloning CirrusSearch )
16:02 mark: Connected cp3018:eth1 to cr1-esams:xe-0/0/3 (unconfigured)
15:59 _joe_: disabling puppet on virt1000 while we test the puppet3 upgrade on virt0
15:48 logmsgbot: reedy Finished scap: 2nd scap for 1.24wmf8, should be effectively a nooop (duration: 12m 33s)
15:35 logmsgbot: reedy Started scap: 2nd scap for 1.24wmf8, should be effectively a nooop
15:21 logmsgbot: anomie Synchronized php-1.24wmf6/extensions/VisualEditor/modules/ve-mw/ui/dialogs/: SWAT: Use <visualeditor-toolbar-cite-label> correctly in the Media and Reference toolbars gerrit:136783 (duration: 00m 15s)
15:18 logmsgbot: anomie Synchronized php-1.24wmf7/extensions/VisualEditor/modules/ve-mw/ui/dialogs/: SWAT: Use <visualeditor-toolbar-cite-label> correctly in the Media and Reference toolbars gerrit:136782 (duration: 00m 12s)
00:46 bblack: lvs3002 (live uploads lb for esams) is running ntpd
June 4
23:43 Tim: on searchidx1001: restarting lsearchd and indexer
23:40 logmsgbot: mwalker Finished scap: Scapping for SWAT; MultiMedia viewer and config changes (duration: 22m 16s)
23:20 Tim: on searchidx1001: as a temporary hack to work around scap disk full errors, set up a bind mount at /usr/local/apache/common-local linking to a directory in /a, by local modification of /etc/fstab
23:18 logmsgbot: mwalker Started scap: Scapping for SWAT; MultiMedia viewer and config changes
04:27 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Wed Jun 4 04:26:32 UTC 2014 (duration 26m 31s)
03:35 Krinkle: Deploy I882e3fa57b2e5e3de in Zuul and reload config
03:16 logmsgbot: LocalisationUpdate completed (1.24wmf7) at 2014-06-04 03:15:34+00:00
02:47 logmsgbot: LocalisationUpdate completed (1.24wmf6) at 2014-06-04 02:46:06+00:00
June 3
23:14 logmsgbot: ori Synchronized php-1.24wmf7/extensions/MobileApp: SWAT cherry-picks for MobileApp (with patch) (duration: 00m 04s)
23:11 logmsgbot: ori Synchronized php-1.24wmf6/extensions/MobileApp: SWAT cherry-picks for MobileApp (duration: 00m 04s)
23:10 logmsgbot: ori Synchronized php-1.24wmf7/extensions/MobileApp: SWAT cherry-picks for MobileApp (duration: 00m 03s)
23:06 logmsgbot: ori Synchronized wmf-config/InitialiseSettings.php: I9dac0dc6a80: Set $wgIncludejQueryMigrate = true; for all wikis (duration: 00m 03s)
22:41 logmsgbot: marktraceur Finished scap: Update Media Viewer preference string for wmf7 - already backported to wmf6 (duration: 13m 19s)
22:27 logmsgbot: marktraceur Started scap: Update Media Viewer preference string for wmf7 - already backported to wmf6
21:49 logmsgbot: marktraceur updated /a/common to I409703a11: Enable MMV by default on dewiki beta.
21:25 logmsgbot: marktraceur Synchronized mediaviewer.dblist: Enable media viewer by default on enwiki (duration: 00m 06s)
21:18 logmsgbot: marktraceur Synchronized wmf-config/InitialiseSettings.php: Throttle the MMV event logging a bit more for the launch today (duration: 00m 06s)
21:17 logmsgbot: marktraceur updated /a/common to I549906510: Launch Media Viewer for all users on English wikipedia
21:09 logmsgbot: marktraceur Synchronized wmf-config/InitialiseSettings.php: Touch InitialiseSettings.php because that's what we do (duration: 00m 06s)
21:08 logmsgbot: marktraceur Synchronized mediaviewer.dblist: Add dewiki to the on-by-default list for Media Viewer (duration: 00m 06s)
21:08 logmsgbot: marktraceur updated /a/common to Ie237b0ae1: Launch Media Viewer for all users on German wikipedia
03:11 logmsgbot: LocalisationUpdate completed (1.24wmf6) at 2014-05-29 03:10:15+00:00
02:36 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-29 02:35:45+00:00
01:22 awight: updated SmashPig from e9964bfec47b3796dab0a19a9545cc3abb23fde6 to 98b1f348aa55f6a3aac441db08a59ca309fade7a
01:16 awight: (rollback)
01:16 awight: updated SmashPig from 03015f3827fedea9d0f89c791604ad08ec97ba71 to e9964bfec47b3796dab0a19a9545cc3abb23fde6
01:04 awight: update SmashPig from f64f79f13cf4ab560d0bb5bd69690c827a821629 to 03015f3827fedea9d0f89c791604ad08ec97ba71
May 28
23:39 logmsgbot: demon Synchronized wmf-config/CommonSettings.php: Including external CologneBlue/Modern skins, if they exist (duration: 00m 07s)
22:27 awight: updated crm from 65a433b5564f42c3aa4f310cd4bb938ae70f841d to 5b231163e9e880de5b9787d40b679a6723748aca
22:27 awight: updated tools from 1e8029544dc19a84f6d1adf2783266e16d19ef1f to d257e8445e028b758b1d1fa90c857667d4faac62
21:23 RoanKattouw: Restarting Parsoid Varnishes per gwicke's request
20:38 mutante: enabling puppet on osmium
20:18 hashar: Jenkins: killed all phantomjs process on gallium. They were eating all available memory. All three process were VisualEditor qunit tests.
19:08 bd808|deploy: Symlinks for mergeCdbFileUpdates, mwversionsinuse, refreshCdbJsonFiles, scap-rebuild-cdbs, scap-recompile and sync-common on tin still pointing to /srv/scap/bin instead of /srv/deployment/scap/scap/bin
17:32 mwalker: enabling worldpay in BE (payments from 5136b0b6852f3e949e4dc847f7137f1b7bc3037b to 7c695e9c4c7386a7585b6067df29b8caaaa089f0)
16:47 hashar: Jenkins/Zuul back. Jobs meant to be run on labs instances ended up not being registered anymore with the Zuul Gearman server. That must be a bug in the Jenkins Gearman plugin :-( bug 63760
16:31 hashar: Jenkins / Zuul locked. Looking into it
12:19 Reedy: Created SecurePoll tables on zerowiki, legalteamwiki, zhwikivoyage, viwikivoyage, tyvwiki
11:40 godog: restart apache2 on tungsten, many report.py hung
05:42 gwicke: restarted parsoids after load surge
03:13 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue May 27 03:12:29 UTC 2014 (duration 12m 28s)
02:25 logmsgbot: LocalisationUpdate completed (1.24wmf6) at 2014-05-27 02:24:51+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-27 02:13:08+00:00
01:39 springle: starting updateCollation on s2 cs-wiki from tin
01:36 logmsgbot: springle synchronized wmf-config/InitialiseSettings.php '$wgCategoryCollation to uca-cs on cswiki'
May 26
09:17 hashar: bugzilla.bugs_fulltext bug was bug 65762
09:16 _joe_: repaired table bugzilla.bugs_fulltext on db1001 as it was marked as crashed
03:11 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon May 26 03:10:02 UTC 2014 (duration 10m 1s)
02:25 logmsgbot: LocalisationUpdate completed (1.24wmf6) at 2014-05-26 02:24:39+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-26 02:13:01+00:00
May 25
03:09 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun May 25 03:08:47 UTC 2014 (duration 8m 46s)
02:25 logmsgbot: LocalisationUpdate completed (1.24wmf6) at 2014-05-25 02:24:10+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-25 02:13:47+00:00
May 24
03:02 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat May 24 03:01:50 UTC 2014 (duration 1m 49s)
02:21 logmsgbot: LocalisationUpdate completed (1.24wmf6) at 2014-05-24 02:20:11+00:00
02:13 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-24 02:12:25+00:00
00:34 mutante: fixing Aaron's and Ariel's file permissions on fenari
May 23
19:43 logmsgbot: anomie synchronized php-1.24wmf6/includes/HistoryBlob.php 'Backport fix for bug 65665 to 1.24wmf6 gerrit:135089'
19:24 awight: updated fr-tools from 73921d4b4a7ba69b703340ed56e513f8ae8e0bb5 to 1e8029544dc19a84f6d1adf2783266e16d19ef1f
18:37 mwalker: updated paymnets wiki from d99177518b741e7fe18ffda86c83f93c72e164a6 for worldpay
18:22 Jeff_Green: ran authdns-update to merge new wikimedia.community dns zone
17:02 bd808: Starting rolling update of elasticsearch for logstash cluster
16:20 bd808: restarted elasticsearch on logstash1002
16:17 bd808: Elasticsearch on logstash1002 dead due to OOM at 2014-05-23T00:34:03Z
14:52 hashar: killed -9 a remaining Jenkins process
14:21 _joe_: killed zuul server, as was stuck
13:50 _joe_: killed & started jenkins, jvm stuck, unresponsive to jstack
13:17 manybubbles: resarting jenkins because it seems stuck
11:05 mark: Setup BFD on Zayo link between cr2-ulsfo and cr1-eqiad
11:01 mark: Setup BFD on GTT link between cr1-ulsfo and cr2-eqiad
07:30 _joe_: powercycling ms-be1007, unresponsive, console blank, no way to debug
04:17 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri May 23 04:16:15 UTC 2014 (duration 16m 14s)
03:23 logmsgbot: LocalisationUpdate completed (1.24wmf6) at 2014-05-23 03:22:07+00:00
02:39 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-23 02:38:33+00:00
May 22
23:22 mutante: osmium,mw1151 fixed UID of mwalker (605->2454)
23:16 bd808: Ran sync-common manually on osmium and mw1151
23:14 mwalker: sync-dir failed for osmium and mw1151
23:14 logmsgbot: mwalker synchronized php-1.24wmf6/extensions/VisualEditor 'Syncing the extension manually because of scap failures on osium, mw1010, mw1070, mw1161, mw1201, and mw1151'
15:31 logmsgbot: anomie synchronized php-1.24wmf5/extensions/MultimediaViewer/ 'SWAT: Deploy new MultimediaViewer logging to wmf5 wikis gerrit:134804'
15:22 logmsgbot: anomie synchronized wmf-config/CommonSettings.php 'SWAT: Disable old MultimediaViewer logging and pre-enable new logging gerrit:134343'
15:21 logmsgbot: anomie synchronized wmf-config/InitialiseSettings.php 'SWAT: Disable old MultimediaViewer logging and pre-enable new logging gerrit:134343'
19:56 awight: updated tools from c1f50f6909b04768f3a8faa50b25e88a43f89606 to 73921d4b4a7ba69b703340ed56e513f8ae8e0bb5
19:05 mutante: welcome new deployer tgr
18:22 andrewbogott: restarting gerrit service
15:47 logmsgbot: anomie synchronized php-1.24wmf5/tests/qunit/suites/resources/mediawiki/mediawiki.user.test.js 'May as well sync this too'
15:45 logmsgbot: anomie synchronized php-1.24wmf5/resources/src/mediawiki/mediawiki.user.js 'SWAT: Use mw.log.deprecate to track user() and anonymous()'
15:19 logmsgbot: anomie synchronized php-1.24wmf5/includes/filerepo/file/LocalFile.php 'SWAT: Replace FOR UPDATE with LockManager use in LocalFile::lock()'
15:18 logmsgbot: anomie synchronized php-1.24wmf5/includes/filebackend/FileBackend.php 'SWAT: Replace FOR UPDATE with LockManager use in LocalFile::lock()'
14:11 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'raise db1068 to normal load'
12:30 hashar: Jenkins: updated sysadmin email address from nobody@integration.wikimedia.org to jenkins-bot@wikimedia.org
15:18 logmsgbot: manybubbles synchronized php-1.24wmf4/extensions/CirrusSearch/ 'adding url parameter to suppress snippets and one to suggest suggestions to cirrus'
15:12 manybubbles: SWAT deployed cirrus update for wmf5 and looks good. doing for wmf4 now.
14:53 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'touched and synced InitializeSettings.php to make update to cirrus.dblist take hold - resyncing to mw1171'
14:51 _joe_: powercycled mw1171, dead and serial console stuck
14:51 logmsgbot: manybubbles synchronized cirrus.dblist 'Switch cirrus to the primary backend for zh-yue wikipedia - resyncing to mw1171'
14:40 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'touched and synced InitializeSettings.php to make update to cirrus.dblist take hold'
14:38 logmsgbot: manybubbles synchronized cirrus.dblist 'Switch cirrus to the primary backend for zh-yue wikipedia'
13:34 Krinkle: Running deleteEqualMessages.php on mtwiki (bug 43917)
13:17 Krinkle: Running deleteEqualMessages.php on zh_min_nanwiki (bug 43917)
13:07 Krinkle: Running deleteEqualMessages.php on zh_yuewiki (bug 43917)
12:01 Krinkle: Running deleteEqualMessages.php on suwiki (bug 43917)
03:10 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon May 19 03:09:00 UTC 2014 (duration 8m 59s)
02:26 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-19 02:25:51+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf4) at 2014-05-19 02:13:37+00:00
00:30 Tim: on osmium: stopping job runners in order to fix cgroup permissions issue
May 18
03:07 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun May 18 03:06:03 UTC 2014 (duration 6m 2s)
02:24 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-18 02:23:29+00:00
02:13 logmsgbot: LocalisationUpdate completed (1.24wmf4) at 2014-05-18 02:12:46+00:00
May 17
03:08 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat May 17 03:07:12 UTC 2014 (duration 7m 11s)
02:25 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-17 02:24:46+00:00
02:15 logmsgbot: LocalisationUpdate completed (1.24wmf4) at 2014-05-17 02:14:13+00:00
01:58 mutante: powercycling labsdb1003
00:23 ori: varnishadm on cp1056 confirms that varnish recognizes mw1151 as "sick"
00:20 ori: stopping apache and disabling puppet on mw1151 so that varnish stops forwarding reqs to it
00:16 Krinkle: On mw1151, Gadget::loadStructuredList() returns false, memcached has no value for 'enwiki:gadgets-definition:7' and is unable to store it.
00:11 ori: Krinkle identified weird RL responses as all originating in mw1151; dmesg shows ata1 disk troubles: "failed command: READ DMA EXT", "sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed"
22:17 awight: tools updated from 85bb7293d83517086e3609f03365aecde9f58c71 to ee31fc94b17c11a48ddac19aabfcdaab69fd2f72
21:37 logmsgbot: ori synchronized php-1.24wmf4/extensions/MultimediaViewer 'Update MultimediaViewer for I0df067a61: Add sampling to unsampled event logging'
21:33 logmsgbot: ori synchronized php-1.24wmf5/extensions/MultimediaViewer 'Update MultimediaViewer for I0df067a61: Add sampling to unsampled event logging'
21:10 mwalker: updating fundraising smashpig from 2fdf982b20f1cbeaf9f57af64ef21b5b69a36f6e to f64f79f13cf4ab560d0bb5bd69690c827a821629
20:41 awight: update crm from 243641de631b712c4a29ca1f3618771b78dadeae to 0b8aa8aa046935b6cfc67c10ebe10396d5e42745
18:43 awight: update tools from a40c0caa18a0efd93bc5d3f7f68386fbc36bf1fa to 85bb7293d83517086e3609f03365aecde9f58c71
18:12 logmsgbot: ori synchronized wmf-config/InitialiseSettings.php 'I3c453b0949f4e: Tweak MediaViewer sampling settings'
18:07 logmsgbot: ori synchronized wmf-config 'Ia43821231: Add sampling control setting for MediaViewer event'
18:04 logmsgbot: ori updated /a/common to Ia43821231: Add sampling control setting for MediaViewer event logging
17:47 logmsgbot: demon synchronized wmf-config/CommonSettings.php 'GeoData to Elastic for all wikivoyages'
17:11 mwalker: updating staging payments servers as well from 5e24b953dcff5305099e152139e6e93daba8aeec to d99177518b741e7fe18ffda86c83f93c72e164a6
17:10 mwalker: and updated to 1.22.6
17:10 mwalker: moved fundraising wiki from 6a1d4983319038edeb88dc34a1c220ecaec1cbde to d99177518b741e7fe18ffda86c83f93c72e164a6 -- including json i18n changes
16:55 manybubbles: "in place" reindexing (for cirrus) all the wikipedias after the deploy train hit them yesterday
16:53 logmsgbot: demon synchronized wmf-config/CommonSettings.php 'Removing old WikiEditor settings'
16:44 RobH: partial zirconium downtime
16:44 RobH: i logged into zirconium, but it had recovered by the time I checked it.
16:33 mwalker: updated fundraising civicrm from 7a23465e620211739421cce3ad57c62597eb8cc3 to 75c1a50b8aa7e7b6f218d7c420932a8fc53a0a34 for an exchange rates fix
16:26 qchris: updated gerrit's hooks-bugzilla plugin to version 2.8.1.2 to allow talking to bugzilla-4.4.4
13:17 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'raise db1070 and db1071 to normal load'
10:14 springle: xtrabackup clone db1049 to db1056
10:13 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'reduce db1049 load while cloning'
09:40 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'pool db1070 and db1071 in s1, warm up'
06:20 logmsgbot: ori synchronized php-1.24wmf4/maintenance/compareParserCache.php 'Ica69a3ef2: Added a script to compare current parser output to cache (no impact on prod; syncing for consistency)'
03:55 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri May 16 03:54:03 UTC 2014 (duration 54m 2s)
03:09 logmsgbot: LocalisationUpdate completed (1.24wmf5) at 2014-05-16 03:08:04+00:00
02:39 logmsgbot: LocalisationUpdate completed (1.24wmf4) at 2014-05-16 02:38:41+00:00
02:07 springle: xtrabackup db1070 to db1071
01:02 logmsgbot: ori synchronized php-1.24wmf5/includes/parser 'I12a60b5cc: Revert "Declare visibility on class properties of includes/parser/"'
00:54 hoo: rebuildItemsPerSite finished running for Wikidata (after about 30h).
00:32 hoo: manually ran rebuildEntityPerPage for Wikidata to fix 2 broken records
19:56 mwalker: updated fundraising tools repo for screenshots, worldpay auditing, live analysis, and... stomp! from 0eb485c8b6db5f06805976860bce7aa8b0d6444b to 47407c16d9922b17af70146416913abfe50b728d
19:09 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf5
18:55 ori: deploying twemproxy module on mw106*, they may complain for a moment
18:52 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.24wmf4
18:46 logmsgbot: reedy Finished scap: testwiki to 1.24wmf5 and build l10n cache (duration: 27m 47s)
18:32 mutante: mw1053 was already disabled in pybal though and RT 7408,7435
18:31 mutante: mw1053 sits at disk partitioning dialog (via mgmt)
18:29 Reedy: mw1053 is pingable but not ssh-able
18:18 logmsgbot: reedy Started scap: testwiki to 1.24wmf5 and build l10n cache
17:53 Jeff_Green: adjusted exim conf on mchenry to route donate.wm.o mail to barium instead of aluminium
16:43 mwalker: disabled qc and put site_offline and maintenance_mode on civicrm to true
15:20 logmsgbot: anomie synchronized php-1.24wmf4/extensions/MultimediaViewer 'SWAT: Deploy change 133446 to fix bug 65225 in MultimediaViewer'
14:03 springle: xtrabackup clone db1056 to db1070
13:59 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'depool db1056 while cloning'
13:44 cmjohnson1: sodium going down again for a different disk replacement
13:16 cmjohnson1: shutting down sodium to replace sdb
12:56 godog: restarting gerrit on ytterbium, clones over https seemingly stuck
12:24 manybubbles|away: "in place" reindexing group1 wikis after the deployment train updated cirrus yesterday. They'll need a full reindex after that is done which will take some time but is required to fix issues with redirects not showing up off of the main namespace
11:56 godog: installed openjdk-7-jdk on ytterbium to attempt gerrit thread dump
10:15 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'depool db1009 for raid tests'
06:44 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'move s5 api traffic to db1005'
05:19 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'move s4 commonswiki api traffic to db1042'
04:20 springle: installed db1073
03:15 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu May 15 03:14:04 UTC 2014 (duration 14m 3s)
02:27 logmsgbot: LocalisationUpdate completed (1.24wmf4) at 2014-05-15 02:26:09+00:00
02:15 logmsgbot: LocalisationUpdate completed (1.24wmf3) at 2014-05-15 02:14:31+00:00
May 14
23:42 logmsgbot: mwalker synchronized wmf-config/InitialiseSettings.php 'Poking settings to try and apply them'
23:29 logmsgbot: mwalker synchronized visualeditor.dblist 'Another part of 132409 (visual editor)'
23:27 K4-713: updated payments from 78cc4285bdeb6eecba3efc75e4a04c8b886561e4 to 5e24b953dcff5305099e152139e6e93daba8aeec
23:27 logmsgbot: mwalker synchronized wmf-config/ 'SWAT of 132409 (visual editor) and 130274 (abuse filter)'
02:26 logmsgbot: LocalisationUpdate completed (1.24wmf4) at 2014-05-14 02:25:08+00:00
02:21 springle: upgrade db1043, rebuild as m3 master
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf3) at 2014-05-14 02:13:23+00:00
00:18 manybubbles: bouncing elasticsearch on elastic1015 to pick up gc logging configuration. it might warn but shouldn't cause any service disrubtion.
22:43 chasemp: ms-be1009 rebooted as it had locked up, swift seems to have recoverd
22:35 mutante: created new gerrit projects for phabricator,arcanist and libphutil
22:22 ori: restarting tungsten to verify fix for gdash/graphite initialization
21:33 ori: gdash and graphite currently down; chase & ori debugging
21:29 manybubbles: I caused elasticsearch1015 to drop out of the Elasticsearch cluster by tring to take a heap dump on it. don't do that. It stops the application for many seconds.
17:10 logmsgbot: reedy Finished scap: Build l10n cache for 1.24wmf4 and move testwiki (duration: 17m 05s)
16:53 logmsgbot: reedy Started scap: Build l10n cache for 1.24wmf4 and move testwiki
16:52 chasemp: rebooted ms-be1006 since it dropped dead
16:52 logmsgbot: reedy Finished scap: Build l10n cache for 1.24wmf4 and move testwiki (duration: 11m 42s)
16:42 manybubbles: reindexing the hebrew wikis other then hewikipedia now that they are on wmf3 so they can have hebmorph
16:40 logmsgbot: reedy Started scap: Build l10n cache for 1.24wmf4 and move testwiki
16:39 manybubbles: rebuilding enwiki's cirrus index to optimize for new highlighter
16:14 logmsgbot: demon rebuilt wikiversions.cdb and synchronized wikiversions files: group1 wikis to wmf3
16:13 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'raise db106[45] to normal load'
15:13 logmsgbot: manybubbles synchronized php-1.24wmf3/extensions/CirrusSearch/ 'updating Cirrus to pick up some fixes'
15:08 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'engage new hightlighter on some more wikis'
15:00 manybubbles: rebuilding all hebrew wikis _except_ hebrew wikipedia and hebrew wikisource to pick up hebmorph. hewikisource got it this morning. hewiki will get it this afternoon after the deployment train
13:24 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'warm up db1064 in s4, db1065 in s1'
12:59 manybubbles: rebuilding cirrus index for hewikisource to pick up hebmorph
10:18 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'reduce db1049 and db1051 load while cloning'
16:33 manybubbles: performing a rolling restart on elasticsearch nodes in production to pick up new plugins: experimental-highlight 0.0.8 and analysis-hebrew 1.1.0
16:30 _joe_: restarted mwprof/profiler-to-carbon on tungsten, stuck somehow
15:40 logmsgbot: demon synchronized wmf-config/CirrusSearch-common.php 'Raised redundancy for commonswiki_file back up, config to match'
15:19 anomie: anomie namespaceDupes.php on OfficeWiki done (that was quick)
15:18 anomie: anomie Running maintenance/namespaceDupes.php on OfficeWiki
15:16 logmsgbot: anomie synchronized wmf-config/InitialiseSettings.php 'SWAT: Change wgMetaNamespace for OfficeWiki and add alias'
15:12 logmsgbot: anomie synchronized wmf-config/InitialiseSettings.php 'SWAT: Allow all users on OfficeWiki to send mass messages (for real this time)'
15:09 logmsgbot: anomie synchronized wmf-config/InitialiseSettings.php 'SWAT: Allow all users on OfficeWiki to send mass messages'
15:03 logmsgbot: anomie synchronized wmf-config/InitialiseSettings.php 'SWAT: Set $wgUploadMissingFileUrl for enwiki'
10:21 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'db1049 to full steam'
09:40 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'warm up db1049 in s4'
07:54 hashar: Jenkins: installing Claim plugin (allow folks to comment on builds and mark them)
20:58 RobH: ssl1001-1003 now have updated unified cert in service
20:58 jgage: both kafka brokers back in service
20:54 RobH: cp4001-4020 unified cert and nginx service reloaded, back in service
20:50 RobH: ssl1006 and ssl1009 are responsive to nginx and back in service
20:43 RobH: pybal
20:43 RobH: ssl1009 was refusing connections both before and after my ssl cert update. ssl1006 is presently refusing connections post update. they are set to disabled in pubal
20:39 RobH: ssl1008 back into service, ssl1009 already depooled
20:38 jgage: forced kafka broker reelection
20:34 RobH: ssl1007 going back into service, ssl1008 depooling
20:25 RobH: depooled ssl1006/7 for update
20:25 RobH: ssl1004/5 returned to service (and puppet agents enabled)
20:21 RobH: puppet agent has been re-enabled on ssl1001-1003
16:00 logmsgbot: anomie synchronized php-1.24wmf3/extensions/MobileFrontend/ 'SWAT: Backport change 131237 to 1.24wmf3 to fix bug in MobileFrontend'
15:59 logmsgbot: anomie synchronized php-1.24wmf2/extensions/MobileFrontend/ 'SWAT: Backport change 131237 to 1.24wmf2 to fix bug in MobileFrontend'
15:49 logmsgbot: anomie synchronized php-1.24wmf2/includes/specials/SpecialAllmessages.php 'SWAT: Backport change 131041 to 1.24wmf2 to fix bug in Special:AllMessages'
15:37 logmsgbot: anomie synchronized php-1.24wmf2/includes/specials/SpecialAllmessages.php 'SWAT: Backport change 131041 to 1.24wmf2 to fix bug in Special:AllMessages'
15:24 logmsgbot: anomie synchronized php-1.24wmf3/includes/specials/SpecialAllmessages.php 'SWAT: Backport change 131041 to 1.24wmf3 to fix bug in Special:AllMessages'
15:12 logmsgbot: anomie synchronized php-1.24wmf3/includes/api/ApiLogin.php 'SWAT: Backport change 131056 to 1.24wmf3 to fix bug 64727'
15:10 logmsgbot: anomie synchronized php-1.24wmf2/includes/api/ApiLogin.php 'SWAT: Backport change 131056 to 1.24wmf2 to fix bug 64727'
12:45 akosiaris: removing various sdtpa devices from LibreNMS
03:12 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon May 5 03:11:15 UTC 2014 (duration 11m 14s)
02:32 ^demon|away: [gitb]lit's wonkiness but they're certainly not helping matters.
02:32 ^demon|away: antimony: ran very very aggressive repacking on mediawiki/core, operations/puppet, mediawiki/extensions/{UploadWizard,CentralAuth,CentralNotice,DonationInterface,FlaggedRevs,AbuseFilter,BlueSpiceExtensions,Translate,WikimediaMessages,EducationProgram,UniversalLanguageSelector,Wikibase}, pywikibot/{core,compat}, operations/dumps/tests. Basically anything taking up >90MB on disk. Probably not the cause of gitb
02:26 logmsgbot: LocalisationUpdate completed (1.24wmf3) at 2014-05-05 02:25:34+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.24wmf2) at 2014-05-05 02:13:18+00:00
16:04 _joe_: depooled mw1053 for hardware problems
15:13 andrewbogott: resetting a bunch more UIDs. Running find-and-chown again, but this time not on the swifts: salt -E '^(?!ms-be|labstore|snapshot).*$'
13:46 paravoid: swift @ eqiad: setting zone 5 (ms-be1013/1014/1015) to weight 2000, i.e. 66%
06:17 ori: re-enabled puppet on osmium and hafnium
04:02 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri May 2 04:01:04 UTC 2014 (duration 1m 3s)
03:10 logmsgbot: LocalisationUpdate completed (1.24wmf3) at 2014-05-02 03:09:16+00:00
02:40 logmsgbot: LocalisationUpdate completed (1.24wmf2) at 2014-05-02 02:39:39+00:00
May 1
23:43 bd808: Restarted logstash on logstash1001; MaxSem noticed that many recursion-guard logs were not being completely reassembled and JVM had one CPU maxed out.
16:13 logmsgbot: reedy Started scap: testwiki to 1.24wmf3
16:10 logmsgbot: reedy updated /a/common to I832b45db6: Correct a domain in wgCopyUploadsDomains
16:01 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'Enable cirrus as a betafeature on all wikis which did not already have it.'
15:51 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'SWAT fix GWtoolset url and add some more logos'
15:40 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'SWAT fix GWtoolset url and add some more logos'
15:35 andrewbogott: reassigning a ton of UIDs in production; running a couple dozen 'find' commands to chown files
15:34 logmsgbot: manybubbles synchronized php-1.24wmf2/includes/Article.php 'SWAT update to prevent fatal in backwards compatibility method'
15:27 logmsgbot: manybubbles synchronized php-1.24wmf2/extensions/VisualEditor/ 'SWAT update for firefox focus'
15:08 logmsgbot: manybubbles synchronized php-1.24wmf2/extensions/Wikidata/ 'SWAT update for time parsing and formatting'
08:15 springle: switching s1-analytics-slave db1047 enwiki to tokudb
03:18 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Thu May 1 03:17:39 UTC 2014 (duration 17m 38s)
02:34 logmsgbot: LocalisationUpdate completed (1.24wmf2) at 2014-05-01 02:33:34+00:00
02:22 logmsgbot: LocalisationUpdate completed (1.24wmf1) at 2014-05-01 02:21:44+00:00
April 30
23:58 bblack: mobile caches now sync zero carriers/proxies from zero.wm.org rather than noc(fenari) temp hack solution
23:14 logmsgbot: ori synchronized php-1.24wmf2/extensions/VisualEditor 'Ibaf0cc823bfe: Update VisualEditor for cherry-picks'
16:05 logmsgbot: manybubbles synchronized php-1.24wmf2/extensions/Wikidata/ 'SWAT upgrade wikidata for date parsing fixes'
15:47 manybubbles: rebuilding test2wiki's cirrus index after swat deploy
15:45 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'SWAT add autopatrolled group to shwiktionary and draft namespace to chapcomwiki'
15:38 logmsgbot: manybubbles synchronized wmf-config/CirrusSearch-common.php 'SWAT deploy - move group0 wikis to experimental highlighter and give enwiki its redundency back'
15:37 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'SWAT deploy - extra setting for cirrus and new groups and sources for gwtoolset'
15:32 manybubbles: cirrus deploys look good, moving on to twkozlowski's requests
15:31 logmsgbot: manybubbles synchronized wmf-config/CirrusSearch-common.php 'SWAT deploy - move group0 wikis to experimental highlighter and give enwiki its redundency back'
15:31 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'SWAT deploy - extra setting for cirrus'
10:14 hashar: Jenkins / Zuul : upgrading python-gear from 0.4.0-1 to 0.5.4-1 . Should fix a bunch of jobs registrations issues in Zuul Gearman. bug 63760
09:59 akosiaris: update python-gear on apt.wikimedia.org to 0.5.4-1
08:30 akosiaris: Published carbon's IPv6 address in DNS. apt.wikimedia.org and ubuntu.wikimedia.org are now IPv6 enabled
05:25 AaronSchulz: Manually removed a few 10000s of duplicate Cyberbot job duplicates
03:48 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Tue Apr 29 03:48:29 UTC 2014 (duration 48m 28s)
03:02 logmsgbot: LocalisationUpdate completed (1.24wmf2) at 2014-04-29 03:02:55+00:00
02:37 logmsgbot: LocalisationUpdate completed (1.24wmf1) at 2014-04-29 02:37:21+00:00
02:24 bblack: wiped disk cache (via mkfs) on cp1055 to (hopefully) clear crash-restart cycle, backend back in service now
01:53 Krinkle: Running deleteEqualMessages.php on cswiki (bug 43917)
00:49 Tim: on cp1055: backend varnish is continually panicking and restarting its child, will try to stop/start service
20:02 gwicke: deployed Parsoid cab9348e using deploy 9e9030d
20:00 logmsgbot: mflaschen synchronized wmf-config/InitialiseSettings.php 'Update GettingStarted config for new format'
19:59 logmsgbot: mflaschen synchronized php-1.24wmf2/extensions/GettingStarted/ 'Sync GettingStarted for Growth team deploy'
19:58 logmsgbot: mflaschen synchronized php-1.24wmf1/extensions/GettingStarted/ 'Sync GettingStarted for Growth team deploy'
19:30 Krinkle: Running deleteEqualMessages.php on simplewiki (bug 43917)
19:25 logmsgbot: ori synchronized wmf-config/InitialiseSettings.php 'I5e0709ef0: Unset $wgUseXVO'
19:25 logmsgbot: ori updated /a/common to I5e0709ef0: Unset $wgUseXVO
19:22 Krinkle: Running deleteEqualMessages.php on rowiktionary (bug 43917)
19:01 Krinkle: Running deleteEqualMessages.php on bat-smgwiki (bug 43917)
18:48 Krinkle: Running deleteEqualMessages.php on afwikiquote (bug 43917)
18:41 hashar: Jenkins disconnected lanthanum slave, killed all jenkins-slave process on it and repooled server.
18:39 Krinkle: Running deleteEqualMessages.php on abwiki (bug 43917)
17:54 manybubbles: deploying a new version of our Elasticsearch highlighter by doing a rolling restart on Elasticsearch machines - should cause no interruption of service
16:51 akosiaris: executed graceful-stop, start for apaches in order to load the new php-luasandbox apache module
15:08 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'Add new sources to gwtoolset and namespaces to hewikisource'
12:48 _joe_: restarted apache on wikitech-static
12:29 Krinkle: Running deleteEqualMessages.php on cvwiki (bug 43917)
12:29 Krinkle: Running deleteEqualMessages.php on afwiki (bug 43917)
11:46 Krinkle: Running deleteEqualMessages.php on bpywiki (bug 43917)
13:43 logmsgbot: hoo synchronized wmf-config/InitialiseSettings-labs.php 'Syncing for cluster consistency'
13:42 logmsgbot: hoo updated /a/common to Ic98928d54: Have Commons on Beta Labs use $stdlogo
13:26 springle: db1016 xfs head behind tail. reverted to last snapshot volume
12:57 springle: powercycle db1016 unresponsive
03:11 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Apr 26 03:11:16 UTC 2014 (duration 11m 15s)
02:31 logmsgbot: LocalisationUpdate completed (1.24wmf2) at 2014-04-26 02:31:41+00:00
02:22 logmsgbot: LocalisationUpdate completed (1.24wmf1) at 2014-04-26 02:22:41+00:00
April 25
20:29 logmsgbot: mwalker synchronized php-1.24wmf2/extensions/WikiEditor/ 'Reverting some faulty WikiEditor code for bug 64289'
20:28 logmsgbot: mwalker synchronized php-1.24wmf1/extensions/WikiEditor/ 'Reverting some faulty WikiEditor code for bug 64289'
18:02 K4-713: adjusted antifraud filters on payments
17:08 Jeff_Green: reenabled puppet and notifications for iodine
16:22 manybubbles: Elasticsearch rolling restart complete.
14:46 Jeff_Green: disabled icinga notifications for iodine too...
14:44 Jeff_Green: puppet stopped on iodine, doing manual spamassassin training
12:58 springle: upgrading db1047 (analytics slave) to mariadb 10
12:28 manybubbles: Performing rolling restart of Cirrus's Elasticsearch servers to upgrade a plugin. Low risk because it won't be used by the general public until Mondayish so a Friday push should be ok.
12:07 ottomata: stopping puppet on analytics1026 to test more frequent runs of Camus
23:16 logmsgbot: mwalker synchronized php-1.24wmf2/extensions/Flow 'Updating flow for 129589 and 129604'
22:53 AaronSchulz: Running PopulateImageSha1.php for all multi-versioned files on all wikis to fix broken SHA-1s
21:22 springle: eventlogging dump loading on db1048 m2 master in screen. ok to kill if necessary
21:18 hashar: restarting Zuul
21:01 mwalker: updating payments from 4811f6d3d80d126c8b3c89c11d20cc6416cb58f6 to e6d188f0dfcd57406acb58aa2b5bf45e48117c33 for donationinterface / worldpay updates
20:39 paravoid: shutting down sdtpa, cr1-sdtpa, csw1-sdtpa, msw1-sdtpa and other sdtpa hosts gone forever
20:37 Coren: sync-apache for 126969 and 91339
19:59 logmsgbot: reedy synchronized docroot and w
19:54 logmsgbot: aaron synchronized wmf-config/PoolCounterSettings-eqiad.php 'Removed redundant pool counter config'
19:00 ori: eventlogging data streaming into db1048; db1047 consumer decom'd.
06:26 mutante: db48,db63 - revoke puppet cert, salt key, kill from storedconfigs
03:55 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Wed Apr 23 03:55:03 UTC 2014 (duration 55m 2s)
03:09 logmsgbot: LocalisationUpdate completed (1.24wmf1) at 2014-04-23 03:08:58+00:00
02:46 logmsgbot: LocalisationUpdate completed (1.23wmf22) at 2014-04-23 02:46:47+00:00
02:25 manybubbles: restarted rebuilding common's Cirrus index after something crashed. going to get more logging out of it if it crashes again. or it'll work. Either way. Like last time the Elasticsearch check might freak out for a bit after it finished because shards are assigning. That can be ignored for an hour or so.
23:30 logmsgbot: ori Started scap: I595446dc5, If2c57846f, Iaa232298e
23:28 logmsgbot: ori synchronized php-1.24wmf1/extensions/EventLogging 'Update EventLogging for Iaa232298e: Set line-height for code icon on schema pages (bug 64251)'
23:27 logmsgbot: ori synchronized wmf-config/InitialiseSettings.php 'If2c57846f: Enable survey option in MediaViewer on a few more wikis'
23:26 logmsgbot: ori updated /a/common to If2c57846f: Enable survey option in MediaViewer on a few more wikis
23:24 logmsgbot: ori synchronized php-1.23wmf22/extensions/MultimediaViewer 'Update MultimediaViewer for I595446dc5: Add more survey languages (fr, de, pt/pr-br)'
23:23 logmsgbot: ori synchronized php-1.24wmf1/extensions/MultimediaViewer 'Update MultimediaViewer for I595446dc5: Add more survey languages (fr, de, pt/pr-br)'
20:12 manybubbles: rebuilding the search index for a few wikis - might cause the Elasticsearch health check to freak out because it sucks
19:51 MatmaRex: wikibugs is down, let's not bring it back up
19:31 Krinkle: Reloading Zuul to deploy config change I9c2f94b138244ab8
19:05 hashar: Jenkins killed Jenkins java process on deployment-bastion.eqiad.wmflab to free up the executor and threads entirely.
18:55 hashar: restarted Zuul to clean up some stuck jobs from the queue
18:49 hashar: Jenkins deployment-bastion.eqiad.wmflab is back online: Slave successfully connected and online
18:47 logmsgbot: reedy synchronized docroot and w
18:46 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: non wikipedias to 1.24wmf1
18:43 RobH: tridge back and accessible
18:42 hashar: Jenkins deposing / repooling deployment-bastion.eqiad.wmflabs slave locked up somehow, the executors are no more taken in account by Jenkins master
18:33 RobH: resurrecting tridge in pmtpa
18:00 RobH: tridge is coming dow for relocation, shouldnt disrupt anything but backups in progress
17:52 bblack: disable cp301[34] (mobile varnish frontends) in pybal on fenari
15:47 logmsgbot: demon synchronized wmf-config/InitialiseSettings.php 'Collection back on, server move over'
15:14 logmsgbot: demon synchronized wmf-config/InitialiseSettings.php 'Icb6b4bad: Updated $wgForceUIMsgAsContentMsg for commonswiki'
14:59 cmjohnson1: shutting down and relocating virt0 and pdf2
14:50 logmsgbot: marc synchronized wmf-config/interwiki.cdb 'Updating interwiki cache'
14:41 logmsgbot: marc synchronized wmf-config/interwiki.cdb 'Updating interwiki cache'
14:33 manybubbles: populating cirrus indexes for all remaining wikis
14:19 akosiaris: added bblack account on all junipers
14:18 manybubbles: building new elasticsearch indexes for the last wikis that didn't have them. the cluster may go red as the indexes are assigned. silly nagios check.
14:15 logmsgbot: manybubbles synchronized wmf-config/ 'cirrus for more wikis and disable collection for more'
14:13 logmsgbot: manybubbles synchronized docroot/noc/createTxtFileSymlinks.sh 'noncirrus is removed'
14:09 cmjohnson1: mchenry and sanger going down for server relocation
13:26 mark: Disabled puppet and apache on fenari
13:25 paravoid: second pass of swiftrepl eqiad->esams
03:55 springle: reset pc100* slaves previously replicating from pmtpa
03:32 ori: 5.5k fatals over last 20 hrs, of which 3.5k are calls to doTransform() on a non-object at TimedMediaThumbnail.php:201, and 0.9k are Lua API OOMs at LuaSandbox/Engine.php:264
03:30 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Apr 21 03:30:15 UTC 2014 (duration 30m 14s)
03:26 ori: ap_busy_workers spike on image scalers eqiad, started ~2:55, subsided around ~3:20
02:42 logmsgbot: LocalisationUpdate completed (1.24wmf1) at 2014-04-21 02:42:30+00:00
02:29 logmsgbot: LocalisationUpdate completed (1.23wmf22) at 2014-04-21 02:29:49+00:00
10:06 hashar: Upgrading Jenkins to latest LTS version 1.532.3
07:57 mutante: DNS update - remove api.svc, arptest.pmtpa ..
06:31 logmsgbot: demon synchronized wmf-config/InitialiseSettings.php 'Next round of wikis done building Cirrus indexes, throw into beta mode'
04:04 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 18 04:04:21 UTC 2014 (duration 4m 20s)
03:06 logmsgbot: LocalisationUpdate completed (1.24wmf1) at 2014-04-18 03:06:06+00:00
02:39 logmsgbot: LocalisationUpdate completed (1.23wmf22) at 2014-04-18 02:39:51+00:00
00:21 logmsgbot: ori synchronized wmf-config/CommonSettings.php 'Ie9b265be9: Enable GlobalCssJs on testwiki & test2wiki (2/2)'
00:21 logmsgbot: ori synchronized wmf-config/InitialiseSettings.php 'Ie9b265be9: Enable GlobalCssJs on testwiki & test2wiki (1/2)'
00:20 logmsgbot: ori updated /a/common to Ie9b265be9: Enable GlobalCssJs on testwiki & test2wiki
00:18 logmsgbot: ori Finished scap: Cherry-pick Ibe8e67ebf for MobileFrontend on 1.23wmf22 and 1.24wmf1; add GlobalCssJs extension to 1.24wmf1 and 1.23wmf22 (duration: 32m 53s)
April 17
23:45 logmsgbot: ori Started scap: Cherry-pick Ibe8e67ebf for MobileFrontend on 1.23wmf22 and 1.24wmf1; add GlobalCssJs extension to 1.24wmf1 and 1.23wmf22
23:39 logmsgbot: ori scap failed: CalledProcessError Command '/usr/local/bin/mw-update-l10n' returned non-zero exit status 1 (duration: 00m 24s)
23:38 logmsgbot: ori Started scap: Cherry-pick Ibe8e67ebf for MobileFrontend on 1.23wmf22 and 1.24wmf1; add GlobalCssJs extension to 1.24wmf1
23:37 logmsgbot: ori updated /a/common to I2a2abd7f3: Add GlobalCssJs to extension-list
23:29 logmsgbot: ori synchronized wmf-config/InitialiseSettings.php 'I52378a4b4: Add meta to legalteamwiki import sources'
23:28 logmsgbot: ori updated /a/common to I52378a4b4: Add meta to legalteamwiki import sources
20:41 manybubbles: elastic1016 restarted and not freaking out any more.
20:37 _joe_: restarting gitblit in order to prevent crippling due to the usual memory leak
20:28 manybubbles: restarting elastic1016 - it is freaking out. If it happens again I'll dig deeper, but for now I consider it a fluke of the rolling restarts today....
20:20 RobH: sorry for the misc-web-lb issues folks, they should be resolved at this time (for now)
20:19 paravoid: lvs1002/1005: commenting first resolv.conf entry until we have a more permanent fix, restarting pybal
20:18 paravoid: disabling puppet on lvs1002/lvs1005
19:57 RobH: still working on issue
19:57 RobH: both cp1043 and cp1044 seem online and serving nginx service, but pybal says they are down still working
19:46 ottomata: power off emery
19:40 RobH: replacing ticket.wikimedia.org cert/key, apache may hiccup
19:33 RobH: blog.w.o cert replacement successful
19:30 ottomata: disabling puppet on emery for decommission
19:29 RobH: blog.w.o certificate swap (yes, again ;), apache may hiccup
18:30 ottomata: switching erbium udp2log instance from consuming multicast relay to unicast direct from varnishes
18:21 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: group0 to 1.24wmf1
18:10 ottomata: stopping puppet on elastic1001 and elastic1002, reinstalling elastic1002
18:02 logmsgbot: reedy rebuilt wikiversions.cdb and synchronized wikiversions files: Wikipedias to 1.23wmf22
16:06 logmsgbot: reedy Finished scap: testwiki to 1.24wmf1 and build l10n cache (duration: 26m 06s)
15:47 logmsgbot: anomie synchronized php-1.23wmf22/extensions/VisualEditor 'SWAT: 126913 - backport to wmf22 of critical fixes for the Math extension's VisualEditor tool'
15:47 logmsgbot: anomie synchronized php-1.23wmf22/extensions/Math 'SWAT: 126913 - backport to wmf22 of critical fixes for the Math extension's VisualEditor tool'
15:40 logmsgbot: reedy Started scap: testwiki to 1.24wmf1 and build l10n cache
13:50 manybubbles: restarting elastic1009 to suck up new config
13:50 manybubbles: raised the number of replicas for labswiki's search directly in elasticsearch because I can't easilly do for cirrus due to access restrictions
13:45 ottomata: reinstalling elastic1011
13:22 mutante: DNS update - remove virt5-15
12:11 mutante: virt5-11 - shut down
11:40 akosiaris: upgraded python-voluptuous on apt.wikimedia.org to 0.8.2-1wmf1
11:39 hashar: Upgraded Zuul to wmf-deploy-20140416-3 (bring in a84f0e4 - "Make queue processing more efficient" which was much needed)
11:29 hashar: upgraded Zuul to wmf-deploy-20140416-2
11:15 mutante: virt5-11 removing from icinga
11:03 mutante: virt5-11 revoked puppet certs and salt keys
10:56 mutante: stopping puppet on virt5-11
10:47 hashar: Upgraded Zuul on gallium to wmf-deploy-20140416 (depends on python-voluptuous 0.7+ , Alexandros packaged 0.8.2 which I manually installed to validate).
15:02 mutante: DNS update - removing Tampa service IPs
13:51 hashar: Jenkins compressing console logs of builds. On gallium as user jenkins : find /var/lib/jenkins/jobs -wholename '*/builds/*/log' -type f -exec gzip --best {} \;
04:33 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Fri Apr 11 04:33:40 UTC 2014 (duration 33m 39s)
03:47 logmsgbot: LocalisationUpdate completed (1.23wmf22) at 2014-04-11 03:47:01+00:00
02:41 logmsgbot: LocalisationUpdate completed (1.23wmf21) at 2014-04-11 02:41:33+00:00
02:29 ori_: graphite: carbon instance 'f' saturates a cpu core. it's the instance that mediawiki profiling data gets hashed to. collector should probably emit to statsd and have statsd compute per-minute rollups
00:06 marktraceur: leaving MultimediaViewer slightly broken on enwiki based on the fact that logged-in users seem mostly unaffected and other wikis aren't seeing issues, will investigate more tomorrow and fix on Monday
09:20 hashar: Jenkins unpooling both slave labs using the web interface and killing the Jenkins client running as jenkins-deploy . Will repool so the job can be reregistered properly [[bugzilla:63760|bug 63760]]
09:11 mutante: DNS update - removing ms6
09:04 hashar: Jenkins bunch of jobs are not being triggered properly. Taking traces.
08:55 mutante: ms6 - shutdown -h now
08:42 mutante: forcing Bugzilla logout for all users
23:36 logmsgbot: ebernhardson synchronized php-1.23wmf21/extensions/Math/modules/VisualEditor/ve.ui.MWMathInspectorTool.js 'Update Math VE tool to use a command in 1.23wmf21'
23:05 logmsgbot: ebernhardson synchronized wmf-config/InitialiseSettings-labs.php 'Enable math VE plugin on labs'
23:04 Krinkle: Jenkins and Zuul are back up. Queues have not been preserved.
23:01 ^d: gerrit: reloaded bugzilla plugin to force it to log back in
23:00 Krinkle: Restarting Jenkins because I have no clue what is going on and have no time to investigate yet another random clogging of all jobs. Restart ought to fix it.
22:54 Krinkle: Zuul has lots of queued jobs for npm slaves, but neither Jenkins nor integration-slave1001.eqiad.wmflabs and 1002 themselves have anything queued. They're idle, responsive and waiting for jobs.
22:47 Krinkle: Jenkins slaves in labs seem to be down. Zuul is stacking up jobs for hasNpm nodes (integration slaves in labs). Both slaves have 7/7 executors idle.
22:33 hoo: Logged out all Bugzilla users by deleting all session cookie data from mysql
00:04 Krinkle: To investigate bug 63579, manually patched "grunt-lib-phantomjs/phantomjs/main.js" in "/srv/deployment/integration/slave-scripts" on gallium
April 8
23:34 logmsgbot: mwalker synchronized php-1.23wmf21/extensions/MultimediaViewer/ 'Updating MultimediaViewer for 124510'
21:21 logmsgbot: bd808 Purged l10n cache for 1.23wmf19
21:20 logmsgbot: bd808 Purged l10n cache for 1.23wmf18
21:12 logmsgbot: bd808 rebuilt wikiversions.cdb and synchronized wikiversions files: group1 to 1.23wmf21
19:58 manybubbles: finished upgrading to Elasticsearch 1.1.0. The process went well with no issues other then some knocking out search in labs 3 times for 30 seconds a piece. And logging lots of nasty warnings to irc. I've started to the process to fix search in labs so it won't happen again.
19:56 manybubbles: upgraded all elasticsearch servers except elastic1008. that is coming now.
18:45 logmsgbot: ori synchronized wmf-config/InitialiseSettings.php 'I4b18e4ce8: Change wgServer and wgCanonicalServer for arbcom wikis'
18:45 logmsgbot: ori updated /a/common to I4b18e4ce8: Change wgServer and wgCanonicalServer for arbcom wikis
18:13 logmsgbot: bd808 Finished scap: group0 wikis to 1.23wmf21 (with patch for bug 63659) (duration: 03m 18s)
18:10 logmsgbot: bd808 Started scap: group0 wikis to 1.23wmf21 (with patch for bug 63659)
18:01 logmsgbot: hoo synchronized wmf-config/InitialiseSettings.php 'Touch to clear config. cache'
17:56 hoo: changed the Wikidata wb_changes_dispatch position of all wikiquote wikis to 118158153
02:56 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Mon Apr 7 02:56:12 UTC 2014 (duration 56m 11s)
02:20 logmsgbot: LocalisationUpdate completed (1.23wmf21) at 2014-04-07 02:20:10+00:00
02:13 logmsgbot: LocalisationUpdate completed (1.23wmf20) at 2014-04-07 02:13:42+00:00
April 6
02:53 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sun Apr 6 02:53:13 UTC 2014 (duration 53m 12s)
02:18 logmsgbot: LocalisationUpdate completed (1.23wmf21) at 2014-04-06 02:18:43+00:00
02:13 logmsgbot: LocalisationUpdate completed (1.23wmf20) at 2014-04-06 02:12:57+00:00
01:32 jamesofu_: sugar down for move to labs
April 5
04:15 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Apr 5 04:15:00 UTC 2014 (duration 50m 39s)
03:56 logmsgbot: andrew rebuilt wikiversions.cdb and synchronized wikiversions files: Revert mw.org, test2wiki and testwikidatawiki to 1.23wmf20 due to localisation issue
03:51 Andrew: Reverting mw.org, test2 and test.wikidata back to 1.23wmf20
03:41 logmsgbot: LocalisationUpdate completed (1.23wmf21) at 2014-04-05 03:41:36+00:00
03:36 logmsgbot: LocalisationUpdate completed (1.23wmf20) at 2014-04-05 03:36:04+00:00
03:23 Andrew: Actually, going to rerun l10nupdate first just to check.
03:22 Andrew: Going to revert deployment of 1.23wmf21 again - still broken
03:08 logmsgbot: LocalisationUpdate ResourceLoader cache refresh completed at Sat Apr 5 03:08:33 UTC 2014 (duration 8m 32s)
02:34 logmsgbot: LocalisationUpdate completed (1.23wmf21) at 2014-04-05 02:34:54+00:00
02:14 logmsgbot: LocalisationUpdate completed (1.23wmf20) at 2014-04-05 02:14:07+00:00
02:35 logmsgbot: springle synchronized wmf-config/db-eqiad.php 's1 depool db1061 for upgrade'
02:24 logmsgbot: LocalisationUpdate completed (1.23wmf19) at 2014-04-03 02:24:07+00:00
April 2
23:47 logmsgbot: aaron synchronized wmf-config/CommonSettings.php 'Bumped wgJobBackoffThrottling for htmlCacheUpdate to 15'
23:47 mwalker: ... deploy was for mobile frontend 123454
23:46 logmsgbot: mwalker synchronized php-1.23wmf20/extensions/MobileFrontend 'SWAT deploy for MaxSem'
20:23 subbu: deployed Parsoid 33471172 with deploy repo sha 5c620e54
19:03 logmsgbot: ori synchronized php-1.23wmf20/extensions/WikimediaEvents 'Update WikimediaEvents for I7fdaa5524: Use simple random sampling to log deprecated usage at 1:100'
19:03 logmsgbot: ori synchronized php-1.23wmf19/extensions/WikimediaEvents 'Update WikimediaEvents for I7fdaa5524: Use simple random sampling to log deprecated usage at 1:100'
17:00 andrewbogott: fixed updating crons on wikitech-status, I think. Time will tell...
16:19 logmsgbot: manybubbles synchronized wmf-config/InitialiseSettings.php 'Lower timeout on prefix searches and make the cirrus.dblist sync I just did take effect.'
16:19 logmsgbot: manybubbles synchronized cirrus.dblist 'Cirrus as primary for most of group1'
16:14 akosiaris: banned tools-exec-03.eqiad.wmflabs. using manual iptables on ytterbium
15:20 ottomata: stopping puppet on stat1
14:27 hashar: Jenkins applying label contintLabsSlave on slaves in labs used for ci (integration-slave1001 and 1002)
14:15 hashar: Jenkins deleting pmtpa slaves (they all have been shutdown and jobs got deleted)
14:00 manybubbles: tried restarting some lsearchd services (carefully) to clear out some crashing when searching for a particular query term. It caused pool queue full errors.... serves me right for trying?
11:20 mutante: running CheckUser/maintenance/purgeOldData.php on all wikis
09:42 akosiaris: rsynced brewster /srv to carbon
09:34 mutante: restarting gitblit on antimony
09:14 mutante: DNS update - removing capella
09:09 mutante: DNS update - removing ms10
05:31 logmsgbot: springle synchronized wmf-config/db-eqiad.php 'normal loads for all upraded slaves'