Nova Resource:Deployment-prep/SAL
2015-12-02
- 00:31 tgr: updated rsvg on appserver to 2.40.11 - https://phabricator.wikimedia.org/T112421
2015-11-04
- 00:06 Krenair: Synchronized portals: https://gerrit.wikimedia.org/r/#/c/250851/
2015-10-09
- 21:51 ori: Accidentally clobbered /etc/init.d/mysql on deployment-db1, causing deployment-prep failures. Restored now.
2015-09-16
- 20:39 cscott: updated OCG to version 4032a596ce6eb442b02cc6ee9b79263b1eb23275
2015-09-14
- 19:18 cscott: updated OCG to version 5811056e28f2bc6408b6da96095352ab381bb11f
- 12:04 dcausse: restarting elasticsearch (deployment-elastic0[5-8]) to deploy new plugins
2015-08-25
- 14:42 andrewbogott: moving deployment-cache-mobile04 to labvirt1004
2015-08-12
- 20:45 urandom: restarted restbase on deployment-restbase01 (dead)
2015-08-05
- 14:33 godog: update deployment-restbase02 to openjdk8 T104887
- 14:18 godog: update deployment-restbase01 to openjdk8 T104887
June 29
- 13:17 dcausse: restarting Elasticsearch to pick up new plugin versions
June 23
- 13:31 cscott: fixed salt on deployment-pdf02, restarted OCG there.
- 05:44 cscott: stopped OCG service on deployment-pdf02, see https://phabricator.wikimedia.org/T103473
- 05:20 cscott: updated OCG to version d7c698d5bf730d34057945e912ac75dc542dd788 ; restarted service.
- 03:58 cscott: stopped OCG on beta; redis 2.8.x is causing the service to crash on startup.
June 22
- 21:58 andrewbogott: re-enabling puppet on deployment-videoscaler01 because no reason was given for disabling
- 20:42 cscott: updated OCG to version b482144f5bd8b427bcc64a3dd287247195aa1951
June 4
- 20:29 ori: upgrading hhvm-fss from 1.1.4 to 1.1.5, has fix for T101395
May 29
- 14:07 moritzm: upgrade java on deployment-restbase0[12] to the 7u79 security update
May 28
- 08:46 godog: test es-tool restart-fast on deployment-elastic05
May 27
- 21:15 AaronSchulz: populated jobqueue:aggregator:s-wikis:v2 with 1000 fake wiki keys for load testing
- 21:07 AaronSchulz: Deployed https://gerrit.wikimedia.org/r/#/c/208852/
- 21:07 AaronSchulz: Deleted 4G of logs on jobrunner01
May 24
- 18:39 YuviKTM: purged old logs kept on NFS
May 20
- 20:58 cscott: updated OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266
May 18
- 15:17 andrewbogott: rebooting deployment-logstash1
May 15
- 20:50 andrewbogott: rebooted deployment-bastion due to inconsistent run state after suspend/resume
May 13
- 21:08 cscott: updated OCG to version c7c75e5b03ad9096571dc6dbfcb7022c924ccb4f
May 2
- 00:51 yuvipanda: created deployment-boomboom to test
April 29
- 21:03 andrewbogott: suspending and shrinking disks of many instances
April 28
- 20:57 YuviPanda: KILL KILL KILL DEPLOYMENT-LUCID-SALT WITH FIRE AND BRIMSTONE AND BAD THINGS
April 27
- 08:01 _joe_: installed hhvm 3.6 on deployment-mediawiki02
April 24
- 14:25 _joe_: installing hhvm 3.6.1 on mediawiki-deployment01
April 23
- 17:19 andrewbogott: rebooting deployment-parsoidcache02 because it seems troubled
April 22
- 12:48 andrewbogott: migrating to new labvirt nodes
April 21
- 08:33 _joe_: rollback installation of hhvm 3.6
- 08:09 _joe_: installing HHVM 3.6 and the corresponding extensions on deployment-mediawiki01
April 9
- 20:11 mutante: fixed apt sources lists on deployment-bastion (T95541)
March 30
- 22:33 Josve05a: manually start mysql on db1 and db2
- 21:57 YuviPanda: reboot all instances from virt1000
March 23
- 20:41 cscott: updated OCG to version 11f096b6e45ef183826721f5c6b0f933a387b1bb
March 18
- 13:45 mobrovac: added restbase security group
- 13:35 YuviPanda: made mobrovac projectadmin
- 13:34 YuviPanda: added mobrovac to project
March 16
- 18:46 manybubbles: upgraded Elasticsearch on deployment-logstash1
March 11
- 18:47 YuviPanda: created deployment-mediawiki03
February 27
- 11:12 YuviPanda: start mysql on deployment-db1
February 26
- 11:53 YuviPanda: created deployment-parsoid01-test to test patch to use role::parsoid on labs
February 18
- 13:04 _joe_: installed new version of the hhvm extensions packages
February 17
- 23:18 Krenair: Started mysql on deployment-db1; beta now appears much less broken than before
February 6
- 20:07 ^d: scratch that, I rebuilt it as precise. why did I do that?
- 20:03 ^d: rebuilt deployment-elastic05 with new partition scheme
February 5
- 12:48 YuviPanda: cherry-picking https://gerrit.wikimedia.org/r/188798 on scap on deployment-prep
- 12:28 YuviPanda: killed chown on deployment-bastion, running direclty on NFS server
- 12:13 YuviPanda: running time sudo chown -R www-data:www-data upload7/ on /data/project
- 12:10 YuviPanda: stopped jobrunner on jobrunner01
- 11:53 YuviPanda: running git-sync-upstream on deployment-salt to pick up latest ops/puppet changes
- 11:52 _joe_: converting the web user to www-data
- 11:44 YuviPanda: deleted mediawiki03 instance, holdover from security testing from long, long ago
- 11:41 YuviPanda: disabled puppet on mediawiki01, 02, jobrunner01, bastion and salt
February 4
- 13:56 YuviPanda: created deployment-jobrunner01, trusty instance
- 13:51 YuviPanda: deleted deployment-jobrunner01, trusty version coming up
- 11:35 YuviPanda: created instance deployment-mediawiki02
- 11:26 YuviPanda: deleted instance deployment-mediawiki02
- 06:37 YuviPanda: created deployment-mediawiki01 host
- 06:34 YuviPanda: killed deployment-mediawiki01 host. FOREEVERRR
February 2
- 13:37 yuvipanda: added mx record to beta.wmflabs.org, for https://phabricator.wikimedia.org/T88215 via LDAP
January 27
- 18:15 andrewbogott: upgrading libc6 on all instances from deployment-salt
January 20
- 02:30 YuviPanda: created deployment-mediawiki04 to test roles
January 7
- 16:25 YuviPanda: added milimetric to NDA sudo’ers groups
December 29
- 22:24 MaxSem: Created a DNS entry for m.wikidata.beta.wmflabs.org
December 22
- 12:40 _joe_: upgrading HHVM to the latest version
December 16
- 16:52 manybubbles: elasticsearch restart finished
- 16:48 mutante: deployment-db2 is down
- 16:48 manybubbles: restarting beta's elasticsearch servers to pick up a new version of a plugin. won't interfere with current downtime.
December 13
- 17:10 bd808: Many strange puppet and scap failures in beta that look to be related to DNS failures
- 16:03 bd808: Starting work on phab:T78076 to renumber apache users in beta
December 11
- 22:47 cscott: updated OCG to version bfc3812ef346c9f767135b339cedd123a1bcac98
December 6
- 05:05 ori: upgrade hhvm-tidy to 0.1-2
December 3
- 21:33 cscott: updated OCG to version 08e94b19c3f17e699d7e53d9605f65c58e17ea0e
December 2
- 17:09 _joe_: upgrading HHVM to its latest version
- 17:08 andrewbogott: this is a test message
December 1
- 21:50 cscott-split: updated OCG to version a06e7c186796a6ee5d5af81e93688520abdf2596
November 26
- 20:47 cscott: updated OCG to version 7d8f2b8bd496464041e3ef9c092732457cc8f7ef
November 24
- 15:16 YuviPanda: modified local hack to account for 47dcefb74dd4faf8afb6880ec554c7e087aa947b
- 14:58 YuviPanda: cherry-picked 3e45c538978710113e6e28e9d533bf8d18c159a6 and 9d4614a8a352c78505212fd6e9d2a7be6d2e4927 to deployment-salt puppetmaster, restoring local hacks
November 19
- 21:19 anomie: Cherry-picked https://gerrit.wikimedia.org/r/#/c/173336/3 to Beta
November 17
- 20:37 YuviPanda: cleaned out logs on deployment-bastion
- 16:48 YuviPanda: delete deployment-analytics01, a tortoise from an ancient time.
- 05:17 YuviPanda: force apt-get install -f to unstuck puppet
- 04:49 YuviPanda: clean up coredump on deployment-prep
November 16
- 00:38 YuviPanda: uncherrypick https://gerrit.wikimedia.org/r/#/c/173634/ because OMG CODE
- 00:14 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/173634/ on deployment-salt
- 00:01 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/173510/ on deployment-prep to make memc03 run puppet
November 14
- 20:02 anomie: Cherry-picking https://gerrit.wikimedia.org/r/#/c/173336/ for testing in logstash
November 13
- 10:11 YuviPanda: cherry pick https://gerrit.wikimedia.org/r/#/c/172967/1 to test https://bugzilla.wikimedia.org/show_bug.cgi?id=73263
November 12
- 18:16 YuviPanda: cherry picking https://gerrit.wikimedia.org/r/#/c/172776/ on labs puppetmaster to see if it fixes issues in the cache machines
November 11
- 17:13 cscott: removed old ocg cronjobs on deployment-pdf0x; see https://bugzilla.wikimedia.org/show_bug.cgi?id=73166
November 10
- 22:37 cscott: rsync'ed .git from pdf01 to pdf02 to resolve git-deploy issues on pdf02 (git fsck on pdf02 reported lots of errors)
- 21:41 cscott: updated OCG to version d9855961b18f550f62c0b20da70f95847a215805 (skipping deployment-pdf02)
- 21:39 cscott: deployment-pdf02 is not responding to git-deploy for OCG
November 5
- 06:14 ori: restarted hhvm on beta app servers
November 3
- 22:07 cscott: updated OCG to version 5834af97ae80382f3368dc61b9d119cef0fe129b
October 29
- 18:55 ori: upgraded hhvm on beta labs to 3.3.0+dfsg1-1+wm1
October 28
- 23:47 RoanKattouw: ...which was a no-op
- 23:46 RoanKattouw: Updating puppet repo on deployment-salt puppet master
- 21:36 RoanKattouw: Creating deployment-parsoid05 as a replacement for the totally broken deployment-parsoid04 (also as a trusty instance rather than precise)
- 21:06 RoanKattouw: Rebooting deployment-parsoid04, wasn't responding to ssh
October 27
- 20:23 cscott: updated OCG to version 60b15d9985f881aadaa5fdf7c945298c3d7ebeac
October 22
- 21:10 arlolra: updated OCG to version e977e2c8ecacea2b4dee837933cc2ffdc6b214cb
October 8
- 22:04 subbu: updated OCG to version def24eca
October 7
- 22:50 cscott: updated OCG to version c778ea8b898f8ad8c2b7ad9de78a75469e7ed061
October 6
- 23:13 YuviPanda: killed extra log files in deployment-bastion
- 21:44 cscott: updated OCG to version bbdf4c6400cfbbc6030114ad16e1a6f7025eab2c
- 15:36 cscott: updated OCG to version aee3712b352f51f96569de0bcccf3facf654e688
October 3
- 19:51 manybubbles: performing rolling restart of elasticsearch nodes to pick up preview of accelerated regex plugin for testing at larger-than-mylaptop-scale
- 14:02 manybubbles: rebuilding beta's simplewiki cirrus *index*
- 14:02 manybubbles: rebuilding beta's simplewiki cirrus inde
October 1
- 20:13 cscott: updated OCG to version 48c495e3656f528abe636ce0cd7562270505534f
- 16:40 bd808: Added Gilles to under_NDA sudoers group
September 30
- 22:00 bd808: Cleaned deleted instances out of salt and trebuchet redis
- 20:26 bd808: Converted deployment-rsync02 to use local puppet & salt masters
- 15:36 bd808: enabling puppet and forcing run on deployment-mediawiki03
- 15:34 bd808: enabling puppet and forcing run on deployment-mediawiki02
- 15:28 bd808: enabling puppet and forcing run on deployment-mediawiki01
September 29
- 22:45 Reedy: re-enabled beta-scap-eqiad
- 21:34 Reedy: disabled "beta-scap-eqiad" until things are fixed
- 21:24 Reedy: deleted l10n cache on deployment-rsync01 to attempt to run sync-common manually
- 21:22 Reedy: deployment-rsync01 hard drive is far too small
- 17:57 cscott: updated OCG to version 89d8f29a24295b05d0643abe976fea83b56575c9
- 06:58 ori: Configured Beta cluster to use redis for session storage
- 06:57 ori: Created deployment-redis02 and converted it to use local puppet & salt masters
- 05:23 ori: Created deployment-redis01 and converted it to use local puppet & salt masters
September 28
- 14:38 andrewbogott: cherry-picked https://gerrit.wikimedia.org/r/#/c/163464/ onto deployment-salt to fix a puppet compile failure.
- 14:38 andrewbogott: edited and re-cherry-picked roan's citoid patch into beta because the previous version was breaking puppet
September 26
- 06:34 cscott: updated OCG to version f3a6c1cbba118d4a5e1aa019937dc50159fc823d
September 25
- 22:48 RoanKattouw: Fixed permissions of deployment-bastion:/srv/deployment/mathoid/mathoid/.git/deploy (needed g+w)
- 11:36 _joe_: updated hhvm to fix most bugs, also cherry-picked https://gerrit.wikimedia.org/r/#/c/162839/
September 24
- 23:00 bd808: Updated bash with salt
- 20:52 cscott: updated OCG to version 48acb8a2031863e35fad9960e48af60a3618def9
September 23
- 20:14 cscott: updated OCG to version 1cf9281ec3e01d6cbb27053de9f2423582fcc156
- 17:37 AaronSchulz: Initialized bloom cache on betalabs, enabled it, and populated it for enwiki
September 22
- 16:08 ori: updating HHVM to 3.3.0-20140918+wmf1
September 20
- 14:43 andrewbogott: movingdeployment-pdf02 to virt1009
- 00:36 mutante: raised instance quota to 43
September 19
- 00:26 cscott: updated OCG to version ce16f7adb60d7c77409e2e11ba0e5d6cce6955d5
September 16
- 15:44 godog: testing scap change from https://gerrit.wikimedia.org/r/#/c/160668/
- 02:46 cscott: updated OCG to version 188a3c221d927bd0601ef5e1b0c0f4a9d1cdbd31
September 15
- 21:44 andrewbogott: migrating deployment-videoscaler01 to virt1002
- 21:41 andrewbogott: migrating deployment-sentry2 to virt1002
- 21:40 cscott: *skipped* deploy of OCG, due to deployment-salt issues
- 21:19 bd808: Added Matanya to under_NDA sudoers group (bug 70864)
September 12
- 12:24 _joe_: set up hiera, noop as expected
September 11
- 16:31 YuviPanda: Delete deployment-graphite instance
- 02:29 mutante: raised instance quota by 1 to 42
September 10
- 08:14 Krinkle: bits.beta.wmflabs.org is down with 503 Service Unavailable (http://bits.beta.wmflabs.org/en.wikipedia.beta.wmflabs.org/load.php)
September 9
- 20:08 cscott: updated OCG to version c9a2b4cf2502479eeabed07ab2de728695d96e46
September 7
- 23:48 bd808: Added John F. Lewis to under_NDA sudo policy (bug 70539)
- 23:29 bd808: Promoted John F. Lewis to project admin (bug 70539)
- 23:26 bd808: Added Jalexander as project member (bug 70539)
September 5
- 17:54 bd808: Purged varnish cache on deployment-cache-bits01 -- sudo varnishadm ban req.url '~' /
- 16:00 YuviPanda: unfuck puppet on deployment-salt, puppet is stupid and does not properly report failed events on last_run_summary.yaml if there's a syntax error or a resource conflict. So I've to read last_run_report and do things with *that* instead now
- 15:49 YuviPanda: deliberately fucking up puppet to see if icinga complains
- 09:52 _joe_: cherry-picked I6ec53da483bebfa375eba2383cbf60123ff1ce26, it work
September 4
- 16:06 bd808: Manually cleaned bogus LocalRenameUserJob jobs from redis
- 13:54 _joe_: stopped puppet on the appservers but mw03, testing an apache change
- 05:28 legoktm: stopping jobrunner on deployment-jobrunner01
- 05:22 legoktm: restarted jobrunner on deployment-jobrunner01
- 05:14 bd808: Bad jobs in job queue filled up /var on jobrunner01 and killed jobrunner script. Leaving down for now until I find out how to delete the bad jobs.
- 01:41 bd808: Killed old jobs-loop.sh processes on deployment-jobrunner01
- 01:24 bd808: Many jobrunner errors like "wikiversions-labs.cdb has no version entry for `amwiki`" with various wiki names
- 01:23 bd808|AWAY: Started jobrunner service manually on jobrunner01.
- 00:44 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known when Trebuchet is running)
- 00:35 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known)
September 3
- 15:02 bd808: _joe_ rolled out a new hhvm package ~5 hours ago
- 15:01 bd808: morebots is back thanks to petan
- 14:50 bd808: logmsgbot down apparently
September 2
- 15:34 bd808: False alarm. SSL is borked in beta and we know that
- 15:29 bd808: `curl -vL -H 'Host: en.wikipedia.beta.wmflabs.org' localhost` works from deployment-cache-text02
- 15:27 bd808: https://en.wikipedia.beta.wmflabs.org/ returning ERR_CONNECTION_REFUSED (is varnish down?)
August 29
- 22:56 bd808: Got puppet to run cleanly on deployment-mediawiki03. Should be ready for serving traffic.
- 22:39 bd808: Fixed a merge conflict in operations/puppet on deployment-salt
- 21:46 bd808: Forced install of "right version of libvips-tools on mediawiki03 `sudo apt-get install libvips-tools=7.38.5-2`
- 08:40 hashar: rebooting deployment-cache-mobile03 (kernel up)
August 28
- 21:32 bd808: Added "Greg Grossmeier" to UnderNDA sudoers group
- 17:12 bd808: Changed centralauth db to rename labswiki -> deploymentwiki
- 16:49 bd808: CentralAuth looks broken on http://deployment.wikimedia.beta.wmflabs.org/
- 16:49 bd808: Apache vhosts look good again
- 16:34 bd808: Restarted varnishes on deployment-cache-text02
- 16:13 andrewbogott: merging a patch that renames 'labswiki' to 'deploymentwiki'
- 09:21 hashar: resetting git repository in /data/project/apache/conf to point to the betaclusterbranch of operations/mediawiki-config.git discarded all local hacks in the process
August 27
- 23:03 hashar: Blacklisting the security audit IP again on deployment-cache bits01 mobile03 and text02
- 22:53 hashar: removed the blackhole ip route from deployment-cache-text02 and deployment-cache-mobile03
- 22:48 hashar: the IP is a known security audit. See Chris Steipp.
- 22:46 hashar: blackholed an IP address on deployment-cache-text02 and deployment-cache-mobile03 , it was causing hundred of requests per seconds and overloaded the beta cluster. Use route -n to find the IP
- 22:37 hashar: restarting udp2log-mw on deployment-bastion. It keeps crashing since fiarly recently
- 22:26 bd808: when restarting varnish on deployment-cache-text02, don't forget that there are 2 varnish services (varnish and varnish-frontend)
- 22:19 bd808: restarted varnish (again) on deployment-cache-text02
- 22:10 bd808: restarted varnish on deployment-cache-text02
- 16:22 bd808: killing `apt-get update` process running on deployment-bastion since Jun13
- 14:59 bd808: Resolved puppet git merge conflict on deployment-salt
- 14:49 bd808: Moved hhvm core dumps to /data/project/hhvm-cores
- 14:42 bd808: Root dirve full on deployment-mediawiki02; hhvm core files are the culprit
August 25
- 23:47 ori: stopping hhvm/apache on deployment-mediawiki02 to replace debug build of hhvm with release build
- 21:44 bd808: Deployed scap 116027f (Make sync-common update l10n cdb files by default)
- 18:30 ori: deployment-mediawiki02: cleared /tmp; running puppet
- 15:05 hashar: mediawiki02 rm /tmp/hhvm*.core . Filled as bug 69979
- 15:01 hashar: mediawiki02 rm /tmp/mw-cache-master/conf*
- 15:01 hashar: mediawiki02 has mw conf caches under /tmp/mw-cache-master/ and since that partition is filled up, that ends up with conf caches being null file
- 15:00 hashar: mediawiki02 rm /var/log/upstart/hhvm*
- 14:53 hashar: mediawiki02 : removed /var/lib/puppet/state/agent_catalog_run.lock
- 14:46 hashar: restarting udp2log-mw service on -bastion. It is stalled for some reason
- 14:42 hashar: on mediawiki02 , clearing out some /var/log/upstart/hhvm.* log files see bug 69976
- 14:34 hashar: mediawiki02 / partition is 100% full
August 22
- 20:21 hashar: udp2log are back in /data/project/logs . The udp2log-mw service went stall for some reason.
- 20:08 ori: ran 'git pull' on deployment-salt:/srv/var-lib/git/operations/puppet
- 19:59 hashar: restarting udp2log-mw service on deployment-bastion
- 19:59 hashar: bits yielding 503
- 00:41 bd808: cherry-picked scap change https://gerrit.wikimedia.org/r/#/c/155677/ for testing
August 21
- 21:49 bd808: Trebuchet happier after all the salt-minion restarts; still have deleted hosts showing in the expected minion list for scap deploys
- 21:01 twentyafterfour: Started salt-minion on deployment-redis01
- 21:01 bd808: Started salt-minon on deployment-upload
- 21:00 bd808: Started salt-minon on deployment-fluoride
- 21:00 bd808: Started salt-minon on deployment-db1
- 20:59 bd808: Started salt-minon on deployment-elastic01
- 20:59 twentyafterfour: Started salt-minion on deployment-eventlogging02
- 20:58 bd808: Started salt-minon on deployment-elastic02
- 20:58 bd808: Started salt-minon on deployment-elastic03
- 20:57 bd808: Started salt-minon on deployment-elastic04
- 20:57 bd808: Started salt-minon on deployment-analytics01
- 20:55 bd808: Started salt-minon on deployment-cache-upload02
- 20:54 bd808: Started salt-minon on deployment-memc04
- 20:54 bd808: Started salt-minon on deployment-parsoid04
- 20:49 bd808: Started salt-minon on deployment-memc05
- 20:48 bd808: Started salt-minon on deployment-db2
- 20:48 twentyafterfour: Started salt-minion on deployment-cache-text02
- 20:47 twentyafterfour: Started salt-minion on deployment-memc03
- 20:46 bd808: Started salt-minon on deployment-cxserver01
- 20:12 bd808: List of broken salt minions can be obtained with `sudo salt-run manage.down` on deployment-salt
- 19:55 bd808: Fixed salt on deployment-memc02
- 19:52 bd808: Salt minions are broken all over beta. Hung grain-ensure calls, hung test.ping calls, downed minions
- 19:50 bd808: Killed dozens of grain-ensure calls and started salt-minion on deployment-cache-mobile03
- 19:47 bd808: Killed hung salt-call and started salt-minion on deployment-cache-bits01
- 19:28 bd808: Deployed cherry-pick of Iea7217a for scap
- 19:27 bd808: Restarted salt-minion on deployment-jobrunner01 & deployment-videoscaler01
- 19:27 bd808: Killed rogue salt-master process on deployment-bastion
- 19:26 bd808: Deleted salt keys for retired apache0[12] minions
- 00:13 bd808: Upgraded elasticsearch to 1.3.2 on deployment-logstash1
August 19
- 16:11 hashar: deleted /usr/local/apache/common-local symlink, made it a directory and retriggered https://integration.wikimedia.org/ci/job/beta-scap-eqiad/17887/console
- 16:03 bd808: Removed local changes to /usr/local/apache/conf/wmflabs-logging.conf on deployment-mediawiki02; logs back to nfs share
- 15:52 bd808: Changed apache logging level from debug to notice on deployment-mediawiki02 in /usr/local/apache/conf/wmflabs-logging.conf
- 15:47 bd808: Changed apache logging level from debug to warn on deployment-mediawiki02
- 15:44 bd808: /var full on deployment-mediawiki02; deleting 572M /var/log/apache2/debug.log.1
- 15:03 hashar: Killed some stalled scap / rsync process on deployment-bastion that were preventing https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ from acquiring the lock.
- 14:17 hashar: huge rsync in progress on bastion
- 14:00 hashar: On bastion reverted the symlink on bastion and manually created directory /usr/local/apache/common-local
- 13:55 hashar_: On bastion, deleting /usr/local/apache/common-local and symlink it to /srv/common-local
August 18
- 22:22 ^d: dropped apache01/02 instances, unused and need the resources
- 18:23 manybubbles: finished upgrading elasticsearch in beta - everything seems ok so far
- 18:15 bd808: Restarted salt-minion on deployment-mediawiki01 & deployment-rsync01
- 18:15 bd808: Ran `sudo pkill python` on deployment-rsync01 to kill hundreds of grain-ensure processes
- 18:12 bd808: Ran `sudo pkill python` on deployment-mediawiki01 to kill hundreds of grain-ensure processes
- 18:10 manybubbles: finally restarting beta's elasticsearch servers now that they have new jars
- 17:56 bd808: Manually ran trebuchet fetches on deployment-elastic0*
- 17:49 bd808: Forcing puppet run on deployment-elastic01
- 17:47 godog: upgraded hhvm on mediawiki02 to 3.3-dev+20140728+wmf5
- 17:44 bd808: Trying to restart minions again with `salt '*' -b 1 service.restart salt-minion`
- 17:39 bd808: Restarting minions via `salt '*' service.restart salt-minion`
- 17:38 bd808: Restarted salt-master service on deployment-salt
- 17:19 bd808: 16:37 Restarted Apache and HHVM on deployment-mediawiki02 to pick up removal of /etc/php5/conf.d/mail.ini (logged in prod SAL by mistake)
- 16:59 manybubbles|lunc: upgrading Elasticsearch in beta to 1.3.2
- 16:11 bd808: Manually applied https://gerrit.wikimedia.org/r/#/c/141287/12/templates/mail/exim4.minimal.erb on deployment-mediawiki02 and restarted exim4 service
- 15:28 bd808: Puppet failing for deployment-mathoid due to duplicate definition error in trebuchet config
- 15:15 bd808: Reinstated puppet patch to depool deployment-mediawiki01 and forced puppet run on all deployment-cache-* hosts
- 15:04 bd808: Puppet run failing on deployment-mediawiki01 (apache won't start); Puppet disabled on deployment-mediawiki02 ('reason not specified') Probably needs to wait until Giuseppe is back from vacation for fixing.
- 15:00 bd808: Rebooting deployment-eventlogging02 via wikitech; console filling with OOM killer messages and puppet runs failing with "Cannot allocate memory - fork(2)"
- 14:29 bd808: Forced puppet run on deployment-cache-upload02
- 14:27 bd808: Forced puppet run on deployment-cache-text02
- 14:24 bd808: Forced puppet run on deployment-cache-mobile03
- 14:20 bd808: Forced puppet run on deployment-cache-bits01
August 17
- 22:58 bd808: Attempting to reboot deployment-cache-bits01.eqiad.wmflabs via wikitech
- 22:56 bd808: deployment-cache-bits01.eqiad.wmflabs not allowing ssh access and wikitech console full of OOM killer messages
August 15
- 21:57 legoktm: set $wgVERPsecret in PrivateSettings.php
- 21:42 hashSpeleology: Beta cluster database updates are broken due to CentralNotice. Fix up is 154231
- 20:57 hashSpeleology: deployment-rsync01 : deleting /usr/local/apache/common-local content. Then ln -s /srv/common-local /usr/local/apache/common-local as set by beta::common which is not applied on that host for some reason. bug 69590
- 20:55 hashSpeleology: puppet administratively disabled on mediawiki02 . Assuming some work in progress on that host. Leaving it untouched
- 20:54 hashSpeleology: puppet is proceeding on mediawiki01
- 20:52 hashSpeleology: attempting to unbreak mediawiki code update bug 69590 by cherry picking 154329
- 20:39 hashSpeleology: in case it is not in SAL. MediaWiki is no more synced to app server bug 69590
- 20:20 hashSpeleology: rebooting mediawiki01 , /var refuses to clear out and stick at 100% usage
- 20:16 hashSpeleology: cleaning up /var/log on deployment-mediawiki02
- 20:14 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/access.log.1
- 20:13 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/debug.log.1
- 20:13 hashSpeleology: bunch of instances have a full /var/log :-/
- 11:37 ori: deployment-cache-bits01 unresponsive; console shows OOMs: https://dpaste.de/LDRi/raw . rebooting
- 03:20 jeremyb: 02:46:37 UTC <ebernhardson> !log beta /dev/vda1 full. moved /srv-old to /mnt/srv-old and freed up 2.1G
August 14
- 12:23 hashar: manually rebased operations/puppet.git on puppetmaster
August 13
- 08:02 hashar: beta-code-update-eqiad is running again
- 07:57 hashar: fixing ownerships under /srv/scap-stage-dir/php-master/skins some files belong to root
- 07:55 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is broken :-/
August 8
- 16:05 bd808: Fixed merge conflict that was preventing updates on puppet master
August 6
- 13:13 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is running again
- 13:13 hashar: removed a bunch of local hack on deployment-bastion:/srv/scap-stage-dir/php-master . That causes the git repo to be dirty and prevents scap from achieving git pull there
- 12:08 hashar: Manually pruning whole text cache on deployment-cache-text02
- 12:07 hashar: Apache virtual hosts were not properly loaded on mediawiki02. I have hacked /etc/apache2/apache2.conf to make it Include Include /usr/local/apache/conf/all.conf (instead of main.conf which does not include everything)
- 08:43 hashar: prunning cache on deployment-cache-text02 / restarting varnish
August 2
- 08:53 swtaarrs: rebuilt and restarted hhvm on deployment-mediawiki02 with potential fix
- 05:17 swtaarrs: restarted hhvm on deployment-mediawiki0{1,2} to unwedge them
August 1
- 15:03 bd808: Updated cherry-pick of Iceb8f43
- 15:02 bd808: Cleaned up puppet repo on deployment-salt; merge conflicts with local Ia463120 hack; reapplied depool of deployment-mediawiki01
- 14:50 bd808: Restarted stuck hhvm on deployment-mediawiki02; apache had 89 children waiting for a response
- 13:27 godog: changed inplace bt-hhvm on deployment-mediawiki01/02 to also copy the binary
- 05:32 ori: depooled deployment-mediawiki02 to investigate HHVM lock-up by cherry-picking I7df8c5310 on beta.
- 00:40 ori: disabled puppet on deployment-mediawiki{01,02} and enabled verbose apache logging
July 31
- 22:41 bd808: Restarted hhvm on -mediawiki{01,02}. Brett looked at 01 before I did and said "it's the same as before"
- 20:09 cscott: updated OCG to version d2919c59eb09e09fc87777696411a070620aef45
- 19:59 hashar: Granted sudo right to cscott (under NDA). Will let him reboot OCG service
- 18:58 ori: re-enabled puppet on deployment-mediawiki{01,02}
- 10:41 hashar: Taking gdb traces of hhvm on mediawiki01 and mediawiki02. Restarting hhvm
- 05:08 bd808: HHVM hung on both boxes. Grabbed core and backtrace before restarting
July 30
- 19:59 bd808: Created local commit 7d56b79 in puppet to work around bugs in Ia463120718dceab087ad3f8e3f35917fa879f387
- 19:46 bd808: Restored prior /etc/hhvm/php.ini from puppet filebucket archive on deployment-mediawiki0[12]
- 19:32 bd808: Disabled puppet on deployment-mediawiki02 for the same reason
- 19:31 bd808: Disabled puppet on deployment-mediawiki01; Ori will look into hhvm config changes that were being applied
- 16:52 bd808: Fixed beta-scap-eqiad Jenkins job by correcting ssh problems in beta project
- 16:43 bd808: Fixed ssh to jobrunner01 and videoscaler01 by correcting unrelated puppet manifest problem and forcing run via salt.
- 16:00 bd808: Puppet runs on videoscaler01 and jobrunner01 failing for "Could not find dependency Ferm::Rule[bastion-ssh] for Ferm::Rule[deployment-bastion-scap-ssh]"
- 16:00 bd808: Puppet seems manually disabled on apache0[12].
- 15:59 bd808: Can't ssh to apache0[12], videoscaler01 and jobrunner01. Puppet not running on any of them. libnss-ldapd unattended update has broken /etc/nslcd.conf
- 15:23 bd808: Removed cherry-pick for Iac547efa83cf059a1276b6e279c3ebd4c7224b2c and updated cherry-pick for I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 to latest patch set.
- 15:05 bd808: Two cherry-picks in puppet conflicting with merged production changes: I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 and Iac547efa83cf059a1276b6e279c3ebd4c7224b2c (ori, twentyafterfour)
- 14:49 bd808: Started apache2 service on deployment-mediawiki01
- 14:16 hashar: rebooting hhvm
- 09:42 hashar: bastion had broken puppet because deployment_server and zuul both declare the same python packages 150501
- 09:40 hashar: restoring on puppetmaster modules/mediawiki/templates/apache/apache2.conf.erb which got deleted somehow
- 09:29 hashar: Rebooting apache01/02 to see whether it fix the ssh connection issue
- 09:27 hashar: manually started hhvm on mediawiki01
- 09:25 hashar: rebooting deployment-mediawiki01 hhvm process went zombie
- 09:23 hashar: restarting hhvm on mediawiki 01/02
- 09:05 hashar_: Beta scap script broken since 6:30am UTC https://integration.wikimedia.org/ci/job/beta-scap-eqiad/
July 29
- 22:56 cscott: updated OCG to version aeb8623d6ebe41ae7c7e36c57844bd9ea8e6d595
- 21:02 bd808: Converted deployment-sentry2.eqiad.wmflabs to use beta salt/puppet master
- 19:14 hashar: Removed all jobs from queue, restarted slave agent. Update Jobs coming back
- 19:09 hashar: deployment-bastion jenkins slave is stuck. Beta cluster is no more updating code :-//
- 15:58 godog: restarted hhvm on deploymnet-mediawiki01
- 15:52 godog: restarted hhvm on deployment-mediawiki02
- 15:50 godog: installed libevent-dbg on deployment-mediawiki02 to capture an hhvm backtrace
- 15:17 bd808: _joe_ restarting hhvm on deployment-mediawiki01
- 15:00 bd808: Apache stuck with 65 children on both deployment-mediawiki servers
- 10:37 hashar: Restarted hhvm on mediawiki{01,02}
July 28
- 17:41 bd808: Updated hhvm to latest 3.3-dev+20140728 build on deployment-mediawiki0[12]
- 15:37 manybubbles: rebuilding elasticsearch indexes to build a weighted all field we'll try to use to improve performance
- 15:32 bd808: Restarted hhvm on deployment-mediawiki0[12]. All apache children were stuck waiting for hhvm to respond.
- 15:20 bd808: Restarted apache on deployment-mediawiki02. 65 children and non-responsive to requests. (same as mediawiki01)
- 15:18 bd808: Restarted apache on deployment-mediawiki01. 65 children and non-responsive to requests.
- 14:23 manybubbles: or not - looks like I can't!
- 14:22 manybubbles: reubilding cirrus search indexes to pick up a speed up all field
- 08:30 hashar: restarted varnish on deployment-cache-bits01 . Hoping to clear bits cache
July 25
- 18:29 bd808: Added twentyafterfour and several other WMF staff to under_NDA sudo group
- 17:15 bd808: Morebots is back!
- 16:38 bd808: pstree showed "hhvm─┬─271*[sh]" on deployment-mediawiki02
- 16:38 bd808: Killed apache2+hhvm and restarted on deployment-mediawiki0[12]
- 16:06 bd808: `tcpdump -n udp dst port 8324` shows packets leaving deployment-bastion for deployment-logstash1
- 16:00 bd808: Stopped udp2log and started udp2log-wm with no apparent effect
- 16:00 bd808: udp2log events not being sent from deployment-bastion to deployment-logstash1
- 15:49 bd808: Restarted logstash on deployment-logstash1
- 09:45 mwalker: rebasing puppet repo to get a ocg patch
July 24
- 16:09 bd808: Reverted MW config to re-enable luasandbox mode; back to luastandalone for now
- 15:44 bd808: Updated MW config to re-enable luasandbox mode
- 15:43 bd808: Updated hhvm-luasandbox to 2.0-3 and restarted hhvm instances
- 14:21 hashar: killed hhvm process on deployment-mediawiki01 and 02. init script does not work.
- 02:59 ori: promoted legoktm to project-admin
July 23
- 23:30 bd808: Running `find . -type d -exec chmod 777 {} +` in /data/project/upload7 to finx shared image dir permisisons
- 20:49 bd808: Changed config to run lua via external executable to avoid hhvm crashing bug
- 16:20 bd808: hhvm upgraded to 3.1+20140723-1+wmf1 on deployment-mediawiki0[12]
- 15:34 bd808: Reverted hhvm to 3.1+20140630-1+wm1 on deployment-mediawiki02
- 15:21 bd808: Upgraded hhvm to 3.1+20140630; seeing problems with luasandbox extension
July 22
- 14:26 hashar: upgrading varnish on deployment-cache-mobile03
- 14:22 hashar: upgrading varnish on deployment-cache-text02
- 14:02 hashar: rebooting deployment-cache-upload02 varnish not happy with memory mapping
- 13:51 hashar: rebooting bits varnish cache
- 13:43 hashar: rebased puppetmaster repo. Rebase got broken after 0317463 - beta: New script to restart apaches got merged in.
- 13:35 hashar: apt-get upgrade on deployment-cache-bits01 + varnish upgrade
- 09:28 hashar: Removing role::beta::natfix that is now handled by labs DNS and the class is removed with 146091
July 21
- 23:37 ori: Switched over beta cluster app servers to HHVM
- 21:27 bd808: Killed update.php jobs; Antoine will give jobs a longer timeout
- 21:23 bd808: Running update.php for simplewiki in screen
- 21:22 bd808: Running update.php for hewiki in screen
- 21:21 bd808: Running update.php for eswiki in screen
- 21:21 bd808: Running update.php for cawiki in screen
- 21:21 bd808: Running update.php for commonswiki in screen
- 21:18 hashar: Restarting upd2log-mw on deployment-bastion. There is a bunch of [python] <defunct> processes
- 17:32 bd808: Updated scap to 4871208 (+ cherry pick of I6a56b5e)
- 17:12 bd808: Hotfix for scap ssh host key checking to fix jenkins scap job
- 17:03 bd808: Testing scap change I40a891b via cherry-pick
- 10:25 hashar: on bastion, fixed some puppet dependency to have nutcracker to start with the proper configuration 148043
- 10:20 hashar: upgrading packages on deployment-bastion
- 10:19 hashar: deleted /var/lib/apt/lists/lock on bastion. Was prevent apt-get update from running
- 10:18 hashar: setting up nutcracker on deployment-bastion. It was installed but the puppet class to configure it was not being applied. Related Gerrit patches: 148041 and 148042
- 09:25 hashar: rebooting deployment-apache02
- 09:22 hashar: rebooting deployment-apache01.
- 00:27 ori: deployment-mediawiki01 & deployment-mediawiki02: configured for project-local puppet & salt masters
July 18
- 00:30 bd808: removed local l10nupdate user from deployment-jobrunner01 and deployment-videoscaler01
- 00:22 bd808: Killed stuck beta-update-databases-eqiad job ( stuck for over 60m waiting for executor; deadlock?)
- 00:21 ori: beta broke due to I433826423. app servers load prod apache confs from /etc/apache2/wikimedia. temp fix: locally hack apache2.conf to load /usr/local/apache2/conf/all.conf; disable puppet.
July 17
- 23:18 bd808: Puppet broken for deployment-bastion by labs specific logic in misc::deployment::vars.
- 19:01 mwalker: possibly breaking labs by cherry picking an apparmor patch that affects mysql https://gerrit.wikimedia.org/r/#/c/147027/
July 16
- 19:15 mwalker: updated puppet about 20 minutes ago for new ocg variables (now officially in production puppet instead of just cherry picked)
July 15
- 18:26 bd808: Removed local mwdeploy user from /etc/passwd on deployment-videoscaler01 and deployment-jobrunner01
- 16:59 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to other random failures now. Lots of strange permissions errors during rsync
- 16:37 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to ssh auth failures; likely a puppet config problem
July 10
- 22:37 bd808: Added Gergő Tisza and Yuvipanda as project admins
July 8
- 23:37 bd808: Updated Kibana to 0afda49 (latest upstream head)
- 17:03 greg-g: Added John F. Lewis to the project after his NDA was signed by Mark (RT 7722)
July 7
- 20:55 bd808: Killed stuck `apt-get update` job on deployment-jobrunner01 started on Jun17
- 20:20 bd808: Fixed puppet on deployment-analytics01 with manual apt-get commands.
- 20:08 bd808: Ran `apt-get dist-upgrade` on deployment-analytics01 to upgrade hadoop, hive, pig, etc which were failing to update via puppet.
July 4
- 02:28 RoanKattouw: Unbroke replication on deployment-db2, it's catching up now
July 3
- 18:59 legoktm: manually created centralauth.renameuser_status table
- 16:04 bd808: Updated scap to ff04431
- 09:24 hashar: Reindexed ElasticSearch index for cawiki/eswiki with: mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki {cawiki,eswiki} --batch-size=50
- 09:22 hashar: Blow up ElasticSearch indices for cawiki and eswiki with: mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType content && mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType general
- 09:10 hashar: used addwiki.php to create the wiki. manually triggered the Jenkins job that update the databases https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/2319/
- 09:06 hashar: Adding cawiki and eswiki for cxserver testing Ibbcbd4
July 2
- 07:49 hashar: cxserver being configured! 140723 by Kartik and Niklas \O/
July 1
- 15:46 bd808: Fixed git rebase conflict in operations/puppet on deployment-salt
- 13:29 manybubbles: rebuilding Cirrus search index in beta to pick up new configuration and cache warmers
- 11:20 hashar: Added Filippo Giunchedi to the project as an admin (WMF ops)
June 30
- 20:47 bd808: The state of puppet for beta is badly broken. I have hacked things to get puppet to apply on deployment-apache0[12] but puppet won't apply on deployment-bastion in part due to the same hacks.
- 18:48 bd808: Created symlink /apache -> /usr/local/apache on deployment-apache0[12] to fix docroot symlinks
- 18:09 bd808: Beta apaches are broken with latest puppet config applied. Working to correct.
- 18:08 bd808: Manually added symlink for /etc/apache/wmf on deployment-apache0[12]
June 26
- 12:48 YuviPanda: cherry picked https://gerrit.wikimedia.org/r/#/c/142228/ to puppetmaster, sending events to charcoal.wmflabs.org now with projectname \o/
- 09:46 YuviPanda: cherry-picked https://gerrit.wikimedia.org/r/#/c/142210/ on to puppetmaster
- 09:38 hashar: Granting sudo to YuviPanda
June 25
- 20:58 bd808: Fixed rebase conflict in operations/puppet.git on deployment-salt caused by cherry-picked vcl patch left over from varnish submodule usage
June 24
- 19:29 bd808: Manually updated operations/puppet checkout on deployment-salt to deal with varnish submodule change
June 19
- 22:47 bd808: Updated scap to 792a572
- 22:46 bd808: Trebuchet runs on deployment-videoscaler01 are succeeding but not showing up in the `git deploy report` output
- 22:40 bd808: Deleted /var/log/diamond/diamond.log on deployment-jobrunner01 because /var was full
June 18
- 16:55 bd808: Setup hourly cron as user bd808 on deployment-salt to test automatic update of puppet repo using ~bd808/git-sync-upstream script
June 17
- 20:36 bd808: Upgraded elasticsearch to version 1.2.1 on deployment-logstash1
June 16
- 21:16 bd808: Jenkins beta-scap-eqiad job broken because of missing puppet config on deployment-jobrunner01; needs role::beta::scap_target
- 20:36 bd808: Enabled puppet on deployment-jobrunner01 and forced a run
- 20:34 bd808: Puppet disabled on deployment-jobrunner01 since 2014-06-03; No SAL logs explaining why
- 20:19 bd808: Updated scap to 5adce72; trebuchet reported i-00000237 (deployment-videoscaler01) as not updating, but manual check shows it did sync properly
- 20:00 bd808: Deleted /var/lib/puppet/state/agent_catalog_run.lock on deployment-bastion after verifying that no puppet processes were running
- 19:55 bd808: Truncated /var/log/diamond/diamond.log and restarted diamond on deployment-bastion
- 19:36 bd808: /var/log/diamond is 787M of 1.2G total logs
- 19:29 bd808: /var 0% free on deployment-bastion; looking for things to clean-up
June 9
- 15:19 andrewbogott: doing a 'rebase origin' on deployment-salt, because it needs it.
- 15:10 andrewbogott: updating all instances to puppet 3 via a cherry-pick�� of https://gerrit.wikimedia.org/r/#/c/137898/ on deployment-salt
June 7
- 02:44 bd808: Restarted logstash on deployment-logstash1; last even logged at 2014-06-06T22:11:04
June 6
- 19:26 bblack: - synced labs/private on deployment-salt again
- 16:30 bd808: Rebooted deployment-salt
- 16:27 bd808: Made /var/log a symlink to /srv/var-log on deployment-salt
- 16:26 bblack: Updated labs/private.git on puppetmaster. brings in updated zero+netmapper password for beta
- 16:18 bd808: Changed from role::labs::lvm::biglogs to role::labs::lvm::srv on deployment-salt and made /var/lib a symlink to /srv/var-lib
- 15:45 bd808: /var on deployment-salt still at 97% full after moving logs; /var/lib is our problem
- 15:43 bd808: Archived deployment-salt:/var/log to /data/project/deployment-salt
- 15:40 bd808: Disabled puppet on deployment-salt to work on disk space issues
- 12:44 hashar: Updated labs/private.git on puppetmaster. Brings Brandon Black change "add labs copy of zerofetcher auth file" 137918
- 02:48 mwalker: added role::labs::lvm::biglogs to deployment-salt because it is out of room on /var and I don't know what I can delete
- 01:25 bd808: Live hacked /etc/apache2/wmf/hhvm.conf on apaches to allow them to start
- 00:30 bd808: `git stash`ed dirty dblist files found in /a/common on deployment-bastion
June 5
- 14:16 manybubbles: rebuild beta's jawiki's search index without kuromoji - it didn't help much anyway
- 14:14 manybubbles: recovered from busted elasticsearch - two problems: 1. I had an index that used the kuromoji plugin but I'd uninstalled it and 2. I had plugins for 1.2.1 but was trying to start 1.1.0. Solution: 1. delete the index and recreate it without kuromoji. 2. upgrade to 1.2.1 like I had planned on doing any way.
- 14:01 manybubbles: elasticsearch cluster got really angry in beta when I restarted some node - its like they aren't talking to eachother properly - trying to recover. once that is done I'll upgrade to 1.2.1 and that might fix it
- 13:59 hashar: deployment-elastic01 puppet was broken due to bug 63322 i.e. having some HTML garbage as ec2id which would be used as puppet certname
- 13:47 manybubbles: rolling restart of elasticsearch nodes in beta to pick up new kernel
June 4
- 20:46 bd808: Fixed file ownership on /data/project/apache/uncommon for beta-recompile-math-texvc-eqiad job
- 19:27 manybubbles: sorry, can't do that yet,
- 19:27 manybubbles: plugins deployed to beta - time to restart Elasticsearch in beta - should cause not interruption of service
- 19:01 manybubbles: deploying Elasticsearch 1.2.1 and some updated plugins to beta
- 17:11 bd808: Unwedged the jenkins jobs to updating beta by stopping the stuck db update job
- 16:27 bd808: Changed uid/git for files owned by l10nupdate user
- 09:50 mwalker: Reset salt caches by running `salt '*' state.clear_cache` from deployment-salt -- deployment-pdf01 now no longer reports errors when returning status for deployment
June 3
- 22:30 bd808: Deleted unused /data/project/apache/common-local on NFS share.
June 2
- 19:42 bd808: Updated scap to a7da355
- 05:14 bd808: Restarted logstash on deployment-logstash1; Last event logged at 2014-06-01T0722:56
May 30
- 21:45 bd808: Restarted uwsgi on deployment-graphite
- 18:43 bd808: Updated scap to c4204dd
May 29
- 21:07 bd808: mwalker cleaned up log spam from upstart on deployment-pdf01
- 20:59 bd808: /var full on deployment-pdf01
- 20:55 bd808: Restarted salt minion on deployment-pdf01 with `sudo salt 'i-00000396.eqiad.wmflabs' service.restart salt-minion`
May 28
- 17:53 bd808: Restarted logstash on deployment-logstash1; last event logged at 2014-05-28T12:11:37
- 16:56 bd808: Updated scap to fd7e538
May 27
- 19:08 bd808: Updated scap to 48c7e28
- 14:56 bd808: Updated scap to 9609e8d
May 23
- 16:32 bd808: Upgraded elasticsearch to 1.1.0 on deployment-logstash1
- 13:36 manybubbles: restarting elasticsearch on deployment-elastic01 to pick up some gc setting recommended by elasticsearch team
May 22
- 23:00 bd808: Added 20after4 as a project admin
- 22:59 bd808: Added matanya as a project memeber
- 21:38 bd808|LUNCH: Deployed scap 096cb3f
May 21
- 17:33 mwalker: converted deployment-pdf01 (i-00000396.eqiad.wmflabs) to use local puppet & salt master
- 14:50 bd808: restarted logstash on deployment-logstash1; getting really tired of these soft crashes
- 00:33 bd808: Puppet failing on deployment-videoscaler01 with duplicate definition of Class[Mediawiki::Jobrunner]
- 00:07 bd808: Fixed puppet for deployment-jobrunner01 using https://gerrit.wikimedia.org/r/#/c/134519/2
May 20
- 23:49 bd808: Fixed puppet for deployment-apache[12] using https://gerrit.wikimedia.org/r/#/c/134519/2
- 23:11 bd808: deployment-apache01 needs more work: "Could not set shell on user[mwdeploy]"
- 23:06 bd808: Fixing puppet config for upstream rename of role::applicationserver -> role::mediawiki
- 21:14 ori: Converted deployment-stream to use local puppet & salt masters
- 21:08 RoanKattouw: chown'ed /data/project/parsoid/parsoid.log from mwalker (?!?) to parsoid so Parsoid runs again
- 15:53 bd808: Deployed scap 7b6fc47 via trebuchet
May 19
- 14:34 bd808: Restarted logstash service on deployment-logstash1; it stopped logging new events at 10:37:13Z
May 16
- 21:20 manybubbles: restarting elasticsearch in beta to update some plugins
- 00:34 bd808: Updated EventLogging to I89819bd
May 15
- 22:14 bd808: Restarted logstash on deployment-logstash1 yet again; memory leak from invalid encoding bug
- 00:14 bd808: Disabled puppet on deployment-logstash1 to test a local logstash config change
May 14
- 23:33 bd808: Added irc input to logstash via I409fec9
May 13
- 09:28 bd808: Restarted logstash service on deployment-logstash1
- 09:28 bd808: Logstash events stop at 2014-05-11T18:36:35Z; Log file shows many "Failed parsing date from field" errors which probably triggered the known upstream memory leak bug
May 10
- 18:02 bd808: Restarted logstash on deployment-logstash1
May 9
May 6
- 17:54 bd808: Restarted logstash on deployment-logstash1
- 17:53 bd808: Logstash in beta hasn't recorded any events since 2014-05-04T04:32:36.
- 15:33 manybubbles: rolling restart of Elasticsearch servers in beta to pick up new highlighter plugin to fix bugs found when we fixed hebrew analysis. and to implement phrase highlighting.
May 5
- 21:29 mwalker: ran puppetstoredconfigclean and revoked puppet and salt keys for i-00000339.eqiad.wmflabs (was pdf01)
- 21:24 mwalker: removing pdf01 instance -- labs just uses production mwlib which works just fine. I'll recreate this when I make the OCG test instance
- 20:57 manybubbles: deploying new plugin to Elasticsearch (swift)
May 3
- 18:10 mwalker: Updated kernel on deployment-pdf01 (manually set console=ttyS0 to match older installed kernels)
- 17:58 mwalker: Converted i-00000339.eqiad.wmflabs (deployment-pdf01) to use local puppet & salt masters
- 17:54 mwalker: signed salt key for i-00000339.eqiad.wmflabs (deployment-pdf01)
- 17:43 bd808: Added mwalker to under_NDA sudoers group
May 2
- 17:01 bd808: Switched scap to use scripts delivered by trebuchet
May 1
- 15:46 manybubbles: upgrading Elasticsearch highlighter via a rolling restart
- 00:56 bd808: Fixed empty PrivateSettings.php configuration file (which I also broke earlier)
April 28
- 16:12 manybubbles: upgrading highlighter plugin in Elasticsearch
- 15:43 bd808: Created empty /srv/scap-stage-dir/wmf-config/mwblocker.log file to stop missing file warnings in beta.
April 25
- 11:31 hashar: commonswiki-75388f96: 0.6183 19.5M SQL ERROR (ignored): Table 'commonswiki.revtag_type' doesn't exist (10.68.16.193)
- 11:30 hashar: Authentication is broken on the beta cluster. Well at least from commons.wikimedia.beta.wmflabs.org
April 23
- 19:34 ^demon|lunch: created zhwiki, ukwiki, ruwiki, kowiki, hiwiki, jawiki for testing
- 10:19 hashar: stopping udp2log and starting udp2log-mw instead (known old bug that prevents logging)
April 22
- 18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again
- 18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again
April 18
- 19:24 manybubbles: rebuilding Cirrus indexes to pick up auxiliary fields and smarter accent matching
April 16
- 18:56 hashar: Migrating memc04 and memc05 to self master/salt [[bugzilla:64010|bug 64010]]
- 13:13 manybubbles: done
- 13:10 manybubbles: rolling restart of Elasticsearch nodes in beta to make super sure it picked up new plugins
- 09:33 hashar: rebased puppetmaster
April 15
- 20:02 manybubbles: restarting elasticsearch in beta to pick up a plugin update - no downtime should occur
- 14:24 hashar: rebased puppetmaster
April 11
- 17:41 bd808: Tried to enable role::protoproxy::ssl::beta on deployment-cache-text02 but it failed to apply because /etc/ssl/certs/star.wmflabs.org.pem and /etc/ssl/private/star.wmflabs.org.key don't match.
- 03:59 bd808: sudo apt-get install mysql-client on deployment-bastion
- 03:54 bd808: Added legoktm as a project member
- 00:02 bd808: Enabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/
April 10
- 21:35 bd808: Running scap on deployment-bastion for the first time in eqiad
- 21:13 bd808: Disabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ to work on scap setup
- 14:52 hashar: Adding Tobias Gritschacher to the project so he can look at udp2log / apache logs whenever needed :-]
April 9
- 23:04 bd808: Re-enabled puppet on deployment-apache02 and forced a puppet run
- 21:39 bd808: Cherry-picked I8f77e0c into puppet and forced puppet run on deployment-bastion
April 8
- 17:53 manybubbles: rebuilding simplewiki's search index optimized for the new highlighter to check the size difference
- 05:34 Ryan_Lane: upgraded libssl on all nodes, restarted affected ssl servers
- 05:03 Ryan_Lane: upgraded libssl on all salt accessible nodes
April 5
- 11:19 hashar: Attempting to reenable SSL support with 124057
April 4
- 21:39 bd808: Restarted logstash; it stopped processing events again at 2014-04-04T19:56:46Z
- 17:31 bd808: Forced puppet run on deployment-cache-text02
- 17:29 bd808: Manually fixed puppet config on deployment-cache-text02 (the cert html error problem)
- 17:22 bd808: Rebooting deployment-cache-bits01
- 17:21 bd808: Forced puppet run on deployment-cache-bits01
- 16:15 manybubbles: Performing a rolling restart of Elasticsearch nodes to pick up a new plugin
April 3
- 17:32 bd808: Fixed certname in /etc/puppet/puppet.conf manually on deployment-bastion so puppet would run again.
- 15:33 bd808: Restarted logstash on deploymnet-logstash1; Stuck in a bad state due to jvm oom logged at 2014-04-03T12:03:43Z
April 2
- 17:54 manybubbles: done installing plugins on Elasticsearch in beta
- 14:10 hashar: Fixed database updating job https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ . It was not running on the proper node.
- 12:50 hashar: restarted parsoid daemon on deployment-parsoid04.eqiad.wmflabs. It also now log to /data/project/parsoid/parsoid.log
- 12:36 hashar: Manually deleting parsoid user/group on deployment-parsoid04. Will use the LDAP uid/gid instead.
April 1
- 21:38 hashar: Removed the Zuul triggers that updated beta cluster in PMTPA 123100.
- 19:49 bd808: Converted deployment-graphite.eqiad.wmflabs to use local puppet & salt masters
- 19:20 bd808: Deleting and re-creating deployment-graphite because I forgot to add the web security group
- 15:57 andrewbogott: shutting down all pmtpa instances
- 14:32 manybubbles: completed upgrade to Elasticsearch 1.1.0 and fixed deployment-elastic04.
- 13:32 hashar: Thumbs access more or less fixed
- 13:31 hashar: deployment-upload is rejecting connection on port 80. Applying role::beta::uploadservice from 122786
- 13:30 manybubbles: upgrading labs Elasticsearch to 1.1.0
- 13:06 hashar: Applying role::beta::natfix on deployment-upload.eqiad.wmflabs . Might let it access images from commons.wikimedia.beta.wmflabs.org ( ex: http://upload.beta.wmflabs.org/wikipedia/commons/thumb/4/43/Feed-icon.svg/16px-Feed-icon.svg.png yields: Error retrieving thumbnail from scaling server: couldn't connect to host commons.wikimedia.beta.wmflabs.org )
- 08:31 hashar: MediaWiki config paths tweaks for Math [[bugzilla:63331|bug 63331]] and Captchas [[bugzilla:63342|bug 63342]]
- 00:32 bd808: Converted deployment-graphite to use local puppet & salt masters
March 31
- 21:02 hashar: Making Parsoid daemon to write its logs to /data/project/parsoid/parsoid.log 122561
- 20:47 hashar: Puppet master is fixed. The certificates got badly messed up, had to regenerate them following the documentation "Regenerate Certificates for Puppet Master"
- 20:17 hashar: restarted parsoid daemon
- 20:00 hashar: stopped parsoid . It is killing the application servers
- 19:53 hashar: restarting both apaches
- 19:21 hashar: restarting job service on jobrunner01 to apply 122436
- 19:20 hashar: Unbreak puppetmaster on deployment-salt.eqiad.wmflabs
- 19:01 hashar: puppet master is broken :(
- 17:39 hashar: lowering # of jobs spawned by the jobrunner 122436
- 16:00 bd808: Restarted logstash service on deployment-logstash1; no new log events seen since 2014-03-28T10:57
- 15:58 bd808: Updated kibana on deployment-logstash1 to e317bc6
- 15:56 hashar_: Cluster slow because some CirrusSearch job is spamming simplewiki . Gotta find a way to throttle the number of jobs being run on jobrunner01 or add more apache boxes . It is transient anyway, might look at limiting the runs tonight
- 15:10 hashar_: Rebased puppet repository. Only one hack left: https://gerrit.wikimedia.org/r/#/c/119534/
- 14:20 hashar: deleting deployment-parsoidcache01 cache the hardway: stopping varnish, deleting files in /srv/vdb/ , starting varnish
- 14:05 hashar: shutdowning database and apache boxes for now.
- 14:03 hashar: shutdowning varnishes instances in pmtpa
- 13:56 hashar: Deleted deployment-cache-upload01 , replaced by deployment-cache-upload02
- 13:52 hashar: upload varnish cache working :-]
- 13:47 hashar: applying role::cache::upload to role-cache-upload02
- 13:37 hashar: migrating deployment-cache-upload02.eqiad.Wmflabs to self puppet/salt master
- 13:22 hashar: Creating deployment-cache-upload02 to replace deployment-cache-upload01 which was missing the security group "web"
- 11:30 hashar: Update DNS entries to point to EQIAD instances (aka switching beta cluster to eqiad)
March 28
- 16:18 hashar: rebased puppet on deployment-salt
- 15:39 hashar: Last log made to wrong project
- 15:39 hashar: deleting instance ntegration-selenium-driver no more needed. browsertests jobs should now be runnable on integration-slave1001 and integration-slave1002 (in eqiad)
- 10:54 hashar: deleting instance integration-debian-builder . That is breaking all debian-glue jobs. Will revisit later next week to get pbuilder/cowbuilder set up on the other eqiad slaves
- 08:48 hashar: deleting integration-slave-pbuilder. Unneeded (i need a coffee)
- 08:43 hashar: Created integration-slave-pbuilder on eqiad to replace pmtpa instance integration-debian-builder
- 00:23 bd808: `sudo chmod -R a+rwx /data/project/upload7`; We need to get this file permissions thing figured out
March 27
- 15:23 hashar: role::beta::natfix cant run on deployment-bastion.eqiad because the ferm rules conflicts with the Augeas rules coming from udp2log :-(
- 15:21 hashar: applying role::beta::natfix on deployment-bastion.eqiad
- 14:58 hashar: fixed up role::beta::natfix . Ferm is now being applied again on various application server instances 121378
- 13:58 hashar: rebased puppetmaster git repository, reapplied ottomata live hacks.
- 12:55 hashar: mediawiki l10n cache being rebuild!!!
- 12:54 hashar: Fixed permissions on eqiad bastion for /srv/scap . Others (such as mwdeploy) could not read / execute scap scripts
- 11:29 hashar: MediaWiki code and configuration are now self updating on EQIAD cluster via Jenkins jobs. First run: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/4/console
- 11:11 hashar: deleting job beta-code-update , replaced by datacenter variants beta-code-update-pmtpa and beta-code-update-eqiad
- 10:54 hashar: Deleting job beta-update-databases , replaced by datacenter variants beta-update-databases-pmtpa and beta-update-databases-eqiad
March 26
- 19:05 bd808: Added ottomata as a project member and admin
- 15:46 springle: deployment-db1 data loaded
- 14:45 bd808: created proxy https://logstash-beta.wmflabs.org for logstash instance
- 14:17 hashar: fixed up redis configuration in eqiad. Jobrunner is happy now: aawiki-504cd7d2: 0.9649 21.5M Creating a new RedisConnectionPool instance with id 627014d. 121060
- 14:05 hashar: udp2log functional on eqiad beta cluster \O/
- 13:55 hashar: stopping udp2log on eqiad bastion, starting udp2log-mw (really should fix that issue one day)
- 13:52 hashar: dropped some live hack on eqiad in /data/project/apache/common-local and ran git pull
- 13:14 hashar: Dropping enwikivoyage and dewikivoyage databases from sql02. Related changes are updating the Jenkins config: https://gerrit.wikimedia.org/r/#/c/121045/ and cleaning up the mw-config : https://gerrit.wikimedia.org/r/#/c/121047/
- 07:53 springle: installed mariadb via puppet on deployment-db1. no data yet
March 25
- 19:43 hashar: created jenkins slave deployment-bastion.eqiad
- 17:17 hashar: Created and validated job that updates Parsoid on the EQIAD beta cluster \O/
March 24
- 23:16 marktraceur: Touching all the MMV scripts because they're not getting invalidated or something
- 23:10 hashar: l10n cache got broken due to a PHP fatal error I introduced. It is back up now. Found out via https://integration.wikimedia.org/dashboard/
- 23:09 hashar: upgraded all pmtpa varnishes, ran puppet on all of them. all set!
- 22:57 hashar: restarting deployment-cache-upload04 , apparently stalled
- 22:48 hashar: upgrading varnish on all pmtpa caches.
- 22:47 hashar: apt-get upgrade varnish on deployment-cache-bits03
- 22:45 marktraceur: attempted restart of varnish on betalabs; seems to have failed, trying again
- 22:42 hashar: made marktraceur a project admin and granted sudo rights
- 22:39 marktraceur: Restarting betalabs varnish to workaround https://bugzilla.wikimedia.org/show_bug.cgi?id=63034
- 17:25 bd808: Converted deployment-db1.eqiad.wmflabs to use local puppet & salt masters
- 17:06 bd808: Changed rules in sql security group to use CIDR 10.0.0.0/8.
- 17:05 bd808: Changed rules in search security group to use CIDR 10.0.0.0/8.
- 17:05 bd808: Built deployment-elastic04.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
- 16:19 bd808: Built deployment-elastic03.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
- 16:08 bd808: Built deployment-elastic02.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
- 15:54 bd808: Built deployment-elastic01.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
- 10:31 hashar: migrated deployment-solr to self puppet/salt masters
March 21
- 09:29 hashar: l10ncache is now rebuild properly : https://integration.wikimedia.org/ci/job/beta-code-update/53508/console
- 09:23 hashar: fixing l10ncache on deplkoyment-bastion : chown -R l10nupdate:l10nupdate /data/project/apache/common-local/php-master/cache/l10n The l10nupdate UID/GID has been changed and are now in LDAP
March 20
- 23:46 bd808: Mounted secondary disk as /var/lib/elasticsearch on deployment-logstash1
- 23:46 bd808: Converted deployment-tin to use local puppet & salt masters
- 22:09 hashar: Migrated videoscaler01 to use self salt/puppet masters.
- 21:30 hashar: manually installing timidity-daemon on jobrunner01.eqiad so puppet can stop it and stop whining
- 21:00 hashar: migrate jobrunner01.eqiad.wmflabs to self puppet/salt masters
- 20:55 hashar: deleting deployment-jobrunner02 , lets start with a single instance for nwo
- 20:51 hashar: Creating deployment-jobrunner01 and 02 in eqiad.
- 15:47 hashar: fixed salt-minion service on deployment-cache-upload01 and deployment-cache-mobile03 by deleting /etc/salt/pki/minion/minion_master.pub
- 15:30 hashar: migrated deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs to use the salt/puppetmaster deployment-salt.eqiad.wmflabs.
- 15:30 hashar: deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs recovered!! /dev/vdb does not exist on eqiad which caused the instance to be stalled.
- 10:48 hashar: Stopped the simplewiki script. Would need to recreate the db from scratch instead
- 10:37 hashar: Cleaning up simplewiki by deleting most pages in the main namespace. Would free up some disk space. deleteBatch.php is running in a screen on deployment-bastion.pmtpa.wmflabs
- 10:08 hashar: applying role::labs::lvm::mnt on deployment-db1 to provide additional disk space on /mnt
- 09:39 hashar: convert all remaining hosts but db1 to use the local puppet and salt masters
- 04:40 springle: created deployment-db1 for mariadb master in eqiad
March 19
- 21:23 bd808: Converted deployment-cache-text02 to use local puppet & salt masters
- 20:21 hashar: migrating eqiad varnish caches to use xfs
- 17:58 bd808: Converted deployment-parsoid04 to use local puppet & salt masters
- 17:51 bd808: Converted deployment-eventlogging02 to use local puppet & salt masters
- 17:22 bd808: Converted deployment-cache-bits01 to use local puppet & salt masters; puppet:///volatile/GeoIP not found on deployment-salt puppetmaster
- 17:00 bd808: Converted deployment-apache02 to use local puppet & salt masters
- 16:49 bd808: Converted deployment-apache01 to use local puppet & salt masters
- 16:30 hashar: Varnish caches in eqiad are failing puppet because there is no /dev/vdb. Will figure it out tomorrow :-]
- 16:15 hashar: Applying role::logging::mediawiki::errors on deployment-fluoride.eqiad.wmflabs . It is not receiving anything yet though.
- 15:50 hashar: fixed upd2log-mw daemon not starting on eqiad bastion ( /var/log/udp2log belonged to wrong UID/GID)
- 15:49 hashar: deleted local user l10nupdate on deployment-bastion. It is in ldap now.
March 18
- 03:31 bd808: deployment-bastion now using deployment-salt as puppet master
March 17
- 15:02 hashar: Starting copying /data/project from ptmpa to eqiad
- 14:46 hashar: manually purging all commonswiki archived files (on beta of course)
March 14
- 14:47 hashar: changing uid/gid of mwdeploy which is now provisioned via LDAP (aka deleting local user and group on all instance + file permissions tweaks)
March 11
- 10:46 hashar: dropping some unused databases from deployment-sql instance.
March 10
- 11:09 hashar: Deleting http://simple.wikipedia.beta.wmflabs.org/wiki/MediaWiki:Robots.txt
- 09:54 hashar: Reducing memcached instances to 3GB ( 115617 ). Seems to fix writing to the EQIAD memcaches which only have 3GB
- 09:08 hashar: Restarted bits cache (CPU / mem overload)
March 6
- 09:07 hashar: restarted varnish and varnish-frontend on deployment-cache-text1
March 5
- 17:26 hashar: hacked in mwversioninuse to return "master=aawiki". Relaunched l10n job using mwdeploy user and then running mw-update-l10n
- 17:07 hashar: mwversioninuse gives a wmf branch instead of master. That breaks l10n messages update and the job https://integration.wikimedia.org/ci/job/beta-code-update/ . Root cause is the python based scap.
March 3
- 17:28 manybubbles: doing an Elasticsearch reindex on beta before I try another one in production
February 28
- 10:17 hashar: Puppet running on varnish upload cache after several months. Might break random things in the process :(
February 27
- 14:11 manybubbles: upgrading beta to Elasticsearch 1.0
February 26
- 20:44 hashar: Cleaning up commonswiki archived files with mwscript deleteArchivedFiles.php --wiki=commonswiki --delete
- 20:44 hashar: deleted all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload (gwtoolset import test). Deleted File:Title_0* (Selenium tests).
- 15:06 hashar: deleted all thumbs from shared directory: /data/project/upload7/*/*/thumb/*
- 14:54 hashar: cleaning out 2013 archived logs.
February 25
- 08:42 hashar: Upgrading all varnishes.
February 24
- 23:36 MaxSem: Rolled back
- 23:25 hoo: recursively chowned extensions/MobileFrontend to mwdeploy:mwdeploy
- 23:21 hoo: chowned /data/project/apache/common-local/php-master/extensions/.git/modules/MobileFrontend/* to mwdeploy:mwdeploy
- 17:47 MaxSem: Investigating a mobile bug, might cause intermittent problems
- 17:36 MaxSem: Rebooted deployment-cache-mobile01 - was impossible to log into it though Varnish still worked
February 21
- 19:42 MaxSem: Adjusted read privs on /home/wikipedia/syslog/apache.log to allow fatalmonitor to work
February 19
- 16:24 hashar: -bastion : /etc/init.d/udp2log stop && /etc/init.d/udp2log-mw start (known bug)
- 16:23 hashar: rebooting -bastion
- 16:22 hashar: rebooting apache32 and apache33 breaking beta :-]
February 17
- 15:26 hashar: rebooting bits cache
February 11
- 21:55 manybubbles: update elasticsearch schema after recent changes. will run a links update as well
February 6
- 22:20 Krinkle: Manually ran changePassword.php to help someone (password reminder emails don't get sent)
- 14:43 hashar: restarting udp2log-mw on deployment-bastion. Logstash.wmflabs.org no more receiving fatals logs since Jan 31st
February 4
- 17:22 hashar: fixed up beta-parsoid-update job so Parsoid should be up to date again. The issue is that the multigit job pointed to a wrong host (ZUUL_URL should be zuul.eqiad.wmnet)
- 13:33 hashar: removing role::memcached from both apache servers
- 09:58 hashar: rebooting all varnish caches
- 09:57 hashar: Upgrading all varnish
February 3
- 16:59 hashar: upgrading varnish on deployment-parsoidcache3
January 30
- 19:35 hashar: deployment-cache-bits03 restarted gmond, leaked memory. Upgrading varnish
- 19:32 hashar: Canceled varnish package upgrade on deployment-cache-mobile01 , it runs a specific version ( 3.0.5plus~wmftest-wm1 ) instead of 3.0.3plus~rc1-wm29
- 19:30 hashar: upgrading varnish on deployment-cache-mobile01
- 19:29 hashar: upgrading varnish on deployment-cache-bits03
- 19:29 hashar: upgrading varnish on deployment-staging-cache-mobile02
- 19:28 hashar: upgrading varnish on deployment-cache-upload04
- 19:27 hashar: reenabling puppet on deployment-cache-mobile01
- 17:10 manybubbles: done reindexing beta. everything looks good
- 16:54 manybubbles: reindexing beta like we're going to do in production when the release train departs later today
January 28
- 17:10 hashar: added addshore and jhall to project so they can grep logs
January 27
- 15:17 hashar: applying role::beta::fatalmonitor puppet class on deployment-bastion bug 60046
January 23
- 19:38 hashar: VisualEditor was not being updated properly because some files belonged to root instead of mwdeploy. Ran chown -R mwdeploy:mwdeploy /data/project/apache/common-local/php-master/extensions/VisualEditor
January 16
- 20:54 manybubbles: turning elasticsearch's disk space aware allocator
January 15
- 21:14 manybubbles: finished updating to elasticsearch 0.90.10
- 08:48 andrewbogott: rebooted deployment-cache-text1
January 2
- 15:32 hashar: Migrated parsoid on deployment-parsoid2 to use mediawiki/services/parsoid out of a checkouts made in /srv/deployment/parsoid/{parsoid,deploy}. No job self updating it yet
- 15:00 manybubbles: finished upgrading Elasticsearch in beta. We're on 0.90.9 now.
- 14:07 hashar: running mw-update-l10n , it was broken because of https://gerrit.wikimedia.org/r/#/c/104741/ fixed up by https://gerrit.wikimedia.org/r/#/c/104953/
- 13:54 manybubbles: upgrading Elasticsearch servers in beta
December 26
- 18:54 manybubbles: performing in place index rebuild for wikis in beta after recent cirrus update
December 23
- 20:40 anomie: Restarting mw-job-runner service on deployment-jobrunner08, since jobs don't seem to be being run
- 20:03 anomie: Restarting apache on deployment-apache33 to see if that clears the odd errors going on
December 18
- 10:56 hashar: reenabling puppet on parsoid2 and deploying the new Parsoid upstart configuration 99656