Nova Resource:Deployment-prep/SAL

2015-12-02

00:31 tgr: updated rsvg on appserver to 2.40.11 - https://phabricator.wikimedia.org/T112421

2015-11-04

00:06 Krenair: Synchronized portals: https://gerrit.wikimedia.org/r/#/c/250851/

2015-10-09

21:51 ori: Accidentally clobbered /etc/init.d/mysql on deployment-db1, causing deployment-prep failures. Restored now.

2015-09-16

20:39 cscott: updated OCG to version 4032a596ce6eb442b02cc6ee9b79263b1eb23275

2015-09-14

19:18 cscott: updated OCG to version 5811056e28f2bc6408b6da96095352ab381bb11f
12:04 dcausse: restarting elasticsearch (deployment-elastic0[5-8]) to deploy new plugins

2015-08-25

14:42 andrewbogott: moving deployment-cache-mobile04 to labvirt1004

2015-08-12

20:45 urandom: restarted restbase on deployment-restbase01 (dead)

2015-08-05

14:33 godog: update deployment-restbase02 to openjdk8 T104887
14:18 godog: update deployment-restbase01 to openjdk8 T104887

June 29

13:17 dcausse: restarting Elasticsearch to pick up new plugin versions

June 23

13:31 cscott: fixed salt on deployment-pdf02, restarted OCG there.
05:44 cscott: stopped OCG service on deployment-pdf02, see https://phabricator.wikimedia.org/T103473
05:20 cscott: updated OCG to version d7c698d5bf730d34057945e912ac75dc542dd788 ; restarted service.
03:58 cscott: stopped OCG on beta; redis 2.8.x is causing the service to crash on startup.

June 22

21:58 andrewbogott: re-enabling puppet on deployment-videoscaler01 because no reason was given for disabling
20:42 cscott: updated OCG to version b482144f5bd8b427bcc64a3dd287247195aa1951

June 4

20:29 ori: upgrading hhvm-fss from 1.1.4 to 1.1.5, has fix for T101395

May 29

14:07 moritzm: upgrade java on deployment-restbase0[12] to the 7u79 security update

May 28

08:46 godog: test es-tool restart-fast on deployment-elastic05

May 27

21:15 AaronSchulz: populated jobqueue:aggregator:s-wikis:v2 with 1000 fake wiki keys for load testing
21:07 AaronSchulz: Deployed https://gerrit.wikimedia.org/r/#/c/208852/
21:07 AaronSchulz: Deleted 4G of logs on jobrunner01

May 24

18:39 YuviKTM: purged old logs kept on NFS

May 20

20:58 cscott: updated OCG to version ca4f64852de5b1de782b292b50038fbd2dd84266

May 18

15:17 andrewbogott: rebooting deployment-logstash1

May 15

20:50 andrewbogott: rebooted deployment-bastion due to inconsistent run state after suspend/resume

May 13

21:08 cscott: updated OCG to version c7c75e5b03ad9096571dc6dbfcb7022c924ccb4f

May 2

00:51 yuvipanda: created deployment-boomboom to test

April 29

21:03 andrewbogott: suspending and shrinking disks of many instances

April 28

20:57 YuviPanda: KILL KILL KILL DEPLOYMENT-LUCID-SALT WITH FIRE AND BRIMSTONE AND BAD THINGS

April 27

08:01 _joe_: installed hhvm 3.6 on deployment-mediawiki02

April 24

14:25 _joe_: installing hhvm 3.6.1 on mediawiki-deployment01

April 23

17:19 andrewbogott: rebooting deployment-parsoidcache02 because it seems troubled

April 22

12:48 andrewbogott: migrating to new labvirt nodes

April 21

08:33 _joe_: rollback installation of hhvm 3.6
08:09 _joe_: installing HHVM 3.6 and the corresponding extensions on deployment-mediawiki01

April 9

20:11 mutante: fixed apt sources lists on deployment-bastion (T95541)

March 30

22:33 Josve05a: manually start mysql on db1 and db2
21:57 YuviPanda: reboot all instances from virt1000

March 23

20:41 cscott: updated OCG to version 11f096b6e45ef183826721f5c6b0f933a387b1bb

March 18

13:45 mobrovac: added restbase security group
13:35 YuviPanda: made mobrovac projectadmin
13:34 YuviPanda: added mobrovac to project

March 16

18:46 manybubbles: upgraded Elasticsearch on deployment-logstash1

March 11

18:47 YuviPanda: created deployment-mediawiki03

February 27

11:12 YuviPanda: start mysql on deployment-db1

February 26

11:53 YuviPanda: created deployment-parsoid01-test to test patch to use role::parsoid on labs

February 18

13:04 _joe_: installed new version of the hhvm extensions packages

February 17

23:18 Krenair: Started mysql on deployment-db1; beta now appears much less broken than before

February 6

20:07 ^d: scratch that, I rebuilt it as precise. why did I do that?
20:03 ^d: rebuilt deployment-elastic05 with new partition scheme

February 5

12:48 YuviPanda: cherry-picking https://gerrit.wikimedia.org/r/188798 on scap on deployment-prep
12:28 YuviPanda: killed chown on deployment-bastion, running direclty on NFS server
12:13 YuviPanda: running time sudo chown -R www-data:www-data upload7/ on /data/project
12:10 YuviPanda: stopped jobrunner on jobrunner01
11:53 YuviPanda: running git-sync-upstream on deployment-salt to pick up latest ops/puppet changes
11:52 _joe_: converting the web user to www-data
11:44 YuviPanda: deleted mediawiki03 instance, holdover from security testing from long, long ago
11:41 YuviPanda: disabled puppet on mediawiki01, 02, jobrunner01, bastion and salt

February 4

13:56 YuviPanda: created deployment-jobrunner01, trusty instance
13:51 YuviPanda: deleted deployment-jobrunner01, trusty version coming up
11:35 YuviPanda: created instance deployment-mediawiki02
11:26 YuviPanda: deleted instance deployment-mediawiki02
06:37 YuviPanda: created deployment-mediawiki01 host
06:34 YuviPanda: killed deployment-mediawiki01 host. FOREEVERRR

February 2

13:37 yuvipanda: added mx record to beta.wmflabs.org, for https://phabricator.wikimedia.org/T88215 via LDAP

January 27

18:15 andrewbogott: upgrading libc6 on all instances from deployment-salt

January 20

02:30 YuviPanda: created deployment-mediawiki04 to test roles

January 7

16:25 YuviPanda: added milimetric to NDA sudo’ers groups

December 29

22:24 MaxSem: Created a DNS entry for m.wikidata.beta.wmflabs.org

December 22

12:40 _joe_: upgrading HHVM to the latest version

December 16

16:52 manybubbles: elasticsearch restart finished
16:48 mutante: deployment-db2 is down
16:48 manybubbles: restarting beta's elasticsearch servers to pick up a new version of a plugin. won't interfere with current downtime.

December 13

17:10 bd808: Many strange puppet and scap failures in beta that look to be related to DNS failures
16:03 bd808: Starting work on phab:T78076 to renumber apache users in beta

December 11

22:47 cscott: updated OCG to version bfc3812ef346c9f767135b339cedd123a1bcac98

December 6

05:05 ori: upgrade hhvm-tidy to 0.1-2

December 3

21:33 cscott: updated OCG to version 08e94b19c3f17e699d7e53d9605f65c58e17ea0e

December 2

17:09 _joe_: upgrading HHVM to its latest version
17:08 andrewbogott: this is a test message

December 1

21:50 cscott-split: updated OCG to version a06e7c186796a6ee5d5af81e93688520abdf2596

November 26

20:47 cscott: updated OCG to version 7d8f2b8bd496464041e3ef9c092732457cc8f7ef

November 24

15:16 YuviPanda: modified local hack to account for 47dcefb74dd4faf8afb6880ec554c7e087aa947b
14:58 YuviPanda: cherry-picked 3e45c538978710113e6e28e9d533bf8d18c159a6 and 9d4614a8a352c78505212fd6e9d2a7be6d2e4927 to deployment-salt puppetmaster, restoring local hacks

November 19

21:19 anomie: Cherry-picked https://gerrit.wikimedia.org/r/#/c/173336/3 to Beta

November 17

20:37 YuviPanda: cleaned out logs on deployment-bastion
16:48 YuviPanda: delete deployment-analytics01, a tortoise from an ancient time.
05:17 YuviPanda: force apt-get install -f to unstuck puppet
04:49 YuviPanda: clean up coredump on deployment-prep

November 16

00:38 YuviPanda: uncherrypick https://gerrit.wikimedia.org/r/#/c/173634/ because OMG CODE
00:14 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/173634/ on deployment-salt
00:01 YuviPanda: cherry-pick https://gerrit.wikimedia.org/r/#/c/173510/ on deployment-prep to make memc03 run puppet

November 14

20:02 anomie: Cherry-picking https://gerrit.wikimedia.org/r/#/c/173336/ for testing in logstash

November 13

10:11 YuviPanda: cherry pick https://gerrit.wikimedia.org/r/#/c/172967/1 to test https://bugzilla.wikimedia.org/show_bug.cgi?id=73263

November 12

18:16 YuviPanda: cherry picking https://gerrit.wikimedia.org/r/#/c/172776/ on labs puppetmaster to see if it fixes issues in the cache machines

November 11

17:13 cscott: removed old ocg cronjobs on deployment-pdf0x; see https://bugzilla.wikimedia.org/show_bug.cgi?id=73166

November 10

22:37 cscott: rsync'ed .git from pdf01 to pdf02 to resolve git-deploy issues on pdf02 (git fsck on pdf02 reported lots of errors)
21:41 cscott: updated OCG to version d9855961b18f550f62c0b20da70f95847a215805 (skipping deployment-pdf02)
21:39 cscott: deployment-pdf02 is not responding to git-deploy for OCG

November 5

06:14 ori: restarted hhvm on beta app servers

November 3

22:07 cscott: updated OCG to version 5834af97ae80382f3368dc61b9d119cef0fe129b

October 29

18:55 ori: upgraded hhvm on beta labs to 3.3.0+dfsg1-1+wm1

October 28

23:47 RoanKattouw: ...which was a no-op
23:46 RoanKattouw: Updating puppet repo on deployment-salt puppet master
21:36 RoanKattouw: Creating deployment-parsoid05 as a replacement for the totally broken deployment-parsoid04 (also as a trusty instance rather than precise)
21:06 RoanKattouw: Rebooting deployment-parsoid04, wasn't responding to ssh

October 27

20:23 cscott: updated OCG to version 60b15d9985f881aadaa5fdf7c945298c3d7ebeac

October 22

21:10 arlolra: updated OCG to version e977e2c8ecacea2b4dee837933cc2ffdc6b214cb

October 8

22:04 subbu: updated OCG to version def24eca

October 7

22:50 cscott: updated OCG to version c778ea8b898f8ad8c2b7ad9de78a75469e7ed061

October 6

23:13 YuviPanda: killed extra log files in deployment-bastion
21:44 cscott: updated OCG to version bbdf4c6400cfbbc6030114ad16e1a6f7025eab2c
15:36 cscott: updated OCG to version aee3712b352f51f96569de0bcccf3facf654e688

October 3

19:51 manybubbles: performing rolling restart of elasticsearch nodes to pick up preview of accelerated regex plugin for testing at larger-than-mylaptop-scale
14:02 manybubbles: rebuilding beta's simplewiki cirrus *index*
14:02 manybubbles: rebuilding beta's simplewiki cirrus inde

October 1

20:13 cscott: updated OCG to version 48c495e3656f528abe636ce0cd7562270505534f
16:40 bd808: Added Gilles to under_NDA sudoers group

September 30

22:00 bd808: Cleaned deleted instances out of salt and trebuchet redis
20:26 bd808: Converted deployment-rsync02 to use local puppet & salt masters
15:36 bd808: enabling puppet and forcing run on deployment-mediawiki03
15:34 bd808: enabling puppet and forcing run on deployment-mediawiki02
15:28 bd808: enabling puppet and forcing run on deployment-mediawiki01

September 29

22:45 Reedy: re-enabled beta-scap-eqiad
21:34 Reedy: disabled "beta-scap-eqiad" until things are fixed
21:24 Reedy: deleted l10n cache on deployment-rsync01 to attempt to run sync-common manually
21:22 Reedy: deployment-rsync01 hard drive is far too small
17:57 cscott: updated OCG to version 89d8f29a24295b05d0643abe976fea83b56575c9
06:58 ori: Configured Beta cluster to use redis for session storage
06:57 ori: Created deployment-redis02 and converted it to use local puppet & salt masters
05:23 ori: Created deployment-redis01 and converted it to use local puppet & salt masters

September 28

14:38 andrewbogott: cherry-picked https://gerrit.wikimedia.org/r/#/c/163464/ onto deployment-salt to fix a puppet compile failure.
14:38 andrewbogott: edited and re-cherry-picked roan's citoid patch into beta because the previous version was breaking puppet

September 26

06:34 cscott: updated OCG to version f3a6c1cbba118d4a5e1aa019937dc50159fc823d

September 25

22:48 RoanKattouw: Fixed permissions of deployment-bastion:/srv/deployment/mathoid/mathoid/.git/deploy (needed g+w)
11:36 _joe_: updated hhvm to fix most bugs, also cherry-picked https://gerrit.wikimedia.org/r/#/c/162839/

September 24

23:00 bd808: Updated bash with salt
20:52 cscott: updated OCG to version 48acb8a2031863e35fad9960e48af60a3618def9

September 23

20:14 cscott: updated OCG to version 1cf9281ec3e01d6cbb27053de9f2423582fcc156
17:37 AaronSchulz: Initialized bloom cache on betalabs, enabled it, and populated it for enwiki

September 22

16:08 ori: updating HHVM to 3.3.0-20140918+wmf1

September 20

14:43 andrewbogott: movingdeployment-pdf02 to virt1009
00:36 mutante: raised instance quota to 43

September 19

00:26 cscott: updated OCG to version ce16f7adb60d7c77409e2e11ba0e5d6cce6955d5

September 16

15:44 godog: testing scap change from https://gerrit.wikimedia.org/r/#/c/160668/
02:46 cscott: updated OCG to version 188a3c221d927bd0601ef5e1b0c0f4a9d1cdbd31

September 15

21:44 andrewbogott: migrating deployment-videoscaler01 to virt1002
21:41 andrewbogott: migrating deployment-sentry2 to virt1002
21:40 cscott: *skipped* deploy of OCG, due to deployment-salt issues
21:19 bd808: Added Matanya to under_NDA sudoers group (bug 70864)

September 12

12:24 _joe_: set up hiera, noop as expected

September 11

16:31 YuviPanda: Delete deployment-graphite instance
02:29 mutante: raised instance quota by 1 to 42

September 10

08:14 Krinkle: bits.beta.wmflabs.org is down with 503 Service Unavailable (http://bits.beta.wmflabs.org/en.wikipedia.beta.wmflabs.org/load.php)

September 9

20:08 cscott: updated OCG to version c9a2b4cf2502479eeabed07ab2de728695d96e46

September 7

23:48 bd808: Added John F. Lewis to under_NDA sudo policy (bug 70539)
23:29 bd808: Promoted John F. Lewis to project admin (bug 70539)
23:26 bd808: Added Jalexander as project member (bug 70539)

September 5

17:54 bd808: Purged varnish cache on deployment-cache-bits01 -- sudo varnishadm ban req.url '~' /
16:00 YuviPanda: unfuck puppet on deployment-salt, puppet is stupid and does not properly report failed events on last_run_summary.yaml if there's a syntax error or a resource conflict. So I've to read last_run_report and do things with *that* instead now
15:49 YuviPanda: deliberately fucking up puppet to see if icinga complains
09:52 _joe_: cherry-picked I6ec53da483bebfa375eba2383cbf60123ff1ce26, it work

September 4

16:06 bd808: Manually cleaned bogus LocalRenameUserJob jobs from redis
13:54 _joe_: stopped puppet on the appservers but mw03, testing an apache change
05:28 legoktm: stopping jobrunner on deployment-jobrunner01
05:22 legoktm: restarted jobrunner on deployment-jobrunner01
05:14 bd808: Bad jobs in job queue filled up /var on jobrunner01 and killed jobrunner script. Leaving down for now until I find out how to delete the bad jobs.
01:41 bd808: Killed old jobs-loop.sh processes on deployment-jobrunner01
01:24 bd808: Many jobrunner errors like "wikiversions-labs.cdb has no version entry for `amwiki`" with various wiki names
01:23 bd808|AWAY: Started jobrunner service manually on jobrunner01.
00:44 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known when Trebuchet is running)
00:35 bd808: Puppet run on deployment-jobrunner01 failing with what seem to be dns issues (getaddrinfo: Name or service not known)

September 3

15:02 bd808: _joe_ rolled out a new hhvm package ~5 hours ago
15:01 bd808: morebots is back thanks to petan
14:50 bd808: logmsgbot down apparently

September 2

15:34 bd808: False alarm. SSL is borked in beta and we know that
15:29 bd808: `curl -vL -H 'Host: en.wikipedia.beta.wmflabs.org' localhost` works from deployment-cache-text02
15:27 bd808: https://en.wikipedia.beta.wmflabs.org/ returning ERR_CONNECTION_REFUSED (is varnish down?)

August 29

22:56 bd808: Got puppet to run cleanly on deployment-mediawiki03. Should be ready for serving traffic.
22:39 bd808: Fixed a merge conflict in operations/puppet on deployment-salt
21:46 bd808: Forced install of "right version of libvips-tools on mediawiki03 `sudo apt-get install libvips-tools=7.38.5-2`
08:40 hashar: rebooting deployment-cache-mobile03 (kernel up)

August 28

21:32 bd808: Added "Greg Grossmeier" to UnderNDA sudoers group
17:12 bd808: Changed centralauth db to rename labswiki -> deploymentwiki
16:49 bd808: CentralAuth looks broken on http://deployment.wikimedia.beta.wmflabs.org/
16:49 bd808: Apache vhosts look good again
16:34 bd808: Restarted varnishes on deployment-cache-text02
16:13 andrewbogott: merging a patch that renames 'labswiki' to 'deploymentwiki'
09:21 hashar: resetting git repository in /data/project/apache/conf to point to the betaclusterbranch of operations/mediawiki-config.git discarded all local hacks in the process

August 27

23:03 hashar: Blacklisting the security audit IP again on deployment-cache bits01 mobile03 and text02
22:53 hashar: removed the blackhole ip route from deployment-cache-text02 and deployment-cache-mobile03
22:48 hashar: the IP is a known security audit. See Chris Steipp.
22:46 hashar: blackholed an IP address on deployment-cache-text02 and deployment-cache-mobile03 , it was causing hundred of requests per seconds and overloaded the beta cluster. Use route -n to find the IP
22:37 hashar: restarting udp2log-mw on deployment-bastion. It keeps crashing since fiarly recently
22:26 bd808: when restarting varnish on deployment-cache-text02, don't forget that there are 2 varnish services (varnish and varnish-frontend)
22:19 bd808: restarted varnish (again) on deployment-cache-text02
22:10 bd808: restarted varnish on deployment-cache-text02
16:22 bd808: killing `apt-get update` process running on deployment-bastion since Jun13
14:59 bd808: Resolved puppet git merge conflict on deployment-salt
14:49 bd808: Moved hhvm core dumps to /data/project/hhvm-cores
14:42 bd808: Root dirve full on deployment-mediawiki02; hhvm core files are the culprit

August 25

23:47 ori: stopping hhvm/apache on deployment-mediawiki02 to replace debug build of hhvm with release build
21:44 bd808: Deployed scap 116027f (Make sync-common update l10n cdb files by default)
18:30 ori: deployment-mediawiki02: cleared /tmp; running puppet
15:05 hashar: mediawiki02 rm /tmp/hhvm*.core . Filled as bug 69979
15:01 hashar: mediawiki02 rm /tmp/mw-cache-master/conf*
15:01 hashar: mediawiki02 has mw conf caches under /tmp/mw-cache-master/ and since that partition is filled up, that ends up with conf caches being null file
15:00 hashar: mediawiki02 rm /var/log/upstart/hhvm*
14:53 hashar: mediawiki02 : removed /var/lib/puppet/state/agent_catalog_run.lock
14:46 hashar: restarting udp2log-mw service on -bastion. It is stalled for some reason
14:42 hashar: on mediawiki02 , clearing out some /var/log/upstart/hhvm.* log files see bug 69976
14:34 hashar: mediawiki02 / partition is 100% full

August 22

20:21 hashar: udp2log are back in /data/project/logs . The udp2log-mw service went stall for some reason.
20:08 ori: ran 'git pull' on deployment-salt:/srv/var-lib/git/operations/puppet
19:59 hashar: restarting udp2log-mw service on deployment-bastion
19:59 hashar: bits yielding 503
00:41 bd808: cherry-picked scap change https://gerrit.wikimedia.org/r/#/c/155677/ for testing

August 21

21:49 bd808: Trebuchet happier after all the salt-minion restarts; still have deleted hosts showing in the expected minion list for scap deploys
21:01 twentyafterfour: Started salt-minion on deployment-redis01
21:01 bd808: Started salt-minon on deployment-upload
21:00 bd808: Started salt-minon on deployment-fluoride
21:00 bd808: Started salt-minon on deployment-db1
20:59 bd808: Started salt-minon on deployment-elastic01
20:59 twentyafterfour: Started salt-minion on deployment-eventlogging02
20:58 bd808: Started salt-minon on deployment-elastic02
20:58 bd808: Started salt-minon on deployment-elastic03
20:57 bd808: Started salt-minon on deployment-elastic04
20:57 bd808: Started salt-minon on deployment-analytics01
20:55 bd808: Started salt-minon on deployment-cache-upload02
20:54 bd808: Started salt-minon on deployment-memc04
20:54 bd808: Started salt-minon on deployment-parsoid04
20:49 bd808: Started salt-minon on deployment-memc05
20:48 bd808: Started salt-minon on deployment-db2
20:48 twentyafterfour: Started salt-minion on deployment-cache-text02
20:47 twentyafterfour: Started salt-minion on deployment-memc03
20:46 bd808: Started salt-minon on deployment-cxserver01
20:12 bd808: List of broken salt minions can be obtained with `sudo salt-run manage.down` on deployment-salt
19:55 bd808: Fixed salt on deployment-memc02
19:52 bd808: Salt minions are broken all over beta. Hung grain-ensure calls, hung test.ping calls, downed minions
19:50 bd808: Killed dozens of grain-ensure calls and started salt-minion on deployment-cache-mobile03
19:47 bd808: Killed hung salt-call and started salt-minion on deployment-cache-bits01
19:28 bd808: Deployed cherry-pick of Iea7217a for scap
19:27 bd808: Restarted salt-minion on deployment-jobrunner01 & deployment-videoscaler01
19:27 bd808: Killed rogue salt-master process on deployment-bastion
19:26 bd808: Deleted salt keys for retired apache0[12] minions
00:13 bd808: Upgraded elasticsearch to 1.3.2 on deployment-logstash1

August 19

16:11 hashar: deleted /usr/local/apache/common-local symlink, made it a directory and retriggered https://integration.wikimedia.org/ci/job/beta-scap-eqiad/17887/console
16:03 bd808: Removed local changes to /usr/local/apache/conf/wmflabs-logging.conf on deployment-mediawiki02; logs back to nfs share
15:52 bd808: Changed apache logging level from debug to notice on deployment-mediawiki02 in /usr/local/apache/conf/wmflabs-logging.conf
15:47 bd808: Changed apache logging level from debug to warn on deployment-mediawiki02
15:44 bd808: /var full on deployment-mediawiki02; deleting 572M /var/log/apache2/debug.log.1
15:03 hashar: Killed some stalled scap / rsync process on deployment-bastion that were preventing https://integration.wikimedia.org/ci/job/beta-scap-eqiad/ from acquiring the lock.
14:17 hashar: huge rsync in progress on bastion
14:00 hashar: On bastion reverted the symlink on bastion and manually created directory /usr/local/apache/common-local
13:55 hashar_: On bastion, deleting /usr/local/apache/common-local and symlink it to /srv/common-local

August 18

22:22 ^d: dropped apache01/02 instances, unused and need the resources
18:23 manybubbles: finished upgrading elasticsearch in beta - everything seems ok so far
18:15 bd808: Restarted salt-minion on deployment-mediawiki01 & deployment-rsync01
18:15 bd808: Ran `sudo pkill python` on deployment-rsync01 to kill hundreds of grain-ensure processes
18:12 bd808: Ran `sudo pkill python` on deployment-mediawiki01 to kill hundreds of grain-ensure processes
18:10 manybubbles: finally restarting beta's elasticsearch servers now that they have new jars
17:56 bd808: Manually ran trebuchet fetches on deployment-elastic0*
17:49 bd808: Forcing puppet run on deployment-elastic01
17:47 godog: upgraded hhvm on mediawiki02 to 3.3-dev+20140728+wmf5
17:44 bd808: Trying to restart minions again with `salt '*' -b 1 service.restart salt-minion`
17:39 bd808: Restarting minions via `salt '*' service.restart salt-minion`
17:38 bd808: Restarted salt-master service on deployment-salt
17:19 bd808: 16:37 Restarted Apache and HHVM on deployment-mediawiki02 to pick up removal of /etc/php5/conf.d/mail.ini (logged in prod SAL by mistake)
16:59 manybubbles|lunc: upgrading Elasticsearch in beta to 1.3.2
16:11 bd808: Manually applied https://gerrit.wikimedia.org/r/#/c/141287/12/templates/mail/exim4.minimal.erb on deployment-mediawiki02 and restarted exim4 service
15:28 bd808: Puppet failing for deployment-mathoid due to duplicate definition error in trebuchet config
15:15 bd808: Reinstated puppet patch to depool deployment-mediawiki01 and forced puppet run on all deployment-cache-* hosts
15:04 bd808: Puppet run failing on deployment-mediawiki01 (apache won't start); Puppet disabled on deployment-mediawiki02 ('reason not specified') Probably needs to wait until Giuseppe is back from vacation for fixing.
15:00 bd808: Rebooting deployment-eventlogging02 via wikitech; console filling with OOM killer messages and puppet runs failing with "Cannot allocate memory - fork(2)"
14:29 bd808: Forced puppet run on deployment-cache-upload02
14:27 bd808: Forced puppet run on deployment-cache-text02
14:24 bd808: Forced puppet run on deployment-cache-mobile03
14:20 bd808: Forced puppet run on deployment-cache-bits01

August 17

22:58 bd808: Attempting to reboot deployment-cache-bits01.eqiad.wmflabs via wikitech
22:56 bd808: deployment-cache-bits01.eqiad.wmflabs not allowing ssh access and wikitech console full of OOM killer messages

August 15

21:57 legoktm: set $wgVERPsecret in PrivateSettings.php
21:42 hashSpeleology: Beta cluster database updates are broken due to CentralNotice. Fix up is 154231
20:57 hashSpeleology: deployment-rsync01 : deleting /usr/local/apache/common-local content. Then ln -s /srv/common-local /usr/local/apache/common-local as set by beta::common which is not applied on that host for some reason. bug 69590
20:55 hashSpeleology: puppet administratively disabled on mediawiki02 . Assuming some work in progress on that host. Leaving it untouched
20:54 hashSpeleology: puppet is proceeding on mediawiki01
20:52 hashSpeleology: attempting to unbreak mediawiki code update bug 69590 by cherry picking 154329
20:39 hashSpeleology: in case it is not in SAL. MediaWiki is no more synced to app server bug 69590
20:20 hashSpeleology: rebooting mediawiki01 , /var refuses to clear out and stick at 100% usage
20:16 hashSpeleology: cleaning up /var/log on deployment-mediawiki02
20:14 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/access.log.1
20:13 hashSpeleology: on deployment-mediawiki01 deleting /var/log/apache2/debug.log.1
20:13 hashSpeleology: bunch of instances have a full /var/log :-/
11:37 ori: deployment-cache-bits01 unresponsive; console shows OOMs: https://dpaste.de/LDRi/raw . rebooting
03:20 jeremyb: 02:46:37 UTC <ebernhardson> !log beta /dev/vda1 full. moved /srv-old to /mnt/srv-old and freed up 2.1G

August 14

12:23 hashar: manually rebased operations/puppet.git on puppetmaster

August 13

08:02 hashar: beta-code-update-eqiad is running again
07:57 hashar: fixing ownerships under /srv/scap-stage-dir/php-master/skins some files belong to root
07:55 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is broken :-/

August 8

16:05 bd808: Fixed merge conflict that was preventing updates on puppet master

August 6

13:13 hashar: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ is running again
13:13 hashar: removed a bunch of local hack on deployment-bastion:/srv/scap-stage-dir/php-master . That causes the git repo to be dirty and prevents scap from achieving git pull there
12:08 hashar: Manually pruning whole text cache on deployment-cache-text02
12:07 hashar: Apache virtual hosts were not properly loaded on mediawiki02. I have hacked /etc/apache2/apache2.conf to make it Include Include /usr/local/apache/conf/all.conf (instead of main.conf which does not include everything)
08:43 hashar: prunning cache on deployment-cache-text02 / restarting varnish

August 2

08:53 swtaarrs: rebuilt and restarted hhvm on deployment-mediawiki02 with potential fix
05:17 swtaarrs: restarted hhvm on deployment-mediawiki0{1,2} to unwedge them

August 1

15:03 bd808: Updated cherry-pick of Iceb8f43
15:02 bd808: Cleaned up puppet repo on deployment-salt; merge conflicts with local Ia463120 hack; reapplied depool of deployment-mediawiki01
14:50 bd808: Restarted stuck hhvm on deployment-mediawiki02; apache had 89 children waiting for a response
13:27 godog: changed inplace bt-hhvm on deployment-mediawiki01/02 to also copy the binary
05:32 ori: depooled deployment-mediawiki02 to investigate HHVM lock-up by cherry-picking I7df8c5310 on beta.
00:40 ori: disabled puppet on deployment-mediawiki{01,02} and enabled verbose apache logging

July 31

22:41 bd808: Restarted hhvm on -mediawiki{01,02}. Brett looked at 01 before I did and said "it's the same as before"
20:09 cscott: updated OCG to version d2919c59eb09e09fc87777696411a070620aef45
19:59 hashar: Granted sudo right to cscott (under NDA). Will let him reboot OCG service
18:58 ori: re-enabled puppet on deployment-mediawiki{01,02}
10:41 hashar: Taking gdb traces of hhvm on mediawiki01 and mediawiki02. Restarting hhvm
05:08 bd808: HHVM hung on both boxes. Grabbed core and backtrace before restarting

July 30

19:59 bd808: Created local commit 7d56b79 in puppet to work around bugs in Ia463120718dceab087ad3f8e3f35917fa879f387
19:46 bd808: Restored prior /etc/hhvm/php.ini from puppet filebucket archive on deployment-mediawiki0[12]
19:32 bd808: Disabled puppet on deployment-mediawiki02 for the same reason
19:31 bd808: Disabled puppet on deployment-mediawiki01; Ori will look into hhvm config changes that were being applied
16:52 bd808: Fixed beta-scap-eqiad Jenkins job by correcting ssh problems in beta project
16:43 bd808: Fixed ssh to jobrunner01 and videoscaler01 by correcting unrelated puppet manifest problem and forcing run via salt.
16:00 bd808: Puppet runs on videoscaler01 and jobrunner01 failing for "Could not find dependency Ferm::Rule[bastion-ssh] for Ferm::Rule[deployment-bastion-scap-ssh]"
16:00 bd808: Puppet seems manually disabled on apache0[12].
15:59 bd808: Can't ssh to apache0[12], videoscaler01 and jobrunner01. Puppet not running on any of them. libnss-ldapd unattended update has broken /etc/nslcd.conf
15:23 bd808: Removed cherry-pick for Iac547efa83cf059a1276b6e279c3ebd4c7224b2c and updated cherry-pick for I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 to latest patch set.
15:05 bd808: Two cherry-picks in puppet conflicting with merged production changes: I5afba2c6b0fbf90ff8495cc4a82f5c7851893b52 and Iac547efa83cf059a1276b6e279c3ebd4c7224b2c (ori, twentyafterfour)
14:49 bd808: Started apache2 service on deployment-mediawiki01
14:16 hashar: rebooting hhvm
09:42 hashar: bastion had broken puppet because deployment_server and zuul both declare the same python packages 150501
09:40 hashar: restoring on puppetmaster modules/mediawiki/templates/apache/apache2.conf.erb which got deleted somehow
09:29 hashar: Rebooting apache01/02 to see whether it fix the ssh connection issue
09:27 hashar: manually started hhvm on mediawiki01
09:25 hashar: rebooting deployment-mediawiki01 hhvm process went zombie
09:23 hashar: restarting hhvm on mediawiki 01/02
09:05 hashar_: Beta scap script broken since 6:30am UTC https://integration.wikimedia.org/ci/job/beta-scap-eqiad/

July 29

22:56 cscott: updated OCG to version aeb8623d6ebe41ae7c7e36c57844bd9ea8e6d595
21:02 bd808: Converted deployment-sentry2.eqiad.wmflabs to use beta salt/puppet master
19:14 hashar: Removed all jobs from queue, restarted slave agent. Update Jobs coming back
19:09 hashar: deployment-bastion jenkins slave is stuck. Beta cluster is no more updating code :-//
15:58 godog: restarted hhvm on deploymnet-mediawiki01
15:52 godog: restarted hhvm on deployment-mediawiki02
15:50 godog: installed libevent-dbg on deployment-mediawiki02 to capture an hhvm backtrace
15:17 bd808: _joe_ restarting hhvm on deployment-mediawiki01
15:00 bd808: Apache stuck with 65 children on both deployment-mediawiki servers
10:37 hashar: Restarted hhvm on mediawiki{01,02}

July 28

17:41 bd808: Updated hhvm to latest 3.3-dev+20140728 build on deployment-mediawiki0[12]
15:37 manybubbles: rebuilding elasticsearch indexes to build a weighted all field we'll try to use to improve performance
15:32 bd808: Restarted hhvm on deployment-mediawiki0[12]. All apache children were stuck waiting for hhvm to respond.
15:20 bd808: Restarted apache on deployment-mediawiki02. 65 children and non-responsive to requests. (same as mediawiki01)
15:18 bd808: Restarted apache on deployment-mediawiki01. 65 children and non-responsive to requests.
14:23 manybubbles: or not - looks like I can't!
14:22 manybubbles: reubilding cirrus search indexes to pick up a speed up all field
08:30 hashar: restarted varnish on deployment-cache-bits01 . Hoping to clear bits cache

July 25

18:29 bd808: Added twentyafterfour and several other WMF staff to under_NDA sudo group
17:15 bd808: Morebots is back!
16:38 bd808: pstree showed "hhvm─┬─271*[sh]" on deployment-mediawiki02
16:38 bd808: Killed apache2+hhvm and restarted on deployment-mediawiki0[12]
16:06 bd808: `tcpdump -n udp dst port 8324` shows packets leaving deployment-bastion for deployment-logstash1
16:00 bd808: Stopped udp2log and started udp2log-wm with no apparent effect
16:00 bd808: udp2log events not being sent from deployment-bastion to deployment-logstash1
15:49 bd808: Restarted logstash on deployment-logstash1
09:45 mwalker: rebasing puppet repo to get a ocg patch

July 24

16:09 bd808: Reverted MW config to re-enable luasandbox mode; back to luastandalone for now
15:44 bd808: Updated MW config to re-enable luasandbox mode
15:43 bd808: Updated hhvm-luasandbox to 2.0-3 and restarted hhvm instances
14:21 hashar: killed hhvm process on deployment-mediawiki01 and 02. init script does not work.
02:59 ori: promoted legoktm to project-admin

July 23

23:30 bd808: Running `find . -type d -exec chmod 777 {} +` in /data/project/upload7 to finx shared image dir permisisons
20:49 bd808: Changed config to run lua via external executable to avoid hhvm crashing bug
16:20 bd808: hhvm upgraded to 3.1+20140723-1+wmf1 on deployment-mediawiki0[12]
15:34 bd808: Reverted hhvm to 3.1+20140630-1+wm1 on deployment-mediawiki02
15:21 bd808: Upgraded hhvm to 3.1+20140630; seeing problems with luasandbox extension

July 22

14:26 hashar: upgrading varnish on deployment-cache-mobile03
14:22 hashar: upgrading varnish on deployment-cache-text02
14:02 hashar: rebooting deployment-cache-upload02 varnish not happy with memory mapping
13:51 hashar: rebooting bits varnish cache
13:43 hashar: rebased puppetmaster repo. Rebase got broken after 0317463 - beta: New script to restart apaches got merged in.
13:35 hashar: apt-get upgrade on deployment-cache-bits01 + varnish upgrade
09:28 hashar: Removing role::beta::natfix that is now handled by labs DNS and the class is removed with 146091

July 21

23:37 ori: Switched over beta cluster app servers to HHVM
21:27 bd808: Killed update.php jobs; Antoine will give jobs a longer timeout
21:23 bd808: Running update.php for simplewiki in screen
21:22 bd808: Running update.php for hewiki in screen
21:21 bd808: Running update.php for eswiki in screen
21:21 bd808: Running update.php for cawiki in screen
21:21 bd808: Running update.php for commonswiki in screen
21:18 hashar: Restarting upd2log-mw on deployment-bastion. There is a bunch of [python] <defunct> processes
17:32 bd808: Updated scap to 4871208 (+ cherry pick of I6a56b5e)
17:12 bd808: Hotfix for scap ssh host key checking to fix jenkins scap job
17:03 bd808: Testing scap change I40a891b via cherry-pick
10:25 hashar: on bastion, fixed some puppet dependency to have nutcracker to start with the proper configuration 148043
10:20 hashar: upgrading packages on deployment-bastion
10:19 hashar: deleted /var/lib/apt/lists/lock on bastion. Was prevent apt-get update from running
10:18 hashar: setting up nutcracker on deployment-bastion. It was installed but the puppet class to configure it was not being applied. Related Gerrit patches: 148041 and 148042
09:25 hashar: rebooting deployment-apache02
09:22 hashar: rebooting deployment-apache01.
00:27 ori: deployment-mediawiki01 & deployment-mediawiki02: configured for project-local puppet & salt masters

July 18

00:30 bd808: removed local l10nupdate user from deployment-jobrunner01 and deployment-videoscaler01
00:22 bd808: Killed stuck beta-update-databases-eqiad job ( stuck for over 60m waiting for executor; deadlock?)
00:21 ori: beta broke due to I433826423. app servers load prod apache confs from /etc/apache2/wikimedia. temp fix: locally hack apache2.conf to load /usr/local/apache2/conf/all.conf; disable puppet.

July 17

23:18 bd808: Puppet broken for deployment-bastion by labs specific logic in misc::deployment::vars.
19:01 mwalker: possibly breaking labs by cherry picking an apparmor patch that affects mysql https://gerrit.wikimedia.org/r/#/c/147027/

July 16

19:15 mwalker: updated puppet about 20 minutes ago for new ocg variables (now officially in production puppet instead of just cherry picked)

July 15

18:26 bd808: Removed local mwdeploy user from /etc/passwd on deployment-videoscaler01 and deployment-jobrunner01
16:59 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to other random failures now. Lots of strange permissions errors during rsync
16:37 bd808: scap failing to deploymnet-videoscaler01 and deploymnet-jobrunner01 due to ssh auth failures; likely a puppet config problem

July 10

22:37 bd808: Added Gergő Tisza and Yuvipanda as project admins

July 8

23:37 bd808: Updated Kibana to 0afda49 (latest upstream head)
17:03 greg-g: Added John F. Lewis to the project after his NDA was signed by Mark (RT 7722)

July 7

20:55 bd808: Killed stuck `apt-get update` job on deployment-jobrunner01 started on Jun17
20:20 bd808: Fixed puppet on deployment-analytics01 with manual apt-get commands.
20:08 bd808: Ran `apt-get dist-upgrade` on deployment-analytics01 to upgrade hadoop, hive, pig, etc which were failing to update via puppet.

July 4

02:28 RoanKattouw: Unbroke replication on deployment-db2, it's catching up now

July 3

18:59 legoktm: manually created centralauth.renameuser_status table
16:04 bd808: Updated scap to ff04431
09:24 hashar: Reindexed ElasticSearch index for cawiki/eswiki with: mwscript extensions/CirrusSearch/maintenance/forceSearchIndex.php --wiki {cawiki,eswiki} --batch-size=50
09:22 hashar: Blow up ElasticSearch indices for cawiki and eswiki with: mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType content && mwscript extensions/CirrusSearch/maintenance/updateOneSearchIndexConfig.php --wiki cawiki --startOver --indexType general
09:10 hashar: used addwiki.php to create the wiki. manually triggered the Jenkins job that update the databases https://integration.wikimedia.org/ci/job/beta-update-databases-eqiad/2319/
09:06 hashar: Adding cawiki and eswiki for cxserver testing Ibbcbd4

July 2

07:49 hashar: cxserver being configured! 140723 by Kartik and Niklas \O/

July 1

15:46 bd808: Fixed git rebase conflict in operations/puppet on deployment-salt
13:29 manybubbles: rebuilding Cirrus search index in beta to pick up new configuration and cache warmers
11:20 hashar: Added Filippo Giunchedi to the project as an admin (WMF ops)

June 30

20:47 bd808: The state of puppet for beta is badly broken. I have hacked things to get puppet to apply on deployment-apache0[12] but puppet won't apply on deployment-bastion in part due to the same hacks.
18:48 bd808: Created symlink /apache -> /usr/local/apache on deployment-apache0[12] to fix docroot symlinks
18:09 bd808: Beta apaches are broken with latest puppet config applied. Working to correct.
18:08 bd808: Manually added symlink for /etc/apache/wmf on deployment-apache0[12]

June 26

12:48 YuviPanda: cherry picked https://gerrit.wikimedia.org/r/#/c/142228/ to puppetmaster, sending events to charcoal.wmflabs.org now with projectname \o/
09:46 YuviPanda: cherry-picked https://gerrit.wikimedia.org/r/#/c/142210/ on to puppetmaster
09:38 hashar: Granting sudo to YuviPanda

June 25

20:58 bd808: Fixed rebase conflict in operations/puppet.git on deployment-salt caused by cherry-picked vcl patch left over from varnish submodule usage

June 24

19:29 bd808: Manually updated operations/puppet checkout on deployment-salt to deal with varnish submodule change

June 19

22:47 bd808: Updated scap to 792a572
22:46 bd808: Trebuchet runs on deployment-videoscaler01 are succeeding but not showing up in the `git deploy report` output
22:40 bd808: Deleted /var/log/diamond/diamond.log on deployment-jobrunner01 because /var was full

June 18

16:55 bd808: Setup hourly cron as user bd808 on deployment-salt to test automatic update of puppet repo using ~bd808/git-sync-upstream script

June 17

20:36 bd808: Upgraded elasticsearch to version 1.2.1 on deployment-logstash1

June 16

21:16 bd808: Jenkins beta-scap-eqiad job broken because of missing puppet config on deployment-jobrunner01; needs role::beta::scap_target
20:36 bd808: Enabled puppet on deployment-jobrunner01 and forced a run
20:34 bd808: Puppet disabled on deployment-jobrunner01 since 2014-06-03; No SAL logs explaining why
20:19 bd808: Updated scap to 5adce72; trebuchet reported i-00000237 (deployment-videoscaler01) as not updating, but manual check shows it did sync properly
20:00 bd808: Deleted /var/lib/puppet/state/agent_catalog_run.lock on deployment-bastion after verifying that no puppet processes were running
19:55 bd808: Truncated /var/log/diamond/diamond.log and restarted diamond on deployment-bastion
19:36 bd808: /var/log/diamond is 787M of 1.2G total logs
19:29 bd808: /var 0% free on deployment-bastion; looking for things to clean-up

June 9

15:19 andrewbogott: doing a 'rebase origin' on deployment-salt, because it needs it.
15:10 andrewbogott: updating all instances to puppet 3 via a cherry-pick�� of https://gerrit.wikimedia.org/r/#/c/137898/ on deployment-salt

June 7

02:44 bd808: Restarted logstash on deployment-logstash1; last even logged at 2014-06-06T22:11:04

June 6

19:26 bblack: - synced labs/private on deployment-salt again
16:30 bd808: Rebooted deployment-salt
16:27 bd808: Made /var/log a symlink to /srv/var-log on deployment-salt
16:26 bblack: Updated labs/private.git on puppetmaster. brings in updated zero+netmapper password for beta
16:18 bd808: Changed from role::labs::lvm::biglogs to role::labs::lvm::srv on deployment-salt and made /var/lib a symlink to /srv/var-lib
15:45 bd808: /var on deployment-salt still at 97% full after moving logs; /var/lib is our problem
15:43 bd808: Archived deployment-salt:/var/log to /data/project/deployment-salt
15:40 bd808: Disabled puppet on deployment-salt to work on disk space issues
12:44 hashar: Updated labs/private.git on puppetmaster. Brings Brandon Black change "add labs copy of zerofetcher auth file" 137918
02:48 mwalker: added role::labs::lvm::biglogs to deployment-salt because it is out of room on /var and I don't know what I can delete
01:25 bd808: Live hacked /etc/apache2/wmf/hhvm.conf on apaches to allow them to start
00:30 bd808: `git stash`ed dirty dblist files found in /a/common on deployment-bastion

June 5

14:16 manybubbles: rebuild beta's jawiki's search index without kuromoji - it didn't help much anyway
14:14 manybubbles: recovered from busted elasticsearch - two problems: 1. I had an index that used the kuromoji plugin but I'd uninstalled it and 2. I had plugins for 1.2.1 but was trying to start 1.1.0. Solution: 1. delete the index and recreate it without kuromoji. 2. upgrade to 1.2.1 like I had planned on doing any way.
14:01 manybubbles: elasticsearch cluster got really angry in beta when I restarted some node - its like they aren't talking to eachother properly - trying to recover. once that is done I'll upgrade to 1.2.1 and that might fix it
13:59 hashar: deployment-elastic01 puppet was broken due to bug 63322 i.e. having some HTML garbage as ec2id which would be used as puppet certname
13:47 manybubbles: rolling restart of elasticsearch nodes in beta to pick up new kernel

June 4

20:46 bd808: Fixed file ownership on /data/project/apache/uncommon for beta-recompile-math-texvc-eqiad job
19:27 manybubbles: sorry, can't do that yet,
19:27 manybubbles: plugins deployed to beta - time to restart Elasticsearch in beta - should cause not interruption of service
19:01 manybubbles: deploying Elasticsearch 1.2.1 and some updated plugins to beta
17:11 bd808: Unwedged the jenkins jobs to updating beta by stopping the stuck db update job
16:27 bd808: Changed uid/git for files owned by l10nupdate user
09:50 mwalker: Reset salt caches by running `salt '*' state.clear_cache` from deployment-salt -- deployment-pdf01 now no longer reports errors when returning status for deployment

June 3

22:30 bd808: Deleted unused /data/project/apache/common-local on NFS share.

June 2

19:42 bd808: Updated scap to a7da355
05:14 bd808: Restarted logstash on deployment-logstash1; Last event logged at 2014-06-01T0722:56

May 30

21:45 bd808: Restarted uwsgi on deployment-graphite
18:43 bd808: Updated scap to c4204dd

May 29

21:07 bd808: mwalker cleaned up log spam from upstart on deployment-pdf01
20:59 bd808: /var full on deployment-pdf01
20:55 bd808: Restarted salt minion on deployment-pdf01 with `sudo salt 'i-00000396.eqiad.wmflabs' service.restart salt-minion`

May 28

17:53 bd808: Restarted logstash on deployment-logstash1; last event logged at 2014-05-28T12:11:37
16:56 bd808: Updated scap to fd7e538

May 27

19:08 bd808: Updated scap to 48c7e28
14:56 bd808: Updated scap to 9609e8d

May 23

16:32 bd808: Upgraded elasticsearch to 1.1.0 on deployment-logstash1
13:36 manybubbles: restarting elasticsearch on deployment-elastic01 to pick up some gc setting recommended by elasticsearch team

May 22

23:00 bd808: Added 20after4 as a project admin
22:59 bd808: Added matanya as a project memeber
21:38 bd808|LUNCH: Deployed scap 096cb3f

May 21

17:33 mwalker: converted deployment-pdf01 (i-00000396.eqiad.wmflabs) to use local puppet & salt master
14:50 bd808: restarted logstash on deployment-logstash1; getting really tired of these soft crashes
00:33 bd808: Puppet failing on deployment-videoscaler01 with duplicate definition of Class[Mediawiki::Jobrunner]
00:07 bd808: Fixed puppet for deployment-jobrunner01 using https://gerrit.wikimedia.org/r/#/c/134519/2

May 20

23:49 bd808: Fixed puppet for deployment-apache[12] using https://gerrit.wikimedia.org/r/#/c/134519/2
23:11 bd808: deployment-apache01 needs more work: "Could not set shell on user[mwdeploy]"
23:06 bd808: Fixing puppet config for upstream rename of role::applicationserver -> role::mediawiki
21:14 ori: Converted deployment-stream to use local puppet & salt masters
21:08 RoanKattouw: chown'ed /data/project/parsoid/parsoid.log from mwalker (?!?) to parsoid so Parsoid runs again
15:53 bd808: Deployed scap 7b6fc47 via trebuchet

May 19

14:34 bd808: Restarted logstash service on deployment-logstash1; it stopped logging new events at 10:37:13Z

May 16

21:20 manybubbles: restarting elasticsearch in beta to update some plugins
00:34 bd808: Updated EventLogging to I89819bd

May 15

22:14 bd808: Restarted logstash on deployment-logstash1 yet again; memory leak from invalid encoding bug
00:14 bd808: Disabled puppet on deployment-logstash1 to test a local logstash config change

May 14

23:33 bd808: Added irc input to logstash via I409fec9

May 13

09:28 bd808: Restarted logstash service on deployment-logstash1
09:28 bd808: Logstash events stop at 2014-05-11T18:36:35Z; Log file shows many "Failed parsing date from field" errors which probably triggered the known upstream memory leak bug

May 10

18:02 bd808: Restarted logstash on deployment-logstash1

May 9

12:06 hashar: Creating en_rtlwiki wiki [[bugzilla:50335|bug 50335]]

May 6

17:54 bd808: Restarted logstash on deployment-logstash1
17:53 bd808: Logstash in beta hasn't recorded any events since 2014-05-04T04:32:36.
15:33 manybubbles: rolling restart of Elasticsearch servers in beta to pick up new highlighter plugin to fix bugs found when we fixed hebrew analysis. and to implement phrase highlighting.

May 5

21:29 mwalker: ran puppetstoredconfigclean and revoked puppet and salt keys for i-00000339.eqiad.wmflabs (was pdf01)
21:24 mwalker: removing pdf01 instance -- labs just uses production mwlib which works just fine. I'll recreate this when I make the OCG test instance
20:57 manybubbles: deploying new plugin to Elasticsearch (swift)

May 3

18:10 mwalker: Updated kernel on deployment-pdf01 (manually set console=ttyS0 to match older installed kernels)
17:58 mwalker: Converted i-00000339.eqiad.wmflabs (deployment-pdf01) to use local puppet & salt masters
17:54 mwalker: signed salt key for i-00000339.eqiad.wmflabs (deployment-pdf01)
17:43 bd808: Added mwalker to under_NDA sudoers group

May 2

17:01 bd808: Switched scap to use scripts delivered by trebuchet

May 1

15:46 manybubbles: upgrading Elasticsearch highlighter via a rolling restart
00:56 bd808: Fixed empty PrivateSettings.php configuration file (which I also broke earlier)

April 28

16:12 manybubbles: upgrading highlighter plugin in Elasticsearch
15:43 bd808: Created empty /srv/scap-stage-dir/wmf-config/mwblocker.log file to stop missing file warnings in beta.

April 25

11:31 hashar: commonswiki-75388f96: 0.6183 19.5M SQL ERROR (ignored): Table 'commonswiki.revtag_type' doesn't exist (10.68.16.193)
11:30 hashar: Authentication is broken on the beta cluster. Well at least from commons.wikimedia.beta.wmflabs.org

April 23

19:34 ^demon|lunch: created zhwiki, ukwiki, ruwiki, kowiki, hiwiki, jawiki for testing
10:19 hashar: stopping udp2log and starting udp2log-mw instead (known old bug that prevents logging)

April 22

18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again
18:42 bd808: Rebooting deployment-bastion in a wild attempt to get the jenkins slave there working again

April 18

19:24 manybubbles: rebuilding Cirrus indexes to pick up auxiliary fields and smarter accent matching

April 16

18:56 hashar: Migrating memc04 and memc05 to self master/salt [[bugzilla:64010|bug 64010]]
13:13 manybubbles: done
13:10 manybubbles: rolling restart of Elasticsearch nodes in beta to make super sure it picked up new plugins
09:33 hashar: rebased puppetmaster

April 15

20:02 manybubbles: restarting elasticsearch in beta to pick up a plugin update - no downtime should occur
14:24 hashar: rebased puppetmaster

April 11

17:41 bd808: Tried to enable role::protoproxy::ssl::beta on deployment-cache-text02 but it failed to apply because /etc/ssl/certs/star.wmflabs.org.pem and /etc/ssl/private/star.wmflabs.org.key don't match.
03:59 bd808: sudo apt-get install mysql-client on deployment-bastion
03:54 bd808: Added legoktm as a project member
00:02 bd808: Enabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/

April 10

21:35 bd808: Running scap on deployment-bastion for the first time in eqiad
21:13 bd808: Disabled https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ to work on scap setup
14:52 hashar: Adding Tobias Gritschacher to the project so he can look at udp2log / apache logs whenever needed :-]

April 9

23:04 bd808: Re-enabled puppet on deployment-apache02 and forced a puppet run
21:39 bd808: Cherry-picked I8f77e0c into puppet and forced puppet run on deployment-bastion

April 8

17:53 manybubbles: rebuilding simplewiki's search index optimized for the new highlighter to check the size difference
05:34 Ryan_Lane: upgraded libssl on all nodes, restarted affected ssl servers
05:03 Ryan_Lane: upgraded libssl on all salt accessible nodes

April 5

11:19 hashar: Attempting to reenable SSL support with 124057

April 4

21:39 bd808: Restarted logstash; it stopped processing events again at 2014-04-04T19:56:46Z
17:31 bd808: Forced puppet run on deployment-cache-text02
17:29 bd808: Manually fixed puppet config on deployment-cache-text02 (the cert html error problem)
17:22 bd808: Rebooting deployment-cache-bits01
17:21 bd808: Forced puppet run on deployment-cache-bits01
16:15 manybubbles: Performing a rolling restart of Elasticsearch nodes to pick up a new plugin

April 3

17:32 bd808: Fixed certname in /etc/puppet/puppet.conf manually on deployment-bastion so puppet would run again.
15:33 bd808: Restarted logstash on deploymnet-logstash1; Stuck in a bad state due to jvm oom logged at 2014-04-03T12:03:43Z

April 2

17:54 manybubbles: done installing plugins on Elasticsearch in beta
14:10 hashar: Fixed database updating job https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ . It was not running on the proper node.
12:50 hashar: restarted parsoid daemon on deployment-parsoid04.eqiad.wmflabs. It also now log to /data/project/parsoid/parsoid.log
12:36 hashar: Manually deleting parsoid user/group on deployment-parsoid04. Will use the LDAP uid/gid instead.

April 1

21:38 hashar: Removed the Zuul triggers that updated beta cluster in PMTPA 123100.
19:49 bd808: Converted deployment-graphite.eqiad.wmflabs to use local puppet & salt masters
19:20 bd808: Deleting and re-creating deployment-graphite because I forgot to add the web security group
15:57 andrewbogott: shutting down all pmtpa instances
14:32 manybubbles: completed upgrade to Elasticsearch 1.1.0 and fixed deployment-elastic04.
13:32 hashar: Thumbs access more or less fixed
13:31 hashar: deployment-upload is rejecting connection on port 80. Applying role::beta::uploadservice from 122786
13:30 manybubbles: upgrading labs Elasticsearch to 1.1.0
13:06 hashar: Applying role::beta::natfix on deployment-upload.eqiad.wmflabs . Might let it access images from commons.wikimedia.beta.wmflabs.org ( ex: http://upload.beta.wmflabs.org/wikipedia/commons/thumb/4/43/Feed-icon.svg/16px-Feed-icon.svg.png yields: Error retrieving thumbnail from scaling server: couldn't connect to host commons.wikimedia.beta.wmflabs.org )
08:31 hashar: MediaWiki config paths tweaks for Math [[bugzilla:63331|bug 63331]] and Captchas [[bugzilla:63342|bug 63342]]
00:32 bd808: Converted deployment-graphite to use local puppet & salt masters

March 31

21:02 hashar: Making Parsoid daemon to write its logs to /data/project/parsoid/parsoid.log 122561
20:47 hashar: Puppet master is fixed. The certificates got badly messed up, had to regenerate them following the documentation "Regenerate Certificates for Puppet Master"
20:17 hashar: restarted parsoid daemon
20:00 hashar: stopped parsoid . It is killing the application servers
19:53 hashar: restarting both apaches
19:21 hashar: restarting job service on jobrunner01 to apply 122436
19:20 hashar: Unbreak puppetmaster on deployment-salt.eqiad.wmflabs
19:01 hashar: puppet master is broken :(
17:39 hashar: lowering # of jobs spawned by the jobrunner 122436
16:00 bd808: Restarted logstash service on deployment-logstash1; no new log events seen since 2014-03-28T10:57
15:58 bd808: Updated kibana on deployment-logstash1 to e317bc6
15:56 hashar_: Cluster slow because some CirrusSearch job is spamming simplewiki . Gotta find a way to throttle the number of jobs being run on jobrunner01 or add more apache boxes . It is transient anyway, might look at limiting the runs tonight
15:10 hashar_: Rebased puppet repository. Only one hack left: https://gerrit.wikimedia.org/r/#/c/119534/
14:20 hashar: deleting deployment-parsoidcache01 cache the hardway: stopping varnish, deleting files in /srv/vdb/ , starting varnish
14:05 hashar: shutdowning database and apache boxes for now.
14:03 hashar: shutdowning varnishes instances in pmtpa
13:56 hashar: Deleted deployment-cache-upload01 , replaced by deployment-cache-upload02
13:52 hashar: upload varnish cache working :-]
13:47 hashar: applying role::cache::upload to role-cache-upload02
13:37 hashar: migrating deployment-cache-upload02.eqiad.Wmflabs to self puppet/salt master
13:22 hashar: Creating deployment-cache-upload02 to replace deployment-cache-upload01 which was missing the security group "web"
11:30 hashar: Update DNS entries to point to EQIAD instances (aka switching beta cluster to eqiad)

March 28

16:18 hashar: rebased puppet on deployment-salt
15:39 hashar: Last log made to wrong project
15:39 hashar: deleting instance ntegration-selenium-driver no more needed. browsertests jobs should now be runnable on integration-slave1001 and integration-slave1002 (in eqiad)
10:54 hashar: deleting instance integration-debian-builder . That is breaking all debian-glue jobs. Will revisit later next week to get pbuilder/cowbuilder set up on the other eqiad slaves
08:48 hashar: deleting integration-slave-pbuilder. Unneeded (i need a coffee)
08:43 hashar: Created integration-slave-pbuilder on eqiad to replace pmtpa instance integration-debian-builder
00:23 bd808: `sudo chmod -R a+rwx /data/project/upload7`; We need to get this file permissions thing figured out

March 27

15:23 hashar: role::beta::natfix cant run on deployment-bastion.eqiad because the ferm rules conflicts with the Augeas rules coming from udp2log :-(
15:21 hashar: applying role::beta::natfix on deployment-bastion.eqiad
14:58 hashar: fixed up role::beta::natfix . Ferm is now being applied again on various application server instances 121378
13:58 hashar: rebased puppetmaster git repository, reapplied ottomata live hacks.
12:55 hashar: mediawiki l10n cache being rebuild!!!
12:54 hashar: Fixed permissions on eqiad bastion for /srv/scap . Others (such as mwdeploy) could not read / execute scap scripts
11:29 hashar: MediaWiki code and configuration are now self updating on EQIAD cluster via Jenkins jobs. First run: https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/4/console
11:11 hashar: deleting job beta-code-update , replaced by datacenter variants beta-code-update-pmtpa and beta-code-update-eqiad
10:54 hashar: Deleting job beta-update-databases , replaced by datacenter variants beta-update-databases-pmtpa and beta-update-databases-eqiad

March 26

19:05 bd808: Added ottomata as a project member and admin
15:46 springle: deployment-db1 data loaded
14:45 bd808: created proxy https://logstash-beta.wmflabs.org for logstash instance
14:17 hashar: fixed up redis configuration in eqiad. Jobrunner is happy now: aawiki-504cd7d2: 0.9649 21.5M Creating a new RedisConnectionPool instance with id 627014d. 121060
14:05 hashar: udp2log functional on eqiad beta cluster \O/
13:55 hashar: stopping udp2log on eqiad bastion, starting udp2log-mw (really should fix that issue one day)
13:52 hashar: dropped some live hack on eqiad in /data/project/apache/common-local and ran git pull
13:14 hashar: Dropping enwikivoyage and dewikivoyage databases from sql02. Related changes are updating the Jenkins config: https://gerrit.wikimedia.org/r/#/c/121045/ and cleaning up the mw-config : https://gerrit.wikimedia.org/r/#/c/121047/
07:53 springle: installed mariadb via puppet on deployment-db1. no data yet

March 25

19:43 hashar: created jenkins slave deployment-bastion.eqiad
17:17 hashar: Created and validated job that updates Parsoid on the EQIAD beta cluster \O/

March 24

23:16 marktraceur: Touching all the MMV scripts because they're not getting invalidated or something
23:10 hashar: l10n cache got broken due to a PHP fatal error I introduced. It is back up now. Found out via https://integration.wikimedia.org/dashboard/
23:09 hashar: upgraded all pmtpa varnishes, ran puppet on all of them. all set!
22:57 hashar: restarting deployment-cache-upload04 , apparently stalled
22:48 hashar: upgrading varnish on all pmtpa caches.
22:47 hashar: apt-get upgrade varnish on deployment-cache-bits03
22:45 marktraceur: attempted restart of varnish on betalabs; seems to have failed, trying again
22:42 hashar: made marktraceur a project admin and granted sudo rights
22:39 marktraceur: Restarting betalabs varnish to workaround https://bugzilla.wikimedia.org/show_bug.cgi?id=63034
17:25 bd808: Converted deployment-db1.eqiad.wmflabs to use local puppet & salt masters
17:06 bd808: Changed rules in sql security group to use CIDR 10.0.0.0/8.
17:05 bd808: Changed rules in search security group to use CIDR 10.0.0.0/8.
17:05 bd808: Built deployment-elastic04.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
16:19 bd808: Built deployment-elastic03.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
16:08 bd808: Built deployment-elastic02.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
15:54 bd808: Built deployment-elastic01.eqiad.wmflabs with local salt/puppet master, secondary disk on /var/lib/elasticsearch and role::elasticsearch::server
10:31 hashar: migrated deployment-solr to self puppet/salt masters

March 21

09:29 hashar: l10ncache is now rebuild properly : https://integration.wikimedia.org/ci/job/beta-code-update/53508/console
09:23 hashar: fixing l10ncache on deplkoyment-bastion : chown -R l10nupdate:l10nupdate /data/project/apache/common-local/php-master/cache/l10n The l10nupdate UID/GID has been changed and are now in LDAP

March 20

23:46 bd808: Mounted secondary disk as /var/lib/elasticsearch on deployment-logstash1
23:46 bd808: Converted deployment-tin to use local puppet & salt masters
22:09 hashar: Migrated videoscaler01 to use self salt/puppet masters.
21:30 hashar: manually installing timidity-daemon on jobrunner01.eqiad so puppet can stop it and stop whining
21:00 hashar: migrate jobrunner01.eqiad.wmflabs to self puppet/salt masters
20:55 hashar: deleting deployment-jobrunner02 , lets start with a single instance for nwo
20:51 hashar: Creating deployment-jobrunner01 and 02 in eqiad.
15:47 hashar: fixed salt-minion service on deployment-cache-upload01 and deployment-cache-mobile03 by deleting /etc/salt/pki/minion/minion_master.pub
15:30 hashar: migrated deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs to use the salt/puppetmaster deployment-salt.eqiad.wmflabs.
15:30 hashar: deployment-cache-upload01.eqiad.wmflabs and deployment-cache-mobile03.eqiad.wmflabs recovered!! /dev/vdb does not exist on eqiad which caused the instance to be stalled.
10:48 hashar: Stopped the simplewiki script. Would need to recreate the db from scratch instead
10:37 hashar: Cleaning up simplewiki by deleting most pages in the main namespace. Would free up some disk space. deleteBatch.php is running in a screen on deployment-bastion.pmtpa.wmflabs
10:08 hashar: applying role::labs::lvm::mnt on deployment-db1 to provide additional disk space on /mnt
09:39 hashar: convert all remaining hosts but db1 to use the local puppet and salt masters
04:40 springle: created deployment-db1 for mariadb master in eqiad

March 19

21:23 bd808: Converted deployment-cache-text02 to use local puppet & salt masters
20:21 hashar: migrating eqiad varnish caches to use xfs
17:58 bd808: Converted deployment-parsoid04 to use local puppet & salt masters
17:51 bd808: Converted deployment-eventlogging02 to use local puppet & salt masters
17:22 bd808: Converted deployment-cache-bits01 to use local puppet & salt masters; puppet:///volatile/GeoIP not found on deployment-salt puppetmaster
17:00 bd808: Converted deployment-apache02 to use local puppet & salt masters
16:49 bd808: Converted deployment-apache01 to use local puppet & salt masters
16:30 hashar: Varnish caches in eqiad are failing puppet because there is no /dev/vdb. Will figure it out tomorrow :-]
16:15 hashar: Applying role::logging::mediawiki::errors on deployment-fluoride.eqiad.wmflabs . It is not receiving anything yet though.
15:50 hashar: fixed upd2log-mw daemon not starting on eqiad bastion ( /var/log/udp2log belonged to wrong UID/GID)
15:49 hashar: deleted local user l10nupdate on deployment-bastion. It is in ldap now.

March 18

03:31 bd808: deployment-bastion now using deployment-salt as puppet master

March 17

15:02 hashar: Starting copying /data/project from ptmpa to eqiad
14:46 hashar: manually purging all commonswiki archived files (on beta of course)

March 14

14:47 hashar: changing uid/gid of mwdeploy which is now provisioned via LDAP (aka deleting local user and group on all instance + file permissions tweaks)

March 11

10:46 hashar: dropping some unused databases from deployment-sql instance.

March 10

11:09 hashar: Deleting http://simple.wikipedia.beta.wmflabs.org/wiki/MediaWiki:Robots.txt
09:54 hashar: Reducing memcached instances to 3GB ( 115617 ). Seems to fix writing to the EQIAD memcaches which only have 3GB
09:08 hashar: Restarted bits cache (CPU / mem overload)

March 6

09:07 hashar: restarted varnish and varnish-frontend on deployment-cache-text1

March 5

17:26 hashar: hacked in mwversioninuse to return "master=aawiki". Relaunched l10n job using mwdeploy user and then running mw-update-l10n
17:07 hashar: mwversioninuse gives a wmf branch instead of master. That breaks l10n messages update and the job https://integration.wikimedia.org/ci/job/beta-code-update/ . Root cause is the python based scap.

March 3

17:28 manybubbles: doing an Elasticsearch reindex on beta before I try another one in production

February 28

10:17 hashar: Puppet running on varnish upload cache after several months. Might break random things in the process :(

February 27

14:11 manybubbles: upgrading beta to Elasticsearch 1.0

February 26

20:44 hashar: Cleaning up commonswiki archived files with mwscript deleteArchivedFiles.php --wiki=commonswiki --delete
20:44 hashar: deleted all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload (gwtoolset import test). Deleted File:Title_0* (Selenium tests).
15:06 hashar: deleted all thumbs from shared directory: /data/project/upload7/*/*/thumb/*
14:54 hashar: cleaning out 2013 archived logs.

February 25

08:42 hashar: Upgrading all varnishes.

February 24

23:36 MaxSem: Rolled back
23:25 hoo: recursively chowned extensions/MobileFrontend to mwdeploy:mwdeploy
23:21 hoo: chowned /data/project/apache/common-local/php-master/extensions/.git/modules/MobileFrontend/* to mwdeploy:mwdeploy
17:47 MaxSem: Investigating a mobile bug, might cause intermittent problems
17:36 MaxSem: Rebooted deployment-cache-mobile01 - was impossible to log into it though Varnish still worked

February 21

19:42 MaxSem: Adjusted read privs on /home/wikipedia/syslog/apache.log to allow fatalmonitor to work

February 19

16:24 hashar: -bastion : /etc/init.d/udp2log stop && /etc/init.d/udp2log-mw start (known bug)
16:23 hashar: rebooting -bastion
16:22 hashar: rebooting apache32 and apache33 breaking beta :-]

February 17

15:26 hashar: rebooting bits cache

February 11

21:55 manybubbles: update elasticsearch schema after recent changes. will run a links update as well

February 6

22:20 Krinkle: Manually ran changePassword.php to help someone (password reminder emails don't get sent)
14:43 hashar: restarting udp2log-mw on deployment-bastion. Logstash.wmflabs.org no more receiving fatals logs since Jan 31st

February 4

17:22 hashar: fixed up beta-parsoid-update job so Parsoid should be up to date again. The issue is that the multigit job pointed to a wrong host (ZUUL_URL should be zuul.eqiad.wmnet)
13:33 hashar: removing role::memcached from both apache servers
09:58 hashar: rebooting all varnish caches
09:57 hashar: Upgrading all varnish

February 3

16:59 hashar: upgrading varnish on deployment-parsoidcache3

January 30

19:35 hashar: deployment-cache-bits03 restarted gmond, leaked memory. Upgrading varnish
19:32 hashar: Canceled varnish package upgrade on deployment-cache-mobile01 , it runs a specific version ( 3.0.5plus~wmftest-wm1 ) instead of 3.0.3plus~rc1-wm29
19:30 hashar: upgrading varnish on deployment-cache-mobile01
19:29 hashar: upgrading varnish on deployment-cache-bits03
19:29 hashar: upgrading varnish on deployment-staging-cache-mobile02
19:28 hashar: upgrading varnish on deployment-cache-upload04
19:27 hashar: reenabling puppet on deployment-cache-mobile01
17:10 manybubbles: done reindexing beta. everything looks good
16:54 manybubbles: reindexing beta like we're going to do in production when the release train departs later today

January 28

17:10 hashar: added addshore and jhall to project so they can grep logs

January 27

15:17 hashar: applying role::beta::fatalmonitor puppet class on deployment-bastion bug 60046

January 23

19:38 hashar: VisualEditor was not being updated properly because some files belonged to root instead of mwdeploy. Ran chown -R mwdeploy:mwdeploy /data/project/apache/common-local/php-master/extensions/VisualEditor

January 16

20:54 manybubbles: turning elasticsearch's disk space aware allocator

January 15

21:14 manybubbles: finished updating to elasticsearch 0.90.10
08:48 andrewbogott: rebooted deployment-cache-text1

January 2

15:32 hashar: Migrated parsoid on deployment-parsoid2 to use mediawiki/services/parsoid out of a checkouts made in /srv/deployment/parsoid/{parsoid,deploy}. No job self updating it yet
15:00 manybubbles: finished upgrading Elasticsearch in beta. We're on 0.90.9 now.
14:07 hashar: running mw-update-l10n , it was broken because of https://gerrit.wikimedia.org/r/#/c/104741/ fixed up by https://gerrit.wikimedia.org/r/#/c/104953/
13:54 manybubbles: upgrading Elasticsearch servers in beta

December 26

18:54 manybubbles: performing in place index rebuild for wikis in beta after recent cirrus update

December 23

20:40 anomie: Restarting mw-job-runner service on deployment-jobrunner08, since jobs don't seem to be being run
20:03 anomie: Restarting apache on deployment-apache33 to see if that clears the odd errors going on

December 18

10:56 hashar: reenabling puppet on parsoid2 and deploying the new Parsoid upstart configuration 99656