Nova Resource:Tools/SAL

From Wikitech

2016-01-12

  • 09:48 scfc_de: tools-checker-01: Removed exim paniclog (OOM).

2016-01-11

  • 22:19 valhallasw`cloud: reset maxujobs 0->128, job_load_adjustments none->np_load_avg=0.50, load_ad... -> 0:7:30
  • 22:12 YuviPanda: restarted gridengine master again
  • 22:07 valhallasw`cloud: set job_load_adjustments from np_load_avg=0.50 to none and load_adjustment_decay_time to 0:0:0
  • 22:05 valhallasw`cloud: set maxujobs back to 0, but doesn't help
  • 21:57 valhallasw`cloud: reset to 7:30
  • 21:57 valhallasw`cloud: that cleared the measure, but jobs still not starting. Ugh!
  • 21:56 valhallasw`cloud: set job_load_adjustments_decay_time = 0:0:0
  • 21:45 YuviPanda: restarted gridengine master
  • 21:43 valhallasw`cloud: qstat -j <jobid> shows all queues overloaded; seems to have started just after a load test for the new maxujobs setting
  • 21:42 valhallasw`cloud: resetting to 0:7:30, as it's not having the intended effect
  • 21:41 valhallasw`cloud: currently 353 jobs in qw state
  • 21:40 valhallasw`cloud: that's load_adjustment_decay_time
  • 21:40 valhallasw`cloud: temporarily sudo qconf -msconf to 0:0:1
  • 19:59 YuviPanda: Set maxujobs (max concurrent jobs per user) on gridengine to 128
  • 17:51 YuviPanda: kill all queries running on labsdb1003
  • 17:20 YuviPanda: stopped webservice for quentinv57-tools

2016-01-09

  • 21:07 valhallasw`cloud: moved tools-checker/208.80.155.229 back to tools-checker-01
  • 21:02 andrewbogott: rebooting tools-checker-01 as it is unresponsive.
  • 13:12 valhallasw`cloud: tools-worker-1002. is unresponsive. Maybe that's where the other grrrit-wm is hiding? Rebooting.

2016-01-08

2015-12-30

  • 04:06 YuviPanda: delete all webgrid jobs to start with a clean slate
  • 03:54 YuviPanda: qmod -rj all tools in the continuous queue, they are all orphaned
  • 02:39 YuviPanda: remove lbenedix and ebekebe from tools.hcclab
  • 00:40 YuviPanda: restarted master on grid-master
  • 00:40 YuviPanda: copied and cleaned out spooldb
  • 00:10 YuviPanda: reboot tools-grid-shadow
  • 00:08 YuviPanda: attempt to stop shadowd
  • 00:03 YuviPanda: attempting to start gridengine-master on tools-grid-shadow
  • 00:00 YuviPanda: kill -9'd gridengine master

2015-12-29

  • 23:31 YuviPanda: rebooting tools-grid-master
  • 23:22 YuviPanda: restart gridengine-master on tools-grid-master
  • 00:18 YuviPanda: shut down redis on tools-redis-01

2015-12-28

  • 22:34 chasemp: attempt to unmount nfs volumes on tools-redis-01 to debug but it hands (I am on console and see root at console hang on login)
  • 22:31 YuviPanda: disable NFS on tools-redis-1001 and 1002
  • 21:32 YuviPanda: disable puppet on tools-redis-01 and -02
  • 21:27 YuviPanda: created tools-redis-1001

2015-12-23

  • 21:21 YuviPanda: deleted tools-worker-01 to -05, creating tools-worker-1001 to 1005
  • 21:19 valhallasw`cloud: tools-proxy-01: umount /home /data/project /data/scratch /public/dumps
  • 19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
  • 18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
  • 18:40 valhallasw`cloud: scratch that, first going to eat dinner
  • 18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
  • 14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
  • 10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel

2015-12-22

  • 18:30 YuviPanda: rescheduling all webservices
  • 18:17 YuviPanda: failed over active proxy to proxy-01
  • 18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
  • 01:42 YuviPanda: rebooting tools-worker-08

2015-12-21

  • 18:44 YuviPanda: reboot tools-proxy-01
  • 18:31 YuviPanda: failover proxy to tools-proxy-02

2015-12-20

  • 00:00 YuviPanda: tools-worker-08 stuck again :|

2015-12-18

  • 15:16 andrewbogott: rebooting locked up host tools-exec-1409

2015-12-16

  • 23:14 andrewbogott: rebooting tools-exec-1407, unresponsive
  • 22:48 YuviPanda: run qmod -c '*' to clear error state on gridengine
  • 21:28 andrewbogott: deleted tools-docker-registry-01
  • 16:24 andrewbogott: rebooting tools-exec-1221 as it was in kernel lockup

2015-12-12

  • 10:08 YuviPanda: restarted cron on tools-submit

2015-12-10

  • 12:47 valhallasw`cloud: broke tools-proxy-02 login (for valhallasw, root still works) by restarting nslcd. Restarting; current proxy is -01.

2015-12-07

  • 13:46 Coren: The new grid masters are happy, killing the old ones (-shadow, -master)
  • 10:46 YuviPanda: restarted nscd on tools-proxy-01

2015-12-06

  • 10:29 YuviPanda: did webservice start on tool 'derivative', was missing service.manifest

2015-12-04

  • 19:33 Coren: switching master role to tools-grid-master
  • 04:42 yuvipanda: disabled puppet on tools-puppetmaster-01 because everything sucks
  • 04:09 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/256618 to tools-puppetmaster-01

2015-12-02

  • 18:29 Coren: switching gridmaster activity to tools-grid-shadow
  • 05:13 yuvipanda: increased security groups quota to 50 because why not

2015-12-01

  • 21:07 yuvipanda: added bd808 as admin
  • 21:01 andrewbogott: deleted tool/service group tools.test300

2015-11-25

  • 15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002

2015-11-20

  • 22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
  • 21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
  • 21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
  • 21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
  • 21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
  • 20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
  • 20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
  • 20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
  • 20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
  • 19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
  • 19:25 Coren: -lighttpd-1403 wants a restart.
  • 19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
  • 18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
  • 18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
  • 18:10 Coren: disabling puppet on the grid nodes listed at https://phabricator.wikimedia.org/P2337 so that the /tmp change in https://gerrit.wikimedia.org/r/#/c/252506/ do not apply early and break services

2015-11-17

  • 19:39 YuviPanda: created tools-worker-03 to be k8s worker node
  • 19:34 YuviPanda: blanked 'realm' for tools-bastion-01 to figure out what happens

2015-11-16

2015-11-03

  • 03:59 scfc_de: tools-submit, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411: Removed exim paniclog (OOM).

2015-11-02

  • 22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
  • 22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
  • 22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
  • 19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
  • 19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs

2015-10-26

  • 20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts

2015-10-11

  • 22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code

2015-10-09

2015-10-06

  • 04:35 yuvipanda: created tools-puppetmaster-02 as hot spare

2015-10-02

  • 17:30 scfc_de: tools-webgrid-lighttpd-1402: Removed exim paniclog (OOM).

2015-10-01

  • 23:38 yuvipanda: actually rebooting tools-worker-02, had actually rebooted-01 earlier #facepalm
  • 23:20 yuvipanda: rebooting tools-worker-02 to pickup new kernel
  • 23:10 yuvipanda: failed over tools-proxy-01 to -02, restarting -01 to pick up new kernel
  • 22:58 yuvipanda: rebooted tools-proxy-02 to pick up new kernel

2015-09-30

  • 07:12 yuvipanda: deleted tools-webproxy-01 and -02, running on proxy-01 and -02 now
  • 06:40 yuvipanda: migrated webproxy to tools-proxy-01

2015-09-29

  • 12:08 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-09-28

  • 15:24 Coren: rebooting tools-shadow after mount option changes.

2015-09-25

  • 16:02 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-24

  • 14:06 scfc_de: tools-exec-1201: Restarted grid engine exec for T109485.
  • 13:56 scfc_de: tools-master: Restarted grid engine master for T109485.

2015-09-23

2015-09-16

  • 17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
  • 01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
  • 01:17 YuviPanda: attempting to move to kubernetes

2015-09-15

  • 01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.

2015-09-14

  • 23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.

2015-09-13

  • 20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
  • 20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).

2015-09-11

  • 14:54 scfc_de: tools-webgrid-lighttpd-1403: Removed exim paniclog (OOM).

2015-09-08

  • 08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated.
    Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
  • 08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
  • 08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools

2015-09-07

  • 18:49 valhallasw`cloud: ran sudo mount -o remount /data/project on tools-static-01, which also solved the issue, so skipping the reboot
  • 18:47 valhallasw`cloud: switched static webserver to tools-static-02
  • 18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
  • 17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master

2015-09-03

  • 07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
  • 07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
  • 07:07 valhallasw`cloud: err, is empty.
  • 07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!

2015-09-02

  • 15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
  • 13:58 valhallasw`cloud: rebooting tools-exec-1403; https://phabricator.wikimedia.org/T107052 happening, also causing significant NFS server load
  • 13:55 valhallasw`cloud: restarted gridengine_exec on tools-exec-1403
  • 13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
  • 13:16 YuviPanda: deleted all jobs of ralgisbot
  • 13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
  • 12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles

2015-09-01

  • 21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
  • 16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
  • 15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
  • 06:23 valhallasw`cloud: seems to have worked. SGE :(
  • 06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
  • 06:08 valhallasw`cloud: e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
  • 06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
  • 06:06 valhallasw`cloud: investigating SGE issues reported on irc/email

2015-08-31

  • 23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
  • 21:21 valhallasw`cloud: webservice: error: argument server: invalid choice: 'generic' (choose from 'lighttpd', 'tomcat', 'uwsgi-python', 'nodejs', 'uwsgi-plain') (for tools.javatest)
  • 21:20 valhallasw`cloud: restarted webservicemonitor
  • 21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
  • 21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
  • 21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
  • 21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
  • 20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
  • 20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
  • 20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
  • 20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
  • 19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
  • 19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
  • 19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
  • 07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)

2015-08-30

  • 13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
  • 13:20 valhallasw`cloud: disabling 503 error page

2015-08-29

  • 04:09 scfc_de: Disabled queue webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".

2015-08-27

  • 15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again

2015-08-26

  • 01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start. If it goes berserk, please service bigbrothermonitor stop.

2015-08-25

  • 20:23 scfc_de: tools-webgrid-generic-1405: killall mpt-statusd.
  • 14:58 YuviPanda: pooled in two new instances for the precise exec pool
  • 14:45 YuviPanda: reboot tools-exec-1221
  • 14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
  • 14:18 YuviPanda: pooled in tools-webgrid-generic-1405
  • 10:16 YuviPanda: created tools-webgrid-generic-1405
  • 10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
  • 09:59 YuviPanda: created tools-exec-1220 and -1221

2015-08-24

  • 16:37 valhallasw`cloud: more processes were started, so added a talk page message on User:Coet (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
  • 16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
  • 16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01

2015-08-20

  • 18:44 valhallasw`cloud: both are now at 3dbbc87
  • 18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
  • 18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
  • 18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
  • 17:06 valhallasw`cloud: wait, what timezone is this?!

2015-08-19

  • 10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406

2015-08-18

  • 15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
  • 14:42 scfc_de: tools-webgrid-lighttpd-1411: Killed mpt-statusd (T104779).
  • 13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
  • 13:55 valhallasw`cloud: no, wait, that's tools-webgrid-lighttpd-1411.eqiad.wmflabs, not the actual host tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs. We should fix that dns mess as well.
  • 13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
  • 13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
  • 13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done
  • 08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
  • 08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
  • 08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
  • 08:21 valhallasw`cloud: still sudo qmod -e "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" -> invalid queue "*@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
  • 08:20 valhallasw`cloud: sudo qconf -mhgrp "@webgrid", added tools-webgrid-lighttpd-1411.eqiad.wmflabs
  • 08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻
  • 08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
  • 08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
  • 08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
  • 08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
  • 08:00 valhallasw`cloud: running puppet agent -tv again
  • 07:55 valhallasw`cloud: argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd
  • 07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
  • 07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README
  • 07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
  • 07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
  • 07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...

2015-08-17

  • 19:00 scfc_de: tools-checker-01, tools-exec-1410, tools-exec-catscan, tools-redis-01, tools-redis-02, tools-web-static-01, tools-webgrid-lighttpd-1406, tools-webproxy-02: Remounted /public/dumps (T109261).
  • 16:17 andrewbogott: disable queues for tools-exec-1205 tools-exec-1207 tools-exec-1208 tools-exec-140 tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-exec-catscan tools-web-static-01 tools-webgrid-lighttpd-1201 tools-webgrid-lighttpd-1205 tools-webgrid lighttpd-1206 tools-webgrid-lighttpd-1406 tools-webproxy-02
  • 15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
  • 14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01

2015-08-15

  • 05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
  • 05:10 andrewbogott: suspending tools-exec-gift, just for a moment...

2015-08-14

  • 17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
  • 15:20 andrewbogott: Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
  • 14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2015-08-13

  • 18:51 valhallasw`cloud: which was resolved by scfc earlier
  • 18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid
    Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists))
  • 18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
  • 16:44 andrewbogott: disabling job queue for tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
  • 14:48 andrewbogott: and tools-webgrid-lighttpd-1408
  • 14:48 andrewbogott: rescheduling (and in some cases killing) jobs on tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405

2015-08-12

  • 16:05 andrewbogott: depooling tools-exec-1203 tools-exec-1210 tools-exec-1214 tools-exec-1402 tools-exec-1405 tools-exec-gift tools-services-01 tools-web-static-02 tools-webgrid-generic-1403 tools-webgrid-lighttpd-1204 tools-webgrid-lighttpd-1209 tools-webgrid-lighttpd-1401 tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1408
  • 15:20 valhallasw`cloud: re-enabling queues on restarted hosts
  • 14:41 andrewbogott: forcing reschedule of jobs on tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410

2015-08-11

  • 18:17 andrewbogott: depooling tools-exec-1201 tools-exec-1202 tools-exec-1204 tools-exec-1206 tools-exec-1209 tools-exec-1213 tools-exec-1217 tools-exec-1218 tools-exec-1408 tools-webgrid-generic-1404 tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1410 in anticipation of labvirt1001 reboot tomorrow

2015-08-04

  • 13:43 scfc_de: Fixed owner of ~tools.kasparbot/error.log (T99576).

2015-08-03

  • 19:13 andrewbogott: deleted tools-static-01

2015-08-01

  • 18:09 andrewbogott: depooling/rebooting tools-webgrid-lighttpd-1407 because it’s unable to fork
  • 16:54 scfc_de: tools-webgrid-lighttpd-1407: Removed exim paniclog (OOM).

2015-07-30

  • 15:00 andrewbogott: rebooting tools-bastion-01 aka tools-login
  • 14:46 scfc_de: tools-webgrid-lighttpd-1408, tools-webgrid-lighttpd-1409: Removed exim paniclog (OOM).
  • 02:53 scfc_de: "webservice uwsgi-python start" for blogconverter.
  • 02:40 scfc_de: qdel 545479 (hazard-bot, "release=trusty-quiet", stuck since July 9th).
  • 02:39 scfc_de: qdel 301895 (projanalysis, "release=trust", stuck since July 1st).
  • 02:38 scfc_de: tools-webgrid-generic-1401, tools-webgrid-generic-1402, tools-webgrid-generic-1403: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).
  • 01:41 scfc_de: tools-webgrid-lighttpd-1406: Rebooted for T107052 (disabled queue, killall -TERM lighttpd, let tools-manifest restart webservices elsewhere, reboot, enabled queue).

2015-07-29

  • 23:43 andrewbogott: draining, rebooting tools-webgrid-lighttpd-1408
  • 20:11 andrewbogott: rebooting tools-webgrid-lighttpd-1404
  • 19:58 scfc_de: tools-*: sudo rmdir /etc/ssh/userkeys/ubuntu{/.ssh{/authorized_keys\ {/public{/keys{/ubuntu{/.ssh,},},},},},}

2015-07-28

  • 17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
  • 17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
  • 17:16 valhallasw`cloud: disabled queue "webgrid-lighttpd@tools-webgrid-lighttpd-1401.eqiad.wmflabs"
  • 02:07 YuviPanda: removed pacct files from tools-bastion-01

2015-07-27

  • 21:27 valhallasw`cloud: turned off process accounting on tools-login while we try to find the root cause of phab:T107052:
    accton off

2015-07-19

  • 01:51 scfc_de: tools-bastion-01: Removed exim paniclog (OOM).

2015-07-11

  • 00:01 mutante: fixing puppet runs on tools-webgrid-* via salt

2015-07-10

  • 23:59 mutante: fixing puppet runs on tools-exec via salt

2015-07-10

  • 20:09 valhallasw`cloud: it took three of us, but adminbot is updated!

July 6

  • 09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)

July 2

  • 17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
  • 16:12 valhallasw`cloud: I mean tools-bastion-01
  • 16:12 valhallasw`cloud: stopping puppet on tools-login and tools-mail to check for changes in deploying https://gerrit.wikimedia.org/r/#/c/205914/

June 29

  • 17:29 YuviPanda: failed over tools webproxy to tools-webproxy-02

June 21

  • 18:57 scfc_de: tools-precise-dev: apt-get purge python-ldap3 (the previous fix for "Cache has broken packages, exiting" didn't work).
  • 16:39 scfc_de: tools-precise-dev: apt-get clean ("Cache has broken packages, exiting").
  • 16:33 scfc_de: tools-submit: Removed exim4 paniclog (OOM).

June 19

  • 15:07 YuviPanda: remounting /data/scratch

June 10

  • 11:52 YuviPanda: tools-trusty be gone

June 8

  • 16:31 YuviPanda: added Nova Tools Bot as admin, for automated nova API access

June 7

  • 17:05 YuviPanda: killed sort /data/project/templatetiger/public_html/dumps/ruwiki-2015-03-24.txt -k4,4 -k2,2 -k3,3n -k5,5n -t? -o /data/project/templatetiger/public_html/dumps/sort/ruwiki-2015-03-24.txt -T /data/project/templatetiger to rescue NFS

June 5

  • 17:44 YuviPanda: migrate tools-shadow to labvirt1002

June 2

  • 18:34 Coren: rebooting tools-webgrid-lighttpd-1406.eqiad.wmflabs
  • 16:27 YuviPanda: cleaned out /etc/hosts file on tools-shadow
  • 16:20 Coren: switching back to tools-master
  • 16:10 YuviPanda: restart nscd on tools-submit
  • 15:54 Coren: Switching names for tools-exec-1401
  • 15:43 Coren: adding the "new" exec nodes (aka, current nodes with new names)
  • 14:34 YuviPanda: turned off dnsmasq for toollabs
  • 13:54 Coren: adding new-style names for submit hosts
  • 13:53 YuviPanda: moved tools-master / shadow to designate
  • 13:52 Coren: new-style names for gridengin admin hosts added
  • 13:28 Coren: sge_shadowd started a new master as expected, after /two/ timeouts of 60s (unexpected)
  • 13:23 Coren: stracing the shadowd to see what's up; master is down as expected.
  • 13:17 Coren: killing the sge_qmaster to test failover
  • 12:56 YuviPanda: switched labs webproxies to designate, forcing puppet run and restarting nscd

May 29

  • 13:39 YuviPanda: tools-redis-01 is redis master now
  • 13:35 YuviPanda: enable puppet on all hosts, redis move-around completed
  • 13:01 YuviPanda: recreating tools-redis-01 and -02
  • 12:52 YuviPanda: disable puppet on all toollabs hosts for tools-redis update
  • 12:27 YuviPanda: created two redis instances (tools-redis-01 and tools-redis-02), beginning to set up stuff

May 28

  • 12:22 wm-bot: petrb: inserted some local IP's to hosts file
  • 12:15 wm-bot: petrb: shutting nscd off on tools-master
  • 12:14 wm-bot: petrb: test
  • 11:28 petan: syslog is full of these May 28 11:27:36 tools-master nslcd[1041]: [81823a] <group=550> error writing to client: Broken pipe
  • 11:25 petan: rebooted tools-master in order to try fix that network issues

May 27

  • 20:10 LostPanda: disabled puppet on tools-shadow too
  • 19:46 LostPanda: echo -n 'tools-master.eqiad.wmflabs' > /var/lib/gridengine/default/common/act_qmaster haaail someone?
  • 19:10 YuviPanda: reverted gridengine-common on tools-shadow to 6.2u5-4 as well, to match tools-master
  • 18:58 YuviPanda: rebooting tools-master after switchoer failed and it can not seem to do DNS

May 23

  • 19:56 scfc_de: tools-webgrid-lighttpd-1410: Removed exim4 paniclog (OOM).

May 22

  • 20:37 yuvipanda: deleted and depooled tools-exec-07

May 20

  • 20:09 yuvipanda: transient shinken puppet alerts because I tried to force puppet runs on all tools hosts but cancelled
  • 20:01 yuvipanda: enabling puppet on all hosts
  • 20:01 yuvipanda: tested new /etc/hosts on tools-bastion-01, puppet run produced no diffs, all good
  • 19:56 yuvipanda: copy cleaned up and regenerated /etc/hosts from tools-precise-dev to all toollabs hosts
  • 19:54 yuvipanda: copy cleaned up hosts file to /etc/hosts on tools-precise-dev
  • 19:54 yuvipanda: enabled puppet on tools-precise-dev
  • 19:33 yuvipanda: disabling puppet on *all* hosts for https://gerrit.wikimedia.org/r/#/c/210000/
  • 06:21 yuvipanda: killed a bunch of webservice jobs stuck in dRr state

May 19

  • 21:06 yuvipanda: failed over services to tools-services-02, -01 was refusing to start some webservices with permission denied errors for setegid
  • 20:16 yuvipanda: qdel -f for all webservice jobs that were in dr state
  • 20:12 yuvipanda: force killed croptool webservice

May 18

  • 01:36 yuvipanda: created new tools-checker-01, applying role and provisioning
  • 01:32 yuvipanda: killed tools-checker-01 instance, recreating

May 15

  • 12:06 valhallasw: killed those perl scripts; kmlexport's lighttpd is also using excessive memory (5%), so restarting that
  • 12:01 valhallasw: webgrid-lighttpd-1402 puppet failure caused by major memory usage; tools.kmlexport is running heavy perl scripts
  • 00:27 yuvipanda: cleared graphite data for /var/* mounts on tools-redis

May 14

  • 21:53 valhallasw: shut down & removed "tools-exec-08.eqiad.wmflabs" from execution host list
  • 21:11 valhallasw: forced rescheduling of (non-cont) welcome.py job (iluvatarbot, jobid 8869)
  • 03:29 yuvipanda: drained, depooled and deleted tools-exec-15

May 10

  • 22:08 yuvipanda: created tools-precise-dev instance
  • 09:28 yuvipanda: cleared and depooled tools-exec-02 and -13. only job running was deadlocked for a long, long time (week)
  • 05:47 scfc_de: tools-submit: Removed paniclog (OOM) and stopped apache2.

May 5

  • 18:50 Betacommand: helperbot WP:AVI bot running logged out owner is MIA, Coren killed job from 1204 and commented out crontab

May 4

  • 21:24 yuvipanda: reboot tools-submit, was stuck

May 2

  • 10:21 yuvipanda: drained all the old webgrid nodes, pooled in all the new webgrid nodes! POTATO!
  • 10:13 yuvipanda: cleaned out wegrid jobs from tools-webgrid-03
  • 10:12 yuvipanda: pooled tools-webgrid-lighttpd-{06-10}
  • 08:56 yuvipanda: drained and deleted tools-webgrid-01
  • 07:31 yuvipanda: depooled and deleted tools-webgrid-{01,02}
  • 07:31 yuvipanda: disabled catmonitor task / cron, was heavily using an sqlite db on NFS
  • 06:56 yuvipanda: pooled tools-webgrid-generic-{01-04}
  • 03:44 yuvipanda: drained and deleted old trusty webgrid tools-webgrid-{05-07}
  • 02:13 yuvipanda: created tools-webgrid-lighttpd-12{01-05} and tools-webgrid-generic-14{01-04}
  • 01:59 yuvipanda: created tools-webgrid-lighttpd-14{01-10}
  • 01:58 yuvipanda: increased tools instance quota

May 1

  • 03:55 YuviKTM: depooled and deleted tools-exec-20
  • 03:54 YuviKTM: killed final job in tools-exec-20 (9911317), decommissioning node

April 30

  • 19:33 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
  • 19:31 YuviKTM: depooled and deleted tools-exec-01, -05, -06 and -11.
  • 06:30 YuviKTM: added public IPs for all exec nodes so IRC tools continue to work. Removed all associated hostnames, let’s not do those
  • 06:13 YuviKTM: allocating new floating IPs for the new instances, because IRC bots need them.
  • 05:42 YuviKTM: disabled and drained tools-exec-1{1-5} of continuous jobs
  • 05:40 YuviKTM: pooled in tools-exec-121{1-9}
  • 05:39 YuviKTM: rebooted tools-exec-121{1-9} instances so they can apply gridengine-common properly
  • 05:39 YuviKTM: created new instances tools-exec-121{1-9} as precise
  • 05:39 YuviKTM: killed tools-dev, nobody still ssh’d in, no crontabs
  • 05:39 YuviKTM: deplooled exec-{06-10} rejigged jobs to newer nodes
  • 05:39 YuviKTM: delete tools-exec-10, was out of jobs
  • 04:28 YuviKTM: deleted tools-exec-09
  • 04:27 YuviKTM: depooled tools-exec-09.eqiad.wmflabs
  • 04:23 YuviKTM: repooled tools-exec-1201 is all good now
  • 04:19 YuviKTM: rejuggle jobs again in trustyland
  • 04:14 YuviKTM: repooled tools-exec-09, apt troubles fixed
  • 04:08 YuviKTM: depooled tools-exec-09, apt troubles
  • 04:04 YuviKTM: pooled tools-exec-1408 and tools-exec-1409
  • 04:00 YuviKTM: pooled tools-exec-1406 and 1407
  • 03:58 YuviKTM: pooled tools-exec-12{02-10}, forgot to put appropriate roles on 1201, fixing now
  • 03:54 YuviKTM: tools-exec-03 and -04 have been deleted a long time ago
  • 03:53 YuviKTM: depooled tools-exec-03 / 04
  • 03:31 YuviKTM: depooled and deleted tools-exec-12 had nothing on it
  • 03:28 YuviKTM: deleted toolx-exec-21 to 24, one task still running on tools-exec
  • 03:24 YuviKTM: disabled and drained continuous tasks off tools-exec-20 to tools-exec-24
  • 03:18 YuviKTM: pooled tools-exec-1403, 1404
  • 03:13 YuviKTM: pooled tools-exec-1402
  • 03:07 YuviKTM: pooled tools-exec-1405
  • 03:04 YuviKTM: pooled tools-exec-1401
  • 02:53 YuviKTM: created tools-exec-14{06-10}
  • 02:14 YuviKTM: created toolx-exec-14{01-05}
  • 01:09 YuviPanda: killing local copy of python-requests, there seems to be a newer vesrion in prod

April 29

  • 19:33 valhallasw`cloud: re-created tools-mailrelay-01 with precise: Nova_Resource:I-00000bca.eqiad.wmflabs
  • 19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
  • 19:28 YuviPanda: recreated tools-static-02
  • 19:11 YuviPanda: failed over tools-static to tools-static-01
  • 14:47 andrewbogott: deleting tools-exec-04
  • 14:44 Coren: -exec-04 drained; removed from queues. Rest well, old friend.
  • 14:41 Coren: disabled -exec-04 (going away)
  • 02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
  • 02:27 YuviPanda: created tools-exec-12{01-10}

April 28

  • 21:41 andrewbogott: shrinking tools-master
  • 21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
  • 21:32 andrewbogott: shrinking tools-redis
  • 21:28 YuviPanda: attempting to failover gridengine to tools-shadow
  • 21:27 andrewbogott: shrinking tools-submit |
  • 21:21 YuviPanda: backup crontabs onto NFS
  • 21:18 andrewbogott: shrinking tools-webproxy-02
  • 21:14 andrewbogott: shrinking tools-static-01
  • 21:11 andrewbogott: shrinking tools-exec-gift
  • 21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
  • 21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
  • 21:01 YuviPanda: failover tools-static to tools-static-02
  • 20:53 andrewbogott: stopping, shrinking, restarting tools-shadow
  • 20:43 andrewbogott: stopping, shrinking, starting tools-static-02
  • 20:39 valhallasw`cloud: created tools-mailrelay-01 Nova_Resource:I-00000bac.eqiad.wmflabs
  • 20:26 YuviPanda: failed over tools-services to services-01
  • 18:11 Coren: reenabled -webgrid-generic-02
  • 18:05 Coren: reenabled -webgrid-03, -webgrid-08, -webgrid-generic-01; drained -webgrid-generic-02
  • 17:44 Coren: -webgrid-03, -webgrid-08 and -webgrid-generic-01 drained
  • 14:04 Coren: reenable -exec-11 for jobs.
  • 13:55 andrewbogott: stopping tools-exec-11 for a resize experiment

April 25

  • 01:32 YuviPanda: deleted tools-static, tools-static-01 has taken over
  • 01:02 YuviPanda: deleted tools-login, tools-bastion-01 has been running for long enoug

April 24

  • 16:29 Coren: repooled -exec-02, -08, -12
  • 16:05 Coren: -exec-02, -08 and -12 draining
  • 15:54 Coren: reenabled tools-exec-07, -10 and -11 after reboot of host
  • 15:41 Coren: -exec-03 goes away for good.
  • 15:31 Coren: draining -exec-03 to ease migration
  • 13:43 Coren: draining tools-exec-07,10,11 to allow virt host reboot

April 23

  • 22:41 YuviPanda: disabled *@tools-exec-09
  • 22:40 YuviPanda: add tools-exec-09 back to @general
  • 22:38 YuviPanda: take tools-exec-09 from @general group
  • 20:53 YuviPanda: restart bigbrother
  • 20:28 YuviPanda: restarted nscd on tools-login and tools-dev
  • 20:22 valhallasw`cloud: removed 10.68.16.4 tools-webproxy tools.wmflabs.org from /etc/hosts
  • 13:17 andrewbogott: beginning migration of tools instances to labvirt100x hosts
  • 01:00 YuviPanda: good bye tools-login.eqiad.wmflabs

April 20

  • 13:38 scfc_de: tools-mail: Removed paniclog and killed superfluous exim.

April 18

  • 20:09 YuviPanda: sysctl vm.overcommit_memory=1 on tools-redis to allow it to bgsave again
  • 19:52 valhallasw`cloud: tools-redis unresponsive (T96485); rebooting

April 17

  • 01:48 YuviPanda: disable puppet on live webproxy (-01) to apply firewall changes to -02

April 16

  • 20:57 Coren: -webgrid-08 drained, rebooting
  • 20:46 Coren: -webgrid-03 repooled, depooling -webgrid-08
  • 20:45 Coren: -webgrid-03 drained, rebooting
  • 20:38 Coren: -webgrid-03 depooled
  • 20:38 Coren: -webgrid-02 repooled
  • 20:35 Coren: -webgrid-02 drained, rebooting
  • 20:33 Coren: -webgrid-02 depooled
  • 20:32 Coren: -webgrid-01 repooled
  • 20:06 Coren: -webgrid-01 drained, rebooting.
  • 19:56 Coren: depooling -webgrid-01 for reboot
  • 14:37 Coren: rebooting -master
  • 14:29 Coren: rebooting -mail
  • 14:22 Coren: rebooting -shadow
  • 14:22 Coren: -exec-15 repooled
  • 14:19 Coren: -exec-15 drained, rebooting.
  • 13:46 Coren: -exec-14 repooled. That's it for general exec nodes.
  • 13:44 Coren: -exec-14 drained, rebooting.

April 15

  • 21:06 Coren: -exec-10 repooled
  • 20:55 Coren: -exec-10 drained, rebooting
  • 20:49 Coren: -exec-07 repooled.
  • 20:47 Coren: -exec-07 drained, rebooting
  • 20:43 Coren: -exec-06 requeued
  • 20:41 Coren: -exec-06 drained, rebooting
  • 20:15 Coren: repool -exec-05
  • 20:10 Coren: -exec-05 drained, rebooting.
  • 19:56 Coren: -exec-04 repooled
  • 19:52 Coren: -exec-04 drained, rebooting.
  • 19:41 Coren: disabling new jobs on remaining (exec) precise instances
  • 19:32 Coren: repool -exec-02
  • 19:30 Coren: draining -exec-04
  • 19:29 Coren: -exec-02 drained, rebooting
  • 19:28 Coren: -exec-03 rebooted, requeing
  • 19:26 Coren: -exec-03 drained, rebooting
  • 18:50 Coren: dequeuing tools-exec-03 whilst waiting for -02 to drain.
  • 18:43 Coren: tools-exec-01 back sans idmap, returning to pool
  • 18:40 Coren: tools-exec-01 drained of jobs; rebooting
  • 18:39 YuviPanda: disabled puppet on running webproxy, tools-webproxy-01
  • 18:25 Coren: disabled -exec-01 and -exec-02 to new jobs.

April 14

  • 13:13 scfc_de: tools-submit: Removed exim paniclog (OOM doom).
  • 13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 13

  • 21:11 YuviPanda: restart portgranter on all webgrid nodes

April 12

  • 10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 11

  • 21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
  • 02:15 YuviPanda: rebooted tools-submit, was not responding

April 10

  • 07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
  • 05:20 YuviPanda: delete the tomcat node finally :D

April 9

  • 23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
  • 23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
  • 08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

April 8

  • 13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
  • 12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
  • 09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.

April 7

  • 07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

April 5

  • 10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 4

  • 22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
  • 08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
  • 08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
  • 03:35 scfc_de: Deployed jobutils/misctools 1.5 (T91954).

April 3

  • 22:55 scfc_de: Removed empty cgi-bin directories.
  • 20:35 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

April 2

  • 20:07 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
  • 20:06 scfc_de: tools-submit: Removed exim paniclog (OOM).
  • 01:25 YuviPanda: created tools-bastion-02

April 1

  • 00:14 scfc_de: tools-webgrid-03: Rebooted, was stuck on console input when unable to mount NFS on boot (per wikitech consule output).

March 31

  • 14:02 Coren: rebooting tools-submit
  • 07:07 YuviPanda: moved tools.wmflabs.org to tools-webproxy-01
  • 07:02 YuviPanda: reboot tools-webgrid-03 and tools-exec-03
  • 00:21 andrewbogott: temporarily shutting ‘toolsbeta-pam-sshd-motd-test’ down to conserve resources. It can be restarted any time.

March 30

  • 22:53 Coren: resyncing project storage with rsync
  • 22:40 Coren: reboot tools-login
  • 22:30 Coren: also bastion2
  • 22:28 Coren: reboot bastion1 so users can log in
  • 21:49 Coren: rebooting dedicated exec nodes.
  • 21:49 Coren: rebooting tools-submit
  • 17:27 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 29

  • 19:30 scfc_de: tools-submit: Restarted bigbrother for T90384.

March 28

  • 19:42 YuviPanda: created tools-exec-20

March 26

  • 21:24 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 25

  • 16:49 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).

March 24

  • 16:03 scfc_de: tools-login: Removed exim paniclog (entries from Sunday).
  • 15:51 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 23

  • 21:23 scfc_de: tools-login, tools-dev, tools-trusty: Now actually disabled role::labs::bastion per T93661 :-).
  • 21:08 scfc_de: tools-login, tools-dev, tools-trusty: role::labs::bastion is still enabled due to T93663.
  • 20:57 scfc_de: tools-login, tools-dev, tools-trusty: Disabled role::labs::bastion per T93661.
  • 03:02 andrewbogott: wiped out atop.log on tools-dev because /var was filling up

March 22

  • 23:08 scfc_de: qconf -ah tools-bastion-01.eqiad.wmflabs
  • 23:07 scfc_de: for host in {tools-bastion-01,tools-webgrid-07,tools-webgrid-generic-{01,02}}.eqiad.wmflabs; do qconf -as "$host"; done
  • 23:07 yuvipanda: copied /etc/hosts into place on tools-bastion-01

March 21

  • 16:18 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.

March 15

  • 22:38 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.

March 13

  • 16:23 YuviPanda: cleaned out / on tools-trusty

March 11

  • 04:28 YuviPanda: tools-redis is back now, as trusty and hopefully slightly more fortified
  • 04:14 YuviPanda: kill tools-redis instance, upgrade to trusty while it is down anyway
  • 03:56 YuviPanda: restarted redis server, it had OOM-killed

March 9

  • 11:02 scfc_de: Deleted probably outdated proxy entry for tool wp-signpost and restarted webservice.
  • 10:22 scfc_de: Deleted obsolete proxy entries without webservice for tools bracketbot/herculebot/extreg-wos/pirsquared/searchsbl/translate/yifeibot.
  • 10:11 scfc_de: Restarted webservices for tools blahma/catmonitor/catscan2/contributions-summary/eagleeye/imagemapedit/jackbot/tb-dev/vcat/wikihistory/xtools-ec (cf. T91939).
  • 08:27 scfc_de: qmod -cq webgrid-lighttpd@tools-webgrid-03.eqiad.wmflabs (OOM of two jobs in the past).

March 7

  • 12:17 scfc_de: Moved obsolete packages that are installed on no instance at all from /data/project/.system/deb to ~tools.admin/archived-packages.

March 6

  • 07:46 scfc_de: Set role::labs::tools::toolwatcher for tools-login.
  • 07:43 scfc_de: Deployed jobutils/misctools 1.4.

March 2

March 1

  • 15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure

February 28

  • 07:51 YuviPanda: create tools-webgrid-07
  • 01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
  • 01:00 Coren: Also That was -webgrid-05
  • 00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.

February 27

  • 17:53 YuviPanda: increased quota to 512G RAM and 256 cores
  • 15:33 Coren: Switched back to -master. I'm making a note here: great success.
  • 15:27 Coren: Gridengine master failover test part three; killing the master with -9
  • 15:20 Coren: Gridengine master failover test part deux - now with verbose logs
  • 15:10 YuviPanda: created tools-webgrid-generic-02
  • 15:10 YuviPanda: increase instance quota to 64
  • 15:10 Coren: Master restarted - test not sucessful.
  • 14:50 Coren: testing gridengine master failover starting now
  • 08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well

February 24

  • 18:33 Coren: tools-submit not recovering well from outage, kicking it.
  • 17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs

February 16

  • 02:31 scfc_de: rm -f /var/log/exim4/paniclog.

February 13

  • 18:01 Coren: tools-redis is dead, long live tools-redis
  • 17:48 Coren: rebuilding tools-redis with moar ramz
  • 17:38 legoktm: redis on tools-redis is OOMing?
  • 17:26 marktraceur: restarting grrrit-wm because it's not behaving

February 1

  • 10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
  • 07:51 YuviPanda: cleared error state of stuck queues
  • 06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
  • 05:47 YuviPanda: completed migrating magnus' tools to trusty, more details at https://etherpad.wikimedia.org/p/tools-trusty-move
  • 05:37 YuviPanda: added tools-webgrid-06 as trusty webnode, operational now
  • 04:52 YuviPanda: migrating all of magnus’ tools, after consultation with him (https://etherpad.wikimedia.org/p/tools-trusty-move for status)
  • 04:10 YuviPanda: widar moved to trusty
  • 03:01 YuviPanda: ran salt -G 'instanceproject:tools' cmd.run 'sudo rm -rf /var/tmp/core’ because disks were getting full.

January 29

  • 17:26 YuviPanda: reschedule all tomcat jobs

January 27

  • 23:27 YuviPanda: qdel -f 7662482 7661111 for Merlissimo

January 19

  • 20:51 YuviPanda: because valhallasw is nice
  • 10:34 YuviPanda: manually started tools-webgrid-generic-01
  • 09:48 YuviPanda: restarted toold-webgrid-03
  • 08:42 scfc_de: qmod -cq {continuous,mailq,task}@tools-exec-{06,10,11,15}.eqiad.wmflabs
  • 08:36 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog and killed second exim (belated SAL amendment.

January 16

  • 22:11 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 15

  • 22:10 YuviPanda: created instance tools-webgrid-generic-01

January 11

  • 06:38 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog.

January 8

  • 07:40 YuviPanda: increase memory limit for autolist from 4G to 7G

December 23

  • 06:00 YuviPanda: tools-uwsgi-01 randomly went to SHUTOFF state, rebooting from virt1000

December 22

  • 07:43 YuviPanda: increased RAM and Cores quota for tools

December 19

  • 16:38 YuviPanda: puppet disabled on tools-webproxy because urlproxy.lua is handhacked to remove stupid syntax errors that got merged.
  • 12:00 YuviPanda|brb: created tools-static, static http server
  • 07:07 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

December 17

  • 22:38 YuviPanda: touched /data/project/repo/Packages so tools-webproxy stops complaining about that not xisting and never running apt-get

December 12

  • 14:08 scfc_de: Ran Puppet on all hosts to fix puppet-run issue.

December 11

  • 07:58 YuviPanda: rebooted tools-login, wasn’t responsive.

December 8

  • 00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy

December 7

  • 06:31 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).
  • 06:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).

December 2

  • 21:31 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (multiple exim4 processes, again).
  • 21:30 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 26

  • 19:26 YuviPanda: created tools-webgrid-05 on trusty to set up a working webnode for trusty

November 25

  • 06:53 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 24

  • 14:02 YuviPanda: rebooting tools-login, OOM'd
  • 02:51 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 22

  • 19:05 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM, again).

November 17

  • 20:40 YuviPanda: cleaned out /tmp on tools-login

November 16

  • 21:31 matanya: back to normal
  • 21:27 matanya: "Could not resolve hostname bastion.wmflabs.org"

November 15

  • 07:24 YuviPanda|zzz: move coredumps from tools-webgrid-04 to /home/yuvipanda

November 14

  • 20:23 YuviPanda: cleared out coredumps on tools-webgrid-01 to free up space
  • 18:26 YuviPanda: cleaned out core dumps on tools-webgrid
  • 16:55 scfc_de: tools-webgrid-02: rm -f /var/log/exim4/paniclog (OOM).

November 13

  • 21:11 YuviPanda: disable puppet on tools-dev to check shinken
  • 21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
  • 20:38 YuviPanda: didn't actually stop puppet, need more patches
  • 20:38 YuviPanda: stopping puppet on tools-dev to test shinken
  • 15:30 scfc_de: tools-exec-06, tools-webgrid-01: rm -f /var/tmp/core/*.
  • 13:31 scfc_de: tools-exec-09, tools-exec-11, tools-exec-13, tools-exec-14, tools-exec-15, tools-webgrid-02, tools-webgrid-04: rm -f /var/tmp/core/*.

November 12

  • 22:07 StupidPanda: enabled puppet on tools-exec-07
  • 21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
  • 21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
  • 20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken

November 7

  • 13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).

November 6

  • 13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).

November 5

  • 19:15 mutante: exec nodes have p7zip-full now
  • 10:07 YuviPanda: cleaned out pacct and atop logs on tools-login

November 4

  • 19:50 mutante: - apt-get clean on tools-login, and gzipped some logs

November 1

  • 12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).

October 30

  • 14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
  • 06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp

October 27

  • 16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
  • 15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.

October 26

  • 12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
  • 12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
  • 12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.

October 24

  • 20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006

October 23

  • 22:55 Coren: reboot tools-shadow, upstart seems hosed

October 14

  • 23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07

October 11

  • 15:31 andrewbogott: rebooting tools-master, stab in the dark
  • 06:01 YuviPanda: restarted gridengine-master on tools-master

October 4

  • 18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b

October 2

  • 17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
  • 17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools

September 28

  • 14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3

September 25

  • 14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now

September 17

  • 21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap

September 15

  • 11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work

September 13

  • 20:52 yuvipanda: cleaned out rotated log files on tools-webproxy

September 12

  • 21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase

September 8

  • 16:08 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM @ 2014-09-07 15:13:59)

September 5

  • 22:22 scfc_de: Deleted stale nginx entries for "rightstool" and "svgcheck"
  • 22:20 scfc_de: Stopped 12 webservices for tool "meta" and started one
  • 18:50 scfc_de: geohack's lighttpd dumped core and left an entry in Redis behind; tools-webproxy: "DEL prefix:geohack"; geohack: "webservice start"

September 4

  • 19:47 lokal-profil: local-heritage Renamed two swedish tables

September 2

  • 04:31 scfc_de: "iptables -A OUTPUT -d 10.68.16.1 -p udp -m udp --dport 53" on all hosts in support of bug #70076

August 23

  • 17:44 scfc_de: qmod -cq task@tools-exec-07 (job #2796555, "11  : before job")

August 21

  • 20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils

August 15

  • 16:45 legoktm: fixed grrrit-wm
  • 16:36 legoktm: restarting grrrit-wm

August 14

  • 22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^\([0-9]\+\) .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529

August 12

  • 03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again

August 2

  • 16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
  • 16:36 scfc_de: Manually rerouted mails for tools.mybot@tools-submit.eqiad.wmflabs

August 1

  • 22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")

July 24

  • 20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
  • 14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts

July 21

  • 18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
  • 18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
  • 18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again

July 18

  • 14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
  • 13:24 scfc_de: Made tools-webgrid-04 a grid submit host
  • 12:58 scfc_de: Made tools-webgrid-03 a grid submit host

July 16

  • 22:41 YuviPanda: reloaded nginx on tools-webproxy to pick up https://gerrit.wikimedia.org/r/#/c/146466/3
  • 15:18 scfc_de: replagstats OOMed four hours after start on May 6th; with ganglia.wmflabs.org down, not restarting
  • 15:14 scfc_de: Restarted toolhistory with 350 MBytes; OOMed June 1st

July 15

  • 11:31 scfc_de: Started webservice for sulinfo; stopped at 2014-06-29 18:31:04

July 14

  • 20:40 andrewbogott: on tools-login
  • 20:39 andrewbogott: manually deleted /var/lib/apt/lists/lock, forcing apt to update

July 13

  • 13:13 scfc_de: tools-exec-13: Moved /var/log around, reboot, iptables-restore & reenabled queues
  • 13:11 scfc_de: tools-exec-12: Moved /var/log around, reboot & iptables-restore

July 12

  • 17:57 scfc_de: tools-exec-11: Stopping apache2 service; no clue how it got there
  • 17:53 scfc_de: tools-exec-11: Moved log files around, rebooted, restored iptables and reenabled queue ("qmod -e {continuous,task}@tools-exec-11...")
  • 13:00 scfc_de: tools-exec-11, tools-exec-13: qmod -r continuous@tools-exec-1[13].eqiad.wmflabs in preparation of reboot
  • 12:58 scfc_de: tools-exec-11, tools-exec-13: Disabled queues in preparation of reboot
  • 11:58 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: mkdir -m 2750 /var/log/exim4 && chown Debian-exim:adm /var/log/exim4; I'll file a bug why the directory wasn't created later

July 11

  • 11:59 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: cp -f /data/project/.system/hosts /etc/hosts

July 10

  • 20:35 scfc_de: tools-exec-11, tools-exec-12, tools-exec-13: iptables-restore /data/project/.system/iptables.conf
  • 16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
  • 01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup

July 9

  • 23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
  • 23:09 YuviPanda: created tools-exec-13 with precise
  • 23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
  • 23:07 YuviPanda: created tools-exec-12
  • 23:06 YuviPanda: created tools-exec-11
  • 19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
  • 14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
  • 13:33 scfc_de: tools-exec-cyberbot: Freed 402398 inodes ...
  • 12:50 scfc_de: tools-exec-cyberbot: "find /tmp -maxdepth 1 -type f -name \*cyberbotpeachy.cookies\* -mtime +30 -delete" as a first step
  • 12:40 scfc_de: tools-exec-cyberbot: Root partition has run out of inodes
  • 12:34 scfc_de: tools-exec-gift: Forgot to log yesterday: The problems were due to overload (load >> 150); SGE shouldn't have allowed that
  • 12:28 YuviPanda: cleaned out old diamond archive logs on tools-master
  • 12:28 YuviPanda: cleaned out old diamond archive logs on tools-webgrid-04
  • 12:25 YuviPanda: cleaned out old diamond archive logs from tools-exec-08

July 8

  • 20:57 scfc_de: tools-exec-gift: Puppet hangs due to "apt-get update" not finishing in time; manual runs of the latter take forever
  • 19:52 scfc_de: tools-exec-wmt, tools-shadow: Removed stale Puppet lock files and reran manually (handy: "sudo find /var/lib/puppet/state -maxdepth 1 -type f -name agent_catalog_run.lock -ls -ok rm -f \{\} \; -exec sudo puppet agent apply -tv \;")
  • 18:09 scfc_de: tools-webgrid-03, tools-webgrid-04: killall -TERM gmond (bug #64216)
  • 17:57 scfc_de: tools-exec-08, tools-exec-09, tools-webgrid-02, tools-webgrid-03: Removed stale Puppet lock files and reran manually
  • 17:26 scfc_de: tools-tcl-test: Rebooted because system said so
  • 17:04 YuviPanda: webservice start on tools.meetbot since it seemed down
  • 14:55 YuviPanda: cleaned out old diamond archive logs on tools-webproxy
  • 13:39 scfc_de: tools-login: rm -f /var/log/exim4/paniclog ("daemon: fork of queue-runner process failed: Cannot allocate memory")

July 6

  • 12:09 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog after I20afa5fb2be7d8b9cf5c3bf4018377d0e847daef got merged

July 5

July 4

  • 08:51 scfc_de: tools-exec-08 (some hours ago): rm -f /var/log/diamond/* && restart diamond
  • 00:02 scfc_de: tools-master: rm -f /var/log/diamond/* && restart diamond

July 3

  • 16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
  • 16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
  • 14:37 Betacommand: replication for enwiki is halted current lag is at 9876

July 2

  • 00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
  • 00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats

July 1

  • 23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
  • 21:08 scfc_de: Reset queues in error state again
  • 17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
  • 16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
  • 15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
  • 15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
  • 15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
  • 01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
  • 01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
  • 01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
  • 01:34 scfc_de: tools-exec-03, tools-exec-10: Removed /var/log/diamond/diamond.log, restarted diamond and bzip2'ed /var/log/diamond/*.log.2014*

June 30

  • 22:10 YuviPanda: ran webservice start for enwp10
  • 22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
  • 22:01 YuviPanda: removed stale lockfile for puppet, forcing run
  • 19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
  • 17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
  • 17:27 YuviPanda: created tools-webgrid-03 and added it to the queue

June 29

  • 19:45 scfc_de: magnustools: "webservice start"
  • 18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead

June 28

  • 21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy

June 21

  • 20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki

June 20

  • 21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
  • 14:47 scfc_de: Restarted webservice for mono; cf. bug #64219

June 16

  • 23:50 scfc_de: Shut down diamond services and removed log files on all hosts

June 15

  • 17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
  • 16:50 scfc_de: qmod -cq cyberbot@tools-exec-cyberbot.eqiad.wmflabs
  • 16:48 scfc_de: tools-exec-cyberbot: rm -f /var/log/diamond/diamond.log && restart diamond
  • 16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)

June 13

  • 22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange

June 10

  • 21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
  • 21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var

June 3

  • 17:50 Betacommand: Brief network outage. source: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.

June 2

  • 20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
  • 19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though. iipsrv.fcgi however has TMPDIR set as planned.

May 27

  • 18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
  • 10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
  • 10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
  • 07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log

May 25

  • 14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors

May 23

  • 14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
  • 14:10 andrewbogott: applying role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors

May 22

  • 02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
  • 02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
  • 01:46 scfc_de: hazard-bot: Disabled minutely cron job github-updater
  • 01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
  • 01:12 scfc_de: tools-mail: /var is full

May 20

  • 18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues

May 16

  • 17:01 scfc_de: tools-webgrid-02: rm -f /tmp/core (tools.misc2svg, May 13 06:10, 3861106688)

May 14

  • 16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis
  • 00:23 Betacommand: 503's related to bug 65179

May 13

  • 20:36 YuviPanda: restarting redis on tools-webproxy fixed 503s
  • 20:36 valhallasw: redis failed, causing tools-webproxy to thow 503's
  • 19:09 marktraceur: Restarted grrrit because it had a stupid nick

May 10

  • 14:50 YuviPanda: upgraded nginx to 1.7.0 on tools-webproxy to get SPDY/3.1

May 9

  • 13:16 scfc_de: Cleared error state of queues {continuous,mailq,task}@tools-exec-06 and webgrid-lighttpd; no obvious or persistent causes

May 6

  • 19:31 scfc_de: replagstats fixed; Ganglia graphs are now under the virtual host "tools-replags"
  • 17:53 scfc_de: Don't think replagstats is really working ...
  • 16:40 scfc_de: Moved ~scfc/bin/replagstats to ~tools.admin/bin/ and enabled as a continuous job (cf. also bug #48694).

April 28

  • 11:51 YuviPanda: pywikibugs Deployed bf1be7b

April 27

  • 13:34 scfc_de: Restarted webservice for geohack and moved {access,error}.log to {access,error}.log.1

April 24

  • 23:39 YuviPanda: restarted grrrit-wm, not greg-g. greg-g does not survive restarts and hence care must be taken to make sure he is not.
  • 23:38 YuviPanda: restarted greg-g after cherry-picking aec09a6 for auth of IRC bot
  • 23:33 legoktm: restarting grrrit-wm https://gerrit.wikimedia.org/r/129610
  • 13:07 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog (relay_domains bug)

April 20

  • 14:27 scfc_de: tools-redis: Set role::labs::lvm::mnt and $lvm_mount_point=/var/lib, moved the data around and rebooted
  • 14:08 scfc_de: tools-redis: /var is full
  • 08:59 legoktm: grrrit-wm: 2014-04-20T08:28:15.889Z - error: Caught error in redisClient.brpop: Redis connection to tools-redis:6379 failed - connect ECONNREFUSED
  • 08:48 legoktm: Your job 438884 ("lolrrit-wm") has been submitted
  • 08:47 legoktm: [01:28:28] * grrrit-wm has quit (Remote host closed the connection)

April 13

April 12

  • 23:51 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("unknown named domain list "+relay_domains"")

April 11

April 10

  • 18:20 scfc_de: tools-webgrid-01, tools-webgrid-02: "kill -HUP" all php-cgis that are not (grand-)children of lighttpd processes

April 8

  • 05:06 Ryan_Lane: restart nginx on tools-proxy-test
  • 05:03 Ryan_Lane: upgraded libssl on all nodes

April 4

  • 15:48 Coren: Moar powar!!1!one: added two exec nodes (-09 -10) and one webgrid node (-02)
  • 11:11 scfc_de: Set /data/project/.system/config/wikihistory.workers to 20 on apper's request

March 30

  • 18:16 scfc_de: Removed empty directories /data/project/{d930913,sudo-test{,-2},testbug{,2,3}}: Corresponding service groups don't exist (anymore)
  • 18:13 scfc_de: Removed /data/project/backup: Only empty dynamic-proxy backup files of January 3rd and earlier

March 29

  • 10:14 wm-bot: petrb: disabled 1 job in cron in -login of user tools.tools-info which was killing login server

March 28

  • 11:53 wm-bot: petrb: did the same on -mail server (removed /var/log/exim4/paniclog) so that we don't get spam every day
  • 11:51 wm-bot: petrb: removed content of /var/log/exim4/paniclog
  • 11:49 wm-bot: petrb: disabled default vimrc which everybody hates on -login

March 21

  • 16:50 scfc_de: tools-login: pkill -u tools.bene (OOM)
  • 16:13 scfc_de: rmdir /home/icinga (totally empty, "drwxr-xr-x 2 nemobis 50383 4096 Mär 17 16:42", perhaps artifact of mass migration?)
  • 15:49 scfc_de: sudo cp -R /etc/skel /home/csroychan && sudo chown -R csroychan.wikidev /home/csroychan; that should close [[bugzilla:62132]]
  • 15:15 scfc_de: sudo cp -R /etc/skel /home/annabel && sudo chown -R annabel.wikidev /home/annabel
  • 15:14 scfc_de: sudo chown -R torin8.wikidev /home/torin8

March 20

  • 18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:62883]]

March 5

  • 13:57 wm-bot: petrb: test

March 4

  • 22:35 wm-bot: petrb: uninstalling it from -login too
  • 22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there

March 3

  • 19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
  • 19:17 wm-bot: petrb: upgrading all packages on webserver-02
  • 19:15 petan: rebooting webserver-01 which is totally dead
  • 19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
  • 19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
  • 16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc. "webservice start" for that, and pkill -HUP -u local-guc.
  • 12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
  • 12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
  • 12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.

March 1

  • 03:42 Coren: disabled puppet in pmtpa tool labs\

February 28

  • 14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
  • 00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "500 Internal Server Error"

February 27

  • 15:28 scfc_de: chmod g-w ~fsainsbu/.forward

February 25

  • 22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.

February 23

  • 20:46 scfc_de: morebots: labs HUPped to reconnect to IRC

February 21

  • 17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
  • 15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either

February 20

  • 21:23 scfc_de: tools-login: Disabled crontab for local-rezabot and left a message at User talk:Reza#Running bots on tools-login, etc. (fa:بحث_کاربر:Reza1615 is write-protected)
  • 20:15 scfc_de: tools-login: Disabled crontab for local-chobot and left a message at ko:사용자토론:ChongDae#Running bots on tools-login, etc.
  • 10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:61583]])
  • 10:30 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch

February 19

  • 20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
  • 19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127

February 18

  • 11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
  • 11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
  • 10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
  • 10:37 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 14

  • 23:54 legoktm: restarting grrrit-wm since it disappeared
  • 08:19 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)

February 13

  • 13:11 scfc_de: Deleted old job of user veblenbot stuck in error state
  • 13:08 scfc_de: Deleted old jobs of user v2 stuck in error state
  • 10:49 scfc_de: tools-login: Commented out local-shuaib-bot's crontab with a pointer to Tools/Help

February 12

  • 07:51 wm-bot: petrb: removed /data/project/james/adminstats/wikitools per request from james on irc

February 11

  • 15:47 scfc_de: Restarted webservice for geohack
  • 13:02 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 13:00 scfc_de: Killed -HUP local-hawk-eye-bot's jobs; one was hanging with a stale NFS handle on tools-exec-05

February 10

  • 23:16 Coren: rebooting webproxy (braindead autofs)

February 9

February 6

February 4

January 31

  • 03:43 scfc_de: Cleaned up all exim queues
  • 01:26 scfc_de: chmod g-w ~{bgwhite,daniel,euku,fale,henna,hydriz,lfaraone}/.forward (test: sudo find /home -mindepth 2 -maxdepth 2 -type f -name .forward -perm /g=w -ls)

January 30

  • 21:48 scfc_de: chmod g-w ~fluff/.forward
  • 21:40 scfc_de: local-betabot: Added "-M" option to crontab's qsub call and rerouted queued mail (freeze, exim -Mar, exim -Mmd, thaw)
  • 18:33 scfc_de: tools-exec-04: puppetd --enable (apparently disabled sometime around 2014-01-16?!)
  • 17:25 scfc_de: tools-exec-06: mv -f /etc/init.d/nagios-nrpe-server{.dpkg-dist,} (nagios-nrpe-server didn't start because start-up script tried to "chown icinga" instead of "chown nagios")

January 28

  • 04:27 scfc_de: tools-webproxy: Blocked Phonifier

January 25

  • 05:37 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (OOM)

January 24

  • 01:07 scfc_de: tools-db: Removed /var/lib/mysql2, set expire_logs_days to 1 day
  • 00:11 scfc_de: tools-db: and restarted mysqld
  • 00:11 scfc_de: tools-db: Moved 4.2 GBytes of the oldest binlogs to /var/lib/mysql2/

January 23

  • 19:24 legoktm: restarting grrrit-wm now https://gerrit.wikimedia.org/r/#/c/109116/
  • 19:23 legoktm: ^ was for grrrit-wm
  • 19:23 legoktm: re-committed password to local repo, not sure why that wasn't committed already

January 21

  • 17:41 scfc_de: tools-exec-09: iptables-restore /data/project/.system/iptables.conf

January 20

  • 07:02 andrewbogott: merged a lint patch to the gridengine module. Should be a noop

January 16

  • 17:11 scfc_de: tools-exec-09: "iptables-restore /data/project/.system/iptables.conf" after reboot

January 15

  • 13:36 scfc_de: After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19  : before writing exit_status")
  • 13:27 scfc_de: tools-login: rm -f /var/log/exim4/paniclog (OOM)
  • 08:54 andrewbogott: rebooted tools-exec-09
  • 08:32 andrewbogott: rebooted tools-db

January 14

  • 15:10 scfc_de: tools-login: pkill -u local-mlwikisource: Freed 1 GByte of memory
  • 14:58 scfc_de: tools-login: Disabled local-mlwikisource's crontab with explanation
  • 13:57 scfc_de: tools-webserver-02: rm -f /var/log/exim4/paniclog (out of memory errors on 2014-01-10)

January 10

January 9

January 8

  • 13:44 scfc_de: Cleared error states of continuous@tools-exec-05, task@tools-exec-05, task@tools-exec-09

January 7

  • 18:59 scfc_de: tools-login, tools-mail: rm -f /var/log/exim4/paniclog (apparently some artifacts of the LDAP failure)

January 6

  • 14:06 YuviPanda: deleted instance tools-mc, didn't know it had come back from the dead

January 1

  • 13:24 scfc_de: tools-exec-02, tools-master, tools-shadow, tools-webserver-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
  • 11:27 scfc_de: tools-webserver-01, tools-webserver-01: rm -f /var/log/exim4/paniclog; out of memory errors
  • 11:18 scfc_de: Emptied /{data/project,home}/.snaplist as the snapshots themselves are not available

December 27

  • 07:39 legoktm: grrrit-wm restart didn't really work.
  • 07:38 legoktm: restarting grrit-wm, for some reason it reconnected and lost its cloak

December 23

  • 18:30 marktraceur: restart grrrit-wm for subbu

December 21

  • 06:50 scfc_de: tools-exec-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update

December 19

  • 17:22 marktraceur: deploying grrrit config change

December 17

  • 23:19 legoktm: rebooted grrrit-wm with new config stuffs

December 14

  • 18:13 marktraceur: restarting grrrit-wm to fix its nickname
  • 13:17 scfc_de: tools-exec-08: Purged packages libapache2-mod-suphp and suphp-common (probably remnants from when the host was misconfigured as a webserver)
  • 13:09 scfc_de: tools-dev, tools-login, tools-mail, tools-webserver-01, tools-webserver-02: rm /var/log/exim4/paniclog (mostly out of memory errors)

December 4

  • 22:15 Coren: tools-exec-01 rebooted to fix the autofs issue; will return to rotation shortly.
  • 16:33 Coren: rebooting webproxy with new kernel settings to help against the DDOS

December 1

  • 14:05 Coren: underlying virtualization hardware rebooted; tools-master and friends coming back up.

November 25

  • 21:03 YuviPanda: created tools-proxy-test instance to play around with the dynamicproxy
  • 12:16 wm-bot: petrb: deswapping -login (swapoff -a && swapon -a)

November 24

  • 07:19 paravoid: disabled crontab for user avocato on tools-login, see above
  • 07:17 paravoid: pkill -u avocato on tools-login, multiple /home/avocato/pywikipedia/redirect.py DoSing the bastion

November 14

  • 09:12 ori-l: Added aude to lolrrit-wm maintainers group

November 13

  • 22:36 andrewbogott: removed 'imagescaler' class from tools-login because that class hasn't existed for a year. Which, a year ago is before that instance even existed so what the heck?

November 3

  • 16:49 ori-l: grrrit-wm stopped receiving events. restarted it; didn't help. then restarted gerrit-to-redis, which seems to have fixed it.

November 1

  • 16:11 wm-bot: petrb: restarted terminator daemon on -login to sort out memory issues caused by heavy mysql client by elbransco

October 23

  • 15:19 Coren: deleted tools-tyrant and tools-exec-cyberbot (cleanup of obsoleted instances)

October 20

  • 18:52 wm-bot: petrb: everything looks better
  • 18:51 wm-bot: petrb: restarting apache server on tools-webproxy
  • 18:49 wm-bot: petrb: installed links on -dev and going to investigate what is wrong with apaches, documentation, Coren, please update it

October 15

  • 21:03 Coren: labs-login rebooted to fix the ownership/take issue with success.

October 10

  • 09:49 addshore: tools-webserver-01is getting a 500 Internal Server Error again

September 23

  • 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
  • 06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
  • 05:15 legoktm: logging a log to test the log logging
  • 05:13 legoktm: logging a log to test the log logging

September 11

  • 09:39 wm-bot: petrb: started toolwatcher

August 24

  • 18:00 wm-bot: petrb: freed 1600mb of ram by killing yasbot processes on -login
  • 17:59 wm-bot: petrb: killing all python processes of yasbot on -login, this bot needs to run on grid, -login is constantly getting OOM because of this bot

August 23

  • 12:17 wm-bot: petrb: test
  • 12:15 wm-bot: petrb: making pv from /dev/vdb on new nodes
  • 11:49 wm-bot: petrb: syncing packages of -login with exec nodes
  • 11:48 petan: someone installed firefox on exec nodes, should investigate / remove

August 22

  • 01:24 scfc_de: tools-webserver-03: Installed python-oursql

August 20

  • 23:00 scfc_de: Opened port 3000 for intra-Labs traffic in execnode security group for YuviPanda's proxy experiments

August 19

  • 09:52 wm-bot: petrb: deleting fatestwiki tool, requested by creator

August 16

  • 00:16 scfc_de: tools-exec-01 doesn't come up again even after repeat reboots

August 15

  • 15:14 scfc_de: tools-webserver-01: Simplified /usr/local/bin/php-wrapper
  • 14:31 scfc_de: tools-webserver-01: "dpkg --configure -a" on apt-get's advice
  • 14:24 scfc_de: chmod 644 ~magnus/.forward
  • 03:07 scfc_de: tools-webproxy: Temporarily serving 403s to AhrefsBot/bingbot/Googlebot/PaperLiBot/TweetmemeBot/YandexBot until they reread robots.txt
  • 02:02 scfc_de: robots.txt: "Disallow: /"

August 11

  • 03:14 scfc_de: tools-mc: Purged memcached

August 10

  • 02:36 scfc_de: Disabled terminatord on tools-login and tools-dev
  • 02:24 scfc_de: chmod g-w ~whym/.forward

August 6

  • 19:26 scfc_de: Set up basic robots.txt to exclude Geohack to see how that affects traffic
  • 02:09 scfc_de: tools-mail: Enabled rudimentary Ganglia monitoring in root's crontab

August 5

  • 20:32 scfc_de: chmod g-w ~ladsgroup/.forward

August 2

  • 23:45 scfc_de: tools-dev: Installed dialog for testing

August 1

  • 19:57 scfc_de: Created new instance tools-redis with redis_maxmemory = "7GB"
  • 19:56 scfc_de: Added redis_maxmemory to wikitech Puppet variables

July 31

  • 10:50 HenriqueCrang: ptwikis added graph with mobile edits

July 30

  • 19:08 scfc_de: tools-webproxy: Purged popularity-contest and ubuntu-standard
  • 07:32 wm-bot: petrb: deleted local-addbot jobs
  • 02:01 scfc_de: tools-webserver-01: Symlinked /usr/local/bin/{job,jstart,jstop,jsub} to /usr/bin; were obsolete versions.

July 29

  • 15:15 scfc_de: tools-webserver-01: rm /var/log/exim4/paniclog
  • 15:10 scfc_de: Purged popularity-contest from tools-webserver-01.
  • 02:40 scfc_de: Restarted toolwatcher on tools-login.
  • 02:11 scfc_de: Reboot tools-login, was not responsive

July 25

  • 23:37 Ryan_Lane: added myself to lolrrit-wm tool
  • 12:06 wm-bot: petrb: test
  • 07:11 wm-bot: petrb: created /var/log/glusterfs/bricks/ to stop rotatelogs from complaining about it being missing

July 20

  • 15:19 petan: rebooting tools-redis

July 19

  • 07:06 petan: instances were rebooted for unknown reasons
  • 00:42 helderwiki: it works! :-)
  • 00:41 legoktm: test

July 10

  • 18:04 wm-bot: petrb: installing mysqltcl on grid
  • 18:01 wm-bot: petrb: installing tclodbc on grid

July 5

  • 19:38 AzaToth: test
  • 19:36 AzaToth: test for example
  • 18:23 Coren: brief outage of webproxy complete (back to business!)
  • 18:13 Coren: brief outage of webproxy (rollback 2.4 upgrade)

July 3

  • 13:44 scfc_de: Set "HostbasedAuthentication yes" and "EnableSSHKeysign yes" in tools-dev's /etc/ssh/ssh_config
  • 12:58 petan: rebooting -mc it's aparently OOM dying

July 2

  • 16:24 wm-bot: petrb: installed maria to all nodes so we can connect to db even from sge
  • 12:19 wm-bot: petrb: installing packages -- libmediawiki-api-perl libdatetime-format-strptime-perl libbot-basicbot-perl libdatetime-format-duration-perl

July 1

  • 18:39 wm-bot: petrb: started toolwatcher on - login
  • 14:22 wm-bot: petrb: installing following packages on grid: libdata-dumper-simple-perl libhtml-html5-entities-perl libirc-utils-perl libtask-weaken-perl libobject-pluggable-perl libpoe-component-syndicator-perl libpoe-filter-ircd-perl libsocket-getaddrinfo-perl libpoe-component-irc-perl libxml-simple-perl
  • 12:05 wm-bot: petrb: starting toolwatcher
  • 11:40 wm-bot: petrb: tools is back o/
  • 09:42 wm-bot: petrb: installing python -zmg -matplotlib @ dev
  • 03:33 scfc_de: Rebooted tools-login apparently out of memory and not responding to ssh

June 30

  • 17:58 scfc_de: Set ssh_hba to yes on tools-exec-06
  • 17:13 scfc_de: Installed python-matplotlib and python-zmq on tools-login for YuviPanda

June 26

  • 21:16 Coren: +Tim Landscheidt to project admins, local-admin
  • 14:23 wm-bot: petrb: updating several packages on -login
  • 13:43 wm-bot: petrb: killing old instance of redis: Jun15 ? 00:06:49 /usr/bin/redis-server /etc/redis/redis.conf
  • 13:42 wm-bot: petrb: restarting redis
  • 13:28 wm-bot: petrb: running puppet on -mc
  • 13:27 wm-bot: petrb: adding ::redis role to tools-mc - if anything will break, YuviPanda did it :P
  • 09:35 wm-bot: petrb: updated status.php to version which display free vmem as well

June 25

  • 12:34 wm-bot: petrb: installing php5-mcrypt on exec and web

June 24

  • 15:45 wm-bot: petrb: changed colors of root prompt productions vs testing
  • 07:57 wm-bot: petrb: 50527 4186 22830 1 Jun23 pts/41 00:08:54 python fill2.py eats 48% of ram on -login

June 19

  • 12:17 wm-bot: petrb: increasing limit on mysql connections

June 17

  • 17:34 wm-bot: petrb: /var/spool/cron/crontabs/ has -rw------- 1 8006 crontab 1176 Apr 11 14:07 local-voxelbot fixing

June 16

  • 21:23 Coren: 1.0.3 deployed (jobutils, misctools)

June 15

  • 21:40 wm-bot: petrb: there is no lvm on -db which we need as hell - therefore no swap either nor storage for binary logs :( I got a feeling that mysql will die oom soonish
  • 21:39 wm-bot: petrb: db has 5% free RAM eeeek
  • 18:36 wm-bot: root: removed lot of ?audit? logs from exec-04 they were eating too much storage
  • 18:23 wm-bot: petrb: temporarily disabling /tmp on exec-04 in order to set up lvm
  • 18:23 wm-bot: petrb: exec-04 96% / usage, creating a new volume
  • 12:33 wm-bot: petrb: installing redis on tools-mc

June 14

  • 12:35 wm-bot: petrb: updating logsplitter to new version

June 13

  • 21:59 wm-bot: petrb: replaced logsplitter on both apache servers with far more powerfull c++ version thus saving a lot of resources on both servers
  • 12:43 wm-bot: petrb: tools-webserver-01 is running quite expensive python job (currently eating almost 1gb of ram) it may need to be fixed or moved to separate webserver, adding swap to prevent machine die OOM
  • 12:22 wm-bot: petrb: killing process 31187 sort -T./enwiki/target -t of user local-enwp10 for same reason as previous one
  • 12:21 wm-bot: petrb: killing process 31190 sort -T./enwiki/target of user local-enwp10 for same reason as previous one
  • 12:17 wm-bot: petrb: killing process 31186 31185 69 Jun11 pts/32 1-13:14:41 /usr/bin/perl ./bin/catpagelinks.pl ./enwiki/target/main_pages_sort_by_ids.lst ./enwiki/target/pagelinks_main_sort_by_ids.lst because it seems to be a bot running on login server eating too many resources

June 11

  • 07:36 wm-bot: petrb: installed libdigest-crc-perl

June 10

  • 13:05 wm-bot: petrb: installing libcrypt-gcrypt-perl
  • 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix !b 49383
  • 08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
  • 08:44 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
  • 08:25 wm-bot: petrb: fixing missing packages on exec nodes

June 9

  • 20:44 wm-bot: petrb: moved logs on -login to separate storage

June 8

  • 21:24 wm-bot: petrb: installing python-imaging-tk on grid
  • 21:20 wm-bot: petrb: installing python-tk
  • 21:16 wm-bot: petrb: installing python-flickrapi on grid
  • 21:16 wm-bot: petrb: installing
  • 16:49 wm-bot: petrb: turned off wmf style of vi on tools-dev feel free to slap me :o or do cat /etc/vim/vimrc.local >> .vimrc if you love it
  • 15:33 wm-bot: petrb: grid is overloaded, needs to be either enlarged or jobs calmed down :o
  • 09:55 wm-bot: petrb: backporting tcl 8.6 from debian
  • 09:38 wm-bot: petrb: update python requests to version 1.2.3.1

June 7

  • 15:29 Coren: Deleted no-longer-needed tools-exec-cg node (spun off to its own project)

June 5

  • 09:52 wm-bot: petrb: on -dev
  • 09:52 wm-bot: petrb: moving /usr to separate volume expect problems :o
  • 09:41 wm-bot: petrb: moved /var/log to separate volume on -dev
  • 09:31 wm-bot: petrb: houston we have problem, / on dev is 94%
  • 09:28 wm-bot: petrb: installed openjdk7 on -dev
  • 09:00 wm-bot: petrb: removing wd-terminator service
  • 08:39 wm-bot: petrb: started toolwatcher
  • 07:04 wm-bot: petrb: installing maven on -dev

June 4

  • 14:49 wm-bot: petrb: installing sbt in order to fix b48859
  • 13:28 wm-bot: petrb: installing csh on cluster
  • 08:37 wm-bot: petrb: installing python-memcache on exec nodes

June 3

  • 21:40 Coren: Rebooting -login; it's trashing. Will keep an eye on it.
  • 14:15 wm-bot: petrb: removing popularity contest
  • 14:11 wm-bot: petrb: removing /etc/logrotate.d/glusterlogs on all servers to fix logrotate daemon
  • 09:43 wm-bot: petrb: syncing packages on exec nodes to avoid troubles with missing libs on some etc

June 2

  • 08:39 wm-bot: petrb: installing ack-grep everywhere per yuvipanda and irc

June 1

  • 20:57 wm-bot: petrb: installed this to exec nodes because it was on some and not on others cpp-4.4 cpp-4.5 cython dbus dosfstools ed emacs23 ftp gcc-4.4-base iptables iputils-tracepath ksh lsof ltrace lshw mariadb-client-5.5 nano python-dbus python-egenix-mxdatetime python-egenix-mxtools python-gevent python-greenlet strace telnet time -y
  • 20:42 wm-bot: petrb: installing wikitools cluster wide
  • 20:40 wm-bot: petrb: installing oursql cluster wide
  • 10:46 wm-bot: petrb: created new instance for experiments with sasl memcache tools-mc

May 31

  • 19:17 petan: deleting xtools project (requested by Cyberpower678)
  • 17:24 wm-bot: petrb: removing old kernels from -dev because / is almost full
  • 17:17 wm-bot: petrb: installed lsof to -dev
  • 15:55 wm-bot: petrb: installed subversion to exec nodes 4 legoktm
  • 15:47 wm-bot: petrb: replacing mysql with maria on exec nodes
  • 15:46 wm-bot: petrb: replacing mysql with maria on exec nodes
  • 15:14 wm-bot: petrb: installing default-jre in order to satisfy its dependencies
  • 15:13 wm-bot: petrb: installing /data/project/.system/deb/all/sbt.deb to -dev in order to test it
  • 13:04 wm-bot: petrb: installing bashdb on tools and -dev
  • 12:27 wm-bot: petrb: removing project local-jimmyxu - per request on irc
  • 10:54 wm-bot: petrb: killing process 3060 on -login (mahdiz 3060 1964 88 May30 ? 21:32:51 /bin/nano /tmp/crontab.Ht3bSO/crontab) it takes max cpu and doesn't seem to be attached

May 30

  • 12:24 wm-bot: petrb: deleted job 1862 from queue (error state)
  • 08:26 wm-bot: petrb: updated sql command

May 29

  • 21:05 wm-bot: petrb: running sudo apt-get install php5-gd

May 28

  • 20:00 wm-bot: petrb: installing p7zip-full to -dev and -login

May 27

  • 08:46 wm-bot: petrb: changed config of mysql to use /mnt as path to save binary logs, this however requires server to be restarted

May 24

  • 08:44 petan: setting up lvm on new exec nodes because it is more flexible and allows us to change the size of volumes on the fly
  • 08:28 petan: created 2 more exec nodes, setting up now...

May 23

  • 09:20 wm-bot: petrb: process 27618 on -login is constantly eating 100% of cpu, changing priority to 20

May 22

  • 20:54 wm-bot: petrb: changing ownership of /data/project/bracketbot/ to local-bracketbot
  • 14:28 labs-logs-bottie: petrb: installed netcat as well
  • 14:28 labs-logs-bottie: petrb: installed telnet to -dev
  • 14:02 Coren: tools-webserver-02 now live; / and /cluebot/ moved there

May 21

  • 20:27 labs-logs-bottie: petrb: uploaded hosts to -dev

May 19

  • 13:40 labs-logs-bottie: petrb: killing that nano process seems to be some hang and unattached anyway
  • 12:59 labs-logs-bottie: petrb: changed priority of nano process to 19
  • 12:55 labs-logs-bottie: petrb: local-hawk-eye-bot /bin/nano /tmp/crontab.d4JhUj/crontab eat too much cpu
  • 12:50 petan: nvm previous line
  • 12:50 labs-logs-bottie: petrb: vul alias viewuserlang

May 14

  • 21:22 labs-logs-bottie: petrb: created a separate volume for /tmp on login so that temp files do not fragment root fs and it does not get filled up by them, it also makes it easier to track filesystem usage
  • 13:16 Coren: reboot -dev, need to test kernel upgrade

May 10

  • 15:08 Coren: create tools-webserver-02 for Apache 2.4 experimentation

May 9

  • 04:12 Coren: added -exec-03 and -exec-04. Moar power!!1!

May 6

  • 19:59 Coren: made tools-dev.wmflabs.org public
  • 08:04 labs-logs-bottie: petrb: created a small swap on -login so that users can not bring it to OOM so easily and so that unused memory blocks can be swapined in order to use the remaining memory more effectively
  • 08:00 labs-logs-bottie: petrb: making lvm from unused disk from /mnt on -login so that we can eventually use it somewhere if needed

May 4

  • 17:50 labs-logs-bottie: petrb: foobar as well
  • 17:47 labs-logs-bottie: petrb: removing project flask-stub using rmtool
  • 15:33 labs-logs-bottie: petrb: fixing missing db user for local-stub
  • 12:51 labs-logs-bottie: petrb: creating mysql accounts by hand for alchimista and fubar

May 2

  • 20:49 labs-logs-bottie: petrb: uploaded motd to exec-N as well, with information which server users connected to

May 1

  • 16:59 labs-logs-bottie: petrb: fixed invalid permissions on /home

April 27

  • 18:54 labs-logs-bottie: petrb: installing pymysql using pip on whole grid because it is needed for greenrosseta (for some reason it is better than python-mysql package)

April 26

  • 23:55 Coren: reboot to finish security updates
  • 08:00 labs-logs-bottie: petrb: patching qtop
  • 07:57 labs-logs-bottie: petrb: added tools-dev to admin host list so that qtop works and fixing the bug of qtop
  • 07:28 labs-logs-bottie: petrb: installing GE tools to -dev so that we can develop new j|q* stuff there

April 25

  • 19:00 Coren: Maintenance over; systems restarted and should be working.
  • 18:18 labs-logs-bottie: petrb: we are getting in troubles with memory on tools-db there is only less than 20% free memory
  • 18:01 Coren: Begin maintenance (login disabled)
  • 13:21 petan: removing local-wikidatastats from ldap

April 24

  • 13:17 labs-logs-bottie: petrb: sudo chown local-peachy PeachyFrameworkLogo.png
  • 11:37 labs-logs-bottie: petrb: created new project stats and cloned acl from wikidatastats, which is supposed to be deleted
  • 11:32 legoktm: wikidatastats attempting to install limn
  • 11:15 labs-logs-bottie: petrb: installing npm to -login instance
  • 07:34 petan: creating project wikidatastats for legoktm addshore and yuvipandianablah :P

April 23

  • 13:32 labs-logs-bottie: petrb: changing permissions of cyberbot and peachy to 775 so that it is easier to use them
  • 12:14 labs-logs-bottie: petrb: qtop on -dev
  • 12:12 labs-logs-bottie: petrb: removed part of motd from login server that got there in a mysterious way

April 19

  • 22:38 Coren: reboot -login, all done with the NFS config. yeay.
  • 17:13 Coren: (final?) reboot of -login with the new autofs configuration
  • 16:24 Coren: (rebooted -login)
  • 16:24 Coren: autofs + gluster = fail
  • 14:45 Coren: reboot -login (NFS mount woes)

April 15

  • 22:29 Coren: also a test; note how said bot knows its place.  :-)
  • 22:14 andrewbogott: this is a test of labs-morebots.
  • 21:49 andrewbogott: this is a test
  • 15:41 labs-logs-bottie: petrb: installing p7zip everywhere
  • 08:00 labs-logs-bottie: petrb: installing dev packages needed for YuviPanda on login box

April 11

  • 22:39 Coren: rebooted tools-puppet-test (no end-user impact): hung filesystem prevents login
  • 07:42 labs-logs-bottie: petrb: removed reboot information from motd

April 10

  • 21:42 labs-logs-bottie: petrb: reverting the change
  • 21:35 labs-logs-bottie: petrb: inserting /lib to /etc/ld.so.conf in order to fix the bug with gcc / ubuntu see irc logs (22:30 GMT)
  • 21:22 labs-logs-bottie: petrb: installing jobutils.deb on login
  • 20:30 labs-logs-bottie: petrb: installing some dev tools to -dev
  • 20:23 petan: created -dev instance for various purposes

April 8

  • 14:07 labs-logs-bottie: petrb: ongrid apt-get install mono-complete
  • 13:50 labs-logs-bottie: local-afcbot: unable to run mono applications: The assembly mscorlib.dll was not found or could not be loaded.

April 4

  • 14:40 labs-logs-bottie: petrb: trying to convert afcbot to new service group local-afcbot

April 2

  • 16:04 labs-logs-bottie: petrb: installed log to /home/petrb/bin/ and testing it
  • 15:55 petan: patched /usr/local/bin/qdisplay so that it can display jobs per node properly
  • 15:54 petan: giving sudo to Petrb in order to update qdisplay

March 28

  • 15:44 Coren: reboot (still unactivated) tools-shadow

March 26

  • 18:17 Coren: Doubled the size of the compute grid! (added tools-exec-02 to the grid)

March 21

  • 23:30 Coren: turned on interpretation of .py as CGI by default on tools-webserver-* to parallel .php
  • 16:15 Coren: Added tools-login.wmflabs.org public IP for the tools-login instance and allowed incoming ssh to it.

March 19

  • 14:21 Coren: reboot cycle (all instances) to apply security updates

March 13

  • 14:04 Coren: restarted webserver: relax AllowOverride options

March 11

  • 15:47 Coren: enabled X forwarding for qmon. Also, installed qmon.
  • 13:17 Coren: added python-requests (1.0, from pip)

March 7

  • 20:41 Coren: tools' php errors now sent to ~/php_errors.log
  • 19:31 Coren: access.log now split by tools (in tool homedir)
  • 16:15 Coren: can haz database (support for user/tool databases in place)

March 6

  • 20:25 Coren: tools-db installed mariadb-server from official repo
  • 19:50 Coren: created tools-db instance for a (temporary) mysql install

March 5

  • 21:45 Coren: rejiggered the webproxy config to be smarter about paths not leading to specific tools

February 26

  • 23:49 Coren: Original note structure: created tools-{master,exec-01,webserver-01,webproxy} instances
  • 18:39 Coren: Created tools-puppet-test for dev and testing of tools' puppet classes.
  • 01:52 Coren: created instance tools-login (primary login/dev instance)
  • 01:52 Coren: created sudo policies and security groups (skeletal)
  • 01:08 Coren: Creation of the new project for preproduction deployment of the current (preleminary) plan mw:Wikimedia Labs/Tool Labs/Design