19:01 valhallasw`cloud: ah, connections that are kept open. A new incognito window is routed correctly.
18:59 valhallasw`cloud: switched to -02, worked correctly, switched back. Switching back does not seem to fully work?!
18:40 valhallasw`cloud: scratch that, first going to eat dinner
18:38 valhallasw`cloud: dynamicproxy ban system deployed on tools-proxy-02 working correctly for localhost; switching over users there by moving the external IP.
14:42 valhallasw`cloud: toollabs homepage is unhappy because tools.xtools-articleinfo is using a lot of cpu on tools-webgrid-lighttpd-1409. Checking to see what's happening there.
10:46 YuviPanda: migrate tools-worker-01 to 3.19 kernel
2015-12-22
18:30 YuviPanda: rescheduling all webservices
18:17 YuviPanda: failed over active proxy to proxy-01
18:12 YuviPanda: upgraded kernel and rebooted tools-proxy-01
01:42 YuviPanda: rebooting tools-worker-08
2015-12-21
18:44 YuviPanda: reboot tools-proxy-01
18:31 YuviPanda: failover proxy to tools-proxy-02
2015-12-20
00:00 YuviPanda: tools-worker-08 stuck again :|
2015-12-18
15:16 andrewbogott: rebooting locked up host tools-exec-1409
18:29 Coren: switching gridmaster activity to tools-grid-shadow
05:13 yuvipanda: increased security groups quota to 50 because why not
2015-12-01
21:07 yuvipanda: added bd808 as admin
21:01 andrewbogott: deleted tool/service group tools.test300
2015-11-25
15:42 Coren: migrating tools-web-static-02 to labvirt1010 to free space on labvirt1002
2015-11-20
22:02 Coren: tools-webgrid-lighttpd-1412 tools-webgrid-lighttpd-1413 tools-webgrid-lighttpd-1414 tools-webgrid-lighttpd-1415 done and back in rotation.
21:46 Coren: tools-webgrid-lighttpd-1411 tools-webgrid-lighttpd-1211 done and back in rotation.
21:30 Coren: tools-webgrid-lighttpd-1410 tools-webgrid-lighttpd-1210 done and back in rotation.
21:25 Coren: tools-webgrid-lighttpd-1409 tools-webgrid-lighttpd-1209 done and back in rotation.
21:13 Coren: tools-webgrid-lighttpd-1408 tools-webgrid-lighttpd-1208 done and back in rotation.
20:58 Coren: tools-webgrid-lighttpd-1407 tools-webgrid-lighttpd-1207 done and back in rotation.
20:53 Coren: tools-webgrid-lighttpd-1406 tools-webgrid-lighttpd-1206 done and back in rotation.
20:41 Coren: tools-webgrid-lighttpd-1405 tools-webgrid-lighttpd-1205 tools-webgrid-generic-1405 done and back in rotation.
20:28 Coren: tools-webgrid-lighttpd-1404 tools-webgrid-lighttpd-1204 tools-webgrid-generic-1404 done and back in rotation.
19:49 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1203 tools-webgrid-generic-1403
19:25 Coren: -lighttpd-1403 wants a restart.
19:15 Coren: done, and putting back in rotation: tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1202 tools-webgrid-generic-1402
18:55 Coren: Putting -lighttpd-1401 -lighttpd-1201 -generic-1401 back in rotation, disabling the others.
18:24 Coren: Beginning draining web nodes; -lighttpd-1401 -lighttpd-1201 -generic-1401
22:57 YuviPanda: pooled tools-webgrid-lighttpd-1413
22:10 YuviPanda: created tools-webgrid-lighttpd-1414 and 1415
22:04 YuviPanda: created tools-webgrid-lighttpd-1412 and 1413
19:53 YuviPanda: drained continuous jobs and disabled queues on tools-exec-1203 and tools-exec-1402
19:50 YuviPanda: drain webgrid-lighttpd-1408 of jobs
2015-10-26
20:53 YuviPanda: updated 6.9 ssh backport to all trusty hosts
2015-10-11
22:54 yuvipanda: delete service.manifest for tool wikiviz to prevent it from attempting to be started. It set itself up for nodejs but didn't actually have any code
18:22 valhallasw`cloud: experimenting with https://github.com/jordansissel/fpm on tools-packages, and manually installing packages for that. Noting them here.
2015-09-16
17:33 scfc_de: Removed python-tools-webservice from precise-tools as apparently old version of tools-webservice.
01:17 YuviPanda: attempting to move grrrit-wm to kubernetes
01:17 YuviPanda: attempting to move to kubernetes
2015-09-15
01:18 scfc_de: Added unixodbc_2.2.14p2-5_amd64.deb back to precise-tools to diagnose if it is related to T111760.
2015-09-14
23:47 scfc_de: Archived unixodbc_2.2.14p2-5_amd64 from deb-precise and aptly, no reference in Puppet or Phabricator and same version as distribution.
2015-09-13
20:53 scfc_de: Archived lua-json_1.3.2-1 from labsdebrepo and aptly, upgraded manually to Trusty's new 1.3.1-1ubuntu0.1~ubuntu14.04.1, restarted nginx on tools-webproxy-01 and tools-webproxy-02, checked that proxy and localhost:8081/list works.
20:42 scfc_de: rm -f /etc/apt/apt.conf.d/20auto-upgrades.ucf-dist on all hosts (cf. T110055).
08:05 valhallasw`cloud: Publish for local repo ./trusty-tools [all, amd64] publishes {main: [trusty-tools]} has been successfully updated. Publish for local repo ./precise-tools [all, amd64] publishes {main: [precise-tools]} has been successfully updated.
08:04 valhallasw`cloud: added all packages in data/project/.system/deb-precise to aptly repo precise-tools
08:03 valhallasw`cloud: added all packages in data/project/.system/deb-trusty to aptly repo trusty-tools
2015-09-07
18:49 valhallasw`cloud: ran sudo mount -o remount /data/project on tools-static-01, which also solved the issue, so skipping the reboot
18:47 valhallasw`cloud: switched static webserver to tools-static-02
18:45 valhallasw`cloud: weird NFS issue on tools-web-static-01. Switching over to -02 before rebooting.
17:57 YuviPanda: created tools-k8s-master-01 with jessie, will be etcd and kubernetes master
2015-09-03
07:09 valhallasw`cloud: and just re-running puppet solves the issue. Sigh.
07:09 valhallasw`cloud: last message in puppet.log.1.gz is Error: /Stage[main]/Toollabs::Exec_environ/Package[fonts-ipafont-gothic]/ensure: change from 00303-5 to latest failed: Could not get latest version: Execution of '/usr/bin/apt-cache policy fonts-ipafont-gothic' returned 100: fonts-ipafont-gothic: (...) E: Cache is out of sync, can't x-ref a package file
07:07 valhallasw`cloud: err, is empty.
07:07 valhallasw`cloud: uppet failure on tools-exec-1215 is CRITICAL 66.67% of data above the critical threshold -- but /var/log/puppet.log doesn't exist?!
2015-09-02
15:01 scfc_de: Added -M option to qsub call for crontab of tools.sdbot.
13:55 valhallasw`cloud: restarted gridengine_exec on tools-exec-1403
13:53 valhallasw`cloud: tools-exec-1403 does lots of locking opreations. Only job there was jid 1072678 = /data/project/hat-collector/irc-bots/snitch.py . Rescheduled that job.
13:16 YuviPanda: deleted all jobs of ralgisbot
13:12 YuviPanda: suspended all jobs in ralgisbot temporarily
12:57 YuviPanda: rescheduled all jobs of ralgisbot, was suffering from stale NFS file handles
2015-09-01
21:01 valhallasw`cloud: killed one of the grrrit-wm jobs; for some reason two of them were running?! Not sure what SGE is up to lately.
16:12 scfc_de: tools-bastion-01: Killed bot of tools.cobain.
15:47 valhallasw`cloud: git reset --hard cdnjs on tools-web-static-01
06:23 valhallasw`cloud: seems to have worked. SGE :(
06:17 valhallasw`cloud: going to restart sge_qmaster, hoping this solves the issue :/
06:08 valhallasw`cloud: e.g. "queue instance "task@tools-exec-1211.eqiad.wmflabs" dropped because it is overloaded: np_load_avg=1.820000 (= 0.070000 + 0.50 * 14.000000 with nproc=4) >= 1.75" but the actual load is only 0.3?!
06:06 valhallasw`cloud: test job does not get submitted because all queues are overloaded?!
06:06 valhallasw`cloud: investigating SGE issues reported on irc/email
2015-08-31
23:20 scfc_de: Changed host name tools-webgrid-generic-1405 in "qconf -mq webgrid-generic" to fix the "au" state of the queue on that host.
21:19 valhallasw`cloud: seems to have some errors in restarting: subprocess.CalledProcessError: Command '['/usr/bin/sudo', '-i', '-u', 'tools.javatest', '/usr/local/bin/webservice', '--release', 'trusty', 'generic', 'restart']' returned non-zero exit status 2
21:18 valhallasw`cloud: running puppet agent -tv on tools-services-02 to make sure webservicemonitor is running
21:15 valhallasw`cloud: several webservices seem to actually have not gotten back online?! what on earth is going on.
21:10 valhallasw`cloud: some jobs still died (including tools.admin). I'm assuming service.manifest will make sure they start again
20:29 valhallasw`cloud: |sort is not so spread out in terms of affected hosts because a lot of jobs were started on lighttpd-1409 and -1410 around the same time.
20:25 valhallasw`cloud: ca 500 jobs @ 5s/job = approx 40 minutes
20:23 valhallasw`cloud: doh. accidentally used the wrong file, causing restarts for another few uwsgi hosts. Three more jobs dead *sigh*
20:21 valhallasw`cloud: now doing more rescheduling, with 5 sec intervals, on a sorted list to spread load between queues
19:36 valhallasw`cloud: last restarted job is 1423661, rest of them are still in /home/valhallaw/webgrid_jobs
19:35 valhallasw`cloud: one per second still seems to make SGE unhappy; there's a whole set of jobs dying, mostly uwsgi?
19:31 valhallasw`cloud: https://phabricator.wikimedia.org/T110861 : rescheduling 521 webgrid jobs, at a rate of one per second, while watching the accounting log for issues
07:31 valhallasw`cloud: removed paniclog on tools-submit; probably related to the NFS outage yesterday (although I'm not sure why that would give OOMs)
2015-08-30
13:23 valhallasw`cloud: killed wikibugs-backup and grrrit-wm on tools-webproxy-01
13:20 valhallasw`cloud: disabling 503 error page
2015-08-29
04:09 scfc_de: Disabled queue webgrid-lighttpd@tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs (qmod -d) because I can't ssh to it and jobs deployed there fail with "failed assumedly before job:can't get password entry for user".
2015-08-27
15:00 valhallasw`cloud: killed multiple kmlexport processes on tools-webgrid-lighttpd-1401 again
2015-08-26
01:10 scfc_de: Felt lucky: kill -STOP bigbrother on tools-submit, installed I00cd7a90273e0d745699855eb671710afb4e85a7 on tools-services-02 and service bigbrothermonitor start. If it goes berserk, please service bigbrothermonitor stop.
14:58 YuviPanda: pooled in two new instances for the precise exec pool
14:45 YuviPanda: reboot tools-exec-1221
14:26 YuviPanda: rebooting tools-exec-1220 because NFS wedge...
14:18 YuviPanda: pooled in tools-webgrid-generic-1405
10:16 YuviPanda: created tools-webgrid-generic-1405
10:04 YuviPanda: apply exec node puppet roles to tools-exec-1220 and -1221
09:59 YuviPanda: created tools-exec-1220 and -1221
2015-08-24
16:37 valhallasw`cloud: more processes were started, so added a talk page message on User:Coet (who was starting the processes according to /var/log/auth.log) and using 'write coet' on tools-bastion-01
16:15 valhallasw`cloud: kill -9'ing because normal killing doesn't work
16:13 valhallasw`cloud: killing all processes of tools.cobain which are flooding tools-bastion-01
2015-08-20
18:44 valhallasw`cloud: both are now at 3dbbc87
18:43 valhallasw`cloud: running git reset --hard origin/master on both checkouts. Old HEAD is 86ec36677bea85c28f9a796f7e57f93b1b928fa7 (-01) / c4abeabd3acf614285a40e36538f50655e53b47d (-02).
18:42 valhallasw`cloud: tools-web-static-01 has the same issue, but with different commit ids (because different hostname). No local changes on static-01. The initial merge commit on -01 is 57994c, merging 1e392ab and fc918b8; on -02 it's 511617f, merging a90818c and fc918b8.
18:39 valhallasw`cloud: cdnjs on tools-web-static-02 can't pull because it has a dirty working tree, and there's a bunch of weird merge commits. Old commit is c4abeabd3acf614285a40e36538f50655e53b47d, the dirty working tree is changes from http to https in various files
17:06 valhallasw`cloud: wait, what timezone is this?!
2015-08-19
10:45 valhallasw`cloud: ran `for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done`; this fixed queues on tools-exec-1404 tools-exec-1409 tools-exec-1410 tools-webgrid-lighttpd-1406
2015-08-18
15:53 scfc_de: Added valhallasw as grid manager (qconf -am valhallasw).
13:57 valhallasw`cloud: same issue seems to happen with the other hosts: tools-exec-1401.tools.eqiad.wmflabs vs tools-exec-1401.eqiad.wmflabs and tools-exec-catscan.tools.eqiad.wmflabs vs tools-exec-catscan.eqiad.wmflabs.
13:55 valhallasw`cloud: no, wait, that's tools-webgrid-lighttpd-1411.eqiad.wmflabs, not the actual host tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs. We should fix that dns mess as well.
13:54 valhallasw`cloud: tried to restart gridengine-exec on tools-exec-1401, no effect. tools-webgrid-lighttpd-1411 also just went into 'au' state.
13:47 valhallasw`cloud: that brought tools-exec-1403, tools-exec-1406 and tools-webgrid-generic-1402 back up, tools-exec-1401 and tools-exec-catscan are still in 'au' state
13:46 valhallasw`cloud: starting gridengine-exec on hosts with queues in 'au' (=alarm, unknown) state using for i in $(qstat -f -xml | grep "<state>au" -B 6 | grep "<name>" | cut -d'@' -f2 | cut -d. -f1); do echo $i; ssh $i sudo service gridengine-exec start; done
08:37 valhallasw`cloud: sudo service gridengine-exec start on tools-webgrid-lighttpd-1404.eqiad.wmflabs" tools-webgrid-lighttpd-1406.eqiad.wmflabs" tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs"
08:33 valhallasw`cloud: tools-webgrid-lighttpd-1403.eqiad.wmflabs, tools-webgrid-lighttpd-1404.eqiad.wmflabs and tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs are all broken (queue dropped because it is temporarily not available)
08:30 valhallasw`cloud: hostname mismatch: host is called tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs in config, but it was named tools-webgrid-lighttpd-1411.eqiad.wmflabs in the hostgroup config
08:14 valhallasw`cloud: and the hostgroup @webgrid doesn't even exist? (╯°□°)╯︵ ┻━┻
08:10 valhallasw`cloud: /var/lib/gridengine/etc/queues/webgrid-lighttpd does not seem to be the correct configuration as the current config refers to '@webgrid' as host list.
08:07 valhallasw`cloud: sudo qconf -Ae /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs -> root@tools-bastion-01.eqiad.wmflabs added "tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs" to exechost list
08:06 valhallasw`cloud: ok, success. /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs now exists. Do I still have to add it manually to the grid? I suppose so.
08:04 valhallasw`cloud: installing packages from /data/project/.system/deb-trusty seems to fail. sudo apt-get update helps.
08:00 valhallasw`cloud: running puppet agent -tv again
07:55 valhallasw`cloud: argh. Disabling toollabs::node::web::generic again and enabling toollabs::node::web::lighttpd
07:54 valhallasw`cloud: various issues such as Error: /Stage[main]/Gridengine::Submit_host/File[/var/lib/gridengine/default/common/accounting]/ensure: change from absent to link failed: Could not set 'link' on ensure: No such file or directory - /var/lib/gridengine/default/common at 17:/etc/puppet/modules/gridengine/manifests/submit_host.pp; probably an ordering issue in
07:53 valhallasw`cloud: Setting up adminbot (1.7.8) ... chmod: cannot access '/usr/lib/adminbot/README': No such file or directory --- ran sudo touch /usr/lib/adminbot/README
07:37 valhallasw`cloud: applying role::labs::tools::compute and toollabs::node::web::generic to \tools-webgrid-lighttpd-1411
07:31 valhallasw`cloud: reading puppet suggests I should qconf -ah /var/lib/gridengine/etc/exechosts/tools-webgrid-lighttpd-1411.tools.eqiad.wmflabs but that file is missing?
07:26 valhallasw`cloud: andrewbogott built tools-webgrid-lighttpd-1411 yesterday but it's not actually added as exec host. Trying to figure out how to do that...
15:33 andrewbogott: re-enabling the queue on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
14:50 andrewbogott: killing remaining jobs on tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01
2015-08-15
05:14 andrewbogott: resumed tools-exec-gift, seems not to have been the culprit
05:10 andrewbogott: suspending tools-exec-gift, just for a moment...
2015-08-14
17:21 andrewbogott: disabling grid jobqueue for tools-exec-1211 tools-exec-1212 tools-exec-1215 tools-exec-1403 tools-exec-1406 tools-master tools-shadow tools-webgrid-generic-1402 tools-webgrid-lighttpd-1203 tools-webgrid-lighttpd-1208 tools-webgrid-lighttpd-1403 tools-webgrid-lighttpd-1404 tools-webproxy-01 in anticipation of monday reboot of labvirt1004
15:20 andrewbogott: Adding back to the grid engine queue: tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
14:43 andrewbogott: killing remaining jobs on tools-exec-1216 tools-exec-1219 tools-exec-1407 tools-mail tools-services-02 tools-webgrid-generic-1401 tools-webgrid-lighttpd-1202 tools-webgrid-lighttpd-1207 tools-webgrid-lighttpd-1210 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407
2015-08-13
18:51 valhallasw`cloud: which was resolved by scfc earlier
18:50 valhallasw`cloud: tools-exec-1201/Puppet staleness was critical due to an agent lock (Ignoring stale puppet agent lock for pid Run of Puppet configuration client already in progress; skipping (/var/lib/puppet/state/agent_catalog_run.lock exists))
18:08 scfc_de: scfc@tools-exec-1201: Removed stale /var/lib/puppet/state/agent_catalog_run.lock; Puppet run was started Aug 12 15:06:08, instance was rebooted ~ 15:14.
17:49 valhallasw`cloud: Jobs were drained at 19:43, but this did not decreade he rate, which is still at ~50k/minute. Now running "sysctl -w sunrpc.nfs_debug=1023 && sleep 2 && sysctl -w sunrpc.nfs_debug=0" which hopefully doesn't kill the server
17:43 valhallasw`cloud: rescheduled all webservice jobs on tools-webgrid-lighttpd-1401.eqiad.wmflabs, server is now empty
00:01 mutante: fixing puppet runs on tools-webgrid-* via salt
2015-07-10
23:59 mutante: fixing puppet runs on tools-exec via salt
2015-07-10
20:09 valhallasw`cloud: it took three of us, but adminbot is updated!
July 6
09:49 valhallasw`cloud: 10:14 <jynus> s51053 is abusing his/her access to replica dbs and creating lag for other users. His/her queries are to be terminated. (= tools.jackbot / user jackpotte)
July 2
17:07 valhallasw`cloud: can't login to tools-mailrelay-01., probably because puppet was disabled for too long. Deleting instance.
19:30 YuviPanda: set appopriate classes for recreated tools-exec-12* nodes
19:28 YuviPanda: recreated tools-static-02
19:11 YuviPanda: failed over tools-static to tools-static-01
14:47 andrewbogott: deleting tools-exec-04
14:44 Coren: -exec-04 drained; removed from queues. Rest well, old friend.
14:41 Coren: disabled -exec-04 (going away)
02:35 YuviPanda: set tools-exec-12{01-10} to configure as exec nodes
02:27 YuviPanda: created tools-exec-12{01-10}
April 28
21:41 andrewbogott: shrinking tools-master
21:33 YuviPanda: failover is going to take longer than actual recompression for tools-master, so let’s just recompress. tools-shadow should take over automatically if that doesn’t work
21:32 andrewbogott: shrinking tools-redis
21:28 YuviPanda: attempting to failover gridengine to tools-shadow
21:27 andrewbogott: shrinking tools-submit |
21:21 YuviPanda: backup crontabs onto NFS
21:18 andrewbogott: shrinking tools-webproxy-02
21:14 andrewbogott: shrinking tools-static-01
21:11 andrewbogott: shrinking tools-exec-gift
21:06 YuviPanda: failover tools-webproxy to tools-webproxy-01
21:06 andrewbogott: stopping, shrinking and starting tools-exec-catscan
21:01 YuviPanda: failover tools-static to tools-static-02
13:13 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
April 13
21:11 YuviPanda: restart portgranter on all webgrid nodes
April 12
10:52 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
April 11
21:49 andrewbogott: moved /data/project/admin/toollabs to /data/project/admin/toollabsbak on tools-webproxy-01 and tools-webproxy-02 to fix permission errors
02:15 YuviPanda: rebooted tools-submit, was not responding
April 10
07:10 PissedPanda: take out tools-services-01 to test switchover and also to recreate as small
05:20 YuviPanda: delete the tomcat node finally :D
April 9
23:24 scfc_de: rm -f /puppet_{host,service}groups.cfg on all hosts (apparently a Puppet/hiera mishap last November).
23:11 scfc_de: tools-webgrid-04: Rescheduled all jobs running on this instance (T95537).
08:32 scfc_de: tools-mail: Removed paniclog (multiple exims, but only one found).
April 8
13:25 scfc_de: Repaired servicegroups repository and restarted toolhistory job; was stuck at 2015-03-29T09:15:05Z (NFS?).
12:01 scfc_de: Removed empty tools with no maintainers javed/javedbaker/shell.
09:10 scfc_de: Removed stale proxy entries for analytalks/anno/commons-coverage/coursestats/eagleeye/hashtags/itwiki/mathbot/nasirkhanbot/rc-vikidia/wikistream.
April 7
07:42 scfc_de: tools-mail: Killed superfluous exim and removed paniclog.
April 5
10:11 scfc_de: tools-mail: Killed superfluous exims and removed paniclog.
April 4
22:48 scfc_de: Removed zombie jobs (qdel 1991607,1994800,1994826,1994827,2054201,3449476,3450329,3451518,3451549,3451590,3451628,3451635,3451830,3451869,3452632,3452633,3452654,3452655,3452657,3452668,4218785,4219210,4219674,4219722,4219791,4219923,4220646).
08:49 scfc_de: tools-submit: Restarted bigbrother because it didn't notice admin's .bigbrotherrc.
08:49 scfc_de: Add webservice to .bigbrotherrc for admin tool.
15:11 YuviPanda|brb: pooled in tools-webgrid-07 to lighty webgrid, moving some tools off -05 and -06 to relieve pressure
February 28
07:51 YuviPanda: create tools-webgrid-07
01:00 Coren: Set vm.overcommit_memory=0 on -webgrid-05 (also trusty)
01:00 Coren: Also That was -webgrid-05
00:59 Coren: set exec-06 to vm.overcommit_memory=0 for now, until the vm behaviour difference between precise and trusty can be nailed down.
February 27
17:53 YuviPanda: increased quota to 512G RAM and 256 cores
15:33 Coren: Switched back to -master. I'm making a note here: great success.
15:27 Coren: Gridengine master failover test part three; killing the master with -9
15:20 Coren: Gridengine master failover test part deux - now with verbose logs
15:10 YuviPanda: created tools-webgrid-generic-02
15:10 YuviPanda: increase instance quota to 64
15:10 Coren: Master restarted - test not sucessful.
14:50 Coren: testing gridengine master failover starting now
08:27 YuviPanda: restart *all* webtools (with qmod -rj webgrid-lighttpd) to have tools-webproxy-01 and -02 pick them up as well
February 24
18:33 Coren: tools-submit not recovering well from outage, kicking it.
17:58 YuviPanda: rebooting *all* webgrid jobs on toollabs
February 16
02:31 scfc_de: rm -f /var/log/exim4/paniclog.
February 13
18:01 Coren: tools-redis is dead, long live tools-redis
17:48 Coren: rebuilding tools-redis with moar ramz
17:38 legoktm: redis on tools-redis is OOMing?
17:26 marktraceur: restarting grrrit-wm because it's not behaving
February 1
10:55 scfc_de: Submitted dummy jobs for tools ftl/limesmap/newwebtest/osm-add-tags/render/tsreports/typoscan/usersearch to get bigbrother to recognize those users and cleaned up output files afterwards.
07:51 YuviPanda: cleared error state of stuck queues
06:41 YuviPanda: set chmod +xw manually on /var/run/lighttpd on webgrid-05, need to investigate why it was necessary
00:15 YuviPanda: killed all db and tools-webproxy aliases in /etc/hosts for tools-webproxy, since otherwise puppet fails because ec2id thinks we’re not in labs because hostname -d is empty because we set /etc/hosts to resolve IP directly to tools-webproxy
21:11 YuviPanda: disable puppet on tools-dev to check shinken
21:00 scfc_de: qmod -cq continuous@tools-exec-09,continuous@tools-exec-11,continuous@tools-exec-13,continuous@tools-exec-14,mailq@tools-exec-09,mailq@tools-exec-11,mailq@tools-exec-13,mailq@tools-exec-14,task@tools-exec-06,task@tools-exec-09,task@tools-exec-11,task@tools-exec-13,task@tools-exec-14,task@tools-exec-15,webgrid-lighttpd@tools-webgrid-01,webgrid-lighttpd@tools-webgrid-02,webgrid-lighttpd@tools-webgrid-04 (fallout from /var being full).
20:38 YuviPanda: didn't actually stop puppet, need more patches
20:38 YuviPanda: stopping puppet on tools-dev to test shinken
22:07 StupidPanda: enabled puppet on tools-exec-07
21:47 StupidPanda: removed coredumps from tools-webgrid-04 to reclaim space
21:45 StupidPanda: removed coredump from tools-webgrid-01 to reclaim space
20:31 YuviPanda: disabling puppet on tools-exec-07 to test shinken
November 7
13:56 scfc_de: tools-submit, tools-webgrid-04: rm -f /var/log/exim4/paniclog (OOM around the time of the filesystem outage).
November 6
13:21 scfc_de: tools-dev: Gzipped /var/log/account/pacct.0 (804111872 bytes); looks like root had his own bigbrother instance running on tools-dev (multiple invocations of webservice per second).
November 5
19:15 mutante: exec nodes have p7zip-full now
10:07 YuviPanda: cleaned out pacct and atop logs on tools-login
November 4
19:50 mutante: - apt-get clean on tools-login, and gzipped some logs
November 1
12:51 scfc_de: Removed log files in /var/log/diamond older than five weeks (pdsh -f 1 -g tools sudo find /var/log/diamond -type f -mtime +35 -ls -delete).
October 30
14:37 YuviPanda: cleaned out pacct and atop logs on tools-dev
06:18 paravoid: killed a "vi" process belonging to user icelabs and running for two days saturating the I/O network bandwidth, and rm'ed a 3.5T(!) .final_mg.txt.swp
October 27
16:06 scfc_de: tools-mail: Killed -HUP old queue runners and restarted exim4; probably the source of paniclog's "re-exec of exim (/usr/sbin/exim4) with -Mc failed: No such file or directory".
15:36 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Recreated (empty) /var/log/apache2 and /var/log/upstart.
October 26
12:35 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/account.
12:33 scfc_de: tools-trusty: Went through shadowed /var and rebooted.
12:31 scfc_de: tools-exec-07, tools-exec-14, tools-exec-15: Created /var/log/exim4, started exim4 and ran queue.
October 24
20:31 andrewbogott: moved tools-exec-12, tools-shadow and tools-mail to virt1006
23:22 YuviPanda|zzz: removed stale puppet lockfile and ran puppet manually on tools-exec-07
October 11
15:31 andrewbogott: rebooting tools-master, stab in the dark
06:01 YuviPanda: restarted gridengine-master on tools-master
October 4
18:31 scfc_de: tools-mail: Deleted /usr/local/bin/collect_exim_stats_via_gmetric and root's crontab; clean-up for Ic9e0b5bb36931aacfb9128cfa5d24678c263886b
October 2
17:59 andrewbogott: added Ryan back to tools admins because that turned out to not have anything to do with the bounce messages
17:32 andrewbogott: removing ryan lane from tools admins, because his email in ldap is defunct and I get bounces every time something goes wrong in tools
September 28
14:45 andrewbogott: rebased /var/lib/git/operations/puppet on toolsbeta-puppetmaster3
September 25
14:43 YuviPanda: cleaned up ghost /var/log (from before biglogs mount) that was taking up space, /var space situation better now
September 17
21:40 andrewbogott: caused a brief auth outage while messing with codfw ldap
September 15
11:00 YuviPanda: tested CPU monitoring on tools-exec-12 by running stress, seems to work
September 13
20:52 yuvipanda: cleaned out rotated log files on tools-webproxy
September 12
21:54 jeremyb: [morebots] booted all bots, reverted to using systemwide (.deb) codebase
20:05 scfc_de: Deployed release 1.0.11 of jobutils and miscutils
August 15
16:45 legoktm: fixed grrrit-wm
16:36 legoktm: restarting grrrit-wm
August 14
22:36 scfc_de: Removed again jobs in error state due to LDAP with "for JOBID in $(qstat -u \* | sed -ne 's/^\([0-9]\+\) .*Eqw.*$/\1/p;'); do if qstat -j "$JOBID" | fgrep -q "can't get password entry for user"; then qdel "$JOBID"; fi; done"; cf. also bug #69529
August 12
03:32 scfc_de: tools-exec-08, tools-exec-wmt, tools-webgrid-02, tools-webgrid-03, tools-webgrid-04: Removed stale "apt-get update" processes to get Puppet working again
August 2
16:39 scfc_de: tools.mybot's crontab uses qsub without -M, added that as a temporary measure and will inform user later
16:36 scfc_de: Manually rerouted mails for tools.mybot@tools-submit.eqiad.wmflabs
August 1
22:41 scfc_de: Deleted all jobs in "E" state that were caused by an LDAP failure at ~ 2014-07-30 07:00Z ("can't get password entry for user [...]")
July 24
20:53 scfc_de: Set SGE "mailer" parameter again for bug #61160
14:51 scfc_de: Removed ignored file /etc/apt/preferences.d/puppet_base_2.7 on all hosts
July 21
18:39 scfc_de: Removed stale Redis entries for currentevents, misc2svg, osm4wiki, wp-signpost, wscredits and yadfa
18:38 scfc_de: Restarted webservice for stewardbots because it wasn't in Redis
18:33 scfc_de: Stopped eight (!) webservices of tools.bookmanagerv2 and started one again
July 18
14:29 scfc_de: admin: Set up .bigbrotherrc for toolhistory
13:24 scfc_de: Made tools-webgrid-04 a grid submit host
12:58 scfc_de: Made tools-webgrid-03 a grid submit host
16:00 YuviPanda: manually removed mariadb remote repo from tools-exec-12 instance, won't be added to new instances (puppet patch was merged)
01:33 YuviPanda|zzz: tools-exec-11 and tools-exec-13 have been added to the @general hostgroup
July 9
23:14 YuviPanda: applied execnode, hba and biglogs to tools-exec-11 and tools-exec-13
23:09 YuviPanda: created tools-exec-13 with precise
23:08 YuviPanda: created tools-exec-12 as trusty by accident, will keep on standby for testing
23:07 YuviPanda: created tools-exec-12
23:06 YuviPanda: created tools-exec-11
19:23 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis again
14:12 scfc_de: tools-exec-cyberbot: Reran Puppet successfully and hotfixed the Peachy temporary file issue; will mail labs-l later
16:59 Betacommand: Coren: It may take a while though; what the catscan queries was blocking is a DDL query changing the schema and that pauses replication.
16:58 Betacommand: Coren: transactions over 30ks killed; the DB should start catching up soon.
14:37 Betacommand: replication for enwiki is halted current lag is at 9876
July 2
00:21 YuviPanda: restarted diamond on almost all nodes to stop sending nfs stats, some still need to be flushed
00:21 YuviPanda: restarted diamond on all exec nodes to stop sending nfs stats
July 1
23:09 legoktm: tools-pywikibot started the webservice, don't know why it wasn't running
21:08 scfc_de: Reset queues in error state again
17:51 YuviPanda: tools-exec-04 removed stale pid file and force puppet run
16:07 YuviPanda: applied biglogs to tools-exec-02 and rejigged things
15:54 YuviPanda: tools-exec-02 removed stale puppet pid file, forcing run
15:51 Coren: adjusted resource limits for -exec-07 to match the smaller instance size.
15:50 Coren: created logfile disk for -exec-07 by hand (smaller instance)
01:53 YuviPanda: tools-exec-10 applied biglogs, moved logs around, killed some old diamond logs
01:41 YuviPanda: tools-exec-03 restarted diamond, atop, exim4, ssh to pick up new log partition
01:40 YuviPanda: tools-exec-03 applied biglogs, moved logs around, killed some old diamond logs
22:06 YuviPanda: stale lockfile in tools-login as well, removing and forcing puppet run
22:01 YuviPanda: removed stale lockfile for puppet, forcing run
19:58 YuviPanda|food: added tools-webgrid-04 to webgrid queue, had to start portgranter manually
17:43 YuviPanda: created tools-webgrid-04, applying webnode role and running puppet
17:27 YuviPanda: created tools-webgrid-03 and added it to the queue
June 29
19:45 scfc_de: magnustools: "webservice start"
18:24 YuviPanda: rebooted tools-webgrid-02. Could not ssh, was dead
June 28
21:07 YuviPanda: removed alias for tools-webproxy and tools.wmflabs.org from /etc/hosts on tools-webproxy
June 21
20:09 scfc_de: Created tool mediawiki-mirror (yuvipanda + Nemo_bis) and chown'ed & chmod o-w /shared/mediawiki
June 20
21:01 scfc_de: tools-webgrid-tomcat: Added to submit host list with "qconf -as" for bug #66882
14:47 scfc_de: Restarted webservice for mono; cf. bug #64219
June 16
23:50 scfc_de: Shut down diamond services and removed log files on all hosts
June 15
17:12 YuviPanda: deleted tools-mongo. MongoDB pre-allocates db files, and so allocating one db to every tool fills up the disk *really* quickly, even with 0 data. Their non preallocating version is 'not meant for production', so putting on hold for now
16:48 scfc_de: tools-exec-cyberbot: No DNS entry (again)
June 13
22:59 YuviPanda: "sudo -u ineditable -s" to force creation of homedir, since the user was unable to login before. /var/log/auth.log had no record of their attempts, but now seems to work. straange
June 10
21:51 scfc_de: Restarted diamond service on all Tools hosts to actually free the disk space :-)
21:36 scfc_de: Deleted /var/log/diamond/diamond.log on all Tools hosts to free up space on /var
June 3
17:50 Betacommand: Brief network outage. source: It's not clearly determined yet; we aborted the investigation to rollback and restore service. As far as we can tell, there is something subtly wrong with the switch configuration of LACP.
June 2
20:15 YuviPanda: create instance tools-trusty-test to test nginx proxy on trusty
19:00 scfc_de: zoomviewer: Set TMPDIR to /data/project/zoomviewer/var/tmp and ./webwatcher.sh; cannot see *any* temporary files being created anywhere, though. iipsrv.fcgi however has TMPDIR set as planned.
May 27
18:49 wm-bot: petrb: temporarily hardcoding tools-exec-cyberbot to /etc/hosts so that host resolution works
10:36 scfc_de: tools-webgrid-01: removed all files of tools.zoomviewer in /tmp
10:22 scfc_de: tools-webgrid-01: /tmp was full, removed files of tools.zoomviewer older than five days
07:52 wm-bot: petrb: restarted webservice of tool admin in order to purge that huge access.log
May 25
14:27 scfc_de: tools-mail: "rm -f /var/log/exim4/paniclog" to leave only relay_domains errors
May 23
14:14 andrewbogott: rebooting tools-webproxy so that services start logging again
14:10 andrewbogott: applying role::labs::lvm::biglogs on tools-webproxy because /var/log was full and causing errors
May 22
02:45 scfc_de: tools-mail: Enabled role::labs::lvm::biglogs, moved data around & rebooted.
02:36 scfc_de: tools-mail: Removed all jsub notifications from hazard-bot from queue.
01:36 scfc_de: tools-mail: Freezing all messages to Yahoo!: "421 4.7.1 [TS03] All messages from 208.80.155.162 will be permanently deferred; Retrying will NOT succeed. See http://postmaster.yahoo.com/421-ts03.html"
01:12 scfc_de: tools-mail: /var is full
May 20
18:34 YuviPanda: back to homerolled nginx 1.5 on proxy, newer versions causing too many issues
16:31 scfc_de: tools-webproxy: "iptables -A INPUT -p tcp \! --source 127/8 --dport 6379 -j REJECT" to block connections from other Tools instances to Redis
18:36 scfc_de: Pointed tools-dev.wmflabs.org at tools-dev.eqiad.wmflabs; cf. [[Bugzilla:62883]]
March 5
13:57 wm-bot: petrb: test
March 4
22:35 wm-bot: petrb: uninstalling it from -login too
22:32 wm-bot: petrb: uninstalling apache2 from tools-dev it has nothing to do there
March 3
19:20 wm-bot: petrb: shutting down almost all services on webserver-02 in order to make system useable and finish upgrade
19:17 wm-bot: petrb: upgrading all packages on webserver-02
19:15 petan: rebooting webserver-01 which is totally dead
19:07 wm-bot: petrb: restarting apache on webserver-02 it complains about OOM but the server has more than 1.5g memory free
19:03 wm-bot: petrb: switched local-svg-map-maker to webserver-02 because 01 is not accessible to me, hence I can't debug that
16:44 scfc_de: tools-webserver-03: Apache was swamped by request for /guc. "webservice start" for that, and pkill -HUP -u local-guc.
12:54 scfc_de: tools-webserver-02: Rebooted, apache2/error.log told of OOM, though more than 1G free memory.
12:50 scfc_de: tools-webserver-03: Rebooted, scripts were timing out
12:42 scfc_de: tools-webproxy: Rebooted; wasn't accessible by ssh.
March 1
03:42 Coren: disabled puppet in pmtpa tool labs\
February 28
14:46 wm-bot: petrb: extending /usr on tools-dev by 800mb
00:26 scfc_de: tools-webserver-02: Rebooted; inaccessible via ssh, http said "500 Internal Server Error"
February 27
15:28 scfc_de: chmod g-w ~fsainsbu/.forward
February 25
22:48 rdwrer: Lol, so, something happened with grrrit-wm earlier and nobody logged any of it. It was yoyoing, Yuvi killed it, then aude did something and now it's back.
February 23
20:46 scfc_de: morebots: labs HUPped to reconnect to IRC
February 21
17:32 scfc_de: tools-dev: mount -t nfs -o nfsvers=3,ro labstore1.pmtpa.wmnet:/publicdata-project /public/datasets; automount seems to have been stuck
15:24 scfc_de: tools-webserver-03: Rebooted, wasn't accessible by ssh and apparently no access to /public/datasets either
10:42 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list", cf. [[bugzilla:61583]])
10:28 scfc_de: Reset error status of task@tools-exec-09 ("can't get password entry for user 'local-voxelbot'"); "getent passwd local-voxelbot" works on tools-exec-09, possibly a glitch
February 19
20:21 scfc_de: morebots: Set "enable_twitter=False" in confs/labs-logbot.py and restarted labs-morebots
19:14 scfc_de: tools-login: Disabled crontab and pkill -HUP -u fatemi127
February 18
11:42 scfc_de: tools-mail: Rerouted queued mail (@tools-login.pmtpa.wmflabs => @tools.wmflabs.org)
11:34 scfc_de: tools-exec-08: Rebooted due to not responding on ssh and SGE
10:39 scfc_de: tools-mail: rm -f /var/log/exim4/paniclog ("User 0 set for local_delivery transport is on the never_users list" => probably artifacts from Coren's LDAP changes)
07:02 andrewbogott: merged a lint patch to the gridengine module. Should be a noop
January 16
17:11 scfc_de: tools-exec-09: "iptables-restore /data/project/.system/iptables.conf" after reboot
January 15
13:36 scfc_de: After reboot of tools-exec-09, all continuous jobs were successfully restarted ("Rr"); task jobs (1974113, 2188472) failed ("19 : before writing exit_status")
07:38 legoktm: restarting grrit-wm, for some reason it reconnected and lost its cloak
December 23
18:30 marktraceur: restart grrrit-wm for subbu
December 21
06:50 scfc_de: tools-exec-01: Commented out duplicate MariaDB entries in /etc/apt/sources.list and re-ran apt-get update
December 19
17:22 marktraceur: deploying grrrit config change
December 17
23:19 legoktm: rebooted grrrit-wm with new config stuffs
December 14
18:13 marktraceur: restarting grrrit-wm to fix its nickname
13:17 scfc_de: tools-exec-08: Purged packages libapache2-mod-suphp and suphp-common (probably remnants from when the host was misconfigured as a webserver)
13:09 scfc_de: tools-dev, tools-login, tools-mail, tools-webserver-01, tools-webserver-02: rm /var/log/exim4/paniclog (mostly out of memory errors)
December 4
22:15 Coren: tools-exec-01 rebooted to fix the autofs issue; will return to rotation shortly.
16:33 Coren: rebooting webproxy with new kernel settings to help against the DDOS
December 1
14:05 Coren: underlying virtualization hardware rebooted; tools-master and friends coming back up.
November 25
21:03 YuviPanda: created tools-proxy-test instance to play around with the dynamicproxy
12:16 wm-bot: petrb: deswapping -login (swapoff -a && swapon -a)
November 24
07:19 paravoid: disabled crontab for user avocato on tools-login, see above
07:17 paravoid: pkill -u avocato on tools-login, multiple /home/avocato/pywikipedia/redirect.py DoSing the bastion
November 14
09:12 ori-l: Added aude to lolrrit-wm maintainers group
November 13
22:36 andrewbogott: removed 'imagescaler' class from tools-login because that class hasn't existed for a year. Which, a year ago is before that instance even existed so what the heck?
November 3
16:49 ori-l: grrrit-wm stopped receiving events. restarted it; didn't help. then restarted gerrit-to-redis, which seems to have fixed it.
November 1
16:11 wm-bot: petrb: restarted terminator daemon on -login to sort out memory issues caused by heavy mysql client by elbransco
October 23
15:19 Coren: deleted tools-tyrant and tools-exec-cyberbot (cleanup of obsoleted instances)
October 20
18:52 wm-bot: petrb: everything looks better
18:51 wm-bot: petrb: restarting apache server on tools-webproxy
18:49 wm-bot: petrb: installed links on -dev and going to investigate what is wrong with apaches, documentation, Coren, please update it
October 15
21:03 Coren: labs-login rebooted to fix the ownership/take issue with success.
October 10
09:49 addshore: tools-webserver-01is getting a 500 Internal Server Error again
September 23
06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
06:44 YuviPanda: remove unpuppetized install of openjdk-6 packages causing problems in -dev (for bug: 54444)
05:15 legoktm: logging a log to test the log logging
05:13 legoktm: logging a log to test the log logging
September 11
09:39 wm-bot: petrb: started toolwatcher
August 24
18:00 wm-bot: petrb: freed 1600mb of ram by killing yasbot processes on -login
17:59 wm-bot: petrb: killing all python processes of yasbot on -login, this bot needs to run on grid, -login is constantly getting OOM because of this bot
August 23
12:17 wm-bot: petrb: test
12:15 wm-bot: petrb: making pv from /dev/vdb on new nodes
11:49 wm-bot: petrb: syncing packages of -login with exec nodes
11:48 petan: someone installed firefox on exec nodes, should investigate / remove
21:40 wm-bot: petrb: there is no lvm on -db which we need as hell - therefore no swap either nor storage for binary logs :( I got a feeling that mysql will die oom soonish
21:39 wm-bot: petrb: db has 5% free RAM eeeek
18:36 wm-bot: root: removed lot of ?audit? logs from exec-04 they were eating too much storage
18:23 wm-bot: petrb: temporarily disabling /tmp on exec-04 in order to set up lvm
18:23 wm-bot: petrb: exec-04 96% / usage, creating a new volume
12:33 wm-bot: petrb: installing redis on tools-mc
June 14
12:35 wm-bot: petrb: updating logsplitter to new version
June 13
21:59 wm-bot: petrb: replaced logsplitter on both apache servers with far more powerfull c++ version thus saving a lot of resources on both servers
12:43 wm-bot: petrb: tools-webserver-01 is running quite expensive python job (currently eating almost 1gb of ram) it may need to be fixed or moved to separate webserver, adding swap to prevent machine die OOM
12:22 wm-bot: petrb: killing process 31187 sort -T./enwiki/target -t of user local-enwp10 for same reason as previous one
12:21 wm-bot: petrb: killing process 31190 sort -T./enwiki/target of user local-enwp10 for same reason as previous one
12:17 wm-bot: petrb: killing process 31186 31185 69 Jun11 pts/32 1-13:14:41 /usr/bin/perl ./bin/catpagelinks.pl ./enwiki/target/main_pages_sort_by_ids.lst ./enwiki/target/pagelinks_main_sort_by_ids.lst because it seems to be a bot running on login server eating too many resources
08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix !b 49383
08:45 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
08:44 wm-bot: petrb: updated /usr/local/bin/logsplitter on webserver-01 in order to fix become afcbot 49383
08:25 wm-bot: petrb: fixing missing packages on exec nodes
June 9
20:44 wm-bot: petrb: moved logs on -login to separate storage
June 8
21:24 wm-bot: petrb: installing python-imaging-tk on grid
21:20 wm-bot: petrb: installing python-tk
21:16 wm-bot: petrb: installing python-flickrapi on grid
21:16 wm-bot: petrb: installing
16:49 wm-bot: petrb: turned off wmf style of vi on tools-dev feel free to slap me :o or do cat /etc/vim/vimrc.local >> .vimrc if you love it
15:33 wm-bot: petrb: grid is overloaded, needs to be either enlarged or jobs calmed down :o
09:55 wm-bot: petrb: backporting tcl 8.6 from debian
09:38 wm-bot: petrb: update python requests to version 1.2.3.1
June 7
15:29 Coren: Deleted no-longer-needed tools-exec-cg node (spun off to its own project)
June 5
09:52 wm-bot: petrb: on -dev
09:52 wm-bot: petrb: moving /usr to separate volume expect problems :o
09:41 wm-bot: petrb: moved /var/log to separate volume on -dev
09:31 wm-bot: petrb: houston we have problem, / on dev is 94%
09:28 wm-bot: petrb: installed openjdk7 on -dev
09:00 wm-bot: petrb: removing wd-terminator service
08:39 wm-bot: petrb: started toolwatcher
07:04 wm-bot: petrb: installing maven on -dev
June 4
14:49 wm-bot: petrb: installing sbt in order to fix b48859
13:28 wm-bot: petrb: installing csh on cluster
08:37 wm-bot: petrb: installing python-memcache on exec nodes
June 3
21:40 Coren: Rebooting -login; it's trashing. Will keep an eye on it.
14:15 wm-bot: petrb: removing popularity contest
14:11 wm-bot: petrb: removing /etc/logrotate.d/glusterlogs on all servers to fix logrotate daemon
09:43 wm-bot: petrb: syncing packages on exec nodes to avoid troubles with missing libs on some etc
June 2
08:39 wm-bot: petrb: installing ack-grep everywhere per yuvipanda and irc
June 1
20:57 wm-bot: petrb: installed this to exec nodes because it was on some and not on others cpp-4.4 cpp-4.5 cython dbus dosfstools ed emacs23 ftp gcc-4.4-base iptables iputils-tracepath ksh lsof ltrace lshw mariadb-client-5.5 nano python-dbus python-egenix-mxdatetime python-egenix-mxtools python-gevent python-greenlet strace telnet time -y
10:46 wm-bot: petrb: created new instance for experiments with sasl memcache tools-mc
May 31
19:17 petan: deleting xtools project (requested by Cyberpower678)
17:24 wm-bot: petrb: removing old kernels from -dev because / is almost full
17:17 wm-bot: petrb: installed lsof to -dev
15:55 wm-bot: petrb: installed subversion to exec nodes 4 legoktm
15:47 wm-bot: petrb: replacing mysql with maria on exec nodes
15:46 wm-bot: petrb: replacing mysql with maria on exec nodes
15:14 wm-bot: petrb: installing default-jre in order to satisfy its dependencies
15:13 wm-bot: petrb: installing /data/project/.system/deb/all/sbt.deb to -dev in order to test it
13:04 wm-bot: petrb: installing bashdb on tools and -dev
12:27 wm-bot: petrb: removing project local-jimmyxu - per request on irc
10:54 wm-bot: petrb: killing process 3060 on -login (mahdiz 3060 1964 88 May30 ? 21:32:51 /bin/nano /tmp/crontab.Ht3bSO/crontab) it takes max cpu and doesn't seem to be attached
May 30
12:24 wm-bot: petrb: deleted job 1862 from queue (error state)
20:00 wm-bot: petrb: installing p7zip-full to -dev and -login
May 27
08:46 wm-bot: petrb: changed config of mysql to use /mnt as path to save binary logs, this however requires server to be restarted
May 24
08:44 petan: setting up lvm on new exec nodes because it is more flexible and allows us to change the size of volumes on the fly
08:28 petan: created 2 more exec nodes, setting up now...
May 23
09:20 wm-bot: petrb: process 27618 on -login is constantly eating 100% of cpu, changing priority to 20
May 22
20:54 wm-bot: petrb: changing ownership of /data/project/bracketbot/ to local-bracketbot
14:28 labs-logs-bottie: petrb: installed netcat as well
14:28 labs-logs-bottie: petrb: installed telnet to -dev
14:02 Coren: tools-webserver-02 now live; / and /cluebot/ moved there
May 21
20:27 labs-logs-bottie: petrb: uploaded hosts to -dev
May 19
13:40 labs-logs-bottie: petrb: killing that nano process seems to be some hang and unattached anyway
12:59 labs-logs-bottie: petrb: changed priority of nano process to 19
12:55 labs-logs-bottie: petrb: local-hawk-eye-bot /bin/nano /tmp/crontab.d4JhUj/crontab eat too much cpu
12:50 petan: nvm previous line
12:50 labs-logs-bottie: petrb: vul alias viewuserlang
May 14
21:22 labs-logs-bottie: petrb: created a separate volume for /tmp on login so that temp files do not fragment root fs and it does not get filled up by them, it also makes it easier to track filesystem usage
13:16 Coren: reboot -dev, need to test kernel upgrade
May 10
15:08 Coren: create tools-webserver-02 for Apache 2.4 experimentation
May 9
04:12 Coren: added -exec-03 and -exec-04. Moar power!!1!
May 6
19:59 Coren: made tools-dev.wmflabs.org public
08:04 labs-logs-bottie: petrb: created a small swap on -login so that users can not bring it to OOM so easily and so that unused memory blocks can be swapined in order to use the remaining memory more effectively
08:00 labs-logs-bottie: petrb: making lvm from unused disk from /mnt on -login so that we can eventually use it somewhere if needed
May 4
17:50 labs-logs-bottie: petrb: foobar as well
17:47 labs-logs-bottie: petrb: removing project flask-stub using rmtool
15:33 labs-logs-bottie: petrb: fixing missing db user for local-stub
12:51 labs-logs-bottie: petrb: creating mysql accounts by hand for alchimista and fubar
May 2
20:49 labs-logs-bottie: petrb: uploaded motd to exec-N as well, with information which server users connected to
May 1
16:59 labs-logs-bottie: petrb: fixed invalid permissions on /home
April 27
18:54 labs-logs-bottie: petrb: installing pymysql using pip on whole grid because it is needed for greenrosseta (for some reason it is better than python-mysql package)
April 26
23:55 Coren: reboot to finish security updates
08:00 labs-logs-bottie: petrb: patching qtop
07:57 labs-logs-bottie: petrb: added tools-dev to admin host list so that qtop works and fixing the bug of qtop
07:28 labs-logs-bottie: petrb: installing GE tools to -dev so that we can develop new j|q* stuff there
April 25
19:00 Coren: Maintenance over; systems restarted and should be working.
18:18 labs-logs-bottie: petrb: we are getting in troubles with memory on tools-db there is only less than 20% free memory
18:01 Coren: Begin maintenance (login disabled)
13:21 petan: removing local-wikidatastats from ldap