Friday, 2016-03-18

*** ajmiller has joined #openstack-infra-incident		02:23
*** ajmiller has quit IRC		03:15
*** lifeless has quit IRC		04:55
*** lifeless has joined #openstack-infra-incident		04:56
*** ajmiller_ has joined #openstack-infra-incident		14:18
*** ajmiller has joined #openstack-infra-incident		14:18
*** ajmiller_ has quit IRC		14:19
*** anteaya has joined #openstack-infra-incident		15:17
anteaya	so jesusaur has noticed a jenkins security advisory: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-24	15:18
anteaya	and right now yolanda is the only infra-root online	15:18
anteaya	fungi appears to be traveling with poor connectivity when he can get it	15:18
anteaya	and since jenkins posts the version in every footer, it seems we should discuss a plan to upgrade	15:19
yolanda	mordred should be around	15:20
yolanda	and i'd like to get opinion of more infra-root and jenkins experts like zaro	15:20
yolanda	i'm afraid of plugins not being compatible	15:20
anteaya	yolanda: good points	15:20
anteaya	and I agree with you	15:21
jeblair	start by installing the new version of jenkins on jenkins-dev	15:21
anteaya	morning jeblair	15:21
anteaya	yolanda: are you able to focus on installing the suggested version of jenkins on jenkins-dev?	15:22
yolanda	anteaya, i need to leave in over an hour as much	15:22
yolanda	children going out from school, and is the last day before holiday	15:23
anteaya	yolanda: okay well this seems to me a higher priority than firefox at the moment	15:23
anteaya	hmmmm	15:23
anteaya	okay thanks, I understand yolanda	15:23
yolanda	i can start trying on jenkins-dev but i cannot follow the issue until completion	15:24
jesusaur	is the jenkins version in puppet somewhere that I'm not seeing? how would jenkins-dev be upgraded?	15:24
anteaya	jesusaur: look at the bottom: https://jenkins.openstack.org/	15:25
yolanda	i think manually installing the war, jesusaur	15:25
anteaya	jesusaur: oh sorry, that wasn't the question you asked, my apologies	15:25
jesusaur	yolanda: ah	15:25
jeblair	hopefully one of clarkb, pleia2, fungi, jhesketh, mordred, pabelanger, SergeyLukjanov or nibalizer will be around by then.	15:25
anteaya	pleia2: is in singapore	15:25
anteaya	yolanda: can you get started and post what actions you are taking?	15:26
anteaya	so someone else can pick up from where you left off?	15:26
*** AJaeger has joined #openstack-infra-incident		15:26
yolanda	that will be the first time that i do an upgrade on upstream so i actually need to learn about it first	15:27
yolanda	i've done on downstream but i don't know if you follow same steps	15:27
mordred	morning	15:27
yolanda	mordred, jeblair, how do you do the upgrades? install new war manually?	15:27
anteaya	mordred: morning	15:27
anteaya	mordred: glad you are here	15:28
anteaya	mordred: we need to discuss upgrading our jenkins, starting with jenkins-dev: https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-24	15:28
mordred	anteaya: just waking up - haven't coffeed yet, so I may not be AWESOME at incident helping yet	15:28
anteaya	mordred: fair enough	15:28
mordred	let me start the coffee real quick	15:28
anteaya	yup	15:28
fungi	yeah, my reception is terrible at the moment and i'm about to be getting back on the road again	15:28
fungi	yolanda: apt-get install jenkins	15:29
anteaya	mordred: if you can help yolanda learn how we upgrade a jenkins that would be a great place to start	15:29
fungi	well, sudo apt-get install jenkins	15:29
mordred	anteaya: my personal process is "cower in terror and cry a lot"	15:29
fungi	but on jenkins-dev first as jeblair said	15:29
anteaya	mordred: good to know	15:29
anteaya	fungi: glad to have your help when we can get it, thank you	15:29
anteaya	fungi: safe travels	15:29
mordred	yolanda: yah - we do jenkins-dev first just to make sure that the jenkins people haven't broke us	15:30
*** ChanServ changes topic to "https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-02-24"		15:31
fungi	i think you _may_ want to also confirm that's pulling in the lts version and not the dev version. if it ends up with a dev version then you may need to check their apt repository for the corresponding package for the lts version instead, wget that and dpkg -i it to install	15:31
anteaya	fungi: thank you, I assumed we were running their lts version, but glad to have the confirmation	15:31
fungi	we tend to avoid the dev releases of jenkins because of stability concerns	15:31
yolanda	ok going to try that on jenkins-dev	15:32
yolanda	i have a fear about gear plugin	15:32
anteaya	yolanda: thank you	15:32
anteaya	yup, good thing it is the -dev server	15:33
jeblair	https://review.openstack.org/271543 is the change to fix the gearman plugin	15:35
yolanda	ok fingers crossed	15:35
anteaya	yolanda: go you	15:35
jeblair	it claims to have passed unit tests, but if you actually read the console log at http://logs.openstack.org/43/271543/3/check/gate-gearman-plugin-build/d49cb94/console.html.gz there are errors	15:36
yolanda	jeblair so this needs to go first...	15:36
yolanda	jenkins-dev is starting now but it should fail	15:36
jeblair	yolanda: the commit message says gearman function registration will fail; so we may need to check on that	15:37
yolanda	jeblair i remember having this problem on my tests with latest versions	15:37
anteaya	yolanda: what version was pulled with apt?	15:37
yolanda	1.642.3	15:38
anteaya	thanks	15:38
yolanda	so jeblair, how do we test? we have some test jobs we can trigger in zuul to point to jenkins-dev?	15:39
jeblair	oh wow, this existis now: https://wiki.jenkins-ci.org/display/JENKINS/Slave+To+Master+Access+Control/	15:39
fungi	yeah, they added that a year or so ago	15:40
fungi	finally	15:40
anteaya	acls for jenkins	15:40
fungi	anyway, i need to disappear again for a few hours. hope to be back to the internet around 1900z or a little after	15:41
anteaya	fungi: thank you	15:41
anteaya	best if luck with whatever you are doing	15:41
anteaya	of	15:41
yolanda	jeblair, jenkins-dev is up, but how can we better check about gear interaction , and zmq?	15:42
jeblair	yolanda: i'll clean up zuul-dev; you can reconfigure jenkins-dev to talk to gearman on zuul-dev.	15:42
anteaya	zaro: once you are online, any suggestions you could make are welcome	15:43
yolanda	jeblair, to talk with zuul gear in production?	15:43
jeblair	yolanda: zuul-dev	15:44
yolanda	ok i updated the url	15:45
jeblair	the stop and set description jobs are registered, but no real jobs	15:48
jeblair	that suggests this version is also incompatible	15:48
yolanda	so i rollback it?	15:49
yolanda	or doesnt' matter to have that on dev to future tests?	15:49
jeblair	no, now we need to land 271543	15:49
jeblair	i have rechecked to see if it will pass tests	15:49
yolanda	ok	15:50
jeblair	i will also go ahead and pull the build off of that node so we don't have to wait for it to actually land	15:53
jeblair	it failed with the same errors, but i'm going to go ahead and install it on jenkins-dev and see what happens	15:54
yolanda	ok	15:55
jeblair	it's installed and restarted and many more functions are registered now	15:56
yolanda	can you trigger some job on zuul-dev to test the full workflow?	15:57
jeblair	let me see how hard that would be	15:58
nibalizer	i am touristing in france	15:58
nibalizer	so pretty useless	15:58
nibalizer	sorry	15:58
yolanda	lucky you :)	15:58
jeblair	nibalizer: go enjoy france!	15:59
anteaya	nibalizer: thanks for letting us know	15:59
nibalizer	kk	16:00
jeblair	zuul-dev is configured to only run noop; i'll manually reconfigure it to run a real job real quick	16:00
nibalizer	good luck :hugops:	16:00
anteaya	nibalizer: thanks, have fun	16:03
jeblair	https://jenkins-dev.openstack.org/view/zaro/job/check-gtest-org/8/console	16:04
jeblair	that looks reasonably good	16:05
jeblair	the content of the job is all wrong which is what the error is about	16:05
jeblair	but it ran it	16:05
anteaya	it did	16:05
jeblair	i think we should now shut down a jenkins master and upgrade it	16:06
jeblair	i will put 07 and 06 in shutdown mode	16:07
jeblair	actually, first i will restart firefox	16:07
yolanda	i need to go for a pair of hours	16:08
anteaya	yolanda: thanks for your help	16:08
jeblair	yolanda: thanks, it sounds like mordred will be online soon.	16:08
anteaya	yolanda: enjoy your next thing	16:08
yolanda	anteaya, family duties. Have a nice weekend	16:09
anteaya	yolanda: thanks you too	16:09
jeblair	anteaya: did the ubuntu-trusty cleanup change land yesterday? and did zuul get restarted?	16:13
jeblair	jenkins06 and jenkins07 are in quiesce mode	16:13
anteaya	sorry will look	16:16
anteaya	we reverted clarkb's move puppet jobs to ubuntu-trusty yesterday: https://review.openstack.org/#/c/294199/	16:17
anteaya	as there are issues with the images, as in it doesn't contain cron for example	16:17
jeblair	wow, crontab	16:17
anteaya	yeah	16:17
jeblair	that is something i would support in a base image :)	16:18
anteaya	and there is a jjb yaml parsing item	16:18
anteaya	jeblair: yay another convert	16:18
anteaya	ha ha ha	16:18
anteaya	and did zuul get restarted, yes it did	16:18
anteaya	fungi restarted it late in the day	16:18
jeblair	anteaya: the second link in your comment doesn't work (it was to jenkins, not logs.o.o)	16:18
anteaya	I'm wrong, zuul didnt' appear to get restarted: http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-03-18.log.html#t2016-03-18T00:39:10	16:19
anteaya	jeblair: ah sorry about that broken comment link	16:20
jeblair	okay, i find it strange that we would run for a whole day without restarting zuul when it's having such a big impact on our quota	16:20
jeblair	i will restart it now	16:20
anteaya	thanks	16:20
anteaya	it was discussed but the timing was to restart at the end of the day	16:21
jeblair	okay, i will be a softie and wait just a few mins for the top 5 changes to merge	16:22
AJaeger	jeblair, anteaya : Last words from fungi last night: He didn't restart and asked somebody else to do it...	16:22
AJaeger	He run out of time ;(	16:22
jeblair	AJaeger: if last night is anything like now, i guess no one else was around	16:22
AJaeger	jeblair: Last night: 12+ hours ago	16:23
* jeblair is starting to think the more roots we are the fewer we have		16:23
anteaya	I'm noticing the same thing	16:23
anteaya	AJaeger: yup, found that in the logs, thanks	16:24
jeblair	since i'm throwing in a zuul restart, i'll go ahead and quiesce jenkins01 and jenkins02	16:26
anteaya	ack	16:26
jeblair	then i'll be able to knock out half the system at a time	16:26
anteaya	wonderful	16:27
AJaeger	is anything in the release queue? Should we give release team a heads-up?	16:27
anteaya	post does have 70 changes	16:27
jeblair	i don't see anything in release	16:28
anteaya	oh sorry, release not post	16:28
* AJaeger gives them a heads-up now...		16:29
anteaya	thanks	16:29
jeblair	AJaeger: ++	16:29
jeblair	this is the slowest job ever	16:31
anteaya	:(	16:31
*** pabelanger has joined #openstack-infra-incident		16:45
pabelanger	o/	16:45
anteaya	hello pabelanger	16:45
pabelanger	just catching up on backscroll for #openstack-infra	16:45
anteaya	yup	16:46
pabelanger	While on PTO today, I have some time for the next 2 hours to help out if needed	16:46
pabelanger	kids down for naps	16:46
clarkb	I am only just now waking up, will be a bit before I can help but looks like jenkins and gearman plugin upgrades in process	16:46
anteaya	thanks pabelanger but you should be time offing it you are on time off	16:47
anteaya	guess it just would be helpful to know in advance so not everyone is off on the same day if possible	16:47
anteaya	clarkb: welcome	16:47
pabelanger	anteaya: don't mind offering if there is an issue ATM	16:48
anteaya	yup, understood and thanks	16:48
pabelanger	anteaya: agreed. Not sure the protocol for PTO upstream, I have no issues sending out a notification in advance, just need to figure out where :)	16:49
anteaya	well in the past we have just said in channel	16:49
anteaya	I'll be away tomorrow	16:49
anteaya	has worked well thus far when practiced	16:49
jeblair	i will be away tomorrow ;)	16:52
jeblair	also the day after	16:52
anteaya	awesome	16:52
* anteaya makes a note		16:52
jeblair	back on monday :)	16:52
anteaya	woooo	16:52
jeblair	why is this job still running?!	16:53
anteaya	:(	16:53
jeblair	did we add vexxhost to grafana?	16:53
clarkb	jeblair: yes it and osic are both in grafana	16:53
jeblair	clarkb: oh! it's in the dropdown but not on the main page	16:54
jeblair	there is no indication that he list of dashboards is not complete. :(	16:54
pabelanger	ya, we need to rejig the front-page to include them	16:55
jeblair	it looks like it generally runs tempest jobs in a reasonable time, but there is a 2h spike. i wonder if this is another similar spike.	16:56
clarkb	jeblair: are you just eaiting for jobs to finish on 01 02 06 and 07?	17:10
AJaeger	jeblair: did you restart zuul? ttx wants to know on #openstack-release...	17:12
jeblair	nope, still waiting on https://jenkins04.openstack.org/job/gate-tempest-dsvm-full/16564/consoleFull to finish	17:13
jeblair	when ^ finishes, i'll restart zuul; if one of the masters finishes first (unlikely?), i'll upgrade it	17:13
* anteaya checks the water in the basement		17:14
jeblair	wow, this run has taken more than 2 hours	17:22
anteaya	:(	17:24
*** dims_ has joined #openstack-infra-incident		17:28
* dhellmann catches up on scrollback		17:33
anteaya	dhellmann: hello	17:33
dhellmann	hi, anteaya	17:33
dhellmann	just to make sure I have the situation clear, you're working on upgrading jenkins still, correct?	17:34
jeblair	yes, and about to restart zuul	17:34
dhellmann	cool, thanks, I'll keep an eye on this channel	17:35
anteaya	dhellmann: great	17:35
jeblair	zuul is stopped	17:35
jeblair	deleted all nodepool nodes	17:36
jeblair	starting zuul	17:36
jeblair	(rather: set all nodepool nodes to be deleted)	17:36
jeblair	re-enqueing changes	17:36
clarkb	jeblair: all nodepool nodes ore just devstack-trusty?	17:37
jeblair	clarkb: all -- i'm restarting zuul	17:37
clarkb	oh right all the jobs will go off in never never land	17:38
jeblair	right that	17:38
jeblair	it's eventually consistent, but this speeds up freeing up the 4 masters and gives us more available quota	17:38
jeblair	everything is re-enqueud and 01,02,06,07 are idle	17:40
jeblair	i gave the all-clear in -release	17:41
jeblair	i'm going to upgrade jenkins07 now	17:41
clarkb	ok	17:41
jeblair	removed all slaves from config.xml	17:44
jeblair	SEVERE: Failed to create slave log directory /var/lib/jenkins/logs/slaves/devstack-centos7-internap-nyj01-8810925	17:45
jeblair	neat	17:45
jeblair	okay	17:46
jeblair	somehow teh gearman plugin is working	17:46
jeblair	i have not upgraded it	17:46
anteaya	odd	17:46
jeblair	i have quiesced jenkins07 to stop further jobs from running there	17:47
clarkb	07 claims to be the old version	17:47
clarkb	oh maybe my html just needed more of a refresh	17:47
clarkb	now its new and shiny	17:47
jeblair	Mar 18, 2016 5:47:27 PM org.gearman.common.GearmanJobServerSession handleReqSessionEvent	17:48
jeblair	INFO: Session GearmanJobServerSession:1038:GearmanNIOJobServerConnection:zuul.openstack.org/162.242.150.96:4730 handling a REQ/CAN_DO event	17:48
jeblair	Mar 18, 2016 5:47:27 PM org.gearman.common.GearmanJobServerSession submitTask	17:48
jeblair	INFO: Session GearmanJobServerSession:1038:GearmanNIOJobServerConnection:zuul.openstack.org/162.242.150.96:4730 is now handling the task GearmanTask:null:1213a77b-4535-448f-ab26-d266eca4859d	17:48
jeblair	also, those logs are insane	17:48
clarkb	jeblair: was not upgrading the plugin intentional for testing?	17:50
jeblair	clarkb: nope. dpkg helpfully restarted jenkins	17:50
jeblair	i think the order needs to be install new plugin, then apt-get upgrade	17:51
jeblair	er, apt-get install	17:51
clarkb	gotcha	17:51
jeblair	the reason it can't create the slave log directory is that there are too many entries in that directory already	17:51
jeblair	i expect that has been the case for some time, and we probably don't really care that much atm, so i'm just going to leave it	17:51
clarkb	kk	17:52
clarkb	I can probably whip up a find to clean out anything older than say 2 months	17:52
jeblair	so... given that i have no idea what the deal is with the gearman plugin; should we just go ahead and upgrade it anyway?	17:52
clarkb	yes I think upgrade sounds safest	17:53
anteaya	I can support that	17:53
anteaya	on dev we saw the upgraded plugin worked	17:53
jeblair	okay, i will nodepool delete all of the jobs running on 07, then upgrade gearman plugin there	17:55
jeblair	the nodepool delete will cause them to be re-enqueued in zuul	17:55
anteaya	ack	17:55
clarkb	find /var/lib/jenkins/logs/slaves -mtime +60 -delete ?	17:59
clarkb	jeblair: ^ should I go ahead and run that on the 8 masters?	17:59
jeblair	clarkb: sure; we could probably to mtime +7 really...	18:00
clarkb	jeblair: it seems to have decided that all of the slaves dir itself should go away	18:02
jeblair	clarkb: you mean /var/lib/jenkins/logs/slaves or all of the entries in there?	18:02
clarkb	/var/lib/jenkins/logs/slaves	18:02
jeblair	clarkb: neat	18:03
jeblair	clarkb: it was jenkins:nogroup	18:03
clarkb	thanks	18:03
clarkb	recreated	18:03
jeblair	i have uploaded the gearman-plugin	18:04
clarkb	bah because the first output of that is that dir	18:04
clarkb	will do this better on not 07 :)	18:04
jeblair	i'm going to restart jenkins on 07 now	18:04
clarkb	kk	18:05
clarkb	find /var/lib/jenkins/logs/slaves/ -mindepth 1 -mtime +60 -delete appears to be what we want	18:06
clarkb	am running that on 06 now	18:06
jeblair	i'm going to create a new default view for all of the masters like this: https://jenkins01.openstack.org/	18:07
jeblair	it has no jobs in	18:07
jeblair	it	18:07
clarkb	++	18:08
jeblair	which means that when we visit one of the masters, we don't have to wait 5 minutes for the list of 8000 jobs to load	18:08
clarkb	yup that above find is much happier, it removed all contents of the dir anyways so it was more than 2 months ago when we ran out of room for new dirs	18:08
clarkb	but left the dir in place	18:08
clarkb	will do the remainder of the masters now	18:08
anteaya	clarkb: yay	18:09
anteaya	jeblair: sounds good	18:09
jeblair	oh, are slave config files now outside of the main config?	18:11
clarkb	jeblair: they werent on the old version but thats good news if so	18:12
clarkb	jeblair: looks like /var/lib/jenkins/nodes	18:15
jeblair	yep	18:15
jeblair	i'm watching a build to make sure the whole lifecycle works	18:15
jeblair	https://jenkins07.openstack.org/job/gate-horizon-npm-run-test/2838/console is finished	18:15
jeblair	it ran on https://jenkins07.openstack.org/computer/ubuntu-trusty-bluebox-sjc1-8878183/	18:15
jeblair	which is offline now (good)	18:16
jeblair	nodepool got the complete event for that and will delete it shortly	18:16
anteaya	I hadn't seen a gui for a single node before, thanks	18:17
clarkb	jeblair: this is with upgraded gearman and upgraded jenkins right?	18:18
jeblair	clarkb: yes	18:18
jeblair	scp console log looks good: http://logs.openstack.org/14/289314/21/check/gate-horizon-npm-run-test/bbe19e1/console.html	18:18
jeblair	jenkins node was just deleted (good)	18:19
jeblair	regular scp looks good: http://docs-draft.openstack.org/57/292957/1/check/gate-storyboard-webclient-js-draft/a4f46f9/dist/#!/page/about	18:20
jeblair	(different job from jenkins07)	18:20
jesusaur	where is the new plugin hosted? and will a new release be tagged?	18:20
jeblair	that's all i can think of; anything else we should check?	18:20
clarkb	/monitoring 404s	18:20
jeblair	clarkb: neat	18:21
jeblair	jesusaur: i approved the change earlier, that should at least result in a branch-tip build	18:21
clarkb	maybe I need to login first and won't get redirected automagically anymore	18:21
jeblair	i'm logged in and it 404s too	18:21
clarkb	nope still 404s ya	18:21
jeblair	monitoring used to be in the manage menu but is gone	18:22
clarkb	maybe we just need to upgrade the plugin	18:22
clarkb	https://wiki.jenkins-ci.org/display/JENKINS/Monitoring has newer releases so presumably it should still work if upgraded	18:22
jeblair	it is not listed under installed plugins on 07	18:23
jeblair	i will install the current version	18:23
jeblair	shall i delete all nodes from 07 and restart?	18:24
anteaya	I support that	18:24
clarkb	btw jenkins02 cleanup of slave logs is far slower than the higher numbered jenkinses. I think the reason this one is so much slower is that its IO isn't as good	18:24
clarkb	jeblair: sure, then we can bundle an upgrade of javamelody into the later restarts	18:24
anteaya	yes fungi said 01 and 02 are the slowest	18:24
anteaya	since they were the first and are probably on the slowest vms so should be upgraded	18:25
anteaya	when someone finds the time...	18:25
jeblair	clarkb: weirdly2 01 and 02 are the fastest at rendering the overview dashboards...	18:25
anteaya	really?	18:25
jeblair	okay, all clear in #-release, so i'm restarting 07 again now	18:26
*** asselin has joined #openstack-infra-incident		18:26
jeblair	anteaya: i confirmed our rax errors are likely because our ord quota is insufficient for the quest antonym made	18:32
jeblair	Forbidden: Quota exceeded for ram: Requested 8192, but already used 516096 of 512000 ram (HTTP 403) (Request-ID: req-c7dc80bb-6798-4783-9165-ea4155e8410b)	18:32
anteaya	jeblair: ah okay	18:32
anteaya	jeblair: I can offer a fix, any suggestions for quota?	18:33
jeblair	anteaya: antonym is increasing our ord quota	18:33
anteaya	ord is currently 195	18:33
anteaya	oh okay, that is even better	18:33
jeblair	okay, 07 is idle again, i will stop it, remove the leaked node config directories, and start again	18:35
anteaya	do I calculate correctly that a quota of 512000 ram is 62 nodes?	18:36
jeblair	yep	18:36
anteaya	thanks	18:36
jeblair	anteaya: we set max-servers a few lower than that in order to have headroom for building images, etc.	18:36
anteaya	ack	18:36
jeblair	clarkb: https://jenkins07.openstack.org/monitoring exists	18:37
anteaya	so the original quota of 55 seems to have been the max without a rax increase	18:37
anteaya	I can sign into jenkins but I can't see the monitoring dashboard	18:37
clarkb	jeblair: confirmed that monitoring is working for me	18:38
anteaya	guess I'll just have to listen to the stories	18:38
clarkb	and if I log out it forces me to log in	18:38
jeblair	anteaya: yeah, it's limited to admins since it lets you kill threads which would do damage	18:38
anteaya	agreed	18:38
jeblair	are we ready to proceed to 1,2,6 now?	18:38
anteaya	clarkb: though shalt be known!	18:38
anteaya	I have no objection to proceeding	18:39
jeblair	clarkb: ?	18:43
clarkb	oh ya I have no objections	18:43
clarkb	01-07 have the logs/slaves cleaned out	18:44
jeblair	cool, will proceed; i'll upgrade monitoring plugin, gearman plugin, then restart	18:44
jeblair	er, then jenkins, then restart :)	18:44
anteaya	ack	18:44
jeblair	06 is starting	18:47
anteaya	yay	18:47
jeblair	06 is up; monitoring exists...	18:48
anteaya	wooooo	18:49
jeblair	nodes are being added	18:49
jeblair	and its running jobs	18:49
jeblair	i think before i do 01 and 02, i will pause and do jenkins.o.o	18:49
nibalizer	hello	18:49
jeblair	since that's the one we actually care about	18:49
anteaya	makes sense	18:49
nibalizer	i have returned to a computer, are there things I can help with?	18:49
anteaya	nibalizer: hi there	18:50
anteaya	I'll let jeblair respond when he has a moment	18:50
anteaya	as I don't know	18:50
clarkb	jeblair: ok I am about to clean out the slave logs on jenkins.o.o	18:50
clarkb	I don't think that will interfere with your upgrade process	18:50
jeblair	clarkb: thanks and i agree	18:50
jeblair	nibalizer: i think we've got it covered now	18:52
clarkb	all done	18:52
nibalizer	jeblair: fantastic	18:52
anteaya	nibalizer: how is the touristing going?	18:52
nibalizer	anteaya: pretty good	18:52
anteaya	awesome	18:53
nibalizer	im exhausted and have lots of pictures	18:53
anteaya	have you had crepes?	18:53
anteaya	nibalizer: sounds like you're doing it right	18:53
nibalizer	anteaya: no i should get some crepes!	18:53
nibalizer	wait no i did have crepes	18:53
anteaya	obviously memorable	18:53
anteaya	were they any good?	18:53
nibalizer	ya	18:54
anteaya	nice	18:54
nibalizer	it was street vendor crepes during the jet lag march of death	18:54
anteaya	ah	18:54
anteaya	restaurant crepes	18:54
anteaya	to savour	18:54
* anteaya fondly remembers crepes from normandy		18:55
clarkb	I think the zuul restart earlier may have confused the nodepool image uploads	18:55
clarkb	supposedly we have 42 images uplodaing but I can't find any evidence that is in progress	18:56
jeblair	we have the all-clear from release to upgrade jenkins.o.o	18:56
jeblair	it's restarting	18:56
clarkb	confirmed with geard that none of the uplaods are actually queued or running	18:59
clarkb	I am going to restart nodepool-builder	18:59
clarkb	and then probably have to clean out nodepool itself	18:59
clarkb	actually nodepool-builder restart probably not necessary as there are 4 workers for each upload	18:59
jeblair	clarkb: i think we knew there were still some edge cases in the builder job code; this may be one of them.	18:59
clarkb	ya	19:00
jeblair	jenkins.o.o is up	19:00
anteaya	yay	19:00
AJaeger	great	19:00
clarkb	the simple fix I think is to restart nodepoold	19:00
clarkb	since that is supposed to clear any building images out of the db	19:00
jeblair	now i will do 01 and 02	19:01
clarkb	so maybe lets wait for jenkins stuff to settle down a bit before restarting nodepoold	19:01
jeblair	01 and 02 are restarting	19:05
jeblair	01 is up	19:05
jeblair	02 is up	19:07
jeblair	01 is getting nodes and jobs	19:08
AJaeger	that went smooth, great!	19:08
clarkb	jeblair: let me know when you think it is relatively safe to restart nodepool	19:08
jeblair	AJaeger: still have 3 more masters to go, but it's pretty mechanical at this point	19:08
jeblair	02 is getting nodes and jobs	19:09
* AJaeger leaves now for the weekend! Hope that's the last fire for today ;)		19:09
jeblair	clarkb: https://jenkins04.openstack.org/monitoring does not exist	19:10
jeblair	which is weird.	19:11
clarkb	jeblair: huh, it hasnt been upgraded	19:11
jeblair	clarkb: oh sorry, yeah, i haven't started on 3,4,5 yet	19:11
jeblair	clarkb: i was just taking stock of the old state	19:11
clarkb	right just noticing that its the old version so should be there	19:11
jeblair	i have created the new 'Default' dashboard on them and just now put them in quiesce mode	19:11
jeblair	clarkb: ok yep.	19:12
clarkb	03 has it	19:12
jeblair	and 5	19:12
jeblair	clarkb: i'm thinking of doing the 'delete nodes from under jenkins' trick for 3-5 and just knocking this out. what do you think?	19:13
clarkb	jeblair: seems reasonable, better to get it done than spend all day on it	19:13
anteaya	I support that	19:13
clarkb	then when it is done we can restart nodepool and get new images up	19:13
pabelanger	clarkb: jeblair I did https://review.openstack.org/#/c/294339/ last night (exposed upload_workers), but haven't tested the uploads yet.	19:15
jeblair	we have clearance from release, so i'm going to upgrade 3-5 now	19:15
pabelanger	should be able to do that on Monday	19:15
clarkb	the one thing we didn't test was jjb ? just thinking of that now	19:15
jeblair	pabelanger: awesome!	19:15
clarkb	but I think other people are successfully using jjb on a variety of jenkins versions so not very worried	19:15
jeblair	clarkb: heh, true. and i think we can work through it if there are problems	19:16
pabelanger	okay, kids are waking up for naps, back to PTO shortly. I'll check back in later tonight.	19:16
pabelanger	from*	19:16
anteaya	pabelanger: thanks for your help	19:16
jeblair	hrm, 04 is somewhat unresponsive	19:19
jeblair	it got better	19:20
anteaya	heh	19:21
jeblair	deleting nodes now	19:21
clarkb	number of threads is keeping steady on hte upgraded jenkinses	19:24
jeblair	upgrading 03 now	19:26
jeblair	and 04	19:27
jeblair	and 05	19:28
jeblair	03 is up	19:29
jeblair	and 04	19:29
jeblair	and 05	19:29
jeblair	03 is getting nodes and jobs	19:30
jeblair	i have quiesced 04	19:31
jeblair	04 did not end up with the monitoring plugin installed	19:31
anteaya	ah	19:32
jeblair	05 is getting nodes and jobs	19:32
anteaya	so just respinning 04 remaining?	19:32
jeblair	re-re-installing monitoring plugin on 04	19:32
anteaya	ack	19:33
jeblair	restarting 04	19:33
jeblair	quiescing 04 again	19:34
clarkb	a startin shutdown mode would be nice	19:34
jeblair	no kidding	19:34
jeblair	this time it snagged some nodes; i'm deleting them	19:36
jeblair	http://paste.openstack.org/show/491153/	19:36
jeblair	is the error	19:36
clarkb	wow heap issues	19:36
clarkb	possible tge plugin data is corrupt?	19:37
jeblair	clarkb:	19:38
jeblair	FYI, there is a proprietary plugin in Jenkins Enterprise by CloudBees	19:38
jeblair	for this purpose.	19:38
jeblair	https://groups.google.com/forum/#!topic/jenkinsci-dev/Gr2QOxSl7_8	19:38
jeblair	though, someone did post a groovy script to do it	19:38
jeblair	https://wiki.jenkins-ci.org/display/JENKINS/Post-initialization+script	19:38
clarkb	wow	19:39
clarkb	value add	19:39
jeblair	clarkb: the /etc/defaults/jenkins file on 04 does not match that of 03	19:39
jeblair	03 has -Xmx12g; 04 does not have an Xmx arg	19:40
clarkb	huh	19:40
clarkb	puppet appears to have run on 04 recently	19:41
jeblair	ah....	19:41
clarkb	hrm 04 has it now	19:42
clarkb	maybe this is the package racing puppet or something?	19:42
jeblair	yeah, maybe a dpkg puppet race	19:42
jeblair	i will restart 04 now	19:42
jeblair	(ansible confirms all /etc/default/jenkins files are identical)	19:42
jeblair	what it failed again	19:44
anteaya	:(	19:44
jeblair	quiesced	19:44
clarkb	huh same error with the heap?	19:45
jeblair	yep	19:45
jeblair	i see -Xmx12g in the command line	19:45
clarkb	/tmp/hsperfdata_jenkins is the melody data I think	19:46
clarkb	but its mostly empty	19:46
jeblair	deleting 04 nodes again	19:46
jeblair	i need to lunch now; i'm going to leave 04 as-is if clarkb or anyone else wants to look into it	19:47
clarkb	ok	19:47
clarkb	it is trying to copy an array it looks like and that hits the error	19:49
anteaya	is it worth it to spin up a fresh vm and install jenkins on it and call it 04?	19:50
clarkb	no we should figure out why it is doing this	19:50
anteaya	okay	19:51
clarkb	the md5sums of the .jpi files differ	19:54
clarkb	jeblair: I think whatever process you were using to update isn't functioning. We cna try copying the .jpi in from another master	19:54
anteaya	wonder why that would only cause an issue on 04	19:55
fungi	i am back and starting to catch up, though i may also have to pop back out again very briefly at 2130z (but not for long)	19:55
anteaya	welcome back	19:55
fungi	and yeah, zuul was still not caught up when i started running out of steam last night, sounds like there wasn't really anybody else around in shape to restart it after that either	19:56
anteaya	fungi: yeah didn't appear so	19:56
anteaya	jeblair observed that the more root folks we have the less we seem to have	19:57
clarkb	jeblair: my suggestion would be to do that, copy the jpi in from 07, rm plugins/monitoring/ on 04 and restart 04	19:57
fungi	so jenkins 1.642.3 is the latest lts. interesting they've changed up the theming on the webui	19:58
clarkb	jeblair: but will wait for your lunch to conclude so we can figure out what might be wrong with the process that was previously used	19:59
clarkb	fungi: basic sitrep is zuul has been restarted (which broke nodepools gearman state so it needs a restart too, then manual uploads of ubuntu-trusty for uuidgen updates) and every master but jenkins04 is fully operational on new jenkins lts with new gearman plugin version	20:03
clarkb	fungi: the monitorying plugin install is unhappy on 04 appears to be due to trying to use the old version against new jenkins lts	20:03
clarkb	I intend on restarting nodepoold as soon as the jenkinses are happy	20:03
clarkb	and will requeue ubuntu-trusty uploads	20:05
fungi	okay, sounds better than it could be i guess	20:08
fungi	and makes sense that nodepool image updates might be disrupted by its gearman disappearing out from under it at the moment	20:09
clarkb	it should mostly handle that but there are still a few corner cases	20:09
fungi	and yeah, yesterday was mostly a bust for getting anything done other than dealing with the openstackid/zanata situation	20:09
clarkb	the fix for uuidgen did get in we just haven't managed to upload any of those images yet from what I can see	20:10
fungi	also yolanda and krotscheck seem to have discovered that missing dbus on the minimal images was keeping firefox from working	20:22
fungi	or i think that was the conclusion they had arrived at	20:23
anteaya	yes	20:23
anteaya	I replied in -infra to keep the conversation about anything not incident related in there	20:24
anteaya	for logs searching purposes	20:25
clarkb	-rw-rw-r-- 1 jenkins jenkins 0 Mar 18 20:03 plugin.ini	20:27
clarkb	thats extra weird	20:27
clarkb	er ww	20:27
fungi	what's weird? the fact that it's empty?	20:31
fungi	maybe we can just copy that from another server?	20:31
clarkb	this was response to emilienm in other channel	20:31
fungi	oh	20:32
clarkb	their job failed being unable to copy that file and I found it weird the file existed	20:32
clarkb	since the copy failed	20:32
clarkb	but I think scp plugin must create the dirent before copying contents	20:32
clarkb	rather than doing a w and writing	20:32
fungi	ahh, i haven't finished catching up in there yet	20:32
jeblair	clarkb: catching up	20:42
jeblair	clarkb: i used the webui; on jenkins04 and 07 i 'installed' the plugin; on others i 'upgraded' it	20:44
jeblair	clarkb: just selecting from the list of available plugins; no manual uploads or anything in the case of monitoring	20:44
clarkb	jeblair: huh, for some reason it appears to still be using the old version on 04 when you do that	20:44
clarkb	jeblair: maybe we can 'upgrade' it now?	20:44
jeblair	(for gearman-plugin, they were all uploaded through the webui)	20:44
* clarkb looks		20:44
jeblair	clarkb: i think it still wasn't showing as upgradable, probably since it isn't really loaded	20:45
clarkb	ya I don't see it	20:45
jeblair	clarkb: assuming that's still the case, i'd agree that next steps are probably to delete, manually install as you suggest	20:45
clarkb	my original suggestion of just copying a jpi from the other servers over is the best idea i have	20:45
jeblair	let's first do the groovy script to have it start in shutdown mode	20:46
clarkb	++	20:46
jeblair	i will do that	20:46
jeblair	clarkb: do you want to work on staging the delete/copy?	20:46
clarkb	yup	20:46
clarkb	it is at jenkins04:/home/clarkb/monitoring.jpi and the md5sums match	20:48
clarkb	we can just cp that into place when 04 jenkins process is stopped	20:48
jeblair	clarkb: my groovy script is staged	20:50
jeblair	clarkb: i'm all set; why don't you drive from here?	20:50
clarkb	ok I will stop jenkins now	20:50
clarkb	ok plugin copied in place starting jenkins	20:51
clarkb	jeblair: ready?	20:51
jeblair	yep	20:52
jeblair	clarkb: same error :/	20:53
jeblair	it did start in shutdown mode	20:54
clarkb	monitoring/META-INF/MANIFEST.MF looks good at least	20:54
clarkb	but ya not working	20:54
clarkb	perhaps /var/lib/jenkins/monitoring is the problem?	20:55
clarkb	there are almost 13k files in there	20:55
clarkb	hrm 07 has almost as many	20:56
clarkb	that would be my next best guess move that dir aside, and have the plugin restart from scratch	20:56
jeblair	let's try it	20:57
clarkb	jeblair: you doing it this time? and do we have to update the groovy stuff each restart or is it set once and that way until the script is removed?	20:58
jeblair	clarkb: it's there for good. it's a file in ~jenkins/init.groovy.d/	20:58
clarkb	I like the idea of putting that on every master and updating the ansible playbook to restart them to explicitly online them	20:59
jeblair	clarkb: i worry a little about system restarts though...	20:59
clarkb	oh good point	20:59
clarkb	jeblair: I can move the monitoring dir aside if you want	21:00
jeblair	clarkb: but, we could ansible the process of adding and removing that	21:00
jeblair	clarkb: so we could have an "ansible shut down jenkins" and "ansible start jenkins" that does that for us	21:00
clarkb	ya	21:00
jeblair	clarkb: why don't you continue to drive	21:00
clarkb	kk	21:00
clarkb	stopping jenkins again	21:01
clarkb	and starting	21:01
jeblair	seems happier now	21:02
clarkb	yup I can get to the melody pag enow	21:03
clarkb	jeblair: anything else you want to chekc before cancelling shutdown mode?	21:03
jeblair	clarkb: nope, i'll go ahead and do that and rm the script	21:03
clarkb	kk	21:03
clarkb	it is running jobs	21:05
clarkb	looking good, I am ready to restart nodepool if others are ready	21:06
clarkb	jeblair: fungi anteaya ^ any reason to not do that now?	21:08
fungi	looks like a fine time for a nodepool restart	21:08
fungi	i'm around to help. pretty much caught up on what i've missed the first part of the day	21:09
clarkb	ok stopping nodepool services now	21:09
anteaya	I have no reason to not restart nodepool	21:09
jeblair	++	21:09
clarkb	ok both services are restarted and appear happy	21:10
clarkb	I will start ubuntu-trusty uploads in a csreen session	21:10
anteaya	yay	21:10
fungi	and wow so moving the monitoring dir aside solved it?	21:11
fungi	freaky	21:11
fungi	i guess that's accumulating graph/stats data over time?	21:11
anteaya	fungi: clarkb removed a whole bunch of slave files accumulated on the jenkins earlier today	21:13
anteaya	not sure if that is related or not	21:13
jeblair	it wasn't the largeest such directory; 03 has 17k files in it	21:13
clarkb	fungi: ya tons of rrd files	21:13
fungi	yikes	21:14
jeblair	but i guess something in there was :(	21:14
clarkb	if I had to guess it wasn't the size so much as maybe corrupt data	21:14
clarkb	which caused the jvm reading them in to allocate crazy amounts of memory and fail	21:14
jeblair	ya. a lot of those files are for ephemeral nodes too	21:14
clarkb	all ubuntu-trusty image uploads are queued	21:14
clarkb	I started with the qcow2 clouds	21:14
clarkb	as they should flush out quicker	21:15
clarkb	I need to step out now back in 20 or so	21:15
anteaya	clarkb: thanks	21:15
fungi	the more data there is, the more likely some of it is to end up corrupt as well	21:18
anteaya	this was the magic incantation: find /var/lib/jenkins/logs/slaves/ -mindepth 1 -mtime +60 -delete	21:19
anteaya	the first round was sans -mindepth 1 and took out the dir as well	21:19
anteaya	which was recreated	21:19
fungi	yep, saw that in scrollback	21:21
anteaya	figured	21:21
*** zaro has quit IRC		21:25
*** greghaynes has quit IRC		21:25
*** zaro has joined #openstack-infra-incident		21:26
*** mordred has quit IRC		21:26
*** dhellmann has quit IRC		21:26
*** dhellmann has joined #openstack-infra-incident		21:26
*** mordred has joined #openstack-infra-incident		21:27
*** greghaynes has joined #openstack-infra-incident		21:28
clarkb	I am back	21:31
clarkb	vexxhost image upload failed on the nginx looking thing	21:32
clarkb	osic and bluebox are done	21:32
*** ChanServ changes topic to "situation normal"		22:21
*** asselin has quit IRC		23:22

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!