Thursday, 2018-01-11

*** jascott1 has joined #openstack-infra		00:00
*** edmondsw has quit IRC		00:00
*** iyamahat_ has quit IRC		00:01
*** dougwig has quit IRC		00:03
*** jascott1 has quit IRC		00:03
*** jascott1 has joined #openstack-infra		00:03
*** iyamahat_ has joined #openstack-infra		00:05
*** iyamahat_ has quit IRC		00:06
*** jascott1 has quit IRC		00:07
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Add skipped CRD tests https://review.openstack.org/531887	00:08
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies https://review.openstack.org/530806	00:08
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699	00:08
*** wolverineav has joined #openstack-infra		00:16
*** felipemonteiro has joined #openstack-infra		00:19
*** sbra has joined #openstack-infra		00:22
*** yamamoto has quit IRC		00:24
*** yamamoto has joined #openstack-infra		00:27
*** erlon has quit IRC		00:27
*** armax has joined #openstack-infra		00:31
openstackgerrit	Ian Wienand proposed openstack-infra/project-config master: Revert "Pause builds for dib 2.10 release" https://review.openstack.org/532701	00:36
*** abelur_ has quit IRC		00:38
*** yamamoto has quit IRC		00:39
*** abelur_ has joined #openstack-infra		00:39
*** claudiub has quit IRC		00:43
*** sree has joined #openstack-infra		00:43
*** rmcall has quit IRC		00:46
*** rmcall has joined #openstack-infra		00:46
*** sbra has quit IRC		00:46
*** sree has quit IRC		00:48
*** wolverin_ has joined #openstack-infra		00:49
*** wolverineav has quit IRC		00:50
*** wolverin_ has quit IRC		00:53
*** cuongnv has joined #openstack-infra		00:58
*** cuongnv has quit IRC		00:59
openstackgerrit	Paul Belanger proposed openstack-infra/project-config master: Set max-server to 0 for infracloud-vanilla https://review.openstack.org/532705	01:01
*** pcrews has quit IRC		01:01
*** threestrands has quit IRC		01:03
clarkb	ianw: ^ do you wnat to review that too?	01:03
openstackgerrit	Kendall Nelson proposed openstack-infra/storyboard master: [WIP]Migration Error with Suspended User https://review.openstack.org/532706	01:04
*** pcrews has joined #openstack-infra		01:05
*** ilpianista has quit IRC		01:05
ianw	are we 100% sure the gate is moving?	01:09
*** ricolin has joined #openstack-infra		01:10
clarkb	ianw: I am not no	01:12
*** stakeda has joined #openstack-infra		01:12
clarkb	ze01 at least is running stuff	01:12
pabelanger	issues?	01:12
clarkb	zuul scheduler is running	01:12
clarkb	and zm01 has a zuul-merger	01:13
ianw	532395,1 ... my change of course ... i would call stuck ... just looking at some of the jobs	01:13
ianw	everything on the grafana page seems to be going	01:13
*** cuongnv has joined #openstack-infra		01:15
ianw	zuul-executor on ze04 seems to have pretty high load ... maybe it's normal	01:15
ianw	http://paste.openstack.org/show/642527/ in the logs, odd one	01:15
ianw	git.exc.GitCommandError: Cmd('git') failed due to: exit code(128)	01:15
ianw	stderr: 'Host key verification failed.	01:16
clarkb	I've got to afk to take care of kids. I did approve pabelanger's max-servers: 0 for infracloud	01:16
clarkb	ianw: ya that was brought up earlier today too	01:16
pabelanger	yah, if we loaded from queues, it is possible one executor grabbed more then normal	01:16
clarkb	I think tomorrow once meltdown is behind us we need to do some zuul stabilization and investigation	01:16
pabelanger	cool	01:16
clarkb	and for you that may be today >_>	01:16
ianw	do we have 11 executors now?	01:16
clarkb	ianw: just 10, ze01-10	01:16
ianw	grafana seems to think we have 11	01:17
*** kiennt26 has joined #openstack-infra		01:17
clarkb	could be stale connwction to geard?	01:17
pabelanger	I can't seem to get any of the stream.html pages working	01:18
pabelanger	finger 779ce43930ab4f48aa41ad6f9e422734@zuulv3.openstack.org also returns Internal streaming error	01:19
ianw	2018-01-11 00:10:20,511 DEBUG zuul.AnsibleJob: [build: ac4dbd3dd3eb4dbbbbed8ab3171500b2] Ansible complete, result RESULT_NORMAL code 0	01:19
ianw	that was about an hour ago, and the status page hasn't picked up the job returned	01:19
*** slaweq has joined #openstack-infra		01:19
pabelanger	tcp6 0 0 :::7900 :::* LISTEN	01:20
pabelanger	that is, new	01:20
*** slaweq_ has joined #openstack-infra		01:21
pabelanger	did we land new tcp 7900 changes?	01:21
pabelanger	corvus: Shrews: ^	01:21
pabelanger	yah, we must have	01:21
pabelanger	so, we have a mix of executors on tcp/79 and other on tcp/7900	01:22
ianw	my other job, 0aee65504f2d491ea497064898f2ad8e, maybe hasn't been picked up by an executor	01:22
pabelanger	2018-01-11 01:18:11,940 DEBUG zuul.web.LogStreamingHandler: Connecting to finger server ze09:79	01:23
pabelanger	socket.gaierror: [Errno -2] Name or service not known	01:23
pabelanger	in zuul-web debug log	01:24
*** slaweq has quit IRC		01:24
ianw	the job at the top of the integrated gate queue	01:25
ianw	http://zuulv3.openstack.org/stream.html?uuid=a379a36040e047cebfbc4b3e2e9d79a3&logfile=console.log	01:25
ianw	2018-01-10 22:53:10,886 DEBUG zuul.AnsibleJob: [build: a379a36040e047cebfbc4b3e2e9d79a3] Ansible complete, result RESULT_NORMAL code 0	01:25
*** slaweq_ has quit IRC		01:25
pabelanger	okay, so I think just ze09 is having web streaming issues	01:26
pabelanger	due to hostname	01:26
ianw	i think my amateur analysis is ... something's not right	01:26
*** ilpianista has joined #openstack-infra		01:26
pabelanger	yup, hostname on ze09 is ze09 again	01:27
pabelanger	while ze01.o.o is ze01.openstack.org	01:27
clarkb	yes and plan is to switch everything ti be like ze09	01:28
clarkb	at least that was mordreds ideal and no one objected	01:28
*** masayukig_ has quit IRC		01:28
pabelanger	I guess zuulv3.o.o cannot resolve ze09 right now	01:28
pabelanger	guess we need to append domain	01:28
clarkb	or use IP addrs	01:29
pabelanger	http://paste.openstack.org/show/642528/	01:29
ianw	infra-root: ^ do we want to agree zuul isn't currently making progress? restart?	01:30
pabelanger	yah, something to deal with in the morning	01:30
*** masayukig has joined #openstack-infra		01:30
clarkb	ianw: I'll have to defer to your judgement, am on phone and feeding kids	01:30
pabelanger	ianw: no, it is inap we have been having issues there today	01:30
pabelanger	I'd wait until the job times out	01:31
pabelanger	ianw: we should see if we can SSH into node and see what network looks like	01:31
ianw	pabelanger: but look at something like a379a36040e047cebfbc4b3e2e9d79a3 ? it appears to have finished but zuul hasn't noticed?	01:32
pabelanger	ianw: what is load look like on executor?	01:32
ianw	that job went to ze03	01:32
ianw	zuul-executor is busy there, but probably not mroe than normal. load ~1.5	01:33
pabelanger	looks like a379a36040e047cebfbc4b3e2e9d79a3 is still running	01:33
*** zhurong has joined #openstack-infra		01:33
*** felipemonteiro has quit IRC		01:35
pabelanger	ianw: I think we need to wait for a379a36040e047cebfbc4b3e2e9d79a3 to timeout	01:35
ianw	yeah ok, that has gone to internap	01:36
*** caphrim007_ has joined #openstack-infra		01:36
ianw	ubuntu-xenial-inap-mtl01-0001803996	01:36
ianw	ac4dbd3dd3eb4dbbbbed8ab3171500b2 is a docs job that should have finihsed ages ago however ...	01:36
ianw	that one went to ovh	01:37
*** caphrim007 has quit IRC		01:38
pabelanger	ianw: do you know which executor ac4dbd3dd3eb4dbbbbed8ab3171500b2 was on?	01:39
pabelanger	ze04.o.o, according to scheduler	01:39
ianw	yes, ze04 ... it went to 158.69.75.101	01:39
*** yamamoto has joined #openstack-infra		01:39
ianw	it's got everything cloned, so something happened	01:40
pabelanger	wow	01:40
pabelanger	I think ze04.o.o was stop and started	01:40
pabelanger	or killed	01:40
pabelanger	2018-01-11 01:00:06,180 DEBUG zuul.log_streamer: LogStreamer starting on port 7900	01:41
*** askb has quit IRC		01:41
ianw	hmm, this might explain the 11 executors	01:41
pabelanger	something happened 40mins ago	01:41
ianw	puppet run?	01:41
*** caphrim007_ has quit IRC		01:41
pabelanger	checking	01:42
*** rcernin has quit IRC		01:42
*** askb has joined #openstack-infra		01:43
*** edmondsw has joined #openstack-infra		01:44
pabelanger	we rebooted 40mins ago	01:44
pabelanger	which make sense	01:44
pabelanger	is that when we did meltdown reboots?	01:44
ianw	no, i rebooted them all yesterday!	01:44
pabelanger	I think we did it again	01:44
pabelanger	uptime is 1hr:44mins	01:44
ianw	actually only 44 minutes (it's 1:44) ...	01:45
pabelanger	okay, so I think the reboot in logs was from server reboot, so executor didn't crash	01:46
pabelanger	however	01:46
ianw	afaics there was no login around the time of the reboot. was it externally triggered?	01:46
*** shu-mutou-AWAY is now known as shu-mutou		01:46
pabelanger	ac4dbd3dd3eb4dbbbbed8ab3171500b2 was running just before server reboot	01:47
*** yamamoto has quit IRC		01:47
pabelanger	maybe?	01:47
clarkb	we didnt do ze's unless jeblair did them?	01:47
pabelanger	was the server live migrated?	01:47
ianw	it must have been ,or something. it happened at exactly 01:00	01:47
pabelanger	because executor was killed when ac4dbd3dd3eb4dbbbbed8ab3171500b2 was running	01:48
pabelanger	so, in that case, zuul does thing it is running	01:48
*** askb has quit IRC		01:48
pabelanger	when it is in fact done	01:48
pabelanger	which means, I do think we need to dump queues and restart scheduler	01:48
*** edmondsw has quit IRC		01:48
ianw	yah :/	01:48
pabelanger	okay, then I agree	01:49
*** askb has joined #openstack-infra		01:49
pabelanger	that also explains why it is running as tcp/7900	01:49
pabelanger	since daemon restarted	01:49
*** markvoelker has joined #openstack-infra		01:50
*** rcernin has joined #openstack-infra		01:50
ianw	is anyone logged into the rax console where they send those notifications?	01:50
Shrews	pabelanger: hrm, yes, it seems my executor change recently landed	01:51
ianw	is this finger thing something we want to consider before reloading?	01:51
Shrews	possibly? we may not have considered all of the things that need to change for it	01:53
*** bobh has joined #openstack-infra		01:53
Shrews	there's this: https://review.openstack.org/532594	01:53
Shrews	and we might need iptables rules for the new port?	01:53
ianw	with the gate the way it is, unless we start force pushing, not sure we can do too much changing	01:55
openstackgerrit	David Shrewsbury proposed openstack-infra/system-config master: Zuul executor needs to open port 7900 now. https://review.openstack.org/532709	01:57
*** Apoorva has quit IRC		01:58
Shrews	532594 and 532709 are going to be needed since that change landed. Then restart the executors. I'm not sure what the recently restarted ze* servers are going to need since they're probably running as root and not zuul now. Might run into file permission issues? Can't speak to that.	02:00
Shrews	ianw: pabelanger: I wasn't expecting that change to land so quickly. My fault for not marking it WIP or -2 before we discussed coordination.	02:03
Shrews	ianw: pabelanger: what can i do to assist here?	02:04
ianw	well only ze04 has probably restarted with the new code	02:04
ianw	is the only consequence that live streaming doesn't work?	02:04
Shrews	right	02:05
Shrews	i see the processes there running as root, not zuul. i'm not sure of the implications of that	02:05
Shrews	corvus would know	02:06
pabelanger	we could manually downgrade ze04.o.o for tonight	02:06
pabelanger	then work on rolling out new code	02:07
ianw	or just turn it off?	02:07
pabelanger	or that	02:07
pabelanger	maybe best for tonight	02:07
pabelanger	I'm going to have to run shortly	02:07
ianw	yeah, given it's run as root now, turning it back to zuul seems dangerous	02:07
ianw	so, let's stop ze04	02:08
ianw	then i'll dump & restart zuul scheduler	02:08
pabelanger	1wfm	02:08
ianw	and cross fingers none of the other ze's restart	02:08
*** xarses has joined #openstack-infra		02:08
ianw	and the a-team can pick this up tomorrow	02:08
pabelanger	agree	02:08
ianw	ok	02:08
Shrews	wfm x 2. should I -2 532594 and 532709 for now?	02:08
Shrews	landing 709 w/o restarting the executors would break streaming on all executors until restart	02:09
ianw	i think so, just for safety	02:09
pabelanger	we could make 709 support both 79,7900 then remove 79 in follow	02:10
*** gouthamr has joined #openstack-infra		02:10
*** ramishra has joined #openstack-infra		02:10
pabelanger	might be safer approach	02:10
Shrews	-2'd for now	02:11
pabelanger	kk	02:11
*** annp has joined #openstack-infra		02:11
ianw	#status log zuul-executor stopped on ze04.o.o and it is placed in the emergency file, due to an external reboot applying https://review.openstack.org/#/c/532575/. we will need to more carefully consider the rollout of this code	02:13
openstackstatus	ianw: finished logging	02:13
*** Apoorva has joined #openstack-infra		02:14
*** gouthamr_ has joined #openstack-infra		02:15
*** gouthamr has quit IRC		02:15
pabelanger	ianw: okay, I'm EOD, good luck	02:17
ianw	i'm not sure i like the look of this	02:18
ianw	http://paste.openstack.org/show/642531/	02:18
*** namnh has joined #openstack-infra		02:18
*** yamahata has joined #openstack-infra		02:18
Shrews	ianw: where's that from?	02:18
Shrews	i think dmsimard pasted something similar a couple of days ago	02:19
*** gouthamr_ has quit IRC		02:19
*** yamamoto has joined #openstack-infra		02:19
dmsimard	yeah	02:19
dmsimard	did you restart zuul ?	02:19
*** Apoorva has quit IRC		02:19
dmsimard	iirc I saw that being spammed (like hell) in zuulv3.o.o logs after starting the scheduler/web	02:20
ianw	yes, just restarted	02:20
Shrews	dmsimard: did anyone look into that?	02:20
ianw	it seems to be going ...	02:20
ianw	ok, i'm going to requeue	02:20
*** rmcall has quit IRC		02:20
dmsimard	Shrews: no one jumped on it when I pointed it out, no	02:21
ianw	if it's job is to scare the daylights out of the person restarting, then it has worked :)	02:22
dmsimard	ianw: iirc there's the same errors in the zuul-web logs too	02:22
dmsimard	not just zuul-scheduler	02:22
dmsimard	yeah, zuul web: http://paste.openstack.org/raw/642533/	02:23
*** gouthamr has joined #openstack-infra		02:23
Shrews	looks like it might be a bug in the /{tenant}/status route?	02:30
Shrews	tristanC: know anything about that? ^^^^	02:30
dmsimard	I was talking to him about that just now :D	02:30
tristanC	Shrews: yep, i'm currently writting a better exception handling	02:30
Shrews	ianw: my instinct says it's not critical	02:30
Shrews	w00t!	02:31
dmsimard	He says it's because when the scheduler restarts, the layouts are not yet available yet when you try to load the webpage it tries to seek them out	02:31
dmsimard	or something along those lines	02:31
tristanC	dmsimard: ianw: this happens when the requested tenant layout isn't ready in the scheduler	02:31
Shrews	dmsimard: tristanC: thx for the explanation	02:32
ianw	tristanC: ok :) crisis averted	02:33
ianw	still requeueing	02:33
*** Qiming has quit IRC		02:33
*** markvoelker has quit IRC		02:36
*** yamahata has quit IRC		02:38
ianw	#status log zuul restarted due to the unexpected loss of ze04; jobs requeued	02:38
openstackstatus	ianw: finished logging	02:38
*** yamahata has joined #openstack-infra		02:39
ianw	ok, we're back to 9 executors in grafana, which seems right	02:39
*** Qiming has joined #openstack-infra		02:39
*** gouthamr has quit IRC		02:40
ianw	will just leave things for a bit, will send a summary email to avoid people having to pull apart chat logs	02:40
*** caphrim007 has joined #openstack-infra		02:41
*** hongbin has joined #openstack-infra		02:44
dmsimard	we have 10, what other executor is missing ?	02:45
*** caphrim007 has quit IRC		02:45
*** threestrands has joined #openstack-infra		02:47
*** yamahata has quit IRC		02:54
*** mriedem has quit IRC		02:56
*** wolverineav has joined #openstack-infra		03:01
*** fultonj has quit IRC		03:01
*** masber has quit IRC		03:02
*** fultonj has joined #openstack-infra		03:02
*** masber has joined #openstack-infra		03:05
*** wolverineav has quit IRC		03:05
*** ijw has joined #openstack-infra		03:06
*** aeng has quit IRC		03:08
*** aeng has joined #openstack-infra		03:08
*** katkapilatova1 has quit IRC		03:08
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: scheduler: better handle format status error https://review.openstack.org/532718	03:10
*** rmcall has joined #openstack-infra		03:10
*** yamamoto_ has joined #openstack-infra		03:12
ianw	dmsimard: 4 is turned off, see prior, and grafana was saying there was 11 at one point, so something had gone wrong there	03:14
*** yamamoto has quit IRC		03:16
*** gyee has quit IRC		03:16
*** felipemonteiro has joined #openstack-infra		03:18
openstackgerrit	Tristan Cacqueray proposed openstack-infra/zuul feature/zuulv3: scheduler: better handle format status error https://review.openstack.org/532718	03:21
*** slaweq has joined #openstack-infra		03:21
*** harlowja_ has quit IRC		03:21
*** slaweq_ has joined #openstack-infra		03:21
*** ijw has quit IRC		03:22
*** slaweq has quit IRC		03:25
*** ramishra has quit IRC		03:25
*** slaweq_ has quit IRC		03:26
*** ramishra has joined #openstack-infra		03:27
*** wolverineav has joined #openstack-infra		03:33
*** wolverineav has quit IRC		03:38
*** ijw has joined #openstack-infra		03:39
*** sbra has joined #openstack-infra		03:42
*** ramishra has quit IRC		03:45
*** ijw has quit IRC		03:47
*** rlandy\|bbl is now known as rlandy		03:50
mtreinish	clarkb: I just did a quick test and afs works fine on on of my raspberry pi's running arch	03:54
*** sbra has quit IRC		03:54
*** verdurin_ has joined #openstack-infra		03:55
*** verdurin has quit IRC		03:56
*** felipemonteiro has quit IRC		03:57
*** caphrim007 has joined #openstack-infra		03:58
*** ameliac has quit IRC		04:00
*** ramishra has joined #openstack-infra		04:01
*** verdurin has joined #openstack-infra		04:03
*** verdurin_ has quit IRC		04:04
*** xarses has quit IRC		04:05
openstackgerrit	Merged openstack/diskimage-builder master: Revert "Dont install python-pip for py3k" https://review.openstack.org/532395	04:05
*** esberglu has quit IRC		04:10
*** bobh has quit IRC		04:11
*** zhurong has quit IRC		04:12
*** xarses has joined #openstack-infra		04:13
*** xarses_ has joined #openstack-infra		04:14
*** xarses_ has quit IRC		04:14
*** xarses_ has joined #openstack-infra		04:15
*** felipemonteiro has joined #openstack-infra		04:15
*** daidv has joined #openstack-infra		04:16
*** xarses has quit IRC		04:18
*** jamesmcarthur has joined #openstack-infra		04:18
*** jamesmcarthur has quit IRC		04:18
*** jamesmcarthur has joined #openstack-infra		04:19
*** sree has joined #openstack-infra		04:21
*** armax has quit IRC		04:24
*** links has joined #openstack-infra		04:25
*** spzala has joined #openstack-infra		04:31
*** andreas_s has joined #openstack-infra		04:32
*** ykarel\|away has joined #openstack-infra		04:33
*** rosmaita has quit IRC		04:34
*** bhavik1 has joined #openstack-infra		04:35
*** shu-mutou has quit IRC		04:35
*** shu-mutou has joined #openstack-infra		04:36
*** xarses_ has quit IRC		04:37
*** andreas_s has quit IRC		04:37
*** felipemonteiro has quit IRC		04:39
*** spzala has quit IRC		04:42
*** dingyichen has quit IRC		04:43
*** rlandy has quit IRC		04:53
*** rmcall has quit IRC		04:54
*** dingyichen has joined #openstack-infra		04:55
*** links has quit IRC		04:56
*** links has joined #openstack-infra		04:57
*** janki has joined #openstack-infra		05:07
*** dingyichen has quit IRC		05:09
*** dingyichen has joined #openstack-infra		05:09
*** nibalizer has joined #openstack-infra		05:14
*** claudiub has joined #openstack-infra		05:15
*** sree has quit IRC		05:19
*** edmondsw has joined #openstack-infra		05:20
*** psachin has joined #openstack-infra		05:21
*** edmondsw has quit IRC		05:24
*** gema has quit IRC		05:26
*** gema has joined #openstack-infra		05:27
*** psachin has quit IRC		05:27
*** zhurong has joined #openstack-infra		05:28
*** rcernin has quit IRC		05:29
*** CrayZee has joined #openstack-infra		05:29
*** harlowja has joined #openstack-infra		05:29
*** sree has joined #openstack-infra		05:29
*** gcb has quit IRC		05:30
*** gcb has joined #openstack-infra		05:31
*** hongbin has quit IRC		05:32
*** shu-mutou is now known as shu-mutou-AWAY		05:32
*** wolverineav has joined #openstack-infra		05:34
*** psachin has joined #openstack-infra		05:35
*** wolverin_ has joined #openstack-infra		05:35
*** jamesmcarthur has quit IRC		05:36
*** wolveri__ has joined #openstack-infra		05:38
*** wolverineav has quit IRC		05:39
*** wolverin_ has quit IRC		05:39
*** bhavik1 has quit IRC		05:42
*** wolveri__ has quit IRC		05:43
*** markvoelker has joined #openstack-infra		05:47
*** coolsvap has joined #openstack-infra		05:47
*** dhajare has joined #openstack-infra		05:47
*** jamesmcarthur has joined #openstack-infra		05:52
*** xinliang has quit IRC		05:56
*** sandanar has joined #openstack-infra		05:58
*** ykarel\|away is now known as ykarel		05:59
*** jamesmcarthur has quit IRC		06:00
*** jbadiapa has quit IRC		06:06
*** xinliang has joined #openstack-infra		06:08
*** xinliang has quit IRC		06:08
*** xinliang has joined #openstack-infra		06:08
CrayZee	Hi, Is there anything going on with zuul? I see over 300 tasks in the queue with tasks running over 3.5 hours that are probably requeued ?	06:08
CrayZee	e.g. 532631...	06:09
*** liujiong has joined #openstack-infra		06:09
*** slaweq has joined #openstack-infra		06:09
*** slaweq has quit IRC		06:13
*** kjackal has joined #openstack-infra		06:14
*** armaan has quit IRC		06:17
ianw	CrayZee: I think we're just flat out ATM. we're down one provider until we can sort out some issues, but http://grafana01.openstack.org/dashboard/db/zuul-status seems to be okish	06:17
*** Hunner has quit IRC		06:19
*** bmjen has quit IRC		06:19
CrayZee	ianw: thanks	06:21
*** liujiong has quit IRC		06:28
*** jamesmcarthur has joined #openstack-infra		06:28
*** jamesmcarthur has quit IRC		06:33
*** liujiong has joined #openstack-infra		06:33
*** jamesmcarthur has joined #openstack-infra		06:35
*** annp has quit IRC		06:38
*** markvoelker has quit IRC		06:40
*** markvoelker has joined #openstack-infra		06:42
eumel8	ianw: you're still around?	06:44
*** kzaitsev_pi has quit IRC		06:45
*** yamamoto has joined #openstack-infra		06:45
ianw	eumel8: a little ... wrapping up	06:45
*** alexchadin has joined #openstack-infra		06:46
*** harlowja has quit IRC		06:47
eumel8	ianw: morning :) can you please take a look into https://review.openstack.org/#/c/531736/ and plan the next upgrade of Zanata?	06:47
*** yamamoto_ has quit IRC		06:49
*** kzaitsev_pi has joined #openstack-infra		06:52
ianw	eumel8: ok, i can probably try that out tomorrow my time. basically 1) pause puppet 2) do the table drop 3) commit that 4) restart puppet?	06:54
eumel8	ianw: drop database, not table :)	06:55
*** aeng has quit IRC		06:56
ianw	ahh, ok. how about i stop puppet for it now & we get it in the queue	06:56
eumel8	ianw: sounds good, thx	06:57
*** threestrands has quit IRC		07:00
*** sbra has joined #openstack-infra		07:00
*** threestrands has joined #openstack-infra		07:01
*** threestrands has quit IRC		07:02
*** threestrands has joined #openstack-infra		07:03
*** threestrands has quit IRC		07:03
*** threestrands has joined #openstack-infra		07:03
*** threestrands has quit IRC		07:04
*** threestrands has joined #openstack-infra		07:04
*** edmondsw has joined #openstack-infra		07:08
*** jamesmcarthur has quit IRC		07:12
*** edmondsw has quit IRC		07:13
*** armaan has joined #openstack-infra		07:13
*** dhill_ has quit IRC		07:13
*** dhill__ has joined #openstack-infra		07:13
*** jamesmcarthur has joined #openstack-infra		07:19
masayukig	mtreinish: clarkb: yeah, I don't have a lot of stuff installed on my machines. But I also tested it on a very slow hdd. Files might be on disk cache because the size is not so big.	07:19
*** pcaruana has joined #openstack-infra		07:21
*** dims_ has quit IRC		07:21
openstackgerrit	Merged openstack-infra/infra-manual master: Clarify the point of a repo in a zuulv3 job name https://review.openstack.org/532686	07:21
*** hemna_ has quit IRC		07:21
*** jamesmcarthur has quit IRC		07:24
*** dims has joined #openstack-infra		07:25
openstackgerrit	Merged openstack-infra/system-config master: Upgrade translate-dev.o.o to Zanata 4.3.3 https://review.openstack.org/531736	07:25
openstackgerrit	Merged openstack-infra/project-config master: Switch grafana neutron board to non-legacy jobs https://review.openstack.org/532632	07:26
*** afred312 has quit IRC		07:27
*** abelur_ has quit IRC		07:29
*** vsaienk0 has joined #openstack-infra		07:29
*** jamesmcarthur has joined #openstack-infra		07:30
*** afred312 has joined #openstack-infra		07:30
AJaeger	config-core, could you review some changes on https://etherpad.openstack.org/p/Nvt3ovbn5x , please? I put up ready changes there for easier review	07:34
*** jamesmcarthur has quit IRC		07:35
*** slaweq has joined #openstack-infra		07:38
*** jtomasek has joined #openstack-infra		07:39
*** jamesmcarthur has joined #openstack-infra		07:39
*** jtomasek has joined #openstack-infra		07:39
*** armaan has quit IRC		07:41
*** armaan has joined #openstack-infra		07:41
*** jamesmcarthur has quit IRC		07:44
openstackgerrit	Merged openstack-infra/project-config master: Set max-server to 0 for infracloud-vanilla https://review.openstack.org/532705	07:45
*** jamesmcarthur has joined #openstack-infra		07:46
*** armaan has quit IRC		07:46
*** armaan has joined #openstack-infra		07:47
*** dciabrin has quit IRC		07:49
AJaeger	frickler: could you review https://review.openstack.org/#/c/523018/ to remove an unused template so that it won't get used again, please?	07:50
*** andreas_s has joined #openstack-infra		07:53
*** jamesmcarthur has quit IRC		07:55
gema	mtreinish: thanks for trying it, good to know afs can be built on arm :)	07:57
AJaeger	infra-root, looking at http://grafana.openstack.org/dashboard/db/nodepool I see a nearly constant number of "deleting nodes". Do we have a problem with deletion somewhere?	07:57
*** annp has joined #openstack-infra		08:01
*** jamesmcarthur has joined #openstack-infra		08:04
*** evin has joined #openstack-infra		08:04
*** sbra has quit IRC		08:06
*** threestrands has quit IRC		08:08
*** jamesmcarthur has quit IRC		08:09
*** slaweq_ has joined #openstack-infra		08:10
*** apetrich has quit IRC		08:13
*** florianf has joined #openstack-infra		08:14
*** slaweq_ has quit IRC		08:14
*** jamesmcarthur has joined #openstack-infra		08:15
*** pcichy has joined #openstack-infra		08:17
*** tesseract has joined #openstack-infra		08:17
*** ramishra has quit IRC		08:18
*** aviau has quit IRC		08:20
*** aviau has joined #openstack-infra		08:20
*** ramishra has joined #openstack-infra		08:20
*** jamesmcarthur has quit IRC		08:21
*** jamesmcarthur has joined #openstack-infra		08:22
*** shardy has joined #openstack-infra		08:22
*** jamesmcarthur has quit IRC		08:26
openstackgerrit	Vasyl Saienko proposed openstack-infra/project-config master: Add networking-generic-switch-tempest-plugin repo https://review.openstack.org/532542	08:27
openstackgerrit	Vasyl Saienko proposed openstack-infra/project-config master: Add jobs for n-g-s tempest-plugin https://review.openstack.org/532543	08:27
*** alexchadin has quit IRC		08:28
*** tosky has joined #openstack-infra		08:29
*** cshastri has joined #openstack-infra		08:31
*** jamesmcarthur has joined #openstack-infra		08:31
*** jpena\|off is now known as jpena		08:34
*** jamesmcarthur has quit IRC		08:36
*** jamesmcarthur has joined #openstack-infra		08:37
*** jbadiapa has joined #openstack-infra		08:40
*** alexchadin has joined #openstack-infra		08:41
*** jamesmcarthur has quit IRC		08:42
*** pcaruana has quit IRC		08:44
*** links has quit IRC		08:46
openstackgerrit	Dmitrii Shcherbakov proposed openstack-infra/project-config master: add charm-panko project-config https://review.openstack.org/532769	08:47
*** jaosorior has joined #openstack-infra		08:47
*** b_bezak has joined #openstack-infra		08:47
*** armaan has quit IRC		08:47
*** armaan has joined #openstack-infra		08:48
*** jamesmcarthur has joined #openstack-infra		08:48
*** ralonsoh has joined #openstack-infra		08:50
*** jamesmcarthur has quit IRC		08:53
*** dingyichen has quit IRC		08:54
*** edmondsw has joined #openstack-infra		08:56
*** b_bezak has quit IRC		08:57
*** hashar has joined #openstack-infra		08:57
*** links has joined #openstack-infra		08:59
*** jamesmcarthur has joined #openstack-infra		09:01
*** edmondsw has quit IRC		09:01
*** jpich has joined #openstack-infra		09:02
*** jamesmcarthur has quit IRC		09:05
*** pblaho has quit IRC		09:06
*** sree_ has joined #openstack-infra		09:10
*** sree_ is now known as Guest37375		09:11
*** sree has quit IRC		09:12
*** sree has joined #openstack-infra		09:12
*** Guest37375 has quit IRC		09:15
*** dbecker has joined #openstack-infra		09:17
*** jamesmcarthur has joined #openstack-infra		09:17
*** flaper87 has quit IRC		09:17
*** pcichy has quit IRC		09:18
*** dsariel has joined #openstack-infra		09:19
*** pblaho has joined #openstack-infra		09:19
*** jamesmcarthur has quit IRC		09:19
*** flaper87 has joined #openstack-infra		09:21
*** flaper87 has quit IRC		09:21
*** e0ne has joined #openstack-infra		09:21
*** flaper87 has joined #openstack-infra		09:23
*** lucas-afk is now known as lucasagomes		09:26
*** kopecmartin has joined #openstack-infra		09:27
*** armaan has quit IRC		09:27
*** armaan has joined #openstack-infra		09:27
*** dsariel has quit IRC		09:27
*** pblaho has quit IRC		09:36
*** armaan has quit IRC		09:36
*** dtantsur\|afk is now known as dtantsur		09:38
ykarel	Jobs are in queue For long, some jobs for 7+ hr, is some issue going on or it's just infra is out of nodes	09:43
*** pblaho has joined #openstack-infra		09:48
*** links has quit IRC		09:51
*** apetrich has joined #openstack-infra		09:53
*** vsaienk0 has quit IRC		09:55
*** stakeda has quit IRC		09:57
CrayZee	ykarel: This is the answer I got a few hours ago: "(06:17:58 UTC) ianw: CrayZee: I think we're just flat out ATM. we're down one provider until we can sort out some issues, but http://grafana01.openstack.org/dashboard/db/zuul-status seems to be okish"	09:59
*** CrayZee has quit IRC		10:00
*** armaan has joined #openstack-infra		10:03
*** pcaruana has joined #openstack-infra		10:04
*** links has joined #openstack-infra		10:05
ykarel	CrayZee Thanks	10:06
*** slaweq_ has joined #openstack-infra		10:11
*** markvoelker has quit IRC		10:13
*** ankkumar has joined #openstack-infra		10:13
*** sambetts\|afk is now known as sambetts		10:14
*** namnh has quit IRC		10:14
*** slaweq_ has quit IRC		10:15
ankkumar	Hi	10:16
ankkumar	I am trying to build openstack/ironic CI using zuul v3	10:16
ankkumar	Can anyone tell me, how would it would read my zuul.yaml customized file, because openstack/ironic repo has there one?	10:16
*** cuongnv has quit IRC		10:16
ankkumar	How would my gate will run, I mean using which zuul yaml files?	10:16
*** pbourke has joined #openstack-infra		10:16
*** liujiong has quit IRC		10:23
*** sree_ has joined #openstack-infra		10:25
*** vsaienk0 has joined #openstack-infra		10:25
*** sree_ is now known as Guest88600		10:26
*** ricolin has quit IRC		10:28
*** sree has quit IRC		10:29
*** rosmaita has joined #openstack-infra		10:32
*** alexchadin has quit IRC		10:34
*** cshastri has quit IRC		10:35
*** zhurong has quit IRC		10:35
*** ldnunes has joined #openstack-infra		10:37
*** abelur_ has joined #openstack-infra		10:37
*** e0ne has quit IRC		10:38
*** e0ne has joined #openstack-infra		10:39
*** sandanar has quit IRC		10:39
*** sandanar has joined #openstack-infra		10:40
*** edmondsw has joined #openstack-infra		10:44
*** pcichy has joined #openstack-infra		10:46
*** slaweq_ has joined #openstack-infra		10:46
*** danpawlik_ has joined #openstack-infra		10:46
*** maciejjo1 has joined #openstack-infra		10:47
*** andreas_s has quit IRC		10:48
*** slaweq_ has quit IRC		10:48
*** maciejjozefczyk has quit IRC		10:48
*** danpawlik has quit IRC		10:48
*** slaweq_ has joined #openstack-infra		10:48
*** cshastri has joined #openstack-infra		10:48
*** slaweq has quit IRC		10:48
*** edmondsw has quit IRC		10:49
*** pcichy has quit IRC		10:50
*** pcichy has joined #openstack-infra		10:50
*** numans has quit IRC		10:54
*** numans has joined #openstack-infra		10:55
*** andreas_s has joined #openstack-infra		10:58
frickler	infra-root: AJaeger: ack, seeing lots of deletion-related exceptions on nl02, will try do dig a bit more	11:10
*** andreas_s has quit IRC		11:19
*** shardy has quit IRC		11:22
*** shardy has joined #openstack-infra		11:24
*** panda\|rover\|afk has quit IRC		11:24
*** derekh has joined #openstack-infra		11:25
*** eyalb has joined #openstack-infra		11:26
*** maciejjo1 is now known as maciejjozefczyk		11:27
eyalb	any news on zuul our jobs are queuing for a long time	11:28
frickler	eyalb: due to various issues over the last days we have a pretty large backlog currently, please be patient. if you have specific jobs that seem stuck, please post them here so we can take a look	11:30
eyalb	frickler: thanks	11:31
*** tpsilva_ has joined #openstack-infra		11:31
frickler	infra-root: seems we have multiple issues currently: a) nodes in stuck deleting state, this seems to be on all providers, all since about the time zuul seems to have been restarted (around 22:30)	11:32
frickler	infra-root: b) some timeouts when deleting current nodes and c) some more issues with infracloud-vanilla	11:33
frickler	for a) I think a restart of nodepool might help, but I don't want to make the situation worse. this seems to be just blocking about 30% of our quota, so not optimal, but also not catastrophic I'd say, as the remaining quota seems to get used without issues	11:34
*** tpsilva_ is now known as tpsilva		11:34
frickler	b) might actually be normal behaviour. c) infracloud-vanilla seems to have been taken offline anyway, so also not critical	11:35
*** andreas_s has joined #openstack-infra		11:36
*** panda has joined #openstack-infra		11:38
frickler	although it may be that the leaked instance cleanup is not working due to c)	11:38
*** apetrich has quit IRC		11:38
*** sdague has joined #openstack-infra		11:39
*** wolverineav has joined #openstack-infra		11:39
*** andreas_s has quit IRC		11:40
*** wolverineav has quit IRC		11:44
*** andreas_s has joined #openstack-infra		11:50
*** florianf has quit IRC		11:52
*** florianf has joined #openstack-infra		11:53
*** rhallisey has joined #openstack-infra		11:54
*** andreas_s has quit IRC		11:55
Shrews	frickler: AJaeger: ugh, not what i want to see when i first wake up. i'll take a look at nodepool	11:56
*** e0ne has quit IRC		11:58
*** janki has quit IRC		11:59
Shrews	oh, this is new: http://paste.openstack.org/show/642553/	11:59
*** panda has quit IRC		11:59
Shrews	a restart of nodepool would not help with that	12:00
Shrews	frickler: you said "more" issues with vanilla. what were the previous issues?	12:00
*** davidlenwell has quit IRC		12:01
*** yamamoto has quit IRC		12:01
*** davidlenwell has joined #openstack-infra		12:01
*** yamamoto has joined #openstack-infra		12:01
*** electrical has quit IRC		12:01
*** icey has quit IRC		12:02
*** pcaruana has quit IRC		12:02
*** pcaruana\|afk\| has joined #openstack-infra		12:02
*** icey has joined #openstack-infra		12:02
*** electrical has joined #openstack-infra		12:02
Shrews	oh, the meltdown patches causing issues	12:03
Shrews	we might should disable all of vanilla until someone can look into that infrastructure a bit more	12:04
*** jpena is now known as jpena\|lunch		12:04
*** jkilpatr has joined #openstack-infra		12:05
*** annp has quit IRC		12:05
Shrews	oh, it IS disabled in nodepool.yaml	12:06
*** kjackal has quit IRC		12:07
Shrews	oh geez, happening across other providers too. sigh	12:08
Shrews	mordred: was there a new shade release recently?	12:09
*** dsariel has joined #openstack-infra		12:10
*** arxcruz\|ruck is now known as arxcruz\|rover		12:11
*** slaweq has joined #openstack-infra		12:11
frickler	Shrews: the failure in your paste is happening because keystone isn't responding. but I'm also seeing timeouts when deleting nodes from other providers	12:12
*** panda has joined #openstack-infra		12:13
Shrews	frickler: yeah, looking at those on the other launcher now	12:13
Shrews	seems to be affecting rax, ovh, and chocolate too	12:13
Shrews	no idea what's up with those	12:13
*** markvoelker has joined #openstack-infra		12:14
frickler	Shrews: http://paste.openstack.org/show/642563/ seems to show the reason for infra-chocolate	12:15
AJaeger	Shrews: https://review.openstack.org/532705 disabled vanilla already.	12:16
frickler	Shrews: nodepool seems always to retry deleting the same node before deleting others	12:16
*** slaweq has quit IRC		12:16
AJaeger	frickler: thanks for looking into this.	12:16
AJaeger	and good morning, Shrews !	12:17
*** sandanar has quit IRC		12:18
frickler	Shrews: I'm wondering whether we should delete infracloud-vanilla from nodepool.yaml completely for the time being. that should at least allow the leaked instance cleanup to run for the other clouds	12:18
*** smatzek has joined #openstack-infra		12:21
Shrews	frickler: iirc, what should be happening is a new thread is started for each node delete. so it should cycle through all deleted instances rather quickly, but the actual delete threads are the things timing out	12:22
*** sdoran has quit IRC		12:24
*** mrmartin has quit IRC		12:24
*** madhuvishy has quit IRC		12:24
*** tomhambleton_ has quit IRC		12:24
*** madhuvishy has joined #openstack-infra		12:24
*** lucasagomes is now known as lucas-hungry		12:25
*** pcichy has quit IRC		12:25
*** abelur has quit IRC		12:27
*** abelur has joined #openstack-infra		12:27
*** kjackal has joined #openstack-infra		12:28
*** e0ne has joined #openstack-infra		12:30
*** pcichy has joined #openstack-infra		12:31
Shrews	looks like we lost connection to our zookeeper server yesterday	12:32
*** edmondsw has joined #openstack-infra		12:33
*** e0ne has quit IRC		12:33
Shrews	ok, this is weird. i'm going to restart the launchers to see what happens	12:33
*** rosmaita has quit IRC		12:34
*** abelur_ has quit IRC		12:35
AJaeger	Shrews: wait, please...	12:35
AJaeger	Shrews: first read http://lists.openstack.org/pipermail/openstack-infra/2018-January/005774.html - ianw run into a problem with ze04 that we should check before restarting	12:36
Shrews	AJaeger: already restarted nl01, which seems to be cleaning up the instances. i worked with ianw last night on ze04, so aware of that	12:37
Shrews	unrelated things	12:37
*** edmondsw has quit IRC		12:38
AJaeger	Shrews: glad to hear.	12:38
*** mrmartin has joined #openstack-infra		12:39
*** e0ne has joined #openstack-infra		12:42
Shrews	nl02 restarted which seems to have cleaned up the enabled providers there	12:42
*** bobh has joined #openstack-infra		12:42
Shrews	AJaeger: frickler: not sure what happened there. I'm going to have to spend the day sorting through tons of logs.	12:43
Shrews	can one of you #status the things while I finally have a cup of coffee?	12:43
*** vsaienk0 has quit IRC		12:44
AJaeger	#status log nl01 and nl02 restarted to recover nodes in deletion	12:47
openstackstatus	AJaeger: finished logging	12:47
*** markvoelker has quit IRC		12:48
AJaeger	Shrews: done - or do you want to write more - or on broader scale? Enjoy your coffee!	12:48
*** pgadiya has joined #openstack-infra		12:49
*** pcaruana\|afk\| has quit IRC		12:49
*** krtaylor_ has quit IRC		12:50
Shrews	AJaeger: thx. good enough for now i guess. i don't yet know enough about the problem to say more	12:50
mpeterson	hello, I was wondering if you know what might be causing a job not to go into the RUN phase at all? it does PRE and then POST all successful but not RUN	12:51
*** e0ne has quit IRC		12:52
*** e0ne has joined #openstack-infra		12:53
*** andreas_s has joined #openstack-infra		12:58
pabelanger	we'll be deleting infracloud-vanilla, and moving compute node to chocolate. We lost the controller yesterday to a bad HDD	12:58
pabelanger	https://review.openstack.org/532705/ should have disable it	12:59
*** ankkumar has quit IRC		12:59
*** jpena\|lunch is now known as jpena		13:00
*** pcaruana\|afk\| has joined #openstack-infra		13:02
*** andreas_s has quit IRC		13:04
*** vsaienk0 has joined #openstack-infra		13:04
*** dprince has joined #openstack-infra		13:08
*** jaosorior has quit IRC		13:09
AJaeger	mpeterson: do you have logs?	13:10
*** olaph has quit IRC		13:13
*** olaph has joined #openstack-infra		13:14
*** janki has joined #openstack-infra		13:14
mpeterson	AJaeger: yes, patches 530642 and 517359	13:16
mpeterson	AJaeger: first one in devstack-tempest, second one in networking-odl-functional-carbon	13:16
*** pcichy has quit IRC		13:17
AJaeger	mpeterson: we have on https://wiki.openstack.org/wiki/Infrastructure_Status item "another set of broken images has been in use from about 06:00-11:00 UTC, reverted once more to the previous one".	13:19
*** e0ne has quit IRC		13:20
AJaeger	mpeterson: that was at 14:00 and your run was at that time - and would explain it. I'm just wondering whether the time span is accurate.	13:20
frickler	pabelanger: the issue seems to have been that nodepool still wanted to delete nodes on vanilla and that may have blocked other deletes	13:22
AJaeger	mpeterson: wait, those are change to Zuul v3 config itself. I wonder whether there's a bug in them that let's Zuul skip the run playbooks.	13:22
AJaeger	mpeterson: please discuss with rest of the team later...	13:22
pabelanger	frickler: ouch, okay. We should be able to write a unit test for that in nodepool at least and see what happens with a bad provider	13:23
mpeterson	AJaeger: yeah, I think it is probably an issue in layer 8 (aka me) rather the infra	13:23
mpeterson	AJaeger: sure, when would the rest team be in aprox?	13:24
*** katkapilatova1 has joined #openstack-infra		13:24
AJaeger	mpeterson: US best - so during the next three hours...	13:24
AJaeger	US based I mean	13:24
openstackgerrit	Merged openstack-infra/project-config master: Remove Neutron legacy jobs definition https://review.openstack.org/530500	13:24
*** e0ne has joined #openstack-infra		13:24
*** katkapilatova1 has left #openstack-infra		13:25
*** bobh has quit IRC		13:25
mpeterson	AJaeger: okey I'm EMEA based, I hope I'll still be around, thanks	13:25
frickler	pabelanger: Shrews: fyi there are still errors about this happening now on nl02, not sure about the impact	13:26
pabelanger	frickler: specific to a cloud?	13:27
Shrews	pabelanger: you'll keep seeing them for vanilla, unless we remove that provider entirely	13:28
*** kiennt26_ has joined #openstack-infra		13:28
Shrews	i don't see any deleted instances that aren't getting cleaned up in zookeeper except for vanilla	13:29
*** panda is now known as panda\|ruck		13:29
Shrews	pabelanger: were there any issues with the zookeeper server yesterday evening? looks like we lost connection to it around 2018-01-10 22:27:18,481 UTC (5:27pm eastern)	13:30
*** e0ne has quit IRC		13:30
Shrews	i think that's about when this delete problem started	13:31
*** eyalb has left #openstack-infra		13:31
Shrews	hrm, we probably didn't notice since it recovered about a second later	13:33
*** e0ne has joined #openstack-infra		13:33
*** apetrich has joined #openstack-infra		13:33
*** e0ne has quit IRC		13:34
frickler	"nodepool list" still shows about 50 nodes in infracloud-vanilla in state deleting	13:35
mgagne	pabelanger: any known/pending issue with inap? I keep seeing inap mentioned and I'm starting to wonder if there is an underlaying issue on our side that needs to be addressed.	13:35
pabelanger	Shrews: yah, working on patches to remove vanilla now	13:36
pabelanger	Shrews: also, not aware of any issues with zookeeper	13:36
*** rlandy has joined #openstack-infra		13:36
*** trown\|outtypewww is now known as trown		13:37
Shrews	pabelanger: fyi, if we never re-enable vanilla, we'll need to manually delete its instances. might even need to cleanup zookeeper nodes for it.	13:38
Shrews	i guess we can't do the instances without a working controller though	13:39
openstackgerrit	Zara proposed openstack-infra/storyboard-webclient master: Fix reference to angular-bootstrap in package.json https://review.openstack.org/532819	13:40
pabelanger	Shrews: Agree	13:41
openstackgerrit	Paul Belanger proposed openstack-infra/project-config master: Remove infracloud-vanilla from nodepool https://review.openstack.org/532820	13:41
pabelanger	Shrews: frickler: ^	13:41
*** rosmaita has joined #openstack-infra		13:41
pabelanger	Shrews: it might be a good idea to maybe come up with a tool or docs to help do the clean up for users in this case of a dead cloud.	13:42
pabelanger	before we'd have to purge the database, would be good to figure out the syntax in zookeeper	13:42
vsaienk0	AJaeger: pabelanger please add to your review queue https://review.openstack.org/#/c/531375/ - adds releasenotes job to n-g-s	13:44
*** lucas-hungry is now known as lucasagomes		13:44
*** markvoelker has joined #openstack-infra		13:44
*** wolverineav has joined #openstack-infra		13:46
Shrews	pabelanger: ++ Having someone do anything manual in zookeeper is not a friendly user experience	13:47
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Improve logging around ZooKeeper suspension https://review.openstack.org/532823	13:48
Shrews	^^^ might help with debugging this situation in the future	13:49
*** tosky has quit IRC		13:50
frickler	Shrews: pabelanger: different issue, would this be correct if I want to hold a node for debugging? zuul autohold --tenant openstack --project openstack/cookbook-openstack-common --job openstack-chef-repo-integration --reason "frickler: debug 523030" --count 1	13:51
*** masber has quit IRC		13:52
Shrews	frickler: i believe that's correct	13:52
pabelanger	yah	13:52
pabelanger	looks right	13:52
Shrews	frickler: just make sure to delete the node when you're done with it	13:52
frickler	Shrews: sure	13:53
*** tosky has joined #openstack-infra		13:54
frickler	hmm, is it expected that "zuul autohold-list" takes like forever? or is that maybe also blocking trying to access vanilla?	13:56
*** pgadiya has quit IRC		13:56
*** e0ne has joined #openstack-infra		13:58
frickler	my autohold command also doesn't proceed. maybe I'll just wait with that until nodepool issues are sorted out	13:58
*** alexchadin has joined #openstack-infra		13:59
*** shardy has quit IRC		14:02
pabelanger	did nodepool restart?	14:03
pabelanger	grafana.o.o says we have zero nodes online	14:03
*** kiennt26_ has quit IRC		14:04
pabelanger	http://grafana.openstack.org/dashboard/db/nodepool	14:04
pabelanger	frickler: Shrews ^	14:04
*** kiennt26_ has joined #openstack-infra		14:05
pabelanger	oh, maybe we just had a gate reset	14:05
pabelanger	but ~500 nodes just deleted	14:06
Shrews	pabelanger: yes, restarted launchers earlier	14:06
frickler	pabelanger: Shrews: https://review.openstack.org/532303 failed at the head of integrated gate	14:06
pabelanger	Shrews: this just happened a minute ago	14:06
pabelanger	yah, must be a gate reset	14:07
*** eharney has joined #openstack-infra		14:07
pabelanger	all our nodes tied up in gate pipeline	14:07
*** efried has joined #openstack-infra		14:08
frickler	http://logs.openstack.org/03/532303/1/gate/openstack-tox-py35/c7bad8c/job-output.txt.gz has a timeout and ssh host id changed warning	14:08
pabelanger	looks like citycloud	14:09
*** dave-mccowan has joined #openstack-infra		14:10
*** slaweq has joined #openstack-infra		14:12
*** dave-mccowan has quit IRC		14:14
rosmaita	howdy infra folks ... quick question: when a gate job seems to be stuck in queued for a long time, is there any action i can take (other than be patient or go grab a coffee)?	14:14
frickler	rosmaita: you can tell us which job it is so we can take a closer look	14:15
*** dave-mccowan has joined #openstack-infra		14:15
frickler	rosmaita: but also we have a pretty large backlog currently due to various issues during the last couple of days	14:16
rosmaita	frickler thanks! the one i was interested in has actually been picked up	14:16
*** abhishekk has joined #openstack-infra		14:17
*** slaweq has quit IRC		14:17
*** markvoelker has quit IRC		14:18
*** Goneri has joined #openstack-infra		14:18
*** edmondsw has joined #openstack-infra		14:21
*** tomhambleton_ has joined #openstack-infra		14:21
*** sdoran has joined #openstack-infra		14:21
*** evin has quit IRC		14:24
*** edmondsw_ has joined #openstack-infra		14:24
*** psachin has quit IRC		14:24
*** esberglu has joined #openstack-infra		14:28
openstackgerrit	Chandan Kumar proposed openstack-infra/project-config master: Switch to tempest-plugin-jobs for ec2api-tempest-plugin https://review.openstack.org/532835	14:28
*** edmondsw has quit IRC		14:28
chandankumar	AJaeger: regarding uploading package to pypi, respective team will take care of that	14:29
chandankumar	for their tempest plugins	14:29
mordred	Shrews: morning! anything I should be looking at?	14:29
*** ijw has joined #openstack-infra		14:31
Shrews	mordred: no, don't think so	14:32
*** rosmaita has quit IRC		14:34
*** jkilpatr has quit IRC		14:39
*** jkilpatr has joined #openstack-infra		14:45
*** edmondsw_ is now known as edmondsw		14:45
*** abhishekk has quit IRC		14:46
openstackgerrit	Tytus Kurek proposed openstack-infra/project-config master: Add charm-interface-designate project https://review.openstack.org/529379	14:46
*** hongbin has joined #openstack-infra		14:48
*** Swami has joined #openstack-infra		14:52
AJaeger	chandankumar: once all are created, tell me in the patch and we can merge...	14:55
chandankumar	AJaeger: sure!	14:56
*** hoangcx_ has joined #openstack-infra		14:57
hoangcx_	AJaeger: Hi, Do you know why https://docs.openstack.org/neutron-vpnaas/latest/ is not live even if we merged a patch in our repos recently	14:58
*** eharney has quit IRC		15:00
hoangcx_	AJaeger: I mean https://review.openstack.org/#/c/522695/ merged in project-config and the we megered one patch in vpnaas to reflect the change. But I don't see the link alive yet.	15:00
*** gouthamr has joined #openstack-infra		15:01
hoangcx_	s/the we/then we	15:01
*** kopecmartin has quit IRC		15:02
*** markvoelker has joined #openstack-infra		15:02
*** markvoelker has quit IRC		15:02
*** mriedem has joined #openstack-infra		15:02
AJaeger	hoangcx_: check http://zuulv3.openstack.org/ - the post queue should contain your job	15:03
*** eharney has joined #openstack-infra		15:03
AJaeger	hoangcx_: we have a backlog of 11+ hours, and your change only merged 1h ago	15:04
openstackgerrit	sebastian marcet proposed openstack-infra/openstackid-resources master: Added new endpoint merge speakers https://review.openstack.org/532844	15:04
*** ihrachys has quit IRC		15:05
*** ihrachys has joined #openstack-infra		15:05
*** kgiusti has joined #openstack-infra		15:08
*** shardy has joined #openstack-infra		15:08
frickler	infra-root: I've seen multiple gate failures with "node_failure" now, but don't know how to debug these further, zuul.log just says "node request failed". http://paste.openstack.org/show/642701/ and also 526061,2	15:08
hoangcx_	AJaeger: Clear the point now. I will check it tomorrow. Sorry for trouble and thank you for pointing it out.	15:11
openstackgerrit	Merged openstack-infra/openstackid-resources master: Added new endpoint merge speakers https://review.openstack.org/532844	15:11
*** hoangcx_ has quit IRC		15:12
*** Apoorva has joined #openstack-infra		15:12
*** hjensas has quit IRC		15:12
*** jkilpatr has quit IRC		15:13
*** jkilpatr has joined #openstack-infra		15:13
*** jkilpatr has quit IRC		15:13
*** jkilpatr has joined #openstack-infra		15:13
*** kopecmartin has joined #openstack-infra		15:15
*** bmjen has joined #openstack-infra		15:15
*** Hunner has joined #openstack-infra		15:16
*** Hunner has quit IRC		15:16
*** Hunner has joined #openstack-infra		15:16
*** hamzy has quit IRC		15:17
*** edmondsw has quit IRC		15:18
*** hashar is now known as hasharAway		15:19
Shrews	frickler: i'm thinking that is likely related to the vanilla failure. vanilla worker thread in the launcher is getting assigned the request, but it throwing an exception during trying to determine if it has quota to handle it	15:20
Shrews	frickler: i think we need to cleanup exception handling there. it's newish code	15:20
Shrews	oh, the exception is handled correctly. it fails the request	15:21
sshnaidm\|afk	pabelanger, hi, we have connection errors to nodes since yesterday, is it known issue? http://logs.openstack.org/63/531563/3/gate/tripleo-ci-centos-7-scenario004-multinode-oooq-container/357414a/job-output.txt.gz#_2018-01-11_14_06_38_648182	15:21
*** sshnaidm\|afk is now known as sshnaidm		15:21
*** kopecmartin has quit IRC		15:21
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Log request ID on request failure https://review.openstack.org/532857	15:23
*** markvoelker has joined #openstack-infra		15:24
Shrews	pabelanger: approved https://review.openstack.org/532820	15:24
Shrews	frickler: ^^^ should help	15:24
*** edmondsw has joined #openstack-infra		15:24
*** bobh has joined #openstack-infra		15:24
*** rmcall has joined #openstack-infra		15:26
*** pcichy has joined #openstack-infra		15:27
*** pcichy has joined #openstack-infra		15:27
*** alexchad_ has joined #openstack-infra		15:28
dmsimard	FYI there are github unicorns going around	15:30
dmsimard	in case some jobs fail because of that	15:30
*** slaweq_ has quit IRC		15:30
dmsimard	https://status.github.com/messages confirmed outage	15:31
*** slaweq has joined #openstack-infra		15:31
*** alexchadin has quit IRC		15:31
*** afred312 has quit IRC		15:31
*** afred312 has joined #openstack-infra		15:32
pabelanger	Shrews: danke	15:32
*** kopecmartin has joined #openstack-infra		15:33
*** Swami has quit IRC		15:33
pabelanger	dmsimard: shouldn't be much of an impact, minus replication	15:34
*** janki has quit IRC		15:35
*** shardy has quit IRC		15:35
*** slaweq has quit IRC		15:35
*** yamamoto has quit IRC		15:37
*** felipemonteiro has joined #openstack-infra		15:37
*** alexchad_ has quit IRC		15:37
*** links has quit IRC		15:38
*** yamamoto has joined #openstack-infra		15:38
*** felipemonteiro_ has joined #openstack-infra		15:38
*** caphrim007 has quit IRC		15:40
*** caphrim007 has joined #openstack-infra		15:40
corvus	Shrews: i think we can approve 532594 now and if you make the change i suggested in 532709 we can merge it too.	15:42
*** yamamoto has quit IRC		15:43
*** felipemonteiro has quit IRC		15:43
mwhahaha	getting post failures again today	15:44
openstackgerrit	David Shrewsbury proposed openstack-infra/system-config master: Zuul executor needs to open port 7900 now. https://review.openstack.org/532709	15:45
Shrews	corvus: done	15:45
mwhahaha	http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22SSH%20Error%3A%20data%20could%20not%20be%20sent%20to%20remote%20host%5C%22	15:45
*** xarses has joined #openstack-infra		15:45
*** caphrim007 has quit IRC		15:45
dmsimard	mwhahaha: we're already tracking it in elastic-recheck	15:45
dmsimard	I've spent some amount of time investigating this week without being able to pinpoint a specific issue that wouldn't be related to load	15:46
pabelanger	corvus: mordred: dmsimard: this doesn't look healthy: http://paste.openstack.org/show/642816/	15:46
pabelanger	first time I've need DB errors on ARA	15:46
pabelanger	s/error/warning	15:47
dmsimard	I've discussed that issue with upstream Ansible recently	15:47
dmsimard	It's an Ansible bug	15:47
*** kiennt26_ has quit IRC		15:48
*** hamzy has joined #openstack-infra		15:48
dmsimard	tl;dr, in some circumstances Ansible can pass a non-boolean ignore_errors down to callbacks (breaking the "contract")	15:48
* dmsimard searches logs		15:51
*** esberglu has quit IRC		15:52
dmsimard	2018-01-05 18:16:46 @sivel dmsimard: I just did a little testing, your `yes` bug was fixed by bcoca in https://github.com/ansible/ansible/commit/aa54a3510f6f14491808291a0300da609b42753d	15:52
dmsimard	That's landed in 2.4.2	15:52
Shrews	dmsimard: sounds like we need https://review.openstack.org/531009	15:53
dmsimard	Shrews: yeah I was just about to mention that	15:53
dmsimard	2.4.2 also contains other fixes we need	15:53
*** alexchadin has joined #openstack-infra		15:53
Shrews	dmsimard: that might require coordination with infra-root about the upgrade path	15:54
pabelanger	yah, maybe even PTG?	15:54
pabelanger	since we are all in a room	15:54
dmsimard	Shrews: right, but it also means we'll be upgrading all the jobs, playbooks, roles, etc... to 2.4.2	15:54
pabelanger	s/all/most all/	15:54
dmsimard	and you know what it means to upgrade from a minor ansible release to the next	15:54
Shrews	i'm going to -2 that, lest someone approve before we discussed it (based on last night's experience)	15:54
pabelanger	dmsimard: yah, part of my concern as well, some playbook / role some place might now be 2.4.2 compat	15:55
pabelanger	so, we'll need to be on deck to help migrate that	15:55
pabelanger	maybe we need to bring zuulv3-dev.o.o online as thirdparty to do some testing	15:55
pabelanger	would be a good way to get some code coverage on 2.4.2	15:56
dmsimard	pabelanger: and it's not just z-j and o-z-j, it's every project's playbooks/roles, across branches, etc.	15:56
pabelanger	yes	15:56
*** ykarel is now known as ykarel\|away		15:57
*** derekh has quit IRC		15:59
Shrews	i have to step away for a bit now for an appointment. bbl	16:01
*** Guest88600 has quit IRC		16:03
tosky	dmsimard: re those POST_FAILURE you discussed earlier, should I just recheck or wait?	16:03
*** armaan has quit IRC		16:04
*** sree has joined #openstack-infra		16:04
dmsimard	tosky: there's a lot of load in general right now	16:04
*** ricolin has joined #openstack-infra		16:04
dmsimard	corvus: I know you want to wait for the RAM governor but the executors aren't keeping up..	16:04
dmsimard	do you have any suggestions ?	16:05
*** ykarel\|away has quit IRC		16:05
dmsimard	the graphs are fairly thundering herd-ish	16:05
*** sbra has joined #openstack-infra		16:05
tosky	I guess I can wait a bit until people leave the offices in the US :)	16:06
corvus	dmsimard: fix the broken executor?	16:06
corvus	dmsimard: and then implement the ram governor? perhaps that should be considered a high priority?	16:07
dmsimard	Shrews: should we rebuild ze04 ? I'm not sure what's the takeaway from your email	16:08
*** jbadiapa has quit IRC		16:08
*** sree has quit IRC		16:09
*** slaweq has joined #openstack-infra		16:09
pabelanger	I don't think we need a rebuilt, we just need to finish landing tcp/7900 changes in puppet-zuul	16:09
pabelanger	then, schedule restarts	16:09
corvus	don't schedule, just do. but there will be permissions that need fixing.	16:09
pabelanger	+3 on https://review.openstack.org/532709/ for firewall	16:10
corvus	https://review.openstack.org/532594 is the other	16:10
pabelanger	great, +3	16:11
pabelanger	checking permissions	16:11
corvus	while those are landing, someone can go on ze04 and fix filesystem permissions	16:11
openstackgerrit	Merged openstack-infra/project-config master: Remove infracloud-vanilla from nodepool https://review.openstack.org/532820	16:12
*** slaweq has quit IRC		16:13
*** slaweq_ has joined #openstack-infra		16:13
pabelanger	corvus: any objections to also having puppet-zuul check them?	16:15
corvus	pabelanger: why?	16:15
*** eumel8 has quit IRC		16:16
corvus	let's just fix this and move on. it was a one time mistake.	16:16
pabelanger	ok	16:16
*** armaan has joined #openstack-infra		16:17
*** slaweq_ has quit IRC		16:17
pabelanger	actually, we'll need to change these on other executor too right? Currently /var/log/zuul is root:root	16:18
pabelanger	actually	16:18
pabelanger	zuul:root	16:18
*** kopecmartin has quit IRC		16:19
pabelanger	okay, we safe there	16:19
*** kopecmartin has joined #openstack-infra		16:19
efried	qq, does 'recheck' kick a currently-running check out of the queue? With how long stuff is taking, I want to be able to recheck as soon as I see any failure rather than waiting for the job to finish.	16:19
*** armaan has quit IRC		16:21
pabelanger	odd /var/log/zuul/executor.log is root:root, I would have thought it was zuul:zuul or zuul:root on ze02	16:22
*** e0ne has quit IRC		16:24
*** alexchadin has quit IRC		16:24
*** alexchadin has joined #openstack-infra		16:25
*** esberglu has joined #openstack-infra		16:26
AJaeger	efried: no, it does not. one way to stop: rebase	16:26
pabelanger	okay, I think we'll also need fix permissions on /var/lib/zuul for all executors after we stop	16:26
efried	AJaeger Okay, thanks. I was getting tempted to do that anyway to see if it knocked anything loose :)	16:27
mriedem	what does NODE_FAILURE as a job result generally mean?	16:28
*** alexchadin has quit IRC		16:29
clarkb	mriedem: shrews was saying earlier that he thought it was related to the vanilla infracloud going away and quota calculations not working	16:29
*** sree has joined #openstack-infra		16:30
clarkb	Shrews: does ^ mean max-servers: 0 is no longer effective? or maybe we didn't get that merged quickyl enough to avoid problems?	16:30
mriedem	so it's like creating a vm fails ?	16:30
mriedem	just suspicious because it's on a new job definition i'm working on and it's in the experimental queue, and zuul never posted results on it last night, and now it failed this morning with node_failure	16:30
clarkb	mriedem: that was my understanding of it, due to a bug in new code that tries to more dynamically determine available quota. Rather than statically assume what we have told nodepool about its quota is correct	16:30
*** rosmaita has joined #openstack-infra		16:31
mriedem	dynamically determine available quota, good luck	16:31
mriedem	upgrade to pike where there are no more quota reservations to screw everything up	16:31
clarkb	Shrews: pabelanger I'm not sure that removing vanilla entirely should be our fix in this case. Nodepool should (and always has in the past) handle cloud outages gracefully	16:32
*** shardy has joined #openstack-infra		16:32
*** rwellum has left #openstack-infra		16:32
mriedem	here is an unrelated question, i see a change that's been approved for a few days now is sitting in the gate queue for 9 hours, and it's also sitting in the check queue	16:32
*** ijw has quit IRC		16:32
mriedem	should i recheck it? or just wait for it to drop from the queues altogether?	16:32
pabelanger	clarkb: agree, it sounds like there was some bug preventing nodes from deleting. Something we should fix for sure	16:33
pabelanger	clarkb: I removed vanilla to stop image uploads by nodepool-builder	16:33
dmsimard	mriedem: might be side effect of people re-checking the patches and the few zuul restarts that have happened where we re-enqueued jobs	16:33
*** pblaho1 has joined #openstack-infra		16:33
mriedem	is there a timeout where a change just gets kicked out altogether? like 10 hours or something?	16:33
mriedem	given the reboots and timeouts and such, i assume people are just rechecking like mad to try and get anything in	16:34
openstackgerrit	Merged openstack-infra/puppet-zuul master: Start executor as 'zuul' user https://review.openstack.org/532594	16:34
*** sree has quit IRC		16:34
*** ramishra has quit IRC		16:34
*** vsaienk0 has quit IRC		16:34
*** maciejjozefczyk has quit IRC		16:35
clarkb	pabelanger: but deleting nodes should be indepdnent of handling node requests right? In any case its something we should make sure nodepool handles gracefully so that losing a lcoud doesn't affect jobs	16:35
*** pblaho has quit IRC		16:35
*** HenryG has quit IRC		16:36
dmsimard	mriedem: the rechecks are exacerbating the load issues that we have right now so we have a bit of a thundering herd effect going on.	16:36
pabelanger	clarkb: agree, I'd have to defer to Shrews on why that didn't work. But I would guess we can reproduce in a unit test to see why	16:36
mriedem	that's what i was wondering - if it's just better to not recheck	16:36
dmsimard	mriedem: things are slowly stabilizing and we also have a zuul executor that isn't running right now (which is not helping) so our current focus is getting that executor back in line (and resolving the root cause of why it went out in the first place)	16:38
mriedem	ack	16:38
dmsimard	mriedem: a good idea might be to keep an eye on the zuul-executor graphs here: http://grafana.openstack.org/dashboard/db/zuul-status	16:38
mriedem	dmsimard: what's a healthy graph look like?	16:39
mriedem	not maxed out?	16:39
dmsimard	mriedem: one where the load of the executors is not 100 :)	16:39
mriedem	heh is that all	16:40
mriedem	ok will do - thanks for the link btw, i always forget where that is (bookmarked now)	16:40
fungi	AJaeger: efried: you may be able to abort running check jobs and restart them by abandoning and restoring the change, so you don't need gratuitous rebases	16:40
dmsimard	fungi: I'm not sure restoring the change triggers the jobs by itself, it might need a recheck after restore but I'm not sure	16:41
efried	fungi That's a nice tip. Though wouldn't that require re+2 as well as re+W?	16:41
efried	fungi In any case, rebase wasn't gonna hurt this particular one.	16:41
*** HenryG has joined #openstack-infra		16:41
fungi	efried: as opposed to rebasing? abandon doesn't remove any votes	16:41
dmsimard	efried: a rebase is likely to require re-voting as well FWIW (though I'm not entirely sure on the criteria about what votes are removed when)	16:41
efried	dmsimard In my recent experience, code reviews stick around on a (non-manual) rebase, but +Ws are cleared.	16:42
fungi	dmsimard: it's configurable in gerrit, but generally for our projects it's set to keep code review votes (and clear verified and workflow votes) when the diff of the new change basically matches the diff of the old change. used to use git patch-id to achieve that but i haven't looked at the code in gerrit recently to know whether that's still the case	16:43
*** kopecmartin has quit IRC		16:43
fungi	(`git patch-id` is a sha-1 checksum of the diff's contents, not to be confused with the commit id which covers additional metadata like the commit message and timestamps)	16:45
dmsimard	fungi: yeah I assumed it had something to do along those lines -- I've modified commit messages before without having to re-run jobs for example	16:47
*** dtantsur is now known as dtantsur\|afk		16:47
fungi	hopefully not in our gerrit. a new patch set should clear the verified vote and cause new jobs to run	16:48
dmsimard	maybe it wasn't in the upstream gerrit, I don't remember	16:49
fungi	also clears workflow votes so that we don't automatically send it into the gate without an explicit reapproval	16:49
fungi	but yes, that's configurable per label, so other gerrits may preserve verified votes on rebase, for example	16:49
clarkb	(we've had to do this because tests whose results are affected by the commit message are or were common)	16:49
fungi	for similar reasons, we clear all votes (including code review) when a commit message is changed	16:50
fungi	but that's also configurable	16:50
fungi	i love logging into the rax dashboard. you never know what new tickets will be waiting	16:52
fungi	"This message is to inform you that the host your cloud server 'ze04.openstack.org' resides on became unresponsive. We have rebooted the server and will continue to monitor it for any further alerts."	16:52
pabelanger	okay, think I've cleaned up all permissions on ze04	16:53
pabelanger	will start it up here in a moment once puppet runs	16:53
clarkb	pabelanger: that is to pick up the init script change/	16:53
clarkb	(puppet that is)	16:53
pabelanger	clarkb: yah	16:53
pabelanger	but it does look like logging on executor is being written as root, so we also need to setup permissions on others after we stop	16:54
*** armaan has joined #openstack-infra		16:55
fungi	#status log previously mentioned trove maintenance activities in rackspace have been postponed/cancelled and can be ignored	16:56
openstackstatus	fungi: finished logging	16:56
*** pcaruana\|afk\| has quit IRC		16:57
*** felipemonteiro_ has quit IRC		16:58
*** florianf has quit IRC		17:00
*** dsariel has quit IRC		17:00
fungi	ttx: are you still using odsreg.openstack.org or can we delete that instance?	17:00
ttx	fungi: I thought we cleared it some times ago, I remember someone asking me about it	17:01
fungi	that may have been forumtopics	17:01
ttx	those days we only use ptgbot on one side and forumtopics.o.o	17:01
fungi	i thought i remembered asking you about odsreg but scoured my irc logs and came up with nothing so thought it best to ask again	17:02
fungi	i'm happy to delete odsreg.openstack.org now if you don't object	17:02
ttx	fungi: go for it!	17:02
fungi	done. thanks for confirming ttx!	17:03
fungi	#status log deleted old odsreg.openstack.org instance	17:03
openstackstatus	fungi: finished logging	17:03
*** iyamahat has joined #openstack-infra		17:03
pabelanger	okay, ze04 is coming online	17:10
*** pcichy has quit IRC		17:10
*** iyamahat has quit IRC		17:11
pabelanger	and running jobs	17:14
pabelanger	still waiting for firewall to open up I think	17:14
pabelanger	finger dd7cbca51ff543389eeb43dac537557f@zuulv3.openstack.org	17:14
*** electrofelix has left #openstack-infra		17:16
*** links has joined #openstack-infra		17:17
*** links has quit IRC		17:17
*** jpich has quit IRC		17:17
*** lucasagomes is now known as lucas-afk		17:19
clarkb	I'm really confused by this node failure stuff	17:20
clarkb	I'm looking at a node request that went in at about 1420UTC got rejected by citycloud and vexxhost then appears to go idle. Then 2 hours later infracloud picks it up and then the status is marked as failed?	17:21
pabelanger	I haven't looked, is there a log?	17:21
clarkb	pabelanger: ya I'm trying to put it together for this example I'll have a paste up soon	17:21
pabelanger	I did see some citycloud failures this morning about SSH hostkey changing, wonder if we had some ghost instances	17:22
pabelanger	but haven't looked more into it	17:22
clarkb	this is way before any ssh can happen	17:23
clarkb	its all in the "please boot me a node" negotiation	17:23
*** amotoki has quit IRC		17:24
*** dhajare has quit IRC		17:24
pabelanger	Shrews: do you have a moment to look at fingergw.log? Seeing some execptions around routing, possible we need to improve logging / bug	17:24
*** electrofelix has joined #openstack-infra		17:26
*** Apoorva has quit IRC		17:28
*** hjensas has joined #openstack-infra		17:28
fungi	clarkb: possible when we restarted the launchers we upgraded to a regression of some sort?	17:28
Shrews	pabelanger: i just got back from the appointment i mentioned earlier. let me catch up on backscroll	17:29
pabelanger	okay, I think it might be firewall	17:29
pabelanger	confirming that is open	17:29
pabelanger	ah, yup: https://review.openstack.org/532709/ isn't merged	17:30
pabelanger	Umm	17:31
pabelanger	[Thu Jan 11 17:26:37 2018] zuul-scheduler[19039]: segfault at a9 ip 0000000000513ef4 sp 00007f2411437ea8 error 4 in python3.5[400000+3a9000]	17:31
*** trown is now known as trown\|outtypewww		17:31
pabelanger	that doesn't look good	17:31
pabelanger	http://zuulv3.openstack.org/ is also down	17:31
fungi	does that correspond to another puppet exec upgrading zuul or its deps?	17:32
Shrews	clarkb: the new quota checks from tristanC were failing (b/c the query to the provider wasn't working). i think we should short-circuit that if we've said max-servers is 0. it worked before on a failed provider b/c it didn't query that provider, and just went off max-servers	17:32
pabelanger	fungi: let me check	17:32
clarkb	pabelanger: Shrews corvus http://paste.openstack.org/show/642927/ (not to distract from the segfault) thats what I can find about infracloud vanilla	17:32
*** armax has joined #openstack-infra		17:32
pabelanger	fungi: yes	17:33
clarkb	Shrews: ya I see that buy why did it take 2 hours to process that request in the first plcae?	17:33
pabelanger	Jan 11 17:16:43 zuulv3 puppet-user[26991]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events	17:33
Shrews	and by "i think we should short-circuit", i mean "should fix it so that it short-circuits"	17:33
*** slaweq has joined #openstack-infra		17:33
fungi	so far all the segfault events have happened right on the heels (within a minute) of a puppet exec upgrading zuul or a dependency on the server where it happened	17:33
clarkb	Shrews: but also I'm not sure its sufficient to short circuit, nodepool should handle cloud failures gracefully	17:33
clarkb	Shrews: if the cloud is not there then we shouldn't mark the request failed and instead let some other cloud service it	17:33
Shrews	clarkb: i dunno. that is new info to me	17:34
corvus	the segfault killed the gearman server	17:34
Shrews	clarkb: i am not disagreeing with you.	17:34
corvus	this is not recoverable	17:34
pabelanger	I also do not see zuul-web	17:35
pabelanger	corvus: happy to pass to you to drive if you'd like	17:35
clarkb	Shrews: cool, just pointing out a short circuit on max-servers isn't sufficnet to get that behavior	17:35
corvus	pabelanger: i don't think there's anything to do but to restart with empty queues.	17:35
corvus	pabelanger: you continue driving	17:35
pabelanger	okay	17:36
clarkb	Shrews: having a hard time with the logs because we don't seem to log the point where we mark it failed on the nodepool side with the request id (that may be because we've bubbled up super far after the cloud exception)	17:36
Shrews	clarkb: https://review.openstack.org/532857	17:36
pabelanger	okay, stopping zuul-echeduler per corvus recommendation	17:36
*** coolsvap has quit IRC		17:36
*** slaweq has quit IRC		17:37
pabelanger	scheduler starting	17:37
Shrews	clarkb: pabelanger: fwiw, there were 2 separate issues i found this morning in the firefight. the vanilla issue is not related to the stuck delete issue	17:39
pabelanger	okay, I've started, and stopped zuul-web. It seems to raising exceptions in scheduler	17:39
pabelanger	AttributeError: 'NoneType' object has no attribute 'layout'	17:39
pabelanger	cat jobs now running	17:39
Shrews	the vanilla issue not being handled well we can fix. i don't have an answer on the delete thing	17:40
corvus	pabelanger: you can start zuul-web, that's harmless	17:40
pabelanger	corvus: okay	17:40
pabelanger	zuul-web started	17:40
*** vsaienk0 has joined #openstack-infra		17:41
corvus	is there any place where we have the full output from the pip install of zuul?	17:41
*** cshastri has quit IRC		17:41
corvus	ie, puppet reports or anything?	17:41
Shrews	pabelanger: did you still need me to look at something?	17:42
pabelanger	okay, scheduler back online	17:42
corvus	pabelanger: i suggest sending a notice	17:42
pabelanger	Shrews: I don't think so, firewall change hasn't landed yet	17:42
pabelanger	corvus: agree	17:42
*** jamesmcarthur has joined #openstack-infra		17:43
fungi	corvus: i believe pip will log in the homedir of the account running that exec, so presumably ~root/	17:43
fungi	looking	17:43
clarkb	Shrews: fwiw I think the deleted issue is the smaller concern of the two since it doesn't directly affect job results	17:43
fungi	though i'm not finding it yet	17:44
clarkb	Shrews: comment on https://review.openstack.org/#/c/532857/1	17:44
pabelanger	how does, #status log Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets	17:44
clarkb	now to catch up on the zuul scheduler thing	17:45
fungi	pabelanger: lgtm	17:45
pabelanger	#status notice Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets	17:45
openstackstatus	pabelanger: sending notice	17:45
clarkb	I'm guessing its too late now but maybe we should turn on core dumps?	17:46
corvus	it's never too late to turn on core dumps	17:46
fungi	Jan 11 17:16:40 zuulv3 kernel: [67122.422307] zuul-web[19121]: segfault at a9 ip 0000000000513ef4 sp 00007fb4116b98a8 error 4 in python3.5[400000+3a9000]	17:46
clarkb	(might be able to modify that ulimit for a running process somehow?)	17:46
fungi	Jan 11 17:16:43 zuulv3 puppet-user[26991]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events	17:46
pabelanger	I am unsure where pip install would be logged now	17:46
fungi	that's pretty tight correlation :/	17:46
-openstackstatus- NOTICE: Due to an unexpected issue with zuulv3.o.o, we were not able to preserve running jobs for a restart. As a result, you'll need to recheck your previous patchsets		17:47
corvus	fungi: the scheduler segfault (which was the geard process) was 10m later?	17:47
fungi	17:26:38 yeah	17:48
corvus	i think it would be really useful to know if a certain dependency was updated, or if it was just zuul	17:48
corvus	thus my query about logs	17:48
openstackstatus	pabelanger: finished sending notice	17:48
fungi	this is at least the third time we've seen this happen (last couple times it took out all the executors and maybe the mergers too)	17:49
fungi	and yeah, i'm looking to see if explicit pip logging needs to be enabled in that exec	17:49
pabelanger	I did run kisk.sh on zuulv3.o.o to attempt to pick up firewall changes, I don't see how but maybe related?	17:50
Shrews	clarkb: i think that log thing is totally separate. there's a bug there, i think, with a handler being present when the provider has been removed	17:50
*** vsaienk0 has quit IRC		17:51
fungi	pabelanger: at what time?	17:51
Shrews	clarkb: i think we should land the log improvement, then I can look at the other thing	17:51
clarkb	Shrews: run_handler is what creates the self.log object and so if it crashes in run_handler then we hvae no logger	17:51
clarkb	Shrews: I'm fine with that but I also don't know that it will actually give us the logs we want	17:51
*** ricolin has quit IRC		17:51
pabelanger	fungi: I do not have timestamps on my terminal, but it was before I noticed an issue with http://zuulv3.o.o	17:51
clarkb	(but merging the change isn't a regression in that case, it just also won't fix things?)	17:51
fungi	pabelanger: was the kick happening roughly concurrent with the 17:26:38 segfault for the gearman process or the 17:16:40 segfault for the zuul-web daemon?	17:52
clarkb	Shrews: I'll approve it	17:52
pabelanger	the puppet run, was success with no unknown errors	17:52
pabelanger	fungi: it was after 2018-01-11 17:15:05,152 so it could have been the reason for puppet run	17:52
corvus	clarkb, Shrews: erm, let's fix it right?	17:52
clarkb	corvus: yes we should fix it	17:53
fungi	pabelanger: yeah, i don't see any puppet activity around 17:26:38 just 17:16:40	17:53
fungi	pabelanger: so sounds likely to be the earlier one	17:53
clarkb	corvus: its unrelated enough to the existing change and if the exception happens after configuring the logger we will be fine	17:53
clarkb	so I think we can approve the existing change then swing around and fix the logger	17:53
*** Apoorva has joined #openstack-infra		17:53
corvus	clarkb, Shrews: i'd hate to restart nodepool just to end up without the extra info we need because of that bug	17:53
pabelanger	fungi: in fact, I only see puppet-user from my kick.sh attempt	17:53
pabelanger	fungi: I don't see any runs of our wheel for some reason	17:54
*** sambetts is now known as sambetts\|afk		17:54
fungi	pabelanger: yeah, i wonder if something about our meltdown upgrades yesterday has broken puppeting	17:55
*** jistr is now known as jistr\|afk		17:55
clarkb	corvus: I can remove the +W if you wnat to see it fixed all at once	17:55
pabelanger	git.o.o is failing puppet, i think that is block zuulv3.o.o from running	17:55
pabelanger	looking at git.o.o now	17:55
clarkb	went ahead and did that	17:55
pabelanger	oh, yum-crontab fix didn't land	17:56
Shrews	clarkb: corvus: run_handler() is re-entrant, executed multiple times for a request. most of the time, self.log is available. but that race i mentioned is what we need to fix, and is rare	17:56
pabelanger	that is odd	17:56
pabelanger	https://review.openstack.org/532331/	17:57
pabelanger	I thought that merged, apparently not	17:57
*** ijw has joined #openstack-infra		17:57
pabelanger	so, it is possible that was the first puppet run on zuulv3.o.o in a few days	17:57
*** yamamoto has joined #openstack-infra		18:00
*** nicolasbock has joined #openstack-infra		18:00
pabelanger	Apologies, it does seem my kick.sh was the reason for puppet run	18:00
openstackgerrit	Dirk Mueller proposed openstack/diskimage-builder master: Add SUSE Mapping https://review.openstack.org/532925	18:00
corvus	pabelanger: well, i mean, puppet runs happen. it's hardly your fault. i don't think there's anything about kick.sh that should cause a segfault.	18:02
corvus	clarkb, pabelanger, fungi: this segfault thing is pretty critical -- we'll sink if zuul just radomly gets killed. what do we need to do to track it down?	18:03
*** ralonsoh has quit IRC		18:04
*** jbadiapa has joined #openstack-infra		18:04
*** yamamoto has quit IRC		18:04
fungi	corvus: looking through pip --help i see options for directing it where to log and how verbosely, but we'll probably want to set that in /etc/pip.conf given the mix of explicit pip calls and pip puppet package resources	18:04
fungi	so i'm looking up the corresponding config options	18:05
corvus	maybe we can figure out what was updated by filesytem timestamps	18:05
corvus	fungi, pabelanger, clarkb: http://paste.openstack.org/show/642940/	18:05
fungi	maybe, though with the gearman crash coming after a 10-minute delay that may indicate that the correlation isn't tight enough to be able to match them	18:05
corvus	was msgpack the thing from earlier?	18:05
fungi	yes, it's the one which changed its name i think?	18:05
johnsom	FYI, we are still seeing "retry_limit" errors that are bogus.	18:06
johnsom	zuul patch 531514 for example	18:06
fungi	corvus: also, deploying pip.conf updates widely will depend on getting puppet working again, so also need to look into why that broke	18:06
clarkb	corvus: probably a good idea to get core dumps turned on too?	18:06
pabelanger	I agree we should consider coredumps for zuul, is there any impact to that?	18:07
clarkb	pabelanger: could fill up our disk (but we can watch it closely)	18:07
corvus	so does this happen everytime msgpack updates, or just once?	18:07
pabelanger	I believe msgpack is what dmsimard seen for executor issue also	18:07
fungi	corvus: we've had several crashes across all zuul daemons (launchers, mergers) i think, so maybe each time msgpack updates?	18:08
dmsimard	everything crashed again ? sorry I've been in meetings all morning	18:08
dmsimard	FWIW we're not the only ones having issues with this... https://github.com/msgpack/msgpack-python/issues/268 && https://github.com/msgpack/msgpack-python/issues/266	18:09
fungi	dmsimard: so far just the daemons on the zuulv3.o.o server, but puppet isn't working (pabelanger ran it manually against that server) so that may be why it hasn't crashed any others yet	18:09
dmsimard	We might have to manually uninstall both msgpack-python and msgpack, make sure there's no more files in /usr/local/python3.5/dist-packages (stale distinfo, eggs, etc.) and then re-run the (zuul?) reinstallation or just install msgpack by itself, I don't know.	18:10
fungi	though interestingly, pip3 list says zuulv3 and ze01 both have the same versions of msgpack and msgpack-python	18:10
dmsimard	fungi: yeah, that's mentioned in the issues I linked	18:10
fungi	maybe zuulv3 upgrade was delayed somehow?	18:11
corvus	okay, so if puppet wasn't running on zuulv3.o.o, then maybe this is the same issue, just delayed, and it's not a systemic problem?	18:11
dmsimard	well, when it first happened last sunday, only the ZE and ZM nodes were impacted .. and then monday (I think?) zuulv3.o.o crashed because of msgpack as well	18:11
corvus	oh, so it has happened before	18:11
clarkb	pabelanger: yes msgpack is what I think we tracked the executors dying to	18:12
corvus	in that case, since there was just a msgpack release, i wonder if we can expect the z* servers to all crash shortly?	18:12
fungi	corvus: yeah, it's tanked all the zuul servers at least twice now, but there may have been two versions of msgpack triggering this	18:12
pabelanger	dmsimard: when on Monday? We had a swapping issue due to memory, but wasn't aware of a crash	18:12
dmsimard	fungi: the corellation can be found in /usr/local/lib/python3.5/dist-packages (see timestamps where modules were last updated) and the dmesg timestamps of the general protection fault	18:13
fungi	corvus: i think what crashed it most recently on the other servers is what just now crashed zuulv3 but was delayed because puppet hasn't run on that server in a while until pabelanger did so manually	18:13
clarkb	could it be that the msgpack builds are replacing compiled so files under zuul?	18:13
dmsimard	that's how I ended up finding out what was the issue	18:13
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Delete the pool thread when a provider is removed https://review.openstack.org/532931	18:13
clarkb	except we should load everything into memory right? so probably not that	18:13
corvus	i'm worried that msgpack_python (the deprecated package) was updated, but not msgpack (the replacement)	18:14
dmsimard	pabelanger: there was two separate incidents for msgpack, one sunday and one monday	18:14
dmsimard	2018-01-07 20:55:48 UTC (dmsimard) all zuul-mergers and zuul-executors stopped simultaneously after what seems to be a msgpack update which did not get installed correctly: http://paste.openstack.org/raw/640474/ everything is started after reinstalling msgpack properly.	18:14
pabelanger	fungi: corvus: I believe so, I am going to see if I can reproduce it locally here by installing old version of msgpack, then upgrading it when zuul is running	18:14
dmsimard	2018-01-08 20:33:57 UTC (dmsimard) the msgpack issue experienced yesterday on zm and ze nodes propagated to zuulv3.o.o and crashed zuul-web and zuul-scheduler with the same python general protection fault. They were started after re-installing msgpack but the contents of the queues were lost.	18:14
*** weshay is now known as weshay_interview		18:14
pabelanger	dmsimard: okay	18:15
dmsimard	pabelanger: it might require something to hit that region of the memory for the GPF to trigger	18:15
fungi	pypi release history shows msgpack and msgpack-python releases on january 6 and january 9	18:15
fungi	one release each on each of those two dates	18:15
corvus	https://pypi.python.org/pypi/msgpack/0.5.1 is worth a read	18:15
pabelanger	let me see when the last time we ran puppet on zuulv3.o.o was	18:16
dmsimard	why the hell is there two package names using the same module name ?	18:16
fungi	0.5.0 for both on the 6th and 0.5.1 on the 9th	18:16
*** jpena is now known as jpena\|off		18:16
pabelanger	Jan 9 20:33:36 zuulv3 puppet-user[17648]: Finished catalog run in 7.59 seconds	18:16
pabelanger	was the last puppet run on zuulv3, before the kick.sh	18:17
dmsimard	that's weird, it's almost an exact match from the 01-08 timestamp	18:17
fungi	anybody happen to know off the top of their head what zuul dep is dragging in msgpack?	18:17
pabelanger	Jan 8 21:49:06 zuulv3 puppet-user[10102]: (/Stage[main]/Zuul/Exec[install_zuul]) Triggered 'refresh' from 1 events	18:17
pabelanger	was our last install_zuul before kisk.sh	18:18
fungi	(or msgpack-python i guess)	18:18
corvus	fungi: no, but i'd like to find out	18:18
fungi	yeah, was going to track that down next	18:18
fungi	just didn't know if anyone already had	18:18
dmsimard	is there something else than puppet that could update python modules ? here's the timestamps from dist-packages: http://paste.openstack.org/raw/642949/	18:18
*** dhajare has joined #openstack-infra		18:18
corvus	fungi: cachecontrol	18:19
dmsimard	fungi, corvus: the stack trace from msgpack should give us an idea of where it's imported from http://paste.openstack.org/raw/640474/	18:19
fungi	dmsimard: potentially unattended-upgrades if installed from distro packages	18:19
*** SumitNaiksatam has joined #openstack-infra		18:19
fungi	corvus: thanks	18:19
corvus	dmsimard: confirmed :)	18:19
pabelanger	drwxr-sr-x 2 root staff 4096 Jan 11 17:16 msgpack_python-0.5.1.dist-info I think that is what we just installed	18:20
dmsimard	fungi: doesn't look like it'd be from a package http://paste.openstack.org/raw/642953/	18:20
fungi	latest release of cachecontrol is what started using msgpack, it seems	18:21
dmsimard	fungi: but from that last paste (dist-packages), you can see that it's not just msgpack that was updated.. there's zuul there as well	18:21
fungi	introduced in 0.12.0, while 0.11.2 and earlier don't use it	18:21
*** rmcall has quit IRC		18:21
fungi	dmsimard: right, it happens when we pip install zuul from source each time a new commit lands	18:21
dmsimard	fungi: so that's not driven by puppet then, it's automatic somehow ?	18:22
fungi	dmsimard: it's driven by puppet. there's a vcsrepo resource for teh zuul git repo which triggers an exec to upgrade zuul from that	18:22
dmsimard	ok, yeah, there's a puppet run that matches the timestamps where the modules have been updated	18:23
dmsimard	BTW, if this happens again (where we lose the full scheduler) -- there is a short window where the status.json appears still populated while the scheduler reloads everything and we might be able to dump everything during that short window	18:24
dmsimard	but I haven't tried	18:24
dmsimard	(only realized when it was too late)	18:24
*** ldnunes has quit IRC		18:27
corvus	dmsimard: i believe we were alerted to this because zuul-web died	18:27
dmsimard	corvus: yeah but for some reason, (monday) when I started zuul-scheduler and zuul-web again, I recall seeing the whole status page and wondering why it wasn't empty	18:28
corvus	dmsimard: that'll be the apache cache. it was gone.	18:28
dmsimard	corvus: likely	18:29
corvus	dmsimard: i'm certain; i checked.	18:29
dmsimard	corvus: status.json is generated dynamically right ? it's not dumped to disk periodically (even for cache purposes) ?	18:29
corvus	dmsimard: it's in-memory in zuul. apache may put it on disk.	18:30
dmsimard	corvus: should we consider dumping it to disk periodically even if just for backup purposes so we can re-queue if need be ? though now that I think about it, this can be out of band -- like a cron that just gets it every minute or something	18:31
corvus	dmsimard: i don't object to a cron	18:32
corvus	in the mean time, what should we do about this?	18:32
dmsimard	I believe pabelanger mentioned he'd like to try and reproduce it locally	18:32
dmsimard	I'll send a patch for a cron which could at least help prevent us losing the entire queues -- check and gate aren't so bad, but losing post/release/tag kind of sucks	18:33
corvus	dmsimard: post/release/tag aren't safe to automatically re-enqueue	18:33
*** dprince has quit IRC		18:34
dmsimard	right, but we have no visibility on what jobs might not have run	18:34
corvus	it's fine to save them, just pointing out that we can't restore them without consideration.	18:34
* dmsimard nods		18:34
*** ldnunes has joined #openstack-infra		18:41
*** dprince has joined #openstack-infra		18:42
*** sree has joined #openstack-infra		18:43
*** iyamahat has joined #openstack-infra		18:43
*** tesseract has quit IRC		18:45
*** jamesmcarthur has quit IRC		18:46
*** sree has quit IRC		18:47
AJaeger	mordred: your change https://review.openstack.org/#/c/532304/ fail in http://git.openstack.org/cgit/openstack-infra/openstack-zuul-jobs/tree/roles/configure-unbound/tasks/main.yaml#n44 . Is role_path not set correctly?	18:48
*** fultonj has quit IRC		18:48
*** beagles has quit IRC		18:49
*** b3nt_pin has joined #openstack-infra		18:49
*** b3nt_pin is now known as beagles		18:49
openstackgerrit	Merged openstack-infra/system-config master: Fix typo with yum-cron package / service https://review.openstack.org/532331	18:51
openstackgerrit	Merged openstack-infra/zuul-jobs master: Capture and report errors in sibling installation https://review.openstack.org/532216	18:51
AJaeger	mordred: or is that a problem of the base-minimal parent?	18:51
*** yamahata has joined #openstack-infra		18:53
AJaeger	mordred: left comments on 532304	18:54
*** felipemonteiro has joined #openstack-infra		18:55
*** felipemonteiro_ has joined #openstack-infra		18:56
*** hemna_ has joined #openstack-infra		18:59
*** sree has joined #openstack-infra		19:00
*** jkilpatr_ has joined #openstack-infra		19:00
*** felipemonteiro has quit IRC		19:00
*** jkilpatr has quit IRC		19:01
*** shardy has quit IRC		19:01
*** caphrim007 has joined #openstack-infra		19:02
*** sree has quit IRC		19:05
*** fultonj has joined #openstack-infra		19:06
*** sree has joined #openstack-infra		19:06
AJaeger	mlavalle, just commented on https://review.openstack.org/#/c/531496 - do you know what to do or do you have further questions?	19:07
*** jkilpatr_ has quit IRC		19:07
corvus	fungi, pabelanger, clarkb: based on what i reported in #zuul, i think we can expect msgpack not to break us again unless they do another rename. further version upgrades shouldn't break us. i'm assuming we have 0.5.1 installed everywhere now. if we want to be extra safe, we might want to uninstall msgpack and msgpack-python everywhere and reinstall msgpack-python 0.5.1 just to clean up, but	19:09
corvus	shouldn't be necessary.	19:09
corvus	dmsimard: ^	19:09
pabelanger	okay, puppet is running again on zuulv3.o.o now that git servers are running puppet again	19:09
clarkb	corvus: good to know thanks for digging in	19:09
corvus	Shrews: what do we need to do for nodepool?	19:10
*** sree has quit IRC		19:10
pabelanger	I am also still waiting for 532709 to land and confirm console logging is working on ze04.o.o before continuing with reboots of other executors	19:12
*** sshnaidm is now known as sshnaidm\|afk		19:15
*** weshay_interview is now known as weshay		19:16
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Fix races around deleting a provider https://review.openstack.org/532931	19:16
Shrews	corvus: bunches of things?	19:18
Shrews	corvus: sort of sinking in the changes at the moment and can't cover them all	19:19
Shrews	and i've totally missed lunch b/c of it, so gonna grab a quick bite. brb	19:19
*** slaweq has joined #openstack-infra		19:20
corvus	Shrews: okay. it looks like the segfault this is mostly sorted, so when you get a chance to sort out what needs to happen, let me know.	19:20
*** vsaienk0 has joined #openstack-infra		19:20
*** panda\|ruck is now known as panda\|ruck\|afk		19:20
*** jkilpatr_ has joined #openstack-infra		19:21
smcginnis	Looking into some stable branch failures. Anyone know how this could have happened? http://logs.openstack.org/periodic-stable/git.openstack.org/openstack/kolla/stable/pike/build-openstack-sphinx-docs/60e2946/job-output.txt.gz#_2018-01-09_06_16_06_651553	19:23
smcginnis	Must be something local: http://git.openstack.org/cgit/openstack/kolla/tree/doc/source?h=stable/pike	19:24
fungi	where did we end up defining zuul nodeset names?	19:24
corvus	fungi: ozj i think	19:26
*** dprince has quit IRC		19:26
openstackgerrit	Ihar Hrachyshka proposed openstack-infra/openstack-zuul-jobs master: Switched all jobs from q-qos to neutron-qos https://review.openstack.org/532948	19:26
fungi	corvus: thanks! i thought it was in pc for some reason	19:26
*** smatzek has quit IRC		19:27
AJaeger	team, is our pypi mirror working? We pushed https://pypi.python.org/pypi/openstackdocstheme 2hours ago and it's not yet used in new jobs	19:27
*** florianf has joined #openstack-infra		19:27
*** smatzek has joined #openstack-infra		19:28
openstackgerrit	Merged openstack-infra/system-config master: Zuul executor needs to open port 7900 now. https://review.openstack.org/532709	19:28
*** alexchadin has joined #openstack-infra		19:29
pabelanger	AJaeger: let me check	19:30
*** sree has joined #openstack-infra		19:30
*** vsaienk0 has quit IRC		19:30
pabelanger	AJaeger: bandersnatch is running	19:31
AJaeger	pabelanger: and it's at http://mirror.ca-ymq-1.vexxhost.openstack.org/pypi/simple/openstackdocstheme/ - let me recheck then.	19:31
AJaeger	thanks	19:31
*** smatzek has quit IRC		19:33
*** sree has quit IRC		19:34
Shrews	corvus: so nodepool, we have A) new quota handling does not fail as gracefully since we do more than just check max-servers now B) it has been suggested that instead of failing a request that gets an exception in request handling, that we instead let another provider try (which is a good idea IMO) and C) some zookeeper wonkiness caused us to not be able to delete 'deleting' znodes even though the	19:39
Shrews	actual instance was deleted. I have no idea on this one atm	19:39
Shrews	corvus: and the other thing clarkb was concerned about is hopefully handled in https://review.openstack.org/532931	19:40
Shrews	but i can't test that one locally b/c linux and the world hates me	19:40
pabelanger	nice	19:41
pabelanger	finger 8ee380a2a3ec4b1698ccd4fe6e6d5ecb@zuulv3.openstack.org	19:42
pabelanger	works	19:42
pabelanger	that should be using tcp/7900 now	19:42
*** eharney has quit IRC		19:42
pabelanger	I think we can proceed with rolling restarts of zuul-executors to drop root permissions	19:43
*** harlowja has joined #openstack-infra		19:43
pabelanger	along with /var/log/zuul permission fix	19:43
*** eharney has joined #openstack-infra		19:44
*** edmondsw_ has joined #openstack-infra		19:48
*** edmonds__ has joined #openstack-infra		19:48
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json https://review.openstack.org/532955	19:49
dmsimard	^ as per my suggestion	19:49
dmsimard	corvus: ^	19:49
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json https://review.openstack.org/532955	19:51
*** alexchadin has quit IRC		19:51
clarkb	Shrews: it would be good to get A) and B) sorted out so that we can safely reboot the chocolate infracloud controller	19:51
clarkb	I'm sort of all over the place today taking care of sick family and getting new glasses and doing travel paperwork	19:51
clarkb	but happy to help where I can (will review that one change shortly)	19:51
pabelanger	clarkb: are we okay to proceed with zuul-executor restarts? This is to drop root permissions and finger port change to tcp/7900. Confirmed to be working on ze04	19:52
*** edmondsw has quit IRC		19:52
*** edmondsw_ has quit IRC		19:52
clarkb	pabelanger: I think as long as it isn't expected to affect job results we should be fine	19:52
clarkb	executor stops result in jobs being rerun right?	19:52
clarkb	(its been a couple rough days so doing everything we can to make it less rough is nice)	19:53
pabelanger	clarkb: yah, jobs will abort and requeue	19:53
pabelanger	just means people wait a little longer for stuff to merge	19:53
clarkb	and we are at capacity ya?	19:53
clarkb	might be best to wait for things to cool off a bit for that?	19:54
pabelanger	yah, we are maxed out right now	19:54
dmsimard	there's a lot of compounding issues right now	19:54
dmsimard	the restarts, the loaded executors, leading people to recheck a significant backlog of things	19:55
dmsimard	need to afk food	19:55
pabelanger	I think what might happen, is if executor is stopped and started again with /var/log/zuul still root, we might not properly start again	19:56
pabelanger	eg: live migration	19:56
clarkb	dmsimard: right, I think it may be best to let things settle in a bit	19:56
clarkb	and see where we are since the executor restarts aren't urgent	19:56
clarkb	pabelanger: will it not fail to start at all or will it start and be broken?	19:57
*** jistr\|afk is now known as jistr		19:57
*** smatzek has joined #openstack-infra		19:57
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool feature/zuulv3: Short-circuit request handling on disable provider https://review.openstack.org/532957	19:57
pabelanger	let me check something	19:57
pabelanger	okay, we might be good	19:58
pabelanger	-rw-rw-rw- 1 root root 171389851 Jan 11 19:58 /var/log/zuul/executor-debug.log	19:58
pabelanger	zuul user will be able to write	19:58
*** sree has joined #openstack-infra		19:59
corvus	Shrews, clarkb: i think (B) is compatible with the algorithm and should be okay to implement. i think that would manifest as: when all of the node launch retries have been exhausted, we decline the request. the final provider which handles that request would still cause it to fail, in the normal way that requests that are universally declined are failed.	19:59
corvus	pabelanger: it's world writable?	20:01
Shrews	corvus: it would be more than node launch retries (in the vanilla case, it was throwing an exception when trying to query the provider for quota info), but... yeah	20:01
*** Apoorva has quit IRC		20:01
corvus	Shrews: sounds good	20:01
corvus	Shrews: is there more detail about (A)?	20:02
openstackgerrit	Sean McGinnis proposed openstack-infra/openstack-zuul-jobs master: Add commit irrelevant files to tempest-full https://review.openstack.org/532959	20:02
pabelanger	corvus: yes	20:02
*** armaan has quit IRC		20:02
corvus	pabelanger: let's fix that after the restarts :)	20:03
pabelanger	agree	20:03
*** armaan has joined #openstack-infra		20:03
Shrews	corvus: such as? https://review.openstack.org/532957 re-enables that short-circuit	20:03
openstackgerrit	Sean McGinnis proposed openstack-infra/project-config master: Remove irrelevant-files for tempest-full https://review.openstack.org/532960	20:03
Shrews	corvus: i think you know as much as i do at this point.... happy to clarify anything though	20:04
corvus	Shrews: that sounds good, but what do you mean by "new quota handling does not fail as gracefully since we do more than just check max-servers now" ? in what cases does it fail not gracefully?	20:04
*** sree has quit IRC		20:04
Shrews	corvus: http://paste.openstack.org/show/643045/	20:05
ianw	there's a lot of scroll-back for us antipodeans ... are we at a point we want to merge https://review.openstack.org/#/c/532701/ to start new image builds, or is there too much else going on?	20:06
corvus	Shrews: or are A and B really related -- if we fix B, we'll handle the case where we can't calculate quota because the provider is broken by declining the request. and then when we come along and set max-servers to 0 because we know the cloud is broken, change 532957 will short circuit that and make things cleaner?	20:06
*** gema has quit IRC		20:06
Shrews	corvus: oh, not failing gracefully... the new quota stuff adds the dimension of exceptions from a wonky provider during the "should I decline this?" checks	20:07
Shrews	A and B are sort of related, yes	20:07
*** hasharAway is now known as hashar		20:08
corvus	Shrews: okay, i think i grok. my understanding is: 532957 should fix the most immediate thing and let us restart infracloud without node failures, then solving (B) will let infracloud break in the future without spewing lots of node_failure messages.. correct?	20:09
Shrews	corvus: 532957 would have hidden the provider failure, but doesn't fix the exception stuff that can still occur when max-servers >0	20:09
corvus	Shrews: yep, that jives with my understanding	20:09
Shrews	because vanilla was disabled, but we were still getting the exceptions	20:09
pabelanger	ianw: +3	20:10
Shrews	corvus: yes, with all my current changes that I have up, we can restart and run for a while while I sort B out	20:11
pabelanger	ianw: also, ze04.o.o is back online and running as zuul user again	20:11
pabelanger	ianw: we have not done any other executors yet, likey do so in another day	20:11
fungi	ianw: probably safe as long as we don't expect that updates to the images will bring yet new regressions... we're somewhat backlogged and maxxed out on capacity at the moment	20:11
corvus	Shrews: i've +2 532931 and 532957	20:11
corvus	Shrews: any others?	20:11
pabelanger	ianw: but all patches for puppet-zuul and fireweall landed	20:11
Shrews	corvus: clarkb: what were the concerns about chocolate? i'm afraid i've been too heads down in firecoding to have noticed	20:12
pabelanger	fungi: ianw: actually, lets hold off until tomorrow then, just to be safe	20:12
pabelanger	and time for some patches to merge	20:12
*** sree has joined #openstack-infra		20:12
ianw	ok, that puts it into my weekend, so maybe leave it with +2's until my monday	20:13
fungi	Shrews: we need to reboot some of the chocolate control plane for meltdown patching still	20:13
Shrews	corvus: no, i think you got them all	20:13
ianw	that way a) it's usually quiet(er) and b) i can monitor	20:13
pabelanger	ianw: good idea, might want to -2 then	20:13
fungi	Shrews: and are concerned that any prolonged outage of the api will cause heartbreak for nodepool	20:13
Shrews	fungi: is chocolate disabled in nodepool?	20:13
corvus	clarkb: +3 532931 and 532957 please	20:13
fungi	Shrews: no, only vanilla is disabled because we were unable to get it back into operation reliably	20:13
Shrews	fungi: has chocolate been otherwise functional?	20:14
fungi	i _think_ so (inasmuch as it ever is anyway)	20:15
corvus	Shrews: i think the extent of what i was saying is that being able to reliably zero max-servers for a cloud gives us the room to enable/disable as needed without worrying about errors related to (B)	20:15
Shrews	corvus: yeah. i think we should disable chocolate with max-servers=0 if we're concerned about it working	20:15
Shrews	957 gives us that leeway	20:16
fungi	but if we set it to max-servers=0 are we still going to cause the same problems that vanilla was/is causing with max-servers=0 already?	20:16
Shrews	fungi: not with 957	20:16
Shrews	(in theory)	20:16
fungi	aha, right, that's the missing piece thanks	20:16
*** sree has quit IRC		20:16
* Shrews has missing pieces scattered everywhere today		20:17
*** cody-somerville has joined #openstack-infra		20:23
*** dprince has joined #openstack-infra		20:26
clarkb	corvus: looking now	20:27
clarkb	fwiw 957 isn't really the concern with restarting the controller	20:28
clarkb	we shouldn't have node failures if a cloud goes away even if max servers is > 0	20:29
*** SumitNaiksatam has quit IRC		20:29
clarkb	this requires us to know in advance when clouds will have outages which isn't always the case	20:29
clarkb	I've approved 957 because its a good improvement either way. Also left a comment on it for a followup	20:30
clarkb	Shrews: ^	20:30
*** fultonj has quit IRC		20:31
*** cody-somerville has quit IRC		20:31
Shrews	clarkb: yeah, not saying that fixes the B part. it's a stop-gap until i can fix the other thing	20:32
Shrews	but it's a good stop-gap that should stay since it prevents unnecessary provider calls	20:33
*** gema has joined #openstack-infra		20:33
clarkb	Shrews: yup	20:33
clarkb	Shrews: in https://review.openstack.org/#/c/532931/2 why are we removing the extra logging that you just added?	20:33
clarkb	oh wait I misread that nevermind	20:33
clarkb	corvus: Shrews both changes have been approved, just the one minor improvement idea on 957	20:36
clarkb	I need to grab lunch now	20:36
Shrews	clarkb: can I +A the logging change?	20:37
*** hrubi has quit IRC		20:37
*** hrubi has joined #openstack-infra		20:39
*** eharney has quit IRC		20:41
clarkb	Shrews: ya I think so	20:42
efried	Don't want to recheck unnecessarily and worsen the problem, so: if my change isn't showing up on the zuulv3.o.o dashboard, do I need to recheck it?	20:49
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Fix races around deleting a provider https://review.openstack.org/532931	20:49
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Short-circuit request handling on disable provider https://review.openstack.org/532957	20:49
efried	Specifically https://review.openstack.org/#/c/518633/	20:50
*** eharney has joined #openstack-infra		20:51
AJaeger	efried: yes. But in your case: Ask mriedem to toggle the +W, so it goes directly into gate. A recheck would run it first through check.	20:52
*** eharney has quit IRC		20:52
efried	AJaeger Oh, that's useful to know. Thank you.	20:52
AJaeger	efried: or have another nova core just add an additional +W to trigger that	20:52
AJaeger	efried: see also top entry at https://wiki.openstack.org/wiki/Infrastructure_Status	20:52
efried	AJaeger What about the other already+W'd patches behind that guy? Will they go to the gate automatically once the first one clears?	20:53
efried	or do they need to be +W-twiddled too?	20:53
AJaeger	efried: hope so ;) Give it a try	20:53
efried	AJaeger Thanks.	20:53
mriedem	what do i need to do?	20:55
AJaeger	mriedem: toggle +W on https://review.openstack.org/#/c/518633	20:55
*** e0ne has joined #openstack-infra		20:55
mriedem	toggle == remove +W and add it back?	20:55
AJaeger	mriedem: yes	20:56
mriedem	consider it toggled	20:56
* mriedem blushes		20:56
AJaeger	thanks, mriedem	20:56
efried	Thanks mriedem	20:56
mriedem	my pleasure	20:56
efried	It appeared in the gate right away, wohoo!	20:56
efried	Now we'll see if it gets through. Last time it sat there for about 12 before disappearing mysteriously without a trace...	20:57
AJaeger	efried: https://review.openstack.org/#/c/518982/19 will not go into gate - there's no Zuul +1 vote. So, that one needs a recheck.	20:57
mriedem	it hit the bermuda triangle	20:57
AJaeger	efried: I suggest you recheck those in the stack that don't have Zuul +1.	20:58
efried	AJaeger Thanks, will do.	20:58
AJaeger	mriedem: it tried to hide but you found it ;)	20:59
fungi	one of zuul's dependencies made a botched attempt at transitioning between package names, and that has resulted in a couple of rather sudden outages where we ended up unable to reenqueue previously running jobs once we got it back online	20:59
fungi	thankfully, we think this shouldn't be a recurrent problem now that they've fixed their transitional package on pypi	20:59
fungi	the joys of continuous delivery ;)	21:00
mtreinish	fungi: which package?	21:00
fungi	msgpack-python -> msgpack	21:00
mtreinish	ah, ok	21:00
fungi	a transitive dep via cachecontrol	21:00
fungi	msgpack-python 0.5.0 was basically empty, and so in-place upgrading it caused segfaults for running zuul processes	21:01
*** krtaylor has joined #openstack-infra		21:01
*** Apoorva has joined #openstack-infra		21:02
fungi	and upgrading from it had similarly disastrous impact once they released a fixed version	21:02
*** david-lyle has quit IRC		21:03
*** krtaylor has quit IRC		21:03
*** david-lyle has joined #openstack-infra		21:04
*** krtaylor has joined #openstack-infra		21:04
ianw	fungi: ahh, so that's the suspect in our "all the executors went bye-bye" scenario the other day?	21:06
*** Apoorva has quit IRC		21:06
corvus	ianw: yep; there's a bit more detail about versions, etc, in #zuul	21:06
*** gouthamr has quit IRC		21:07
*** edmonds__ is now known as edmondsw		21:07
*** e0ne has quit IRC		21:09
*** masber has joined #openstack-infra		21:10
*** sree has joined #openstack-infra		21:12
efried	mriedem Well, that worked so well for that patch, would you mind doing the same for https://review.openstack.org/#/c/521686/ ?	21:14
*** gouthamr has joined #openstack-infra		21:15
mriedem	efried: bauzas is still awake, i'm sure he can do it	21:15
efried	Hum, that one might be different, since it's still in the check queue (six times ??)	21:15
*** olaph1 has joined #openstack-infra		21:15
ianw	eumel8 / ianychoi : i dropped the db (made a backup) and ran puppet on translate-dev ... it doesn't seem to have redeployed zanata. looking into it	21:16
*** olaph has quit IRC		21:16
*** sree has quit IRC		21:17
AJaeger	config-core, I updated the list of reviews to review at https://etherpad.openstack.org/p/Nvt3ovbn5x - the backlog is growing; just in case somebody has time left after all the updating and fixing...	21:21
AJaeger	efried: 521686 has no Zuul +1	21:21
fungi	smcginnis: pasted a job log in #-release where a job failed to push content via rsync to static.o.o, connection unexpectedly closed	21:23
fungi	er, smcginnis pasted	21:23
fungi	http://logs.openstack.org/a5/a52aa0b2ad06a52e50be8879f9256576ceceb91c/release-post/publish-static/cee3a5f/job-output.txt.gz#_2018-01-11_21_08_06_844641	21:23
efried	AJaeger Perhaps you can help me understand what I'm seeing on the dashboard. 518633,23 showed up in the gate and was trucking along nicely. Then while it was going, three copies of it showed up in the check queue. They're still there. And now the one in the gate doesn't have the sub-job thingies anymore - just the one line labeled 518633,23	21:23
fungi	i can connect to the server, at least	21:23
efried	AJaeger What does it mean when a change shows up that way (with just the one status line labeled with the change number, that never seems to move)?	21:24
fungi	efried: does hovering over the dot next to it tell you anything in a pop-up tooltip?	21:24
efried	fungi Oo, "dependent change required for testing" -- howzat?	21:24
AJaeger	efried: I see 518633 as bottom of stack, so that is fine. You rechecked some changes that are stacked on top of it. so, this is fine	21:25
fungi	efried: looks like the cinder change ahead of it failed a voting job, so your change has been restarted to no longer include testing with that failing cinder change	21:25
efried	fungi When you say "ahead of it"...	21:26
fungi	it took a moment to get new node assignments for the jobs	21:26
fungi	"above" it in the gate pipeline	21:26
efried	There's no dependency between them, is there?	21:26
*** ldnunes has quit IRC		21:26
openstackgerrit	Merged openstack-infra/nodepool feature/zuulv3: Improve logging around ZooKeeper suspension https://review.openstack.org/532823	21:26
fungi	efried: only insofar as they share some jobs and there's a chance that the cinder change could break the ability of your change to pass jobs (or vice versa) so we test them together to make sure we don't race in merging an interdependency bug	21:27
*** jkilpatr_ has quit IRC		21:27
fungi	but if we realize the cinder change isn't going to merge, we have to retest your nova change without it applied	21:27
efried	fungi I see. And that helps me understand how the whole queue thrashing problem can compound as it gets more full.	21:28
fungi	yep. we have a sliding stability window where we'll try to test as many changes at a time as possible, but if we fail too often we scale the window down to a minimum of 20 changes at a time in a shared queue	21:28
fungi	just to keep the thrash manageable	21:29
lbragstad	i have a quick question, we started noticing this on stable/pike http://logs.openstack.org/periodic-stable/git.openstack.org/openstack/keystone/stable/pike/build-openstack-sphinx-docs/a5a6cb8/job-output.txt.gz#_2018-01-09_06_13_52_297432	21:29
lbragstad	we did see that on master a while ago but I don't think we fixed it with a requirements bump	21:29
lbragstad	i'm wondering if that jogs anyone's memory here	21:30
fungi	lbragstad: smells like a dependency updating in a backward-incompatible fashion	21:30
smcginnis	lbragstad: Is that package python-ldap?	21:30
cmurphy	lbragstad: fungi let me dig up AJaeger's patch that fixed it - the problem was autodoc wasn't loading all the libs that were declared in setup.cfg and not in requirements.txt	21:31
lbragstad	https://github.com/openstack/keystone/blob/master/setup.cfg#L31	21:31
smcginnis	lbragstad: Ah, it can't be in setup.cfg anymore since the source isn't installed to run the docs job.	21:32
lbragstad	aha	21:32
smcginnis	lbragstad: Probably need to backport the change cmurphy is thinking of.	21:32
*** e0ne has joined #openstack-infra		21:32
ianw	fungi: any thoughts on what we can do about translate-dev getting "451 4.7.1 Greylisting in action, please come back in 00:04:59" when sending confirm emails? how do we make it look more legit?	21:32
smcginnis	I've been doing that in a few projects to get stable jobs working. Let me know if there's any questions about it.	21:32
fungi	ianw: make sure it's not covered by a listing in the spamhaus pbl, and apply for an exception for that ip address if it is	21:33
fungi	ianw: rackspace has added basically all of their ip addignments to the pbl, i guess as a way to cut down on abuse reports, so you have to explicitly poke holes in those blanket listings for systems you want to be able to send e-mail to popular domains which may make filtering decisions based on pbl lookups	21:34
fungi	cmurphy: great memory, i didn't even consider this might be fallout from the docs pti compliance work	21:35
lbragstad	cmurphy: wasn't it this one? https://review.openstack.org/#/c/530087/	21:35
*** numans has quit IRC		21:36
*** numans has joined #openstack-infra		21:36
fungi	ianw: i usually start by looking at https://talosintelligence.com/reputation_center/lookup?search=translate-dev.openstack.org and that's suggesting there aren't any matching blacklist entries so it's probably not pbl impact at least	21:37
cmurphy	lbragstad: it was either that one or this one https://review.openstack.org/#/c/528960	21:37
smcginnis	lbragstad: It was probably a change in openstack/keystone.	21:37
cmurphy	if it's that one ^ then we just need to backport	21:37
smcginnis	cmurphy: Yep, that looks like what you'd need.	21:37
ianw	fungi: yeah ... also digging into the logs, there's not much email traffic out, but it might just be @redhat.com that's being too picky	21:37
fungi	ianw: up side, if it's just greylisting, is that usually subsides as popular domains get used to receiving e-mail from it	21:38
smcginnis	lbragstad, cmurphy: Only tricky thing I've run in to backporting these is the differences in global-requirements with the stable branch.	21:38
cmurphy	smcginnis: oh hrm	21:38
corvus	ianw: is it putting translate-dev01.openstack.org in the envelope sender address, or translate-dev?	21:38
fungi	ooh, excellent point	21:38
*** sbra has quit IRC		21:38
smcginnis	cmurphy, lbragstad: But actually, doesn't look like that patch was complete. Should switch s/python setup.py build_sphinx/sphinx-build .../. But I don't think that's necessary as far as fixing pike.	21:39
ianw	corvus: hmm, possibly "noreply@openstack.org" http://paste.openstack.org/show/643127/	21:40
lbragstad	smcginnis: ack - let me get a backport proposed quick	21:40
fungi	ianw: if the message is still in exim's queue, you should be able to find the message (body and headers) in /var/spool/exim4/input/	21:40
ianw	fungi: yep, that's what i just did :) anyway, let's see if it's an issue with the real server & people not getting their auth emails and then dig into it	21:41
*** aeng has joined #openstack-infra		21:41
fungi	and yeah, looking at it i agree it seems to be using noreply@openstack.org for sender	21:42
fungi	which at least matches the from header	21:42
*** smatzek has quit IRC		21:42
ianw	there are some interesting options. i wonder if i want to turn my "maximum gravatar rating shown" up to "X"	21:42
fungi	it doesn't go to 11?	21:43
fungi	er, i mean, xi?	21:43
corvus	noreply@openstack.org does not accept bounce messages, so it's a bad choice for an envelope sender. that's a legit greylist trigger.	21:43
*** jtomasek has quit IRC		21:43
*** sree has joined #openstack-infra		21:43
fungi	yep, we usually configure an alias of the infra-root address as sender for other systems	21:44
*** olaph1 has quit IRC		21:44
*** olaph has joined #openstack-infra		21:45
corvus	http://paste.openstack.org/show/643135/	21:45
*** eharney has joined #openstack-infra		21:46
*** rcernin has joined #openstack-infra		21:46
fungi	right, not gonna get accepted by any mta which enforces sender verification callbacks	21:46
ianw	i wonder where it comes from. not immediately obvious from http://codesearch.openstack.org/?q=noreply%40openstack.org&i=nope&files=&repos=	21:47
*** sree has quit IRC		21:48
ianw	well the server gui configuration has a "from email address" durr	21:48
fungi	i was gonna say, may be manually configured	21:48
*** eharney has quit IRC		21:48
ianw	i just dropped the db for it to repopulate from scratch though	21:49
corvus	it's fine as a from header, just not great as an envelope sender	21:49
lbragstad	smcginnis: not sure if i'm on the right track here https://review.openstack.org/#/c/532984/	21:49
lbragstad	but there was a merge conflict with the test-requirements.txt file	21:49
fungi	ianw: maybe it doesn't store that in the remote db?	21:50
lbragstad	and i'm not quite sure if what i proposed breaks process or not (if i had to change requirement versions because of the conflict)	21:50
ianw	fungi: yeah, i guess. thanks, i think it's some good info if it's a problem with other people not getting emails, especially on the live server	21:52
smcginnis	lbragstad: No, that looks right to me. Let's see how it goes in the gate.	21:52
*** jkilpatr has joined #openstack-infra		21:52
ianw	clarkb / ianychoi / eumel8 : i'll drop some notes in the change, but i think translate-dev should be back with the fresh db, i can log in and poke around at least. let me know if issues	21:53
lbragstad	smcginnis: cool - thanks for the sanity check	21:53
lbragstad	cmurphy: fyi ^	21:53
lbragstad	smcginnis: fwiw - if that does fix things, you might have to kick that one through (keystone doesn't have two active stable cores)	21:54
*** slaweq has quit IRC		21:55
*** slaweq has joined #openstack-infra		21:56
*** e0ne has quit IRC		21:57
openstackgerrit	David Moreau Simard proposed openstack-infra/puppet-zuul master: Periodically retrieve and back up Zuul's status.json https://review.openstack.org/532955	21:57
*** threestrands has joined #openstack-infra		21:58
*** threestrands has quit IRC		21:58
*** threestrands has joined #openstack-infra		21:58
*** hamzy has quit IRC		21:58
*** threestrands has quit IRC		21:59
*** threestrands has joined #openstack-infra		21:59
*** threestrands has quit IRC		22:00
smcginnis	lbragstad: Really?	22:00
smcginnis	Oh, stable. Sure, I will +2 once things pass.	22:00
*** threestrands has joined #openstack-infra		22:01
*** jascott1 has joined #openstack-infra		22:02
*** threestrands has quit IRC		22:02
*** gyee has joined #openstack-infra		22:02
*** threestrands has joined #openstack-infra		22:02
*** Apoorva has joined #openstack-infra		22:03
*** sree has joined #openstack-infra		22:06
*** tpsilva has quit IRC		22:09
*** nicolasbock has quit IRC		22:10
*** sree has quit IRC		22:10
smcginnis	lbragstad: Looked up the stable/pike g-r values.	22:11
clarkb	not really back yet but looks like nodepool changes did merge, next step is restarting the launchers or is that done?	22:16
*** dave-mccowan has quit IRC		22:16
*** dave-mccowan has joined #openstack-infra		22:17
*** olaph has quit IRC		22:17
*** Goneri has quit IRC		22:20
*** Apoorva_ has joined #openstack-infra		22:23
*** hashar has quit IRC		22:24
*** dave-mccowan has quit IRC		22:25
*** Apoorva has quit IRC		22:26
corvus	clarkb: i don't think it's been done	22:27
*** Apoorva_ has quit IRC		22:30
*** Apoorva has joined #openstack-infra		22:31
*** dprince has quit IRC		22:31
*** sree has joined #openstack-infra		22:32
*** bobh has quit IRC		22:32
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Support cross-source dependencies https://review.openstack.org/530806	22:35
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: WIP: Add cross-source tests https://review.openstack.org/532699	22:35
*** sree has quit IRC		22:36
*** esberglu has quit IRC		22:36
*** olaph has joined #openstack-infra		22:38
*** sree has joined #openstack-infra		22:38
*** kjackal has quit IRC		22:39
*** kjackal has joined #openstack-infra		22:40
*** sree has quit IRC		22:43
openstackgerrit	James E. Blair proposed openstack-infra/zuul feature/zuulv3: Delete stale jobdirs at startup https://review.openstack.org/531510	22:44
*** slaweq has quit IRC		22:46
*** jbadiapa has quit IRC		22:47
*** markvoelker has quit IRC		22:49
ianw	corvus: one thing i'd note about that deleting job dirs, is that on the cinder mounted volumes, it actually takes quite a while	22:49
*** markvoelker has joined #openstack-infra		22:49
ianw	when i restarted the executors the other day i cleared out the old stuff, and it was like upwards of 10 minutes	22:50
ianw	it made me think maybe a cron job that removes X day old dirs might work	22:50
corvus	ianw: well, if we need that, we should build it into zuul.	22:50
corvus	trouble is, it's hard to tell if the admin intended to keep the dir	22:51
corvus	so if we do that, we'd need to drop a flag in the dirs indicating they were 'kept' and not have the 'cron' delete them	22:51
*** niska has quit IRC		22:51
*** ruhe has quit IRC		22:51
ianw	oh, from the keep var in config you mean?	22:52
*** rlandy is now known as rlandy\|bbl		22:52
*** niska has joined #openstack-infra		22:52
corvus	ianw: it's a run-time toggle, so can change at any time	22:52
*** ruhe has joined #openstack-infra		22:52
ianw	yeah, a stamp-file in the directory might be a good interface	22:53
*** Jeffrey4l has quit IRC		22:53
*** markvoelker has quit IRC		22:54
*** andreas_s has joined #openstack-infra		22:54
*** sree has joined #openstack-infra		22:54
*** bandini has quit IRC		22:54
*** edmondsw has quit IRC		22:54
openstackgerrit	James E. Blair proposed openstack-infra/system-config master: Add zuul mailing lists https://review.openstack.org/533006	22:54
ianw	it's probably fine to clear out on start, but if at least in our case, with the slower remote storage for dirs, we have the slate almost clean when it starts that could be helpful	22:54
*** wolverineav has quit IRC		22:55
ianw	especially because you're usually restarting them in a pressure situation :)	22:55
*** edmondsw has joined #openstack-infra		22:55
*** abelur_ has joined #openstack-infra		22:55
*** vsaienk0 has joined #openstack-infra		22:55
*** felipemonteiro_ has quit IRC		22:55
*** wolverineav has joined #openstack-infra		22:55
corvus	yep. i think both things are incremental improvements.	22:55
ianw	i'll add it to my todo list :)	22:55
*** Jeffrey4l has joined #openstack-infra		22:56
*** bandini has joined #openstack-infra		22:56
*** markvoelker has joined #openstack-infra		22:56
clarkb	ok back with new glasses. I can see again. Except now everything looks funny	22:58
*** sree has quit IRC		22:58
clarkb	corvus: nl02 is the launcher for chocolate and has the new nodepool code installed. thoughts on restarting that one now?	22:58
*** edmondsw has quit IRC		22:59
*** markvoelker has quit IRC		22:59
*** wolverineav has quit IRC		22:59
corvus	clarkb: i'm around for a bit longer and can help with probs, i say go for it	23:00
*** andreas_s has quit IRC		23:00
clarkb	corvus: ok restart it now then	23:00
*** markvoelker has joined #openstack-infra		23:00
clarkb	it is running again and appears to be acting normally (decline requests that are at quota and just saw a node go ready)	23:03
clarkb	corvus: Shrews what is the request handler behavior when all clouds are at quota? I seem to recall reading that code at some point but forget the behavior	23:03
*** vsaienk0 has quit IRC		23:04
*** markvoelker has quit IRC		23:05
corvus	clarkb: every provider will grab one more request than it can handle and then block on that request, not accepting any more until it completes.	23:05
*** sree has joined #openstack-infra		23:05
corvus	(strictly speaking, when a single provider is at quota, it grabs one more request and blocks. providers act independently, so if they are all at quota, they simply all do that)	23:06
*** jascott1 has quit IRC		23:06
clarkb	corvus: Pausing request handling to satisfy request appears to be the logged message for that?	23:07
*** jascott1 has joined #openstack-infra		23:07
corvus	clarkb: i believe so	23:07
corvus	clarkb: once it gets enough available quota, it should 'unpause', finish that request, and as long as we are backlogged, grab another one and go right back to being paused.	23:08
*** jascott1 has quit IRC		23:09
*** jascott1 has joined #openstack-infra		23:09
*** sree has quit IRC		23:10
clarkb	corvus: 2018-01-11 23:01:28,873 DEBUG nodepool.driver.openstack.OpenStackNodeRequestHandler[nl02-7970-PoolWorker.citycloud-la1-main]: Declining node request 100-0002001054 because it would exceed quota lots of messages like that which I would expect wouldn't happen if the handler was paused?	23:11
*** jascott1 has quit IRC		23:11
openstackgerrit	Merged openstack-infra/system-config master: Add note on how to talk to zuul's gearman https://review.openstack.org/531522	23:11
*** jascott1 has joined #openstack-infra		23:11
*** jascott1 has quit IRC		23:12
*** jascott1 has joined #openstack-infra		23:13
*** jbadiapa has joined #openstack-infra		23:14
*** jtomasek has joined #openstack-infra		23:15
openstackgerrit	Merged openstack-infra/puppet-lodgeit master: Systemd: start lodgeit after network https://review.openstack.org/527729	23:16
corvus	clarkb: that should only happen if it's impossible for the provider to ever satisfy that (ie, a request for 10 nodes in a provider where our absolute limit is 5)	23:16
clarkb	huh I wonder if someone is requesting 50 nodes for a job	23:17
*** jascott1 has quit IRC		23:17
clarkb	(that could also explain why we are at quota so much)	23:17
corvus	clarkb: the example you cited was 1 node	23:18
clarkb	corvus: oh wait citycloud has a couple regions that are turned off	23:19
clarkb	and la1 is one of them	23:19
clarkb	ok that explains it	23:19
corvus	looks like it was also declined by rh1-main	23:20
corvus	er rh1	23:20
clarkb	corvus: have you seen anything re nl02 that would indicate we shouldn't restart nl01 as well? I haven't yet	23:20
*** jtomasek has quit IRC		23:21
corvus	clarkb: nope, i say go.	23:22
clarkb	done	23:24
*** jascott1 has joined #openstack-infra		23:29
clarkb	Shrews: corvus related to the max-servers: 0 short circuit from earlier today we may want to avoid doing all the extra logging and pause the handler entirely (eg stop polling requests)	23:30
*** jbadiapa has quit IRC		23:32
clarkb	corvus: do you want to review https://review.openstack.org/#/c/523951/ I think it is ready and will allow us to merge zuulv3 branches into master	23:33
corvus	clarkb: when a provider is paused, it should stop polling for new requests	23:36
corvus	clarkb: i'll take a look	23:36
corvus	+3 and reviewed children as well	23:40
clarkb	cool	23:40
*** kgiusti has left #openstack-infra		23:41
*** smatzek has joined #openstack-infra		23:43
*** sree has joined #openstack-infra		23:43
openstackgerrit	Clark Boylan proposed openstack-infra/project-config master: Set infracloud chocolate to max-servers: 0 https://review.openstack.org/533012	23:45
clarkb	^^ is not a rush, don't think I will have time to do controller00 patching reboots today but will try for tomorrow	23:45
*** smatzek has quit IRC		23:47
*** sree has quit IRC		23:47
*** hongbin has quit IRC		23:53
*** armaan has quit IRC		23:53
clarkb	zxiiro: is https://review.openstack.org/#/c/194497/ still valid?	23:54
clarkb	corvus: I think https://review.openstack.org/#/c/163637/ ended up being rplaced by the zuulv3 spec?	23:55
corvus	clarkb: yeah, i think it covered everything there except the ip whitelist.	23:57
zxiiro	clarkb: not sure. I can bring it up on our meeting tomorrow though	23:57
clarkb	zxiiro: thanks	23:57

Generated by irclog2html.py 2.15.3 by Marius Gedminas - find it at mg.pov.lt!