Tuesday, 2016-11-29

*** willthames has joined #zuul		00:14
ianw	jamielennox / pabelanger: ok, i think i came up with a solution, which is activating the virtualenv, if it exists, in the builder. patch in a bit :)	00:29
*** Shuo has quit IRC		01:04
jeblair	ianw: maybe i misunderstood, but nodepool should not internally know anything about virtualenvs	01:20
ianw	jeblair: well, there's that too. the other option is to say "you run nodepool from within an activated virtualenv (i.e. /path/to/venv/bin is in the $PATH) or it doesn't work"	01:21
jeblair	ianw: i think that's one of many options :)	01:22
jeblair	(others include: add a config file option for the dib binary, symlink, set PATH in startup script, update-alternatives, and i'm sure i'm missing more)	01:23
ianw	config file for the dib binary doesn't really work, because dib just hits the same problem	01:23
jeblair	oh good you -1d that with that comment, thanks	01:24
jeblair	ianw: that combined with greghaynes message suggests that 'delay' may be a good way forward :)	01:27
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966	01:28
ianw	jeblair: ^ with usual bonus ianw too verbose commit msg. lightly tested (i.e. not)	01:29
ianw	but that's the gist	01:29
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966	01:36
openstackgerrit	Merged openstack-infra/nodepool: Use database for non-zookeeper commands https://review.openstack.org/403869	01:49
pabelanger	ianw: jamielennox: I've done what SpamapS suggested about /bin/bash -c "source ..." in systemd, which makes diskimage-builder happy. Going to update my local env to do that now, see how it works long term	02:07
*** willthames has quit IRC		02:26
*** willthames has joined #zuul		02:46
ianw	pabelanger: yep, that'll work too	03:01
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966	03:06
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966	03:09
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966	03:20
*** hashar has joined #zuul		05:01
openstackgerrit	Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-older-than tests https://review.openstack.org/401027	05:32
openstackgerrit	Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-username tests https://review.openstack.org/401028	05:32
openstackgerrit	Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-email tests https://review.openstack.org/401029	05:32
openstackgerrit	Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-newer-than tests https://review.openstack.org/401019	05:32
*** saneax-_-\|AFK is now known as saneax		06:42
ianw	pabelanger / jeblair: run out of time to look further, but i'm not convinced nodepool tests on master are stable. locally i get weird hangs and exceptions, and results are never the same twice on https://review.openstack.org/#/c/403966/	07:04
openstackgerrit	Cao Xuan Hoang proposed openstack-infra/nodepool: Clean imports in code https://review.openstack.org/404059	07:26
*** abregman has joined #zuul		08:02
*** willthames has quit IRC		08:22
*** yolanda has quit IRC		08:31
*** yolanda has joined #zuul		08:43
*** Cibo_ has joined #zuul		09:27
*** abregman has quit IRC		09:35
*** abregman has joined #zuul		09:35
*** Cibo_ has quit IRC		09:40
*** openstackgerrit has quit IRC		09:48
*** openstackgerrit has joined #zuul		09:49
*** yolanda has quit IRC		09:55
*** yolanda has joined #zuul		09:57
*** markmcd has quit IRC		10:40
*** abregman has quit IRC		11:03
*** abregman has joined #zuul		12:06
*** bhavik1 has joined #zuul		13:12
*** abregman is now known as abregman\|afk		14:13
*** bhavik1 has quit IRC		14:14
*** abregman\|afk is now known as abregman		14:17
pabelanger	morning	14:25
pabelanger	continuing work on nb01.o.o	14:25
mordred	pabelanger: let me know if there's anything I can do to help	14:40
pabelanger	mordred: ack!	14:42
pabelanger	just waiting for ansible wheel to run	14:43
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Test rotation of builds in nodepool-builder https://review.openstack.org/400004	14:45
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Test rotation of builds in nodepool-builder https://review.openstack.org/400004	14:52
*** abregman is now known as abregman\|afk		14:54
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Re-enable test: test_list_nodes https://review.openstack.org/400955	15:04
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Re-enable test: test_image_upload_fail https://review.openstack.org/400970	15:04
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Fix nits in test_image_upload_fail https://review.openstack.org/404257	15:04
Shrews	pabelanger: mordred: Those 1st two were rebased to avoid the alien test that was -1'd. They were already +3'd, so if you could push them through...	15:05
mordred	Shrews: done	15:06
Shrews	muchas gracias	15:06
Shrews	clarkb: 257 fixes your nits from another review	15:08
pabelanger	\o/	15:20
pabelanger	http://nb01.openstack.org/	15:20
pabelanger	builds are started	15:20
pabelanger	will take about 1 hours to get through first image	15:22
Shrews	mordred: jeblair: clarkb: regarding the metadata comment in https://review.openstack.org/#/c/400836/2/nodepool/cmd/nodepoolcmd.py, if i'm not supposed to used 'snapshot' as the key, what should I be using?	15:22
Shrews	err, snapshot as the value	15:23
*** saneax is now known as saneax-_-\|AFK		15:23
Shrews	basically, what metadata tells me it is a nodepool managed image?	15:23
*** abregman\|afk is now known as abregman		15:25
pabelanger	blarg	15:30
pabelanger	mordred: jeblair: clarkb: any objections for adding an elastic-recheck for nodepool about our database drop issue?	15:31
openstackgerrit	Merged openstack-infra/nodepool: Re-enable test: test_list_nodes https://review.openstack.org/400955	15:32
Shrews	This is an interesting failure: http://logs.openstack.org/57/404257/1/check/gate-nodepool-python27-ubuntu-xenial/c7cca8e/testr_results.html.gz	15:37
Shrews	We have two upload threads uploading the same image. We have checks to guard against that. Not sure what's failing	15:38
Shrews	i'll need to add some logging stuff to debug	15:42
mordred	Shrews: oh yay	15:57
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading https://review.openstack.org/404292	15:57
clarkb	pabelanger: nope	16:00
clarkb	pabelanger: that is what it is there for	16:00
pabelanger	kewl	16:00
pabelanger	just uploaded it and pinged you in openstack-infra	16:00
Shrews	mordred: yeah. i'm hoping that zk node locking actually works like we expect it to	16:00
Shrews	i mean, we do test locking, too	16:01
pabelanger	blarg	16:01
pabelanger	DIB element strikes again	16:01
pabelanger	2016-11-29 15:54:33,234 INFO nodepool.image.build.fedora-23: + die 'Can not find Zuul public key!'	16:01
mordred	Shrews: well, if it doesn't, this is definitely the right time to learn that	16:02
clarkb	Shrews: I think shade sets metdata properties itself but also nodepool may set some too	16:03
mordred	both set things	16:04
mordred	shade sets the md5/sha256 checksums	16:04
mordred	those are both prefixed with owner_specified.shade	16:05
mordred	then nodepool should be setting image_type	16:07
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Remove unused function make_image_dict https://review.openstack.org/404306	16:11
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Move to using properties instead of metadata https://review.openstack.org/404307	16:11
mordred	Shrews, clarkb: nodepool sends provider_image.meta as the extra metada on an image	16:13
mordred	Shrews, clarkb: but that's all user defined - there is currently no key that nodepool adds to an image that would let us know it's nodepool managed	16:17
clarkb	I think if we had onethat was "not cloud managed" that would be sufficient (since that was what the old code was doing)	16:19
clarkb	but I am not sure if glance differentiates	16:19
mordred	the old code was not doing that	16:19
mordred	actually	16:19
clarkb	mordred: it was	16:19
clarkb	it was only looking at snapshot images	16:19
mordred	only for snapshot images	16:19
clarkb	yes, it was broken but that was the intent	16:19
mordred	yah	16:19
mordred	well, it's a super easy fix	16:19
clarkb	basically we want to keep the cloud images out of the alien list	16:21
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Flag uploaded images as being managed by nodepool https://review.openstack.org/404322	16:21
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading https://review.openstack.org/404292	16:22
mordred	clarkb: the logic in that patch in alien list is still wrong - one sec	16:22
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Flag uploaded images as being managed by nodepool https://review.openstack.org/404322	16:23
mordred	there. that should do it	16:23
*** bhavik1 has joined #zuul		16:24
Shrews	mordred: we should just add all of that to my test re-enable of alien-image-list	16:24
mordred	Shrews: ++ - lemme do that real quick	16:24
jeblair	mordred, clarkb, Shrews: how about we mirror the nodepool node metadata structure for images?	16:27
jeblair	http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/provider_manager.py?h=feature/zuulv3#n211	16:29
openstackgerrit	Merged openstack-infra/nodepool: Re-enable test: test_image_upload_fail https://review.openstack.org/400970	16:30
jeblair	mordred: though istr you wanted to de-json that and split it into multiple keys (because there is a length limit for values (which we are under) though of course there is a (provider-specified <sigh>) limit on the number of keys)	16:31
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836	16:32
jeblair	mordred: anyway, i think the corresponding information we'd want to have there would be build number and upload number (provider and image name are obviously implied)	16:32
mordred	jeblair: ++	16:32
clarkb	jeblair: right its 255 bytes per key value or something	16:33
mordred	jeblair: and yes, I do want to de-jsonify it - and yes ot sigh on number of keys	16:33
jeblair	mordred: ok. i left a review on your abandoned change https://review.openstack.org/404322 :)	16:34
mordred	jeblair: I'mma do that in a follow up patch for now just so I don't keep stepping on Shrews' patch	16:35
mordred	we can squash if we want once we're happy	16:35
jeblair	squash or simulprove is ok	16:36
pabelanger	current dib-image-list on nb01.o.o: http://paste.openstack.org/show/590854/	16:36
pabelanger	going to see why we are not cleaning up the building state of DIBs	16:37
pabelanger	otherwise, going to be a long list	16:37
jeblair	pabelanger: it's doing things!	16:37
pabelanger	it is!	16:37
pabelanger	https://review.openstack.org/#/q/topic:zuulv3-builder	16:37
pabelanger	we need to land ^	16:37
Shrews	jeblair: i'm about to post a change to include the ZnodeStat structure in the build and upload objects. I'm concerned that node versioning might be biting us (though I haven't confirmed that) and that info could be useful for debugging.	16:37
pabelanger	to fix ssh key issues	16:37
jeblair	Shrews: in the repr you mean?	16:37
Shrews	jeblair: yes	16:38
jeblair	cool	16:38
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Add znode stat structure to build/upload objects https://review.openstack.org/404292	16:39
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading https://review.openstack.org/404335	16:39
jeblair	Shrews: i'm catching up on the failure from earlier so i hope to be useful on this soon	16:40
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Add build and upload information to image metadata https://review.openstack.org/404337	16:41
Shrews	it's weird that upload thing happened twice on back-to-back reviews. Now I've had 2 rechecks and 2 test reviews that haven't reproduced it	16:41
mordred	Shrews, jeblair: ^^ it doesn't seem that we have access to upload id yet when we upload the image - am I reading that wrong?	16:44
Shrews	mordred: correct. won't have it until storeImageUpload is called	16:47
mordred	k. so - jeblair, we can easily add build_id to the images, but not upload_id unless we added an additional api call to update the image metadata	16:47
Shrews	mordred: the caller knows it	16:48
mordred	I'm not sure it's worth an extra API call	16:48
mordred	Shrews: it does?	16:48
Shrews	mordred: yes, just not passed to _uploadImage()	16:48
Shrews	mordred: upnum	16:48
mordred	Shrews: oh - awesome. thanks!	16:48
mordred	jeblair: ignore me	16:48
Shrews	mordred: also, as far as stepping on toes, the only thing left is to add the non-empty alien image test case, which I'd actually prefer to do in a separate review.	16:49
jeblair	Shrews: i'm looping that failed test locally, no luck so far	16:50
Shrews	jeblair: yeah, i have not been able to reproduce it locally either	16:50
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Add build and upload information to image metadata https://review.openstack.org/404337	16:52
mordred	jeblair: ok - I think that's ^^ more inline with things	16:52
Shrews	jeblair: looking at http://logs.openstack.org/57/404257/1/check/gate-nodepool-python27-ubuntu-xenial/c7cca8e/testr_results.html.gz at the failure, that zk LOST message between uploads is a bit concerning. I do not know if that's normal	16:53
Shrews	it may be nothing	16:54
jeblair	pabelanger: looking at 404309, i think i spotted a problem -- but also, in openstackci::builder we use both nodepool and nodepool::builder -- so should we just add the key as a param to openstackci as in 404317 and then pass it to nodepool instead of nodepool::builder?	16:55
mordred	Shrews: ok. I'll just squash onto your patch then	16:55
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836	16:56
*** hashar is now known as hasharAWay		16:57
pabelanger	jeblair: 404309, I can move .ssh file back up to the original line, no problem	16:58
jeblair	pabelanger: oh, i see, i didn't see the move, i understand that change better now	17:00
pabelanger	jeblair: trying to understand your follow up comment for 404317, do you mind rephrasing it?	17:00
jeblair	pabelanger: yeah, so don't change 404309 just yet	17:00
pabelanger	k	17:00
jeblair	pabelanger: what i meant was, i think you could abandon 404309 and then in 404317, add the public key to the instantiation of class ::nodepool	17:01
jeblair	pabelanger: i think the way to decide between the two approaches is to say whether the public key is something that should be common to all nodepool nodes, or something that is builder-specific	17:03
jeblair	pabelanger: i guess it's builder-specific? and the private key will eventually be launcher-specific?	17:04
jeblair	pabelanger: so maybe your approach is best? :)	17:04
pabelanger	jeblair: well, in fact, I don't actually like what we are doing with the public key, its specific how we build the zuul user with DIB elements today. So, I think you might be right, abandon 404309 for now, then add the logic into 404317 or system-config	17:05
jeblair	pabelanger: or maybe openstackci	17:06
jeblair	pabelanger: at any rate, i +2d your existing changes :)	17:07
pabelanger	clarkb: any thoughts on ^ +topic:zuulv3-builder	17:07
pabelanger	jeblair: thanks	17:07
jeblair	Shrews: i think what we're seeing is the command starting up and performing its ZK connect and disconnect. which is probably telling us that an upload worker is deciding to upload after we delete the upload.	17:08
clarkb	pabelanger: sorry caffeinating	17:09
clarkb	will pull up ina few	17:09
*** abregman has quit IRC		17:09
Shrews	jeblair: yeah, i think so	17:09
Shrews	jeblair: oh, i grok now. so things are working correctly, but the 2nd upload happens too quickly?	17:12
jeblair	Shrews: yeah, if that's correct, then i think it's normal behavior, i think we just need to alter the test to stop trying to upload images after we delete.	17:12
jeblair	Shrews: or alter it to wait for that specific upload's deletion	17:13
jeblair	which might be the easier and better thing	17:14
Shrews	jeblair: so make waitForImageDeletion take an optional upload ID might be the ticket	17:16
jeblair	Shrews: i think if you throw a sleep between nodepoolcmd.main() and waitForImageDeletion, it repros. and yeah, i think that sounds like a good solution	17:17
Shrews	ack	17:17
clarkb	jeblair: Shrews in the past we have tried to be specific and not turn off the engine once a state is reached	17:17
jeblair	clarkb: so that's another vote in favor if optional upload id?	17:18
jeblair	Shrews: are you writing the fix?	17:18
Shrews	jeblair: on it	17:18
clarkb	jeblair: ya.	17:18
clarkb	pabelanger: ok stack reviewed, need to add an extra check on the first change but otherwise lgtm	17:19
jeblair	mordred, rcarrillocruz: i'm going to set the topic of the devstack ansible changes to zuulv3	17:24
jeblair	it was just the callback change that was missing it, now they all show up	17:25
rcarrillocruz	++ thx	17:26
mordred	jeblair: woot	17:26
jeblair	mordred: 404295 seems to output all the gathered facts	17:29
jeblair	mordred: http://logs.openstack.org/95/404295/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/5aaf721/logs/devstack-gate-setup-host.txt.gz#_2016-11-29_16_02_47_661929	17:30
mordred	jeblair: yup. was just asking folks about that in infra	17:30
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Fix test_image_delete https://review.openstack.org/404356	17:31
Shrews	jeblair: clarkb: ^^^	17:31
jeblair	+2	17:32
pabelanger	jeblair: mind +3 on 404309 update to address clarkb comments	17:49
jeblair	pabelanger: done	17:50
pabelanger	danke	17:51
*** morgan_ is now known as morganfainberg		18:03
*** morganfainberg is now known as morgan		18:04
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836	18:47
Shrews	mordred: fixed the merge conflict ^^^	18:48
openstackgerrit	Merged openstack-infra/nodepool: Remove unused function make_image_dict https://review.openstack.org/404306	18:48
*** bhavik1 has quit IRC		18:51
openstackgerrit	Merged openstack-infra/nodepool: Fix nits in test_image_upload_fail https://review.openstack.org/404257	18:56
mordred	Shrews: yay!	19:07
*** abregman has joined #zuul		19:08
mordred	clarkb: I did not +A https://review.openstack.org/400836 since I wrote code in it	19:08
*** abregman has quit IRC		19:09
clarkb	mordred: and I just -1'd it >_>	19:09
clarkb	its still listing the cloud provided images in the alien list	19:09
*** abregman has joined #zuul		19:10
mordred	clarkb: oh - that's an easy fix	19:11
Shrews	mordred: you fixing that or want me to?	19:13
mordred	Shrews: I got it	19:14
mordred	Shrews: just going to filter the images when we get the provider_image list	19:14
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836	19:14
mordred	Shrews, clarkb: ^^	19:14
clarkb	thanks will rereview post meeting	19:15
clarkb	(head down in code review is how I missed the start of the meeting so will avoid it until done)	19:15
mordred	clarkb: yah. same here :)	19:15
pabelanger	ugh	20:14
pabelanger	another DIB fail	20:14
pabelanger	jeblair: Shrews: What do you think about us doing a merge master to feature/zuulv3?	20:14
pabelanger	I just ran into a bug	20:14
pabelanger	which we fixed in master	20:14
pabelanger	but are missing in feature/zuulv3	20:15
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Stop running DIB usage report https://review.openstack.org/404409	20:17
pabelanger	jeblair: clarkb: mordred: ^ is the fix we need in feature/zuulv3	20:17
pabelanger	Did a cherry-pick	20:17
pabelanger	or we can merge master	20:17
clarkb	+2	20:19
clarkb	though if we think we are this close to really running the builder with zk may be a good time to work on a merge too	20:19
pabelanger	things are running now	20:20
pabelanger	merge would help fix issues like ^	20:20
jeblair	pabelanger, clarkb: https://review.openstack.org/400536	20:23
jeblair	that needs a refresh anyway, so i will +3 the cherry-pick	20:24
clarkb	ok	20:24
jeblair	clarkb, pabelanger: jhesketh did request a review on that despite the test failure	20:25
pabelanger	jeblair: sure, will look today	20:25
*** hasharAWay is now known as hashar		20:31
*** abregman has quit IRC		20:53
openstackgerrit	Merged openstack-infra/nodepool: Stop running DIB usage report https://review.openstack.org/404409	20:59
openstackgerrit	Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836	21:16
mordred	Shrews, clarkb: whoops. ^^ I had broked things	21:18
pabelanger	oh snap	21:19
pabelanger	\| fedora-23-0000000004 \| fedora-23 \| nb01 \| qcow2,vhd,raw \| ready \| 00:00:00:18 \|	21:19
pabelanger	http://paste.openstack.org/show/590895/	21:19
pabelanger	nodepool image-list	21:19
mordred	pabelanger: woot!	21:20
mordred	pabelanger: that's super exciting	21:20
mordred	pabelanger: I just +A'd two of your patches - but gerrit claims they conflict. we'll just have to see	21:21
pabelanger	mordred: woot indeed	21:21
mordred	pabelanger: so - we really do need to get jhesketh's merge master patch sorted out :)	21:22
pabelanger	mordred: agreed, I think that will help with some yak shaving	21:22
mordred	mmm. shaving yaks	21:22
mordred	pabelanger: there is a fellow outside of vancouver who has a herd of tibetan yaks for sale	21:22
pabelanger	ha, nice	21:23
openstackgerrit	Merged openstack-infra/nodepool: Fix test_image_delete https://review.openstack.org/404356	21:23
openstackgerrit	Merged openstack-infra/nodepool: Add pause support for diskimages https://review.openstack.org/403780	21:23
pabelanger	Also need to land our nodepool patch to use checksum files, to avoid duplicate checksums	21:24
mordred	pabelanger: seems like we should port that patch to v3 at this point	21:24
clarkb	now to see how reliable uploads will be for us	21:24
clarkb	pabelanger: mordred maybe just get that from merge?	21:25
clarkb	since its a performamce noy functional thing	21:25
mordred	clarkb: well, it hasn't landed in either branch	21:25
mordred	so I was mostly thinking we should not land it in master then deal with it on merge since we're so close to re-merging everything	21:26
clarkb	oh I thought the first version of it was landed	21:26
mordred	if it is, I totally agree it should just be a thing on merge	21:26
*** willthames has joined #zuul		21:28
pabelanger	ya, I don't think it merged yet	21:29
pabelanger	checking	21:29
pabelanger	383519	21:30
pabelanger	I can port to zuulv3	21:30
pabelanger	mordred: clarkb: a review on https://review.openstack.org/#/c/404296/ for elastic-recheck. I believe I got the syntax right	21:32
pabelanger	for our database drop failures	21:32
Shrews	pabelanger: what does the cpu usage look like on nb01? curious as to how the small delay between schedule checks is affecting it	21:34
pabelanger	$ cat /proc/loadavg	21:35
pabelanger	4.34 4.63 4.23 7/544 387	21:35
pabelanger	right now	21:35
pabelanger	277.5 CPU for nodepool-builder ATM	21:35
pabelanger	but, think shade is still doing md5sum	21:35
mordred	we're spending 39% of a CPU in dnf?	21:36
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: [DNM] test https://review.openstack.org/404428	21:36
mordred	Shrews: it does feel like an awful lot of CPU usage given that I think there is currently one dib build going on	21:38
Shrews	mordred: the BuildWorker and UploadWorkers both only sleep 0.1s between their respective checks.	21:38
Shrews	so they are always doing something	21:38
mordred	Shrews: ah	21:39
mordred	then that CPU usage seems about right on	21:39
pabelanger	Was thinking we should make that an exponential backoff on error, up to X seconds?	21:39
pabelanger	or mins	21:39
pabelanger	as not to spam logs on error	21:40
jeblair	remote: https://review.openstack.org/404430 Add nb01.openstack.org to cacti	21:40
jeblair	pabelanger, mordred: ^	21:40
pabelanger	+3	21:40
Shrews	pabelanger: backoff on upload?	21:41
Shrews	pabelanger: not sure how to make that work since one thread would have to keep the upload lock to keep other threads from re-attempting the upload	21:41
Shrews	which i guess it could do	21:42
*** Cibo_ has joined #zuul		21:42
pabelanger	Shrews: I haven't looked into it yet, but something I was thinking about, since my logs were filling up aggressively on a mis-configuration of clouds.yaml	21:42
mordred	pabelanger: easy soluiton - don't mis-configure clouds.yaml ;)	21:43
Shrews	pabelanger: incentive to fix your clouds.yaml :)	21:43
pabelanger	indeed	21:43
Shrews	it's a feature	21:43
jeblair	i mean, it actually is a feature	21:43
jeblair	we have way more problems with uploads failing with EOPENSTACK than we have with misconfigured clouds.yamls	21:44
jeblair	which doesn't mean 'continue blowing up indefinitely' is the only answer, just that the... er... obstinancy of the builder is intentional. :)	21:46
jeblair	Shrews: and oh yeah, we did want to make the sleep longer, just need to keep it short in the check.	21:47
jeblair	(i was thinking something like 10 secs)	21:47
jeblair	er 'keep it short in the tests'	21:47
openstackgerrit	Paul Belanger proposed openstack-infra/nodepool: Use diskimage-builder checksum files https://review.openstack.org/404432	21:48
Shrews	jeblair: do what we did with cleanup_interval for the CleanupWorker thread?	21:49
pabelanger	will check dsvm test on that	21:49
jeblair	Shrews: yeah, probably so	21:49
Shrews	jeblair: k. i'll work on that	21:49
jeblair	Shrews: thanks; pabelanger: that should make things 100 times better. :)	21:50
pabelanger	100x improvement	21:50
Shrews	golly you maths good	21:50
jeblair	i learned me some math	21:51
pabelanger	wonder if we should enable shade debugs for now on nb01	21:52
pabelanger	to see what is going on	21:52
pabelanger	not seeing much traffic between nb01.o.o and clouds	21:52
pabelanger	Oh, errors	22:01
mordred	yay!	22:01
pabelanger	OpenStackCloudException: Error fetching image list: 'X509' object has no attribute '_x509'	22:01
mordred	pabelanger: whereat?	22:01
pabelanger	/var/log/nodepool/builder-debug.log	22:01
mordred	that is a bug in python requests, iirc	22:01
pabelanger	something, something, ssl?	22:02
pabelanger	off to the googles	22:02
mordred	pabelanger: https://github.com/kennethreitz/requests/issues/3701	22:03
pabelanger	PyOpenSSL seems to be what people are pointing too	22:03
pabelanger	ya	22:03
pabelanger	reading that	22:03
openstackgerrit	David Shrewsbury proposed openstack-infra/nodepool: Make build/upload worker sleep time configurable https://review.openstack.org/404438	22:03
pabelanger	python -c 'import OpenSSL; print(OpenSSL.__version__)'	22:04
pabelanger	0.13	22:04
pabelanger	looks like we need a version bump	22:04
mordred	that'll do it	22:04
mordred	did we install from disto packages or something?	22:04
pabelanger	let me see how it gets pulled in	22:04
pabelanger	checking	22:04
pabelanger	ii python-openssl 0.13-2ubuntu6 amd64 Python 2 wrapper around the OpenSSL library	22:05
pabelanger	ya	22:05
mordred	def wanna not have that and have pip installed version instead	22:06
jeblair	er, on nodepool.o.o there is no openssl	22:07
pabelanger	checking to see how it was installed	22:07
mordred	nothing seems to depend on it	22:07
Shrews	mordred: pabelanger: nb01 will also need the unreleased shade to get the image meta	22:07
Shrews	if it doesn't have it already	22:08
pabelanger	k	22:08
mordred	Shrews: I'd actually like to release shade - since that is a fairly hefty regression	22:08
Shrews	mordred: DO EET	22:08
pabelanger	/var/log/apt/history.log doesn't list python-openssl	22:09
mordred	maybe base-image ?	22:09
jeblair	pabelanger: agreed, i can't find anything either	22:09
pabelanger	mordred: ya, I wonder if that is the case	22:10
mordred	really need to get around to having our own dib-created base images that are what we run our servers on	22:10
pabelanger	++	22:10
pabelanger	soon(tm(	22:10
jeblair	let's go to -infra	22:11
Shrews	let's go drinking. night all	22:11
jeblair	Shrews: cheers!	22:12
adam_g	anyone able to clear up some confusion ive got around item.change.{refspec, branch, ref} WRT https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/launcher/client.py?h=feature/zuulv3#n28 ?	22:17
adam_g	im not sure whats meant to happen there when item.change is a Ref and not a Change	22:18
clarkb	adam_g: iirc its supposed to tell the merger to fetch the ref	22:19
clarkb	so that when the job runs the ref is available to it	22:19
clarkb	(but that behavior may have changed with v3)	22:20
adam_g	clarkb: right, but the models differ quiet a bit and it looks like the logic in there is missing some smarts in dealing with some of those differences	22:20
clarkb	adam_g: well there its just constructing the data	22:20
adam_g	i can update that to fish out (what i think are) the correct bits from each, but having not much prior knowledge here, im not sure its correct	22:20
clarkb	so either you get a change + patchset or you get an oldrev and newrev	22:20
clarkb	then things consuming the data will have to switch based on that	22:21
adam_g	clarkb: right, but things branch seem to be located in different places depending on what type of item.change comes in.	22:22
adam_g	this is /w all the newer v3 model changes, btw	22:22
clarkb	for v3 specifics there I am really the wrong person to ask, generally though the differences are in the source event type from gerrit and on the merger side making sure you have the correct ref locally for the test to have access to	22:23
adam_g	yep. i guess my confusion is just about the internal zuul modelling around that	22:25
jeblair	adam_g: one of the todos for v3 is to rework the change/changeish/ref hierarchy	22:28
jeblair	but that's mostly for clarity and to simplify some code	22:29
adam_g	jeblair: ah, i see	22:29
adam_g	was just hacking on getting some of the post tests working. lemme push up what i have and you can lemme kow if its worth working on before that happens	22:29
jeblair	adam_g: ok. i think you should be able to proceed with just corrections, not refactoring. but yeah, something to look at would be nice.	22:30
openstackgerrit	Adam Gandelman proposed openstack-infra/zuul: Enable test_post https://review.openstack.org/404445	22:31
jeblair	adam_g: what happens if you don't make those changes?	22:33
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Fail on bad options to fake-image-create https://review.openstack.org/291516	22:38
pabelanger	Shrews: jeblair: looks like we need to do some exception handling: http://paste.openstack.org/show/590899/	22:40
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Fail on bad options to fake-image-create https://review.openstack.org/404452	22:40
adam_g	jeblair: make_merger_item() hits a few AttributeErrors	22:44
jeblair	pabelanger: i thought Shrews fixed that in a recent change	22:45
Shrews	https://review.openstack.org/400970	22:45
jeblair	pabelanger: are we running with that?	22:46
pabelanger	jeblair: we are now	22:46
pabelanger	but not before I think	22:46
pabelanger	just restarted nodepool-builder 5mins ago	22:46
jeblair	pabelanger: so i think we should be good. could probably delete the old uploading ones if you want	22:47
pabelanger	lets see what happens	22:47
jeblair	pabelanger: i don't think they will be deleted until the underlying build ages out. we might want to fix that (though with 970 this is only likely to happen after a hard-stop)	22:51
pabelanger	k	22:51
pabelanger	seems to be what I am seeing, they are hanging around in deleting state	22:51
pabelanger	and ya, I did do a hard stop too	22:52
jeblair	pabelanger: they are in deleting state? your paste said uploading.	22:52
pabelanger	jeblair: I did image-delete on them	22:53
pabelanger	the old ones	22:53
jeblair	pabelanger: then i wonder why those are not being deleted.	22:53
pabelanger	\| 0000000004 \| 0000000001 \| infracloud-vanilla \| fedora-23 \| None \| None \| deleting \| 00:00:04:49 \|	22:53
pabelanger	for example	22:53
pabelanger	maybe logic issue around uploads, since we didn't actually upload anything	22:54
pabelanger	then, did the delete	22:54
pabelanger	I can try and reproduce with a test in a bit	22:55
jeblair	pabelanger: well, we have a test for image-delete :)	22:55
pabelanger	Yay	22:57
pabelanger	uploads started	22:57
pabelanger	after 10min delay	22:58
clarkb	we used to have logic in nodepoold that on start would change the state automatically	22:58
clarkb	we likely need that in the builder itself now because its more standalone	22:58
clarkb	and I think there may even be a test for that explicit case	22:58
pabelanger	https://review.openstack.org/#/c/404432/ should help speed things up, and reduce duplicate checksum	22:58
jeblair	clarkb: no, we don't want to change state on startup because it's a multi-node system	22:59
clarkb	jeblair: but uploads are process specific	22:59
clarkb	(as are builds)	22:59
clarkb	so even though its multinode you want to update the db for the specific upload/build that belonged to a dead process	22:59
jeblair	clarkb: sure, but it does not make sense for a builder B to change builder A's uploads to delet on startup.	22:59
*** hashar has quit IRC		23:00
clarkb	ya but thats an implementation detail	23:00
jeblair	it does make sense for builder B to delete builder A's failed uploads. and that is the intent.	23:00
clarkb	we need something that will update state for processes that die	23:00
jeblair	we have it	23:00
clarkb	oh pabelanger indicated it required manual intervention?	23:00
jeblair	we are talking about the implementation details of this thing :)	23:00
jeblair	there's a lot going on, let's take it one at a time	23:01
jeblair	first, the things that are in the uploading state are due to a bug that is fixed in a patch that is landed	23:01
jeblair	so that shouldn't happen again	23:01
jeblair	second, since that did happen with a version of the program that was running before the fix landed, manual corrective action was indicated	23:02
jeblair	that is not working as expected	23:02
jeblair	because the state is set to delete, but the record is not actually being deleted	23:02
jeblair	i think that our test for that is not testing what we think	23:03
pabelanger	Ya, sorry for confusing.	23:03
clarkb	gotcha so the fix is potentially flawed	23:03
clarkb	by not doing deletes completely	23:04
jeblair	no, the fix should be fine	23:04
jeblair	which fix?	23:04
clarkb	23:01:25 jeblair \| first, the things that are in the uploading state are due to a bug that is fixed in a patch that is landed	23:04
clarkb	whatever that is	23:04
Shrews	i think there are too many unknowns to make a call. how many builds are there? how many successful uploads? what was the state of things when the delete was issued?	23:04
Shrews	without more info, i cannot say that things are not working as we expect them to	23:04
jeblair	Shrews: i don't think there are any unknowns	23:05
jeblair	i just think we're talking about 5 things and not being specific enough	23:05
Shrews	jeblair: i have unknowns. all i have is a pastebin	23:05
jeblair	so as annoying as it is, if this(these) conversation(s) are going to be fruitful, we need to be extremely specific	23:05
jeblair	Shrews: i'm working from the same pastebin	23:05
jeblair	okay, so the first thing:	23:06
jeblair	http://paste.openstack.org/show/590899/	23:06
jeblair	lots of uploads in 'uploading' state -- more than one for each image-provider combo	23:06
jeblair	those are uploads that failed, but the process was running before 400970 landed, so we didn't put them in the failed state	23:07
jeblair	that has landed, the current process is running that, so that bug is fixed now.	23:07
jeblair	to clarkb's point on this...	23:08
jeblair	it is possible for us to see this behavior again (as i noted earlier) if a builder is killed while uploading	23:08
jeblair	that's much more rare than a failed image uploa	23:08
jeblair	we might want to think about having the cleanupworker deal with that by checking that uploads in the 'uploading' state really are uploading	23:09
pabelanger	Ya, that might be good	23:09
pabelanger	also, this just happened:	23:09
pabelanger	2016-11-29 23:08:28,439 INFO nodepool.builder.UploadWorker.13: Image build 0000000004 in infracloud-vanilla is ready	23:09
Shrews	if that is all the info, then there is only a single READY build. image-delete will delete any provider uploads, but the build remains READY. since it's the only one (we keep the two most recent), it will not be deleted.	23:10
clarkb	considering we are almost always uploading (based on current builder) the chances of hitting that case will be high)	23:10
jeblair	but it's not super urgent, because the uploadworker itself will start a new upload if an existing one is not actually in progress.	23:10
pabelanger	agree, is a cosmetic thing for me right now	23:11
jeblair	the second problem is why image-delete didn't do anything	23:11
jeblair	or, well, it set the state to deleting, but no actual deleting happened	23:12
pabelanger	I can upload a new pastebin showing that issue	23:12
pabelanger	http://paste.openstack.org/show/590901/ is the current output	23:12
pabelanger	after running: nodepool image-delete --provider infracloud-chocolate --image fedora-23 --upload-id 0000000001 --build-id 0000000004	23:12
jeblair	Shrews: and yes, i believe that's the case for this issue -- the cleanup worker is ignoring all uploads related to that build because the build has not expired	23:13
pabelanger	I wasn't expecting the 'deleting' state to persist. However, jeblair has an idea of the issue I believe	23:13
clarkb	pabelanger: is the build id required there? that seems unnecessary (since the upload id is what ties us to a provider) (not trying to distract from other discussion we can talk about this when the first thing is wrapped up)	23:13
openstackgerrit	Adam Gandelman proposed openstack-infra/zuul: Enable test_post* https://review.openstack.org/404466	23:13
jeblair	we should probably rework that so that the cleanupworker will delete anything in the deleting state	23:14
pabelanger	clarkb: required today, but I think that CLI command is going to change. global UUIDs for iamges?	23:14
clarkb	jeblair: ++	23:14
clarkb	pabelanger: even without a global uuid isn't the provider + image name + upload id unique?	23:14
jeblair	we do have a test for image-delete, however, it consideres an image deleted as soon as it doesn't show up as ready	23:14
pabelanger	clarkb: Shrews would know	23:14
jeblair	so i think there is a mismatch in what we consider 'deleted' versus what that test is looking for	23:15
jeblair	clarkb, pabelanger: no, upload ids start at 1 again for each new image build	23:15
clarkb	jeblair: ah ok hrm	23:15
pabelanger	right	23:15
pabelanger	3 images online now	23:16
pabelanger	for fedora-23	23:16
clarkb	120 to go :)	23:16
clarkb	(I forget the exact number but its not small)	23:16
pabelanger	we only have fedora-23 right now, still building other images	23:17
pabelanger	debian-jessie almost done	23:17
clarkb	pabelanger: I can go ahead and write the zookeeper change if you want	23:18
clarkb	thogu h Iam going to guess that will require a restart of zookeeper	23:18
pabelanger	clarkb: sure, I haven't looked yet	23:18
clarkb	oh the puppet module already sets it to 10k	23:19
* clarkb checks the version we are running		23:19
Shrews	jeblair: the deleted state will persist until the build itself ages out.	23:19
Shrews	also, maybe we need a new build-delete command that recursively deletes all successful uploads and, lastly, the build itself	23:20
pabelanger	clarkb: oh, nice	23:20
* Shrews really shouldn't cook and irc simultaneously		23:20
pabelanger	Shrews: from your fridge? say from your fridge :)	23:21
mordred	finally a use for IoT!	23:21
jeblair	Shrews: that's the dib-image-delete command	23:21
clarkb	pabelanger: https://github.com/deric/puppet-zookeeper/blob/v0.5.5/manifests/init.pp#L42 so we might be fine as is	23:22
pabelanger	snapCount=10000	23:22
pabelanger	confirmed on nodepool.o.o	23:22
Shrews	jeblair: ah, right. forgot that existed	23:22
jeblair	neat, i just used boartty to add a story with no tasks	23:22
jeblair	clarkb, Shrews, pabelanger: story for the first thing: https://storyboard.openstack.org/#!/story/2000811	23:23
Shrews	pabelanger: mordred: doesn't everyone have their fridge beep when their nick is mentioned on irc?	23:24
clarkb	jeblair: lgtm	23:24
pabelanger	agreed	23:24
Shrews	jeblair: code already does that	23:25
jeblair	clarkb, Shrews, pabelanger: story for the second thing: https://storyboard.openstack.org/#!/story/2000812	23:25
jeblair	Shrews: if it did, we would not have seen uploads stay in 'uploading' after the builder was restarted	23:26
Shrews	jeblair: it won't be cleaned up if it's a current build	23:26
Shrews	i think what you all are driving at is cleaning up the cruft from current builds	23:27
Shrews	which does not happen currently	23:27
pabelanger	Yup	23:28
Shrews	so the first SB needs rephrased a bit to clarify it	23:28
clarkb	right the problem with that state is that a build in uploading state should mean that that build will or is being uploaded. But after catastrophic failure that is not the case so we should clean it up	23:28
jeblair	Shrews: yeah, i think you're right. these both may be the same bug. ;)	23:31
jeblair	fundamentally, i think the algo to decide when to delete a ready image/upload is correct, but it should be separate from when to delete images/uploads that are in non-ready states	23:34
jeblair	adam_g: do you have those attribute errors handy? i'm curious/confused why this was broken	23:35
jeblair	Shrews: we may go less insane if we just separate them and do it as two passes	23:36
* clarkb checks and confirms already insane		23:36
clarkb	ETOOMANYTHINGSTOTRACK	23:36
pabelanger	jeblair: adam_g: I had this problem I believe a while back, have an existing WIP review: https://review.openstack.org/#/c/393887/	23:36
pabelanger	but haven't debugged some of the why yet	23:37
jeblair	Shrews: could even have pass 1 be "decide which ready things to delete", then pass 2 be "delete thing in deleting state"	23:37
jeblair	(and maybe handle 'failed' in pass 2 as well, or just drop the state altogether and set it to deleting on failure)	23:38
jeblair	pabelanger: i agree with your question there :)	23:38
pabelanger	Ya, it was confusing	23:39
pabelanger	but old review, 3 weeks	23:39
pabelanger	haven't looped back to it, so happy for adam_g to dig into it :)	23:39
openstackgerrit	Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966	23:48
pabelanger	end of line for me	23:52
pabelanger	images are uploads, builds are happening on nb01.o.o	23:53
pabelanger	lets see what happens over night	23:53

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!