*** willthames has joined #zuul | 00:14 | |
ianw | jamielennox / pabelanger: ok, i think i came up with a solution, which is activating the virtualenv, if it exists, in the builder. patch in a bit :) | 00:29 |
---|---|---|
*** Shuo has quit IRC | 01:04 | |
jeblair | ianw: maybe i misunderstood, but nodepool should not internally know anything about virtualenvs | 01:20 |
ianw | jeblair: well, there's that too. the other option is to say "you run nodepool from within an activated virtualenv (i.e. /path/to/venv/bin is in the $PATH) or it doesn't work" | 01:21 |
jeblair | ianw: i think that's one of *many* options :) | 01:22 |
jeblair | (others include: add a config file option for the dib binary, symlink, set PATH in startup script, update-alternatives, and i'm sure i'm missing more) | 01:23 |
ianw | config file for the dib binary doesn't really work, because dib just hits the same problem | 01:23 |
jeblair | oh good you -1d that with that comment, thanks | 01:24 |
jeblair | ianw: that combined with greghaynes message suggests that 'delay' may be a good way forward :) | 01:27 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966 | 01:28 |
ianw | jeblair: ^ with usual bonus ianw too verbose commit msg. lightly tested (i.e. not) | 01:29 |
ianw | but that's the gist | 01:29 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966 | 01:36 |
openstackgerrit | Merged openstack-infra/nodepool: Use database for non-zookeeper commands https://review.openstack.org/403869 | 01:49 |
pabelanger | ianw: jamielennox: I've done what SpamapS suggested about /bin/bash -c "source ..." in systemd, which makes diskimage-builder happy. Going to update my local env to do that now, see how it works long term | 02:07 |
*** willthames has quit IRC | 02:26 | |
*** willthames has joined #zuul | 02:46 | |
ianw | pabelanger: yep, that'll work too | 03:01 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966 | 03:06 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966 | 03:09 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966 | 03:20 |
*** hashar has joined #zuul | 05:01 | |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-older-than tests https://review.openstack.org/401027 | 05:32 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-username tests https://review.openstack.org/401028 | 05:32 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-email tests https://review.openstack.org/401029 | 05:32 |
openstackgerrit | Jamie Lennox proposed openstack-infra/zuul: Re-enable requirement-newer-than tests https://review.openstack.org/401019 | 05:32 |
*** saneax-_-|AFK is now known as saneax | 06:42 | |
ianw | pabelanger / jeblair: run out of time to look further, but i'm not convinced nodepool tests on master are stable. locally i get weird hangs and exceptions, and results are never the same twice on https://review.openstack.org/#/c/403966/ | 07:04 |
openstackgerrit | Cao Xuan Hoang proposed openstack-infra/nodepool: Clean imports in code https://review.openstack.org/404059 | 07:26 |
*** abregman has joined #zuul | 08:02 | |
*** willthames has quit IRC | 08:22 | |
*** yolanda has quit IRC | 08:31 | |
*** yolanda has joined #zuul | 08:43 | |
*** Cibo_ has joined #zuul | 09:27 | |
*** abregman has quit IRC | 09:35 | |
*** abregman has joined #zuul | 09:35 | |
*** Cibo_ has quit IRC | 09:40 | |
*** openstackgerrit has quit IRC | 09:48 | |
*** openstackgerrit has joined #zuul | 09:49 | |
*** yolanda has quit IRC | 09:55 | |
*** yolanda has joined #zuul | 09:57 | |
*** markmcd has quit IRC | 10:40 | |
*** abregman has quit IRC | 11:03 | |
*** abregman has joined #zuul | 12:06 | |
*** bhavik1 has joined #zuul | 13:12 | |
*** abregman is now known as abregman|afk | 14:13 | |
*** bhavik1 has quit IRC | 14:14 | |
*** abregman|afk is now known as abregman | 14:17 | |
pabelanger | morning | 14:25 |
pabelanger | continuing work on nb01.o.o | 14:25 |
mordred | pabelanger: let me know if there's anything I can do to help | 14:40 |
pabelanger | mordred: ack! | 14:42 |
pabelanger | just waiting for ansible wheel to run | 14:43 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Test rotation of builds in nodepool-builder https://review.openstack.org/400004 | 14:45 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Test rotation of builds in nodepool-builder https://review.openstack.org/400004 | 14:52 |
*** abregman is now known as abregman|afk | 14:54 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Re-enable test: test_list_nodes https://review.openstack.org/400955 | 15:04 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Re-enable test: test_image_upload_fail https://review.openstack.org/400970 | 15:04 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Fix nits in test_image_upload_fail https://review.openstack.org/404257 | 15:04 |
Shrews | pabelanger: mordred: Those 1st two were rebased to avoid the alien test that was -1'd. They were already +3'd, so if you could push them through... | 15:05 |
mordred | Shrews: done | 15:06 |
Shrews | muchas gracias | 15:06 |
Shrews | clarkb: 257 fixes your nits from another review | 15:08 |
pabelanger | \o/ | 15:20 |
pabelanger | http://nb01.openstack.org/ | 15:20 |
pabelanger | builds are started | 15:20 |
pabelanger | will take about 1 hours to get through first image | 15:22 |
Shrews | mordred: jeblair: clarkb: regarding the metadata comment in https://review.openstack.org/#/c/400836/2/nodepool/cmd/nodepoolcmd.py, if i'm not supposed to used 'snapshot' as the key, what should I be using? | 15:22 |
Shrews | err, snapshot as the value | 15:23 |
*** saneax is now known as saneax-_-|AFK | 15:23 | |
Shrews | basically, what metadata tells me it is a nodepool managed image? | 15:23 |
*** abregman|afk is now known as abregman | 15:25 | |
pabelanger | blarg | 15:30 |
pabelanger | mordred: jeblair: clarkb: any objections for adding an elastic-recheck for nodepool about our database drop issue? | 15:31 |
openstackgerrit | Merged openstack-infra/nodepool: Re-enable test: test_list_nodes https://review.openstack.org/400955 | 15:32 |
Shrews | This is an interesting failure: http://logs.openstack.org/57/404257/1/check/gate-nodepool-python27-ubuntu-xenial/c7cca8e/testr_results.html.gz | 15:37 |
Shrews | We have two upload threads uploading the same image. We have checks to guard against that. Not sure what's failing | 15:38 |
Shrews | i'll need to add some logging stuff to debug | 15:42 |
mordred | Shrews: oh yay | 15:57 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading https://review.openstack.org/404292 | 15:57 |
clarkb | pabelanger: nope | 16:00 |
clarkb | pabelanger: that is what it is there for | 16:00 |
pabelanger | kewl | 16:00 |
pabelanger | just uploaded it and pinged you in openstack-infra | 16:00 |
Shrews | mordred: yeah. i'm hoping that zk node locking actually works like we expect it to | 16:00 |
Shrews | i mean, we do test locking, too | 16:01 |
pabelanger | blarg | 16:01 |
pabelanger | DIB element strikes again | 16:01 |
pabelanger | 2016-11-29 15:54:33,234 INFO nodepool.image.build.fedora-23: + die 'Can not find Zuul public key!' | 16:01 |
mordred | Shrews: well, if it doesn't, this is definitely the right time to learn that | 16:02 |
clarkb | Shrews: I think shade sets metdata properties itself but also nodepool may set some too | 16:03 |
mordred | both set things | 16:04 |
mordred | shade sets the md5/sha256 checksums | 16:04 |
mordred | those are both prefixed with owner_specified.shade | 16:05 |
mordred | then nodepool should be setting image_type | 16:07 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Remove unused function make_image_dict https://review.openstack.org/404306 | 16:11 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Move to using properties instead of metadata https://review.openstack.org/404307 | 16:11 |
mordred | Shrews, clarkb: nodepool sends provider_image.meta as the extra metada on an image | 16:13 |
mordred | Shrews, clarkb: but that's all user defined - there is currently no key that nodepool adds to an image that would let us know it's nodepool managed | 16:17 |
clarkb | I think if we had onethat was "not cloud managed" that would be sufficient (since that was what the old code was doing) | 16:19 |
clarkb | but I am not sure if glance differentiates | 16:19 |
mordred | the old code was not doing that | 16:19 |
mordred | actually | 16:19 |
clarkb | mordred: it was | 16:19 |
clarkb | it was only looking at snapshot images | 16:19 |
mordred | only for snapshot images | 16:19 |
clarkb | yes, it was broken but that was the intent | 16:19 |
mordred | yah | 16:19 |
mordred | well, it's a super easy fix | 16:19 |
clarkb | basically we want to keep the cloud images out of the alien list | 16:21 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Flag uploaded images as being managed by nodepool https://review.openstack.org/404322 | 16:21 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading https://review.openstack.org/404292 | 16:22 |
mordred | clarkb: the logic in that patch in alien list is still wrong - one sec | 16:22 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Flag uploaded images as being managed by nodepool https://review.openstack.org/404322 | 16:23 |
mordred | there. that should do it | 16:23 |
*** bhavik1 has joined #zuul | 16:24 | |
Shrews | mordred: we should just add all of that to my test re-enable of alien-image-list | 16:24 |
mordred | Shrews: ++ - lemme do that real quick | 16:24 |
jeblair | mordred, clarkb, Shrews: how about we mirror the nodepool *node* metadata structure for images? | 16:27 |
jeblair | http://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/provider_manager.py?h=feature/zuulv3#n211 | 16:29 |
openstackgerrit | Merged openstack-infra/nodepool: Re-enable test: test_image_upload_fail https://review.openstack.org/400970 | 16:30 |
jeblair | mordred: though istr you wanted to de-json that and split it into multiple keys (because there is a length limit for values (which we are under) though of course there is a (provider-specified <sigh>) limit on the number of keys) | 16:31 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836 | 16:32 |
jeblair | mordred: anyway, i think the corresponding information we'd want to have there would be build number and upload number (provider and image name are obviously implied) | 16:32 |
mordred | jeblair: ++ | 16:32 |
clarkb | jeblair: right its 255 bytes per key value or something | 16:33 |
mordred | jeblair: and yes, I do want to de-jsonify it - and yes ot sigh on number of keys | 16:33 |
jeblair | mordred: ok. i left a review on your abandoned change https://review.openstack.org/404322 :) | 16:34 |
mordred | jeblair: I'mma do that in a follow up patch for now just so I don't keep stepping on Shrews' patch | 16:35 |
mordred | we can squash if we want once we're happy | 16:35 |
jeblair | squash or simulprove is ok | 16:36 |
pabelanger | current dib-image-list on nb01.o.o: http://paste.openstack.org/show/590854/ | 16:36 |
pabelanger | going to see why we are not cleaning up the building state of DIBs | 16:37 |
pabelanger | otherwise, going to be a long list | 16:37 |
jeblair | pabelanger: it's doing things! | 16:37 |
pabelanger | it is! | 16:37 |
pabelanger | https://review.openstack.org/#/q/topic:zuulv3-builder | 16:37 |
pabelanger | we need to land ^ | 16:37 |
Shrews | jeblair: i'm about to post a change to include the ZnodeStat structure in the build and upload objects. I'm concerned that node versioning might be biting us (though I haven't confirmed that) and that info could be useful for debugging. | 16:37 |
pabelanger | to fix ssh key issues | 16:37 |
jeblair | Shrews: in the repr you mean? | 16:37 |
Shrews | jeblair: yes | 16:38 |
jeblair | cool | 16:38 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Add znode stat structure to build/upload objects https://review.openstack.org/404292 | 16:39 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading https://review.openstack.org/404335 | 16:39 |
jeblair | Shrews: i'm catching up on the failure from earlier so i hope to be useful on this soon | 16:40 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Add build and upload information to image metadata https://review.openstack.org/404337 | 16:41 |
Shrews | it's weird that upload thing happened twice on back-to-back reviews. Now I've had 2 rechecks and 2 test reviews that haven't reproduced it | 16:41 |
mordred | Shrews, jeblair: ^^ it doesn't seem that we have access to upload id yet when we upload the image - am I reading that wrong? | 16:44 |
Shrews | mordred: correct. won't have it until storeImageUpload is called | 16:47 |
mordred | k. so - jeblair, we can easily add build_id to the images, but not upload_id unless we added an additional api call to update the image metadata | 16:47 |
Shrews | mordred: the caller knows it | 16:48 |
mordred | I'm not sure it's worth an extra API call | 16:48 |
mordred | Shrews: it does? | 16:48 |
Shrews | mordred: yes, just not passed to _uploadImage() | 16:48 |
Shrews | mordred: upnum | 16:48 |
mordred | Shrews: oh - awesome. thanks! | 16:48 |
mordred | jeblair: ignore me | 16:48 |
Shrews | mordred: also, as far as stepping on toes, the only thing left is to add the non-empty alien image test case, which I'd actually prefer to do in a separate review. | 16:49 |
jeblair | Shrews: i'm looping that failed test locally, no luck so far | 16:50 |
Shrews | jeblair: yeah, i have not been able to reproduce it locally either | 16:50 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Add build and upload information to image metadata https://review.openstack.org/404337 | 16:52 |
mordred | jeblair: ok - I think that's ^^ more inline with things | 16:52 |
Shrews | jeblair: looking at http://logs.openstack.org/57/404257/1/check/gate-nodepool-python27-ubuntu-xenial/c7cca8e/testr_results.html.gz at the failure, that zk LOST message between uploads is a bit concerning. I do not know if that's normal | 16:53 |
Shrews | it may be nothing | 16:54 |
jeblair | pabelanger: looking at 404309, i think i spotted a problem -- but also, in openstackci::builder we use both nodepool and nodepool::builder -- so should we just add the key as a param to openstackci as in 404317 and then pass it to nodepool instead of nodepool::builder? | 16:55 |
mordred | Shrews: ok. I'll just squash onto your patch then | 16:55 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836 | 16:56 |
*** hashar is now known as hasharAWay | 16:57 | |
pabelanger | jeblair: 404309, I can move .ssh file back up to the original line, no problem | 16:58 |
jeblair | pabelanger: oh, i see, i didn't see the move, i understand that change better now | 17:00 |
pabelanger | jeblair: trying to understand your follow up comment for 404317, do you mind rephrasing it? | 17:00 |
jeblair | pabelanger: yeah, so don't change 404309 just yet | 17:00 |
pabelanger | k | 17:00 |
jeblair | pabelanger: what i meant was, i think you could abandon 404309 and then in 404317, add the public key to the instantiation of class ::nodepool | 17:01 |
jeblair | pabelanger: i think the way to decide between the two approaches is to say whether the public key is something that should be common to all nodepool nodes, or something that is builder-specific | 17:03 |
jeblair | pabelanger: i guess it's builder-specific? and the private key will eventually be launcher-specific? | 17:04 |
jeblair | pabelanger: so maybe your approach is best? :) | 17:04 |
pabelanger | jeblair: well, in fact, I don't actually like what we are doing with the public key, its specific how we build the zuul user with DIB elements today. So, I think you might be right, abandon 404309 for now, then add the logic into 404317 or system-config | 17:05 |
jeblair | pabelanger: or maybe openstackci | 17:06 |
jeblair | pabelanger: at any rate, i +2d your existing changes :) | 17:07 |
pabelanger | clarkb: any thoughts on ^ +topic:zuulv3-builder | 17:07 |
pabelanger | jeblair: thanks | 17:07 |
jeblair | Shrews: i *think* what we're seeing is the command starting up and performing its ZK connect and disconnect. which is probably telling us that an upload worker is deciding to upload after we delete the upload. | 17:08 |
clarkb | pabelanger: sorry caffeinating | 17:09 |
clarkb | will pull up ina few | 17:09 |
*** abregman has quit IRC | 17:09 | |
Shrews | jeblair: yeah, i think so | 17:09 |
Shrews | jeblair: oh, i grok now. so things are working correctly, but the 2nd upload happens too quickly? | 17:12 |
jeblair | Shrews: yeah, if that's correct, then i think it's normal behavior, i think we just need to alter the test to stop trying to upload images after we delete. | 17:12 |
jeblair | Shrews: or alter it to wait for that specific upload's deletion | 17:13 |
jeblair | which might be the easier and better thing | 17:14 |
Shrews | jeblair: so make waitForImageDeletion take an optional upload ID might be the ticket | 17:16 |
jeblair | Shrews: i think if you throw a sleep between nodepoolcmd.main() and waitForImageDeletion, it repros. and yeah, i think that sounds like a good solution | 17:17 |
Shrews | ack | 17:17 |
clarkb | jeblair: Shrews in the past we have tried to be specific and not turn off the engine once a state is reached | 17:17 |
jeblair | clarkb: so that's another vote in favor if optional upload id? | 17:18 |
jeblair | Shrews: are you writing the fix? | 17:18 |
Shrews | jeblair: on it | 17:18 |
clarkb | jeblair: ya. | 17:18 |
clarkb | pabelanger: ok stack reviewed, need to add an extra check on the first change but otherwise lgtm | 17:19 |
jeblair | mordred, rcarrillocruz: i'm going to set the topic of the devstack ansible changes to zuulv3 | 17:24 |
jeblair | it was just the callback change that was missing it, now they all show up | 17:25 |
rcarrillocruz | ++ thx | 17:26 |
mordred | jeblair: woot | 17:26 |
jeblair | mordred: 404295 seems to output all the gathered facts | 17:29 |
jeblair | mordred: http://logs.openstack.org/95/404295/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/5aaf721/logs/devstack-gate-setup-host.txt.gz#_2016-11-29_16_02_47_661929 | 17:30 |
mordred | jeblair: yup. was just asking folks about that in infra | 17:30 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Fix test_image_delete https://review.openstack.org/404356 | 17:31 |
Shrews | jeblair: clarkb: ^^^ | 17:31 |
jeblair | +2 | 17:32 |
pabelanger | jeblair: mind +3 on 404309 update to address clarkb comments | 17:49 |
jeblair | pabelanger: done | 17:50 |
pabelanger | danke | 17:51 |
*** morgan_ is now known as morganfainberg | 18:03 | |
*** morganfainberg is now known as morgan | 18:04 | |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836 | 18:47 |
Shrews | mordred: fixed the merge conflict ^^^ | 18:48 |
openstackgerrit | Merged openstack-infra/nodepool: Remove unused function make_image_dict https://review.openstack.org/404306 | 18:48 |
*** bhavik1 has quit IRC | 18:51 | |
openstackgerrit | Merged openstack-infra/nodepool: Fix nits in test_image_upload_fail https://review.openstack.org/404257 | 18:56 |
mordred | Shrews: yay! | 19:07 |
*** abregman has joined #zuul | 19:08 | |
mordred | clarkb: I did not +A https://review.openstack.org/400836 since I wrote code in it | 19:08 |
*** abregman has quit IRC | 19:09 | |
clarkb | mordred: and I just -1'd it >_> | 19:09 |
clarkb | its still listing the cloud provided images in the alien list | 19:09 |
*** abregman has joined #zuul | 19:10 | |
mordred | clarkb: oh - that's an easy fix | 19:11 |
Shrews | mordred: you fixing that or want me to? | 19:13 |
mordred | Shrews: I got it | 19:14 |
mordred | Shrews: just going to filter the images when we get the provider_image list | 19:14 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836 | 19:14 |
mordred | Shrews, clarkb: ^^ | 19:14 |
clarkb | thanks will rereview post meeting | 19:15 |
clarkb | (head down in code review is how I missed the start of the meeting so will avoid it until done) | 19:15 |
mordred | clarkb: yah. same here :) | 19:15 |
pabelanger | ugh | 20:14 |
pabelanger | another DIB fail | 20:14 |
pabelanger | jeblair: Shrews: What do you think about us doing a merge master to feature/zuulv3? | 20:14 |
pabelanger | I just ran into a bug | 20:14 |
pabelanger | which we fixed in master | 20:14 |
pabelanger | but are missing in feature/zuulv3 | 20:15 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Stop running DIB usage report https://review.openstack.org/404409 | 20:17 |
pabelanger | jeblair: clarkb: mordred: ^ is the fix we need in feature/zuulv3 | 20:17 |
pabelanger | Did a cherry-pick | 20:17 |
pabelanger | or we can merge master | 20:17 |
clarkb | +2 | 20:19 |
clarkb | though if we think we are this close to really running the builder with zk may be a good time to work on a merge too | 20:19 |
pabelanger | things are running now | 20:20 |
pabelanger | merge would help fix issues like ^ | 20:20 |
jeblair | pabelanger, clarkb: https://review.openstack.org/400536 | 20:23 |
jeblair | that needs a refresh anyway, so i will +3 the cherry-pick | 20:24 |
clarkb | ok | 20:24 |
jeblair | clarkb, pabelanger: jhesketh did request a review on that despite the test failure | 20:25 |
pabelanger | jeblair: sure, will look today | 20:25 |
*** hasharAWay is now known as hashar | 20:31 | |
*** abregman has quit IRC | 20:53 | |
openstackgerrit | Merged openstack-infra/nodepool: Stop running DIB usage report https://review.openstack.org/404409 | 20:59 |
openstackgerrit | Monty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests https://review.openstack.org/400836 | 21:16 |
mordred | Shrews, clarkb: whoops. ^^ I had broked things | 21:18 |
pabelanger | oh snap | 21:19 |
pabelanger | | fedora-23-0000000004 | fedora-23 | nb01 | qcow2,vhd,raw | ready | 00:00:00:18 | | 21:19 |
pabelanger | http://paste.openstack.org/show/590895/ | 21:19 |
pabelanger | nodepool image-list | 21:19 |
mordred | pabelanger: woot! | 21:20 |
mordred | pabelanger: that's super exciting | 21:20 |
mordred | pabelanger: I just +A'd two of your patches - but gerrit claims they conflict. we'll just have to see | 21:21 |
pabelanger | mordred: woot indeed | 21:21 |
mordred | pabelanger: so - we really do need to get jhesketh's merge master patch sorted out :) | 21:22 |
pabelanger | mordred: agreed, I think that will help with some yak shaving | 21:22 |
mordred | mmm. shaving yaks | 21:22 |
mordred | pabelanger: there is a fellow outside of vancouver who has a herd of tibetan yaks for sale | 21:22 |
pabelanger | ha, nice | 21:23 |
openstackgerrit | Merged openstack-infra/nodepool: Fix test_image_delete https://review.openstack.org/404356 | 21:23 |
openstackgerrit | Merged openstack-infra/nodepool: Add pause support for diskimages https://review.openstack.org/403780 | 21:23 |
pabelanger | Also need to land our nodepool patch to use checksum files, to avoid duplicate checksums | 21:24 |
mordred | pabelanger: seems like we should port that patch to v3 at this point | 21:24 |
clarkb | now to see how reliable uploads will be for us | 21:24 |
clarkb | pabelanger: mordred maybe just get that from merge? | 21:25 |
clarkb | since its a performamce noy functional thing | 21:25 |
mordred | clarkb: well, it hasn't landed in either branch | 21:25 |
mordred | so I was mostly thinking we should not land it in master then deal with it on merge since we're so close to re-merging everything | 21:26 |
clarkb | oh I thought the first version of it was landed | 21:26 |
mordred | if it is, I totally agree it should just be a thing on merge | 21:26 |
*** willthames has joined #zuul | 21:28 | |
pabelanger | ya, I don't think it merged yet | 21:29 |
pabelanger | checking | 21:29 |
pabelanger | 383519 | 21:30 |
pabelanger | I can port to zuulv3 | 21:30 |
pabelanger | mordred: clarkb: a review on https://review.openstack.org/#/c/404296/ for elastic-recheck. I believe I got the syntax right | 21:32 |
pabelanger | for our database drop failures | 21:32 |
Shrews | pabelanger: what does the cpu usage look like on nb01? curious as to how the small delay between schedule checks is affecting it | 21:34 |
pabelanger | $ cat /proc/loadavg | 21:35 |
pabelanger | 4.34 4.63 4.23 7/544 387 | 21:35 |
pabelanger | right now | 21:35 |
pabelanger | 277.5 CPU for nodepool-builder ATM | 21:35 |
pabelanger | but, think shade is still doing md5sum | 21:35 |
mordred | we're spending 39% of a CPU in dnf? | 21:36 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: [DNM] test https://review.openstack.org/404428 | 21:36 |
mordred | Shrews: it does feel like an awful lot of CPU usage given that I think there is currently one dib build going on | 21:38 |
Shrews | mordred: the BuildWorker and UploadWorkers both only sleep 0.1s between their respective checks. | 21:38 |
Shrews | so they are always doing something | 21:38 |
mordred | Shrews: ah | 21:39 |
mordred | then that CPU usage seems about right on | 21:39 |
pabelanger | Was thinking we should make that an exponential backoff on error, up to X seconds? | 21:39 |
pabelanger | or mins | 21:39 |
pabelanger | as not to spam logs on error | 21:40 |
jeblair | remote: https://review.openstack.org/404430 Add nb01.openstack.org to cacti | 21:40 |
jeblair | pabelanger, mordred: ^ | 21:40 |
pabelanger | +3 | 21:40 |
Shrews | pabelanger: backoff on upload? | 21:41 |
Shrews | pabelanger: not sure how to make that work since one thread would have to keep the upload lock to keep other threads from re-attempting the upload | 21:41 |
Shrews | which i guess it could do | 21:42 |
*** Cibo_ has joined #zuul | 21:42 | |
pabelanger | Shrews: I haven't looked into it yet, but something I was thinking about, since my logs were filling up aggressively on a mis-configuration of clouds.yaml | 21:42 |
mordred | pabelanger: easy soluiton - don't mis-configure clouds.yaml ;) | 21:43 |
Shrews | pabelanger: incentive to fix your clouds.yaml :) | 21:43 |
pabelanger | indeed | 21:43 |
Shrews | it's a feature | 21:43 |
jeblair | i mean, it actually is a feature | 21:43 |
jeblair | we have way more problems with uploads failing with EOPENSTACK than we have with misconfigured clouds.yamls | 21:44 |
jeblair | which doesn't mean 'continue blowing up indefinitely' is the only answer, just that the... er... obstinancy of the builder is intentional. :) | 21:46 |
jeblair | Shrews: and oh yeah, we did want to make the sleep longer, just need to keep it short in the check. | 21:47 |
jeblair | (i was thinking something like 10 secs) | 21:47 |
jeblair | er 'keep it short in the tests' | 21:47 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Use diskimage-builder checksum files https://review.openstack.org/404432 | 21:48 |
Shrews | jeblair: do what we did with cleanup_interval for the CleanupWorker thread? | 21:49 |
pabelanger | will check dsvm test on that | 21:49 |
jeblair | Shrews: yeah, probably so | 21:49 |
Shrews | jeblair: k. i'll work on that | 21:49 |
jeblair | Shrews: thanks; pabelanger: that should make things 100 times better. :) | 21:50 |
pabelanger | 100x improvement | 21:50 |
Shrews | golly you maths good | 21:50 |
jeblair | i learned me some math | 21:51 |
pabelanger | wonder if we should enable shade debugs for now on nb01 | 21:52 |
pabelanger | to see what is going on | 21:52 |
pabelanger | not seeing much traffic between nb01.o.o and clouds | 21:52 |
pabelanger | Oh, errors | 22:01 |
mordred | yay! | 22:01 |
pabelanger | OpenStackCloudException: Error fetching image list: 'X509' object has no attribute '_x509' | 22:01 |
mordred | pabelanger: whereat? | 22:01 |
pabelanger | /var/log/nodepool/builder-debug.log | 22:01 |
mordred | that is a bug in python requests, iirc | 22:01 |
pabelanger | something, something, ssl? | 22:02 |
pabelanger | off to the googles | 22:02 |
mordred | pabelanger: https://github.com/kennethreitz/requests/issues/3701 | 22:03 |
pabelanger | PyOpenSSL seems to be what people are pointing too | 22:03 |
pabelanger | ya | 22:03 |
pabelanger | reading that | 22:03 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Make build/upload worker sleep time configurable https://review.openstack.org/404438 | 22:03 |
pabelanger | python -c 'import OpenSSL; print(OpenSSL.__version__)' | 22:04 |
pabelanger | 0.13 | 22:04 |
pabelanger | looks like we need a version bump | 22:04 |
mordred | that'll do it | 22:04 |
mordred | did we install from disto packages or something? | 22:04 |
pabelanger | let me see how it gets pulled in | 22:04 |
pabelanger | checking | 22:04 |
pabelanger | ii python-openssl 0.13-2ubuntu6 amd64 Python 2 wrapper around the OpenSSL library | 22:05 |
pabelanger | ya | 22:05 |
mordred | def wanna not have that and have pip installed version instead | 22:06 |
jeblair | er, on nodepool.o.o there is no openssl | 22:07 |
pabelanger | checking to see how it was installed | 22:07 |
mordred | nothing seems to depend on it | 22:07 |
Shrews | mordred: pabelanger: nb01 will also need the unreleased shade to get the image meta | 22:07 |
Shrews | if it doesn't have it already | 22:08 |
pabelanger | k | 22:08 |
mordred | Shrews: I'd actually like to release shade - since that is a fairly hefty regression | 22:08 |
Shrews | mordred: DO EET | 22:08 |
pabelanger | /var/log/apt/history.log doesn't list python-openssl | 22:09 |
mordred | maybe base-image ? | 22:09 |
jeblair | pabelanger: agreed, i can't find anything either | 22:09 |
pabelanger | mordred: ya, I wonder if that is the case | 22:10 |
mordred | really need to get around to having our own dib-created base images that are what we run our servers on | 22:10 |
pabelanger | ++ | 22:10 |
pabelanger | soon(tm( | 22:10 |
jeblair | let's go to -infra | 22:11 |
Shrews | let's go drinking. night all | 22:11 |
jeblair | Shrews: cheers! | 22:12 |
adam_g | anyone able to clear up some confusion ive got around item.change.{refspec, branch, ref} WRT https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/launcher/client.py?h=feature/zuulv3#n28 ? | 22:17 |
adam_g | im not sure whats meant to happen there when item.change is a Ref and not a Change | 22:18 |
clarkb | adam_g: iirc its supposed to tell the merger to fetch the ref | 22:19 |
clarkb | so that when the job runs the ref is available to it | 22:19 |
clarkb | (but that behavior may have changed with v3) | 22:20 |
adam_g | clarkb: right, but the models differ quiet a bit and it looks like the logic in there is missing some smarts in dealing with some of those differences | 22:20 |
clarkb | adam_g: well there its just constructing the data | 22:20 |
adam_g | i can update that to fish out (what i think are) the correct bits from each, but having not much prior knowledge here, im not sure its correct | 22:20 |
clarkb | so either you get a change + patchset or you get an oldrev and newrev | 22:20 |
clarkb | then things consuming the data will have to switch based on that | 22:21 |
adam_g | clarkb: right, but things branch seem to be located in different places depending on what type of item.change comes in. | 22:22 |
adam_g | this is /w all the newer v3 model changes, btw | 22:22 |
clarkb | for v3 specifics there I am really the wrong person to ask, generally though the differences are in the source event type from gerrit and on the merger side making sure you have the correct ref locally for the test to have access to | 22:23 |
adam_g | yep. i guess my confusion is just about the internal zuul modelling around that | 22:25 |
jeblair | adam_g: one of the todos for v3 is to rework the change/changeish/ref hierarchy | 22:28 |
jeblair | but that's mostly for clarity and to simplify some code | 22:29 |
adam_g | jeblair: ah, i see | 22:29 |
adam_g | was just hacking on getting some of the post tests working. lemme push up what i have and you can lemme kow if its worth working on before that happens | 22:29 |
jeblair | adam_g: ok. i *think* you should be able to proceed with just corrections, not refactoring. but yeah, something to look at would be nice. | 22:30 |
openstackgerrit | Adam Gandelman proposed openstack-infra/zuul: Enable test_post https://review.openstack.org/404445 | 22:31 |
jeblair | adam_g: what happens if you don't make those changes? | 22:33 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Fail on bad options to fake-image-create https://review.openstack.org/291516 | 22:38 |
pabelanger | Shrews: jeblair: looks like we need to do some exception handling: http://paste.openstack.org/show/590899/ | 22:40 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Fail on bad options to fake-image-create https://review.openstack.org/404452 | 22:40 |
adam_g | jeblair: make_merger_item() hits a few AttributeErrors | 22:44 |
jeblair | pabelanger: i thought Shrews fixed that in a recent change | 22:45 |
Shrews | https://review.openstack.org/400970 | 22:45 |
jeblair | pabelanger: are we running with that? | 22:46 |
pabelanger | jeblair: we are now | 22:46 |
pabelanger | but not before I think | 22:46 |
pabelanger | just restarted nodepool-builder 5mins ago | 22:46 |
jeblair | pabelanger: so i think we should be good. could probably delete the old uploading ones if you want | 22:47 |
pabelanger | lets see what happens | 22:47 |
jeblair | pabelanger: i don't think they will be deleted until the underlying build ages out. we might want to fix that (though with 970 this is only likely to happen after a hard-stop) | 22:51 |
pabelanger | k | 22:51 |
pabelanger | seems to be what I am seeing, they are hanging around in deleting state | 22:51 |
pabelanger | and ya, I did do a hard stop too | 22:52 |
jeblair | pabelanger: they are in deleting state? your paste said uploading. | 22:52 |
pabelanger | jeblair: I did image-delete on them | 22:53 |
pabelanger | the old ones | 22:53 |
jeblair | pabelanger: then i wonder why those are not being deleted. | 22:53 |
pabelanger | | 0000000004 | 0000000001 | infracloud-vanilla | fedora-23 | None | None | deleting | 00:00:04:49 | | 22:53 |
pabelanger | for example | 22:53 |
pabelanger | maybe logic issue around uploads, since we didn't actually upload anything | 22:54 |
pabelanger | then, did the delete | 22:54 |
pabelanger | I can try and reproduce with a test in a bit | 22:55 |
jeblair | pabelanger: well, we have a test for image-delete :) | 22:55 |
pabelanger | Yay | 22:57 |
pabelanger | uploads started | 22:57 |
pabelanger | after 10min delay | 22:58 |
clarkb | we used to have logic in nodepoold that on start would change the state automatically | 22:58 |
clarkb | we likely need that in the builder itself now because its more standalone | 22:58 |
clarkb | and I think there may even be a test for that explicit case | 22:58 |
pabelanger | https://review.openstack.org/#/c/404432/ should help speed things up, and reduce duplicate checksum | 22:58 |
jeblair | clarkb: no, we don't want to change state on startup because it's a multi-node system | 22:59 |
clarkb | jeblair: but uploads are process specific | 22:59 |
clarkb | (as are builds) | 22:59 |
clarkb | so even though its multinode you want to update the db for the specific upload/build that belonged to a dead process | 22:59 |
jeblair | clarkb: sure, but it does not make sense for a builder B to change builder A's uploads to delet on startup. | 22:59 |
*** hashar has quit IRC | 23:00 | |
clarkb | ya but thats an implementation detail | 23:00 |
jeblair | it *does* make sense for builder B to delete builder A's failed uploads. and that is the intent. | 23:00 |
clarkb | we need something that will update state for processes that die | 23:00 |
jeblair | we have it | 23:00 |
clarkb | oh pabelanger indicated it required manual intervention? | 23:00 |
jeblair | we are talking about the implementation details of this thing :) | 23:00 |
jeblair | there's a lot going on, let's take it one at a time | 23:01 |
jeblair | first, the things that are in the uploading state are due to a bug that is fixed in a patch that is landed | 23:01 |
jeblair | so that shouldn't happen again | 23:01 |
jeblair | second, since that did happen with a version of the program that was running before the fix landed, manual corrective action was indicated | 23:02 |
jeblair | that is not working as expected | 23:02 |
jeblair | because the state is set to delete, but the record is not actually being deleted | 23:02 |
jeblair | i think that our test for that is not testing what we think | 23:03 |
pabelanger | Ya, sorry for confusing. | 23:03 |
clarkb | gotcha so the fix is potentially flawed | 23:03 |
clarkb | by not doing deletes completely | 23:04 |
jeblair | no, the fix should be fine | 23:04 |
jeblair | which fix? | 23:04 |
clarkb | 23:01:25 jeblair | first, the things that are in the uploading state are due to a bug that is fixed in a patch that is landed | 23:04 |
clarkb | whatever that is | 23:04 |
Shrews | i think there are too many unknowns to make a call. how many builds are there? how many successful uploads? what was the state of things when the delete was issued? | 23:04 |
Shrews | without more info, i cannot say that things are not working as we expect them to | 23:04 |
jeblair | Shrews: i don't think there are any unknowns | 23:05 |
jeblair | i just think we're talking about 5 things and not being specific enough | 23:05 |
Shrews | jeblair: i have unknowns. all i have is a pastebin | 23:05 |
jeblair | so as annoying as it is, if this(these) conversation(s) are going to be fruitful, we need to be extremely specific | 23:05 |
jeblair | Shrews: i'm working from the same pastebin | 23:05 |
jeblair | okay, so the first thing: | 23:06 |
jeblair | http://paste.openstack.org/show/590899/ | 23:06 |
jeblair | lots of uploads in 'uploading' state -- more than one for each image-provider combo | 23:06 |
jeblair | those are uploads that failed, but the process was running before 400970 landed, so we didn't put them in the failed state | 23:07 |
jeblair | that has landed, the current process is running that, so that bug is fixed now. | 23:07 |
jeblair | to clarkb's point on this... | 23:08 |
jeblair | it is possible for us to see this behavior again (as i noted earlier) if a builder is killed while uploading | 23:08 |
jeblair | that's much more rare than a failed image uploa | 23:08 |
jeblair | we *might* want to think about having the cleanupworker deal with that by checking that uploads in the 'uploading' state really are uploading | 23:09 |
pabelanger | Ya, that might be good | 23:09 |
pabelanger | also, this just happened: | 23:09 |
pabelanger | 2016-11-29 23:08:28,439 INFO nodepool.builder.UploadWorker.13: Image build 0000000004 in infracloud-vanilla is ready | 23:09 |
Shrews | if that is all the info, then there is only a single READY build. image-delete will delete any provider uploads, but the build remains READY. since it's the only one (we keep the two most recent), it will not be deleted. | 23:10 |
clarkb | considering we are almost always uploading (based on current builder) the chances of hitting that case will be high) | 23:10 |
jeblair | but it's not super urgent, because the uploadworker itself will start a new upload if an existing one is not actually in progress. | 23:10 |
pabelanger | agree, is a cosmetic thing for me right now | 23:11 |
jeblair | the second problem is why image-delete didn't do anything | 23:11 |
jeblair | or, well, it set the state to deleting, but no actual deleting happened | 23:12 |
pabelanger | I can upload a new pastebin showing that issue | 23:12 |
pabelanger | http://paste.openstack.org/show/590901/ is the current output | 23:12 |
pabelanger | after running: nodepool image-delete --provider infracloud-chocolate --image fedora-23 --upload-id 0000000001 --build-id 0000000004 | 23:12 |
jeblair | Shrews: and yes, i believe that's the case for this issue -- the cleanup worker is ignoring all uploads related to that build because the build has not expired | 23:13 |
pabelanger | I wasn't expecting the 'deleting' state to persist. However, jeblair has an idea of the issue I believe | 23:13 |
clarkb | pabelanger: is the build id required there? that seems unnecessary (since the upload id is what ties us to a provider) (not trying to distract from other discussion we can talk about this when the first thing is wrapped up) | 23:13 |
openstackgerrit | Adam Gandelman proposed openstack-infra/zuul: Enable test_post* https://review.openstack.org/404466 | 23:13 |
jeblair | we should probably rework that so that the cleanupworker will delete anything in the deleting state | 23:14 |
pabelanger | clarkb: required today, but I think that CLI command is going to change. global UUIDs for iamges? | 23:14 |
clarkb | jeblair: ++ | 23:14 |
clarkb | pabelanger: even without a global uuid isn't the provider + image name + upload id unique? | 23:14 |
jeblair | we do have a test for image-delete, however, it consideres an image deleted as soon as it doesn't show up as ready | 23:14 |
pabelanger | clarkb: Shrews would know | 23:14 |
jeblair | so i think there is a mismatch in what we consider 'deleted' versus what that test is looking for | 23:15 |
jeblair | clarkb, pabelanger: no, upload ids start at 1 again for each new image build | 23:15 |
clarkb | jeblair: ah ok hrm | 23:15 |
pabelanger | right | 23:15 |
pabelanger | 3 images online now | 23:16 |
pabelanger | for fedora-23 | 23:16 |
clarkb | 120 to go :) | 23:16 |
clarkb | (I forget the exact number but its not small) | 23:16 |
pabelanger | we only have fedora-23 right now, still building other images | 23:17 |
pabelanger | debian-jessie almost done | 23:17 |
clarkb | pabelanger: I can go ahead and write the zookeeper change if you want | 23:18 |
clarkb | thogu h Iam going to guess that will require a restart of zookeeper | 23:18 |
pabelanger | clarkb: sure, I haven't looked yet | 23:18 |
clarkb | oh the puppet module already sets it to 10k | 23:19 |
* clarkb checks the version we are running | 23:19 | |
Shrews | jeblair: the deleted state will persist until the build itself ages out. | 23:19 |
Shrews | also, maybe we need a new build-delete command that recursively deletes all successful uploads and, lastly, the build itself | 23:20 |
pabelanger | clarkb: oh, nice | 23:20 |
* Shrews really shouldn't cook and irc simultaneously | 23:20 | |
pabelanger | Shrews: from your fridge? say from your fridge :) | 23:21 |
mordred | finally a use for IoT! | 23:21 |
jeblair | Shrews: that's the dib-image-delete command | 23:21 |
clarkb | pabelanger: https://github.com/deric/puppet-zookeeper/blob/v0.5.5/manifests/init.pp#L42 so we might be fine as is | 23:22 |
pabelanger | snapCount=10000 | 23:22 |
pabelanger | confirmed on nodepool.o.o | 23:22 |
Shrews | jeblair: ah, right. forgot that existed | 23:22 |
jeblair | neat, i just used boartty to add a story with no tasks | 23:22 |
jeblair | clarkb, Shrews, pabelanger: story for the first thing: https://storyboard.openstack.org/#!/story/2000811 | 23:23 |
Shrews | pabelanger: mordred: doesn't everyone have their fridge beep when their nick is mentioned on irc? | 23:24 |
clarkb | jeblair: lgtm | 23:24 |
pabelanger | agreed | 23:24 |
Shrews | jeblair: code already does that | 23:25 |
jeblair | clarkb, Shrews, pabelanger: story for the second thing: https://storyboard.openstack.org/#!/story/2000812 | 23:25 |
jeblair | Shrews: if it did, we would not have seen uploads stay in 'uploading' after the builder was restarted | 23:26 |
Shrews | jeblair: it won't be cleaned up if it's a current build | 23:26 |
Shrews | i think what you all are driving at is cleaning up the cruft from current builds | 23:27 |
Shrews | which does not happen currently | 23:27 |
pabelanger | Yup | 23:28 |
Shrews | so the first SB needs rephrased a bit to clarify it | 23:28 |
clarkb | right the problem with that state is that a build in uploading state should mean that that build will or is being uploaded. But after catastrophic failure that is not the case so we should clean it up | 23:28 |
jeblair | Shrews: yeah, i think you're right. these both may be the same bug. ;) | 23:31 |
jeblair | fundamentally, i think the algo to decide when to delete a ready image/upload is correct, but it should be separate from when to delete images/uploads that are in non-ready states | 23:34 |
jeblair | adam_g: do you have those attribute errors handy? i'm curious/confused why this was broken | 23:35 |
jeblair | Shrews: we may go less insane if we just separate them and do it as two passes | 23:36 |
* clarkb checks and confirms already insane | 23:36 | |
clarkb | ETOOMANYTHINGSTOTRACK | 23:36 |
pabelanger | jeblair: adam_g: I had this problem I believe a while back, have an existing WIP review: https://review.openstack.org/#/c/393887/ | 23:36 |
pabelanger | but haven't debugged some of the why yet | 23:37 |
jeblair | Shrews: could even have pass 1 be "decide which ready things to delete", then pass 2 be "delete thing in deleting state" | 23:37 |
jeblair | (and maybe handle 'failed' in pass 2 as well, or just drop the state altogether and set it to deleting on failure) | 23:38 |
jeblair | pabelanger: i agree with your question there :) | 23:38 |
pabelanger | Ya, it was confusing | 23:39 |
pabelanger | but old review, 3 weeks | 23:39 |
pabelanger | haven't looped back to it, so happy for adam_g to dig into it :) | 23:39 |
openstackgerrit | Ian Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib https://review.openstack.org/403966 | 23:48 |
pabelanger | end of line for me | 23:52 |
pabelanger | images are uploads, builds are happening on nb01.o.o | 23:53 |
pabelanger | lets see what happens over night | 23:53 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!