Tuesday, 2016-11-29

*** willthames has joined #zuul00:14
ianwjamielennox / pabelanger: ok, i think i came up with a solution, which is activating the virtualenv, if it exists, in the builder.  patch in a bit :)00:29
*** Shuo has quit IRC01:04
jeblairianw: maybe i misunderstood, but nodepool should not internally know anything about virtualenvs01:20
ianwjeblair: well, there's that too.  the other option is to say "you run nodepool from within an activated virtualenv (i.e. /path/to/venv/bin is in the $PATH) or it doesn't work"01:21
jeblairianw: i think that's one of *many* options :)01:22
jeblair(others include: add a config file option for the dib binary, symlink, set PATH in startup script, update-alternatives, and i'm sure i'm missing more)01:23
ianwconfig file for the dib binary doesn't really work, because dib just hits the same problem01:23
jeblairoh good you -1d that with that comment, thanks01:24
jeblairianw: that combined with greghaynes message suggests that 'delay' may be a good way forward :)01:27
openstackgerritIan Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib  https://review.openstack.org/40396601:28
ianwjeblair: ^ with usual bonus ianw too verbose commit msg.  lightly tested (i.e. not)01:29
ianwbut that's the gist01:29
openstackgerritIan Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib  https://review.openstack.org/40396601:36
openstackgerritMerged openstack-infra/nodepool: Use database for non-zookeeper commands  https://review.openstack.org/40386901:49
pabelangerianw: jamielennox: I've done what SpamapS suggested about /bin/bash -c "source ..." in systemd, which makes diskimage-builder happy.  Going to update my local env to do that now, see how it works long term02:07
*** willthames has quit IRC02:26
*** willthames has joined #zuul02:46
ianwpabelanger: yep, that'll work too03:01
openstackgerritIan Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib  https://review.openstack.org/40396603:06
openstackgerritIan Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib  https://review.openstack.org/40396603:09
openstackgerritIan Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib  https://review.openstack.org/40396603:20
*** hashar has joined #zuul05:01
openstackgerritJamie Lennox proposed openstack-infra/zuul: Re-enable requirement-older-than tests  https://review.openstack.org/40102705:32
openstackgerritJamie Lennox proposed openstack-infra/zuul: Re-enable requirement-username tests  https://review.openstack.org/40102805:32
openstackgerritJamie Lennox proposed openstack-infra/zuul: Re-enable requirement-email tests  https://review.openstack.org/40102905:32
openstackgerritJamie Lennox proposed openstack-infra/zuul: Re-enable requirement-newer-than tests  https://review.openstack.org/40101905:32
*** saneax-_-|AFK is now known as saneax06:42
ianwpabelanger / jeblair: run out of time to look further, but i'm not convinced nodepool tests on master are stable.  locally i get weird hangs and exceptions, and results are never the same twice on https://review.openstack.org/#/c/403966/07:04
openstackgerritCao Xuan Hoang proposed openstack-infra/nodepool: Clean imports in code  https://review.openstack.org/40405907:26
*** abregman has joined #zuul08:02
*** willthames has quit IRC08:22
*** yolanda has quit IRC08:31
*** yolanda has joined #zuul08:43
*** Cibo_ has joined #zuul09:27
*** abregman has quit IRC09:35
*** abregman has joined #zuul09:35
*** Cibo_ has quit IRC09:40
*** openstackgerrit has quit IRC09:48
*** openstackgerrit has joined #zuul09:49
*** yolanda has quit IRC09:55
*** yolanda has joined #zuul09:57
*** markmcd has quit IRC10:40
*** abregman has quit IRC11:03
*** abregman has joined #zuul12:06
*** bhavik1 has joined #zuul13:12
*** abregman is now known as abregman|afk14:13
*** bhavik1 has quit IRC14:14
*** abregman|afk is now known as abregman14:17
pabelangermorning14:25
pabelangercontinuing work on nb01.o.o14:25
mordredpabelanger: let me know if there's anything I can do to help14:40
pabelangermordred: ack!14:42
pabelangerjust waiting for ansible wheel to run14:43
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Test rotation of builds in nodepool-builder  https://review.openstack.org/40000414:45
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Test rotation of builds in nodepool-builder  https://review.openstack.org/40000414:52
*** abregman is now known as abregman|afk14:54
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Re-enable test: test_list_nodes  https://review.openstack.org/40095515:04
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Re-enable test: test_image_upload_fail  https://review.openstack.org/40097015:04
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Fix nits in test_image_upload_fail  https://review.openstack.org/40425715:04
Shrewspabelanger: mordred: Those 1st two were rebased to avoid the alien test that was -1'd. They were already +3'd, so if you could push them through...15:05
mordredShrews: done15:06
Shrewsmuchas gracias15:06
Shrewsclarkb: 257 fixes your nits from another review15:08
pabelanger\o/15:20
pabelangerhttp://nb01.openstack.org/15:20
pabelangerbuilds are started15:20
pabelangerwill take about 1 hours to get through first image15:22
Shrewsmordred: jeblair: clarkb: regarding the metadata comment in https://review.openstack.org/#/c/400836/2/nodepool/cmd/nodepoolcmd.py, if i'm not supposed to used 'snapshot' as the key, what should I be using?15:22
Shrewserr, snapshot as the value15:23
*** saneax is now known as saneax-_-|AFK15:23
Shrewsbasically, what metadata tells me it is a nodepool managed image?15:23
*** abregman|afk is now known as abregman15:25
pabelangerblarg15:30
pabelangermordred: jeblair: clarkb: any objections for adding an elastic-recheck for nodepool about our database drop issue?15:31
openstackgerritMerged openstack-infra/nodepool: Re-enable test: test_list_nodes  https://review.openstack.org/40095515:32
ShrewsThis is an interesting failure: http://logs.openstack.org/57/404257/1/check/gate-nodepool-python27-ubuntu-xenial/c7cca8e/testr_results.html.gz15:37
ShrewsWe have two upload threads uploading the same image. We have checks to guard against that. Not sure what's failing15:38
Shrewsi'll need to add some logging stuff to debug15:42
mordredShrews: oh yay15:57
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading  https://review.openstack.org/40429215:57
clarkbpabelanger: nope16:00
clarkbpabelanger: that is what it is there for16:00
pabelangerkewl16:00
pabelangerjust uploaded it and pinged you in openstack-infra16:00
Shrewsmordred: yeah. i'm hoping that zk node locking actually works like we expect it to16:00
Shrewsi mean, we do test locking, too16:01
pabelangerblarg16:01
pabelangerDIB element strikes again16:01
pabelanger2016-11-29 15:54:33,234 INFO nodepool.image.build.fedora-23: + die 'Can not find Zuul public key!'16:01
mordredShrews: well, if it doesn't, this is definitely the right time to learn that16:02
clarkbShrews: I think shade sets metdata properties itself but also nodepool may set some too16:03
mordredboth set things16:04
mordredshade sets the md5/sha256 checksums16:04
mordredthose are both prefixed with owner_specified.shade16:05
mordredthen nodepool should be setting image_type16:07
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Remove unused function make_image_dict  https://review.openstack.org/40430616:11
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Move to using properties instead of metadata  https://review.openstack.org/40430716:11
mordredShrews, clarkb: nodepool sends provider_image.meta as the extra metada on an image16:13
mordredShrews, clarkb: but that's all user defined - there is currently no key that nodepool adds to an image that would let us know it's nodepool managed16:17
clarkbI think if we had onethat was "not cloud managed" that would be sufficient (since that was what the old code was doing)16:19
clarkbbut I am not sure if glance differentiates16:19
mordredthe old code was not doing that16:19
mordredactually16:19
clarkbmordred: it was16:19
clarkbit was only looking at snapshot images16:19
mordredonly for snapshot images16:19
clarkbyes, it was broken but that was the intent16:19
mordredyah16:19
mordredwell, it's a super easy fix16:19
clarkbbasically we want to keep the cloud images out of the alien list16:21
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Flag uploaded images as being managed by nodepool  https://review.openstack.org/40432216:21
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading  https://review.openstack.org/40429216:22
mordredclarkb: the logic in that patch in alien list is still wrong - one sec16:22
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Flag uploaded images as being managed by nodepool  https://review.openstack.org/40432216:23
mordredthere. that should do it16:23
*** bhavik1 has joined #zuul16:24
Shrewsmordred: we should just add all of that to my test re-enable of alien-image-list16:24
mordredShrews: ++ - lemme do that real quick16:24
jeblairmordred, clarkb, Shrews: how about we mirror the nodepool *node* metadata structure for images?16:27
jeblairhttp://git.openstack.org/cgit/openstack-infra/nodepool/tree/nodepool/provider_manager.py?h=feature/zuulv3#n21116:29
openstackgerritMerged openstack-infra/nodepool: Re-enable test: test_image_upload_fail  https://review.openstack.org/40097016:30
jeblairmordred: though istr you wanted to de-json that and split it into multiple keys (because there is a length limit for values (which we are under) though of course there is a (provider-specified <sigh>) limit on the number of keys)16:31
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests  https://review.openstack.org/40083616:32
jeblairmordred: anyway, i think the corresponding information we'd want to have there would be build number and upload number (provider and image name are obviously implied)16:32
mordredjeblair: ++16:32
clarkbjeblair: right its 255 bytes per key value or something16:33
mordredjeblair: and yes, I do want to de-jsonify it - and yes ot sigh on number of keys16:33
jeblairmordred: ok.  i left a review on your abandoned change https://review.openstack.org/404322 :)16:34
mordredjeblair: I'mma do that in a follow up patch for now just so I don't keep stepping on Shrews' patch16:35
mordredwe can squash if we want once we're happy16:35
jeblairsquash or simulprove is ok16:36
pabelangercurrent dib-image-list on nb01.o.o: http://paste.openstack.org/show/590854/16:36
pabelangergoing to see why we are not cleaning up the building state of DIBs16:37
pabelangerotherwise, going to be a long list16:37
jeblairpabelanger: it's doing things!16:37
pabelangerit is!16:37
pabelangerhttps://review.openstack.org/#/q/topic:zuulv3-builder16:37
pabelangerwe need to land ^16:37
Shrewsjeblair: i'm about to post a change to include the ZnodeStat structure in the build and upload objects. I'm concerned that node versioning might be biting us (though I haven't confirmed that) and that info could be useful for debugging.16:37
pabelangerto fix ssh key issues16:37
jeblairShrews: in the repr you mean?16:37
Shrewsjeblair: yes16:38
jeblaircool16:38
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Add znode stat structure to build/upload objects  https://review.openstack.org/40429216:39
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: DNM: Debug of uploading  https://review.openstack.org/40433516:39
jeblairShrews: i'm catching up on the failure from earlier so i hope to be useful on this soon16:40
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Add build and upload information to image metadata  https://review.openstack.org/40433716:41
Shrewsit's weird that upload thing happened twice on back-to-back reviews. Now I've had 2 rechecks and 2 test reviews that haven't reproduced it16:41
mordredShrews, jeblair: ^^ it doesn't seem that we have access to upload id yet when we upload the image - am I reading that wrong?16:44
Shrewsmordred: correct. won't have it until storeImageUpload is called16:47
mordredk. so - jeblair, we can easily add build_id to the images, but not upload_id unless we added an additional api call to update the image metadata16:47
Shrewsmordred: the caller knows it16:48
mordredI'm not sure it's worth an extra API call16:48
mordredShrews: it does?16:48
Shrewsmordred: yes, just not passed to _uploadImage()16:48
Shrewsmordred: upnum16:48
mordredShrews: oh - awesome. thanks!16:48
mordredjeblair: ignore me16:48
Shrewsmordred: also, as far as stepping on toes, the only thing left is to add the non-empty alien image test case, which I'd actually prefer to do in a separate review.16:49
jeblairShrews: i'm looping that failed test locally, no luck so far16:50
Shrewsjeblair: yeah, i have not been able to reproduce it locally either16:50
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Add build and upload information to image metadata  https://review.openstack.org/40433716:52
mordredjeblair: ok - I think that's ^^ more inline with things16:52
Shrewsjeblair: looking at http://logs.openstack.org/57/404257/1/check/gate-nodepool-python27-ubuntu-xenial/c7cca8e/testr_results.html.gz at the failure, that zk LOST message between uploads is a bit concerning. I do not know if that's normal16:53
Shrewsit may be nothing16:54
jeblairpabelanger: looking at 404309, i think i spotted a problem -- but also, in openstackci::builder we use both nodepool and nodepool::builder -- so should we just add the key as a param to openstackci as in 404317 and then pass it to nodepool instead of nodepool::builder?16:55
mordredShrews: ok. I'll just squash onto your patch then16:55
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests  https://review.openstack.org/40083616:56
*** hashar is now known as hasharAWay16:57
pabelangerjeblair: 404309, I can move .ssh file back up to the original line, no problem16:58
jeblairpabelanger: oh, i see, i didn't see the move, i understand that change better now17:00
pabelangerjeblair: trying to understand your follow up comment for 404317, do you mind rephrasing it?17:00
jeblairpabelanger: yeah, so don't change 404309 just yet17:00
pabelangerk17:00
jeblairpabelanger: what i meant was, i think you could abandon 404309 and then in 404317, add the public key to the instantiation of class ::nodepool17:01
jeblairpabelanger: i think the way to decide between the two approaches is to say whether the public key is something that should be common to all nodepool nodes, or something that is builder-specific17:03
jeblairpabelanger: i guess it's builder-specific?  and the private key will eventually be launcher-specific?17:04
jeblairpabelanger: so maybe your approach is best? :)17:04
pabelangerjeblair: well, in fact, I don't actually like what we are doing with the public key, its specific how we build the zuul user with DIB elements today.  So, I think you might be right, abandon 404309 for now, then add the logic into 404317 or system-config17:05
jeblairpabelanger: or maybe openstackci17:06
jeblairpabelanger: at any rate, i +2d your existing changes :)17:07
pabelangerclarkb: any thoughts on ^ +topic:zuulv3-builder17:07
pabelangerjeblair: thanks17:07
jeblairShrews: i *think* what we're seeing is the command starting up and performing its ZK connect and disconnect.  which is probably telling us that an upload worker is deciding to upload after we delete the upload.17:08
clarkbpabelanger: sorry caffeinating17:09
clarkbwill pull up ina few17:09
*** abregman has quit IRC17:09
Shrewsjeblair: yeah, i think so17:09
Shrewsjeblair: oh, i grok now. so things are working correctly, but the 2nd upload happens too quickly?17:12
jeblairShrews: yeah, if that's correct, then i think it's normal behavior, i think we just need to alter the test to stop trying to upload images after we delete.17:12
jeblairShrews: or alter it to wait for that specific upload's deletion17:13
jeblairwhich might be the easier and better thing17:14
Shrewsjeblair: so make waitForImageDeletion take an optional upload ID might be the ticket17:16
jeblairShrews: i think if you throw a sleep between nodepoolcmd.main() and waitForImageDeletion, it repros.  and yeah, i think that sounds like a good solution17:17
Shrewsack17:17
clarkbjeblair: Shrews in the past we have tried to be specific and not turn off the engine once a state is reached17:17
jeblairclarkb: so that's another vote in favor if optional upload id?17:18
jeblairShrews: are you writing the fix?17:18
Shrewsjeblair: on it17:18
clarkbjeblair: ya.17:18
clarkbpabelanger: ok stack reviewed, need to add an extra check on the first change but otherwise lgtm17:19
jeblairmordred, rcarrillocruz: i'm going to set the topic of the devstack ansible changes to zuulv317:24
jeblairit was just the callback change that was missing it, now they all show up17:25
rcarrillocruz++ thx17:26
mordredjeblair: woot17:26
jeblairmordred: 404295 seems to output all the gathered facts17:29
jeblairmordred: http://logs.openstack.org/95/404295/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/5aaf721/logs/devstack-gate-setup-host.txt.gz#_2016-11-29_16_02_47_66192917:30
mordredjeblair: yup. was just asking folks about that in infra17:30
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Fix test_image_delete  https://review.openstack.org/40435617:31
Shrewsjeblair: clarkb: ^^^17:31
jeblair+217:32
pabelangerjeblair: mind +3 on 404309 update to address clarkb comments17:49
jeblairpabelanger: done17:50
pabelangerdanke17:51
*** morgan_ is now known as morganfainberg18:03
*** morganfainberg is now known as morgan18:04
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests  https://review.openstack.org/40083618:47
Shrewsmordred: fixed the merge conflict ^^^18:48
openstackgerritMerged openstack-infra/nodepool: Remove unused function make_image_dict  https://review.openstack.org/40430618:48
*** bhavik1 has quit IRC18:51
openstackgerritMerged openstack-infra/nodepool: Fix nits in test_image_upload_fail  https://review.openstack.org/40425718:56
mordredShrews: yay!19:07
*** abregman has joined #zuul19:08
mordredclarkb: I did not +A https://review.openstack.org/400836 since I wrote code in it19:08
*** abregman has quit IRC19:09
clarkbmordred: and I just -1'd it >_>19:09
clarkbits still listing the cloud provided images in the alien list19:09
*** abregman has joined #zuul19:10
mordredclarkb: oh - that's an easy fix19:11
Shrewsmordred: you fixing that or want me to?19:13
mordredShrews: I got it19:14
mordredShrews: just going to filter the images when we get the provider_image list19:14
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests  https://review.openstack.org/40083619:14
mordredShrews, clarkb: ^^19:14
clarkbthanks will rereview post meeting19:15
clarkb(head down in code review is how I missed the start of the meeting so will avoid it until done)19:15
mordredclarkb: yah. same here :)19:15
pabelangerugh20:14
pabelangeranother DIB fail20:14
pabelangerjeblair: Shrews: What do you think about us doing a merge master to feature/zuulv3?20:14
pabelangerI just ran into a bug20:14
pabelangerwhich we fixed in master20:14
pabelangerbut are missing in feature/zuulv320:15
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Stop running DIB usage report  https://review.openstack.org/40440920:17
pabelangerjeblair: clarkb: mordred: ^ is the fix we need in feature/zuulv320:17
pabelangerDid a cherry-pick20:17
pabelangeror we can merge master20:17
clarkb+220:19
clarkbthough if we think we are this close to really running the builder with zk may be a good time to work on a merge too20:19
pabelangerthings are running now20:20
pabelangermerge would help fix issues like ^20:20
jeblairpabelanger, clarkb: https://review.openstack.org/40053620:23
jeblairthat needs a refresh anyway, so i will +3 the cherry-pick20:24
clarkbok20:24
jeblairclarkb, pabelanger: jhesketh did request a review on that despite the test failure20:25
pabelangerjeblair: sure, will look today20:25
*** hasharAWay is now known as hashar20:31
*** abregman has quit IRC20:53
openstackgerritMerged openstack-infra/nodepool: Stop running DIB usage report  https://review.openstack.org/40440920:59
openstackgerritMonty Taylor proposed openstack-infra/nodepool: Re-enable alien-image-list command and tests  https://review.openstack.org/40083621:16
mordredShrews, clarkb: whoops. ^^ I had broked things21:18
pabelangeroh snap21:19
pabelanger| fedora-23-0000000004      | fedora-23      | nb01    | qcow2,vhd,raw | ready    | 00:00:00:18 |21:19
pabelangerhttp://paste.openstack.org/show/590895/21:19
pabelangernodepool image-list21:19
mordredpabelanger: woot!21:20
mordredpabelanger: that's super exciting21:20
mordredpabelanger: I just +A'd two of your patches - but gerrit claims they conflict. we'll just have to see21:21
pabelangermordred: woot indeed21:21
mordredpabelanger: so - we really do need to get jhesketh's merge master patch sorted out :)21:22
pabelangermordred: agreed, I think that will help with some yak shaving21:22
mordredmmm. shaving yaks21:22
mordredpabelanger: there is a fellow outside of vancouver who has a herd of tibetan yaks for sale21:22
pabelangerha, nice21:23
openstackgerritMerged openstack-infra/nodepool: Fix test_image_delete  https://review.openstack.org/40435621:23
openstackgerritMerged openstack-infra/nodepool: Add pause support for diskimages  https://review.openstack.org/40378021:23
pabelangerAlso need to land our nodepool patch to use checksum files, to avoid duplicate checksums21:24
mordredpabelanger: seems like we should port that patch to v3 at this point21:24
clarkbnow to see how reliable uploads will be for us21:24
clarkbpabelanger: mordred maybe just get that from merge?21:25
clarkbsince its a performamce noy functional thing21:25
mordredclarkb: well, it hasn't landed in either branch21:25
mordredso I was mostly thinking we should not land it in master then deal with it on merge since we're so close to re-merging everything21:26
clarkboh I thought the first version of it was landed21:26
mordredif it is, I totally agree it should just be a thing on merge21:26
*** willthames has joined #zuul21:28
pabelangerya, I don't think it merged yet21:29
pabelangerchecking21:29
pabelanger38351921:30
pabelangerI can port to zuulv321:30
pabelangermordred: clarkb: a review on https://review.openstack.org/#/c/404296/ for elastic-recheck. I believe I got the syntax right21:32
pabelangerfor our database drop failures21:32
Shrewspabelanger: what does the cpu usage look like on nb01? curious as to how the small delay between schedule checks is affecting it21:34
pabelanger$ cat /proc/loadavg21:35
pabelanger4.34 4.63 4.23 7/544 38721:35
pabelangerright now21:35
pabelanger277.5 CPU for nodepool-builder ATM21:35
pabelangerbut, think shade is still doing md5sum21:35
mordredwe're spending 39% of a CPU in dnf?21:36
openstackgerritIan Wienand proposed openstack-infra/nodepool: [DNM] test  https://review.openstack.org/40442821:36
mordredShrews: it does feel like an awful lot of CPU usage given that I think there is currently one dib build going on21:38
Shrewsmordred: the BuildWorker and UploadWorkers both only sleep 0.1s between their respective checks.21:38
Shrewsso they are always doing something21:38
mordredShrews: ah21:39
mordredthen that CPU usage seems about right on21:39
pabelangerWas thinking we should make that an exponential backoff on error, up to X seconds?21:39
pabelangeror mins21:39
pabelangeras not to spam logs on error21:40
jeblairremote:   https://review.openstack.org/404430 Add nb01.openstack.org to cacti21:40
jeblairpabelanger, mordred: ^21:40
pabelanger+321:40
Shrewspabelanger: backoff on upload?21:41
Shrewspabelanger: not sure how to make that work since one thread would have to keep the upload lock to keep other threads from re-attempting the upload21:41
Shrewswhich i guess it could do21:42
*** Cibo_ has joined #zuul21:42
pabelangerShrews: I haven't looked into it yet, but something I was thinking about, since my logs were filling up aggressively on a mis-configuration of clouds.yaml21:42
mordredpabelanger: easy soluiton - don't mis-configure clouds.yaml ;)21:43
Shrewspabelanger: incentive to fix your clouds.yaml  :)21:43
pabelangerindeed21:43
Shrewsit's a feature21:43
jeblairi mean, it actually is a feature21:43
jeblairwe have way more problems with uploads failing with EOPENSTACK than we have with misconfigured clouds.yamls21:44
jeblairwhich doesn't mean 'continue blowing up indefinitely' is the only answer, just that the... er... obstinancy of the builder is intentional.  :)21:46
jeblairShrews: and oh yeah, we did want to make the sleep longer, just need to keep it short in the check.21:47
jeblair(i was thinking something like 10 secs)21:47
jeblairer 'keep it short in the tests'21:47
openstackgerritPaul Belanger proposed openstack-infra/nodepool: Use diskimage-builder checksum files  https://review.openstack.org/40443221:48
Shrewsjeblair: do what we did with cleanup_interval for the CleanupWorker thread?21:49
pabelangerwill check dsvm test on that21:49
jeblairShrews: yeah, probably so21:49
Shrewsjeblair: k. i'll work on that21:49
jeblairShrews: thanks; pabelanger: that should make things 100 times better.  :)21:50
pabelanger100x improvement21:50
Shrewsgolly you maths good21:50
jeblairi learned me some math21:51
pabelangerwonder if we should enable shade debugs for now on nb0121:52
pabelangerto see what is going on21:52
pabelangernot seeing much traffic between nb01.o.o and clouds21:52
pabelangerOh, errors22:01
mordredyay!22:01
pabelangerOpenStackCloudException: Error fetching image list: 'X509' object has no attribute '_x509'22:01
mordredpabelanger: whereat?22:01
pabelanger/var/log/nodepool/builder-debug.log22:01
mordredthat is a bug in python requests, iirc22:01
pabelangersomething, something, ssl?22:02
pabelangeroff to the googles22:02
mordredpabelanger: https://github.com/kennethreitz/requests/issues/370122:03
pabelangerPyOpenSSL seems to be what people are pointing too22:03
pabelangerya22:03
pabelangerreading that22:03
openstackgerritDavid Shrewsbury proposed openstack-infra/nodepool: Make build/upload worker sleep time configurable  https://review.openstack.org/40443822:03
pabelanger python -c 'import OpenSSL; print(OpenSSL.__version__)'22:04
pabelanger0.1322:04
pabelangerlooks like we need a version bump22:04
mordredthat'll do it22:04
mordreddid we install from disto packages or something?22:04
pabelangerlet me see how it gets pulled in22:04
pabelangerchecking22:04
pabelangerii  python-openssl                   0.13-2ubuntu6                         amd64        Python 2 wrapper around the OpenSSL library22:05
pabelangerya22:05
mordreddef wanna not have that and have pip installed version instead22:06
jeblairer, on nodepool.o.o there is no openssl22:07
pabelangerchecking to see how it was installed22:07
mordrednothing seems to depend on it22:07
Shrewsmordred: pabelanger: nb01 will also need the unreleased shade to get the image meta22:07
Shrewsif it doesn't have it already22:08
pabelangerk22:08
mordredShrews: I'd actually like to release shade - since that is a fairly hefty regression22:08
Shrewsmordred: DO EET22:08
pabelanger /var/log/apt/history.log doesn't list python-openssl22:09
mordredmaybe base-image ?22:09
jeblairpabelanger: agreed, i can't find anything either22:09
pabelangermordred: ya, I wonder if that is the case22:10
mordredreally need to get around to having our own dib-created base images that are what we run our servers on22:10
pabelanger++22:10
pabelangersoon(tm(22:10
jeblairlet's go to -infra22:11
Shrewslet's go drinking. night all22:11
jeblairShrews: cheers!22:12
adam_ganyone able to clear up some confusion ive got around item.change.{refspec, branch, ref} WRT https://git.openstack.org/cgit/openstack-infra/zuul/tree/zuul/launcher/client.py?h=feature/zuulv3#n28 ?22:17
adam_gim not sure whats meant to happen there when item.change is a Ref and not a Change22:18
clarkbadam_g: iirc its supposed to tell the merger to fetch the ref22:19
clarkbso that when the job runs the ref is available to it22:19
clarkb(but that behavior may have changed with v3)22:20
adam_gclarkb: right, but the models differ quiet a bit and it looks like the logic in there is missing some smarts in dealing with some of those differences22:20
clarkbadam_g: well there its just constructing the data22:20
adam_gi can update that to fish out (what i think are) the correct bits from each, but having not much prior knowledge here, im not sure its correct22:20
clarkbso either you get a change + patchset or you get an oldrev and newrev22:20
clarkbthen things consuming the data will have to switch based on that22:21
adam_gclarkb: right, but things branch seem to be located in different places depending on what type of item.change comes in.22:22
adam_gthis is /w all the newer v3 model changes, btw22:22
clarkbfor v3 specifics there I am really the wrong person to ask, generally though the differences are in the source event type from gerrit and on the merger side making sure you have the correct ref locally for the test to have access to22:23
adam_gyep. i guess my confusion is just about the internal zuul modelling around that22:25
jeblairadam_g: one of the todos for v3 is to rework the change/changeish/ref hierarchy22:28
jeblairbut that's mostly for clarity and to simplify some code22:29
adam_gjeblair: ah, i see22:29
adam_gwas just hacking on getting some of the post tests working. lemme push up what i have and you can lemme kow if its worth working on before that happens22:29
jeblairadam_g: ok.  i *think* you should be able to proceed with just corrections, not refactoring.  but yeah, something to look at would be nice.22:30
openstackgerritAdam Gandelman proposed openstack-infra/zuul: Enable test_post  https://review.openstack.org/40444522:31
jeblairadam_g: what happens if you don't make those changes?22:33
openstackgerritIan Wienand proposed openstack-infra/nodepool: Fail on bad options to fake-image-create  https://review.openstack.org/29151622:38
pabelangerShrews: jeblair: looks like we need to do some exception handling: http://paste.openstack.org/show/590899/22:40
openstackgerritIan Wienand proposed openstack-infra/nodepool: Fail on bad options to fake-image-create  https://review.openstack.org/40445222:40
adam_gjeblair: make_merger_item() hits a few AttributeErrors22:44
jeblairpabelanger: i thought Shrews fixed that in a recent change22:45
Shrewshttps://review.openstack.org/40097022:45
jeblairpabelanger: are we running with that?22:46
pabelangerjeblair: we are now22:46
pabelangerbut not before I think22:46
pabelangerjust restarted nodepool-builder 5mins ago22:46
jeblairpabelanger: so i think we should be good.  could probably delete the old uploading ones if you want22:47
pabelangerlets see what happens22:47
jeblairpabelanger: i don't think they will be deleted until the underlying build ages out.  we might want to fix that (though with 970 this is only likely to happen after a hard-stop)22:51
pabelangerk22:51
pabelangerseems to be what I am seeing, they are hanging around in deleting state22:51
pabelangerand ya, I did do a hard stop too22:52
jeblairpabelanger: they are in deleting state?  your paste said uploading.22:52
pabelangerjeblair: I did image-delete on them22:53
pabelangerthe old ones22:53
jeblairpabelanger: then i wonder why those are not being deleted.22:53
pabelanger| 0000000004 | 0000000001 | infracloud-vanilla   | fedora-23 | None                | None              | deleting  | 00:00:04:49 |22:53
pabelangerfor example22:53
pabelangermaybe logic issue around uploads, since we didn't actually upload anything22:54
pabelangerthen, did the delete22:54
pabelangerI can try and reproduce with a test in a bit22:55
jeblairpabelanger: well, we have a test for image-delete :)22:55
pabelangerYay22:57
pabelangeruploads started22:57
pabelangerafter 10min delay22:58
clarkbwe used to have logic in nodepoold that on start would change the state automatically22:58
clarkbwe likely need that in the builder itself now because its more standalone22:58
clarkband I think there may even be a test for that explicit case22:58
pabelangerhttps://review.openstack.org/#/c/404432/ should help speed things up, and reduce duplicate checksum22:58
jeblairclarkb: no, we don't want to change state on startup because it's a multi-node system22:59
clarkbjeblair: but uploads are process specific22:59
clarkb(as are builds)22:59
clarkbso even though its multinode you want to update the db for the specific upload/build that belonged to a dead process22:59
jeblairclarkb: sure, but it does not make sense for a builder B to change builder A's uploads to delet on startup.22:59
*** hashar has quit IRC23:00
clarkbya but thats an implementation detail23:00
jeblairit *does* make sense for builder B to delete builder A's failed uploads.  and that is the intent.23:00
clarkbwe need something that will update state for processes that die23:00
jeblairwe have it23:00
clarkboh pabelanger indicated it required manual intervention?23:00
jeblairwe are talking about the implementation details of this thing :)23:00
jeblairthere's a lot going on, let's take it one at a time23:01
jeblairfirst, the things that are in the uploading state are due to a bug that is fixed in a patch that is landed23:01
jeblairso that shouldn't happen again23:01
jeblairsecond, since that did happen with a version of the program that was running before the fix landed, manual corrective action was indicated23:02
jeblairthat is not working as expected23:02
jeblairbecause the state is set to delete, but the record is not actually being deleted23:02
jeblairi think that our test for that is not testing what we think23:03
pabelangerYa, sorry for confusing.23:03
clarkbgotcha so the fix is potentially flawed23:03
clarkbby not doing deletes completely23:04
jeblairno, the fix should be fine23:04
jeblairwhich fix?23:04
clarkb23:01:25    jeblair | first, the things that are in the uploading state are due to a bug that is fixed in a patch that is landed23:04
clarkbwhatever that is23:04
Shrewsi think there are too many unknowns to make a call. how many builds are there? how many successful uploads? what was the state of things when the delete was issued?23:04
Shrewswithout more info, i cannot say that things are not working as we expect them to23:04
jeblairShrews: i don't think there are any unknowns23:05
jeblairi just think we're talking about 5 things and not being specific enough23:05
Shrewsjeblair: i have unknowns. all i have is a pastebin23:05
jeblairso as annoying as it is, if this(these) conversation(s) are going to be fruitful, we need to be extremely specific23:05
jeblairShrews: i'm working from the same pastebin23:05
jeblairokay, so the first thing:23:06
jeblairhttp://paste.openstack.org/show/590899/23:06
jeblairlots of uploads in 'uploading' state -- more than one for each image-provider combo23:06
jeblairthose are uploads that failed, but the process was running before 400970 landed, so we didn't put them in the failed state23:07
jeblairthat has landed, the current process is running that, so that bug is fixed now.23:07
jeblairto clarkb's point on this...23:08
jeblairit is possible for us to see this behavior again (as i noted earlier) if a builder is killed while uploading23:08
jeblairthat's much more rare than a failed image uploa23:08
jeblairwe *might* want to think about having the cleanupworker deal with that by checking that uploads in the 'uploading' state really are uploading23:09
pabelangerYa, that might be good23:09
pabelangeralso, this just happened:23:09
pabelanger2016-11-29 23:08:28,439 INFO nodepool.builder.UploadWorker.13: Image build 0000000004 in infracloud-vanilla is ready23:09
Shrewsif that is all the info, then there is only a single READY build. image-delete will delete any provider uploads, but the build remains READY. since it's the only one (we keep the two most recent), it will not be deleted.23:10
clarkbconsidering we are almost always uploading (based on current builder) the chances of hitting that case will be high)23:10
jeblairbut it's not super urgent, because the uploadworker itself will start a new upload if an existing one is not actually in progress.23:10
pabelangeragree, is a cosmetic thing for me right now23:11
jeblairthe second problem is why image-delete didn't do anything23:11
jeblairor, well, it set the state to deleting, but no actual deleting happened23:12
pabelangerI can upload a new pastebin showing that issue23:12
pabelangerhttp://paste.openstack.org/show/590901/ is the current output23:12
pabelangerafter running: nodepool image-delete --provider infracloud-chocolate --image fedora-23 --upload-id 0000000001 --build-id 000000000423:12
jeblairShrews: and yes, i believe that's the case for this issue -- the cleanup worker is ignoring all uploads related to that build because the build has not expired23:13
pabelangerI wasn't expecting the 'deleting' state to persist.  However, jeblair has an idea of the issue I believe23:13
clarkbpabelanger: is the build id required there? that seems unnecessary (since the upload id is what ties us to a provider) (not trying to distract from other discussion we can talk about this when the first thing is wrapped up)23:13
openstackgerritAdam Gandelman proposed openstack-infra/zuul: Enable test_post*  https://review.openstack.org/40446623:13
jeblairwe should probably rework that so that the cleanupworker will delete anything in the deleting state23:14
pabelangerclarkb: required today, but I think that CLI command is going to change.  global UUIDs for iamges?23:14
clarkbjeblair: ++23:14
clarkbpabelanger: even without a global uuid isn't the provider + image name + upload id unique?23:14
jeblairwe do have a test for image-delete, however, it consideres an image deleted as soon as it doesn't show up as ready23:14
pabelangerclarkb: Shrews would know23:14
jeblairso i think there is a mismatch in what we consider 'deleted' versus what that test is looking for23:15
jeblairclarkb, pabelanger: no, upload ids start at 1 again for each new image build23:15
clarkbjeblair: ah ok hrm23:15
pabelangerright23:15
pabelanger3 images online now23:16
pabelangerfor fedora-2323:16
clarkb120 to go :)23:16
clarkb(I forget the exact number but its not small)23:16
pabelangerwe only have fedora-23 right now, still building other images23:17
pabelangerdebian-jessie almost done23:17
clarkbpabelanger: I can go ahead and write the zookeeper change if you want23:18
clarkbthogu h Iam going to guess that will require a restart of zookeeper23:18
pabelangerclarkb: sure, I haven't looked yet23:18
clarkboh the puppet module already sets it to 10k23:19
* clarkb checks the version we are running23:19
Shrewsjeblair: the deleted state will persist until the build itself ages out.23:19
Shrewsalso, maybe we need a new build-delete command that recursively deletes all successful uploads and, lastly, the build itself23:20
pabelangerclarkb: oh, nice23:20
* Shrews really shouldn't cook and irc simultaneously23:20
pabelangerShrews: from your fridge? say from your fridge :)23:21
mordredfinally a use for IoT!23:21
jeblairShrews: that's the dib-image-delete command23:21
clarkbpabelanger: https://github.com/deric/puppet-zookeeper/blob/v0.5.5/manifests/init.pp#L42 so we might be fine as is23:22
pabelangersnapCount=1000023:22
pabelangerconfirmed on nodepool.o.o23:22
Shrewsjeblair: ah, right. forgot that existed23:22
jeblairneat, i just used boartty to add a story with no tasks23:22
jeblairclarkb, Shrews, pabelanger: story for the first thing: https://storyboard.openstack.org/#!/story/200081123:23
Shrewspabelanger: mordred: doesn't everyone have their fridge beep when their nick is mentioned on irc?23:24
clarkbjeblair: lgtm23:24
pabelangeragreed23:24
Shrewsjeblair: code already does that23:25
jeblairclarkb, Shrews, pabelanger: story for the second thing: https://storyboard.openstack.org/#!/story/200081223:25
jeblairShrews: if it did, we would not have seen uploads stay in 'uploading' after the builder was restarted23:26
Shrewsjeblair: it won't be cleaned up if it's a current build23:26
Shrewsi think what you all are driving at is cleaning up the cruft from current builds23:27
Shrewswhich does not happen currently23:27
pabelangerYup23:28
Shrewsso the first SB needs rephrased a bit to clarify it23:28
clarkbright the problem with that state is that a build in uploading state should mean that that build will or is being uploaded. But after catastrophic failure that is not the case so we should clean it up23:28
jeblairShrews: yeah, i think you're right.  these both may be the same bug.  ;)23:31
jeblairfundamentally, i think the algo to decide when to delete a ready image/upload is correct, but it should be separate from when to delete images/uploads that are in non-ready states23:34
jeblairadam_g: do you have those attribute errors handy?  i'm curious/confused why this was broken23:35
jeblairShrews: we may go less insane if we just separate them and do it as two passes23:36
* clarkb checks and confirms already insane23:36
clarkbETOOMANYTHINGSTOTRACK23:36
pabelangerjeblair: adam_g: I had this problem I believe a while back, have an existing WIP review: https://review.openstack.org/#/c/393887/23:36
pabelangerbut haven't debugged some of the why yet23:37
jeblairShrews: could even have pass 1 be "decide which ready things to delete", then pass 2 be "delete thing in deleting state"23:37
jeblair(and maybe handle 'failed' in pass 2 as well, or just drop the state altogether and set it to deleting on failure)23:38
jeblairpabelanger: i agree with your question there :)23:38
pabelangerYa, it was confusing23:39
pabelangerbut old review, 3 weeks23:39
pabelangerhaven't looped back to it, so happy for adam_g to dig into it :)23:39
openstackgerritIan Wienand proposed openstack-infra/nodepool: Activate virtualenv before running dib  https://review.openstack.org/40396623:48
pabelangerend of line for me23:52
pabelangerimages are uploads, builds are happening on nb01.o.o23:53
pabelangerlets see what happens over night23:53

Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!