openstackgerrit | Merged openstack-infra/zuul: Re-enable test_success_pattern as test_success_url https://review.openstack.org/400455 | 00:10 |
---|---|---|
*** jamielennox is now known as jamielennox|away | 00:17 | |
*** jamielennox|away is now known as jamielennox | 00:23 | |
*** jamielennox is now known as jamielennox|away | 00:46 | |
*** jamielennox|away is now known as jamielennox | 00:53 | |
pabelanger | okay, fedora-23-0000000004 didn't delete properly | 01:58 |
pabelanger | http://paste.openstack.org/show/591193/ | 01:58 |
pabelanger | we'll have to debug it in the morning | 01:58 |
clarkb | pabelanger: it waits forall the uploads to delete first right? | 02:02 |
pabelanger | clarkb: yup, they look to be in the deleting state | 02:02 |
pabelanger | let me get output | 02:02 |
clarkb | but they arestillthere | 02:02 |
clarkb | eg they have tobe removed | 02:02 |
pabelanger | right, let me check | 02:03 |
pabelanger | http://paste.openstack.org/show/591194/ | 02:03 |
pabelanger | checking infracloud | 02:03 |
clarkb | (nit saying this is correct but ^ may explain it) | 02:03 |
*** persia has quit IRC | 02:06 | |
pabelanger | infracloud looks good | 02:06 |
clarkb | pabelanger: meaning the image isnt in infracloud? | 02:07 |
pabelanger | right | 02:07 |
clarkb | (but it still has a node in zk) | 02:07 |
pabelanger | I see 4 fedora-24 images | 02:07 |
pabelanger | and times look correct | 02:08 |
*** persia has joined #zuul | 02:09 | |
pabelanger | Ya, 0004 is not in infracloud-vanilla | 02:09 |
pabelanger | but still exists in ZK | 02:09 |
pabelanger | so, out of sync | 02:09 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Add log message when deleting images https://review.openstack.org/405741 | 02:27 |
*** jamielennox is now known as jamielennox|away | 03:11 | |
*** jamielennox|away is now known as jamielennox | 03:20 | |
*** saneax-_-|AFK is now known as saneax | 03:27 | |
*** jamielennox is now known as jamielennox|away | 03:30 | |
*** jamielennox|away is now known as jamielennox | 03:37 | |
*** saneax is now known as saneax-_-|AFK | 03:58 | |
*** bhavik1 has joined #zuul | 05:19 | |
*** openstackgerrit has quit IRC | 06:33 | |
*** jamielennox is now known as jamielennox|away | 07:18 | |
*** saneax-_-|AFK is now known as saneax | 07:43 | |
*** jamielennox|away is now known as jamielennox | 07:44 | |
*** openstackgerrit has joined #zuul | 07:57 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul: tools: Add git-repo support script https://review.openstack.org/405976 | 07:57 |
*** openstackgerrit has quit IRC | 08:03 | |
Zara | jeblair: ah, cool, I'm guessing those are all tasks in stories tagged v3, which is why automatic lanes can't be used (yet)? I'm glad the api makes it possible to work around that way for now! (and yeah, the more content in the board, the slower things will be, so I'd recommend only using a 'merged' lane if you need to) | 09:31 |
*** openstackgerrit has joined #zuul | 09:40 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul: Read all Gerrit events from poll interruption https://review.openstack.org/406019 | 09:40 |
*** hashar has joined #zuul | 09:53 | |
openstackgerrit | Tobias Henkel proposed openstack-infra/zuul: tools: Add git-repo support script https://review.openstack.org/405976 | 11:49 |
*** Cibo_ has joined #zuul | 12:32 | |
*** bhavik1 has quit IRC | 13:21 | |
*** saneax is now known as saneax-_-|AFK | 13:38 | |
*** dmsimard is now known as dmsimard|pto | 13:39 | |
openstackgerrit | Tristan Cacqueray proposed openstack-infra/zuul: Don't getChange on source not triggering a change https://review.openstack.org/406129 | 14:01 |
pabelanger | morning | 14:10 |
pabelanger | images builds / uploads worked for nb01.o.o | 14:10 |
pabelanger | looking into clean up this morning | 14:10 |
Shrews | pabelanger: let me know if i can help. | 14:28 |
pabelanger | Shrews: I think I see the issue, upload.external_name is None | 14:29 |
pabelanger | so, we are raising an exception | 14:29 |
pabelanger | however | 14:29 |
Shrews | oh neat | 14:29 |
pabelanger | http://paste.openstack.org/show/591246/ | 14:29 |
pabelanger | that is the traceback | 14:30 |
pabelanger | so, if we clean up TypeError, I think nodepool-builder will do the right thing | 14:30 |
Shrews | hrm, how can we delete the image through the manager without an external name? Is that an error propogated from somewhere else? | 14:31 |
Shrews | should we be using external_id instead? | 14:31 |
Shrews | (that code was copied straight from the older builder code) | 14:32 |
pabelanger | Shrews: I think this is left over crud from your delete patch | 14:32 |
pabelanger | in fact, we don't have any images that are uploaded | 14:32 |
pabelanger | but | 14:32 |
pabelanger | {'build_id': u'0000000003', 'external_name': None, 'state_time': 1480674727, '_state': u'deleting', 'image_name': u'ubuntu-xenial', 'provider_name': 'ovh-bhs1', '_id': u'0000000001', 'external_id': None} | 14:32 |
pabelanger | that is the current value | 14:32 |
Shrews | so we probably just need an "if external_name:" before the deleteImage call | 14:35 |
pabelanger | think so | 14:35 |
*** Cibo_ has quit IRC | 14:38 | |
Shrews | pabelanger: add it to 405741? | 14:42 |
pabelanger | Shrews: sure | 14:44 |
*** jamielennox is now known as jamielennox|away | 14:45 | |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Handle delete exception when external_name is empty https://review.openstack.org/406151 | 14:46 |
pabelanger | oops | 14:46 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Handle delete exception when external_name is empty https://review.openstack.org/405741 | 14:47 |
pabelanger | okay, left a comment about how this could happen | 14:48 |
pabelanger | however, concerned we could leak images | 14:48 |
pabelanger | but, in our case, I think we just have bad data in zookeeper | 14:50 |
pabelanger | I'm also going to stop nodepool-builder, as we are hammering our clouds trying to delete images that don't exist | 14:52 |
Shrews | ++ | 14:53 |
pabelanger | back shortly, running an errand | 15:03 |
*** jhesketh has quit IRC | 15:04 | |
*** jhesketh has joined #zuul | 15:06 | |
*** openstackgerrit has quit IRC | 15:33 | |
clarkb | pabelanger: also this is what the alien image list is for, to find those leaks if they happen | 15:55 |
*** openstackgerrit has joined #zuul | 16:23 | |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Handle delete exception when external_name is empty https://review.openstack.org/405741 | 16:23 |
clarkb | pabelanger: maybe ^ should come with a test? | 16:24 |
pabelanger | clarkb: Sure, was thinking that too | 16:25 |
clarkb | I would let it run with the fakes to successfully upload, then edit the zk entry to set externals to None then see that it gets cleaned up | 16:25 |
pabelanger | Ya, that works | 16:25 |
*** saneax-_-|AFK is now known as saneax | 16:39 | |
*** saneax is now known as saneax-_-|AFK | 16:55 | |
pabelanger | okay | 17:01 |
pabelanger | reproduced the issue | 17:01 |
*** bhavik1 has joined #zuul | 17:06 | |
pabelanger | remote: https://review.openstack.org/405741 Make sure we clean up diskimages with invalid external_name | 17:10 |
pabelanger | clarkb: ^ when you have time, see what you think of the test | 17:10 |
pabelanger | that should be close to what happened yesterday | 17:10 |
pabelanger | jeblair: Shrews: ^too | 17:11 |
clarkb | pabelanger: I'll admit I don't immediately grok when the test starts setting the rebuild age to 2 days but then later we only decrement by one day | 17:15 |
clarkb | pabelanger: oh ya I think thats not necesary in this test? | 17:17 |
pabelanger | clarkb: I wanted to emulate 2 days passed for our build times. So, first image build is 172800, 2nd 86400, and 3rd would be now() | 17:17 |
*** hashar has quit IRC | 17:17 | |
clarkb | ok, maybe a comment explaining that might be good | 17:18 |
pabelanger | sure | 17:18 |
clarkb | pabelanger: ok posted comments. Mostly nits but also I think the assertion can be more robust | 17:22 |
pabelanger | looking | 17:22 |
pabelanger | clarkb: re: 3rd comment, imageDelete would not happen in this case, since we don't have an external_name. Suggestions on how to check we actually don't issue the deleteImage() call | 17:28 |
clarkb | pabelanger: I think what you are looking for is that the upload no longer exists in zk | 17:28 |
pabelanger | ya, we can do that | 17:28 |
clarkb | basically get uploads 0000001 should be None | 17:29 |
clarkb | or [] etc | 17:29 |
pabelanger | yes | 17:29 |
SpamapS | jeblair: that's amazing! | 17:37 |
SpamapS | jeblair: (the auto board you made) | 17:37 |
SpamapS | jeblair: will it get confused if we add some stories as well? I like the idea of having epic stories in 'todo' that invite people to click through and _add_ tasks, like the test adding one. | 17:38 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Make sure we clean up diskimages with invalid external_name https://review.openstack.org/405741 | 17:44 |
jeblair | SpamapS: nope, that will be fine -- it ignores stories | 17:46 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Make sure we clean up diskimages with invalid external_name https://review.openstack.org/405741 | 17:47 |
pabelanger | okay, commands to hopefully explain the madness | 17:47 |
pabelanger | comments* | 17:47 |
jeblair | SpamapS: i can point the script at board 41 if you want | 17:47 |
pabelanger | jeblair: Shrews: do you mind reviewing 405741 again | 17:53 |
jeblair | np | 17:59 |
SpamapS | jeblair: sounds like a good idea to me. Let me add a new lane. | 18:03 |
SpamapS | jeblair: ok, It has a New, and I removed Done since the script will just cull the finished things | 18:05 |
SpamapS | jeblair: maybe we can put your script in the zuulv3 branch under a contrib dir or something? | 18:06 |
jeblair | SpamapS: good idea, will do. | 18:17 |
jeblair | 18:17 < openstackgerrit> James E. Blair proposed openstack-infra/infra-specs: Zuul v3: use Zookeeper for Nodepool-Zuul protocol https://review.openstack.org/305506 | 18:18 |
jeblair | Shrews, jhesketh, mordred, clarkb: ^ i have updated that based on comments and things we've learned so far... if you can take a look at that today or monday, that would be great -- that should let us get it on the infra agenda tuesday, and hopefully have it in place by the time we're ready to start on it. | 18:20 |
mordred | jeblair: looks good on first read - I'm going to give it another read before voting to make sure | 18:29 |
jeblair | mordred: cool, thx. i added extra detail (mostly because i had to think a little more deeply to answer questions/comments and didn't want it to go to waste) but hopefully it's not too much. | 18:30 |
mordred | no - I thnk it's great detail - it makes it easier to read | 18:31 |
mordred | sort of like how the extended lord of the rings movies actually feel shorter | 18:31 |
Shrews | pabelanger: technically, you should be locking the upload before calling storeImageUpload (similar to storeBuild), but I can't mentally trace what issues not doing so in that test might cause | 18:32 |
jeblair | the two towers is so different | 18:32 |
clarkb | Shrews: pabelanger in the test as long as the builder has reached steady state there shouldn't be any contention for the resources, however doing the locking probably not a bad idea just for sanity | 18:34 |
Shrews | clarkb: cleanup thread does locking | 18:34 |
clarkb | Shrews: but only when it makes changes which happens after these updates right? | 18:34 |
clarkb | but ya better/easier to just lock and not worry than try to figure out if it is safe not to | 18:35 |
pabelanger | sure, I can lock | 18:36 |
pabelanger | give me a sec | 18:36 |
openstackgerrit | Paul Belanger proposed openstack-infra/nodepool: Make sure we clean up diskimages with invalid external_name https://review.openstack.org/405741 | 18:41 |
pabelanger | Shrews: does that look right?^ | 18:51 |
pabelanger | ah, you added +1 cool | 18:51 |
pabelanger | jeblair: clarkb: okay, if you don't mind reviewing 405741 again, should be able to merge and start nodepool-builder again on nb01.o.o | 18:52 |
pabelanger | and things cleanup magically | 18:52 |
pabelanger | :) | 18:52 |
Shrews | This pretty simple change has been seeking some love for a while: https://review.openstack.org/404292 | 18:54 |
pabelanger | I'll work on the sorted output of image-list once nodepool-builder is restarted | 18:58 |
jlk | just a quick question, is there any active work ongoing for github integration with zv3? | 18:59 |
jeblair | jlk: nope, it's deferred until we have the first stage working | 18:59 |
jlk | (we're discussing some workflow stuff and github limitations right now) | 18:59 |
jlk | jeblair: okay thanks. | 19:00 |
openstackgerrit | Merged openstack-infra/nodepool: Add znode stat structure to build/upload objects https://review.openstack.org/404292 | 19:04 |
*** morgan has quit IRC | 19:34 | |
*** TheJulia has quit IRC | 19:34 | |
openstackgerrit | Merged openstack-infra/nodepool: Make sure we clean up diskimages with invalid external_name https://review.openstack.org/405741 | 19:43 |
pabelanger | restarting nodepool-builder | 19:50 |
pabelanger | yay | 19:50 |
pabelanger | things are happening | 19:50 |
pabelanger | http://paste.openstack.org/show/591281/ | 19:51 |
pabelanger | current image-list: http://paste.openstack.org/show/591282/ | 19:52 |
jeblair | that's nicer | 19:55 |
pabelanger | oh, haha | 19:55 |
pabelanger | checksum still exist | 19:55 |
pabelanger | need to add them into cleanup worker | 19:55 |
jeblair | :) | 19:55 |
pabelanger | we still have some cruf in dib-image-list: http://paste.openstack.org/show/591284/ | 19:56 |
pabelanger | will have to see what is going on there | 19:56 |
jeblair | pabelanger: may i suggest adding a remove function to dibimagefile? that way we centralize all the things needed to remove a "file" from the filesystem? | 19:56 |
pabelanger | jeblair: sure, that makes sense | 19:56 |
jeblair | and yeah, i agree those 'deleting' ubunte-precise images should be gone (within a minute or so after starting i would think) | 19:59 |
jeblair | pabelanger: did the underlying files for those builds get deleted (or never created)? | 20:00 |
jeblair | pabelanger: if so, the builder won't know it's responsible and won't delete them | 20:00 |
pabelanger | jeblair: never created, we have diskimage-builder failures | 20:00 |
pabelanger | had* | 20:01 |
jeblair | so yeah, i think we will need to use zk shell for that | 20:01 |
pabelanger | likely so | 20:01 |
pabelanger | I plan on adding more testing around disk-image-create to make sure we don't leak info like that, haven't looped back to it | 20:02 |
pabelanger | I should make a story for it | 20:02 |
jeblair | and probably make sure we handle that case better in the future (remove the znode if the build fails). or, use the builder name to determine whether a builder is responsible | 20:02 |
pabelanger | let me see how to do it in zk-shell | 20:03 |
jeblair | (to summarize: only the responsible builder can delete the image build znode because it must also delete the file from disk. currently, whether the file is on disk is used to determine whether a given builder is responsible. so the case where a failure happens and no file is created needs to be handled somehow) | 20:04 |
clarkb | If its set to failed the builder that built it should be able to safely remove the znode regardless of on disk situationright? | 20:04 |
clarkb | ah | 20:04 |
clarkb | that answered my uestion as ifyou knew Iwould ask | 20:04 |
jeblair | :) | 20:05 |
jeblair | clarkb: right. so we either have the builder remove the znode immediately after failure (reduces the edge case down to "the builder is forcefully killed before the file is created"), or use the builder name in the znode as authoritative (makes unique and unchanging hostnames "important"). | 20:05 |
jeblair | pabelanger: ^ some copy/paste material for the story :) | 20:06 |
clarkb | maybe use a nodepool install specific identifer that defaults to hostname? | 20:07 |
pabelanger | ack | 20:07 |
clarkb | then we can manage logical ownership separately | 20:07 |
jeblair | ya | 20:10 |
jeblair | i think in our case it will always be okay to use hostname (until we decide to store builds on $networkfilesystem -- then the configured logical id makes sense) | 20:11 |
*** hashar has joined #zuul | 20:12 | |
pabelanger | okay, I've cleaned up the cruf in zk-shell | 20:15 |
pabelanger | have to run an errand, I'll address ^ once I get back | 20:16 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Delete hard upload failures from current builds https://review.openstack.org/406342 | 20:16 |
Shrews | pabelanger: jeblair: Is something like that ^^^ what you guys had in mind? | 20:17 |
*** TheJulia has joined #zuul | 20:19 | |
*** morgan has joined #zuul | 20:25 | |
Shrews | the only thing about that change is that you lose a record of the failure, unless you examine logs | 20:25 |
Shrews | though that error should be very rare now, actually | 20:28 |
clarkb | Shrews: isnt that error dependent on the cloud? | 20:33 |
Shrews | clarkb: we've only seen that happen on older versions where we weren't trapping the exception | 20:33 |
clarkb | or is this a caseof host instance reboots? | 20:33 |
clarkb | gotcha | 20:34 |
Shrews | i've since fixed that | 20:34 |
Shrews | so it's a "clean up our oops from before" and "should it somehow happen again, clean that up too" | 20:34 |
clarkb | ya | 20:35 |
clarkb | maybe we want an explicit log for it? | 20:35 |
clarkb | somewhere around line 362 of builder.py | 20:35 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Delete hard upload failures from current builds https://review.openstack.org/406342 | 20:37 |
Shrews | gah | 20:38 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Delete hard upload failures from current builds https://review.openstack.org/406342 | 20:38 |
Shrews | clarkb: ^^^ log added | 20:38 |
clarkb | cool will post proper review when done eating lunch | 20:39 |
*** bhavik1 has quit IRC | 20:39 | |
clarkb | but looks fine at first glance | 20:39 |
openstackgerrit | Adam Gandelman proposed openstack-infra/zuul: Re-enable merge-mode config option and add more tests https://review.openstack.org/406361 | 21:06 |
clarkb | Shrews: in this change why are we restricting this cleanup to to builds_to_keep? | 21:21 |
clarkb | Shrews: is it not possible for a build record to have a non ready state and also have upload in uploading state? | 21:22 |
Shrews | clarkb: because a non-ready state build should be deleted by the cleanup thread. these are already handled in that case | 21:26 |
clarkb | Shrews: even if there are outstanding records in a non delete state? | 21:29 |
Shrews | clarkb: yep, unless one of those is a most recent upload | 21:30 |
Shrews | but an aged out build should have already been cleaned up by this new code while it was current, anyway | 21:31 |
clarkb | kk | 21:33 |
jeblair | lgtm | 21:56 |
Shrews | hrm, there was a bug there, but surprised the test did not catch it | 22:08 |
Shrews | ah, found it | 22:17 |
*** jamielennox|away is now known as jamielennox | 22:22 | |
openstackgerrit | Ricardo Carrillo Cruz proposed openstack-infra/zuul: WIP Add per-repo public and private keys https://review.openstack.org/406382 | 22:28 |
openstackgerrit | Ricardo Carrillo Cruz proposed openstack-infra/zuul: WIP Add per-repo public and private keys https://review.openstack.org/406382 | 22:30 |
Shrews | i'm impressed at how much that patch just does NOT work | 22:41 |
pabelanger | Shrews: jeblair: just noticed, fedora-24 currently has 2 uploading for the same provider: http://paste.openstack.org/show/591299/ | 22:47 |
pabelanger | I believe we are only upload 1 image right now, so possible it is a stale status | 22:47 |
openstackgerrit | Ricardo Carrillo Cruz proposed openstack-infra/zuul: WIP Add per-repo public and private keys https://review.openstack.org/406382 | 22:48 |
Shrews | pabelanger: it would have to be. uploading is locked. either one is from a failure, or perhaps you kill it during a upload | 22:48 |
Shrews | my last review should get rid of that, but it currently doesn't work | 22:49 |
pabelanger | Shrews: Oh, that is possible. I might have stopped nodepool-builder while an upload was in process | 22:49 |
pabelanger | k | 22:49 |
openstackgerrit | Ricardo Carrillo Cruz proposed openstack-infra/zuul: WIP Add per-repo public and private keys https://review.openstack.org/406382 | 22:51 |
Shrews | So... VERY WEIRD zk behavior, but when i fix that code to actually delete the 'uploading' node, a new build ID is generated for that build, but it isn't from something that the builder does | 23:03 |
jeblair | efriday | 23:03 |
jeblair | Shrews: if you want to push up a non-working patchset, i'd be happy to poke at it. | 23:04 |
jeblair | 23:07 < openstackgerrit> James E. Blair proposed openstack-infra/infra-specs: Zuul v3: update with Ansible role information https://review.openstack.org/381329 | 23:07 |
openstackgerrit | David Shrewsbury proposed openstack-infra/nodepool: Delete hard upload failures from current builds https://review.openstack.org/406342 | 23:07 |
Shrews | jeblair: run that single test with that ^^^ and look at the FOUND BUILDS log lines | 23:08 |
Shrews | the JSON errors are thrown because build ID 000000002 shows up, but there is no data there to parse | 23:09 |
jeblair | cool, i see that behavior. i will eat some chocolate and ponder. | 23:10 |
openstackgerrit | Adam Gandelman proposed openstack-infra/zuul: Re-enable merge-mode config option and add more tests https://review.openstack.org/406361 | 23:10 |
Shrews | jeblair: let me know when your eyes start to bleed, as mine have. i will go do dinner type things for a while | 23:14 |
*** hashar has quit IRC | 23:44 | |
* SpamapS trying to figure out where test_rerun_on_error gets the 'RUN_ERROR' state at all | 23:53 | |
SpamapS | jeblair: I only ever see 'RUN_ERROR' as a build state in the zuul test suite. Is it possible this test is an old vestige of the pre-2.5 days and some aspect of jenkins? | 23:56 |
SpamapS | like I can't even see how anything other than the test suite itself arrives at this RUN_ERROR result | 23:56 |
Generated by irclog2html.py 2.14.0 by Marius Gedminas - find it at mg.pov.lt!