Wednesday, 2021-07-07

opendevreviewJames E. Blair proposed zuul/zuul-registry master: Build images on bionic  https://review.opendev.org/c/zuul/zuul-registry/+/79972800:28
opendevreviewJames E. Blair proposed zuul/zuul-registry master: Add a restricted mode (read authentication required)  https://review.opendev.org/c/zuul/zuul-registry/+/78838900:46
opendevreviewJames E. Blair proposed zuul/zuul-registry master: Make TLS optional  https://review.opendev.org/c/zuul/zuul-registry/+/78839000:46
opendevreviewJames E. Blair proposed zuul/zuul-registry master: Add content-length headers and debug messages  https://review.opendev.org/c/zuul/zuul-registry/+/79106800:46
opendevreviewJames E. Blair proposed zuul/zuul-registry master: Add missing size to image configs  https://review.opendev.org/c/zuul/zuul-registry/+/79972600:46
opendevreviewJames E. Blair proposed zuul/zuul-registry master: Also include missing size attributes in layers  https://review.opendev.org/c/zuul/zuul-registry/+/79972900:53
*** iury|holiday is now known as iurygregory05:53
opendevreviewSimon Westphahl proposed zuul/zuul master: Store tenant layout state in Zookeeper  https://review.opendev.org/c/zuul/zuul/+/77146106:52
boha[m]<avass[m] "boha: apparently we are using th"> Thanks, that did the trick ;-). Is the task "debug" with hello message from the quick start guide supposed to work on windows? I might have messed something else up in the process, but I only see pre and post jobs running.07:27
*** jpena|off is now known as jpena07:46
*** bhagyashris_ is now known as bhagyashris|ruck08:37
*** rpittau|afk is now known as rpittau09:42
avass[m]boha: yes that should work, usually when tasks don't show up it's because they target a node that does not exist in inventory and the task is skipped11:27
*** jpena is now known as jpena|lunch11:39
opendevreviewSimon Westphahl proposed zuul/zuul master: Fix race waiting for watches to be registered  https://review.opendev.org/c/zuul/zuul/+/79979712:09
opendevreviewSimon Westphahl proposed zuul/zuul master: dnm: try to reproduce race condition  https://review.opendev.org/c/zuul/zuul/+/79979812:21
*** jpena|lunch is now known as jpena12:43
*** eliadcohen__ is now known as eliadcohen13:47
opendevreviewFelix Edel proposed zuul/zuul master: Switch to ZooKeeper backed NodesProvisionedEvents  https://review.opendev.org/c/zuul/zuul/+/79983313:50
*** bhagyashris is now known as bhagyashris|ruck13:59
*** bhagyashris_ is now known as bhagyashris|ruck14:24
clarkbcorvus: Is the next set of v5 work finishing up the result event migration to zk and the storage of tenant configs in zk?15:29
corvusclarkb: yes, there are some changes ready for review under "hashtag:sos" -- i plan on reviewing them today15:30
clarkbwhy sos?15:30
corvusscale out scheduler15:31
clarkbthat is a fun collision :)15:31
corvus;)15:31
corvusthere's one nodepool change in there too -- we're going to need a little more information in the node requests to account for the fact that the schedulers won't be tracking node requests in memory any more (and a scheduler may have to process a node request submitted by another scheduler)15:33
clarkbhttps://review.opendev.org/q/hashtag:%2522sos%2522+(status:open) for anyone else wanting to see the changes.15:33
clarkbI'll try to dig into those today as well15:34
corvusso there's a change to basically add some extra data to the noderequest that is opaque to nodepool, but zuul can see it and know what node requests belong to what builds15:34
corvusthat's https://review.opendev.org/798746 -- it might be good to go ahead and get that merged and in a nodepool release so than the next zuul release can depend on it15:35
clarkbI'll start there then15:35
clarkbcorvus: in that nodepool change we need nodepool to be aware of the data (but then ignore it) so that when it serializes it back into zk after updates we don't lose that information?15:42
corvusclarkb: exactly15:43
clarkbhttps://review.opendev.org/c/zuul/nodepool/+/798746 it is actually a very straightforward change if anyone else has time to review it15:44
clarkbtobiash[m]: tristanC ^ maybe?15:44
tobiash[m]I already have it open ;)15:44
*** marios is now known as marios|out16:06
*** rpittau is now known as rpittau|afk16:09
opendevreviewMerged zuul/nodepool master: Add requestor_data field to node request  https://review.opendev.org/c/zuul/nodepool/+/79874616:45
*** jpena is now known as jpena|off16:52
*** mgoddard- is now known as mgoddard16:54
opendevreviewJames E. Blair proposed zuul/zuul-operator master: Add instructions and tools for running tests with kind  https://review.opendev.org/c/zuul/zuul-operator/+/78576317:31
opendevreviewJames E. Blair proposed zuul/zuul-operator master: Fix error with multiple nodepool providers  https://review.opendev.org/c/zuul/zuul-operator/+/79987317:31
opendevreviewJames E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test  https://review.opendev.org/c/zuul/zuul-operator/+/79987417:31
clarkbhttps://review.opendev.org/c/zuul/zuul/+/799797 failed both unittest runs on test_cancel_starting_build that change doesn't affect that particular test so I will reapprove it when done with my review, but that may indicate a race or bug in that particular test17:55
opendevreviewJames E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test  https://review.opendev.org/c/zuul/zuul-operator/+/79987418:04
opendevreviewJames E. Blair proposed zuul/zuul-operator master: Add instructions and tools for running tests with kind  https://review.opendev.org/c/zuul/zuul-operator/+/78576318:04
clarkbI've got `tox -e py38 -- --until-failure test_cancel_starting_build` running locally now18:05
corvusclarkb: yeah swest is trying to get more info on that in https://review.opendev.org/79979818:05
clarkbwe'll see if I can trip over it locally18:05
clarkboh cool18:05
opendevreviewJames E. Blair proposed zuul/zuul master: dnm: try to reproduce race condition  https://review.opendev.org/c/zuul/zuul/+/79979818:07
corvusi'll set up a local loop too18:07
clarkbcorvus: ok I caught one but I was running against master. Let me rerun on top of swest's change18:12
clarkbcorvus: do we maybe need to do similar to https://review.opendev.org/c/zuul/zuul/+/799797/1/tests/unit/test_zk.py and wait for our watch to be in place before cancelling the build?18:15
clarkbwe have the ordering correct right now (watch before cancel) but can the watch creation happen on the backend after creating the object?18:16
clarkbcorvus: I got one with the extra logging. Let me push it up somewhere18:18
clarkbI need to take a break though, but can read through it after18:18
clarkbcorvus: test_cancel_starting_build.log in my homedir on the zuul scheduler18:21
clarkbok back in a bit18:21
clarkblooking at that log the watcher seems to be in place first and even hits the znode doesn't exist yet path18:49
clarkbI think I see it. The watch event fires, then we delete the znode, then we get the data via the watch event18:50
clarkbI think I see the fix too.18:51
clarkbit almost seems like if the delete races the watcher that the data watch function isn't called19:24
clarkbwhich may be related to the issue we had to fix last week?19:24
corvusclarkb: i'm back; i'll take a look at your log19:27
corvusoh also my local run caught something19:28
clarkbcorvus: if you look at the log you can see the watch start, then later the cancel znode gets created, but verysoon after it also gets deleted. By the time the watch anlder tries to read the data from zk it gets a NoNodeError19:29
clarkbcorvus: I've got a change that rewrites the data watch function to try and handle that better but it is still failing occasionally so I'm adding more debugging in my log19:29
clarkband it seems like maybe we get called with a None event more often than I expected19:30
clarkboh wait19:31
clarkbI need send_event?19:31
clarkbreading the docs it seems send_event is only necessary on a childwatch19:32
corvusclarkb: if we set two threading.Events, we can confirm that we have performed the initial get which sets the watch before we cancel the build, then the second call should be for the created event (which we should get even if the zoned is deleted by the time we call 'get')19:33
clarkbcorvus: yup that is basically what my current fix does, but it tries to check the event type is CREATED too on the second event19:33
clarkband I think I had at least one case where that wasn't the case19:33
clarkbI'll push up what I have now anyway I guess and we can improve it if necessary19:33
clarkbI'm up to 81 successful runs after adding more debug logging to try and determine why it wasn't a CREATED event19:34
clarkbalso I suppose it is possible I derped when running tests and I didn't :w properly19:36
corvusheh :)19:39
opendevreviewClark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build  https://review.opendev.org/c/zuul/zuul/+/79979819:40
clarkbI reused the same change id as the extra logging change but cleaned it up a bit to make it mroe mergable. I also removed the parent change dependencies as those are unrelated19:41
clarkbfeel free to move it back or restore the logging in that change etc19:41
clarkbI'm running that test locally in a loop again to build confidence in what I pushed. It failed on me but I also did rebase and stuff and it is possible that tox caught something while the git repo was changing states19:42
clarkbI'll leave git alone now19:42
corvusclarkb: oh, that's not quite what i was imagining; let me flesh it out and i'll push up another19:43
clarkbcorvus: yours should be about the same, its just using the watcher state to track the first event in your example19:44
clarkb*mine is just using the watcher state to track the first event in your example19:44
clarkbok mine just failed locally without any changes to disk going on in the repo at the time19:48
clarkbI'm rerunning again with the logging put back in19:49
corvusjust working on the commit message now19:50
corvusclarkb: i'm doubting my fix right now...19:55
corvusclarkb: looking at your log again... i don't see a response to this transaction:19:56
corvus2021-07-07 11:13:40,043 kazoo.client                     DEBUG    Sending request(xid=22): Exists(path='/test/uTMrboPF19071tests.unit.testscheduler.TestScheduler.testcancelstartingbuild/zuul/build-requests/unzoned/e4268fef7b0541ca814ba540803f8d18/cancel', watcher=<bound method DataWatch._watcher of <kazoo.recipe.watchers.DataWatch object at 0x7f754c148fa0>>)19:56
corvusthen again, i also don't see that in a normal test run19:57
clarkbcorvus: ya so that seems to be in the first pass of the watch path19:58
clarkbit first calls the GetData and that returns NoNodeError so it tries Exists next19:58
corvusyep; and it's exists that actually sets the zk watch19:58
corvusso seeing a response to that would give me a warm fuzzy19:59
clarkbya and if that never gets a response maybe we aren't hitting the first pass19:59
corvusbut i think it's a red herring -- i don't think kazoo logs reponses to exists19:59
corvusi think for the moment we must assume exists is called and correctly sets the watch20:00
clarkbI'm going to try and add some richer logging ot hte data watch function20:03
clarkband see what it shows rather than just the kazoo logging20:03
clarkbcorvus: with the code I have pushed and only adding a self.log.debug(event) to the top of data_watch() it seems that data_watch() is only called once20:05
corvusthat should not be the case20:06
clarkbya it is definitely unexpected20:07
clarkbI'm putting some unique strings in that logger so that I don't just get None back out (this way I'm confident that is the log line I see)20:08
corvusi see it called 2x20:08
corvusbut i'm not running your code20:08
clarkbI think the old code could fail in the 2x case too20:08
clarkbmy change should only fail if it is called 1x? but I'm trying to confirm that now20:08
clarkbyup confirmed now20:09
clarkbwith my chagne when it fails it seems to fail when data_watch is only called once20:09
corvusclarkb: which call do you get?20:10
clarkbcorvus: event is None so I think it is the initial setup call20:10
corvusthat makes sense.  that one should be more-or-less synchronous, in that we shouldn't return from creating the DataWatch until that has happened.20:11
corvusbut that does sort of suggest something along the lines of "exist() did not set the watch"20:12
clarkbReceived EVENT: Watch(type=1, state=3, path='/test/kfSdnawG_25854_tests.unit.test_scheduler.TestScheduler.test_cancel_starting_build/zuul/build-requests/unzoned/b3160cca502446c291b4f6e8cd51456b/cancel') it does seem toget the created event notification at least20:14
clarkbwould it get that if it wasn't watching the node?20:14
corvusnope -- so that makes me think more along the lines of "kazoo didn't map the event to the watcher"20:14
clarkbThen the Delete for that path happens then the GetData from the watch handler then the exists20:14
clarkbbut then it doesn't seem to call our supplied function20:15
clarkbcorvus: I think the issue is in kazoo20:16
corvusso there was a get_data callback?20:16
clarkbhttps://github.com/python-zk/kazoo/blob/master/kazoo/recipe/watchers.py#L205 initial_version == self._version because we have never seen any real data20:16
clarkbcorvus: yes I think it is getting to ^ that line then not calling our function again20:17
corvusclarkb: i agree20:19
clarkbkazoo is basically going "this state is the same as the previous state so I noop"20:19
corvusclarkb: maybe for this we should just use a low-level exists call with a watch20:19
corvusinstead of a DataWatch20:19
corvuswe don't even need to handle reconnection events, it's a unit test20:20
clarkbcorvus: could we use a childwatch maybe?20:20
clarkbon the parent side20:20
corvusi think that's subject to race20:21
clarkbThen just assume if that gets triggered the cancel node was created20:21
corvusi think an exists with a watcher is exactly what we want20:21
corvusit will get called when the node exists :)20:21
clarkbok I'll try that20:22
clarkbany idea what the type of the watch method is there?20:23
corvusclarkb:     def _watcher(self, event):20:24
corvuslike that20:24
clarkbthat seems to be working. Let me put this together20:26
corvusyeah, i've got it running in a loop locally too.  i think it's going to be less code than before actually :)20:27
clarkbyup its quite small20:27
opendevreviewClark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build  https://review.opendev.org/c/zuul/zuul/+/79979820:31
clarkbsomething like that I think20:31
clarkboh i have an unused var in there one sec20:31
opendevreviewClark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build  https://review.opendev.org/c/zuul/zuul/+/79979820:31
corvusclarkb: i left 3 notes :)20:32
opendevreviewClark Boylan proposed zuul/zuul master: Fix data watch race in test_cancel_starting_build  https://review.opendev.org/c/zuul/zuul/+/79979820:33
clarkbcorvus: ^ all three should be fixed in that ps20:33
corvusclarkb: cool, i think we can probably +w that, and ping tobiash, swest, and felixedel so they're aware of that DataWatch behavior nuance20:34
clarkb++20:34
clarkbI'm up to 107 successful runs so far with that code20:35
avass[m]today was a productive day. I just spent five hours trying to figure out why my new router couldn't get a dhcp lease. turns out the cable plugged into the wall was lose and it happened to work by a fluke when i plugged it in directly to my desktop...20:35
corvusavass: wow20:35
clarkbavass[m]: the best kind of bug20:35
fungithat sure sounds like a wednesday20:36
avass[m]best part is that we we're two people debugging it and we even started reading the sourcecode for udhcpc to figure out what was happening :)20:44
corvusavass: if you had just called tech support, they would have asked "did you try unplugging it and plugging it back in?" :)20:45
avass[m]oh right, they were also able to see my mac address (?) and saw that I got an ip address when I plugged it into my desktop.20:46
avass[m]in my experience tech support is usually pretty good here as long as you describe the steps you've already taken20:47
avass[m]at least we've got and answer to "How many software engineers do you need to fix a hardware problem"20:48
avass[m]if we just ignore that I studied electrical engineering for a moment... :)20:49
clarkbcorvus: swest you may want to check my simplification suggestions on https://review.opendev.org/c/zuul/zuul/+/795985 I didn't -1 because the code isn't wrong, but I didn't +2 because I think the code will be a lot easier to follow with that simplification21:02
clarkbbut if you prefer I can go ahead and approve it as is and we can sort that out in followups21:02
opendevreviewJames E. Blair proposed zuul/zuul-operator master: Add instructions and tools for running tests with kind  https://review.opendev.org/c/zuul/zuul-operator/+/78576321:04
opendevreviewJames E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test  https://review.opendev.org/c/zuul/zuul-operator/+/79987421:04
opendevreviewJames E. Blair proposed zuul/zuul-operator master: Fix error with multiple nodepool providers  https://review.opendev.org/c/zuul/zuul-operator/+/79991721:04
corvusclarkb: i think we can wait for swest to update that21:06
corvusi agree that will make it easier to read21:06
avass[m]So how close is v5 anyway? Do we have an approximate timeline?21:08
corvusavass: on the order of months21:09
clarkbprogress is picking up though seems like. I'm trying to dedicate more time to helping it along at least via reviews21:11
corvuswhat do folks think about bumping nodepool to 4.7.0, so that the next zuul version can be 4.7.0 and can depend on the same nodepool version?  or should we just do 4.2.0 for nodepool?21:14
corvus(this is because zuul will need the requestor_data nodepool change)21:16
clarkbI don't think syncing up versions is super important. But if users expect things to line up maybe that is the best option21:16
fungiwe've not made attempts to sync up minor revisions in the past, but if there's going to be a zuul dependency on a new nodepool feature then maybe that's a good enough excuse to skip a few versions21:16
corvusokay, maybe just 4.2.0 then?21:17
clarkbwfm21:17
corvuszuul-maint: how does this look for a nodepool release?  commit 9757e3f1ceade7ca1a866edc95243f6e29fc9522 (HEAD -> master, tag: 4.2.0, origin/master)21:17
clarkbOne thing that comes to mind is that dib is likely to get a new release if it hasn't already to fix centos builds iirc.21:19
fungiyeah, that matches origin/master for me, and i agree there are quite a few feature changes in there since the last tag (4.1.0) so looks correct21:19
clarkbmgiht want to bump that dep too if a new dib version exists /me looks21:19
clarkba new version does not exist. Lets do 4.2.0 /me checks the sha21:20
clarkb9757e3f1ceade7ca1a866edc95243f6e29fc9522 at 4.2.0 looks correct to me21:20
corvusk, i'll give that some more time in case others want to weigh in, then i'll push in a bit21:21
avass[m]Syncing versions between zuul and nodepool would make things slightly easier, but i assume there's a release note to tell people which versions are compatible and I don't see any reason why new installations would want to run an older version of zuul or nodepool. So that doesn't seem like a big deal to me21:33
avass[m]Unless you've got multiple zuuls that uses the same nodepools21:34
corvusit's a one way dependency, so that's not a problem21:35
corvus(upgrade nodepool first before upgrading any zuuls)21:35
avass[m]Ah yeah21:35
fungican multiple individual schedulers share a launcher?21:35
fungii didn't even realize that was possible21:36
avass[m]I just assumed that's possible, but maybe there would be key collisions if someone tried that21:37
corvusfungi: yes, but i think we may have inadvertently made that more difficult/impossible due to the v5 work.  i don't think there's a way to separate the /zuul hierarchy from the nodepool root.21:37
corvusbut that's not too hard to fix21:37
corvus(it's not a structural problem, more of an oversight)21:37
fungiahh, yeah i guess as long as they all shared teh same zk cluster21:37
corvusin v3, only nodepool was in zk, so you just pointed X zuuls at the same chroot in zk; zuul's data was all local to that zuul21:38
corvusbut now zuul data are stored in /zuul under the same chroot where /nodepool is21:38
corvusso we probably need to either make the nodepool zk connection a second connection, or add an extra path element for zuul21:39
corvussince no one has complained about that, i'm guessing there are zero such users, so it's probably not our most pressing concern21:40
corvus(zuul's multi-tenancy sort of mitigates that)21:41
opendevreviewMerged zuul/zuul master: Fix data watch race in test_cancel_starting_build  https://review.opendev.org/c/zuul/zuul/+/79979821:56
opendevreviewMerged zuul/zuul master: Fix race waiting for watches to be registered  https://review.opendev.org/c/zuul/zuul/+/79979721:56
corvuspushed nodepool 4.2.022:26
fungithanks!22:28
opendevreviewJames E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test  https://review.opendev.org/c/zuul/zuul-operator/+/79987422:46
opendevreviewJames E. Blair proposed zuul/zuul-operator master: WIP: add static node to functional test  https://review.opendev.org/c/zuul/zuul-operator/+/79987423:26
corvusnodepool release announcement sent23:31
corvusswest: https://review.opendev.org/799463 approved but has a +2 comment from clarkb23:45
clarkbits a super minor thing23:47
clarkbcorvus: can you check my comments on https://review.opendev.org/c/zuul/zuul/+/799464 particularly the third one?23:58
clarkbtl;dr is I'm not sure that will properly handle tenant deletion23:58
clarkbthe old code made a new abide which created a new abide.tenants OrderedDict but now we reuse the abide and I think we never discard old tenants that go away23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!