Monday, 2021-07-19

-opendevstatus- NOTICE: The maintenance of the review.opendev.org Gerrit service is now complete and service has been restored. Please alert us in #opendev if you have any issues. Thank you03:30
*** dpawlik0 is now known as dpawlik05:10
*** mgoddard- is now known as mgoddard06:04
*** dpawlik5 is now known as dpawlik06:50
*** hashar is now known as Guest136507:02
*** hashar_ is now known as hashar07:02
*** dpawlik7 is now known as dpawlik07:06
*** dpawlik3 is now known as dpawlik08:30
*** dpawlik4 is now known as dpawlik08:44
*** rpittau|afk is now known as rpittau09:31
opendevreviewAndy Ladjadj proposed zuul/zuul master: [DNM] use packedrefs for all refs  https://review.opendev.org/c/zuul/zuul/+/80127609:55
opendevreviewAndy Ladjadj proposed zuul/zuul master: [DNM] use packedrefs for all refs  https://review.opendev.org/c/zuul/zuul/+/80127609:57
opendevreviewAndy Ladjadj proposed zuul/nodepool master: [driver][aws] add a cache on retrieve image id  https://review.opendev.org/c/zuul/nodepool/+/80128811:14
opendevreviewAndy Ladjadj proposed zuul/nodepool master: [driver][aws] add a cache on retrieve image id  https://review.opendev.org/c/zuul/nodepool/+/80128811:50
opendevreviewPaul Belanger proposed zuul/nodepool master: First ensure ssh connection is valid before scanning keys  https://review.opendev.org/c/zuul/nodepool/+/80118912:59
opendevreviewTristan Cacqueray proposed zuul/zuul master: Document RETRY build status  https://review.opendev.org/c/zuul/zuul/+/80130313:10
*** sshnaidm|afk is now known as sshnaidm13:10
clarkbcorvus: I seem to recall mention of fixing a static node nodepool test? is test_static_parallel_increase the one that was failing? it hit pabelanger[m]'s change above if so. Do we have a fix up for that yet?14:45
corvusclarkb: that doesn't ring a bell14:48
clarkbhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_a28/801189/2/check/nodepool-tox-py36/a28686a/testr_results.html has the traceback. It looks like a zk issue trying to unlock a noderequest without holding the noderequest14:49
clarkbI'll recheck pabelangers change and see if it is persistent14:50
clarkband then maybe run that test in a loop locally since that worked well the last time we had an issue14:51
tristanCis it possible to use the rebase merge mode in github (request from https://storyboard.openstack.org/#!/story/2009062 ) ?15:57
clarkbtristanC: zuul had support for it at one time with gerrit (not sure if it got ripped out). We found it created weird interactions that were not easily addressed. I don't recall the specifics though.15:59
clarkbMight be worthwhile digging into the history there (it is probably in the zuul git log somewhere?) and evaluating if those issues are still issues and whether or not they can be addressed for other hosting systems like github16:00
clarkbOne problem was specific to gerrit and that was that it rewrote the commits (github may do the same?)16:01
clarkbthis meant that even in a fast forward situation you never got the same commit hash out again16:01
tristanCclarkb: thank you, that is my recollection too16:01
clarkb"The rebase and merge behavior on GitHub deviates slightly from git rebase. Rebase and merge on GitHub will always update the committer information and create new commit SHAs, whereas git rebase outside of GitHub does not change the committer information when the rebase happens on top of an ancestor commit."16:04
clarkbhttps://docs.github.com/en/github/administering-a-repository/configuring-pull-request-merges/about-merge-methods-on-github#rebasing-and-merging-your-commits16:04
clarkbit seems that github has this problem too. Whether or not that is still a problem for zuul I don't know16:04
clarkbtristanC: did you want to respond to the story with some of that info? I can in a bit if you don't16:06
corvushere's a puzzle -- i'm looking into zk connection losses on opendev.  check out the following log entries: https://paste.opendev.org/show/807564/16:07
corvusthe first traceback makes perfect sense to me.16:07
corvusthe second traceback is very confusing -- how did we jump from the deferred exception rais in kazoo/handlers/utils.py to the gerrit connection election?16:08
corvus(line 35/36 of that paste)16:08
corvus35 should have been the end... is there some data corruption happening inside of kazoo?16:09
clarkbcorvus: is it possible the log messages ended up out of order and there is another "Traceback (most recent call last):" out of sequence somewhere?16:10
clarkbtaht would mean we have 3 exceptions instead of 216:10
clarkbor at least 3 I guess16:10
corvusclarkb: i don't see any in the recent history above that point.  below that is another disconnection and traceback; i think that's a separate event.16:11
corvusalso, the individual chunks in that make sense16:12
clarkbin that case it certainly seems like call stacks are weird. THough I haven't dug into the specific code to be more confident of that16:12
corvusi've checked 4 disconnects so far, and they all have the same pattern in at least one TB.16:20
tristanCclarkb: i don't mind copying your answer, though it's not urgent and i'd rather let you do it if you don't mind16:24
clarkbtristanC: ok I'll get to that in a bit16:25
tristanCcorvus: isn't that a property of AsyncResult?16:27
opendevreviewAndy Ladjadj proposed zuul/zuul master: [DNM] use packedrefs for all refs  https://review.opendev.org/c/zuul/zuul/+/80127616:27
tristanC(that the exception can comes from another code path)16:27
*** marios is now known as marios|out16:29
pabelanger[m]clarkb: do have info about the failing non-voting nodepool test? is that what you were looking into16:29
clarkbpabelanger[m]: no I think that is expected due to the zk ssl work? the failure in the py36 unittests was what I asked about earlier16:30
clarkbpabelanger[m]: I don't think that failure is related to your chagne as it seems related to zk trying to unlock a nodrequest and failing16:30
tobiash[m]corvus: we've also experienced zk connection losses after updating on saturday. We had to rollback16:30
pabelanger[m]okay, great. thanks for looking16:31
tobiash[m]corvus: if you check the logs for 'zookeeper connection' you may see it toggles between suspended and connected16:31
pabelanger[m]if I could get a few reviews on https://review.opendev.org/c/zuul/nodepool/+/801189, it would be helpful for us on zuul.a.c. Even if we could schedule a release for it too16:31
corvustristanC: the exception attached to asyncresult should just include a stack trace from the call stack within the kazoo client; i don't understand why the call stack there would include the leader election16:31
tobiash[m]corvus: swest also discovered a merged but yet unrelease fix on the kazoo connection level: https://github.com/python-zk/kazoo/pull/63316:32
tristanCsince the traceback contains two get_children_async call, could it be that the handler is mixing the two here: https://github.com/python-zk/kazoo/blob/master/kazoo/client.py#L123516:47
corvustristanC: i wonder -- it's certainly suspicious that both tracebacks in my paste relate to get_children on the same znode16:50
opendevreviewTristan Cacqueray proposed zuul/zuul master: Document RETRY build status  https://review.opendev.org/c/zuul/zuul/+/80130316:54
clarkbtristanC: https://storyboard.openstack.org/#!/story/2009062 has been updated with some of my notes from above17:25
tobiash[m]clarkb, tristanC commented on ^17:32
*** rpittau is now known as rpittau|afk17:38
tobiash[m]pabelanger: lgtm17:42
opendevreviewChao Zhang proposed zuul/zuul-jobs master: Update commits since tag calculation  https://review.opendev.org/c/zuul/zuul-jobs/+/80137018:03
opendevreviewChao Zhang proposed zuul/zuul-jobs master: Update commits since tag calculation  https://review.opendev.org/c/zuul/zuul-jobs/+/80137018:04
opendevreviewMerged zuul/zuul master: Document RETRY build status  https://review.opendev.org/c/zuul/zuul/+/80130318:26
opendevreviewMerged zuul/nodepool master: First ensure ssh connection is valid before scanning keys  https://review.opendev.org/c/zuul/nodepool/+/80118918:49
opendevreviewChao Zhang proposed zuul/zuul-jobs master: Update commits since tag calculation  https://review.opendev.org/c/zuul/zuul-jobs/+/80137019:59
opendevreviewChao Zhang proposed zuul/zuul-jobs master: Update commits since tag calculation  https://review.opendev.org/c/zuul/zuul-jobs/+/80137020:00
opendevreviewJames E. Blair proposed zuul/zuul master: Split ZK event queues by type  https://review.opendev.org/c/zuul/zuul/+/80138721:37
corvusclarkb: i think that's something you suggested on an earlier review?21:38
clarkbcorvus: yup, as it simplifies the idea of what a queue is :)21:38
clarkbI'll give it a review21:38
corvusi'm about to make some changes to the event queues so that they shard large events (there's a possibility that's what caused the problems tobiash et al saw over the weekend), so i figured that would be a good first step before further changing the queue structure21:39
corvusclarkb: i think it's failing my local testing -- so if you do look at it at this point, maybe just review it for the overall concept :)21:41
clarkbnoted21:41
clarkbcorvus: the overall shape of the change lgtm. One thought is how will migrating to that look for those of us running the old system? will we orphan those events? probably fine if so21:49
clarkbor maybe they are ephemeral already21:49
corvusclarkb: i believe we'll orphan them21:58
corvuscoming up soon on my list is the "delete everything from zk" cli utility; we can use that to clean up the cruft21:58
clarkbcorvus: https://review.opendev.org/c/zuul/zuul/+/799127 is also related to some feedback I left on a change I think. Any idea what is up with testing on that one?22:10
corvusokay, found the issue; was missing an ensure_path; it's not a substantial change.  will upload new ps when local tests finish.22:10
corvusclarkb: no, i haven't looked into that yet.22:11
opendevreviewJames E. Blair proposed zuul/zuul master: Split ZK event queues by type  https://review.opendev.org/c/zuul/zuul/+/80138722:14
corvusthat should be gtg now22:14
corvusnow on to sharding22:14
clarkbcorvus: you didn't remove the tenant specific string.22:14
clarkbnot a big deal but you called it out so pointing it out22:15
opendevreviewJames E. Blair proposed zuul/zuul master: Split ZK event queues by type  https://review.opendev.org/c/zuul/zuul/+/80138722:15
corvusclarkb: derp thx22:15
opendevreviewJames E. Blair proposed zuul/zuul master: Shard event queue data  https://review.opendev.org/c/zuul/zuul/+/80140223:44
corvusokay that's step 2.  i think 1+2 are the minimum needed to address potential issues with large events and results.  i think we should also do one more change which is to utilize that new system to forward events with the data in-place; that should further reduce zk traffic and churn.23:46
corvuswell, that's a lot harder than it sounds (because some of the forwarding methods use quite a bit of the event data, so it needs to be read anyway).  might still be able to avoid a write, but it's not a slam-dunk.  i may stop here for now.23:55
corvusswest, tobiash: please take a look at https://review.opendev.org/801387 and https://review.opendev.org/801402 and let me know what you think23:56
opendevreviewPaul Belanger proposed zuul/nodepool master: Add release note for previous commit  https://review.opendev.org/c/zuul/nodepool/+/80140323:57
pabelanger[m]clarkb: tobiash corvus: I should have included a release note in my change, to allow for a nodepool release.23:58
pabelanger[m]^23:58
clarkbpabelanger[m]: one msall thing but if we have you here should be a quick update (left it on the chnage)23:59

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!