Friday, 2022-10-21

-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-sphinx] 862215: Workaround circular import warning https://review.opendev.org/c/zuul/zuul-sphinx/+/86221503:40
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-sphinx] 862216: Workaround circular import warning https://review.opendev.org/c/zuul/zuul-sphinx/+/86221603:42
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-sphinx] 862215: Workaround circular import warning https://review.opendev.org/c/zuul/zuul-sphinx/+/86221503:45
-@gerrit:opendev.org- Ian Wienand proposed: [zuul/zuul-jobs] 861588: Revert sphinx pin https://review.opendev.org/c/zuul/zuul-jobs/+/86158803:46
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 862192: Fix test_ensure_cloned and JWT tests https://review.opendev.org/c/zuul/zuul/+/86219206:35
-@gerrit:opendev.org- Zuul merged on behalf of Tony Breeds: [zuul/zuul-jobs] 861909: [docs][trivial] Remove redundant colon : https://review.opendev.org/c/zuul/zuul-jobs/+/86190908:38
-@gerrit:opendev.org- Zuul merged on behalf of Simon Westphahl: [zuul/zuul] 861376: Fix re-use of existing job data when hash matches https://review.opendev.org/c/zuul/zuul/+/86137610:52
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 861494: Add JobData refresh test https://review.opendev.org/c/zuul/zuul/+/86149410:52
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 862134: [DNM] TEST zuul_stream: close streamer when task is skipped https://review.opendev.org/c/zuul/zuul/+/86213411:56
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 862194: Defensively handle null JobData hashes https://review.opendev.org/c/zuul/zuul/+/86219412:09
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 862134: [DNM] TEST zuul_stream: close streamer when task is skipped https://review.opendev.org/c/zuul/zuul/+/86213412:20
-@gerrit:opendev.org- Thomas Cardonne proposed wip: [zuul/zuul] 862134: [DNM] TEST zuul_stream: close streamer when task is skipped https://review.opendev.org/c/zuul/zuul/+/86213412:45
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:13:47
- [zuul/zuul] 860033: Add access-rules configuration and documentation https://review.opendev.org/c/zuul/zuul/+/860033
- [zuul/zuul] 860034: Set Access-Control-Allow-Origin headers in check_auth tool https://review.opendev.org/c/zuul/zuul/+/860034
- [zuul/zuul] 860035: Support authz for read-only web access https://review.opendev.org/c/zuul/zuul/+/860035
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed on behalf of Ian Wienand: [zuul/zuul] 860632: web/api: Reject unknown methods and consolidate CORS and request checking https://review.opendev.org/c/zuul/zuul/+/86063213:47
-@gerrit:opendev.org- Zuul merged on behalf of Benjamin Schanzel: [zuul/nodepool] 853151: Fix doc for k8s mem limits as MiB instead of MB https://review.opendev.org/c/zuul/nodepool/+/85315114:45
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/nodepool] 861797: DNM checking docker image builds https://review.opendev.org/c/zuul/nodepool/+/86179715:37
@clarkb:matrix.orgcorvus: I think the nodepool python310 job might be flaky. I'm still in PTG things but if no one else looks at it before me I can try to take alook this afternoon15:58
@jim:acmegating.comClark: interesting that it's just py310.  seems mostly to be related to static driver.  i wonder if it's the slots change.16:03
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:16:07
- [zuul/nodepool] 862375: DNM: run a lot of py310 tests https://review.opendev.org/c/zuul/nodepool/+/862375
- [zuul/nodepool] 862376: Revert "Add "slots" to static node driver" https://review.opendev.org/c/zuul/nodepool/+/862376
@jim:acmegating.comClark: i don't have a lot of time to dig into that now, but there are some changes that should help us decide if we should revert the slots change.16:07
@clarkb:matrix.orgthanks16:07
@jim:acmegating.comhas anyone volunteered to look into the pyjwt issue?16:07
@clarkb:matrix.orgnot that I've seen yet. I may have time for that one this afternoon too. But this morning is all PTG16:07
@jim:acmegating.comwell, its an opportunity for someone else to contribute to maintenance if they want16:12
@clarkb:matrix.org++16:12
@michael_kelly_anet:matrix.orgCan I get some reviews on:16:16
https://review.opendev.org/c/zuul/zuul-operator/+/861279/3
https://review.opendev.org/c/zuul/zuul-operator/+/853592/11
https://review.opendev.org/c/zuul/zuul-operator/+/853695/13
?
-@gerrit:opendev.org- Michael Kelly proposed: [zuul/zuul-operator] 862390: helm: Add cert-manager as optional dependency https://review.opendev.org/c/zuul/zuul-operator/+/86239017:50
@clarkb:matrix.orgcorvus: looks like both the parent chagne and the slots revert change failed in a similar manner. I think that likely rules out the slots change. PTG should end in a half hour or so, then I need to eat lunch, then I've got a school thing then I'll dig in. I should get there just eventually :)18:26
@jim:acmegating.comok, so probably just some test instability showing up due to unrelated timing changes18:43
@clarkb:matrix.orgOne of the failures is here: https://opendev.org/zuul/nodepool/src/branch/master/nodepool/tests/unit/test_driver_gce.py#L330 and it is failing because the node is already fulfilled at that point. The config we run allows for up to 8 nodes and we have enough quota so nodepool doesn't wait there like the test seems to think it should21:30
@clarkb:matrix.orgI think maybe pre python3.10 nodepool isn't able to turn around that fulfillment and this is just a preexisting race inthe code. I'll need to read a bit more to decide if we need to wait there to test what we want to test21:31
@clarkb:matrix.orgI think the test intends on exercising the wait for quota and testing that workflow so the correct fix here is to actually hit the quota21:33
@clarkb:matrix.orgFor the other issue I think it is a bug in the static driver. We somehow reregistered a static node twice after it was used and deleted allowing the single node to be registered twice allowing it to fulfill two requests immediately21:54
@clarkb:matrix.orghttps://6355e428223de2a8abab-7be64e411bd94d4a66b71f09ef725f83.ssl.cf1.rackcdn.com/862375/1/check/nodepool-tox-py310-2/0775735/testr_results.html has the logs for that21:54
@clarkb:matrix.orgI think maybe because syncNodeCount() and nodeDeletionNotification() can run over each other?21:56
@clarkb:matrix.orgoh a I think I see the fix21:57
@clarkb:matrix.orghrm there is a lock that exists that I think syncNodeCount() needs to take into account. But I'm not seeing where we ever check details with the lock held to prevent moving forward and reregistering anyway22:00
@clarkb:matrix.orgcorvus: I think where I'm ending up is that the slots implementation for the static driver is buggy beacuse it appears possible to register a node twice for the same slot due to a race between those two methods calling the node registration method. I'm having a hard time understanding the expected behavior here as we don't even seem to document how to use slots in the static driver?22:10
@clarkb:matrix.orgI guess slots are an implementation detail for max-parallel-jobs so don't need their own top level documentation?22:11
@clarkb:matrix.orgI'll make a maybe not correct patch and then we can decide if we want to push forward with that or revert or something22:12
-@gerrit:opendev.org- Clark Boylan proposed:22:37
- [zuul/nodepool] 862414: Fix gce test pausing at quote https://review.opendev.org/c/zuul/nodepool/+/862414
- [zuul/nodepool] 862415: Fix race reregistering nodes in the static driver https://review.opendev.org/c/zuul/nodepool/+/862415
@clarkb:matrix.orgcorvus: ^ I've made a note in the second change there that we should also consider rewriting things and simplifying. There are a lot of places where we seem to assume we are thread safe but after this bug I'm not convinced. But I think we can be thread safe if we move the registration and deregistration of nodes into the main run loop22:38
@clarkb:matrix.organd that should result in a much simpler driver since we alerady have the main run loop doing that stuff but we also do it elsewhere22:38
@clarkb:matrix.orgI found myself quite confused tracing all the paths in that driver. I think part of the problem is that we're tracking state that would typically otherwise be in the cloud provider, but I do think trying to simplify if we can would be helpful to avoid bugs like this22:41
@clarkb:matrix.orgfwiw I think the bug exists in the driver regardless of the slots change. But maybe the slots change has made the race easier to hit22:42
@clarkb:matrix.orgso I'm less happy with reverting that change now22:42
@clarkb:matrix.orghrm, one thing I am confused about is how it managed to race at all. At first reading I thought this was a proper run loop but it seems to only be called when we reconfigure the provider and as far as I can tell that didn't happen here. Which implies it was called just once and somehow registered the node for the first request, used it, deleted it, then called syncNodeCount again somehow while the nodedeletednotification was running?22:57
@clarkb:matrix.org(the reason I've made this assumption it must be syncNodeCount is that nodeDeletedNotification does hold the lock and gets the slots info with the lock held which means if it was called twice the lock would serialize us)22:58
@clarkb:matrix.orgwhich is why I'm reasonably certain that this must be the start(), _start(), syncNodeCount() path somehow22:58
@clarkb:matrix.orgdefinitely look into this separately if you can before assuming what I've written is correct: https://6355e428223de2a8abab-7be64e411bd94d4a66b71f09ef725f83.ssl.cf1.rackcdn.com/862375/1/check/nodepool-tox-py310-2/0775735/testr_results.html22:59
@jim:acmegating.comClark: it does look like there may be some more complexity there -- the tests on that change are timing out.   with the nodepool-in-zuul work coming up, i think it might make sense to see about reworking that driver as a statemachine driver before we try to do anything else with it.  i have the openstack driver first on my list, but i can look at that one next.  i think that will simplify how we think about the lifecycle and hopefully will make this easier to think about.23:33
@jim:acmegating.com(it takes me a long time to get back up to speed on that driver every time i have to think about it)23:33
@jim:acmegating.com(also, re docs -- yes, there's no nodepool docs about slots, since they're not really user facing, but there will be in zuul once things stabilize enough we decide to expose it there)23:34
@clarkb:matrix.orgYa I'm realizing that I had some assumptions about how the driver worked (loop vs single call to start()) that invalidate some of the thinking behind that fix.23:34
@clarkb:matrix.orgI also think we may just need more info (via logging) to understand this better. I'll look at that early next week to at least start collecting that data if we have more failures23:35
@clarkb:matrix.orgIt is almost certainly a race so adding logs may fix it :)23:35
@jim:acmegating.comhaha23:36
@jim:acmegating.comfight fire with fire23:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!