Friday, 2023-02-10

@clarkb:matrix.orgok I think I've figured out what is going on and it is actually a leak in cheroot. When you ask the cheroot threadpool to shutdown it doesn't close accepted sockets00:53
@clarkb:matrix.orgI think on our side we could ensure that the http request to cherrypy (and cheroot) is complete and requests shutdown from the client side before we shutdown the server00:54
@clarkb:matrix.orgbut then that introduces another complication in that cheroot allows its connections to linger and close later: https://github.com/cherrypy/cheroot/blob/main/cheroot/server.py#L1359-L137200:56
@clarkb:matrix.orgBased on that I'm now reasonably confident this is a known deficiency in cheroot.00:56
@clarkb:matrix.orgI think I'll stop putting so much effort into making these warnings go away for now. Really what python seems to be complaining about is that cheroot never calls socket.close() on the socket objects resulting from accept() and when the garbage collector runs it complains about this.00:57
@clarkb:matrix.orghttps://github.com/cherrypy/cheroot/issues/508 is related01:11
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 871863: Add an !unsafe change_message variable https://review.opendev.org/c/zuul/zuul/+/87186310:01
@mhuin:matrix.orgzuul-maint: https://review.opendev.org/c/zuul/zuul-client/+/819118 had a +2 a long time ago then fell through the cracks. It lets users authenticate via username/password instead of using a token if the identity provider is configured to do so13:27
@mhuin:matrix.orgThat's a better user experience than looking for your JWT on the GUI13:27
@mbecker12:matrix.orgHi! we noticed that on zuul 6.4.0 that any job running on openshift pods got disconnected after 15 minutes and all we got was a `MODULE FAILURE`. After a lot of digging, we decided to install kubectl in an additional executor image layer which solved the problem, so that we had a proper kubectl installation that was not a symlink to oc. Afterwards, we noticed that we can reproduce the same error with an old version of oc, whereas a current oc installation doesn't seem to have this problem. I just wanted to mention it here and make you guys aware of it :) 15:00
The kubectl version resolves to 1.11, maybe it could be good to update the version of oc/kubectl in the image? Or maybe this is already done in newer versions of zuul and I might have missed that
@clarkb:matrix.org> <@mbecker12:matrix.org> Hi! we noticed that on zuul 6.4.0 that any job running on openshift pods got disconnected after 15 minutes and all we got was a `MODULE FAILURE`. After a lot of digging, we decided to install kubectl in an additional executor image layer which solved the problem, so that we had a proper kubectl installation that was not a symlink to oc. Afterwards, we noticed that we can reproduce the same error with an old version of oc, whereas a current oc installation doesn't seem to have this problem. I just wanted to mention it here and make you guys aware of it :) 15:06
> The kubectl version resolves to 1.11, maybe it could be good to update the version of oc/kubectl in the image? Or maybe this is already done in newer versions of zuul and I might have missed that
https://review.opendev.org/c/zuul/zuul/+/867137
@fungicide:matrix.orgdid any changes merge last week that might have impacted zk schema in backward-incompatible ways? apparently following our automatic update to master over the weekend, we started logging this and failing to enqueue into at least a couple of different pipelines: https://paste.opendev.org/show/bcGaLU1wkHZzXgcZ4agi/15:08
@clarkb:matrix.orgfungi:  chances are that's corrupt data in the db not a schema change15:09
@clarkb:matrix.orgSince everything is expected to be stored as json (possibly compressed) iirc15:10
@fungicide:matrix.orginteresting. in that case we've got more than one pipeline impacted and they have entirely disjoint triggers (completely separate projects), so maybe something happened to corrupt more than one znode15:11
@clarkb:matrix.orgThe znode path is in that paste fwiw. End of the first line15:12
@jim:acmegating.comfungi: what host/logfile for that exception?15:13
@fungicide:matrix.orgcorvus: i was pulling it from /var/log/zuul/debug.log on zuul01.opendev.org but i haven't looked at the other yet. there are close to 30k repeats of that error in today's log alone15:14
@fungicide:matrix.orgmaybe more like 60k15:18
@fungicide:matrix.orgafter deduplication, i think we have 34 different znodes mentioned in today's log15:18
@fungicide:matrix.orgacross multiple pipelines and tenants15:18
@fungicide:matrix.orgthough 10 different tenant+pipeline pairs if i filter down the "Exception loading ZKObject" messages further to ignore the zkobject ids15:20
@jim:acmegating.comfungi: Clark there is a high degree of probability this is the race described and fixed by https://review.opendev.org/87248215:21
@jim:acmegating.comi think it would be worthwhile to try to land that today15:22
@jim:acmegating.comi'll switch to #opendev to talk about fixing the current state15:22
@fungicide:matrix.orgthere may be more than one cause, i see a handful of "Exception loading ZKObject" messages (<100/day) in logs up until a few days ago, and then around then their frequency increases by a couple of orders of magnitude on 2023-02-0615:24
@fungicide:matrix.org * there may be more than one cause, i see a handful of "Exception loading ZKObject" messages (\<100/day) in logs up until a few days ago, and then their frequency increases by a couple of orders of magnitude on 2023-02-0615:24
@jim:acmegating.comfungi: there are some cases where that's harmless; that's likely the case for the <100/day time period.15:25
@fungicide:matrix.orgi figured, it just complicates the analysis since they're hard to filter out15:26
@clarkb:matrix.orgcorvus: I posted some questions on https://review.opendev.org/872482 that change is actually fairly involved and I want ot make sure I'm understanding things correctly16:54
@jim:acmegating.comClark: looking16:54
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 873410: Remove layout_uuid from PipelineState create call https://review.opendev.org/c/zuul/zuul/+/87341017:41
@jim:acmegating.comClark: your comments prompted that ^17:41
@clarkb:matrix.org+217:43
@clarkb:matrix.orgThe fix for the race condition is failing in the gate. It seems to have exploded on NoNodeExceptions.19:16
@clarkb:matrix.orgOne of my warning fixup changes had NoNode exceptions in a couple tests too, but the race condition fix seems to have failed more explosively. Maybe in a large recursive loop? The output log is quite large so hard to parse it, but it looks like the end of the file shows a very deep stack trace19:18
@clarkb:matrix.orgI think in the case of two failed unittests the issue starts with a zookeeper disconnect. However I don't see anything in the zk logs that indicate zk was unhappy19:23
@clarkb:matrix.orgthe system load is high just prior to that, and maybe that trips things over somehow.19:27
@clarkb:matrix.orgah yup in the logs the nonodeerror is preceded by a zk connection loss too19:28
@clarkb:matrix.orgcorvus: the failure on the latest recheck looks like it could be related to the change? It's got errors processing the gate with the status and change list objects. I think these were expected but that we'd get past them?20:50
@clarkb:matrix.orgHrm actually I wonder if the issue is periodic request queue cleanups breaking after we start test shutdown. I will investigate that more later today20:57
@clarkb:matrix.orgI think I've decided that idea is a red herring. It shows up in the test log as an error but in the code we catch the exception and log it and move on. The actual issue seems to be due to bad version21:31
@clarkb:matrix.orgwhich I think possibly is related to the change so I won't recheck it again. It occurs when processing pipelines and its pipeline processing that we've modified in the change21:32
@jpew:matrix.orgianw: Did you happen to see my comment on https://review.opendev.org/c/zuul/zuul-jobs/+/871664 ?21:49
@jim:acmegating.comClark: looking at the second one (which is the one with actionable logs) https://7a4304b52924f118030f-a5c518380d1db6458f4f9d5f7d87c4d0.ssl.cf1.rackcdn.com/872482/4/gate/zuul-nox-py311/74c8bed/testr_results.html22:44
@jim:acmegating.comi think you're on to something with the cleanup path -- it looks like it's cleaning up what it thinks is a stale queue, so it actually deletes the queue while the run handler is running22:51
@jim:acmegating.comClark: i think the issue is that the unit test does that while not holding the pipeline lock22:51
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 873437: Fix race in test_queue unit tests https://review.opendev.org/c/zuul/zuul/+/87343723:02
@jim:acmegating.comClark: fungi ^ i think we should merge that and re-approve the fix23:03
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:23:03
- [zuul/zuul] 872482: Fix race condition in pipeline change list init https://review.opendev.org/c/zuul/zuul/+/872482
- [zuul/zuul] 873410: Remove layout_uuid from PipelineState create call https://review.opendev.org/c/zuul/zuul/+/873410

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!