@clarkb:matrix.org | ok I think I've figured out what is going on and it is actually a leak in cheroot. When you ask the cheroot threadpool to shutdown it doesn't close accepted sockets | 00:53 |
---|---|---|
@clarkb:matrix.org | I think on our side we could ensure that the http request to cherrypy (and cheroot) is complete and requests shutdown from the client side before we shutdown the server | 00:54 |
@clarkb:matrix.org | but then that introduces another complication in that cheroot allows its connections to linger and close later: https://github.com/cherrypy/cheroot/blob/main/cheroot/server.py#L1359-L1372 | 00:56 |
@clarkb:matrix.org | Based on that I'm now reasonably confident this is a known deficiency in cheroot. | 00:56 |
@clarkb:matrix.org | I think I'll stop putting so much effort into making these warnings go away for now. Really what python seems to be complaining about is that cheroot never calls socket.close() on the socket objects resulting from accept() and when the garbage collector runs it complains about this. | 00:57 |
@clarkb:matrix.org | https://github.com/cherrypy/cheroot/issues/508 is related | 01:11 |
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 871863: Add an !unsafe change_message variable https://review.opendev.org/c/zuul/zuul/+/871863 | 10:01 | |
@mhuin:matrix.org | zuul-maint: https://review.opendev.org/c/zuul/zuul-client/+/819118 had a +2 a long time ago then fell through the cracks. It lets users authenticate via username/password instead of using a token if the identity provider is configured to do so | 13:27 |
@mhuin:matrix.org | That's a better user experience than looking for your JWT on the GUI | 13:27 |
@mbecker12:matrix.org | Hi! we noticed that on zuul 6.4.0 that any job running on openshift pods got disconnected after 15 minutes and all we got was a `MODULE FAILURE`. After a lot of digging, we decided to install kubectl in an additional executor image layer which solved the problem, so that we had a proper kubectl installation that was not a symlink to oc. Afterwards, we noticed that we can reproduce the same error with an old version of oc, whereas a current oc installation doesn't seem to have this problem. I just wanted to mention it here and make you guys aware of it :) | 15:00 |
The kubectl version resolves to 1.11, maybe it could be good to update the version of oc/kubectl in the image? Or maybe this is already done in newer versions of zuul and I might have missed that | ||
@clarkb:matrix.org | > <@mbecker12:matrix.org> Hi! we noticed that on zuul 6.4.0 that any job running on openshift pods got disconnected after 15 minutes and all we got was a `MODULE FAILURE`. After a lot of digging, we decided to install kubectl in an additional executor image layer which solved the problem, so that we had a proper kubectl installation that was not a symlink to oc. Afterwards, we noticed that we can reproduce the same error with an old version of oc, whereas a current oc installation doesn't seem to have this problem. I just wanted to mention it here and make you guys aware of it :) | 15:06 |
> The kubectl version resolves to 1.11, maybe it could be good to update the version of oc/kubectl in the image? Or maybe this is already done in newer versions of zuul and I might have missed that | ||
https://review.opendev.org/c/zuul/zuul/+/867137 | ||
@fungicide:matrix.org | did any changes merge last week that might have impacted zk schema in backward-incompatible ways? apparently following our automatic update to master over the weekend, we started logging this and failing to enqueue into at least a couple of different pipelines: https://paste.opendev.org/show/bcGaLU1wkHZzXgcZ4agi/ | 15:08 |
@clarkb:matrix.org | fungi: chances are that's corrupt data in the db not a schema change | 15:09 |
@clarkb:matrix.org | Since everything is expected to be stored as json (possibly compressed) iirc | 15:10 |
@fungicide:matrix.org | interesting. in that case we've got more than one pipeline impacted and they have entirely disjoint triggers (completely separate projects), so maybe something happened to corrupt more than one znode | 15:11 |
@clarkb:matrix.org | The znode path is in that paste fwiw. End of the first line | 15:12 |
@jim:acmegating.com | fungi: what host/logfile for that exception? | 15:13 |
@fungicide:matrix.org | corvus: i was pulling it from /var/log/zuul/debug.log on zuul01.opendev.org but i haven't looked at the other yet. there are close to 30k repeats of that error in today's log alone | 15:14 |
@fungicide:matrix.org | maybe more like 60k | 15:18 |
@fungicide:matrix.org | after deduplication, i think we have 34 different znodes mentioned in today's log | 15:18 |
@fungicide:matrix.org | across multiple pipelines and tenants | 15:18 |
@fungicide:matrix.org | though 10 different tenant+pipeline pairs if i filter down the "Exception loading ZKObject" messages further to ignore the zkobject ids | 15:20 |
@jim:acmegating.com | fungi: Clark there is a high degree of probability this is the race described and fixed by https://review.opendev.org/872482 | 15:21 |
@jim:acmegating.com | i think it would be worthwhile to try to land that today | 15:22 |
@jim:acmegating.com | i'll switch to #opendev to talk about fixing the current state | 15:22 |
@fungicide:matrix.org | there may be more than one cause, i see a handful of "Exception loading ZKObject" messages (<100/day) in logs up until a few days ago, and then around then their frequency increases by a couple of orders of magnitude on 2023-02-06 | 15:24 |
@fungicide:matrix.org | * there may be more than one cause, i see a handful of "Exception loading ZKObject" messages (\<100/day) in logs up until a few days ago, and then their frequency increases by a couple of orders of magnitude on 2023-02-06 | 15:24 |
@jim:acmegating.com | fungi: there are some cases where that's harmless; that's likely the case for the <100/day time period. | 15:25 |
@fungicide:matrix.org | i figured, it just complicates the analysis since they're hard to filter out | 15:26 |
@clarkb:matrix.org | corvus: I posted some questions on https://review.opendev.org/872482 that change is actually fairly involved and I want ot make sure I'm understanding things correctly | 16:54 |
@jim:acmegating.com | Clark: looking | 16:54 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 873410: Remove layout_uuid from PipelineState create call https://review.opendev.org/c/zuul/zuul/+/873410 | 17:41 | |
@jim:acmegating.com | Clark: your comments prompted that ^ | 17:41 |
@clarkb:matrix.org | +2 | 17:43 |
@clarkb:matrix.org | The fix for the race condition is failing in the gate. It seems to have exploded on NoNodeExceptions. | 19:16 |
@clarkb:matrix.org | One of my warning fixup changes had NoNode exceptions in a couple tests too, but the race condition fix seems to have failed more explosively. Maybe in a large recursive loop? The output log is quite large so hard to parse it, but it looks like the end of the file shows a very deep stack trace | 19:18 |
@clarkb:matrix.org | I think in the case of two failed unittests the issue starts with a zookeeper disconnect. However I don't see anything in the zk logs that indicate zk was unhappy | 19:23 |
@clarkb:matrix.org | the system load is high just prior to that, and maybe that trips things over somehow. | 19:27 |
@clarkb:matrix.org | ah yup in the logs the nonodeerror is preceded by a zk connection loss too | 19:28 |
@clarkb:matrix.org | corvus: the failure on the latest recheck looks like it could be related to the change? It's got errors processing the gate with the status and change list objects. I think these were expected but that we'd get past them? | 20:50 |
@clarkb:matrix.org | Hrm actually I wonder if the issue is periodic request queue cleanups breaking after we start test shutdown. I will investigate that more later today | 20:57 |
@clarkb:matrix.org | I think I've decided that idea is a red herring. It shows up in the test log as an error but in the code we catch the exception and log it and move on. The actual issue seems to be due to bad version | 21:31 |
@clarkb:matrix.org | which I think possibly is related to the change so I won't recheck it again. It occurs when processing pipelines and its pipeline processing that we've modified in the change | 21:32 |
@jpew:matrix.org | ianw: Did you happen to see my comment on https://review.opendev.org/c/zuul/zuul-jobs/+/871664 ? | 21:49 |
@jim:acmegating.com | Clark: looking at the second one (which is the one with actionable logs) https://7a4304b52924f118030f-a5c518380d1db6458f4f9d5f7d87c4d0.ssl.cf1.rackcdn.com/872482/4/gate/zuul-nox-py311/74c8bed/testr_results.html | 22:44 |
@jim:acmegating.com | i think you're on to something with the cleanup path -- it looks like it's cleaning up what it thinks is a stale queue, so it actually deletes the queue while the run handler is running | 22:51 |
@jim:acmegating.com | Clark: i think the issue is that the unit test does that while not holding the pipeline lock | 22:51 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 873437: Fix race in test_queue unit tests https://review.opendev.org/c/zuul/zuul/+/873437 | 23:02 | |
@jim:acmegating.com | Clark: fungi ^ i think we should merge that and re-approve the fix | 23:03 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: | 23:03 | |
- [zuul/zuul] 872482: Fix race condition in pipeline change list init https://review.opendev.org/c/zuul/zuul/+/872482 | ||
- [zuul/zuul] 873410: Remove layout_uuid from PipelineState create call https://review.opendev.org/c/zuul/zuul/+/873410 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!