Friday, 2023-02-10

@clarkb:matrix.org	ok I think I've figured out what is going on and it is actually a leak in cheroot. When you ask the cheroot threadpool to shutdown it doesn't close accepted sockets	00:53
@clarkb:matrix.org	I think on our side we could ensure that the http request to cherrypy (and cheroot) is complete and requests shutdown from the client side before we shutdown the server	00:54
@clarkb:matrix.org	but then that introduces another complication in that cheroot allows its connections to linger and close later: https://github.com/cherrypy/cheroot/blob/main/cheroot/server.py#L1359-L1372	00:56
@clarkb:matrix.org	Based on that I'm now reasonably confident this is a known deficiency in cheroot.	00:56
@clarkb:matrix.org	I think I'll stop putting so much effort into making these warnings go away for now. Really what python seems to be complaining about is that cheroot never calls socket.close() on the socket objects resulting from accept() and when the garbage collector runs it complains about this.	00:57
@clarkb:matrix.org	https://github.com/cherrypy/cheroot/issues/508 is related	01:11
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 871863: Add an !unsafe change_message variable https://review.opendev.org/c/zuul/zuul/+/871863		10:01
@mhuin:matrix.org	zuul-maint: https://review.opendev.org/c/zuul/zuul-client/+/819118 had a +2 a long time ago then fell through the cracks. It lets users authenticate via username/password instead of using a token if the identity provider is configured to do so	13:27
@mhuin:matrix.org	That's a better user experience than looking for your JWT on the GUI	13:27
@mbecker12:matrix.org	Hi! we noticed that on zuul 6.4.0 that any job running on openshift pods got disconnected after 15 minutes and all we got was a `MODULE FAILURE`. After a lot of digging, we decided to install kubectl in an additional executor image layer which solved the problem, so that we had a proper kubectl installation that was not a symlink to oc. Afterwards, we noticed that we can reproduce the same error with an old version of oc, whereas a current oc installation doesn't seem to have this problem. I just wanted to mention it here and make you guys aware of it :)	15:00
The kubectl version resolves to 1.11, maybe it could be good to update the version of oc/kubectl in the image? Or maybe this is already done in newer versions of zuul and I might have missed that
@clarkb:matrix.org	> <@mbecker12:matrix.org> Hi! we noticed that on zuul 6.4.0 that any job running on openshift pods got disconnected after 15 minutes and all we got was a `MODULE FAILURE`. After a lot of digging, we decided to install kubectl in an additional executor image layer which solved the problem, so that we had a proper kubectl installation that was not a symlink to oc. Afterwards, we noticed that we can reproduce the same error with an old version of oc, whereas a current oc installation doesn't seem to have this problem. I just wanted to mention it here and make you guys aware of it :)	15:06
> The kubectl version resolves to 1.11, maybe it could be good to update the version of oc/kubectl in the image? Or maybe this is already done in newer versions of zuul and I might have missed that
https://review.opendev.org/c/zuul/zuul/+/867137
@fungicide:matrix.org	did any changes merge last week that might have impacted zk schema in backward-incompatible ways? apparently following our automatic update to master over the weekend, we started logging this and failing to enqueue into at least a couple of different pipelines: https://paste.opendev.org/show/bcGaLU1wkHZzXgcZ4agi/	15:08
@clarkb:matrix.org	fungi: chances are that's corrupt data in the db not a schema change	15:09
@clarkb:matrix.org	Since everything is expected to be stored as json (possibly compressed) iirc	15:10
@fungicide:matrix.org	interesting. in that case we've got more than one pipeline impacted and they have entirely disjoint triggers (completely separate projects), so maybe something happened to corrupt more than one znode	15:11
@clarkb:matrix.org	The znode path is in that paste fwiw. End of the first line	15:12
@jim:acmegating.com	fungi: what host/logfile for that exception?	15:13
@fungicide:matrix.org	corvus: i was pulling it from /var/log/zuul/debug.log on zuul01.opendev.org but i haven't looked at the other yet. there are close to 30k repeats of that error in today's log alone	15:14
@fungicide:matrix.org	maybe more like 60k	15:18
@fungicide:matrix.org	after deduplication, i think we have 34 different znodes mentioned in today's log	15:18
@fungicide:matrix.org	across multiple pipelines and tenants	15:18
@fungicide:matrix.org	though 10 different tenant+pipeline pairs if i filter down the "Exception loading ZKObject" messages further to ignore the zkobject ids	15:20
@jim:acmegating.com	fungi: Clark there is a high degree of probability this is the race described and fixed by https://review.opendev.org/872482	15:21
@jim:acmegating.com	i think it would be worthwhile to try to land that today	15:22
@jim:acmegating.com	i'll switch to #opendev to talk about fixing the current state	15:22
@fungicide:matrix.org	there may be more than one cause, i see a handful of "Exception loading ZKObject" messages (<100/day) in logs up until a few days ago, and then around then their frequency increases by a couple of orders of magnitude on 2023-02-06	15:24
@fungicide:matrix.org	* there may be more than one cause, i see a handful of "Exception loading ZKObject" messages (\<100/day) in logs up until a few days ago, and then their frequency increases by a couple of orders of magnitude on 2023-02-06	15:24
@jim:acmegating.com	fungi: there are some cases where that's harmless; that's likely the case for the <100/day time period.	15:25
@fungicide:matrix.org	i figured, it just complicates the analysis since they're hard to filter out	15:26
@clarkb:matrix.org	corvus: I posted some questions on https://review.opendev.org/872482 that change is actually fairly involved and I want ot make sure I'm understanding things correctly	16:54
@jim:acmegating.com	Clark: looking	16:54
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 873410: Remove layout_uuid from PipelineState create call https://review.opendev.org/c/zuul/zuul/+/873410		17:41
@jim:acmegating.com	Clark: your comments prompted that ^	17:41
@clarkb:matrix.org	+2	17:43
@clarkb:matrix.org	The fix for the race condition is failing in the gate. It seems to have exploded on NoNodeExceptions.	19:16
@clarkb:matrix.org	One of my warning fixup changes had NoNode exceptions in a couple tests too, but the race condition fix seems to have failed more explosively. Maybe in a large recursive loop? The output log is quite large so hard to parse it, but it looks like the end of the file shows a very deep stack trace	19:18
@clarkb:matrix.org	I think in the case of two failed unittests the issue starts with a zookeeper disconnect. However I don't see anything in the zk logs that indicate zk was unhappy	19:23
@clarkb:matrix.org	the system load is high just prior to that, and maybe that trips things over somehow.	19:27
@clarkb:matrix.org	ah yup in the logs the nonodeerror is preceded by a zk connection loss too	19:28
@clarkb:matrix.org	corvus: the failure on the latest recheck looks like it could be related to the change? It's got errors processing the gate with the status and change list objects. I think these were expected but that we'd get past them?	20:50
@clarkb:matrix.org	Hrm actually I wonder if the issue is periodic request queue cleanups breaking after we start test shutdown. I will investigate that more later today	20:57
@clarkb:matrix.org	I think I've decided that idea is a red herring. It shows up in the test log as an error but in the code we catch the exception and log it and move on. The actual issue seems to be due to bad version	21:31
@clarkb:matrix.org	which I think possibly is related to the change so I won't recheck it again. It occurs when processing pipelines and its pipeline processing that we've modified in the change	21:32
@jpew:matrix.org	ianw: Did you happen to see my comment on https://review.opendev.org/c/zuul/zuul-jobs/+/871664 ?	21:49
@jim:acmegating.com	Clark: looking at the second one (which is the one with actionable logs) https://7a4304b52924f118030f-a5c518380d1db6458f4f9d5f7d87c4d0.ssl.cf1.rackcdn.com/872482/4/gate/zuul-nox-py311/74c8bed/testr_results.html	22:44
@jim:acmegating.com	i think you're on to something with the cleanup path -- it looks like it's cleaning up what it thinks is a stale queue, so it actually deletes the queue while the run handler is running	22:51
@jim:acmegating.com	Clark: i think the issue is that the unit test does that while not holding the pipeline lock	22:51
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 873437: Fix race in test_queue unit tests https://review.opendev.org/c/zuul/zuul/+/873437		23:02
@jim:acmegating.com	Clark: fungi ^ i think we should merge that and re-approve the fix	23:03
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:		23:03
- [zuul/zuul] 872482: Fix race condition in pipeline change list init https://review.opendev.org/c/zuul/zuul/+/872482
- [zuul/zuul] 873410: Remove layout_uuid from PipelineState create call https://review.opendev.org/c/zuul/zuul/+/873410

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!