Wednesday, 2023-04-12

@clarkb:matrix.orgOh connection limits make sense00:09
@clarkb:matrix.org++ will look into that00:09
@clarkb:matrix.orglooks like mariadb respects the same setting we already set. But maybe it doesn't close connections as quickly. I'll look into collecting the journal which should hopefully give us a better iduea00:35
@jjbeckman:matrix.org> <@jim:acmegating.com> jjbeckman: the zuul container images include oc06:28
Oh, thanks for pointing that out. We already implemented a `kubectl cp` based playbook that works nicely for our use case. Is there any merit in using `oc` over `kubectl` in a native Kubernetes cluster environment?
@jjbeckman:matrix.org> <@jim:acmegating.com> jjbeckman: the zuul container images include oc06:37
* Oh, thanks for pointing that out. We already implemented a `kubectl cp` based playbook that works nicely for our use case. Is there any advantage in using `oc` over `kubectl` in a native Kubernetes cluster environment?
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 880138: Ensure cycle dependencies are enqueued ahead https://review.opendev.org/c/zuul/zuul/+/88013809:30
@vonschultz:matrix.orgNodepool sometimes gives the error `AttributeError: 'NoneType' object has no attribute 'availability_zone'` when used with the AWS driver. There's one patch from February fixing it, https://review.opendev.org/c/zuul/nodepool/+/875050, and another patch from March, https://review.opendev.org/c/zuul/nodepool/+/878093. Could someone maybe review them? (Either patch would solve the problem.)09:44
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 880138: Ensure cycle dependencies are enqueued ahead https://review.opendev.org/c/zuul/zuul/+/88013810:39
@jim:acmegating.comjjbeckman: it supports rsync and works with both13:24
-@gerrit:opendev.org- Zuul merged on behalf of Christian von Schultz: [zuul/nodepool] 875050: Don't get AWS instance az if subnet is None https://review.opendev.org/c/zuul/nodepool/+/87505015:03
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/88016616:03
@clarkb:matrix.orgHopefully one of the debugging methods in that change is able to catch the issue16:03
-@gerrit:opendev.org- Clark Boylan proposed:17:24
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212
-@gerrit:opendev.org- Clark Boylan proposed:17:31
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212
-@gerrit:opendev.org- Clark Boylan proposed:17:44
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/zone-zuul-ci.org] 880213: Set default ttl to one hour https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/88021317:49
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/zone-gating.dev] 880214: Set default ttl to one hour https://review.opendev.org/c/opendev/zone-gating.dev/+/88021417:50
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: WIP: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/87853817:50
@clarkb:matrix.orgThis mariadb mystery deepens and it looks like focal and jammy have different logging behaviors. I'm hopeful this latest patchset will work for both and get us the info we need17:52
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/zone-zuul-ci.org] 879783: Revert short @ record TTLs https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/87978317:52
@jim:acmegating.comthat promote change is not ready yet; i just wanted a clean rebase on ianw's latest updates17:52
@clarkb:matrix.orgwhat is interesting is that particular failure does seem to happen at least once per 3.11 run on the regular but 3.8 doesn't. And it changes tests which is why I decided to go ahead and push a change bumping up the connection count as that feels like a race17:56
@clarkb:matrix.orghrm https://zuul.opendev.org/t/zuul/build/c9af56088bc6460fb62decd1556ba857 shows it happenign on the very first test(s) being run in the zuul-nox-remote job. Nothing in the logs either17:59
@clarkb:matrix.orgalso this failure was on connection setup to configure the test fixture not for fixture cleanup18:00
@clarkb:matrix.orgpersistent ssh connections make group updates not take effect without an explicit meta: reset_connection.18:27
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/87853818:32
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/87853818:34
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/87853818:36
@jim:acmegating.comClark: ianw ^ okay i think that's structurally sound if you want to take an initial look over it18:38
@clarkb:matrix.orgawesome I'll pull it up after lunch18:38
-@gerrit:opendev.org- Clark Boylan proposed:18:41
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212
@clarkb:matrix.orgThat should reconnect forcing the group update to take effect allowing us to retrieve journal logs for mariadb18:41
@clarkb:matrix.organd hopefully that will help us debug the actual error when it happens because I've yet to find anything else that indicates why this is breaking18:41
@clarkb:matrix.orgcorvus: I think my change stack managed to catch one with logs but unfortunately those logs aren't sufficient to determine the cause: https://zuul.opendev.org/t/zuul/build/d990cd3af03a4d42b42f6d1f462686f7/log/mysql.journal.log#34 and https://zuul.opendev.org/t/zuul/build/d990cd3af03a4d42b42f6d1f462686f7/log/job-output.txt#5735 seem to roughly line up19:51
@jim:acmegating.comClark: super weird -- like it's getting rejected very early in the connection process?20:16
@jim:acmegating.comClark: there was another rejection after that though at 19:27:2920:17
@jim:acmegating.comno corresponding mariadb log for that though20:17
@clarkb:matrix.orgYup looking at pymysql it happens when reading server info which is apparently before it tries to auth. Maybe the server info determines auth soecifics20:21
@clarkb:matrix.orgI also don't think this is happening with the older mariadb on focal.20:21
@clarkb:matrix.org> <@jim:acmegating.com> Clark: there was another rejection after that though at 19:27:2920:22
Is this the same rejection getting reported in the log again?
@jim:acmegating.comClark: oh that may be20:25
@jim:acmegating.comwhew20:25
@clarkb:matrix.orgThe max connections increase didn't help so we can probably rule that out unless we're running up significant numbers of connections. I expect to see more failures if that was the case20:26
@clarkb:matrix.orgI'm beginning to suspect a bug in mariadb or pymysql considering this happens infrequently and is before we auth even?20:26
@jim:acmegating.comi wonder what happened to the other 3 errors?  maybe they happened to hit tests that don't have critical sql paths.  but if we're hitting at least 4 of these, then that's pretty high frequency.20:27
@clarkb:matrix.orgits still 4 out of like 2000 though20:27
@jim:acmegating.comright, but it only takes one to fail the job, and if we rely on sql more, then we increase the chances of failure (though we get to test our internal redundancy a bit)20:28
@clarkb:matrix.orgoh ya its definitely a problem. I'm just trying to justify why I think it might not be a bug on our side :)20:29
@clarkb:matrix.orgI think if it were a bug in zuul we'd hit this far more20:29
@jim:acmegating.comyeah, agreed with that :)20:29
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/88024120:31
@clarkb:matrix.orgI don't think ^ is the sort of thing we should merge but if we can immediately reconnect that would point more to mariadb I think20:31
@jim:acmegating.comClark: apparently we can increase the mariadb log verbosity.  i'd bet a nickel that if we do that, we get a whole lot of logs and no more answers, but might be worth a shot.  https://mariadb.com/kb/en/error-log/#configuring-the-error-log-verbosity20:32
@clarkb:matrix.org++ let me try that20:33
@clarkb:matrix.orglooks like we want at least level 420:34
@jim:acmegating.comClark: another brainstorming idea: we do a lot of database creation and deletion, along with flushing privileges.  what if doing that opens a window where connection auth is unreliable?  maybe try a throwaway change where instead of creating new users, we just use the global openstackci user for everything?  still create new databases, but basically remove the user creation/drop calls and most importantly the flush privs calls?20:35
-@gerrit:opendev.org- Clark Boylan proposed:20:36
- [zuul/zuul] 880212: Increase mysqld connections and log level for mariadb https://review.opendev.org/c/zuul/zuul/+/880212
- [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/880241
@clarkb:matrix.orgoh that is an interesting idea.20:36
-@gerrit:opendev.org- Clark Boylan proposed:20:41
- [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/880241
- [zuul/zuul] 880242: Don't create per test db users and passwords https://review.opendev.org/c/zuul/zuul/+/880242
@clarkb:matrix.orghrm I'm not convinced the server is respecting the config update for the log level :/20:51
@clarkb:matrix.orgcorvus: the intermediate registry promotion chnage lgtm but I left one question about a possible improvement20:57
@clarkb:matrix.orgoh we only use that mysql config file in the docker-compose setup. Not the one used by CI /me looks at that more closely as a couple of assumptions i had were just invalidated20:59
-@gerrit:opendev.org- Clark Boylan proposed:21:06
- [zuul/zuul] 880212: Increase mysqld connections and log level for mariadb https://review.opendev.org/c/zuul/zuul/+/880212
- [zuul/zuul] 880242: Don't create per test db users and passwords https://review.opendev.org/c/zuul/zuul/+/880242
- [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/880241
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 879989: Do not wait for streamer when disabled https://review.opendev.org/c/zuul/zuul/+/87998922:10
@iwienand:matrix.orgsigh i don't have emails about those zuul-jobs changes.  i wonder if RH is rejecting gerrit mail, again22:11
@clarkb:matrix.orgall three of those investigative changes passed python3.11 this time around22:23
@clarkb:matrix.orglooking at the more verbose db logs I am a bit surprised that the default connection limit of 151 was working for us.22:24
@clarkb:matrix.orgOnce the python38 jobs finish and buildsets report I'll recheck things to see if we can catch any failures. But it may be that the SET GLOBAL for max_connections addresses things afterall22:24
@jim:acmegating.comClark: yeah.  early rejection seems like a plausible way for them to have implemented the limit.22:44
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 878497: test-registry: split docker and container paths https://review.opendev.org/c/zuul/zuul-jobs/+/87849723:46

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!