@clarkb:matrix.org | Oh connection limits make sense | 00:09 |
---|---|---|
@clarkb:matrix.org | ++ will look into that | 00:09 |
@clarkb:matrix.org | looks like mariadb respects the same setting we already set. But maybe it doesn't close connections as quickly. I'll look into collecting the journal which should hopefully give us a better iduea | 00:35 |
@jjbeckman:matrix.org | > <@jim:acmegating.com> jjbeckman: the zuul container images include oc | 06:28 |
Oh, thanks for pointing that out. We already implemented a `kubectl cp` based playbook that works nicely for our use case. Is there any merit in using `oc` over `kubectl` in a native Kubernetes cluster environment? | ||
@jjbeckman:matrix.org | > <@jim:acmegating.com> jjbeckman: the zuul container images include oc | 06:37 |
* Oh, thanks for pointing that out. We already implemented a `kubectl cp` based playbook that works nicely for our use case. Is there any advantage in using `oc` over `kubectl` in a native Kubernetes cluster environment? | ||
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 880138: Ensure cycle dependencies are enqueued ahead https://review.opendev.org/c/zuul/zuul/+/880138 | 09:30 | |
@vonschultz:matrix.org | Nodepool sometimes gives the error `AttributeError: 'NoneType' object has no attribute 'availability_zone'` when used with the AWS driver. There's one patch from February fixing it, https://review.opendev.org/c/zuul/nodepool/+/875050, and another patch from March, https://review.opendev.org/c/zuul/nodepool/+/878093. Could someone maybe review them? (Either patch would solve the problem.) | 09:44 |
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 880138: Ensure cycle dependencies are enqueued ahead https://review.opendev.org/c/zuul/zuul/+/880138 | 10:39 | |
@jim:acmegating.com | jjbeckman: it supports rsync and works with both | 13:24 |
-@gerrit:opendev.org- Zuul merged on behalf of Christian von Schultz: [zuul/nodepool] 875050: Don't get AWS instance az if subnet is None https://review.opendev.org/c/zuul/nodepool/+/875050 | 15:03 | |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166 | 16:03 | |
@clarkb:matrix.org | Hopefully one of the debugging methods in that change is able to catch the issue | 16:03 |
-@gerrit:opendev.org- Clark Boylan proposed: | 17:24 | |
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166 | ||
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212 | ||
-@gerrit:opendev.org- Clark Boylan proposed: | 17:31 | |
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166 | ||
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212 | ||
-@gerrit:opendev.org- Clark Boylan proposed: | 17:44 | |
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166 | ||
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212 | ||
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/zone-zuul-ci.org] 880213: Set default ttl to one hour https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/880213 | 17:49 | |
-@gerrit:opendev.org- Clark Boylan proposed: [opendev/zone-gating.dev] 880214: Set default ttl to one hour https://review.opendev.org/c/opendev/zone-gating.dev/+/880214 | 17:50 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: WIP: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/878538 | 17:50 | |
@clarkb:matrix.org | This mariadb mystery deepens and it looks like focal and jammy have different logging behaviors. I'm hopeful this latest patchset will work for both and get us the info we need | 17:52 |
-@gerrit:opendev.org- Zuul merged on behalf of Clark Boylan: [opendev/zone-zuul-ci.org] 879783: Revert short @ record TTLs https://review.opendev.org/c/opendev/zone-zuul-ci.org/+/879783 | 17:52 | |
@jim:acmegating.com | that promote change is not ready yet; i just wanted a clean rebase on ianw's latest updates | 17:52 |
@clarkb:matrix.org | what is interesting is that particular failure does seem to happen at least once per 3.11 run on the regular but 3.8 doesn't. And it changes tests which is why I decided to go ahead and push a change bumping up the connection count as that feels like a race | 17:56 |
@clarkb:matrix.org | hrm https://zuul.opendev.org/t/zuul/build/c9af56088bc6460fb62decd1556ba857 shows it happenign on the very first test(s) being run in the zuul-nox-remote job. Nothing in the logs either | 17:59 |
@clarkb:matrix.org | also this failure was on connection setup to configure the test fixture not for fixture cleanup | 18:00 |
@clarkb:matrix.org | persistent ssh connections make group updates not take effect without an explicit meta: reset_connection. | 18:27 |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/878538 | 18:32 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/878538 | 18:34 | |
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul-jobs] 878538: Update promote-container-image to copy from intermediate registry https://review.opendev.org/c/zuul/zuul-jobs/+/878538 | 18:36 | |
@jim:acmegating.com | Clark: ianw ^ okay i think that's structurally sound if you want to take an initial look over it | 18:38 |
@clarkb:matrix.org | awesome I'll pull it up after lunch | 18:38 |
-@gerrit:opendev.org- Clark Boylan proposed: | 18:41 | |
- [zuul/zuul] 880166: Add extra debugging around mariadb https://review.opendev.org/c/zuul/zuul/+/880166 | ||
- [zuul/zuul] 880212: Increase mysqld connections for mariadb https://review.opendev.org/c/zuul/zuul/+/880212 | ||
@clarkb:matrix.org | That should reconnect forcing the group update to take effect allowing us to retrieve journal logs for mariadb | 18:41 |
@clarkb:matrix.org | and hopefully that will help us debug the actual error when it happens because I've yet to find anything else that indicates why this is breaking | 18:41 |
@clarkb:matrix.org | corvus: I think my change stack managed to catch one with logs but unfortunately those logs aren't sufficient to determine the cause: https://zuul.opendev.org/t/zuul/build/d990cd3af03a4d42b42f6d1f462686f7/log/mysql.journal.log#34 and https://zuul.opendev.org/t/zuul/build/d990cd3af03a4d42b42f6d1f462686f7/log/job-output.txt#5735 seem to roughly line up | 19:51 |
@jim:acmegating.com | Clark: super weird -- like it's getting rejected very early in the connection process? | 20:16 |
@jim:acmegating.com | Clark: there was another rejection after that though at 19:27:29 | 20:17 |
@jim:acmegating.com | no corresponding mariadb log for that though | 20:17 |
@clarkb:matrix.org | Yup looking at pymysql it happens when reading server info which is apparently before it tries to auth. Maybe the server info determines auth soecifics | 20:21 |
@clarkb:matrix.org | I also don't think this is happening with the older mariadb on focal. | 20:21 |
@clarkb:matrix.org | > <@jim:acmegating.com> Clark: there was another rejection after that though at 19:27:29 | 20:22 |
Is this the same rejection getting reported in the log again? | ||
@jim:acmegating.com | Clark: oh that may be | 20:25 |
@jim:acmegating.com | whew | 20:25 |
@clarkb:matrix.org | The max connections increase didn't help so we can probably rule that out unless we're running up significant numbers of connections. I expect to see more failures if that was the case | 20:26 |
@clarkb:matrix.org | I'm beginning to suspect a bug in mariadb or pymysql considering this happens infrequently and is before we auth even? | 20:26 |
@jim:acmegating.com | i wonder what happened to the other 3 errors? maybe they happened to hit tests that don't have critical sql paths. but if we're hitting at least 4 of these, then that's pretty high frequency. | 20:27 |
@clarkb:matrix.org | its still 4 out of like 2000 though | 20:27 |
@jim:acmegating.com | right, but it only takes one to fail the job, and if we rely on sql more, then we increase the chances of failure (though we get to test our internal redundancy a bit) | 20:28 |
@clarkb:matrix.org | oh ya its definitely a problem. I'm just trying to justify why I think it might not be a bug on our side :) | 20:29 |
@clarkb:matrix.org | I think if it were a bug in zuul we'd hit this far more | 20:29 |
@jim:acmegating.com | yeah, agreed with that :) | 20:29 |
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/880241 | 20:31 | |
@clarkb:matrix.org | I don't think ^ is the sort of thing we should merge but if we can immediately reconnect that would point more to mariadb I think | 20:31 |
@jim:acmegating.com | Clark: apparently we can increase the mariadb log verbosity. i'd bet a nickel that if we do that, we get a whole lot of logs and no more answers, but might be worth a shot. https://mariadb.com/kb/en/error-log/#configuring-the-error-log-verbosity | 20:32 |
@clarkb:matrix.org | ++ let me try that | 20:33 |
@clarkb:matrix.org | looks like we want at least level 4 | 20:34 |
@jim:acmegating.com | Clark: another brainstorming idea: we do a lot of database creation and deletion, along with flushing privileges. what if doing that opens a window where connection auth is unreliable? maybe try a throwaway change where instead of creating new users, we just use the global openstackci user for everything? still create new databases, but basically remove the user creation/drop calls and most importantly the flush privs calls? | 20:35 |
-@gerrit:opendev.org- Clark Boylan proposed: | 20:36 | |
- [zuul/zuul] 880212: Increase mysqld connections and log level for mariadb https://review.opendev.org/c/zuul/zuul/+/880212 | ||
- [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/880241 | ||
@clarkb:matrix.org | oh that is an interesting idea. | 20:36 |
-@gerrit:opendev.org- Clark Boylan proposed: | 20:41 | |
- [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/880241 | ||
- [zuul/zuul] 880242: Don't create per test db users and passwords https://review.opendev.org/c/zuul/zuul/+/880242 | ||
@clarkb:matrix.org | hrm I'm not convinced the server is respecting the config update for the log level :/ | 20:51 |
@clarkb:matrix.org | corvus: the intermediate registry promotion chnage lgtm but I left one question about a possible improvement | 20:57 |
@clarkb:matrix.org | oh we only use that mysql config file in the docker-compose setup. Not the one used by CI /me looks at that more closely as a couple of assumptions i had were just invalidated | 20:59 |
-@gerrit:opendev.org- Clark Boylan proposed: | 21:06 | |
- [zuul/zuul] 880212: Increase mysqld connections and log level for mariadb https://review.opendev.org/c/zuul/zuul/+/880212 | ||
- [zuul/zuul] 880242: Don't create per test db users and passwords https://review.opendev.org/c/zuul/zuul/+/880242 | ||
- [zuul/zuul] 880241: Retry mariadb connections on error https://review.opendev.org/c/zuul/zuul/+/880241 | ||
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 879989: Do not wait for streamer when disabled https://review.opendev.org/c/zuul/zuul/+/879989 | 22:10 | |
@iwienand:matrix.org | sigh i don't have emails about those zuul-jobs changes. i wonder if RH is rejecting gerrit mail, again | 22:11 |
@clarkb:matrix.org | all three of those investigative changes passed python3.11 this time around | 22:23 |
@clarkb:matrix.org | looking at the more verbose db logs I am a bit surprised that the default connection limit of 151 was working for us. | 22:24 |
@clarkb:matrix.org | Once the python38 jobs finish and buildsets report I'll recheck things to see if we can catch any failures. But it may be that the SET GLOBAL for max_connections addresses things afterall | 22:24 |
@jim:acmegating.com | Clark: yeah. early rejection seems like a plausible way for them to have implemented the limit. | 22:44 |
-@gerrit:opendev.org- Zuul merged on behalf of Ian Wienand: [zuul/zuul-jobs] 878497: test-registry: split docker and container paths https://review.opendev.org/c/zuul/zuul-jobs/+/878497 | 23:46 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!