Tuesday, 2023-10-17

clarkbhey it is meeting time19:04
clarkbsorry I got distracted by the followup to the tc meeting discussing mail delivery problems19:04
clarkbI suspect that it will be a quiet day for the meeting since peolpe are busy but I can run through the topics really quickly anyway19:05
clarkb#startmeeting infra19:05
opendevmeetMeeting started Tue Oct 17 19:05:19 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:05
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:05
opendevmeetThe meeting name has been set to 'infra'19:05
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/DBCAGOTUNVOC2NLM4FATGKZK6GTZRJQ5/ Our Agenda19:05
clarkb#topic Announcements19:05
clarkbThe PTG is happening next week from October 23-27. Please keep this in mind as we make changes to our hosted systems19:05
clarkb#topic Mailman 319:06
clarkbAll of our lists have been migrated into lists01.opendev.org and are running with mailman 319:06
clarkbthank you fungi for getting this mostly over the finish line19:06
clarkbThere are two issues (one oustanding) that should be called out19:07
clarkbFirst is we had exim configured to copy deliveries to openstack-discuss to a local mailbox. We had done this on the old server to debug dmarc issues19:07
clarkbExim was unable to make these local copy deliveries because the dest dir didn't exist. Senders were getting "not delivered" emails as a result19:07
clarkbThe periodic jobs at ~02:00 Monday should have fixed this as we landed a change to remove the local copy config in exim entirely19:08
clarkbI sent email today and can report back tomorrow if I got the "not delivered" message for it (it arrived 24 hours later last time)19:08
clarkbThe other issue is that RH corp email appears to not be delivering to the new server. As far as we can tell this is because they use some service that is resolving lists.openstack.org to the old server which refuses smtp connections at this point19:08
clarkbNot much we can do for that other than bring it to the attention of others who might engage with this service. This is an ongoing effort. I just brought it up with the opensdtack TC which is made up of some RHers19:09
clarkbI think the last remaining tasks are to plan for cleaning up the old server19:10
fricklerwell we could enable exim on the old server again and make it forward to the new one19:10
clarkband maybe consider adding MX records alongside our A/AAAA records19:10
tonybthat's super strange19:10
fricklerat least for some time19:10
clarkbfrickler: I think I would -2 that.19:10
clarkbwe shouldn't need to act as a proxy for email in that way. And it will just prolong the need for keeping the 11.5 year old server that sometimes fails to boot around longer19:11
fricklerI don't say we should do that, but it is something we could do to resolve the issue from our side19:11
frickleradding MX records doesn't sound wrong, either19:12
clarkbI think working around it will give the third party an excuse to not resolve it. Best we tackle it directly19:12
clarkbya if we add MX records we should do it for all of the list domains for consistency. I don't think it will help this issue but may make other senders happy19:12
clarkbunfortunately fungi is at a conference so can't weigh in. Hopefully we can catch up on all the mailing list fun with him later this week19:13
clarkb#topic LE certcheck failures in Ansible19:14
clarkbWhile I was trying to get the exim config on lists01 updated to fix the undeliverable local copy errors I hit a problem with our LE jobs19:14
tonybI definitely see it as a RH bug and once it's been raised as an issues inside RH that's up to their infra teams tonfix19:14
clarkbtonyb: fwiw hberaud and ralonsoh have been working it as affected users already aiui19:15
clarkbWhen compiling the list of domains to configure certcheck with we got `'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'`19:15
tonybokay19:15
clarkbthis error does not occur 100% of the time so I suspect some sort of weird ansible issue19:15
clarkbdigging through the info in the logs I wasn't able to find any nodes that didn't have letsencrypt_certcheck_domains applied to them that are in the letsencrypt group19:16
clarkb#link https://review.opendev.org/c/opendev/system-config/+/898475 Changes to LE roles to improve debugging19:16
clarkbI don't have a fix but did write up ^ to add more debugging to the system to hopefully make the problem more clear19:16
corvus(mx records should not be required; they kind of only muddy the waters; i agree they're not technically wrong though, and could be a voodoo solution to the problem.  just bringing that up in case anyone is under the impression that they are required and we are wrong for omitting them)19:16
clarkbannoyingly ansible does not report the iteration item when a loop iteration fails. It does report it when it succeeds....19:17
clarkbmakes debugging loop failures like the one building the certcheck list difficult to debug19:17
clarkbmy change basically hacks around that by recording the info directly19:17
clarkbreviews welcome as is fresh eyes debugging if someone has time to look at the logs. I can probably even past them somewhere after making sure no secrets are leaked if that helps19:18
clarkb#topic Zuul not properly caching branches when they are created19:18
clarkbThis doesn't appear to be a 100% of the time problem either. But yesterday we noticed after a user reported jobs weren't running on a change that zuul seemed unaware of the branch that change was proposed to19:18
clarkbcorvus: theorized that this may be due to Gerrit emitting the ref-updated event that zuul processes before the git repos have the branch in them on disk (whcih zuulrefers to to list branches)19:19
clarkbthe long term fix for this is to have zuul query the gerrit api for branch listing which should be consistent with the events stream19:19
clarkbin the meantime we can force zuul to reload the affected zuul tenant which fixes the problem19:19
clarkbRun `docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` on scheduler to fix19:20
corvusi might nitpick the topic title here and point out that it may not actually be a zuul bug; there's a good chance that the issue might be described as "Gerrit doesn't report the correct set of branches over git under some circumstances".  but i agree that it manifests to users as "zuul doesn't have the right branch list" :)19:20
clarkbI did this yesterday and it took about 21 minutes but afterwards all was well19:20
tonybwhen someone reports it right?19:20
clarkbtonyb: yup19:20
corvusand yes, i think the next step in fixing is to switch to the gerrit rest api to see if it behaves better19:20
clarkbfair point. The cooperation between services is broken by data consistency expectations that dno't hold :)19:21
corvusyes! :)19:21
clarkb#topic Server Upgrades19:21
clarkbNo new server upgrades. Some services have been upgraded though. More on that later19:22
clarkb#topic InMotion/OnMetal Cloud Redeployment19:22
clarkb#undo19:22
opendevmeetRemoving item from minutes: #topic InMotion/OnMetal Cloud Redeployment19:22
clarkb#topic InMotion/OpenMetal Cloud Redeployment19:22
clarkbI always want to type OnMetal beacuse that was Rax's thing19:22
clarkbAfter discussing this last week I think I'm leaning towards doing a single redeployment early next year. That way we get all the new goodness with the least amount of effort19:23
clarkbthe main resource we tend to lack is time so minimizing time required to use services and tools seems important to me19:23
frickler+119:24
tonyb+119:24
clarkbwe can always change our mind later if we find a new good reason to deploy sooner. But until then I'm happy as is.19:24
clarkb#topic Python Container Updates19:26
clarkb#link https://review.opendev.org/q/(+topic:bookworm-python3.11+OR+hashtag:bookworm+)status:open19:26
clarkbThe end of this process is in sight. Everything but zuul/zuul-operator and openstack/python-openstackclient are now on python3.11. Everythin on Python3.11 is on bookworm except for zuul/zuul-registry19:26
clarkbI have a change to fixup some of the job dependencies (something we missed when making the other changes) and then another change to drop python3.9 entirely as nothing is using it19:27
clarkbOnce zuul-operator and openstackclient move to python3.11 we can drop the python3.10 builds too19:27
tonybNice19:27
clarkbAnd then we can look into adding python3.12 image builds, but I don't think this is urgent as we don't have a host platform outside of the containers for running things like linters and unittests. But having the images ready would be nice19:28
clarkb#topic Gita 1.21 Upgrade19:29
tonyb+119:29
clarkb#undo19:29
opendevmeetRemoving item from minutes: #topic Gita 1.21 Upgrade19:29
clarkb#topic Gitea 1.21 Upgrade19:29
clarkbI cannot type today19:29
clarkbNothing really new here. Upstream hasn't produced a new rc or final release so there is no changelog yet19:30
clarkbHopefully we get one soon so that we can plan key rotations if we deem that necessary as well as the gitea upgrade proper19:30
clarkb#topic Zookeeper 3.8 Upgrade19:30
clarkbThis wasn't on the agenda I sent out because updates happened in docker hub this morning. I decided to go ahead and upgrade the zookeeper cluster to 3.8.3 today after new images with some bug fixes became available19:31
clarkbThis is now done. All three nodes are updated and the cluster seems happy19:31
clarkb#link https://review.opendev.org/c/opendev/system-config/+/898614 check myid in zookeeper testing19:31
clarkbThis change came out of one of the things corvus was checking during the upgrade. Basically a sanity check that the zookeeper node recognizes its own id properly19:32
clarkbThe main motiviation behind this is 3.9 is out now which means 3.8 is the current stable release. Now we're caught up and getting all the latest updates19:32
clarkb#topic Ansible 8 Upgrade for OpenDev Control Plane19:33
clarkbAnother topic that didn't make it on the agenda. This was entirely my fault as I knew about it but failed to add it19:33
clarkb#link https://review.opendev.org/c/opendev/system-config/+/898505 Update ansible on bridge to ansible 819:33
clarkbI think it is time for us to catch up on ansible releases for our control plane. Zuul has been happy with it and the migration seemed straightforward19:33
clarkbI do note that change did not run many system-config-run-* jobs against ansible 8 so we should modify the change to trigger more of those to get good coverage with the new ansible version before merging it. I've got that as a todo for later today19:34
clarkbassuming our system-config-run jobs are happy it should be very safe to land. Just need to monitor after it goes in to ensure the upgrade is successful and we didn't miss any compatibility issues19:35
clarkb#topic Open Discussion19:35
clarkbAnything else?19:35
fricklercoming back to lists.openstack.org, one possible issue occurred to me19:35
fricklerfor the old server, rdns pointed back to lists.openstack.org, now we have lists01.opendev.org19:36
fricklerso in fact doing an MX record pointing to the latter might be more correct19:36
clarkbfrickler: you think they may only accept forward records that have matching reverse records?19:36
fricklerI know some people do when receiving mail, not sure how strict things are when sending19:37
clarkbya may be worth a try with MX records I guess then. Though I'd like fungi to weigh in on that before we take action since he has been driving this whole thing19:37
frickleralso in the SMTP dialog the server identifies as lists0119:37
corvushrm, i have received 2 different A responses for lists.openstack.org.  it's possible the old one was cached19:37
corvuslists.openstack.org.30INA50.56.173.22219:38
corvuslists.openstack.org.21INA162.209.78.7019:38
fricklercorvus: from where did you receive those? first is the old IP19:38
corvusjust a local lookup; so it's entirely possible it's some old cache on my router19:38
corvusi'll keep an eye out and let folks know if it flaps back19:39
clarkbthanks. I have only received the new ip from my local resolver, the authoritative servers, google, cloudflare, and quad9 so far19:39
clarkbif that is more consistent it may be a thread to pull on19:39
corvusjava is famous for not respecting ttls; so if rh has some java thing involved, that could be related19:40
clarkbI pushed an update to https://gerrit-review.googlesource.com/c/plugins/replication/+/387314 earlier today. It still doesn't pass all tests but passes my new tests and I'm hoping I can get feedback on the approach before doing the work to make all test cases pass and fix one known issue19:40
clarkbcorvus: yup java 5 and older ignored ttls by default using only the first resolved values. Then after that this behavior became configured19:41
clarkb*configurable19:41
corvusif a server is unhappy about forward/reverse dns matching, an mx record probably won't help that.  the important thing is that the forward dns of the helo matches the reverse dns19:42
tonybI'll check with the affected users and make sure an internal ticket is raised19:42
corvus(and that the A record for the name returned by the PTR matches the incoming IP)19:42
clarkbcorvus: I feel like this needs pictures :)19:43
clarkbwith lots of arrows19:43
tonyblol19:43
corvusimagine a cat playing with a ball of yarn19:44
clarkbsounds like that may be it. We can talk dns and smtp with fungi when he is able and take it from there19:45
clarkbthank you for your time everyone and sorry I was a few minutes late19:45
corvusthank you clarkb :)19:45
clarkbI think we will have a meeting next week during the PTG since our normal meeting time is outside of ptg times and I don't think I'm going to be super busy with the pTG this time around but that may change19:45
clarkb#endmeeting19:45
opendevmeetMeeting ended Tue Oct 17 19:45:41 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:45
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.html19:45
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.txt19:45
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.log.html19:45
tonybthanks all19:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!