Tuesday, 2023-10-17

clarkb	hey it is meeting time	19:04
clarkb	sorry I got distracted by the followup to the tc meeting discussing mail delivery problems	19:04
clarkb	I suspect that it will be a quiet day for the meeting since peolpe are busy but I can run through the topics really quickly anyway	19:05
clarkb	#startmeeting infra	19:05
opendevmeet	Meeting started Tue Oct 17 19:05:19 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:05
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:05
opendevmeet	The meeting name has been set to 'infra'	19:05
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/DBCAGOTUNVOC2NLM4FATGKZK6GTZRJQ5/ Our Agenda	19:05
clarkb	#topic Announcements	19:05
clarkb	The PTG is happening next week from October 23-27. Please keep this in mind as we make changes to our hosted systems	19:05
clarkb	#topic Mailman 3	19:06
clarkb	All of our lists have been migrated into lists01.opendev.org and are running with mailman 3	19:06
clarkb	thank you fungi for getting this mostly over the finish line	19:06
clarkb	There are two issues (one oustanding) that should be called out	19:07
clarkb	First is we had exim configured to copy deliveries to openstack-discuss to a local mailbox. We had done this on the old server to debug dmarc issues	19:07
clarkb	Exim was unable to make these local copy deliveries because the dest dir didn't exist. Senders were getting "not delivered" emails as a result	19:07
clarkb	The periodic jobs at ~02:00 Monday should have fixed this as we landed a change to remove the local copy config in exim entirely	19:08
clarkb	I sent email today and can report back tomorrow if I got the "not delivered" message for it (it arrived 24 hours later last time)	19:08
clarkb	The other issue is that RH corp email appears to not be delivering to the new server. As far as we can tell this is because they use some service that is resolving lists.openstack.org to the old server which refuses smtp connections at this point	19:08
clarkb	Not much we can do for that other than bring it to the attention of others who might engage with this service. This is an ongoing effort. I just brought it up with the opensdtack TC which is made up of some RHers	19:09
clarkb	I think the last remaining tasks are to plan for cleaning up the old server	19:10
frickler	well we could enable exim on the old server again and make it forward to the new one	19:10
clarkb	and maybe consider adding MX records alongside our A/AAAA records	19:10
tonyb	that's super strange	19:10
frickler	at least for some time	19:10
clarkb	frickler: I think I would -2 that.	19:10
clarkb	we shouldn't need to act as a proxy for email in that way. And it will just prolong the need for keeping the 11.5 year old server that sometimes fails to boot around longer	19:11
frickler	I don't say we should do that, but it is something we could do to resolve the issue from our side	19:11
frickler	adding MX records doesn't sound wrong, either	19:12
clarkb	I think working around it will give the third party an excuse to not resolve it. Best we tackle it directly	19:12
clarkb	ya if we add MX records we should do it for all of the list domains for consistency. I don't think it will help this issue but may make other senders happy	19:12
clarkb	unfortunately fungi is at a conference so can't weigh in. Hopefully we can catch up on all the mailing list fun with him later this week	19:13
clarkb	#topic LE certcheck failures in Ansible	19:14
clarkb	While I was trying to get the exim config on lists01 updated to fix the undeliverable local copy errors I hit a problem with our LE jobs	19:14
tonyb	I definitely see it as a RH bug and once it's been raised as an issues inside RH that's up to their infra teams tonfix	19:14
clarkb	tonyb: fwiw hberaud and ralonsoh have been working it as affected users already aiui	19:15
clarkb	When compiling the list of domains to configure certcheck with we got `'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'`	19:15
tonyb	okay	19:15
clarkb	this error does not occur 100% of the time so I suspect some sort of weird ansible issue	19:15
clarkb	digging through the info in the logs I wasn't able to find any nodes that didn't have letsencrypt_certcheck_domains applied to them that are in the letsencrypt group	19:16
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/898475 Changes to LE roles to improve debugging	19:16
clarkb	I don't have a fix but did write up ^ to add more debugging to the system to hopefully make the problem more clear	19:16
corvus	(mx records should not be required; they kind of only muddy the waters; i agree they're not technically wrong though, and could be a voodoo solution to the problem. just bringing that up in case anyone is under the impression that they are required and we are wrong for omitting them)	19:16
clarkb	annoyingly ansible does not report the iteration item when a loop iteration fails. It does report it when it succeeds....	19:17
clarkb	makes debugging loop failures like the one building the certcheck list difficult to debug	19:17
clarkb	my change basically hacks around that by recording the info directly	19:17
clarkb	reviews welcome as is fresh eyes debugging if someone has time to look at the logs. I can probably even past them somewhere after making sure no secrets are leaked if that helps	19:18
clarkb	#topic Zuul not properly caching branches when they are created	19:18
clarkb	This doesn't appear to be a 100% of the time problem either. But yesterday we noticed after a user reported jobs weren't running on a change that zuul seemed unaware of the branch that change was proposed to	19:18
clarkb	corvus: theorized that this may be due to Gerrit emitting the ref-updated event that zuul processes before the git repos have the branch in them on disk (whcih zuulrefers to to list branches)	19:19
clarkb	the long term fix for this is to have zuul query the gerrit api for branch listing which should be consistent with the events stream	19:19
clarkb	in the meantime we can force zuul to reload the affected zuul tenant which fixes the problem	19:19
clarkb	Run `docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` on scheduler to fix	19:20
corvus	i might nitpick the topic title here and point out that it may not actually be a zuul bug; there's a good chance that the issue might be described as "Gerrit doesn't report the correct set of branches over git under some circumstances". but i agree that it manifests to users as "zuul doesn't have the right branch list" :)	19:20
clarkb	I did this yesterday and it took about 21 minutes but afterwards all was well	19:20
tonyb	when someone reports it right?	19:20
clarkb	tonyb: yup	19:20
corvus	and yes, i think the next step in fixing is to switch to the gerrit rest api to see if it behaves better	19:20
clarkb	fair point. The cooperation between services is broken by data consistency expectations that dno't hold :)	19:21
corvus	yes! :)	19:21
clarkb	#topic Server Upgrades	19:21
clarkb	No new server upgrades. Some services have been upgraded though. More on that later	19:22
clarkb	#topic InMotion/OnMetal Cloud Redeployment	19:22
clarkb	#undo	19:22
opendevmeet	Removing item from minutes: #topic InMotion/OnMetal Cloud Redeployment	19:22
clarkb	#topic InMotion/OpenMetal Cloud Redeployment	19:22
clarkb	I always want to type OnMetal beacuse that was Rax's thing	19:22
clarkb	After discussing this last week I think I'm leaning towards doing a single redeployment early next year. That way we get all the new goodness with the least amount of effort	19:23
clarkb	the main resource we tend to lack is time so minimizing time required to use services and tools seems important to me	19:23
frickler	+1	19:24
tonyb	+1	19:24
clarkb	we can always change our mind later if we find a new good reason to deploy sooner. But until then I'm happy as is.	19:24
clarkb	#topic Python Container Updates	19:26
clarkb	#link https://review.opendev.org/q/(+topic:bookworm-python3.11+OR+hashtag:bookworm+)status:open	19:26
clarkb	The end of this process is in sight. Everything but zuul/zuul-operator and openstack/python-openstackclient are now on python3.11. Everythin on Python3.11 is on bookworm except for zuul/zuul-registry	19:26
clarkb	I have a change to fixup some of the job dependencies (something we missed when making the other changes) and then another change to drop python3.9 entirely as nothing is using it	19:27
clarkb	Once zuul-operator and openstackclient move to python3.11 we can drop the python3.10 builds too	19:27
tonyb	Nice	19:27
clarkb	And then we can look into adding python3.12 image builds, but I don't think this is urgent as we don't have a host platform outside of the containers for running things like linters and unittests. But having the images ready would be nice	19:28
clarkb	#topic Gita 1.21 Upgrade	19:29
tonyb	+1	19:29
clarkb	#undo	19:29
opendevmeet	Removing item from minutes: #topic Gita 1.21 Upgrade	19:29
clarkb	#topic Gitea 1.21 Upgrade	19:29
clarkb	I cannot type today	19:29
clarkb	Nothing really new here. Upstream hasn't produced a new rc or final release so there is no changelog yet	19:30
clarkb	Hopefully we get one soon so that we can plan key rotations if we deem that necessary as well as the gitea upgrade proper	19:30
clarkb	#topic Zookeeper 3.8 Upgrade	19:30
clarkb	This wasn't on the agenda I sent out because updates happened in docker hub this morning. I decided to go ahead and upgrade the zookeeper cluster to 3.8.3 today after new images with some bug fixes became available	19:31
clarkb	This is now done. All three nodes are updated and the cluster seems happy	19:31
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/898614 check myid in zookeeper testing	19:31
clarkb	This change came out of one of the things corvus was checking during the upgrade. Basically a sanity check that the zookeeper node recognizes its own id properly	19:32
clarkb	The main motiviation behind this is 3.9 is out now which means 3.8 is the current stable release. Now we're caught up and getting all the latest updates	19:32
clarkb	#topic Ansible 8 Upgrade for OpenDev Control Plane	19:33
clarkb	Another topic that didn't make it on the agenda. This was entirely my fault as I knew about it but failed to add it	19:33
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/898505 Update ansible on bridge to ansible 8	19:33
clarkb	I think it is time for us to catch up on ansible releases for our control plane. Zuul has been happy with it and the migration seemed straightforward	19:33
clarkb	I do note that change did not run many system-config-run-* jobs against ansible 8 so we should modify the change to trigger more of those to get good coverage with the new ansible version before merging it. I've got that as a todo for later today	19:34
clarkb	assuming our system-config-run jobs are happy it should be very safe to land. Just need to monitor after it goes in to ensure the upgrade is successful and we didn't miss any compatibility issues	19:35
clarkb	#topic Open Discussion	19:35
clarkb	Anything else?	19:35
frickler	coming back to lists.openstack.org, one possible issue occurred to me	19:35
frickler	for the old server, rdns pointed back to lists.openstack.org, now we have lists01.opendev.org	19:36
frickler	so in fact doing an MX record pointing to the latter might be more correct	19:36
clarkb	frickler: you think they may only accept forward records that have matching reverse records?	19:36
frickler	I know some people do when receiving mail, not sure how strict things are when sending	19:37
clarkb	ya may be worth a try with MX records I guess then. Though I'd like fungi to weigh in on that before we take action since he has been driving this whole thing	19:37
frickler	also in the SMTP dialog the server identifies as lists01	19:37
corvus	hrm, i have received 2 different A responses for lists.openstack.org. it's possible the old one was cached	19:37
corvus	lists.openstack.org.30INA50.56.173.222	19:38
corvus	lists.openstack.org.21INA162.209.78.70	19:38
frickler	corvus: from where did you receive those? first is the old IP	19:38
corvus	just a local lookup; so it's entirely possible it's some old cache on my router	19:38
corvus	i'll keep an eye out and let folks know if it flaps back	19:39
clarkb	thanks. I have only received the new ip from my local resolver, the authoritative servers, google, cloudflare, and quad9 so far	19:39
clarkb	if that is more consistent it may be a thread to pull on	19:39
corvus	java is famous for not respecting ttls; so if rh has some java thing involved, that could be related	19:40
clarkb	I pushed an update to https://gerrit-review.googlesource.com/c/plugins/replication/+/387314 earlier today. It still doesn't pass all tests but passes my new tests and I'm hoping I can get feedback on the approach before doing the work to make all test cases pass and fix one known issue	19:40
clarkb	corvus: yup java 5 and older ignored ttls by default using only the first resolved values. Then after that this behavior became configured	19:41
clarkb	*configurable	19:41
corvus	if a server is unhappy about forward/reverse dns matching, an mx record probably won't help that. the important thing is that the forward dns of the helo matches the reverse dns	19:42
tonyb	I'll check with the affected users and make sure an internal ticket is raised	19:42
corvus	(and that the A record for the name returned by the PTR matches the incoming IP)	19:42
clarkb	corvus: I feel like this needs pictures :)	19:43
clarkb	with lots of arrows	19:43
tonyb	lol	19:43
corvus	imagine a cat playing with a ball of yarn	19:44
clarkb	sounds like that may be it. We can talk dns and smtp with fungi when he is able and take it from there	19:45
clarkb	thank you for your time everyone and sorry I was a few minutes late	19:45
corvus	thank you clarkb :)	19:45
clarkb	I think we will have a meeting next week during the PTG since our normal meeting time is outside of ptg times and I don't think I'm going to be super busy with the pTG this time around but that may change	19:45
clarkb	#endmeeting	19:45
opendevmeet	Meeting ended Tue Oct 17 19:45:41 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	19:45
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.html	19:45
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.txt	19:45
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.log.html	19:45
tonyb	thanks all	19:45

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!