clarkb | hey it is meeting time | 19:04 |
---|---|---|
clarkb | sorry I got distracted by the followup to the tc meeting discussing mail delivery problems | 19:04 |
clarkb | I suspect that it will be a quiet day for the meeting since peolpe are busy but I can run through the topics really quickly anyway | 19:05 |
clarkb | #startmeeting infra | 19:05 |
opendevmeet | Meeting started Tue Oct 17 19:05:19 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:05 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:05 |
opendevmeet | The meeting name has been set to 'infra' | 19:05 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/DBCAGOTUNVOC2NLM4FATGKZK6GTZRJQ5/ Our Agenda | 19:05 |
clarkb | #topic Announcements | 19:05 |
clarkb | The PTG is happening next week from October 23-27. Please keep this in mind as we make changes to our hosted systems | 19:05 |
clarkb | #topic Mailman 3 | 19:06 |
clarkb | All of our lists have been migrated into lists01.opendev.org and are running with mailman 3 | 19:06 |
clarkb | thank you fungi for getting this mostly over the finish line | 19:06 |
clarkb | There are two issues (one oustanding) that should be called out | 19:07 |
clarkb | First is we had exim configured to copy deliveries to openstack-discuss to a local mailbox. We had done this on the old server to debug dmarc issues | 19:07 |
clarkb | Exim was unable to make these local copy deliveries because the dest dir didn't exist. Senders were getting "not delivered" emails as a result | 19:07 |
clarkb | The periodic jobs at ~02:00 Monday should have fixed this as we landed a change to remove the local copy config in exim entirely | 19:08 |
clarkb | I sent email today and can report back tomorrow if I got the "not delivered" message for it (it arrived 24 hours later last time) | 19:08 |
clarkb | The other issue is that RH corp email appears to not be delivering to the new server. As far as we can tell this is because they use some service that is resolving lists.openstack.org to the old server which refuses smtp connections at this point | 19:08 |
clarkb | Not much we can do for that other than bring it to the attention of others who might engage with this service. This is an ongoing effort. I just brought it up with the opensdtack TC which is made up of some RHers | 19:09 |
clarkb | I think the last remaining tasks are to plan for cleaning up the old server | 19:10 |
frickler | well we could enable exim on the old server again and make it forward to the new one | 19:10 |
clarkb | and maybe consider adding MX records alongside our A/AAAA records | 19:10 |
tonyb | that's super strange | 19:10 |
frickler | at least for some time | 19:10 |
clarkb | frickler: I think I would -2 that. | 19:10 |
clarkb | we shouldn't need to act as a proxy for email in that way. And it will just prolong the need for keeping the 11.5 year old server that sometimes fails to boot around longer | 19:11 |
frickler | I don't say we should do that, but it is something we could do to resolve the issue from our side | 19:11 |
frickler | adding MX records doesn't sound wrong, either | 19:12 |
clarkb | I think working around it will give the third party an excuse to not resolve it. Best we tackle it directly | 19:12 |
clarkb | ya if we add MX records we should do it for all of the list domains for consistency. I don't think it will help this issue but may make other senders happy | 19:12 |
clarkb | unfortunately fungi is at a conference so can't weigh in. Hopefully we can catch up on all the mailing list fun with him later this week | 19:13 |
clarkb | #topic LE certcheck failures in Ansible | 19:14 |
clarkb | While I was trying to get the exim config on lists01 updated to fix the undeliverable local copy errors I hit a problem with our LE jobs | 19:14 |
tonyb | I definitely see it as a RH bug and once it's been raised as an issues inside RH that's up to their infra teams tonfix | 19:14 |
clarkb | tonyb: fwiw hberaud and ralonsoh have been working it as affected users already aiui | 19:15 |
clarkb | When compiling the list of domains to configure certcheck with we got `'ansible.vars.hostvars.HostVarsVars object' has no attribute 'letsencrypt_certcheck_domains'` | 19:15 |
tonyb | okay | 19:15 |
clarkb | this error does not occur 100% of the time so I suspect some sort of weird ansible issue | 19:15 |
clarkb | digging through the info in the logs I wasn't able to find any nodes that didn't have letsencrypt_certcheck_domains applied to them that are in the letsencrypt group | 19:16 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/898475 Changes to LE roles to improve debugging | 19:16 |
clarkb | I don't have a fix but did write up ^ to add more debugging to the system to hopefully make the problem more clear | 19:16 |
corvus | (mx records should not be required; they kind of only muddy the waters; i agree they're not technically wrong though, and could be a voodoo solution to the problem. just bringing that up in case anyone is under the impression that they are required and we are wrong for omitting them) | 19:16 |
clarkb | annoyingly ansible does not report the iteration item when a loop iteration fails. It does report it when it succeeds.... | 19:17 |
clarkb | makes debugging loop failures like the one building the certcheck list difficult to debug | 19:17 |
clarkb | my change basically hacks around that by recording the info directly | 19:17 |
clarkb | reviews welcome as is fresh eyes debugging if someone has time to look at the logs. I can probably even past them somewhere after making sure no secrets are leaked if that helps | 19:18 |
clarkb | #topic Zuul not properly caching branches when they are created | 19:18 |
clarkb | This doesn't appear to be a 100% of the time problem either. But yesterday we noticed after a user reported jobs weren't running on a change that zuul seemed unaware of the branch that change was proposed to | 19:18 |
clarkb | corvus: theorized that this may be due to Gerrit emitting the ref-updated event that zuul processes before the git repos have the branch in them on disk (whcih zuulrefers to to list branches) | 19:19 |
clarkb | the long term fix for this is to have zuul query the gerrit api for branch listing which should be consistent with the events stream | 19:19 |
clarkb | in the meantime we can force zuul to reload the affected zuul tenant which fixes the problem | 19:19 |
clarkb | Run `docker exec zuul-scheduler_scheduler_1 zuul-scheduler tenant-reconfigure openstack` on scheduler to fix | 19:20 |
corvus | i might nitpick the topic title here and point out that it may not actually be a zuul bug; there's a good chance that the issue might be described as "Gerrit doesn't report the correct set of branches over git under some circumstances". but i agree that it manifests to users as "zuul doesn't have the right branch list" :) | 19:20 |
clarkb | I did this yesterday and it took about 21 minutes but afterwards all was well | 19:20 |
tonyb | when someone reports it right? | 19:20 |
clarkb | tonyb: yup | 19:20 |
corvus | and yes, i think the next step in fixing is to switch to the gerrit rest api to see if it behaves better | 19:20 |
clarkb | fair point. The cooperation between services is broken by data consistency expectations that dno't hold :) | 19:21 |
corvus | yes! :) | 19:21 |
clarkb | #topic Server Upgrades | 19:21 |
clarkb | No new server upgrades. Some services have been upgraded though. More on that later | 19:22 |
clarkb | #topic InMotion/OnMetal Cloud Redeployment | 19:22 |
clarkb | #undo | 19:22 |
opendevmeet | Removing item from minutes: #topic InMotion/OnMetal Cloud Redeployment | 19:22 |
clarkb | #topic InMotion/OpenMetal Cloud Redeployment | 19:22 |
clarkb | I always want to type OnMetal beacuse that was Rax's thing | 19:22 |
clarkb | After discussing this last week I think I'm leaning towards doing a single redeployment early next year. That way we get all the new goodness with the least amount of effort | 19:23 |
clarkb | the main resource we tend to lack is time so minimizing time required to use services and tools seems important to me | 19:23 |
frickler | +1 | 19:24 |
tonyb | +1 | 19:24 |
clarkb | we can always change our mind later if we find a new good reason to deploy sooner. But until then I'm happy as is. | 19:24 |
clarkb | #topic Python Container Updates | 19:26 |
clarkb | #link https://review.opendev.org/q/(+topic:bookworm-python3.11+OR+hashtag:bookworm+)status:open | 19:26 |
clarkb | The end of this process is in sight. Everything but zuul/zuul-operator and openstack/python-openstackclient are now on python3.11. Everythin on Python3.11 is on bookworm except for zuul/zuul-registry | 19:26 |
clarkb | I have a change to fixup some of the job dependencies (something we missed when making the other changes) and then another change to drop python3.9 entirely as nothing is using it | 19:27 |
clarkb | Once zuul-operator and openstackclient move to python3.11 we can drop the python3.10 builds too | 19:27 |
tonyb | Nice | 19:27 |
clarkb | And then we can look into adding python3.12 image builds, but I don't think this is urgent as we don't have a host platform outside of the containers for running things like linters and unittests. But having the images ready would be nice | 19:28 |
clarkb | #topic Gita 1.21 Upgrade | 19:29 |
tonyb | +1 | 19:29 |
clarkb | #undo | 19:29 |
opendevmeet | Removing item from minutes: #topic Gita 1.21 Upgrade | 19:29 |
clarkb | #topic Gitea 1.21 Upgrade | 19:29 |
clarkb | I cannot type today | 19:29 |
clarkb | Nothing really new here. Upstream hasn't produced a new rc or final release so there is no changelog yet | 19:30 |
clarkb | Hopefully we get one soon so that we can plan key rotations if we deem that necessary as well as the gitea upgrade proper | 19:30 |
clarkb | #topic Zookeeper 3.8 Upgrade | 19:30 |
clarkb | This wasn't on the agenda I sent out because updates happened in docker hub this morning. I decided to go ahead and upgrade the zookeeper cluster to 3.8.3 today after new images with some bug fixes became available | 19:31 |
clarkb | This is now done. All three nodes are updated and the cluster seems happy | 19:31 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/898614 check myid in zookeeper testing | 19:31 |
clarkb | This change came out of one of the things corvus was checking during the upgrade. Basically a sanity check that the zookeeper node recognizes its own id properly | 19:32 |
clarkb | The main motiviation behind this is 3.9 is out now which means 3.8 is the current stable release. Now we're caught up and getting all the latest updates | 19:32 |
clarkb | #topic Ansible 8 Upgrade for OpenDev Control Plane | 19:33 |
clarkb | Another topic that didn't make it on the agenda. This was entirely my fault as I knew about it but failed to add it | 19:33 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/898505 Update ansible on bridge to ansible 8 | 19:33 |
clarkb | I think it is time for us to catch up on ansible releases for our control plane. Zuul has been happy with it and the migration seemed straightforward | 19:33 |
clarkb | I do note that change did not run many system-config-run-* jobs against ansible 8 so we should modify the change to trigger more of those to get good coverage with the new ansible version before merging it. I've got that as a todo for later today | 19:34 |
clarkb | assuming our system-config-run jobs are happy it should be very safe to land. Just need to monitor after it goes in to ensure the upgrade is successful and we didn't miss any compatibility issues | 19:35 |
clarkb | #topic Open Discussion | 19:35 |
clarkb | Anything else? | 19:35 |
frickler | coming back to lists.openstack.org, one possible issue occurred to me | 19:35 |
frickler | for the old server, rdns pointed back to lists.openstack.org, now we have lists01.opendev.org | 19:36 |
frickler | so in fact doing an MX record pointing to the latter might be more correct | 19:36 |
clarkb | frickler: you think they may only accept forward records that have matching reverse records? | 19:36 |
frickler | I know some people do when receiving mail, not sure how strict things are when sending | 19:37 |
clarkb | ya may be worth a try with MX records I guess then. Though I'd like fungi to weigh in on that before we take action since he has been driving this whole thing | 19:37 |
frickler | also in the SMTP dialog the server identifies as lists01 | 19:37 |
corvus | hrm, i have received 2 different A responses for lists.openstack.org. it's possible the old one was cached | 19:37 |
corvus | lists.openstack.org.30INA50.56.173.222 | 19:38 |
corvus | lists.openstack.org.21INA162.209.78.70 | 19:38 |
frickler | corvus: from where did you receive those? first is the old IP | 19:38 |
corvus | just a local lookup; so it's entirely possible it's some old cache on my router | 19:38 |
corvus | i'll keep an eye out and let folks know if it flaps back | 19:39 |
clarkb | thanks. I have only received the new ip from my local resolver, the authoritative servers, google, cloudflare, and quad9 so far | 19:39 |
clarkb | if that is more consistent it may be a thread to pull on | 19:39 |
corvus | java is famous for not respecting ttls; so if rh has some java thing involved, that could be related | 19:40 |
clarkb | I pushed an update to https://gerrit-review.googlesource.com/c/plugins/replication/+/387314 earlier today. It still doesn't pass all tests but passes my new tests and I'm hoping I can get feedback on the approach before doing the work to make all test cases pass and fix one known issue | 19:40 |
clarkb | corvus: yup java 5 and older ignored ttls by default using only the first resolved values. Then after that this behavior became configured | 19:41 |
clarkb | *configurable | 19:41 |
corvus | if a server is unhappy about forward/reverse dns matching, an mx record probably won't help that. the important thing is that the forward dns of the helo matches the reverse dns | 19:42 |
tonyb | I'll check with the affected users and make sure an internal ticket is raised | 19:42 |
corvus | (and that the A record for the name returned by the PTR matches the incoming IP) | 19:42 |
clarkb | corvus: I feel like this needs pictures :) | 19:43 |
clarkb | with lots of arrows | 19:43 |
tonyb | lol | 19:43 |
corvus | imagine a cat playing with a ball of yarn | 19:44 |
clarkb | sounds like that may be it. We can talk dns and smtp with fungi when he is able and take it from there | 19:45 |
clarkb | thank you for your time everyone and sorry I was a few minutes late | 19:45 |
corvus | thank you clarkb :) | 19:45 |
clarkb | I think we will have a meeting next week during the PTG since our normal meeting time is outside of ptg times and I don't think I'm going to be super busy with the pTG this time around but that may change | 19:45 |
clarkb | #endmeeting | 19:45 |
opendevmeet | Meeting ended Tue Oct 17 19:45:41 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:45 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.html | 19:45 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.txt | 19:45 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-10-17-19.05.log.html | 19:45 |
tonyb | thanks all | 19:45 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!