Tuesday, 2023-08-22

clarkbmeeting time19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Aug 22 19:01:11 2023 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VRBBT25TOTXJG3L5SXKWV3EELG34UC5E/ Our Agenda19:01
clarkbWe've actually got a fairly full agenda so I may move quicker than I'd like. But we can always go back to discussing items at the end of our hour if we have time19:01
clarkb#topic Announcements19:01
clarkbThe service coordinator nomination period ends today. I haven't seen anyone nominate themselves yet. Does this mean everyone is happy with and prefers me to keep doing it?19:02
fungii'm happy to be not-it ;)19:02
fungi(also you're doing a great job!)19:02
corvusi can confirm 100% that is the correct interpretation19:02
clarkbok I guess I can make it official with an email later today before today ends to avoid and needless process confusion19:03
clarkbanything else to announce?19:04
clarkb#topic Infra root google account19:05
clarkbJust a quick note that I haven't tried to login yet so I have no news yet19:05
clarkbbut it is on my todo list and hopefully I can get to it soon. This week should be a bit less crazy ofr me than last week (half the visiting family is no longer here)19:05
clarkb#topic Mailman 319:05
clarkbfungi: we made some changes and got things to a good stable stopping point I think. What is next? mailman 3 upgrade?19:06
fungiwe merged the remaining fixes last week and the correct site names are showing up on archive pages now19:06
fungii've got a held node built with https://review.opendev.org/869210 and have just finished syncing a copy of all our production mm2 lists to it to run a test import19:07
fungi#link https://review.opendev.org/869210 Upgrade to latest Mailman 3 releases19:07
clarkboh right we wanted to make sure that upgrading wouldn't put us in a worse position for the 2 -> 3 migration19:08
fungii'll step through the migration steps in https://etherpad.opendev.org/p/mm3migration to make sure they're still working correctly19:08
fungi#link https://etherpad.opendev.org/p/mm3migration Mailman 3 Migration Plan19:08
fungiat which point we can merge the upgrade change to swap out the containers and start scheduling new domain migrations19:08
fungithat's where we're at currently19:09
clarkbsounds good. Let us know when you feel ready for us to review and approve the upgrade change19:09
clarkb#topic Gerrit Updates19:09
fungiwill do, thanks!19:09
clarkbI did email the gerrit list last week about the too many query terms problem frickler has run into with starred changes19:10
clarkbthey seem to acknowledge that this is less than ideal (one idea was to log/report the query in its entirety so that you could use that information to find the starred changes)19:10
clarkbbut no suggestions for a workaround other than what we already know (bump index.maxTerms)19:11
clarkbno one said don't do that either so I think we can proceed wtih frickler's change and then monitor the resulting performance situation19:11
clarkb#link https://review.opendev.org/c/opendev/system-config/+/892057 Bump index.maxTerms to address starred changes limit.19:11
clarkbideally we can approve that and restart gerrit with the new config during a timeframe that frickler is able to confirm it has fixed the issue quickly (so taht we can revert and/or debug further if necessary)19:12
clarkbfrickler: I know you couldn't attend the meeting today, but maybe we can sync up later this week on a good time to do that restart with ou19:13
clarkb#topic Server Upgrades19:13
clarkbno news here. Mostly a casualty of my traveling and having family around. I should have more time for this in the near future19:13
clarkb(I don't like reaplcing servers when I feel I'm not in a position to revert or debug unexpected problems)19:14
clarkb#topic Rax IAD image upload struggles19:14
clarkbnodepool image uploads to rax iad are timing out19:15
clarkbthis problem seems to get worse the more uploads you perform at the same tiem19:15
clarkbThe other two rackspace regions do not have this problem (or if they share the underlying mechanism it doesnt' manifest as badly so we don't really care/notice)19:15
fungispecifically, the bulk of the time/increase seems to occur on the backend in glance after the upload completes19:16
clarkbfungi and frickler have been doing the bulk of the debugging (thank you)19:16
clarkbI think our end goal is to collect enough info that we can file a ticket with rakcspace to see if this is something they can fixup19:16
fungiwhen we were trying to upload all out images, i clockec around 5 hours from end of an upload to when it would first start to appear in the image list19:16
fungiwhen we're not uploading any images, a single test upload is followed by around 30 minutes of time before it appears in the image list19:17
clarkband when you multiply the number of images we have by 30 minutes you end up with something suspicousl close to 5 hours19:18
fungiand yes, what we're mostly lacking right now is someone to have time to file a ticket and try to explain our observations in a way that doesn't come across as us having unrealistic expectations19:18
corvusin the mean time, are we increasing the upload timeout to accomodate 5h?19:19
clarkbI think a key part of doing that is showing that dfw and ord don't suffer from this which would indicate an actual problem with the egion19:19
fungi"we're trying to upload 15 images in parallel, around 20gb each, and your service can't keep up" is likely to result in us being told "please stop that"19:19
clarkbfungi: note it should eventually balance out to an image an hour or so due to rebuild timelines19:20
clarkbbut due to the long processing times they all pile up instaed19:20
fungicorvus: frickler piecemeal unpaused some images manually to get fresher uploads for our more frequently used images to complete19:20
fungihe did test by overriding the openstacksdk default wait timeout to something like 10 hours, just to confirm it did work around the issue19:21
clarkbthat isn't currently configurable in nodepool today? Or maybe we could do it with clouds.yaml?19:21
corvusoh this is an sdk-level timeout?  not the nodepool one?19:21
clarkbya19:21
fungiyes19:21
funginodepool doesn't currently expose a config option for passing the sdk timeout parameter19:22
fungiwe could add one, but also this seems like a pathological condition19:22
corvusstill, i think that would be an ok change.19:23
fungiyeah, i agree, i just don't think we're likely to want to actually set that long term19:23
corvusin general nodepool has a bunch of user-configurable timeouts for stuff like that because we know especially with clouds, things are going to be site dependent.19:23
corvusyeah, it's not an ideal solution, but, i think, an acceptable one.  :)19:24
clarkbother things to note. We had leaked images in all three regions. fungi cleared those out manually. I'm not sure why the leak detection and cleanup in nodepool didn't find them and take care of it for us.19:24
clarkband we had instances that we could not delete in all three regions that rackspace cleaned up for us after a ticket was submitted19:24
fungii cleared out around 1200 leaked images in iad (mostly due to the upload timeout problem i think, based on their ages). the other two regions have around 400 leaked images but have not been cleaned up yet19:25
corvusmaybe the leaked images had a different schema or something.  if it happens again, ping me and i can try to see why.19:25
clarkbthanks!19:25
fungicorvus: sure, we can dig deeper. it seemed outwardly that nodepool decided the images never got created19:25
fungibecause the sdk returned an error when it gave up waiting for them19:26
corvusah, could be missing the metadata entirely then too19:26
fungiyes, it could be missing followup steps the sdk would otherwise have performed19:26
fungiif the metadata is something that gets set post-upload19:27
fungi(from the cloud api side i mean)19:27
clarkbanything else on this topic?19:28
funginot from me19:28
clarkb#topic Fedora Cleanup19:28
clarkbI pushed two changes earlier today to start on this. Bindep is the only fedora-latest user in codesearch that is also in the zuul tenant config.19:29
clarkb#link https://review.opendev.org/c/opendev/base-jobs/+/892380 Cleanup fedora-latest in bindep19:29
clarkber thats the next change one second while I copy paste the bindep one properly19:29
clarkb#undo19:29
opendevmeetRemoving item from minutes: #link https://review.opendev.org/c/opendev/base-jobs/+/89238019:29
clarkb#link https://review.opendev.org/c/opendev/bindep/+/892378 Cleanup fedora-latest in bindep19:29
clarkbThis should be a very safe change. The next one which removes nodeset: fedora-latest is less safe because older branches may use it etc19:30
clarkb#link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove fedora-latest nodeset19:30
clarkbI'm personally inclined to land both of them ad we can revert 892380 if something unexpected happens. but that nodeset doesn't work anyway so those jobs should already be broken19:30
clarkbWe can then look at cleaning things up from nodepool and the mirrors etc19:31
clarkblet me know if you disagree or find extra bits that need cleanup first19:31
clarkb#topic Gitea 1.20 Upgrade19:32
fungithis cleanup also led to us rushing through an untested change for the zuul/zuul-jobs repo, we do need to remember that it has stakeholders beyond our deployment19:32
clarkbGitea has published a 1.20.3 release already. I think my impressin that this is a big release with not a lot of new features is backed up by the amount of fixing they have had to do19:32
clarkbBut I think I've managed to work through all the documented breaking changes (and one undocumented breaking change)19:33
clarkbthere is a held node that seems to work here: https://158.69.78.38:3081/opendev/system-config19:33
clarkbThe main thing at this point would be to go over the change itself and that held node to make sure you are happy with the changes I had to make19:33
clarkband if so I can add the new necessary but otherwise completely ignored secret data to our prod secrets and merge the change when we can monitor it19:34
clarkb#link https://review.opendev.org/c/opendev/system-config/+/886993 Gitea 1.20 change19:34
clarkbthere is the change19:34
clarkblooks like fungi did +2 it. I should probably figure out those secrets then19:35
clarkbthanks for the review19:35
fungiit looked fine to me, but i won't be around today to troubleshoot if something goes sideways with the upgrade19:35
clarkbwe can approve tomorrow. I'm not in a huge rush other than simply wanting it off my todo list19:35
clarkb#topic Zuul changes and updates19:35
clarkbThere are a number of user facing changes that have been made or will be made very soon in zuul19:36
clarkbI want to make sure we're all aware of them and have some sort of plan for working through them19:36
clarkbFirst up Zuul has added Ansible 8 support. In the not too distant future Zuul will drop Ansible 6 support which is what we default to today19:36
clarkbin the past what we've done is asked our users to test new ansible against their jobs if they are worred and set a hard cutover date via tenant config in the future19:37
corvusand between those 2 events, zuul will switch the default from 6 to 819:37
fungialso it's skipping ansible 7 support, right?19:37
corvusyeah we're too slow, missed it.19:37
clarkbOpenStack Bobcat releases ~Oct 619:37
clarkbI'm thinking we switch all tenants to ansible 8 by default the week after that19:38
clarkbthough we should probably go ahead and switch the opendev tenant nowish. So all tenants that haven't switched yet on that date19:38
corvusoh that sounds extremely generous.  i think we can and should look into switching much earlier19:38
clarkbcorvus: my concern with that is openstack will have a revolt if anything breaks due to their CI already being super flaky and the release coming up19:39
corvuswhat if we switch opendev, wait a while, switch everyone but openstack, wait a while, and then switch openstack...19:39
fungiwfm19:39
clarkbthat is probably fine19:39
fricklerwith "a while" = 6 weeks that sounds fine19:40
corvusi think if we run into any problems, we can throw the breaks easily, but given that zuul (including zuul-jobs) switched with no changes, this might go smoothly...19:40
clarkbmaybe try to switch opendev this week, everyone else less openstack the week after if opendev is happy. Then do openstack after the bobcat release?19:40
fungisounds good19:40
clarkbthat way we can get an email out and give people at least some lead time19:40
corvuswell that's not really what i'm suggesting19:40
corvuswe could do a week between each and have it wrapped up by mid september19:41
clarkbhrm I think the main risk is that specific jobs break19:41
clarkband openstack is going to be angry about that given how unhappy openstack ci has been lately19:41
corvusor we could do opendev now, and then everyone else 1 week after that and wrap it up 2 weeks from now19:41
fricklerwhere is this urgency coming from?19:42
corvuswe actually should be doing this much more quickly and much more frequently19:42
fungibeing able to merge the change in zuul that drops ansible 6 and being able to continue upgrading the opendev deployment whne that happens19:42
clarkbI think last time we went quickly with the idea that specific jobs could force ansible 6, but then other zuul config errors complicated that more than we anticipated19:42
corvuswe need to change ansible versions every 6 months to keep up with what's supported upstream19:42
corvusso i would like us to acclimate to this being somewhat frequent and not a big deal19:42
clarkband to be fair we've been pushing openstack to address those but largely only frickler has put any effort into it :/19:42
fungithouhj as of an hour ago it seems like we've got buy-in from th eopenstack tc to start deleting branches in repos with unfixed zuul config errors19:43
fricklerwhat's wrong with running unsupported ansible versions? people are still running python219:43
clarkbfrickler: at least one issue is the size of the installation for ansible. Every version you support creates massive bloat in your zuul executors19:44
corvusit's not our intention to use out-of-support ansible19:44
fricklerI deon't think that that is compatible with the state openstack is in19:44
corvushow so?19:44
corvusdo we know that openstack runs jobs that don't work with ansible 8?19:45
fungiwe certainly can't encourage it, given zuul is designed to run untrusted payloads and ansible upstream won't be fixing security vulnerabilities in those versions19:45
fricklerthere is no developer capacity to adapt19:45
clarkbno we do not know that yet. JayF thought that some ironic jobs might use openstack collections though which are apparently not backward compatible in 819:45
fricklerif it all works fine, fine, but if not, adapting will take a long time19:45
corvuszuul's executor can't use collections...19:45
fungii don't see how ironic jobs would be using collections with the executor's ansible19:45
clarkbah19:45
fungiif they're using collections that would be in a nested ansible, so not affected by this19:46
clarkbmaybe a compromise would be to try it soonish and if things break in ways that aren't reasonable to address then we can revert to 6? But it sounds like we expect 8 to work so go for it until we have evidence to the contrary?19:46
corvusyeah, i am fully on-board with throwing the emergency brake lever if we find problems19:46
clarkbI'm not sure if installing the big pypi ansible package gets you the openstack stuff19:46
fricklerit does afaict19:47
corvusi don't think we should assume that everything will break.  :)19:47
fricklerthe problem is to find how what breaks, how will you do that?19:47
clarkbok proposal: switch opendev nowish. If that looks happy plan to switch everyone else less openstack early next week. If that looks good switch openstack late next week or early the week after19:47
fricklerwaiting for people to complain will not work19:47
clarkbfrickler: I mean if a tree falls in a forest and no one hears...19:48
clarkbI understand your concern but if no one is paying attention then we aren't going to solve that either way19:48
clarkbwe can however react to those who are paying attention19:48
clarkband I think that is the best we can do whether we wait a long time or a short time19:48
corvusthat works for me; if we want to give openstack more buffer around the release, then switching earlier may help.  either sounds good to me.19:48
clarkbdoesn't look lik any of the TC took fungi's invitation to discuss it here so we can't get their input19:50
fungiwell, can't get their input in this meeting anyway19:51
clarkbcorvus: just pushed a change to swithc opendev19:51
clarkblets land that asap and then if we don't have anything pushing the brakes by tomorrow I can send email to service-announce with the plan I skethced out above19:51
clarkbwe are running out of time though and I wanted to get to a few more items19:52
corvus#link  https://review.opendev.org/c/openstack/project-config/+/892405 Switch opendev tenant to Ansible 819:52
fricklerswitching the ansible version does work speculatively, right?19:52
clarkbfrickler: it does at a job level yes19:52
clarkbso we can ask people to test things before we switch them too19:52
clarkbZuul is also planning to uncombine stdout and stderr in command/shell like tasks19:52
corvusthis one is riskier.19:53
clarkbI think this may be more likely to cause problems than the ansible 8 upgrade since it isn't always clear things are going to stderr particulalry when it hasjust worked historically19:53
clarkbwe probably need to test this more methodically from a base jobs/base roles standpoint and then move up the job inheritance ladder19:53
corvusi think the best we can do there is speculatively test it with zuul-jobs first to see if anything explodes...19:54
corvusthen what clarkb said19:54
corvuswe might want to make new base jobs in opendev so we can flip tenants one at a time...?19:54
corvus(because we can change the per-tenant default base job...)19:54
clarkboh that is an interesting idea I hadn't considered19:55
corvusi would be very happy to let this one bake for quite a while in zuul....19:55
clarkbok so this is less urgent. That is good to know19:55
clarkbwe can probably punt on decisions for it now. But keep it in mind and let us know if you have any good ideas for testing that ahead of time19:56
corvuslike, long enough for lots of community members to upgrade their zuuls and try it out, etc.19:56
clarkbAnd finally early failure detection in tasks via regex matching of task output is in the works19:56
corvusthere's no ticking clock on this one...19:56
clarkback19:56
corvusthe regex change is being gated as we speak19:56
clarkbthe failure stuff is more "this is a cool feature you might want to take advantage of" it won't affect existing jobs without intervention19:57
corvuswe'll try it out in zuul and maybe come up with some patterns that can be copied19:57
clarkb#topic Base container image updates19:58
clarkbreally quickly before we run out of time. We are now in a good spot to convert the consumers of the base container images to bookworm19:58
clarkbThe only one I really expect to maybe have problems is gerrit due to the java stuff and it may be a non issue there since containers seem to have less issues with this19:58
corvusi think zuul is ready for that19:59
clarkbhelp appreciated updating any of our images :)19:59
clarkb#topic Open Discussion19:59
clarkbAnything important last minute before we call it a meeting?19:59
funginothing here20:00
clarkbThank you everyone for your time. Feel free to continue discussion in our other venues20:00
clarkb#endmeeting20:01
opendevmeetMeeting ended Tue Aug 22 20:01:02 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:01
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.html20:01
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.txt20:01
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.log.html20:01
fungithanks clarkb!20:01

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!