clarkb | meeting time | 19:00 |
---|---|---|
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Aug 22 19:01:11 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VRBBT25TOTXJG3L5SXKWV3EELG34UC5E/ Our Agenda | 19:01 |
clarkb | We've actually got a fairly full agenda so I may move quicker than I'd like. But we can always go back to discussing items at the end of our hour if we have time | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | The service coordinator nomination period ends today. I haven't seen anyone nominate themselves yet. Does this mean everyone is happy with and prefers me to keep doing it? | 19:02 |
fungi | i'm happy to be not-it ;) | 19:02 |
fungi | (also you're doing a great job!) | 19:02 |
corvus | i can confirm 100% that is the correct interpretation | 19:02 |
clarkb | ok I guess I can make it official with an email later today before today ends to avoid and needless process confusion | 19:03 |
clarkb | anything else to announce? | 19:04 |
clarkb | #topic Infra root google account | 19:05 |
clarkb | Just a quick note that I haven't tried to login yet so I have no news yet | 19:05 |
clarkb | but it is on my todo list and hopefully I can get to it soon. This week should be a bit less crazy ofr me than last week (half the visiting family is no longer here) | 19:05 |
clarkb | #topic Mailman 3 | 19:05 |
clarkb | fungi: we made some changes and got things to a good stable stopping point I think. What is next? mailman 3 upgrade? | 19:06 |
fungi | we merged the remaining fixes last week and the correct site names are showing up on archive pages now | 19:06 |
fungi | i've got a held node built with https://review.opendev.org/869210 and have just finished syncing a copy of all our production mm2 lists to it to run a test import | 19:07 |
fungi | #link https://review.opendev.org/869210 Upgrade to latest Mailman 3 releases | 19:07 |
clarkb | oh right we wanted to make sure that upgrading wouldn't put us in a worse position for the 2 -> 3 migration | 19:08 |
fungi | i'll step through the migration steps in https://etherpad.opendev.org/p/mm3migration to make sure they're still working correctly | 19:08 |
fungi | #link https://etherpad.opendev.org/p/mm3migration Mailman 3 Migration Plan | 19:08 |
fungi | at which point we can merge the upgrade change to swap out the containers and start scheduling new domain migrations | 19:08 |
fungi | that's where we're at currently | 19:09 |
clarkb | sounds good. Let us know when you feel ready for us to review and approve the upgrade change | 19:09 |
clarkb | #topic Gerrit Updates | 19:09 |
fungi | will do, thanks! | 19:09 |
clarkb | I did email the gerrit list last week about the too many query terms problem frickler has run into with starred changes | 19:10 |
clarkb | they seem to acknowledge that this is less than ideal (one idea was to log/report the query in its entirety so that you could use that information to find the starred changes) | 19:10 |
clarkb | but no suggestions for a workaround other than what we already know (bump index.maxTerms) | 19:11 |
clarkb | no one said don't do that either so I think we can proceed wtih frickler's change and then monitor the resulting performance situation | 19:11 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/892057 Bump index.maxTerms to address starred changes limit. | 19:11 |
clarkb | ideally we can approve that and restart gerrit with the new config during a timeframe that frickler is able to confirm it has fixed the issue quickly (so taht we can revert and/or debug further if necessary) | 19:12 |
clarkb | frickler: I know you couldn't attend the meeting today, but maybe we can sync up later this week on a good time to do that restart with ou | 19:13 |
clarkb | #topic Server Upgrades | 19:13 |
clarkb | no news here. Mostly a casualty of my traveling and having family around. I should have more time for this in the near future | 19:13 |
clarkb | (I don't like reaplcing servers when I feel I'm not in a position to revert or debug unexpected problems) | 19:14 |
clarkb | #topic Rax IAD image upload struggles | 19:14 |
clarkb | nodepool image uploads to rax iad are timing out | 19:15 |
clarkb | this problem seems to get worse the more uploads you perform at the same tiem | 19:15 |
clarkb | The other two rackspace regions do not have this problem (or if they share the underlying mechanism it doesnt' manifest as badly so we don't really care/notice) | 19:15 |
fungi | specifically, the bulk of the time/increase seems to occur on the backend in glance after the upload completes | 19:16 |
clarkb | fungi and frickler have been doing the bulk of the debugging (thank you) | 19:16 |
clarkb | I think our end goal is to collect enough info that we can file a ticket with rakcspace to see if this is something they can fixup | 19:16 |
fungi | when we were trying to upload all out images, i clockec around 5 hours from end of an upload to when it would first start to appear in the image list | 19:16 |
fungi | when we're not uploading any images, a single test upload is followed by around 30 minutes of time before it appears in the image list | 19:17 |
clarkb | and when you multiply the number of images we have by 30 minutes you end up with something suspicousl close to 5 hours | 19:18 |
fungi | and yes, what we're mostly lacking right now is someone to have time to file a ticket and try to explain our observations in a way that doesn't come across as us having unrealistic expectations | 19:18 |
corvus | in the mean time, are we increasing the upload timeout to accomodate 5h? | 19:19 |
clarkb | I think a key part of doing that is showing that dfw and ord don't suffer from this which would indicate an actual problem with the egion | 19:19 |
fungi | "we're trying to upload 15 images in parallel, around 20gb each, and your service can't keep up" is likely to result in us being told "please stop that" | 19:19 |
clarkb | fungi: note it should eventually balance out to an image an hour or so due to rebuild timelines | 19:20 |
clarkb | but due to the long processing times they all pile up instaed | 19:20 |
fungi | corvus: frickler piecemeal unpaused some images manually to get fresher uploads for our more frequently used images to complete | 19:20 |
fungi | he did test by overriding the openstacksdk default wait timeout to something like 10 hours, just to confirm it did work around the issue | 19:21 |
clarkb | that isn't currently configurable in nodepool today? Or maybe we could do it with clouds.yaml? | 19:21 |
corvus | oh this is an sdk-level timeout? not the nodepool one? | 19:21 |
clarkb | ya | 19:21 |
fungi | yes | 19:21 |
fungi | nodepool doesn't currently expose a config option for passing the sdk timeout parameter | 19:22 |
fungi | we could add one, but also this seems like a pathological condition | 19:22 |
corvus | still, i think that would be an ok change. | 19:23 |
fungi | yeah, i agree, i just don't think we're likely to want to actually set that long term | 19:23 |
corvus | in general nodepool has a bunch of user-configurable timeouts for stuff like that because we know especially with clouds, things are going to be site dependent. | 19:23 |
corvus | yeah, it's not an ideal solution, but, i think, an acceptable one. :) | 19:24 |
clarkb | other things to note. We had leaked images in all three regions. fungi cleared those out manually. I'm not sure why the leak detection and cleanup in nodepool didn't find them and take care of it for us. | 19:24 |
clarkb | and we had instances that we could not delete in all three regions that rackspace cleaned up for us after a ticket was submitted | 19:24 |
fungi | i cleared out around 1200 leaked images in iad (mostly due to the upload timeout problem i think, based on their ages). the other two regions have around 400 leaked images but have not been cleaned up yet | 19:25 |
corvus | maybe the leaked images had a different schema or something. if it happens again, ping me and i can try to see why. | 19:25 |
clarkb | thanks! | 19:25 |
fungi | corvus: sure, we can dig deeper. it seemed outwardly that nodepool decided the images never got created | 19:25 |
fungi | because the sdk returned an error when it gave up waiting for them | 19:26 |
corvus | ah, could be missing the metadata entirely then too | 19:26 |
fungi | yes, it could be missing followup steps the sdk would otherwise have performed | 19:26 |
fungi | if the metadata is something that gets set post-upload | 19:27 |
fungi | (from the cloud api side i mean) | 19:27 |
clarkb | anything else on this topic? | 19:28 |
fungi | not from me | 19:28 |
clarkb | #topic Fedora Cleanup | 19:28 |
clarkb | I pushed two changes earlier today to start on this. Bindep is the only fedora-latest user in codesearch that is also in the zuul tenant config. | 19:29 |
clarkb | #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Cleanup fedora-latest in bindep | 19:29 |
clarkb | er thats the next change one second while I copy paste the bindep one properly | 19:29 |
clarkb | #undo | 19:29 |
opendevmeet | Removing item from minutes: #link https://review.opendev.org/c/opendev/base-jobs/+/892380 | 19:29 |
clarkb | #link https://review.opendev.org/c/opendev/bindep/+/892378 Cleanup fedora-latest in bindep | 19:29 |
clarkb | This should be a very safe change. The next one which removes nodeset: fedora-latest is less safe because older branches may use it etc | 19:30 |
clarkb | #link https://review.opendev.org/c/opendev/base-jobs/+/892380 Remove fedora-latest nodeset | 19:30 |
clarkb | I'm personally inclined to land both of them ad we can revert 892380 if something unexpected happens. but that nodeset doesn't work anyway so those jobs should already be broken | 19:30 |
clarkb | We can then look at cleaning things up from nodepool and the mirrors etc | 19:31 |
clarkb | let me know if you disagree or find extra bits that need cleanup first | 19:31 |
clarkb | #topic Gitea 1.20 Upgrade | 19:32 |
fungi | this cleanup also led to us rushing through an untested change for the zuul/zuul-jobs repo, we do need to remember that it has stakeholders beyond our deployment | 19:32 |
clarkb | Gitea has published a 1.20.3 release already. I think my impressin that this is a big release with not a lot of new features is backed up by the amount of fixing they have had to do | 19:32 |
clarkb | But I think I've managed to work through all the documented breaking changes (and one undocumented breaking change) | 19:33 |
clarkb | there is a held node that seems to work here: https://158.69.78.38:3081/opendev/system-config | 19:33 |
clarkb | The main thing at this point would be to go over the change itself and that held node to make sure you are happy with the changes I had to make | 19:33 |
clarkb | and if so I can add the new necessary but otherwise completely ignored secret data to our prod secrets and merge the change when we can monitor it | 19:34 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/886993 Gitea 1.20 change | 19:34 |
clarkb | there is the change | 19:34 |
clarkb | looks like fungi did +2 it. I should probably figure out those secrets then | 19:35 |
clarkb | thanks for the review | 19:35 |
fungi | it looked fine to me, but i won't be around today to troubleshoot if something goes sideways with the upgrade | 19:35 |
clarkb | we can approve tomorrow. I'm not in a huge rush other than simply wanting it off my todo list | 19:35 |
clarkb | #topic Zuul changes and updates | 19:35 |
clarkb | There are a number of user facing changes that have been made or will be made very soon in zuul | 19:36 |
clarkb | I want to make sure we're all aware of them and have some sort of plan for working through them | 19:36 |
clarkb | First up Zuul has added Ansible 8 support. In the not too distant future Zuul will drop Ansible 6 support which is what we default to today | 19:36 |
clarkb | in the past what we've done is asked our users to test new ansible against their jobs if they are worred and set a hard cutover date via tenant config in the future | 19:37 |
corvus | and between those 2 events, zuul will switch the default from 6 to 8 | 19:37 |
fungi | also it's skipping ansible 7 support, right? | 19:37 |
corvus | yeah we're too slow, missed it. | 19:37 |
clarkb | OpenStack Bobcat releases ~Oct 6 | 19:37 |
clarkb | I'm thinking we switch all tenants to ansible 8 by default the week after that | 19:38 |
clarkb | though we should probably go ahead and switch the opendev tenant nowish. So all tenants that haven't switched yet on that date | 19:38 |
corvus | oh that sounds extremely generous. i think we can and should look into switching much earlier | 19:38 |
clarkb | corvus: my concern with that is openstack will have a revolt if anything breaks due to their CI already being super flaky and the release coming up | 19:39 |
corvus | what if we switch opendev, wait a while, switch everyone but openstack, wait a while, and then switch openstack... | 19:39 |
fungi | wfm | 19:39 |
clarkb | that is probably fine | 19:39 |
frickler | with "a while" = 6 weeks that sounds fine | 19:40 |
corvus | i think if we run into any problems, we can throw the breaks easily, but given that zuul (including zuul-jobs) switched with no changes, this might go smoothly... | 19:40 |
clarkb | maybe try to switch opendev this week, everyone else less openstack the week after if opendev is happy. Then do openstack after the bobcat release? | 19:40 |
fungi | sounds good | 19:40 |
clarkb | that way we can get an email out and give people at least some lead time | 19:40 |
corvus | well that's not really what i'm suggesting | 19:40 |
corvus | we could do a week between each and have it wrapped up by mid september | 19:41 |
clarkb | hrm I think the main risk is that specific jobs break | 19:41 |
clarkb | and openstack is going to be angry about that given how unhappy openstack ci has been lately | 19:41 |
corvus | or we could do opendev now, and then everyone else 1 week after that and wrap it up 2 weeks from now | 19:41 |
frickler | where is this urgency coming from? | 19:42 |
corvus | we actually should be doing this much more quickly and much more frequently | 19:42 |
fungi | being able to merge the change in zuul that drops ansible 6 and being able to continue upgrading the opendev deployment whne that happens | 19:42 |
clarkb | I think last time we went quickly with the idea that specific jobs could force ansible 6, but then other zuul config errors complicated that more than we anticipated | 19:42 |
corvus | we need to change ansible versions every 6 months to keep up with what's supported upstream | 19:42 |
corvus | so i would like us to acclimate to this being somewhat frequent and not a big deal | 19:42 |
clarkb | and to be fair we've been pushing openstack to address those but largely only frickler has put any effort into it :/ | 19:42 |
fungi | thouhj as of an hour ago it seems like we've got buy-in from th eopenstack tc to start deleting branches in repos with unfixed zuul config errors | 19:43 |
frickler | what's wrong with running unsupported ansible versions? people are still running python2 | 19:43 |
clarkb | frickler: at least one issue is the size of the installation for ansible. Every version you support creates massive bloat in your zuul executors | 19:44 |
corvus | it's not our intention to use out-of-support ansible | 19:44 |
frickler | I deon't think that that is compatible with the state openstack is in | 19:44 |
corvus | how so? | 19:44 |
corvus | do we know that openstack runs jobs that don't work with ansible 8? | 19:45 |
fungi | we certainly can't encourage it, given zuul is designed to run untrusted payloads and ansible upstream won't be fixing security vulnerabilities in those versions | 19:45 |
frickler | there is no developer capacity to adapt | 19:45 |
clarkb | no we do not know that yet. JayF thought that some ironic jobs might use openstack collections though which are apparently not backward compatible in 8 | 19:45 |
frickler | if it all works fine, fine, but if not, adapting will take a long time | 19:45 |
corvus | zuul's executor can't use collections... | 19:45 |
fungi | i don't see how ironic jobs would be using collections with the executor's ansible | 19:45 |
clarkb | ah | 19:45 |
fungi | if they're using collections that would be in a nested ansible, so not affected by this | 19:46 |
clarkb | maybe a compromise would be to try it soonish and if things break in ways that aren't reasonable to address then we can revert to 6? But it sounds like we expect 8 to work so go for it until we have evidence to the contrary? | 19:46 |
corvus | yeah, i am fully on-board with throwing the emergency brake lever if we find problems | 19:46 |
clarkb | I'm not sure if installing the big pypi ansible package gets you the openstack stuff | 19:46 |
frickler | it does afaict | 19:47 |
corvus | i don't think we should assume that everything will break. :) | 19:47 |
frickler | the problem is to find how what breaks, how will you do that? | 19:47 |
clarkb | ok proposal: switch opendev nowish. If that looks happy plan to switch everyone else less openstack early next week. If that looks good switch openstack late next week or early the week after | 19:47 |
frickler | waiting for people to complain will not work | 19:47 |
clarkb | frickler: I mean if a tree falls in a forest and no one hears... | 19:48 |
clarkb | I understand your concern but if no one is paying attention then we aren't going to solve that either way | 19:48 |
clarkb | we can however react to those who are paying attention | 19:48 |
clarkb | and I think that is the best we can do whether we wait a long time or a short time | 19:48 |
corvus | that works for me; if we want to give openstack more buffer around the release, then switching earlier may help. either sounds good to me. | 19:48 |
clarkb | doesn't look lik any of the TC took fungi's invitation to discuss it here so we can't get their input | 19:50 |
fungi | well, can't get their input in this meeting anyway | 19:51 |
clarkb | corvus: just pushed a change to swithc opendev | 19:51 |
clarkb | lets land that asap and then if we don't have anything pushing the brakes by tomorrow I can send email to service-announce with the plan I skethced out above | 19:51 |
clarkb | we are running out of time though and I wanted to get to a few more items | 19:52 |
corvus | #link https://review.opendev.org/c/openstack/project-config/+/892405 Switch opendev tenant to Ansible 8 | 19:52 |
frickler | switching the ansible version does work speculatively, right? | 19:52 |
clarkb | frickler: it does at a job level yes | 19:52 |
clarkb | so we can ask people to test things before we switch them too | 19:52 |
clarkb | Zuul is also planning to uncombine stdout and stderr in command/shell like tasks | 19:52 |
corvus | this one is riskier. | 19:53 |
clarkb | I think this may be more likely to cause problems than the ansible 8 upgrade since it isn't always clear things are going to stderr particulalry when it hasjust worked historically | 19:53 |
clarkb | we probably need to test this more methodically from a base jobs/base roles standpoint and then move up the job inheritance ladder | 19:53 |
corvus | i think the best we can do there is speculatively test it with zuul-jobs first to see if anything explodes... | 19:54 |
corvus | then what clarkb said | 19:54 |
corvus | we might want to make new base jobs in opendev so we can flip tenants one at a time...? | 19:54 |
corvus | (because we can change the per-tenant default base job...) | 19:54 |
clarkb | oh that is an interesting idea I hadn't considered | 19:55 |
corvus | i would be very happy to let this one bake for quite a while in zuul.... | 19:55 |
clarkb | ok so this is less urgent. That is good to know | 19:55 |
clarkb | we can probably punt on decisions for it now. But keep it in mind and let us know if you have any good ideas for testing that ahead of time | 19:56 |
corvus | like, long enough for lots of community members to upgrade their zuuls and try it out, etc. | 19:56 |
clarkb | And finally early failure detection in tasks via regex matching of task output is in the works | 19:56 |
corvus | there's no ticking clock on this one... | 19:56 |
clarkb | ack | 19:56 |
corvus | the regex change is being gated as we speak | 19:56 |
clarkb | the failure stuff is more "this is a cool feature you might want to take advantage of" it won't affect existing jobs without intervention | 19:57 |
corvus | we'll try it out in zuul and maybe come up with some patterns that can be copied | 19:57 |
clarkb | #topic Base container image updates | 19:58 |
clarkb | really quickly before we run out of time. We are now in a good spot to convert the consumers of the base container images to bookworm | 19:58 |
clarkb | The only one I really expect to maybe have problems is gerrit due to the java stuff and it may be a non issue there since containers seem to have less issues with this | 19:58 |
corvus | i think zuul is ready for that | 19:59 |
clarkb | help appreciated updating any of our images :) | 19:59 |
clarkb | #topic Open Discussion | 19:59 |
clarkb | Anything important last minute before we call it a meeting? | 19:59 |
fungi | nothing here | 20:00 |
clarkb | Thank you everyone for your time. Feel free to continue discussion in our other venues | 20:00 |
clarkb | #endmeeting | 20:01 |
opendevmeet | Meeting ended Tue Aug 22 20:01:02 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:01 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.html | 20:01 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.txt | 20:01 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-22-19.01.log.html | 20:01 |
fungi | thanks clarkb! | 20:01 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!