clarkb | hello it is meeting time | 19:01 |
---|---|---|
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Sep 5 19:01:10 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | It feels like yseterday was a holiday. So many things this morning | 19:01 |
fungi | yesterday felt less like a holiday than it could have | 19:01 |
ianychoi | o/ | 19:02 |
fungi | but them's the breaks | 19:02 |
ianychoi | I have not been aware of such holidays but hope that many people had great holidays :) | 19:02 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/F5B2EF7BWK62UQZCTHVGEKER4XFRDSIE/ Our Agenda | 19:02 |
clarkb | ianychoi: I did! it was the last day my parents were here visiting and we made smoked bbq meats | 19:03 |
clarkb | #topic Announcements | 19:03 |
clarkb | I have nothing | 19:03 |
ianychoi | Wow great :) | 19:03 |
ianychoi | (I will bring translation topics during open discussion time.. thanks!) | 19:04 |
clarkb | #topic Infra Root Google Account Activity | 19:05 |
clarkb | I have nothing to report here | 19:05 |
clarkb | I'e still got it on the todo list | 19:06 |
clarkb | hopefully soon | 19:06 |
clarkb | #topic Mailman 3 | 19:06 |
clarkb | as mentioned last week fungi thinks we're ready to start migrating additional domains and I agree. That means we need to schedule a time to migrate lists.kata-containers.io and lists.airshipit.org | 19:06 |
fungi | yeah, so i've decided which are the most active lists on each site to notify | 19:07 |
fungi | it seems like thursday september 14 may be a good date to do those two imports | 19:07 |
fungi | if people agree, i can notify the airship-discuss and kata-dev mailing lists with a message similar to the one i sent to zuul-discuss when the lists.zuul-ci.org site was migrated | 19:08 |
fungi | is there a time of day which would make sense to have more interested parties around to handle comms or whatever might arise? | 19:09 |
frickler | I have no idea where the main timezones of those communities are located | 19:10 |
clarkb | airship was in US central time iirc | 19:10 |
clarkb | and kata is fairly global | 19:10 |
frickler | so likely US morning would be best suited to give you some room to handle possible fallout | 19:11 |
fungi | yeah, i'm more concerned with having interested sysadmins around for the maintenance window. the communities themselves will know to plan for the list archives to be temporarily unavailable and for mail deliveries to be deferred | 19:11 |
fungi | i'm happy to run the migration commands (though they're also documented in the planning pad and scripted in system-config too) | 19:12 |
clarkb | I'm happy to help anytime after about 15:30 UTC | 19:12 |
fungi | but having more people around in general at those times helps | 19:12 |
clarkb | and thursday the 14th shoudl work for me anytime after then | 19:12 |
fungi | the migration process will probably take around an hour start to finish, and that includes dns propagation | 19:13 |
fungi | i'll revisit what we did for the first migrations, but basically we leveraged dns as a means of making incoming messages temporarily undeliverable until the migration was done, and then updated dns to point to the new server | 19:14 |
fungi | for kata it may be easier since it's on its own server already, the reason we did it via dns for the first migrations is that they shared a server with other sites that weren't moving at the same times | 19:15 |
clarkb | using DNS seemed to work fine though so we can stick to that | 19:16 |
fungi | frickler: does 15:30-16:30 utc work well for you? | 19:16 |
fungi | if so i'll let the airship and kata lists know that's the time we're planning | 19:16 |
frickler | I'm not sure I'll be around, but time is fine in general | 19:17 |
frickler | *the time | 19:17 |
fungi | okay, no worries. thanks! | 19:17 |
clarkb | great see yall then | 19:17 |
clarkb | anything else mailman 3 related? | 19:17 |
fungi | let's go with 15:30-16:30 utc on thursday 2023-09-14 | 19:17 |
fungi | i'll send announcements some time tomorrow | 19:17 |
clarkb | thanks | 19:18 |
clarkb | #topic Server Upgrades | 19:19 |
clarkb | Nothing new here either | 19:20 |
clarkb | #topic Rax IAD image upload struggles | 19:20 |
clarkb | Lots of progress/news here though | 19:20 |
clarkb | fungi filed a ticket with rax and the response was essentially that iad is expected to behave differently | 19:20 |
fungi | yes, sad panda | 19:20 |
clarkb | this means we can't rely on the cloud provider to fix it for us. Instead we've reduced the number of upload threads to 1 per builder and increased the image rebuild time intervals | 19:21 |
fungi | though that prompted us to look at whether we're too aggressively updating our images | 19:21 |
clarkb | the idae ehre is that we don't actually need new images for all images constantly and can more conservatively update things in a way that the cloud region can hopefully keep up with | 19:21 |
corvus | what's the timing look like now? how long between update cycles? and how long does a full update cycle take? | 19:22 |
fungi | yeah, basically update the default nodeset's label daily, current versions of other distros every 2 days, and older versions of distros weekly | 19:22 |
frickler | but that was only merged 2 days ago, so no effect seen yet | 19:22 |
corvus | makes sense | 19:22 |
fungi | noting that the "default nodeset" isn't necessarily always consistent across tenants, but we can come up with a flexible policy tehre | 19:23 |
clarkb | fungi: we currently define it centrally in opendev/base-jobs though | 19:23 |
fungi | and this is an experiment in order to hopefully get reasonably current images for jobs while minimizing the load we put on our providers' image services | 19:23 |
frickler | looking at the upload ids, some images seem to have taken 4 attempts in iad to be successful | 19:23 |
corvus | maybe we could look at adjusting that to so that current versions of all distros are updated daily. once we have more data. | 19:24 |
fungi | frickler: that also means we probably have new leaked images in iad we should look at the metadata for now to see if we can identify for sure why nodepool doesn't clean them up | 19:24 |
frickler | I was looking at nodepool stats in grafana and > 50% of the average used nodes were jammy | 19:24 |
frickler | fungi: yes | 19:24 |
fungi | corvus: yes, that also seems reasonable | 19:24 |
corvus | if you find a leaked image, ping me with the details please | 19:25 |
fungi | corvus: will do, thanks! | 19:25 |
fungi | i'll try to take a look after dinner | 19:25 |
fungi | i'll avoid cleaning them up until we have time to go over them | 19:26 |
corvus | it's probably enough to keep one around if you want to clean up others | 19:26 |
fungi | the handful we've probably leaked pales in comparison to the 1200 i cleaned up in iad recently | 19:26 |
fungi | so i'll probably just delay cleanup until we're satisfied | 19:27 |
clarkb | sounds like that is all for images | 19:29 |
clarkb | #topic Fedora cleanup | 19:29 |
clarkb | The nodeset removal from base-jobs landed | 19:29 |
clarkb | I've also seen some projects like ironic push changes to clean up their use of fedora | 19:29 |
clarkb | I think the next step is to actually remove the label (and images) from nodepool when we think people have had enough time to prepare | 19:30 |
clarkb | should we send an email announcing a date for that? | 19:30 |
frickler | well nothing of that would still work, or was that only devstack that was broken? | 19:31 |
fungi | i thought devstack dropped fedora over a month ago | 19:31 |
fungi | dropped the devstack-defined fedora nodesets anyway | 19:31 |
frickler | but didn't all fedora testing stop working when they pulled their repo content? | 19:31 |
clarkb | frickler: yes, unless jobs got updated to pull from other locations | 19:32 |
fungi | yes, well anything that tried to install packages anyway | 19:32 |
clarkb | I don't think we need to wait very long as you are correct most things would be very broken | 19:32 |
clarkb | more of a final warning if anyone had this working somehow | 19:32 |
frickler | I don't think waiting a week or two will help anyone, but it also doesn't hurt us | 19:33 |
clarkb | ya I was thinking about a week | 19:33 |
clarkb | maybe announce removal for Monday? | 19:33 |
fungi | wfm | 19:33 |
frickler | ack | 19:33 |
clarkb | that gives me time to write the chagnes for it too :) | 19:33 |
clarkb | cool | 19:33 |
clarkb | #topic Zuul Ansible 8 Default | 19:34 |
clarkb | All of the OpenDev Zuul tenants are ansible 8 by default now | 19:34 |
clarkb | I haven't heard of or seen anyone needing to pin to ansible 6 either | 19:34 |
corvus | what's the openstack switcheroo date? | 19:34 |
fungi | yesterday | 19:35 |
clarkb | This doesn't need to be on the agenda for next week, but I wanted to make note of this and remind people to call out oddities if they see them | 19:35 |
frickler | it was yesterday | 19:35 |
frickler | there's only one concern that I mentioned earlier: we might not notice when jobs pass that actually should fail | 19:35 |
corvus | cool. :) sorry i misread comment from clark :) | 19:35 |
frickler | might be because new ansible changed the error handling | 19:35 |
clarkb | frickler: yes, I think that risk exists but it seems to be a low probability | 19:35 |
clarkb | since ansible is fail by default if anything goes wrong generally | 19:35 |
fungi | frickler: i probably skimmed that comment too quickly earlier, what error handling changed in 8? | 19:36 |
fungi | or was it hypothetical? | 19:36 |
frickler | that was purely hypothetical | 19:36 |
fungi | okay, yes i agree that there are a number of hypothetical concerns with any upgrade | 19:36 |
fungi | since in theory any aspect of the software can change | 19:36 |
clarkb | yup mostly just be aware there may be behavior chagnes and if you see them please let zuul and opendev folks know | 19:38 |
fungi | from a practical standpoint, unless anyone has mentioned specific changes to error handling in ansible 8 i'm not going to lose sleep over the possibility of that sort of regression, but we should of course be mindful of the ever-present possibility | 19:38 |
clarkb | both of us will be interested in any observed differences even if they are minor | 19:38 |
clarkb | (one thing I want to look at if I ever find time is performance) | 19:38 |
clarkb | #topic Zuul PCRE regex support is deprecated | 19:39 |
clarkb | The automatic weekend ugprade of zuul pulled in changes to deprecate PCRE regexes within zuul. This results in warnings where regexes that re2 cannot support are used | 19:39 |
clarkb | There was a bug that caused these warnings to prevent new config updates from being usable. We tracked down and fixed those bugs and corvus restarted zuul schedulers outside of the automated upgrade system | 19:40 |
corvus | sorry for the disruption, and thanks for the help | 19:40 |
clarkb | Where that leaves us is opendev's zuul configs will need to be updated to remove pcre regexes. I don't think this is super urgent but cutting down on the warnings in the error list helps reduce noise | 19:40 |
fungi | no need to apologize, thanks for implementing a very useful feature | 19:41 |
fungi | and for the fast action on the fixes too | 19:41 |
corvus | #link https://review.opendev.org/893702 merged change from frickler for project-config | 19:41 |
corvus | #link https://review.opendev.org/893792 change to openstack-zuul-jobs | 19:42 |
corvus | that showed us an issue with zuul-sphinx which should be resolved soon | 19:42 |
corvus | i think those 2 changes will take care of most of the "central" stuff. | 19:42 |
corvus | after they merge, i can write a message letting the wider community know about the change, how to make updates, point at those changes, etc. | 19:43 |
fungi | i guess the final decision on the !zuul user match was that the \S trailer wasn't needed any longer? | 19:43 |
corvus | i agreed with that assesment and approved it | 19:43 |
fungi | okay, cool | 19:43 |
frickler | I also checked the git history of that line and it seemed to agree with that assessment | 19:44 |
fungi | yay for simplicity | 19:44 |
corvus | in general, there's a lot less line noise in our branch matchers now. :) | 19:44 |
fungi | thankfully | 19:44 |
corvus | also, quick reminder in case it's useful -- branches is a list, so you can make a whole list of positive and negated regexes if you need to. the list is boolean or. | 19:45 |
corvus | i haven't run into a case where that's necessary, or looks better than just a (a|b|c) sequence, but it's there if we need it. | 19:45 |
fungi | a list of or'ed negated regexes wouldn't work would it? | 19:46 |
corvus | i mean... it'd "work"... | 19:46 |
fungi | !a|!b would include everything | 19:46 |
opendevmeet | fungi: Error: "a|!b" is not a valid command. | 19:46 |
fungi | opendevmeet agrees | 19:46 |
opendevmeet | fungi: Error: "agrees" is not a valid command. | 19:46 |
fungi | and would like to subscribe to our newsletter | 19:46 |
corvus | but yes, probably of limited utility. :) | 19:46 |
clarkb | anything else on regexes? | 19:47 |
corvus | nak | 19:47 |
clarkb | have a couple more things to get to and we are running out of time. Thanks | 19:47 |
clarkb | #topic Bookworm updates | 19:47 |
clarkb | #link https://review.opendev.org/q/hashtag:bookworm+status:open Next round of image rebuilds onto bookworm. | 19:47 |
clarkb | I think we're ready to proceed on this with zuul if zuul is ready. But nodepool may pose problems ebtween ext4 options nad older grub | 19:47 |
clarkb | I am helping someone debug this today and I'm not sure yet if bookworm is affected. But generally you can create an ext4 fs that grub doesn't like using newer mkfs | 19:48 |
fungi | what's grub's problem there? | 19:48 |
clarkb | https://bugs.launchpad.net/ubuntu/+source/grub2/+bug/1844012 appears related | 19:48 |
corvus | would we notice issues at the point where we already put the new image in service? so would need to roll back to "yesterday's"* image? | 19:48 |
clarkb | fungi: basically grub says "unknown filesystem" when it doesn't like the features enabled in the ext4 fs | 19:48 |
corvus | (* yesterday may not equal yesterday anymore) | 19:48 |
clarkb | corvus: no, we would fail to build images in the first place | 19:48 |
fungi | are we using grub in our bookworm containers? | 19:49 |
clarkb | fungi: the bookworm containers run dib with makes grub for all of our images | 19:49 |
clarkb | *all of our vm images | 19:49 |
corvus | oh ok. in that case... is there dib testing using the nodepool container? | 19:49 |
fungi | oh, right, that's the nodepool connection. now i understand | 19:49 |
clarkb | corvus: oh there should be so we can use a depends on | 19:49 |
clarkb | corvus: good idea | 19:49 |
corvus | clarkb: not sure if they're in the same tenant though | 19:49 |
clarkb | sorry I'm just being made aware of this as our meeting started so its all new to me | 19:49 |
fungi | exciting | 19:50 |
corvus | but if depends-on doesn't work, some other ideas: | 19:50 |
clarkb | corvus: hrm they aren't. But we can probably figure some way of testing that. Maybe hardcoding off of the intermediate registry | 19:50 |
corvus | yeah, manually specify the intermediate registry container | 19:50 |
corvus | or we could land it in nodepool and fast-revert if it breaks | 19:50 |
fungi | absolute worst case, new images we upload will fail to boot, so jobs will end up waiting indefinitely for node assignments until we roll back to prior images | 19:51 |
clarkb | it is possible that bookworm avoids the feature anyway and we're fine so definitely worth testing | 19:51 |
clarkb | fungi: dib fails hard on the grub failure so it shouldn't get taht far | 19:51 |
clarkb | corvus: ya I'll keep the test in production alternative in mind if I can't find an easy way to test it otherwise | 19:51 |
fungi | oh, so our images will just get stale? even less impact, we just have to watch for it since it may take a while to notice otherwise | 19:51 |
corvus | oh, one other possibility might be a throwaway nodepool job that uses an older distro | 19:51 |
corvus | (since nodepool does have one functional job that builds images with dib; just probably not on an affected distro) | 19:52 |
clarkb | fungi: right dib running in bullseye today runs mkfs.ext4 and creates a new ext4 fs that focal grub can install into. When we switch to bookworm the concern is that grub will say unknown filesystem. exit with and error and the image build will fail | 19:52 |
clarkb | but it shouldn't ever upload | 19:52 |
clarkb | corvus: oh cool I can look at taht too. And see this is happening at build not boot time we don't need complicated verification of the end result. Just that the build itself succeeds | 19:53 |
clarkb | *since this is happening | 19:53 |
clarkb | that was all I had. Happy for zuul to proceed wtih bookworm in the meantime | 19:53 |
clarkb | #topic Open Discussion | 19:54 |
clarkb | ianychoi: I know you wanted to talk about the Zanata db/stats api stuff? | 19:54 |
ianychoi | Yep | 19:54 |
clarkb | I'm not aware of us doing anything special to prevent the stats api from being used. Which is why I wonder if it is an admin only function | 19:55 |
clarkb | If it is i think we can provide you or someone else with admin access to use the api. | 19:55 |
ianychoi | First, public APIs for user stats do not work - e.g., https://translate.openstack.org/rest/stats/user/ianychoi/2022-10-05..2023-03-22 | 19:55 |
clarkb | I am concerned that I don't know what all goes into that database though so am wary of providing a database dump. But others may know more and have other thoughts | 19:55 |
ianychoi | It worked to calculate translation stats previously to sync with Stackalytics + to calculate extra ATC status | 19:56 |
ianychoi | The root cause might be from some messy DB status in Zanata instance but it is not easy to get help.. | 19:57 |
ianychoi | So I thought investigating in DB issues would be one idea. | 19:57 |
clarkb | I see. Probably before we get that far we should check the server logs to see if there are any errors associated with those requests. I do note that the zanata rest api documentation doesn't seem to show user stats just project stats | 19:58 |
clarkb | http://zanata.org/zanata-platform/rest-api-docs/resource_StatisticsResource.html maybe you can get the data on a project by project basis instead? | 19:59 |
frickler | I just noticed once again that I'm too young an infra-root in order to be able to access that host, but I'm fine with not changing that situation | 19:59 |
ianychoi | Yep I also figured out that project APIs are working well | 19:59 |
clarkb | frickler: interesting, I thought that server was getting users managed by ansibel like the other servers do | 19:59 |
ianychoi | So, I think some help to co-work on this part with infra-root would be so great ideally | 19:59 |
ianychoi | Or maybe me or Seongsoo need to step up :p | 20:00 |
clarkb | ianychoi: ok, the main thing is that this service is long deprecated so we're unlikely to be able to invest much in it. But I think we can check logs for obvious errors. | 20:00 |
fungi | frickler: you have an ssh account on translate.openstack.org | 20:00 |
fungi | but maybe it's set up wrong or something | 20:00 |
ianychoi | Agree with @clarkb would you help check logs? | 20:01 |
ianychoi | Or feel free to point out to me so that I can investigate in seeing detail log messages | 20:01 |
fungi | service admins for that platform are probably added manually though not through ansible | 20:01 |
clarkb | ianychoi: yes a root will need to check the logs. I can look later today | 20:01 |
ianychoi | Thank you! | 20:02 |
frickler | ah, I was using the wrong username. so I can check tomorrow if noone else has time | 20:02 |
fungi | i think pleia2 added our sysadmins as zanata admins back when the server was first set up, but it likely hasn't been revisited since | 20:02 |
clarkb | sounds like a plan we can sync up from there | 20:02 |
clarkb | but we are out of time. Feel free to bring up more discussion in #opendev or on the service-discuss mailing list | 20:02 |
clarkb | thank you everyone! | 20:03 |
clarkb | #endmeeting | 20:03 |
opendevmeet | Meeting ended Tue Sep 5 20:03:03 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:03 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.html | 20:03 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.txt | 20:03 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-09-05-19.01.log.html | 20:03 |
clarkb | (I can smell lunch and I am very hungry :) ) | 20:03 |
ianychoi | Thank you all | 20:03 |
fungi | thanks! | 20:03 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!