*** diablo_rojo_phone is now known as Guest166 | 07:50 | |
clarkb | We'll start the meeting momentarily | 18:59 |
---|---|---|
ianw | o/ | 19:01 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Sep 13 19:01:07 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/pipermail/service-discuss/2022-September/000359.html Our Agenda | 19:01 |
clarkb | There is an agenda with quite a number of things on it. They are mostly small things so I may go quickly to be sure we get through it all then we can swing back around on anything that needed extra discussion | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | Nothing major here. Just a reminder that OpenStack is in the middle of its release process and elections | 19:02 |
clarkb | Don't forget to vote if you're eligible and take care to double check changes you are making to ensure we don't inadverdently break something the release depends on | 19:02 |
clarkb | #topic Topics | 19:03 |
clarkb | #topic Bastion Host Updates | 19:03 |
clarkb | We've taken yet another pivot after realizing we likely just never want to run the console stream daemon in these infra prod jobs. At leastnot in its current form | 19:04 |
clarkb | but the command module (and its relatives like shell) write the files out regardless | 19:04 |
clarkb | ianw: wrote some changes to make that optional which I think will be helpful for us | 19:04 |
clarkb | #link https://review.opendev.org/c/zuul/zuul/+/855309/ make console stream file writing toggleable | 19:04 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/855472 Disable file writing for infra-prod | 19:04 |
ianw | yes sorry that needs a revision from your comments | 19:04 |
clarkb | ianw: ya and did you see my note about modifying the base jobs repo in a similar manner to system-config as well? | 19:04 |
ianw | ummm, sorry no, but can do | 19:05 |
clarkb | ping me if I don't rereview those quickly enough after updates. I'd like to see those get in as they appear to be a good improvement for our use case (and probably others in a similar boat) | 19:05 |
clarkb | #topic Upgrading Bionic Servers | 19:07 |
clarkb | #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. | 19:07 |
clarkb | I keep meaning to pick this up but then other things pop up and grab my attention | 19:07 |
clarkb | Help appreciated and let me know if anyone starts working on this and needs changes reviewed or help debugging issues. I'm more than happy to take a look | 19:08 |
clarkb | But no real updates on this yet | 19:08 |
clarkb | #topic Mailman 3 | 19:09 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/851248 Change to deploy mm3 server. | 19:09 |
clarkb | #link https://etherpad.opendev.org/p/mm3migration Server and list migration notes | 19:09 |
clarkb | We (mostly fungi at this point) continue to make progress on getting to the point where this is ready | 19:09 |
clarkb | The database appears happy with the larger connection buffer settings. | 19:09 |
clarkb | fungi: ^ have you checked if mysqldump is happy with that setting too? We should check that (maybe by manually running the mysqldump?) | 19:10 |
fungi | no, i didn't check that, but can make a note in the etherpad to test it with the next hold after a full import | 19:10 |
clarkb | Other todos include retesting now that the change is creating all the lists and not just those that ansible for mm2 knew about, checkign the pipermail redirects, and I think adding redirects for non list archive urls | 19:10 |
clarkb | fungi: thanks | 19:11 |
clarkb | fungi: we should probably go ahead and add a hold and recheck nowish? | 19:11 |
clarkb | fungi: I can do that after the meeting if that is helpful | 19:11 |
fungi | yeah, i just hadn't gotten to it yet | 19:11 |
clarkb | cool I'll sync up after the meeting to get that moving forward | 19:11 |
clarkb | Thanks for all the help on this. You definitely realize just how many little details go into a big migration like this when you start testing it out | 19:12 |
clarkb | #topic Jaeger Tracing Server | 19:12 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/855983 adds deployment for jaeger | 19:12 |
corvus | my ball; will update this week. | 19:12 |
clarkb | There is a change now. CI isn't happy with it yet and I think ianw has some feedback | 19:12 |
clarkb | corvus: great, just wanted to make sure others were aware too. | 19:13 |
corvus | seems like ppl generally like it so far. just working through some technical details. | 19:13 |
clarkb | #topic Fedora 36 | 19:15 |
clarkb | #link https://review.opendev.org/c/zuul/nodepool/+/853914 Remove fedora 35 testing from nodepool | 19:16 |
clarkb | ianw everywhere else is using the fedora-latest label and will get automatically updated? | 19:16 |
ianw | devstack still has https://review.opendev.org/c/openstack/devstack/+/854334 but i need to look into that | 19:17 |
clarkb | ah they have their own labels. | 19:18 |
ianw | but other than that, yes -- so with the nodepool change one step closer to dropping f35 | 19:18 |
clarkb | ianw: looks like the issue there is they are branching nodeset definitions :/ | 19:18 |
clarkb | thats going to create problems for every transition that uses an alias like -latest | 19:18 |
clarkb | might make sense tomove that into openstack-zuul-jobs or similar to avoid thatp roblem | 19:19 |
ianw | we always seem to have this discussion about making sure various testing jobs don't end up on stable branches | 19:19 |
clarkb | another option is for them to use anonymous nodesets | 19:19 |
clarkb | but I don't think they should be managing aliased nodesets on branched repos | 19:20 |
clarkb | as this will be a problem every 6 months | 19:20 |
fungi | yeah, a branchless repo like osj should fit the bill | 19:20 |
fungi | er, ozj | 19:20 |
ianw | there should be no fedora on anything but master -- but i agree this could have a better home | 19:20 |
clarkb | ianw: ya the problemi s they branch yoga and don't clean it up | 19:21 |
clarkb | its better to just avoid having it on master where it can end up in a stable branch probably | 19:21 |
ianw | i can add a todo to have a look | 19:21 |
clarkb | anyway we can sort that out with the qa team separately | 19:21 |
clarkb | is there anything other than reviewing the nodepool change that we can do to help | 19:21 |
ianw | i don't think so, thanks. unless people want to start debugging devstack, which i don't think they do :) | 19:22 |
clarkb | #topic Jitsi Meet Updates | 19:23 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/856553 Update to use colibri websockets and scale out JVBs | 19:23 |
clarkb | This is one of those changes where in theory I've done what is expected of the service | 19:23 |
clarkb | But its kinda hard to confirm that without having a full blown service running with dns setup and being able to talk to it with our browsers | 19:24 |
fungi | but it's also fairly easy to test if we set aside a window to do so | 19:24 |
clarkb | In particular it isn't clear to me if the JVB java keystores need to have some relationship to a CA or to each other or be verifiable in some way | 19:24 |
clarkb | All of the bits I could find on the docs and forum posts about this don't indicate any sort of relationship like that so I think they may be using this just for encryption and not verification | 19:24 |
clarkb | fungi: ya exactly | 19:25 |
clarkb | I think if people are comfortable with the change we could probably land it and test things during a quiet time (Friday?) | 19:25 |
clarkb | and if it breaks either revert or try to roll forward and fix | 19:25 |
clarkb | I do think there is a window of opportunity here where we should get it done or wait until after the ptg though. Probably week before ptg is not the time to land this but before that is ok? | 19:25 |
fungi | and i think it should be reasonably safe to merge first, make sure things aren't broken, take a jvb server out of emergency, redeploy to it gets updated, stio the jvb container on the main server, test again | 19:25 |
fungi | i like friday | 19:26 |
fungi | s/stio/stop | 19:26 |
ianw | ++ | 19:26 |
clarkb | sounds good. /me makes a note on the todo list to try and get that done friday | 19:27 |
clarkb | Other than that I think we are in good shape for having the service up for the ptg. The non jvb setup seems to be working | 19:27 |
clarkb | (just a question of whether or not it can scale, but that is what the jvb change is for) | 19:28 |
clarkb | #topic Stability of the Zuul Reboot Playbook | 19:28 |
clarkb | If you didn't know this already Clouds are excellent chaos monkey drivers | 19:28 |
clarkb | Over the weekend we hit another issue with the playbook. This time it is a race between asking the container to stop in an exec and the container quitting out from under the docker exec | 19:29 |
clarkb | when the container exits before the exec is complete the docker command return code is 137 and ansible gets angry | 19:29 |
clarkb | I pushed an update to handle this as well as an unexpected behavior with docker-compose ps -q printing exited containers that frickler pointed out (docker ps -q does not do this) | 19:29 |
clarkb | I started a manual run of that yesterday and we are currently waiting for ze08 to stop | 19:30 |
clarkb | Hoping that completes today which will have what should become zuul 6.4.0 deployed in opendev for a bit before the release is made | 19:30 |
clarkb | Calling this out because I think it is a good idea for us to keep an eye on this playbook for a bit until we're satisfied it is stable | 19:30 |
corvus | the original run was dev10/dev18 | 19:30 |
corvus | the new run is upgrading to dev21? | 19:30 |
corvus | was it resumed or is everything going to dev21? | 19:31 |
corvus | (i'm not sure how to read ze01 being at dev18, ze05 at dev21, and ze12 at dev18 again | 19:31 |
clarkb | corvus: all of the ze's updated to dev18 over the weekend as the crash happened on zm05 which was after the zes | 19:32 |
clarkb | corvus: some time after my manual restart of the playbook yesterday a change or two landed to zuul and our hourly zuul playbook docker-compose pulled that so nodes after that point are upgrading to dev21 | 19:32 |
clarkb | once this is done we can go and update ze01-ze04 to dev21 to match | 19:33 |
clarkb | as they should be the only ones out of sync (unless more zuul changes land in the interim) | 19:33 |
corvus | gotcha. i was hoping to avoid that, but it looks like the changes that merged are inconsequential to the release | 19:33 |
clarkb | yes I looked at them and they didn't look to be major | 19:33 |
corvus | nah, no need, we can run with diverse versions | 19:33 |
corvus | we don't use the elasticsearch reporter :) | 19:34 |
clarkb | So far the updated playbook seems happy. I'll continue to monitor it | 19:35 |
corvus | \o/ | 19:35 |
clarkb | #topic Python Container Image Updates | 19:35 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/856537 | 19:35 |
clarkb | This is a great time to update our python container base images as they now include a fixed glibc for the ansible issue and new python minor releases | 19:36 |
clarkb | Once we land that we can remove the zuul glibc workaround and let that change rebuild the zuul images | 19:36 |
clarkb | I wouldn't call this urgent, but it is good hygiene to update these periodically so that changes to our various images can pick up the underlying updates | 19:37 |
ianw | ok, i feel like the zuul workaround is separate though | 19:37 |
clarkb | ianw: once the base image has fixed glibc the zuul workaround is no longer required? | 19:38 |
clarkb | This is a necessary precondition of removing the workaround | 19:38 |
ianw | oh right, it builds on these base images. although it might do an apt-get upgrade as part of building | 19:38 |
ianw | zuul that is? | 19:38 |
clarkb | zuul might, thats true | 19:38 |
clarkb | our base images don't | 19:38 |
ianw | anyway, yeah pulling into base images seems good | 19:39 |
corvus | #link zuul workaround: https://review.opendev.org/849795 | 19:39 |
corvus | i'm not aware of an apt-get upgrade | 19:39 |
ianw | right, and https://review.opendev.org/c/zuul/zuul/+/854939 was to revert it | 19:40 |
ianw | i updated that to depends-on the system-config change; so ordering should be right now | 19:41 |
clarkb | cool | 19:41 |
corvus | sounds like a plan | 19:41 |
clarkb | #topic Improving Ansible Task Runtime | 19:42 |
clarkb | This is largely meant to be informational to help people be conscious of this as they write new ansible | 19:42 |
clarkb | But I'm also happy if people end up refactoring existing ansible :) | 19:42 |
clarkb | The TL;DR is that even though zuul using ssh control persistence and ansible pipelining the cost to run an individual task as simple as copying a few bytes file or execing ls is often measured in seconds | 19:43 |
clarkb | The exact number of seconds seems to vary across our clouds but we've seen it as high as 6 in some :( | 19:43 |
clarkb | This becomes particularly problematic when you are running ansible tasks in a loop with a large number of loop inputs | 19:44 |
clarkb | each input creates a new task that can take 6 seconds to execute. Multiply that by 100 items in a loop and now you just spent 10 minutes doing something that probably should've taken a second or two at most | 19:44 |
clarkb | I've written a few chagnes at this point to pick off some low hanging fruit that suffer from this | 19:44 |
clarkb | #link https://review.opendev.org/c/zuul/zuul-jobs/+/855402 | 19:45 |
clarkb | #link https://review.opendev.org/c/zuul/zuul-jobs/+/857228 | 19:45 |
clarkb | in particular improve some shared library roles so that everyone can benefit | 19:45 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/857232 | 19:45 |
clarkb | this change is specific to how we run nested ansible and saves 1-3 minutes or so depending on the test node. As noted in the commit message of this change there is a downside to it (more complicated nested ansible setup) and I've asked for feedback on whether or not we think that cost is worthwhile | 19:46 |
clarkb | I've just WIP'd it to ensure we don't merge it before additional feedback is given | 19:46 |
clarkb | So ya, try to be aware of this as you write ansible, it can make a bit impact on how long our jobs take to execute | 19:47 |
clarkb | sometimes it might be appropriate to move actions into a shell script rather than have ansible work through logic and iteration | 19:47 |
clarkb | sometimes we can use synchronize instaed of a loop of copies, and so on | 19:47 |
clarkb | And be on the look out for any particularly problematic bits that we might be able to improve. The multi node known hsots stuff could be quicker after my improvement above for example and maybe our infra log encryption could be sped up too | 19:48 |
clarkb | #topic Open Discussion | 19:49 |
clarkb | We got through the agenda. Anything else or anything we covered above that you'd like to go into more detail on? | 19:49 |
fungi | i've got nothing else | 19:50 |
clarkb | the debian reprepro mirror needs help | 19:51 |
fungi | yep, planning to dig into that during/after dinner, unless someone beats me to it | 19:51 |
clarkb | it somehow leaked a lock file which I cleaned up earlier today and now it complains of a corrupt db | 19:51 |
fungi | database rebuild seems to be necessary | 19:51 |
ianw | yeah, i feel like i have notes on that, i can take alook | 19:51 |
ianw | #link https://review.opendev.org/c/opendev/system-config/+/852056 | 19:51 |
ianw | is one; about reverting the pin of the grafana contatiner. frickler isn't a fan, i'm a bit less worried about it -- not sure what others think | 19:52 |
clarkb | ianw: are they going to keep releasing beta software to :latest? | 19:52 |
clarkb | I'm ok with deploying it if they stop doing that | 19:52 |
frickler | ah, I checked that, the dashboard page looks empty with :latest | 19:52 |
clarkb | there was talk of them doing a :stable or similar tag iirc | 19:53 |
frickler | also didn't we have a patch that generates screenshots of all the individual dashboards? I didn't find that | 19:53 |
ianw | that's a point, this job doesn't run that | 19:53 |
clarkb | frickler: I think that job runs on the project-config side we could run it here too though and probably a good idea | 19:53 |
frickler | anyway something still seems broken with latest, so we can either try some tagged version or try to find a fix in our setup | 19:54 |
frickler | not sure if someone has time and energy for that | 19:54 |
ianw | well yeah, if there is a problem with :latest, ignoring it is only going to make it worse :) that's kind of my point | 19:55 |
clarkb | right but it seems that they started releasing known problematic stuff to :latest | 19:55 |
clarkb | whereas before it was vetted releases | 19:55 |
clarkb | I'm ok with keeping up with their relaeses but don't think we should be responsible for beta testing for them | 19:56 |
ianw | well, i doubt they would say that, and really it is our model of loading via the yaml path etc. that i think we're testing, and that's not going to be something covered by upstream ci | 19:57 |
clarkb | ianw: aiui when we broke previously it was because latest was a beta release | 19:57 |
clarkb | and the issue was a known issue they were already working to fix that would never end up in the final release | 19:57 |
ianw | not really -- it was their bug -- but we reported it, and confirmed it, and helped get it fixed | 19:58 |
frickler | different topic, just to shortly mention it before time's up, there seem to be some issues with nested-kvm on ovh-gra1. I'm testing with a beta version of cirros, will apply some special cmdline option | 19:59 |
frickler | hope to have some more information tomorrow | 19:59 |
clarkb | frickler: thanks. | 19:59 |
ianw | yeah, if it's the same thing as we saw with our jammy nodes booting there, i think we'll need some help from the cloud side | 20:00 |
clarkb | checking docker hub they don't seem to have stable tags | 20:00 |
ianw | it releases to kernel messages spewing from a prctl due to cpu flags | 20:00 |
ianw | #link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1973839 | 20:00 |
fungi | though i expect their business model provides them with an incentive to leverage users of the open source version as beta testers in order to shield their paying customers from bugs | 20:01 |
clarkb | amorin was responsive at least. Suggested trying a different flavor on a one off boot to check if that was any btter | 20:01 |
fungi | (grafana's business model, i mean) | 20:01 |
frickler | I'll try the added kernel option first, other flavor second, didn't get to it today | 20:01 |
clarkb | and we are at time. Thanks everyone. Feel free to continue the grafana and nested virt discussion in #opendev | 20:02 |
frickler | made updates to service-types-authority work again | 20:02 |
clarkb | We'll be back here same time and place next week. | 20:02 |
clarkb | #endmeeting | 20:02 |
opendevmeet | Meeting ended Tue Sep 13 20:02:24 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:02 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html | 20:02 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.txt | 20:02 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.log.html | 20:02 |
fungi | thanks clarkb! | 20:02 |
ianw | fungi: i feel like that's a bit unfair. i think they would say that they have a lot of CI and try to maintain a lot of compatibility. when we reported a bug it was investigated and fixed promptly. not sure you can ask for a lot more | 20:02 |
fungi | ianw: rather, i mean they have an incentive to not provide a "stable" tag, because that's what people pay them for | 20:02 |
fungi | a tag for "give me the latest official release" | 20:03 |
fungi | rather than consuming their beta test stream or pinning to a specific version | 20:04 |
frickler | I think most people don't need that, they use numbered tagged versions and something like renovate bot on github to keep track | 20:04 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!