Tuesday, 2022-09-13

*** diablo_rojo_phone is now known as Guest166		07:50
clarkb	We'll start the meeting momentarily	18:59
ianw	o/	19:01
clarkb	#startmeeting infra	19:01
opendevmeet	Meeting started Tue Sep 13 19:01:07 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:01
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:01
opendevmeet	The meeting name has been set to 'infra'	19:01
clarkb	#link https://lists.opendev.org/pipermail/service-discuss/2022-September/000359.html Our Agenda	19:01
clarkb	There is an agenda with quite a number of things on it. They are mostly small things so I may go quickly to be sure we get through it all then we can swing back around on anything that needed extra discussion	19:01
clarkb	#topic Announcements	19:01
clarkb	Nothing major here. Just a reminder that OpenStack is in the middle of its release process and elections	19:02
clarkb	Don't forget to vote if you're eligible and take care to double check changes you are making to ensure we don't inadverdently break something the release depends on	19:02
clarkb	#topic Topics	19:03
clarkb	#topic Bastion Host Updates	19:03
clarkb	We've taken yet another pivot after realizing we likely just never want to run the console stream daemon in these infra prod jobs. At leastnot in its current form	19:04
clarkb	but the command module (and its relatives like shell) write the files out regardless	19:04
clarkb	ianw: wrote some changes to make that optional which I think will be helpful for us	19:04
clarkb	#link https://review.opendev.org/c/zuul/zuul/+/855309/ make console stream file writing toggleable	19:04
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/855472 Disable file writing for infra-prod	19:04
ianw	yes sorry that needs a revision from your comments	19:04
clarkb	ianw: ya and did you see my note about modifying the base jobs repo in a similar manner to system-config as well?	19:04
ianw	ummm, sorry no, but can do	19:05
clarkb	ping me if I don't rereview those quickly enough after updates. I'd like to see those get in as they appear to be a good improvement for our use case (and probably others in a similar boat)	19:05
clarkb	#topic Upgrading Bionic Servers	19:07
clarkb	#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.	19:07
clarkb	I keep meaning to pick this up but then other things pop up and grab my attention	19:07
clarkb	Help appreciated and let me know if anyone starts working on this and needs changes reviewed or help debugging issues. I'm more than happy to take a look	19:08
clarkb	But no real updates on this yet	19:08
clarkb	#topic Mailman 3	19:09
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/851248 Change to deploy mm3 server.	19:09
clarkb	#link https://etherpad.opendev.org/p/mm3migration Server and list migration notes	19:09
clarkb	We (mostly fungi at this point) continue to make progress on getting to the point where this is ready	19:09
clarkb	The database appears happy with the larger connection buffer settings.	19:09
clarkb	fungi: ^ have you checked if mysqldump is happy with that setting too? We should check that (maybe by manually running the mysqldump?)	19:10
fungi	no, i didn't check that, but can make a note in the etherpad to test it with the next hold after a full import	19:10
clarkb	Other todos include retesting now that the change is creating all the lists and not just those that ansible for mm2 knew about, checkign the pipermail redirects, and I think adding redirects for non list archive urls	19:10
clarkb	fungi: thanks	19:11
clarkb	fungi: we should probably go ahead and add a hold and recheck nowish?	19:11
clarkb	fungi: I can do that after the meeting if that is helpful	19:11
fungi	yeah, i just hadn't gotten to it yet	19:11
clarkb	cool I'll sync up after the meeting to get that moving forward	19:11
clarkb	Thanks for all the help on this. You definitely realize just how many little details go into a big migration like this when you start testing it out	19:12
clarkb	#topic Jaeger Tracing Server	19:12
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/855983 adds deployment for jaeger	19:12
corvus	my ball; will update this week.	19:12
clarkb	There is a change now. CI isn't happy with it yet and I think ianw has some feedback	19:12
clarkb	corvus: great, just wanted to make sure others were aware too.	19:13
corvus	seems like ppl generally like it so far. just working through some technical details.	19:13
clarkb	#topic Fedora 36	19:15
clarkb	#link https://review.opendev.org/c/zuul/nodepool/+/853914 Remove fedora 35 testing from nodepool	19:16
clarkb	ianw everywhere else is using the fedora-latest label and will get automatically updated?	19:16
ianw	devstack still has https://review.opendev.org/c/openstack/devstack/+/854334 but i need to look into that	19:17
clarkb	ah they have their own labels.	19:18
ianw	but other than that, yes -- so with the nodepool change one step closer to dropping f35	19:18
clarkb	ianw: looks like the issue there is they are branching nodeset definitions :/	19:18
clarkb	thats going to create problems for every transition that uses an alias like -latest	19:18
clarkb	might make sense tomove that into openstack-zuul-jobs or similar to avoid thatp roblem	19:19
ianw	we always seem to have this discussion about making sure various testing jobs don't end up on stable branches	19:19
clarkb	another option is for them to use anonymous nodesets	19:19
clarkb	but I don't think they should be managing aliased nodesets on branched repos	19:20
clarkb	as this will be a problem every 6 months	19:20
fungi	yeah, a branchless repo like osj should fit the bill	19:20
fungi	er, ozj	19:20
ianw	there should be no fedora on anything but master -- but i agree this could have a better home	19:20
clarkb	ianw: ya the problemi s they branch yoga and don't clean it up	19:21
clarkb	its better to just avoid having it on master where it can end up in a stable branch probably	19:21
ianw	i can add a todo to have a look	19:21
clarkb	anyway we can sort that out with the qa team separately	19:21
clarkb	is there anything other than reviewing the nodepool change that we can do to help	19:21
ianw	i don't think so, thanks. unless people want to start debugging devstack, which i don't think they do :)	19:22
clarkb	#topic Jitsi Meet Updates	19:23
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/856553 Update to use colibri websockets and scale out JVBs	19:23
clarkb	This is one of those changes where in theory I've done what is expected of the service	19:23
clarkb	But its kinda hard to confirm that without having a full blown service running with dns setup and being able to talk to it with our browsers	19:24
fungi	but it's also fairly easy to test if we set aside a window to do so	19:24
clarkb	In particular it isn't clear to me if the JVB java keystores need to have some relationship to a CA or to each other or be verifiable in some way	19:24
clarkb	All of the bits I could find on the docs and forum posts about this don't indicate any sort of relationship like that so I think they may be using this just for encryption and not verification	19:24
clarkb	fungi: ya exactly	19:25
clarkb	I think if people are comfortable with the change we could probably land it and test things during a quiet time (Friday?)	19:25
clarkb	and if it breaks either revert or try to roll forward and fix	19:25
clarkb	I do think there is a window of opportunity here where we should get it done or wait until after the ptg though. Probably week before ptg is not the time to land this but before that is ok?	19:25
fungi	and i think it should be reasonably safe to merge first, make sure things aren't broken, take a jvb server out of emergency, redeploy to it gets updated, stio the jvb container on the main server, test again	19:25
fungi	i like friday	19:26
fungi	s/stio/stop	19:26
ianw	++	19:26
clarkb	sounds good. /me makes a note on the todo list to try and get that done friday	19:27
clarkb	Other than that I think we are in good shape for having the service up for the ptg. The non jvb setup seems to be working	19:27
clarkb	(just a question of whether or not it can scale, but that is what the jvb change is for)	19:28
clarkb	#topic Stability of the Zuul Reboot Playbook	19:28
clarkb	If you didn't know this already Clouds are excellent chaos monkey drivers	19:28
clarkb	Over the weekend we hit another issue with the playbook. This time it is a race between asking the container to stop in an exec and the container quitting out from under the docker exec	19:29
clarkb	when the container exits before the exec is complete the docker command return code is 137 and ansible gets angry	19:29
clarkb	I pushed an update to handle this as well as an unexpected behavior with docker-compose ps -q printing exited containers that frickler pointed out (docker ps -q does not do this)	19:29
clarkb	I started a manual run of that yesterday and we are currently waiting for ze08 to stop	19:30
clarkb	Hoping that completes today which will have what should become zuul 6.4.0 deployed in opendev for a bit before the release is made	19:30
clarkb	Calling this out because I think it is a good idea for us to keep an eye on this playbook for a bit until we're satisfied it is stable	19:30
corvus	the original run was dev10/dev18	19:30
corvus	the new run is upgrading to dev21?	19:30
corvus	was it resumed or is everything going to dev21?	19:31
corvus	(i'm not sure how to read ze01 being at dev18, ze05 at dev21, and ze12 at dev18 again	19:31
clarkb	corvus: all of the ze's updated to dev18 over the weekend as the crash happened on zm05 which was after the zes	19:32
clarkb	corvus: some time after my manual restart of the playbook yesterday a change or two landed to zuul and our hourly zuul playbook docker-compose pulled that so nodes after that point are upgrading to dev21	19:32
clarkb	once this is done we can go and update ze01-ze04 to dev21 to match	19:33
clarkb	as they should be the only ones out of sync (unless more zuul changes land in the interim)	19:33
corvus	gotcha. i was hoping to avoid that, but it looks like the changes that merged are inconsequential to the release	19:33
clarkb	yes I looked at them and they didn't look to be major	19:33
corvus	nah, no need, we can run with diverse versions	19:33
corvus	we don't use the elasticsearch reporter :)	19:34
clarkb	So far the updated playbook seems happy. I'll continue to monitor it	19:35
corvus	\o/	19:35
clarkb	#topic Python Container Image Updates	19:35
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/856537	19:35
clarkb	This is a great time to update our python container base images as they now include a fixed glibc for the ansible issue and new python minor releases	19:36
clarkb	Once we land that we can remove the zuul glibc workaround and let that change rebuild the zuul images	19:36
clarkb	I wouldn't call this urgent, but it is good hygiene to update these periodically so that changes to our various images can pick up the underlying updates	19:37
ianw	ok, i feel like the zuul workaround is separate though	19:37
clarkb	ianw: once the base image has fixed glibc the zuul workaround is no longer required?	19:38
clarkb	This is a necessary precondition of removing the workaround	19:38
ianw	oh right, it builds on these base images. although it might do an apt-get upgrade as part of building	19:38
ianw	zuul that is?	19:38
clarkb	zuul might, thats true	19:38
clarkb	our base images don't	19:38
ianw	anyway, yeah pulling into base images seems good	19:39
corvus	#link zuul workaround: https://review.opendev.org/849795	19:39
corvus	i'm not aware of an apt-get upgrade	19:39
ianw	right, and https://review.opendev.org/c/zuul/zuul/+/854939 was to revert it	19:40
ianw	i updated that to depends-on the system-config change; so ordering should be right now	19:41
clarkb	cool	19:41
corvus	sounds like a plan	19:41
clarkb	#topic Improving Ansible Task Runtime	19:42
clarkb	This is largely meant to be informational to help people be conscious of this as they write new ansible	19:42
clarkb	But I'm also happy if people end up refactoring existing ansible :)	19:42
clarkb	The TL;DR is that even though zuul using ssh control persistence and ansible pipelining the cost to run an individual task as simple as copying a few bytes file or execing ls is often measured in seconds	19:43
clarkb	The exact number of seconds seems to vary across our clouds but we've seen it as high as 6 in some :(	19:43
clarkb	This becomes particularly problematic when you are running ansible tasks in a loop with a large number of loop inputs	19:44
clarkb	each input creates a new task that can take 6 seconds to execute. Multiply that by 100 items in a loop and now you just spent 10 minutes doing something that probably should've taken a second or two at most	19:44
clarkb	I've written a few chagnes at this point to pick off some low hanging fruit that suffer from this	19:44
clarkb	#link https://review.opendev.org/c/zuul/zuul-jobs/+/855402	19:45
clarkb	#link https://review.opendev.org/c/zuul/zuul-jobs/+/857228	19:45
clarkb	in particular improve some shared library roles so that everyone can benefit	19:45
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/857232	19:45
clarkb	this change is specific to how we run nested ansible and saves 1-3 minutes or so depending on the test node. As noted in the commit message of this change there is a downside to it (more complicated nested ansible setup) and I've asked for feedback on whether or not we think that cost is worthwhile	19:46
clarkb	I've just WIP'd it to ensure we don't merge it before additional feedback is given	19:46
clarkb	So ya, try to be aware of this as you write ansible, it can make a bit impact on how long our jobs take to execute	19:47
clarkb	sometimes it might be appropriate to move actions into a shell script rather than have ansible work through logic and iteration	19:47
clarkb	sometimes we can use synchronize instaed of a loop of copies, and so on	19:47
clarkb	And be on the look out for any particularly problematic bits that we might be able to improve. The multi node known hsots stuff could be quicker after my improvement above for example and maybe our infra log encryption could be sped up too	19:48
clarkb	#topic Open Discussion	19:49
clarkb	We got through the agenda. Anything else or anything we covered above that you'd like to go into more detail on?	19:49
fungi	i've got nothing else	19:50
clarkb	the debian reprepro mirror needs help	19:51
fungi	yep, planning to dig into that during/after dinner, unless someone beats me to it	19:51
clarkb	it somehow leaked a lock file which I cleaned up earlier today and now it complains of a corrupt db	19:51
fungi	database rebuild seems to be necessary	19:51
ianw	yeah, i feel like i have notes on that, i can take alook	19:51
ianw	#link https://review.opendev.org/c/opendev/system-config/+/852056	19:51
ianw	is one; about reverting the pin of the grafana contatiner. frickler isn't a fan, i'm a bit less worried about it -- not sure what others think	19:52
clarkb	ianw: are they going to keep releasing beta software to :latest?	19:52
clarkb	I'm ok with deploying it if they stop doing that	19:52
frickler	ah, I checked that, the dashboard page looks empty with :latest	19:52
clarkb	there was talk of them doing a :stable or similar tag iirc	19:53
frickler	also didn't we have a patch that generates screenshots of all the individual dashboards? I didn't find that	19:53
ianw	that's a point, this job doesn't run that	19:53
clarkb	frickler: I think that job runs on the project-config side we could run it here too though and probably a good idea	19:53
frickler	anyway something still seems broken with latest, so we can either try some tagged version or try to find a fix in our setup	19:54
frickler	not sure if someone has time and energy for that	19:54
ianw	well yeah, if there is a problem with :latest, ignoring it is only going to make it worse :) that's kind of my point	19:55
clarkb	right but it seems that they started releasing known problematic stuff to :latest	19:55
clarkb	whereas before it was vetted releases	19:55
clarkb	I'm ok with keeping up with their relaeses but don't think we should be responsible for beta testing for them	19:56
ianw	well, i doubt they would say that, and really it is our model of loading via the yaml path etc. that i think we're testing, and that's not going to be something covered by upstream ci	19:57
clarkb	ianw: aiui when we broke previously it was because latest was a beta release	19:57
clarkb	and the issue was a known issue they were already working to fix that would never end up in the final release	19:57
ianw	not really -- it was their bug -- but we reported it, and confirmed it, and helped get it fixed	19:58
frickler	different topic, just to shortly mention it before time's up, there seem to be some issues with nested-kvm on ovh-gra1. I'm testing with a beta version of cirros, will apply some special cmdline option	19:59
frickler	hope to have some more information tomorrow	19:59
clarkb	frickler: thanks.	19:59
ianw	yeah, if it's the same thing as we saw with our jammy nodes booting there, i think we'll need some help from the cloud side	20:00
clarkb	checking docker hub they don't seem to have stable tags	20:00
ianw	it releases to kernel messages spewing from a prctl due to cpu flags	20:00
ianw	#link https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1973839	20:00
fungi	though i expect their business model provides them with an incentive to leverage users of the open source version as beta testers in order to shield their paying customers from bugs	20:01
clarkb	amorin was responsive at least. Suggested trying a different flavor on a one off boot to check if that was any btter	20:01
fungi	(grafana's business model, i mean)	20:01
frickler	I'll try the added kernel option first, other flavor second, didn't get to it today	20:01
clarkb	and we are at time. Thanks everyone. Feel free to continue the grafana and nested virt discussion in #opendev	20:02
frickler	made updates to service-types-authority work again	20:02
clarkb	We'll be back here same time and place next week.	20:02
clarkb	#endmeeting	20:02
opendevmeet	Meeting ended Tue Sep 13 20:02:24 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:02
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.html	20:02
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.txt	20:02
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-09-13-19.01.log.html	20:02
fungi	thanks clarkb!	20:02
ianw	fungi: i feel like that's a bit unfair. i think they would say that they have a lot of CI and try to maintain a lot of compatibility. when we reported a bug it was investigated and fixed promptly. not sure you can ask for a lot more	20:02
fungi	ianw: rather, i mean they have an incentive to not provide a "stable" tag, because that's what people pay them for	20:02
fungi	a tag for "give me the latest official release"	20:03
fungi	rather than consuming their beta test stream or pinning to a specific version	20:04
frickler	I think most people don't need that, they use numbered tagged versions and something like renovate bot on github to keep track	20:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!