Tuesday, 2024-07-09

clarkb	We'll start our meeting in a couple minutes	18:58
clarkb	#startmeeting infra	19:00
opendevmeet	Meeting started Tue Jul 9 19:00:15 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.	19:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	19:00
opendevmeet	The meeting name has been set to 'infra'	19:00
clarkb	#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/PF7F3JFOMNXHCMB7J426QSYOJZXGQ66D/ Our Agenda	19:00
clarkb	#topic Announcements	19:00
clarkb	A heads up that I'll be AFK the first half of next week through Wednesday	19:00
clarkb	whcih means I'll miss that week's meeting. I'm happy for someone else to organize and chair the meeting, but also seems fine to skip as well	19:01
clarkb	Can probably make a decision on which option to choose based on how things are going early next week	19:02
clarkb	not something that needs to be decided here	19:02
tonyb	Sounds fair	19:02
clarkb	#topic Upgrading Old Servers	19:03
clarkb	tonyb: good morning! anything new to add here? I think there was some held node behavior investigation done with fungi	19:03
tonyb	Not a lot of change. we did some investigation, which didn't show any problems.	19:04
clarkb	should we be reviewing changes for that at this point then?	19:04
tonyb	Not yet	19:05
fungi	i did some preliminary testing	19:05
tonyb	I'll push up a new revision soon and then we can begin the review process	19:05
fungi	but haven't done things that would involve me using multiple separate accounts yet	19:05
fungi	so far it all looks great though, no concerns yet	19:06
clarkb	sounds good. Since we're deploying a new server alongside the old one we should be able to merge things before we're 100% certain they are ready and then we can always delete and start over if necessary	19:06
clarkb	There was also discussion about clouds lacking Ubuntu Noble images at this point.	19:06
clarkb	tonyb: I did upload a noble image to openmetal if you want to start with a mirror replacement there (but then that runs into the openafs problems...)	19:07
fungi	yeah, our typical approach is to get the server running on a snapshot of the production data and then re-sync its state at cut-over time	19:07
tonyb	Yup I'll work on that later this week	19:07
clarkb	but ya I think we can just grab the appropriate image from ubuntu and convert as necessary then upload	19:07
fungi	probably easier in this case since it'll use a new primary domain name anyway	19:07
clarkb	tonyb: oh I did want to mention that I'm not sure if the two vexxhost regions have different bfv requirements (you mentioned gitea servers are not bfv but they are in one region and gerrit is in the other). Something we can test though	19:08
fungi	the only real cut-over step is swapping out the current wiki.openstack.org dns to point at wiki.opendev.org's redirect	19:08
tonyb	Good to know.	19:08
clarkb	fungi: we also need to shutthings down to mvoe db contents	19:09
clarkb	so there will be a short downtime I think as well as updating the dns	19:09
tonyb	I'll re-work the wiki announcement to send out sometime soon	19:09
fungi	yeah, i mentioned re-syncing data	19:10
clarkb	ah yup	19:10
fungi	but point is that only needs to be done one last time when we're ready to change dns for the old domain	19:10
clarkb	++	19:10
fungi	so we can test the new production server pretty thoroughly before that	19:11
clarkb	anything else related to server upgrades?	19:12
tonyb	Not from me.	19:12
clarkb	#topic AFS Mirror Cleanups	19:13
clarkb	So last week I threw out there that we might consider force merging some centos 8 stream job removals from openstack-zuul-jobs in particular. Doing so will impact projects that are still using fips jobs on centos 8 stream in particular (glance was an example)	19:14
clarkb	I wanted to see how thinsg were looking post openstack CVE patching before committing to that. Do we think that openstack is in a reasonable place for this to happen now?	19:14
frickler	yes	19:15
fungi	i do think so too, yes	19:15
clarkb	cool I pushed a change up for that not realizing that frickler had already pushed a similar change.	19:16
fungi	keep in mind that openstack also has stream 9 fips jobs, so projects running them on 8 are simply on old job configs and should update	19:16
clarkb	oh maybe that was just for wheel jobs	19:16
frickler	clarkb: which ones are you referring to?	19:17
clarkb	https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922649 and https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922314	19:17
clarkb	frickler: maybe I should rebase yours on top of mine and then mine can drop the wheel build stuff. Then we force merge the fips removal and then work through wheel build cleanup directly (since its just requiremenst and ozj that need updates for that)	19:18
clarkb	if that sounds raesonable I can do that after the meeting and lunch	19:18
frickler	we can do the reqs update later, too	19:19
clarkb	ya its less urgent since its less painful to cleanup. For the painful stuff ideally we get it done sooner so that people have more time to deal with it	19:19
clarkb	ok I'll do that proposed change since I don't hear and objections and then we can proceed with additional cleanups	19:20
clarkb	Anything else on this topic?	19:20
frickler	I mean we could also force merge both and clean up afterwards	19:20
frickler	but rebasing to avoid conflicts is a good idea anyway	19:20
clarkb	ack	19:21
clarkb	#topic Gitea 1.22 Upgrade	19:21
clarkb	There is a gitea 1.22.1 release now	19:21
fungi	(can't force merge if they merge-conflict regardless)	19:21
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/920580/ Change implementing the upgrade	19:21
tonyb	Oh yay	19:21
clarkb	This change has been updated to deploy the new release. It passes testing and there is a held node	19:21
clarkb	#link https://104.130.219.4:3081/opendev/system-config 1.22.1 Held node	19:22
clarkb	I think our best approach here may be to go ahead and do the upgrade once we're comfortable with it (its a big upgrade...) then worry about the db doctoring after the upgrade since we need the new version for the db doctoring tool anway	19:22
fungi	sounds great to me	19:22
clarkb	with that plan it would be great if people could review the change and call out any concerns from the release notes or what I've had to update	19:23
clarkb	and then once we're happy with it do the upgrade	19:23
fungi	will do	19:23
clarkb	#topic Etherpad 2.1.1 Upgrade	19:24
clarkb	Similarly we have some recent etherpad releases we should consider updating to. The biggest change appears to be readding APIKEY auth back to the service	19:24
clarkb	There are a number of bugfixes too though	19:24
clarkb	#link https://review.opendev.org/c/opendev/system-config/+/923661 Etherpad 2.1.1 Upgrade	19:24
clarkb	This also passes testing and I've got a held node for it as well	19:24
clarkb	149.202.167.222 is the held node and "clarkb-test" is the pad I used there	19:24
clarkb	you have to edit /etc/hosts for this one otherwise redirects send you to the prod server	19:25
clarkb	also I don't bother to revert the auth method since I would expect apikey to go away at some point just with better communication the next time around	19:25
clarkb	Similar to gitea I guess look over the change and release notes and held node and call out any concerns otherwise I think we're in good shape to proceed with this one too	19:26
fungi	i'll test it out after dinner and approve if all looks good. can keep an eye on the deploy job and re-test prod after too	19:26
clarkb	thanks. I'll be around too after lunch	19:26
clarkb	#topic Testing Rackspace's New Cloud Offering	19:26
clarkb	Unfortunately I never heard back after my suggestion for a meeting this week. I'll need to take a different tactic	19:27
clarkb	I left this on the agenda so that I could be clear that no meeting is happening yet	19:27
clarkb	but that was all	19:27
clarkb	#topic Drop x/* projects with config errors from zuul	19:27
clarkb	#link https://review.opendev.org/c/openstack/project-config/+/923509 Proposed cleanup of inactive x/ projects	19:27
clarkb	frickler has proposed that we clean up idle/inactive and broken with zuul config projects from the zuul tenant config. I think this is fine to do. It is easy to add projects back if anyone complains	19:28
frickler	do you want to send a mail about this first?	19:28
frickler	and give possibly a last warning period?	19:29
fungi	an announcement with a planned merge date wouldn't be terrible	19:29
clarkb	yes, I was just going to suggest that. I think what we can do is send email to service announce indicating we'll start cleaning up projects and point to your change as the first iteration and encourage people to fix things up or let us know if they are still active and will fix things	19:29
frickler	also I didn't actually check the "idle/inactive" part	19:29
clarkb	frickler: ya I was going to check it before +2ing :)	19:29
frickler	I just took the data from the config-errors list	19:30
clarkb	frickler: do you want to send that email ro should I? if you want me to send it what time frame do you think is a good one for merging it? Sometime next week?	19:30
frickler	please do send the mail, 1 week notice is fine IMO	19:30
clarkb	ok that is on my todo list	19:30
fungi	most of them probably have my admin account owning their last changes from the initial x/ namespace migration	19:30
clarkb	thank you for bringing this up, I think cleanups like this will go a long way in reducing the blast radius around things like image lable removals from nodepool	19:31
clarkb	anything else on this subject?	19:31
frickler	a related question would be whether we want to do any actual repo retirements later	19:31
frickler	like similar to what openstack does	19:31
clarkb	to be clear I'll check activity, leave a review, then send email to service-discuss in the naer future	19:31
frickler	so you'll check all repos? or just the ones with config errors?	19:32
clarkb	frickler: I think we've avoided doing that ourselves because 1) ist a fair bit of work and 2) it doesn't affect us much if the repos are active and the maintainers don't want to indicate things have shutdown	19:32
clarkb	frickler: just the ones in your change	19:32
clarkb	frickler: but I guess I can try sampling others to see if we can easily add to the list	19:32
clarkb	basically we aren't responsible for people managing their software projects in a way that indicates if they have all moed on to other things and I don't think we should set that expectation	19:33
frickler	but we might care about projects that have silently become abandoned	19:33
clarkb	I don't think we do?	19:34
frickler	one relatively easy to check criterium would be no zuul jobs running (passing) in a year or so	19:35
clarkb	I mean we care about their impact on the larger CI system for example. But we shouldn't be responsible for changing the git repo content to say everyone has gone away	19:35
fungi	proactively identifying and weeding out abandoned projects doesn't seem tractable to me, but unconfiguring projects which are causing us problems is fine	19:35
clarkb	projects may elect to do that themselves but I don't think we should be expected to do that for anyone	19:35
frickler	hmm, so I'm in a minority then it seems, fine	19:36
fungi	it's possible projects are "feature complete" and haven't needed any bug fixes in years so haven't run any jobs, but would get things running if they discovered a fix was needed in the future	19:37
fungi	and we haven't done a good job of collecting contact information for projects, so reaching out to them may also present challenges	19:38
clarkb	ya I think the harm is minimized if we disconnect things from the CI system until needed again	19:39
clarkb	but we don't need to police project activity beyond that	19:39
fungi	that roughly matches my position as well	19:39
frickler	ok, let's move on, then	19:39
clarkb	#topic Zuul DB Performance Issues	19:40
clarkb	#link https://zuul.opendev.org/t/openstack/buildsets times out without showing results	19:40
clarkb	#link https://zuul.opendev.org/t/openstack/buildsets?pipeline=gate takes 20-30s	19:40
frickler	yes, I noticed that while shepherding the gate last week for the cve patches	19:40
clarkb	my hunch here is that we're simply lacking indexes that are necessary to make this performant	19:40
clarkb	since there are at least as many builds as there are buildsets and the builds queries seem to be responsive enough	19:41
clarkb	corvus: are you around?	19:42
corvus	i am	19:43
clarkb	fwiw the behaviors frickler pointed out do appear to reproduce toady as well so not just a "zuul is busy with cve patches" behavior	19:43
clarkb	corvus: cool I wanted to make sure you saw this (don't expect answers right away)	19:43
corvus	yeah i don't have any insight	19:43
frickler	so do we need more inspection on the opendev side or rather someone to look into the zuul implementation side first?	19:44
corvus	i think the next step would be asking opendev's db to explain the query plan in this environment to try to track down the problem	19:45
clarkb	maybe a little bit of both. Looking at the api backend for those queries might point out an obviously missing inefficiency with something like an index, but if there isn't anything obvious then checking opendev db server slow logs is probably the next step	19:45
fungi	ideally we'd spot the culprit on our zuul-web services first, yeah	19:45
corvus	it's unlikely to be reproducible outside of that environment; each rdbms does its own query plans based on its own assumptions about the current data set	19:46
fungi	it's always possible problems we're observing are related to our deployment/data and not upstream defects, after all	19:46
clarkb	corvus: is there a good wy to translate the sqlalchemy python query expression to sql? Maybe its in the logs on the server side?	19:46
corvus	"show full processlist" while the query is running	19:46
clarkb	ack	19:46
fungi	with as long as these queries are taking, capturing that should be trivial	19:46
tonyb	I can look at the performance of the other zuuls I have access to but they're much smaller	19:47
clarkb	so thats basically find the running query in the process list, then have the server describe the query plan for that query then determine what if anything needs to chagne in zuul	19:47
corvus	or on the server. yes.	19:47
clarkb	cool we have one more topic and we're running out of time so lets move on	19:48
frickler	fwiw I don't see any noticable delay on my downstream zuul server, so yes, likely related to the amount of data opendev has collected	19:48
clarkb	we can followup on any investigation outside of the meeting	19:48
frickler	ack	19:48
corvus	it's not necessarily the size, it's the characteristics of the data	19:48
clarkb	#topic Reconsider Zuul Pipeline Queue Floor for OpenStack Tenant	19:48
clarkb	openstack's cve patching process exposed that things are quite flaky there at the moment and it is almost impossible that the 20th change in an openstack gate queue would merge at the same time as the first change	19:49
frickler	this is also mainly based on my observations made last week	19:49
clarkb	currently we configure the gate pipeline to have a floor of 20 in its windowing algorithm which means that is the minimum number of changes that will be enqueued	19:49
clarkb	s/enqueued/running jobs at the same time if we have at least that many changes/	19:50
clarkb	I think reducing that number to say 10 would be fine (this was frickler's suggestion in the agenda) because as you point out it is highly unlikely to merge 20 chagnes together right now and it also isn't that common to have a queue that deep these days	19:50
clarkb	so basically 90% of the time we won't notice either way and the other 10% of the time it is likely to be helpful	19:51
corvus	is there a measurable problem and outcome we would like to achieve?	19:51
fungi	i think, from an optimization perspective, because the gate pipeline will almost always contain fewer items than the check pipeline, that having the floor somewhat overestimate a maximum probable merge window somewhat is still preferable	19:51
clarkb	corvus: time to merge openstack changes in the gate which is impacted by throwing away many test nodes?	19:51
clarkb	corvus: in particular openstck also has clean check but the gate has priority for nodes. So if we're burning nodes in the gate that won't ever pass jobs then check is slowed down which slows down the time to get into the gate	19:52
frickler	also time to check further changes while the gate is using a large percentage of total available resources	19:52
corvus	has "reconsider clean check" been discussed?	19:52
fungi	we'll waste more resources on gate pipeline results by being optimistic, but if it's not starving the gate pipeline of resources then we have options for manually dealing with resource starvation in lower-priority pipelines	19:52
clarkb	not that I am aware of. I suspect that removing clean check would just lead to less stability, but maybe the impact of the lower stability would be minimized	19:53
corvus	if the goal of clean check is to reduce the admission of racy tests, it may be better overall to just run every job twice in gate. :)	19:53
clarkb	fungi: like dequeuing things from the gate?	19:53
fungi	yes, and promoting, and directly enqueuing urgent changes from check	19:54
corvus	i think originally clean check was added because people were throwing bad changes at the gate and people didn't want to wait for them to fall out	19:54
clarkb	corvus: yes one of the original goals of clean check was to reduce the likelyhood that a change would be rechecked and forced in before someone looked closer and went "hrm this is actually flaky and broken"	19:54
corvus	(if that's no longer a problem, it may be more trouble than it's worth now)	19:54
clarkb	I think that history has shown people are more likeyl to just recheck harder rather than investigate though	19:54
clarkb	from opendev's perspecitve I don't think it creates any issues for us to either remove clean check or reduce the window floor (the window will still grow if things get less flaky)	19:55
frickler	well openstack at least is working on trying to reduce blank rechecks	19:56
clarkb	but OpenStack needs to consider what the various fallout consequences would be in those scenarios	19:56
corvus	how often is the window greater than 10?	19:56
corvus	sorry, the queue depth	19:56
clarkb	corvus: its very rare these days which is why I mentioned the vast majority of the time it is a noop	19:56
clarkb	but it does still happen around feature freeze, requirements update windows, and security patching with a lot of backports	19:57
clarkb	its just not daily like it once was	19:57
fungi	random data point, at this moment in time the openstack tenant has one item in the gate pipeline and it's not even for an openstack project ;)	19:57
frickler	and its not in the integrated queue	19:58
corvus	i don't love the idea, and i especially don't love the idea without any actual metrics around the problem or solution. i think there are better things that can be done first.	19:58
corvus	one of those things is early failure detection	19:58
corvus	is anyone from the openstack tenant working on that?	19:59
corvus	that literally gets bad changes out of the critical path gate queue faster	19:59
clarkb	corvus: I guess the idea is early failure detection could help because it would release resources more quickly to be used elsewhere?	19:59
clarkb	ya	19:59
frickler	I think that that is working for the openstack tenant?	20:00
fungi	devstack/tempest/grenade jobs in particular (tox-based jobs already get it by default i think?)	20:00
corvus	fungi: the zuul project is the only one using that now afaict	20:00
clarkb	frickler: its disabled by default and not enabled for openstack iirc	20:00
frickler	at least I saw tempest jobs go red before they reported failure	20:00
clarkb	the bug related to that impacted projects whether or not they were enabling hte feature	20:00
corvus	the playbook based detection is automatic	20:01
fungi	longer-running jobs (e.g. tempest) would benefit the most from early failure signalling	20:01
corvus	the output based detection needs to be configured	20:01
corvus	(it's regex based)	20:01
clarkb	we are at time. But maybe ag ood first step is enabling early failure detection for tempest and unittests and see if that helps things first	20:02
corvus	but we worked out a regex for zuul that seems to be working with testr; it would be useful to copy that to other tenants	20:02
clarkb	then if that doesn't we can move onto the next thing which may or may not be removing clean check or reducing the window floor	20:02
fungi	i guess in particular, jobs that "fail" early into a long-running playbook without immediately wrapping up the playbook are the best candidates for optimization there?	20:02
corvus	++ i'm not super-strongly opposed to changing the floor; i just don't want to throw out the benefits of a deep queue when there are other options.	20:03
clarkb	fungi: ya tempest fits that criteria well as there may be an hour of testing after the first failure I think	20:03
corvus	clarkb: fungi exactly	20:04
fungi	whereas jobs that abort a playbook on the first error are getting most of the benefits already	20:04
corvus	yep	20:04
corvus	imagine the whole gate pipeline turning red and kicking out a bunch of changes in 10 minutes :)	20:04
corvus	(kicking them out of the main line, i mean, not literally ejecting them; not talking about fail-fast here)	20:05
fungi	i do think that would make a significant improvement in throughput/resource usage	20:06
corvus	i can't do that work, but i'm happy to help anyone who wants to	20:07
clarkb	thanks! We're well over time now and I'm hungry. Let's end the meeting here adn we can coordinate those improvements either in the regular irc channel or the mailing list	20:08
clarkb	thanks everyone!	20:08
clarkb	#endmeeting	20:08
opendevmeet	Meeting ended Tue Jul 9 20:08:16 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	20:08
opendevmeet	Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.html	20:08
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.txt	20:08
opendevmeet	Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.log.html	20:08
corvus	thanks clarkb !	20:08
fungi	thanks!	20:08
clarkb	Sorry we didn't have time for open discussion but feel free to bring other items up in the regular channel or on the mailing list as well	20:09
*** mordred1 is now known as Guest1016		23:17

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!