Tuesday, 2024-07-09

clarkbWe'll start our meeting in a couple minutes18:58
clarkb#startmeeting infra19:00
opendevmeetMeeting started Tue Jul  9 19:00:15 2024 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:00
opendevmeetThe meeting name has been set to 'infra'19:00
clarkb#link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/PF7F3JFOMNXHCMB7J426QSYOJZXGQ66D/ Our Agenda19:00
clarkb#topic Announcements19:00
clarkbA heads up that I'll be AFK the first half of next week through Wednesday19:00
clarkbwhcih means I'll miss that week's meeting. I'm happy for someone else to organize and chair the meeting, but also seems fine to skip as well19:01
clarkbCan probably make a decision on which option to choose based on how things are going early next week19:02
clarkbnot something that needs to be decided here19:02
tonybSounds fair19:02
clarkb#topic Upgrading Old Servers19:03
clarkbtonyb: good morning! anything new to add here? I think there was some held node behavior investigation done with fungi19:03
tonybNot a lot of change.  we did some investigation, which didn't show any problems.19:04
clarkbshould we be reviewing changes for that at this point then?19:04
tonybNot yet19:05
fungii did some preliminary testing19:05
tonybI'll push up a new revision soon and then we can begin the review process19:05
fungibut haven't done things that would involve me using multiple separate accounts yet19:05
fungiso far it all looks great though, no concerns yet19:06
clarkbsounds good. Since we're deploying a new server alongside the old one we should be able to merge things before we're 100% certain they are ready and then we can always delete and start over if necessary19:06
clarkbThere was also discussion about clouds lacking Ubuntu Noble images at this point.19:06
clarkbtonyb: I did upload a noble image to openmetal if you want to start with a mirror replacement there (but then that runs into the openafs problems...)19:07
fungiyeah, our typical approach is to get the server running on a snapshot of the production data and then re-sync its state at cut-over time19:07
tonybYup I'll work on that later this week19:07
clarkbbut ya I think we can just grab the appropriate image from ubuntu and convert as necessary then upload19:07
fungiprobably easier in this case since it'll use a new primary domain name anyway19:07
clarkbtonyb: oh I did want to mention that I'm not sure if the two vexxhost regions have different bfv requirements (you mentioned gitea servers are not bfv but they are in one region and gerrit is in the other). Something we can test though19:08
fungithe only real cut-over step is swapping out the current wiki.openstack.org dns to point at wiki.opendev.org's redirect19:08
tonybGood to know.19:08
clarkbfungi: we also need to shutthings down to mvoe db contents19:09
clarkbso there will be a short downtime I think as well as updating the dns19:09
tonybI'll re-work the wiki announcement to send out sometime soon19:09
fungiyeah, i mentioned re-syncing data19:10
clarkbah yup19:10
fungibut point is that only needs to be done one last time when we're ready to change dns for the old domain19:10
clarkb++19:10
fungiso we can test the new production server pretty thoroughly before that19:11
clarkbanything else related to server upgrades?19:12
tonybNot from me.19:12
clarkb#topic AFS Mirror Cleanups19:13
clarkbSo last week I threw out there that we might consider force merging some centos 8 stream job removals from openstack-zuul-jobs in particular. Doing so will impact projects that are still using fips jobs on centos 8 stream in particular (glance was an example)19:14
clarkbI wanted to see how thinsg were looking post openstack CVE patching before committing to that. Do we think that openstack is in a reasonable place for this to happen now?19:14
frickleryes19:15
fungii do think so too, yes19:15
clarkbcool I pushed a change up for that not realizing that frickler had already pushed a similar change.19:16
fungikeep in mind that openstack also has stream 9 fips jobs, so projects running them on 8 are simply on old job configs and should update19:16
clarkboh maybe that was just for wheel jobs19:16
fricklerclarkb: which ones are you referring to?19:17
clarkbhttps://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922649 and https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/92231419:17
clarkbfrickler: maybe I should rebase yours on top of mine and then mine can drop the wheel build stuff. Then we force merge the fips removal and then work through wheel build cleanup directly (since its just requiremenst and ozj that need updates for that)19:18
clarkbif that sounds raesonable I can do that after the meeting and lunch19:18
fricklerwe can do the reqs update later, too19:19
clarkbya its less urgent since its less painful to cleanup. For the painful stuff ideally we get it done sooner so that people have more time to deal with it19:19
clarkbok I'll do that proposed change since I don't hear and objections and then we can proceed with additional cleanups19:20
clarkbAnything else on this topic?19:20
fricklerI mean we could also force merge both and clean up afterwards19:20
fricklerbut rebasing to avoid conflicts is a good idea anyway19:20
clarkback19:21
clarkb#topic Gitea 1.22 Upgrade19:21
clarkbThere is a gitea 1.22.1 release now19:21
fungi(can't force merge if they merge-conflict regardless)19:21
clarkb#link https://review.opendev.org/c/opendev/system-config/+/920580/ Change implementing the upgrade19:21
tonybOh yay19:21
clarkbThis change has been updated to deploy the new release. It passes testing and there is a held node19:21
clarkb#link https://104.130.219.4:3081/opendev/system-config 1.22.1 Held node19:22
clarkbI think our best approach here may be to go ahead and do the upgrade once we're comfortable with it (its a big upgrade...) then worry about the db doctoring after the upgrade since we need the new version for the db doctoring tool anway19:22
fungisounds great to me19:22
clarkbwith that plan it would be great if people could review the change and call out any concerns from the release notes or what I've had to update19:23
clarkband then once we're happy with it do the upgrade19:23
fungiwill do19:23
clarkb#topic Etherpad 2.1.1 Upgrade19:24
clarkbSimilarly we have some recent etherpad releases we should consider updating to. The biggest change appears to be readding APIKEY auth back to the service19:24
clarkbThere are a number of bugfixes too though19:24
clarkb#link https://review.opendev.org/c/opendev/system-config/+/923661 Etherpad 2.1.1 Upgrade19:24
clarkbThis also passes testing and I've got a held node for it as well19:24
clarkb149.202.167.222 is the held node and "clarkb-test" is the pad I used there19:24
clarkbyou have to edit /etc/hosts for this one otherwise redirects send you to the prod server19:25
clarkbalso I don't bother to revert the auth method since I would expect apikey to go away at some point just with better communication the next time around19:25
clarkbSimilar to gitea I guess look over the change and release notes and held node and call out any concerns otherwise I think we're in good shape to proceed with this one too19:26
fungii'll test it out after dinner and approve if all looks good. can keep an eye on the deploy job and re-test prod after too19:26
clarkbthanks. I'll be around too after lunch19:26
clarkb#topic Testing Rackspace's New Cloud Offering19:26
clarkbUnfortunately I never heard back after my suggestion for a meeting this week. I'll need to take a different tactic19:27
clarkbI left this on the agenda so that I could be clear that no meeting is happening yet19:27
clarkbbut that was all19:27
clarkb#topic Drop x/* projects with config errors from zuul19:27
clarkb#link https://review.opendev.org/c/openstack/project-config/+/923509 Proposed cleanup of inactive x/ projects19:27
clarkbfrickler has proposed that we clean up idle/inactive and broken with zuul config projects from the zuul tenant config. I think this is fine to do. It is easy to add projects back if anyone complains19:28
fricklerdo you want to send a mail about this first?19:28
fricklerand give possibly a last warning period?19:29
fungian announcement with a planned merge date wouldn't be terrible19:29
clarkbyes, I was just going to suggest that. I think what we can do is send email to service announce indicating we'll start cleaning up projects and point to your change as the first iteration and encourage people to fix things up or let us know if they are still active and will fix things19:29
frickleralso I didn't actually check the "idle/inactive" part19:29
clarkbfrickler: ya I was going to check it before +2ing :)19:29
fricklerI just took the data from the config-errors list19:30
clarkbfrickler: do you want to send that email ro should I? if you want me to send it what time frame do you think is a good one for merging it? Sometime next week?19:30
fricklerplease do send the mail, 1 week notice is fine IMO19:30
clarkbok that is on my todo list19:30
fungimost of them probably have my admin account owning their last changes from the initial x/ namespace migration19:30
clarkbthank you for bringing this up, I think cleanups like this will go a long way in reducing the blast radius around things like image lable removals from nodepool19:31
clarkbanything else on this subject?19:31
fricklera related question would be whether we want to do any actual repo retirements later19:31
fricklerlike similar to what openstack does19:31
clarkbto be clear I'll check activity, leave a review, then send email to service-discuss in the naer future19:31
fricklerso you'll check all repos? or just the ones with config errors?19:32
clarkbfrickler: I think we've avoided doing that ourselves because 1) ist a fair bit of work and 2) it doesn't affect us much if the repos are active and the maintainers don't want to indicate things have shutdown19:32
clarkbfrickler: just the ones in your change19:32
clarkbfrickler: but I guess I can try sampling others to see if we can easily add to the list19:32
clarkbbasically we aren't responsible for people managing their software projects in a way that indicates if they have all moed on to other things and I don't think we should set that expectation19:33
fricklerbut we might care about projects that have silently become abandoned19:33
clarkbI don't think we do?19:34
fricklerone relatively easy to check criterium would be no zuul jobs running (passing) in a year or so19:35
clarkbI mean we care about their impact on the larger CI system for example. But we shouldn't be responsible for changing the git repo content to say everyone has gone away19:35
fungiproactively identifying and weeding out abandoned projects doesn't seem tractable to me, but unconfiguring projects which are causing us problems is fine19:35
clarkbprojects may elect to do that themselves but I don't think we should be expected to do that for anyone19:35
fricklerhmm, so I'm in a minority then it seems, fine19:36
fungiit's possible projects are "feature complete" and haven't needed any bug fixes in years so haven't run any jobs, but would get things running if they discovered a fix was needed in the future19:37
fungiand we haven't done a good job of collecting contact information for projects, so reaching out to them may also present challenges19:38
clarkbya I think the harm is minimized if we disconnect things from the CI system until needed again19:39
clarkbbut we don't need to police project activity beyond that19:39
fungithat roughly matches my position as well19:39
fricklerok, let's move on, then19:39
clarkb#topic Zuul DB Performance Issues19:40
clarkb#link https://zuul.opendev.org/t/openstack/buildsets times out without showing results19:40
clarkb#link https://zuul.opendev.org/t/openstack/buildsets?pipeline=gate takes 20-30s19:40
frickleryes, I noticed that while shepherding the gate last week for the cve patches19:40
clarkbmy hunch here is that we're simply lacking indexes that are necessary to make this performant19:40
clarkbsince there are at least as many builds as there are buildsets and the builds queries seem to be responsive enough19:41
clarkbcorvus: are you around?19:42
corvusi am19:43
clarkbfwiw the behaviors frickler pointed out do appear to reproduce toady as well so not just a "zuul is busy with cve patches" behavior19:43
clarkbcorvus: cool I wanted to make sure you saw this (don't expect answers right away)19:43
corvusyeah i don't have any insight19:43
fricklerso do we need more inspection on the opendev side or rather someone to look into the zuul implementation side first?19:44
corvusi think the next step would be asking opendev's db to explain the query plan in this environment to try to track down the problem19:45
clarkbmaybe a little bit of both. Looking at the api backend for those queries might point out an obviously missing inefficiency with something like an index, but if there isn't anything obvious then checking opendev db server slow logs is probably the next step19:45
fungiideally we'd spot the culprit on our zuul-web services first, yeah19:45
corvusit's unlikely to be reproducible outside of that environment; each rdbms does its own query plans based on its own assumptions about the current data set19:46
fungiit's always possible problems we're observing are related to our deployment/data and not upstream defects, after all19:46
clarkbcorvus: is there a good wy to translate the sqlalchemy python query expression to sql? Maybe its in the logs on the server side?19:46
corvus"show full processlist" while the query is running19:46
clarkback19:46
fungiwith as long as these queries are taking, capturing that should be trivial19:46
tonybI can look at the performance of the other zuuls I have access to but they're much smaller19:47
clarkbso thats basically find the running query in the process list, then have the server describe the query plan for that query then determine what if anything needs to chagne in zuul19:47
corvusor on the server.  yes.19:47
clarkbcool we have one more topic and we're running out of time so lets move on19:48
fricklerfwiw I don't see any noticable delay on my downstream zuul server, so yes, likely related to the amount of data opendev has collected19:48
clarkbwe can followup on any investigation outside of the meeting19:48
fricklerack19:48
corvusit's not necessarily the size, it's the characteristics of the data19:48
clarkb#topic Reconsider Zuul Pipeline Queue Floor for OpenStack Tenant19:48
clarkbopenstack's cve patching process exposed that things are quite flaky there at the moment and it is almost impossible that the 20th change in an openstack gate queue would merge at the same time as the first change19:49
fricklerthis is also mainly based on my observations made last week19:49
clarkbcurrently we configure the gate pipeline to have a floor of 20 in its windowing algorithm which means that is the minimum number of changes that will be enqueued19:49
clarkbs/enqueued/running jobs at the same time if we have at least that many changes/19:50
clarkbI think reducing that number to say 10 would be fine (this was frickler's suggestion in the agenda) because as you point out it is highly unlikely to merge 20 chagnes together right now and it also isn't that common to have a queue that deep these days19:50
clarkbso basically 90% of the time we won't notice either way and the other 10% of the time it is likely to be helpful19:51
corvusis there a measurable problem and outcome we would like to achieve?19:51
fungii think, from an optimization perspective, because the gate pipeline will almost always contain fewer items than the check pipeline, that having the floor somewhat overestimate a maximum probable merge window somewhat is still preferable19:51
clarkbcorvus: time to merge openstack changes in the gate which is impacted by throwing away many test nodes?19:51
clarkbcorvus: in particular openstck also has clean check but the gate has priority for nodes. So if we're burning nodes in the gate that won't ever pass jobs then check is slowed down which slows down the time to get into the gate19:52
frickleralso time to check further changes while the gate is using a large percentage of total available resources19:52
corvushas "reconsider clean check" been discussed?19:52
fungiwe'll waste more resources on gate pipeline results by being optimistic, but if it's not starving the gate pipeline of resources then we have options for manually dealing with resource starvation in lower-priority pipelines19:52
clarkbnot that I am aware of. I suspect that removing clean check would just lead to less stability, but maybe the impact of the lower stability would be minimized19:53
corvusif the goal of clean check is to reduce the admission of racy tests, it may be better overall to just run every job twice in gate.  :)19:53
clarkbfungi: like dequeuing things from the gate?19:53
fungiyes, and promoting, and directly enqueuing urgent changes from check19:54
corvusi think originally clean check was added because people were throwing bad changes at the gate and people didn't want to wait for them to fall out19:54
clarkbcorvus: yes one of the original goals of clean check was to reduce the likelyhood that a change would be rechecked and forced in before someone looked closer and went "hrm this is actually flaky and broken"19:54
corvus(if that's no longer a problem, it may be more trouble than it's worth now)19:54
clarkbI think that history has shown people are more likeyl to just recheck harder rather than investigate though19:54
clarkbfrom opendev's perspecitve I don't think it creates any issues for us to either remove clean check or reduce the window floor (the window will still grow if things get less flaky)19:55
fricklerwell openstack at least is working on trying to reduce blank rechecks19:56
clarkbbut OpenStack needs to consider what the various fallout consequences would be in those scenarios19:56
corvushow often is the window greater than 10?19:56
corvussorry, the queue depth19:56
clarkbcorvus: its very rare these days which is why I mentioned the vast majority of the time it is a noop19:56
clarkbbut it does still happen around feature freeze, requirements update windows, and security patching with a lot of backports19:57
clarkbits just not daily like it once was19:57
fungirandom data point, at this moment in time the openstack tenant has one item in the gate pipeline and it's not even for an openstack project ;)19:57
fricklerand its not in the integrated queue19:58
corvusi don't love the idea, and i especially don't love the idea without any actual metrics around the problem or solution.  i think there are better things that can be done first.19:58
corvusone of those things is early failure detection19:58
corvusis anyone from the openstack tenant working on that?19:59
corvusthat literally gets bad changes out of the critical path gate queue faster19:59
clarkbcorvus: I guess the idea is early failure detection could help because it would release resources more quickly to be used elsewhere?19:59
clarkbya19:59
fricklerI think that that is working for the openstack tenant?20:00
fungidevstack/tempest/grenade jobs in particular (tox-based jobs already get it by default i think?)20:00
corvusfungi: the zuul project is the only one using that now afaict20:00
clarkbfrickler: its disabled by default and not enabled for openstack iirc20:00
fricklerat least I saw tempest jobs go red before they reported failure20:00
clarkbthe bug related to that impacted projects whether or not they were enabling hte feature20:00
corvusthe playbook based detection is automatic20:01
fungilonger-running jobs (e.g. tempest) would benefit the most from early failure signalling20:01
corvusthe output based detection needs to be configured20:01
corvus(it's regex based)20:01
clarkbwe are at time. But maybe ag ood first step is enabling early failure detection for tempest and unittests and see if that helps things first20:02
corvusbut we worked out a regex for zuul that seems to be working with testr; it would be useful to copy that to other tenants20:02
clarkbthen if that doesn't we can move onto the next thing which may or may not be removing clean check or reducing the window floor20:02
fungii guess in particular, jobs that "fail" early into a long-running playbook without immediately wrapping up the playbook are the best candidates for optimization there?20:02
corvus++ i'm not super-strongly opposed to changing the floor; i just don't want to throw out the benefits of a deep queue when there are other options.20:03
clarkbfungi: ya tempest fits that criteria well as there may be an hour of testing after the first failure I think20:03
corvusclarkb: fungi exactly20:04
fungiwhereas jobs that abort a playbook on the first error are getting most of the benefits already20:04
corvusyep20:04
corvusimagine the whole gate pipeline turning red and kicking out a bunch of changes in 10 minutes :)20:04
corvus(kicking them out of the main line, i mean, not literally ejecting them; not talking about fail-fast here)20:05
fungii do think that would make a significant improvement in throughput/resource usage20:06
corvusi can't do that work, but i'm happy to help anyone who wants to20:07
clarkbthanks! We're well over time now and I'm hungry. Let's end the meeting here adn we can coordinate those improvements either in the regular irc channel or the mailing list20:08
clarkbthanks everyone!20:08
clarkb#endmeeting20:08
opendevmeetMeeting ended Tue Jul  9 20:08:16 2024 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:08
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.html20:08
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.txt20:08
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.log.html20:08
corvusthanks clarkb !20:08
fungithanks!20:08
clarkbSorry we didn't have time for open discussion but feel free to bring other items up in the regular channel or on the mailing list as well20:09
*** mordred1 is now known as Guest101623:17

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!