clarkb | We'll start our meeting in a couple minutes | 18:58 |
---|---|---|
clarkb | #startmeeting infra | 19:00 |
opendevmeet | Meeting started Tue Jul 9 19:00:15 2024 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:00 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:00 |
opendevmeet | The meeting name has been set to 'infra' | 19:00 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/PF7F3JFOMNXHCMB7J426QSYOJZXGQ66D/ Our Agenda | 19:00 |
clarkb | #topic Announcements | 19:00 |
clarkb | A heads up that I'll be AFK the first half of next week through Wednesday | 19:00 |
clarkb | whcih means I'll miss that week's meeting. I'm happy for someone else to organize and chair the meeting, but also seems fine to skip as well | 19:01 |
clarkb | Can probably make a decision on which option to choose based on how things are going early next week | 19:02 |
clarkb | not something that needs to be decided here | 19:02 |
tonyb | Sounds fair | 19:02 |
clarkb | #topic Upgrading Old Servers | 19:03 |
clarkb | tonyb: good morning! anything new to add here? I think there was some held node behavior investigation done with fungi | 19:03 |
tonyb | Not a lot of change. we did some investigation, which didn't show any problems. | 19:04 |
clarkb | should we be reviewing changes for that at this point then? | 19:04 |
tonyb | Not yet | 19:05 |
fungi | i did some preliminary testing | 19:05 |
tonyb | I'll push up a new revision soon and then we can begin the review process | 19:05 |
fungi | but haven't done things that would involve me using multiple separate accounts yet | 19:05 |
fungi | so far it all looks great though, no concerns yet | 19:06 |
clarkb | sounds good. Since we're deploying a new server alongside the old one we should be able to merge things before we're 100% certain they are ready and then we can always delete and start over if necessary | 19:06 |
clarkb | There was also discussion about clouds lacking Ubuntu Noble images at this point. | 19:06 |
clarkb | tonyb: I did upload a noble image to openmetal if you want to start with a mirror replacement there (but then that runs into the openafs problems...) | 19:07 |
fungi | yeah, our typical approach is to get the server running on a snapshot of the production data and then re-sync its state at cut-over time | 19:07 |
tonyb | Yup I'll work on that later this week | 19:07 |
clarkb | but ya I think we can just grab the appropriate image from ubuntu and convert as necessary then upload | 19:07 |
fungi | probably easier in this case since it'll use a new primary domain name anyway | 19:07 |
clarkb | tonyb: oh I did want to mention that I'm not sure if the two vexxhost regions have different bfv requirements (you mentioned gitea servers are not bfv but they are in one region and gerrit is in the other). Something we can test though | 19:08 |
fungi | the only real cut-over step is swapping out the current wiki.openstack.org dns to point at wiki.opendev.org's redirect | 19:08 |
tonyb | Good to know. | 19:08 |
clarkb | fungi: we also need to shutthings down to mvoe db contents | 19:09 |
clarkb | so there will be a short downtime I think as well as updating the dns | 19:09 |
tonyb | I'll re-work the wiki announcement to send out sometime soon | 19:09 |
fungi | yeah, i mentioned re-syncing data | 19:10 |
clarkb | ah yup | 19:10 |
fungi | but point is that only needs to be done one last time when we're ready to change dns for the old domain | 19:10 |
clarkb | ++ | 19:10 |
fungi | so we can test the new production server pretty thoroughly before that | 19:11 |
clarkb | anything else related to server upgrades? | 19:12 |
tonyb | Not from me. | 19:12 |
clarkb | #topic AFS Mirror Cleanups | 19:13 |
clarkb | So last week I threw out there that we might consider force merging some centos 8 stream job removals from openstack-zuul-jobs in particular. Doing so will impact projects that are still using fips jobs on centos 8 stream in particular (glance was an example) | 19:14 |
clarkb | I wanted to see how thinsg were looking post openstack CVE patching before committing to that. Do we think that openstack is in a reasonable place for this to happen now? | 19:14 |
frickler | yes | 19:15 |
fungi | i do think so too, yes | 19:15 |
clarkb | cool I pushed a change up for that not realizing that frickler had already pushed a similar change. | 19:16 |
fungi | keep in mind that openstack also has stream 9 fips jobs, so projects running them on 8 are simply on old job configs and should update | 19:16 |
clarkb | oh maybe that was just for wheel jobs | 19:16 |
frickler | clarkb: which ones are you referring to? | 19:17 |
clarkb | https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922649 and https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/922314 | 19:17 |
clarkb | frickler: maybe I should rebase yours on top of mine and then mine can drop the wheel build stuff. Then we force merge the fips removal and then work through wheel build cleanup directly (since its just requiremenst and ozj that need updates for that) | 19:18 |
clarkb | if that sounds raesonable I can do that after the meeting and lunch | 19:18 |
frickler | we can do the reqs update later, too | 19:19 |
clarkb | ya its less urgent since its less painful to cleanup. For the painful stuff ideally we get it done sooner so that people have more time to deal with it | 19:19 |
clarkb | ok I'll do that proposed change since I don't hear and objections and then we can proceed with additional cleanups | 19:20 |
clarkb | Anything else on this topic? | 19:20 |
frickler | I mean we could also force merge both and clean up afterwards | 19:20 |
frickler | but rebasing to avoid conflicts is a good idea anyway | 19:20 |
clarkb | ack | 19:21 |
clarkb | #topic Gitea 1.22 Upgrade | 19:21 |
clarkb | There is a gitea 1.22.1 release now | 19:21 |
fungi | (can't force merge if they merge-conflict regardless) | 19:21 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/920580/ Change implementing the upgrade | 19:21 |
tonyb | Oh yay | 19:21 |
clarkb | This change has been updated to deploy the new release. It passes testing and there is a held node | 19:21 |
clarkb | #link https://104.130.219.4:3081/opendev/system-config 1.22.1 Held node | 19:22 |
clarkb | I think our best approach here may be to go ahead and do the upgrade once we're comfortable with it (its a big upgrade...) then worry about the db doctoring after the upgrade since we need the new version for the db doctoring tool anway | 19:22 |
fungi | sounds great to me | 19:22 |
clarkb | with that plan it would be great if people could review the change and call out any concerns from the release notes or what I've had to update | 19:23 |
clarkb | and then once we're happy with it do the upgrade | 19:23 |
fungi | will do | 19:23 |
clarkb | #topic Etherpad 2.1.1 Upgrade | 19:24 |
clarkb | Similarly we have some recent etherpad releases we should consider updating to. The biggest change appears to be readding APIKEY auth back to the service | 19:24 |
clarkb | There are a number of bugfixes too though | 19:24 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/923661 Etherpad 2.1.1 Upgrade | 19:24 |
clarkb | This also passes testing and I've got a held node for it as well | 19:24 |
clarkb | 149.202.167.222 is the held node and "clarkb-test" is the pad I used there | 19:24 |
clarkb | you have to edit /etc/hosts for this one otherwise redirects send you to the prod server | 19:25 |
clarkb | also I don't bother to revert the auth method since I would expect apikey to go away at some point just with better communication the next time around | 19:25 |
clarkb | Similar to gitea I guess look over the change and release notes and held node and call out any concerns otherwise I think we're in good shape to proceed with this one too | 19:26 |
fungi | i'll test it out after dinner and approve if all looks good. can keep an eye on the deploy job and re-test prod after too | 19:26 |
clarkb | thanks. I'll be around too after lunch | 19:26 |
clarkb | #topic Testing Rackspace's New Cloud Offering | 19:26 |
clarkb | Unfortunately I never heard back after my suggestion for a meeting this week. I'll need to take a different tactic | 19:27 |
clarkb | I left this on the agenda so that I could be clear that no meeting is happening yet | 19:27 |
clarkb | but that was all | 19:27 |
clarkb | #topic Drop x/* projects with config errors from zuul | 19:27 |
clarkb | #link https://review.opendev.org/c/openstack/project-config/+/923509 Proposed cleanup of inactive x/ projects | 19:27 |
clarkb | frickler has proposed that we clean up idle/inactive and broken with zuul config projects from the zuul tenant config. I think this is fine to do. It is easy to add projects back if anyone complains | 19:28 |
frickler | do you want to send a mail about this first? | 19:28 |
frickler | and give possibly a last warning period? | 19:29 |
fungi | an announcement with a planned merge date wouldn't be terrible | 19:29 |
clarkb | yes, I was just going to suggest that. I think what we can do is send email to service announce indicating we'll start cleaning up projects and point to your change as the first iteration and encourage people to fix things up or let us know if they are still active and will fix things | 19:29 |
frickler | also I didn't actually check the "idle/inactive" part | 19:29 |
clarkb | frickler: ya I was going to check it before +2ing :) | 19:29 |
frickler | I just took the data from the config-errors list | 19:30 |
clarkb | frickler: do you want to send that email ro should I? if you want me to send it what time frame do you think is a good one for merging it? Sometime next week? | 19:30 |
frickler | please do send the mail, 1 week notice is fine IMO | 19:30 |
clarkb | ok that is on my todo list | 19:30 |
fungi | most of them probably have my admin account owning their last changes from the initial x/ namespace migration | 19:30 |
clarkb | thank you for bringing this up, I think cleanups like this will go a long way in reducing the blast radius around things like image lable removals from nodepool | 19:31 |
clarkb | anything else on this subject? | 19:31 |
frickler | a related question would be whether we want to do any actual repo retirements later | 19:31 |
frickler | like similar to what openstack does | 19:31 |
clarkb | to be clear I'll check activity, leave a review, then send email to service-discuss in the naer future | 19:31 |
frickler | so you'll check all repos? or just the ones with config errors? | 19:32 |
clarkb | frickler: I think we've avoided doing that ourselves because 1) ist a fair bit of work and 2) it doesn't affect us much if the repos are active and the maintainers don't want to indicate things have shutdown | 19:32 |
clarkb | frickler: just the ones in your change | 19:32 |
clarkb | frickler: but I guess I can try sampling others to see if we can easily add to the list | 19:32 |
clarkb | basically we aren't responsible for people managing their software projects in a way that indicates if they have all moed on to other things and I don't think we should set that expectation | 19:33 |
frickler | but we might care about projects that have silently become abandoned | 19:33 |
clarkb | I don't think we do? | 19:34 |
frickler | one relatively easy to check criterium would be no zuul jobs running (passing) in a year or so | 19:35 |
clarkb | I mean we care about their impact on the larger CI system for example. But we shouldn't be responsible for changing the git repo content to say everyone has gone away | 19:35 |
fungi | proactively identifying and weeding out abandoned projects doesn't seem tractable to me, but unconfiguring projects which are causing us problems is fine | 19:35 |
clarkb | projects may elect to do that themselves but I don't think we should be expected to do that for anyone | 19:35 |
frickler | hmm, so I'm in a minority then it seems, fine | 19:36 |
fungi | it's possible projects are "feature complete" and haven't needed any bug fixes in years so haven't run any jobs, but would get things running if they discovered a fix was needed in the future | 19:37 |
fungi | and we haven't done a good job of collecting contact information for projects, so reaching out to them may also present challenges | 19:38 |
clarkb | ya I think the harm is minimized if we disconnect things from the CI system until needed again | 19:39 |
clarkb | but we don't need to police project activity beyond that | 19:39 |
fungi | that roughly matches my position as well | 19:39 |
frickler | ok, let's move on, then | 19:39 |
clarkb | #topic Zuul DB Performance Issues | 19:40 |
clarkb | #link https://zuul.opendev.org/t/openstack/buildsets times out without showing results | 19:40 |
clarkb | #link https://zuul.opendev.org/t/openstack/buildsets?pipeline=gate takes 20-30s | 19:40 |
frickler | yes, I noticed that while shepherding the gate last week for the cve patches | 19:40 |
clarkb | my hunch here is that we're simply lacking indexes that are necessary to make this performant | 19:40 |
clarkb | since there are at least as many builds as there are buildsets and the builds queries seem to be responsive enough | 19:41 |
clarkb | corvus: are you around? | 19:42 |
corvus | i am | 19:43 |
clarkb | fwiw the behaviors frickler pointed out do appear to reproduce toady as well so not just a "zuul is busy with cve patches" behavior | 19:43 |
clarkb | corvus: cool I wanted to make sure you saw this (don't expect answers right away) | 19:43 |
corvus | yeah i don't have any insight | 19:43 |
frickler | so do we need more inspection on the opendev side or rather someone to look into the zuul implementation side first? | 19:44 |
corvus | i think the next step would be asking opendev's db to explain the query plan in this environment to try to track down the problem | 19:45 |
clarkb | maybe a little bit of both. Looking at the api backend for those queries might point out an obviously missing inefficiency with something like an index, but if there isn't anything obvious then checking opendev db server slow logs is probably the next step | 19:45 |
fungi | ideally we'd spot the culprit on our zuul-web services first, yeah | 19:45 |
corvus | it's unlikely to be reproducible outside of that environment; each rdbms does its own query plans based on its own assumptions about the current data set | 19:46 |
fungi | it's always possible problems we're observing are related to our deployment/data and not upstream defects, after all | 19:46 |
clarkb | corvus: is there a good wy to translate the sqlalchemy python query expression to sql? Maybe its in the logs on the server side? | 19:46 |
corvus | "show full processlist" while the query is running | 19:46 |
clarkb | ack | 19:46 |
fungi | with as long as these queries are taking, capturing that should be trivial | 19:46 |
tonyb | I can look at the performance of the other zuuls I have access to but they're much smaller | 19:47 |
clarkb | so thats basically find the running query in the process list, then have the server describe the query plan for that query then determine what if anything needs to chagne in zuul | 19:47 |
corvus | or on the server. yes. | 19:47 |
clarkb | cool we have one more topic and we're running out of time so lets move on | 19:48 |
frickler | fwiw I don't see any noticable delay on my downstream zuul server, so yes, likely related to the amount of data opendev has collected | 19:48 |
clarkb | we can followup on any investigation outside of the meeting | 19:48 |
frickler | ack | 19:48 |
corvus | it's not necessarily the size, it's the characteristics of the data | 19:48 |
clarkb | #topic Reconsider Zuul Pipeline Queue Floor for OpenStack Tenant | 19:48 |
clarkb | openstack's cve patching process exposed that things are quite flaky there at the moment and it is almost impossible that the 20th change in an openstack gate queue would merge at the same time as the first change | 19:49 |
frickler | this is also mainly based on my observations made last week | 19:49 |
clarkb | currently we configure the gate pipeline to have a floor of 20 in its windowing algorithm which means that is the minimum number of changes that will be enqueued | 19:49 |
clarkb | s/enqueued/running jobs at the same time if we have at least that many changes/ | 19:50 |
clarkb | I think reducing that number to say 10 would be fine (this was frickler's suggestion in the agenda) because as you point out it is highly unlikely to merge 20 chagnes together right now and it also isn't that common to have a queue that deep these days | 19:50 |
clarkb | so basically 90% of the time we won't notice either way and the other 10% of the time it is likely to be helpful | 19:51 |
corvus | is there a measurable problem and outcome we would like to achieve? | 19:51 |
fungi | i think, from an optimization perspective, because the gate pipeline will almost always contain fewer items than the check pipeline, that having the floor somewhat overestimate a maximum probable merge window somewhat is still preferable | 19:51 |
clarkb | corvus: time to merge openstack changes in the gate which is impacted by throwing away many test nodes? | 19:51 |
clarkb | corvus: in particular openstck also has clean check but the gate has priority for nodes. So if we're burning nodes in the gate that won't ever pass jobs then check is slowed down which slows down the time to get into the gate | 19:52 |
frickler | also time to check further changes while the gate is using a large percentage of total available resources | 19:52 |
corvus | has "reconsider clean check" been discussed? | 19:52 |
fungi | we'll waste more resources on gate pipeline results by being optimistic, but if it's not starving the gate pipeline of resources then we have options for manually dealing with resource starvation in lower-priority pipelines | 19:52 |
clarkb | not that I am aware of. I suspect that removing clean check would just lead to less stability, but maybe the impact of the lower stability would be minimized | 19:53 |
corvus | if the goal of clean check is to reduce the admission of racy tests, it may be better overall to just run every job twice in gate. :) | 19:53 |
clarkb | fungi: like dequeuing things from the gate? | 19:53 |
fungi | yes, and promoting, and directly enqueuing urgent changes from check | 19:54 |
corvus | i think originally clean check was added because people were throwing bad changes at the gate and people didn't want to wait for them to fall out | 19:54 |
clarkb | corvus: yes one of the original goals of clean check was to reduce the likelyhood that a change would be rechecked and forced in before someone looked closer and went "hrm this is actually flaky and broken" | 19:54 |
corvus | (if that's no longer a problem, it may be more trouble than it's worth now) | 19:54 |
clarkb | I think that history has shown people are more likeyl to just recheck harder rather than investigate though | 19:54 |
clarkb | from opendev's perspecitve I don't think it creates any issues for us to either remove clean check or reduce the window floor (the window will still grow if things get less flaky) | 19:55 |
frickler | well openstack at least is working on trying to reduce blank rechecks | 19:56 |
clarkb | but OpenStack needs to consider what the various fallout consequences would be in those scenarios | 19:56 |
corvus | how often is the window greater than 10? | 19:56 |
corvus | sorry, the queue depth | 19:56 |
clarkb | corvus: its very rare these days which is why I mentioned the vast majority of the time it is a noop | 19:56 |
clarkb | but it does still happen around feature freeze, requirements update windows, and security patching with a lot of backports | 19:57 |
clarkb | its just not daily like it once was | 19:57 |
fungi | random data point, at this moment in time the openstack tenant has one item in the gate pipeline and it's not even for an openstack project ;) | 19:57 |
frickler | and its not in the integrated queue | 19:58 |
corvus | i don't love the idea, and i especially don't love the idea without any actual metrics around the problem or solution. i think there are better things that can be done first. | 19:58 |
corvus | one of those things is early failure detection | 19:58 |
corvus | is anyone from the openstack tenant working on that? | 19:59 |
corvus | that literally gets bad changes out of the critical path gate queue faster | 19:59 |
clarkb | corvus: I guess the idea is early failure detection could help because it would release resources more quickly to be used elsewhere? | 19:59 |
clarkb | ya | 19:59 |
frickler | I think that that is working for the openstack tenant? | 20:00 |
fungi | devstack/tempest/grenade jobs in particular (tox-based jobs already get it by default i think?) | 20:00 |
corvus | fungi: the zuul project is the only one using that now afaict | 20:00 |
clarkb | frickler: its disabled by default and not enabled for openstack iirc | 20:00 |
frickler | at least I saw tempest jobs go red before they reported failure | 20:00 |
clarkb | the bug related to that impacted projects whether or not they were enabling hte feature | 20:00 |
corvus | the playbook based detection is automatic | 20:01 |
fungi | longer-running jobs (e.g. tempest) would benefit the most from early failure signalling | 20:01 |
corvus | the output based detection needs to be configured | 20:01 |
corvus | (it's regex based) | 20:01 |
clarkb | we are at time. But maybe ag ood first step is enabling early failure detection for tempest and unittests and see if that helps things first | 20:02 |
corvus | but we worked out a regex for zuul that seems to be working with testr; it would be useful to copy that to other tenants | 20:02 |
clarkb | then if that doesn't we can move onto the next thing which may or may not be removing clean check or reducing the window floor | 20:02 |
fungi | i guess in particular, jobs that "fail" early into a long-running playbook without immediately wrapping up the playbook are the best candidates for optimization there? | 20:02 |
corvus | ++ i'm not super-strongly opposed to changing the floor; i just don't want to throw out the benefits of a deep queue when there are other options. | 20:03 |
clarkb | fungi: ya tempest fits that criteria well as there may be an hour of testing after the first failure I think | 20:03 |
corvus | clarkb: fungi exactly | 20:04 |
fungi | whereas jobs that abort a playbook on the first error are getting most of the benefits already | 20:04 |
corvus | yep | 20:04 |
corvus | imagine the whole gate pipeline turning red and kicking out a bunch of changes in 10 minutes :) | 20:04 |
corvus | (kicking them out of the main line, i mean, not literally ejecting them; not talking about fail-fast here) | 20:05 |
fungi | i do think that would make a significant improvement in throughput/resource usage | 20:06 |
corvus | i can't do that work, but i'm happy to help anyone who wants to | 20:07 |
clarkb | thanks! We're well over time now and I'm hungry. Let's end the meeting here adn we can coordinate those improvements either in the regular irc channel or the mailing list | 20:08 |
clarkb | thanks everyone! | 20:08 |
clarkb | #endmeeting | 20:08 |
opendevmeet | Meeting ended Tue Jul 9 20:08:16 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:08 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.html | 20:08 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.txt | 20:08 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2024/infra.2024-07-09-19.00.log.html | 20:08 |
corvus | thanks clarkb ! | 20:08 |
fungi | thanks! | 20:08 |
clarkb | Sorry we didn't have time for open discussion but feel free to bring other items up in the regular channel or on the mailing list as well | 20:09 |
*** mordred1 is now known as Guest1016 | 23:17 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!