-opendevstatus- NOTICE: Zuul job execution is temporarily paused while we rearrange local storage on the servers | 16:53 | |
-opendevstatus- NOTICE: Zuul job execution has resumed with additional disk space on the servers | 17:42 | |
clarkb | almost meeting time | 18:59 |
---|---|---|
fungi | ahoy! | 19:00 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Aug 15 19:01:05 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/FRMRI2B7KC2HPOC5VTJYQBKARGCTY5GA/ Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | I'm back in my normal timezone. Other than that I didn't have anything to announce | 19:01 |
clarkb | oh! the openstack feature freeze happens right at the end of this month/ start of septeber | 19:02 |
fungi | yeah, that's notable | 19:02 |
clarkb | something to be aware of as we make changes to avoid major impact to openstack | 19:02 |
fungi | it will be the busiest time for our zuul resources | 19:03 |
clarkb | sounds like openstack continues to struggle with reliability problems in the gate too so we may be asked to help diagnose issues | 19:03 |
clarkb | I expect they will become more urgent in feature freeze crunch time | 19:04 |
clarkb | #topic Topics | 19:04 |
clarkb | #topic Google Account for Infra root | 19:04 |
clarkb | our infra root email address got notified that an associated google account is on the chopping block December 1st if there is no activity bfore then | 19:05 |
clarkb | assuming we decide to preserve the account we'll need to do some sort of activity every two yeras or it will be deleted | 19:05 |
fungi | any idea what it's used for? | 19:05 |
fungi | or i guess not used for | 19:05 |
clarkb | deleted account names cannot be reused so we don't have t oworry about someone taking it over at least | 19:05 |
fungi | formerly used for long ago | 19:05 |
clarkb | I'm beinging it up here in hopes someone else knows what it was used for :) | 19:06 |
corvus | i think they retire the address so it may be worth someone logging in just in case we decide it's important later | 19:06 |
corvus | but i don't recall using it for anything, sorry | 19:06 |
clarkb | ya if we login we'll reset that 2 year counter and the account itself may have clues for what it was used for | 19:07 |
clarkb | I can try to do that befor eour next meeting and hopefully have new info to share then | 19:08 |
clarkb | if anyone else recalls what it was for later please share | 19:08 |
clarkb | but we have time to sort this out for now at least | 19:08 |
clarkb | #topic Bastion Host Updates | 19:08 |
clarkb | #link https://review.opendev.org/q/topic:bridge-backups | 19:09 |
clarkb | this topic could still use root review by others | 19:09 |
fungi | i haven't found time to look yet | 19:09 |
clarkb | I also thought we may need to look at upgrading ansible on the bastion but I think ianw may have already taken care of that | 19:10 |
clarkb | double checking probably a good idea though | 19:10 |
clarkb | #topic Mailman 3 | 19:10 |
clarkb | fungi: it looks like the mailman 3 vhosting stuff is working as expected now. I recall reviewing some mm3 change sthough I'm not sure where we've ended up since | 19:10 |
fungi | so the first three changes in topic:mailman3 still need more reviews but should be safe to merge | 19:12 |
fungi | the last change in that series currently is the version upgrade to latest mm3 releases | 19:12 |
clarkb | #link https://review.opendev.org/q/topic:mailman3+status:open | 19:13 |
fungi | i have a held node prepped to do another round of import testing, but got sideswiped by other tasks and haven't had time to run through those yet | 19:13 |
clarkb | ok I've got etherpad and gitea prod updates to land too. After the meeting we should make a rough plan for landing some of these things and pushing forward | 19:13 |
fungi | the upgrade is probably also safe to merge, but has no votes and i can understand if reviewers would rather wait until i've tested importing on the held node | 19:14 |
clarkb | that does seem like a good way to exercise the upgrade | 19:14 |
clarkb | to summarize, no known issues. General update changes needed. Upgrade change queued after general updates. Import testing needed for further migration | 19:15 |
fungi | also the manual steps for adding the dhango domains/postorius mailhost associations are currently recorded in the migration etherpad | 19:15 |
fungi | i'll want to add those to our documentation | 19:15 |
fungi | s/dhango/django/ | 19:15 |
clarkb | ++ | 19:15 |
fungi | they involve things like ssh port forwarding so you can connect to the django admin url from localhost | 19:16 |
fungi | and need admin the credentials from the ansible inventory | 19:16 |
fungi | s/admin the/the admin/ | 19:16 |
fungi | once i've done anotehr round of import tests and we merge the version update, we should be able to start scheduling more domain migrations | 19:17 |
clarkb | once logged in the steps are a few button clicks though so pretty straightforward | 19:17 |
fungi | yup | 19:17 |
fungi | i just wish that part were easier to script | 19:17 |
clarkb | ++ | 19:17 |
fungi | from discussion on the mm3 list, it seems there is no api endpoint in postorius for the mailhost assocuation step, which would be the real blocker (we'd have to hack up something based on the current source code for how postorius's webui works) | 19:18 |
clarkb | lets coordinate to land some of these change safter the meeting and then figure out docs and an upgrade as followups | 19:18 |
fungi | sounds great, thanks! that's all i had on this topic | 19:18 |
clarkb | #topic Gerrit Updates | 19:19 |
clarkb | I've kept this agenda item for two reasons. First I'm still hoping for some feedback on dealing with the replication task leaks. Second I'm hoping to start testing the 3.7 -> 3.8 upgrade very soon | 19:19 |
clarkb | For replication task leaks the recent restart where we moved those aside/deleted them showed that is a reasonable thing to do | 19:20 |
clarkb | we can either script that or simply stop bind mounting the dir where they are stored so docker rms them for us | 19:20 |
clarkb | For gerrit upgrades the base upgrade job is working (and has been working) but we need to go through the release notes and test things on a held node like reverts (if possible) and any new feature sor behaviors that concern us | 19:21 |
frickler | did you ever see my issue with starred changes? | 19:21 |
clarkb | frickler: I don't think Idid | 19:22 |
frickler | seems I'm unable to view them because I have more than 1024 starred | 19:22 |
clarkb | interesting. This is querying is:starred listing? | 19:22 |
frickler | yes, getting "Error 400 (Bad Request): too many terms in query: 1193 terms (max = 1024)" | 19:22 |
frickler | and similar via gerrit cli | 19:22 |
fungi | ick | 19:23 |
clarkb | have we brought this up on their mailing list or via a bug? | 19:23 |
frickler | I don't think so | 19:23 |
clarkb | ok I can write an email to repo discuss if you prefer | 19:23 |
frickler | I also use this very rarely, so it may be an older regression | 19:23 |
clarkb | ack | 19:23 |
clarkb | its also possible that is a configurable limit | 19:23 |
frickler | that would be great, then I could at least find out which changes are starred and unstar them | 19:24 |
frickler | a maybe related issue is that stars aren't show in any list view | 19:24 |
frickler | just in the view of the change itself | 19:24 |
clarkb | good point. I can ask for hints on methods for finding subsets for unstarring | 19:24 |
frickler | thx | 19:25 |
clarkb | #topic Server Upgrades | 19:26 |
clarkb | No new servers booted recently that I am aware of | 19:26 |
corvus | index.maxTerms | 19:26 |
corvus | frickler: ^ | 19:26 |
clarkb | However we had trouble with zuul executors running out of disk today. The underlying issue was that /var/lib/zuul was not a dedicated fs with extra space | 19:26 |
clarkb | so a reminder to all of us replacing servers and reviewing server replacements to check for volumes/filesystem mounts | 19:27 |
fungi | those got replaced over the first two weeks of july, so it's amazing we didn't start seeing problems before now | 19:27 |
corvus | https://gerrit-review.googlesource.com/Documentation/config-gerrit.html (i can't seem to deep link it, so search for maxTerms there) | 19:27 |
clarkb | #topic Fedora Cleanup | 19:28 |
corvus | re mounts -- i guess the question is what do we want to do in the fiture? update launch-node to have an option to switch around /opt? or make it a standard part of server documentation? (but then we have to remember to read the docs which we normally don't have to do for a simple server replacement) | 19:29 |
clarkb | #undo | 19:29 |
opendevmeet | Removing item from minutes: #topic Fedora Cleanup | 19:29 |
clarkb | corvus: maybe a good place to annotate that info is in our inventory file since I at least tend to look there in order to get the next server in sequence | 19:29 |
corvus | that's a good idea | 19:29 |
clarkb | because you are right that it will be easy to miss in proper documentation | 19:29 |
fungi | launch node does also have an option to add volumes, i think, which would be more portable outside rackspace | 19:30 |
clarkb | fungi: yes it does volume management and can do arbitrary mounts for volumes | 19:30 |
corvus | or... | 19:30 |
fungi | so if we moved the executors to, say, vexxhost or ovh we'd want to do it that way presumably | 19:30 |
corvus | we could update zuul executors to bind-mount in /opt as /var/lib/zuul, and/or reconfigure them to use /opt/zuul as the build dir | 19:31 |
clarkb | /opt/zuul is a good idea actually | 19:31 |
corvus | (one of those violates our "same path in container" rule, but the second doesn't) | 19:31 |
clarkb | since that reduces moving parts and keeps thing ssimple | 19:31 |
corvus | yeah, /opt/zuul would keep the "same path" rule, so is maybe the best option... | 19:31 |
corvus | i like that. | 19:32 |
fungi | it does look like launch/src/opendev_launch/make_swap.sh currently hard-codes /opt as the mountpoint | 19:32 |
clarkb | yup and swap as the other portion | 19:33 |
fungi | so would need patching if we wanted to make it configurable | 19:33 |
clarkb | I like the simplicity of /opt/zuul | 19:33 |
fungi | thus patching the compose files seems more reasonable | 19:34 |
fungi | and if we're deploying outside rax we just need to remember to add a cinder volume for /opt | 19:34 |
frickler | ack, that's also what my zuul aio uses | 19:34 |
frickler | or have large enough / | 19:34 |
fungi | (unless the flavor in that cloud has tons of rootfs and we feel safe using that instead) | 19:34 |
fungi | yeah, exactly | 19:35 |
frickler | coming back to index.maxTerms, do we want to try bumping that to 2k? | 19:36 |
frickler | or 1.5k? | 19:36 |
frickler | I think it'll likely require a restart, though? | 19:37 |
clarkb | at the very least we can probably bump it temporarily allowing you to adjust your star count | 19:37 |
clarkb | yes it will require a restart | 19:37 |
frickler | ok, I'll propose a patch and then we can discuss the details | 19:37 |
clarkb | I odn't know what the memory scaling is like for terms but that would be my main concern | 19:38 |
clarkb | #topic Fedora Cleanup | 19:38 |
clarkb | tonyb and I looked at doing this the graceful way and then upstream deleted the packages anyway | 19:38 |
clarkb | I suspect this means we can forge ahead and simply remove the image type since they are largely non functional due to changes upstream of us | 19:39 |
clarkb | then we can clear out the mirror content | 19:39 |
clarkb | any concerns with that? I know nodepool recently updated its jobs to exclude fedora | 19:39 |
clarkb | I think devstack has done similar cleanup | 19:39 |
fungi | it's possible people have adjusted the urls in jobs to grab packages from the graveyard, but unlikely | 19:39 |
corvus | zuul-jobs is mostly fedora free now due to the upstream yank | 19:40 |
clarkb | I'm hearing we should just go ahead and remove the images :) | 19:40 |
clarkb | I'll try to push that forward this week too | 19:40 |
clarkb | cc tonyb if still interested | 19:40 |
corvus | (also it's worth specifically calling out that there is now no fedora testing in zuul jobs, meaning that the base job playbooks, etc, could break for fedora at any time) | 19:41 |
clarkb | even the software factory third party CI which uses fedora is on old nodes and not running jobs properly | 19:41 |
corvus | so if anyone adds fedora images back to opendev, please make sure to add them to zuul-jobs for testing first before using them in any other projects | 19:41 |
clarkb | ++ | 19:41 |
clarkb | and maybe software factory is interested in updating their third party ci | 19:41 |
fungi | maybe the fedora community wants to run a third-party ci | 19:41 |
fungi | since they do use zuul to build fedora packages | 19:42 |
fungi | (in a separate sf-based zuul deployment from the public sf, as i understand it) | 19:42 |
fungi | so it's possible they have newer fedora on theirs than the public sf | 19:42 |
clarkb | ya we can bring it up with the sf folks and take it fro mthere | 19:43 |
fungi | or bookwar maybe | 19:43 |
corvus | for the base jobs roles, first party ci would be ideal | 19:43 |
fungi | certainly | 19:43 |
corvus | and welcome. just to be clear. | 19:44 |
fungi | just needs someone interested in working on that | 19:44 |
corvus | base job roles aren't very effectively tested by third party ci | 19:44 |
corvus | (there is some testing, but not 100% coverage, due to the nature of base job roles) | 19:44 |
corvus | s/very effectively/completely effectively/ i think that's a little more accurate | 19:45 |
clarkb | good to keep in mind | 19:46 |
clarkb | #topic Gitea 1.20 | 19:46 |
clarkb | I sorted out the access log issue. Turns out there were additional undocumented in release notes breaking changes | 19:47 |
fungi | gotta love those | 19:47 |
clarkb | and you need different configs for access logs now. I still need to cross check the format of them to production since the breaking change they did document is that the format may differ | 19:47 |
clarkb | Then I've got a whole list of TODOs in the commit messag eto work through | 19:47 |
clarkb | in general though I just need a block of focused time to page all this back in and get up to speed on it | 19:48 |
clarkb | but good news some progress here | 19:48 |
clarkb | #topic etherpad 1.9.1 | 19:48 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/887006 Etherpad 1.9.1 | 19:48 |
clarkb | Looks like fungi and ianw have tested the held node | 19:48 |
fungi | yes, seems good to me | 19:49 |
clarkb | cool so we can probably land this one real soon | 19:49 |
clarkb | as mentioned earlier we should sync up on a rough plan for some of these and start landing them | 19:49 |
clarkb | #topic Python Container Updates | 19:50 |
clarkb | we discovered last week when trying to sort out zookeeper installs on bookworm that the java packaging for bookworm is broken but not in a consistent manner | 19:50 |
clarkb | it seems to run package setup in different orders depending on which packages you have installed and it only breaks in one order | 19:50 |
clarkb | testing has package updates to fix this but they haven't made it back to bookworm yet. For zookeeper installs we are pulling the affected package from testing. | 19:51 |
clarkb | I think the only service this currently affects is gerrit | 19:51 |
clarkb | and we can probably take our time upgrading gerrit waiting for bookworm to be fixed properly | 19:51 |
clarkb | but be aware of that if yo uare doing java things on the new bookworm images | 19:51 |
corvus | #link https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1030129 | 19:51 |
clarkb | othrewise I think we are as ready as we can be migrating our images to bookworm. Sounds like zuul and nodepool plan to do so after their next release | 19:53 |
corvus | clarkb: ... but local testing with a debian:bookworm image had the ca certs install working somehow... | 19:53 |
corvus | so ... actually... we might be able to build bookworm java app containers? | 19:53 |
clarkb | corvus: it appears to be related to the packages already installed and/or being installed affecting the order of installation | 19:53 |
clarkb | corvus: thats true we may be able to build the gerrit containers and sidestep the issue | 19:54 |
corvus | (but nobody will know why they work :) | 19:54 |
clarkb | specifically if the jre package is setup before the ca-certificates-java package it works. But if we go in the other ordre it break | 19:54 |
clarkb | the jre package depends on the certificates package so yo ucan't do separate install invocations between them | 19:55 |
clarkb | #topic Open Disucssion | 19:55 |
clarkb | Anything else? | 19:55 |
fungi | forgot to add to the agenda, rackspace issues | 19:56 |
fungi | around the end of july we started seeing frequent image upload errors to the iad glance | 19:56 |
fungi | that led to filling up the builders and they ceased to be able to update images anywhere for about 10 days | 19:57 |
fungi | i cleaned up the builders but the issue with glance in iad persists (we've paused uploads for it) | 19:57 |
fungi | that still needs more looking into, and probably a ticket opened | 19:57 |
clarkb | ++ I mentioned this last week but I think our best bet is to engage rax and show them how the other clouds differ (if not entirely in behavior at least by degree) | 19:58 |
fungi | separately, we have a bunch of stuck "deleting" nodes in multiple rackspace regions (including iad i think), taking up the majority of the quotas | 19:58 |
clarkb | *the other rax regions | 19:58 |
fungi | frickler did some testing with a patched builder and increasing the hardcoded 60-minute timeout for images to become active did work around the issue | 19:59 |
fungi | for glance uploads i mean | 19:59 |
fungi | but clearly that's a pathological case and not something we should bother actually implementing | 19:59 |
fungi | and that's all i had | 20:00 |
frickler | yes, that worked when uploading not too many images in parallel | 20:00 |
frickler | but yes, 1h should be enough for any healthy cloud | 20:00 |
corvus | 60m or 6 hour? | 20:00 |
frickler | 1h is the default from sdk | 20:01 |
frickler | I bumped it to 10h on nb01/02 temporarily | 20:01 |
fungi | aha, so you patched the builder to specify a timeout when calling the upload method | 20:01 |
corvus | got it, thx. (so many timeout values) | 20:01 |
frickler | ack | 20:01 |
clarkb | we are out of time | 20:02 |
fungi | anyway, we were seeing upwards of a 5.5 hour delay for images to become active there when uploading manually | 20:02 |
fungi | thanks clarkb! | 20:02 |
clarkb | I'm going to end it here but feel free to continue conversation | 20:02 |
clarkb | I just don't want to keep people from lunch/dinner/breakfast as necessary | 20:02 |
clarkb | thank you all! | 20:02 |
clarkb | #endmeeting | 20:02 |
opendevmeet | Meeting ended Tue Aug 15 20:02:50 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:02 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.html | 20:02 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.txt | 20:02 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-08-15-19.01.log.html | 20:02 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!