Tuesday, 2022-10-25

clarkbJust about meeting time18:59
ianwo/19:00
clarkbWe do have a fairly large agend so I'll try tokeep things moving19:00
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Oct 25 19:01:16 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-October/000369.html Our Agenda19:01
clarkb#topic Announcements19:01
clarkbNo announcements so we can dive right in19:01
clarkb#topic Bastion Host Changes19:02
clarkbianw: you've made a bunch of progress on this both with the zuul console log files and the virtualenv and upgrade work19:02
ianwyep in short there is one change that is basically s/bridge.openstack.org/bridge01.opendev.org/ -> https://review.opendev.org/c/opendev/system-config/+/86111219:03
ianwthe new host is ready19:03
clarkbianw: at this point do we expect that we won't have any console logs written to the host? we updated the base jobs repo and system-config? Have we deleted the old files?19:03
ianwoh, in terms of the console logs in /tmp -- yep they should be gone and i removed all the old files19:03
clarkbI guess that is less important for bridge as we're replacing the host. But for static that is important19:04
clarkbalso great19:04
ianwon bridge and static19:04
clarkbFor the bridge replacement I saw there were a couple of struggles with the overlap between testing and prod. Are any of those worth digging into?19:04
ianwnot at this point -- it was all about trying to minimise the number of places we hardcode literal "bridge.openstack.org"19:05
ianwi think I have it down to about the bare minimum; so 861112 is basically it19:05
clarkbFor the new server the host vars and group vars and secrets files are moved over?19:06
clarkb(since that requires a manual step)19:06
ianwno, so i plan on doing that today if no objections19:06
ianwthere's a few manual steps -- copying the old secrets, and setting up zuul login19:06
ianwand i am 100% sure there is something forgotten that will be revealed when we actually try it19:06
clarkbya I think the rough order of operations should be copying that content over, ask other roots to double check things and then land https://review.opendev.org/c/opendev/system-config/+/861112 ?19:07
ianwbut i plan to keep notes and add a small checklist for migrating bridge to system-config docs19:07
clarkb++19:07
clarkbif we want we can do a pruning pass of that data first too (since we may have old hosts var files or similar)19:07
ianwyep, that is about it19:07
clarkbbut that seems less critical and can be done on the new host afterwads too19:08
clarkbok sounds good to me19:08
ianwyeah i think at this point i'd like to get the migration done and prod jobs working on it -- then we can move over any old ~ data and prune, etc.19:08
clarkbanything else on this topic?19:09
ianwnope, hopefully next week it won't be a topic! :)19:09
corvus(and try to reconstruct our venvs!  ;)19:09
fungithat'll be the hardest part!19:09
clarkbit may be worth keeping the old bridge around for a bit too just in case19:09
clarkbthank you for pushing this along. Great progress19:09
ianwcorvus: i've got a change out for us to have launch node in a venv setup by system-config, so we don't need separate ones19:09
corvus++19:09
ianw#link https://review.opendev.org/c/opendev/system-config/+/86128419:09
clarkb#topic Upgrading Bionic Servers19:09
clarkbWe have our first jammy server in production. gitea-lb02 which fronts opendev.org19:10
clarkbThis server was booted in vexxhost which does/did not have a jammy image already. I took ubuntu's published image and converted it to raw and uploaded that to vexxhost19:10
clarkbI did the raw conversion for maximum compatibility with vexxhost ceph19:11
clarkbThat seems to be working fine. But did require a modern paramiko in a venv to do ssh as jammy ssh seems to not want to do rsa + sha119:11
clarkbI thought about updating launch node to use an ed25519 key instead but paramiko doesn't have key generation routines for that key type like it does rsa19:11
clarkbAnyway it mostly works except for the paramiko thing. I don't think there is much to add to this other than that ianw's bridge work should hopefully mitigate some of this19:12
clarkbOtherwise I think we can go ahead and launch jammy nodes19:12
clarkb#topic Removing snapd19:12
ianw++ the new bridge is jammy too19:12
clarkbWhen doing the new jammy node I noticed that we don't remove snapd which is something I thought we were doing. Fungi did some excellent git history invenstigating and discovered we did remove snapd at one time but stopped so that we could install the kubectl snap19:13
clarkbWe aren't currently using kubectl for anything in production and even if we were I think we could find a different isntall method. This makes me wonder if we should go back toremoving snapd? 19:13
clarkbI don't think we need to make a hard decision here in the meeting but wanted to call it out as something to think about and if you have thoughts I'm happy for them to be shared19:14
fungialso we only needed to stop removing it from the server(s) where we installed kubectl19:14
fungiand also there now seem to be more sane ways of installing an updated kubectl anyway19:14
ianwi hit something tangentially related with the screenshots -- i wanted to use firefox on the jammy hosts but the geckodriver  bits don't work because firefox is embedded in a snap19:15
ianwwhich -- i guess i get why you want your browser sandboxed.  but it's also quite a departure from traditional idea of a packaged system19:16
clarkbprobably something that deserves a bit more investigation to understand its broader impact then19:16
clarkbI'll try to make time for that. One thing that might be good is listing snaps for which there aren't packages that we might end up using like kubectl or firefox19:16
clarkbAnd then take it from there19:17
clarkb#topic Mailman 319:17
clarkbMoving along so we don't run out of time19:17
clarkbfungi: I think our testing is largely complete at this point. Are we ready to boot a new jammy server and if so have we decided where it should live?19:18
fungiif folks are generally satisfied with our forked image strategy, yeah i guess next steps are deciding where to boot it and then booting it and getting it excluded from blocklists if needed19:18
clarkbat this point I still haven't heard from the upstream image maintainer. I do think we should probably accept that we'll need to maintain our own images at least for now19:19
fungionce we have ip addresses for the server, we can include those in communications around migration planning for lists.opendev.org and lists.zuul-ci.org as our first sites to move19:19
clarkbre hosting location it occured to me that we can't easily get reverse dns records outside of rax which makes me think rax is the best location for a mail server19:20
clarkbBut I think we could also host it in vexxhost if mnaser doesn't have concerns with email flowing through his IPs and he is willing to edit dns records for us19:21
fungiperhaps, but rackspace also preemptively places their netblocks on the sbl19:21
fungiwhich makes them less great for it19:21
fungier, on the pbl i mean19:21
clarkbya so maybe step 0 is send a feeler to mnaser about it19:21
fungi(spamhaus policy blocklist)19:21
clarkbto figure out how problematic the dns records and email traffic would be19:21
corvusi think that's normal/expected behavior19:22
corvusand removal from pbl is easy?19:22
fungiexclusion from pbl used to be easier19:22
corvusi think vexxhost can do reverse dns by request19:22
funginow they require you to periodically renew your pbl exclusion and there's no way to find out when it will run out that i can find19:22
corvusis it not easy?  i thought it was click-and-done19:22
corvusah :(19:23
ianwfor review02 we did have to ask mnaser, but it was also easy :)19:23
clarkbfrom our end being able to set reverse dns records was what came to mind. SOunds like pbl is also worth considering19:23
ianwso there's already a lot of mail coming out of that19:23
fungicorvus: at least i recall spotting that change recently, looking now for a clear quote i can link19:24
corvus(either place seems good to me; seems like nothing's perfect)19:25
clarkbI guess the two todos are for people to weigh in on whether or not we're comfortable with forked images and specify if they have a strong preference for hosting location19:26
clarkbI agree sounds like we'll just deal with different things in either location19:26
clarkb#link https://review.opendev.org/c/opendev/system-config/+/860157 Change to fork upstream mailman3 docker images19:26
fungii concur19:26
clarkbmaybe drop your thoughts there?19:26
fungialso i suppose merging those changes will be a prerequisite to booting the new server19:27
fungithere's a series of several19:27
clarkbfungi: we can boot the new server first it just won't do much until changesl and19:27
clarkbbut I don't think the boot order is super important here19:27
fungigood point19:27
clarkbok lets move on. Please leave thoughts on the change otherwise I expect we'll proceed19:28
clarkb#topic Switching our base job nodeset to Jammy19:28
clarkb#link https://review.opendev.org/c/opendev/base-jobs/+/86262419:29
clarkbtoday is the day we said we would make this swap19:29
fungiyeah, we can merge it after the meeting wraps up19:29
clarkb++ Mostly a heads up that this is changing and to be on the lookout for fallout19:29
clarkbI did find a place in zuul-jobs that would likely break which was python3.8 jobs running without a nodeset specifier19:30
fungiif anyone else wants to review that three-line change before we approve it, you have roughly half an hour19:30
clarkbI expect that sort of thing to be the bulk of what we run into19:30
clarkb#topic Updating our base python images to use pip wheel19:31
clarkbAbout a week ago Nodepool could no longer build its container images. The issue was that we weren't using wheels built by the builder in the prod image19:31
clarkbafter a bunch of debugging it basically came down to pip 22.3 changed the location it caches wheels compared to 22.2.2 and prior19:32
clarkbI think this is actually a pip bug (because it reduces the file integrity assertions that existed previously)19:32
fungior rather the layout of the cache directory19:32
fungichanged19:32
clarkbya19:32
clarkb#link https://github.com/pypa/pip/issues/1152719:32
clarkb#link https://github.com/pypa/pip/pull/1153819:33
clarkbI filed an issue upstream and wrote a patch. The patch is currently not passing CI due to a different git change (that zuul also ran into) that impacts their test suite. They've asked if I want to write a patch for that too but I haven't found time yet19:33
clarkbAnyway part of the fallout from this is that pip says we shouldn't use the cache that way as its more of an implementation detail for pip which is a reasonable position19:33
clarkbTheir suggestion is to use `pip wheel` instead and explicitly fetch/build wheels and use them that way19:34
clarkb#link https://review.opendev.org/c/opendev/system-config/+/86215219:34
clarkbthat change updates our base images to do this. I've tested it with a change to nodepool and diskimage builder which helps to exercise that the modification actually work without breaking sibling installs and extras installs19:35
clarkbThis shouldn't actually change our images much, but should make our build process more reliable in the future19:35
clarkbreviews and concerns appreciated.19:35
fungitonyb is looking into doing something similar with rewrites of the wheel cache builder jobs, i think19:35
fungiand constraints generation jobs more generally19:36
clarkbThe other peice of feedback that came out of this is that other people do similar but instead of creating a wheel cache and copying that and performing another install on the prod image they do a pip install --user on the builder side then just copy over $USER/.local to the prod image19:36
clarkbthis has the upside of not needing wheel files in the final image which reduces the final image size19:36
fungias long as the path remains the same, right?19:37
clarkbI think we should consider doing that as well, but all of our consuming images would need to be updated to find the executables in the local dir or a virtualenv19:37
clarkbfungi: yes it only works if the two sides stay in sync for python versions (somethign we already attempt to do) and paths19:37
ianwthat would be ... interesting19:37
ianwi think most would not19:37
fungiwell, but also venvs aren't supposed to be relocateable19:37
clarkbits the and paths bit that makes it difficult for us to transition as we'd need to update the consuming images19:37
ianw(find things in a venv)19:37
clarkbfungi: yes, except in this case they aren't relocating as far as they are concerned everything stays in the same spot19:38
fungithe path of the venv inside the container image would need to be the same as where they're copied from on the host where they're built?19:38
fungior maybe i'm misunderstanding how docker image builds work19:38
clarkbfungi: the way it works today is we have a builder image and a base image that becomes the prod image19:39
corvusi think the global install has a lot going for it and prefer that to a user/venv install19:39
clarkbthe builder image makes wheels using compile time deps. We copy the wheels to the base prod image and install there which means we don't need build time deps in the prod image19:39
fungiokay, so you're saying create the venv in the builder image but then copy it to the base image19:39
clarkbin the venv case you'd make the venv on the builder and copy it to base19:39
fungiin that case the paths would be identical, right19:39
clarkbcorvus: ya it would definitely be a lot of effort to switch considering existing assumptions so we better really like the smaller images19:40
corvuswhy would they be smaller?19:40
clarkbanyway I bring it up as it was mentioned and I do think it is a good idea if the tiniest image is the goal. I don't think we should shelve the pip wheel work in favor of that as its a lot more effort19:40
clarkbcorvus: beacuse we copy the wheels from the builder to the base image which increases the base image by the aggregate size of all the wheels. You don't have thisstep in the venv case19:40
fungicorvus: because pip will cache the wheels while installing19:40
fungior otherwise needs a local copy of them19:41
corvuswe can just remove the wheel cache after installing them?19:41
clarkbcorvus: that doesn't reduce the size of the image unfortunately19:41
corvusi think there are tools/techniques for that19:42
clarkbbecause they are copied in using a docker COPY directive we get a layer with that copy. Then any removals are just another layer delta saying the files don't exist anymore. But the layer is still there with the contents19:42
clarkbanyway we don't need to debugthe sizes here. I just wanted to call it out as another alternative to what my changes propose. But one I think would require significnatly more effort which is why I didn't change direction19:42
fungithe recent changes to image building aren't making larger images than what we did before anyway19:43
clarkb#topic Dropping python3.8 base docker images19:44
clarkbrelated but not really is removing python3.8 base docker images to make room for yesterday's python3.11 release19:44
clarkb#link https://review.opendev.org/q/status:open+(topic:use-new-python+OR+topic:docker-cleanups)19:44
clarkbat this point we're ready to land the removal. I didn't +A it earlier since docker hub was having trouble but sounds like that may be over19:44
clarkbthen we should also look at updating python3.9 things to 3.10/3.11 but there is a lot more stuff on 3.9 than 3.819:45
clarkbTHank you for all the reviews and moving this along19:46
clarkb#topic iweb cloud going away by the end of the year19:46
clarkbleaseweb acquired iweb which was spun out of inap19:46
clarkbleaseweb is a cloud provider but not primarily an openstack cloud provider.19:46
clarkbThey have told us that the openstack environment currently backing our iweb provider in nodepool will need to go away by the end of the year. But they said we could keep using it until then and to let them know when we stop using it19:47
clarkbthat pool gives us 200 nodes which is a fair bit.19:48
fungiaround 20-25% of our theoretical total quotas, i guess19:48
clarkbThe good news is that they were previously open to the idea of providing us test resources via cloudstack19:48
clarkbthis would require a new nodepool driver. I've got a meeting on friday to talk to them about whether or not this si still something they are interested in19:49
clarkbI don't thinkwe need to do anything today. And I should make a calendar reminder for mid december to shut down that provider in our nodepool config19:49
clarkbAnd now you all know what I know :)19:50
clarkb#topic Etherpad container log growth19:50
clarkbDuring the PTG last week we discovered the etherpad servers root fs was filling up over time. It turned out to be the container log itself as there hasn't been anethepad release to upgrade too in a while so the container has run for a while19:50
clarkbTo address that we docker-compose down'd then up'd the service which made a new container and cleared out the old large log file19:51
clarkbMy question here is if we would expect ianw's container syslogging stuff to mitigate this. If so we should convert etherpad to it19:51
clarkbmy understanding is that etherpad writes to stdout/stderr and docker accumulates that into a log file that never rotates19:52
ianwit seems like putting that in /var/log/containers and having normal logrotate would help in that situation19:53
clarkbya I thought it would, but didn't feel like paging in how all that works in order to write a change before someone else agreed it would :)19:53
clarkbsounds like something we should try and get done19:54
ianwit should all be gate testable in that the logfile will be created, and you can confirm the output of docker-logs doesn't have it too19:54
clarkbgood point19:54
ianwi can take a todo to update the etherpad installation19:54
clarkbianw: that would be great (I don't mind doing it either just to page in how it all works, but with the pip things and mailman things and so on time is always an issue)19:55
clarkb#topic Open Discussion19:55
clarkbSomehow I have more things to bring up that didn't make it to the agenda19:55
clarkbcorvus discovered we're underutilizing our quota in the inmotion cloud19:55
clarkbI believe this to be due to leaked placement allocations in the placement service for that cloud19:55
clarkbhttps://docs.openstack.org/nova/latest/admin/troubleshooting/orphaned-allocations.html19:55
clarkbThat is nova docs on how to deal with it and this is something melwitt has helped with in the past19:55
clarkbI've got that on my todo list to try and take a look but if anyone wants to look at nova debugging I'm happy to let some one else look19:56
clarkbAnd finally, the foundation has sent email to various project mailing lists asking for feedback on the potential for a PTG colocated with the Vancouver summit. There is a survey you can fill out to give them your thoughts19:56
clarkbAnything else?19:57
ianwone minor thing is 19:57
ianw#link https://review.opendev.org/c/zuul/zuul-sphinx/+/86221519:57
ianwsee the links inline, but works around what i think is a docutils bug (no response on that bug from upstream, no sure how active they are)19:58
ianw#link https://review.opendev.org/q/topic:ansible-lint-6.8.219:58
ianwis also out there -- but i just noticed that a bunch of the jobs stopped working because it seems part of the testing is to install zuul-client, which must have just dropped 3.6 support maybe?19:59
clarkbianw: yes it did. That came out of feedback for my docker image updtes to zuul-client19:59
ianwanyway, i'll have to loop back on some of the -1's there on some platforms to figure that out, but in general the changes can be looked at 19:59
clarkband we are at time20:00
fungithanks clarkb!20:00
clarkbThank you everyone. Sorry for the long agenda. I guess that is what happens when you skip due to a ptg20:00
clarkb#endmeeting20:00
opendevmeetMeeting ended Tue Oct 25 20:00:37 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)20:00
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.html20:00
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.txt20:00
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-10-25-19.01.log.html20:00

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!