clarkb | Just about meeting time | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue May 23 19:01:19 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | Hello everyone (I expect a small group today thats fine) | 19:01 |
clarkb | link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda | 19:01 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/GMCSR45YSJJUK3DNJYTPUI52L4BDP3BM/ Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | I guess a friendly reminder that the Open Infra summit is coming up in a few weeks | 19:02 |
clarkb | less than a month away now (like 3 weeks?) | 19:02 |
* fungi sticks his head in the sand | 19:02 | |
corvus | 2 more weeks till time to panic | 19:02 |
clarkb | #topic Migrating container images to quay.io | 19:03 |
clarkb | After our discussion last week I dove in and started poking at running services with podman in system-config and leaned on Zuul jobs to give us info back | 19:03 |
clarkb | The good news is that podman and buildkit seem to solve the problem with speculative testing of container images | 19:04 |
clarkb | (this was expected but good to confirm( | 19:04 |
clarkb | The bad news is that switching to podman isn't super straightforward for a number of smaller issues that add up in my opinion | 19:04 |
clarkb | Hurdles include packaging for podman and friends isn't great until you get on Ubuntu Jammy or newer. Podman doesn't support syslog logging so we need to switch all our logging over to journald. In addition to simply checking if services can run under podman we need to transition to podman which means stopping docker services and starting podman services in some sort of | 19:06 |
clarkb | coordinated fashion. There are also questions about whether or not we should change which user runs services when moving to podman | 19:06 |
clarkb | I have yet to find a single blocker that would prevent us from doing the move, but I'm not confident we can do it in a short period of time | 19:06 |
clarkb | For this reason I brought this up yesterday in #opendev and basically said I think we should go with the skopeo workaround or revert to docker hub. In both cases we could then work forward to move services onto podman and either remove the skopeo workaround or migrate to quay.io at a later date | 19:07 |
clarkb | During that discussion the consensus seemed to be that we preferred reverting back to docker hub. Then we can move to podman then we can move to quay.io and everything should be happy with us unlike the current state | 19:07 |
fungi | that matches my takeaway | 19:08 |
clarkb | I wanted to bring this up more formally in the meeting before proceeding with that plan. Has anyone changed their mind or have new input etc? | 19:08 |
clarkb | if we proceed with that plan I think it will look very similar to the quay.io move. We want to move the base images first so that when we move the other images back they rebuild and see the current base images | 19:09 |
clarkb | One thing that makes this tricky is the sync of container images back to docker hub since I don't think I can use a personal account for that like I did with quay.io. But that is solvable | 19:09 |
clarkb | I'll also draft up a document like I did for the move to quay.io so that we can have a fairly complete list of tasks and keep track of it all. | 19:10 |
clarkb | I'm not hearing any objections or new input. In that case I plan to start on this tomorrow (I'd like to be able to be pretty head down on it at least to start to make sure I don't miss anyting important) | 19:11 |
corvus | clarkb: sounds good to me | 19:11 |
fungi | thanks! | 19:11 |
tonyb | sounds good. Thank you clarkb for pulling all that together | 19:11 |
corvus | clarkb: i think you can just use manually the opendev zuul creds on docker? | 19:11 |
clarkb | corvus: yup | 19:12 |
clarkb | should be able to docker login with them during the period of time I need them | 19:12 |
corvus | also, it seems like zuul can continue moving forward with quay.io | 19:12 |
corvus | since it's isolated from the opendev/openstack tenant | 19:12 |
corvus | so it can be the advance party | 19:12 |
clarkb | agreed and the podman moves there are isolated to CI jobs which are a lot easier to transition. The either work or they don't (and as of a few minutes ago I think they are working) | 19:13 |
corvus | i'm in favor of zuul continuing with that, iff the plan is for opendev to eventually move. | 19:13 |
corvus | i'd like us to be aligned long-term, one way or the other | 19:13 |
clarkb | I'd personally like to see opendev move to podman and quay.io. I think for long term viability the extra functioanlity is going to be useful | 19:13 |
corvus | ++ | 19:14 |
fungi | same | 19:14 |
clarkb | both tools have extra features that enable things like per image access controls in quay.io, speculative gating etc | 19:14 |
clarkb | I'm just acknowledging it will take time | 19:14 |
corvus | also, we should be able to move services to podman one at a time, then once everything is on podman, make the quay switch | 19:14 |
clarkb | ++ | 19:15 |
clarkb | servers currently on Jammy would be a good place to start since jammy has podman packages built in | 19:16 |
clarkb | currently I Think that is gitea, gitea-lb and etherpad? | 19:16 |
clarkb | alright I probably won't get to that today since I've got other stuff in flight already, but hope to focus on this tomorrow. I'll ping for reviews :) | 19:16 |
clarkb | #topic Bastion Host Changes | 19:17 |
clarkb | #link https://review.opendev.org/q/topic:bridge-backups | 19:17 |
clarkb | this topic still needs reviews if anyone has time | 19:17 |
clarkb | otherwise I am not aware of anything new | 19:17 |
fungi | lists01 is also jammy | 19:18 |
clarkb | ah up | 19:18 |
clarkb | #link Mailman 3 | 19:18 |
clarkb | fungi: any new movement with mailman3 things speaking of list01 | 19:18 |
fungi | i'm fiddling with it now, current held node is 173.231.255.71 | 19:18 |
clarkb | anything we can do to help? | 19:19 |
fungi | with the current stack the default mailman hostname is no longer one of the individual list hostnames, which gives us some additional flexibility | 19:19 |
clarkb | fungi: we probably want to test the transition from our current state to that state as well? | 19:20 |
clarkb | (I'm not sure if that has been looked at yet) | 19:20 |
fungi | i'm still messing with the django commands to see how we might automate the additional django site to postorius/hyperkitty hostname mappings | 19:20 |
fungi | but yes, that should be fairly testable manually | 19:20 |
fungi | for a one-time transition it probably isn't all that useful to build automated testing | 19:21 |
fungi | but ultimately it's just a few db entries changing | 19:21 |
clarkb | ya not sure it needs to be autoamted. More just tested | 19:21 |
fungi | agreed | 19:22 |
clarkb | similar to how we've checked the gerrit upgrade rollback procedures | 19:22 |
clarkb | Sounds good. Let us know if we can help | 19:22 |
fungi | folks are welcome to override their /etc/hosts to point to the held node and poke around the webui, though this one needs data imported still | 19:22 |
fungi | and there are still things i haven't set, so that's probably premature anyway | 19:23 |
clarkb | ok I've scribbled a note to do that if I have time. Can't hurt anyway | 19:23 |
fungi | exactly | 19:23 |
fungi | thanks! | 19:23 |
clarkb | #topic Gerrit leaked replication task files | 19:24 |
clarkb | This is ongoing. No movement upstream in the issues I've filed. The number of files is growing at a steady ut management rate | 19:24 |
clarkb | I'm honestly tempted to undo the bind mound for this directory and go back to potentially lossy replications though | 19:24 |
clarkb | I've been swamped with other stuff though and since the rate hasn't grown in a scary way I've been happy to deprioritize looking into this further | 19:25 |
clarkb | Maybe late this week / early next week I can put a java developer hat on and see about fixing it though | 19:25 |
clarkb | #topic Upgrading Old Servers | 19:26 |
clarkb | similar to the last item I hvaen't had time to look at this recently. I don't think anyone else has either based on what I've seen in gerrit/irc/email | 19:26 |
fungi | i have not, sorry | 19:26 |
clarkb | This should probably be a higher priority than playing java dev though so maybe I start here when I dig out of my hole | 19:27 |
clarkb | if anyone else ends up with time please jump on one of these. It is a big help | 19:27 |
clarkb | #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes | 19:27 |
clarkb | #topic Fedora Cleanup | 19:27 |
clarkb | This topic was the old openafs utilization topic. But I pushed a change to remove unused fedora mirror content and that freed up about 200GB and now we are below 90% utilization | 19:28 |
tonyb | yay | 19:28 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/IOYIYWGTZW3TM4TR2N47XY6X7EB2W2A6/ Proposed removal of fedora mirrors | 19:28 |
clarkb | No one has said please keep fedora we need it for X, Y or Z | 19:28 |
clarkb | corvus: ^ I actually recently remembered that the nodepool openshift testing uses fedora but I don't think it needs to | 19:28 |
clarkb | openshift itself is installed on centos and then nodepool is run on fedora. It could probably be a single node job even or the fedora node should be able to be replaced with anything else | 19:29 |
clarkb | either way I think the next step here is removing fedora mirror config from our jobs so that jobs on fedora talk upstream | 19:29 |
tonyb | I started looking at removing pulling the setup of f36 from zuul but that's a little bigger than expected so we so we don't break existing users outside of here | 19:29 |
clarkb | then we can remove all fedora mirroring and then we can work on clearing out fedora jobs and the nodes themselves | 19:30 |
fungi | ianw has a (failing/wip) change to remove our ci mirrors from fedora builds in dib | 19:30 |
clarkb | tonyb: oh cool | 19:30 |
fungi | #link https://review.opendev.org/883798 "fedora: don't use CI mirrors" | 19:30 |
clarkb | fungi: that change should pass as soon as the nodepool podman change merges | 19:30 |
corvus | clarkb: i agree, it should be able to be replaced | 19:30 |
clarkb | tonyb: is the issue that we assume all nodes should have a local mirror configured? I wonder how we are handling that with rocky. Maybe we just ignored rocky in that role? | 19:30 |
fungi | clarkb: there are fedora-specific tests in dib which are currently in need of removal for that change to pass, it looked like to me | 19:31 |
tonyb | yeah rocky is ignored | 19:31 |
tonyb | my initial plan was to just pull fedora but then I realised if there were non OpenDev users of zuul that did care about fedora mirroring they get broken | 19:32 |
clarkb | fungi: ah | 19:32 |
tonyb | so now I'm working on adding a flag to the base zuul jobs to say ... skip fedora | 19:32 |
fungi | oh, never mind. now the failures i'm looking at are about finding files on disk, so maybe. (or maybe i'm looking at different errors now) | 19:33 |
clarkb | tonyb: got it. That does pose a problem. If you add flags it might be good to add flags for everthing too for consistency | 19:33 |
tonyb | but trying to do that well and struggling with perfect being the enemy of good | 19:33 |
clarkb | tonyb: but getting something up so that the zuul community can weigh in is probably best before over engineering | 19:33 |
clarkb | ++ | 19:33 |
tonyb | clarkb: yeah doing it for everything was the plan | 19:33 |
fungi | but yeah, the dib-functtests failure doesn't appear on the surface to be container-toolchain-related | 19:34 |
tonyb | kinda adding a distro-enable-mirror style flag | 19:34 |
tonyb | I'll post a Wip for a more concrete discussion | 19:34 |
clarkb | sounds good | 19:35 |
clarkb | #topic Quo Vadis Storyboard | 19:35 |
clarkb | Anything new here? | 19:35 |
clarkb | I don't have anything new. ETOOMANYTHINGS | 19:35 |
fungi | nor i | 19:36 |
fungi | well, the spamhaus blocklist thing | 19:36 |
fungi | apparently spamhaus css has listed the /64 containing storyboard01's ipv6 address | 19:36 |
clarkb | oh ya we should probably talk to oepnstack about that. tl;dr is cloud providers (particularly public clodu providers) should assign no less than a /64 to each instance | 19:36 |
fungi | due to some other tenant in rackspace sending spam from a different address in that same network | 19:37 |
fungi | they processed my removal request, but them immediately re-added it because of the problem machine that isn't ours | 19:37 |
fungi | #link https://storyboard.openstack.org/#!/story/2010689 "Improve email deliverabilty for storyboard@storyboard.openstack.org, some emails are considered SPAM" | 19:37 |
fungi | there was also a related ml thread on openstack-discuss, which one of my story comments links to a message in | 19:38 |
corvus | i know you can tell exim not to use v6 for outgoing; it may be possible to tell it to prefer v4... | 19:38 |
fungi | right, that's a variation of what i was thinking in my last comment on the story | 19:38 |
fungi | and less hacky | 19:39 |
fungi | though still definitely hacky | 19:39 |
fungi | anyway, that's all i really had storyboard-related this week | 19:39 |
clarkb | thanks! | 19:39 |
clarkb | #topic Open Discussion | 19:39 |
clarkb | Anything else? | 19:39 |
fungi | there were a slew of rackspace tickets early today about server outages... pulling up the details | 19:40 |
fungi | zm06, nl04, paste01, | 19:41 |
fungi | looks like they were from right around midnight utc | 19:41 |
fungi | er, no, 04:00 utc | 19:41 |
fungi | anyway, if anyone spots problems with those three servers, that's likely related | 19:42 |
clarkb | only paste01 is a singleton that would really show up as a problem. Might be worth double checkign the other two | 19:43 |
fungi | all three servers have a ~16-hour uptime right now | 19:43 |
fungi | so they at least came back up and are responding to ssh | 19:43 |
fungi | host became unresponsive and was rebooted in all three cases (likely all on the same host) | 19:44 |
clarkb | zm06 did not restart its merger | 19:45 |
clarkb | nl04 did restart its launcher | 19:45 |
clarkb | I can restart zm06 after lunch | 19:45 |
clarkb | thank you for calling that out | 19:45 |
corvus | we have "restart: on-failure"... | 19:45 |
fungi | i'll go ahead and close out those tickets on the rackspace end | 19:46 |
corvus | and services normally restart after boots | 19:46 |
fungi | also they worked two tickets i opened yesterday about undeletable servers, most were in the iad region | 19:46 |
corvus | so i wonder why zm06 didn't? | 19:46 |
fungi | in total, freed up 118 nodes worth of capacity | 19:46 |
clarkb | corvus: maybe hard reboot under the host doesn't count as failure? | 19:46 |
fungi | in the twisted land of systemd | 19:47 |
corvus | May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.636033310Z" level=info msg="Removing stale sandbox 88813ad85aa8c751aa92b684e64e1ea7f2e9f2e9c8209ce79bfcf9fe18ee77e7 (05ce3156e52feef4e8e78ad8015aabba477f7d926635e8bf59534a8294d44559)" | 19:47 |
corvus | May 23 03:46:44 zm06 dockerd[541]: time="2023-05-23T03:46:44.638772820Z" level=warning msg="Error (Unable to complete atomic operation, key modified) deleting object [endpoint f128a3c81341d323dfbcbb367224049ec2aac64f10af50e702b3f546f8f09a6c 6e0ac20c216668af07c98111a59723329046fab81fde48b45369a1f7d088ffeb], retrying...." | 19:47 |
corvus | i have no idea what either of those mean | 19:47 |
corvus | other than that, i don't see any logs about docker's decision making there | 19:48 |
clarkb | probably fine to start things back up again then and see if it happens again in the future? I guess if we really want we can do a graceful reboot and see if it comes back up | 19:49 |
clarkb | Anything else? Last call I can give you all 10 minutes back for breakfast/lunch/dinner :) | 19:49 |
fungi | i'm set | 19:50 |
clarkb | Thank you everyone. I expect we'll be back here next week and the week after. But then we'll skip on june 13th due to the summit | 19:50 |
clarkb | #endmeeting | 19:51 |
opendevmeet | Meeting ended Tue May 23 19:51:03 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:51 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.html | 19:51 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.txt | 19:51 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-05-23-19.01.log.html | 19:51 |
fungi | thanks clarkb! | 19:52 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!