*** corvus is now known as Guest490 | 09:27 | |
*** tristanC_ is now known as tristanC | 13:16 | |
clarkb | Almost meeting time | 18:59 |
---|---|---|
ianw | o/ | 19:01 |
fungi | ahoy? | 19:01 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Sep 21 19:01:21 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | hello | 19:01 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2021-September/000285.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | Minor notice that the next few days I'll be afk a bit. Have doctor visits and also brothers are dragging me out fishing assuming the fishing is good today (the salmon are swimming upstream) | 19:02 |
clarkb | I'll be around most of the day tomorrow. Then not very around thursday | 19:02 |
clarkb | #topic Actions from last meeting | 19:03 |
clarkb | #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-09-14-19.01.txt minutes from last meeting | 19:03 |
diablo_rojo | o/ | 19:03 |
clarkb | There were no recordred actions. however I probably should've recorded one for the next thing :) | 19:03 |
clarkb | #topic Specs | 19:03 |
clarkb | #link https://review.opendev.org/c/opendev/infra-specs/+/804122 Prometheus Cacti replacement | 19:03 |
clarkb | As we discussed last week there is some consideration for whether or not we should continue to use snmp or switch to node exporter | 19:04 |
clarkb | it looks like the big downsides to node exporter are going to be that using distro packages is problematic because the distro packages are older and have multiple backward incompatible changes to the current 1.x release series | 19:04 |
clarkb | using docker to deploy node exporter is possible but as ianw and corvus point out a bit odd because we have to expose system resources to it in the container | 19:05 |
clarkb | Then the downsides to snmp are needing to do a lot of work to build out the metrics and graphs ourselves | 19:05 |
clarkb | I think I'm leaning more towards node exporter. One of the reasons we are switchign to prometheus is it gives us the ability to do richer metrics beyond just system level stuff (think applications) and leanign into their more native tooling seems reasonabel as a result | 19:06 |
clarkb | Anyway please leave your preferences in review and I'll update it if necessary | 19:06 |
clarkb | #action Everyone provide feedback on Prometheus spec and indicate a preference for snmp or node exporter | 19:07 |
ianw | ++ personally i feel like un-containerised makes most sense, even if we pull it from a ppa or something for consistency | 19:07 |
fungi | in progress, i'm putting together a mailman 3.x migration spec, hope to have it up by the next meeting | 19:07 |
clarkb | ianw: I think if we can get at least node exporter v1.x that would work. As they appear to have gotten a lot better about not just changing the names of stuff | 19:07 |
clarkb | fungi: thanks! | 19:08 |
clarkb | #topic Topics | 19:08 |
clarkb | #topic Listserv updates | 19:08 |
clarkb | Just a heads up that we pinned the kernel packages on lists.o.o. If we need to update the kernel there we can do it explicitly then run the extract-vmlinux tool against the result and replace the file in /boot | 19:09 |
clarkb | As far as replacing the server goes I think we should consider that in the context of fungi's spec. Seems like there may be a process where we spin up a mm3 server and then migrate into that to transition servers as well as services | 19:09 |
fungi | yes, we could in theory migrate on a domain by domain basis at least. migrating list by list would be more complex (involving apache redriects and mail forwards) | 19:10 |
clarkb | Another option available to us is to spin up a new mm2 server using the test mode flag which won't email all the list owners. Then we can migrate list members, archives, and configs to that server. I think this becomes a good option if it is easier to upgrade mm3 from mm2 in place | 19:10 |
clarkb | I'm somewhat deferring to reading fungi's spec to get a better understanding of which approach is preferable | 19:11 |
fungi | there's a config importer for mm3, and it can also back-poulate the new hyeprkitty list archives (sans attachments) from the mbox copies | 19:11 |
fungi | er, hyperkitty | 19:11 |
fungi | but it's also typical to continue serving the old pipermail-based archives indefinitely so as to not break existing hyperlinks | 19:12 |
fungi | we can rsync them over as a separate step of course | 19:12 |
fungi | otherwise it's mostly switching dns records | 19:12 |
clarkb | that seems reasonable | 19:12 |
fungi | i think the bulk of the work will be up front, figuring out how we want to deploy and configure the containers | 19:13 |
fungi | so that's where i'm going to need the most input once i push up the draft spec | 19:13 |
clarkb | noted | 19:13 |
clarkb | are there any other concerns or items to note about the existing server? | 19:14 |
clarkb | I think we're fairly stable now. And the tools to do the kernel extraction should all be in my homedir on the server | 19:14 |
fungi | not presently, as far as i'm aware | 19:14 |
fungi | not presently any other concerns or items to note, i mean | 19:14 |
clarkb | #topic Improving OpenDev's CD throughput | 19:15 |
clarkb | I suspect that this one has taken a backseat to firefighting and other items | 19:15 |
clarkb | ianw: ^ anything new to call out? Totally fine if not (I've had my own share of ditractions recently) | 19:15 |
ianw | no, sorry, will get back to | 19:16 |
clarkb | #topic Gerrit Account Cleanups | 19:16 |
clarkb | I keep intending to send out emails for this early in a week but then finding other more urgent items early in the week :/ | 19:17 |
clarkb | At this point I'm hopeful this can happen tomorrow | 19:17 |
clarkb | But I haven't sent any emails yet | 19:17 |
clarkb | #topic OpenDev Logo Hosting | 19:17 |
clarkb | This one has made great progress. Thank you ianw. | 19:17 |
clarkb | At this point we've got about 3 things half related to this gerrit update for the logo in our gerrit theme. in #opendev I proposed that we land the gerrit updates soon. Then we can do a gerrit pull and restart to pick up the replication timeout config change and the theme changes. | 19:18 |
ianw | i see we have a plan for paste | 19:18 |
ianw | we can also use the buildkit approach with gerrit container and copy it in via the assets container | 19:19 |
clarkb | Then when that is done we can update the gitea 1.15.3 change to stop trying to manage the gerrit theme logo url and upgrade gitea | 19:19 |
ianw | but that is separate to actually unblocking gitea | 19:19 |
fungi | yeah, i'm still trying to get the local hosting for paste working, have added a test and an autohold, will see if my test fails | 19:19 |
ianw | ok, will peruse the change | 19:19 |
clarkb | ianw: I think I'm ok with the gerrit approach as is since we already copy other assets in using this system | 19:19 |
clarkb | Separately I did push up a gitea 1.14.7 change stacked under the 1.15.3 chagne which I think is safe to land today and we should considerdoing so | 19:19 |
clarkb | (I'm not sure if gitea tests old point release to latest release upgrades) | 19:20 |
clarkb | ianw: anyway I didn't approve the gerrit logo changes because I wanted to make sure we are all cool with the above approach before committing to it | 19:20 |
fungi | i definitely don't mind copying assets into containers, the two goals as i saw it were 1. only have one copy of distinct assets in our git repositories, and 2. not cause browsers to grab assets for one service from an unrelated one | 19:20 |
clarkb | ianw: but feel free to approve if this sounds good to you | 19:20 |
clarkb | Sounsd like that may be it on this topic? | 19:22 |
ianw | this sounds good, will go through today after breakfast | 19:22 |
clarkb | ianw: thanks. Let me know if I can help with anything too | 19:22 |
clarkb | #topic Gerrit Replication "leaks" | 19:22 |
clarkb | I did more digging into this today. What I found was that there is no indication on the gitea side that gerrit is talking to it (no ssh processes, no git-receive-pack processes and no sockets) | 19:23 |
clarkb | fungi checked the gerrit side and saw that gerrit did think it has a socket open to the gitea | 19:23 |
clarkb | The good news with that is I suspect the no network traffic timeout may actually help us here as a result | 19:24 |
clarkb | Other things I have found include giteas have ipv6 addresses but no AAAA records. THis means all replication happens over ipv4. THis is a good thing because it appears gitea05 cannot talk to review02 via ipv6 | 19:24 |
clarkb | I ran some ping -c 100 processes between gitea05 and review02 and from both sides saw about a 2% packet loss during one iteration | 19:25 |
clarkb | Makes me suspect something funny with networking is happening but that will need more investigating | 19:25 |
clarkb | Finally we've left 3 leaked tasks in place this morning to see if gerrit eventually handles them itself | 19:26 |
fungi | when looking at the leaked connections earlier, i did notice there was one which was open on the gitea side but not the gerrit side | 19:26 |
clarkb | If necessary we can kill and reenqueue the replication for those but as long as no one complains or notices it is a good sanity check to see if gerrit eventually claens up after itself | 19:26 |
fungi | s/open/established | 19:26 |
clarkb | fungi: oh I thought it was the gerrit side that was open but not gitea | 19:26 |
clarkb | or did you see that too? | 19:26 |
fungi | er, yeah might have been. i'd need to revisit that | 19:27 |
clarkb | ok | 19:27 |
fungi | also we had some crazy dos situation at the time, so i sort of stopped digging deeper | 19:27 |
clarkb | Also while I was digging into this a bit more ^ happened | 19:27 |
fungi | conditions could have been complicated by that situation | 19:27 |
clarkb | fungi and I made notes of the details in the incident channel | 19:27 |
fungi | i would not assume they're typical results | 19:27 |
clarkb | should this occur again we've identified a likely culprit and they can be temporarily filtered via iptables on the haproxy server | 19:27 |
clarkb | #topic Scheduling Gerrit Project Renames | 19:29 |
clarkb | Just a reminder that these requests are out there and we said we would pencil in the week of October 11-15 | 19:29 |
clarkb | I'm beginning to strongly suspect that we cannot delete old orgs and have working redirects from the old org name to the new one | 19:30 |
fungi | yeah, the list seems fairly solidified at this point, barring further additions | 19:30 |
fungi | if anyone has repos they want renamed, now's the time to get the changes up for them | 19:30 |
fungi | also we decided that emptying a namespace might cause issues on the gitea side? | 19:30 |
fungi | was there a resolution to that? | 19:30 |
clarkb | And I looked at the rename playbook briefly to see if I could determine what would be required to force update allthe project metadata after a rename. I think the biggest issue here is access to the metadata as the rename playbook has a very small set of data | 19:31 |
clarkb | fungi: see my note above. I think it is only an issue if we delete the org | 19:31 |
fungi | ahh, okay | 19:31 |
clarkb | fungi: we won't delete the org when we rename. | 19:31 |
clarkb | I brought it up to try and figure out if we could safely cleanup old orgs but I think that is a bad idea | 19:31 |
fungi | and yes, for metadata the particular concern raised by users is that in past renames we haven't updated issues links | 19:31 |
fungi | so renamed orgs with storyboard links are going to their old urls still | 19:32 |
clarkb | ya for metadata I think where I ended up at was the simplest solution to that is to make our rename process a two pass system. First pass is the rename playbook. Then we run the gitea project management playbook with force update flag set to true, but only run it against the subset of projects that are affected by the rename | 19:32 |
fungi | though separately, a nice future addition would be some redirects in apache on the storyboard end (could just be a .htaccess file even) | 19:32 |
clarkb | rather than try and have the rename playbook learn how to do it all at once (because the datastructures are very different) | 19:32 |
clarkb | This two pass system should be testable in the existing jobs we've got for spinniing up a gitea | 19:33 |
clarkb | if someone has time to update the job to run the force update after a rename that would be a good addition | 19:33 |
clarkb | Anything else on project renames? | 19:34 |
fungi | nope, the last one went fairly smoothly | 19:34 |
clarkb | #topic InMotion Scale Up | 19:35 |
fungi | we do however need to make sure that all our servers are up so ansible doesn't have a cow man | 19:35 |
clarkb | ++ | 19:35 |
clarkb | last week I fixed leaked placement records in the inmotion cloud which corrected the no valid host found errors there | 19:35 |
clarkb | Then Friday and Weekend the cloud was updated to have a few more IPs assigned to it and we bumped up the nodepool max-servers | 19:35 |
clarkb | In the process we'ev discovered we need to tune that setting for the cloud's abilities better | 19:36 |
clarkb | TheJulia noticed some unittests took a long time and more recently I've found that zuul jobs running there have difficulty talking to npm's registry (though I'm not yet certain this was a cloud issue as I couldn't replicate it from hsots with the same IP in the same cloud) | 19:36 |
clarkb | All this to say please be aware of this and don't be afraid to dial back max-servers if evidence points to problems | 19:37 |
fungi | i think yuriys mentioned yesterday adding some datadog agents to the underlying systems in order to better profile resource utilization too | 19:37 |
clarkb | They are very interested in helping us run our CI jobs and I want to support that which I guess means risking a few broken eggs | 19:37 |
fungi | as of this morning we lowered max_servers to 32 | 19:38 |
clarkb | fungi: yup that was one idea that was mentioned. I was ok with it if they felt that was the best approach | 19:38 |
fungi | this morning my time (around 13z i think?) | 19:38 |
clarkb | But thought others might have opinions with using the non free service (I think they use it internally so are able to parse those metrics) | 19:38 |
fungi | i also suggested to yuriys that he can tweak quotas on the openstack side to more dynamically adjust how many nodes we boot if that's easier for troubleshooting/experimentation | 19:39 |
clarkb | note we can do that too as we have access to set quotas on the project | 19:39 |
clarkb | also 8 was the old stable node count | 19:39 |
fungi | yep, though we also have access to just edit the launcher's config and put it in the emergency list | 19:39 |
clarkb | But ya they seem very interested in helping us so I think it is worth working through this | 19:40 |
clarkb | and it seems like they have been getting valuable feedback too. Hopefully win win for everyone | 19:40 |
ianw | i'm not sure about the datadog things, but it sounds a lot like the stats nodepool puts out via openstackapi anyway | 19:42 |
clarkb | ianw: I think the datadog agents can attach to the various openstack python processes and record things like rabbitmq connection issues and placement allocation problems like we saw | 19:42 |
clarkb | similar to what prometheus theoretically lets us do with gerrit and so on | 19:43 |
clarkb | at the very least I'm willing to experiement with it if they feel it would be helpful. We've always said we can redeploy this cloud if necessary | 19:43 |
clarkb | but if anyone has strong objections definitely let yuriys know | 19:43 |
fungi | yeah, he's around in #opendev and paying attention | 19:44 |
ianw | https://grafana.opendev.org/d/4sdNjeXGk/nodepool-inmotion?orgId=1 is looking a little sad on openstackapi stats anyway | 19:44 |
fungi | at least lately | 19:44 |
clarkb | ianw: hrm is that a bug in our nodepool configs? | 19:44 |
clarkb | or maybe openstacksdk updated again and changed everything? | 19:44 |
ianw | i feel like i've fixed things in here before, i'll have to investigate | 19:45 |
clarkb | #topic Open Discussion | 19:46 |
clarkb | sounds like that may have been it for our last agenda item? Anything else can go here :) | 19:46 |
clarkb | I suspect that zuul will be wanting to do a full restart of the opendev zuul install soon. There are a number of scale out scheduelr changes that haev landed as well as bugfixes for issues we've seen | 19:47 |
clarkb | We should be careful to do that around the openstack release in a way that doesn't impact them greatly | 19:47 |
fungi | i still need to do the afs01.dfw cinder volume replacement | 19:47 |
fungi | that was going to be today, until git asploded | 19:47 |
clarkb | half related the paste cinder volume seems more stable today | 19:48 |
fungi | good | 19:48 |
clarkb | I know openstack has a number of bugs in its CI unrelated to the infrastructure too so don't be surprised if we get requests to hold instances or help debug them | 19:49 |
clarkb | some of them are as simple as debuntu package does not install reliably :/ | 19:49 |
Guest490 | would a restart later today be okay? | 19:50 |
fungi | also the great setuptools shaheup | 19:50 |
fungi | Guest490: i'm guessing you're corvus and asking about restarting zuul? | 19:51 |
clarkb | I suspect that today or tomorrow are likely to be ok particularly later in the day Pacific time | 19:51 |
fungi | if so, yes seems fine to me | 19:51 |
clarkb | seems we get a big early rush then it tails off | 19:51 |
clarkb | and then next week is likely to be bad for restarts | 19:51 |
clarkb | (I suspect second rcs to roll through next week) | 19:52 |
fungi | should we time the zuul and gerrit restarts together? | 19:52 |
clarkb | fungi: that is an option if we can get the theme updates in | 19:52 |
clarkb | zuul goes quickly enough that we probably don't need to require that though | 19:52 |
Guest490 | yep i am corvus | 19:52 |
fungi | i'm happy to help with restarting it all after the outstanding patches for the gerrit container build and deploy | 19:54 |
fungi | in th emeantime i need to start preparing dinner | 19:54 |
clarkb | cool I'll be around too this afternoon as noted in #opendev. And ya I need lunch now | 19:54 |
clarkb | Thanks everyone! feel free to continue conversation in #opendev or on the service-discuss mailing list | 19:55 |
clarkb | #endmeeting | 19:55 |
opendevmeet | Meeting ended Tue Sep 21 19:55:12 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:55 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.html | 19:55 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.txt | 19:55 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2021/infra.2021-09-21-19.01.log.html | 19:55 |
fungi | thanks clarkb! | 19:56 |
*** Guest490 is now known as corvus | 21:55 | |
*** corvus is now known as _corvus | 21:56 | |
*** _corvus is now known as corvus | 21:56 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!