clarkb | Almost meeting time | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Feb 14 19:01:06 2023 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/archives/list/service-discuss@lists.opendev.org/thread/VH5EQYJH3E2YKTP3K4IXQI27WRRSEUMR/ Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | It is Service Coordinator nomination time. I've not seen any nominations yet and the time period ends today. I suppose thats a gentle way ofsaying I should keep doing it? | 19:02 |
fungi | congratudolences! | 19:02 |
clarkb | If no one speaks up indicating their interest before I finish lunch today then I guess I'll make my nomination official after lunch | 19:02 |
clarkb | The other announcement today is that we'll cancel next week's meeting. fungi and I are traveling and busy next tuesday. Thats ~50% of our normal attendance so I think we can just skip | 19:03 |
ianw | ++ | 19:04 |
clarkb | #topic Bastion Host Updates | 19:04 |
clarkb | #link https://review.opendev.org/q/topic:bridge-backups | 19:04 |
clarkb | #link https://review.opendev.org/q/topic:prod-bastion-group Remaining changes are part of parallel ansible runs on bridge | 19:04 |
clarkb | ianw: ^ you should just start nagging me to review that first set of changes. I keep putting it off due to distractions. I have been doing zuul reviews today and when I get tired of those I should do some opendev reviews too | 19:05 |
clarkb | (zuul early day due to overlap with europe is good then opendev late day due to overlap wuth au is good :) ) | 19:05 |
ianw | :) i should loop back on the parallel stuff too | 19:05 |
ianw | it probably needs remerging etc. | 19:06 |
clarkb | are there any other bastion concerns? | 19:06 |
ianw | it being a jammy host it hits | 19:07 |
ianw | #link https://review.opendev.org/c/opendev/system-config/+/872808 | 19:07 |
ianw | with the old apt-key config | 19:07 |
ianw | that's all i can think of | 19:07 |
clarkb | oh I had a question about that which i guess I didn't post on the review directly (my bad) | 19:07 |
clarkb | specifically how new of an apt do we need to support that method of key trust | 19:07 |
clarkb | We would need it to work on bionic and newer iirc | 19:08 |
ianw | i think 1.4 which is >=bionic | 19:08 |
clarkb | and I guess reverting and making it distro release specific isn't too terrible either | 19:08 |
clarkb | I'll do a quick rereview after the meeting since i didn't record my previous thoughts properly | 19:08 |
clarkb | #topic Mailman 3 | 19:10 |
clarkb | We're still poking at the site creation stuff last I saw, but there was one other thing that had a change to address it | 19:11 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/873337 Fix warnings about missing migrations | 19:11 |
fungi | my testing on the previous held node was probably invalid because of the lingering db migration issue which i think is what also resulted in my container restart errors | 19:11 |
fungi | i thought i had approved 873337 already but i guess now | 19:12 |
fungi | approved now | 19:12 |
fungi | i'll get a new held node once we have new images | 19:12 |
clarkb | sounds good | 19:12 |
fungi | or i guess that fix won't need new images | 19:12 |
fungi | so i can recheck the dnm change as soon as that merges | 19:12 |
clarkb | Any other mailman related items? I think we've managed to chip away at most of it other than the site creation to fix vhosting (which amkes sense since that is the complicated bitwith db migrations) | 19:12 |
fungi | i don't have anything else, no. i still haven't had time to wrap my head around creating new sites with django migrations | 19:13 |
clarkb | #topic Gerrit Updates | 19:13 |
clarkb | #link https://review.opendev.org/c/openstack/project-config/+/867931 Cleaning up deprecated copy conditions in project ACLs | 19:13 |
clarkb | This update has been announced. I think we can probably land it whenever we're confident jeepyb is happy (and last I saw it was workign?) | 19:14 |
clarkb | ianw: ^ not sure if you had a specific plan for that one | 19:14 |
ianw | i think it might need to be manually applied as it will take longer than the job timeout | 19:14 |
clarkb | ianw: jeepyb will only update those 90ish repos? Oh except those files might be used in more than 90 repos | 19:15 |
clarkb | if we think increasing the timeout would work that seems fine, otherwise manual application also seems fine. | 19:15 |
fungi | we could chunk the change up into batches i guess, but manual manage-projects run seems fine to me | 19:16 |
ianw | it's *probably* ok, but still. might be an idea to 1) put in emergency 2) down gerrit 3) run a manual backup run 4) up gerrit 5) manually apply? 6) remove from emergency? | 19:16 |
clarkb | ianw: do we think we need a backup like that? The acls are all in git so theoretically we can just revert them if necessary | 19:17 |
clarkb | mostly just thinking that I'm not sure a gerrit downtime is necessary | 19:18 |
ianw | i could go either way; i was just thinking it's an unambiguous snapshot | 19:19 |
clarkb | I think I'm willing to trust the acl system's historical record here. We've relied on it in the past and can continue to do so | 19:19 |
ianw | i guess we have now double-checked all the acl files, and gerrit shouldn't let us merge anything it doesn't like | 19:19 |
fungi | right, as long as we merge it through gerrit rather than behind its back, the worst that should happen is manage-projects throws errors and we can't create new projects or update existing ones for a little while until we sort it out | 19:20 |
fungi | or would have to take manual action in order to do so at least | 19:21 |
clarkb | I guess doing a canary change with a smaller set of updates might be good if we're worried about getting syntax wrong etc | 19:21 |
clarkb | but ya I think a downtime for backups is overkill given gerrit's builtin checks and record keeping | 19:21 |
fungi | technically the syntax is already checked by the manage-projects test in our gerrit job, right? | 19:22 |
clarkb | fungi: yes, but using the rules we try to interpret from gerrit not gerrit tiself | 19:22 |
clarkb | and this is a new set of rules so possible we got it wrong | 19:22 |
fungi | oh, i guess i thought we had an integration test creating a project in gerrit | 19:23 |
clarkb | not using our production acls | 19:23 |
clarkb | that would take too long to run probably unfortunately | 19:23 |
fungi | and the acl change doesn't update the test acl to match? | 19:23 |
clarkb | I don't think so. | 19:24 |
fungi | i didn't think to check that myself | 19:24 |
clarkb | they are decoupled. We test jeepyb + gerritlib in that bubble. We test our deployment of gerrit in system-config. And then we do simple linter type checks in project-config | 19:24 |
clarkb | this change is to project-config and doesn't impact the others | 19:24 |
ianw | i don't think we set any conditions in system-config, but we can | 19:25 |
clarkb | ya maybe that is better than doing a canary change | 19:25 |
clarkb | just to make suer we get the general syntax correct | 19:25 |
clarkb | But ya I'm not too worried about it given the ability to rollback etc | 19:26 |
clarkb | There were two other Gerrit related items | 19:26 |
ianw | ok, will do, and then i'll plan to apply it manually just to really watch it closely and because of timeouts, but with no downtime | 19:26 |
clarkb | Both of which i've put on the Gerrit community meeting agenda for March 2 (8am pacific) | 19:26 |
clarkb | The first is Java 17 support. I have a hcange up to swap us to java 17 which works in our CI jobs. But you have to use an ugly java cli option to make it happen which seems at odds with their full compatibility statement | 19:27 |
clarkb | I'm hoping to get a better sense of that support in the community meeting and if thats he path forward I guess we roll with it | 19:27 |
clarkb | The other is the ssh connectivity problem with channel tracking | 19:27 |
clarkb | ianw: has been digging into this quite a bit and I think discovered a bug in the upstream implementation of channel tracking. That doesn't explain why ssh is unhappy though right? just that fixing the bug will get us better information when those cases happen? | 19:28 |
ianw | so i think the bug means that the workaround committed was actually not doing anything | 19:29 |
clarkb | #link https://github.com/apache/mina-sshd/issues/319 Gerrit SSH issues with flaky networks. | 19:29 |
ianw | #link https://gerrit-review.googlesource.com/c/gerrit/+/358314 | 19:29 |
clarkb | ianw: "the workaround" is to disable channel tracking? | 19:29 |
ianw | specifically it's https://gerrit-review.googlesource.com/c/gerrit/+/238384 | 19:30 |
ianw | what that is supposed to do is track when a ssh channel is opened in a variable | 19:30 |
ianw | then, if an unhandledchannelerror is raised by mina, it looks at what channel it was, and if that channel has been opened before, basically ignores it | 19:31 |
clarkb | ah but since it wasn't tracking it that error propagates. So your fix may be the actual fix too | 19:31 |
ianw | right, the "track when opened" was never running because it wasn't registered to receive the channel open events | 19:32 |
clarkb | In that case I can use the community meeting to beg for reviews if they haven't landed it by then | 19:32 |
clarkb | :) | 19:32 |
ianw | so ... it's a fix ... but it doesn't really seem to answer any questions of what's going on | 19:32 |
clarkb | which is why your extra logging change remains to collect that info and hopefully debug the underlying situation | 19:33 |
ianw | #link https://gerrit-review.googlesource.com/c/gerrit/+/357694 | 19:33 |
ianw | right, yeah that change has logging for basically every channel event. but i'm not sure how much it helps now -- since we would be getting log messages when the channel is initalized from the prior change, which was mostly what we were interested in | 19:34 |
ianw | i don't know. i think maybe merge the "fix" and just move on with life and don't think too hard about it :) | 19:35 |
clarkb | works for me. I'll bring it up with gerrit if we don't manage to make progress before the meeting | 19:35 |
ianw | something is still not quite right in mina I think, but this probably isn't the context to find it | 19:36 |
clarkb | #topic Upgrading Servers | 19:37 |
clarkb | I'm trying to pick this up again and have begun looking at the gitea backends | 19:38 |
clarkb | A couple of things make this easier than I feared and one thing makes this painful :) | 19:38 |
clarkb | We control gitea ansible group independently of what servers haproxy load balances to and gerrit replicates to. This means we can pretty easily spin up a new gitea on a new server running with a bunch of empty git repos | 19:39 |
clarkb | Then when we are happy with the state of the server add it to gerrit replication, force gerrit to replicate everything to that server, then wait | 19:39 |
clarkb | Then add the server to haproxy and probably remove an old server. Repeat in a loop | 19:39 |
clarkb | What makes this painful/difficult is ensuring gitea state is what we want it to be. Specifically for redirects | 19:40 |
clarkb | I poked around in a held gitea test node's db yesterday and I think we can construct the redirects from scratch given info we have, but one thing that compliactes that is we need to create gitea orgs that don't exist in projects.yaml | 19:40 |
clarkb | essentially leading me to realize that bootstrapping that all from an empty state is probably more effort than necessary right now (though a noble exercise and maybe one we should get around to eventually) | 19:41 |
clarkb | instead I think we should stop gitea after the initial bring up then replace its db with a prod db | 19:41 |
clarkb | er replace its fresh db with a copy of a prod db from an old host | 19:41 |
clarkb | that will bring over the other orgs and redirects in theory. | 19:42 |
clarkb | What I'm concerned about doing this is that maybe we'll end up with stuff missing on disk. But since we never have to put the server into a public facing capacity until we are happy with it I think we just do that and see if it works | 19:42 |
clarkb | Looking at my current calendar and todo list maybe I can spin up that new server tomorrow, getit deployed as a blank gitea then start attempting to make it a prod like gitea on thrusday | 19:43 |
clarkb | For things that are not gitea we have etherpad, nameservers, static, mirrors, and jitsimeet. Of those I think etherpad and nameservers are the priorities | 19:44 |
ianw | this is all because in the past we've made the gitea projects as usual via the api, but then they've been moved, which we've also done via gitea, which has internally applied db updates to reflect this on it's instance, but when we're starting a new host we have no way of capturing this (at the moment, at least), right? | 19:44 |
clarkb | ianw: correct. We have a repo that captures the renames at that point in time but there is no tooling to apply that to gitea as a set of old orgs and redirects | 19:44 |
clarkb | I suppose as an alternative we could do inplace server upgrades. But I like to avoid those when we can | 19:45 |
ianw | it is always nice to validate we can start fresh | 19:45 |
clarkb | For the other servers I'm thinking etherpad and nameservers are the other priorities. In particular I had some notes about doing the nameservers but am not really confident in the process for that. If anyone has time to think that through and write out a small plan that would be appreciated | 19:46 |
clarkb | I suspect I'm overcomplicating the effort to update the nameservers in my head | 19:46 |
clarkb | and yes help much appreciated. Thanks for all the help so far too | 19:46 |
ianw | ++ i can have a look at nameservers | 19:49 |
clarkb | #topic Quo vadis Storyboard | 19:49 |
clarkb | This topic like the service has become a victim of a lack of time | 19:50 |
clarkb | I don't have anything new here. But maybe we should have a meeting dedicated to this in order to create a forcing function to spend time on it | 19:50 |
clarkb | I'd suggest hte PTG but the TPG conflicts with spring break around here so I'm trying to limit my PTG commitments :) | 19:51 |
clarkb | But maybe a higher bw call type setup the week before PTG or something? | 19:51 |
clarkb | Let me get through next week's travel and then try to put something together for that | 19:52 |
clarkb | #topic Open Discussion | 19:52 |
clarkb | As mentioned at the beginning of the meeting I'll make my service coordinator nomination official in an hour or so after lunch assuming no one beats me to it | 19:53 |
clarkb | Zuul's sqlalchemy 2.0 change merged earlier today. I may try to kick off a zuul restart sooner than the regularly scheduled weekend restart just to get that checked more quickly | 19:53 |
clarkb | anything else? | 19:56 |
ianw | not from me, thanks! | 19:57 |
clarkb | Thank you everyone for your time during this meeting but also for contributing to OpenDev. We'll skip next week's meeting and be back here in two weeks | 19:57 |
clarkb | #endmeeting | 19:58 |
opendevmeet | Meeting ended Tue Feb 14 19:58:01 2023 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:58 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.html | 19:58 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.txt | 19:58 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2023/infra.2023-02-14-19.01.log.html | 19:58 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!