Tuesday, 2022-08-23

clarkbmeeting time. We'll get started in a couple of minutes18:59
clarkb#startmeeting infra19:01
opendevmeetMeeting started Tue Aug 23 19:01:10 2022 UTC and is due to finish in 60 minutes.  The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot.19:01
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.19:01
opendevmeetThe meeting name has been set to 'infra'19:01
clarkb#link https://lists.opendev.org/pipermail/service-discuss/2022-August/000355.html Our Agenda19:01
clarkb#topic Announcements19:01
ianwo/19:01
clarkbThe service coordinator nomination period ended last week. I saw only one nomination; the one for myself. I guess that means I'm it by default again19:02
frickler\o19:02
clarkb#link https://releases.openstack.org/zed/schedule.html OpenStack Feature Freeze begins next week19:02
clarkbHeads up that openstack is about to enter the typically most crazy portion of its release cycle19:03
clarkbthough the last few have been pretty chill comparatively19:03
ianwlong live the king^W service-coordinator! :)19:03
clarkbAnd finally I'll be AFK tomorrow. Back thursday19:04
fungioff celebrating your reelection... i mean drowning your sorrows?19:04
clarkblooking for salmon swimming up the columbia river19:04
clarkband coincidentally escaping the heat at home19:05
fungiso that's a yes19:05
clarkbha19:05
clarkb#topic Bastion Host Updates19:05
clarkbTime to dive in19:05
clarkbianw: one thing that occured to me is that the recent Zuul auto upgrades should've deplyoed your fixes for the console log file leaks?19:06
clarkbI think those changes landed19:06
clarkbIf that is the case should we go ahead and manually clear those files out of bridge and static as they should leak far less quickly now?19:06
ianwumm, yes, i think this weekend actually should have deployed the file deletion in /tmp 19:07
ianwi'll double check, restart the zuul_console on the static nodes and cleanup the tmps, then make a note to go back and check19:07
corvuswe're probably testing the backwards compat now, until the daemon is killed19:08
clarkbianw: just be careful to not delete the files that will be automatically deleted. Might need to use an age filter when deleting19:08
ianwlast weekend it deployed just the initial changes, which broke xenial/centos-7 because it used 3-era f-strings19:08
clarkbcorvus: ianw: oh right we need to have it start a new one.19:08
ianwin related news19:09
ianw#link https://review.opendev.org/q/topic:stream-2.7-container19:09
clarkbbut also we block port 19885 so the zuul cluster doesn't succeed at getting the logs. I wonder if we should also look at just not trying to run it at all on bridge (I think we need to be very careful exposing those logs direclty)19:09
ianwruns the console streamer in a python 2.7 environment to do ultimate backwards compat testing19:09
clarkbsounds like progress though. Thank you for looking at that19:09
corvusso opendev doesn't actually need / will not use these changes?19:10
clarkbcorvus: it does use them because the console log stuff runs there and leaks the files. But we don't currently expose the results on the live stream through zuul finger protocol19:10
corvusi mean, we could have just stopped running the log daemon?  or aggressively pruned?19:11
clarkbcorvus: maybe? I think zuul will continue to try and fetch the data but I guess the firewall blocking the port and not having anything listening on the port are roughly equivalent from that perspecitve (sorry this just occured to me)19:12
corvusi'm asking partly for curiosity, but also trying to get a handle on whether this is actually getting tested19:12
clarkbcorvus: it will be tested19:12
corvusit sounds like opendev is not going to be a robust test of this feature?19:12
clarkbbut ya it may not be a complete test19:12
fungithough also worth revisiting whether we might be comfortable streaming such console logs in the future19:13
clarkbfungi: yup that too19:13
corvusthanks, it's good to know the limitations of opendev's production testing of new features like this.19:13
clarkbcorvus: I think the regular jobs will exercise this pretty well too fwiw. Since it happens for all the jobs19:13
clarkbI think where our gap may be is how much we'll leak due to aborted jobs and the like?19:14
ianwthere was also some work from several years ago to tunnel the console logs over a unix socket over ssh19:14
corvusif we don't allow connections on 19885 then will anything be deleted from bridge?19:14
fungiwe put a lot of belts and suspenders in place early on because we were unsure of the security of some solutions, but now we've had time to evaluate things in a production scenario and could make better (informed by observed data and experience) decisions19:14
clarkboh the entire protocl happens over 19885? yes I think that is correct. This may not delete anything on bridge I guess :/19:14
ianwhrm, that is a wrinkle, it does now send a message "i've finished with this, remove it"19:15
corvusi suggested that zuul-console should have a periodic deletion as a backstop.  did that get implemented?19:15
clarkbfungi: we would need to review all of the log files we produce to double check them for leaked sensitive info. Address any such leaks, then remember to not add new ones19:15
fungiwhat's the link to the review for the feature which got merged?19:16
clarkbcorvus: I don't think so19:16
clarkbfungi: https://review.opendev.org/c/zuul/zuul/+/85027019:16
fungithanks19:16
clarkbcorvus: I think the python 2.7 testing has become the next focus there to avoid regressions on the ansible target end19:17
ianwi havent' implemented cleanup, but i have expanded the docs to talk about it explicitly19:17
ianwthat is19:18
ianw#link https://review.opendev.org/c/zuul/zuul/+/851942/19:18
ianwwhich could use a review19:18
corvusso how is bridge going to get cleaned up?19:18
clarkbcorvus: it won't, but we only just realized that19:18
clarkbwe'd need to stop running the console streamer on bridge or implement the periodic cleanups.19:19
corvusianw: do you plan on implementing periodic cleanup in zuul-console, or separately?19:20
corvus(by separately, i mean just a cron job on bridge?)19:20
ianwi don't have immediate plans to work on adding it to zuul-console19:21
corvusdid this make it into a zuul release?19:21
fungilooks like no19:22
fungii don't see a tag which contains df3f9dcd30a13232447b3be67c7845c51cb527a0 in its history19:22
fungiso it could be easily reverted19:22
ianwi'm not sure anything needs to be reverted19:23
corvusokay.  i think further discussion can happen in #zuul, but given that the feature won't be used for the intended use case, it's probably worth considering19:23
clarkbya we've got a few more topics to get to here. probably best to pick this back up in the zuul room on matrix19:23
clarkbreally quickly before the next topic, ianw any venv on bridge chagnes needing review yet?19:24
fungiyeah, just to confirm, it merged 9 days ago, and the most recent release (6.2.0) was 10 days before that19:24
ianwno sorry, i have some local work i haven't pushed yet19:24
clarkbno problem. Just want to make sure I'm not missing any important changes to review19:24
clarkb#topic Updating Bionic Servers to Focal or Jammy19:25
clarkbI don't think there is anything new on this front19:25
clarkbBut I think we are generally ready to deploy to jammy for new or replacement things.19:25
clarkbJammy just got its .1 release too19:25
clarkbwhich is when the open up the in place upgrade path for focal installations. A good indication upstream thinks it is ready too19:26
clarkb#link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done.19:26
clarkbFeel free to add any additional bits of info to that etherpad as we start to take this on19:26
clarkb#topic Mailman 319:27
clarkb#link https://review.opendev.org/c/opendev/system-config/+/85124819:27
clarkbI think the deployment is largely there now. Plan is to start testing migration of actual lists ~Thursday19:27
clarkbReviews definitely welcome at this point. I've still got it marked WIP but it has congealed into something that looks mergeable now19:28
fungii may be able to try out some of the migration tools on the held node tomorrow19:28
fungithanks for putting the deployment together!19:28
clarkbThere is also a held node at 198.72.124.71 if anyone wants to poke at it19:28
clarkbyou're welcome19:29
clarkbImportantly I think I've got the native vhosting working19:29
clarkbwhcih means we don't have any regressions from our mm2 vhosting behavior19:29
clarkband the rest api seems to be sufficient for the management we need to do.19:30
clarkbFor downsides mm3 is a significantly more complicated piece of software built on django with a database and all that. But it shouldn't be too bad19:30
clarkbAnyway feel free to poke at the held node and leave review comments. I'll do my best to catch up on that after tomorrow. Previous investigation has been helpful in improving the deployment19:31
clarkb#topic Gitea 1.17 Upgrade19:31
clarkb#link https://review.opendev.org/c/opendev/system-config/+/847204 1.17.1 out, time to schedule the upgrade19:31
clarkbGitea 1.17.1 is out. That change has been updated to deploy 1.17.1.19:31
clarkbI think we can upgrade whenever we are comfortable doing so with the opnestack release schedule and so on19:32
fungiyeah, the list of changes didn't look too risky for us19:32
clarkbBig changes for gitea to pay attention to: main is the default branch for new projects. Testing was updated to ensure we continue to create master by default to match gerrit and jeepyb19:32
clarkbalso they added a package repos feature that had a bunch of bugs in the .0 release. We intentionally disable it in part due to our distributed cluster not having shared storage, but also because we likely don't have sufficient storage for it19:33
clarkbIf I can get reviews I'm happy to babysit that upgrade on thursday when I get back19:33
ianw++ will look19:33
clarkb#topic Gerrit Load Issues19:34
clarkbLast week a couple of times around 08:00 UTC Gerrit got busy and stopped accepting http requests19:34
clarkbI believe that Gerrit itself was still running it just exhausted its thread pool which caused it and apache to return 500 errors19:35
clarkbIn response to that we've bumped up our http thread count above the value for ssh+http git request threads. The idea being we can still have a responsive web ui and rest api if the git side of gerrit is busy19:35
clarkbLet's keep an eye on it and evaluate if further changes need to be made based on the new behavior with this config update19:36
clarkbDuring debugging and response to this a few busy IPs were blackholed using the linux route table.19:36
clarkbOne class of blockage was jenkins servers using the gerrit trigger plugin beacuse they make a request to /plugins/events-log/ every 2 seconds which 404s19:37
clarkb#link https://github.com/jenkinsci/gerrit-trigger-plugin/pull/470 trying to make the gerrit trigger plugin less noisy19:37
clarkbI've made that pull request trying to improve the plugin to be less noisy as Jenkins is still reasonable for third party CI19:37
clarkbI did unblock an IP beacuse the users noticed19:37
clarkbAlso this pointed out that infra-root may use different approaches to block traffic to a server.19:38
clarkbThe first place I always look for network traffic blockages is the firwall. I was very confused when I couldn't find iptables rules for this.19:38
clarkbI don't thnik we need to solve this in this meeting but it would be good if we are consistent in applying those temporary ephemeral blockages on our servers. We should decide on a method either iptables or ip route and stick to it19:39
clarkbMaybe give that some thought over the next week and we can discuss it in next week's meeting19:39
clarkbAnd finally if we have to make additional changes to Gerrit to address this let's try to make them as slowly as is reasonable so that we can measure their impact and avoid negatively impacting openstack feature freeze (a time when gerrit tends to be busy)19:40
clarkb#topic Jaeger Tracging Server19:42
clarkb#undo19:42
opendevmeetRemoving item from minutes: #topic Jaeger Tracging Server19:42
clarkb#topic Jaeger Tracing Server19:42
clarkb#link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html Zuul spec for OpenTelemetry support19:42
clarkbZuul will be growing support for opentelemetry tracing19:42
clarkbThe question is whether or not OpenDev should have a Jaeger server to help test this and take advantage of the functionality19:42
clarkbmy initialy thought was maybe we can colocate this with the prometheus server I have on my todo list. But then immediately decided keeping them separate to reduce difficulty of OS upgrades etc is better19:43
clarkbI think from an operational standpoint zuul's logs have been pretty good and we've been able to debug problems tracing through logs. frickler and I did that recently with that github repo's unexpected behavior19:44
fungioase os upgrades would mater for containerized prometheus and jaeger deployments?19:44
fungier, base os upgrades i mean19:44
corvusjaeger would be containerized19:44
clarkbfungi: typically we do OS upgrades by spinning up new hosts and both prometheus and jaeger appear to store data in databases of some sort that would need to be migrated19:45
fricklerI must admin I have no idea what jaeger does. what kind of additional data would we get from it?19:45
clarkbkeeping things separate will liekly simplify things and there isn't a strong reason to collocate19:45
fungiopentelemetry tracing data19:45
ianwclarkb: not that different to graphite though?  that has a db that needs to be moved if we update19:46
clarkbfrickler: its like fancy log data. But instead of raw text it goes into a db and there is tooling to render it nicely with timings and so on19:46
fungifrickler: basically timing and sequencing of events19:46
corvusi think the utility here would be a marginal improvement to infra-root's ability to debug user-facing issues; potentially a significant improvement in ability for users to self-diagnose; and a benefit to the zuul project in being able to fully demonstrate and collaborate on the reference zuul implementation that opendev runs19:46
clarkbcorvus: user exposure is a good point since the tracing data is far more sanitized than the raw logs19:46
corvusi plan on adding a sample deployment to the zuul quickstart system, so it's not a big deal for me to port that over to opendev.19:46
fungi#link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html19:47
fungithat might be a missing bit of context19:47
corvusclarkb: yep, i expect this to be fully user-safe19:47
fungier, i guess clarkb already linked the zuul spec19:47
clarkbthe zuul event id in our logs is a rudimentary form of tracing19:47
clarkbyou can think of it like grep 'eventid' /var/log/zuul/ across all the zuul machines for curated info19:48
corvusjaeger can store data locally on the filesystem19:48
corvusso i'm imagining a really simple self-contained jaeger server.  and it's okay if we lose the data.19:49
clarkbno opposition from me.19:49
clarkbparticularly now that I've realized it can help users debug or at least better understand uenxpected zuul behavior19:49
ianwfor mine, we have such good pre-production system-config testing, as long as a service is working in with that I don't see any reason not to just bring it up19:49
clarkbthats new functionality that would be useful19:49
fungifrickler: anyway, the introduction to that spec outlines the potential benefits of including an interface to that information in our deployment19:50
corvusgiven the simplicity of that (self-contained, ephemeral, no security issues) i thought maybe i could just propose the change to implement it and we can review it there (i'll be doing the work regardless, so i can accept the risk that we run into a blocker in review)19:50
fungisounds good to me19:50
clarkbya a spec may be overkill given the low risk of deploying it19:50
clarkbworst case we just turn it off and then write a spec :)19:50
corvusclarkb: yep19:51
corvusokay, sounds like once i'm ready, i'll propose an implementation change19:52
clarkbcorvus: and probably good to target jammy as the base at this point.19:52
corvusack19:53
clarkbSounds like that may be it for this topic?19:53
clarkb#topic Open Discussion19:54
clarkbAnything else?19:54
funginothing from me19:54
clarkbIf that is all then thank you everyone. We can end here19:56
clarkbWe'll be back next week at the same time and location19:57
clarkb#endmeeting19:57
opendevmeetMeeting ended Tue Aug 23 19:57:07 2022 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)19:57
opendevmeetMinutes:        https://meetings.opendev.org/meetings/infra/2022/infra.2022-08-23-19.01.html19:57
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-08-23-19.01.txt19:57
opendevmeetLog:            https://meetings.opendev.org/meetings/infra/2022/infra.2022-08-23-19.01.log.html19:57
fungithanks clarkb!19:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!