clarkb | meeting time. We'll get started in a couple of minutes | 18:59 |
---|---|---|
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Aug 23 19:01:10 2022 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link https://lists.opendev.org/pipermail/service-discuss/2022-August/000355.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
ianw | o/ | 19:01 |
clarkb | The service coordinator nomination period ended last week. I saw only one nomination; the one for myself. I guess that means I'm it by default again | 19:02 |
frickler | \o | 19:02 |
clarkb | #link https://releases.openstack.org/zed/schedule.html OpenStack Feature Freeze begins next week | 19:02 |
clarkb | Heads up that openstack is about to enter the typically most crazy portion of its release cycle | 19:03 |
clarkb | though the last few have been pretty chill comparatively | 19:03 |
ianw | long live the king^W service-coordinator! :) | 19:03 |
clarkb | And finally I'll be AFK tomorrow. Back thursday | 19:04 |
fungi | off celebrating your reelection... i mean drowning your sorrows? | 19:04 |
clarkb | looking for salmon swimming up the columbia river | 19:04 |
clarkb | and coincidentally escaping the heat at home | 19:05 |
fungi | so that's a yes | 19:05 |
clarkb | ha | 19:05 |
clarkb | #topic Bastion Host Updates | 19:05 |
clarkb | Time to dive in | 19:05 |
clarkb | ianw: one thing that occured to me is that the recent Zuul auto upgrades should've deplyoed your fixes for the console log file leaks? | 19:06 |
clarkb | I think those changes landed | 19:06 |
clarkb | If that is the case should we go ahead and manually clear those files out of bridge and static as they should leak far less quickly now? | 19:06 |
ianw | umm, yes, i think this weekend actually should have deployed the file deletion in /tmp | 19:07 |
ianw | i'll double check, restart the zuul_console on the static nodes and cleanup the tmps, then make a note to go back and check | 19:07 |
corvus | we're probably testing the backwards compat now, until the daemon is killed | 19:08 |
clarkb | ianw: just be careful to not delete the files that will be automatically deleted. Might need to use an age filter when deleting | 19:08 |
ianw | last weekend it deployed just the initial changes, which broke xenial/centos-7 because it used 3-era f-strings | 19:08 |
clarkb | corvus: ianw: oh right we need to have it start a new one. | 19:08 |
ianw | in related news | 19:09 |
ianw | #link https://review.opendev.org/q/topic:stream-2.7-container | 19:09 |
clarkb | but also we block port 19885 so the zuul cluster doesn't succeed at getting the logs. I wonder if we should also look at just not trying to run it at all on bridge (I think we need to be very careful exposing those logs direclty) | 19:09 |
ianw | runs the console streamer in a python 2.7 environment to do ultimate backwards compat testing | 19:09 |
clarkb | sounds like progress though. Thank you for looking at that | 19:09 |
corvus | so opendev doesn't actually need / will not use these changes? | 19:10 |
clarkb | corvus: it does use them because the console log stuff runs there and leaks the files. But we don't currently expose the results on the live stream through zuul finger protocol | 19:10 |
corvus | i mean, we could have just stopped running the log daemon? or aggressively pruned? | 19:11 |
clarkb | corvus: maybe? I think zuul will continue to try and fetch the data but I guess the firewall blocking the port and not having anything listening on the port are roughly equivalent from that perspecitve (sorry this just occured to me) | 19:12 |
corvus | i'm asking partly for curiosity, but also trying to get a handle on whether this is actually getting tested | 19:12 |
clarkb | corvus: it will be tested | 19:12 |
corvus | it sounds like opendev is not going to be a robust test of this feature? | 19:12 |
clarkb | but ya it may not be a complete test | 19:12 |
fungi | though also worth revisiting whether we might be comfortable streaming such console logs in the future | 19:13 |
clarkb | fungi: yup that too | 19:13 |
corvus | thanks, it's good to know the limitations of opendev's production testing of new features like this. | 19:13 |
clarkb | corvus: I think the regular jobs will exercise this pretty well too fwiw. Since it happens for all the jobs | 19:13 |
clarkb | I think where our gap may be is how much we'll leak due to aborted jobs and the like? | 19:14 |
ianw | there was also some work from several years ago to tunnel the console logs over a unix socket over ssh | 19:14 |
corvus | if we don't allow connections on 19885 then will anything be deleted from bridge? | 19:14 |
fungi | we put a lot of belts and suspenders in place early on because we were unsure of the security of some solutions, but now we've had time to evaluate things in a production scenario and could make better (informed by observed data and experience) decisions | 19:14 |
clarkb | oh the entire protocl happens over 19885? yes I think that is correct. This may not delete anything on bridge I guess :/ | 19:14 |
ianw | hrm, that is a wrinkle, it does now send a message "i've finished with this, remove it" | 19:15 |
corvus | i suggested that zuul-console should have a periodic deletion as a backstop. did that get implemented? | 19:15 |
clarkb | fungi: we would need to review all of the log files we produce to double check them for leaked sensitive info. Address any such leaks, then remember to not add new ones | 19:15 |
fungi | what's the link to the review for the feature which got merged? | 19:16 |
clarkb | corvus: I don't think so | 19:16 |
clarkb | fungi: https://review.opendev.org/c/zuul/zuul/+/850270 | 19:16 |
fungi | thanks | 19:16 |
clarkb | corvus: I think the python 2.7 testing has become the next focus there to avoid regressions on the ansible target end | 19:17 |
ianw | i havent' implemented cleanup, but i have expanded the docs to talk about it explicitly | 19:17 |
ianw | that is | 19:18 |
ianw | #link https://review.opendev.org/c/zuul/zuul/+/851942/ | 19:18 |
ianw | which could use a review | 19:18 |
corvus | so how is bridge going to get cleaned up? | 19:18 |
clarkb | corvus: it won't, but we only just realized that | 19:18 |
clarkb | we'd need to stop running the console streamer on bridge or implement the periodic cleanups. | 19:19 |
corvus | ianw: do you plan on implementing periodic cleanup in zuul-console, or separately? | 19:20 |
corvus | (by separately, i mean just a cron job on bridge?) | 19:20 |
ianw | i don't have immediate plans to work on adding it to zuul-console | 19:21 |
corvus | did this make it into a zuul release? | 19:21 |
fungi | looks like no | 19:22 |
fungi | i don't see a tag which contains df3f9dcd30a13232447b3be67c7845c51cb527a0 in its history | 19:22 |
fungi | so it could be easily reverted | 19:22 |
ianw | i'm not sure anything needs to be reverted | 19:23 |
corvus | okay. i think further discussion can happen in #zuul, but given that the feature won't be used for the intended use case, it's probably worth considering | 19:23 |
clarkb | ya we've got a few more topics to get to here. probably best to pick this back up in the zuul room on matrix | 19:23 |
clarkb | really quickly before the next topic, ianw any venv on bridge chagnes needing review yet? | 19:24 |
fungi | yeah, just to confirm, it merged 9 days ago, and the most recent release (6.2.0) was 10 days before that | 19:24 |
ianw | no sorry, i have some local work i haven't pushed yet | 19:24 |
clarkb | no problem. Just want to make sure I'm not missing any important changes to review | 19:24 |
clarkb | #topic Updating Bionic Servers to Focal or Jammy | 19:25 |
clarkb | I don't think there is anything new on this front | 19:25 |
clarkb | But I think we are generally ready to deploy to jammy for new or replacement things. | 19:25 |
clarkb | Jammy just got its .1 release too | 19:25 |
clarkb | which is when the open up the in place upgrade path for focal installations. A good indication upstream thinks it is ready too | 19:26 |
clarkb | #link https://etherpad.opendev.org/p/opendev-bionic-server-upgrades Notes on the work that needs to be done. | 19:26 |
clarkb | Feel free to add any additional bits of info to that etherpad as we start to take this on | 19:26 |
clarkb | #topic Mailman 3 | 19:27 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/851248 | 19:27 |
clarkb | I think the deployment is largely there now. Plan is to start testing migration of actual lists ~Thursday | 19:27 |
clarkb | Reviews definitely welcome at this point. I've still got it marked WIP but it has congealed into something that looks mergeable now | 19:28 |
fungi | i may be able to try out some of the migration tools on the held node tomorrow | 19:28 |
fungi | thanks for putting the deployment together! | 19:28 |
clarkb | There is also a held node at 198.72.124.71 if anyone wants to poke at it | 19:28 |
clarkb | you're welcome | 19:29 |
clarkb | Importantly I think I've got the native vhosting working | 19:29 |
clarkb | whcih means we don't have any regressions from our mm2 vhosting behavior | 19:29 |
clarkb | and the rest api seems to be sufficient for the management we need to do. | 19:30 |
clarkb | For downsides mm3 is a significantly more complicated piece of software built on django with a database and all that. But it shouldn't be too bad | 19:30 |
clarkb | Anyway feel free to poke at the held node and leave review comments. I'll do my best to catch up on that after tomorrow. Previous investigation has been helpful in improving the deployment | 19:31 |
clarkb | #topic Gitea 1.17 Upgrade | 19:31 |
clarkb | #link https://review.opendev.org/c/opendev/system-config/+/847204 1.17.1 out, time to schedule the upgrade | 19:31 |
clarkb | Gitea 1.17.1 is out. That change has been updated to deploy 1.17.1. | 19:31 |
clarkb | I think we can upgrade whenever we are comfortable doing so with the opnestack release schedule and so on | 19:32 |
fungi | yeah, the list of changes didn't look too risky for us | 19:32 |
clarkb | Big changes for gitea to pay attention to: main is the default branch for new projects. Testing was updated to ensure we continue to create master by default to match gerrit and jeepyb | 19:32 |
clarkb | also they added a package repos feature that had a bunch of bugs in the .0 release. We intentionally disable it in part due to our distributed cluster not having shared storage, but also because we likely don't have sufficient storage for it | 19:33 |
clarkb | If I can get reviews I'm happy to babysit that upgrade on thursday when I get back | 19:33 |
ianw | ++ will look | 19:33 |
clarkb | #topic Gerrit Load Issues | 19:34 |
clarkb | Last week a couple of times around 08:00 UTC Gerrit got busy and stopped accepting http requests | 19:34 |
clarkb | I believe that Gerrit itself was still running it just exhausted its thread pool which caused it and apache to return 500 errors | 19:35 |
clarkb | In response to that we've bumped up our http thread count above the value for ssh+http git request threads. The idea being we can still have a responsive web ui and rest api if the git side of gerrit is busy | 19:35 |
clarkb | Let's keep an eye on it and evaluate if further changes need to be made based on the new behavior with this config update | 19:36 |
clarkb | During debugging and response to this a few busy IPs were blackholed using the linux route table. | 19:36 |
clarkb | One class of blockage was jenkins servers using the gerrit trigger plugin beacuse they make a request to /plugins/events-log/ every 2 seconds which 404s | 19:37 |
clarkb | #link https://github.com/jenkinsci/gerrit-trigger-plugin/pull/470 trying to make the gerrit trigger plugin less noisy | 19:37 |
clarkb | I've made that pull request trying to improve the plugin to be less noisy as Jenkins is still reasonable for third party CI | 19:37 |
clarkb | I did unblock an IP beacuse the users noticed | 19:37 |
clarkb | Also this pointed out that infra-root may use different approaches to block traffic to a server. | 19:38 |
clarkb | The first place I always look for network traffic blockages is the firwall. I was very confused when I couldn't find iptables rules for this. | 19:38 |
clarkb | I don't thnik we need to solve this in this meeting but it would be good if we are consistent in applying those temporary ephemeral blockages on our servers. We should decide on a method either iptables or ip route and stick to it | 19:39 |
clarkb | Maybe give that some thought over the next week and we can discuss it in next week's meeting | 19:39 |
clarkb | And finally if we have to make additional changes to Gerrit to address this let's try to make them as slowly as is reasonable so that we can measure their impact and avoid negatively impacting openstack feature freeze (a time when gerrit tends to be busy) | 19:40 |
clarkb | #topic Jaeger Tracging Server | 19:42 |
clarkb | #undo | 19:42 |
opendevmeet | Removing item from minutes: #topic Jaeger Tracging Server | 19:42 |
clarkb | #topic Jaeger Tracing Server | 19:42 |
clarkb | #link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html Zuul spec for OpenTelemetry support | 19:42 |
clarkb | Zuul will be growing support for opentelemetry tracing | 19:42 |
clarkb | The question is whether or not OpenDev should have a Jaeger server to help test this and take advantage of the functionality | 19:42 |
clarkb | my initialy thought was maybe we can colocate this with the prometheus server I have on my todo list. But then immediately decided keeping them separate to reduce difficulty of OS upgrades etc is better | 19:43 |
clarkb | I think from an operational standpoint zuul's logs have been pretty good and we've been able to debug problems tracing through logs. frickler and I did that recently with that github repo's unexpected behavior | 19:44 |
fungi | oase os upgrades would mater for containerized prometheus and jaeger deployments? | 19:44 |
fungi | er, base os upgrades i mean | 19:44 |
corvus | jaeger would be containerized | 19:44 |
clarkb | fungi: typically we do OS upgrades by spinning up new hosts and both prometheus and jaeger appear to store data in databases of some sort that would need to be migrated | 19:45 |
frickler | I must admin I have no idea what jaeger does. what kind of additional data would we get from it? | 19:45 |
clarkb | keeping things separate will liekly simplify things and there isn't a strong reason to collocate | 19:45 |
fungi | opentelemetry tracing data | 19:45 |
ianw | clarkb: not that different to graphite though? that has a db that needs to be moved if we update | 19:46 |
clarkb | frickler: its like fancy log data. But instead of raw text it goes into a db and there is tooling to render it nicely with timings and so on | 19:46 |
fungi | frickler: basically timing and sequencing of events | 19:46 |
corvus | i think the utility here would be a marginal improvement to infra-root's ability to debug user-facing issues; potentially a significant improvement in ability for users to self-diagnose; and a benefit to the zuul project in being able to fully demonstrate and collaborate on the reference zuul implementation that opendev runs | 19:46 |
clarkb | corvus: user exposure is a good point since the tracing data is far more sanitized than the raw logs | 19:46 |
corvus | i plan on adding a sample deployment to the zuul quickstart system, so it's not a big deal for me to port that over to opendev. | 19:46 |
fungi | #link https://zuul-ci.org/docs/zuul/latest/developer/specs/tracing.html | 19:47 |
fungi | that might be a missing bit of context | 19:47 |
corvus | clarkb: yep, i expect this to be fully user-safe | 19:47 |
fungi | er, i guess clarkb already linked the zuul spec | 19:47 |
clarkb | the zuul event id in our logs is a rudimentary form of tracing | 19:47 |
clarkb | you can think of it like grep 'eventid' /var/log/zuul/ across all the zuul machines for curated info | 19:48 |
corvus | jaeger can store data locally on the filesystem | 19:48 |
corvus | so i'm imagining a really simple self-contained jaeger server. and it's okay if we lose the data. | 19:49 |
clarkb | no opposition from me. | 19:49 |
clarkb | particularly now that I've realized it can help users debug or at least better understand uenxpected zuul behavior | 19:49 |
ianw | for mine, we have such good pre-production system-config testing, as long as a service is working in with that I don't see any reason not to just bring it up | 19:49 |
clarkb | thats new functionality that would be useful | 19:49 |
fungi | frickler: anyway, the introduction to that spec outlines the potential benefits of including an interface to that information in our deployment | 19:50 |
corvus | given the simplicity of that (self-contained, ephemeral, no security issues) i thought maybe i could just propose the change to implement it and we can review it there (i'll be doing the work regardless, so i can accept the risk that we run into a blocker in review) | 19:50 |
fungi | sounds good to me | 19:50 |
clarkb | ya a spec may be overkill given the low risk of deploying it | 19:50 |
clarkb | worst case we just turn it off and then write a spec :) | 19:50 |
corvus | clarkb: yep | 19:51 |
corvus | okay, sounds like once i'm ready, i'll propose an implementation change | 19:52 |
clarkb | corvus: and probably good to target jammy as the base at this point. | 19:52 |
corvus | ack | 19:53 |
clarkb | Sounds like that may be it for this topic? | 19:53 |
clarkb | #topic Open Discussion | 19:54 |
clarkb | Anything else? | 19:54 |
fungi | nothing from me | 19:54 |
clarkb | If that is all then thank you everyone. We can end here | 19:56 |
clarkb | We'll be back next week at the same time and location | 19:57 |
clarkb | #endmeeting | 19:57 |
opendevmeet | Meeting ended Tue Aug 23 19:57:07 2022 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 19:57 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2022/infra.2022-08-23-19.01.html | 19:57 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2022/infra.2022-08-23-19.01.txt | 19:57 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2022/infra.2022-08-23-19.01.log.html | 19:57 |
fungi | thanks clarkb! | 19:57 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!