clarkb | Our meeting will start in a few minutes. | 18:58 |
---|---|---|
ianw | o/ | 19:00 |
clarkb | #startmeeting infra | 19:01 |
opendevmeet | Meeting started Tue Dec 7 19:01:01 2021 UTC and is due to finish in 60 minutes. The chair is clarkb. Information about MeetBot at http://wiki.debian.org/MeetBot. | 19:01 |
opendevmeet | Useful Commands: #action #agreed #help #info #idea #link #topic #startvote. | 19:01 |
opendevmeet | The meeting name has been set to 'infra' | 19:01 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000305.html Our Agenda | 19:01 |
clarkb | #topic Announcements | 19:01 |
clarkb | This didn't make it onto the agenda because it didn't occur to me until this morning. We are fast approaching a holiday period for many of us. I'll be unable to make a meeting on the 21st and likely unable to make a meetingon January 4 | 19:02 |
fungi | i'm okay with skipping those | 19:02 |
clarkb | ya I think we can go ahead and cancel the 21st and 4th. And I'll try hard to do a check in on the 28th though I expect things will get pretty quiet all around | 19:03 |
clarkb | everyone should enjoy the holidays and their assocaited time off. I'm going to attempt to do this myself :) | 19:03 |
ianw | ++ won't be regularly around then either | 19:04 |
clarkb | #topic Actions from last meeting | 19:05 |
clarkb | #link http://eavesdrop.openstack.org/meetings/infra/2021/infra.2021-11-30-19.01.txt minutes from last meeting | 19:05 |
clarkb | There weren't any actions recorded | 19:05 |
clarkb | #topic Topics | 19:05 |
clarkb | #topic Improving CD Throughput | 19:05 |
clarkb | We made some progress here and also took a step or two back but we learned some stuff | 19:05 |
clarkb | When we switched in the "setup source for system-config on bridge at the start of each buildset" change we missed a few important things that we have reverted that change over | 19:06 |
clarkb | We need to make sure that we are using nodeless jobs, that we update system-config on bridge and not on a normal zuul node, we need to honor DISABLE-ANSIBLE and we need to be sure every buildset has this job run first | 19:06 |
clarkb | The good news is that since we learned all of that we are able to regroup and make a new plan. | 19:06 |
clarkb | link http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps | 19:07 |
clarkb | #link http://lists.opendev.org/pipermail/service-discuss/2021-December/000306.html has a good rundown of the next steps | 19:07 |
ianw | on the DISABLE-ANSIBLE ... i was thinking about that | 19:07 |
clarkb | infra-root ^ that email describes a refactor of various things to make it harder to make those previous mistakes again | 19:07 |
ianw | i feel like it makes sense to check that in the base job that sets up the system-config checkout for the buildset ... that will hold all prod jobs | 19:08 |
ianw | but maybe not so much in the prod jobs themselves. that way, if a buildset starts, it completes, but a new one won't | 19:08 |
clarkb | ya I'm somewhat on the fence over that. To me if I disable ansible that means no more ansible even if the buildset is still running | 19:09 |
clarkb | But I can see an argument for allowing an in progress buildset to complete for consistency | 19:09 |
ianw | in parallel operation, it seems unclear if you dropped it in the middle of a deploy buildset what it would catch | 19:09 |
clarkb | What we can do if we really really need to stop the production line is move the authorized keys file aside | 19:09 |
clarkb | and usually we use that toggle when doing stuff like project renames, not as an emergency off switch (ssh authorized_keys seems better for that) | 19:10 |
corvus | or dequeue the job | 19:10 |
clarkb | ya I guess that too | 19:10 |
ianw | with zuul authenticated ui, that would be practical | 19:10 |
clarkb | ianw: we should update the documentation to make that behavior change clear though | 19:10 |
ianw | (currently, pretty sure the jobs would be done before i'd pulled up a login window and figured out :) | 19:10 |
ianw | sure, i can post a doc update and we can discuss there | 19:11 |
clarkb | sounds good, thanks | 19:11 |
fungi | we talked about having the base job abort if the disable-ansible file is present, did i push that change (or has someone)? i can't recall now | 19:12 |
clarkb | fungi: I don't recall seeing a change for that. You did split it out into a separate role if we wanted to consume it in multiple jobs | 19:12 |
fungi | oh, right | 19:13 |
ianw | fungi: abort as in abort, or do the pause thing it does now? | 19:13 |
fungi | i think the pause thing it does now, sorry | 19:13 |
clarkb | infra-root ^ if you can review the changes outlined in that email that would be great. I'm planning on digging in this afternoon myself. I think we're really close to being able to start updating semaphores and getting parallel runs which is exciting | 19:13 |
ianw | fungi: that would be the status quo I believe, as that is checked in the setup-src job | 19:14 |
ianw | currently every prod job runs that; after the changes, only the bootstrap-bridge (that all other jobs depend on) would run it | 19:15 |
clarkb | I think we have to avoid soft dependencies to make that work, but I was alread asking for that. | 19:15 |
fungi | yeah, and the problem we ran into was that subsequent jobs didn't check it so proceeded normally when setup-src got skipped | 19:15 |
clarkb | I suspect this beacuse in the current system if you set the disable-ansible file they all run serially failing and retrying in a loop until they have failed 3 times in a row? Or maybe that is only when you pull the ssh keys out | 19:16 |
fungi | as belt-and-braces safety we could check it in the job they all inherit from | 19:16 |
fungi | or in base | 19:16 |
clarkb | fungi: ya I think that is an artifact of the soft dependency | 19:16 |
ianw | right, yes the base job (after proposed changes) is a hard dependency that should always run (no file matchers) | 19:16 |
clarkb | if we make it a hard dependency they they shouldn't proceed | 19:16 |
corvus | if you don't want child jobs to run, you can filter them out of the list | 19:17 |
clarkb | corvus: isn't that what a hard dependency failing to succeed will already do? | 19:17 |
corvus | https://zuul-ci.org/docs/zuul/reference/jobs.html#skipping-dependent-jobs | 19:17 |
ianw | yeah, it was supposed to be a hard dependency in this case -- it has to run to checkout the system-config source for the buildset | 19:18 |
corvus | yeah, but i think you could do "child_jobs: []" to cause 0 child jobs to run regardless of hard/soft | 19:18 |
corvus | so if you want to do it with soft, that could be a way | 19:18 |
corvus | but if it needs to be hard for other reasons, then meh. :) | 19:18 |
clarkb | gotcha, ya in this case I think we need a hard dependency either way | 19:18 |
ianw | it was a bug to not run it, not the intention | 19:18 |
corvus | ack | 19:19 |
clarkb | Let's continue on as we have a few other subjects to cover | 19:19 |
clarkb | #topic User management on our systems | 19:19 |
clarkb | Yesterday we managed to update the matrix-gerritbot image to run under the gerritbot user | 19:20 |
clarkb | I think what we learned from this exercise is that even simple "read only" appaering images can be complicated and that setting users to run a container under is going to be an image by image exercise | 19:20 |
clarkb | that said I still think there is value in this and we should try to pick them off as we can | 19:20 |
clarkb | But beware that we need to be careful about permissions within the image and bind mounts as well as expectation of the running processes. Turns out openssh fails if it is running as a user without an extry in /etc/passwd | 19:21 |
clarkb | At this point I don't think there is anything else to review or cover other than to say, if you've got free time you might look into updating one of our containers :) | 19:21 |
ianw | do we have a list? | 19:22 |
clarkb | IRC bots in particular seem like good targets since they all run on a shared host | 19:22 |
clarkb | ianw: I haven't made a comprehensive one yet as I was mostly going to focus on eavesdrop to start | 19:22 |
clarkb | low impact from our perspective to restart them and debug as wego, but also relatively high ROI since they share a host | 19:22 |
clarkb | most other systems are all dedicated hosts so less returns | 19:22 |
ianw | ok, np. i know i had issues when haproxy switched *to* having a separate user with the LB setup | 19:22 |
clarkb | ianw: looking at my notes hound, lodgeit, refstack, grafana are others. But this isn't a comprehensive list I don't think | 19:24 |
clarkb | but ya I was focusing on ircbots to start since all of ^ are on dedicated hosts | 19:24 |
clarkb | Anyway as mentioned I/we hae learned a bit doing this for the gerritbots and there are more irc/matrix bots to address. Also the services above. If you've got time feel free to pick them off. Our testing helps with ensuring it is happy too | 19:25 |
clarkb | #topic Zuul Gearman Going Away | 19:25 |
clarkb | Zuul's gearman tooling is very close to being deleted. This means we can no longer use the zuul gearman commands to enqueue/dequeue etc | 19:26 |
clarkb | Instead we'll need to use Zuul client to talk to the REST API for this whcih requires a JWT | 19:26 |
clarkb | corvus has changes up to set up a local JWT for administrative tasks on our zuul installation. We should also update our docs and our queue saving scripts to match when that is ready | 19:26 |
corvus | i think they just merged (thanks fungi ) | 19:27 |
corvus | with those in place, i'll generate a jwt and set up zuul-client | 19:27 |
fungi | yeah, so we should in theory still be able to run them from a shell on the server without looking up credentials | 19:27 |
corvus | note, zuul-client != zuul. they are very similar, but only zuul-client has the ability to read a jwt from a config file. | 19:27 |
corvus | we will probably remove the admin commands from zuul eventually too since they are redundant and not as useful as zuul-client's implementation | 19:28 |
corvus | so anyway, that'll be "zuul-client enqueue" etc in the future | 19:28 |
fungi | thanks! | 19:29 |
clarkb | yup mostly calling this out so people are aware and that we don't forget to update docs and our queue saving script | 19:29 |
clarkb | #topic keycloak.opendev.org | 19:29 |
clarkb | On the subject of authentication we now have a keycloak server to experiment with | 19:30 |
clarkb | The main thing I wanted to clarify on this is currently the server is in a pre production state right? we shouldn't be relying on this for anything production liek and instead use it to figure out how to make keycloak work according to our auth spec that fungi wrote | 19:30 |
clarkb | for example we can integrate keycloak with zuul's new auth stuff but we aren't doing that yet while we learn about keycloak? | 19:31 |
clarkb | or maybe if we do that it will be in a limited capcity and functionality could come and go. We'll continue to rely on local auth for admin stuff | 19:31 |
fungi | i'm willing to be flexible there | 19:31 |
corvus | it is in pre-prod. expect data to disappear at any time. | 19:31 |
corvus | i would like to go ahead and create a realm for use with zuul... i think maybe something simple where a few of us make some accounts manually or something | 19:32 |
fungi | one thing we already learned is it's apparently still easy to accidentally create multiple accounts when you use different ids if you don't link them in advance | 19:32 |
corvus | yeah. that can be resolved, but only if we allow password authentication. | 19:32 |
corvus | (like, you can fix that in a self-service way, but only if password auth is available too) | 19:33 |
clarkb | Seems like we should avoid that if we can to make sure people understand we aren't intending to be an actual auth identity | 19:33 |
fungi | which we had previously wanted to avoid so we could not be in the business of having a database of passwords as a high-profile target, nor deal with frequent password reset requests. it's something we'll need to weigh as the poc moves along | 19:33 |
corvus | i think that's worth revisiting. here's a thought experiment: | 19:34 |
clarkb | ya and I bet it is impossible to disable the password auth for external identity usages because that identity should be the same for any method used to authenticate via keyloak | 19:34 |
corvus | how different is a database of passwords from a database of mappings from a threat POV. | 19:34 |
corvus | sorry, was meant to be a question | 19:34 |
fungi | if users avoid reusing passwords, not terribly different. but users often reuse passwords | 19:35 |
clarkb | if there as a way to run it where password auth only let you run keycloak account tasks and not log in elsewhere I think that would be fine | 19:35 |
clarkb | But I strongly suspect that isn't hwo things are designed | 19:35 |
corvus | anyway -- not something we need to answer now, but i do think it's worth revisiting that with updated knowledge | 19:35 |
corvus | clarkb: i couldn't say whether that's possible or not | 19:35 |
fungi | also if we add passwords, we probably need to add integrated 2fa | 19:36 |
fungi | which could become its own support burden | 19:36 |
ianw | iiuc, a holdup for gerrit conversion was that keycloak doesn't allow adding launchpad/openid right? but there was a theory that it wouldn't be too hard to add? is that accurate? | 19:36 |
clarkb | ya all stuff to explore. Maybe figure out if 2fa is viable and if we can require it for example to mitigate the concerns with passwords | 19:36 |
corvus | clarkb: there's a lot of workflow-by-form stuff, so maybe something can be created for that. but it's certainly not a "checkbox" :) | 19:36 |
corvus | ianw: yes | 19:36 |
clarkb | ianw: yes there is a php saml tool thing that can translate to other backends and keycloak speaks the saml to that php tool in theory | 19:36 |
corvus | 2fa is available and is a "checkbox" :) | 19:36 |
fungi | ianw: yes, there's a proposal in the spec to create a sort of bridge from keycloak to openid via phpsimplesaml | 19:37 |
fungi | corvus: i expect turning on 2fa is not hard, but helping users reset it every time they lock themselves out might be | 19:37 |
clarkb | writing that bit would be a good next step for someone interested in experimenting with keycloak more | 19:37 |
ianw | cool, well this seems like a great step in having an environment we can test that too. i'd be interested in working on that in the future | 19:37 |
clarkb | ianw: ++ having the actual service up gives us something to look at that is more than theoretical | 19:37 |
clarkb | I'll also need to finish the gerrit user cleanups so that we can update the external ids database in a straightforward manner | 19:38 |
corvus | yeah... and i won't be able to drive this, so having other folks step in and pick it up would be great | 19:38 |
clarkb | Alright tldr is work to be done, feel free to experiment, but this isn't for production use yet | 19:39 |
clarkb | anything else? | 19:39 |
clarkb | #topic Adding a lists.openinfra.dev mailman site | 19:40 |
fungi | i'm still trying to fix things to make our current mailman orchestration go | 19:40 |
clarkb | fungi and I ran into some trouble with newlist when doing this that we thought we had corrected. Long story short newlist is still looking for input to confirm emailing people | 19:40 |
clarkb | seems that redirecting /dev/null into newlist corrects this, but it also exposes that our testing is different than prod | 19:40 |
fungi | i was able to reproduce it with a dnm change, and determined that redirecting stdin from /dev/null in a shell task properly solves it | 19:41 |
clarkb | fungi: the plan is to update our system-config-run jobs to all block port 25 outbound then we can tell newlist to send email right? | 19:41 |
clarkb | oh sorry I'll let fungi fill us in :) | 19:41 |
fungi | setting stdin to a null string in a cmd task does not have the same effect | 19:41 |
fungi | (which is waht we had merged previously) | 19:41 |
fungi | and yeah, i have changes up to collect exim logs so we can see what's trying to send e-mail through the mta in tests, as well as blocking 25/tcp egress to prevent our deploy jobs from accidentally sending e-mail | 19:42 |
fungi | and then i'm dropping the test-time addition of the -q option for the newlist command | 19:42 |
clarkb | fungi: are any of those ready for review yet? | 19:42 |
fungi | probably, though i have a pending update to one of them once i get test results back from the latest revision to the iptables change | 19:43 |
fungi | and haven't rebased the dropping of -q onto that stack yet | 19:43 |
clarkb | gotcha, feel free to ping me when you want reviews and I'll happily take a look | 19:43 |
fungi | should hopefully have it up right after the meeting, and then once those merge we can add new mailing lists again more easily | 19:43 |
clarkb | thanks! | 19:44 |
fungi | topic:mailman-lists | 19:44 |
fungi | in case anyone's looking for them | 19:44 |
clarkb | #topic Gerrit User Summit | 19:45 |
clarkb | Gerrit User Summit happened last week. I found it useful to catch up on some of the gerrit upstream activities | 19:45 |
clarkb | I took notes and they are in my homedir on review02. But I'll try to summarize some of the interesting bits really quickly here | 19:45 |
clarkb | Gerrit 3.2 is EOL. Thank you ianw for helping get us to 3.3 | 19:46 |
clarkb | The new Checks UI work relies on a plugin in the Gerrit server that queries CI systems for results/status and then renders them in a consistent way regardless of the CI system | 19:46 |
clarkb | this means that we could probably replace the Zuul summary plugin with a Checks UI plugin using this new system. But I think that is 3.4 and beyond. Not a 3.3 thing | 19:46 |
corvus | and that's a java plugin? or is it a pg plugin? | 19:47 |
clarkb | I think that is a java plugin because you have to interact with gerrit internal state | 19:47 |
clarkb | the plugin acts as a data retrieval and conformance system between your CI system and the checks UI | 19:48 |
clarkb | and I think that requires you make writes somewhere which I suspect the js stuff can't do | 19:48 |
clarkb | however, that wasn't entirely clear to me so I could be wrong | 19:48 |
clarkb | Gerrit is working towards deleting prolog for complex acl rule applications. Instead they are replacing it with "Composable Submit Requirements" which use a simple query language based on Gerrit's existing query language | 19:49 |
clarkb | you essentially write rules that say "if this gerrit query returns a result then this rule applies to this change" | 19:49 |
clarkb | and the rules can say this is required for submitting etc | 19:49 |
clarkb | I don't expect we'll migrate to this quickly for anything though random users may use it for various additional checks. However, we should be careful to ensure we don't accidentally reduce our requirements for submitting via zuul | 19:50 |
clarkb | There is a ChronicleMap libmodule plugin for persistent caches. This apparently improves performance quite a bit since you don't lose cache data when restarting gerrit. Some people suggested it be incorporated directly into Gerrit rather than a plugin | 19:51 |
clarkb | Our performance is pretty good these days and we don't restart Gerrit often but may be worth looking into as some users (those talking to nova stuff) have indicated slowness after restarts | 19:51 |
clarkb | And finally the Gerrit meetings are open to the entire community. YOu can also put stuff on their agenda if you have something specific you want to discuss | 19:52 |
clarkb | This is something I wasn't clear about since I think they title them the EC meeting or similar | 19:52 |
clarkb | I'll probably try to start attending these once I figure out when they happen | 19:52 |
clarkb | So ya feel free to ask me any questions if you have them though I'm still a gerrit community noob | 19:53 |
clarkb | Overall I think the event went well and I learned some stuff about what to look for for the future | 19:53 |
clarkb | #topic Nodepool Image cleanups | 19:54 |
ianw | thanks for attending/the summary! | 19:54 |
clarkb | We've got a number of images that are either under maintained or going EOL | 19:55 |
clarkb | #link http://lists.opendev.org/pipermail/service-announce/2021-December/000029.html | 19:55 |
clarkb | I send email outlining a rough plan for cleanups to service-announce | 19:55 |
clarkb | This should reduce a lot of pressure on our AFS volumes too | 19:56 |
clarkb | If we do get responses for opensuse and gentoo help it would be good to maybe also try and run periodic jobs on those platforms somewhere | 19:57 |
clarkb | to serve as a signal when tehy break? | 19:57 |
clarkb | An idea I had that mgiht make maintenance a bit more repsonsive in the future | 19:57 |
clarkb | #topic Open Discussion | 19:57 |
ianw | we do have a periodic run of zuul-jobs that tries everything | 19:57 |
clarkb | ianw: ah ok so those volunteers could watch that | 19:57 |
ianw | but if a job fails in the woods with nobody listening ... :) | 19:57 |
ianw | speaking of | 19:58 |
clarkb | We are almost out of time, anything else? | 19:58 |
ianw | #link https://review.opendev.org/c/zuul/zuul-jobs/+/818702 | 19:58 |
jentoio | I'd like to help/volunteer | 19:58 |
ianw | that will enable f35; which seems to have gone smoothly | 19:58 |
clarkb | jentoio: cool, can you respond to that email so that we can help keep track of it and not miss that there is interest? | 19:59 |
jentoio | sure, I was hoping we can meet for coffee to discuss | 19:59 |
jentoio | since we live near each other - unless you moved ;) | 19:59 |
ianw | i can also lookup the details of the zuul-jobs runs and post | 19:59 |
clarkb | I'm still in the same part of the world though I don't get out much these days | 20:00 |
jentoio | but I'll respond to email as well. | 20:00 |
clarkb | ya I think that helps since others are involved as well and around the world | 20:00 |
clarkb | but I'm happy to discuss further as well. And we are at time | 20:00 |
clarkb | thank you everyone! | 20:00 |
clarkb | #endmeeting | 20:00 |
opendevmeet | Meeting ended Tue Dec 7 20:00:49 2021 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4) | 20:00 |
opendevmeet | Minutes: https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-07-19.01.html | 20:00 |
opendevmeet | Minutes (text): https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-07-19.01.txt | 20:00 |
opendevmeet | Log: https://meetings.opendev.org/meetings/infra/2021/infra.2021-12-07-19.01.log.html | 20:00 |
fungi | thanks clarkb! | 20:01 |
clarkb | As always feel free to continue discussion on the mailing list (service-discuss@lists.opendev.org) or in #opendev | 20:01 |
clarkb | And we'll see you here next week before taking a break for the holidays | 20:01 |
Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!