Friday, 2023-01-20

@iwienand:matrix.org	> <@g_gobi:matrix.org> One of my usecase is like I'm running my testing using zuul. I want to check the infrastructure after some of test-suite. Like if any of the test-suite failed pause the execution and debug it.	03:07
Zuul allows for that via autoholds; see https://zuul-ci.org/docs/zuul-client/commands.html ; e.g. https://zuul.openstack.org/autoholds
@g_gobi:matrix.org	Thanks ianw: I'll take a look	03:25
@jim:acmegating.com	tdlaw: i made a video about that: https://www.youtube.com/watch?v=WEQ0X3SGD0g	04:30
-@gerrit:opendev.org- Per Wiklund proposed wip: [zuul/nodepool] 871102: Introduce driver for openshift virtualmachines https://review.opendev.org/c/zuul/nodepool/+/871102		08:04
-@gerrit:opendev.org- Simon Westphahl proposed:		14:31
- [zuul/zuul] 871106: Require latest layout for processing mgmt events https://review.opendev.org/c/zuul/zuul/+/871106
- [zuul/zuul] 871107: Periodically cleanup leaked pipeline state https://review.opendev.org/c/zuul/zuul/+/871107
- [zuul/zuul] 871108: Cleanup deleted pipelines and and event queues https://review.opendev.org/c/zuul/zuul/+/871108
@fungicide:matrix.org	> <@g_gobi:matrix.org> Thanks ianw: I'll take a look	14:35
note that autoholds don't pause mid-execution of the build, but rather instruct nodepool not to delete the nodes after the build completes with a failure result
@fungicide:matrix.org	but more generally, inspecting the state of test environments after a failure is what that feature is mostly used for	14:36
-@gerrit:opendev.org- Ade Lee proposed: [zuul/zuul-jobs] 866881: Add ubuntu to enable-fips role https://review.opendev.org/c/zuul/zuul-jobs/+/866881		14:43
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul] 871197: Fix include-branches priority https://review.opendev.org/c/zuul/zuul/+/871197		14:45
@fungicide:matrix.org	has anyone else observed instability in the roles.upload-logs-base.library.test_zuul_ibm_upload.TestUpload.test_upload_result test? i just saw it fail the same way in py38 and py311 for a change while passing on py39 and py310: https://review.opendev.org/c/zuul/zuul-jobs/+/866881	15:25
@jim:acmegating.com	fungi: not an answer, but some context: a lot of that test is actually shared between all the roles, so it may or may not be ibm specific (it could be that test race timing makes it more likely to show up there right now).	15:29
@fungicide:matrix.org	got it. interesting that it would hit the same exact test both times in that case	15:29
@fungicide:matrix.org	i looked through the build history, but seems like zuul-jobs has been somewhat quiet so not a lot of data points	15:30
@g_gobi:matrix.org	> <@fungicide:matrix.org> note that autoholds don't pause mid-execution of the build, but rather instruct nodepool not to delete the nodes after the build completes with a failure result	15:42
fungi: thanks for the admin
@g_gobi:matrix.org	Information*	15:42
@clarkb:matrix.org	jpew: re https://review.opendev.org/c/zuul/zuul/+/871197 I agree the docs and behavior appear out of sync and that change should fix that. I wonder if we should put the release notes under the "upgrade:" heading to make it clear this is a behavior change. It is a bug fix relative to the docs, but it is also a behavior change potentially	16:43
@jpew:matrix.org	Clark: Ya, that works for me. Probably with a note on how to check your configuration when upgrading?	16:44
@clarkb:matrix.org	++ Might also be good to sanity check with corvus that the docs aren't at fault here (however, I think having explicit include override exclude makes sense to me so I think the docs are correct)	16:46
@jim:acmegating.com	yeah, i think a typical pattern might be: exclude `/dev/.*` and then someone might want to include `/dev/something`, so that's likely the best order and i think we should stick with the docs and merge jpew's change. i'm almost ambivalent about the reno question, but i also would tend toward upgrade, so it sounds like 3 votes for upgrade. i'll +2 the change, but feel free to update it with the reno change and i'll reapply.	16:55
@jim:acmegating.com	* yeah, i think a typical pattern might be: exclude `dev/.*` and then someone might want to include `dev/something`, so that's likely the best order and i think we should stick with the docs and merge jpew's change. i'm almost ambivalent about the reno question, but i also would tend toward upgrade, so it sounds like 3 votes for upgrade. i'll +2 the change, but feel free to update it with the reno change and i'll reapply.	16:56
@jpew:matrix.org	corvus: Well. that's sort of an interesting point, because in a certain sense "yes" thats how it works, but it does so by completely ignoring the exclude list	16:57
@jpew:matrix.org	If you have include-branches.... you don't need exclude-branches because include-branches means include _only_ these branches	16:58
@jpew:matrix.org	(which my patch doesn't change FWIW)	16:58
@jpew:matrix.org	(the old code _also_ ignored any branch that wasn't in the exclude list)	16:59
@jpew:matrix.org	* (the old code _also_ ignored any branch that wasn't in the include list)	16:59
@jim:acmegating.com	yeah, i think the old code was trying for some kind of interaction between the two which it didn't quite accomplish. the only thing it let you do was exclude something that was included, which is the opposite of the docs, and opposite of what i think would be the most common pattern.	17:02
@jim:acmegating.com	`It will not exclude a branch which already matched include-branches.` from the docs is exactly the opposite of what it does	17:03
@jim:acmegating.com	(Strike that. Reverse it.)	17:06
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul] 871197: Fix include-branches priority https://review.opendev.org/c/zuul/zuul/+/871197		17:11
@clarkb:matrix.org	Zuulians: Ansible 7 (2.14) drops support for Python 3.8 on the control node. It requires 3.9-3.11. I may be getting ahead of myself as Ansible 7 work hasn't started yet as far as I know, but I'm wondering if we bump 3.8 to 3.9 or maybe all the way to 3.10? https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html#control-node-requirements	18:00
@clarkb:matrix.org	I guess the consideration there is that rhel 9 and friends are on python 3.9?	18:00
@fungicide:matrix.org	conservatively, i'd say keep python 3.8 support until its time to drop support for ansible 6	18:01
@clarkb:matrix.org	fungi: historically thats been not long after adding ansible N+1 fwiw	18:02
@clarkb:matrix.org	I think the two are tied together pretty closely	18:02
@fungicide:matrix.org	obviously there's no point in testing with 3.8 after ansible 7 is the oldest we support since they're incompatible, but until then people may want to be able to use platforms which don't have newer python	18:02
@clarkb:matrix.org	we dropped 5 shortly after adding 6	18:03
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 870430: Add a cache for image lists https://review.opendev.org/c/zuul/nodepool/+/870430		18:06
@jim:acmegating.com	we should definitely add ansible 7 asap. once we have that, it's kind of a grey area whether we should continue supporting 3.8 since running ansible 7 on a 3.8-based system is not supported. i would tend to lean toward increasing zuul's minimum at the point where we add 7.	18:09
@jim:acmegating.com	(this isn't a typical dependency situation, in that a deployer could choose to deploy on 3.8 and then a user can choose to run ansible 7)	18:10
@clarkb:matrix.org	it may not even be possible to run ansible 7 in our test suite on 3.8 (though we could add test skips or something)	18:10
@fungicide:matrix.org	good point, i didn't consider that we don't do specific ansible versions in different jobs	18:10
@fungicide:matrix.org	though it's interesting that ansible has decided to drop support for python 3.8 now, when it's still got another 1.75 years before its final scheduled release	18:11
@jim:acmegating.com	the jobs mirror the deployment choices, in that we have a high version and low version job.	18:11
@jim:acmegating.com	fungi: hrm that is interesting.	18:11
@clarkb:matrix.org	I suspect this is due to 3.9 being the rhel version	18:12
@fungicide:matrix.org	if that were the impetus, you'd think they'd also be considering dropping support for running the ansible controller on any other distro	18:12
@clarkb:matrix.org	old rhel uses ansible 2.9 or something. New rhel is all that matters going forward type of deal	18:13
@jim:acmegating.com	i think ansible is forcing our hand on at least 3.9. what are reasons to do more than that and go to 3.10+ ?	18:14
@clarkb:matrix.org	corvus: the test suite is much faster on 3.10+	18:14
@clarkb:matrix.org	it would be a selfish dev reason to reduce rtt on changes by about 30 minutes	18:14
@jim:acmegating.com	i find this argument strangely compelling :)	18:15
@fungicide:matrix.org	looking at alternative distros, the current python3 packages on debian stable (bullseye) and ubuntu lts (jammy) are at least 3.9 anyway, so maybe they consider rhel to be the most conservative server distro where python versions are concerned	18:15
@jim:acmegating.com	so it sounds like 3.9 is in the bag; we probably should see if anyone has a need to keep 3.9 in support for zuul before we bump to 3.10 (which i realize is exactly why you started this convo, i guess i'm just re-articulating). might be worth a maling list post to get more input.	18:16
@clarkb:matrix.org	ya I figured I'd bring this up sooner than later so that people can start thinking about it. I can draft up a mailing list email too	18:17
@fungicide:matrix.org	jammy's default python is 3.10, and bookworm will probably be 3.11 unless it gets rolled back to 3.10 at the last minute	18:17
@jim:acmegating.com	so many folks are running from container images now that i suspect there may even be a good chance we could just settle on a single version.	18:18
@fungicide:matrix.org	maybe once bookworm is out, we drop 3.9? but i honestly have no idea if anyone deploys zuul on debian/stable with the system python anyway	18:18
@fungicide:matrix.org	could be worth bringing up on the ml i guess	18:19
@clarkb:matrix.org	I guess expanding on selfish developer more those same sorts of improvements are likely seen in production too on busy systems. Could try to frame it as performance improvements for end users too if they drop old python.	18:20
@clarkb:matrix.org	email sent	19:32
@michael_kelly_anet:matrix.org	Is there some sort of generic webhook trigger available in Zuul? Context is that I'm working with pact.io and I want to be able to verify contracts with providers when they're updated - the pact workflow is to dispatch a webhook for this.	19:34
@michael_kelly_anet:matrix.org	And, assuming there's not some such mechanism, any objections to adding it?	19:39
@fungicide:matrix.org	Michael Kelly: maybe have a look at the rtd role in zuul-jobs? pretty sure that does a lightweight ansible url call from the executor as a nodeless job	19:54
@fungicide:matrix.org	to call rtd's publish webhook	19:54
@fungicide:matrix.org	or do you mean in the reverse direction (have a zuul pipeline that is triggered by the pact.io actions)?	19:54
@michael_kelly_anet:matrix.org	The latter	19:55
@fungicide:matrix.org	that would normally be a new trigger driver, yeah	19:55
@fungicide:matrix.org	https://zuul-ci.org/docs/zuul/latest/drivers/ lists all the current zuul drivers, and all of them except elasticsearch/mqtt/smtp have trigger components. as you surmised, there's no generic "i'm just a webhook" driver included at the moment	19:59
@fungicide:matrix.org	at least a few of those drivers do have webhook implementations, but all of them needed more functionality than just the webhook	20:00
@clarkb:matrix.org	I think part of the reason for that is Zuul jobs are typically git event driven. I'm not sure how you map that onto a generic system without git	20:02
@fungicide:matrix.org	the timer trigger is probably the closest model we have for that. i guess a webhook driver would need some additional git context supplied in the pipeline configuration, e.g. what branch to run with or something	20:03
@fungicide:matrix.org	also something has to map the webhook call to a project	20:03
@fungicide:matrix.org	i have a feeling a "generic webhook" would turn into some sort of parser language for dealing with every service's unique flavor of webhook metadata encoding in order to suss out the relevant context for the pipeline	20:04
@fungicide:matrix.org	but as i don't have to deal much with services that dispatch webhooks, i don't know that for sure (is there some convention for how you add information into a webhook?)	20:07
@michael_kelly_anet:matrix.org	I think it's ok to actually impose a standard structure for the webhook - this is what Jenkins does, for example. I think expecting some sort of structure on this could make sense. My other debate is to just spend my cycles on a Kafka driver (which I need anyway for other reasons...)	20:26
@michael_kelly_anet:matrix.org	In my example with pact, constraining it to something like (project,ref,pipeline) is totally fine because I can easily encode all of this in the webhook.	20:27
@michael_kelly_anet:matrix.org	The alternative would be to drive this against the REST endpoint for enqueuing, but I expect that's not an ideal approach.	20:28
@michael_kelly_anet:matrix.org	(eg: there's no API-token like mechanism that I'm aware of for this?)	20:29
@clarkb:matrix.org	There are jwts	20:32
-@gerrit:opendev.org- Zuul merged on behalf of Joshua Watt: [zuul/zuul] 871197: Fix include-branches priority https://review.opendev.org/c/zuul/zuul/+/871197		20:35
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 870430: Add a cache for image lists https://review.opendev.org/c/zuul/nodepool/+/870430		20:39
@michael_kelly_anet:matrix.org	Clark: Assuming you're referring to the admin tokens as in https://zuul-ci.org/docs/zuul/latest/authentication.html ?	21:03
@clarkb:matrix.org	correct	21:04
@michael_kelly_anet:matrix.org	How does Zuul feel about the expiration of those tokens? :)	21:04
@clarkb:matrix.org	you have to set an expiration date on them iirc	21:05
@clarkb:matrix.org	I don't know if zuul enforces a max time to expiration though	21:05
@michael_kelly_anet:matrix.org	Right. Hmm. I'll have to explore this a bit. The other thing I think I'm missing here is how I get some arbitrary JSON blob conveyed to the underlying job so that it knows, eg, what which contract version(s) to verify.	21:07
@michael_kelly_anet:matrix.org	I think it'd be ok for it to receive that as an ansible variable and do absolutely no post-processing of it though.	21:08
@michael_kelly_anet:matrix.org	Jenkins has some annoying fiddling around jenkins job parameters that's mildly fragile	21:08
@michael_kelly_anet:matrix.org	Completely unrelated set of questions that came up wrt some discussions I was having around managing/maintaining our kubernetes Zuul instance(s) - what's the advantage of using multiple tenants vs say just using multiple Zuul deployments instead?	21:14
@clarkb:matrix.org	a long time ago in a galaxy far far away many of us worked in companies where every team/project/business unit had a different Jenkins server. And sometimes they had multiples that different people managed. This led to jenkins being out old and outdated with no central process for getting configs updated. I think these experiences influenced a lot of Zuul's design and one of the things that includes is trying to centralize CI by making it scalable and configurable by its users. Then you can have a single high throughput system that is up to date with known methods for modifying it. No one is surprised when someone's cube loses power and all the Jenkins builds stop :) But you still need to separate things logically hence tenants	21:19
@clarkb:matrix.org	Basically its easier ot manage a small bit of logical config for a service than it is to run 20 different versions of that services	21:19
@michael_kelly_anet:matrix.org	Interesting take. This sounds very familiar... we sort of pendulum back and forth on this one. eg: Our mono-jenkins is a bit of a tire fire because the sheer number of users means that there's all kinds of conflicts wrt plugins and such.	21:21
@michael_kelly_anet:matrix.org	And once the mono thing gets so big the humans who were just part times maintainers basically end up having their life be about running jenkins.	21:22
@clarkb:matrix.org	you can also only have so many agents attached to a single jenkins. And its security boundaries are iffy. Also it leaks threads onto the stack so you OOM not because the heap is full but because the stack is full. (this is all decade old info and may not be current). But it makes it really hard to run a single jenkins for more than like 100 people	21:22
@clarkb:matrix.org	back in the day we ran 8 jenkins servers for openstack alone. And that was driven by the agents per server limits. I also had a cronjob that gracefully restarted them all weekly to get ahead of the stack OOM	21:23
@michael_kelly_anet:matrix.org	fwiw, trying to occupy a middle ground between these two things where it's like "no I'm not going to support your custom requests" and "you can do that in your own universe and here's how to make it easy while getting upgrades, backups, etc"	21:24
@clarkb:matrix.org	Zuul executors are still constrained by how many nodes you can talk to but you can scale executors horizontally	21:24
@clarkb:matrix.org	Anyway I think it ultimately boils down to managing one thing is easier than a lot of different things.	21:25
@michael_kelly_anet:matrix.org	Gotcha.	21:25
@michael_kelly_anet:matrix.org	So I think the tl;dr really works out to how much configuration drift there is.	21:26
@michael_kelly_anet:matrix.org	multiple tenants is probably a good place to start and as people step on toes, maybe go multi instance.	21:26
@jim:acmegating.com	The tools of tenant configuration in Zuul allow you to set the dial anywhere between full local control and no local control for tenants.	21:26
@jim:acmegating.com	The main reasons to have multiple instances are pretty esoteric (like low level hardware trust, etc).	21:27
@michael_kelly_anet:matrix.org	Interesting. Part of my calculation may end up being impact scope of problems too. eg: if I nuke myself I'd rather take down 20 people than 500 people	21:59
@jim:acmegating.com	no promises, but we're pretty good on system uptime. usually when we accidentally nuke things it's with changes to config projects, and that's the same blast radius no matter how many zuul's you have.	22:02
@jim:acmegating.com	* no promises, but we're pretty good on system uptime. usually when we accidentally nuke things it's with changes to config projects, and that's the same blast radius no matter how many zuuls you have.	22:03
@jpew:matrix.org	We've had really good uptime running Zuul on kubernetes.... I do a restart of one of the cluster servers every week for regular maintaince and basically our users don't even notice	22:03
@jpew:matrix.org	I don't even tell them I'm doing it anymore	22:04
@michael_kelly_anet:matrix.org	Oh, for sure. Usually failure modes tend to be self inflicted for msot of what we build/run internally as well.	22:04
@jim:acmegating.com	jpew: are you fully redundant (multi scheduler?)	22:04
@jpew:matrix.org	Yep	22:04
@michael_kelly_anet:matrix.org	The real issue is that when we self inflict those failure modes I don't want to stop the whole company.	22:04
@jpew:matrix.org	3 schedulers/web instances, and 6 executors	22:04
@michael_kelly_anet:matrix.org	jpew: Are you using the operator? :)	22:05
@jpew:matrix.org	No	22:05
@michael_kelly_anet:matrix.org	:(	22:05
@jpew:matrix.org	I've used it before, but we have too much stuff we need to customize for it to be worth it here. We are using argo to manage the cluster	22:06
@michael_kelly_anet:matrix.org	Gotcha.	22:06
@jpew:matrix.org	Need like custom SSL certs and things like that, we run vanilla zuul images	22:06
@michael_kelly_anet:matrix.org	Ah.	22:06
@michael_kelly_anet:matrix.org	We're using Ambassador for ingress (different set of problems) so it sort of sidesteps that whole problem.	22:07
@michael_kelly_anet:matrix.org	But it introduces its own problems.	22:07
@jpew:matrix.org	I ran zuul on my k8s cluster at home for a while, and I used the operator there to good effect	22:07
@michael_kelly_anet:matrix.org	Keen.	22:07
@jpew:matrix.org	.... but it was also why I knew _not_ to use it at the more restricted work environment :)	22:08
@michael_kelly_anet:matrix.org	hah	22:08
@michael_kelly_anet:matrix.org	I'm still hacking on it myself - trying to make it more broadly functional to cover cases that I want.	22:09
@jpew:matrix.org	I think it's gotten better recently though, that was almost 2 years ago and it wasn't getting as much maintaince	22:09
@jpew:matrix.org	Ya, that's good	22:09
@michael_kelly_anet:matrix.org	If you're bored, feel free to pull my commit stack and play around.	22:11
@jpew:matrix.org	Well, I'm not running it at home any more, so I don't have as much need	22:11
@jpew:matrix.org	But if I do I will	22:11
@michael_kelly_anet:matrix.org	hah	22:12
@michael_kelly_anet:matrix.org	Fair enough.	22:12
@michael_kelly_anet:matrix.org	I have to bring my home cluster back to life.	22:12
@michael_kelly_anet:matrix.org	Also, if you have anything that would make the operator more useful to you in the corporate space feel free to nudge me. I might be inclined to hack it out if it's sufficiently congruent with my own needs	22:25
@clarkb:matrix.org	jpew: ya OpenDev does weekly rolling restarts and same deal we don't announce it or anything it just happens and we stay up to date	23:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!