Friday, 2023-01-20

@iwienand:matrix.org> <@g_gobi:matrix.org> One of my usecase is like I'm running my testing using zuul. I want to check the infrastructure after some of test-suite. Like if any of the test-suite failed pause the execution and debug it.03:07
Zuul allows for that via autoholds; see https://zuul-ci.org/docs/zuul-client/commands.html ; e.g. https://zuul.openstack.org/autoholds
@g_gobi:matrix.orgThanks ianw:  I'll take a look 03:25
@jim:acmegating.comtdlaw: i made a video about that: https://www.youtube.com/watch?v=WEQ0X3SGD0g04:30
-@gerrit:opendev.org- Per Wiklund proposed wip: [zuul/nodepool] 871102: Introduce driver for openshift virtualmachines https://review.opendev.org/c/zuul/nodepool/+/87110208:04
-@gerrit:opendev.org- Simon Westphahl proposed:14:31
- [zuul/zuul] 871106: Require latest layout for processing mgmt events https://review.opendev.org/c/zuul/zuul/+/871106
- [zuul/zuul] 871107: Periodically cleanup leaked pipeline state https://review.opendev.org/c/zuul/zuul/+/871107
- [zuul/zuul] 871108: Cleanup deleted pipelines and and event queues https://review.opendev.org/c/zuul/zuul/+/871108
@fungicide:matrix.org> <@g_gobi:matrix.org> Thanks ianw:  I'll take a look14:35
note that autoholds don't pause mid-execution of the build, but rather instruct nodepool not to delete the nodes after the build completes with a failure result
@fungicide:matrix.orgbut more generally, inspecting the state of test environments after a failure is what that feature is mostly used for14:36
-@gerrit:opendev.org- Ade Lee proposed: [zuul/zuul-jobs] 866881: Add ubuntu to enable-fips role https://review.opendev.org/c/zuul/zuul-jobs/+/86688114:43
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul] 871197: Fix include-branches priority https://review.opendev.org/c/zuul/zuul/+/87119714:45
@fungicide:matrix.orghas anyone else observed instability in the roles.upload-logs-base.library.test_zuul_ibm_upload.TestUpload.test_upload_result test? i just saw it fail the same way in py38 and py311 for a change while passing on py39 and py310: https://review.opendev.org/c/zuul/zuul-jobs/+/86688115:25
@jim:acmegating.comfungi: not an answer, but some context: a lot of that test is actually shared between all the roles, so it may or may not be ibm specific (it could be that test race timing makes it more likely to show up there right now).15:29
@fungicide:matrix.orggot it. interesting that it would hit the same exact test both times in that case15:29
@fungicide:matrix.orgi looked through the build history, but seems like zuul-jobs has been somewhat quiet so not a lot of data points15:30
@g_gobi:matrix.org> <@fungicide:matrix.org> note that autoholds don't pause mid-execution of the build, but rather instruct nodepool not to delete the nodes after the build completes with a failure result15:42
fungi: thanks for the admin
@g_gobi:matrix.orgInformation*15:42
@clarkb:matrix.orgjpew: re https://review.opendev.org/c/zuul/zuul/+/871197 I agree the docs and behavior appear out of sync and that change should fix that. I wonder if we should put the release notes under the "upgrade:" heading to make it clear this is a behavior change. It is a bug fix relative to the docs, but it is also a behavior change potentially16:43
@jpew:matrix.orgClark: Ya, that works for me. Probably with a note on how to check your configuration when upgrading?16:44
@clarkb:matrix.org++ Might also be good to sanity check with corvus that the docs aren't at fault here (however, I think having explicit include override exclude makes sense to me so I think the docs are correct)16:46
@jim:acmegating.comyeah, i think a typical pattern might be: exclude `/dev/.*` and then someone might want to include `/dev/something`, so that's likely the best order and i think we should stick with the docs and merge jpew's change.  i'm almost ambivalent about the reno question, but i also would tend toward upgrade, so it sounds like 3 votes for upgrade.  i'll +2 the change, but feel free to update it with the reno change and i'll reapply.16:55
@jim:acmegating.com * yeah, i think a typical pattern might be: exclude `dev/.*` and then someone might want to include `dev/something`, so that's likely the best order and i think we should stick with the docs and merge jpew's change.  i'm almost ambivalent about the reno question, but i also would tend toward upgrade, so it sounds like 3 votes for upgrade.  i'll +2 the change, but feel free to update it with the reno change and i'll reapply.16:56
@jpew:matrix.orgcorvus: Well. that's sort of an interesting point, because in a certain sense "yes" thats how it works, but it does so by completely ignoring the exclude list16:57
@jpew:matrix.orgIf you have include-branches.... you don't need exclude-branches because include-branches means include _only_ these branches16:58
@jpew:matrix.org(which my patch doesn't change FWIW)16:58
@jpew:matrix.org(the old code _also_ ignored any branch that wasn't in the exclude list)16:59
@jpew:matrix.org * (the old code _also_ ignored any branch that wasn't in the include list)16:59
@jim:acmegating.comyeah, i think the old code was trying for some kind of interaction between the two which it didn't quite accomplish.  the only thing it let you do was exclude something that was included, which is the opposite of the docs, and opposite of what i think would be the most common pattern.17:02
@jim:acmegating.com`It will not exclude a branch which already matched include-branches.` from the docs is exactly the opposite of what it does17:03
@jim:acmegating.com(Strike that.  Reverse it.)17:06
-@gerrit:opendev.org- Joshua Watt proposed: [zuul/zuul] 871197: Fix include-branches priority https://review.opendev.org/c/zuul/zuul/+/87119717:11
@clarkb:matrix.orgZuulians: Ansible 7 (2.14) drops support for Python 3.8 on the control node. It requires 3.9-3.11. I may be getting ahead of myself as Ansible 7 work hasn't started yet as far as I know, but I'm wondering if we bump 3.8 to 3.9 or maybe all the way to 3.10? https://docs.ansible.com/ansible/latest/installation_guide/intro_installation.html#control-node-requirements18:00
@clarkb:matrix.orgI guess the consideration there is that rhel 9 and friends are on python 3.9?18:00
@fungicide:matrix.orgconservatively, i'd say keep python 3.8 support until its time to drop support for ansible 618:01
@clarkb:matrix.orgfungi: historically thats been not long after adding ansible N+1 fwiw18:02
@clarkb:matrix.orgI think the two are tied together pretty closely18:02
@fungicide:matrix.orgobviously there's no point in testing with 3.8 after ansible 7 is the oldest we support since they're incompatible, but until then people may want to be able to use platforms which don't have newer python18:02
@clarkb:matrix.orgwe dropped 5 shortly after adding 618:03
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 870430: Add a cache for image lists https://review.opendev.org/c/zuul/nodepool/+/87043018:06
@jim:acmegating.comwe should definitely add ansible 7 asap.  once we have that, it's kind of a grey area whether we should continue supporting 3.8 since running ansible 7 on a 3.8-based system is not supported.  i would tend to lean toward increasing zuul's minimum at the point where we add 7.18:09
@jim:acmegating.com(this isn't a typical dependency situation, in that a deployer could choose to deploy on 3.8 and then a user can choose to run ansible 7)18:10
@clarkb:matrix.orgit may not even be possible to run ansible 7 in our test suite on 3.8 (though we could add test skips or something)18:10
@fungicide:matrix.orggood point, i didn't consider that we don't do specific ansible versions in different jobs18:10
@fungicide:matrix.orgthough it's interesting that ansible has decided to drop support for python 3.8 now, when it's still got another 1.75 years before its final scheduled release18:11
@jim:acmegating.comthe jobs mirror the deployment choices, in that we have a high version and low version job.18:11
@jim:acmegating.comfungi: hrm that is interesting.18:11
@clarkb:matrix.orgI suspect this is due to 3.9 being the rhel version18:12
@fungicide:matrix.orgif that were the impetus, you'd think they'd also be considering dropping support for running the ansible controller on any other distro18:12
@clarkb:matrix.orgold rhel uses ansible 2.9 or something. New rhel is all that matters going forward type of deal18:13
@jim:acmegating.comi think ansible is forcing our hand on at least 3.9.  what are reasons to do more than that and go to 3.10+ ?18:14
@clarkb:matrix.orgcorvus: the test suite is much faster on 3.10+18:14
@clarkb:matrix.orgit would be a selfish dev reason to reduce rtt on changes by about 30 minutes18:14
@jim:acmegating.comi find this argument strangely compelling :)18:15
@fungicide:matrix.orglooking at alternative distros, the current python3 packages on debian stable (bullseye) and ubuntu lts (jammy) are at least 3.9 anyway, so maybe they consider rhel to be the most conservative server distro where python versions are concerned18:15
@jim:acmegating.comso it sounds like 3.9 is in the bag; we probably should see if anyone has a need to keep 3.9 in support for zuul before we bump to 3.10 (which i realize is exactly why you started this convo, i guess i'm just re-articulating).  might be worth a maling list post to get more input.18:16
@clarkb:matrix.orgya I figured I'd bring this up sooner than later so that people can start thinking about it. I can draft up a mailing list email too18:17
@fungicide:matrix.orgjammy's default python is 3.10, and bookworm will probably be 3.11 unless it gets rolled back to 3.10 at the last minute18:17
@jim:acmegating.com*so* many folks are running from container images now that i suspect there may even be a good chance we could just settle on a *single* version.18:18
@fungicide:matrix.orgmaybe once bookworm is out, we drop 3.9? but i honestly have no idea if anyone deploys zuul on debian/stable with the system python anyway18:18
@fungicide:matrix.orgcould be worth bringing up on the ml i guess18:19
@clarkb:matrix.orgI guess expanding on selfish developer more those same sorts of improvements are likely seen in production too on busy systems. Could try to frame it as performance improvements for end users too if they drop old python.18:20
@clarkb:matrix.orgemail sent19:32
@michael_kelly_anet:matrix.orgIs there some sort of generic webhook trigger available in Zuul?  Context is that I'm working with pact.io and I want to be able to verify contracts with providers when they're updated - the pact workflow is to dispatch a webhook for this.19:34
@michael_kelly_anet:matrix.orgAnd, assuming there's not some such mechanism, any objections to adding it?19:39
@fungicide:matrix.orgMichael Kelly: maybe have a look at the rtd role in zuul-jobs? pretty sure that does a lightweight ansible url call from the executor as a nodeless job19:54
@fungicide:matrix.orgto call rtd's publish webhook19:54
@fungicide:matrix.orgor do you mean in the reverse direction (have a zuul pipeline that is triggered by the pact.io actions)?19:54
@michael_kelly_anet:matrix.orgThe latter19:55
@fungicide:matrix.orgthat would normally be a new trigger driver, yeah19:55
@fungicide:matrix.orghttps://zuul-ci.org/docs/zuul/latest/drivers/ lists all the current zuul drivers, and all of them except elasticsearch/mqtt/smtp have trigger components. as you surmised, there's no generic "i'm just a webhook" driver included at the moment19:59
@fungicide:matrix.orgat least a few of those drivers do have webhook implementations, but all of them needed more functionality than just the webhook20:00
@clarkb:matrix.orgI think part of the reason for that is Zuul jobs are typically git event driven. I'm not sure how you map that onto a generic system without git20:02
@fungicide:matrix.orgthe timer trigger is probably the closest model we have for that. i guess a webhook driver would need some additional git context supplied in the pipeline configuration, e.g. what branch to run with or something20:03
@fungicide:matrix.orgalso something has to map the webhook call to a project20:03
@fungicide:matrix.orgi have a feeling a "generic webhook" would turn into some sort of parser language for dealing with every service's unique flavor of webhook metadata encoding in order to suss out the relevant context for the pipeline20:04
@fungicide:matrix.orgbut as i don't have to deal much with services that dispatch webhooks, i don't know that for sure (is there some convention for how you add information into a webhook?)20:07
@michael_kelly_anet:matrix.orgI think it's ok to actually impose a standard structure for the webhook - this is what Jenkins does, for example.  I think expecting some sort of structure on this could make sense.  My other debate is to just spend my cycles on a Kafka driver (which I need anyway for other reasons...)20:26
@michael_kelly_anet:matrix.orgIn my example with pact, constraining it to something like (project,ref,pipeline) is totally fine because I can easily encode all of this in the webhook.20:27
@michael_kelly_anet:matrix.orgThe alternative would be to drive this against the REST endpoint for enqueuing, but I expect that's not an ideal approach.20:28
@michael_kelly_anet:matrix.org(eg: there's no API-token like mechanism that I'm aware of for this?)20:29
@clarkb:matrix.orgThere are jwts20:32
-@gerrit:opendev.org- Zuul merged on behalf of Joshua Watt: [zuul/zuul] 871197: Fix include-branches priority https://review.opendev.org/c/zuul/zuul/+/87119720:35
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 870430: Add a cache for image lists https://review.opendev.org/c/zuul/nodepool/+/87043020:39
@michael_kelly_anet:matrix.orgClark: Assuming you're referring to the admin tokens as in https://zuul-ci.org/docs/zuul/latest/authentication.html ?21:03
@clarkb:matrix.orgcorrect21:04
@michael_kelly_anet:matrix.orgHow does Zuul feel about the expiration of those tokens? :)21:04
@clarkb:matrix.orgyou have to set an expiration date on them iirc21:05
@clarkb:matrix.orgI don't know if zuul enforces a max time to expiration though21:05
@michael_kelly_anet:matrix.orgRight.  Hmm.  I'll have to explore this a bit. The other thing I think I'm missing here is how I get some arbitrary JSON blob conveyed to the underlying job so that it knows, eg, what which contract version(s) to verify.21:07
@michael_kelly_anet:matrix.orgI think it'd be ok for it to receive that as an ansible variable and do absolutely no post-processing of it though.21:08
@michael_kelly_anet:matrix.orgJenkins has some annoying fiddling around jenkins job parameters that's mildly fragile21:08
@michael_kelly_anet:matrix.orgCompletely unrelated set of questions that came up wrt some discussions I was having around managing/maintaining our kubernetes Zuul instance(s) - what's the advantage of using multiple tenants vs say just using multiple Zuul deployments instead?21:14
@clarkb:matrix.orga long time ago in a galaxy far far away many of us worked in companies where every team/project/business unit had a different Jenkins server. And sometimes they had multiples that different people managed. This led to jenkins being out old and outdated with no central process for getting configs updated. I think these experiences influenced a lot of Zuul's design and one of the things that includes is trying to centralize CI by making it scalable and configurable by its users. Then you can have a single high throughput system that is up to date with known methods for modifying it. No one is surprised when someone's cube loses power and all the Jenkins builds stop :) But you still need to separate things logically hence tenants21:19
@clarkb:matrix.orgBasically its easier ot manage a small bit of logical config for a service than it is to run 20 different versions of that services21:19
@michael_kelly_anet:matrix.orgInteresting take.  This sounds very familiar... we sort of pendulum back and forth on this one. eg: Our mono-jenkins is a bit of a tire fire because the sheer number of users means that there's all kinds of conflicts wrt plugins and such.21:21
@michael_kelly_anet:matrix.orgAnd once the mono thing gets so big the humans who were just part times maintainers basically end up having their life be about running jenkins.21:22
@clarkb:matrix.orgyou can also only have so many agents attached to a single jenkins. And its security boundaries are iffy. Also it leaks threads onto the stack so you OOM not because the heap is full but because the stack is full. (this is all decade old info and may not be current). But it makes it really hard to run a single jenkins for more than like 100 people21:22
@clarkb:matrix.orgback in the day we ran 8 jenkins servers for openstack alone. And that was driven by the agents per server limits. I also had a cronjob that gracefully restarted them all weekly to get ahead of the stack OOM21:23
@michael_kelly_anet:matrix.orgfwiw, trying to occupy a middle ground between these two things where it's like "no I'm not going to support your custom requests" and "you can do that in your own universe and here's how to make it easy while getting upgrades, backups, etc"21:24
@clarkb:matrix.orgZuul executors are still constrained by how many nodes you can talk to but you can scale executors horizontally21:24
@clarkb:matrix.orgAnyway I think it ultimately boils down to managing one thing is easier than a lot of different things.21:25
@michael_kelly_anet:matrix.orgGotcha.  21:25
@michael_kelly_anet:matrix.orgSo I think the tl;dr really works out to how much configuration drift there is.21:26
@michael_kelly_anet:matrix.orgmultiple tenants is probably a good place to start and as people step on toes, maybe go multi instance.21:26
@jim:acmegating.comThe tools of tenant configuration in Zuul allow you to set the dial anywhere between full local control and no local control for tenants.21:26
@jim:acmegating.comThe main reasons to have multiple instances are pretty esoteric (like low level hardware trust, etc).21:27
@michael_kelly_anet:matrix.orgInteresting.  Part of my calculation may end up being impact scope of problems too. eg: if I nuke myself I'd rather take down 20 people than 500 people21:59
@jim:acmegating.comno promises, but we're pretty good on system uptime.  usually when we accidentally nuke things it's with changes to config projects, and that's the same blast radius no matter how many zuul's you have.22:02
@jim:acmegating.com * no promises, but we're pretty good on system uptime.  usually when we accidentally nuke things it's with changes to config projects, and that's the same blast radius no matter how many zuuls you have.22:03
@jpew:matrix.orgWe've had really good uptime running Zuul on kubernetes.... I do a restart of one of the cluster servers every week for regular maintaince and basically our users don't even notice22:03
@jpew:matrix.orgI don't even tell them I'm doing it anymore22:04
@michael_kelly_anet:matrix.orgOh, for sure. Usually failure modes tend to be self inflicted for msot of what we build/run internally as well.22:04
@jim:acmegating.comjpew: are you fully redundant (multi scheduler?)22:04
@jpew:matrix.orgYep22:04
@michael_kelly_anet:matrix.orgThe real issue is that when we self inflict those failure modes I don't want to stop the whole company.22:04
@jpew:matrix.org3 schedulers/web instances, and 6 executors22:04
@michael_kelly_anet:matrix.orgjpew: Are you using the operator? :)22:05
@jpew:matrix.orgNo22:05
@michael_kelly_anet:matrix.org:(22:05
@jpew:matrix.orgI've used it before, but we have too much stuff we need to customize for it to be worth it here. We are using argo to manage the cluster22:06
@michael_kelly_anet:matrix.orgGotcha. 22:06
@jpew:matrix.orgNeed like custom SSL certs and things like that, we run vanilla zuul images22:06
@michael_kelly_anet:matrix.orgAh.22:06
@michael_kelly_anet:matrix.orgWe're using Ambassador for ingress (different set of problems) so it sort of sidesteps that whole problem.22:07
@michael_kelly_anet:matrix.orgBut it introduces its own problems.22:07
@jpew:matrix.orgI ran zuul on my k8s cluster at home for a while, and I used the operator there to good effect22:07
@michael_kelly_anet:matrix.orgKeen.22:07
@jpew:matrix.org.... but it was also why I knew _not_ to use it at the more restricted work environment :)22:08
@michael_kelly_anet:matrix.orghah22:08
@michael_kelly_anet:matrix.orgI'm still hacking on it myself - trying to make it more broadly functional to cover cases that I want.22:09
@jpew:matrix.orgI think it's gotten better recently though, that was almost 2 years ago and it wasn't getting as much maintaince22:09
@jpew:matrix.orgYa, that's good22:09
@michael_kelly_anet:matrix.orgIf you're bored, feel free to pull my commit stack and play around.22:11
@jpew:matrix.orgWell, I'm not running it at home any more, so I don't have as much need22:11
@jpew:matrix.orgBut if I do I will22:11
@michael_kelly_anet:matrix.orghah22:12
@michael_kelly_anet:matrix.orgFair enough.22:12
@michael_kelly_anet:matrix.orgI have to bring my home cluster back to life.22:12
@michael_kelly_anet:matrix.orgAlso, if you have anything that would make the operator more useful to you in the corporate space feel free to nudge me.  I might be inclined to hack it out if it's sufficiently congruent with my own needs22:25
@clarkb:matrix.orgjpew: ya OpenDev does weekly rolling restarts and same deal we don't announce it or anything it just happens and we stay up to date23:48

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!