Monday, 2021-10-18

-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed:00:40
- [zuul/zuul] 814243: Make FrozenJob a ZKObject https://review.opendev.org/c/zuul/zuul/+/814243
- [zuul/zuul] 814329: Implement frozen job serialization/deserialization https://review.opendev.org/c/zuul/zuul/+/814329
@tibeer:matrix.orgHi guys, is there something planned for this weeks PTG?04:22
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul] 814354: test https://review.opendev.org/c/zuul/zuul/+/81435406:12
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul] 814354: test https://review.opendev.org/c/zuul/zuul/+/81435406:42
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 813014: wip: only create build set when needed https://review.opendev.org/c/zuul/zuul/+/81301407:38
-@gerrit:opendev.org- Felix Edel proposed: [zuul/zuul] 760805: Add /components API endpoint to zuul-web https://review.opendev.org/c/zuul/zuul/+/76080508:11
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 813014: wip: only create build set when needed https://review.opendev.org/c/zuul/zuul/+/81301409:42
-@gerrit:opendev.org- Simon Westphahl proposed: [zuul/zuul] 813014: wip: only create build set when needed https://review.opendev.org/c/zuul/zuul/+/81301411:55
@westphahl:matrix.orgcorvus: one comment on 81428112:45
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul-jobs] 813034: Implement role for limiting zuul log file size https://review.opendev.org/c/zuul/zuul-jobs/+/81303413:06
@jpew:matrix.orgMy builds are timing out after 4 hours, and I cannot figure out why. I have both the tenant max-job-timeout and the job timeout set longer than this, but it still stops13:27
@jpew:matrix.orgIs there some other timeout I'm unaware of?13:27
@avass:vassast.orgjpew: there's also `post-timeout`: https://www.zuul-ci.org/docs/zuul/reference/job_def.html#attr-job.post-timeout13:38
@jpew:matrix.orgIt's before the post-playbook13:39
@jpew:matrix.orghttps://zuul.wattissoftware.com/t/wattissoftware-zuul/build/85a653631dcc49539134838c9979b916/console13:39
@avass:vassast.orgjpew: that doesn't look like a zuul timeout13:40
@avass:vassast.orghere's an example from opendev: https://zuul.opendev.org/t/openstack/build/3169b9ae251a474daa1d7220097924a413:40
@jpew:matrix.orgDoes ansible have a per-task timeout maybe?13:41
@avass:vassast.orgI don't think there it13:42
@avass:vassast.org * I don't think there is13:42
@fungicide:matrix.org> <@tibeer:matrix.org> Hi guys, is there something planned for this weeks PTG?13:47
zuul isn't officially scheduled for any ptg sessions, but at least a few of the zuul maintainers and other contributors will certainly be in some sessions. i'm in an openstack qa team session now, diversity and inclusion working group next, and so on... the opendev collaboratory has a couple of hours blocked out on wednesday and we'll probably be talking about some things related to our deployment of zuul there
@avass:vassast.orgjpew: maybe there's something with sshd limiting connections to 4h?13:48
@jpew:matrix.orgAh, it looks like kubelet has a default 4 hour timeout on commands13:48
@avass:vassast.orgoh or kubelet13:48
@tibeer:matrix.org> <@fungicide:matrix.org> zuul isn't officially scheduled for any ptg sessions, but at least a few of the zuul maintainers and other contributors will certainly be in some sessions. i'm in an openstack qa team session now, diversity and inclusion working group next, and so on... the opendev collaboratory has a couple of hours blocked out on wednesday and we'll probably be talking about some things related to our deployment of zuul there13:49
Okay, wasn't aware of it. I wanted to talk about the zuul-operator. Wasn't it you I chatted with when you were off for vacation?
@fungicide:matrix.orgi honestly don't recall ;)13:50
@fungicide:matrix.orgi don't know much about the operator, to be honest13:50
@avass:vassast.orgVolvo is working on using it afaik ;)13:51
-@gerrit:opendev.org- Felix Edel proposed:14:19
- [zuul/zuul] 760805: Add /components API endpoint to zuul-web https://review.opendev.org/c/zuul/zuul/+/760805
- [zuul/zuul] 760806: UI: Add actions and reducers to retrieve components https://review.opendev.org/c/zuul/zuul/+/760806
- [zuul/zuul] 760807: UI: Add components page https://review.opendev.org/c/zuul/zuul/+/760807
@tristanc_:matrix.orgperhaps we could arrange an adhoc meetpad to discuss the operator roadmap?14:33
Albin Vass corvus Tim Beermann I'm available this week (UTC-5 tz)
@clarkb:matrix.orgjpew: Zuul does log pretty explicitly when it has hit its own timeouts on a job. That can help you determine whether or not it was zuul or something under it hitting a timeout. Though sounds like you found it was kubelet.14:35
@tibeer:matrix.org> <@tristanc_:matrix.org> perhaps we could arrange an adhoc meetpad to discuss the operator roadmap?14:36
> Albin Vass corvus Tim Beermann I'm available this week (UTC-5 tz)
This would be great. I'm almost every time available during normal UTC times.
@jim:acmegating.comwhat would the topic be?14:36
@jim:acmegating.comto have a productive ptg session, we should have an agenda and a goal for an outcome14:37
@tibeer:matrix.orgAs far as I know, the Zuul operator is currently not working, so the topic could be: Get Zuul Operator working and production ready?14:38
@tristanc_:matrix.orgMeeting zuul-operator users sounds like a nice goal/outcome to me :) And perhaps we could gather feedback and define the roadmap for the next release?14:39
@jim:acmegating.comare you having trouble using it?  i'm not aware of issues with it...14:39
@jim:acmegating.comtristanC: sure, i mean, that's sort of a birds of a feather session rather than a design session...14:39
@jim:acmegating.comjust want to know what people are expecting14:40
@jpew:matrix.orgI'm using the operator, but I had to make a few changes to get it working: https://review.opendev.org/q/project:zuul/zuul-operator+status:open+owner:JPEWhacker%2540gmail.com14:40
@tibeer:matrix.orgcorvus: yes, i have issues setting it up. But it's also lacking more stuff like nodepool etc.14:40
@jim:acmegating.comit should manage nodepool launchers14:41
@tibeer:matrix.orgMaybe we can also define a session with the outcome of documenting how to use the operator ^^14:42
@tibeer:matrix.orgI have a Setup working with docker-compose. But we are currently migrating heavily to kubernetes, so getting the operator running is highly wanted 14:43
@jim:acmegating.comthere's some documentation here... https://zuul-ci.org/docs/zuul-operator/#simple-example14:43
-@gerrit:opendev.org- Dong Zhang proposed: [zuul/zuul] 814437: In github report, using warning emoji for CANCELED job https://review.opendev.org/c/zuul/zuul/+/81443714:44
@tibeer:matrix.orgI've read that one a couple of months ago. I'll give it a try again and come back to you when i ran into issues and why.14:44
@jim:acmegating.comso it sounds like there isn't a proposal for new work, but maybe an interest in an operator-related birds-of-a-feather session?14:46
@jim:acmegating.comto share experiences and chat about it in general?14:46
@tibeer:matrix.orgWould also help14:47
@jim:acmegating.comhow about 1400 utc on thursday?14:48
@jim:acmegating.comthat should give us enough time to send out an email to zuul-discuss and give people some notice14:49
@tibeer:matrix.orgwould fit for me14:50
@tristanc_:matrix.orgcorvus: that works for me, perhaps this could be extended to any zuul-related topics?14:50
@tibeer:matrix.org * would fit for me :)14:50
@jim:acmegating.comjpew: ^ interested?  does that time work for you?14:52
@jpew:matrix.orgYes, and yes. Thanks!14:52
@jim:acmegating.comtristanC: sure, maybe we can give operator first billing and then open the floor for other topics14:54
@jim:acmegating.comavass: ^ i assume that time is ok for you too?14:54
@jim:acmegating.comClark, fungi: is there something we should do to make an official PTG session or something?14:55
@jim:acmegating.comor should we just send an email with a date/time and meetpad link?14:55
@clarkb:matrix.orgIf you want to get on the official schedule with the etherpad links and all that then I think you need to tell ptgbot about your activity15:15
@clarkb:matrix.orgbut if you don't care about that an email with a date/time and meetpad link should be sufficient15:15
@fungicide:matrix.orgyeah, i can add a zuul track (as a chanop in #openinfra-events irc) if desired, and then we can "book" an available room on the schedule if people think that will help, though all the slots are already booked at that particular time. i agree just having a meetpad room at a time that's convenient for the parties involved ought to be enough15:20
@clarkb:matrix.orgAs a heads up I intend on pushing a gear 0.16.0 release today. Zuul has pinned this dep so up to date zuul shouldn't have problems with that (nor do we really expect issues anyway but you've been warned)15:21
@jim:acmegating.comClark: thanks15:22
@jim:acmegating.comfungi: i guess if all the slots are taken we can just skip the ptg thing and send an email15:22
@avass:vassast.orgcorvus: yeah I could probably join in around that time, just gotta make sure I'm not booked with anything else :)15:23
@jim:acmegating.comAlbin VassTim Beermann tristanC jpew i sent an email about the meetup http://lists.zuul-ci.org/pipermail/zuul-discuss/2021-October/001735.html16:09
@tristanc_:matrix.orgcorvus: thanks, i'm looking forward to it!16:12
@jim:acmegating.comClark, fungi, swest, tobiash: okay i think i understand the stuck queue issue in opendev.  i think we raced a cache update and cleanup.17:40
@jim:acmegating.com1) several hours before the first error, some event caused us to query gerrit for change 814237 and put it in the change cache17:40
2) for a while, it was in some pipelines which caused it to stay in the 'relevant' set and therefore stay in the cache
3) it left the pipelines it was in
4) a cache prune process started and calculated the relevant set (it was not in this set) and also the changes that were last modified > 1 hour ago (it was in this set). this combination means it's subject to pruning.
5) the cache cleanup starts slowly deleting changes (this takes about 3 minutes)
6) an event arrives for our change. gerrit is queried and the updated change is inserted into the cache
7) the cache cleanup method gets around to deleting our change from the cache
8) subsequent queue processes can't find the change in the cache
we should be able to reduce the race window from minutes to milliseconds by checking the last_modified time again right before deletion. i'm not sure i see a way to eliminate it without stopping the world.
@jim:acmegating.comi'll work on a change for that17:41
@fungicide:matrix.orga toctou sort of race i guess. is it possible we could make the pipeline manager more robust in the face of disappearing change cache entries too?17:46
@jim:acmegating.comfungi: that would be ideal, however it's not clear what we should do.  it's tempting to repopulate the cache, but the system pretty much assumes that change items are singletons and things start to get weird if they aren't.  we could work to change that.  the only other thing we could do is catch this failure and dequeue (or re-enqueue?) the item.17:54
@fungicide:matrix.orgyeah, it's probably a rare enough race already even without the extra update time check, that even an inefficient/disruptive mitigation could be tolerated in the long run17:57
@fungicide:matrix.orgre-enqueuing doesn't sound that outrageous. dequeuing alone might create hard-to-troubleshoot incidents (though i expect the scheduler logs would elucidate quickly in such cases)17:58
@fungicide:matrix.orgalso your explanation makes sense as to why we hit it in the pipeline we did, that one evaluates enqueuing on every review comment18:00
@fungicide:matrix.orgfar, far more opportunities to lose that race18:00
@tristanc_:matrix.orgcorvus: if i understand correctly, (and please forgive me as I haven't looked closely at the changes in the zookeeper implementation), the gerrit connection and the prunning process are not using locks to prevent such issue?18:23
@tristanc_:matrix.organd i guess we are not stopping the world because the pruning process is too slow, but perhaps it could be made faster?18:27
@jim:acmegating.comtristanC: yeah, the change cache is designed to avoid locks to keep the zk traffic low... but typing that made me realize one of the techniques we use to avoid locks (checking the zk version on write to detect conflicts) can probably be applied to detect this issue on delete... so we might be able to close the hole entirely.  i'll try that; thanks!19:05
@fungicide:matrix.orgthat does sound promising19:06
@jim:acmegating.com(basically, there should be a version # associated with the >1hr old change; pass that version to the delete call and it should fail if it's been updated since then; if it fails because of that, leave it in the cache for next time)19:06
@tristanc_:matrix.orgnice, i'd be happy to review that strategy19:13
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 814493: Handle concurrent modification during change cache delete https://review.opendev.org/c/zuul/zuul/+/81449321:06
@jim:acmegating.comtristanC, fungi, Clark, swest, tobiash: ^ that should do it.  passes tests locally for me.21:18
@jim:acmegating.comwe can't actually test the race in prune, but i added a test with the same sequence of events.21:19
@jim:acmegating.comoh... should i add a reno in anticipation of 4.10.3?  :)21:20
@jim:acmegating.comi think i should.  will update.21:21
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 814493: Handle concurrent modification during change cache delete https://review.opendev.org/c/zuul/zuul/+/81449321:22
@jim:acmegating.com^ with reno21:22
@clarkb:matrix.orgthanks I'll review that today21:36
@jim:acmegating.comrelaying some discussion in #opendev -- there are some stuck paused jobs which appear to have occurred after the opendev scheduler was restarted, but the executors were not.  it looks like we may not handle the case where an executor pauses a job, then the scheduler is restarted -- nothing ever unpauses that job.  this shouldn't be an issue once we finish moving the pipeline state into zk, so i'm somewhat inclined to just say that for the moment, if you restart the scheduler you should also restart executors.21:51
@clarkb:matrix.orgshould you also restart mergers?21:53
@clarkb:matrix.orgmaybe those aren't a problem as they don't have the more complicated state changes21:53
@jim:acmegating.comshouldn't be necessary, but couldn't hurt21:55
@jim:acmegating.comthis specific issue is because the executor sits there holding the lock; the mergers will eventually finish their merge job and release the lock, so they don't suffer this bug21:56
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/zuul] 814493: Handle concurrent modification during change cache delete https://review.opendev.org/c/zuul/zuul/+/81449322:02
@jim:acmegating.comClark: don't hate me, but i think i made that slightly more pythonic ^22:02
@clarkb:matrix.orgcorvus: what is the difference between change.cache_version and change.cache_stat?22:08
@clarkb:matrix.orgwe seem to use both for similar tasks22:08
@jim:acmegating.comClark: change.cacheversion is a @property change.cachestat.version22:08
@jim:acmegating.comi wanted to avoid the property thing in order to save a local reference so we didn't get inconsistent values 22:09
@clarkb:matrix.orgI see and sometimes we need to last modified info and the version not just the version22:09
@clarkb:matrix.org * I see and sometimes we need the last modified info and the version not just the version22:09
@jim:acmegating.comyeah.  all of those ride together on the same cachestat object; so they could all be updated in another thread by replacing the cachestat22:09
@jim:acmegating.comand sorry for the missing underscores22:10
@jim:acmegating.comin chat22:10
@clarkb:matrix.org+2'd with a super minor thing if you happen to write a new patchset for some reason22:12
@jim:acmegating.comClark: responded -- can you take a look?22:13
@jim:acmegating.com(note the lines below the comment if it's not clear)22:13
@clarkb:matrix.orgcorvus: shouldn't we remove the stale entry in the bad version case?22:14
@jim:acmegating.comClark: it will be (or psobably already has been) updated by a watch22:15
@clarkb:matrix.orgah ok22:15
@clarkb:matrix.orgthen ya your comment makes sense. I'll leave a note22:15
@jim:acmegating.comkk22:15
@clarkb:matrix.orgtristanC: 814493 is read for your review :) We can restart opendev on that change and correct the other issue noted above about needing to restart scheduler and executors together at the same time if we can land that22:31
@tristanc_:matrix.orgClark: I'm reviewing it now22:40
@ecsantos:matrix.orgHi folks22:52
@ecsantos:matrix.orgI have a use case and I'd like to know if it'd be possible to implement it with the current Zuul features22:52
@ecsantos:matrix.orgI want to use the Gerrit projects from opendev.org to import job content to my local Zuul instance (e.g., base-jobs, zuul-jobs)22:52
@ecsantos:matrix.orgAfter importing them, I'd like to:22:53
• make my own customizations
• keep these customizations when new upstream changes are merged
@ecsantos:matrix.orgThis sounds impossible, but I wanted to validate if it indeed is. Is it?22:53
@ecsantos:matrix.org* I have an use case and I'd like to know if it'd be possible to implement it with the current Zuul features22:54
@clarkb:matrix.orgecsantos: I think you need to keep a local fork of the repos and load those into zuul rather than the ones that are hosted directly on opendev22:57
@clarkb:matrix.orgAlso you might consider having a set of local roles and job defs outside of the ones from opendev and not try to manage them in the same repos22:57
@clarkb:matrix.orgthat might be the closest to what you are trying to achieve but would rely on zuul loading configs from multiple repos (those in opendev and those locally)22:57
@ecsantos:matrix.orgMy first was to do that, fork them locally. Software Factory does that with zuul-jobs. But I wanted to keep up to date with changes in the upstream23:11
@ecsantos:matrix.orgI thought of creating a local repo to hold job content that would override some variables and values set by the upstream jobs23:11
@ecsantos:matrix.orgDoes this sound doable in Zuul?23:11
@ecsantos:matrix.org* My thought first was to do that, fork them locally. Software Factory does that with zuul-jobs. But I wanted to keep up to date with changes in the upstream23:12
@ecsantos:matrix.org* I also thought of creating a local repo to hold job content that would override some variables and values set by the upstream jobs23:12
@clarkb:matrix.orgecsantos: yes in most cases you should be able to make your own child jobs that override things23:20
@ecsantos:matrix.org* My first thought was to do that, fork them locally. Software Factory does that with zuul-jobs. But I wanted to keep up to date with changes in the upstream23:21
@clarkb:matrix.orgopenstack does similar with openstack/openstack-zuul-jobs on opendev23:21
@ecsantos:matrix.orgWill take a look, maybe it'll help. Thanks Clark! :)23:21
-@gerrit:opendev.org- James E. Blair https://matrix.to/#/@jim:acmegating.com proposed: [zuul/nodepool] 814503: Add user-data/custom-data to Azure driver https://review.opendev.org/c/zuul/nodepool/+/81450323:33
-@gerrit:opendev.org- Clark Boylan proposed: [zuul/zuul] 814504: Cleanup empty secrets dirs when deleting secrets https://review.opendev.org/c/zuul/zuul/+/81450423:35
@clarkb:matrix.org814504 is a response to opendev's daily export-keys complaining after some project renames23:35
@jim:acmegating.comClark: as an alternative, what about having the delete command not fail?23:40
@jim:acmegating.comer sorry, the export command23:40
@clarkb:matrix.orgcorvus: we could do that too, but then we would have accumulated cruft in the zk db over time23:41
@clarkb:matrix.orgThat was why I went with the cleanup approach instead, but if we prefer the safety of having the export not complain if there isn't secret data I could write that up instead23:41
@jim:acmegating.comClark: just to double check, you're keeping in mind that we end up with "prefix/project" no matter how many /s are in the original project, right?  so "foo" becomes "foo/foo", "foo/bar" becomes "foo/foo%2fbar", "foo/bar/baz" becomes "foo/foo%2fbar%2fbaz".23:47
@clarkb:matrix.orgcorvus: yup I updated the existing functions to get the safe encoded name and then rely on that23:48
@jim:acmegating.comClark: i'm pretty sure your code is designed with that in mind; i just want to call it out as it may not be obvious.23:48
@jim:acmegating.comcool23:48
@jim:acmegating.comClark: the change lgtm; but i did leave 2 suggestions inline which if you choose to implement i would be thoroughly happy to re-review :)23:49
@clarkb:matrix.orgyup I see those. I can do those cleanups tomorrow probably. We're starting to figure out dinner now. FWIW https://review.opendev.org/c/zuul/zuul/+/814504/1/zuul/lib/strings.py was edited to make the name handling consistent with existing encoding23:50
@jim:acmegating.comClark: oh i just noticed your new strings omitted the leading /23:51
@clarkb:matrix.orgoops23:51
@clarkb:matrix.orgfeel free to -1 and I'll pick this up tomorrow23:52
@jim:acmegating.comyeah, that feels like something worth being consistent on; i will go ahead -1 for that.  thanks!23:52

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!