Monday, 2022-05-09

@jsokolvewd:matrix.org	Well... Half works. I was using `prepare-workspace` role from zuul-jobs and that one does need ssh to work as it uses `synchronize` (rsync)	07:42
@avass:vassast.org	> <@jsokolvewd:matrix.org> Well... Half works. I was using `prepare-workspace` role from zuul-jobs and that one does need ssh to work as it uses `synchronize` (rsync)	11:52
You need this for k8s: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-openshift
:)
@jsokolvewd:matrix.org	I have a vanilla k8s cluster, not openshift one, can it still work there?	11:54
@avass:vassast.org	I used it for my k8s cluster in digital ocean previously, so it should work	11:55
@avass:vassast.org	and I think `oc` is practically a wrapper around `kubectl` to add better openshift support	11:57
@avass:vassast.org	Jakub Sokół: https://docs.openshift.com/container-platform/4.10/cli_reference/openshift_cli/usage-oc-kubectl.html#the-oc-binary	11:57
@jsokolvewd:matrix.org	Indeed the rsync commands works with my cluster. Thanks Albin Vass	12:15
@jpew:matrix.org	Would someone be able to look at https://review.opendev.org/c/zuul/nodepool/+/839226 and https://review.opendev.org/c/zuul/zuul/+/838924 please?	13:34
@clarkb:matrix.org	Zuulians https://review.opendev.org/c/zuul/nodepool/+/840990 and child as well as https://review.opendev.org/c/zuul/zuul/+/840987 and child address python version information in nodepool and zuul if you have time for them. This was done due to feedback from Jakub Sokół on this info missing from v6 release notes	16:50
@clarkb:matrix.org	fungi: responded to your question at https://review.opendev.org/c/zuul/zuul-website/+/840951	16:55
@clarkb:matrix.org	jpew: I'll try to review those today. Once I've caught up on some operational stuff and my own changes I'm hoping to be able to do reviews and will do my best to include those	16:55
@jpew:matrix.org	Clark: OK. Thank you!	16:56
@clarkb:matrix.org	I find that reviews are easier for me when I do them in blocks. I can page in more stuff and be more useful that way	16:56
@y2kenny:matrix.org	Hi, I just want to report an observation and an somewhat related long standing issue. We have a daily job (periodic pipeline) set on a repo that has 1k+ branches. The job only applies to one of the branches. Email notification from zuul is set. One day, the git server went down during the scheduled periodic trigger. What happened next is that Zuul sent out 1k+ email (one per branch) claiming "Merge failure"	18:11
@y2kenny:matrix.org	While undesirable, I think I understand why zuul sent 1k email (it was zuul attempting to work on all the branches in the monitored repo even when there are no jobs for them.) A possible issue here is probably the not very accurate error message of Merge failure when the issue was connection failure to git/gerrit.	18:19
@y2kenny:matrix.org	The long standing issue is somewhat related to this... it's the tendencies for Zuul to "freeze up" when face with 'large' amount of event. From my setup, I see it most often around the time of the periodic trigger. Because Zuul generate a event per branch in the repo, hundreds of events get created every night at once. From that point on, Zuul will basically froze with no jobs getting scheduled for an hour or two. And then the blockage seems to resolve itself all of a sudden and everything back to normal.	18:24
@y2kenny:matrix.org	Just from observation, there seems to be a global lock of some kind that is blocking everything but I don't know for sure.	18:25
@y2kenny:matrix.org	Another "stop the world" type freeze that I just observed is when someone pushed a patch series of 30 commits to gerrit for review. That branch does not have anything to do with zuul but other branches in the same repo do. What I just saw is a lot of events created for each pipeline and zuul seems to freeze up until I dequeue some of the events with zuul-client.	18:28
@y2kenny:matrix.org	From the scheduler log, I am seeing repeated message about "Adding change ... to queue... in pipeline". on the same 30 commits over and over again	18:30
@y2kenny:matrix.org	This is Zuul 5.2.2 with 6x scheduler, 10x executor, 10x merger	18:33
@y2kenny:matrix.org	the only other thing I can think of is the connection to gerrit... the pipe to gerrit and gerrit responsiveness is outside of my control but that could also be another factor that is not scaling properly	18:34
@y2kenny:matrix.org	I am not sure how to solve this. I was hoping to deploy Zuul incrementally to avoid the scalability issues but seems like I can't really do that unless I am using a completely separate set of repositories with majority of the branches strip out. If folks have other solution that would be great.	18:40
@y2kenny:matrix.org	Zuul runs pretty well normally btw, if patches comes in a few 3~5 of them at a time. The freeze is only obvious when there's a bust of 30~100+ events	18:41
@fungicide:matrix.org	event queues are sequential queues, so the scheduler needs to process them in the order in which they arrive. is it that you have non-event-driven activity you want zuul to be working on which is a higher priority than processing the events it's queued, or that you want it to somehow be able to know which events to process out of order?	18:43
@fungicide:matrix.org	or just that you want event processing to be faster overall?	18:44
@y2kenny:matrix.org	I think if event processing is faster then the other issues will probably matter less. I think the key couple of questions for me just from observing: 1) why is nothing moving on the server for couple of hours. 2) and why are things not moving due to a bunch of events that turns out to be noop. I guess what you are saying is that, to know that these events are noop there's actually quite a bit of processing and the processing must be done serially?	18:50
@y2kenny:matrix.org	Just to help my understanding, what does the scheduler needs to do to process the events? Is this an issue because everyone of the burst of events may contain a commit that has zuul config related modifications?	18:55
@y2kenny:matrix.org	I am wondering if things can be simplified if there's a 3rd project type (beyond config and untrusted)... perhaps "regular" or "plain" or something... (not sure... just thinking out loud...)	19:04
@y2kenny:matrix.org	but perhaps you would want untrusted-project to be as performant regardless	19:05
@fungicide:matrix.org	profiling the events might help to understand where the scheduler is spending the bulk of its time on each one, yes. if there were a way for it to know more quickly that a given event would be a no-op without e.g. needing to check whether it will need a speculative layout built and so on, if that's where the time is going, or maybe there's contention over zookeeper, or fetching additional information from gerrit for each one...	20:02
@fungicide:matrix.org	Kenny Ho: probably the first thing i would do, if you haven't already, is pick a representative event id from the debug log of a scheduler which handled it and put together a timeline from the associated timestamps of what happened with it, seeing what gaps appear from the time it fist starts to be processed	20:14
@fungicide:matrix.org	* Kenny Ho: probably the first thing i would do, if you haven't already, is pick a representative event id from the debug log of a scheduler which handled it and put together a timeline from the associated timestamps of what happened with it, seeing what gaps appear from the time it first starts to be processed	20:14
@fungicide:matrix.org	for example, we've seen an overloaded gerrit server lead to long response times for queries, which could cause the processing of each event to take considerably longer	20:15
@y2kenny:matrix.org	fungi: Our gerrit definitely has high load. This is another question I meant to ask... would having multiple executor/scheduler actually worsen that situation because each of them will try to pull things from gerrit at the same time? (i.e. no point in scaling out executors and schedulers too much if gerrit is busy.)	20:37
@jpew:matrix.org	Would it make any sense for the zuul exector to renice ansible processes?	21:43
@jpew:matrix.org	(to lower their priority that is)	21:43
@fungicide:matrix.org	Kenny Ho: hard to know without seeing telemetry for the gerrit instance, that's what monitoring and resource tracking are for. yes your gerrit might need a bigger server or some rethinking how projects are organized if it can't handle the query volume. multiple schedulers could definitely increase the number of queries being performed in parallel anyway	21:48
@fungicide:matrix.org	jpew: it might make sense if you're running other thnigs besides executors on the same servers which need higher priority than the ansible processes. usually we just increase the number of executors to make sure we have enough that we don't block on accepting new builds for long	21:49
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 840381: Update docs to clarify inheritance and global repo state https://review.opendev.org/c/zuul/zuul/+/840381		22:13
@clarkb:matrix.org	jpew: left comments on those changes thanks.	22:26

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!