Monday, 2022-05-09

@jsokolvewd:matrix.orgWell... Half works. I was using `prepare-workspace` role from zuul-jobs and that one does need ssh to work as it uses `synchronize` (rsync) 07:42
@avass:vassast.org> <@jsokolvewd:matrix.org> Well... Half works. I was using `prepare-workspace` role from zuul-jobs and that one does need ssh to work as it uses `synchronize` (rsync)11:52
You need this for k8s: https://opendev.org/zuul/zuul-jobs/src/branch/master/roles/prepare-workspace-openshift
:)
@jsokolvewd:matrix.orgI have a vanilla k8s cluster, not openshift one, can it still work there? 11:54
@avass:vassast.orgI used it for my k8s cluster in digital ocean previously, so it should work11:55
@avass:vassast.organd I think `oc` is practically a wrapper around `kubectl` to add better openshift support11:57
@avass:vassast.orgJakub Sokół: https://docs.openshift.com/container-platform/4.10/cli_reference/openshift_cli/usage-oc-kubectl.html#the-oc-binary11:57
@jsokolvewd:matrix.orgIndeed the rsync commands works with my cluster. Thanks Albin Vass12:15
@jpew:matrix.orgWould someone be able to look at https://review.opendev.org/c/zuul/nodepool/+/839226 and https://review.opendev.org/c/zuul/zuul/+/838924 please?13:34
@clarkb:matrix.orgZuulians https://review.opendev.org/c/zuul/nodepool/+/840990 and child as well as https://review.opendev.org/c/zuul/zuul/+/840987 and child address python version information in nodepool and zuul if you have time for them. This was done due to feedback from Jakub Sokół on this info missing from v6 release notes16:50
@clarkb:matrix.orgfungi: responded to your question at https://review.opendev.org/c/zuul/zuul-website/+/84095116:55
@clarkb:matrix.orgjpew: I'll try to review those today. Once I've caught up on some operational stuff and my own changes I'm hoping to be able to do reviews and will do my best to include those16:55
@jpew:matrix.orgClark: OK. Thank you!16:56
@clarkb:matrix.orgI find that reviews are easier for me when I do them in blocks. I can page in more stuff and be more useful that way16:56
@y2kenny:matrix.orgHi, I just want to report an observation and an somewhat related long standing issue.  We have a daily job (periodic pipeline) set on a repo that has 1k+ branches.  The job only applies to one of the branches.  Email notification from zuul is set.  One day, the git server went down during the scheduled periodic trigger.  What happened next is that Zuul sent out 1k+ email (one per branch) claiming "Merge failure"18:11
@y2kenny:matrix.orgWhile undesirable, I think I understand why zuul sent 1k email (it was zuul attempting to work on all the branches in the monitored repo even when there are no jobs for them.)  A possible issue here is probably the not very accurate error message of Merge failure when the issue was connection failure to git/gerrit.18:19
@y2kenny:matrix.orgThe long standing issue is somewhat related to this... it's the tendencies for Zuul to "freeze up"  when face with 'large' amount of event.  From my setup, I see it most often around the time of the periodic trigger.  Because Zuul generate a event per branch in the repo, hundreds of events get created every night at once.  From that point on, Zuul will basically froze with no jobs getting scheduled for an hour or two.  And then the blockage seems to resolve itself all of a sudden and everything back to normal.18:24
@y2kenny:matrix.orgJust from observation, there seems to be a global lock of some kind that is blocking everything but I don't know for sure.18:25
@y2kenny:matrix.orgAnother "stop the world" type freeze that I just observed is when someone pushed a patch series of 30 commits to gerrit for review.  That branch does not have anything to do with zuul but other branches in the same repo do.  What I just saw is a lot of events created for each pipeline and zuul seems to freeze up until I dequeue some of the events with zuul-client.18:28
@y2kenny:matrix.orgFrom the scheduler log, I am seeing repeated message about "Adding change ... to queue... in pipeline".  on the same 30 commits over and over again18:30
@y2kenny:matrix.orgThis is Zuul 5.2.2 with 6x scheduler, 10x executor, 10x merger18:33
@y2kenny:matrix.orgthe only other thing I can think of is the connection to gerrit... the pipe to gerrit and gerrit responsiveness is outside of my control but that could also be another factor that is not scaling properly18:34
@y2kenny:matrix.orgI am not sure how to solve this.  I was hoping to deploy Zuul incrementally to avoid the scalability issues but seems like I can't really do that unless I am using a completely separate set of repositories with majority of the branches strip out.  If folks have other solution that would be great.18:40
@y2kenny:matrix.orgZuul runs pretty well normally btw, if patches comes in a few 3~5 of them at a time.  The freeze is only obvious when there's a bust of 30~100+ events18:41
@fungicide:matrix.orgevent queues are sequential queues, so the scheduler needs to process them in the order in which they arrive. is it that you have non-event-driven activity you want zuul to be working on which is a higher priority than processing the events it's queued, or that you want it to somehow be able to know which events to process out of order?18:43
@fungicide:matrix.orgor just that you want event processing to be faster overall?18:44
@y2kenny:matrix.orgI think if event processing is faster then the other issues will probably matter less.  I think the key couple of questions for me just from observing: 1) why is nothing moving on the server for couple of hours. 2) and why are things not moving due to a bunch of events that turns out to be noop.  I guess what you are  saying is that, to know that these events are noop there's actually quite a bit of processing and the processing must be done serially?18:50
@y2kenny:matrix.orgJust to help my understanding, what does the scheduler needs to do to process the events?  Is this an issue because everyone of the burst of events may contain a commit that has zuul config related modifications?18:55
@y2kenny:matrix.orgI am wondering if things can be simplified if there's a 3rd project type (beyond config and untrusted)...  perhaps "regular" or "plain" or something... (not sure... just thinking out loud...)19:04
@y2kenny:matrix.orgbut perhaps you would want untrusted-project to be as performant regardless19:05
@fungicide:matrix.orgprofiling the events might help to understand where the scheduler is spending the bulk of its time on each one, yes. if there were a way for it to know more quickly that a given event would be a no-op without e.g. needing to check whether it will need a speculative layout built and so on, if that's where the time is going, or maybe there's contention over zookeeper, or fetching additional information from gerrit for each one...20:02
@fungicide:matrix.orgKenny Ho: probably the first thing i would do, if you haven't already, is pick a representative event id from the debug log of a scheduler which handled it and put together a timeline from the associated timestamps of what happened with it, seeing what gaps appear from the time it fist starts to be processed20:14
@fungicide:matrix.org * Kenny Ho: probably the first thing i would do, if you haven't already, is pick a representative event id from the debug log of a scheduler which handled it and put together a timeline from the associated timestamps of what happened with it, seeing what gaps appear from the time it first starts to be processed20:14
@fungicide:matrix.orgfor example, we've seen an overloaded gerrit server lead to long response times for queries, which could cause the processing of each event to take considerably longer20:15
@y2kenny:matrix.orgfungi: Our gerrit definitely has high load.  This is another question I meant to ask... would having multiple executor/scheduler actually worsen that situation because each of them will try to pull things from gerrit at the same time? (i.e. no point in scaling out executors and schedulers too much if gerrit is busy.)20:37
@jpew:matrix.orgWould it make any sense for the zuul exector to renice ansible processes?21:43
@jpew:matrix.org(to lower their priority that is)21:43
@fungicide:matrix.orgKenny Ho: hard to know without seeing telemetry for the gerrit instance, that's what monitoring and resource tracking are for. yes your gerrit might need a bigger server or some rethinking how projects are organized if it can't handle the query volume. multiple schedulers could definitely increase the number of queries being performed in parallel anyway21:48
@fungicide:matrix.orgjpew: it might make sense if you're running other thnigs besides executors on the same servers which need higher priority than the ansible processes. usually we just increase the number of executors to make sure we have enough that we don't block on accepting new builds for long21:49
-@gerrit:opendev.org- Zuul merged on behalf of James E. Blair https://matrix.to/#/@jim:acmegating.com: [zuul/zuul] 840381: Update docs to clarify inheritance and global repo state https://review.opendev.org/c/zuul/zuul/+/84038122:13
@clarkb:matrix.orgjpew: left comments on those changes thanks.22:26

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!