Wednesday, 2021-08-11

clarkbcorvus: testing failed on the key startup change00:00
clarkbcorvus: I think in the test web keys test we may need to load the keys when they are requested if they are not already present? There may be a few spots in the code that rely on things be pre cached00:02
fungioh, indeed00:20
opendevreviewJames E. Blair proposed zuul/zuul master: Add include-branches tenant config option  https://review.opendev.org/c/zuul/zuul/+/80417700:21
opendevreviewJames E. Blair proposed zuul/zuul master: Use branchesForRepoState in all cases  https://review.opendev.org/c/zuul/zuul/+/80417800:21
corvusclarkb: yeah that sounds likely; i'll whack the moles if no one raises high-level issues with the change00:22
opendevreviewMerged zuul/zuul master: Stop running mypy in linters  https://review.opendev.org/c/zuul/zuul/+/80360202:07
*** marios is now known as marios|ruck05:11
*** bhagyashris_ is now known as bhagyashris05:39
*** rpittau|afk is now known as rpittau07:23
*** jpena|off is now known as jpena07:32
*** bhavikdbavishi1 is now known as bhavikdbavishi09:46
*** rpittau is now known as rpittau|afk11:28
*** dviroel|out is now known as dviroel11:32
*** jpena is now known as jpena|lunch11:35
*** jpena|lunch is now known as jpena|off12:28
opendevreviewDong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github  https://review.opendev.org/c/zuul/zuul/+/80354712:45
opendevreviewDong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github  https://review.opendev.org/c/zuul/zuul/+/80354712:57
opendevreviewMatthieu Huin proposed zuul/zuul master: [WIP] web UI: user login with OpenID Connect  https://review.opendev.org/c/zuul/zuul/+/73408214:36
*** jpena|off is now known as jpena14:57
opendevreviewDong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github  https://review.opendev.org/c/zuul/zuul/+/80354715:03
opendevreviewDong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github  https://review.opendev.org/c/zuul/zuul/+/80354715:07
*** sshnaidm is now known as sshnaidm|afk15:35
*** jpena is now known as jpena|off15:42
pabelanger[m]I am seeing a lot of ERRORs in our zuul-scheduler, could someone help explain what might be happening?16:03
pabelanger[m]2021-08-11 16:02:20,627 ERROR zuul.zk.SemaphoreHandler: [e: c8aeb4b0-fa81-11eb-8732-0b1879f54153] Semaphore /zuul/semaphores/ansible/ansible-test-cloud-integration-aws can not be released for 3d9166f9654646c193a9dbb9f8016005-ansible-test-cloud-integration-aws-py36_5 because the semaphore is not held16:03
corvuspabelanger: ignore it16:04
fungipabelanger[m]: https://review.opendev.org/803948 merged yesterday to fix that16:05
fungi"fix" by silencing the error message, that is16:05
pabelanger[m]k, I startred looking because we have some jobs in gate pipeline waiting on a semaphore, so figured it was related16:05
clarkbya once you've upgraded past that change the error becomes meaningful again16:05
pabelanger[m]will keep digging into logs16:06
clarkbpabelanger[m]: there was a change that broken semaphoe cleanup that landed a few days ago and the fix for that also landed very recently. If you are running offo f master you may have hit that?16:06
clarkbif running from releases that bug isn't present16:06
pabelanger[m]we are on 4.7.0, so I can see the ERROR message16:07
corvuszuul-maint: does this look good for a zuul release?  commit d07397a73c2551d5c77e0ffc3d98008337168902 (tag: 4.8.0)16:14
clarkbthat commit is one commit behind master but the commit at tip of master is just the mypy removal. The commit lgtm. The tag value also lgtm as new features (not just bug fixes) were added16:16
clarkband we have release notes for those features at https://zuul-ci.org/docs/zuul/reference/releasenotes.html#new-features16:16
clarkb++ LGTM16:17
tobiash[m]lgtm16:18
*** dviroel is now known as dviroel|away16:19
mordred++16:21
fungicorvus: yep, d07397a as 4.8.0 sounds great, thanks!16:25
clarkbhttps://zuul.opendev.org/t/openstack/build/8365660d04df497488f6fab0cca0f9c6/console is an interesting failure. We've indexed the stdout of find-testr.sh at [0] but that is a blank line and [1] has the path we want. That script hasn't changed in a year16:30
corvusclarkb: i can't think of anything that would have changed that.  note stderr does not have a blank line, so it's not a general issue16:32
clarkbcorvus: ya I'm looking at the script now and wonder if maybe find did something different?16:32
clarkbI'm trying to figure it out, but it is a really weird one16:33
clarkbor type?16:34
corvusclarkb: super weird but -- i just ran "type -p subunit2html" (it's not in my path) and it emitted a blank line.  then i ran it again and no blank line.16:36
corvusobvs can't reproduce that; take it with a grain of salt.16:36
corvusmaybe i just hit enter twice16:37
clarkbya I always get a single line16:37
clarkbmaybe it is different if you have it in multiple locations in PATH?16:38
corvusi bet that was a copy paste that had a newline.  probably not relevant16:39
clarkbhelp type does say that type won't return a string if type -t doesn't return FILE16:40
clarkbI think we want to use type -P16:40
clarkbbut I'm not sure that will fix the problem16:41
clarkbheh putting a different ls in your path ahead of /usr/bin/ls is not the greatest idea16:44
clarkbI cannot reproduce the behavior with multiple entries in my PATH16:44
clarkbmaybe we should just do a string chomp16:45
clarkbfungi: corvus ^ do you know of a good way to do that in bash? I think having newlines also complicates it since that involves multiline regexes16:50
opendevreviewJames E. Blair proposed zuul/zuul master: Disable aliases in inventory.yaml for better readibility  https://review.opendev.org/c/zuul/zuul/+/80267416:50
fungiclarkb: the hackish way is to wrap the output in a subshell and echo the stdout from it16:52
corvusclarkb: my first thought is awk but i don't have a spell for that handy either16:53
fungiecho `type -p subunit2html`16:53
fungiit's not very safe though16:54
fungiif you're just wanting to filter out extra blank lines you could: type -p subunit2html | grep -v ^$16:55
clarkbI thought maybe for a second that it was looking for testr and stestr because the .testrepository and .stestr stuff both existed but they don't16:55
fungiwhat specifically is it you want to do? i probably haven't been following closely enough to properly infer the goal16:56
clarkbfungi: an ironic-inspector job failed https://zuul.opendev.org/t/openstack/build/8365660d04df497488f6fab0cca0f9c6/console because type -p stestr apperas to have emitted a blank line which we treated as the command path for stestr16:57
clarkbthe actual command path was on the next line16:58
clarkbI am unable to reproduce the double line behavior16:58
fungithat sounds familiar. i think i've seen it before, though i want to say it was because of some iteration over a list16:58
fungiis it possible type was called more than once?16:58
clarkbmaybe? That was what I was looking for by checking if they have a .testrepository and a .stestr in the repo but they don't. The script find-testr.sh does have a loop looking for both commands depending on whether or not the .dirs are in place16:59
* fungi could almost swear he implemented a fix for this16:59
fungithis is on a stable branch, maybe my fix was only to master devstack16:59
clarkbfungi: the script is in zuul-jobs though17:00
fungioh, nm17:00
clarkband hasn't been modified in a year17:00
corvusyeah, this is (at least nominally) an issue with generically finding the subunit html in zuul-jobs.  (of course, tempest may have hosed the host in such a way to make it fail, but who knows)17:01
opendevreviewClark Boylan proposed zuul/zuul-jobs master: Find (s)testr more reliably  https://review.opendev.org/c/zuul/zuul-jobs/+/80428017:05
clarkbthat is a bit brute force and doesn't explain why this happens. But if that passes testing I suspect it should be safe enough?17:05
fungiclarkb: hah, i found where we discussed it in #opendev a couple months ago, but i think we ended up at the same place as today's analysis: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-06-22.log.html#t2021-06-22T13:26:5117:35
fungimy opinion was "we could add some debugging from the script to stderr i suppose, which shouldn't conflict with parsing the stdout, but since this seems to occur infrequently it would be hard to leverage without merging it to the role"17:36
corvusfungi, clarkb: past-fungi makes a good point in that the newline may not be coming from the type command, but the script or something else, in which case clarkb's fix is too low-level?17:38
mordredyah - although it probably shouldn't hurt?17:38
corvustrue17:39
fungipast-fungi had more braincells. i kill them off at an alarming rate17:39
corvusso maybe still +2 on clarkb's change and if it recurs, then we know it's (probably) something higher up the stack?17:39
clarkbwfm17:39
clarkbbut ya thats a good point it could be coming from soething earlier in the script17:39
fungiif nothing else, it does confirm we're continuing to see this problem. if we could give it some sort of discoverable signal combined with a bit of debugging info, maybe we could get to the bottom of it (might not just be affecting this one script)17:40
corvusclarkb: it's coming from... INSIDE THE SCRIPT!!!!!17:42
corvusi think it might be time to afk for a few minutes.17:42
clarkbha17:44
clarkblooking at the script I don't see anywhere else where we could be echoing a blank newline. It is all variable assignments and for loops18:16
clarkband running the script locally against my zuul repo which has tox and stestr things reliably outputs a single line18:21
clarkbreally really weird18:21
clarkbany objections to approving 804280?18:25
clarkbfungi: ^ in particular hasn't weighe din on the change18:25
KennyHo[m]I am having a huge backlog on my Zuul right now but all of the nodes are idle.  What is the best way to debug this?  I am suspecting something is wrong with merger or the executor but I am not sure where to start.18:35
clarkbKennyHo[m]: I would start with a single jobs and trace it to figure out why it isn't running18:36
clarkbin your zuul scheduler logs you should have event ids on the log lines. If you grepfor that event id you'll get all the logs associated with that event including job procssing (or most ofthem anyway)18:36
clarkbthis should include things like node requests ids which you can take to nodepool logs and grep for there to see if maybe nodepool isn't provisioning nodes18:37
KennyHo[m]clarkb: Looking at some of these, I am not sure if the jobs are supposed to run.18:37
KennyHo[m]these are basically events that have no zuul jobs18:37
KennyHo[m]but everything is just queued up on the web UI18:38
clarkbIn general I would expect this sort of problem if nodes are not launching in nodepool, executors are shutdown, or maybe semaphores being held18:38
clarkbKennyHo[m]: ya you should try and identify the event for a specific thing queued up in the UI18:38
clarkbI think if you grep for the change id that may give you the eventid then you can grep for the event id?18:38
KennyHo[m]this is all within the scheduler right?18:39
clarkbyes18:39
clarkbthen from that you get clues like node request ids etc that can be taken to the other services18:39
KennyHo[m]clarkb: So I see merge state: PENDING... I am not sure if that's significant18:48
clarkbKennyHo[m]: that could be a clue. One of the first things zuul will do for a change is merge it against the tip of the target branch because that is what it builds its jobs off of18:48
clarkbif the merges aren't completing that could stop the processing pipeline. Every executor runs its own merger threads and you can run separate merger processes. I would check those logs for merge errors or maybe the processes have stopped completely?18:49
KennyHo[m]the mergers have a different log or does it all goes to the executor's log?18:50
KennyHo[m](I haven't figured out how to get merger working yet so all my executor == merger)18:50
corvusexecutor log18:51
KennyHo[m]corvus: um... I think I am having a merger traffic jam.  Is merger supposed to block the whole world?  What I noticed in the past is that no job seems to get schedule when the mergers are busy.18:54
KennyHo[m]what happen is that I have a periodic pipeline that run a daily job on the linux kernel repo18:54
KennyHo[m]on that periodic event, it actually generates thousands of events... one per branch18:55
KennyHo[m]I am not sure if that's by design18:55
KennyHo[m]the daily job is really only working on one branch18:56
clarkbI think the peridoic trigger works by issueing and event for each branch on the target repos then the job processing ipeline decides if anything needs to run from there18:56
clarkbunlike with a gerrit change or github pr there is't a specific entity to run against just a time to attempt to run things18:57
KennyHo[m]but seems like the merger is doing a merge test for every single branch in the repo in question even when no one really care about those branches.18:57
clarkbyes, I suspect zuul is doing that because it doesn't know if it care about them or not until after the git repo updates for each branch18:57
clarkbthe indication for whether or not it should do work is in the branches so it needs to update them iirc18:58
KennyHo[m]ok... so is there a way to clear the queue of the periodic pipeline?18:58
clarkbyou can run the zuul dequeue command against those entries to remove them18:59
clarkbsomething like a branch filter might make sense on the periodic trigger?18:59
KennyHo[m]1352 entry... :'(19:00
clarkbthat is a lot of branches19:01
KennyHo[m]zuul basically triggered an event for every devs personal branches and project branches...19:01
KennyHo[m]clarkb: something else is going on I think... these events used to clear much quicker19:03
clarkbIn gerrit land this is addressed but not having branches for all the things. Instead everything is organized into changes targetting a smaller set of branches. In github and gitlab we have the option to exclude unprotected branches and basically only run zuul against the protected stuff which isn't where personal dev happens.19:03
clarkbI wonder if we need similar filtering on a general level19:03
clarkbcorvus: ^19:03
avass[m]Of course you can also configure jobs to only match a specific branch19:04
clarkbavass[m]: yes, but I think the merge happens before that to get the .zuul.yaml content?19:05
KennyHo[m]avass: the periodic job is matching specific branch19:05
clarkbKennyHo[m]: re not happening before it is possible those merges went much more quickly before for some reason? Either far fewr branches or cheaper merges per branch? Once zuul does the update and notices that it doens't have any jobs to run for that brnach it will stop processing19:06
corvusclarkb: something like https://review.opendev.org/804177 ?19:06
clarkbcorvus: ya exactly19:06
KennyHo[m]clarkb: I have multiple merger/executors... I am wondering if it went faster on the instances that are warmed up19:06
*** dviroel|away is now known as dviroel19:09
opendevreviewMerged zuul/zuul-jobs master: Find (s)testr more reliably  https://review.opendev.org/c/zuul/zuul-jobs/+/80428019:21
KennyHo[m]Another question... is the merge attempt per change per pipeline?  So let say I have a burst of 100 commits and I have 5 pipelines for zuul, I would get 500 merge attempt?19:22
Clark[m]It is at least per event. I'm not sure if the merges happen separately if the event is handled by multiple pipelines19:24
KennyHo[m]ok... I am just looking at the log but I am not sure if I am understanding it correctly.  I am seeing multiple Merge... complete line each having an unique id (for the same change.19:26
KennyHo[m])19:26
corvusif the event matches 5 pipelines, there will be 5 merge jobs19:26
Clark[m]It probably is per pipeline because dependent pipelines have to construct a pipeline specific git tree19:26
corvusi would consider 5 identical periodic pipelines unusual though19:27
KennyHo[m]corvus: for this case, they are not periodic pipelines but multiple check pipeline variant19:28
KennyHo[m]each doing different kind of gerrit scoring19:28
KennyHo[m]I ended up killing everything and restarted both scheduler and executors for the periodic issue19:30
KennyHo[m]this reminds me... in the current webUI, there's a "Queue length" at the top left corner of the status page, which queue is that?19:32
KennyHo[m]Is that the Gearman queue?19:33
KennyHo[m]this is the queue before things get to the pipeline queue.  (I am just wondering because someone accidentally checked in a periodic pipeline that was running every minute.... and those events seem to survive zuul restart and I wasn't sure how to clear them.)19:34
KennyHo[m]fortunately when those events get to the pipeline queue, they got cleared fairly quickly.19:35
corvusthey're moving to zk now; we'll have a "delete everything from zk" tool eventually, but we don't have it yet19:36
KennyHo[m]corvus: ok...  (yea, last few days hasn't been fun... but I guess we were stress testing Zuul a bit? ;))19:37
corvusKenny Ho: well, mergers are a scale-out component for a reason :)19:38
corvusobviously if you don't need to do the merges, that's best, but at some point, if you have sustaned queues, then adding mergers (and/or executors) is a good way to alleviate it19:40
fungiin opendev we run 8 additional mergers beyond those supplied by our 12 executors19:40
KennyHo[m]I scaled up to 9 executors19:40
fungibut we also don't have projects with thousands of branches, nor any with a history as lengthy as the linux kernel19:41
KennyHo[m]what is the reason to scale with mergers instead of making them all executors?19:41
fungiKennyHo[m]: we're able to supply smaller servers to act as additional mergers, since they don't need all the resources you would usually consume with an executor19:41
corvusso that initial pipeline processing isn't blocked19:42
corvusonce an executor gets busy enough, it's not going to perform well as a merger19:42
KennyHo[m]fungi: smaller in terms of CPU/mem usage?  I'd imagine the storage requirement is the same for merger and executor?19:42
fungibut on the sizing front, yes we allocate 8gb ram and 8 cpus to our executors, while the dedicated mergers have only 2gb ram and 2 cpus19:43
fungiKennyHo[m]: we actually also use a separate filesystem for /var/lib/zuul on our executors, because workspaces can need additional storage (more bursty than what tends to be consumed by the git caches)19:46
KennyHo[m]fungi: yea... I haven't looked into that optimization yet.  I have been thinking of using networked storage but I am not sure what the performance trade off is vs local storage19:48
KennyHo[m]even though my deployment is done via k8s, I am only using local storage on an nvme right now.19:48
fungidepends on the sort of storage you're considering. git operations on remote filesystems could be costly. on remote block devices on the other hand, it would probably depend on the latency and bandwidth to your storage arrays19:49
fungigit on nfs would probably be a disaster, git on a filesystem which is formatted on top of iscsi or fibrechannel would probably not be bad as long as you had good connectivity19:50
fungithe kernel's filesystem cache also tends to smooth out a lot of that19:50
KennyHo[m]I see.  Thanks for the tips19:51
fungiany time19:53
clarkbalso those repos are largely ephemeral20:01
clarkbso using simple fast local storage is probably the best option20:01
opendevreviewJames E. Blair proposed zuul/zuul master: Add include-branches tenant config option  https://review.opendev.org/c/zuul/zuul/+/80417720:03
opendevreviewJames E. Blair proposed zuul/zuul master: Use branchesForRepoState in all cases  https://review.opendev.org/c/zuul/zuul/+/80417820:03
clarkbI should review that since I suggsted it would be a good feature20:06
clarkbcorvus: would it make sense to allow that filter to be a regex? Then a project like openstack could do something like - ^stable/.*$ - master20:09
clarkbthen if users ended uphaving dev branches somehow they would be filtered out?20:09
clarkbKennyHo[m]: in your case is there a fairly static set of branches you do care about or would filtering on a prefix match be useful?20:10
pabelanger[m]clarkb: corvus: okay, i think I've figured out my semaphore issue. If I am debugging this correctly, because my check pipeline is processed first it is getting priority to the semaphore lock, over say gate which is second.  This seems to be regardless of pipeline precedence.20:18
clarkbpabelanger[m]: semaphores are independent of pipelines so ya that can happen20:18
pabelanger[m]I'm not sure how to best solve, as our gate pipeline has been starved for 11+ hours: https://dashboard.zuul.ansible.com/t/ansible/status 20:19
clarkbIf you can divide the resources the semaphore control you could use two different semaphores one per pipeline. That would reduce overall throughput if one queue is empty and the other isn't20:20
clarkbpabelanger[m]: is the priority difference due to setting different priorities in the pipelines or just because zuul evaluates them in a specific order at the same priority?20:20
clarkbIf you aren't already setting specific priorities you could make gate a higher priority (opendev does this, we order by proximity to releases and merges)20:21
pabelanger[m]I think it is because check is listed first in the pipeline order20:21
clarkbif you are setting priorities you could stop or flip the priorities20:21
pabelanger[m]if I moved gate to be the first one, I think it would work as expected20:21
pabelanger[m]but gate is already higher20:21
pabelanger[m]https://github.com/ansible/project-config/blob/master/zuul.d/pipelines.yaml20:22
clarkbya looks like you have precedence high on gate and normal on check20:23
clarkbbut check is listed in the file before gate20:23
pabelanger[m]we really only do that to control render order on https://dashboard.zuul.ansible.com/t/ansible/status20:23
clarkbI wonder if zuul should orer the pipelines in its processing list based on precedence too20:24
clarkbthough that might impact the ui?20:24
pabelanger[m]possible, I can look into it20:24
*** dviroel is now known as dviroel|ruck20:35
pabelanger[m]clarkb: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L1555 looks to be why20:37
clarkbya I wonder if we should sort that for loop by the precedence level20:37
pabelanger[m]I can work on that, along with test to expose semaphore issue20:44
opendevreviewJames E. Blair proposed zuul/zuul master: Add delete-state command to delete everything from ZK  https://review.opendev.org/c/zuul/zuul/+/80430421:10
opendevreviewPaul Belanger proposed zuul/zuul master: Process pipelines based on precedence  https://review.opendev.org/c/zuul/zuul/+/80430521:13
pabelanger[m]okay, that should sort them. ^ I'll work on a test in the morning21:14
pabelanger[m]hopefully nothing booms in testing21:14
corvustristanC: it looks like we're down to 3 kopf changes (then all known issues should be addressed)  starting at https://review.opendev.org/800786 if you're interested21:14
corvuspabelanger: maybe WIP that until ready?21:15
pabelanger[m]yup!21:16
pabelanger[m]EOD for today, thanks for help21:17
corvuspabelanger: g'night :)21:20
*** dviroel|ruck is now known as dviroel|ruck|out21:46
clarkbfungi: I think https://review.opendev.org/c/zuul/nodepool/+/777641 was proposed to debug the thing with stale node request locks that cause us to restart nodepool21:47
clarkbfungi: I haven't seen one of those recently. Do you know if that is still happening?21:47
fungiclarkb: i haven't seen one this week, though recently they were complicated by the unrelated zuul bug with the request retries reusing the old request object21:53
clarkbya21:53
fungilikely if we are still getting stale node request locks, the retries are masking them since the affected builds eventually get nodes assigned21:54
clarkbI think zuul will only retry once the node request has been unlocked though21:54
clarkbzuul waits for that state change21:54
fungioh, so it won't give up waiting after some timeout21:55
clarkbI don't think so22:00
clarkbthe negotation there expects the nodepool side to give up and unlock and hand control back to zuul or succeed and similarly hand back control22:01

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!