Wednesday, 2021-08-11

clarkb	corvus: testing failed on the key startup change	00:00
clarkb	corvus: I think in the test web keys test we may need to load the keys when they are requested if they are not already present? There may be a few spots in the code that rely on things be pre cached	00:02
fungi	oh, indeed	00:20
opendevreview	James E. Blair proposed zuul/zuul master: Add include-branches tenant config option https://review.opendev.org/c/zuul/zuul/+/804177	00:21
opendevreview	James E. Blair proposed zuul/zuul master: Use branchesForRepoState in all cases https://review.opendev.org/c/zuul/zuul/+/804178	00:21
corvus	clarkb: yeah that sounds likely; i'll whack the moles if no one raises high-level issues with the change	00:22
opendevreview	Merged zuul/zuul master: Stop running mypy in linters https://review.opendev.org/c/zuul/zuul/+/803602	02:07
*** marios is now known as marios\|ruck		05:11
*** bhagyashris_ is now known as bhagyashris		05:39
*** rpittau\|afk is now known as rpittau		07:23
*** jpena\|off is now known as jpena		07:32
*** bhavikdbavishi1 is now known as bhavikdbavishi		09:46
*** rpittau is now known as rpittau\|afk		11:28
*** dviroel\|out is now known as dviroel		11:32
*** jpena is now known as jpena\|lunch		11:35
*** jpena\|lunch is now known as jpena\|off		12:28
opendevreview	Dong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github https://review.opendev.org/c/zuul/zuul/+/803547	12:45
opendevreview	Dong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github https://review.opendev.org/c/zuul/zuul/+/803547	12:57
opendevreview	Matthieu Huin proposed zuul/zuul master: [WIP] web UI: user login with OpenID Connect https://review.opendev.org/c/zuul/zuul/+/734082	14:36
*** jpena\|off is now known as jpena		14:57
opendevreview	Dong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github https://review.opendev.org/c/zuul/zuul/+/803547	15:03
opendevreview	Dong Zhang proposed zuul/zuul master: WIP: Show emoji to highlight failed jobs in build result in Github https://review.opendev.org/c/zuul/zuul/+/803547	15:07
*** sshnaidm is now known as sshnaidm\|afk		15:35
*** jpena is now known as jpena\|off		15:42
pabelanger[m]	I am seeing a lot of ERRORs in our zuul-scheduler, could someone help explain what might be happening?	16:03
pabelanger[m]	2021-08-11 16:02:20,627 ERROR zuul.zk.SemaphoreHandler: [e: c8aeb4b0-fa81-11eb-8732-0b1879f54153] Semaphore /zuul/semaphores/ansible/ansible-test-cloud-integration-aws can not be released for 3d9166f9654646c193a9dbb9f8016005-ansible-test-cloud-integration-aws-py36_5 because the semaphore is not held	16:03
corvus	pabelanger: ignore it	16:04
fungi	pabelanger[m]: https://review.opendev.org/803948 merged yesterday to fix that	16:05
fungi	"fix" by silencing the error message, that is	16:05
pabelanger[m]	k, I startred looking because we have some jobs in gate pipeline waiting on a semaphore, so figured it was related	16:05
clarkb	ya once you've upgraded past that change the error becomes meaningful again	16:05
pabelanger[m]	will keep digging into logs	16:06
clarkb	pabelanger[m]: there was a change that broken semaphoe cleanup that landed a few days ago and the fix for that also landed very recently. If you are running offo f master you may have hit that?	16:06
clarkb	if running from releases that bug isn't present	16:06
pabelanger[m]	we are on 4.7.0, so I can see the ERROR message	16:07
corvus	zuul-maint: does this look good for a zuul release? commit d07397a73c2551d5c77e0ffc3d98008337168902 (tag: 4.8.0)	16:14
clarkb	that commit is one commit behind master but the commit at tip of master is just the mypy removal. The commit lgtm. The tag value also lgtm as new features (not just bug fixes) were added	16:16
clarkb	and we have release notes for those features at https://zuul-ci.org/docs/zuul/reference/releasenotes.html#new-features	16:16
clarkb	++ LGTM	16:17
tobiash[m]	lgtm	16:18
*** dviroel is now known as dviroel\|away		16:19
mordred	++	16:21
fungi	corvus: yep, d07397a as 4.8.0 sounds great, thanks!	16:25
clarkb	https://zuul.opendev.org/t/openstack/build/8365660d04df497488f6fab0cca0f9c6/console is an interesting failure. We've indexed the stdout of find-testr.sh at [0] but that is a blank line and [1] has the path we want. That script hasn't changed in a year	16:30
corvus	clarkb: i can't think of anything that would have changed that. note stderr does not have a blank line, so it's not a general issue	16:32
clarkb	corvus: ya I'm looking at the script now and wonder if maybe find did something different?	16:32
clarkb	I'm trying to figure it out, but it is a really weird one	16:33
clarkb	or type?	16:34
corvus	clarkb: super weird but -- i just ran "type -p subunit2html" (it's not in my path) and it emitted a blank line. then i ran it again and no blank line.	16:36
corvus	obvs can't reproduce that; take it with a grain of salt.	16:36
corvus	maybe i just hit enter twice	16:37
clarkb	ya I always get a single line	16:37
clarkb	maybe it is different if you have it in multiple locations in PATH?	16:38
corvus	i bet that was a copy paste that had a newline. probably not relevant	16:39
clarkb	help type does say that type won't return a string if type -t doesn't return FILE	16:40
clarkb	I think we want to use type -P	16:40
clarkb	but I'm not sure that will fix the problem	16:41
clarkb	heh putting a different ls in your path ahead of /usr/bin/ls is not the greatest idea	16:44
clarkb	I cannot reproduce the behavior with multiple entries in my PATH	16:44
clarkb	maybe we should just do a string chomp	16:45
clarkb	fungi: corvus ^ do you know of a good way to do that in bash? I think having newlines also complicates it since that involves multiline regexes	16:50
opendevreview	James E. Blair proposed zuul/zuul master: Disable aliases in inventory.yaml for better readibility https://review.opendev.org/c/zuul/zuul/+/802674	16:50
fungi	clarkb: the hackish way is to wrap the output in a subshell and echo the stdout from it	16:52
corvus	clarkb: my first thought is awk but i don't have a spell for that handy either	16:53
fungi	echo `type -p subunit2html`	16:53
fungi	it's not very safe though	16:54
fungi	if you're just wanting to filter out extra blank lines you could: type -p subunit2html \| grep -v ^$	16:55
clarkb	I thought maybe for a second that it was looking for testr and stestr because the .testrepository and .stestr stuff both existed but they don't	16:55
fungi	what specifically is it you want to do? i probably haven't been following closely enough to properly infer the goal	16:56
clarkb	fungi: an ironic-inspector job failed https://zuul.opendev.org/t/openstack/build/8365660d04df497488f6fab0cca0f9c6/console because type -p stestr apperas to have emitted a blank line which we treated as the command path for stestr	16:57
clarkb	the actual command path was on the next line	16:58
clarkb	I am unable to reproduce the double line behavior	16:58
fungi	that sounds familiar. i think i've seen it before, though i want to say it was because of some iteration over a list	16:58
fungi	is it possible type was called more than once?	16:58
clarkb	maybe? That was what I was looking for by checking if they have a .testrepository and a .stestr in the repo but they don't. The script find-testr.sh does have a loop looking for both commands depending on whether or not the .dirs are in place	16:59
* fungi could almost swear he implemented a fix for this		16:59
fungi	this is on a stable branch, maybe my fix was only to master devstack	16:59
clarkb	fungi: the script is in zuul-jobs though	17:00
fungi	oh, nm	17:00
clarkb	and hasn't been modified in a year	17:00
corvus	yeah, this is (at least nominally) an issue with generically finding the subunit html in zuul-jobs. (of course, tempest may have hosed the host in such a way to make it fail, but who knows)	17:01
opendevreview	Clark Boylan proposed zuul/zuul-jobs master: Find (s)testr more reliably https://review.opendev.org/c/zuul/zuul-jobs/+/804280	17:05
clarkb	that is a bit brute force and doesn't explain why this happens. But if that passes testing I suspect it should be safe enough?	17:05
fungi	clarkb: hah, i found where we discussed it in #opendev a couple months ago, but i think we ended up at the same place as today's analysis: https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2021-06-22.log.html#t2021-06-22T13:26:51	17:35
fungi	my opinion was "we could add some debugging from the script to stderr i suppose, which shouldn't conflict with parsing the stdout, but since this seems to occur infrequently it would be hard to leverage without merging it to the role"	17:36
corvus	fungi, clarkb: past-fungi makes a good point in that the newline may not be coming from the type command, but the script or something else, in which case clarkb's fix is too low-level?	17:38
mordred	yah - although it probably shouldn't hurt?	17:38
corvus	true	17:39
fungi	past-fungi had more braincells. i kill them off at an alarming rate	17:39
corvus	so maybe still +2 on clarkb's change and if it recurs, then we know it's (probably) something higher up the stack?	17:39
clarkb	wfm	17:39
clarkb	but ya thats a good point it could be coming from soething earlier in the script	17:39
fungi	if nothing else, it does confirm we're continuing to see this problem. if we could give it some sort of discoverable signal combined with a bit of debugging info, maybe we could get to the bottom of it (might not just be affecting this one script)	17:40
corvus	clarkb: it's coming from... INSIDE THE SCRIPT!!!!!	17:42
corvus	i think it might be time to afk for a few minutes.	17:42
clarkb	ha	17:44
clarkb	looking at the script I don't see anywhere else where we could be echoing a blank newline. It is all variable assignments and for loops	18:16
clarkb	and running the script locally against my zuul repo which has tox and stestr things reliably outputs a single line	18:21
clarkb	really really weird	18:21
clarkb	any objections to approving 804280?	18:25
clarkb	fungi: ^ in particular hasn't weighe din on the change	18:25
KennyHo[m]	I am having a huge backlog on my Zuul right now but all of the nodes are idle. What is the best way to debug this? I am suspecting something is wrong with merger or the executor but I am not sure where to start.	18:35
clarkb	KennyHo[m]: I would start with a single jobs and trace it to figure out why it isn't running	18:36
clarkb	in your zuul scheduler logs you should have event ids on the log lines. If you grepfor that event id you'll get all the logs associated with that event including job procssing (or most ofthem anyway)	18:36
clarkb	this should include things like node requests ids which you can take to nodepool logs and grep for there to see if maybe nodepool isn't provisioning nodes	18:37
KennyHo[m]	clarkb: Looking at some of these, I am not sure if the jobs are supposed to run.	18:37
KennyHo[m]	these are basically events that have no zuul jobs	18:37
KennyHo[m]	but everything is just queued up on the web UI	18:38
clarkb	In general I would expect this sort of problem if nodes are not launching in nodepool, executors are shutdown, or maybe semaphores being held	18:38
clarkb	KennyHo[m]: ya you should try and identify the event for a specific thing queued up in the UI	18:38
clarkb	I think if you grep for the change id that may give you the eventid then you can grep for the event id?	18:38
KennyHo[m]	this is all within the scheduler right?	18:39
clarkb	yes	18:39
clarkb	then from that you get clues like node request ids etc that can be taken to the other services	18:39
KennyHo[m]	clarkb: So I see merge state: PENDING... I am not sure if that's significant	18:48
clarkb	KennyHo[m]: that could be a clue. One of the first things zuul will do for a change is merge it against the tip of the target branch because that is what it builds its jobs off of	18:48
clarkb	if the merges aren't completing that could stop the processing pipeline. Every executor runs its own merger threads and you can run separate merger processes. I would check those logs for merge errors or maybe the processes have stopped completely?	18:49
KennyHo[m]	the mergers have a different log or does it all goes to the executor's log?	18:50
KennyHo[m]	(I haven't figured out how to get merger working yet so all my executor == merger)	18:50
corvus	executor log	18:51
KennyHo[m]	corvus: um... I think I am having a merger traffic jam. Is merger supposed to block the whole world? What I noticed in the past is that no job seems to get schedule when the mergers are busy.	18:54
KennyHo[m]	what happen is that I have a periodic pipeline that run a daily job on the linux kernel repo	18:54
KennyHo[m]	on that periodic event, it actually generates thousands of events... one per branch	18:55
KennyHo[m]	I am not sure if that's by design	18:55
KennyHo[m]	the daily job is really only working on one branch	18:56
clarkb	I think the peridoic trigger works by issueing and event for each branch on the target repos then the job processing ipeline decides if anything needs to run from there	18:56
clarkb	unlike with a gerrit change or github pr there is't a specific entity to run against just a time to attempt to run things	18:57
KennyHo[m]	but seems like the merger is doing a merge test for every single branch in the repo in question even when no one really care about those branches.	18:57
clarkb	yes, I suspect zuul is doing that because it doesn't know if it care about them or not until after the git repo updates for each branch	18:57
clarkb	the indication for whether or not it should do work is in the branches so it needs to update them iirc	18:58
KennyHo[m]	ok... so is there a way to clear the queue of the periodic pipeline?	18:58
clarkb	you can run the zuul dequeue command against those entries to remove them	18:59
clarkb	something like a branch filter might make sense on the periodic trigger?	18:59
KennyHo[m]	1352 entry... :'(	19:00
clarkb	that is a lot of branches	19:01
KennyHo[m]	zuul basically triggered an event for every devs personal branches and project branches...	19:01
KennyHo[m]	clarkb: something else is going on I think... these events used to clear much quicker	19:03
clarkb	In gerrit land this is addressed but not having branches for all the things. Instead everything is organized into changes targetting a smaller set of branches. In github and gitlab we have the option to exclude unprotected branches and basically only run zuul against the protected stuff which isn't where personal dev happens.	19:03
clarkb	I wonder if we need similar filtering on a general level	19:03
clarkb	corvus: ^	19:03
avass[m]	Of course you can also configure jobs to only match a specific branch	19:04
clarkb	avass[m]: yes, but I think the merge happens before that to get the .zuul.yaml content?	19:05
KennyHo[m]	avass: the periodic job is matching specific branch	19:05
clarkb	KennyHo[m]: re not happening before it is possible those merges went much more quickly before for some reason? Either far fewr branches or cheaper merges per branch? Once zuul does the update and notices that it doens't have any jobs to run for that brnach it will stop processing	19:06
corvus	clarkb: something like https://review.opendev.org/804177 ?	19:06
clarkb	corvus: ya exactly	19:06
KennyHo[m]	clarkb: I have multiple merger/executors... I am wondering if it went faster on the instances that are warmed up	19:06
*** dviroel\|away is now known as dviroel		19:09
opendevreview	Merged zuul/zuul-jobs master: Find (s)testr more reliably https://review.opendev.org/c/zuul/zuul-jobs/+/804280	19:21
KennyHo[m]	Another question... is the merge attempt per change per pipeline? So let say I have a burst of 100 commits and I have 5 pipelines for zuul, I would get 500 merge attempt?	19:22
Clark[m]	It is at least per event. I'm not sure if the merges happen separately if the event is handled by multiple pipelines	19:24
KennyHo[m]	ok... I am just looking at the log but I am not sure if I am understanding it correctly. I am seeing multiple Merge... complete line each having an unique id (for the same change.	19:26
KennyHo[m]	)	19:26
corvus	if the event matches 5 pipelines, there will be 5 merge jobs	19:26
Clark[m]	It probably is per pipeline because dependent pipelines have to construct a pipeline specific git tree	19:26
corvus	i would consider 5 identical periodic pipelines unusual though	19:27
KennyHo[m]	corvus: for this case, they are not periodic pipelines but multiple check pipeline variant	19:28
KennyHo[m]	each doing different kind of gerrit scoring	19:28
KennyHo[m]	I ended up killing everything and restarted both scheduler and executors for the periodic issue	19:30
KennyHo[m]	this reminds me... in the current webUI, there's a "Queue length" at the top left corner of the status page, which queue is that?	19:32
KennyHo[m]	Is that the Gearman queue?	19:33
KennyHo[m]	this is the queue before things get to the pipeline queue. (I am just wondering because someone accidentally checked in a periodic pipeline that was running every minute.... and those events seem to survive zuul restart and I wasn't sure how to clear them.)	19:34
KennyHo[m]	fortunately when those events get to the pipeline queue, they got cleared fairly quickly.	19:35
corvus	they're moving to zk now; we'll have a "delete everything from zk" tool eventually, but we don't have it yet	19:36
KennyHo[m]	corvus: ok... (yea, last few days hasn't been fun... but I guess we were stress testing Zuul a bit? ;))	19:37
corvus	Kenny Ho: well, mergers are a scale-out component for a reason :)	19:38
corvus	obviously if you don't need to do the merges, that's best, but at some point, if you have sustaned queues, then adding mergers (and/or executors) is a good way to alleviate it	19:40
fungi	in opendev we run 8 additional mergers beyond those supplied by our 12 executors	19:40
KennyHo[m]	I scaled up to 9 executors	19:40
fungi	but we also don't have projects with thousands of branches, nor any with a history as lengthy as the linux kernel	19:41
KennyHo[m]	what is the reason to scale with mergers instead of making them all executors?	19:41
fungi	KennyHo[m]: we're able to supply smaller servers to act as additional mergers, since they don't need all the resources you would usually consume with an executor	19:41
corvus	so that initial pipeline processing isn't blocked	19:42
corvus	once an executor gets busy enough, it's not going to perform well as a merger	19:42
KennyHo[m]	fungi: smaller in terms of CPU/mem usage? I'd imagine the storage requirement is the same for merger and executor?	19:42
fungi	but on the sizing front, yes we allocate 8gb ram and 8 cpus to our executors, while the dedicated mergers have only 2gb ram and 2 cpus	19:43
fungi	KennyHo[m]: we actually also use a separate filesystem for /var/lib/zuul on our executors, because workspaces can need additional storage (more bursty than what tends to be consumed by the git caches)	19:46
KennyHo[m]	fungi: yea... I haven't looked into that optimization yet. I have been thinking of using networked storage but I am not sure what the performance trade off is vs local storage	19:48
KennyHo[m]	even though my deployment is done via k8s, I am only using local storage on an nvme right now.	19:48
fungi	depends on the sort of storage you're considering. git operations on remote filesystems could be costly. on remote block devices on the other hand, it would probably depend on the latency and bandwidth to your storage arrays	19:49
fungi	git on nfs would probably be a disaster, git on a filesystem which is formatted on top of iscsi or fibrechannel would probably not be bad as long as you had good connectivity	19:50
fungi	the kernel's filesystem cache also tends to smooth out a lot of that	19:50
KennyHo[m]	I see. Thanks for the tips	19:51
fungi	any time	19:53
clarkb	also those repos are largely ephemeral	20:01
clarkb	so using simple fast local storage is probably the best option	20:01
opendevreview	James E. Blair proposed zuul/zuul master: Add include-branches tenant config option https://review.opendev.org/c/zuul/zuul/+/804177	20:03
opendevreview	James E. Blair proposed zuul/zuul master: Use branchesForRepoState in all cases https://review.opendev.org/c/zuul/zuul/+/804178	20:03
clarkb	I should review that since I suggsted it would be a good feature	20:06
clarkb	corvus: would it make sense to allow that filter to be a regex? Then a project like openstack could do something like - ^stable/.*$ - master	20:09
clarkb	then if users ended uphaving dev branches somehow they would be filtered out?	20:09
clarkb	KennyHo[m]: in your case is there a fairly static set of branches you do care about or would filtering on a prefix match be useful?	20:10
pabelanger[m]	clarkb: corvus: okay, i think I've figured out my semaphore issue. If I am debugging this correctly, because my check pipeline is processed first it is getting priority to the semaphore lock, over say gate which is second. This seems to be regardless of pipeline precedence.	20:18
clarkb	pabelanger[m]: semaphores are independent of pipelines so ya that can happen	20:18
pabelanger[m]	I'm not sure how to best solve, as our gate pipeline has been starved for 11+ hours: https://dashboard.zuul.ansible.com/t/ansible/status	20:19
clarkb	If you can divide the resources the semaphore control you could use two different semaphores one per pipeline. That would reduce overall throughput if one queue is empty and the other isn't	20:20
clarkb	pabelanger[m]: is the priority difference due to setting different priorities in the pipelines or just because zuul evaluates them in a specific order at the same priority?	20:20
clarkb	If you aren't already setting specific priorities you could make gate a higher priority (opendev does this, we order by proximity to releases and merges)	20:21
pabelanger[m]	I think it is because check is listed first in the pipeline order	20:21
clarkb	if you are setting priorities you could stop or flip the priorities	20:21
pabelanger[m]	if I moved gate to be the first one, I think it would work as expected	20:21
pabelanger[m]	but gate is already higher	20:21
pabelanger[m]	https://github.com/ansible/project-config/blob/master/zuul.d/pipelines.yaml	20:22
clarkb	ya looks like you have precedence high on gate and normal on check	20:23
clarkb	but check is listed in the file before gate	20:23
pabelanger[m]	we really only do that to control render order on https://dashboard.zuul.ansible.com/t/ansible/status	20:23
clarkb	I wonder if zuul should orer the pipelines in its processing list based on precedence too	20:24
clarkb	though that might impact the ui?	20:24
pabelanger[m]	possible, I can look into it	20:24
*** dviroel is now known as dviroel\|ruck		20:35
pabelanger[m]	clarkb: https://opendev.org/zuul/zuul/src/branch/master/zuul/scheduler.py#L1555 looks to be why	20:37
clarkb	ya I wonder if we should sort that for loop by the precedence level	20:37
pabelanger[m]	I can work on that, along with test to expose semaphore issue	20:44
opendevreview	James E. Blair proposed zuul/zuul master: Add delete-state command to delete everything from ZK https://review.opendev.org/c/zuul/zuul/+/804304	21:10
opendevreview	Paul Belanger proposed zuul/zuul master: Process pipelines based on precedence https://review.opendev.org/c/zuul/zuul/+/804305	21:13
pabelanger[m]	okay, that should sort them. ^ I'll work on a test in the morning	21:14
pabelanger[m]	hopefully nothing booms in testing	21:14
corvus	tristanC: it looks like we're down to 3 kopf changes (then all known issues should be addressed) starting at https://review.opendev.org/800786 if you're interested	21:14
corvus	pabelanger: maybe WIP that until ready?	21:15
pabelanger[m]	yup!	21:16
pabelanger[m]	EOD for today, thanks for help	21:17
corvus	pabelanger: g'night :)	21:20
*** dviroel\|ruck is now known as dviroel\|ruck\|out		21:46
clarkb	fungi: I think https://review.opendev.org/c/zuul/nodepool/+/777641 was proposed to debug the thing with stale node request locks that cause us to restart nodepool	21:47
clarkb	fungi: I haven't seen one of those recently. Do you know if that is still happening?	21:47
fungi	clarkb: i haven't seen one this week, though recently they were complicated by the unrelated zuul bug with the request retries reusing the old request object	21:53
clarkb	ya	21:53
fungi	likely if we are still getting stale node request locks, the retries are masking them since the affected builds eventually get nodes assigned	21:54
clarkb	I think zuul will only retry once the node request has been unlocked though	21:54
clarkb	zuul waits for that state change	21:54
fungi	oh, so it won't give up waiting after some timeout	21:55
clarkb	I don't think so	22:00
clarkb	the negotation there expects the nodepool side to give up and unlock and hand control back to zuul or succeed and similarly hand back control	22:01

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!