Monday, 2024-12-09

mnasiadka	fungi: actually - is there an option to force a job to be run on raxflex - or it's down to series of rechecks? ;-)	08:03
frickler	mnasiadka: we don't have a special label for raxflex. I've set up two holds now with the hope to catch one of the failing jobs there, though	08:55
*** ralonsoh_ is now known as ralonsoh		09:21
opendevreview	Jaromír Wysoglad proposed openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339	09:52
opendevreview	Albin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects https://review.opendev.org/c/zuul/zuul-jobs/+/936828	12:49
opendevreview	Albin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects https://review.opendev.org/c/zuul/zuul-jobs/+/936828	12:49
opendevreview	Merged openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339	13:29
*** darmach5 is now known as darmach		13:30
opendevreview	Dr. Jens Harbott proposed zuul/zuul-jobs master: zuul_debug_info: Add /proc/cpuinfo output https://review.opendev.org/c/zuul/zuul-jobs/+/937376	15:24
frickler	fungi: ^^ that's regarding the discussion in #openstack-infra (cc corvus)	15:24
clarkb	frickler: do ansible facts not already capture what we need?	15:39
clarkb	I know they capture cpu count not sure about feature flags or model string	15:39
frickler	clarkb: well in the case in question they said cpu count = 1, which shouldn't happen with the CPU=8 flavor we are using, shouldn't it?	15:40
frickler	https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/host-info.ubuntu-noble.yaml	15:40
clarkb	frickler: see my message in #openstack-infra it shouldn't happen but it does and it isn't zuul or ansible's fault	15:40
clarkb	it is xens iirc	15:40
clarkb	or maybe noble's kernel or or the other	15:40
frickler	so if we know that that may happen, maybe we should do an early discard for that node and let zuul retry?	15:41
clarkb	that is one option. I was hoping that the affected parties (this came up in like october iirc) would debug and sort it out	15:41
frickler	clarkb: well seems like they didn't, I also don't remember that discussion, do you know if it was here or maybe in #zuul?	15:42
clarkb	we just have to be careful with how we do that because <8vcpu is fine; we have 4vcpu in raxflex for example. Might do a specific check of the xen version and the distro version and kick out if they are wrong rather than checking vcpu	15:43
clarkb	frickler: it was in #openstack-infra iirc and was a neutron or nova or maybe both problem	15:43
frickler	also maybe cardoe might be able to escalate if needed?	15:43
clarkb	the jobs assumed they would have more than one vcpu for something (was it numa?) and when they only had one kaboom	15:43
cardoe	flex isn't using xen	15:43
clarkb	right this is old rax and it is a single hypervisor or multiple hypervisors but not all hypervisors in that region that exhibit this issue	15:44
frickler	clarkb: hmm, I only remember the 4cpu vs. 8cpu discussion, going down to 1cpu is new to me	15:44
cardoe	what flavor do you use in old rax?	15:44
frickler	cardoe: flavor-name: 'Performance'	15:45
cardoe	The issue here is the underlying CPU?	15:46
fungi	cpu count, yes	15:46
fungi	should have gotten 8, ended up with only 1	15:46
clarkb	frickler: looks like this was previously debugged on october 2, 2024 and I discussed it in this channel but I thought it originated in #openstack-infra maybe I just haven't found the logs there	15:47
clarkb	cardoe: yes and in october when I looekd at this it seemed specific to Ubuntu Noble booting on hypervisors with a conspicuously older version of Xen than Noble instances with 8 vcpu in the same region	15:47
fungi	but the api is being suuuuper slooooow right now, i'm trying to check whether we're somehow ending up with any instances booted on a flavor other than "8 GB Performance"	15:47
fungi	just to rule that out	15:48
frickler	oh, that was when I was on PTO and stopped my IRC client. need to read logs	15:48
clarkb	https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:13	15:49
clarkb	it was cinder	15:49
clarkb	I think we forgot to remind cardoe after vacation :)	15:50
frickler	seems it is also happening rarely enough so we weren't reminded by more obscure failure reports ;)	15:52
clarkb	cardoe: in rax-ord host_id: 7eca1835ed13e21e6a6b3c7bba861f314865eb616acfeaf63911026b booted a ubuntu noble VM and presented only a single cpu. Ansible reports ansible_bios_vendor: Xen ansible_bios_version: 4.1.5. Per the IRC logs above it seems when that version is newer we get 8vcpu using the same VM image	16:00
clarkb	this is from the logs in https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/ which is a job that ran a coupel of days ago	16:00
frickler	hmm, so "flavor-name: 'Performance'" does a substring search and combines that with the min-ram condition? sadly nodepool doesn't seem to log the actual flavor it ends up with, but the id should be "performance1-8" according to flavor list	16:01
clarkb	I really don't think it is a flavor problem based on my previuos investigation	16:02
clarkb	the memory and disk totals aligned properly.	16:02
clarkb	and a consistent difference was the reported xen version	16:02
clarkb	so I suspect a bug in xen or the kernel or both	16:02
fungi	53 more defunct subscriptions disabled automatically on openstack-discuss today, and i've manually disabled another 4 that were causing uncaught bounce reports (the delays compared to saturday's batch seem to be a combination of users who were doing digest deliveries and/or deferred deliveries which took longer to time out)	16:13
opendevreview	Clark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290	16:23
clarkb	Let's see if I've sufficiently modified Gerrit permissions to be allowed to push acl updates as administrator now	16:23
clarkb	fungi: and I guess it hasn't disabled anyone you'd expect to stay enabled this time around?	16:24
fungi	nope	16:24
frickler	sadly the host_id seems to be too long for opensearch to find it, but if finds multiple jobs running on different host_ids but with the same bios version, all properly with vpcus=8 though	16:24
fungi	frickler: under noble specifically?	16:25
frickler	oh, I didn't check that, let me add a filter	16:25
clarkb	right this seems to only affect noble	16:25
clarkb	https://gerrit-review.googlesource.com/c/gerrit/+/445064 since this has merged all we need to do is rebuild our gerrit 3.10 image and redeploy it to get the openid redirect fix	16:32
frickler	I actually didn't find another noble run during the last 2 weeks	16:32
clarkb	frickler: ya that was my problem with the first round of debugging the sample size was low and the only failure case was the one called out	16:32
clarkb	but in all cases of successful noble runs in rax ord the bios version aka xen version was newer than where the failure occurred	16:33
clarkb	its probably a very small number of hypervisors so rare to hit	16:33
clarkb	the hostid I recorded in the original etherpad is different than the one ykarel found though so probably at least 2 hypervisors	16:34
frickler	yes, most logs I found seem to be clustered at the periodic pipeline early morning rush	16:34
clarkb	infra-root on my post gerrit upgrade todo list is cleaning up autoholds. I don't see anything indicating we should keep those around for additional testing you good with me deleting them at this point?	16:42
clarkb	looks like i haev some local system updates to apply. I'll clean up autoholds when that is done	16:42
fungi	seems fine to clean those up, i'll also drop my mm3 test hold	16:43
tonyb	clarkb: Sounds good to me	16:45
clarkb	cool my todo list today is full of followup stuff to the gerrit upgrade. There is also the dib testing situation so that we can land a bugfix and tag a release but I'm not sure I'll get to that today. Might make a good meeting agenda item though	16:46
tonyb	I wont be at the meeting as I'll be landing in Sydney :(	16:47
clarkb	I've just realized the way I'm editing All-Projects in the test gerrit is to add a new refs/* section for push in the config file. Does anyone know if you can have multiple entries for access refs/* and if they merge or one overrides another?	16:53
clarkb	fungi: ^ you may know?	16:53
fungi	i've never tried using duplicate section headings	16:59
fungi	could you run `git config ...` to add them instead?	16:59
fungi	then it shouldn't wind up with duplicate sections	17:00
fungi	oh, wait, you mean the gerrit config (acl) file, not the git config file?	17:00
fungi	yeah, without constructing a test scenario around that, i don't know	17:01
clarkb	ya the acl files since I need to add push = +force group Administrators to access refs/*	17:09
clarkb	it might work to use git config and supply it with a path though /me tries this locally	17:09
clarkb	of course now I have to figure out how to interact with [access "refs/*"] via git config	17:16
clarkb	git config get --file foo.ini --all access.refs/\*.push <- like that	17:19
fungi	oh, cool	17:19
clarkb	and happily --all returns all the values so this probably actually would work by appeneding additional sections to the acl config (its not c git but still I expect they use an equivalent to --all since you can have many entries)	17:19
clarkb	my test foo.ini had multiple sections too	17:20
fungi	probably, yes. also in a pinch we've had success using python's configparser, and then "fixing" the indentation to match what gerrit expects	17:20
clarkb	infra-root (and maybe corvus in particular) the integrated gate has a number of nova changes with cancelled jobs but no failing jobs in the change with cancelled jobs or the change ahead. Any idea what has happend there? I note the change at the head restarted at least two jobs i wonder if zuul is detecting that as a gate failure again	17:22
clarkb	(I know we had a problem with this and fixed it and I thought we had a test to help prevent this regression occuring again)	17:23
clarkb	the change status says "failing because at least one job failed" but I suspect that may be because cancelled isn't in the success list	17:23
clarkb	the cancelled jobs all cancelled at 14:24:50 which does not line up with when the retried jobs stopped due to failures	17:25
clarkb	so I don't think it wsa retries causing this	17:25
fungi	did nova maybe add configuration to cancel jobs if certain ones fail?	17:27
cardoe	ugh. yeah I don't have a good answer there on those hypervisors.	17:30
cardoe	cloudnull: poke ^	17:31
fungi	clarkb: yeah, i'm not seeing any indication that nova's zuul config is doing anything unusual at least	17:32
clarkb	2024-12-09 14:06:54,827 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] <QueueItem 1101548b9a6a4152abe011717bca9767 live for [<Change 0x7f3fa99535d0 openstack/nova 924593,4>] in gate> is a failing item because ['at least one job failed'] <- from zuul01 is where zuul seems to haev first noticed it needs to do something	17:32
fungi	which i guess was the timed_out build for nova-tox-functional-py312?	17:33
fungi	the buildset for the parent change does have two retried jobs, but neither of them ended at the time that the jobs on the child change got cancelled	17:36
clarkb	fungi: where do you see a timed out build?	17:36
clarkb	oh!	17:37
fungi	openstack/nova 924593,4 has nova-tox-functional-py312 timed_out	17:37
clarkb	that is it. THe color choices here make it too easy for me to skim and see ti as cancelled too	17:37
clarkb	ok that explains it	17:37
fungi	but the cancelled jobs don't depend on that job do they?	17:37
clarkb	that particular job timed out which cancelled the others. Then the chagnes behind it all depend on 924593 so were also cancelled	17:37
clarkb	fungi: they shouldn't so ya nova may be failing fast?	17:37
fungi	not that i can find in nova's .zuul.yaml no	17:38
fungi	but maybe i'm not looking for the right thing	17:38
clarkb	the changes behind all depend on the timed out change so the other changes not reenqueing is expected	17:38
clarkb	so the only question is why cancel all of the jobs in 924593 and its children due to a single timed out failure	17:38
clarkb	octavia and designate set fail-fast to true but they also have their own queues	17:40
clarkb	I wonder if it could be set on a stable branch so not showing up in codesearch or if some other project sets it and it affects the entire integrated gate queue?	17:41
corvus	i agree the color is problematic	17:42
fungi	14:05:25 is when the log for that build says it timed out, the cancelled builds all have completion times of 14:24:50 which is nearly 20 minutes. i'm not convinced the timeout correlates to the cancels	17:44
fungi	though i suppose if it hit during a reconfigure or something then event processing could have been delayed	17:44
corvus	reconfigs don't take that long these days (like 30s or so)	17:45
clarkb	it does start cancelling jobs almost immedaite in the log	17:46
clarkb	2024-12-09 14:06:54,843 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] Cancel jobs for <QueueItem b2823b0facbc4f9b912c1a9488466d9c live for [<Change 0x7f3fa81e4a10 openstack/nova 924594,4>] in gate>	17:46
clarkb	however that appears to be for all the children not the change itself	17:47
clarkb	ignoring the unexplained behavior for 924593 its almost like we cancel all of the builds for its children in preparation for a reenqueue but then we can't reenqueue beacuse the parent is broken	17:49
clarkb	I wonder if that is a regression from old behavior where things would keep running in their own fork of the subway graph previously	17:50
fungi	would they be reenqueued though if a required parent is failing?	17:50
corvus	592 was failing at one point	17:50
corvus	2024-12-09 14:24:38,995 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] <QueueItem 9e0478f1cb0d49ff9dfc05304de1fab7 live for [<Change 0x7f3fa80c6c50 openstack/nova 924592,4>] in gate> is a failing item because ['at least one job failed']	17:51
clarkb	oh interesting so maybe this is related to the retries?	17:51
corvus	it... got better? not quite dead yet? maybe related to retries	17:51
clarkb	and that timestamp is much closer to where we see the 593 cancelations	17:51
clarkb	so theory 594 and beyond cancel when 593 has the timeout job but we don't actually cancel anything on 593 until 592 is considered failing	17:52
corvus	https://zuul.opendev.org/t/openstack/buildset/e0429ce698fa427b9dfdc2ed2682b578 is the current 592 buildset	17:52
clarkb	I think there are two problems here first is why was 592 considered failing if it isn't right now and second why don't we let the jobs run to completion (if only on a fork of the subway graph)	17:53
fungi	could https://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8 have been in early failure detection and then hit a different condition which zuul decided should qualify it for a retry instead?	17:54
clarkb	oh that is an interesting theory. It would explain why the timestamps don't initially appear to line up	17:55
fungi	https://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8/log/job-output.txt#34945	17:55
fungi	bingo	17:55
fungi	2024-12-09 14:24:37.761281 \| controller \| Early failure in job, matched regex "\{\d+\} (.*?) \[[\d\.]+s\] \.\.\. FAILED"	17:56
fungi	at least the first half of the theory is confirmed to correlate timewise	17:56
clarkb	yup it failed in tempest then later the host goes unreachable	17:56
clarkb	so this is a corner case where unreachable hosts override early failure detection for retries (maybe it shouldn't? haven't through that through)	17:56
fungi	right, so maybe as an optimization zuul should not retry builds which indicate early failure	17:57
corvus	i think this is as expected/documented	17:57
clarkb	and then we also have to understand why the jobs behind aren't allowed to continue to completion	17:57
corvus	we've always canceled jobs behind the failing change	17:57
corvus	once we identify a failing change, (even I think) it's a waste of resources to keep running the jobs behind it	17:57
clarkb	corvus: I thought we let them run to completion on a fork of the subway graph. But maybe I'm miss remembering	17:57
clarkb	ack	17:58
corvus	we do keep running the jobs for the failing change though	17:58
fungi	and apparently even sometimes retry them	17:58
corvus	(which, in this case, is counter-inuitively 592 not 593)	17:58
fungi	as was the case here	17:58
clarkb	so probably the main things to think about are changing the timeout color (that ended up not being directly related but was still confusing) and then brainstorm if early failure detection wins over retries	17:58
clarkb	however we may have failed early due to something that should be retried so I'm not sure what the right behavior is there	17:59
corvus	yeah... i think we did tweak something recently to make early failure less overridable, lemme look that up	17:59
clarkb	tonyb: you said we don't need your old autohold for gerrit stuff right?	18:00
clarkb	tonyb: I can clean that up along with my autoholds if that is the case. I'm like 95% certain it is so may clean it up anyway	18:00
fungi	yeah either early failure could be due to something that later ends in a retryable condition in which case the depending items shouldn't get their builds cancelled because the failing job may still succeed on retry, or the early failure was for a legitimate reason in which case we shouldn't retry the build and should cancel the children because the failing item is not going to merge	18:01
fungi	so the fault in this case could be seen as either deciding to retry the failing build, or cancelling the builds for subsequent items when we weren't actually sure the first item wouldn't merge	18:03
corvus	https://review.opendev.org/c/zuul/zuul/+/892247 is what i was thinking of	18:03
corvus	that means once early failure hits, success can never happen	18:03
corvus	making the host unreachable is a neat workaround ;)	18:04
fungi	okay, so this is an unexpected loophole	18:04
clarkb	ok so the intended behavior is that early failure detection would keep the build in a failed sate	18:04
clarkb	*state	18:05
clarkb	so it is a bug	18:05
corvus	i'm leaning towards saying that because of the reasoning in 892247 we should close the loophole and make early failure override the unreachable retry	18:05
fungi	i concur	18:05
clarkb	I feel like that may not be most correct in every situation but I think it would be easiset to understand	18:05
clarkb	and most conssitent	18:05
fungi	agreed on both points	18:05
clarkb	(then you adjust your early failure detection criteria to not overmatch if indeed it was caused by unreachable fallout)	18:05
corvus	yep. possible an unreachableness could cause an early failure detection, but we're kind of trying to avoid having the nnfi algorithm flop around.	18:06
clarkb	++	18:07
corvus	for regex-based early failures, it is, as clarkb said, a matter of not overmatching. for ansible-based failures, i think we should get the unreachable signal before or instead of getting the early failure signal, so it should be okay. but that's a thing to double check with any change we make here.	18:08
fungi	i need to disappear for an eye exam, might be a while since that office seems to always run chronically behind schedule. hoping to be back by 20:00z but it could be later	18:11
clarkb	have fun!	18:11
fungi	i'm sure i will	18:12
clarkb	my gerrit 3.11 testing failed on docker hub rate limits trying to pull mariadb	18:36
clarkb	this is me thinking outloud: I wonder if the huge influx of kolla jobs is basically a sure fire way to push as many IPs as possible over that limit	18:36
clarkb	there are several kolla changes up many of which are running >60 jobs and probably on the order of 100 nodes a piece.	18:36
clarkb	giving me the opportunity to code review my own work before pushing up a new patchset to fix some obvious bugs	18:40
opendevreview	Clark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290	18:41
clarkb	since 937290 rebuilds our gerrit images I won't make a special change to rebuild them to pick up the openid fix	18:51
clarkb	I should probably hold a node and just double check openid login works on 3.10 like it did with 3.9 after my update	18:51
clarkb	but one step at a time	18:51
clarkb	and I'm not too worried about it I don't think that code changes much between releases and my edit was extremely small and tested on 3.9	18:52
clarkb	weird latest patchset hit a bazelisk build failure with js files for the ui apparently not being present. I wonder if the underlying js build failed somehow and bazel doesn't notice uintil it tries to copy the results	19:11
clarkb	the image builds just an hour before passed so unlikely to be a problem with what we are doing but could be an upstream issue. We'll see how the 3.11 build goes	19:12
clarkb	3.11 built just fine...	19:21
fungi	okay, back earlier than expected from the eye doctor	19:26
clarkb	did they hit your eyeballs with the puffs of air and then blast them with very bright green light ot take pictures?	19:27
fungi	i think they did the contact pressure test because they got in "really close" with the gage, plus i don't remember any puffs of air and i expect i would have noticed them	19:28
opendevreview	Clark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290	19:38
clarkb	I fixed ^ for http previously but not for ssh. But progress is being made	19:38
fungi	i suspect we're going to need similar workarounds for git-review's integration testing	19:48
clarkb	yes and the gerritlib integration testing which I pushed a change up to pin to 3.10 for now	19:53
clarkb	fwiw I have confirmed that gitea is still suffering OOM issues	21:28
clarkb	no movement on https://github.com/go-gitea/gitea/issues/31565 for several weeks too	21:29
clarkb	I'm waiting on CI results for 937290 just to be sure I don't need to push any further edits then I'm going to take advantage of the fact it isn't raining today to go for a bike ride	21:51
clarkb	arg hit the mariadb rate limit again	21:58
clarkb	maybe this is a sign to do something else for a bit	21:58
clarkb	ok rechecked again maybe I'll get lucky. Going to go on a bike ride while I wait. Reminder to let me know of edits to the meeting agenda or make them directly too	22:01
clarkb	I'll get that sent out when I get back	22:01
fungi	have fun! i'll try to keep an eye on it	22:01
clarkb	fungi: looking at the meeting agenda I think this week is a good one to try the db server purging	22:01
fungi	db server purging...	22:02
fungi	remind me	22:02
Clark[m]	Sorry backup server not db	22:16
Clark[m]	And now I'm afk for abit	22:16
frickler	Clark[m]: maybe mention the rax vcpus=1 issue for the agenda?	22:40
frickler	also we have a kind of workaround-ish fix for the raxflex disk label issue in kolla now, just fyi https://review.opendev.org/c/openstack/kolla/+/937345 still needs backports and a similar patch for k-a	22:41
fungi	i agree this week would be good for the backup purge	23:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!