Monday, 2024-12-09

mnasiadkafungi: actually - is there an option to force a job to be run on raxflex - or it's down to series of rechecks? ;-)08:03
fricklermnasiadka: we don't have a special label for raxflex. I've set up two holds now with the hope to catch one of the failing jobs there, though08:55
*** ralonsoh_ is now known as ralonsoh09:21
opendevreviewJaromír Wysoglad proposed openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos  https://review.opendev.org/c/openstack/project-config/+/93733909:52
opendevreviewAlbin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects  https://review.opendev.org/c/zuul/zuul-jobs/+/93682812:49
opendevreviewAlbin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects  https://review.opendev.org/c/zuul/zuul-jobs/+/93682812:49
opendevreviewMerged openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos  https://review.opendev.org/c/openstack/project-config/+/93733913:29
*** darmach5 is now known as darmach13:30
opendevreviewDr. Jens Harbott proposed zuul/zuul-jobs master: zuul_debug_info: Add /proc/cpuinfo output  https://review.opendev.org/c/zuul/zuul-jobs/+/93737615:24
fricklerfungi: ^^ that's regarding the discussion in #openstack-infra (cc corvus)15:24
clarkbfrickler: do ansible facts not already capture what we need?15:39
clarkbI know they capture cpu count not sure about feature flags or model string15:39
fricklerclarkb: well in the case in question they said cpu count = 1, which shouldn't happen with the CPU=8 flavor we are using, shouldn't it?15:40
fricklerhttps://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/host-info.ubuntu-noble.yaml15:40
clarkbfrickler: see my message in #openstack-infra it shouldn't happen but it does and it isn't zuul or ansible's fault15:40
clarkbit is xens iirc15:40
clarkbor maybe noble's kernel or or the other15:40
fricklerso if we know that that may happen, maybe we should do an early discard for that node and let zuul retry?15:41
clarkbthat is one option. I was hoping that the affected parties (this came up in like october iirc) would debug and sort it out15:41
fricklerclarkb: well seems like they didn't, I also don't remember that discussion, do you know if it was here or maybe in #zuul?15:42
clarkbwe just have to be careful with how we do that because <8vcpu is fine; we have 4vcpu in raxflex for example. Might do a specific check of the xen version and the distro version and kick out if they are wrong rather than checking vcpu15:43
clarkbfrickler: it was in #openstack-infra iirc and was a neutron or nova or maybe both problem15:43
frickleralso maybe cardoe might be able to escalate if needed?15:43
clarkbthe jobs assumed they would have more than one vcpu for something (was it numa?) and when they only had one kaboom15:43
cardoeflex isn't using xen15:43
clarkbright this is old rax and it is a single hypervisor or multiple hypervisors but not all hypervisors in that region that exhibit this issue15:44
fricklerclarkb: hmm, I only remember the 4cpu vs. 8cpu discussion, going down to 1cpu is new to me15:44
cardoewhat flavor do you use in old rax?15:44
fricklercardoe: flavor-name: 'Performance'15:45
cardoeThe issue here is the underlying CPU?15:46
fungicpu count, yes15:46
fungishould have gotten 8, ended up with only 115:46
clarkbfrickler: looks like this was previously debugged on october 2, 2024 and I discussed it in this channel but I thought it originated in #openstack-infra maybe I just haven't found the logs there15:47
clarkbcardoe: yes and in october when I looekd at this it seemed specific to Ubuntu Noble booting on hypervisors with a conspicuously older version of Xen than Noble instances with 8 vcpu in the same region15:47
fungibut the api is being suuuuper slooooow right now, i'm trying to check whether we're somehow ending up with any instances booted on a flavor other than "8 GB Performance"15:47
fungijust to rule that out15:48
frickleroh, that was when I was on PTO and stopped my IRC client. need to read logs15:48
clarkbhttps://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:1315:49
clarkbit was cinder15:49
clarkbI think we forgot to remind cardoe after vacation :)15:50
fricklerseems it is also happening rarely enough so we weren't reminded by more obscure failure reports ;)15:52
clarkbcardoe: in rax-ord host_id: 7eca1835ed13e21e6a6b3c7bba861f314865eb616acfeaf63911026b booted a ubuntu noble VM and presented only a single cpu. Ansible reports ansible_bios_vendor: Xen ansible_bios_version: 4.1.5. Per the IRC logs above it seems when that version is newer we get 8vcpu using the same VM image16:00
clarkbthis is from the logs in https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/ which is a job that ran a coupel of days ago16:00
fricklerhmm, so "flavor-name: 'Performance'" does a substring search and combines that with the min-ram condition? sadly nodepool doesn't seem to log the actual flavor it ends up with, but the id should be "performance1-8" according to flavor list16:01
clarkbI really don't think it is a flavor problem based on my previuos investigation16:02
clarkbthe memory and disk totals aligned properly.16:02
clarkband a consistent difference was the reported xen version16:02
clarkbso I suspect a bug in xen or the kernel or both16:02
fungi53 more defunct subscriptions disabled automatically on openstack-discuss today, and i've manually disabled another 4 that were causing uncaught bounce reports (the delays compared to saturday's batch seem to be a combination of users who were doing digest deliveries and/or deferred deliveries which took longer to time out)16:13
opendevreviewClark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93729016:23
clarkbLet's see if I've sufficiently modified Gerrit permissions to be allowed to push acl updates as administrator now16:23
clarkbfungi: and I guess it hasn't disabled anyone you'd expect to stay enabled this time around?16:24
funginope16:24
fricklersadly the host_id seems to be too long for opensearch to find it, but if finds multiple jobs running on different host_ids but with the same bios version, all properly with vpcus=8 though16:24
fungifrickler: under noble specifically?16:25
frickleroh, I didn't check that, let me add a filter16:25
clarkbright this seems to only affect noble16:25
clarkbhttps://gerrit-review.googlesource.com/c/gerrit/+/445064 since this has merged all we need to do is rebuild our gerrit 3.10 image and redeploy it to get the openid redirect fix16:32
fricklerI actually didn't find another noble run during the last 2 weeks16:32
clarkbfrickler: ya that was my problem with the first round of debugging the sample size was low and the only failure case was the one called out16:32
clarkbbut in all cases of successful noble runs in rax ord the bios version aka xen version was newer than where the failure occurred16:33
clarkbits probably a very small number of hypervisors so rare to hit16:33
clarkbthe hostid I recorded in the original etherpad is different than the one ykarel found though so probably at least 2 hypervisors16:34
frickleryes, most logs I found seem to be clustered at the periodic pipeline early morning rush16:34
clarkbinfra-root on my post gerrit upgrade todo list is cleaning up autoholds. I don't see anything indicating we should keep those around for additional testing you good with me deleting them at this point?16:42
clarkblooks like i haev some local system updates to apply. I'll clean up autoholds when that is done16:42
fungiseems fine to clean those up, i'll also drop my mm3 test hold16:43
tonybclarkb: Sounds good to me16:45
clarkbcool my todo list today is full of followup stuff to the gerrit upgrade. There is also the dib testing situation so that we can land a bugfix and tag a release but I'm not sure I'll get to that today. Might make a good meeting agenda item though16:46
tonybI wont be at the meeting as I'll be landing in Sydney :(16:47
clarkbI've just realized the way I'm editing All-Projects in the test gerrit is to add a new refs/* section for push in the config file. Does anyone know if you can have multiple entries for access refs/* and if they merge or one overrides another?16:53
clarkbfungi: ^ you may know?16:53
fungii've never tried using duplicate section headings16:59
fungicould you run `git config ...` to add them instead?16:59
fungithen it shouldn't wind up with duplicate sections17:00
fungioh, wait, you mean the gerrit config (acl) file, not the git config file?17:00
fungiyeah, without constructing a test scenario around that, i don't know17:01
clarkbya the acl files since I need to add push = +force group Administrators to access refs/*17:09
clarkbit might work to use git config and supply it with a path though /me tries this locally17:09
clarkbof course now I have to figure out how to interact with [access "refs/*"] via git config17:16
clarkbgit config get --file foo.ini --all access.refs/\*.push <- like that17:19
fungioh, cool17:19
clarkband happily --all returns all the values so this probably actually would work by appeneding additional sections to the acl config (its not c git but still I expect they use an equivalent to --all since you can have many entries)17:19
clarkbmy test foo.ini had multiple sections too17:20
fungiprobably, yes. also in a pinch we've had success using python's configparser, and then "fixing" the indentation to match what gerrit expects17:20
clarkbinfra-root (and maybe corvus in particular) the integrated gate has a number of nova changes with cancelled jobs but no failing jobs in the change with cancelled jobs or the change ahead. Any idea what has happend there? I note the change at the head restarted at least two jobs i wonder if zuul is detecting that as a gate failure again17:22
clarkb(I know we had a problem with this and fixed it and I thought we had a test to help prevent this regression occuring again)17:23
clarkbthe change status says "failing because at least one job failed" but I suspect that may be because cancelled isn't in the success list17:23
clarkbthe cancelled jobs all cancelled at 14:24:50 which does not line up with when the retried jobs stopped due to failures17:25
clarkbso I don't think it wsa retries causing this17:25
fungidid nova maybe add configuration to cancel jobs if certain ones fail?17:27
cardoeugh. yeah I don't have a good answer there on those hypervisors.17:30
cardoecloudnull: poke ^17:31
fungiclarkb: yeah, i'm not seeing any indication that nova's zuul config is doing anything unusual at least17:32
clarkb2024-12-09 14:06:54,827 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] <QueueItem 1101548b9a6a4152abe011717bca9767 live for [<Change 0x7f3fa99535d0 openstack/nova 924593,4>] in gate> is a failing item because ['at least one job failed'] <- from zuul01 is where zuul seems to haev first noticed it needs to do something17:32
fungiwhich i guess was the timed_out build for nova-tox-functional-py312?17:33
fungithe buildset for the parent change does have two retried jobs, but neither of them ended at the time that the jobs on the child change got cancelled17:36
clarkbfungi: where do you see a timed out build?17:36
clarkboh!17:37
fungiopenstack/nova 924593,4 has nova-tox-functional-py312 timed_out17:37
clarkbthat is it. THe color choices here make it too easy for me to skim and see ti as cancelled too17:37
clarkbok that explains it17:37
fungibut the cancelled jobs don't depend on that job do they?17:37
clarkbthat particular job timed out which cancelled the others. Then the chagnes behind it all depend on 924593 so were also cancelled17:37
clarkbfungi: they shouldn't so ya nova may be failing fast?17:37
funginot that i can find in nova's .zuul.yaml no17:38
fungibut maybe i'm not looking for the right thing17:38
clarkbthe changes behind all depend on the timed out change so the other changes not reenqueing is expected17:38
clarkbso the only question is why cancel all of the jobs in 924593 and its children due to a single timed out failure17:38
clarkboctavia and designate set fail-fast to true but they also have their own queues17:40
clarkbI wonder if it could be set on a stable branch so not showing up in codesearch or if some other project sets it and it affects the entire integrated gate queue?17:41
corvusi agree the color is problematic17:42
fungi14:05:25 is when the log for that build says it timed out, the cancelled builds all have completion times of 14:24:50 which is nearly 20 minutes. i'm not convinced the timeout correlates to the cancels17:44
fungithough i suppose if it hit during a reconfigure or something then event processing could have been delayed17:44
corvusreconfigs don't take that long these days (like 30s or so)17:45
clarkbit does start cancelling jobs almost immedaite in the log17:46
clarkb2024-12-09 14:06:54,843 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] Cancel jobs for <QueueItem b2823b0facbc4f9b912c1a9488466d9c live for [<Change 0x7f3fa81e4a10 openstack/nova 924594,4>] in gate>17:46
clarkbhowever that appears to be for all the children not the change itself17:47
clarkbignoring the unexplained behavior for 924593 its almost like we cancel all of the builds for its children in preparation for a reenqueue but then we can't reenqueue beacuse the parent is broken17:49
clarkbI wonder if that is a regression from old behavior where things would keep running in their own fork of the subway graph previously17:50
fungiwould they be reenqueued though if a required parent is failing?17:50
corvus592 was failing at one point17:50
corvus2024-12-09 14:24:38,995 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] <QueueItem 9e0478f1cb0d49ff9dfc05304de1fab7 live for [<Change 0x7f3fa80c6c50 openstack/nova 924592,4>] in gate> is a failing item because ['at least one job failed']17:51
clarkboh interesting so maybe this is related to the retries?17:51
corvusit... got better?  not quite dead yet?  maybe related to retries17:51
clarkband that timestamp is much closer to where we see the 593 cancelations17:51
clarkbso theory 594 and beyond cancel when 593 has the timeout job but we don't actually cancel anything on 593 until 592 is considered failing17:52
corvushttps://zuul.opendev.org/t/openstack/buildset/e0429ce698fa427b9dfdc2ed2682b578 is the current 592 buildset17:52
clarkbI think there are two problems here first is why was 592 considered failing if it isn't right now and second why don't we let the jobs run to completion (if only on a fork of the subway graph)17:53
fungicould https://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8 have been in early failure detection and then hit a different condition which zuul decided should qualify it for a retry instead?17:54
clarkboh that is an interesting theory. It would explain why the timestamps don't initially appear to line up17:55
fungihttps://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8/log/job-output.txt#3494517:55
fungibingo17:55
fungi2024-12-09 14:24:37.761281 | controller | Early failure in job, matched regex "\{\d+\} (.*?) \[[\d\.]+s\] \.\.\. FAILED"17:56
fungiat least the first half of the theory is confirmed to correlate timewise17:56
clarkbyup it failed in tempest then later the host goes unreachable17:56
clarkbso this is a corner case where unreachable hosts override early failure detection for retries (maybe it shouldn't? haven't through that through)17:56
fungiright, so maybe as an optimization zuul should not retry builds which indicate early failure17:57
corvusi think this is as expected/documented17:57
clarkband then we also have to understand why the jobs behind aren't allowed to continue to completion17:57
corvuswe've always canceled jobs behind the failing change17:57
corvusonce we identify a failing change, (even I think) it's a waste of resources to keep running the jobs behind it17:57
clarkbcorvus: I thought we let them run to completion on a fork of the subway graph. But maybe I'm miss remembering17:57
clarkback17:58
corvuswe do keep running the jobs for the failing change though17:58
fungiand apparently even sometimes retry them17:58
corvus(which, in this case, is counter-inuitively 592 not 593)17:58
fungias was the case here17:58
clarkbso probably the main things to think about are changing the timeout color (that ended up not being directly related but was still confusing) and then brainstorm if early failure detection wins over retries17:58
clarkbhowever we may have failed early due to something that should be retried so I'm not sure what the right behavior is there17:59
corvusyeah... i think we did tweak something recently to make early failure less overridable, lemme look that up17:59
clarkbtonyb: you said we don't need your old autohold for gerrit stuff right?18:00
clarkbtonyb: I can clean that up along with my autoholds if that is the case. I'm like 95% certain it is so may clean it up anyway18:00
fungiyeah either early failure could be due to something that later ends in a retryable condition in which case the depending items shouldn't get their builds cancelled because the failing job may still succeed on retry, or the early failure was for a legitimate reason in which case we shouldn't retry the build and should cancel the children because the failing item is not going to merge18:01
fungiso the fault in this case could be seen as either deciding to retry the failing build, or cancelling the builds for subsequent items when we weren't actually sure the first item wouldn't merge18:03
corvushttps://review.opendev.org/c/zuul/zuul/+/892247 is what i was thinking of18:03
corvusthat means once early failure hits, success can never happen18:03
corvusmaking the host unreachable is a neat workaround ;)18:04
fungiokay, so this is an unexpected loophole18:04
clarkbok so the intended behavior is that early failure detection would keep the build in a failed sate18:04
clarkb*state18:05
clarkbso it is a bug18:05
corvusi'm leaning towards saying that because of the reasoning in 892247 we should close the loophole and make early failure override the unreachable retry18:05
fungii concur18:05
clarkbI feel like that may not be most correct in every situation but I think it would be easiset to understand18:05
clarkband most conssitent18:05
fungiagreed on both points18:05
clarkb(then you adjust your early failure detection criteria to not overmatch if indeed it was caused by unreachable fallout)18:05
corvusyep.  possible an unreachableness could cause an early failure detection, but we're kind of trying to avoid having the nnfi algorithm flop around.18:06
clarkb++18:07
corvusfor regex-based early failures, it is, as clarkb said, a matter of not overmatching.  for ansible-based failures, i think we should get the unreachable signal before or instead of getting the early failure signal, so it should be okay.  but that's a thing to double check with any change we make here.18:08
fungii need to disappear for an eye exam, might be a while since that office seems to always run chronically behind schedule. hoping to be back by 20:00z but it could be later18:11
clarkbhave fun!18:11
fungii'm sure i will18:12
clarkbmy gerrit 3.11 testing failed on docker hub rate limits trying to pull mariadb18:36
clarkbthis is me thinking outloud: I wonder if the huge influx of kolla jobs is basically a sure fire way to push as many IPs as possible over that limit18:36
clarkbthere are several kolla changes up many of which are running >60 jobs and probably on the order of 100 nodes a piece.18:36
clarkbgiving me the opportunity to code review my own work before pushing up a new patchset to fix some obvious bugs18:40
opendevreviewClark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93729018:41
clarkbsince 937290 rebuilds our gerrit images I won't make a special change to rebuild them to pick up the openid fix18:51
clarkbI should probably hold a node and just double check openid login works on 3.10 like it did with 3.9 after my update18:51
clarkbbut one step at a time18:51
clarkband I'm not too worried about it I don't think that code changes much between releases and my edit was extremely small and tested on 3.918:52
clarkbweird latest patchset hit a bazelisk build failure with js files for the ui apparently not being present. I wonder if the underlying js build failed somehow and bazel doesn't notice uintil it tries to copy the results19:11
clarkbthe image builds just an hour before passed so unlikely to be a problem with what we are doing but could be an upstream issue. We'll see how the 3.11 build goes19:12
clarkb3.11 built just fine...19:21
fungiokay, back earlier than expected from the eye doctor19:26
clarkbdid they hit your eyeballs with the puffs of air and then blast them with very bright green light ot take pictures?19:27
fungii think they did the contact pressure test because they got in "really close" with the gage, plus i don't remember any puffs of air and i expect i would have noticed them19:28
opendevreviewClark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing  https://review.opendev.org/c/opendev/system-config/+/93729019:38
clarkbI fixed ^ for http previously but not for ssh. But progress is being made19:38
fungii suspect we're going to need similar workarounds for git-review's integration testing19:48
clarkbyes and the gerritlib integration testing which I pushed a change up to pin to 3.10 for now19:53
clarkbfwiw I have confirmed that gitea is still suffering OOM issues21:28
clarkbno movement on https://github.com/go-gitea/gitea/issues/31565 for several weeks too21:29
clarkbI'm waiting on CI results for 937290 just to be sure I don't need to push any further edits then I'm going to take advantage of the fact it isn't raining today to go for a bike ride21:51
clarkbarg hit the mariadb rate limit again21:58
clarkbmaybe this is a sign to do something else for a bit21:58
clarkbok rechecked again maybe I'll get lucky. Going to go on a bike ride while I wait. Reminder to let me know of edits to the meeting agenda or make them directly too22:01
clarkbI'll get that sent out when I get back22:01
fungihave fun! i'll try to keep an eye on it22:01
clarkbfungi: looking at the meeting agenda I think this week is a good one to try the db server purging22:01
fungidb server purging...22:02
fungiremind me22:02
Clark[m]Sorry backup server not db22:16
Clark[m]And now I'm afk for abit22:16
fricklerClark[m]: maybe mention the rax vcpus=1 issue for the agenda?22:40
frickleralso we have a kind of workaround-ish fix for the raxflex disk label issue in kolla now, just fyi https://review.opendev.org/c/openstack/kolla/+/937345 still needs backports and a similar patch for k-a22:41
fungii agree this week would be good for the backup purge23:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!