mnasiadka | fungi: actually - is there an option to force a job to be run on raxflex - or it's down to series of rechecks? ;-) | 08:03 |
---|---|---|
frickler | mnasiadka: we don't have a special label for raxflex. I've set up two holds now with the hope to catch one of the failing jobs there, though | 08:55 |
*** ralonsoh_ is now known as ralonsoh | 09:21 | |
opendevreview | JaromÃr Wysoglad proposed openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339 | 09:52 |
opendevreview | Albin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects https://review.opendev.org/c/zuul/zuul-jobs/+/936828 | 12:49 |
opendevreview | Albin Vass proposed zuul/zuul-jobs master: prepare-workspace-git: Make it possible to sync a subset of projects https://review.opendev.org/c/zuul/zuul-jobs/+/936828 | 12:49 |
opendevreview | Merged openstack/project-config master: Add openstack-k8s-operators/sg-core to zuul repos https://review.opendev.org/c/openstack/project-config/+/937339 | 13:29 |
*** darmach5 is now known as darmach | 13:30 | |
opendevreview | Dr. Jens Harbott proposed zuul/zuul-jobs master: zuul_debug_info: Add /proc/cpuinfo output https://review.opendev.org/c/zuul/zuul-jobs/+/937376 | 15:24 |
frickler | fungi: ^^ that's regarding the discussion in #openstack-infra (cc corvus) | 15:24 |
clarkb | frickler: do ansible facts not already capture what we need? | 15:39 |
clarkb | I know they capture cpu count not sure about feature flags or model string | 15:39 |
frickler | clarkb: well in the case in question they said cpu count = 1, which shouldn't happen with the CPU=8 flavor we are using, shouldn't it? | 15:40 |
frickler | https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/zuul-info/host-info.ubuntu-noble.yaml | 15:40 |
clarkb | frickler: see my message in #openstack-infra it shouldn't happen but it does and it isn't zuul or ansible's fault | 15:40 |
clarkb | it is xens iirc | 15:40 |
clarkb | or maybe noble's kernel or or the other | 15:40 |
frickler | so if we know that that may happen, maybe we should do an early discard for that node and let zuul retry? | 15:41 |
clarkb | that is one option. I was hoping that the affected parties (this came up in like october iirc) would debug and sort it out | 15:41 |
frickler | clarkb: well seems like they didn't, I also don't remember that discussion, do you know if it was here or maybe in #zuul? | 15:42 |
clarkb | we just have to be careful with how we do that because <8vcpu is fine; we have 4vcpu in raxflex for example. Might do a specific check of the xen version and the distro version and kick out if they are wrong rather than checking vcpu | 15:43 |
clarkb | frickler: it was in #openstack-infra iirc and was a neutron or nova or maybe both problem | 15:43 |
frickler | also maybe cardoe might be able to escalate if needed? | 15:43 |
clarkb | the jobs assumed they would have more than one vcpu for something (was it numa?) and when they only had one kaboom | 15:43 |
cardoe | flex isn't using xen | 15:43 |
clarkb | right this is old rax and it is a single hypervisor or multiple hypervisors but not all hypervisors in that region that exhibit this issue | 15:44 |
frickler | clarkb: hmm, I only remember the 4cpu vs. 8cpu discussion, going down to 1cpu is new to me | 15:44 |
cardoe | what flavor do you use in old rax? | 15:44 |
frickler | cardoe: flavor-name: 'Performance' | 15:45 |
cardoe | The issue here is the underlying CPU? | 15:46 |
fungi | cpu count, yes | 15:46 |
fungi | should have gotten 8, ended up with only 1 | 15:46 |
clarkb | frickler: looks like this was previously debugged on october 2, 2024 and I discussed it in this channel but I thought it originated in #openstack-infra maybe I just haven't found the logs there | 15:47 |
clarkb | cardoe: yes and in october when I looekd at this it seemed specific to Ubuntu Noble booting on hypervisors with a conspicuously older version of Xen than Noble instances with 8 vcpu in the same region | 15:47 |
fungi | but the api is being suuuuper slooooow right now, i'm trying to check whether we're somehow ending up with any instances booted on a flavor other than "8 GB Performance" | 15:47 |
fungi | just to rule that out | 15:48 |
frickler | oh, that was when I was on PTO and stopped my IRC client. need to read logs | 15:48 |
clarkb | https://meetings.opendev.org/irclogs/%23opendev/%23opendev.2024-10-02.log.html#t2024-10-02T15:55:13 | 15:49 |
clarkb | it was cinder | 15:49 |
clarkb | I think we forgot to remind cardoe after vacation :) | 15:50 |
frickler | seems it is also happening rarely enough so we weren't reminded by more obscure failure reports ;) | 15:52 |
clarkb | cardoe: in rax-ord host_id: 7eca1835ed13e21e6a6b3c7bba861f314865eb616acfeaf63911026b booted a ubuntu noble VM and presented only a single cpu. Ansible reports ansible_bios_vendor: Xen ansible_bios_version: 4.1.5. Per the IRC logs above it seems when that version is newer we get 8vcpu using the same VM image | 16:00 |
clarkb | this is from the logs in https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_5c1/936807/5/check/openstack-tox-py312/5c1b043/ which is a job that ran a coupel of days ago | 16:00 |
frickler | hmm, so "flavor-name: 'Performance'" does a substring search and combines that with the min-ram condition? sadly nodepool doesn't seem to log the actual flavor it ends up with, but the id should be "performance1-8" according to flavor list | 16:01 |
clarkb | I really don't think it is a flavor problem based on my previuos investigation | 16:02 |
clarkb | the memory and disk totals aligned properly. | 16:02 |
clarkb | and a consistent difference was the reported xen version | 16:02 |
clarkb | so I suspect a bug in xen or the kernel or both | 16:02 |
fungi | 53 more defunct subscriptions disabled automatically on openstack-discuss today, and i've manually disabled another 4 that were causing uncaught bounce reports (the delays compared to saturday's batch seem to be a combination of users who were doing digest deliveries and/or deferred deliveries which took longer to time out) | 16:13 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290 | 16:23 |
clarkb | Let's see if I've sufficiently modified Gerrit permissions to be allowed to push acl updates as administrator now | 16:23 |
clarkb | fungi: and I guess it hasn't disabled anyone you'd expect to stay enabled this time around? | 16:24 |
fungi | nope | 16:24 |
frickler | sadly the host_id seems to be too long for opensearch to find it, but if finds multiple jobs running on different host_ids but with the same bios version, all properly with vpcus=8 though | 16:24 |
fungi | frickler: under noble specifically? | 16:25 |
frickler | oh, I didn't check that, let me add a filter | 16:25 |
clarkb | right this seems to only affect noble | 16:25 |
clarkb | https://gerrit-review.googlesource.com/c/gerrit/+/445064 since this has merged all we need to do is rebuild our gerrit 3.10 image and redeploy it to get the openid redirect fix | 16:32 |
frickler | I actually didn't find another noble run during the last 2 weeks | 16:32 |
clarkb | frickler: ya that was my problem with the first round of debugging the sample size was low and the only failure case was the one called out | 16:32 |
clarkb | but in all cases of successful noble runs in rax ord the bios version aka xen version was newer than where the failure occurred | 16:33 |
clarkb | its probably a very small number of hypervisors so rare to hit | 16:33 |
clarkb | the hostid I recorded in the original etherpad is different than the one ykarel found though so probably at least 2 hypervisors | 16:34 |
frickler | yes, most logs I found seem to be clustered at the periodic pipeline early morning rush | 16:34 |
clarkb | infra-root on my post gerrit upgrade todo list is cleaning up autoholds. I don't see anything indicating we should keep those around for additional testing you good with me deleting them at this point? | 16:42 |
clarkb | looks like i haev some local system updates to apply. I'll clean up autoholds when that is done | 16:42 |
fungi | seems fine to clean those up, i'll also drop my mm3 test hold | 16:43 |
tonyb | clarkb: Sounds good to me | 16:45 |
clarkb | cool my todo list today is full of followup stuff to the gerrit upgrade. There is also the dib testing situation so that we can land a bugfix and tag a release but I'm not sure I'll get to that today. Might make a good meeting agenda item though | 16:46 |
tonyb | I wont be at the meeting as I'll be landing in Sydney :( | 16:47 |
clarkb | I've just realized the way I'm editing All-Projects in the test gerrit is to add a new refs/* section for push in the config file. Does anyone know if you can have multiple entries for access refs/* and if they merge or one overrides another? | 16:53 |
clarkb | fungi: ^ you may know? | 16:53 |
fungi | i've never tried using duplicate section headings | 16:59 |
fungi | could you run `git config ...` to add them instead? | 16:59 |
fungi | then it shouldn't wind up with duplicate sections | 17:00 |
fungi | oh, wait, you mean the gerrit config (acl) file, not the git config file? | 17:00 |
fungi | yeah, without constructing a test scenario around that, i don't know | 17:01 |
clarkb | ya the acl files since I need to add push = +force group Administrators to access refs/* | 17:09 |
clarkb | it might work to use git config and supply it with a path though /me tries this locally | 17:09 |
clarkb | of course now I have to figure out how to interact with [access "refs/*"] via git config | 17:16 |
clarkb | git config get --file foo.ini --all access.refs/\*.push <- like that | 17:19 |
fungi | oh, cool | 17:19 |
clarkb | and happily --all returns all the values so this probably actually would work by appeneding additional sections to the acl config (its not c git but still I expect they use an equivalent to --all since you can have many entries) | 17:19 |
clarkb | my test foo.ini had multiple sections too | 17:20 |
fungi | probably, yes. also in a pinch we've had success using python's configparser, and then "fixing" the indentation to match what gerrit expects | 17:20 |
clarkb | infra-root (and maybe corvus in particular) the integrated gate has a number of nova changes with cancelled jobs but no failing jobs in the change with cancelled jobs or the change ahead. Any idea what has happend there? I note the change at the head restarted at least two jobs i wonder if zuul is detecting that as a gate failure again | 17:22 |
clarkb | (I know we had a problem with this and fixed it and I thought we had a test to help prevent this regression occuring again) | 17:23 |
clarkb | the change status says "failing because at least one job failed" but I suspect that may be because cancelled isn't in the success list | 17:23 |
clarkb | the cancelled jobs all cancelled at 14:24:50 which does not line up with when the retried jobs stopped due to failures | 17:25 |
clarkb | so I don't think it wsa retries causing this | 17:25 |
fungi | did nova maybe add configuration to cancel jobs if certain ones fail? | 17:27 |
cardoe | ugh. yeah I don't have a good answer there on those hypervisors. | 17:30 |
cardoe | cloudnull: poke ^ | 17:31 |
fungi | clarkb: yeah, i'm not seeing any indication that nova's zuul config is doing anything unusual at least | 17:32 |
clarkb | 2024-12-09 14:06:54,827 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] <QueueItem 1101548b9a6a4152abe011717bca9767 live for [<Change 0x7f3fa99535d0 openstack/nova 924593,4>] in gate> is a failing item because ['at least one job failed'] <- from zuul01 is where zuul seems to haev first noticed it needs to do something | 17:32 |
fungi | which i guess was the timed_out build for nova-tox-functional-py312? | 17:33 |
fungi | the buildset for the parent change does have two retried jobs, but neither of them ended at the time that the jobs on the child change got cancelled | 17:36 |
clarkb | fungi: where do you see a timed out build? | 17:36 |
clarkb | oh! | 17:37 |
fungi | openstack/nova 924593,4 has nova-tox-functional-py312 timed_out | 17:37 |
clarkb | that is it. THe color choices here make it too easy for me to skim and see ti as cancelled too | 17:37 |
clarkb | ok that explains it | 17:37 |
fungi | but the cancelled jobs don't depend on that job do they? | 17:37 |
clarkb | that particular job timed out which cancelled the others. Then the chagnes behind it all depend on 924593 so were also cancelled | 17:37 |
clarkb | fungi: they shouldn't so ya nova may be failing fast? | 17:37 |
fungi | not that i can find in nova's .zuul.yaml no | 17:38 |
fungi | but maybe i'm not looking for the right thing | 17:38 |
clarkb | the changes behind all depend on the timed out change so the other changes not reenqueing is expected | 17:38 |
clarkb | so the only question is why cancel all of the jobs in 924593 and its children due to a single timed out failure | 17:38 |
clarkb | octavia and designate set fail-fast to true but they also have their own queues | 17:40 |
clarkb | I wonder if it could be set on a stable branch so not showing up in codesearch or if some other project sets it and it affects the entire integrated gate queue? | 17:41 |
corvus | i agree the color is problematic | 17:42 |
fungi | 14:05:25 is when the log for that build says it timed out, the cancelled builds all have completion times of 14:24:50 which is nearly 20 minutes. i'm not convinced the timeout correlates to the cancels | 17:44 |
fungi | though i suppose if it hit during a reconfigure or something then event processing could have been delayed | 17:44 |
corvus | reconfigs don't take that long these days (like 30s or so) | 17:45 |
clarkb | it does start cancelling jobs almost immedaite in the log | 17:46 |
clarkb | 2024-12-09 14:06:54,843 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] Cancel jobs for <QueueItem b2823b0facbc4f9b912c1a9488466d9c live for [<Change 0x7f3fa81e4a10 openstack/nova 924594,4>] in gate> | 17:46 |
clarkb | however that appears to be for all the children not the change itself | 17:47 |
clarkb | ignoring the unexplained behavior for 924593 its almost like we cancel all of the builds for its children in preparation for a reenqueue but then we can't reenqueue beacuse the parent is broken | 17:49 |
clarkb | I wonder if that is a regression from old behavior where things would keep running in their own fork of the subway graph previously | 17:50 |
fungi | would they be reenqueued though if a required parent is failing? | 17:50 |
corvus | 592 was failing at one point | 17:50 |
corvus | 2024-12-09 14:24:38,995 DEBUG zuul.Pipeline.openstack.gate: [e: f9829647324849379fea66c0d18277eb] <QueueItem 9e0478f1cb0d49ff9dfc05304de1fab7 live for [<Change 0x7f3fa80c6c50 openstack/nova 924592,4>] in gate> is a failing item because ['at least one job failed'] | 17:51 |
clarkb | oh interesting so maybe this is related to the retries? | 17:51 |
corvus | it... got better? not quite dead yet? maybe related to retries | 17:51 |
clarkb | and that timestamp is much closer to where we see the 593 cancelations | 17:51 |
clarkb | so theory 594 and beyond cancel when 593 has the timeout job but we don't actually cancel anything on 593 until 592 is considered failing | 17:52 |
corvus | https://zuul.opendev.org/t/openstack/buildset/e0429ce698fa427b9dfdc2ed2682b578 is the current 592 buildset | 17:52 |
clarkb | I think there are two problems here first is why was 592 considered failing if it isn't right now and second why don't we let the jobs run to completion (if only on a fork of the subway graph) | 17:53 |
fungi | could https://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8 have been in early failure detection and then hit a different condition which zuul decided should qualify it for a retry instead? | 17:54 |
clarkb | oh that is an interesting theory. It would explain why the timestamps don't initially appear to line up | 17:55 |
fungi | https://zuul.opendev.org/t/openstack/build/1a4f2c1d17b946cda628491f5a5d91f8/log/job-output.txt#34945 | 17:55 |
fungi | bingo | 17:55 |
fungi | 2024-12-09 14:24:37.761281 | controller | Early failure in job, matched regex "\{\d+\} (.*?) \[[\d\.]+s\] \.\.\. FAILED" | 17:56 |
fungi | at least the first half of the theory is confirmed to correlate timewise | 17:56 |
clarkb | yup it failed in tempest then later the host goes unreachable | 17:56 |
clarkb | so this is a corner case where unreachable hosts override early failure detection for retries (maybe it shouldn't? haven't through that through) | 17:56 |
fungi | right, so maybe as an optimization zuul should not retry builds which indicate early failure | 17:57 |
corvus | i think this is as expected/documented | 17:57 |
clarkb | and then we also have to understand why the jobs behind aren't allowed to continue to completion | 17:57 |
corvus | we've always canceled jobs behind the failing change | 17:57 |
corvus | once we identify a failing change, (even I think) it's a waste of resources to keep running the jobs behind it | 17:57 |
clarkb | corvus: I thought we let them run to completion on a fork of the subway graph. But maybe I'm miss remembering | 17:57 |
clarkb | ack | 17:58 |
corvus | we do keep running the jobs for the failing change though | 17:58 |
fungi | and apparently even sometimes retry them | 17:58 |
corvus | (which, in this case, is counter-inuitively 592 not 593) | 17:58 |
fungi | as was the case here | 17:58 |
clarkb | so probably the main things to think about are changing the timeout color (that ended up not being directly related but was still confusing) and then brainstorm if early failure detection wins over retries | 17:58 |
clarkb | however we may have failed early due to something that should be retried so I'm not sure what the right behavior is there | 17:59 |
corvus | yeah... i think we did tweak something recently to make early failure less overridable, lemme look that up | 17:59 |
clarkb | tonyb: you said we don't need your old autohold for gerrit stuff right? | 18:00 |
clarkb | tonyb: I can clean that up along with my autoholds if that is the case. I'm like 95% certain it is so may clean it up anyway | 18:00 |
fungi | yeah either early failure could be due to something that later ends in a retryable condition in which case the depending items shouldn't get their builds cancelled because the failing job may still succeed on retry, or the early failure was for a legitimate reason in which case we shouldn't retry the build and should cancel the children because the failing item is not going to merge | 18:01 |
fungi | so the fault in this case could be seen as either deciding to retry the failing build, or cancelling the builds for subsequent items when we weren't actually sure the first item wouldn't merge | 18:03 |
corvus | https://review.opendev.org/c/zuul/zuul/+/892247 is what i was thinking of | 18:03 |
corvus | that means once early failure hits, success can never happen | 18:03 |
corvus | making the host unreachable is a neat workaround ;) | 18:04 |
fungi | okay, so this is an unexpected loophole | 18:04 |
clarkb | ok so the intended behavior is that early failure detection would keep the build in a failed sate | 18:04 |
clarkb | *state | 18:05 |
clarkb | so it is a bug | 18:05 |
corvus | i'm leaning towards saying that because of the reasoning in 892247 we should close the loophole and make early failure override the unreachable retry | 18:05 |
fungi | i concur | 18:05 |
clarkb | I feel like that may not be most correct in every situation but I think it would be easiset to understand | 18:05 |
clarkb | and most conssitent | 18:05 |
fungi | agreed on both points | 18:05 |
clarkb | (then you adjust your early failure detection criteria to not overmatch if indeed it was caused by unreachable fallout) | 18:05 |
corvus | yep. possible an unreachableness could cause an early failure detection, but we're kind of trying to avoid having the nnfi algorithm flop around. | 18:06 |
clarkb | ++ | 18:07 |
corvus | for regex-based early failures, it is, as clarkb said, a matter of not overmatching. for ansible-based failures, i think we should get the unreachable signal before or instead of getting the early failure signal, so it should be okay. but that's a thing to double check with any change we make here. | 18:08 |
fungi | i need to disappear for an eye exam, might be a while since that office seems to always run chronically behind schedule. hoping to be back by 20:00z but it could be later | 18:11 |
clarkb | have fun! | 18:11 |
fungi | i'm sure i will | 18:12 |
clarkb | my gerrit 3.11 testing failed on docker hub rate limits trying to pull mariadb | 18:36 |
clarkb | this is me thinking outloud: I wonder if the huge influx of kolla jobs is basically a sure fire way to push as many IPs as possible over that limit | 18:36 |
clarkb | there are several kolla changes up many of which are running >60 jobs and probably on the order of 100 nodes a piece. | 18:36 |
clarkb | giving me the opportunity to code review my own work before pushing up a new patchset to fix some obvious bugs | 18:40 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290 | 18:41 |
clarkb | since 937290 rebuilds our gerrit images I won't make a special change to rebuild them to pick up the openid fix | 18:51 |
clarkb | I should probably hold a node and just double check openid login works on 3.10 like it did with 3.9 after my update | 18:51 |
clarkb | but one step at a time | 18:51 |
clarkb | and I'm not too worried about it I don't think that code changes much between releases and my edit was extremely small and tested on 3.9 | 18:52 |
clarkb | weird latest patchset hit a bazelisk build failure with js files for the ui apparently not being present. I wonder if the underlying js build failed somehow and bazel doesn't notice uintil it tries to copy the results | 19:11 |
clarkb | the image builds just an hour before passed so unlikely to be a problem with what we are doing but could be an upstream issue. We'll see how the 3.11 build goes | 19:12 |
clarkb | 3.11 built just fine... | 19:21 |
fungi | okay, back earlier than expected from the eye doctor | 19:26 |
clarkb | did they hit your eyeballs with the puffs of air and then blast them with very bright green light ot take pictures? | 19:27 |
fungi | i think they did the contact pressure test because they got in "really close" with the gage, plus i don't remember any puffs of air and i expect i would have noticed them | 19:28 |
opendevreview | Clark Boylan proposed opendev/system-config master: Add Gerrit 3.11 image builds and testing https://review.opendev.org/c/opendev/system-config/+/937290 | 19:38 |
clarkb | I fixed ^ for http previously but not for ssh. But progress is being made | 19:38 |
fungi | i suspect we're going to need similar workarounds for git-review's integration testing | 19:48 |
clarkb | yes and the gerritlib integration testing which I pushed a change up to pin to 3.10 for now | 19:53 |
clarkb | fwiw I have confirmed that gitea is still suffering OOM issues | 21:28 |
clarkb | no movement on https://github.com/go-gitea/gitea/issues/31565 for several weeks too | 21:29 |
clarkb | I'm waiting on CI results for 937290 just to be sure I don't need to push any further edits then I'm going to take advantage of the fact it isn't raining today to go for a bike ride | 21:51 |
clarkb | arg hit the mariadb rate limit again | 21:58 |
clarkb | maybe this is a sign to do something else for a bit | 21:58 |
clarkb | ok rechecked again maybe I'll get lucky. Going to go on a bike ride while I wait. Reminder to let me know of edits to the meeting agenda or make them directly too | 22:01 |
clarkb | I'll get that sent out when I get back | 22:01 |
fungi | have fun! i'll try to keep an eye on it | 22:01 |
clarkb | fungi: looking at the meeting agenda I think this week is a good one to try the db server purging | 22:01 |
fungi | db server purging... | 22:02 |
fungi | remind me | 22:02 |
Clark[m] | Sorry backup server not db | 22:16 |
Clark[m] | And now I'm afk for abit | 22:16 |
frickler | Clark[m]: maybe mention the rax vcpus=1 issue for the agenda? | 22:40 |
frickler | also we have a kind of workaround-ish fix for the raxflex disk label issue in kolla now, just fyi https://review.opendev.org/c/openstack/kolla/+/937345 still needs backports and a similar patch for k-a | 22:41 |
fungi | i agree this week would be good for the backup purge | 23:11 |
Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!