Thursday, 2022-12-15

opendevreviewMerged openstack/project-config master: Update post-review zuul pipeline definiton  https://review.opendev.org/c/openstack/project-config/+/86728200:02
*** rlandy|bbl is now known as rlandy|out02:24
*** yadnesh|away is now known as yadnesh05:08
*** soniya29 is now known as soniya29|pto05:57
*** jpena|off is now known as jpena07:54
opendevreviewAde Lee proposed openstack/project-config master: Add FIPS job for ubuntu  https://review.opendev.org/c/openstack/project-config/+/86711210:37
*** yadnesh is now known as yadnesh|afk11:05
*** dviroel|out is now known as dviroel|rover11:12
*** rlandy|out is now known as rlandy11:14
*** yadnesh|afk is now known as yadnesh12:02
opendevreviewThierry Carrez proposed openstack/project-config master: Bring back the PTL+1 column in release dashboard  https://review.opendev.org/c/openstack/project-config/+/86780113:00
ykarel_fungi, Thanks i got nodes on hold, who can help from vexxhost-ca-ymq-1 side as they may have already dealt with such nested virt issues?13:34
fungiykarel_: well, for a start, do you need me to give your ssh key access to log into the held nodes?13:36
ykarel_fungi, i already have my keys there injected in pre playbook13:36
ykarel_now collecting information required for reporting bugs as described in https://www.kernel.org/doc/html/latest/virt/kvm/x86/running-nested-guests.html#reporting-bugs-from-nested-setups13:37
fungiahh, okay. in that case, mnaser has helped look into this sort of thing before from the vexxhost side13:37
ykarel_Thanks fungi 13:38
fungiit looks like the uuid for 199.19.213.61 is 1982206a-6620-454f-9983-465db3651cf8, while the uuid for 199.19.213.167 is 5200d42a-e41f-4fc4-a5f8-3d8a48a7134913:38
fungithat may make it easier to look up on the nova side13:38
ykarel_mnaser, since we switched jobs to run on ubuntu jammy, we started seeing random issues https://bugs.launchpad.net/neutron/+bug/1999249, can you please help in clearing it we only seeing it in vexxhost provider13:39
ykarel_Thanks fungi13:39
fungiykarel_: and just to be clear, you're enabling nested virt in the tempest ovn jobs?13:40
ykarel_fungi, yes we do13:40
fungiyeah, looks like you limited those to the nested-virt-ubuntu-jammy label13:41
ykarel_https://github.com/openstack/neutron-tempest-plugin/blob/master/zuul.d/base-nested-switch.yaml#L31-L3213:41
ykarel_yes 13:41
*** akekane is now known as abhishekk13:45
*** dasm|off is now known as dasm14:30
*** frenzy_friday is now known as frenzy_friday|food14:45
*** yadnesh is now known as yadnesh|away15:01
clarkbykarel_: fungi: to be extra clear this is exactly why we don't support nested virt... We have those labels set up under the condition that you know it is unsupported and any issues will need to be sorted out between you and the cloud provider (we only add it to clouds that indicate they are willing to work through this)15:26
clarkbwe can collect info out of nodepool if necessary (though at this point zuul and the node itself should have all that data?) but we haven't in the past been able to debug this for you15:27
ykarel_clarkb, Thanks yes i understand that, yes nodes have required data, if anything else is required we can collect that too as it's quite easily reproducable, but for such issues investigation is required at L0 level too15:32
ykarel_those nodes provides good performance that's why we moved to those ;)15:33
clarkbykarel_: ok that is not why you should use them15:33
*** frenzy_friday|food is now known as frenzy_friday15:33
clarkbthey are there specifically for jobs that require the functionality to run at all with an understanding that debugging might need to be undertaken15:33
clarkbthey don't provide the same redundancy as other labels, nor the same capacity. We need to use them judiciously to balance the needs of jobs against limited resources and possibility for brokeness15:34
opendevreviewMerged openstack/project-config master: Bring back the PTL+1 column in release dashboard  https://review.opendev.org/c/openstack/project-config/+/86780115:34
ykarel_clarkb, yes we had concidered capacity/redudancy/debugging aspects these while started using these 15:36
ykarel_not considered functionality part as for those jobs we don't need any nested virt functionality15:37
clarkbif you don't need any nested virt functionality then the jobs probably shouldn't be using that label15:37
clarkband that is for two resons. The first is that these are limited resources and some jobs actually do need that functioanlity (octavia for example) and second they are known to be less reliable because nested virt is less reliable and now your jobs are broken15:38
ykarel_yes got it, it worked great for us with focal, but broken with jammy :(15:40
ykarel_job time was reduced approximately by half + too much less failures and thus rechecks15:41
mlavallehey, can I get someone from the devstack core team to take a look at https://review.opendev.org/c/openstack/devstack/+/866944. Just needs to be pushed over the edge15:42
clarkbright, and the idea is that we very specifically target jobs that need that to these labels. If nested virt worked everywhere and was reliable we wouldn't have this situation. Unfortunately nested virt is extremely complicated relies on hardware details and the kernel versions for each level of operating system and probably libvirt too.15:42
clarkbmlavalle: you'll want to ask in #openstack-qa15:42
ykarel_and now with jammy with qemu>=5 guest vms consuming too much memory and we seeing oom-kills 15:42
mlavalleclarkb: ack, thanks15:42
ykarel_clarkb, yes right15:45
clarkbFor a long time we thought that nested virt was stable on AMD cpus bceause the linux kernel set the flag to enable it by default on AMD but not Intel. The implication being "we expect it to work on AMD but be less reliable on Intel". Turns out that was a bug and they had set the flag improperly and it is flaky on both :(15:47
* ykarel_ was not aware about ^15:48
ykarel_clarkb, if we are not affecting the capacity for projects that really need nested virt functionality can't we keep on using these nodes as we understand the supportability issues with it?15:50
ykarel_after the jammy issue is fixed with those15:50
clarkbykarel_: looking at https://grafana.opendev.org/d/b283670153/nodepool-vexxhost?orgId=1&from=now-7d&to=now it does appear that we are not hitting our capacity for those resources. I think the main risks to you are being queued behind capacity limits, the cloud having an outage or going away and no longer being able to run there, and the nested virt errors themselves. I guess15:52
ykarel_we are not running all neutron jobs there but a few 15:52
clarkbI'm ok with it as long as capacity isn't an issue15:52
clarkbthe main risk is that you could end up being unable to test if that cloud can't work for some reason (like now with nested virt problems or if the cloud has to shutdown or has an outage)15:53
ykarel_Thanks clarkb , yes total understandable15:54
clarkbykarel_: I would definitely keep it to targetted usage15:55
fungiykarel_: out of curiosity, is that job failing this way 100% of the time, or just sometimes?15:55
ykarel_fungi, on vexxhost provider 40% of time on ovh gra1, bhs1 not seen a failure yet15:56
clarkbya its very likely specific to the underlying CPUs or kernels15:56
ykarel_so was trying to understand if it's with only few compute nodes in the cloud15:56
clarkband they almost certainly differ between vexxhost and ovh15:56
fungibut also the fact that it doesn't lock up immediately and actually works all the way through a job most of the time is interesting, definitely sounds like some sort of corner case (granted a fairly impactful one in this situation)15:57
fungior maybe it's that only 40% of the nova controllers have the right cpu or kernel version to expose the bug15:58
clarkbfungi: that would be my hunch15:58
ykarel_yeap that's something to be figured out what's triggering it16:01
ykarel_will wait for mnaser to look into it16:01
ykarel_may be he already knows about it16:01
johnsomykarel_ As others have said, please don’t use the nested-virt nodes unless you absolutely need them.16:05
fungibut also, finding out new nested virt bugs and getting them reported to the kernel devs is helpful16:06
fungiricolin: ^ also not sure if that environment might be one you have insight into16:08
ykarel_johnsom, just wondering if you seen this issue with octavia or any other project using those nodes?16:13
johnsomThere are only two projects that really need these nodes (that I know of). So far I have not seen any issues with Jammy on them.16:16
ykarel_thanks, would be interesting to see why it's not seen there as just running libguestfs-test-tool stucks on the affected system16:19
clarkboh libguestfs....16:19
johnsomTypically we have to attach to the kernel durning boot to debug issues on these nodes. It is a lot of work. If you don’t absolutely need them for the test cases, I recommend you don’t use them16:21
opendevreviewMerged openstack/project-config master: Set iweb max-servers to 0  https://review.opendev.org/c/openstack/project-config/+/86726216:55
*** jpena is now known as jpena|off17:16
opendevreviewGhanshyam proposed openstack/openstack-zuul-jobs master: Pin tox<4 for stable branches (<=stable/zed) testing  https://review.opendev.org/c/openstack/openstack-zuul-jobs/+/86784918:55
opendevreviewMerged openstack/project-config master: Set iweb to empty labels and diskimages  https://review.opendev.org/c/openstack/project-config/+/86726321:07
*** dviroel|rover is now known as dviroel|rover|afk21:16
*** cloudnull1 is now known as cloudnull22:29
*** blarnath is now known as d34dh0r5322:29
*** rlandy is now known as rlandy|out22:49
*** sfinucan is now known as stephenfin22:55
*** dasm is now known as dasm|off23:59

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!