Monday, 2024-01-15

*** gibi_off is now known as gibi07:43
* gibi is back07:45
gibilet me know if there are urgent+important things I should be aware of07:46
elodillesgibi: welcome back o/ maybe a nice clean cherry pick for an easy start? :D o:) https://review.opendev.org/c/openstack/nova/+/90437108:17
gibielodilles: thanks. Easy win indeed. Done08:24
elodilles\o/08:24
elodillesthanks, too08:24
gibiping me when the next round of backport from that is up08:28
elodillessure thing :)08:31
opendevreviewMerged openstack/nova stable/2023.2: Reproduce bug #2025480 in a functional test  https://review.opendev.org/c/openstack/nova/+/90437008:58
*** Continuity__ is now known as Continuity09:35
bauzasgibi: want some paperwork change ? sure https://review.opendev.org/c/openstack/nova-specs/+/905342 :p09:43
gibibauzas: sure. done.09:45
bauzas<309:45
* bauzas hugs09:45
opendevreviewMerged openstack/nova-specs master: add the blueprint link to the specs  https://review.opendev.org/c/openstack/nova-specs/+/90534209:58
opendevreviewMerged openstack/nova stable/2023.2: Do not untrack resources of a server being unshelved  https://review.opendev.org/c/openstack/nova/+/90437110:24
gibigate seems to be healthy on stable 10:45
elodillesgibi: yepp, it looks good \o/ and the 2023.2 version of the backport has merged ;)10:48
elodillesso I've +2'd the patch on 2023.110:49
gibielodilles: lets see how the 2023.1 gate performs :)11:07
bauzasfwiw, back in 2024 finding why we race for https://42ba7385992f85feadcc-7de3ea29d2988c6d3fb53b5f30580c1f.ssl.cf2.rackcdn.com/904177/5/check/nova-tox-functional-py38/b878af0/testr_results.html11:09
bauzas3 weeks after testing, I don't remember why11:09
sean-k-mooney[m]bauzas:  is it because you added the test before https://review.opendev.org/c/openstack/nova/+/904209/411:26
sean-k-mooney[m]i also looks like whatever fixture your using to providie the mdevs likely is not taking effect11:30
opendevreviewAmit Uniyal proposed openstack/nova master: enforce remote console shutdown  https://review.opendev.org/c/openstack/nova/+/90182412:32
opendevreviewAmit Uniyal proposed openstack/nova master: enforce remote console shutdown  https://review.opendev.org/c/openstack/nova/+/90182412:36
opendevreviewMerged openstack/nova stable/2023.1: Reproduce bug #2025480 in a functional test  https://review.opendev.org/c/openstack/nova/+/90437212:46
*** elodilles is now known as elodilles_afk12:55
opendevreviewAmit Uniyal proposed openstack/nova master: Allow swap resize from non-zero to zero  https://review.opendev.org/c/openstack/nova/+/85733913:01
opendevreviewAmit Uniyal proposed openstack/nova master: WIP: temp commit for tests  https://review.opendev.org/c/openstack/nova/+/90559713:01
bauzassean-k-mooney: nope, sorry but you're wrong, the test works correctly13:25
bauzasI'm running stestr with --until-failure from 2 hours for all functional tests with mdevs, np for the moment13:26
bauzas stestr --test-path=./nova/tests/functional run nova.tests.functional.libvirt.test_vgpu --until-failure13:27
bauzas======13:27
bauzasTotals13:27
bauzas======13:27
bauzasRan: 3924 tests in 8192.6080 sec.13:27
bauzas - Passed: 392413:27
bauzas - Skipped: 013:27
bauzas - Expected Fail: 013:27
bauzas - Unexpected Success: 013:27
bauzas - Failed: 013:27
bauzasSum of execute time for each test: 35255.4411 sec.13:27
*** elodilles_afk is now known as elodilles13:33
*** elodilles is now known as elodilles_afk14:21
* frickler also has a simple stable paperwork change to offer https://review.opendev.org/c/openstack/nova/+/89729716:12
*** elodilles_afk is now known as elodilles16:27
kashyapzigo: Or anyone: Do you know if #ubuntu-kernel on Libera is still the place to talk to Ubuntu kernel maintainers?16:49
clarkbkashyap: the ubuntu wiki says that is the correct location and the ubuntu security team is in a similarly named channel on libera16:50
kashyapclarkb: Thank you! Mid last year I had chat with 'juergh' about a certain kernel panic.  I was hoping to catch them again, but the channel right now has about 4 folks.16:58
sean-k-mooneyhttp://tinyurl.com/bdfk6j9e17:04
kashyapmelwitt: dansmith: Hi, so even with CirrOS 0.5.3 (with the updated stable kernel), we're seeing these sort of failures, yeah? - https://bugs.launchpad.net/nova/+bug/201861217:04
sean-k-mooneyso for context we have had 226 kernel panics in the last 10 ish days17:05
kashyap(Meta note: If so, we (I can do) have to file a separate issue as per the above)17:05
sean-k-mooneymany have been in the cinder project jobs17:05
kashyapsean-k-mooney: Yeah, I was just about to run a query in OpenSearch too; thanks for the link17:05
melwittkashyap: we're using cirros 0.6.2 and 0.6.1 17:05
melwittorly? ok, that's good to know17:05
kashyapmelwitt: Ah, thanks!  I'm running on an outdated brain model17:06
melwitt:)17:06
sean-k-mooneythey are in  neutron, glance, cinder,nova 17:06
dansmiththe first one i see in that stack from sean-k-mooney is not like the one from the bug kashyap 17:06
dansmithit's failing to load userspace libs, which is what sean-k-mooney was mentioning, but is different from the page fault in the bug.. similar time in boot, but different behaivor17:07
sean-k-mooney" Kernel panic - not syncing: IO-APIC + timer "17:07
sean-k-mooneythat is form using cirros 5.x17:07
sean-k-mooneyits a fixed kernel bug that was just not in the cirros images so we can ignore those17:07
kashyapdansmith: Yeah, the one I just searched for is this (also in melwitt's CI list):17:09
kashyap"Kernel panic - not syncing: Attempted to kill init! exitcode=0x00001000"17:09
sean-k-mooneyright the "/sbin/init: can't load library 'libtirpc.so.3'" is the main painic we still see after the recent changes we made17:09
kashyapThe above panic has 36 global hits (12 Tempest) in the last 7 days.  Nova and Cinder projects affected17:09
dansmithsean-k-mooney: meaning we see that even with the split boot right?17:10
dansmithI hadn't seen that manifestation until just now17:10
melwittI guess that's "good" that we can still get some samples for showing kernel devs if we can get some help17:10
dansmithor hadn't noticed. I've mostly been in "see a panic, recheck" mode for a while now17:10
kashyapsean-k-mooney: IIUC, "5.x" needs to 0.5.3 or higher17:10
sean-k-mooneydansmith: yep but form what i have seen its mostly (perhaps always) been a boot form volume test17:11
sean-k-mooneykashyap: i dont think the io api issues was ever fixed in 0.5.317:11
sean-k-mooneyi tried to get it rebuilt but it didnt happen in the past17:12
sean-k-mooneyif 0.5.3 happned in the last year then it might have the updated kernel17:12
sean-k-mooneyif not then it is still using the old ubuntu package17:12
dansmithsean-k-mooney: ack, but it seems like it's still in initramfs at this  point so still seems like bfv "shouldn't matter"17:13
kashyapBy the "io api issues" you mean the IO-APIC related patch "no_timer_check" that was at some pointed added to a CirrOS image?17:13
sean-k-mooneykashyap: no17:14
sean-k-mooneythe reason this happens in 5.2 for example is ubuntu had already fixed the kernel bug but cirros was not rebuilt with the fixed kernel package17:14
kashyapsean-k-mooney: CirrOS 0.5.3 happened in May 2023 (after the bug Dan filed): https://github.com/cirros-dev/cirros/releases/tag/0.5.317:14
sean-k-mooneyeventhough it was released17:14
sean-k-mooneyok so it did hapen in the last year17:15
kashyapYep; I filed this request after Dan's bug above: https://github.com/cirros-dev/cirros/issues/10217:16
sean-k-mooneyi asked to do this 2-3 years ago and they said no at the time17:16
kashyapMaybe because at that time the older kernel was still "in support"17:16
sean-k-mooneyin support didnt matter17:17
sean-k-mooneyit was pinned ot a specific deb package 17:17
sean-k-mooneyso it was reciving no backports17:17
sean-k-mooneyit was using 5.3.0-26.28~18.04.1 explcitly17:17
sean-k-mooneyanyay ya if we are using cirros 5.2 anywhere they shoudl be updated to 5.317:18
sean-k-mooneyhttps://codesearch.opendev.org/?q=cirros-0.5.2&i=nope&literal=nope&files=&excludeFiles=&repos=17:20
sean-k-mooneyso unfortunetly we are usin 0.5.2 in some places17:20
sean-k-mooneyobviouly on master we shoudl be using 6.x17:20
sean-k-mooneybut we likely shoudl fix that if nova is pinend and inform others to update there repos to avoid the io-apic issue17:21
sean-k-mooneythe cinder-tempest-plugin-lvm-lio-barbican-centos-9-stream is using 0.5.2 for example17:21
sean-k-mooneyor at leat havign the io-apic panic17:22
sean-k-mooneynova is using it ofr the arm job 17:22
sean-k-mooneythats the only usage of 5.2 that affects us directly17:23
sean-k-mooneyim not sure the kernel bug exists in the arm builds17:23
kashyapmelwitt: sean-k-mooney: I'll start a "kernel panics" only Etherpad to track some of these in one place17:27
melwittthanks kashyap 17:27
sean-k-mooneyim in the middle of something else but dansmith melwitt do you want me to select one of the jobs and go back to the normal image and larger nodeset lable 17:29
sean-k-mooneyor will ye have time to do that17:29
dansmithnot just larger nodeset, but also with a larger flavor17:30
sean-k-mooneymelwitt: also i had hopped to start reviewing your serise today but its looking like it will we wednessday 17:30
sean-k-mooneydansmith: ya that what i ment but didnt say17:30
dansmithI'm actually supposed to be off doing something else right now too, but I can work on that later if you never get to it17:30
dansmith(later this week)17:31
sean-k-mooneyack ill see if i have time before i finish for today17:31
sean-k-mooneyto that end ill go back to my ansibel patch17:31
kashyapsean-k-mooney: melwitt: A quick start  here: https://etherpad.opendev.org/p/Kernel-panics-in-Nova-CI17:33
gibiwhy a privsep decorated function like https://github.com/openstack/nova/blob/master/nova/virt/libvirt/cpu/core.py#L63-L66 does not get escalated privileges? I see https://paste.opendev.org/show/bFZ9WgE2828iGamQoDaB/ and sure /sys/devices/system/cpu/cpu1/online is only writtable by root but I would assume that is why we have the privesep decorator on the call17:44
sean-k-mooney it shold be elevated17:46
sean-k-mooney that is a file not found issue17:46
gibino the file is visible 17:47
gibi[root@edpm-compute-1 ~]# podman exec -it nova_compute /bin/bash17:47
gibibash-5.1$ ls -alh /sys/devices/system/cpu/cpu1/online17:47
gibi-rw-r--r--. 1 root root 4.0K Jan 15 14:38 /sys/devices/system/cpu/cpu1/online17:47
sean-k-mooneycan you write ot it with sudo ?17:47
sean-k-mooneyim wondering if its selinux17:47
sean-k-mooneyim guesing your on centso becasue fo py 3.917:48
gibiyes I can17:48
gibi[root@edpm-compute-1 ~]# podman exec -it -u root nova_compute /bin/bash17:48
gibi[root@edpm-compute-1 /]# echo "online" > /sys/devices/system/cpu/cpu1/online17:48
gibi[root@edpm-compute-1 /]# echo "offline" > /sys/devices/system/cpu/cpu1/online17:48
gibi[root@edpm-compute-1 /]# cat  /sys/devices/system/cpu/cpu1/online17:48
gibi017:48
sean-k-mooneyodd17:49
sean-k-mooneyanythign in the audit log?17:49
sean-k-mooneyim wondering if hte permission issue is due to user/group, or due to selinux or something else17:49
gibino related log in audit.log17:51
gibiwouldn't I should see the code that escalates priviledges in the stack trace?17:52
gibi* shouldn't I see ...17:52
sean-k-mooneywell it should be runing via privsep17:52
sean-k-mooneyso you shoudl see some privsep debug logs17:52
sean-k-mooneyim not sure why this is saying ERROR oslo_service.service17:53
sean-k-mooneythats the wrong module logger17:54
gibithere is no debug log about privsep either17:54
gibibut we have `oslo.privsep.daemon=INFO` in the config17:54
sean-k-mooneyyou should see message form nova sayign the deamon was started17:55
sean-k-mooneyeven at info17:55
sean-k-mooneyperhaps nto the detailed logs but you should see it start17:55
gibisomething is very strange18:01
gibiat the first nova-compute startup the privsep daemon logs that it is started, but subsequent container restarts it does not log it18:02
gibiI'm going to redeploy the computes to see if I can reproduce it18:03
sean-k-mooneythis is nova compute container18:04
sean-k-mooneyhum ok i woner if we are leakign the unix socket18:04
sean-k-mooneyor what we are using to detech the start between container starts18:05
sean-k-mooneyand then not recreating the privsep deamon18:05
gibiI pushed revert  https://github.com/openstack-k8s-operators/nova-operator/pull/650 to unblock Andrew and the folks testing cpu pinning18:09
sean-k-mooneyhehe ok im not sure htat is super relevent to upstream nova folks but perhaps we have a bug18:10
gibinah, that repo is also upstream :D18:10
sean-k-mooneyi supose18:10
gibispreading the word that a new deployer engine exists ;)18:11
sean-k-mooneyanyway i suspect we need to look into how privsep is being executed18:11
sean-k-mooneyi didnt look at that previosly becuase i was expecting it to run the same as triplo did18:11
sean-k-mooneybut perhasp something has changed there18:11
gibiI will redeploy tomorrow and check for the privesp process before touching anything18:12
gibithen I will try a container restart to see if privesp also restarted18:12
sean-k-mooneyso you might need to do an operation that needs privsep before its started18:13
sean-k-mooneyso you might need to boot a vm or something like that18:13
sean-k-mooneyi think its lazy loaded18:13
sean-k-mooneywe are not usign the fork modle i belive18:13
sean-k-mooneythere are 2 ways to run privsep but i think we are using the privsep helper way18:13
sean-k-mooneyvai sudo/rootwrap18:13
gibiI will check tomorrow and report back18:14
sean-k-mooneycool. are you almost done for today18:14
sean-k-mooneyim going to be aroudn until the top of the hour but i approved the revert quickly to unblock folks18:14
gibiI'm done for today18:15
gibisee you all tomorrow18:15
sean-k-mooneyo/18:15
opendevreviewMerged openstack/nova stable/2023.2: Fix URLs in status check results  https://review.opendev.org/c/openstack/nova/+/89729719:07
opendevreviewsean mooney proposed openstack/nova master: [DNM] debug kernel paincs  https://review.opendev.org/c/openstack/nova/+/90562819:31
sean-k-mooneydansmith: it looks like the 16g nodepool lables are not aviable so i used the 32G ones ^ i can see about readding the 16G flavors but i just want to see if that is correct first. the vxhost flaovr still exists as we have 16G lables but with bionic19:32
sean-k-mooneyso we just need a 16G lable with jammy19:33
sean-k-mooneywith that said im not sure but it looks like vexhost might be providing 32G vms by defult now https://opendev.org/openstack/project-config/src/branch/master/nodepool/nl03.opendev.org.yaml#L147-L18719:35
opendevreviewMerged openstack/nova stable/2023.1: Do not untrack resources of a server being unshelved  https://review.opendev.org/c/openstack/nova/+/90437320:06
opendevreviewsean mooney proposed openstack/nova master: [DNM] debug kernel paincs  https://review.opendev.org/c/openstack/nova/+/90562821:51

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!