Friday, 2025-04-04

zigoFYI, we were affected by nova-compute flapping, we applied https://review.opendev.org/c/openstack/nova/+/939317 and that solved the issue for us. Thanks to melwitt and others who worked on it.07:24
opendevreviewMasahito Muroi proposed openstack/nova master: Use dict object for request_specs_dict in the _list_view  https://review.opendev.org/c/openstack/nova/+/93965808:06
sean-k-mooneydansmith: i feel like you will care about https://etherpad.opendev.org/p/nova-2025.2-ptg#L57713:04
sean-k-mooneywel have an opionion13:04
sean-k-mooneycare implies a positive emotional sentement that i dont think you will express toward the idea13:05
sahido/13:06
sahidWe don't have any logs in scheduler to give some information regarding the how the list is re-ordered by weighter13:07
sean-k-mooneywe do at debug13:08
sean-k-mooneywe have the weights per host13:08
sean-k-mooneyper weigher13:08
dansmithsean-k-mooney: that line is blank, which thing?13:36
fricklerhmm, the timeslider sadly doesn't show line numbers. the line's now at #578 now, though14:00
sean-k-mooneydansmith: I would quite like an equivalent to oVirt's VDSM hooking mechanism in Nova. 14:00
sean-k-mooneyi.e. the requst to have a way to add arbiatry hooks to modify the libvirt xml without chanign nova14:01
dansmithsean-k-mooney: okay I figured that might be it :)14:01
dansmithbauzas: any chance you could try to review some of this OTU series? https://review.opendev.org/c/openstack/nova/+/94414814:03
dansmiththe bottom two are just trivial setup for the larger one above14:03
bauzas_dansmith: I definitely want to 14:03
dansmithall of them are +2 already, so the bottom two could probably be sent to the gate without a lot of time investment :)14:03
dansmiththanks14:03
sean-k-mooneydansmith: +2w on the first patch +2 on the second with a question https://review.opendev.org/c/openstack/nova/+/944149/comment/133f1fc5_dabaf92d/ 14:43
sean-k-mooneyim going ot grab somethign to drink and kick off a devstack install but ill try and take a look at the rest when i get back14:44
dansmithsean-k-mooney: please see response14:44
dansmithhttps://review.opendev.org/c/openstack/nova/+/943816/12/nova/compute/pci_placement_translator.py specifically L202 here :)14:45
sahidsean-k-mooney: We don't have any logs like for filters. Filter Computefilter returned 2 host(s) I could be good to know the weighter as-well. Exemple I have removed one of them and I have no idea whether it is well removed or not14:51
sahidI can see a ligne WeighedHost but no detail about which weighter was used to order the list14:52
sean-k-mooneysahid: https://zuul.opendev.org/t/openstack/build/95c4e84b168d4a79ba750797f28a81c7/log/controller/logs/screen-n-sch.txt#101015:11
sean-k-mooneywe have output per weigher per host both pre and post normaization and multiplying15:12
sean-k-mooneysahid: again only at debug level like the filters15:12
sean-k-mooneygrep for "DEBUG nova.weights" in yoru logs15:13
sean-k-mooneyhttps://github.com/openstack/nova/blob/9d910ec4bf2a12baf3b5f0ec3bc41686413538fb/nova/weights.py#L136-L17015:15
sean-k-mooneysahid: it was added 4 years ago https://github.com/openstack/nova/commit/154ab7b2f9ad80fe432d2c036d5e8c4ee171897b15:16
sean-k-mooneydansmith: that not what i was asking let me restate it15:17
dansmithoh? okay15:17
sean-k-mooneydansmith: do we need to invaliate the cache for all pci in placement RPs or just the one that also are marked OTU15:17
sean-k-mooneyfor normal devies we do not expect the reserved to change15:18
dansmithsean-k-mooney: ah, so if you look at some of gibi's comments on the commit message, he noted that PCI-in-placement specifically calls out reserved as a thing that could be changed by the operator directly in placement,15:19
dansmithand as such it's probably good that we invalidate our cache for all of those to make sure we're seeing the right info, even though technically only OTUs care at the moment15:19
sean-k-mooneyoh it does? ok i dont know why but if that the exsign bevhiaor then the code you wote makes sense15:19
dansmithsean-k-mooney: yeah he provided a link to a comment in the code that specifically says that, and asked for me to add words to the commit message about it15:20
sean-k-mooneyi dont knwo what the orginal usecase for allowing the reserved to be change was but if thats the case then il upgrade to +w15:20
dansmithso we _could_ potentially follow up with this and use the OTU trait as well to control invalidation, but I think he would argue it's best if we just don't so that future developers are clear that we won't get stuck in the same way I initially did until I added this :)15:21
sean-k-mooneyah https://review.opendev.org/c/openstack/nova/+/944149/6//COMMIT_MSG#1515:21
sean-k-mooneyok i missed that when i looked at the comit15:21
gibisean-k-mooney: the idea behind it is that nova should own as small part of the inventory as it cares about. Before OTU it only cared about total and max_unit (as it is mandatory) so the rest of the field was never set in the inventory of a PCI RP15:22
gibimaking it possible to set externally15:22
sean-k-mooneydansmith: well as we dicussed a few weeks ago im not really sure why we are caching at all15:22
sean-k-mooneyso im all for being safe with correctness here15:22
sean-k-mooneyeven if it might be slightly slower15:22
sean-k-mooneygibi: right and nova cares about reserved even without OTU15:23
sean-k-mooneywe report the inventory based on the devspec15:23
dansmithwell, for a big tree of lots of providers it probably is a real gain, but brings problems like this (as does all caching of course)15:23
sean-k-mooneyso you shoudl never be chanign reserved since that will mnake placement and the pci deivces tabel be sort of out of sync15:23
sean-k-mooneyanywya dont wnat ot boil the ocean i was just wonderigy why we were not checkifn for "is pci in placement and one time use" rather then just "is pci in placment"15:24
dansmithwell, that patch is before OTU, so that's part of why :)15:25
sean-k-mooney ya i was orgianly wondering if that was the reason nad if you were going to modify it again in the next patch15:26
dansmithnova is not currently reporting anything reserved for pci-in-placement providers, AFAIK (other than OTU in the next patch), so I think gibi's point is that if you want to reserve some VFs you would do it in placement directly15:26
dansmithsean-k-mooney: yeah I understand now, I just didn't get that from your comment :)15:26
sean-k-mooneyno worries. so +2w on the first too. ill take a look at the rest once i have devstack deploying15:27
dansmiththanks15:28
sean-k-mooneydansmith: as an aside, you have a vexhost lable created have you tried to use this in ci yet?15:33
sean-k-mooneyso the host im using is not the one that has igb support. you said in your testing the rng didnt crash the host vm?15:40
sean-k-mooneyif so ill see if i can ply with that as well15:40
dansmith(on a call)15:49
melwittzigo: just fyi that we had a partial regression involving mdevs with that patch which was tracked and fixed here https://bugs.launchpad.net/nova/+bug/209889216:03
sean-k-mooneydansmith: not urgent but you have a trival bug in the unit test. https://review.opendev.org/c/openstack/nova/+/943816/comment/4d20bb05_6116bff7/ it does no tactully break anything so it can be a follow up16:21
dansmithsean-k-mooney: ah yeah will fix in a follow up.. 16:26
sean-k-mooneyyour not actully building it into a treee or populating the pci tracker so technically it wont break anything. it just will be confusing if we read it in like a year so ya no need to respine but nice to fix16:27
dansmithyep, spinning a follow-up now16:28
dansmithsean-k-mooney: are you holding the +W until you test locally?16:29
dansmithoh nm :)16:29
sean-k-mooneyi was until my devstack failed for the 3 time16:30
dansmithheh16:30
sean-k-mooneythere is a patch sitting in ci to make our new prometous plugin not fail if /etc/promethous exists and i keep forgetting i need to delete the dir before i restack16:33
sean-k-mooneyi should just checkout the fixed version but im manually testing a docs change for the default local.conf16:34
sean-k-mooney* default of watcher + promethous 16:36
* sean-k-mooney cant spell that and its annoying how much i have to type it now16:36
melwittp8s 😛 16:39
sean-k-mooneyi rather unofrtunetly misclicked yesterday and added a commone misspelling of approch to my firefox? or garmmerly dictonatry instead of fixing it16:41
sean-k-mooneyi need to go fix that before i forget16:42
sean-k-mooneyintorspeciton that is also not a correct spelling.... how do i spellcheck my dictionary 16:43
sean-k-mooneyi guess google16:43
sean-k-mooneythis is a very sean problem16:44
dansmithheh16:44
sean-k-mooneyskipable also has too Ps16:45
melwittyou're on a roll16:45
sean-k-mooneyi have custom dictionaries for jagon like podman or mixed-rhel16:46
sean-k-mooneybut i really shoudl just delete anything that a real word because it proably wrong16:46
sean-k-mooney sudo mkdir /etc/prometheus16:48
sean-k-mooneymkdir: cannot create directory '/etc/prometheus': File exists16:48
sean-k-mooney... attepmpt 516:48
sean-k-mooneythis tim eim just going to use the fixed versions16:48
sahidsean-k-mooney: thanks for the reference :-) we are running a version a bit too old I guess...16:48
sean-k-mooneysahid: its a pretty trivial backport if thats an option for you16:49
sahidno I don't think we need to backport it. we don't run in debug anyway it was more about to proof the change on UAT16:53
sahidthank you sean-k-mooney 16:53
-opendevstatus- NOTICE: The Gerrit service on review.opendev.org will be offline momentarily for a patch release update17:17
sean-k-mooneyjust as an fyi if you end up with passt or pasta installed on your system becuase of docker/podman or i think technially libvirt? not sure 17:57
sean-k-mooneyhttps://bugs.launchpad.net/ubuntu/+source/apparmor/+bug/2065685/comments/717:57
sean-k-mooneyyou may need to manually fix apparmor to fix your devstack17:58
sean-k-mooneythe issue is fixed just not on 24.04...17:58
sean-k-mooneyfinally18:01
sean-k-mooneyInvalid [pci]device_spec config: one_time_use=true requires pci.report_in_placement to be enabled 18:08
sean-k-mooneywell at least i confrimed that works18:08
sean-k-mooneyright... AttributeError: module 'os_traits' has no attribute 'HW_PCI_ONE_TIME_USE'18:09
sean-k-mooneydansmith: so the db and placement all looks good https://paste.opendev.org/show/bpBlYPAE5cvPwAK1pOuS/ im quite tired but this is almost set up so im going to quickly create an alias and flaovr and see if i can boot a vm and dlete it and reset it18:16
sean-k-mooneythen ill call it a day and rest18:17
dansmithsean-k-mooney: cool18:22
sean-k-mooney... Uggla  the new alias example swe added for live migrateble dvice are wrong18:29
sean-k-mooneyUggla: you used `'` not `"`18:29
sean-k-mooneythey are json so the have to be double quotes18:29
dansmithooh, yeah I see18:30
sean-k-mooneyyou get a really unintuivie error18:31
sean-k-mooneyBadRequestException: 400: Client Error for url: http://192.168.16.185/compute/v2.1/servers, Invalid PCI alias definition: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)18:31
sean-k-mooneyi mean it makes sense but i dont expect end users to get that excption18:31
sean-k-mooneywehn its an error in your alias in the config file18:31
sean-k-mooneyits really a server side error not a client side one18:32
dansmithah yeah, exposing that to the user is probably not a great idea either18:33
sean-k-mooneyit might be becaus i did that as admin18:33
sean-k-mooneyill check that after im done18:34
sean-k-mooneyhttps://paste.opendev.org/show/bwJuKNE7VW3NvAjnxiWr/18:36
sean-k-mooneyso cool i was able to boot a vm18:36
sean-k-mooneyit reserved the device18:37
sean-k-mooneyand after deleting it i now get a no valid host18:37
dansmith\o/18:38
sean-k-mooneyand cool i can reset it and boot again18:40
dansmith\o\ /o/ \o/18:40
sean-k-mooneyim too tired this envning to review the final functional test patch but ill do that on monday18:41
sean-k-mooneyi was testing with the reno patch just before it18:41
dansmithno worries, good to have another confirmation, thanks18:41
dansmithrest up for next week ;)18:41
sean-k-mooneyo/18:43
dansmithgmann: melwitt: this recently-added test seems to be failing a *lot* https://review.opendev.org/c/openstack/tempest/+/85888519:03
dansmithand it seems to be failing because the swap devices don't actually show up or disappear in the guest19:04
dansmithprobably because it doesn't wait for SSHABLE I'm guessing19:05
sean-k-mooneythat shoudl not matter for rezise since there is no hotplug19:05
sean-k-mooneybut i dotn think it will auto swap on19:05
sean-k-mooneyso depending on what the test is doing it might now be stable19:06
sean-k-mooneyhum blkid looking for type swap19:06
sean-k-mooneydo we format the disk with a swap partion19:07
dansmithwhat I mean is, I'm not sure it's waiting for the instance to have actually been resized19:07
sean-k-mooneyoh 19:07
sean-k-mooneywell that optentaly differnet19:07
dansmithI guess the underlying stuff waits for the verify stage, so perhaps not19:08
dansmitheither way, seems to be quite flaky19:08
melwittsean-k-mooney: swap is formatted19:08
sean-k-mooneyack19:08
sean-k-mooneyi tought it was but was not sure19:08
sean-k-mooneyim nmot sure why there are reboots in those tests19:08
dansmithme either, but it's failing before the reboot19:09
sean-k-mooneyi get why its creating flaovr but im not sure why that part of the resize helper19:09
sean-k-mooneyi mena that fine just not what i expect form the function name19:10
dansmithI also don't think the 2048_to_1024 test is doing anything particularly useful since it's not asserting the size of the device19:10
sean-k-mooneyhttps://github.com/openstack/tempest/blob/b1e168015316f3f73131957e9fb6abfd2fdc20f1/tempest/api/compute/base.py#L441-L454 should be confirming as you said19:11
dansmithyeah19:11
sean-k-mooneydansmith: thse looks like the tests were partly ported form the functional tests for the same thing19:12
sean-k-mooneybut those had more asserts19:12
dansmitheither way, I wonder if we should fast revert that.. it's been merged for less than 24 hours19:13
sean-k-mooneyprobably if its flaky i say the pin to review a few days ago but didnt really have time to look at it19:15
dansmithproposed https://review.opendev.org/c/openstack/tempest/+/94637919:16
sean-k-mooneyits the remote command19:18
dansmithlooks like all the multicell jobs are failing this since it merged (which is only three amazingly) https://zuul.opendev.org/t/openstack/builds?job_name=nova-multi-cell&skip=019:18
sean-k-mooney[20:16:32]❯ blkid19:18
sean-k-mooney/dev/nvme0n1p3: UUID="9e5b21f3-dd4d-425a-b309-dcfbf585ad16" TYPE="crypto_LUKS" PARTUUID="82a4c723-1748-43f6-b82d-f051b8d3a6c5"19:19
sean-k-mooney/dev/nvme0n1p1: UUID="F3DE-29C6" BLOCK_SIZE="512" TYPE="vfat" PARTUUID="7208aa8f-8f44-432c-8163-693e4c1dcb8b"19:19
sean-k-mooney/dev/nvme0n1p2: UUID="f84e29ae-bdd6-45a2-9590-f7d9fc71f90c" BLOCK_SIZE="4096" TYPE="ext4" PARTUUID="b2216aac-c4e1-4cf1-930b-37a879cb1cd3"19:19
sean-k-mooney/dev/mapper/nvme0n1p3_crypt: UUID="0ebbc49c-51a0-4b8c-bdc1-702bf9d178b7" UUID_SUB="83086ee4-93b7-49c7-9915-0b8da9589da6" BLOCK_SIZE="4096" TYPE="btrfs"19:19
sean-k-mooney~ via 🐹 v1.21.11 via 🐍 v3.11.11 19:19
sean-k-mooney[20:16:35]➜ lsblk19:19
sean-k-mooneyNAME                MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS19:19
sean-k-mooneyloop0                 7:0    0  76.8M  1 loop  19:19
sean-k-mooneyloop1                 7:1    0  76.8M  1 loop  19:19
sean-k-mooneyzram0               252:0    0  15.6G  0 disk  [SWAP]19:19
sean-k-mooneynvme0n1             259:0    0 953.9G  0 disk  19:19
sean-k-mooney├─nvme0n1p1         259:1    0   1.1G  0 part  /boot/efi19:19
sean-k-mooney├─nvme0n1p2         259:2    0  14.9G  0 part  /boot19:19
sean-k-mooney└─nvme0n1p3         259:3    0 937.9G  0 part  19:19
sean-k-mooney  └─nvme0n1p3_crypt 253:0    0 937.8G  0 crypt /var/lib/docker/btrfs19:19
sean-k-mooney                                               /home19:19
sean-k-mooney                                               /swap19:19
sean-k-mooney                                               /19:19
sean-k-mooneypaste.o.o is donw19:19
sean-k-mooneyblkid wiht no args does not alwasy show swap19:19
sean-k-mooneyah 19:20
sean-k-mooneyso with sudo the swap disk show up 19:20
sean-k-mooneybut not without19:20
sean-k-mooneyat least initally 19:21
sean-k-mooneyonce i run blkid with sudo the non sudo one has the swap partion too19:21
sean-k-mooneyso it must be caching or something like that19:21
dansmithI suppose I won't recheck my patches since this has failed twice in a row on the same thing19:23
sean-k-mooneyhttps://paste.centos.org/view/ebce235e19:23
dansmithyeah weird, but... busybox19:24
sean-k-mooneywell the last two are the gnu versions19:24
sean-k-mooneybusybox apprently does not show anything as non root19:24
sean-k-mooneywhich could also be the problme here since cirror is useing busybox19:25
sean-k-mooneywhich is why i cheked it19:25
dansmithmakes me think that this test wasn't actually running in whatever tempest configs are required to merge19:25
dansmithanyway, nothing much I can do about it now I guess19:27
dansmithI'mma go get food19:27
sean-k-mooneydid you hit the revert button19:28
dansmithpasted above: https://review.opendev.org/c/openstack/tempest/+/94637919:28
sean-k-mooneyack19:28
sean-k-mooneyok commented on that too and now i need to go before the shops start to close o/19:31
gmanndansmith: ok, I did nto check multicell job but it was passing on regular one. agree to merge the revert first and then we can fix the test19:42
dansmithgmann: I'm really not sure why the multicell job would be special here, almost like it is missing a regex exclude or something20:11

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!