Monday, 2023-01-16

opendevreviewMerged openstack/nova stable/ussuri: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86400707:39
opendevreviewAmit Uniyal proposed openstack/nova stable/train: Adds a repoducer for post live migration fail  https://review.opendev.org/c/openstack/nova/+/86380607:42
opendevreviewAmit Uniyal proposed openstack/nova stable/train: [compute] always set instance.host in post_livemigration  https://review.opendev.org/c/openstack/nova/+/86405507:42
gibidansmith: I saw such interpreter crashes before. It is really a segfault of the python interpreter based on dmesg. Unfortunately it happens randomly afais.08:48
gibidansmith: hm, but this time it was OOM08:53
viks__hi, is there a way to set say `/var/lib/nova1` instead of `/var/lib/nova` ? i could not find any configuration to do that in `nova.conf`09:16
gibiviks__: https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.instances_path and https://docs.openstack.org/nova/latest/configuration/config.html#DEFAULT.state_path are the way I think09:43
viks__gibi: thanks... got it.. actually i was searching for `/var/lib/nova` in sample conf so i did not find it before... anyway can i set multiple values for it say for eg: `/var/lib/nova,/var/lib/nova1`  where they are differnt 2 mount points ?09:48
gibiviks__: you can only set a single path09:50
viks__gibi: ok.. thanks.. one more thing.. when to use `instances_path` if `state_path` itself will do the job? any suggestions?09:52
gibiviks__: if you want to store the instance local disks in a different place then the nova lock files then instances_path will let you separate the instance disk from the lock files09:55
viks__gibi: ok.. thanks10:04
opendevreviewSahid Orentino Ferdjaoui proposed openstack/nova master: compute: enhance compute evacuate instance to support target state  https://review.opendev.org/c/openstack/nova/+/85838310:14
opendevreviewSahid Orentino Ferdjaoui proposed openstack/nova master: api: extend evacuate instance to support target state  https://review.opendev.org/c/openstack/nova/+/85838410:14
*** elodilles_pto is now known as elodilles10:19
gibidansmith: opened the bug for the OOM https://bugs.launchpad.net/nova/+bug/200295110:32
sean-k-mooneyviks__: if you cant use a single mount path for soem reason you could fake it using lvm voluems to combine multipel disk into one or do it at thte file system level instead of block level using mergerfs https://manpages.ubuntu.com/manpages/impish/man1/mergerfs.1.html10:54
sean-k-mooneynova need everything to be in a single directory on the file system but we dont really care where that folder comes form or how you created it10:55
bauzasauniyal: so, about what we discussed for https://bugs.launchpad.net/nova/+bug/199673210:56
bauzasauniyal: what you need to do first is to check how to look at the host_state.failed_builds value10:57
sean-k-mooneyall you need to do is add a new exception that inherits form the existign one and then where we incremente the value skip it if its the new excpetion10:58
sean-k-mooneythe late affinity failure will use the new expction and not be counted10:58
sean-k-mooneybut because it inherits form the orginal any clean up that was previousl done will still be done10:58
sean-k-mooneyso ya one of the first steps is find out where we modify the build failure value10:59
sean-k-mooneyand also where we raise the current excetption for the affinity check10:59
sean-k-mooneyafter that you can add the new expction and modify both to use it11:00
bauzassean-k-mooney: the concern of auniyal was how to test iut11:00
sean-k-mooneyout side of functional tests it would be tricky to do end to end but unit test for the excption raising and functional test for the end to end interaction. there is no point doing tepest type testing since fault inject will be needed and that is not somethign tempest is good for 11:04
bauzassean-k-mooney: I think it's possible to have a functest for it11:06
bauzasbut we need to be able to verify the stats11:06
viks__sean-k-mooney: thanks for the suggestions11:06
bauzassean-k-mooney: if we have a functest that creates a group with anti-affinity policy without having the filter, then we can create two instances asking for the same host11:09
auniyalsean-k-mooney, bauzas, suppose we have only one host hostA, an instance with server group having a anti-affinity is present hostA, so if user create anothere instance of same group, that time it will go for reschedule, and hence increase the counter of build faild11:09
auniyalis this a valid scenario11:09
auniyalto test11:09
bauzassean-k-mooney: and then we could verify the stats value11:09
bauzasauniyal: yeah11:10
bauzasauniyal: but then you need to verify the counter11:10
bauzashttps://github.com/openstack/nova/blob/2eb358cdcec36fcfe5388ce6982d2961ca949d0a/nova/compute/resource_tracker.py#L1978-L198011:10
bauzaswhich is a defaultdict set here https://github.com/openstack/nova/blob/2eb358cdcec36fcfe5388ce6982d2961ca949d0a/nova/compute/resource_tracker.py#L9911:11
bauzasso the stats is updated here https://github.com/openstack/nova/blob/2eb358cdcec36fcfe5388ce6982d2961ca949d0a/nova/compute/stats.py#L14311:12
bauzasso, as you'll see, this is a keyed dict11:12
bauzaswith a functional test, you can introspect that compute.stats dict 11:13
bauzasand see that the value of that dict for the key 'failed_builds' is incremented by 111:13
bauzasafter creating inst211:13
bauzasand that's what we want to change11:14
bauzasauniyal: ^11:14
auniyalack, this will tell us isntance failed again and agian11:16
gibiauniyal: no, if you have one host with an instance in an anti-affinity group and you try to schedule the second instance to the same group then it will not trigger a reschedule11:19
gibiit will simply fail the scheduling with NoValidHost11:19
auniyalyes11:19
bauzasyes you need a test with 2 nodes, don't disagree11:20
gibibauzas: two nodes will not help either as the scheduler will pick the other host11:20
bauzasgibi: not if you trick it :)11:20
gibito trigger a late affinity check failure (and hence the build failure counter increase) you need to have two parallel scheduling request11:20
bauzasgibi: my proposal is simplier with a functest, we have code snippets for tricking the scheduler11:21
bauzasor you could use the az hack11:22
bauzasit will skip the scheduler11:22
gibi(alternatively we could try to remove the anti-affinity filter from the config then the scheduler will allow both VMs to the same host, and the second will fail the late affinity check there)11:23
bauzasgibi: ah, you missed then my point11:26
bauzas(12:09:29) bauzas: sean-k-mooney: if we have a functest that creates a group with anti-affinity policy without having the filter, then we can create two instances asking for the same host11:27
bauzaswe indeed need a functest that *doesn't* use the AntiAffinityFilter11:27
bauzasfaking the scheduler is just for making sure we land instances on the same host11:27
gibibauzas: ack, then we thought about the same thing. cool :)11:28
auniyalcan we reproduce it manually 11:29
auniyalI understand reschdule is correct but its should be counted11:31
auniyalso we just need to verify this before sauing buld failed at https://github.com/openstack/nova/blob/2eb358cdcec36fcfe5388ce6982d2961ca949d0a/nova/compute/manager.py#L226511:32
auniyal*it should NOT be counted11:32
bauzasauniyal: you can reproduce it with devstack11:37
bauzasauniyal: make sure the filter is disabled11:37
bauzasand force to create two instances with the same group on the same host11:37
bauzasthat should work11:37
sean-k-mooneybauzas: you can contol what filters are used in the fucntest so that should not be a problem11:37
bauzasintrospecting the stats field would be a bit trickier but I think we log the stats with the DEBUG level11:38
sean-k-mooneyya we likely do but if you really need to you can jsut go driect to the db11:38
bauzassean-k-mooney: yeah, and I even think we have a funtest fake filter for ensuring all instances go the same host11:38
sean-k-mooneyoften w will use a spy function to intercept and recored such thigns11:38
sean-k-mooneybauzas: we do yes11:39
bauzasanyway, /me goes off for lunch11:39
kashyapHmm, on this bz: https://bugzilla.redhat.com/show_bug.cgi?id=2138381 (on CPU compatibility).  A Red Hat customer-facing person says using the new CPU API still throws the same error.  Actually removing the check is what works correctly: 11:42
kashyaphttps://review.opendev.org/c/openstack/nova/+/869587 -- libvirt: Remove compareCPU() check in _check_cpu_compatibility()11:42
kashyapAbout debuggability concerns (if you remove the compareCPU() check: the same guy confirms you'd get the same error from libvirt.  And it is still debuggable (which is what I said before)11:43
sean-k-mooneykashyap: its much much much less debugable as you now need to boot a vm to triger it11:44
sean-k-mooneysince you have https://review.opendev.org/c/openstack/nova/+/869950 i think i would prefer if you abandoned https://review.opendev.org/c/openstack/nova/+/869587 and we proceded with the replacment instead11:46
kashyapsean-k-mooney: Sure, I would also prefer the replacement11:47
kashyapWe can agree to disagree on "much much much"11:47
kashyapI don't have the energy to argue much anyway; /me is still recovering from a bike accident11:48
sean-k-mooneywhen they tested this did they use teh cpu falgs to remove teh flag 11:48
sean-k-mooneyas in did they set cpu_extra_flags=-whatever11:49
kashyapI don't think they have specified it; I'll ask 'em on th ebz11:49
sean-k-mooneyok its also kind fo sucks that the errror message does not tell you want is incompatible11:50
sean-k-mooneyit woudl be nice if it said this set of features are unavaible11:50
kashyapYeah11:51
sean-k-mooneylooks like they found a beaker node internally with icelake to test on11:52
sean-k-mooneymaybe we can do the same or get access to test it there11:52
sean-k-mooneyit might save some back and forth if you can get direct ssh access to see whats happening or get a dev env yourself that you can deploy devstack on11:53
kashyapsean-k-mooney: Also, we should still keep the removing the check option open absolutely.  Please don't insist on keeping it w/o good reasons.11:58
kashyapThe tests pass, and DanPB also once said that code is wrong and should be even removed.11:58
kashyap("tests pass" is not the full reason; but it is not causing problems/troubles.  And it was also properly tested by the same RHT person)11:59
sean-k-mooneykashyap: i have given yuou a good reason it will regress novas functionality to remove it and i have explained why11:59
sean-k-mooneyif i was insiting i would be using my -2 rights on the patch. i am not12:00
kashyapsean-k-mooney: Sigh; the definition of "regression" is not serious here.  We're going in circles.  I also want other people's take here.12:01
kashyap(You have to see the _effect_ of the patch: it is changing _where_ it is failing.  Yes, it's a kind of a "regression"; but functionally users are better off)12:02
kashyapAnyway.  Let's explore the replacement patch in fuller too.12:02
kashyapsean-k-mooney: I'm in agreement with you on definitely using the newer API, as that's a net-benefit.  (I was not debating that one.)  12:07
sahido/ guys, do we have a process to convert an option from bool to int?12:11
sean-k-mooneysahid: we do it called not doing it. basically if your changing the type you have to rename the option and deprecat the old one. in this case your going form bool which in our congi is based on string to int12:17
sean-k-mooneyyou can do that in plcaee befause we accpet true/yes|false/no not just 1|012:18
sean-k-mooneysahid: what config option do you want to modify12:19
sean-k-mooneyyou will basically have to deprecate the old one and add a new one in the new format and support both in the A cycle.12:19
sean-k-mooneysupporting both formats is required becasue you are not allowd to requrie config change to upgrade12:20
sahidyes that the point I don't want to break things.12:25
sahidsean-k-mooney: I'm not sure to understand you mean we can update from bool to int transparently as this is using a string to int?12:25
sean-k-mooneywe cant do it transparently because its string to int12:27
sahidoh.. but the pb in our case will be that, a True will not be converted to a int12:27
sean-k-mooneyin c it would be int to int12:27
sean-k-mooneywe still need to accpet yes/y/True ectra in the config12:28
sean-k-mooneyand that woudl have to be converted to 1 i guess 12:28
sean-k-mooneybut you would also have to accpet 1 and any other values you are supproting12:28
sean-k-mooneyso ya after your change you still need to be able to handel a config with True in it as valid if it was to be transparent12:29
sean-k-mooneyso thats the problem in this case12:29
sahidsean-k-mooney: is related to this one, if you have a moment to take a look https://review.opendev.org/c/openstack/nova/+/86732412:33
sahidbasically it's to add ability to set number of retry12:33
sahidoriginaly the option is Bool, and used to activated or desactive announces12:34
sean-k-mooneyah that patch i saw that breifly fly by12:34
sean-k-mooneyhonestly i would just add a second config option for the retry12:34
sahidit's now needed to specify a number of retries and i wamted to avoid that we introduce a new option12:35
sean-k-mooneyyep but if we do this i think its just clean to add a new option and default to 1 or 312:35
sahidyes... as it turn now it's basically what we will have to do12:35
sean-k-mooneyya so workaround options still are treatd like normal config options so the same rules apply12:36
sahidyou mean we could harcored the number of retry instead,12:36
sean-k-mooneyin this case i would jsu tkeep the enable as a bool and add a retry option12:37
sean-k-mooneywell we coudl but im ok with a config option for the reties12:37
sean-k-mooneyill just comment on the patch one sec.12:37
sahidso one option to enable, one option to set the number of retries, and one option to specify the interval12:37
sahidcool thank you12:37
sean-k-mooneyyep exactly12:38
sean-k-mooneyand we can set teh retires and interval to whatever we think is a good default12:38
sahidok fairenough :)12:39
sean-k-mooneyok done i was suggesting 1 or 3 because 1 i sthe current behavior and 3 is what qemu defaults too when it sends them12:43
kashyapsean-k-mooney: BTW a small data point on that "mpx" saga: if Nova doesn't break at the first CPU compare in check_cpu_compatibility(), then using "cpu_model_extra_flags=-mpx" works13:37
kashyap(That gives a hint too that the first compare is wrong)13:38
sean-k-mooneythe way it should be working is we should be removing the mpx flag form all the modles listed in cpu_models and if any of them pass then we proceed as normal13:39
sean-k-mooneyso as long as any of the listed modeles work with the cpu_model_extra_flags option applied then we shoudl boot13:40
sean-k-mooney/boot/start the agent/13:40
sean-k-mooneyalthough really if any of them are invlied with that combination we shoudl reject it as an error13:41
*** dasm|off is now known as dasm14:03
opendevreviewAaron S proposed openstack/nova master: Add further workaround features for qemu_monitor_announce_self  https://review.opendev.org/c/openstack/nova/+/86732416:33
opendevreviewArtom Lifshitz proposed openstack/nova master: Microversion 2.94: FQDN in hostname  https://review.opendev.org/c/openstack/nova/+/86981216:56
artombauzas, ^^16:56
bauzasartom: ack, will look16:57
bauzasdamn, who knows the Launchpad nick of Kirill ? /me needs to paperwork the right ownership of https://blueprints.launchpad.net/nova/+spec/ironic-vnc-console17:39
bauzasanyway, I can live with that17:40
bauzaswow, the numbers of accepted blueprints for Antelope are identical to Yoga17:44
bauzasdisclaimer: this is gonna be a productive 5-week17:44
sean-k-mooneyya we have more then we will likely land but we shal see how it goes17:53
sean-k-mooneypci in palcemnt is technially complete we jsut have some cleanups  and a bugfix still waiting ot merge17:53
sean-k-mooneybut the feature is fully merged17:53
sean-k-mooneyim hoping artom's fqdn change, shaids evacuate change and dansmits uuid change will merge in the next week17:54
sean-k-mooneywe will see i guess based on review bandwith17:55
opendevreviewMerged openstack/nova master: Follow up for the PCI in placement series  https://review.opendev.org/c/openstack/nova/+/85565418:40
opendevreviewMerged openstack/nova master: Rename _to_device_spec_conf to _to_list_of_json_str  https://review.opendev.org/c/openstack/nova/+/85564819:44
*** dasm is now known as dasm|off22:37
opendevreviewMerged openstack/nova master: Enable new defaults and scope checks by default  https://review.opendev.org/c/openstack/nova/+/86621823:49
opendevreviewMerged openstack/nova master: Remove use of removeprefix  https://review.opendev.org/c/openstack/nova/+/86778823:49
opendevreviewMerged openstack/nova master: Unit test exceptions raised duing live migration monitoring  https://review.opendev.org/c/openstack/nova/+/85935823:56
opendevreviewMerged openstack/nova master: Reproduce PCI pool filtering bug  https://review.opendev.org/c/openstack/nova/+/85564923:56

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!