Wednesday, 2022-08-17

opendevreviewmelanie witt proposed openstack/nova master: Adapt websocketproxy tests for SimpleHTTPServer fix  https://review.opendev.org/c/openstack/nova/+/85337900:34
gibimelwitt: thank you for taking the triage baton07:06
whoami-rajathey bauzas , jfyi I've rebased my rebuild feature to use mv 2.9407:14
gibimelwitt, sean-k-mooney, stephenfin: with frickler's help the https://review.opendev.org/c/openstack/nova/+/826523 is unblocked now (the procedural -2 is removed). I let some of you +A the bottom when you feel it can start landing10:06
sean-k-mooneyah sorry had to load context frickler removed sylvains procedual -210:08
sean-k-mooneygot it10:08
sean-k-mooneylooks like the first third to half of the feature is ready to merge so ill kick off the start of the serries thanks for that10:09
sean-k-mooneygibi: done10:11
gibisean-k-mooney: cool10:11
sean-k-mooneythat should merge everything up until https://review.opendev.org/c/openstack/nova/+/760456/1210:12
sean-k-mooneyah i was +1 on that while we were debating the terminology10:14
sean-k-mooneyok upgraded my votes on 2 other so that should bring us up to https://review.opendev.org/c/openstack/nova/+/826529/9 and later which i have not really reviewd. thats just over half way i think so that will help manage the patch seires review10:19
sean-k-mooneygibi: im not sure ill have time today but am i right in assuming there are some patches ready to review in the pci series10:19
gibisean-k-mooney: yes, until https://review.opendev.org/c/openstack/nova/+/850468 it is ready10:20
gibipatches top of that is under development (have TODOs in the commit message)10:21
sean-k-mooneyok cool ill see if i can make some time for those10:21
gibithanks10:30
sean-k-mooneydo you have the link to the etherpad for the feature reviews by the way10:31
sean-k-mooneyi have the microversion plan https://etherpad.opendev.org/p/nova-zed-microversions-plan but cant recall what the other one was called10:32
sean-k-mooneyoh its proably on the meeting wiki10:34
sean-k-mooneyah yes https://etherpad.opendev.org/p/nova-zed-blueprint-status10:34
gibisorry I was distracted 10:43
gibibut yes you found the right one10:44
opendevreviewMerged openstack/nova master: Add limitation to docs about bug 1983570  https://review.opendev.org/c/openstack/nova/+/85216811:14
sean-k-mooneyartom: o/11:18
artom~o~11:19
sean-k-mooneyartom: will you have time to work on updating bauzas's mdev parseing seriese this week11:19
artomsean-k-mooney, yeah, I didn't forget11:20
sean-k-mooneythanks11:21
opendevreviewMerged openstack/nova stable/train: [CI] Fix gate by using zuulv3 live migration and grenade jobs  https://review.opendev.org/c/openstack/nova/+/79543513:06
opendevreviewMerged openstack/nova stable/xena: Reproducer unit test for bug 1934094  https://review.opendev.org/c/openstack/nova/+/84292214:17
*** dasm|off is now known as dasm14:22
opendevreviewMerged openstack/nova stable/xena: Fix instance's image_ref lost on failed unshelving  https://review.opendev.org/c/openstack/nova/+/84292315:14
melwittgibi: \o/ thanks!15:27
gibimelwitt: :)15:29
melwittelodilles: thank you for getting stable/train unblocked. it's a miracle! 😂 15:34
opendevreviewBalazs Gibizer proposed openstack/nova master: Reproduce bug 1986838  https://review.opendev.org/c/openstack/nova/+/85351615:54
gibisean-k-mooney: ^^ another interesting bug fall out from the pci work15:54
gibithe scheduler expects the compute will fail and the compute says the scheduler should have done a better job15:56
gibibut eventually nobody does its job and we have broken instances 15:57
gibistory of nova :D15:57
elodillesmelwitt: np, it was your patch anyway o:) so thanks too! :)16:08
elodillesi still don't know what is causing the original problem, but at least with zuul v3 jobs the gate works \o/16:09
melwittsame. glad you thought to try that patch again, I don't think I would have thought of it :)16:16
elodilleswell, it was not even my idea as neutron team fixed the same issue with the same solution: changing to zuul v3, so that is why i also tried it o:)16:19
melwitt:D16:24
sean-k-mooneygibi: i tought we had a check to prevent two alisis beign the same16:53
gibisean-k-mooney: we dont have. And with the traits in alias we can create different aliases that matches the same device from nova perspective16:54
gibibut at least with placement we in the above case there would be zero allocation candidates16:54
gibiso the scheduling would fail accordingly16:54
gibianyhow I will propose a simple fix that is backportable16:55
gibiand on master this could never happen with placement16:55
gibiafter the pci in placement lands16:55
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/tests/unit/pci/test_request.py#L190-L23116:55
sean-k-mooneythis is what i was remembering16:56
sean-k-mooneywe prevent the same alis beign defiend twice with confliging numa_policies or device types16:56
gibiyeah we prevent defining the same alias twice16:57
gibibut we allow two aliases matching the same device16:57
sean-k-mooneyyep the second is valid16:57
sean-k-mooneywhat is not valid is having 2 claims agaisnt one device16:57
sean-k-mooneythe two alisas shoudl have created two pci request objects16:57
sean-k-mooney the bug is that we dont ensure the all pci_request objects are fullfiled by unique host pci adresses16:58
gibithere will be two InstancePCIRequest objects that is OK. the bug is either a) the scheduler does not fail when it fails to consume both requests b) pci_claim does not fail when fails to consume both requests 16:59
sean-k-mooneyfor example if the alias name is differnt its fine for the addres/(product_id/vendor_id) to be the same with different numa polcies16:59
sean-k-mooneyya so its A16:59
sean-k-mooneywe should not get to the claim on the compute host if there is only one device17:00
sean-k-mooneyby the way the code for the schdluler and the comptue node is basically the same17:00
sean-k-mooneyexcpt the schduler path uses a copy of the data17:00
sean-k-mooneywhen its checking if you can consume the pci device17:00
sean-k-mooneyso if you fix it it will fix it for both17:01
gibibut even if #a) passes as there is two free devices at that time. #b) shouldt detect and fail if a parallel claim request consumed one of the devices and the new request cannot be fulfilled17:01
sean-k-mooneyyes17:01
sean-k-mooneybut a is impletnend by calling the code for b on a copy of the data17:01
sean-k-mooneyso we fix the later it fixes both17:01
gibiwhile in general the logic of the scheduler and the pci claim is the same for pci, in reality this part differs. I will push the fix and show that we need to fix this in two separate places 17:02
gibiI think we are the same as using _filter_pools but then that is called from two differnt paths by the two service17:04
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/pci/stats.py#L622-L64717:04
sean-k-mooneysupport_resuts just calls filter_pools17:04
sean-k-mooneycorrect17:04
sean-k-mooneyon the compute node it calls consume_requests17:05
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/pci/stats.py#L226-L26917:05
gibibtw, https://github.com/openstack/nova/blob/master/nova/pci/stats.py#L645 is broken as it individually matches requests to pools and does not consume pools by requests17:05
sean-k-mooneybut fixing it in  _filter_pools is what  i ment by fixing it in one place to fix both17:05
gibithe scheduler calls apply_requests that detects the double consumption17:05
gibisean-k-mooney: fixing it in filter_pools is tricky as filter_pools does not meant to consume things17:06
gibican be done though if needed17:06
sean-k-mooneyright but apply is not ment to be called in the sculer17:06
sean-k-mooneyit will work but we need to make sure not to commit to it on the db17:06
sean-k-mooneyso it will need to be a deep copy17:07
sean-k-mooneywith no save17:07
sean-k-mooneyok it does not save17:07
gibinova.scheduler.manager.SchedulerManager._consume_selected_host17:07
sean-k-mooneyso as long as we do a deep copy of the pool we are good17:07
gibithat is the one calling apply17:07
sean-k-mooneyreally for multi create17:08
sean-k-mooneywe might already be using a copy by the way17:08
gibiyes probably17:08
sean-k-mooneywe had to fix this in the past17:08
sean-k-mooneygibi: https://opendev.org/openstack/nova/src/branch/master/nova/scheduler/host_manager.py#L30917:15
gibiyepp that one17:15
gibias you said we need to consume in the scheduler to handle multi create17:15
sean-k-mooneyya ok17:15
sean-k-mooneythe host manager has a copy of the pci stats17:16
sean-k-mooneyso i guess doing the apply in teh filter might be ok17:16
sean-k-mooneyi would have expect _locked_consume_from_request17:16
sean-k-mooneyto fail17:16
sean-k-mooneyif calling apply_requests was enough17:16
gibiit fails17:17
sean-k-mooneyoh so we never get to the compute17:17
gibibut  @set_update_time_on_success catches everything17:17
sean-k-mooneyi tought you said we got to the compute17:17
gibihttps://opendev.org/openstack/nova/src/branch/master/nova/scheduler/host_manager.py#L26617:17
gibihttps://opendev.org/openstack/nova/src/branch/master/nova/scheduler/host_manager.py#L8117:17
gibithis is where scheduler points to the compute :D17:18
sean-k-mooney ok but we dont actully select the host and proced to we17:18
sean-k-mooneyi guess we might if noting depend on teh result of the consume17:18
sean-k-mooneyok it returns nothing17:19
sean-k-mooneyand since that just logs17:19
sean-k-mooneywe continue and fail17:19
sean-k-mooneyon the compute17:19
gibiyepp that just logs17:19
sean-k-mooneyok so that should raise an instance build excption or similr17:19
sean-k-mooneyso we retry the next alternit host17:19
sean-k-mooneywell no17:20
sean-k-mooneywe need to fail in the filter17:20
sean-k-mooneyso call apply there17:20
sean-k-mooneybut we proably should aslo adress the fact that eats valid excptions 17:20
gibiand this is where compute detects the error and only logs, does not fail the instance build https://github.com/openstack/nova/blob/master/nova/pci/stats.py#L24417:21
gibiit point to the scheduler :D17:21
sean-k-mooneywell its ment to return none17:21
sean-k-mooneyand the caller shoudl error but i guess it does not17:21
sean-k-mooneyi.e. the caller proably shoudl be checkign the lenght of the allocation to the requests17:22
gibialso apply does not care about dependent devices, but consume_requests does, so we might actually need to call consume_requests17:22
sean-k-mooneyso suports request only exists to not call consume17:22
sean-k-mooneybut we proably could remvoe it can call consume17:23
gibiyeah probably17:23
sean-k-mooneythe reason we dont right now17:23
sean-k-mooneyis incase the host is eliminated by another filter17:23
sean-k-mooneywe want to make sure the host state object has the correct data17:24
sean-k-mooneybut we can do that by doing a deep copy of the pci stats in the filter17:24
sean-k-mooneythere are other ways around that too17:25
sean-k-mooneythis comes back to correct behavior for muticreate/server groups17:25
sean-k-mooneyi would proably try just calling consume and see if that breaks any of our tests17:25
sean-k-mooneythen we might eb able to remove the call to apply17:26
sean-k-mooneyin the host manager17:26
gibiyepp17:26
gibithat make sense17:26
* sean-k-mooney hates multi create, our lives would be much simpler without it17:28
spatelcan i set hard limit in nova to create number of vm instance?17:40
JayFSo, trying to figure out why gate jobs keep failing on my change. They are seemingly random, and I noticed something strange in zuul status: https://home.jvf.cc/~jay/gated-after-master-change.png18:15
JayF826523,5 is a change landing into master AFAICT18:15
JayFbut is in the same merge queue as my change to stable/yoga (800873,2)18:15
JayFs/yoga/victoria/18:16
sean-k-mooneythey might be intermiten failure yes18:17
JayFWell, I'm just saying I'm surprised that stable/ patches and master patches are gate-queued together18:17
sean-k-mooneyoh they should not be unless its tempest18:18
JayFthat's what my png ^^^ shows18:18
JayFis that it seems to have done exactly that18:18
sean-k-mooneydo you have the link to the backprot18:18
JayFoooh, or do the separate lines indicate "nope, this is in a separate queue"18:18
JayFhttps://review.opendev.org/c/openstack/nova/+/800873 same as yesterday18:19
JayFtrying to get it to actually land18:19
sean-k-mooneythe seperate lines are seperate sets of patches that are in flight18:19
sean-k-mooneyso those cidner changes are sperate form the nova one18:19
JayFgot it, this is just me learning for the first time after looking at this UI for an embarassingly long time18:19
JayFthat the lines on the left have meaning18:19
JayFlol18:20
JayFthank you for the clarity :D 18:20
sean-k-mooneythere is somethign wrong here18:20
sean-k-mooneywell hum im not sure18:20
sean-k-mooney800873,2 is a master patch18:20
sean-k-mooneybut i dont think it depends on the 826523,518:21
JayFhttps://review.opendev.org/c/openstack/nova/+/800873 that's not true18:21
sean-k-mooneyi think the patch it was specultivly merged with has already merged18:21
JayFthis says repo | branch: openstack/nova | stable/victoria18:21
sean-k-mooneyon that yes18:21
sean-k-mooneybut not on https://review.opendev.org/c/openstack/nova/+/826523/18:21
JayFthat is 800873,218:21
sean-k-mooneythat is the second nova patch with the x18:22
sean-k-mooneyso the integrate queue currently has 3 sets of changes in flight i think18:22
sean-k-mooney826523,5 which is your one that depends on nothing18:22
sean-k-mooney800873,2 that depend on a change that has laready merged18:22
sean-k-mooneyand then the swich/cinder ones18:23
sean-k-mooneywhich are a third set18:23
JayFyeah; I think I misread the queue docs 18:23
JayFthe behavior seems sane to me now that I understand what the UI was tryin' to tell me18:23
sean-k-mooneythe ui is not super clear18:24
sean-k-mooneyon the pluse side the backport is still green 18:24
sean-k-mooneyso hopefully it will have mergedin the next half hour or so18:24
JayFno, the backport is the one that is X I thought18:25
sean-k-mooneyoh you are right18:25
JayFyeah nova-live-migration failed again :| 18:25
sean-k-mooneyfailed on nova live migration18:25
sean-k-mooneyah 18:25
JayFI wish I could tell Zuul "trust me, Ironic driver can't break this" :| 18:26
sean-k-mooneytest_live_block_migration_with_attached_volume18:26
sean-k-mooneywell actullly this came up in the past18:26
sean-k-mooneywe could skip the live migration job if its an ironic only change18:26
JayFI mean, that seems sane-ish to me, but maybe a lot of effort for a branch where this is likely literally one of the only ironic driver things to hit18:26
* JayF is doing an audit in Ironic right now for patches to backport18:27
sean-k-mooneywe can add nova/virt/ironic and nova/tests/unit/virt/ironic18:27
sean-k-mooneyto the irrelevent files list18:27
sean-k-mooneywell im sure you woudl not mind having it on master and it shoudl be simple to backport18:27
JayFyou want to tell me where that change goes? If so I'll post it, and you can keep your +2 :D 18:27
JayFI took just long enough of a break from OpenStack to not really grok the new zuul configs very well18:28
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/.zuul.yaml#L92-L10618:28
JayFincoming 18:28
sean-k-mooneyunfortuetly that is used as the defintio of nova-base-irrelevant-files18:29
sean-k-mooneythat is sginifed by the  &18:29
sean-k-mooneyso you will need to copy tha tsection to https://github.com/openstack/nova/blob/master/.zuul.yaml#L13118:29
sean-k-mooneythen in the current one for the migration update it it18:29
JayFgot it18:29
JayFyaml anchors are awesome and terrible :D 18:29
sean-k-mooneyok im going to go figure out dinner ill check back in a while but feel free to add me to the review and/or ping me here18:30
sean-k-mooneyin the mean time youcan likelly just recheck once the gate change fails18:30
sean-k-mooneywith a message plaease :) "recheck ironic does not support live migration" or similar18:31
sean-k-mooneywe have an issue with volume detach intermitently failing and i think that is what you are hitting18:32
opendevreviewJay Faulkner proposed openstack/nova master: nova-live-migration tests not needed for Ironic  https://review.opendev.org/c/openstack/nova/+/85352918:32
JayFsean-k-mooney: ^ 18:33
JayFThanks sean :) 18:33
sean-k-mooneyok that looks ok i think18:34
sean-k-mooneytest only changes and releasenote only change are already skipped18:34
sean-k-mooneyso just the ironic virt driver 18:34
JayFThat's why I didn't add the tests line, yep18:34
JayFI know how the stanzas for zuul config work, I just can't ever find the #@%$# configs anymore lol18:35
sean-k-mooneyyou can technially go form teh zuul ui to the job definition mostly at least to the repo where there defiend18:36
sean-k-mooneyi normally fall back on https://codesearch.opendev.org/ if i cant find it quickly18:36
sean-k-mooneyanyway chat to you tomorrow o/18:36
opendevreviewMerged openstack/nova stable/victoria: [ironic] Minimize window for a resource provider to be lost  https://review.opendev.org/c/openstack/nova/+/80087320:27
JayF\o/20:27
opendevreviewJay Faulkner proposed openstack/nova stable/ussuri: [ironic] Minimize window for a resource provider to be lost  https://review.opendev.org/c/openstack/nova/+/85354020:30
opendevreviewRico Lin proposed openstack/nova master: Add locked_memory extra spec and image property  https://review.opendev.org/c/openstack/nova/+/77834721:19
opendevreviewRico Lin proposed openstack/nova master: libvirt: Add vIOMMU device to guest  https://review.opendev.org/c/openstack/nova/+/83064621:19
opendevreviewRico Lin proposed openstack/nova master: Add traits for viommu model  https://review.opendev.org/c/openstack/nova/+/84450721:19
ricolingibi: was asking about the necessary of raise exception for locked memory, but I guess the right behavior should be raise it. just update the patch accordingly. thanks for your review, please help to review again. ^^^21:20
*** dasm is now known as dasm|off22:00
opendevreviewJay Faulkner proposed openstack/nova stable/train: [ironic] Minimize window for a resource provider to be lost  https://review.opendev.org/c/openstack/nova/+/85354622:04
JayFTrain is as far back as you all go, right? 22:05
sean-k-mooney[m]i think we still might have older branches but i think we have gate issues beyond train22:23
sean-k-mooney[m]we still have stable queens branches22:24
sean-k-mooney[m]but i dont think the gate will pass. you can try22:24
sean-k-mooney[m]we have been discussing disabling some broken jobs on really old branches22:25
JayFtrain is probably the most reasonable to go back to anyway22:35
melwittJayF: iiuc, the eventual fix in the ironic driver was this https://review.opendev.org/q/Iba87cef50238c5b02ab313f2311b826081d5b4ab but it hasn't been reviewed on older branches yet. fyi22:57

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!