Friday, 2021-10-22

brinzhangbauzas: I agree with gibi and sean-k-mooney, I have no difficulty discussing it, thanks00:29
*** ganso_ is now known as ganso02:29
*** viks___ is now known as viks__02:31
gibimorning07:25
bauzasgood morning08:00
opendevreviewBalazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot  https://review.opendev.org/c/openstack/nova/+/81341908:09
opendevreviewBalazs Gibizer proposed openstack/nova master: Add a WA flag waiting for vif-plugged event during reboot  https://review.opendev.org/c/openstack/nova/+/81341908:13
opendevreviewBalazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot  https://review.opendev.org/c/openstack/nova/+/81343708:27
-opendevstatus- NOTICE: zuul needed to be restarted, queues were lost, you may need to recheck your changes08:47
stephenfinbauzas: Morning o/ Care to look at https://review.opendev.org/c/openstack/nova/+/81454708:56
bauzasstephenfin: ok thanks for finding it !08:56
bauzasstephenfin: just a thought, you're not explaining in https://review.opendev.org/c/openstack/nova/+/814547/1//COMMIT_MSG why we now have a regression08:58
stephenfinoh, sorry, the regression is because I removed I6ce930fa86c82da1008089791942b1fff7d04c1808:59
stephenfinI mention that at the end of the commit message. It's kind of implicit though, admittedly08:59
stephenfinI thought I'd fixed the issue that made I6ce930fa86c82da1008089791942b1fff7d04c18 necessary. Evidently not :(09:00
opendevreviewRajat Dhasmana proposed openstack/nova-specs master: Add spec for volume backed server rebuild  https://review.opendev.org/c/openstack/nova-specs/+/80962109:00
bauzasstephenfin: ok, then I'll leave a comment telling it and then I'll approve09:10
bauzasdone.09:12
stephenfinty09:36
opendevreviewMerged openstack/nova master: db: Increase timeout for migration tests  https://review.opendev.org/c/openstack/nova/+/81454709:40
opendevreviewWenping Song proposed openstack/nova master: Support concurrently add hosts to aggregates  https://review.opendev.org/c/openstack/nova/+/81510510:04
gibisean-k-mooney[m]: do I understand correctly that neutron's sriov-nic-agent only sends plugtime plug/unplug events for vnic_type=direct ports but not for vnic_type=direct-physical ports10:16
gibi?10:16
gibisean-k-mooney[m]: https://github.com/openstack/neutron/blob/6d8e830859cd4ac9708701b8e344fdc68cbcaebb/neutron/plugins/ml2/drivers/mech_sriov/mech_driver/mech_driver.py#L135-L13710:18
sean-k-mooney[m]hum that is a good question. i guess that woud be the case yes since for PFs the agent does not configure anything since anything it did would be undone whne we detach the device from the host kernel and attach it to the guest10:20
sean-k-mooney[m]i have never actully check its behavior in that regard10:21
gibiin my local env I see plug/unplug event during nova hard reboot for VF ports but not for PF ports so probably this is the case10:22
sean-k-mooney[m]yes so you might need to make an excption in your workaround patch10:23
sean-k-mooney[m]perhaps change it form a boolean to a list of vnic_types10:23
sean-k-mooney[m]odl only support vnic_type normal and vhost_user10:24
sean-k-mooney[m]well vhost-user10:24
gibiI think the doc in the patch still correct when we say set the flag only for ml2/ovs or networking-odl10:24
gibiI might extend that with mech_sriov + vnic_type direct10:25
sean-k-mooney[m]right but if you filter by vnic type you can use it when you have odl and sriov on the same host10:25
gibiyeah, I can ignore direct-physical ports when waiting for plug10:26
sean-k-mooney[m]ya i guess that also works10:26
gibisean-k-mooney[m]: what would be your way to filter?10:26
sean-k-mooney[m]if we make the config option a list of vnic_types to wait for on hard reboot we just do10:28
sean-k-mooney[m]if vif.vnic_type in CONF.wait_on_reboot: …10:28
gibihm yeah that is also a way10:29
sean-k-mooney[m]im not sure if we need to have different behavor for other vnic types like the cyborg ones10:29
sean-k-mooney[m]or baremetal though thtat is used only by ironic10:30
sean-k-mooney[m]i wouold  have to look at the spec again but when we are using cyborg provided smart nics neutron still sends the events right?10:31
gibihm neutron seems to support accelerator-direct with mech_sriov, and direct means a VF so I assume there is plug time events10:35
sean-k-mooney[m]i would assume so too but its not actully mentioned in https://specs.openstack.org/openstack/nova-specs/specs/xena/implemented/sriov-smartnic-support.html10:35
sean-k-mooney[m]gibi so you could either limit this to the case we know work (normal,direct,vhost-user) or you could filet out the case we know wont work (direct-physical,acclerator-direct-physical)10:38
gibiyeah10:38
gibias the config today needs to be opt-in10:38
gibiI guess opting in to supported vnic types are better10:38
gibiwhen you say vhost-user why is not enough to simply filter for vnic_type direct? 10:39
sean-k-mooney[m]in that case the list would be normal,direct,macvtap,acclerator-direct,vhost-user10:40
sean-k-mooney[m]well direct shoud work right10:40
sean-k-mooney[m]hardware offloaded ovs with ml2/ovs support direct and will send plug time events and the sriov nic agent should also10:40
sean-k-mooney[m]and vhost-user should also work with ml2/ovs and ml2/odl10:41
sean-k-mooney[m]the sriov nica agent should support macvtap plug time events10:41
sean-k-mooney[m]ml2/ovs should also send them for vdpa10:42
gibifor the deployer probably it is easier to just list vnic_types and not go into details like vhost-user 10:42
sean-k-mooney[m]vhost-user is a vnic type10:42
gibisean-k-mooney[m]: is it?10:42
sean-k-mooney[m]yes10:42
sean-k-mooney[m]its not a vif_type10:42
gibiblob/6d8e830859cd4ac9708701b8e344fdc68cbcaebb/neutron/plugins/ml2/drivers/mech_sriov/mech_driver/mech_driver.py#L16410:43
gibisorry10:43
gibiwrong buffe4r10:43
gibihttps://github.com/openstack/neutron-lib/blob/f01b2e9025d33aeff3bf22ea2568bda036878819/neutron_lib/api/definitions/portbindings.py#L13110:43
sean-k-mooney[m]so i think if you want to hard code it just filter out direct-physical,baremetal and acclerator-direct-physical10:43
gibise here are the vnic_types10:43
gibiI don't see vhost-user as vnic type in that list10:44
sean-k-mooney[m]oh sorry maybe your right i have not looked at dpdk in 2 years or more10:45
sean-k-mooney[m]i might be miss rememebvring let me check the ml2/driver but i guess its vnic_normal10:45
gibiI thin it is mapped to normal yes10:46
gibianyhow I think we are in agreement to have this filtering based on vnic_type10:46
gibiI think I will amend the current patch with that10:46
sean-k-mooney[m]ya you are right it is10:46
sean-k-mooney[m]so we should just skip waiting then for  *-physical and baremetal10:47
sean-k-mooney[m]looking at the other vnic_types i dont think vnic_type smartnic is used with ovs or odl10:48
gibiOK, I will discuss this the the downstream folks to and see if they prefer a configurable vnic_type or they are OK with a hardcode10:48
sean-k-mooney[m]i think that is used by ironic10:48
sean-k-mooney[m]ack10:49
gibiyes, smartnic is ironic afaik10:49
gibiso we can filter out that too10:49
sean-k-mooney[m]ya most likely10:49
sean-k-mooney[m]you could make the config an exclude list and default to the set we know wont work10:50
sean-k-mooney[m]actully no10:50
sean-k-mooney[m]that would enable it by default which we do not want10:50
sean-k-mooney[m]ok ill be afk for 20 mins or so chat to you later10:51
gibiack, thanks!10:52
fricklerkashyap: couple of more findings: a) no change with the Nehalem cpu settings patch from clarkb11:13
fricklerb) same issue with qemu-6.1 compiled from source11:13
kashyapfrickler: Hi11:13
fricklerc) the delta doesn'n really increase with large flavors, i.e. with 512M or 1G, the cirros process still stays at 600M11:14
kashyapfrickler: The Nehalem thing here is not relevant (unless you're using CentOS9).11:15
kashyapfrickler: Good to know that you've actually tested it w/ compiled with source11:15
fricklerthe latter is likely why this issue isn't more widely seen. it just affects CIs that try to start a larger number of small instances11:15
frickler... in a limited memory environment11:15
kashyapfrickler: Right.  So, this is TCG - this is not super amazingly well-tested upstream.  Because a lot of folks use hardware accel.  That said:11:17
kashyapfrickler: Can you please file an upstream QEMU bug here (do you have a GitLab account?) - https://gitlab.com/qemu-project/qemu/-/issues11:18
kashyapfrickler: That'll help me investigate the issue with a TCG dev. 11:19
kashyapAlso, please include the bits you posted yesterday.  (https://paste.opendev.org/raw/810150/)11:21
kashyapfrickler: I wonder if we can replicate this outside of OpenStack CI: like artificially triggering a script that'll start a ton of CirrOS instances?11:22
fricklerkashyap: I'll do the bug report, though likely not today, I'll let you know then11:23
kashyapThanks!  Do mention the buggy version where you saw it first.  And also the 6.1 compiled-from-source test.11:23
kashyapIt'll help with bisecting.11:23
fricklerkashyap: well I replicated with a local devstack deployment. I can also try to just create an instance with virt-manager11:24
kashyapYes, that'll be more preferable, if possible.11:24
fricklerkashyap: ok, thx for your feedback so far11:25
kashyapNo problem.  These TCG bugs (if it is indeed a bug) are hard to suss out.11:25
sean-k-mooney1frickler: you are seeing this just with a normal boot right11:37
sean-k-mooney1you dont need to boot many vms to trigger it11:37
sean-k-mooney1you are seeing the large memory usage with just a singel instnace11:37
sean-k-mooney1so this should not be hard to replicate right11:37
kashyapsean-k-mooney1: I don't think it's just one boot.  He said "CIs that try to start a larger no. of small instances"11:37
sean-k-mooney1frickler: out of interest do you have swap avaiable in these hosts11:37
kashyapsean-k-mooney1: Good question ;-)  The "ghost of swap"...11:38
sean-k-mooney1kashyap: its only an issue for cis because those small instance that used to fit nolonger do11:38
sean-k-mooney1kashyap: when i first spoke to frickler about this i think they mentioned it hapens for any small vms created11:38
*** sean-k-mooney1 is now known as sean-k-mooney11:39
sean-k-mooneyi.e. in ci the vm used to take say 128mb for ram is now using 600 so we cant run 4 of them in parallel anymore11:39
* kashyap nods (on both points)11:39
sean-k-mooneyso what im wondiring is in environment without swap are we seeing more resident memory usage11:40
fricklersean-k-mooney: yes, I see this with a single instance. the CI example is just where we noticed it first, with jobs OOMing with tempest running parallel tests11:40
sean-k-mooneyfrickler: we had a very weird customer issue where we saw python process have very large resident memory usage when no swap was presnet but when it was allocated they did not have high memory usage and also did not use any swap11:41
sean-k-mooneyit was like have swap avaiabel stop the memory allcoator preallcoating the memory11:42
fricklerhmm, indeed I have no swap on my test host. but we do have swap enabled on CI instances11:42
sean-k-mooneyah ok i was going to say could you add a 1G swap file temporay to your devstack and see if it change behavior11:43
sean-k-mooneyif its in the ci then no point its not relatted11:43
*** tosky_ is now known as tosky12:07
bauzas(late) reminder: final PTG day for nova sessions starting in 12 mins at https://www.openstack.org/ptg/rooms/newton12:49
sean-k-mooneydansmith: for the health check you want me to support http over tcp  ranther then just a tcp socket right. assuming yes im assuming if it make sense ot make this a real wsgi application or just use https://eventlet.net/doc/modules/wsgi.html#eventlet.wsgi.server to call a binay specifci health check function12:57
dansmithsean-k-mooney: yes http and I'd keep it uuber simple (so the latter)12:58
sean-k-mooneyok 12:58
sean-k-mooneybrb just going to make a coffee and ill join13:00
bauzasnova session started13:02
dansmithsean-k-mooney: out of curiosity, does haproxy support some sort of bare tcp socket health check/13:03
dansmithI would kinda expect not13:03
dansmithI thought even systemd wanted http, but can use a script too13:03
sean-k-mooneydansmith: i think i tcan but orginaly haproxy was not part of my orignal usecases13:10
dansmithokay13:10
sean-k-mooneyi was orginally thining of this as a camand/contol interface with comand objects echanged more like the rpc bus13:11
sean-k-mooneywith nc as or nova-manage as the cli13:11
gibisean-k-mooney: fyi, sriov agent only unrealiably sends vif plug for VFs, as it polls the hypervisor. If the unplug/plug is fast enough then the agent might miss the state when the device was down13:12
sean-k-mooneygibi: more fun13:12
sean-k-mooneyok13:12
gibiit is fun all the way down :D13:12
sean-k-mooneyso we might want to only wait for vnic_type=normal then13:13
gibiit seems soo13:13
sean-k-mooneylong term we really do need to fix this interface and enforce a stricter contract 13:13
sean-k-mooneyrather then guessing cause there are so many factors to consider13:14
gibiyes, we need neturon to be either enforce that events always sent, or declare in the port what event can be expected. There is no way nova can maintain a sane mapping alone13:14
bauzasdmitriis: saw the chat ?13:25
dmitriisbauzas: looking, 1 sec13:26
bauzasdmitriis: I'm about to propose to postpone your topic after 3pm  UTC13:26
dmitriisbauzas: got a conflict at 3PM but I can make it work13:27
bauzasdmitriis: maybe later then ?13:27
bauzasthe idea is just to avoid discussing your stuff *before* :)13:27
dmitriisbauzas: let's do it at 3PM, later is more complicated :^)13:28
bauzasack, moving your topic then :)13:28
dmitriisbauzas: ack, ty for pinging13:28
stephenfinIs it just me or is tbarron's sound clipping real bad? I can understand him though (i.e. can be fixed later)13:33
tbarronstephenfin: It may be from my end, sorry.  Yesterday zoom was having trouble with my rural location.13:38
stephenfintbarron: nw, I was just concerned something was broken on my end :) I could understand everything just fine13:39
tbarroncool13:39
opendevreviewStephen Finucane proposed openstack/nova master: db: Remove models that were moved to the API database  https://review.opendev.org/c/openstack/nova/+/81214913:43
opendevreviewStephen Finucane proposed openstack/nova master: db: Remove models for removed services, features  https://review.opendev.org/c/openstack/nova/+/81215013:43
opendevreviewStephen Finucane proposed openstack/nova master: db: Remove nova-network models  https://review.opendev.org/c/openstack/nova/+/81215113:43
opendevreviewBalazs Gibizer proposed openstack/nova stable/pike: Add a WA flag waiting for vif-plugged event during reboot  https://review.opendev.org/c/openstack/nova/+/81343713:45
kashyapclarkb: I'm innundated with a few things; I will reply to your email on the list on Monday.  Hope that's okay13:56
clarkbkashyap: yup no worries15:15
clarkbhave a good weekend15:16
bauzasdansmith: dunno if you're fancy happy joining us, but we're discussing a topic your knowledge could be helpful : instance v3.0 object bump17:04
dansmithsorry, I'm tied up17:04
bauzasor you're stuck in the TC meeting17:05
bauzasheh, no worries17:05
dansmithI definitely have opinions17:05
bauzasdansmith: notes on the etherpad could be appreciated tho :)17:05
bauzashttps://etherpad.opendev.org/p/nova-yoga-ptg L64617:05
sean-k-mooneyhttps://etherpad.opendev.org/p/nova-yoga-ptg-backup17:26
sean-k-mooneyno colors but there is a snapshot ^17:26
clarkbI just restored it to the version that bauzas identified as good (the original etherpad url I mean)17:28
bauzasyeah \o/17:28
sean-k-mooneyyep that looks ok17:28
bauzasI just wrote our last conclusions17:28
bauzaswe can call it a wrap17:28
sean-k-mooneyclarkb++ thanks17:28
bauzassean-k-mooney: just copy it again, so we have a backup17:28
* bauzas doesn't wanna implore the infra team for the 3rd time (on the same etherpad) :D17:29
sean-k-mooneyi just did an plain text export form the timeline and imported it again in adifferent page17:29
bauzasyeah that works17:29
sean-k-mooneybut ill try that in html form and see if it can keep colours17:29
bauzaswe have the highlights with the history17:29
bauzasso nothing is technically lost17:29
sean-k-mooneyin etherpad format it goes to the latest point in history not the one you have selected in the timeline17:30
sean-k-mooneyya so html format does not keep colors either17:31
sean-k-mooneybut we have teh backup in any case17:31
sean-k-mooneyand the orginal is restored so we are good17:31
bauzasyup17:32
bauzason that note, /me calls it a week17:32
bauzas\o17:32
gibime too17:32
gibio/17:32
bauzasI'm just sad to hear that the next PTG will still be virtual, but that's life17:33
bauzasI'm just exhausted and I miss our whiteboards and hallway discussions17:33
bauzasbut I'll open a beer to enjoy the last day of the PTG as if it was physical :)17:34
mnaser_hm19:22
mnaser_anyone would have an idea as to why metadata service seems to be stalling / very slow19:22
mnaser_2021-10-22 19:23:13.422 11 INFO nova.metadata.wsgi.server [req-f65469d1-4082-4b9e-ab3b-3c710c1f0366 - - - - -] 10.30.107.187,10.101.2.141 "GET /latest/meta-data/block-device-mapping/root HTTP/1.1" status: 200 len: 148 time: 25.835518819:23
mnaser_it almost feels like all 'requests' stall for a bit and then they all burst out at once19:23
mnaser_memcache is fine, mysql is fine19:23
mnaser_plenty of conductors19:24
mnaser_2021-10-22 19:25:15.880 13 INFO nova.metadata.wsgi.server [-] 192.168.1.5,10.101.2.142 "GET /openstack HTTP/1.1" status: 200 len: 235 time: 5.029531719:25
mnaser_especailly this, it seems very 5s-y19:25
clarkbmnaser_: I've noticed that on a couple of devstack jobs recently too but haven't had a chance to dig into it beyond noting that was the error returned by tempest19:46
clarkbspecifically some tests says metadata service fails to respond in time and the test times out19:47
prometheanfirenova fails with the new oslo-concurrency 4.5.0 https://zuul.opendev.org/t/openstack/build/9a385f3324fb46a3abf6257a09020d3820:44
prometheanfirelooks like an extra value was used (blocking)?20:45

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!