Wednesday, 2021-01-13

*** tosky has quit IRC00:01
*** macz_ has quit IRC00:15
*** bbowen has quit IRC00:36
*** bbowen has joined #openstack-nova00:37
*** LinPeiWen has joined #openstack-nova00:40
*** mlavalle has quit IRC00:47
*** kevinz has joined #openstack-nova01:03
*** LinPeiWen has quit IRC01:08
*** LinPeiWen94 has joined #openstack-nova01:19
openstackgerritBrin Zhang proposed openstack/nova master: Replaces tenant_id with project_id from List/Update Servers APIs  https://review.opendev.org/c/openstack/nova/+/76429201:28
openstackgerritBrin Zhang proposed openstack/nova master: Replace all_tenants with all_projects in List Server APIs  https://review.opendev.org/c/openstack/nova/+/76531101:28
*** dklyle has quit IRC01:40
*** tkajinam has quit IRC01:41
*** tkajinam has joined #openstack-nova01:42
*** dklyle has joined #openstack-nova01:48
*** chengsheng1 is now known as chengsheng02:00
*** tinwood has quit IRC02:08
*** tkajinam has quit IRC02:09
*** tkajinam has joined #openstack-nova02:10
*** tinwood has joined #openstack-nova02:11
*** macz_ has joined #openstack-nova02:16
*** zenkuro has quit IRC02:18
*** macz_ has quit IRC02:21
*** ccstone has quit IRC02:26
*** ccstone has joined #openstack-nova02:26
*** zzzeek has quit IRC02:30
*** spatel has joined #openstack-nova02:31
*** zzzeek has joined #openstack-nova02:31
openstackgerritBrin Zhang proposed openstack/nova master: Replaces tenant_id with project_id from Rebuild Server API  https://review.opendev.org/c/openstack/nova/+/76638002:35
*** zzzeek has quit IRC02:36
*** zzzeek has joined #openstack-nova02:40
*** dklyle has quit IRC02:45
*** hamalq has quit IRC02:56
*** rcernin has quit IRC02:57
*** sapd1 has joined #openstack-nova02:58
*** sapd1 has quit IRC03:03
*** mkrai has joined #openstack-nova03:04
*** rcernin has joined #openstack-nova03:18
*** rcernin has quit IRC03:21
*** rcernin has joined #openstack-nova03:21
*** psachin has joined #openstack-nova03:33
*** sapd1 has joined #openstack-nova03:36
*** swp20 has quit IRC03:44
*** swp20 has joined #openstack-nova03:45
*** sapd1 has quit IRC04:00
*** sapd1 has joined #openstack-nova04:09
*** rcernin has quit IRC04:35
*** rcernin has joined #openstack-nova04:35
*** ratailor has joined #openstack-nova04:52
openstackgerritsean mooney proposed openstack/os-traits master: add vdpa trait  https://review.opendev.org/c/openstack/os-traits/+/77053004:57
*** sapd1 has quit IRC05:01
*** gyee has quit IRC05:08
*** vishalmanchanda has joined #openstack-nova05:11
openstackgerritsean mooney proposed openstack/os-traits master: add vdpa trait  https://review.opendev.org/c/openstack/os-traits/+/77053005:12
openstackgerritsean mooney proposed openstack/nova master: [WIP] add vdpa nodedev parsing and interface config gen  https://review.opendev.org/c/openstack/nova/+/77053205:14
openstackgerritsean mooney proposed openstack/nova master: [WIP] add vdpa trait reporting.  https://review.opendev.org/c/openstack/nova/+/77053305:14
openstackgerritsean mooney proposed openstack/nova master: add constants for vnic type vdpa  https://review.opendev.org/c/openstack/nova/+/77047405:21
openstackgerritsean mooney proposed openstack/nova master: [WIP] add vdpa nodedev parsing and interface config gen  https://review.opendev.org/c/openstack/nova/+/77053205:21
openstackgerritsean mooney proposed openstack/nova master: [WIP] add vdpa trait reporting.  https://review.opendev.org/c/openstack/nova/+/77053305:21
*** alex_xu has joined #openstack-nova05:26
*** rcernin_ has joined #openstack-nova05:42
*** rcernin has quit IRC05:42
openstackgerritsean mooney proposed openstack/nova master: [WIP] add vdpa prefilter  https://review.opendev.org/c/openstack/nova/+/77053405:47
*** sapd1 has joined #openstack-nova05:55
*** spatel has quit IRC05:58
*** hemanth_n has joined #openstack-nova06:47
*** mkrai has quit IRC06:54
*** mkrai has joined #openstack-nova06:54
*** mkrai has quit IRC07:13
*** ralonsoh has joined #openstack-nova07:19
*** zzzeek has quit IRC07:28
*** rcernin_ has quit IRC07:28
*** zzzeek has joined #openstack-nova07:31
*** openstackgerrit has quit IRC07:47
*** mkrai has joined #openstack-nova07:52
*** nightmare_unreal has joined #openstack-nova07:53
gibigood morning07:55
*** slaweq has joined #openstack-nova07:59
*** slaweq has quit IRC08:04
*** rcernin_ has joined #openstack-nova08:06
*** slaweq has joined #openstack-nova08:10
*** andrewbonney has joined #openstack-nova08:13
*** mkrai has quit IRC08:13
*** tesseract has joined #openstack-nova08:17
*** rpittau|afk is now known as rpittau08:25
*** rcernin_ has quit IRC08:26
*** tosky has joined #openstack-nova08:39
*** mkrai has joined #openstack-nova08:44
lyarwoodMorning08:48
gibilyarwood: melwit explained one of my questions in the detach patch, so I have things to do with that patch, but if you have any other hints about open question then I would be glad to discuss08:49
lyarwoodgibi: I've just got the change open now, let me take a look08:51
gibicool08:51
gibiI promise I will not dissapera now for couple of hours :)08:52
*** songwenping_ has joined #openstack-nova09:12
lyarwoodgibi: okay updated, I need to check if there's an internal libvirt timeout for these detach events09:12
*** swp20 has quit IRC09:14
lyarwoodgibi: ah nope, it's raised on a sync failure, there's no async checking within libvirtd that raises it09:15
lyarwoodI didn't post my comments anyway, doh!09:16
*** dasp_ has quit IRC09:16
*** dasp has joined #openstack-nova09:17
gibilyarwood: thanks09:18
gibilyarwood: yeah, the persisten/live error comes synchronously09:19
gibilyarwood: do you happen to know that when we check that the device is in the domain does that check looks into the live domian?09:20
lyarwoodgibi: iirc we use XMLDesc(0) to dump the domain and that's the live config09:23
lyarwoodgibi: there was a bug about this for paused instances iirc09:23
lyarwoodgibi: where we need to provide the VIR_DOMAIN_XML_INACTIVE flag https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainXMLFlags09:24
gibilyarwood: thanks, so we check the live config, thats good, then if the synch error came then we can simply check the live domain and it device is there then we can retry09:27
lyarwoodgibi: yeah I'd continue to retry on a direct sync error if the device is still there, VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED (but that should be a direct sync failure?) and a configurable timeout within n-cpu09:29
gibiVIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED is the failed event09:29
gibiso tathat is async09:29
gibiI can unify the retry if we get sync or async failure and the device is still in the live domain then we retry09:30
lyarwoodright sorry my point was that within libvirt at least it looks like that's only actually raised synchronously with the failure of the initial request to QEMU and that should bubble up directly to our call to libvirt09:30
lyarwoodyup cool that works09:31
lyarwoodI'm likely missing something in the libvirt code anyway regarding where VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED is being raised so that sounds like the best approach09:31
gibiack, thanks for the help09:32
gibiif you get the libvirt timeout value for detach event then let me know and I will update the nova timeout value to be bigger09:32
lyarwoodkashyap: https://review.opendev.org/c/openstack/nova/+/770246 ; you might be interested in this, gibi is trying to rewrite our detach device logic in the libvirt driver to use events. I've made some comments in the change but if you have anymore context feel free to add it there.09:35
*** derekh has joined #openstack-nova09:35
* lyarwood will try to join the libvirt channel again later today and ask about the behaviour of the events when we fail to detach09:35
kashyaplyarwood: Yeah, was just skimming the chat here.  Was responding to something downstream that was breathing down my neck09:35
lyarwoodnp09:35
lyarwoodswitching topics, stephenfin how's your SQL/sqlalchemy foo? trying to work out if 1. the following is a valid query for a nova-status command and 2. if it would work in sqlalchemy.09:37
lyarwoodselect distinct instances.uuid from instances left join instance_system_metadata on instances.uuid = instance_system_metadata.instance_uuid where instances.uuid not in (select instance_system_metadata.instance_uuid from instance_system_metadata where instance_system_metadata.key = 'hw_machine_type');09:37
lyarwoodtl;dr I'm trying to list the instance uuids that *don't* have a `hw_machine_type` key set in instance_system_metadata09:38
lyarwoodand it has been waaaaaaaaaaaay too long since I wrote any SQL so this might be entirely wrong09:38
*** songwenping_ has quit IRC09:39
*** songwenping_ has joined #openstack-nova09:39
kashyapgibi: Thx for taking up that; I just skimmed the patch.  I'll look deeper; once I switch context.09:40
gibikashyap: thanks09:41
lyarwoodoh and that reminds me, sean-k-mooney, you know how you asked if we could stash image metadata properties in instance_system_metadata? Well they are already there.09:41
stephenfinlyarwood: It's not my strongest skill, but that does look reasonable to me. I don't think the subquery is necessary, but the syntax I'm thinking of could be backend-specific09:42
lyarwoodsean-k-mooney: https://github.com/openstack/nova/blob/e6f5e814050a19d6f027037424556b2889514ec3/nova/objects/image_meta.py#L113-L12709:42
lyarwoodstephenfin: yeah I couldn't work out the SQL to select instances.uuid where instance_system_metadata.key doesn't contain 'hw_machine_type'09:43
*** zenkuro has joined #openstack-nova09:43
*** hoonetorg has joined #openstack-nova09:44
lyarwoodstephenfin: I'll convert this into sqla for now and go from there, thanks09:45
stephenfinlyarwood: 0c441e636ba9d287909584b6ddf15eab5d479f0e would be good prior art also09:47
stephenfinIf not an exact match, at least it might help in terms of wiring up the machinery for an online migration09:48
lyarwoodstephenfin: I wasn't going to write an online migration for this09:49
lyarwoodstephenfin: this is something n-cpu will populate at startup09:49
lyarwoodstephenfin: and nova-status can warn about later prior to changing defaults09:50
stephenfinah, gotcha09:50
lyarwoodstephenfin: I don't see a query in that change FWIW09:50
lyarwoodwell not a join etc09:50
*** openstackgerrit has joined #openstack-nova10:28
openstackgerritYumengBao proposed openstack/os-traits master: add owner traits for accelerator resources  https://review.opendev.org/c/openstack/os-traits/+/77056910:28
*** tesseract has quit IRC10:31
openstackgerritBrin Zhang proposed openstack/nova master: Replaces tenant_id with project_id from List SG API  https://review.opendev.org/c/openstack/nova/+/76672610:32
*** tesseract has joined #openstack-nova10:33
gibibrinzhang, alex_xu: responded in https://review.opendev.org/c/openstack/nova/+/729563 (finally)10:34
openstackgerritKashyap Chamarthy proposed openstack/os-traits master: Add a trait for UEFI Secure Boot support  https://review.opendev.org/c/openstack/os-traits/+/77057010:34
*** ociuhandu has joined #openstack-nova10:40
*** dtantsur|afk is now known as dtantsur10:41
openstackgerritStephen Finucane proposed openstack/python-novaclient master: Add support for microversion v2.88  https://review.opendev.org/c/openstack/python-novaclient/+/77057310:42
brinzhanggibi: so we wont merge this patch, right?10:49
*** hemanth_n has quit IRC11:01
*** songwenping__ has joined #openstack-nova11:02
*** songwenping_ has quit IRC11:05
*** mkrai has quit IRC11:26
*** ociuhandu has quit IRC11:36
*** zenkuro has quit IRC11:40
*** zenkuro has joined #openstack-nova11:41
gibibrinzhang: we need a separate bugfix, that is all what I said11:46
brinzhangIMO, the bug fix shuold not prevent this patch go11:48
brinzhangwe shuold register a bugfix, and then fix it11:48
gibibrinzhang: yepp, that works for me11:51
brinzhanggibi: thanks, I hope we can make this patch merge, it's also meet alex_xu's meaning11:53
*** sapd1 has quit IRC12:01
*** mgariepy has quit IRC12:04
*** raildo has joined #openstack-nova12:16
*** ratailor has quit IRC12:26
sean-k-mooneylyarwood: i know the image metadata is in the instance_system_metadata table12:26
sean-k-mooneylyarwood: thats why i wanted you to set the value there12:26
sean-k-mooneythey are just prefixed with img_12:27
lyarwoodsean-k-mooney: ah I thought you were also suggesting that we dump all of the image metadata props in there as well12:27
lyarwoodsean-k-mooney: I just missed that they were there already, prefixed by image_ as you said12:28
sean-k-mooneyyeah so i was suggesting setting the effective values of all image props there instead of just the set values12:28
sean-k-mooneye.g. if you dont have hw_vif_model set today it will normally default to virtio12:29
lyarwoodah right12:29
sean-k-mooneyso we woudl store img_hw_vif_model=virtio12:29
sean-k-mooneyor whatever it is12:29
sean-k-mooneyas if it had been set12:29
lyarwoodFWIW I'm not overwriting image_hw_machine_type at the moment12:30
lyarwoodI'm just dumping it into hw_machine_type12:30
lyarwoodimage_hw_machine_type just remains on the original value12:30
sean-k-mooneyya you could do that the only issue with that approch is you have to now check both in the code12:30
sean-k-mooneywell if image_hw_machine_type was set tehn you would not be settting hw_machine_type12:31
sean-k-mooneysince you only need to set that if the machine type is not set in the image12:31
lyarwoodit's just copied from image_meta in that case12:31
sean-k-mooneyyep12:32
sean-k-mooneyi was trying to avoid having two sources of truth12:32
sean-k-mooneye.g. image_hw_machine_type and hw_machine_type12:32
lyarwoodimage_hw_machine_type is just the original, hw_machine_type is the single source of truth from now on12:33
lyarwoodas we can change it over time etc12:33
sean-k-mooneywell no we cant thats the point12:33
sean-k-mooneyif its set in the image it cant be changed12:33
lyarwoodyou can through the versioned machine types12:33
sean-k-mooneyno if its set in the image thats it we dont use the config values at all12:34
lyarwoodwhy? moving forward through the versioned machine types provides a stable ABI etc12:34
sean-k-mooneyit would break backwards compatiablity with teh existing usage12:35
lyarwoodhow?12:35
kashyapYes to what lyarwood said on versioned machine types12:35
kashyapsean-k-mooney: When talking of this topic, a clear example would make sure we're not talking of different things.12:35
sean-k-mooneythe existing usage is that if yuou set a version machine type in the image metadata it will have that version for the lifetime of the instance12:35
sean-k-mooneyif you set the unversion on then it will use the latest version on the host it spawns on12:35
sean-k-mooneywe should not be changing that behavior in your spec12:36
lyarwoodthe only part I'm changing is that instead of being for the lifetime of the instance operators can now update the versioned machien type12:36
lyarwoodaliases from the image would stay, I don't switch them out for the versioned machine types etc12:37
sean-k-mooneylyarwood: that was not part of the spec12:37
sean-k-mooneywe did not provide any mechanium to update the machine type over the instacen lifttime12:38
sean-k-mooneythat is what the recreate api would provide12:38
lyarwoodthat's between types, why would we ask users to rebuild for a version update?12:39
sean-k-mooneyoperators could alwasy update the versioned machine type by updating the config for instance that dont have hw_machine_type set12:39
lyarwoodand I'm pretty sure it's in the spec12:39
sean-k-mooneylyarwood: recreate with the same image and flavor was ment to just update the metadata its not the same as rebuild in that case12:40
lyarwoodsean-k-mooney: ah sorry you're talking about an API that doesn't exist :)12:40
sean-k-mooneyyes the part that was defered/rejected at the ptg meaning we had no aggreaded way to update the machine type12:41
lyarwoodsean-k-mooney: and I agree that would be nicer and would mean we wouldn't need a nova-manage command for this12:41
* lyarwood is being called for lunch, brb12:41
sean-k-mooneyya so in the spec the only way to change the machine type is via the nova manage command12:43
openstackgerritVlad Gusev proposed openstack/nova stable/victoria: Use subqueryload() instead of joinedload() for (system_)metadata  https://review.opendev.org/c/openstack/nova/+/76180912:43
sean-k-mooneybut the image metadata still has precidence12:43
sean-k-mooneyso it only matters for vms without that12:43
*** ociuhandu has joined #openstack-nova12:45
*** spatel has joined #openstack-nova12:47
openstackgerritBrin Zhang proposed openstack/nova master: Replaces tenant_id with project_id from Flavor Access APIs  https://review.opendev.org/c/openstack/nova/+/76770412:51
*** spatel has quit IRC12:52
*** bbowen has quit IRC12:53
brinzhangbauzas: do you have time to check my response of your comments in https://review.opendev.org/c/openstack/nova/+/729563, there is -1 on it12:54
brinzhangbauzas: may it will take your some time, but whatever thanks12:55
*** ociuhandu has quit IRC12:55
*** ociuhandu has joined #openstack-nova12:56
*** ociuhandu has quit IRC12:56
*** vishalmanchanda has quit IRC13:01
*** ociuhandu has joined #openstack-nova13:01
*** brinzhang has quit IRC13:11
*** brinzhang has joined #openstack-nova13:11
*** mgariepy has joined #openstack-nova13:21
*** bbowen has joined #openstack-nova13:21
bauzasbrinzhang: ack, will look at your replies then13:31
*** whoami-rajat__ has joined #openstack-nova13:46
*** links has joined #openstack-nova13:47
*** links has quit IRC13:47
*** nweinber has joined #openstack-nova13:55
*** nweinber has quit IRC14:01
*** liuyulong has joined #openstack-nova14:01
*** nweinber has joined #openstack-nova14:02
*** liuyulong has quit IRC14:03
*** jmlowe has joined #openstack-nova14:26
*** vishalmanchanda has joined #openstack-nova14:35
*** mkrai has joined #openstack-nova14:40
openstackgerritTakashi Natsume proposed openstack/python-novaclient master: Deprecate agent commands and APIs  https://review.opendev.org/c/openstack/python-novaclient/+/76906814:44
*** belmoreira has joined #openstack-nova14:53
*** ociuhandu_ has joined #openstack-nova14:58
*** zenkuro has quit IRC15:02
*** ociuhandu has quit IRC15:02
*** zenkuro has joined #openstack-nova15:03
*** macz_ has joined #openstack-nova15:13
*** macz_ has quit IRC15:17
*** psachin has quit IRC15:28
*** ociuhandu_ has quit IRC15:35
*** ociuhandu has joined #openstack-nova15:35
*** dklyle has joined #openstack-nova15:46
*** sapd1 has joined #openstack-nova15:59
*** macz_ has joined #openstack-nova16:02
*** mkrai has quit IRC16:03
*** mkrai_ has joined #openstack-nova16:03
*** rnoriega_ is now known as rnoriega12316:06
*** rnoriega123 is now known as rnoriega_16:06
*** rnoriega_ is now known as rnoriega16:09
*** ociuhandu_ has joined #openstack-nova16:11
*** ociuhandu has quit IRC16:11
melwittstephenfin, artom: would like to have your numa expert review on this patch (and the func test below it) please if you could spare some time this week https://review.opendev.org/c/openstack/nova/+/76961416:19
artommelwitt, will take a look tomorrow16:19
*** mgariepy has quit IRC16:20
melwittthanks!16:20
artomHopefully before then, actually16:20
*** mkrai_ has quit IRC16:22
openstackgerritLee Yarwood proposed openstack/nova master: WIP libvirt: Record the machine_type of instances in system_metadata  https://review.opendev.org/c/openstack/nova/+/76753316:29
openstackgerritLee Yarwood proposed openstack/nova master: WIP nova-manage: Add commands for managing instance machine type  https://review.opendev.org/c/openstack/nova/+/76954816:29
openstackgerritLee Yarwood proposed openstack/nova master: WIP nova-status: Add hw_machine_type check for libvirt instances  https://review.opendev.org/c/openstack/nova/+/77064316:29
*** slaweq has quit IRC16:34
*** slaweq has joined #openstack-nova16:36
*** adrianc has quit IRC16:41
*** tosky has quit IRC16:41
*** adrianc has joined #openstack-nova16:42
*** tosky has joined #openstack-nova16:42
*** tesseract has quit IRC16:54
sean-k-mooneyartom: i havenet made any changes yet but i responed to your questions in https://review.opendev.org/c/openstack/nova-specs/+/764999/2/specs/wallaby/approved/libvirt-vdpa-support.rst can you re review16:56
sean-k-mooneyi have some WIP patches up as well https://review.opendev.org/q/topic:%22vhost-vdpa%22+(status:open%20OR%20status:merged)16:57
openstackgerritBalazs Gibizer proposed openstack/nova master: DNM try to replace retry with libvirt event in detach  https://review.opendev.org/c/openstack/nova/+/77024616:57
sean-k-mooneyits not fully complete as i still need to extend the pci tracker and where we create the pci request16:57
sean-k-mooneyand i need to test it locally and write tests and docs16:58
sean-k-mooneybut it has the outline of about 2 thirds of the code16:58
artomsean-k-mooney, cool - I want to un-WIP my socket affinity spec, then I'll take a look16:59
*** gyee has joined #openstack-nova17:00
*** ociuhandu_ has quit IRC17:06
*** markguz_ has joined #openstack-nova17:08
*** ociuhandu has joined #openstack-nova17:09
markguz_Hi nova folks. THe good people at #openstack-ironic thought it might be a good idea for me to ask about my problem here.17:10
markguz_I've got an issue where my ironic instance spawning is getting stuck for +/- 10mins just after i start deployment.  The nova-scheduler picks the correct compute node and then the nova-compute node reports "Starting instance... _do_build_and_run_instance"17:12
markguz_then nothing happens for about 10 mins. then suddenly the process starts and deployment continues..17:12
*** ociuhandu_ has joined #openstack-nova17:13
markguz_TheJulia over at ironic thinks that the process is getting stuck at the scheduling stage.  The vm reports Building, but the task status sits at "none" during that 10mins17:13
markguz_For VMs there is no delay, only for BMs.  This is Rocky.  and it is not a busy deployment.  Very little activity going. We can go days without spawning bm or vm instances17:14
markguz_I've been trying to dig around the code to see what happens when the compute node reports "Starting instance... _do_build_and_run_instance" but it's very hard to follow17:15
markguz_If anyone could help me follow the rabbit through the rabbit hole and see exactly what is happening, I'd be most appreciative17:16
*** ociuhandu has quit IRC17:16
*** ociuhandu_ has quit IRC17:17
*** mgariepy has joined #openstack-nova17:20
*** ociuhandu has joined #openstack-nova17:22
melwittmarkguz_: if it's landed on the compute node already, we don't consider that to be "scheduling" as it's already been scheduled/placed. but to dig into this further you'll want to trace the request id of the line that says "Starting instance" in the nova-compute log and see if you can see where it stops making progress. that would help17:23
sean-k-mooneyif this is ironic by they way whe n we get do do_build_and_run_instance at some point the compute manager will hand of to the ironic driver which will call ironic to provision the node17:25
melwittright. would want to look and see if he can verify it's gotten to that point17:26
sean-k-mooneyhttps://opendev.org/openstack/nova/src/branch/master/nova/compute/manager.py#L2186 starting to build instance is right at the top17:26
*** ociuhandu has quit IRC17:27
sean-k-mooneythe we save the task state at None and vmstate building17:27
melwittyeah, I know. just saying he can trace the request id to see how far it gets17:27
melwittafter that17:27
markguz_here's a grep of the req id for an instance out of the logs http://paste.openstack.org/show/801601/17:29
markguz_nothing between 9.40 and 10.0517:29
sean-k-mooneydo you have the concurnet build limit set17:31
stephenfinmelwitt: Comments left17:31
stephenfinsean-k-mooney: ^17:31
sean-k-mooneyfor the compute service17:31
sean-k-mooneyit defaults to 10 i belive17:31
*** zenkuro has quit IRC17:31
TheJuliamarkguz_: something between scheduling and initial network setup :\17:31
sean-k-mooneyock "compute_resources" acquired by "nova.compute.resource_tracker.instance_claim" :: waited 1495.269s17:31
sean-k-mooneyit looks like it was jsut waiting on the RT lock17:32
melwittyep, waited 24 min for the lock17:32
melwittthank you stephenfin17:32
markguz_why would the lock take 25mins?17:33
sean-k-mooneythe comput service is likely bussy starting up17:34
sean-k-mooneyyou mentioned this is only after the iniall start right17:34
sean-k-mooneyor is this for each spawn17:34
lyarwoodjust grep for the compute_resources lock and see what was holding it before?17:34
sean-k-mooneyya that too17:35
sean-k-mooneyyou could see how many other instance got the lock in that interval17:35
markguz_sean-k-mooney: it's every spawn. usually 10mins, sometimes longer and sometimes shorter17:36
lyarwooddoes the resource tracker make external API calls in the Ironic driver?17:36
sean-k-mooneythe RT is shared so i dont think so17:38
lyarwoodyeah sorry I mean the code that refreshes it within the driver17:39
sean-k-mooneyim pretty sure this is the lock in question https://opendev.org/openstack/nova/src/branch/stable/rocky/nova/compute/manager.py#L2221-L222217:40
melwittmarkguz_: I think you might be hitting https://bugs.launchpad.net/nova/+bug/186412217:41
openstackLaunchpad bug 1864122 in OpenStack Compute (nova) "Instances (bare metal) queue for 30-60 seconds when managing a large amount of Ironic nodes" [Medium,Fix released] - Assigned to Jason Anderson (jasonandersonatuchicago)17:41
sean-k-mooneyyep that was what grabed the lock https://opendev.org/openstack/nova/src/branch/stable/rocky/nova/compute/resource_tracker.py#L159-L16017:41
markguz_i have +/- 240 nodes17:42
markguz_is that a large amount?17:42
sean-k-mooneymelwitt: yep a race on the lock with the update periodic task seams likely17:42
melwittif you read the bug it says can be seen around > 100 nodes17:42
sean-k-mooneymarkguz_: how many ironic compute services do you have17:43
melwittthat fix is available in ussuri and onward, it was not backported because it requires a newer version of oslo.concurrency17:43
markguz_sean-k-mooney: 117:43
sean-k-mooneyi belive the periodic will only update the resouce usage for the nodes that are assgined to it17:43
sean-k-mooneyso i think you can scale it by deploying more compute service instances TheJulia is that correct?17:44
sean-k-mooneymarkguz_: if you have 3 contolers i would suggest running an ironic nova compute service instance on each assuming that makes sense to TheJulia or others17:45
TheJuliasean-k-mooney: yes, you can, you should just be able to run multiple instances17:45
TheJuliamarkguz_: ^^^ instances of nova-compute configured for ironic17:45
stephenfinmelwitt: comments left on the bug report too17:45
* stephenfin knocks off for the evening o/17:45
sean-k-mooneymarkguz_: the other thing you could do is reduce the interval of the periodic17:46
melwitthm, I thought you needed to configure node partitioning to do that17:46
sean-k-mooneywe fixed it by chanigin the type of lock we use17:46
melwitt"conductor groups"17:46
TheJuliamelwitt: only to force specific grouping/allocation into specific grouping17:46
sean-k-mooneythat on the ironic side i think17:46
melwittit's not17:47
TheJuliaits on both sides17:47
sean-k-mooneyah ok17:47
markguz_peridoc_task_interval is set to 24017:47
melwittwell, it might be but you have to do it on the nova side too17:47
TheJuliaotherwise it runs a hash ring  based upon the node list17:47
TheJuliaand the group is just a key in the hash ring17:47
sean-k-mooneywe improved this in nova by using oslos fair locks17:48
TheJuliathe nova side name is a little different because naming_is_fun^TM17:48
sean-k-mooneyhttps://review.opendev.org/c/openstack/nova/+/711528/217:48
markguz_so we use this in a lab env and when we spin up baremetal we need to spin up a spcific node as they are connected to specific hardware that is being tested17:48
sean-k-mooneybut that was only done in ussuri17:48
TheJuliasean-k-mooney: ohhhhh neat17:48
sean-k-mooneyso we would have to backport it unfortunetlly im not sure oslo has the required support in rocky let me check17:49
melwittok conductor groups are not available until stein anyways17:49
sean-k-mooneywe would need oslo.concurrancy 3.29.0 to backport it17:49
markguz_if i run multiple computes i'm guessing that i will need to change how i call a instance. right now i use the "avail_zone:compute_host:bm_uuid" trick17:49
melwittyeah, I said all of that earlier17:49
sean-k-mooneystable rocky is oslo.concurrency===3.27.017:49
TheJuliamarkguz_: yeah, :\17:50
melwittyes, the patch that added fair locks bumped the oslo.concurrency version17:50
sean-k-mooneyso it can go back to stien17:50
sean-k-mooneybut not rocky17:50
melwittso it wasn't bumped until ussuri17:50
sean-k-mooneymarkguz_: ya you would need to know which host has it17:51
markguz_what's ironic is (pun intended) is that i was upgrading with the intention of getting to ussuri but when ironic failed at the rocky step i didn't want to compound the problem by continuing to upgrade17:51
melwitthuh yeah actually it could be backported to stein because the upper constraint is 3.29.1 for whatever reason17:52
melwittI did not expect that17:52
markguz_assuming the bug is the problem going to ussuri will fix it? but that will break a lot of our automation due to the way were calling the nodes17:53
markguz_i mean it's not the end of the world, but ugh.. more work :-(17:53
markguz_at least i finally have a better idea of what's wrong at least. i seriously was losing the will to live over this ;-)17:53
melwittyeah. you could try to haxx and apply the patch to see if it helps. you just need oslo.concurrency >= 3.29.017:54
melwitt(so you know for sure whether you're hitting that bug)17:54
markguz_melwitt: does the patch need to go on the scheduler or the compute node? or both?17:55
melwittmarkguz_: compute node17:55
markguz_melwitt: then i can probably crowbar that in17:56
*** mlavalle has joined #openstack-nova17:57
melwittbleh, there's merge conflicts but it's really just adding fair=True to all the @utils.synchronized(COMPUTE_RESOURCE_SEMAPHORE, fair=True)17:58
markguz_ok. i'll give it try and see what happens.17:59
markguz_will i need to upgrade the other oslo. components or just concurrency?18:00
melwittjust concurrency18:00
openstackgerritmelanie witt proposed openstack/nova stable/train: Use fair locks in resource tracker  https://review.opendev.org/c/openstack/nova/+/77058518:02
*** hamalq has joined #openstack-nova18:03
sean-k-mooneywe could proably implenet a version of the patch for rocky too18:04
sean-k-mooneythat just did not use the fair lock form oslo18:04
*** rpittau is now known as rpittau|afk18:05
*** derekh has quit IRC18:07
sean-k-mooneymarkguz_: this was the implemenation fo the fair lock https://github.com/openstack/oslo.concurrency/commit/2b55da68ae45ff45cba68672cdbc24342cf115f618:10
*** andrewbonney has quit IRC18:12
sean-k-mooneymarkguz_: if you wanted to backport the upstream patch and then backport the implemantion of the fair lock into nova we could evaulate that or at least the stable team could18:13
sean-k-mooneyits just using https://fasteners.readthedocs.io/en/latest/api/lock.html#fasteners.lock.ReaderWriterLock18:13
sean-k-mooneyto actuly provide the fifo behavior18:14
sean-k-mooneythe version of fasteners on stable rocky has the required functionality18:15
sean-k-mooneythat said i think you can just bump the oslo.concurrancy version locally and locally apply the nova patch and it should run fine18:16
openstackgerritmelanie witt proposed openstack/nova stable/stein: Use fair locks in resource tracker  https://review.opendev.org/c/openstack/nova/+/77065718:19
*** ralonsoh has quit IRC18:20
sean-k-mooneystephenfin: by the way as far as i am aware we never us the cpu_toplopgy filed in the numa cell object to generate teh xml at all18:20
openstackgerritmelanie witt proposed openstack/nova stable/stein: Use fair locks in resource tracker  https://review.opendev.org/c/openstack/nova/+/77065718:20
sean-k-mooneythe numa toplogy of the guest or host should have no impact on the cpu toplogy of the guest period18:21
sean-k-mooneyany other behviaor is inconsistent with the intended behviaor as discibed by the specs18:22
*** dtantsur is now known as dtantsur|afk18:43
openstackgerritMerged openstack/nova stable/victoria: Omit resource inventories from placement update if zero  https://review.opendev.org/c/openstack/nova/+/76617718:58
*** belmoreira has quit IRC19:03
markguz_sean-k-mooney: i think it would be simpler for me to just upgrade to ussuri19:14
sean-k-mooneyif that is an option yes19:14
openstackgerritMerged openstack/nova stable/victoria: Add upgrade check about old computes  https://review.opendev.org/c/openstack/nova/+/76192419:15
markguz_fortunately for me this is an internal deployment that is not used by paying customers so i have some degree of flexibility on it's availability19:15
sean-k-mooneywhat do you use to deploy/manage it19:16
markguz_sean-k-mooney: originally i deployed kilo with rdo packstack.  since then it's become a bit of a bit of hodgepodge of manual installs. I mostly use ansible to keep things up to date19:21
sean-k-mooneyah i see19:21
sean-k-mooneypackstack has more or less been unsupported for a few years now19:22
sean-k-mooneyi think it still technicaly exists but redhat stop supporting in with our product in queens i think19:22
markguz_yeah. i generally just install from the package manager. and have some ansible plays that configure compute nodes etc etc.19:22
sean-k-mooneye.g. we moved to require all customer deploy with triplo around queens19:22
markguz_i've got some bits an pieces that are installed from git master for things like magnum and heat.19:23
sean-k-mooneymarkguz_: you should look into openstack ansible so19:23
markguz_it's on my todo list :-)19:23
markguz_The patch seems to have fixed my problem... spawning is happening at the expected rate now19:26
sean-k-mooneydid you just bump the oslo concurancy version and apply th nova patch19:26
markguz_sean-k-mooney: yup19:26
sean-k-mooneythere is still goign to be contention on the lock but it should now preserve order19:27
markguz_so spliting to multiple compute nodes is still probably the best path19:27
*** hamalq has quit IRC19:29
sean-k-mooneylong term proably but at leat your current issue is mitagated19:30
sean-k-mooneyi wont say solved but managemable19:30
openstackgerritArtom Lifshitz proposed openstack/nova-specs master: `socket` PCI NUMA-affinity Policy  https://review.opendev.org/c/openstack/nova-specs/+/76555119:33
artomsean-k-mooney, stephenfin (though I suspect you're done for the day) ^^19:33
* artom looks at sean-k-mooney's specs next19:34
sean-k-mooneymore or less ill leave it open however for tomorrow19:34
artom... with some snow shovelling and kid driving thrown in the mix19:34
*** zenkuro has joined #openstack-nova19:50
*** whoami-rajat__ has quit IRC19:55
*** hoonetorg has quit IRC20:04
*** nightmare_unreal has quit IRC20:16
*** slaweq has quit IRC20:41
*** adeberg has quit IRC20:53
*** nweinber has quit IRC21:09
*** jdillaman has joined #openstack-nova21:09
*** hoonetorg has joined #openstack-nova21:18
*** vishalmanchanda has quit IRC21:41
*** hoonetorg has quit IRC21:43
*** rcernin has joined #openstack-nova21:59
*** xek has quit IRC22:00
*** xek has joined #openstack-nova22:01
*** xek has quit IRC22:05
*** brinzhang_ has joined #openstack-nova23:02
*** songwenping_ has joined #openstack-nova23:02
*** songwenping__ has quit IRC23:05
*** brinzhang has quit IRC23:05
openstackgerritMerged openstack/nova stable/stein: [stable-only] Cap bandit and make lower-constraints job non-voting  https://review.opendev.org/c/openstack/nova/+/76648723:54

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!