Wednesday, 2023-04-12

opendevreviewSam Morrison proposed openstack/nova master: Filter out deleted instances when looking for build timouts  https://review.opendev.org/c/openstack/nova/+/88012506:48
bauzasdansmith: gmann: sean-k-mooney: I added my thoughts on the deprecation patches but I'm afraid of the fact a backport wouldn't help operators to notice08:17
opendevreviewJorge San Emeterio proposed openstack/nova master: WIP: Testing whether tests on bug#1998148 still fail.  https://review.opendev.org/c/openstack/nova/+/88013508:48
Uggla bauzas , hi, if you have some time, can you have a look at the virtiofs series and especially the interfaces ?09:05
bauzassure I've seen your updates 09:06
bauzasbut I need a jar of coffee first 09:06
sean-k-mooneyUggla: have you moved the libvirt driver changes for creating the xml config to the start fo the series09:34
sean-k-mooneythat should be the first set of patches in the serise so that can be merged to allow scaphanda work to start in parallel09:34
sean-k-mooneywhile we wait for manialla to add the lock api09:35
manuvakery1Hi, I have a 2 node cluster and I am running a rally scenario against it to create and delete servers.  When i set the server count to 2 I see the response time as follows09:36
manuvakery1nova.boot_servers : 11.941 sec (Avg)09:36
manuvakery1nova.delete_servers: 14.675 sec (Avg)09:36
manuvakery1make the server count to 5 added a huge difference in delete server 09:36
manuvakery1nova.boot_servers: 18.803 sec (Avg)09:36
manuvakery1nova.delete_servers: 42.988 sec (Avg)09:36
manuvakery1is this expected?09:36
manuvakery1the compute nodes are not under any load09:37
Ugglasean-k-mooney, hum, not really.09:37
sean-k-mooneyok im not sure when we disucssed that09:40
kashyapbauzas: Do we have a conclusion on this? -- https://review.opendev.org/c/openstack/nova/+/87902109:41
sean-k-mooneybut if we want to merge any code before the manial api is implmented the libvirt dirver chage for the xml generation is the lowest risk09:41
bauzassean-k-mooney: I haven't went more than the 4th change, but the Manila API should be the last change in the series09:42
noonedeadpunkhey folks! We're reviewing our nova role, as we haven't touched in for a while, and now a bit /o\ about logic we have there. so would be glad if you can share your prespective on some things :)09:42
sean-k-mooneybauzas: yes the api should be last and the xml change sin the drive rhsould be first09:42
sean-k-mooneydriver then db then rpc finaly api09:42
bauzassean-k-mooney: for the libvirt driver, meh to me, since I think we should be quite okay for the 4 first patches (DB and objects)09:42
bauzasnoonedeadpunk: shoot09:43
noonedeadpunkSo first question - when `discover_hosts_in_cells_interval` is set to -1, I assume you need to run `nova-manage cell_v2 discover_hosts` before compute will appear in `openstack compute service list`?09:43
noonedeadpunkAs I can recall that compute itself does report back on startup as well09:43
sean-k-mooneynoonedeadpunk: no they will show up in the list09:43
sean-k-mooneybut they wont be in the cell mappings09:43
noonedeadpunkand this option is for scheduler09:43
bauzasnoonedeadpunk: no, but they won't be scheduled09:43
bauzassince they won't have a cell mapping09:44
noonedeadpunkaha, gotcha09:44
bauzashaha09:44
sean-k-mooneydiscover_hosts tell use which rabbit to use to talk to the compute and what cell db its in09:44
noonedeadpunkok, then we have this set correctly :D09:44
noonedeadpunkbut it discovers... through API?09:45
sean-k-mooneyno09:45
noonedeadpunknot through conductor then?09:45
noonedeadpunkah, right09:45
sean-k-mooneyit discovers by connecting to all the cell dbs09:45
noonedeadpunkok-ok, yes, I recalled :)09:45
sean-k-mooneyhttps://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L1043 09:46
sean-k-mooneycalls https://github.com/openstack/nova/blob/ae42400b7663bc58d5562de99e976c95131b77a9/nova/objects/host_mapping.py#L247-L28509:46
noonedeadpunkand in order for compute to be discovered by discover_hosts it must already be in service list?09:47
opendevreviewMerged openstack/placement master: Db: Drop redundant indexes for columns already having unique constraint  https://review.opendev.org/c/openstack/placement/+/85677009:47
sean-k-mooneycorrect09:47
sean-k-mooneyso the compute agent connects to the conducotr on startup09:47
sean-k-mooneyand then it ask the conductor to register a service in the db09:47
sean-k-mooney"the cell db"09:48
sean-k-mooneythe conducotor may or may not have access to the api db09:48
sean-k-mooneyso we cant assume it can add the cell mapping09:48
noonedeadpunkyeah, ok, thanks for the explanation09:50
noonedeadpunkAs I was mixing up discovery flow09:50
bauzasnoonedeadpunk: maybe this would help you :p https://docs.openstack.org/nova/latest/admin/cells.html#faqs09:50
bauzasbut yeah, discover_hosts is idempotent so you can call it anytime you want09:51
bauzasthis is explained at the end of https://docs.openstack.org/nova/latest/admin/cells.html#configuring-a-new-deployment09:52
noonedeadpunkwell, I was unsure about logic, as I could not fully understand the reason why we're waiting for compute to be present in service list when nova_discover_hosts_in_cells_interval=-1 but run cell_v2 discover_hosts only after that09:52
bauzasnoonedeadpunk: we have two different databases now as a reminder09:53
bauzasso we need to make sure we synchronize them09:53
noonedeadpunkAs for some reason I thought that discover_hosts also will add to service list, but now I know it;s not how it works)09:53
sean-k-mooneybauzas: technially we have 3. we alway have api + cell 0 and cell one in any real deployment :)09:54
bauzassean-k-mooney: we have two different table schemas, but yeah :)09:57
bauzascell0 is logically identically to any cell 09:57
bauzasidentical*09:57
bauzasonly the usage of this cell is different09:57
bauzasbut we're bikeshedding :)09:58
noonedeadpunkAnother thing I am unsure about - db online_data_migrations and nova-status upgrade check and when they can/should be run.09:59
noonedeadpunkAs far as I got - db online_data_migrations should be done only during upgrades10:00
noonedeadpunkBut is it recommended to run `nova-status upgrade check` before that?10:00
noonedeadpunkAs it feels like upgrade check should be run once all computes are upgraded, while online_data_migrations requires only conductor to be done?10:01
bauzasnoonedeadpunk: this is correct10:05
bauzasyou should run nova-status upgrade check *before* the db upgrade as a pre-flight check10:05
bauzasnova-status is a controller script, checking a few APIs and DBs10:06
bauzasto see whether you're safe to upgrade10:06
bauzaslike, say we introduced a thing but wasn't the default10:07
bauzasbefore we flip the default and remove the old bits, we propose you to check that you opted in before10:07
noonedeadpunkaha, yes, exactly what I was unsure about. As currently we run upgrade check basically after cell_v2 discover_hosts at the very end10:07
noonedeadpunkwhich didn't make much sense to me10:08
noonedeadpunkis there any way to tell if online_data_migrations are needed at all? Like by some exit code of smth?10:08
bauzasnoonedeadpunk: see the upgrade docs, this explains it way better than me now https://docs.openstack.org/nova/latest/admin/upgrades.html#rolling-upgrade-process10:09
bauzasnoonedeadpunk: wait, I said something wrong10:10
bauzasnoonedeadpunk: https://docs.openstack.org/nova/latest/cli/nova-status.html#upgrade10:10
bauzasyou need to do DB schema expands/contracts before with the migrations10:11
bauzasthen, you run the status check10:11
bauzasand only then you need to start the compute services10:11
* bauzas was used to those limbo dances before but we haven't really touched nova-status since a while10:12
bauzasneither the online data migrations10:12
bauzasnoonedeadpunk: as a summary, say that during a Bobcat release we may add some data online migrations10:15
noonedeadpunkok, so status check before service restart but after online migrations?10:15
bauzaswe would provide the nova-manage command to perform such data migrations10:15
bauzasaccordingly, we would propose a nova-status check addition that would check whether the data is fully migrated10:16
bauzasso when upgrading to C, you would run this nova-status check that would fail if you forgot to run the online data migrations command10:16
bauzassame goes with Placement10:16
bauzassay that by Bobcat, we start using a new Placement API call10:17
bauzaswe would need to continue supporting the old Placement calls10:17
bauzasbut if we start using that sole version of Placement, then we would need to add a nova-status check that would fail if you forgot to update Placement10:18
bauzasI just hope I haven't lost you :)10:19
wangrongdansmith: sean-k-mooney: Thank you for your feedback in generic-vdpa spec. Regarding some of the issues you raised, I have explained them in my reply. Could you please take a look at the submission again. Thank you for your assistance. https://review.opendev.org/c/openstack/nova-specs/+/87933810:25
sean-k-mooneyi saw i skimmed it breifly this morning10:25
gibisean-k-mooney: https://bugs.launchpad.net/nova/+bug/2008716 this seems to be a real bug in our migration code but you had comments before reagarding different dbs being targeted. I don't see how that is possible given the error message states the that column is missing not the table10:25
noonedeadpunkbauzas: sorry, was side-pinged :(10:26
sean-k-mooneywangrong: in general i think we will still need to change the design singnifcalty form what you orginaly propsosed10:26
noonedeadpunkthat makes total sense. 10:27
noonedeadpunkSo yes, once I've upgraded packages on Bobcat, I should do migrations, and then check upgrade to ensure it's safe to restart now10:27
noonedeadpunkok, now will try to make some patches out of this discussion10:35
wangrongsean-k-mooney: ok, could you offer more detail info about the problem of current design in the review comment10:37
sean-k-mooneyyes i need to also fully re read your comments10:41
sean-k-mooneywangrong: basically i think we need two feature before we can implement the feature you are proposing10:42
sean-k-mooneyallow seting the hw_vif_model per port and allow setting hw_disk_bus per cinder volume10:43
sean-k-mooneyif we have both feature we can add hw_vif_modle=vdpa and hw_disk_bus=vdpa10:43
sean-k-mooneyto enabel generic vdpa support10:43
wangrongsean-k-mooney: yes, I think so,  if we plan to use hw_xxx.10:46
opendevreviewOleksandr Klymenko proposed openstack/placement master: CADF audit support for Placement API  https://review.opendev.org/c/openstack/placement/+/88014511:07
noonedeadpunkNo, wait. I'm confused again. What's written in docs, is that I should run online_data_migrations _after_ service restart, and upgrade check _before_ restart. So upgrade check should be run before migrations11:38
bauzasnoonedeadpunk: because the upgrade check would verify something else but the online data migrations for the same cycle11:46
noonedeadpunkyeah, ok, we just got things mixed up a bit11:51
noonedeadpunkwe were running online_data_migrations early, before first conductor is restarted, but upgrade_check when all services are11:51
noonedeadpunkSo i got really confused11:51
sean-k-mooneyso you should be able to do db sync before an restarts11:52
sean-k-mooneyim not sure about online migration 11:52
noonedeadpunkyeah, db_syncs are done before, but were followed with online migrations...11:52
sean-k-mooneypart of me expect thost to happen after the db models are updated i.e. after all the contoelr services are restarted11:53
noonedeadpunkWhat I'm not sure about, if upgrade_check is going to pass if only conductor/api is available, but no computes. I assume it should not care as long as cell mappings are in place.11:53
sean-k-mooneyso im not sure that will break things but it leave a window where more data could be created because fo the old db model that is in use11:53
noonedeadpunkWell. That is true. 11:54
sean-k-mooneyso i think online migration shoudl be after the restarts11:54
noonedeadpunkBut docs you've shown say it should be done after everything11:54
sean-k-mooneyya11:54
sean-k-mooneyi would generally put them right at the end11:54
noonedeadpunkAs partially data will be migrated when requested even without migrations11:54
sean-k-mooneywhen all services are fully upgraded11:54
noonedeadpunkyup, great, thanks for help as usual!11:55
sean-k-mooneyso the thing is as long as you are not trying to skip several release11:55
sean-k-mooneywe will have the fall back code11:55
sean-k-mooneyso the online migrations coudl be done even in a seperate maintaince window11:55
noonedeadpunkwell, it seemed working nicely old way as well for years, so I was quite tentative to touch anything there :D11:55
sean-k-mooneybut tis better to do them as part of the normal upgrade after all the services are updated11:56
* noonedeadpunk always skipping releases11:56
sean-k-mooneywhere you would have a probelm is if there is a db contraction11:56
* noonedeadpunk but has to deal with consequences11:56
sean-k-mooneyalotugh i dont think that woule be part of the online migrations11:57
sean-k-mooneyi.e. if we removed a column11:57
noonedeadpunkYeah, that what I was thinking about11:57
noonedeadpunkthat if db schema changes - it should not be done post, but I assume it's done during db_sync11:57
noonedeadpunkwhich is done really at early stage11:58
sean-k-mooneyya that a good point11:58
sean-k-mooneyschema chagnes are seperate11:58
sean-k-mooneythe online migation are just data migrations11:58
noonedeadpunkok, then will see how CI will fail dramatically with my changes :D11:59
bauzasnoonedeadpunk: as it was written in the docs I passed, db sync are only changing the schemas12:04
bauzasnoonedeadpunk: then either we change the data internally (either by a lazy call, or when restarting the service), or we ask operators to run online_data_migration12:05
bauzasbut online_data_migration can run during all the cycle, until you want to upgrade to a new release12:06
bauzasthen, once you want to upgrade, you can before run nova-status upgrade check for verifying that you're done with the online data migrations12:07
noonedeadpunkwith N-1 online data migrations, right?12:08
noonedeadpunkso running upgrade check on N will verify that N-1 migrations are done12:10
noonedeadpunkor well, N-2 with slurp, but that's different topic :D12:10
bauzasyup12:13
noonedeadpunkok, awesome, thanks for your time and patience :)12:14
* bauzas needs to go off for 30 mins12:14
dansmithnoonedeadpunk: yeah, can't run online migrations until you're past the point where conductors, api, scheduler are upgraded (or stopped before upgrade13:35
dansmithnoonedeadpunk: the idea is that online migrations should be run *after* everything is done and back up13:36
dansmithservices will migrate what they need on-demand, and the online migrations are just there to push things that don't get migrated on demand13:36
noonedeadpunkyup, thanks for confirming that!13:38
noonedeadpunkwill go bug cinder folks with the same question then :-)13:41
dansmithack, I didn't mean to repeat, just wasn't sure there was a clear summary of that convo, but sounds like you got it :)13:45
noonedeadpunkyeah, I've pushed change as a result - I should have posted it https://review.opendev.org/c/openstack/openstack-ansible-os_nova/+/88014713:46
dansmithnoonedeadpunk: I think running the status check after online migrations also makes sense as there are cases where it will tell you that there are still pending migrations to do if they're not complete13:57
dansmithbut regardless, the meat of that change sounds right yeah13:57
bauzasdansmith: thanks for having explained it better than me :)14:52
bauzasyour summary is far easier :)14:52
dansmithbauzas: when you're done with the current call you're on I want to chat about some resource tracker stuff15:12
noonedeadpunkbetween not that long ago (on PTG?) we've discussed enable_new_services option. And it should be applied not for computes at the end of the day, but somewhere on conductor/scheduler/api - not sure where exactly as I have them combined at same place (with same nova.conf)15:54
noonedeadpunkso while it's defined in nova.conf.compute - it's not compute option15:55
noonedeadpunk(https://opendev.org/openstack/nova/src/branch/master/nova/conf/compute.py#L1456-L1474)15:55
dansmithconductor15:57
dansmithbut we could also fix that15:57
bauzasdansmith: I'm here16:00
dansmithbauzas: so... the RT has this notion of "disabled compute nodes"16:00
dansmithwhich seems to cover things actually disabled (via ironic) as well as "no such compute node on this host"16:00
* bauzas reloads the RT context in his mind16:00
dansmithwe do some weird things in move claims if we think the compute node is missing,16:01
dansmithlike accept the migration but do no claim16:01
bauzasiirc, this was done for the latter, not the former case16:01
dansmiththe latter meaning a missing compute node?16:01
dansmithI can't think of any reason why that would make sense, as we'll just mess up our accounting16:01
dansmithhttps://review.opendev.org/c/openstack/nova/+/879682/2/nova/compute/resource_tracker.py#29616:02
dansmithso I'm looking at that ^ specifically16:02
dansmithwhere we just do a nopclaim and return from move_claim16:02
dansmithwhich means, AFAICT, we would have created the migration object on the target machine, claimed no resources, and agreed to accept the new incoming instance16:03
dansmitheven though it's for a node that we don't have or know anything about?16:03
bauzasI'm just trying to understand the intent16:04
dansmithalso note this from mr. pipes: https://github.com/openstack/nova/blob/master/nova/compute/resource_tracker.py#L15416:04
dansmithin the instance claim, which we'll also do weird stuff for new instances with no such node16:04
dansmiththe earlier part of the comment says it shouldn't happen16:05
dansmithso, like, I'm confused and feel like this is probably some very old cruft16:05
dansmithlike, surely we can't have these things happen in a placement world and not get everything messed up, right?16:05
bauzasyeah, found the commit https://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f53716:06
bauzasok, so now I think I remember where it comes16:07
bauzasthe virt drivers are responsible for creating the RTs16:07
bauzasat compute startup, we were originally setting one RT per compute node resources dict that was returned by the virt driver16:08
bauzasand then mr jay factorized that 16:08
bauzasby having one single RT instance16:08
dansmith...right16:08
bauzasso the disabled property was originally coming from the fact there could be a race where the RT was instanciated but the virt resources not fully set up16:09
dansmiththat may be the case for startup, but we should never accept an instance boot or migration if we can't account for the resources right?16:10
bauzashttps://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f537#diff-ed9525d7ae319fd575249ed72daf634d27182e08fea7a4f740cb3164233612b7L43516:10
bauzaswell, that whole thing predates placement16:12
bauzashttps://github.com/openstack/nova/commit/1c967593fbb0ab8b9dc8b0b509e388591d32f537#diff-ed9525d7ae319fd575249ed72daf634d27182e08fea7a4f740cb3164233612b7L41216:12
dansmithright16:12
dansmith(those links do not link to a hunk for me, btw)16:12
bauzasah shit16:12
bauzasso, the fact is, before placement,16:12
bauzaswe were relying on the ComputeNode records for scheduling16:12
bauzasif there were no ComputeNode records, then the HostStates from the scheduler weren't generated16:13
bauzasand no instances were able to schedule into them16:13
dansmithyour point being that we could (should) never get into that case where we do the NopClaims because of that?16:13
dansmithI think that's probably also not true because of forced migrations16:14
bauzasthe NopClaims were there for saying 'meh' 16:14
bauzasif no compute record was existing then basically we were tracking nothing16:14
dansmithright but they also cause us to skip all the other stuff like PCI and migration objects, etc16:14
dansmithbut we allow the instance to be sent there16:15
bauzasif the scheduler isn't having a compute node record, then it shouldn't be able to send an instance to it16:15
bauzasand I think this assumption is still valid16:15
dansmithexcept for times where we didn't go through the scheduler, which I think back when this was done, was the case16:16
dansmithbut if you're right, why would we not raise an exception here instead of silently "meh" ?16:16
bauzasgood question16:16
dansmithand also,16:16
dansmithtransient changes could result in the scheduler thinking a compute node was valid for a given host, send a node there,16:17
bauzasthe NopClaims even predates Mr Jay's work on RT16:17
bauzasand even my presence on OpenStack since this decade :)16:17
dansmithand the RT would literally create a migration record with a destination node name that does not match anything it knows about16:17
dansmiththe nopclaim stuff itself predates tons of stuff, yeah, but the usage of it in this case is much newer16:18
dansmithhere's my problem:16:18
bauzasoh16:19
dansmithin order to do the strict tying of instance->compute->service, we need migrations to have a strict tie from migration->compute16:19
dansmithwhich means we need the compute id in the migration16:19
bauzasI missed the fact we record a migration before we return the NopClaim16:19
dansmithright16:19
bauzastechnically this isn't a noop16:19
dansmithcrazy right?16:19
bauzasdansmith: I wonder something, lemme check16:19
dansmithso my code moves that to below the disabled check and refuses to create a migration record for a destination compute node we don't know anything about (and thus don't have a node id)16:19
bauzasmy guess is that someone added the migration checks *after* the nopclaim implementation16:20
bauzaswithout thinking about the implication16:20
dansmithhere, the nodename loose affiliation is us just lying about the destination, which is easy if the nodename is just a name, but the id requires stricter adherence16:20
bauzas(gosh hope it isn't me :) )16:20
dansmithwhat "migration checks" ?16:22
dansmiththose lines I'm moving are the creation of the migration object16:22
dansmith(or the assignment of the migration to a node)16:22
dansmithlike, creating the migration and then returning out of that early also means instance.migration_context isn't created16:23
bauzasyeah, I guess we need to be smarter here16:25
bauzaswe probably piled a couple of modifications without really considering the impact16:25
dansmithokay, that's what I'm hoping .. that we really never get here and that we're not either (a) depending on this behavior or (b) silently sometimes hitting this code path and losing resources or something16:26
bauzasthe fact is, say the virt driver returns nothing suddently16:26
bauzaswe need to handle this case correctly16:26
dansmithso my change there means we no longer update/create the migration if we're going to bail out early16:26
dansmithbut we might want to follow up with something to actually abort16:26
bauzasyes16:26
dansmithso I thought this might be for ironic for some reason16:27
bauzasthe fact is, we were allowing a migration to succeed even if we weren't tracking its resources16:27
dansmithbecause they override their disabled check to mean more than just a missing compute16:27
dansmithif not, I can update my comment there16:27
bauzasdo you feel brave enough to untangle this spaghetti code ?16:27
dansmithI've been holding off writing tests for this change in case I was missing something but it sounds like I'm good to proceed16:28
bauzasyeah and honestly I'm a bit afraid of the potential impacts we may create16:28
dansmithall I have the appetite for at the moment is doing this change to it, because I've still got work to do on the rest of the series16:28
dansmithpotential impact of what? not creating the migration object or making this an error instead of a silent Nop ?16:28
bauzasthe potential impact of having a migration succeed as a noop without having a migration record set16:29
dansmithokay I'm confused I thought you were sure we couldn't be actually running this16:30
dansmithhowever, if we are, surely it's better to have it fail, find out how we're getting there, and fix it right?16:30
dansmithbecause like I said,16:30
dansmithif we hit this code, we're not even creating instance.migration_context16:30
dansmithor the pci claims16:31
bauzasthere are two preconditions in order to have this disabled property returning True16:31
dansmithand we won't have the numa stuff that goes along with it16:31
bauzasthe first precondition, which is to have the variable to be set internally, seems enough trivial to me, as this being a race condition hack at startup16:31
bauzasI'm a bit concerned by the second precondition, which is that the virt driver returns yay or nay16:32
dansmithyeah, and "disabled" is a very strange word to use for the first case16:32
dansmithbauzas: for ironic it seems to be "if this node exists at all" which hopefully should never happen for transient reasons16:32
dansmithfor libvirt it would potentially be a rename issue I think,16:33
dansmithwhich is a case where we might have been renamed and then accept a migration for a node we're not actually supposed to be hosting16:33
dansmithwhich is another good reason to make this strict16:33
dansmithlike, if I send a message to a compute node and say "yo dawg, I'm looking to migrate an instance into your does-not-exist node", the compute node will happily create a migration object for that node, and punt on the rest of everything and silently say "roger that"16:38
dansmithif I set CONF.host to something that matches another compute by accident, I'll do that for nodes they legit own16:38
bauzasfor libvirt, yup16:40
dansmithright, I mean for libvirt16:40
bauzasagreed, this is a weird exception handling16:40
bauzaslike, you are sent to a compute service where the virt driver attached to it reports that it doesn't know the compute node target you may want to use16:41
bauzasthis ^ is for all drivers16:41
dansmithright16:42
dansmithand it's crazy to just silently accept16:42
dansmithsince we don't even create the migration context in the current form, I'm sure no migrations could be working in this case16:43
bauzasI also don't see any good reason for doing this, even with the FakeDriver16:44
dansmithack16:44
dansmithI'll update the comments there to reflect this and proceed unless you have any other checking you want to do16:44
bauzasdansmith: given all we discussed, my only open concern is, shouldn't we rather hardstop the migration call here?16:46
bauzasinstead of returning a noop ?16:46
dansmithbauzas: we should, but I just don't want to tangle that up with my effort to get these objects linked in the database16:46
dansmithbecause like I said, migration_context below is already skipped, and no migration can work in that case either, AFAIK16:47
dansmith(instance.migration_context, even though the migration object is created at the top)16:47
bauzasdansmith: that's the only concern I have, as said16:47
bauzasif we don't return an exception, then the migration will be accepted16:48
dansmithif we weren't creating the migration object at the top before we check for the node, I wouldn't even have had to ask about this, and it's not really related to the rest of the work16:48
dansmithbauzas: right but it already is16:48
bauzaseventually the instance status could be something like 'resize_confirm'16:48
bauzaswithout having any migration records attached to it16:48
dansmithbauzas: my point is that the create_migration() will proceed even with a missing node, so it's not like that is stopping us from returning the nopclaim16:49
bauzasand revert resize wouldn't work, since no migration record exists16:49
dansmithright but we can't even get that far without instance.migration_context right?16:49
bauzasI have to admit this is out of my knowledge16:50
dansmithactually, I think the conductor will fail earlier than that even,16:51
dansmithokay you know what.. maybe we're missing more history here16:53
dansmithI think maybe this _create_migration() is from before we had conductor-directed migrations?16:53
bauzashonestly, tomorrow I'll setup a 2-node devstack and try doing crazypants migrations16:53
dansmithso maybe we always pre-create the migration in the conductor (at least for resize) now?16:53
bauzasoh, this is surely existing from 9 years 16:53
dansmithso maybe we should actually fail here if migration is None and not fall back to creating the migration16:54
dansmith(need to check the other migration types though)16:54
bauzasfrom what I recall, this was refactored by ndipanov when he wrote the NUMA migrations16:54
bauzaswhich were something like Liberty IIRC16:54
bauzasor even Kilo16:55
bauzasway before we had the conductor engines managing the migration records (in Newton or Ocata IIRC)16:55
bauzasso yeah you may be true16:55
bauzasmaybe all those codepaths aren't walked anyway, if we fail earlier16:56
bauzasor maybe the conductor fails after16:56
dansmithI'm now realizing there is more in the conductor that needs to be node-id-focused16:56
dansmithyeah my concern is live migration16:56
dansmithand evac16:56
dansmithcross-cell definitely happens in the superconductor because it mirrors the source cell migration16:57
bauzasI think we still rely on conductor-based operations for allocations16:57
bauzasfor those two move ops16:57
dansmitheither way, the node id part is handled on the destination node, so it actually doesn't change anything16:57
dansmithwe'll call to the compute and it will stamp the nodename we gave it on the migration even if it doesn't know that node16:58
dansmiththe update half of that16:58
dansmithokay you're probably EOD at this point right?16:59
dansmithI have to run to something else for a bit, but maybe think on it a bit16:59
dansmithI guess I could also put a DNM patch to raise there and make sure at least we don't see it in any of our multinode CI tests17:00
bauzasdansmith: tbh, I was wanting to cycle again on your spec17:01
bauzasI started it before the meeting17:01
dansmithyes, that would also be appreciated17:01
bauzasand then you puzzled me :)17:01
dansmithI updated it for some of this migration stuff last week17:01
bauzasso before I call, I'll get it a round17:01
dansmithack17:02
bauzasand done17:06
bauzasthis was quick, I was at the migration object change17:06
bauzasUggla: you're next in the review list :)17:09
bauzasbut this will be done tomorrow 17:09
bauzassee ya folks17:09
opendevreviewDan Smith proposed openstack/nova master: Populate ComputeNode.service_id  https://review.opendev.org/c/openstack/nova/+/87990417:26
opendevreviewDan Smith proposed openstack/nova master: Add compute_id columns to instances, migrations  https://review.opendev.org/c/openstack/nova/+/87949917:26
opendevreviewDan Smith proposed openstack/nova master: Add dest_compute_id to Migration object  https://review.opendev.org/c/openstack/nova/+/87968217:26
opendevreviewDan Smith proposed openstack/nova master: Add compute_id to Instance object  https://review.opendev.org/c/openstack/nova/+/87950017:26
opendevreviewDan Smith proposed openstack/nova master: Online migrate missing Instance.compute_id fields  https://review.opendev.org/c/openstack/nova/+/87990517:26
opendevreviewDan Smith proposed openstack/nova master: DNM: Look for disabled node claims  https://review.opendev.org/c/openstack/nova/+/87968717:26
opendevreviewsean mooney proposed openstack/nova master: [WIP] add hypervisor version weigher  https://review.opendev.org/c/openstack/nova/+/88023119:25
opendevreviewKarl Kloppenborg proposed openstack/placement stable/yoga: chore: back-merge os-traits version update to placement in stable/yoga to support owner traits  https://review.opendev.org/c/openstack/placement/+/88024923:14
opendevreviewKarl Kloppenborg proposed openstack/placement stable/zed: chore: update stable/zed OS-TRAITS version to 2.10.0  https://review.opendev.org/c/openstack/placement/+/88025023:21

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!