Tuesday, 2023-02-21

*** ralonsoh_ooo is now known as ralonsoh07:32
opendevreviewJorge San Emeterio proposed openstack/nova master: WIP: Look for cpu controller on cgroups v2  https://review.opendev.org/c/openstack/nova/+/87312708:49
bauzasgibi: so a bit of heads-up 09:26
bauzasgibi: first, you may be interested in knowing what the logs tell to us for the functests https://paste.opendev.org/show/bfvZX0XeKsELzY54EGb8/09:27
bauzasgibi: secondly, I created a docs patch and a fup for the cpu mgmnt series https://review.opendev.org/c/openstack/nova/+/874514 and https://review.opendev.org/c/openstack/nova/+/874515/09:28
bauzaseventually, I'll tell about the RC1 etherpad in the meeting https://etherpad.opendev.org/p/nova-antelope-rc-potential09:29
bauzaswe now have a LP tag for antelope rc09:29
opendevreviewJorge San Emeterio proposed openstack/nova stable/train: WIP: Fixing python-devel package for RHEL 8  https://review.opendev.org/c/openstack/nova/+/87454710:13
opendevreviewJorge San Emeterio proposed openstack/nova stable/train: Changing "python-devel" to "python3-devel" on bindep test requirements for RPM based distros.  https://review.opendev.org/c/openstack/nova/+/87454711:10
opendevreviewAlexey Stupnikov proposed openstack/nova stable/victoria: reenable greendns in nova.  https://review.opendev.org/c/openstack/nova/+/83343612:03
opendevreviewRajesh Tailor proposed openstack/nova master: Handle InstanceExists exception for duplicate instance  https://review.opendev.org/c/openstack/nova/+/86093812:39
*** ralonsoh is now known as ralonsoh_lunch12:51
*** ralonsoh_lunch is now known as ralonsoh13:31
opendevreviewJorge San Emeterio proposed openstack/nova stable/train: Indicate dependency on "python3-devel" for py3 based RPM distros.  https://review.opendev.org/c/openstack/nova/+/87454714:10
*** dasm|off is now known as dasm14:12
opendevreviewJorge San Emeterio proposed openstack/nova stable/train: Add binary test dependency "python3-devel" for py3 based RPM distros.  https://review.opendev.org/c/openstack/nova/+/87454714:12
opendevreviewJorge San Emeterio proposed openstack/nova stable/train: [stable-only] Add binary test dependency "python3-devel" for py3 based RPM distros.  https://review.opendev.org/c/openstack/nova/+/87454714:13
gibibauzas: I will have to drop aroun 17:30 during the nova weekly meeting 14:40
bauzasgibi: ack, np14:40
elodillesbauzas: are you editing the meeting page? let me know when i can update stable section14:43
bauzaselodilles: do it now14:44
bauzaselodilles: I'll add all the Bobcat plans and RC1 later14:44
elodillesbauzas: done14:45
bauzasall cool14:45
opendevreviewJorge San Emeterio proposed openstack/nova master: WIP: Look for cpu controller on cgroups v2  https://review.opendev.org/c/openstack/nova/+/87312714:45
elodillesbauzas: btw, have you seen this? https://review.opendev.org/c/openstack/releases/+/87445014:51
elodilles(i know that you are busy with everything o:))14:51
bauzaselodilles: yup, it's now in the RC1 etherpad14:51
bauzaswe'll discuss it in the meeting14:52
elodillesbauzas: ++14:52
opendevreviewMerged openstack/nova-specs master: Create specs directory for 2023.2 Bobcat  https://review.opendev.org/c/openstack/nova-specs/+/87206815:39
bauzas#startmeeting nova16:00
opendevmeetMeeting started Tue Feb 21 16:00:38 2023 UTC and is due to finish in 60 minutes.  The chair is bauzas. Information about MeetBot at http://wiki.debian.org/MeetBot.16:00
opendevmeetUseful Commands: #action #agreed #help #info #idea #link #topic #startvote.16:00
opendevmeetThe meeting name has been set to 'nova'16:00
Ugglao/16:00
bauzashey folks, hola everyone16:00
dansmitho/16:01
bauzas#link https://wiki.openstack.org/wiki/Meetings/Nova#Agenda_for_next_meeting16:01
*** artom_ is now known as artom16:01
elodilleso/16:01
bauzaslet's start, some people have to leave early16:01
bauzas#topic Bugs (stuck/critical) 16:01
bauzas#info No Critical bug16:02
bauzas#link https://bugs.launchpad.net/nova/+bugs?search=Search&field.status=New 16 new untriaged bugs (-1 since the last meeting)16:02
gibio/16:02
bauzasauniyal helped me with triage16:02
bauzasI created an etherpad16:02
bauzasand I have a bug I'd like to discuss with you folks16:02
bauzas#link https://etherpad.opendev.org/p/nova-bug-triage-2023021416:02
bauzasthe bug in question :16:03
bauzas#link https://bugs.launchpad.net/nova/+bug/200677016:03
bauzasas you see, i did set it to Opinion16:03
bauzastl;dr: this is about our ip query param for instances list16:03
bauzaswe directly call Neutron to get the ports16:03
bauzasit basically works, but the reporter had some concerns16:04
bauzaspeople want to discuss this bug now or later ? 16:05
bauzas(we can discuss it in the open disc topic if we have time)16:05
bauzaslet's say later then :)16:06
bauzas(people can lookup the bug if they want meanwhile)16:06
dansmithopinion seems right to me :)16:06
bauzaslet's discuss this then later in the open discussion topic16:06
bauzasso people will have time16:06
bauzasmoving on16:06
bauzas#info Add yourself in the team bug roster if you want to help https://etherpad.opendev.org/p/nova-bug-triage-roster16:07
bauzasUggla: works for you to get the baton this week ?16:07
Ugglabauzas, ok16:07
bauzasack16:07
bauzas#info bug baton is being passed to Uggla16:07
bauzas#topic Gate status 16:07
bauzas#link https://bugs.launchpad.net/nova/+bugs?field.tag=gate-failure Nova gate bugs 16:07
bauzasbut the best is to track the etherpad16:07
bauzas#link https://etherpad.opendev.org/p/nova-ci-failures16:08
bauzasit was a dodgy week16:08
dansmithso,16:08
dansmiththis got merged: https://review.opendev.org/c/openstack/devstack/+/87364616:08
dansmithwhich seems to allow halving the memory used by mysqld16:08
bauzashaha, gtk 16:09
dansmithwhich may help with the OOM issues we see, especially in the fat jobs like ceph-multistore16:09
dansmithwe could enable that in our nova-ceph-multistore job if we want to be on the leading edge and try to make sure that it's actually helping16:09
dansmith(it's opt-in right now)16:09
bauzasindeed, I'll double check later if we continue to have some OOM issues16:09
bauzasah my bad16:09
dansmithwe could remove it if it causes other problems, but.. might be good to try it16:09
bauzassurely16:10
bauzasdansmith: thanks for having worked on it16:10
bauzasdansmith: I can write a zuul patch for novza 16:10
dansmithI can do it too, just wanted to socialize16:10
bauzasdansmith: ack cool then, ping me for reviews16:10
dansmithack16:11
bauzas++ again16:11
bauzasdansmith: I also need to look at all the Gerrit recheck comments I wrote last week16:12
bauzasI maybe found some other races16:12
bauzasbut we'll see16:12
bauzaswe also have the OOM logger patch that was telling us a few things16:13
bauzashttps://paste.opendev.org/show/bfvZX0XeKsELzY54EGb8/16:13
bauzasbut let's discuss this off-meeting16:13
* gibi had not time to look at the extra logs from the functional race16:14
bauzasgibi: basically, each of the 6 failures having logs was having a different functest16:15
gibicool, that can serve as a basis for a local repor16:15
bauzasanyway, moving on16:15
sean-k-mooney1bauzas: they are all in teh libvirt test suite16:15
sean-k-mooney1so likely all have the same common issue16:16
sean-k-mooney1but ya lets move on16:16
bauzasmaybe, I didn't had time yet to look at the code 16:16
bauzas#link https://zuul.openstack.org/builds?project=openstack%2Fnova&project=openstack%2Fplacement&pipeline=periodic-weekly Nova&Placement periodic jobs status16:16
bauzasall of them are green ^16:16
bauzas#info Please look at the gate failures and file a bug report with the gate-failure tag.16:16
bauzas#info STOP DOING BLIND RECHECKS aka. 'recheck' https://docs.openstack.org/project-team-guide/testing.html#how-to-handle-test-failures16:16
bauzasthat's it16:16
bauzas#topic Release Planning 16:16
bauzas#link https://releases.openstack.org/antelope/schedule.html16:16
bauzasso we're now on Feature Freeze16:17
bauzas#link https://etherpad.opendev.org/p/nova-antelope-blueprint-status Blueprint status for 2023.116:17
bauzasyou can see what we merged16:17
bauzasI also created two changes for my own series that were asked16:17
bauzasbut I'll ping folks tomorrow about them16:17
bauzas#info Antelope-rc1 is in 1.5 weeks16:17
bauzasnow, we need to prepare for our RC1 where we branch master16:18
bauzas#link https://etherpad.opendev.org/p/nova-antelope-rc-potential16:18
bauzasas you see ^ I created an etherpad16:18
bauzasthanks btw. again takashi for creating some changes that are needed16:18
bauzasas a reminder, if people find some bugs, they can use a specific tag :16:19
bauzashttps://bugs.launchpad.net/nova/+bugs?field.tag=antelope-rc-potential16:19
bauzasbefore RC1, any bug report can be using this tag, but we prefer to make sure they are regressions16:19
bauzasafter RC1, only regressions should use this tag16:20
bauzasI created a cycle highlights change too : 16:20
bauzas#link https://review.opendev.org/c/openstack/releases/+/874483 Cycle highlights for Nova Antelope16:20
bauzasplease review it 16:20
bauzasat least gibi, dansmith, artom and other folks that were having merged changes16:21
gibiack16:21
bauzasI'll +1 on Thursday16:21
bauzaswe need to merge this before this Thursday for the Foundation market folks16:22
bauzaswe also have https://review.opendev.org/c/openstack/releases/+/874450 to +116:22
bauzasI guess we're done with our clients16:22
bauzasso I'll branch os-vif, osc-placement and novaclient unless people have concerns16:23
bauzasas you see in the commit msg, it will be merged eventually on Friday16:23
* bauzas will just verify the SHA116:24
elodillesor earlier if a release liaison +1s it16:24
bauzaselodilles: yup, but I'm asking people if they have concerns 16:24
sean-k-mooney1speaking of which im not sure if i ill have time to continue to do that16:24
bauzaslooks not, so I'll just verify the SHA1s before +1ing16:24
sean-k-mooney1i may leave my slef for this cycle if no on else want to take on that role16:24
bauzassean-k-mooney1: yup, I know and I was planning to ask you16:25
sean-k-mooney1but i am not sure of my aviablity to keep an eye on it this cycle16:25
*** sean-k-mooney1 is now known as sean-k-mooney16:25
bauzasok, so maybe it's not time yet to ask if someone else wants to be a release liaison16:25
bauzasbut I'll officially ask it next week16:25
sean-k-mooneyok16:25
bauzaswe can have more than one release liaison btw.16:26
bauzasno need to remove you before someone arrives or something like that16:26
sean-k-mooneyack the primary role is to reducec the bus factor and ensure that release are done correctly and in a timply fashion so it does not all fall on the PTL16:26
bauzasand we can even have *two* liaisons if we really find *two* people wanting to be :)16:26
bauzasno need to battle :po16:26
bauzasI'll explain next week what a release liaison is and what they do16:27
bauzasbut if people want, they can DM me 16:27
bauzasbefore next meeting16:27
bauzas#info If someone wants to run as a Nova release liaison next cycle, please ping bauzas16:28
bauzasI think that's it for the RC1 agenda16:28
bauzasoh16:28
bauzasone last thing16:28
bauzasthanks to takashi, https://review.opendev.org/c/openstack/nova-specs/+/872068 is merged16:29
bauzasyou can now add your specs for Bocat16:29
bauzasBobcat even16:29
bauzaslike, people who had accepted specs for Antelope can just repropose them for Bobcat and I'll quickly +2/+W directly if nothing changes between both spec files16:30
* bauzas tries to not eye at folks16:30
bauzasI'll do the Launchpad Bobcat magic later next week (I guess)16:31
bauzasthat's it this time16:31
bauzas#topic vPTG Planning 16:31
bauzasas a weekly reminder :16:31
bauzas#link https://www.eventbrite.com/e/project-teams-gathering-march-2023-tickets-483971570997 Register your free ticket16:31
* gibi needs to drop, will read back tomorrow16:31
bauzasmaybe you haven't seen but we are officially a PTG team 16:31
bauzasI don't know yet how long we could run the vPTG sessions16:32
bauzasbut like every cycle, I'll ask your opinions about the timing16:32
bauzasnot today, but once I'm asked16:32
bauzasgood time for saying16:32
bauzas#link https://etherpad.opendev.org/p/nova-bobcat-ptg Draft PTG etherpad16:33
bauzasI feel alone with this etherpad ^16:33
bauzasand I'm sure people have topics they want to discuss16:33
bauzasanyway, moving on16:34
bauzas(just hoping people read our meeting notes)16:34
bauzas#topic Review priorities 16:34
bauzas#link https://review.opendev.org/q/status:open+(project:openstack/nova+OR+project:openstack/placement+OR+project:openstack/os-traits+OR+project:openstack/os-resource-classes+OR+project:openstack/os-vif+OR+project:openstack/python-novaclient+OR+project:openstack/osc-placement)+(label:Review-Priority%252B1+OR+label:Review-Priority%252B2)16:34
bauzas#info As a reminder, cores eager to review changes can +1 to indicate their interest, +2 for committing to the review16:34
bauzas#topic Stable Branches 16:35
bauzaselodilles: your turn16:35
elodilles#info stable gates seem to be OK (victoria gate workaround has landed and it is now unblocked)16:35
elodilleswell, unblocked16:35
elodillesthough it's not everywhere easy to merge in patches16:35
elodilles#info stable branch status / gate failures tracking etherpad: https://etherpad.opendev.org/p/nova-stable-branch-ci16:36
bauzasindeed16:36
elodillesthat's the short summary16:36
bauzasI still have the ussuri CVE VMDK fix to be merged16:36
bauzasI rechecked it a few times16:36
bauzaselodilles: thanks for the notes16:37
elodillesnp16:37
bauzas#topic Open discussion 16:37
bauzasso, nothing on the agenda16:37
bauzaswe can discuss https://bugs.launchpad.net/nova/+bug/2006770 if people want or close the meeting16:37
bauzasthe fact is, I wrote Opinion16:37
bauzasunless people have concerns with what I wrote, I'm done.16:38
bauzaslooks not16:39
bauzasthen I assume we're done.16:39
dansmith++16:40
bauzasthanks all16:40
bauzas#endmeeting16:40
opendevmeetMeeting ended Tue Feb 21 16:40:23 2023 UTC.  Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)16:40
opendevmeetMinutes:        https://meetings.opendev.org/meetings/nova/2023/nova.2023-02-21-16.00.html16:40
opendevmeetMinutes (text): https://meetings.opendev.org/meetings/nova/2023/nova.2023-02-21-16.00.txt16:40
opendevmeetLog:            https://meetings.opendev.org/meetings/nova/2023/nova.2023-02-21-16.00.log.html16:40
elodillesthanks o/16:40
dansmithbauzas: so, gmann and I were running that memory usage patch in periodic on tempest jobs for a few days to make sure it didn't substantially worsen things16:40
dansmithand my survey at the moment indicates that it looks good16:41
dansmithso I'll propose to make it enabled for ceph-multistore (which will also impact glance) and we'll see if gmann is cool with that when he's around16:41
bauzasnice to hear16:41
bauzasack, do it and I'll vote16:41
opendevreviewDan Smith proposed openstack/nova master: Use mysql memory reduction flags for ceph job  https://review.opendev.org/c/openstack/nova/+/87466416:45
dansmithbauzas: ^16:45
bauzasdansmith: I doubt that cells_v2 map_instances could work with https://bugs.launchpad.net/nova/+bug/2007922 (even I asked for it)17:43
bauzasdansmith: tl;dr: the instance mapping exists but the cell value is None17:43
bauzasand we know the instance is in cell0 DB17:43
dansmithyeah, as I said, I initially missed that the person said they had reference in the mappings table17:43
bauzasdansmith: I guess the simpliest thing is to hack the DB to add the cell0 uuid in the instancemapping record, nope ?17:43
dansmithprobably17:44
bauzasor do we have a better nova-manage command ? 17:44
bauzaslooking at the docs, nope17:44
dansmithnot that I know of17:44
bauzasthis instance is somehow sit in the middle 17:44
bauzasnot fully migrated but in between17:44
dansmithnot fully ... mapped?17:44
bauzassorry, yeah mapped17:45
bauzasI'll propose the ALTER to the reporter17:45
dansmithdon't we have a mapped flag on the instance (or something else)?17:47
bauzasin the instances table you mean ?17:48
dansmithI thought it was.. that's how we survey instances that need to be mapped right?17:48
* bauzas just checks the map_instances code17:49
dansmithjust wondering if that flag matches or not17:49
bauzasso17:52
bauzashttps://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L87417:52
bauzaswe just iterate over a limit and a marker on the instances table from a cell that's given17:53
dansmithah right17:54
bauzasand I think I understand how the cell ID was set to None17:54
bauzashttps://github.com/openstack/nova/blob/439c67254859485011e7fd2859051464e570d78b/nova/objects/instance_mapping.py#L7317:54
dansmithit only does that if it's not none though17:55
bauzasanyway, map_instances *could* work with cell017:55
bauzasif the reporter runs map_instances with cell0 attribute, it will loop over the contents of cell0's instances table and will create an instancemapping object17:56
bauzasoh wait, fuck no17:56
bauzashttps://github.com/openstack/nova/blob/master/nova/cmd/manage.py#L791-L79217:56
bauzasso, definitely the easier is to alter the db17:56
dansmithagain, I only thought it was useful to run map if the mapping didn't exist18:00
bauzasyup18:03
bauzasor the reporter could then delete the instance mapping18:03
sean-k-mooneybauzas: elodilles  can we prioritise review of this if possibel https://review.opendev.org/c/openstack/nova/+/87454718:10
opendevreviewTakashi Natsume proposed openstack/placement master: Move implemented specs for Xena and Yoga release  https://review.opendev.org/c/openstack/placement/+/85373018:10
sean-k-mooneythis will help us fix our downstream ci 18:11
bauzasdone but I leave you +W as I don't have a lot of context18:13
sean-k-mooneytl;dr we use bindep in our downstream jobs to install deps before runing tox but rhel 8 nolonger has python-devel18:15
gmanndansmith: +W on 'mysql memory reduction flags for ceph job'18:15
sean-k-mooneyi wanted to check with elodilles to make sure they were ok with the stable-only change18:15
dansmithgmann: cool18:16
dansmiththanks18:16
dansmithgmann: oh jeez, I didn't realize the mysql periodic thing hadn't landed yet18:24
dansmithgmann: do you think we should wait for that to soak for a bit?18:24
dansmithI know the devstack one did, and I guess I misread that the tempest one hadn't yet18:24
gmanndansmith: I also did not realize it when I checked that patch.  but I think it is ok to enable it in ceph job and see. 18:26
gmannwe can always revert it if it fail and make delay things during release time18:26
dansmithokay, that's my preference too18:26
mnaseri got a fun one.  it looks like by default nova saves the az of the vm in the cell db, but it doesn't update the request_spec, but when we do migrations, we pass the request_spec to the scheduler (which contains az=null) which then moves you from one az to another in the migration18:44
mnasersince .. https://github.com/openstack/nova/blob/90e2a5e50fbf08e62a1aedd5e176845ee22d96c9/nova/scheduler/request_filter.py#L138-L166 checks for request_spec az18:45
sean-k-mooneymnaser: this was changed recently18:45
mnaserthis is in a scenario where an operator wants to make vms stick to their az if a user doesnt specify one18:45
sean-k-mooneyright so we spent a lot fo time trying to decide what the sematics shoudl be18:46
sean-k-mooneyim trying to find the spec18:46
sean-k-mooneyhttps://specs.openstack.org/openstack/nova-specs/specs/zed/implemented/unshelve-to-host.html18:47
sean-k-mooneyi guess this was for unshleve18:47
sean-k-mooneymnaser: we epect tht the request spec would not have the az by the way if the user did not request one18:48
mnasermakes sense cause that's their request18:48
mnaseri understand it might not be everyone that wants this, but maybe for live migration use case it can cause issues if nova ends up doing cross-az migrations18:49
sean-k-mooneyfor move operations where we supprot specifying a AZ it would be ok in some cases to set it in the request spec18:49
sean-k-mooneymnaser: but we would want to have the same beahivor as in the unselve spec18:50
sean-k-mooneyi dont recal if we fixed the other move operatiosn to be consitent with that when we did this18:50
sean-k-mooneyUggla: do you recall18:50
sean-k-mooneymnaser: wiht unshelve to a specific az if you set it in the unshelve request and it was not set in the orgainl request spec it will be set after18:51
sean-k-mooneymnaser: live migraiton does not currently supprot an az18:52
mnasersean-k-mooney: esentailly im thinking this is where this can be changed https://github.com/openstack/nova/blob/f01a90ccb85ab254236f84009cd432d03ce12ebb/nova/compute/api.py#L5499-L550018:52
mnasercause live migrating from one az to another could pretty much fail, and we can just have it as an option i guess if we dont want to change default behaviour18:53
sean-k-mooneynor does migrate18:53
mnaserin most worlds migrate or live migrate will fail across az's18:53
mnaseresp if you're using different storage backends for example18:53
sean-k-mooneymnaser: this would be an api change and need a spec18:53
sean-k-mooneyin general live migration betwen AZ will either work on not work depending on yoru deployment. in general i would expect it to work in most cases18:54
sean-k-mooneyit just comes down to if you have exchanged ssh keys such that the hyperviors can comunicate and if you are using az with cinder or not18:55
sean-k-mooneyand cross_az attach18:55
mnaserMaybe we can make it so that if cross az attach = false then it would update the request spec to match?18:56
sean-k-mooneyno18:56
sean-k-mooneyno config drvent api behavior18:56
sean-k-mooneythis is not a bug18:56
sean-k-mooneyif we want to supprot move operation to target an AZ or change the request spec this is an api change18:56
mnaserNo it’s not to target an AZ18:56
sean-k-mooneyi know you want to prefer to keep affinity18:57
mnaserit’s for that it stays in the same AZ, or otherwise the live migration will fail18:57
sean-k-mooneyliek a weigher or filter18:57
sean-k-mooneyhowever that is not what the end user asked for18:57
mnaserif nova allows you to live migrate from one az to another for a vm with cross_az_attach set to false is that a bug ?18:58
sean-k-mooneynot a schduler bug18:58
sean-k-mooneyit will fail in pre-live-migrate18:59
sean-k-mooneyand the vm will stay in active on the host18:59
sean-k-mooney(source host)18:59
mnasernow if you’re using rbd for images_type and you have 2 clusters with each az using different cluster18:59
mnaserAnd you do a live migrate and end up with vm running on the other side and but using it’s original storage19:00
sean-k-mooneythen you need to configure your filters to ensure that you target the vsm to spcific cluster using a flavor or simialr19:00
mnaserAnd then on resize ops it blows up horribly because it’s trying to use the destination cluster id19:00
sean-k-mooneyyep that operator error if they did not configure things properly to prevent this19:01
sean-k-mooneyadressing theses usecase is somethign that could be done but it would be a feature not a bug19:01
mnaserHow?  So if you have 3 azs you create 3 flavors?19:01
sean-k-mooneyyep19:01
mnaserDo you think that’s user friendly at all19:01
sean-k-mooneynope but its how its currently desigined19:02
sean-k-mooneyand fixing it would not eb a bug fix19:02
mnaserSo really what you’re saying is nova will do live migrations that will break your vm19:02
mnaserAnd that’s not a bug19:02
sean-k-mooneynope19:02
sean-k-mooneynova check if it can attach the volcumes to the select host before it live migrates19:03
sean-k-mooneyso it will pass the schduler but fail in pre live migrate19:03
mnaserok, lets put that aside and talk about the users who use images_type=rbd19:03
mnaserwith different az's19:03
mnaserit will break thoes vms19:03
sean-k-mooneyalso live migrate is an admin only api and we allow you as an admin to select the host19:03
mnaserok when we're deploying openstack for customers they don't expect to sit and decide which host they are going to move things into at scale19:04
sean-k-mooneymnaser: not if you use cross_az_atch=false19:04
mnaserif i tell them 'sorry, openstack is kinda silly, it picks the wrong hosts, you just pick the right host yourself instead'19:04
mnasernon-bfv, images_type=rbd, 2 az's with ceph cluster each will result in broken live migrations19:04
sean-k-mooneyif you want to propsoe a new feature for this im open to review that19:04
sean-k-mooneywhat i do not think woudl be corerct it considerign this a bug when we previously declared it out of scope and backproting this19:05
sean-k-mooneymnaser: it would break if the ceph cluster was inaccable yes19:05
sean-k-mooneyalthough i belvie19:06
sean-k-mooneythe vm would stay runnign on the souce host in active19:06
sean-k-mooneywith the migration in error19:06
mnaserand any reasonable operator would make a sane assumption that the cloud would not live migrate across az's19:06
sean-k-mooneylibvirt will detect teh qemu instance was not able to connect19:06
mnaserthe vm does migrate if the cluster is accessible, and then all further operations like resize/migrate/etc are broken19:06
sean-k-mooneyand it shoudl abort the migration19:06
mnaserso it goes into a user-facing broken state19:06
sean-k-mooneyaz are not fault domain19:07
sean-k-mooneyor isolated segments19:07
sean-k-mooneymnaser: i do not belive you will get into a user facing broken state for live migration19:07
mnaseryou will.. if both ceph clusters are accessible, then the further operations will try to use the fsid of the target vm19:07
mnaseri can ask to get tracebacks and logs from teh customer19:08
mnaserbut it makes sense since now its trying to use the _new_ cluster fsid, but doesnt find the volume, since its attached from the old cluster fsid19:08
sean-k-mooneyif both are accsabel and you only have ceph cred for one of them on the compute host then qemu will not be able to conenct19:08
sean-k-mooneymnaser: that sound like they are trying to use the same user/keyring between both clusters19:09
mnaserok, assume one cluster with different pools when you're using ceph then19:09
sean-k-mooneywhich is incorect19:09
mnaseri havent dug that deep into their stuff19:09
mnasernow when nova tries to do things it'll do it on the new pool but cant find that _disk image19:10
sean-k-mooneywhich will fail when we try to create the qemu instance on the dest19:10
sean-k-mooneybut the migraiton shoudl abort then19:10
mnaserisnt the old xml get transferred19:11
sean-k-mooneyand the vm shoudl stay runing on the souce node in actie19:11
mnaserso it successfully completes?19:11
mnasers/isnt/doesnt/19:11
sean-k-mooneyno the vm get created really really early on the dest 19:11
mnaseri dont think we rebuild xml from scratch on target but rather rely on shipping the xml from the old libvirt to the new one?19:11
sean-k-mooneywe have to create the vm on the dest so that the ram can be copied 19:11
sean-k-mooneymnaser: we generate a new xml on the souce for the dest19:11
mnaserok something is not adding up then19:12
sean-k-mooneyso my expectation is that it shoudl use the old cluster19:12
sean-k-mooneyso you woudl have cross az traffic19:12
mnaseroh ok right yes, it would add up nevermind19:12
mnaserif we generate xml on source for the dest it'll have the old19:12
sean-k-mooneywhat might break is a hard reboot after that19:12
mnaseryes exactly, or resize, etc19:12
sean-k-mooneyright but thats a complete differnt issue19:13
sean-k-mooneywe do not supprot move operations across diffent stroagge backends at all19:13
sean-k-mooneyand preventing that is left to the operator today and it has alyas been that way in nova19:13
mnaserso as someone whos trying to get people to use openstack, giving them a big gun to shoot themselves in the foot19:13
mnaserand then when they do that because it doesnt seem very trivial and obvious that what they did is wrong19:14
sean-k-mooneymnaser: the simpelr approch si to use cells19:14
mnaserwhen they went ahead, created az, aggregates, etc19:14
sean-k-mooneywe do not allwo cross cell live migration19:14
mnaserthat's a really good point19:14
mnaserso ensure same storage backend inside a cell19:14
mnaserseems like pretty sane advice19:15
sean-k-mooneyyes19:15
sean-k-mooneywith all that said we coudl work on a feature to adress this19:15
sean-k-mooneybut it would be a new feature and it would have to still allow usecase wehre cross az move operations make sense19:15
sean-k-mooneymnaser: for example we recently added a similar feature  for neutron routed networks19:16
sean-k-mooneyhttps://specs.openstack.org/openstack/nova-specs/specs/wallaby/implemented/routed-networks-scheduling.html19:16
mnasersometimes i really feel letting users create az's was a massive mistake lol19:16
sean-k-mooneywell users cant19:17
mnaserit was always so loose and there's so many people who get shot in the foot with it19:17
mnasernah i mean from an operator perspective19:17
sean-k-mooneyits admin only unless you change the policy19:17
mnaserpeople build out something and then it almost never gives them what they want19:17
sean-k-mooneyoh well the issue is peopel consufe nova az with aws19:17
sean-k-mooneyand they are nothign like each other19:17
mnaseryeah19:17
sean-k-mooneyso before wallaybe tehre was no schduler supprot for route l3 networks19:18
mnaseraws has a strong presence so its natural to think of it that way19:18
sean-k-mooneyi.e. there was nothign preventing you form cold/live migrating to a host where that ip coudl not be routed19:18
sean-k-mooneyhttps://specs.openstack.org/openstack/nova-specs/specs/wallaby/implemented/routed-networks-scheduling.html added support for this19:18
sean-k-mooneyit woudl not be unreasonable to have a similer feature for nova stoage19:19
sean-k-mooneyfor example if we use the ceph fsid to create a placement aggrate containing all host that were configured to use that ceph cluster19:19
sean-k-mooneyand then recoded that in the isntance_system_metadata and schduled based on that if set19:19
sean-k-mooneywe would jsut need to do member_of=<fsid> in the pacement query19:20
mnaseryeah, that seems like a handy simple way to track that for ceph19:20
sean-k-mooneyif rbd_fsid was in instance_system_metadata19:21
mnaseri guess we would technically toss that into block device mapping data19:21
mnaseri cant remember if nova uses that for its own storage19:21
sean-k-mooneyish19:21
sean-k-mooneywe do in weird ways19:22
mnasermaybe we should add a warning to the doc https://docs.openstack.org/nova/latest/admin/availability-zones.html about looking into using cells if you want to have full isolation and not allow migrations from one az to another19:22
sean-k-mooneybut this would be for the root disk really althoguh you could map cinder voluems to placment aggreats in a simialr way19:22
sean-k-mooneybauzas: when you ahve time reading back over ^ would be good19:23
mnaserill push a PR to add some details about migrations and bring up cells19:23
sean-k-mooneymnaser: cells are still not full isolation but ya.19:24
mnaseri have to be honest in my ability of providing help, spec + new feature discussion + all that is a bit too far of a reach for this19:24
sean-k-mooneymnaser: the other approch woudl be to have a weigher19:24
sean-k-mooneyso an az affinity weigher19:24
mnaserhmm19:24
mnaseri could do that out of tree i guess19:24
mnaseras i dont think nova particlarly would wnat to carry that19:24
sean-k-mooneywe would need to pass the instance current cell to the sheduler and then the weigher could prefer to say in the same az19:25
sean-k-mooneyam i would not be against having it19:25
sean-k-mooneywe woudl need to modify the destination object and add a prefered az filed or something19:25
mnaseri guess it can be a filter too but it would be very ugly 19:26
mnasercause it would have to check if this is a reschedule (aka instance exists and we can find it) or first time (ignore)19:27
sean-k-mooneywell it should not be a filter because corss az move operations are valid19:27
mnaserah yes also addressing that19:28
sean-k-mooneymnaser: basically we coudl add a "current_az" field here https://github.com/openstack/nova/blob/master/nova/objects/request_spec.py#L1092-L112219:28
mnaserthis starts to enter the domain of requiring more resources/time than i have so trying to see how i can be them most useful with the little resource i can spend on this 😅19:29
sean-k-mooneythats used in a few places but we baskcialy woudl just need to get the instnace.az and pass it on19:29
sean-k-mooneywell simple solution is docs patch + ptg topic19:29
sean-k-mooneyand i can raise it as a "operator pain point" internally and see if there is interst in adressing it19:30
sean-k-mooneyalthough i think we likely wont have time in the next cycle to work on this19:30
sean-k-mooneythere are potically 2 feature here an az affinity weigher, and reporting ceph clusters reachablity to placment19:31
sean-k-mooneyboth help usablity in differnt ways19:32
*** dasm is now known as dasm|off19:56
simondodsleyIn Train when I try to volume migrate a boot volume of a shutdown instance I get the message `Cannot 'swap_volume' instance xyx while it is in vm_state stopped`21:36
simondodsleyIs there any way to do this?21:36
simondodsleyOr was this something added after Train21:36

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!