Wednesday, 2021-09-29

opendevreviewGhanshyam proposed openstack/nova master: DNM testing grenade neutron-trunk fix  https://review.opendev.org/c/openstack/nova/+/81111800:51
opendevreviewGhanshyam proposed openstack/nova stable/xena: DNM: Testing nova-grenade-multinode with neutron-trunk  https://review.opendev.org/c/openstack/nova/+/81149101:02
opendevreviewGhanshyam proposed openstack/nova stable/wallaby: DNM: Testing nova-grenade-multinode with neutron-trunk  https://review.opendev.org/c/openstack/nova/+/81151301:03
opendevreviewHang Yang proposed openstack/nova master: Support creating servers with RBAC SGs  https://review.opendev.org/c/openstack/nova/+/81152101:34
opendevreviewmelanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes  https://review.opendev.org/c/openstack/nova/+/80662901:48
opendevreviewFederico Ressi proposed openstack/nova master: Check Nova project changes with Tobiko scenario test cases  https://review.opendev.org/c/openstack/nova/+/80685301:52
opendevreviewFederico Ressi proposed openstack/nova master: Debug Nova APIs call failures  https://review.opendev.org/c/openstack/nova/+/80668301:56
opendevreviewGhanshyam proposed openstack/nova stable/victoria: DNM: Testing nova-grenade-multinode with neutron-trunk  https://review.opendev.org/c/openstack/nova/+/81154004:10
bauzasgood morning Nova06:43
dekeHi08:16
dekeHas there been any discussion about implementing a MAAS driver for nova to enable ironic-like functionality for users who have deployed openstack with juju on MAAS?08:16
bauzasdeke: I haven't heard anything about this08:36
bauzasdeke: tbh, it would be a laaaarge discussion, right?08:36
dekeyea it would08:36
dekeit was just a thought I had08:36
dekeif we are already deploying on top of a baremetal provisioning service08:37
dekethen why deploy another baremetal provisioning service08:37
bauzaswe are not saying "no" for a new virt driver08:39
bauzasbut for having a new virt driver upstream, that means we need to discuss how to use it and how we could verify it by the CI08:40
bauzasthat means large discussions during a lot of PTGs + making sure we at least have a third-party CI 08:40
lyarwooddansmith: https://review.opendev.org/c/openstack/grenade/+/811117 - looks like you're the only remaining active core on grenade, can you ack this when you get online to unblock Nova's gate?08:42
gibigmann: thanks for the patches!09:03
bauzasgmann: what gibi said, thanks for having worked on them09:05
bauzasso the xena and master changes for fixing the placement API endpoints are now merged, we only have the grenade issue left, right?09:06
lyarwoodYup I believe so09:07
lyarwoodand I think we only need it in master09:07
gibiyupp thats my view too09:07
lyarwoodI don't know why we've backported it tbh09:07
gibilyarwood: you mean why we are enabling trunk testing on wallaby and back?09:07
lyarwoodon stable/xena and backwards, I thought grenade used the current branch of itself and the previous branch for everything else for the initial deploy09:08
lyarwoodso master grenade deploys a stable/xena env and then upgrades that to master09:08
bauzasgibi: I stupidely forgot to +1 the RC2 patches yesterday before leaving09:10
bauzas:facepalm:09:10
bauzasnow we need elodilles_pto but as his nick says, he's on PTO :)09:10
* bauzas hides09:11
gibilyarwood: if we do this https://review.opendev.org/c/openstack/devstack/+/811518/2/lib/tempest then we need to do this as well https://review.opendev.org/c/openstack/grenade/+/811542/1/.zuul.yaml on stable. But I agre that we don't have to enable trunk on stable, skip is OK to me too.09:12
lyarwoodyeah I agree with the comment on the stable/wallaby change that we shouldn't be adding tests to stable this late on09:12
gibibauzas: I think other release cores can approve the RC2 we don't necessary need to wait for elodilles_pto. 09:13
gibianyhow he is back tomorrow so no worries09:13
gibilyarwood: ack, we can discuss this with gmann when he is up09:14
lyarwoodkk09:20
opendevreviewLee Yarwood proposed openstack/nova master: nova-manage: Ensure mountpoint is passed when updating attachment  https://review.opendev.org/c/openstack/nova/+/81171310:56
opendevreviewLee Yarwood proposed openstack/nova master: nova-manage: Always get BDMs using get_by_volume_and_instance  https://review.opendev.org/c/openstack/nova/+/81171611:17
lyarwoodhttps://review.opendev.org/c/openstack/nova/+/811118 - cool the check queue is passing with the grenade fix11:26
gibi\o/11:27
mdbooth👋 Are we confident that https://review.opendev.org/c/openstack/devstack/+/811399 fixed the devstack issue on Ubuntu? My devstack-using CI job still seems to be failing:12:00
mdboothhttps://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/kubernetes-sigs_cluster-api-provider-openstack/1009/pull-cluster-api-provider-openstack-e2e-test/1443169418262614016/artifacts/logs/cloud-final.log12:00
mdboothI can see the following in the devstack output: "ProxyPass "/placement" "unix:/var/run/uwsgi/placement-api.socket|uwsgi://uwsgi-uds-placement-api" retry=0", which looks like it contains the fix. However I'm still seeing the borked GETs to placement12:02
mdboothI'm far from discounting some other issue in the environment, btw, just looking for confirmation or otherwise that we've seen this fix the issue.12:04
*** swp20 is now known as songwenping12:04
gibimdbooth: you logs shows the original error I can confirm as placement logs ""GET /placemen//resource_providers?in_"12:05
mdboothgibi: Right12:05
gibiand I do see the ProxyPass you mentioned, and that is the correct one12:06
gibiare you sure that your env using the ProxyPass that is logged?12:06
mdboothgibi: There's a git pull; git show up the top of that output which suggests the working directory is at "7f16f6d4 Fix uwsgi config for trailing slashes"12:06
mdboothgibi: i.e. Can I confirm that the apache config actually contains that ProxyPass?12:07
gibigibi: or that apache2 was restarted / reload after the setting is applied12:07
gibihm, I see an apache2 reload later12:08
mdboothThis is running in cloud-init on initial boot. *However* it is running from an image that already contains devstack bits, so it's entirely possible there's dirty state there.12:08
mdboothLet me try to confirm that. I hadn't considered that it might not actually be updating the apache config.12:09
gibihm, sorry I only see logs that suggest the reload like12:09
gibiTo activate the new configuration, you need to run:12:09
gibisystemctl reload apache212:09
gibibut I don't see the reload itself12:10
gibiso If you can log into the system then try accessing placment if it fails then check the ProxyPass confing then reload apache and check that placement is accessible now12:11
gibi(I did this manually with in my local devstack so I do belive the fix helps at least in devstack)12:11
mdboothUnfortunately this is running in a CI system I don't have access to :( I have to debug via modifying the job and re-executing 😬12:12
mdboothI'll see what I can confirm. Thanks!12:12
gibimdbooth: no problem, let us know if this still fails for you12:13
mdboothOk, I can confirm that the image the machine is created from already contains dirty devstack state12:18
mdboothIncluding the broken ProxyPass directives12:18
mdboothSo unless we consider it a bug that devstack doesn't update that, I'm guessing this is my problem?12:19
gibimdbooth: so you have both wrong and bad ProxyPass lines in the config file?12:20
gibiI mean wrong and good12:20
gibi:D12:20
mdboothSo *before* pulling and executing a patched devstack I already have bad config present12:21
mdboothAnd running the patched devstack doesn't fix it12:21
gibiI do see multiple ProxyPass lines locally too, I guess those are from multiple unstack.sh / stack.sh runs in my case12:21
lyarwoodyeah that isn't a devstack bug12:22
lyarwood./clean.sh first12:22
gibihm, if clean.sh removes it then I agree it is not a bug (I'm lazy to always run clean.sh but I should)12:22
mdboothlyarwood: Or more likely summon the arcane wizards of CI and update the 'preinstalled' image!12:23
lyarwoodyeah or that if it's already baked into the image12:23
gibimdbooth: yeah, for CI I suggest to always start from a clean image :)12:23
lyarwoodthat's odd tbh12:23
mdboothApparently it saves a ton of time, although I haven't personally measured it. I'm about to measure it, though :)12:24
gibinot too long ago dansmith added parallelism to devstack stack.sh run that helped a lot with runtime12:24
lyarwoodyeah that we use across most jobs now AFAIK12:27
mdboothlyarwood: Is there a flag?12:28
gibithere should be12:29
gibibut is on by default12:29
mdboothOk, cool12:29
lyarwoodyeah it's there by default in master and xena12:29
gibibtw, ./clean.sh deletes all the config from /etc/apache2/sites-enabled/ except glance :D12:29
mdboothAh, looks like it's DEVSTACK_PARALLEL, and it's not in victoria12:30
mdboothPerhaps I'll update12:30
lyarwoodwhy are you deploying victoria btw?12:30
mdboothlyarwood: Because nobody changed it12:31
lyarwoodoh fun12:31
gibi:)12:31
mdboothHonestly this CI system is awesome, I'm not complaining. It installs devstack in a VM in GCE and runs tests against it, and until we hit this it was rock solid.12:32
mdboothOk, I'm going to run again from a clean image instead of the preinstalled image to see how long that takes. Then I'm going to do the same again but against Xena to measure again.12:33
mdboothThanks for all the help!12:33
gibihappy to help12:34
gmannbauzas: other than neutron-grenade. nova-ceph-multistore is broken on stable  (https://bugs.launchpad.net/devstack-plugin-ceph/+bug/1945358)  fixes are ready to merger - https://review.opendev.org/q/topic:%22bug%252F1945358%22+(status:open%20OR%20status:merged)12:36
gmanngibi: lyarwood checking comments on grenade fixes, 12:36
gibigmann: in short we are questioning whether we need to turn on trunk testing on stable branches12:39
gmanngibi: lyarwood ok, I will say we should as Tempest master is used to test the stable/ussuri -> master and if we do not enable then we skip the trunk test for no reason. it was just a miss in enabling the extension in stable branches. 12:41
lyarwoodgm12:41
lyarwoodops sorry12:41
lyarwoodgmann: Yeah I appreciate that but if wasn't tested until now it seems a little odd to add it in for non-master branches12:42
lyarwoodgmann: the missing coverage was mostly around live migration right?12:42
gmannand we can see it is passing in nova grenade jobs in stable so we do not need anythings extra than grenade fix12:42
gmannlyarwood: yeah, live migration trunk tests. those are only tests for trunk currently in tempest12:43
lyarwoodgmann: okay well if it's passing and the neutron folks are okay with helping with any fallout that might appear in the coming days then I guess we can go ahead12:43
gmannlyarwood: yeah, if neutron team is not comfortable or confident then we can just keep it like that.12:43
gmannlyarwood: whether we need it for Xena or not, I remember master testing patch passed only when I added xena fix as depends-on here https://review.opendev.org/c/openstack/nova/+/81111812:53
gmannI think i added due to zuul job inventory but let me debug that what exactly happing here12:54
lyarwoodokay if that's the case lets just merge both12:54
gmannlyarwood: zuul inventory seems to take job definition from master only https://zuul.opendev.org/t/openstack/build/01f10e2cb162450e912551026dde8f85/log/zuul-info/inventory.yaml#294-30413:08
gmannlet me remove the xena fix as depends-on and then we can get more clarity 13:09
opendevreviewGhanshyam proposed openstack/nova master: DNM testing grenade neutron-trunk fix  https://review.opendev.org/c/openstack/nova/+/81111813:09
lyarwoodgmann: kk13:16
bauzassorry folks, I had issues with my computer13:55
bauzasgmann: thanks, unfortunately gtk13:56
bauzaslooks like it's limbo13:57
gmannnova grenade master fix is in gate https://review.opendev.org/c/openstack/grenade/+/81111713:58
gmannfor enabling testing on stable or not is different things and we can continue discussion.13:58
dansmithbauzas: gibi looks like the nova-yoga-ptg etherpad has been emptied14:01
bauzasORLY ?14:01
gibishit, I see14:02
bauzasdansmith: this was really done 3 mins before14:02
gibithis is the last known state https://etherpad.opendev.org/p/nova-yoga-ptg/timeslider#827214:02
gibilast known good state14:03
bauzasyup14:03
dansmithbauzas: okay I just loaded it and saw it empty14:03
bauzasdansmith: me too14:03
bauzasit's just, something happened between 8272 and 827314:03
dansmithcertainly you can restore a rev right?14:03
gibithe content yes, the coloring I think no14:03
bauzasI wonder14:04
bauzaslemme ask infra14:04
gibisure14:04
dansmithhmm, I thought there was some way14:04
gibiwhat I did before is that I exported the old state and copied that back14:04
bauzasfungi is looking at restoring14:05
gibiack14:05
bauzasfolks, don't touch now14:05
dansmithsweet14:05
bauzasdansmith: thanks for having identified this14:06
lyarwoodah ffs I think that was me14:06
lyarwoodI was editing and my dock crashed14:06
lyarwoodand then I couldn't reconnect to the pad14:07
lyarwoodyeah my items were the last on there14:08
lyarwoodhttps://etherpad.opendev.org/p/nova-yoga-ptg/timeslider#827214:08
lyarwoodnot entirely sure how my laptop dock (and thus network) crashing caused this tbh14:09
gibian interesting untested edge case in the etherpad code :)14:09
fungibauzas: gibi: dansmith: i've rolled it back to revision 827214:13
fungilyarwood: ^14:13
dansmithlooks good, thanks fungi !14:13
bauzasfungi: with all my love14:13
fungicool, just making sure it's looking like you needed14:13
lyarwoodthanks for that and apologies all14:13
fungiwe do also back up the db behind it daily, worst case14:13
bauzasfungi: it is, all the colors14:14
gibifungi: awesome thanks14:14
fungino problem, glad we could recover it14:14
bauzasand yeah, interesting edge case14:14
bauzasa docking issue with a network problem swallows an etherpad14:14
fungithose aren't so bad. in the past there have also been bugs which clients somehow tickled to make the pads completely unusable14:16
fungiand the most we can do in those cases is dump an earlier text copy with the api and stick that into a new pad, but it loses all the attribution and history14:16
fungiand formatting14:17
bauzasfungi: thanks, gtk14:18
bauzasgmann: i know you're busy with all those devstack and grenade stuff 14:19
bauzasgmann: but tell me when you think it would be a good opportunity for the x-p PTG session between oslo and nova re: policy14:19
gmannbauzas: give me 10 min, currently internal meeting14:22
bauzasgmann: nah, no rush14:22
lyarwoodhttps://review.opendev.org/c/openstack/nova/+/811118/ is passing with just the master grenade fix that's in the gate FWIW14:34
* lyarwood does a little dance14:34
gibi\o/14:35
bauzaslyarwood: https://media.giphy.com/media/HTjcWZwMtHpyhuCGKZ/giphy-downsized-large.gif?cid=ecf05e47cwkexzrwaxwlkbicd2ppy22bb1l52sg7yoa8sk9x&rid=giphy-downsized-large.gif&ct=g14:52
* bauzas is sometimes regreting we can't add media in IRC14:52
bauzasbut that's lasting 0.1sec and then I say "meh"14:52
fungiif you used a matrix client with the matrix-oftc bridge to join this channel, you and other matrix users could share inline media while irc users would just see a url to it in-channel14:53
bauzasfungi: what I said, "meh" :p15:04
fungiheh15:05
gmannlyarwood: perfect then I will say it was late night thing which thought that Xena fix is needed :)15:05
gmannbauzas: for PTG oslo sessions, is it possible on Tuesday Oct 19th between 13-15 UTC15:07
opendevreviewmelanie witt proposed openstack/nova stable/train: Reject open redirection in the console proxy  https://review.opendev.org/c/openstack/nova/+/79180715:14
opendevreviewmelanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes  https://review.opendev.org/c/openstack/nova/+/80662915:14
gmannbauzas: and as per current topic I need only 30 min unless there is more things to discuss from anyone else15:19
bauzasgmann: sorry, I'm on a meeting but I guess we could run a meeting after 1400UTC as we have a cyborg x-p session first15:20
gmannbauzas: on tuesday right?15:22
bauzasgmann: on the 19th of October, yes15:22
gmannbauzas: perfect, sounds good. thanks 15:22
bauzashttps://etherpad.opendev.org/p/nova-yoga-ptg L47 tells me the 14-15:00UTC slot is already taken15:23
gmannbauzas: that is 'Cyborg-Nova: Tuesday (19th Oct) 13:00 UTC - 14: 00 UTC:'15:24
bauzasmy bad, yeah15:24
bauzasagain, wfm for a oslo x-p session on Oct-19 14:00UTC15:24
gmann+1, thanks again15:24
lyarwoodgmann: ^_^ no issues thanks for working on it so late15:24
bauzasgmann: all good 15:25
gmannlyarwood: np!, and for stable backport we can wait for neutron team opinion so I agree on 'no hurry for those' .15:25
lyarwoodawesome15:25
bauzasartom: honestly, I'm torn with https://review.opendev.org/c/openstack/nova/+/80847416:12
bauzasthat's an behavioural change16:13
opendevreviewBalazs Gibizer proposed openstack/nova master: Enable min pps tempest testing in nova-next  https://review.opendev.org/c/openstack/nova/+/81174816:13
bauzasoperators suppose a plugoff when delete16:13
bauzasnow, we'll first try to shutdown the guest for every instance 16:13
bauzasincluding other drivers but libvirt16:13
melwittthat is my concern as well. maybe it could be conditional on bfv that is not "delete on termination"?16:14
melwittis that the only case where this would be desired?16:15
melwittor rather, is there any gain in doing it for non bfv non delete on termination?16:15
dansmithgenerally you don't want that :)16:15
melwittmaybe also shared storage is another case16:16
dansmiththe only real case for non-bfv volumes is for precious data16:16
melwittbut precious data on a local disk that's going to be deleted anyway? I must be missing something16:16
dansmithsorry, thought you were talking about delete-on-termination for non-bfv cinder volumes16:17
artommelwitt, "is there any gain in doing it for non bfv non delete on termination?" None that I can see16:17
artomWell, no, any volume, really16:17
artomDoesn't have to be bfv16:17
melwittoh, yeah attached volumes. I wasn't thinking of that. yeah16:17
dansmiththat's my point, delete-on-termination should only be useful for bfv volumes we created with non-precious data from an image16:18
artomIt's still attached and mounted in the guest, and would ideally be flushed correctly if it's not delete_on_termination=True16:18
artomI need to run an errand quickly, can this be carried over to the gerrit review?16:18
artomAnd thanks for looking into it :)16:18
artomAnd yeah, so bauzas's point, the compute manager/driver division of labour here is pretty muddy16:19
melwittmy bad for not looking at the change yet. if it's targeted to only instance with volume(s) cases I think that makes a lot more sense16:19
gibiis it really a gracefull shutdown via openstack server stop and then a openstack server delete?16:19
dansmithgibi: stop + delete should be graceful16:19
gibiso my point is this can already done with our APIs16:20
melwittjust saying I don't think we should be doing it for everything, for things where the data is going to be blown away anyway16:20
dansmithgibi: for sure. I assume the goal is to make nova do the graceful behavior if volumes are attached, but to do it properly really requires some higher-level orch, like a stop...timeout...destroy kind of thing16:21
dansmith"do the graceful behavior *automatically*" I should have said16:21
gibiOK I see16:21
melwittyeah that is my understanding as well16:21
dansmithI'm a bit torn, because unless you're running with unsafe cache, I would think that fast destroy is fine.. might have a journal to replay when you use the volume later, but...16:22
gibiit make sense for data consistency but it also makes delete slower so I think this should be opt in16:22
melwittdansmith: yeah it's weird, the user is experiencing volume gets corrupted and no longer usable when they delete without stopping first16:23
melwittwe had thought just deleting should be fine but it's behaving in a way we didn't expect16:23
melwittnot sure why16:23
dansmithdestroy of a running vm is the same as pulling the plug.. if you're using a precious volume, you wouldn't do that to a physical server, so...16:23
artomdansmith, so the "real" problem is https://bugzilla.redhat.com/show_bug.cgi?id=196508116:24
melwittyeah but re: "I would think that fast destroy is fine"?16:24
artomApparently in some cases stop+delete causes races16:24
melwittyeah I think you have to poll and wait for it to be stopped no?16:25
dansmithmelwitt: to do the graceful shutdown, you'd need some long-running task, yeah, what I said above16:25
artomSo we can either make delete safer, or require that any orchestration/automation on top of Nova does stop + delete, but then we would need to fix that race16:25
dansmithartom: so this has nothing to do with volume safety?16:25
artomdansmith, it does, because the reason for doing stop + delete (which can cause this deadlock) is volume safety16:26
melwittit does. it began with the racing problem and then we said "try a delete without the stop" and then their volumes got messed up16:26
dansmithoh the dbdeadlock came from stop?16:26
lyarwooddoesn't delete take an instance.uuid lock on the compute?16:26
artomdansmith, stop immediately followed by delete, apparently16:26
melwittit came from doing a delete right after a stop16:27
melwittwithout waiting for the stop to be stopped16:27
melwittlyarwood: good question16:27
dansmithokay, so, fix that16:27
dansmithdon't engineer an orchestrated graceful delete, IMHO16:27
dansmithI'm guessing task_state doesn't protect the delete from running,16:27
artomI dunno, I don't necessarily think expecting an attached volume to not be corrupted after a delete is invalid16:28
dansmithand the api might start the delete process while the stop is still running on the compute host or something16:28
melwitthm, stop and delete are both locked with instance.uuid16:28
dansmithartom: I agree, delete could leave a volume unhappy, but I'm thinking if you issued a stop and then delete you're assuming they're queued16:28
melwittI just checked16:28
lyarwoodhttps://github.com/openstack/nova/blob/e07bb310b674fb471a92edf3258e564f05534595/nova/compute/manager.py#L3237-L3255 looks like soft delete doesnt16:28
dansmithartom: in reality it's probably about like hitting shutdown on your server and then pulling the plug before it finishes, but...16:29
melwittdansmith: that's exactly what they want, the queuing16:29
dansmithI think relying on the instance lock is probably too fragile here,16:30
dansmithsince it's just on the compute node, but I'd have to go look at the (many) delete path(s) we have16:30
lyarwoodI wouldn't say so for this case where the instance was running on a host16:31
dansmithdelete is and always has been pretty much "I want this to complete and stop charging me immediately", so... we're really not wrong here, IMHO16:31
lyarwoodthat makes perfect sense16:31
opendevreviewBalazs Gibizer proposed openstack/nova master: Enable min pps tempest testing in nova-next  https://review.opendev.org/c/openstack/nova/+/81174816:31
dansmithlyarwood: wouldn't say what, that relying on the lock is unsafe? we can do stuff in the api to delete things (in the local case) which has nothing to do with the instance lock,16:32
dansmithand I'm not sure that relying on the ordering of two calls is really safe either unless we have perfectly fair locks16:32
bauzassorry, I was afk16:32
* bauzas scrolling back to the convo I started16:33
lyarwooddansmith: we can't force the delete in the API of an instance still associated with a host can we?16:33
dansmithlike, if you issue three calls on that instance and the stop is #2, delete is #3, the lock has to be fair to make sure we don't happen to choose the waiting #3 thread when we stop doing #116:33
dansmithlyarwood: of course we can16:33
dansmithif we think the host is down, or we missed the last service update, etc16:33
lyarwoodright I'm stuck in the happy path here16:34
melwittwe have the ability to add fair=True to the locks, I wonder if that would work?16:34
bauzasmy biggest concern here is that we would make destroy holding for a graceful shutdown16:34
dansmithbauzas: yeah I don't think we should do that for sure16:35
dansmithbauzas: I think at most, we should try to make sure a delete doesn't preempt a stop operation in progress16:35
bauzasyup, agreed16:35
bauzasif a stop is occuring, destroy should wait16:35
melwitthow can we make it wait without rejecting it?16:35
bauzasexcellent questioin16:36
melwittI do wonder if the fair=True would help16:36
bauzasI'd say we would reject synchronously by looking at the VM state16:36
melwittI'll try it in a devstack if I can repro the problem16:36
bauzasdestroy is synchronous, right?16:37
bauzaseven if the host is down, destroy will occur16:37
dansmithmelwitt: I'm just suggesting that lock inversion is *one* possible way that we might not be able to depend on the ordering16:37
bauzasthat's the guest destroy which is async, right?16:37
dansmithanything else that happens on the compute host before we block on the lock could cause it, as well as if conductor is involved or something like that16:38
bauzasdansmith: I missed your proposal16:38
bauzasyou're telling we could lock on stop ?16:38
bauzashence preventing the delete ?16:38
melwittwe do lock on stop on the compute node16:39
melwittbut lock waiters would not be in order because they are not fair locks16:39
melwitt*would not necessarily be in order16:39
dansmithbauzas: I haven't proposed anything16:39
bauzasok, melwitt explained16:40
dansmithyeah, so when we go to do delete,16:41
melwittdansmith: ack, I was just thinking that would be such a small and simple way to fix it if the fair locks would do that. but like you said if we have an issue with two async things arriving on the compute node in the "wrong order" then it wouldn't help16:41
dansmithwe create an instance event before we run terminate_instance and grab the lock,16:41
dansmithwhich means we're doing network IO to conductor16:41
dansmithso even fair locks won't prevent order inversion there16:41
melwittyeah...16:41
dansmithso if stop and delete arrive in the right order, but each call out to conductor, we hit any conductor in the cluster, each one tries to create records in the db, who knows which one will finish first, grab the lock, etc16:42
dansmiththat's what I mean by assuming this sort of super tight ordering is unsafe, even though we *think* we're on the compute and largely single-threaded16:42
bauzasI see16:42
bauzasstupid idea, can't we rely on the state of the instance ?16:43
dansmithdelete is anything-goes I think16:43
bauzasyeah, that's the original problem16:43
dansmithbecause that's what we want...16:43
bauzasyup... 16:43
dansmithand anything else we build into there is going to be pretty obscure.. "delete always deletes, except stop but not shelve... and either waits or refuses or ..."16:44
dansmithmakes the "delete always works" contract a little less clear16:44
bauzashonestly I don't know how to move on with this ask16:44
dansmithtell them that delete means pulling the plug,16:44
bauzas"ask your orchestration to be smarter ?"16:44
lyarwoodhow about making stop more graceful just for the libvirt driver?16:44
gibican we add a flag to the delete api saying I-want-a-shutdown-first?16:44
dansmithand you wouldn't do that while waiting for start->shutdown on a physical server, so they should wait for vm_state=STOPPED before delete16:45
dansmithgibi: that makes it a little less obscure, but doesn't eliminate the need to make that orchestration bit work, of course16:45
bauzasyeah16:45
lyarwoodartom: there main issue with stop was that it eventually destroys the instance right?16:45
bauzasthat doesn't solve the ordering problem16:45
lyarwoodartom: and in your change you've suggested that we call shutdown on the domain as an initial step to make this more graceful16:46
dansmithlyarwood: can we call quiesce or something else constant-time before libvirt destroy?16:46
dansmithinstead of shutdown, which the guest can block or ignore?16:46
bauzasthe root problem is the I/O flushes, right?16:47
dansmithnovaclient has a --poll option for some things.  "nova stop server --poll; nova delete server" would solve this pretty easy :)16:47
melwittthey're using tripleo/heat but I'm pretty sure heat has dependency or waiting ability16:49
dansmithsurely16:49
lyarwooddansmith: quiesce before $domain.shutdown() might help flush things if that's where the guestOS is getting hung up16:49
melwitttheir argument has been that stop + insta delete should not break the volume, IIUC16:49
lyarwoodwaiting assumes we don't kill it before it's finished shutting down16:50
dansmithlyarwood: I meant to make sure journal buffers are written before we nuke the guest16:50
dansmithmelwitt: that argument is fine as long as you wait between stop and delete :)16:50
dansmithmelwitt: because again, start->shutdown and then pulling the plug before it finishes yields a corrupted disk :)16:51
melwittdansmith: that was the first thing I said on the bug report but it's gone on for a long time now and gotten into the weeds16:51
* dansmith nods16:51
bauzasmelwitt: I haven't seen any bug report against artom's patch16:52
bauzasI guess you're talking internally16:52
melwittbauzas: yes internally16:52
bauzasyeah, because that's what worried me originally 16:52
bauzastechnically, destroy works like expected16:52
dansmithwe also really need to do a better job of making this a sanitized bug externally if we're going to claim this is a bug in nova16:52
bauzasI'm not happy with claiming this as an upstream bug16:53
dansmithme either, fwiw :)16:53
bauzasa blueprint or a wishlist bug16:53
dansmithwe could add a feature as gibi said, but destroy is doing the right thing here16:54
bauzas(18:52:51) bauzas: technically, destroy works like expected16:54
melwitt+1 to all of that16:54
dansmithan alternative to gibi's idea, would be a delete flag that says "assuming task_state=None" meaning "delete this if nothing else is going on"16:54
bauzasstrong agreement here16:54
dansmithbut it would require the client to retry, which they could currently do by just waiting for the stop to finish16:54
dansmithso, meh16:54
bauzasdansmith: yeah, that's why I was considering the vm state or the task state16:55
bauzasif we really want to do *something* 16:55
bauzasbut,16:55
bauzasthis can't be done with the current destroy API16:55
dansmithit doesn't make it do what they want, and honestly it's kinda weird since they could just wait for the stop just as well, but it's less new orchestration stuff16:55
dansmith[09:47:37]  <dansmith> novaclient has a --poll option for some things.  "nova stop server --poll; nova delete server" would solve this pretty easy :)16:56
dansmith^ :)16:56
bauzasit would be a 'destroy++" API16:56
bauzasheh16:56
dansmithalso, if melwitt is right and they're using heat, then FFS, get heat to wait or something16:56
bauzas(18:44:41) bauzas: "ask your orchestration to be smarter ?"16:56
dansmithlol16:56
lyarwoodwe keep saying they could wait for the stop to finish but wasn't that part of the issue here? Even if they did wait the libvirt driver would kill the instance prematurely before it had finished shutting down?16:56
dansmithif we're just re-quoting ourselves, are we done here? :P16:57
gibiif currently stop + wait for STOPPED + delete works, then a new delete-with-gracefull-shutdown could also be orchestrated from the conductor but it is obviously an orchestration and adds complexity to the already complext delete codepaths16:57
bauzasgibi: and again, this can't be the straight delete16:57
dansmithgibi: we don't go through conductor directly for stop or delete right now, AFAIK, so it would really get confusing to add another whole path like that, IMHO16:57
gibibauzas: yepp this is delete++ :D16:57
gibidansmith: then we either need the make the stop RCP sync or do a polling from the nova-api16:58
gibiRPC16:58
melwittlyarwood: I didn't think so?16:58
bauzascan't we just have as dansmith suggested a "destroy++" API return some 40x when the task state is not None ?16:58
lyarwoodthat's fine if they aren't waiting 16:59
melwittif that's the case I missed it16:59
dansmithgibi: we'd need a cast to conductor, a new method there, and a sync stop operation to compute from conductor followed by delete.. waiting from the api is not reasonable because stop can take a *long* time16:59
gibiyeah16:59
lyarwoodmelwitt: I was sure we had talked about them waiting and libvirt still killing the domain before things had shutdown sorry16:59
lyarwoodI'll look at the bug again16:59
dansmithin addition to the other four ways we can delete things :)16:59
melwittlyarwood: my understanding (and I could be wrong) is that they have never tried waiting to stopped16:59
lyarwoodI was sure we suggested that early on17:00
dansmithmelwitt: right that seems like it to me17:00
melwittlyarwood: ok same, maybe I'm way off17:00
bauzasI'd rather prefer this destroy++ API to stop synchronously at the API level if some conditions aren't met17:00
bauzasrather than us pursuing the idea we could achieve some distributed locking mechanism17:00
dansmithgibi: also delete returns "I will eventually do this" to the client, which it would no longer be able to guarantee17:00
dansmithbauzas: fwiw, I don't think that is really going to do what this person wants, so we should make sure they think it would help before we do that work17:01
gibidansmith: ohh, that is correct and bad :/17:01
gibidelete++ is just too complex17:02
bauzasdansmith: welp, good point17:02
bauzaseither way, I think we need a proper tracking17:07
bauzasartom: I guess you need to fill a blueprint and honestly given the brainstorm efforts we made over last hour, you need to write a bit of a spec17:07
bauzasmostly because the current destroy action can't be used for this and we need to consider a new action parameter (or any change in our API)17:08
bauzasartom: feel also free to write a PTG proposal for this one, we could continue the talk there17:08
* bauzas stops for the day17:22
* gibi too17:23
*** beekneemech is now known as bnemec17:41
spatelAny idea how to do VM nic bonding using two VF with SRIOV ? 17:57
spateli am looking for redundancy with sriov implementation and only solution is to do bonding inside vm 17:58
opendevreviewIvan Kolodyazhny proposed openstack/nova master: Add release note which descrube NVMe attach issue  https://review.opendev.org/c/openstack/nova/+/81144718:04
opendevreviewmelanie witt proposed openstack/nova stable/train: Reject open redirection in the console proxy  https://review.opendev.org/c/openstack/nova/+/79180719:15
opendevreviewmelanie witt proposed openstack/nova stable/train: address open redirect with 3 forward slashes  https://review.opendev.org/c/openstack/nova/+/80662919:15
opendevreviewmelanie witt proposed openstack/nova stable/train: [stable-only] Set lower-constraints job as non-voting  https://review.opendev.org/c/openstack/nova/+/81176219:15
opendevreviewDavid Vallee Delisle proposed openstack/nova master: rephrasing config description for num_pcie_ports in libvirt  https://review.opendev.org/c/openstack/nova/+/81117321:59
opendevreviewGoutham Pacha Ravi proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific  https://review.opendev.org/c/openstack/nova/+/81180222:19
melwittgouthamr: thanks for that! ^ I was going to do it and forgot22:33
gouthamro/ melwitt - i may have hit a circular dependency of sorts22:34
melwittoh hm.. (looking at it now)22:35
melwittgouthamr: I wonder if you need to use the url instead of the change-id? not sure how it would know which of the two (master or stable/wallaby) to use22:36
melwittfor the depends-on22:37
gouthamr+1 can try that22:37
opendevreviewGoutham Pacha Ravi proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific  https://review.opendev.org/c/openstack/nova/+/81180222:38
clarkbthe url is prefered now because you can depends on things in other code review systems22:38
clarkbchange id is gerrit specific22:38
gmanngouthamr: i do not see circular deps, it should work fine. nova->plugin->devstack22:39
melwittgood to know22:39
gouthamrthanks, but zuul stull tells me "Unable to freeze job graph: 0"22:40
clarkbgouthamr: that implies you have a bug in your zuul config I think22:40
clarkbgouthamr: can you give me links to all the changes involved?22:41
gouthamrclarkb: yep: https://review.opendev.org/c/openstack/nova/+/811802/22:41
gouthamrhttps://review.opendev.org/c/openstack/devstack-plugin-ceph/+/810059 and https://review.opendev.org/c/openstack/devstack/+/81020222:42
opendevreviewGhanshyam proposed openstack/nova stable/wallaby: DNM: Test wallaby with ceph pacific  https://review.opendev.org/c/openstack/nova/+/81180222:42
gmanngouthamr: I think running now ^^ ?22:43
gouthamroh!22:43
gmannyeah22:43
gouthamrgmann: thanks! i am used to pushing empty changes to trigger the CI elsewhere :)22:43
gmannwith right deps too https://zuul.openstack.org/status#nova22:44
melwittgmann++22:44
clarkbthe issue is you have no files in the commit22:47
clarkbseems like you figured that out22:47
clarkbI don't know that that is a use case we should support. If you aren't changing anything then why bother22:48
melwittI noticed that but didn't know that it would cause problems22:48
gmannclarkb: yeah, erorr was confusing though 22:48
clarkbgmann: yes, https://paste.opendev.org/show/809680/ is the internal handling of it. It is an exceptional case currently22:49
clarkbI've brought it up in the zuul matrix room to see if that is something we can handle better22:49
gmannclarkb: +122:49
gmannthanks 22:49
gouthamrawesome thanks clarkb!22:51
clarkbthe other issue you'll run into with that even if zuul didn't explode on it is so many jobs match on files that are modified. If no files are modified then none of those jobs will run22:52
opendevreviewmelanie witt proposed openstack/nova stable/wallaby: Add functional regression test for bug 1853009  https://review.opendev.org/c/openstack/nova/+/81180523:51
opendevreviewmelanie witt proposed openstack/nova stable/wallaby: Clear rebalanced compute nodes from resource tracker  https://review.opendev.org/c/openstack/nova/+/81180623:51
opendevreviewmelanie witt proposed openstack/nova stable/wallaby: Invalidate provider tree when compute node disappears  https://review.opendev.org/c/openstack/nova/+/81180723:51
opendevreviewmelanie witt proposed openstack/nova stable/wallaby: Prevent deletion of a compute node belonging to another host  https://review.opendev.org/c/openstack/nova/+/81180823:51
opendevreviewmelanie witt proposed openstack/nova stable/wallaby: Fix inactive session error in compute node creation  https://review.opendev.org/c/openstack/nova/+/81180923:51

Generated by irclog2html.py 2.17.2 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!