Wednesday, 2024-01-24

opendevreview	Dale Smith proposed openstack/magnum master: Drop Swarm support https://review.opendev.org/c/openstack/magnum/+/894395	00:20
*** LarsErik1 is now known as LarsErikP		08:35
jakeyip	hi dalees, do you have anything you want to talk about today?	08:51
dalees	hey jakeyip - mostly related to patchsets. I added some topics to the agenda	08:52
dalees	other than prompting for reviews, i did want to ask if anyone has objections to magnum-api looking up and calling functions on the driver class.	08:53
jakeyip	ok lets meet 'officially' then, I'll start the meeting in a bit	08:55
jakeyip	#startmeeting magnum	09:00
opendevmeet	Meeting started Wed Jan 24 09:00:43 2024 UTC and is due to finish in 60 minutes. The chair is jakeyip. Information about MeetBot at http://wiki.debian.org/MeetBot.	09:00
opendevmeet	Useful Commands: #action #agreed #help #info #idea #link #topic #startvote.	09:00
opendevmeet	The meeting name has been set to 'magnum'	09:00
jakeyip	Agenda:	09:00
jakeyip	#link https://etherpad.opendev.org/p/magnum-weekly-meeting	09:00
jakeyip	#topic Roll Call	09:00
jakeyip	o/	09:00
dalees	o/	09:00
jakeyip	#topic review - flakey tests	09:01
dalees	so this work was prompted when trying to merge the Swarm patchset	09:01
jakeyip	thanks. I had a brief look, haven't tested it out yet due to lack of time	09:02
dalees	and there was some state leaking after some tests, so I've now mocked these and the broken test case seems to be stable now.	09:02
dalees	cool - I've taken the liberty of rebasing the Swarm removal onto this too - it passed CI	09:03
dalees	but otherwise there's no rush on merging. it can wait until reviewers have time	09:04
jakeyip	I'm ok with it, but I was hoping mnasiadka can have a look when he comes back since he was working in this space too	09:05
jakeyip	if there's no urgency maybe we can wait a week? I think he'll be back soon	09:05
dalees	agreed, let's park it until then	09:06
jakeyip	thanks for understanding	09:06
jakeyip	#topic control plan resize	09:07
dalees	so for this i had a question I posted yesterday:	09:07
dalees	I'm interested in thoughts on Magnum-API looking up the driver for a cluster and using validation methods on it. I think the driver is only used by the Magnum Conductor currently, so perhaps the expected way to do this is RPC calls? This is in relation to https://review.opendev.org/c/openstack/magnum/+/906086	09:07
jakeyip	hmm I have a thought, we currently don't support (?) but what if conductor and api are different versions (e.g. middle of upgrade) and there's new driver code	09:11
dalees	well the validation (in this case for control plane sizes on create or resize) will be validated according to whatever driver version is installed in the magnum-api instance and not the magnum-conductor (which still performs the actual resize actions)	09:13
jakeyip	also, we don't store driver used for cluster 'at point of creation', rather conductor look it up, so in time this may be an issue... not really sure	09:13
jakeyip	to be fair I don't think mixed version is really much of a problem, we never say we support mixed version right?	09:15
jakeyip	generally i think ok, not shooting this down, just posting edge cases	09:16
dalees	No, I wouldn't think so - perhaps for a short time during upgrade.	09:16
dalees	yeah it's tricky. Seems a big waste of effort to call out to RPC just to validate the master count, but maybe that keeps the delineation.	09:17
jakeyip	I agree it's a lot of effort. I'm not keen for such a simple case	09:18
mkjpryor	Hi all	09:18
dalees	hi mkjpryor	09:18
mkjpryor	Apologies for my late arrival	09:18
jakeyip	hi mkjpryor glad you can join :)	09:19
mkjpryor	Did I miss anything important?	09:19
dalees	mkjpryor: backlog available here: https://meetings.opendev.org/meetings/magnum/2024/magnum.2024-01-24-09.00.log.txt	09:19
mkjpryor	Thanks dalees	09:19
dalees	just discussing a question i had about magnum-api using driver lookup. I'm interested if you have thoughts, too	09:19
mkjpryor	FWIW, I quite like the fact that the API is completely implementation agnostic	09:21
mkjpryor	It makes things very easy to reason about	09:21
mkjpryor	I think it would have to be very compelling functionality to change that	09:22
mkjpryor	dalees: What exactly is the use case you are worried about?	09:22
dalees	right now validation of API requests is performed by magnum-api. It can respond immediately to failures (ie. master node count not accepted).	09:23
dalees	if I do it in the resize function of the conductor's driver, it's an async RPC call and the clusters transitions into UPDATE_FAILED	09:23
mkjpryor	I get you	09:23
mkjpryor	So you're worried about the case where there is driver-specific validation that needs to happen	09:24
mkjpryor	So we can fail early	09:24
dalees	yes	09:24
dalees	maybe the answer is just a non-async RPC to do that, and still have the conductor locate and run the driver code.	09:24
mkjpryor	I guess you are worried about the API and the conductor versions getting out-of-sync if we call driver methods directly in the API?	09:25
mkjpryor	I'd be worried about the latency of the RPC call in the API code	09:26
mkjpryor	But I guess there must be other OpenStack services that do that	09:26
mkjpryor	:shrugs:	09:26
dalees	it's a concern we were discussing, in most deployments both would be deployed at once I'd imagine.	09:26
mkjpryor	I can't imagine many cases, other than a failed upgrade, where they would be different for longer than a few seconds	09:27
mkjpryor	And if you have a failed upgrade, you probably have other problems :D	09:27
jakeyip	openstack upgrade? for us the workflow is probably more than a few seconds :D	09:28
dalees	could be longer, with several api nodes and deployments progressing through them. But yeah, it wouldn't be long. Same problem with a conductor that doesn't know how to answer the validation RPC I guess.	09:28
mkjpryor	OpenStack upgrade - yeah. True - if you have multiple controllers then it might be a few minutes.	09:29
mkjpryor	Or if the conductor at the old versions answers the validation RPC and the conductor at the new version performs the operation	09:29
mkjpryor	I think whatever we do, other than denying API requests during an upgrade, we have a risk of mismatched versions	09:30
mkjpryor	So we might as well do the easy/low-latency thing of calling a driver method directly in the API IMHO	09:30
dalees	okay thanks; I'll think more on your initial comments about it being easier to reason about if conductor does it. I that that's important.	09:33
dalees	maybe it just needs some defined boundaries and commenting. We don't ever want magnum-api doing driver actions, but validation seems helpful when these drivers aren't all exposing the same capabilities.	09:34
jakeyip	yeah, this case is relatively trivial, I'm concerned to 'do it right' just takes so much more effort and blocks things. I want to have some time to think about API doing validation	09:34
mkjpryor	As long as there is a well-defined contract for the driver to implement, on reflection I have less of a problem with the API calling that directly	09:35
mkjpryor	And it is clear which bits of the driver get called by the API and which bits by the conductor	09:36
jakeyip	trying to think about cinder resize, does api figure out if backend driver supports resize?	09:36
dalees	hmm, I'll try and figure that one out, jakeyip - could be similar, or quite different :)	09:39
dalees	just to add concrete code; here are the proposed additions to the base driver, that the magnum-api may call (I need to add really clear docs!): https://review.opendev.org/c/openstack/magnum/+/906086/2/magnum/drivers/common/driver.py#196	09:41
jakeyip	dalees: what are the bad things if an invalid size is specified and cluster goes into UPDATE_FAILED ?	09:42
dalees	jakeyip: 1) user doesn't get feedback in their CLI or UI, they get "accepted" and it fails later. 2) UPDATE_FAILED clusters cannot be acted on with some features like upgrade. You have to use a "trick" to resize them to the same size. (I'm proposing we change this in the patchset too).	09:43
dalees	Mostly that it's a much worse user experience than getting a 400 "no you can't, here are the valid sizes".	09:44
jakeyip	what's the trick?	09:46
dalees	if a cluster is in UPDATE_FAILED you send it a resize to the same size it already is. magnum-api accepts that action and transitions to UPDATE_COMPLETE.	09:47
jakeyip	ok thanks, just want to confirm because I vaguely remember doing something similar but with worker nodegroups etc.	09:49
dalees	(this is in magnum/conductor/handlers/cluster_conductor.py in allow_update_status variables)	09:49
dalees	yeah it doesn't matter what nodegroup you do it with. The whole cluster changes state into UPDATE_COMPLETE which allows more actions on it.	09:51
jakeyip	the change to allow upgrade when UPDATE_FAILED seems to be a bit dangerous, can you explain why you need that? is it not possible to resize to current, get it to transition to UPDATE_COMPLETE then upgrade?	09:53
dalees	because that's a silly workaround? Perhaps UPDATE_FAILED covers too many cases, some you would rather deny upgrades - so I don't stand by that entirely.	09:57
dalees	if update_failed because you tried to set the master size to 100, but it failed and stayed at 3 - well what's the point in denying a cluster upgrade?	09:58
jakeyip	also if we want Magnum's future to be a thin shim, should we even do validation? what if one day even numbers of controllers are acceptable? (becauase magic :P). do we want to update validation logic in magnum? where do we want to draw the line at validation?	09:58
dalees	well yeah, this goes a long way ;)	09:59
jakeyip	yeah UPDATE_FAILED may have many reasons, I guess it was (is?) good to make sure cluster is healthy before allowing another operation. this can change, of cos	09:59
dalees	for that exact example, I got you covered. The driver defines whatever it wants to support: https://github.com/stackhpc/magnum-capi-helm/pull/41/files	09:59
jakeyip	OK for stackhpc example understood	10:02
jakeyip	so in the size 100 case, should it be an operation to resize to 3 again then upgrade? what should the status be after upgrade if we allow? db will be storing size 100 I assume?	10:02
dalees	well no, size 100 is right now denied as an HTTP 400 and not stored - that's the use case for validation vs letting magnum-api trigger the async RPC.	10:04
dalees	but i think there are more use cases that Magnum right now encodes that we'll want to move into the driver	10:05
dalees	vexxhost had one a week back which was CNI support for Cilium. The list is hardcoded in Magnum right now.	10:06
jakeyip	sorry am thinking of the case where it is allowed, or it is a resize of worker nodegroup, then no quota or something resulting in a UPDATE_FAILED status	10:06
jakeyip	we are past time, want to have a think and come back next week?	10:09
dalees	yep all good - a bit to think about here. thanks for the discussion mkjpryor and jakeyip	10:09
jakeyip	thanks for bringing this up	10:09
jakeyip	#endmeeting\	10:10
opendevmeet	Meeting ended Wed Jan 24 10:10:33 2024 UTC. Information about MeetBot at http://wiki.debian.org/MeetBot . (v 0.1.4)	10:10
opendevmeet	Minutes: https://meetings.opendev.org/meetings/magnum/2024/magnum.2024-01-24-09.00.html	10:10
opendevmeet	Minutes (text): https://meetings.opendev.org/meetings/magnum/2024/magnum.2024-01-24-09.00.txt	10:10
opendevmeet	Log: https://meetings.opendev.org/meetings/magnum/2024/magnum.2024-01-24-09.00.log.html	10:10
*** freyes_ is now known as freyes		14:01
*** Guest20 is now known as diablo_rojo		23:58

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!