Wednesday, 2025-03-19

opendevreview	sean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught https://review.opendev.org/c/openstack/nova/+/944933	00:25
opendevreview	OpenStack Proposal Bot proposed openstack/nova master: Imported Translations from Zanata https://review.opendev.org/c/openstack/nova/+/944963	04:01
opendevreview	Adam Kijak proposed openstack/nova master: Fix race condition when attaching a volume and deleting the instance https://review.opendev.org/c/openstack/nova/+/944976	09:51
opendevreview	Stephen Finucane proposed openstack/nova master: Remove tags from README https://review.opendev.org/c/openstack/nova/+/944994	12:30
opendevreview	Pierre Riteau proposed openstack/nova master: Fix missing backtick in configuration option help https://review.opendev.org/c/openstack/nova/+/931253	13:43
priteau	Hello Nova cores. This trivial doc fix has been open for a few months ^	13:43
priteau	Could you please review?	13:43
opendevreview	sean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught https://review.opendev.org/c/openstack/nova/+/944933	14:23
opendevreview	Bogdan Dobrelya proposed openstack/nova master: [DNM] Hard OS exit instead of SysExit exception https://review.opendev.org/c/openstack/nova/+/945011	14:24
opendevreview	sean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught https://review.opendev.org/c/openstack/nova/+/944933	14:26
opendevreview	Bogdan Dobrelya proposed openstack/nova master: [DNM] Hard OS exit instead of SysExit exception https://review.opendev.org/c/openstack/nova/+/945011	14:31
dansmith	sean-k-mooney: so one approach is to try to exit really really hard to get apache/mod_wsgi to notice... but,	15:10
dansmith	what about us making sure we never raise and just return 500s if we would have otherwise failed?	15:11
sean-k-mooney	hum	15:11
dansmith	even without a dedicated healthcheck that would be easy to arrange to not be healthy	15:11
sean-k-mooney	we would have to kind of hack it	15:12
dansmith	curl /versions or whatever, always returning 500 would leave a pod unsuitable for traffic	15:12
sean-k-mooney	but we could set a module level global i guess	15:12
sean-k-mooney	and check that really early	15:12
dansmith	sure, we'd need some way to record it and continue logging the config error or whatever so the operator saw it in the logs	15:12
dansmith	sean-k-mooney: or just make application() return a different stub object if we failed to init the real one	15:12
sean-k-mooney	perhaps. the issue is it reloading that applciation into the exisitng interperter	15:13
sean-k-mooney	so we keep all the module level state like that partly parsed oslo config global	15:13
dansmith	do we need to?	15:13
sean-k-mooney	do we need to keep that state?	15:14
dansmith	we could just have a safety proxy that sits around the real one and keeps trying to load the real one, eventually proxying to it if it loads, or returning the 500s if not	15:14
sean-k-mooney	i was condiring trying to reset it so that we woul parse the config and raise the excption again	15:14
dansmith	I kinda don't think we need to worry about the partial state	15:14
sean-k-mooney	well that why its getting to the microverison endpoint when its reloaded	15:15
dansmith	I imagine everyone doing this in a modern way is going to fix the config and restart a container when they do, so just having it be a terminal state if config is broken seems fine to me	15:15
sean-k-mooney	just enough fo the state is thare that we get passed where the excption woudl be raised	15:15
sean-k-mooney	right im just wondering how to make it terminal	15:15
dansmith	I mean if mod_wsgi reloads us and we have to try initializing again and still fail, who cares?	15:16
sean-k-mooney	im totally fine with changing the guts of the decorator to do something like record if it ever raised an excption	15:16
sean-k-mooney	and then keep raising that if invoked in the future	15:16
sean-k-mooney	that would be pretty clean i think	15:16
sean-k-mooney	dansmith: right now when we are reloaded we dont fail on the init the second time	15:17
dansmith	yeah, just set a global if we failed, stop trying, just return all the 500s, wait for them to fix config and restart	15:17
sean-k-mooney	becuase of the leaked state	15:17
dansmith	sean-k-mooney: ahh, I see, then yeah, some global to prevent us from trying	15:17
sean-k-mooney	dansmith: lets wait for the current ci job to complete	15:17
sean-k-mooney	ill refactor to raise 500 if we ever raised before and see what that looks like	15:18
dansmith	I just feel like trying to sys.exit() harder and harder is sort of missing the point	15:18
bogdando	we also have one testing the os._exit "fork"	15:18
sean-k-mooney	ya i dont like calling that ^	15:18
dansmith	we should just detect, log, and signal (which for a web application is a 500 healthcheck, IMHO)	15:18
dansmith	sean-k-mooney: agree about os._exit, that's what I mean by "harder" :)	15:18
sean-k-mooney	ya so i dont know if you looked at my most recent version wich follows the reload docs	15:19
sean-k-mooney	i choose to use sig kill instead of sig interupt	15:19
sean-k-mooney	but either i think woudl work	15:19
bogdando	sys exit is trying the door, but apache won't let us out	15:20
sean-k-mooney	dansmith: i think your suggestion is worth trying as it shoudl be pretty clean.	15:20
bogdando	os.exit - trying a window	15:20
bogdando	500 code - sitting in the room arguing with the door guard	15:20
sean-k-mooney	500s are more polite	15:21
dansmith	sean-k-mooney: to me, the kill/reload thing is what we should do if we need to retry, but for bad config or almost all of our situations, we don't	15:21
dansmith	so doing kill there just means we crashloop	15:21
dansmith	starting up and returning 500 will signal that we're not happy instead of constantly looking like we're restarting	15:21
sean-k-mooney	yep	15:21
dansmith	I much prefer the "unhealthy" marker instead of the constant "starting..." status where you're not sure if you're still supposed to wait	15:22
sean-k-mooney	long term either implemeting helthchecks or hooking the system ready socket is cleaner	15:22
bogdando	the problem is we need a hard stop, no half measures	15:22
sean-k-mooney	i.e. for knowing if we started properly	15:22
bogdando	letting the service operating half-configured even a fraction of a second could be destructive	15:22
sean-k-mooney	well if my current patch works it wont	15:23
bogdando	(if we got unlucky enough)	15:23
sean-k-mooney	and if we do the 500 thing it wont either	15:23
dansmith	it's not going to operate half-configured either way if we never return application	15:23
dansmith	right	15:23
bogdando	ok, makes sense	15:23
bogdando	what about a tail of rpc calls?	15:23
dansmith	not sure what you mean	15:24
bogdando	it won't accept new API requests, but what about a pending work	15:24
sean-k-mooney	it gets dropped on the floor	15:24
bogdando	woudln't it heppily pick it up after crash looping?	15:24
bogdando	ok	15:24
sean-k-mooney	the http request wont get into the applciation logic and it will get a 500	15:24
opendevreview	ribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check https://review.opendev.org/c/openstack/nova/+/944105	15:24
opendevreview	ribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers https://review.opendev.org/c/openstack/nova/+/944106	15:24
opendevreview	ribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier https://review.opendev.org/c/openstack/nova/+/944133	15:24
opendevreview	ribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation https://review.opendev.org/c/openstack/nova/+/944153	15:24
bogdando	IIRC, there is a logic for recovering ongoing rpc calls (long running ops) after restarting the process. Wouldn't restarting half-configured introduce more problems that it solves, even if never taking new API requests?	15:26
sean-k-mooney	no	15:26
sean-k-mooney	config errors should not allow the api to ever come up	15:27
dansmith	if we never return application, we can never start rpc calls	15:27
sean-k-mooney	so you shoudl nto be able to start rpcs	15:27
dansmith	I'm not sure what the concern is	15:27
bogdando	oh ok then, sorry	15:27
opendevreview	ribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check https://review.opendev.org/c/openstack/nova/+/944105	15:27
opendevreview	ribaudr proposed openstack/nova master: FUP improve comment accuracy and variable naming for tag removal https://review.opendev.org/c/openstack/nova/+/943124	15:27
opendevreview	ribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier https://review.opendev.org/c/openstack/nova/+/944133	15:27
sean-k-mooney	also if you restart the api and it has done an rpc the reply is dropped on the floor	15:27
opendevreview	ribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation https://review.opendev.org/c/openstack/nova/+/944153	15:27
opendevreview	ribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers https://review.opendev.org/c/openstack/nova/+/944106	15:27
dansmith	also, the 500 approach specifically avoids restarting if we know we're not going to be able to do so successfull	15:28
sean-k-mooney	it will be about 2 hours until the ci job finishes	15:29
sean-k-mooney	ill see if i can do a poc of the 500 approch this evenign or early tomorrow	15:29
sean-k-mooney	i mean i can push one now	15:29
bogdando	what about conductor, it service no API, right? but we cannot afford it restarted half-configured for some moment? sorry I am fully noob in Nova services architecture	15:29
sean-k-mooney	but i guess i can triger a job via a second dnm change	15:30
sean-k-mooney	the conductor is not a wsig process	15:30
bogdando	it serves no api*	15:30
sean-k-mooney	this si only a problem because of mod_wsgi	15:30
bogdando	oh yes, indeed	15:30
sean-k-mooney	all the rest exit on config error	15:30
dansmith	yeah I'm not sure what conductor has to do with this.. it can use the usual systemd stuff	15:30
bogdando	not that much a noob to grasp that :)	15:30
sean-k-mooney	dansmith: i have a poc locally but i dont think we want to keep raising the excption and returning 500 in all cases. for config yes but for DBConnectionError maybe not.	16:28
sean-k-mooney	dansmith: ill push what i have shortly and we can dicuss if we want to filter ot a subset of excptions	16:29
dansmith	sean-k-mooney: yep, that's fair.. I would default to 500, excluding specific things we know are transient	16:29
sean-k-mooney	ack	16:29
dansmith	anything unexpected should be assumed to be fatal	16:29
bogdando	you see, arguing more with the door gouard? :)	16:31
bogdando	fail early is always the best approach	16:31
sean-k-mooney	so today we allow retrying in case teh db is not ready yet	16:31
sean-k-mooney	so it would be a regression to hard fail in that case	16:31
sean-k-mooney	a small one but still a chagne in behvior	16:31
bogdando	well, systemd or k8s or pidone shall restart	16:32
sean-k-mooney	it might	16:32
sean-k-mooney	it might not	16:32
bogdando	I mean it is a normal assumption for underlying OS to expect that	16:32
sean-k-mooney	we can change the behavior we jsut need to conservitive and advertise it peroperly ot operators	16:32
dansmith	compute tolerates certain things on startup and not others, so this mirrors that	16:32
dansmith	compute also handles conductor not started, waits for it, etc	16:33
sean-k-mooney	i am raising this because im modifying a test that emulatest this and im liek "i proably shoudl not be changing this test"	16:33
bogdando	then it seems we only have one choice, code 500 with tricky logic around	16:34
dansmith	how is it tricky logic?	16:35
sean-k-mooney	no we can hard exit just not on any excption	16:35
sean-k-mooney	config errors are fine	16:35
sean-k-mooney	i jsut cant do excpet Excpetion	16:35
sean-k-mooney	which is what my current patch does	16:35
bogdando	ack	16:35
sean-k-mooney	if the genal approch looks ok to folks after i push it ill refactor it to be more selective	16:36
bogdando	@dansmith the trickyness that I meant is how we have started firstly with a simple return http 500, and this evolved into decisions making	16:38
dansmith	this is not tricky, it's a single "try...except $allowed: pass except Exception: fail"	16:39
sean-k-mooney	bogdando: its more we cant regress https://launchpad.net/bugs/1882094 while fixing this	16:39
sean-k-mooney	while i can change the test case we added to prevent that regressing there are trade offs and if i can aovid changing the test i likely shoudl	16:41
bogdando	what happens first, oslo config validation or DB connection during wsgi app init?	16:41
sean-k-mooney	config	16:41
sean-k-mooney	so that it can know what db to connect too	16:41
bogdando	I recall there were corner cases with logs eating some events because errors happened before the logs configure	16:42
bogdando	sean-k-mooney: while that is totally logical, that was not clear to me after reading https://bugs.launchpad.net/nova/+bug/1882094	16:44
bogdando	that "Removing and moving the DB connection establishment out of our WSGI init script" could be understood like it connects to DB even before config gets validated	16:45
dansmith	of course config :)	16:46
sean-k-mooney	bogdando: it could but that not what the code does	16:46
bogdando	well, sometimes it firstly needs alive MQ connection before to start verifying something else in a config	16:47
bogdando	not saying that's exactly a case here	16:47
sean-k-mooney	the api does not	16:47
sean-k-mooney	so this call to init_global_data is waht calls config.parse_args https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L128	16:48
sean-k-mooney	here https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L101	16:48
sean-k-mooney	the first time the db is acccesed is here https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L64	16:49
sean-k-mooney	which is after we return from init_global_data	16:49
sean-k-mooney	dansmith: by hte way this is why we are not getting the config error the second time https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L92-L93	16:50
bogdando	at least my silly question brought some useful considerations	16:50
dansmith	sean-k-mooney: ah interesting :)	16:50
sean-k-mooney	i could proably tweak run_once for this edge case too but im slighly reluctant to broaden the scope of that	16:53
bogdando	@sean-k-mooney so, that means we can hit that initialization again with an invalid config	16:53
bogdando	a rogue database connection :)	16:53
sean-k-mooney	bogdando: yes but rememebr python is ment to die if you dont cache the excption	16:53
sean-k-mooney	mod_wsgi is the thing that is breaking expections :)	16:54
bogdando	my point is now this try catch becomes even more nuanced	16:54
sean-k-mooney	that run_once was added for the orginal bug	16:55
bogdando	yea, mod_wsgi is the thing that is breaking excEPtions	16:55
sean-k-mooney	it was there to make sure we dint try and re parse the config amoung other things if it succeed the first time	16:55
sean-k-mooney	becuase that is an error	16:55
sean-k-mooney	bogdando: extra context we have shipt the run_once approch for 5 years and backported it to all our downstream release that are still supproted.	16:57
sean-k-mooney	so while we can condier changing this we have several years fo experince that this work good enough	16:57
sean-k-mooney	so im trying to be tactical without createing more tech debt	16:58
opendevreview	sean mooney proposed openstack/nova master: [WIP] raise_forever on wsgi app init https://review.opendev.org/c/openstack/nova/+/945028	17:00
bogdando	puppet used to be much more restrictive on configs management	17:00
bogdando	than eating free form snippets	17:00
sean-k-mooney	pupet was a terribel user experince	17:01
bogdando	but it was guarding us against this behavior	17:01
sean-k-mooney	to be clear we are only having this converstaion because healthcheck are not merged	17:01
sean-k-mooney	we have an implmemtned feature in review for over 2 years that was intned to solve this and other problems	17:02
bogdando	good point	17:02
sean-k-mooney	https://review.opendev.org/q/topic:%22per-process-healthchecks%22+status:open	17:02
opendevreview	sean mooney proposed openstack/nova master: [WIP] raise_forever on wsgi app init https://review.opendev.org/c/openstack/nova/+/945028	17:10
sean-k-mooney	i just rebased that to master	17:10
sean-k-mooney	and here is the test patch https://github.com/openstack-k8s-operators/nova-operator/pull/943	17:12
sean-k-mooney	im going to leave it there for today but i will look at the result tomrrow	17:12
opendevreview	Mina Shahsavan pour proposed openstack/nova master: Volume Affinity Filter https://review.opendev.org/c/openstack/nova/+/945031	17:19
opendevreview	Balazs Gibizer proposed openstack/nova master: Ignore metadata tags in pci/stats _find_pool logic https://review.opendev.org/c/openstack/nova/+/944277	17:26
dansmith	bauzas: Uggla good to approve stuff for master=flamingo yes?	17:53
dansmith	aight, stable/2025.1 is present so I'm just gonna do it	18:28
sean-k-mooney	dansmith: do you have any idea what "reservation id" is for a nova instance	19:10
sean-k-mooney	hum	19:12
sean-k-mooney	"The reservation id for the server. This is an id that can be useful in tracking groups of servers created with multiple create, that will all have the same reservation_id. By default, it appears in the response for administrative users only."	19:12
sean-k-mooney	ok apprently that show up even if you dont use multi create	19:13
sean-k-mooney	i dont think i have seen that in horizon before but i guess i dont use the admin project most of the time	19:14
dansmith	sean-k-mooney: that may be a holdover from being able to support the ec2 api	19:18
sean-k-mooney	it was added in 2.3 but maybe	19:18
sean-k-mooney	i was wondering if it was somehow blazar related	19:18
sean-k-mooney	just off the name	19:18
dansmith	orly okay	19:19
opendevreview	Merged openstack/nova master: Fix missing backtick in configuration option help https://review.opendev.org/c/openstack/nova/+/931253	19:25
opendevreview	Merged openstack/nova master: Reproduce bug/2098496 https://review.opendev.org/c/openstack/nova/+/941673	19:26
opendevreview	Merged openstack/nova master: Ignore metadata tags in pci/stats _find_pool logic https://review.opendev.org/c/openstack/nova/+/944277	22:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!