Wednesday, 2025-03-19

opendevreviewsean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught  https://review.opendev.org/c/openstack/nova/+/94493300:25
opendevreviewOpenStack Proposal Bot proposed openstack/nova master: Imported Translations from Zanata  https://review.opendev.org/c/openstack/nova/+/94496304:01
opendevreviewAdam Kijak proposed openstack/nova master: Fix race condition when attaching a volume and deleting the instance  https://review.opendev.org/c/openstack/nova/+/94497609:51
opendevreviewStephen Finucane proposed openstack/nova master: Remove tags from README  https://review.opendev.org/c/openstack/nova/+/94499412:30
opendevreviewPierre Riteau proposed openstack/nova master: Fix missing backtick in configuration option help  https://review.opendev.org/c/openstack/nova/+/93125313:43
priteauHello Nova cores. This trivial doc fix has been open for a few months ^13:43
priteauCould you please review?13:43
opendevreviewsean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught  https://review.opendev.org/c/openstack/nova/+/94493314:23
opendevreviewBogdan Dobrelya proposed openstack/nova master: [DNM] Hard OS exit instead of SysExit exception  https://review.opendev.org/c/openstack/nova/+/94501114:24
opendevreviewsean mooney proposed openstack/nova master: Force api to exit if exceptions are uncaught  https://review.opendev.org/c/openstack/nova/+/94493314:26
opendevreviewBogdan Dobrelya proposed openstack/nova master: [DNM] Hard OS exit instead of SysExit exception  https://review.opendev.org/c/openstack/nova/+/94501114:31
dansmithsean-k-mooney: so one approach is to try to exit really really hard to get apache/mod_wsgi to notice... but,15:10
dansmithwhat about us making sure we never raise and just return 500s if we would have otherwise failed?15:11
sean-k-mooneyhum15:11
dansmitheven without a dedicated healthcheck that would be easy to arrange to not be healthy15:11
sean-k-mooneywe would have to kind of hack it15:12
dansmithcurl /versions or whatever, always returning 500 would leave a pod unsuitable for traffic15:12
sean-k-mooneybut we could set a module level global i guess15:12
sean-k-mooneyand check that really early15:12
dansmithsure, we'd need some way to record it and continue logging the config error or whatever so the operator saw it in the logs15:12
dansmithsean-k-mooney: or just make application() return a different stub object if we failed to init the real one15:12
sean-k-mooneyperhaps. the issue is it reloading that applciation into the exisitng interperter15:13
sean-k-mooneyso we keep all the module level state like that partly parsed oslo config global15:13
dansmithdo we need to?15:13
sean-k-mooneydo we need to keep that state?15:14
dansmithwe could just have a safety proxy that sits around the real one and keeps trying to load the real one, eventually proxying to it if it loads, or returning the 500s if not15:14
sean-k-mooneyi was condiring trying to reset it so that we woul parse the config and raise the excption again15:14
dansmithI kinda don't think we need to worry about the partial state15:14
sean-k-mooneywell that why its getting to the microverison endpoint when its reloaded15:15
dansmithI imagine everyone doing this in a modern way is going to fix the config and restart a container when they do, so just having it be a terminal state if config is broken seems fine to me15:15
sean-k-mooneyjust enough fo the state is thare that we get passed where the excption woudl be raised15:15
sean-k-mooneyright im just wondering how to make it terminal15:15
dansmithI mean if mod_wsgi reloads us and we have to try initializing again and still fail, who cares?15:16
sean-k-mooneyim totally fine with changing the guts of the decorator to do something like record if it ever raised an excption15:16
sean-k-mooneyand then keep raising that if invoked in the future15:16
sean-k-mooneythat would be pretty clean i think15:16
sean-k-mooneydansmith: right now when we are reloaded we dont fail on the init the second time15:17
dansmithyeah, just set a global if we failed, stop trying, just return all the 500s, wait for them to fix config and restart15:17
sean-k-mooneybecuase of the leaked state 15:17
dansmithsean-k-mooney: ahh, I see, then yeah, some global to prevent us from trying15:17
sean-k-mooneydansmith: lets wait for the current ci job to complete15:17
sean-k-mooneyill refactor to raise 500 if we ever raised before and see what that looks like15:18
dansmithI just feel like trying to sys.exit() harder and harder is sort of missing the point15:18
bogdandowe also have one testing the os._exit "fork"15:18
sean-k-mooneyya i dont like calling that ^15:18
dansmithwe should just detect, log, and signal (which for a web application is a 500 healthcheck, IMHO)15:18
dansmithsean-k-mooney: agree about os._exit, that's what I mean by "harder" :)15:18
sean-k-mooneyya so i dont know if you looked at my most recent version wich follows the reload docs 15:19
sean-k-mooneyi choose to use sig kill instead of sig interupt15:19
sean-k-mooneybut either i think woudl work15:19
bogdandosys exit is trying the door, but apache won't let us out15:20
sean-k-mooneydansmith: i think your suggestion is worth trying as it shoudl be pretty clean.15:20
bogdandoos.exit - trying a window15:20
bogdando500 code - sitting in the room arguing with the door guard15:20
sean-k-mooney500s are more polite15:21
dansmithsean-k-mooney: to me, the kill/reload thing is what we should do if we need to retry, but for bad config or almost all of our situations, we don't15:21
dansmithso doing kill there just means we crashloop15:21
dansmithstarting up and returning 500 will signal that we're not happy instead of constantly looking like we're restarting15:21
sean-k-mooneyyep15:21
dansmithI much prefer the "unhealthy" marker instead of the constant "starting..." status where you're not sure if you're still supposed to wait15:22
sean-k-mooneylong term either implemeting helthchecks or hooking the system ready socket is cleaner15:22
bogdandothe problem is we need a hard stop, no half measures15:22
sean-k-mooneyi.e. for knowing if we started properly15:22
bogdandoletting the service operating half-configured even a fraction of a second could be destructive15:22
sean-k-mooneywell if my current patch works it wont15:23
bogdando(if we got unlucky enough)15:23
sean-k-mooneyand if we do the 500 thing it wont either15:23
dansmithit's not going to operate half-configured either way if we never return application15:23
dansmithright15:23
bogdandook, makes sense15:23
bogdandowhat about a tail of rpc calls?15:23
dansmithnot sure what you mean15:24
bogdandoit won't accept new API requests, but what about a pending work15:24
sean-k-mooneyit gets dropped on the floor15:24
bogdandowoudln't it heppily pick it up after crash looping?15:24
bogdandook15:24
sean-k-mooneythe http request wont get into the applciation logic and it will get a 50015:24
opendevreviewribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check  https://review.opendev.org/c/openstack/nova/+/94410515:24
opendevreviewribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers  https://review.opendev.org/c/openstack/nova/+/94410615:24
opendevreviewribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier  https://review.opendev.org/c/openstack/nova/+/94413315:24
opendevreviewribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation  https://review.opendev.org/c/openstack/nova/+/94415315:24
bogdandoIIRC, there is a logic for recovering ongoing rpc calls (long running ops) after restarting the process. Wouldn't restarting half-configured introduce more problems that it solves, even if never taking new API requests?15:26
sean-k-mooneyno15:26
sean-k-mooneyconfig errors should not allow the api to ever come up15:27
dansmithif we never return application, we can never start rpc calls15:27
sean-k-mooneyso you shoudl nto be able to start rpcs15:27
dansmithI'm not sure what the concern is15:27
bogdandooh ok then, sorry15:27
opendevreviewribaudr proposed openstack/nova master: FUP Remove unnecessary PCI check  https://review.opendev.org/c/openstack/nova/+/94410515:27
opendevreviewribaudr proposed openstack/nova master: FUP improve comment accuracy and variable naming for tag removal  https://review.opendev.org/c/openstack/nova/+/94312415:27
opendevreviewribaudr proposed openstack/nova master: FUP Add a warning to make non-explicit live migration request debugging easier  https://review.opendev.org/c/openstack/nova/+/94413315:27
sean-k-mooneyalso if you restart the api and it has done an rpc the reply is dropped on the floor15:27
opendevreviewribaudr proposed openstack/nova master: FUP Update pci-passthrough and virtual-gpu documentation  https://review.opendev.org/c/openstack/nova/+/94415315:27
opendevreviewribaudr proposed openstack/nova master: FUP improve and add integration tests for PCI SR-IOV servers  https://review.opendev.org/c/openstack/nova/+/94410615:27
dansmithalso, the 500 approach specifically avoids restarting if we know we're not going to be able to do so successfull15:28
sean-k-mooneyit will be about 2 hours until the ci job finishes15:29
sean-k-mooneyill see if i can do a poc of the 500 approch this evenign or early  tomorrow15:29
sean-k-mooneyi mean i can push one now15:29
bogdandowhat about conductor, it service no API, right? but we cannot afford it restarted half-configured for some moment? sorry I am fully noob in Nova services architecture15:29
sean-k-mooneybut i guess i can triger a job via a second dnm change15:30
sean-k-mooneythe conductor is not a wsig process15:30
bogdandoit serves no api*15:30
sean-k-mooneythis si only a problem because of mod_wsgi15:30
bogdandooh yes, indeed15:30
sean-k-mooneyall the rest exit on config error15:30
dansmithyeah I'm not sure what conductor has to do with this.. it can use the usual systemd stuff15:30
bogdandonot that much a noob to grasp that :)15:30
sean-k-mooneydansmith: i have a poc locally but i dont think we want to keep raising the excption and returning 500 in all cases. for config yes but for DBConnectionError maybe not.16:28
sean-k-mooneydansmith: ill push what i have shortly and we can dicuss if we want to filter ot a subset of excptions16:29
dansmithsean-k-mooney: yep, that's fair.. I would default to 500, excluding specific things we know are transient16:29
sean-k-mooneyack16:29
dansmithanything unexpected should be assumed to be fatal16:29
bogdandoyou see, arguing more with the door gouard? :)16:31
bogdandofail early is always the best approach16:31
sean-k-mooneyso today we allow retrying in case teh db is not ready yet16:31
sean-k-mooneyso it would be a regression to hard fail in that case16:31
sean-k-mooneya small one but still a chagne in behvior16:31
bogdandowell, systemd or k8s or pidone shall restart16:32
sean-k-mooneyit might16:32
sean-k-mooneyit might not16:32
bogdandoI mean it is a normal assumption for underlying OS to expect that16:32
sean-k-mooneywe can change the behavior we jsut need to conservitive and advertise it peroperly ot operators16:32
dansmithcompute tolerates certain things on startup and not others, so this mirrors that16:32
dansmithcompute also handles conductor not started, waits for it, etc16:33
sean-k-mooneyi am raising this because im modifying a test that emulatest this and im liek "i proably shoudl not be changing this test"16:33
bogdandothen it seems we only have one choice, code 500 with tricky logic around16:34
dansmithhow is it tricky logic?16:35
sean-k-mooneyno we can hard exit just not on any excption16:35
sean-k-mooneyconfig errors are fine16:35
sean-k-mooneyi jsut cant do excpet Excpetion16:35
sean-k-mooneywhich is what my current patch does16:35
bogdandoack16:35
sean-k-mooneyif the genal approch looks ok to folks after i push it ill refactor it to be more selective16:36
bogdando@dansmith the trickyness that I meant is how we have started firstly with a simple return http 500, and this evolved into decisions making16:38
dansmiththis is not tricky, it's a single "try...except $allowed: pass except Exception: fail"16:39
sean-k-mooneybogdando: its more we cant regress https://launchpad.net/bugs/1882094 while fixing this16:39
sean-k-mooneywhile i can change the test case we added to prevent that regressing there are trade offs and if i can aovid changing the test i likely shoudl16:41
bogdandowhat happens first, oslo config validation or DB connection during wsgi app init?16:41
sean-k-mooneyconfig16:41
sean-k-mooneyso that it can know what db to connect too16:41
bogdandoI recall there were corner cases with logs eating some events because errors happened before the logs configure16:42
bogdandosean-k-mooney: while that is totally logical, that was not clear to me after reading https://bugs.launchpad.net/nova/+bug/188209416:44
bogdandothat "Removing and moving the DB connection establishment out of our WSGI init script" could be understood like it connects to DB even before config gets validated16:45
dansmithof course config :)16:46
sean-k-mooneybogdando: it could but that not what the code does16:46
bogdandowell, sometimes it firstly needs alive MQ connection before to start verifying something else in a config16:47
bogdandonot saying that's exactly a case here16:47
sean-k-mooneythe api does not 16:47
sean-k-mooneyso this call to init_global_data is waht calls config.parse_args https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L12816:48
sean-k-mooneyhere https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L10116:48
sean-k-mooneythe first time the db is acccesed is here https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L6416:49
sean-k-mooneywhich is after we return from init_global_data16:49
sean-k-mooneydansmith: by hte way this is why we are not getting the config error the second time https://github.com/openstack/nova/blob/master/nova/api/openstack/wsgi_app.py#L92-L9316:50
bogdandoat least my silly question brought some useful considerations16:50
dansmithsean-k-mooney: ah interesting :)16:50
sean-k-mooneyi could proably tweak run_once for this edge case too but im slighly reluctant to broaden the scope of that16:53
bogdando@sean-k-mooney so, that means we can hit that initialization again with an invalid config16:53
bogdandoa rogue database connection :)16:53
sean-k-mooneybogdando: yes but rememebr python is ment to die if you dont cache the excption16:53
sean-k-mooneymod_wsgi is the thing that is breaking expections :)16:54
bogdandomy point is now this try catch becomes even more nuanced16:54
sean-k-mooneythat run_once was added for the orginal bug16:55
bogdandoyea, mod_wsgi is the thing that is breaking excEPtions16:55
sean-k-mooneyit was there to make sure we dint try and re parse the config amoung other things if it succeed the first time16:55
sean-k-mooneybecuase that is an error16:55
sean-k-mooneybogdando: extra context we have shipt the run_once approch for 5 years and backported it to all our downstream release that are still supproted.16:57
sean-k-mooneyso while we can condier changing this we have several years fo experince that this work good enough16:57
sean-k-mooneyso im trying to be tactical without createing more tech debt16:58
opendevreviewsean mooney proposed openstack/nova master: [WIP] raise_forever on wsgi app init  https://review.opendev.org/c/openstack/nova/+/94502817:00
bogdandopuppet used to be much more restrictive on configs management17:00
bogdandothan eating free form snippets17:00
sean-k-mooneypupet was a terribel user experince17:01
bogdandobut it was guarding us against this behavior17:01
sean-k-mooneyto be clear we are only having this converstaion because healthcheck are not merged17:01
sean-k-mooneywe have an implmemtned feature in review for over 2 years that was intned to solve this and other problems17:02
bogdandogood point17:02
sean-k-mooneyhttps://review.opendev.org/q/topic:%22per-process-healthchecks%22+status:open17:02
opendevreviewsean mooney proposed openstack/nova master: [WIP] raise_forever on wsgi app init  https://review.opendev.org/c/openstack/nova/+/94502817:10
sean-k-mooneyi just rebased that to master17:10
sean-k-mooneyand here is the test patch https://github.com/openstack-k8s-operators/nova-operator/pull/94317:12
sean-k-mooneyim going to leave it there for today but i will look at the result tomrrow17:12
opendevreviewMina Shahsavan pour proposed openstack/nova master: Volume Affinity Filter  https://review.opendev.org/c/openstack/nova/+/94503117:19
opendevreviewBalazs Gibizer proposed openstack/nova master: Ignore metadata tags in pci/stats _find_pool logic  https://review.opendev.org/c/openstack/nova/+/94427717:26
dansmithbauzas: Uggla good to approve stuff for master=flamingo yes?17:53
dansmithaight, stable/2025.1 is present so I'm just gonna do it18:28
sean-k-mooneydansmith: do you have any idea what "reservation id" is for a nova instance19:10
sean-k-mooneyhum19:12
sean-k-mooney"The reservation id for the server. This is an id that can be useful in tracking groups of servers created with multiple create, that will all have the same reservation_id. By default, it appears in the response for administrative users only."19:12
sean-k-mooneyok apprently that show up even if you dont use multi create19:13
sean-k-mooneyi dont think i have seen that in horizon before but i guess i dont use the admin project most of the time19:14
dansmithsean-k-mooney: that may be a holdover from being able to support the ec2 api19:18
sean-k-mooneyit was added in 2.3 but maybe19:18
sean-k-mooneyi was wondering if it was somehow blazar related19:18
sean-k-mooneyjust off the name19:18
dansmithorly okay19:19
opendevreviewMerged openstack/nova master: Fix missing backtick in configuration option help  https://review.opendev.org/c/openstack/nova/+/93125319:25
opendevreviewMerged openstack/nova master: Reproduce bug/2098496  https://review.opendev.org/c/openstack/nova/+/94167319:26
opendevreviewMerged openstack/nova master: Ignore metadata tags in pci/stats _find_pool logic  https://review.opendev.org/c/openstack/nova/+/94427722:04

Generated by irclog2html.py 2.17.3 by Marius Gedminas - find it at https://mg.pov.lt/irclog2html/!